0% found this document useful (0 votes)
54 views43 pages

Part PDF

Uploaded by

Srinu Sehwag
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views43 pages

Part PDF

Uploaded by

Srinu Sehwag
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

I207T001015 .

207
T001015d.207

VLSI DESIGN # 2000 OPA (Overseas Publishers Association) N.V.


2000, Vol. 00, No. 00, pp. 1 ± 43 Published by license under
Reprints available directly from the publisher the Gordon and Breach Science
Photocopying permitted by license only Publishers imprint.
Printed in Malaysia.

Tutorial on VLSI Partitioning


SAO-JIE CHENa, y and CHUNG-KUAN CHENG b,*
a
Dept. of Electrical Engineering, National Taiwan University, Taipei, Taiwan 10764; bDept. of Computer Science
and Engineering, University of California, San Diego, La Jolla, CA 92093-0114

(Received 1 March 1999; In ®nal form 10 February 2000)

The tutorial introduces the partitioning with applications to VLSI circuit designs. The
problem formulations include two-way, multiway, and multi-level partitioning,
partitioning with replication, and performance driven partitioning. We depict the
models of multiple pin nets for the partitioning processes. To derive the optimum
solutions, we describe the branch and bound method and the dynamic programming
method for a special case of circuits. We also explain several heuristics including the
group migration algorithms, network ¯ow approaches, programming methods,
Lagrange multiplier methods, and clustering methods. We conclude the tutorial with
research directions.

Keywords: Partitioning, clustering, network ¯ow, hierarchical partitioning, replication, perfor-


mance driven partitioning

1. INTRODUCTION ity of the circuit has become so high that it is very


dicult to design and simulate the whole system
Automatic partitioning [5, 61, 78, 72] is becoming without decomposing it into sets of smaller sub-
an important topic with the advent of deep sub- systems. This divide and conquer strategy relies on
micron technologies. An ecient and e€ective parti- partitioning to manipulate the whole system into
tioning [12, 17, 19, 48, 69, 70, 81, 94, 105, 77] tool hierarchical tree structure.
can drastically reduce the complexity of the design Partitioning is also needed to handle engineering
process and handle engineering change orders in a change orders. For huge systems, design iterations
manageable scope. Moreover, the quality of the require very fast turn around time. A hierarchical
partitioning di€erentiates the ®nal product in terms partitioning methodology can localize the mod-
of production cost and system performance. i®cations and reduce the complexity.
The size of VLSI designs has increased to systems Furthermore, a good partitioning tool can
of hundreds of millions of transistors. The complex- decrease the production cost and improve the

*Corresponding author. Tel: (858)534-6184, Fax: (858)534-7029, e-mail: [email protected]


y
Tel: (8862)2363-5251 ext. 417, e-mail: [email protected]

1
I207T001015 . 207
T001015d.207

2 S.-J. CHEN AND C.-K. CHENG

system performance. With the advance of fabrica- the netlist and construct functional modules out
tion technologies, the cost of a transistor drops of the clusters.
while the cost of input/output pads remains fairly
While partitioning is a tool required to manage
constant. Consequently, the size of the interface
huge systems in many ®elds such as ecient
between partitions, e.g., between chips, determines
storage of large databases on disks, data mining,
a signi®cant portion of the manufacturing expenses.
and etc., in this tutorial, we focus our e€orts on
And the quality of the partitioning has strong e€ect
partitioning with applications to VLSI circuit
on production cost. Furthermore, in submicron
designs. In the next section, we describe the
designs, interconnection delays tend to dominate
notations for the tutorial. In section three, the
gate delays [8]; therefore system performance is
formulations of the partitioning problems are
greatly in¯uenced by the partitions.
stated. Section four covers the models for multiple
Partitioning has been applied to solve the
pin nets. Section ®ve depicts the partitioning
various aspects of VLSI design problems [5, 36]:
algorithms. The tutorial is concluded with research
 Physical packaging Partitioning decomposes directions.
the system in order to satisfy the physical
packaging constraints. The partitioning con-
2. PRELIMINARIES
forms to a physical hierarchy ranging from
cabinets, cases, boards, chips, to modular blocks.
In this section, we establish notations used and
 Divide and conquer strategy Partitioning is used
formulate the partitioning problems addressed in
to tackle the design complexity with a divide and
our approaches. A circuit is represented by a
conqure strategy [21]. This strategy is adopted to
hypergraph, H(V, E ), where the vertex set
decompose the project between team members,
V={vi j i=1, 2, . . . , n} denotes the set of modules
to construct a logic hierarchy for logic synthesis,
and the hyperedge set E={ej j j=1, 2, . . . , m} de-
to transform the netlist into physical hierarchy
notes the set of nets. Each net ej is a subset of V
for ¯oorplanning, to allocate cells into regions
with cardinality jejj  2. The modules in ej are
for placement and RLC extraction, and manip-
called the pins of ej.
ulate hierarchies between logic and layout for
The hypergraph representation for a circuit with
simulation.
9 modules and 6 signal nets is shown in Figure 1,
 System emulation and rapid prototyping One
where nets e1, e3 and e5 are two-pin nets, net e6 is a
approach for system emulation and prototyping
three-pin net, and nets e2 and e4 are four-pin nets.
is to construct the hardware with ®eld program-
When the circuit has only two pin nets, we can
mable gate arrays. Usually, the capacity of these
simplify the representation to a graph G(V, E ). A
®eld programmable gate arrays is smaller than
net connecting modules vi and vj is represented by
current VLSI designs. Thus, these prototyping
eij with a connectivity cij. We set cij=0 if there is no
machines are composed of a hierarchical struc-
net connecting modules vi and vj. We shall show
ture of ®eld programmable gate arrays. A
later that for certain formulations we replace
partitioning tool is needed to map the netlist into
multiple pin nets with models of two pin nets.
the hardware [110].
The replacement is performed when the partition-
 Hardware and software codesign For hardware
ing algorithm is devised for graph models.
and software codesign, partitioning is used to de-
compose the designs into hardware and software. (i) Module Size and Net Connectivity Each mod-
 Management of design reuse For huge designs ule vi is attached with a size si in R+, positive real
P
especially system-on-a-chip, we have to manage numbers. We de®ne S…Vj † ˆ vi 2Vj si to be the size
design reuse. Partitioning can identify clusters of of a partition Vj. Each net ei is attached with a
I207T001015 . 207
T001015d.207

VLSI PARTITIONING 3

FIGURE 1 Hypergraph example.

connectivity ci in R+. By default, ci=1. For a bus of the net and bi  V are the sink pins of the net.
of multiple signal lines, we can represent the bus We assume that jai [ bij  2, jaij  1 and jbij  1.
with a net ei of connectivity ci equal to the number Usually, each net has one source pin and multiple
of lines. We can also assign higher weights for sink pins. However, some nets may have multiple
some important nets, this will enable us to keep the sources which share the same interconnect line.
modules of these nets in the same partition. Furthermore, one pin can be both a source pin and
In this tutorial, we will assume that circuits are sink pin of the same net. Therefore, ai and bi may
represented as hypergraphs except when stated have a nonempty intersection.
otherwise, hence, the terms circuit, netlist, and For two disjoint vertex sets X and Y, we shall use
hypergraph are used interchangeably throughout E(X ! Y ) to denote the directed cut set from X to
the tuorial. Y. Net set E(X ! Y ) contains all the nets ei= (ai, bi)
such that X intersects the source pin set ai and Y
(ii) Partitions and Cuts The set of hyperedges
intersects the sink pin set bi, i.e., E(X ! Y )=
connecting any two-way partition (V1, V2) of two
{ei j ei=(ai, bi), ai \ X 6ˆ ;, bi \ Y 6ˆ ;}. We use the
disjoint vertex sets V1 and V2 is denoted by a cut
function C(X ! Y ) to denote the total cut count
E(V1, V2)={ej 2 E j 0 < jej \ V1j and 0 < jej \ V2j},
of the nets in E(X ! Y ), i.e., C…X ! Y† ˆ
i.e., ej 2 E(V1, V2) if there exist some pins of ej in V1 P
ei 2E…X!Y† ci .
and some di€erent pins of ej in V2. We de®ne
P
C…V1 ; V2 † ˆ ei 2E…V1 ;V2 † ci to be the cut count of (iv) Performance Driven Partitioning In perfor-
the partition (V1, V2). mance driven partitioning [106], modules are
For a multiway partition (V1, V2, . . . , Vk) distinguished into two types: combinational ele-
where k > 2, a cut E(V1, V2, . . . , Vk)={ej 2 Ej 9 i ments and globally clocked registers. In illustra-
s.t. 0 < jej \ Vij < jejj}. For each subset Vi, we tion, we shall use circles to represent the com-
denote its external cut set E(Vi)={ej 2 E j0 < j binational elements and rectangles to represent the
ej \ Vij < jejj}. We denote its adjacent net set to be registers in ®gures (Fig. 13). Each module vi has an
the nets with some pin contained in Vi, i.e., associated delay di.
I(Vi)={ei j jei \ Vij > 0}. A path of length k from a module vi to a module
vj is a sequence hvi0 ; vi1 ; . . . ; vik i of modules such
(iii) Replication Cuts and Directed Cuts For
that vi ˆ vi0 , vj ˆ vik and for each l 2 {1, 2, . . . , k},
replication cuts and performance driven partition-
modules vilÿ1 and vil are a souce pin and a sink pin
ing, the direction of the nets makes a di€erence in
of a net in E, respectively.
the process. We characterize the pins of each net
into two types: source and sink. A directed net ei is (v) Clustering Given a hypergraph H(V, E ),
denoted by (ai, bi) where ai  V are the source pins highly connected modules in V can be grouped
I207T001015 . 207
T001015d.207

4 S.-J. CHEN AND C.-K. CHENG

together to form some single supermodules called jV j equally spaced slots on a striaght line (Fig. 2).
clusters. After this process, a clustering ÿ={V1, Modules vs and vt are ®xed at the two extreme
V2, . . . , Vk} of the original hypergraph H is ends, i.e., vs on the ®rst slot (left end ) and vt on the
obtained and a contracted (i.e., coarser) hypergraph last slot (right end ). The goal is to assign all
Hÿ(Vÿ, Eÿ ) is induced, where Vÿ ˆ fv ÿ1 ; v ÿ2 ; . . . ; modules to distinct slots to minimize the total wire
v ÿk g. For every ej 2 E, the contracted net e ÿj 2 Eÿ if length. Let us use xi to denote the coordinate of
je ÿj j  2, where e ÿj ˆ fv ÿi jej \ Vi 6ˆ ;g, that is, e ÿj module vi after it is assigned to the slot. The length
spans the set of clusters containing modules of ej. A of a net ei can be expressed as the di€erence of the
contracted hypergraph, of course, can be used to maximum coordinate and the minimum coordi-
induce another coarser contracted hypergraph nate of the modules in the net, i.e., maxvj 2ei xj ÿ
based on the same clustering process. On the other minvk 2ei xk . The total wire length can be expressed
hand, a contracted hypergraph Hÿ(Vÿ, Eÿ ) can be as follows.
unclustered to return to a ®ner hypergraph H(V, E ). X
…maxvj 2ei xj ÿ minvj 2ei xj † …2†
ei 2E
3. PROBLEM FORMULATIONS
The relation between partitioning and place-
In this section, we describe di€erent formulations ment can be derived under the assumption that all
of the partitioning problems addressed in this nets are two pin nets [50].
tutorial. We will cover two-way partitioning, THEOREM 3.1 Given a graph G(V, E ) with modules
multiway partitioning, multiple level partitioning, vs and vt in V, let (V1, V2) be a min-cut partition
partitioning with replication, and performance separating modules vs and vt. Let vs and vt be the two
driven partitioning. modules locating at the two extreme ends of a linear
placement. Then, there exists an optimal linear
3.1. Two-way Partitioning or Bipartitioning placement solution such that all modules in V2 are
on the slots right of all modules in V1 (Fig. 2).
We consider several possible variations on the size
Thus, we can use the min-cut to partition a linear
constraints and cost functions in the formulation.
Additionally, in certain formulations, we ®x two
modules vs and vt to be on the opposite sides of the
cut as two seeds.

3.1.1. Min-cut Separating Two Modules


vs and vt

Given a hypergraph, we ®x two modules denoted


as vs and vt at two sides. A min-cut is a partition
(V1, V2), vs 2 V1 and vt 2 V2 such that the cut count
C (V1, V2) is minimized, i.e.,

minvs 2V1 ;vt 2V2 C…V1 ; V2 † …1†

where V1 and V2 are disjoint and the union of the


two sets is equal to V.
FIGURE 2 Suppose partition (V1, V2) is a min-cut separating
This partitioning is strongly related to a linear modules vs and vt. There exists an optimal linear placement that
placement problem. In a linear placement, we have modules in V2 are at the right side of modules in V1.
I207T001015 . 207
T001015d.207

VLSI PARTITIONING 5

placement into two smaller problems and still C…A; V ÿ A ÿ fvs g† ÿ C…A; fvs g†
…3†
maintain optimality. Conceptually, we can conceive S…A†
that modules in V1 or V2 have stronger internal
where vertex set A does not contain vs and vt.
connection within the set than its mutual connec-
Vertex set A is non-empty, i.e., S(A) > 0.
tion to the other set. Thus, if the span of modules in
Cost ratio cut is also strongly related to a linear
V1 and in V2 are mixed in a linear placement, we can
placement. Assuming that all nets are two pin nets,
slide all modules in V1 to the left and all modules in
we can derive the following theorem [22]:
V2 to the right to reduce the total wire length. In
fact, this is the procedure to prove the theorem. THEOREM 3.2 Given a graph G(V, E ) with modules
The min-cut with no size constraints can be vs and vt in V, let (V1, V2) be an optimal cost ratio
found in polynomial time using classical maximum cut partition. There exists an optimal linear
¯ow techniques [1]. However, it may happen that placement solution such that all modules in A are
the optimal solution separates only vs or vt from on the slots left of all modules in V ÿ A ÿ {vs}.
the rest of the modules, i.e., V1={vs} or V2={vt}.
Conceptually, we can conceive that C(A, V ÿ
This result is very likely to happen because most
A ÿ {vs}) is the force to pull A to the right and
VLSI basic modules have very small degrees of
C (A, {vs}) is the force to push A to the left. The
connecting nets (e.g., the degree of a 3-input
denominator S(A) is the inertia of the set A. A set A
NAND gate=4).
with the minimum cost ratio moves with the fastest
acceleration toward left end of the slots
3.1.2. Minimum Cost Ratio Cut
Example In Figure 3, the circuit contains six
The cost ratio cut formulation supplies a partition modules. The optimum cost ratio cut solution has
di€erent from the min-cut that separates two ®xed A={v1, v2, v3} The cost ratio value is
modules. Thus, if the min-cut cannot provide any
nontrivial solution, we may adopt the cost ratio C…A; V ÿ A ÿ fvs g† ÿ C…A; fvs g† 4 ÿ 3 1
ˆ ˆ :
cut to perform another trial. S…A† 3 3
In cost ratio cut, we ®x two modules vs and vt at …4†
two di€erent sides. Our objective is to ®nd a vertex
set A to minimize a cost ratio function: The cost ratio value of any other choice of set A is
larger than expression 4.

FIGURE 3 A six module circuit to illustrate the cost ratio cut.


I207T001015 . 207
T001015d.207

6 S.-J. CHEN AND C.-K. CHENG

The cost ratio cut solution can be found in poly-


nomial time for a special case of serial parallel
graphs [22]. We are unaware of algorithms for
general cases. Note that, the solution may have
V ÿ A ÿ {vs} equal to set {vt}. In such case, the
partitioning result is not useful for decomposing the
circuit.

3.1.3. Min-cut with Size Constraints

For min-cut with size constraints, we have lower


and upper bounds on the partition size Sl and Su,
FIGURE 4 Four possible con®gurations of net ei={a, b} in a
where 0 < Sl  Su < S(V ) and Sl‡Su=S(V ). The random placement.
bipartitioning problem is to divide vertex set V
into two nonempty partitions V1, V2, where circuit of one million modules usually has an
V1 \ V2=; and V1 [ V2=V, with the objective of asymptotic number of nets, i.e., jEj=O(jV j )=
minimizing cut count C (V1, V2) and subject to the 1,000,000. The expected cut count would be
following size constraints: C (V1, V2)  500,000. This number is much worse
than the results we can achieve. In practice, the cut
Sl  S…Vb †  Su for b ˆ 1; 2 …5† counts on circuits of a million of modules are
usually no more than several thousands [34, 36]. In
The min-cut problem with size constraints is NP
other words, the probability that a net belongs to a
complete [43]. However, because of the importance
cut set is small, below one percent for a circuit of
of the problem in many applications, many
one million gates.
heuristic algorithms have been developed.
Random Partitioning We use a random parti- Suppose the two bounds of partitioned sizes are
not equal, Sl 6ˆ Su. Using the proposed random
tion estimation of min-cut with size constraints to
demonstrate that the quality variation of parti- graph model, the expected cut count C (V1, V2) is
tioning results can be signi®cant. Let us simplify proportional to the product of two sizes, i.e.,
the case by assigning the modules with uniform S(V1)  S(V2). Consequently, the expected cut
size, i.e., si=1 for all vi in V, and the nets with count is smallest if the size of one partition appro-
aches the upper bound S(Vi)=Su and the size of
uniform connectivity, i.e., ci=1 for all ei in E.
another partition approaches the lower bound
Let us assume that the modules are partitioned
S(Vj)=Sl. In practice, we do observe this behavior.
into two sets V1, V2 with equal sizes: S(V1)=S(V2).
One partition is fully loaded to its maximum
The partition is performed with an independent
capacity, while another partition is under utilized
random process [10] so that each module has a
50% chance to go to either side. For a net ei of two with a large capacity left unused. This phenomena is
not desirable for certain applications.
pins, we can derive that net ei belongs to the cut set
E(V1, V2) with a 0.5 probability (Fig. 4). Similarly,
we can derive that for a net ei of k pins (k > 2), the 3.1.4. Ratio Cut
probability that net ei belongs to cut set E(V1, V2)
Ratio cut formulation integrates the cut count and
is …2k ÿ 2†=2k . This probability is larger than 0.5
a partition size balance criterion into a single
and approaches one as k increases. In other words,
objective function [87, 109]. Given a partition
the expected cut count C (V1, V2) is equal to or
(V1, V2) where V1 and V2 are disjoint and
larger than half the number of nets. For example, a
V1 [ V2=V, the objective funtion is de®ned as
I207T001015 . 207
T001015d.207

VLSI PARTITIONING 7

C…V1 ; V2 † Expec…C…fvs g; V ÿ fvs g†† ˆ …jVj ÿ 1†  f …8†


…6†
S…V1 †  S…V2 †
As jV j approaches in®nity, the value of Eq. (7)
The numerator of the objective function minimizes becomes much larger than 8.
the cut count while the denominator avoids This derivation provides another explanation
uneven partition sizes. Like many other partition- why the min-cut separating two ®xed modules tends
ing problems, ®nding the ratio cut in a general to generate very uneven sized subsets. The very
network belongs to the class of NP-complete uneven sized subsets naturally give the lowest cut
problems [87]. value. Therefore, the ratio value C…V1 ; V2 †=
…S…V1 †  S…V2 †† is proposed to alleviate the hidden
Example Figure 5 shows a seven module example.
size e€ect. As a consequence, the expected value of
The modules are of unit size and the nets are of unit
this ratio is a constant with respect to di€erent cuts:
connectivity. Partition (V1, V2) has a cost C
 
…V1 ; V2 †=…S…V1 †  S…V2 †† ˆ 2=…4  3† ˆ …1=6†. Any C…V1 ; V2 † f  jV1 j  jV2 j
other partition corresponds to a much larger cost. Expec ˆ ˆf
S…V1 †  S…V2 † jV1 j  jV2 j
The Clustering Property of the Ratio Cut The …9†
clustering property of the ratio cut can be
Thus, if the nets of the graph are uniformly
illustrated by a random graph model. Let us
distributed, all cuts have the same ratio value. In
assume that the circuit is a uniformly distributed
other words, the choice of the cuts and the
random graph. with uniform module sizes, i.e.,
partition sizes does not make di€erence in such a
si=1. We construct the nets connecting each pair
uniformly distributed random graph. In a general
of modules with identical independent probability
circuit di€erent cuts generate di€erent ratios. Cuts
f. Consider a cut which partitions the circuit into
that go through weakly connected groups corre-
two subsets V1 and V2 with comparable sizes 
spond to smaller ratio values. The minimum of all
jV j and (1 ÿ )  jV j respectively, where < 1.
cuts according to their corresponding ratios
The expected cut count equals the probability f
de®nes the sparsest cut since this cut deviates the
multiplied by the number of possible nets between
most from the expectation on a uniformly
V1 and V2.
distributed graph.
Expec…C…V1 ; V2 †† ˆ f  jV1 j  jV2 j
3.2. Multi-way Partitioning
ˆ …1 ÿ †jVj2  f : …7†
For multi-way partitioning, we discuss a k-way
On the other hand, if another cut separates only partitioning with ®xed size constraints and a
one module vs from the rest of the modules, the cluster ratio cut. These two problems are the
expected cut count is extensions of the min-cut with ®xed size con-
straints and the ratio cut from two-way to multi-
way partitioning, respectively.

3.2.1. K-way Partitioning


For multi-way partitioning, we separate vertex set
V into k disjoint subsets where k > 2, i.e.,
(V1, V2, . . . , Vk). There is an upper bound Su and
FIGURE 5 An example of seven modules, where partition
a lower bound Sl on the size of each subset Vi, i.e.,
(V1, V2) is a minimum ratio cut. Sl  S(Vi)  Su.
I207T001015 . 207
T001015d.207

8 S.-J. CHEN AND C.-K. CHENG

There are di€erent ways to formulate the cut no bound on the size of each subset. Furthermore,
cost because of the di€erent criteria used to count the number of partitions, k, is not ®xed, and
the cost of multiple pin nets. In the following we instead is part of the objective function.
list a few possible objective functions.
C…V1 ; V2 ; . . . ; Vk †
(i) Minimize the cut count, RC ˆ mink>1 P P …13†
1ikÿ1 ji S…Vi †  S…Vj †
X
C…V1 ; V2 ; . . . ; Vk † ˆ ci …10†
Note that we can rewrite the denominator to
ei 2E…V1 ;V2 ;...;Vk †
reduce complexity of the derivation.
(ii) Minimize the sum of cut counts of all vertex
sets. Let us denote the cut count of vertex set C…V1 ; V2 ; . . . ; Vk †
P RC ˆ mink>1 P
Vi to be C…Vi † ˆ ei 2E…Vi † ci . The sum of cut …1=2† 1ik S…Vi †  ‰S…V† ÿ S…Vi †Š
counts of all subsets can be expressed as …14†

X
k X
k X If the number of partitions is one, the denomi-
C…Vi † ˆ cj …11† nator becomes zero. Thus, k is restricted to be
iˆ1 iˆ1 ej 2E…Vi †
larger than one.
Thus, the cost of a net connecting three Example Figure 6 shows a ®fteen module circuit.
subsets is more expensive than the same net The modules are of unit size and the nets are of
connecting two subsets. unit connectivity. The square dot in the ®gure
(iii) Minimize the maximum cut count of all represents a hypernet. The partition shown by the
subsets, i.e., dashed line is a minimum cluster ratio cut. The
cost of the cut is
max1ik C…Vi † …12†
C…V1 ;V2 ;...;V4 †
P
…1=2† 1i4S…Vi †‰S…V†ÿS…Vi †Š
3.2.2. Cluster Ratio Cut 4 1
ˆ ˆ
…1=2†‰4…15ÿ4†‡3…15ÿ3†‡4…15ÿ4†‡4…15ÿ4†Š 21
Cluster ratio cut is an extension of ratio cut from
two-way partition to multiway partition. There is …15†

FIGURE 6 A ®fteen module example to demonstrate cluster ratio cut.


I207T001015 . 207
T001015d.207

VLSI PARTITIONING 9

The physical intuition of cluster ratio can be reach the leaves. Thus, the leaves are ranked level
explained using a random graph model [10]. Let G zero. Each node is one level above the maximum
be a uniformly distributed random graph. We level of its children. When the level of the root is
construct the nets connecting each pair of modules only one, the problem is degenerated to two-way
with identical independent probability f. Since the or multiway partitioning.
nets are uniformly distributed, the probability of Each net ei spans a set of leaves. Given a set of
®nding a subgraph which is signi®cantly denser leaves, there is a unique lowest common ancestor.
than the rest of the graph is very small, meaning The level of the lowest ancestor is de®ned to be the
that there is no distinct cluster structure in G. level l(ei) of the net.
Consider a cut E(V1, V2, . . . , Vk), the expected The cost of a net ei is de®ned to be the
value of C (V1, V2, . . . , Vk) equals multiplication of its connectivity ci and the weight
w(l(ei)) of level l(ei) for net ei to communicate, i.e.,
k X
X kÿ1
ciw(l(ei)). The cost of the multi-level partition is
Expec…C…V1 ; V2 ; . . . ; Vk †† ˆ f  jVi j  jVj j P
iˆj‡1 jˆ1
the sum of the cost of all nets, i.e., ei 2E ci w…l…ei ††.
…16†
3.3.1. J-level K-way Partitioning
and the expected value of cluster ratio equals
When the root of the partitioning tree is level j and
!
C…V1 ; V2 ; . . . ; Vk † the number of branches of each node is no more
Expec…RC † ˆ Expec Pk Pkÿ1 than k, we say it a j-level k-way partition. We can set
iˆj‡1 jˆ1 jVi j  jVj j
Pk Pkÿ1 di€erent communication weights for each level.
f  iˆj‡1 jˆ1 jVi j  jVj j Usually, the function is monotone, i.e., w(l) is larger
ˆ Pk Pkÿ1 ˆ f …17†
when level l increases. The vertex set Vi of each leaf i
iˆj‡1 jˆ1 jVi j  jVj j
has its size bounded by Sl  S(Vi)  Su.
Since f is a constant, all cuts have the same For electronic packaging, the tree is bounded by
expected cluster ratio value. Therefore, if we use the number of external connections. We call a leaf
cluster ratio as the metric, all cuts would be is covered by a node if there is a directed path from
equally favored, which is consistent with the fact the node to the leaf in the tree representation. For
that G has no distinct clusters. However, in a each node ni, we de®ne Ti to be the union of the
general circuit, di€erent cuts generate di€erent modules in the leaves covered by node ni. Let E(Ti)
ratio values. Cuts that go through weakly con- be the external nets of Ti, i.e., E(Ti) ={ei j 0 < j
nected groups correspond to smaller ratio values. ei \ Ti j < jeij}. The cut count of each node should
The minimum of all cuts according to their cluster not exceed the capacity of the external connection
ratio values de®nes the cluster structure of the of the packaging, i.e.,
circuit since this cut deviates the most from the X
cuts of a uniformly distributed graph. C…Ti † ˆ cj  Cap…l…ni †† …18†
ej 2E…Ti †

where Cap(l(ni)) is the capacity of the external


3.3. Multi-level Partitioning
connection of level l(ni).
In multi-level partitioning [4, 23, 47, 58, 67, 68,
Example Figure 7 shows an example of a 3-level
109, 110], the ®nal result is represented by a tree
5-way partitioning structure. The leaves are at
structure. All the modules are assigned to the
level 0 and the root is at level 3. Each node has at
leaves of the tree. The tree is directed from the root
most ®ve children. Net ei={v1, v2, v3} is covered by
toward the leaves. The level of the nodes is de®ned
node na at level l(na)=2.
to be the maximum number of nodes to traverse to
I207T001015 . 207
T001015d.207

10 S.-J. CHEN AND C.-K. CHENG

FIGURE 7 An example of a 3-level 5 way partitioning tree structure.

3.3.2. Generic Binary Tree Example Figure 8 illustrates a generic binary tree
for partitioning. In this ®gure, the root is at level
A generic binary tree structure [110] is proposed to
three. Each node has at most two children.
simplify the multi-level partitioning. There is only
one constant Su to set in the binary tree. Thus, it is
3.4. Replication Cut
much easier to make a fair comparison between
di€erent algorithms. In the replication cut problem, a subset of the
In a generic binary tree, each internal node has circuit may be replicated to reduce the cut count of
exactly two children. The weight of each level is a partition [54, 64, 82]. In this section, we use a
de®ned to be w(l)=2l. Thus, we have the objective two-way partition to illusturate the problem. We
function ®x two modules vs and vt at two sides of the cut.
X We use three vertex sets to represent the partition,
min ci 2l…ei †
V1, V2 , and R, where V1, V2 , and R are disjoint
ei 2E
and V1 [ V2 [ R=V, vs 2 V1, vt 2 V2. Subsets V1
subject to the constraint on the capacity of the and V2 are separated by the cut and subset R is to
leaves, i.e., S(Vi)  Su where Vi is the vertex set of be replicated at both sides (Fig. 9).
leaf i. The level of the root is adjusted according to Each copy of R needs to collect a complete set of
the minimization of the objective function. input signals in order to compute the function

FIGURE 8 An example of a generic binary tree.


I207T001015 . 207
T001015d.207

VLSI PARTITIONING 11

FIGURE 9 Replication cut problem: (a) the three sets of nodes V1, R and V2; (b) the duplicated circuit with R being replicated.

properly. Thus, the nets from V1 to R and from V2 V1 \ V2 ˆ ;; R ˆ V ÿ V1 ÿ V2 :


to R are duplicated. However, the output signals
Interpretation of the Replication Cut Suppose
of R can be obtained from either copy of R. For
we rewrite the replication cut in the format:
example, nets from the right side R to V1 in Figure
9(b) are not duplicated because V1 gets inputs R…V1 ; V2 † ˆ E…V1 ! R† [ E…V1 ! V2 †
from the left side R. For the same reason, we do
[ E…V2 ! V1 † [ E…V2 ! R†
not replicate the nets from the left side R to V2.
ˆ E…V1 ! V 1 † [ E…V2 ! V
 2†
Given two disjoint sets V1 and V2, let a replication
cut R(V1, V2) denote the cut set of a partitioning  1 and V  2 denote the complementary sets of
where V
with R=V ÿ V1 ÿ V2 being duplicated. From  1 ˆ V ÿ V1 and V
 2 ˆ V ÿ V2 . The
V1 and V2, i.e., V
Figure 9(b), we can see that R(V1, V2) is the union  1 † and
cut set becomes the union of E…V1 ! V
of four directed cuts, that is,  2 †. We can interpret the cut set of the
E…V2 ! V
R…V1 ; V2 † ˆ E…V1 ! V2 † [ E…V2 ! V1 † replication cut R(V1, V2) as two directed cuts on
the original circuit G as shown in Figure 10.
[ E…V1 ! R† [ E…V2 ! R†:

Let Sl and Su denote the size limits on the two 3.5. Performance Driven Partitioning
partitioned subsets. We state the Replication Cut
Problem as follows: The goal of performance driven partitioning is to
Given a directed circuit G, we want to ®nd a generate a partition that satis®es some timing
replication cut R(V1, V2) with an objective constraints. Due to the physical geometric distance
and interface technology limitations, inter-parti-
X
min CR …V1 ; V2 † ˆ ci …19† tion delay contributes the dominant portion of
ei 2R…V1 ;V2 † signal propagation delay. Consequently, instead of
minimizing the number of the crossing nets as the
subject to the size constraints only objective during partitioning, we should take
into account the interpartition delay to satisfy the
Sl  S…V1 [ R†  Su and Sl  S…V2 [ R†  Su ,
timing constraints.
and the feasible condition Clock period is a major measurement for circuit
performance. It is determined by the longest signal
propagation delay between registers. Each cross-
I207T001015 . 207
T001015d.207

12 S.-J. CHEN AND C.-K. CHENG

 1 † [ E…V2 ! V
FIGURE 10 An interpretation of the replication cut, R…V1 ; V2 † ˆ E…V1 ! V  2 †.

ing net is associated with an interpartition delay  bounds of sizes Sl and Su, and interpartition delay ,
determined by VLSI technologies. Given a path p ®nd a partition (V1, V2) with the minimum cut count,
from one register to another register with no subject to Sl  S(V1)  Su, Sl  S(V2)  Su, and
interleaving registers, let dp be the sum of maxp dp ‡ dbp  T.
combinational block delays and dbp be the sum of
Example In Figure 11, path p starts at register vi
interpartition delays along path p. The longest
and ends at register vj. The path crosses between
delay dp ‡ dbp among all paths p should be smaller
the partition (V1, V2) three times. Thus, the
than the clock period T, i.e.:
interpartition delay dbp ˆ 3.
max dp ‡ dbp  T: …20† Replication can improve the performance of the
p
partitioned results [83]. In Figure 12(a), vertex set
Now we state the performance-driven partition- R locates at the side of V2. Path p crosses between
ing problem as follows: the partition (V1, R [ V2) three times. By replicat-
Given hypergraph H(V, E ), clock period T, two

FIGURE 11 An illustration of performance driven partitioning.


I207T001015 . 207
T001015d.207

VLSI PARTITIONING 13

where Pij is the set of all paths from module vi to vj.


We de®ne a path p from vi to vj as a W-critical path
if rp equals W(i, j); W-critical path p is also called
an IO-W-critical path if modules vi and vj are the
primary input and output, respectively.
(i) Iteration Bound While retiming can reduce
the clock period of a circuit, there is a lower bound
imposed by the feedback loops in the hypergraph
[92]. Given a loop l, let dl, dbl and rl be the sum of
combinational block delays, the sum of interparti-
tion delays, and the number of registers in loop l,
respectively. The delay-to-register ratio of a loop l
is equal to …dl ‡ dbl †=rl . The iteration bound is de®-
ned as the maximum delay-to-register ratio, i.e.:
 
dl ‡ dbl
J…V1 ; V2 † ˆ max jl 2 L ; …21†
FIGURE 12 Illustration of replication and its e€ect on rl
partitioning. The ®gure shows path p (a) before and (b) after
vertex set R is replicated. where L is the set of all loops. Note that the
iteration bound of a given circuit yields a lower
bound on the achieved clock period by retiming.
ing vertex set R (Fig. 12(b)), path p needs to cross
the partition only once. (ii) Latency Bound Let p denote the IO-W-
critical path with maximum path delay among all
3.5.1. Retiming IO-W-critical paths from vi to vj. Since the number
of registers in path p is equal to W(i, j), the IO
Retiming shifts the locations of the registers to latency (i.e. (W(i, j) ÿ 1)  T ) between vi and vj is
improve the system performance [76]. It is an not less than dp ‡ dbp , where T denotes the clock
e€ective approach to reduce the clock period. period, and dp and dbp are the sum of combina-
Moreover, the process also reduces the primary tional block delays and the sum of interpartition
input to primary output latency which is another delays on path p, respectively. Thus, we de®ne
important measurement for circuit performance. latency bound M as follows [85, 86]:
As in [85], we assume that the combinational
blocks are ®ne-grained. A module is called ®ne- M…V1 ; V2 † ˆ maxfdp ‡ dbp j p 2 PIOW g; …22†
grained, if it can be split into several smaller
where PIOW is the set of all IO-W-critical paths.
modules. Alternatively, if a module cannot be
Latency bound also imposes a lower bound on the
split, it is called coarse-grained. The interpartition
system latency achieved by using retiming. An all-
delay  on crossing nets is inherently coarse-
pair shortest-path algorithm can be used to
grained and cannot be split.
calculate the latency bound.
Given a path p, we use rp to denote the number
of registers on the path. Let W(i, j) denote the We have two reasons to use the iteration and
minimum rp among all possible paths p from i to j, latency bounds. (i) It is faster to calculate these
i.e., bounds. (ii) The iteration and latency bounds
stand for the lower bounds of the clock period and
W…i; j† ˆ min frp j p 2 Pij g;
system latency achieved by adopting retiming,
respectively. The partition with lower iteration and
I207T001015 . 207
T001015d.207

14 S.-J. CHEN AND C.-K. CHENG

latency bounds can achieve better clock period and which is smaller than the iteration bound before
system latency by using retiming. Therefore, we replication.
want to generate a partition with small iteration
and latency bounds. 3.6. Clustering
Statement of the Problem Now we state the
Clustering [6] is similar to multiway partitioning in
performance-driven partitioning problem as fol-
that the process groups modules into k subsets.
lows:
However, for clustering the number of subsets is
Given hypergraph H(V, E ), two numbers ~J and M,~
usually much greater than for a typical multiway
bounds of sizes Sl and Su, and interpartition delay ,
partitioning problem, e.g., k  10.
®nd a partition (V1, V2) with the minimum number
Often, a clustering process is used as part of a
of cut count, subject to Sl  S(V1)  Su, Sl 
divide and conquer approach. Thus, it is impor-
S(V2)  Su, J…V1 ; V2 †  ~ ~
J, and M…V1 ; V2 †  M.
tant to choose an objective function that ®ts the
Example Figure 13 illustrates the e€ect of repli- target application. If the goal is to reduce problem
cation on the iteration bound. Let us assume that complexity, we set the objective function to be:
the interpartition delay is =4. Before replication,
Xk
C…Vi †
the iteration bound is dominated by loop l1. The min ; …26†
bound is equal to C
iˆ1 I
…Vi †

where Vi's are disjoint vertex sets and their union


dl1 ‡ dcl1 8‡24
ˆ ˆ 4: …23† is equal to V. Function C (Vi) is the external cut
rl1 4
count of cluster Vi and CI (Vi) is the count of nets
P
After replication [85], the bound contributed by connecting vertex set Vi, i.e., ei 2I…Vi † ci .
loop l1 is equal to For performance driven clustering, the objective
function is to minimize the number of cuts
dl1 ‡ dcl1 8 between registers.
ˆ ˆ 2: …24†
rl1 4
The iteration bound now is dominated by the 4. MULTIPLE PIN NET MODELS
union of loops l1 and l2,
The handling of multiple pin nets strongly depends
dl1 ‡l2 ‡ dd l1 ‡l2 18 ‡ 2  4 on the partitioning approach [102]. A proper model
ˆ ˆ 3:25; …25†
rl1 ‡l2 8 is needed to re¯ect the correct cut count and im-
prove the eciency. In this section, we ®rst intro-
duce a shift model which is used for iterations of

FIGURE 13 Illustration of replication and its e€ect on iteration bound.


I207T001015 . 207
T001015d.207

VLSI PARTITIONING 15

shifting a module or swapping a pair of modules.


We then describe a clique model which is used to
replace a multiple pin net. The star and loop models
are variations of two pin net models, however, with
less complexity than the clique model. Finally, a
¯ow model is introduced for network ¯ow appro-
aches. FIGURE 14 Multiple pin net model of shifting process.

4.1. Shift Model Otherwise, the move has no e€ect on the cut
The shift model [101] for multiple pin net is useful count and potential cost.
when we perturb the partition by shifting one 2. If the revised pin count ki=1, the shift of the
module to a di€erent vertex set or by swapping last pin of ei in V1 will decrease the cut count by
two modules between di€erent vertex sets. Let us ci. We then update the potential cost of this last
simplify the description by assuming only one pin.
module is shifted to a di€erent vertex set. A swap 3. If ki=0, the cut count reduces by ci. However,
of a pair of modules can be treated as two steps of the shift of any pin vk 2 ei from V2 to V1 will
module shifting. increase the cut count. Thus, in this case, we
For each shift, we want to update the cut count. re¯ect the cost of potential shift on the pins of
We also want to update the potential change in ei, which takes O(jeij) operations.
cost for each module if it were to be shifted, so that
we can rank the modules for the next move. Such 4.2. Clique of Two Pin Nets
cost revision can be expensive if the circuit has
Some researchers use cliques of two pin nets to
large nets which contain huge numbers of pins,
model multiple pin nets. Given a multiple pin net
e.g., hundreds of thousand pins.
ei, we construct a clique of (1/2)jeij(jeij ÿ 1) two
The shift model reduces the complexity of the
pin nets to connect all pairs of pins in the net. The
cost revision by utilizing the property that for huge
clique model maintains the symmetric relation of
nets most shifts of its pins do not change the cost
the modules of the same net in the sense that the
of the other pins in the net.
order of the pins in the net has no e€ect on the
Let us simplify the description by considering a
cost.
two way partitioning. The model can be extended to
The weight of two pin nets in the clique module
multiple way partitioning according to the choice of
is adjusted by some factor. One approach is to use
objective functions. Let module vj be shifted from
2/jeij to scale down the connectivity. The total
vertex set V1 to V2. The con®guration of nets
weight of all the nets in the clique is (2/jeij)  (1/2)
ei 2 E({vj}) connecting module vj is revised. For each
jeij(jeij ÿ 1)ci=(jeij ÿ 1)ci. Note that it takes jeij ÿ 1
net ei, we denote ki to be the number of pins of ei in
two pin nets to form a spanning tree of jeij
V1 and jeij ÿ ki the number of pins of ei in V2 (Fig.
modules.
14). With respect to net ei, we update the pin
Other factor has been proposed such as 1/
numbers ki and jeij ÿ ki after module vj is shifted.
(jeij ÿ 1) which is based on a di€erent probability
We also update the cost of modules in nets ei.
model. However, no factor can exactly re¯ect the
1. If the revised ki  2, the potential cost of pins cost of a multiple pin net model.
due to net ei is zero. For the case that Complexity of the Clique Model The complex-
jeij ÿ ki=1, we increase the cut count by ci ity of the clique model is high. There are O(jeij2)
and set the potential cost of pins in ei. two pin nets in a clique model. Suppose the
I207T001015 . 207
T001015d.207

16 S.-J. CHEN AND C.-K. CHENG

process of each two pin net takes a constant time. to the sequence is one. The model remains correct
It takes O(jeij2) operations to process a multiple even if any two consecutive modules in the
pin net ei. Therefore, in practice, if the pin number sequence swap their order.
is larger than a threshold, the net is ignored in the
process. 4.5. Flow Model
For the network ¯ow approach, we consider each
4.3. Star of Two Pin Nets
net ei as a pipe. A set of saturated pipes forms a
A star model introduces less complexity than a bottleneck of the ¯ow. The union of the saturated
clique model. Given a net ei, we create a dummy pipes becomes the cut of the circuit. In such a
module ~vi . The dummy module ~vi connects every model, we set the capacity of the pipe equal to the
pin in ei with a two pin net. This module maintains corresponding connectivity ci [52].
the symmetry of the net. However, we need only Let xiu be the amount of ¯ow from pin vi to net
jeij two pin nets. eu and xuj be the amount of ¯ow from net eu to pin
For the clique and star models, the cost of the vj (Fig. 16). The total ¯ow injected into the net
partition depends on the number of pins on the should be smaller than or equal to its capacity and
two sides of the partition. The cost is higher when the incoming ¯ow is equal to the outgoing ¯ow,
the pins are distributed more evenly on the two i.e.,
sides of the cut. Thus, these models discourage X
xiu  cu ; …27†
even partitioning of the pins in the nets. vi 2eu
X X
xiu ÿ xui ˆ 0: …28†
4.4. Loop Model of Two Pin Nets vi 2eu vi 2eu

A loop model re¯ects the exact cut count [22],


however, it is sensitive to the order of the pins. We
can derive heuristic ordering of the pins using a 5. APPROACHES
linear placement. Modules are sequenced accord-
ing to their x coordinates in the placement. We In this section we introduce several approaches to
®nd the partition by collecting the modules partitioning. We ®rst discuss two methods for
according to the sequence. optimal solutions: a branch and bound method
Following the order of the modules in the x and a dynamic programming algorithm. The
coordinates, we link the modules of a multiple pin branch and bound method is e€ective in searching
net with two pin nets into a loop. We link the pins exhaustively for the optimal solution for small
in a sequence (Fig. 15) alternating on every other circuits. The dynamic programming method pre-
module. The loop is formed by the two connec- sented runs in polynomial time and ®nds an
tions at the two ends. optimal partition for a special class of circuits.
A factor of (1/2) is assigned to the two pin nets We then explain a few heuristic algorithms:
so that the cut count separating modules according

FIGURE 15 A loop model of multiple pin net where modules


are placed on an x axis. FIGURE 16 A ¯ow model with respect to net eu.
I207T001015 . 207
T001015d.207

VLSI PARTITIONING 17

group migration, network ¯ow, nonlinear pro- placed on each of the two di€erent sides. A path in
gramming, Lagrangian, and clustering methods. the tree from the root to a leaf corresponds to one
The group-migration approach is a popular assignment for the partition.
method in practice due to its ¯exibility and We use a depth ®rst search approach to traverse
e€ectiveness. The network ¯ow method gives us the binary tree. We prune the search space
a di€erent view of the partitioning problem by according to the size constraint and a partial cut
transforming the minimization of the cut count count. In the binary tree, a node at level k along
into the maximization of the ¯ow via a duality in with the path from the root to the node represents
linear programming. This approach derives ex- a partition assignment of the ®rst k modules. Let
cellent results with respect to certain objective V1 and V2 be the two vertex sets of the partitions
functions. The nonlinear programming method of the ®rst k modules. If S(Vi) > Su for i=1 or 2,
provides a global view of the whole problem. The the size constraint is violated, and there is no need
Lagrangian method is a useful approach for to proceed. Thus, we prune the branches below.
performance driven problems. Finally, we depict We also use a partial cut count to prune the
a clustering method for the partitioning. binary tree. The cut of the partial partition is
In most cases, we illustrate the method in expressed as: E(V1, V2)={ei j jei \ V1j > 0 and
question using two-way partitioning as the target jei \ V2j > 0}. The partial cut count is described
P
problem. However, many methods can be ex- as: C…V1 ; V2 † ˆ ei 2E…V1 ;V2 † ci . If the partial cut
tended to other problems or di€erent objective count C(V1, V2) is larger than the cut count of a
functions. For example, we can apply group known solution, the partition results below this
migration to multiway [98, 99] or multiple level node are going to be worse than the existing
partitioning problems [68, 67] with modi®cation to solution. We prune the branches of such a node.
the cost of the moves. Furthermore, some methods Complexity of the Method Suppose the circuit
may be combined to solve a problem. For has unit size si=1 on each module and the
example, we can use clustering to reduce the size constraint requires an even size Sl=Su=jVj/2
of an input circuit and then use group migration to (assuming that jVj is even). Applying Stirling's
®nd a partition of the reduced circuit with much approximation [63], we have the number of
greater eciency [24, 59]. In fact, this strategy possible partitions:
derives the best results in terms of CPU time and s
cut count in recent benchmark [2]. jVj! 2 jVj
 2 : …29†
…jVj=2†!2 jVj
5.1. Branch and Bound Method
Although the number of combinations is huge,
The branch and bound method is an exhaustive we have found that the application to small circuits
search technique that may be e€ectively applied to is practical. We improve the eciency of the
the min-cut problem with size constraints for small pruning by ordering the modules according to their
cases. In the branch and bound process, the degrees, i.e., the number of nets connecting to the
modules are ®rst ordered in a sequence. For each modules, in a descending order. With an elegant
module, we try placing it to either side of the cut. implementation, we can ®nd optimal solutions
The process can be represented by a complete when the number of modules is small, e.g., jVj  60.
binary tree with jVj levels. The root of the tree is
the ®rst module in the sequence. The nodes in the
5.2. Dynamic Programming for a Serial
kth level of the tree correspond to the kth module
and Parallel Graph
in the sequence. The two branches at each node
represent the two trials where the kth module is For the special case where the circuit can be
I207T001015 . 207
T001015d.207

18 S.-J. CHEN AND C.-K. CHENG

represented by a serial and parallel graph of unit and the sink module vt2 of G2 (Fig. 17(b)). The
module size, we can ®nd a minimum two way merged source module and merged sink module
partition (V1, V2) with size constraints in poly- become the source module vs and the sink module
nomial time. In this section, we ®rst describe the vt of graph G, respectively.
serial and parallel graph. We then depict a Dynamic Programming The dynamic program-
dynamic programming algorithm that solves the ming algorithm performs a bottom up process
partitioning problem on this class of graphs. We according to the construction of the serial and
assume that all modules are of unit size, i.e., si=1. parallel graph. It starts from the basic serial and
A serial and parallel graph can be constructed parallel graph. For each graph G(V, E ), we derive
from smaller serial and parallel graphs by serial or two tables.
parallel process. Each serial and parallel graph has
a(i, j): the minimum cut count with i modules on
a source module vs and a sink module vt. A graph
the left hand side and j modules on the
G(V, E ) with two modules, V={vs, vt} and one
right hand side under the condition that
edge E={e}, e={vs, vt} is a basic serial and parallel
source module vs is on the left hand side
graph. A serial and parallel graph is constructed
and sink module vt is on the right hand
from the basic graph by a series of serial and
side.
parallel processes.
b(i, j): the minimum cut count with i modules on
Serial Process Given two serial and parallel
the left hand side and j modules on the
graphs, G1(V1, E1) and G2(V2, E2), we construct a
right hand side under the condition that
serial and parallel graph G(V, E ) by merging the
both source module vs and sink module vt
sink module vt1 of G1 and the source module vs2 of
are on the left hand side.
G2 (Fig. 17(a)). The source module vs1 of graph G1
becomes the source module of graph G, i.e., Let graph G(V, E ) be constructed with
vs=vs1. The sink module vt2 of graph G2 becomes G1(V1, E1) and G2(V2, E2) by one of the serial
the sink module of graph G, i.e., vt=vt2. and parallel processes. Let a1, b1 be the tables of
Parallel Process Given two serial and parallel graph G1 and a2, b2 be the tables of graph G2. We
graphs, G1(V1, E1) and G2(V2, E2), we construct a construct the tables a, b of graph G(V, E ) as
serial and parallel graph G(V, E ) by merging the follows.
source module vs1 of G1 and the source module vs2 Table Formulas for Parallel Process
of G2 and by merging the sink module vt1 of G1
a…i; j† ˆ mink‡mˆjV2 j a1 …i ‡ 1 ÿ k; j ‡ 1 ÿ m†
‡ a2 …k; m†; 8i ‡ j ˆ jVj; …30†
b…i; j† ˆ mink‡mˆjV2 j b1 …i ‡ 2 ÿ k; j ÿ m†
‡ b2 …k; m†; 8i ‡ j ˆ jVj: …31†

For table a(i, j), we try all combinations of


tables a1 and a2 with the constraint that the
number of modules on the left hand side is i and
the number of modules on the right hand side is j.
Note that the extra addition of 1 in the index is
used to compensate the merging of the two source
modules or the sink modules. For table b(i, j), we
try all combinations of tables b1 and b2 with the
same size constraint.
FIGURE 17 Construction of serial and parallel graphs.
I207T001015 . 207
T001015d.207

VLSI PARTITIONING 19

Table Formula for Serial Process jump out of local minima, and so the optimum
solution will not be found. The progress of the
a…i; j† ˆ min…mink‡mˆjV2 j a1 …i ÿ k; j ‡ 1 ÿ m†
method has de®nitely pushed the envelope further.
‡ b2 …k; m†; mink‡mˆjV2 j In this section, we concentrate on two-way min-
b1 …i ‡ 1 ÿ k; j ÿ m† cut with size constraints. The method is ¯exible
‡ a2 …k; m††; 8i ‡ j ˆ jVj; …32† and can be extended to other partitioning pro-
blems with modi®cations of the moves and the cost
b…i; j† ˆ min…mink‡mˆjV2 j a1 …i ÿ k; j ‡ 1 ÿ m†
function.
‡ a2 …m; k†; mink‡mˆjV2 j The algorithm performs a series of passes. At
b1 …i ‡ 1 ÿ k; j ÿ m† the beginning of a pass, each module is labeled
‡ b2 …k; m††; 8i ‡ j ˆ jVj: …33† unlocked. Once a module is shifted, it becomes
locked in this pass. The group migration algorithm
For table a(i, j), we try all combinations of iteratively interchanges a pair of unlocked modules
tables a1 and b2 and all combinations of tables b1 or shifts a single module to a di€erent side with the
and a2. For the combinations of tables a1 and b2, largest reduction (gain) of the cost function. This
the merged module (by merging vt1 and vs2) is on continues until all modules are locked. The lowest
the right hand side. For the combinations of tables cost along the whole sequence of swapping is
b1 and a2, the merged module is on the left hand recorded. The group migration takes the subse-
side. For table b(i, j), we try all combinations of quence that produces the lowest cut count and
tables a1 and a2 and all combinations of tables b1 undoes the moves after the point of the lowest
and b2. For the combinations of tables a1 and a2, cost. This partitioning result is then used as the
the merged module is on the right hand side. In initial solution for the next pass. The algorithm
terms of G2, its source module vs2 is on the right terminates when a pass fails to ®nd a result with a
hand side and its sink module vt2 is on the left cost lower than the cost of the previous pass.
hand side. Thus, the indices of table a2 are Group Migration Algorithm Input: Hypergraph
reversed, i.e., a2(m, k) instead of a2(k, m). For the H(V, E ) and an initial partition. Cost function and
combinations of tables b1 and b2, the merged size constraints.
module is on the left hand side.
1. One pass of moves.

5.3. Group Migration Algorithms 1.1. Choose and perform the best move.
1.2. Lock the moved modules.
The group migration algorithm was ®rst proposed 1.3. Update the gain of unlocked modules.
by Kernighan and Lin [60] in 1970. Since then, 1.4. Repeat Steps 1.1 ± 1.3 until all modules are
many variations [15, 26, 27, 33, 39, 45, 49, 84, 97 ± locked or no move is feasible.
99, 108, 111, 116] have been reported to improve 1.5. Find and execute the best subsequence of
the eciency and e€ectiveness of the method. the move. Undo the rest of the sequence.
Today, it is still a popular method in practice.
The probability of ®nding the optimum solution 2. Use the previous result as an initial partition.
in a single trial drops exponentially as the size of 3. Repeat the pass (Steps 1 and 2) until there is no
the circuit increases [60]. Using the original more improvement.
version, Kernighan and Lin showed that the Figure 18 illustrates the cost of a sequence of
probability of obtaining an optimal solution is a moves. This algorithm escapes from local optima
function of the problem size, p(jVj )=2ÿn/30. In by a whole sequence of the moves even when a
other words, if the circuit size is large, then the single move may produce a negative gain.
heuristic Kernighan ± Lin algorithm is unlikely to
I207T001015 . 207
T001015d.207

20 S.-J. CHEN AND C.-K. CHENG

top of the two sides. The search of all pairs


takes O(jV1jjV2j) operations. In practice, we
order modules according to their shift gain.
The search of the best pair is limited to the top
k modules on each side, e.g., k=3. Thus, the
complexity is actually O(k2).

Pairwise swapping is a natural adoption when


the size constraint is tight. When no single shift is
feasible, we can use swapping to balance the size of
the partition.

FIGURE 18 Cost of a sequence of moves and subsequence


selection.
5.3.2. Data Structure

In the following, we discuss variations of several The choice of data structure strongly depends on
parts in the process: basic moves (Step 1.1), data the cost functions, gains, and the characteristic of
structure, gains (Steps 1.1 and 1.3). At the end of VLSI circuitry. A sorting structure such as heap or
this subsection, we introduce a net based move and AVL tree is a natural choice to sort for the top
a simulated annealing approach. modules. However, for the case that the gain
di€ers by a very limited quantities, an array struc-
ture can simplify the coding and the complexity.
5.3.1. Basic Moves
(i) Heap or AVL Tree We can use a heap or
Basic moves cover the shifting of a single module AVL tree to sort the modules according to
and the swapping of a pair of modules. A their shift gain. Each side of the partition
swapping can be conceived as two consecutive keeps a heap. The top of the heap is the
shifts, however, with consideration of the mutual module of the maximum gain. The sorting of
e€ect between the two shifts. each module takes O(jVjlog(jVj )) operations.
(i) Module Shifting For each unlocked module, (ii) Array (Bucket) of Link List Figure 19
we check its gain: the cost function reduction illustrate a bucket list data structure. The gain
by shifting the module to a di€erent side is transformed to the index of the bucket [40].
assuming that the rest of the modules are Modules of the same gain are stored in the
®xed. To select the best module to shift, we same bucket by a link list. A bucket is an
order on each side the modules according to e€ective data structure when the objective
their shift gains. If the size constraints are function is the cut count. The gain of cut
violated after the shift, the move is not count is limited by the maximum degrees of
P
feasible. We search for the best feasible the modules, i.e., degmax ˆ maxvi 2V e2E…fvi g†
module to move [40]. ce . Thus, the dimension of the bucket is set to
(ii) Pairwise Swapping We exchange two mod- be 2degmax.
ules in two vertex sets of the partition. Note For VLSI applications, the degree of modules is
that the gain of the swap is not equal to the much smaller than the number of modules. Thus,
sum of the gains of two shifts. The mutual the dimension of the bucket is small. It is very
e€ect between the two modules needs to be ecient to search and revise the module order in
included when we derive the gain. Thus, the the bucket structure. In fact, it is proven that using
best pair may not be the two modules on the the bucket structure and cut count as the objective
I207T001015 . 207
T001015d.207

VLSI PARTITIONING 21

FIGURE 19 Bucket list.

function, it takes linear time proportional to the (iii) (a) Levels with Priority The ®rst level gain is
total number of pins to perform each pass [40]. identical to the shift gain of cut count. The second
level gain is equal to the number of nets that have
one more pins on the same side. Thus, the kth level
5.3.3. Gains
gain is equal to the number of nets that have k
In this subsection, we use cut count as the more pins on the same side [65]. The pins on the
objective function. The extension to other cost other side will increase by one after the module is
functions is possible. However, we may loose shifted. Thus, the negative gain of level k is
eciency. contributed by the nets with k ÿ 1 pins on the
other side.
(i) Shift Gain We use shift model for multiple
Let us assume that module vi is in vertex set V1
pin nets. Given a module vi, we check the set
to simplify the notation. For each net ej 2 E({vi}),
E({vi}) of nets connecting to this module. The
we denote kj=jej \ V1j the number of pins in V1.
contribution of each net e 2 E({vi}) by shifting
Let us de®ne E(+, i, k) to be the set of nets
module vi is the gain ge(vi) of the net with respect
ej 2 E({vi}) with kj=k‡1 pins in V1 (the extra one
to module vi. The gain g(vi) of module vi is the total
is used to count module vi itself ) and nonzero pins
gains of all its adjacent nets, i.e.,
P in V2, i.e., jejj > kj. And E(ÿ, i, k) to be the set of
g…vi † ˆ e2E…fvi g† ge …vi ††
nets ej 2 E({vi}) with no other pins in V1 and k ÿ 1
(ii) Swap Gain The swap gain is the sum of the pins in V2, i.e., jejj=k and kj=1. Then, the kth
gains of two modules vi and vj, deducting the e€ect level gain of module vi, gi(k), is the weight
on common nets, i.e., g…vi † ‡ g…vj †ÿ di€erence of the two sets, E(+, i, k) and E(ÿ, i, k).
P
z e2E…fvi g†\E…fvj g† …ge …vi † ‡ ge …vj ††. X X
gi …k† ˆ ce ÿ ce …34†
(iii) Weights of Multipin Nets The sequence of e2E…‡;i;k† e2E…ÿ;i;k†
the move depends much on the gain calculation.
E…‡; i; k† ˆ fej j ej 2 E…fvi g†; kj ˆ k ‡ 1; jej j > kj g
For a circuit of 1,000,000 modules, suppose the
degree of most modules is less than 100 and each …35†
net is of unit weight. We have roughly 1,000,000 E…ÿ; i; k† ˆ fej j ej 2 E…fvi g†; kj ˆ 1; jej j ˆ kg
modules/200 gain levels = 5,000 modules per gain
…36†
level. To di€erentiate these 5,000 modules, we have
to adjust the weight of multiple pin nets.
I207T001015 . 207
T001015d.207

22 S.-J. CHEN AND C.-K. CHENG

Q
We compare the modules with a priority on the to V2. Hence, ce  j6ˆi;vj 2e\V1 p…vj † is the expected
lower level gain. In other words, we compare the gain if module vi is shifted. The second term
Q
®rst level ®rst. If the modules are equal at the ®rst vj 2e\V2 p…vj † is the potential Q
that the pins in V2
level gain, we then compare the second level and so will shift to V1. Thus, ce  vj 2e\V2 p…vj † is the
on. In practice, we limit the number of levels by a expected loss if module vi is shifted.
threshold, e.g., l  3. The gain of a module vi is the total gains of the
adjacent nets with respect to this module, i.e.,
(iii) (b) Probabilistic Gain In probabilistic gain
model [37], each module vi is assigned a weight X
g…vi † ˆ ge …vi †: …39†
p(vi). The weight p(vi) is a function of the gain g(vi) e2E…fvi g†
of module vi to re¯ect the belief level ( potential)
that the shift of module vi will be executed at the Net gain ge(vi) and module potential p(vi) are
end of the pass. Thus, if module vi is unlocked, mutually dependent. We derive the values via
iterations. Initially, we use the plain shift gain (by
p…vi † ˆ f …g…vi ††: …37† cut count) to derive the potential p(vi)=f (g(vi)).
From these initial potentials, we derive the
Otherwise, p(vi)=0. Figure 20 illustrates function
probabilistic net gain. The net gain is then used
f, which increases monotonically. The slope within
to derive the module gain. In practice, we stop
g0 and gup ampli®es the di€erence of gains. The
after a limited number of cycles, e.g., two
slope is clamped at two ends pmax and pmin
iterations ([37]). Note that there is no guarantee
(0  pmin < pmax  1) which represent the maxi-
that the iteration will converge.
mum potential that the module will shift or stay.
After each move, the associated module poten-
For each net e 2 E({vi}), its contribution ge(vi) to
tial and probabilistic net gains are updated and the
the gain of module vi is the tendency that the whole
plain cut count is recorded. Exact cut count is used
net will shift with module vi to the other side. To
when we select the subsequence of move to
simplify the notation, let us assume that module vi
execute.
is in V1. Thus, we have the following expression.
It has been shown via benchmarks released by
! ACM/SIGDA, the probabilistic gain model pro-
Y Y
ge …vi † ˆ ce p…vj † ÿ p…vj † …38† duces excellent partitioning results; it outperforms
j6ˆi;vj 2e\V1 vj 2e\V2 the other gain models by wide margins.
Q
where vj 2S p…vj † ˆ 1 if S is an empty set. The ®rst
Q 5.3.4. Net-based Move
term j6ˆi;vj 2e\V1 p…vj † in the parentheses is the
potential that all the pins will shift with module vi The net based process [115, 32] is similar to the
module based approach except that all operations
are based on the concept of the critical and
complementary critical sets. The main di€erences
are (1) Instead of a single module, each move now
shifts one critical or complementary critical set,
depending on the type of objective function. For
convenience, we say a move is initiated by a net eu
if this move is composed of shifting the critical or
complementary critical set associated with eu. (2)
The locking mechanism is operated on a net, that
is, if the critical or complementary critical set of a
FIGURE 20 Function of probabilistic gain.
I207T001015 . 207
T001015d.207

VLSI PARTITIONING 23

net has been moved then all the moves initiated by to the annealing temperature. As temperature
this net will be prohibited thereafter. drops, we gradually increase to enforce the size
Given a net eu and a vertex set Vb, let us de®ne balance.
the critical set of net eu with respect to set Vb as

sub ˆ eu \ Vb ; …40† 5.4. Flow Approaches


In this section, we assume that the circuit can be
and the complementary critical set of eu with
respect to set Vb as represented by a graph G(V, E ) with unit module
size, i.e., si=1 and all nets are two pin nets. The
b
sub ˆ eu \ V …41† ¯ow approach can be extended to multiple pin nets
using a ¯ow model.
For a move associated with a net eu, we can We ®rst go through maximum ¯ow minimum cut
either place the critical set Sub into a partition [1, 73] to introduce the duality [30] and the concept
other than Vb, or the complementary critical set of shadow price. The derivation is then extended to
Sub into the partition Vb. The gain of each move is a weighted cluster ratio cut and a replication cut.
then computed by evaluating the change of the Finally, we introduce heuristic algorithms that
cost due to the move of the critical or comple- accelerate the ¯ow calculation. The ¯ow approach
mentary critical set. can derive excellent results. Furthermore, exploit-
Usage of Basic Module Moves Although the ing its duality formulation, we can derive a tight
net-based move model provides a di€erent process bound of the optimal solutions.
to improve current partition, it is more expensive
than the module-based move model because more
5.4.1. Maximum Flow Minimum Cut
modules are involved in each move.
We can mimic the net based move by adding In maximum ¯ow minimum cut formulation, the
weights to the connectivity of desired nets [38]. The ¯ow injects into module vs and drains from module
basic move is still based on the modules. However, vt. The ¯ow is conservative at all other modules.
after module vi is moved, we add more weights on The capacity of the nets eij is equal to its
the nets connecting to vi, i.e., E({vi}). These extra connectivity, cij. We set cij=0 if there is no net
weights encourage the adjacent modules to go connecting modules vi and vj. The notation xij
along with module vi and thus achieves the e€ect denotes the amount of ¯ow from module vi to
of net based move. Empirical study ®nds improve- module vj and xji denotes the amount of ¯ow from
ment on the partitioning results. module vj to module vi on net eij. The objective is
to maximize the ¯ow injection f into vs.
5.3.5. Simulated Annealing Approach
Obj : max f …43†
For simulated annealing [20, 81, 62, 56], we can
subject to the constraints,
adopt the basic moves such as module shifting and
pairwise swapping. There is no need of lock xij ‡ xji  cij ; 81  i; j  jVj …44†
mechanism. To allow a larger searching space,
we incorporate the size constraints into objective X
jVj X
jVj
xjs ÿ xsj ÿ f ˆ 0 …45†
function, e.g., jˆ1 jˆ1

C…V1 ; V2 † ‡ …S…V1 † ÿ S…V2 ††2 : …42† X


jVj X
jVj
xjt ÿ xtj ‡ f ˆ 0 …46†
where is a coecient. We can adjust it according jˆ1 jˆ1
I207T001015 . 207
T001015d.207

24 S.-J. CHEN AND C.-K. CHENG

5.4.2. The Weighted Cluster Ratio Metric


X
jVj X
jVj and a Uniform Multi-commodity
xij ÿ xji ˆ 0; 81  i  jVj …47† Flow Problem
jˆ1 jˆ1
In a uniform multi-commodity ¯ow problem
xij  0; 81  i; j  jVj: …48† [74, 75], the demand of ¯ow between each pair of
modules is equal to an identical value f. As we
To derive the duality, we use shadow prices: a keep increasing f, some of the nets become
bidirectional distance dij for each net eij Eq. (44), saturated. These saturated nets form a bottleneck
potential i for each module vi Eqs. (45) ± (47) The of communication and thus prescribes a potential
dual problem can be expressed as follows [30]. clustering of the communication system [71].
X We simplify the notation by assuming a graph
Obj : min cij dij …49† model G(V, E ). From each module vp, we inject
eij 2E
subject to ¯ow f/2 to each of the rest modules. Summing up
the ¯ow in two directions, the ¯ow between each
dij  ji ÿ j j; 81  i; j  jVj; …50† pair of modules is f. We de®ne the ¯ow originated
…p†
from module vp as commodity p. Let xij be the
t ÿ s ˆ 1: …51†
¯ow for commodity p on net eij. The objective is to
maximize f:
Figure 21 illustrates the formulation. As we
increase the ¯ow, certain nets are going to Obj : max f …52†
saturate, i.e., the two sides of inequality expression
subject to the ¯ow demand from module vp to the
(44) become equal. Once the saturated nets
other modules vi,
become a bottleneck of the ¯ow, the set of nets
forms a cut E(V1, V2) with vs 2 V1 and vt 2 V2. In
X
jVj X
jVj
duality, the potential of modules in V2 increases to …p†
xij ÿ
…p†
xji
one, and the potential of modules in V1 remains to jˆ1 jˆ1

be zero, i.e., i=1, 8vi 2 V2 and i=0, 8vi 2 V1. ÿf =2 if i 6ˆ p; and 1  i; p  jVj;
The distance of nets in the cut is one, while the ˆ
…jVj ÿ 1†f =2 if i ˆ p; and 1  i; p  jVj;
distance of nets outside the cut is zero, i.e., dij=1,
…53†
8cij 2 E(V1, V2) and dij=0, 8cij 2 = E(V1, V2).
and the net capacity constraint,

FIGURE 21 Illustration of maximum ¯ow minimum cut formulation.


I207T001015 . 207
T001015d.207

VLSI PARTITIONING 25

X
jVj
…p†
X
jVj
…p†
In the solution of linear programming problem
xij ‡ xji  cij ; 1  i; j  jVj: …54† (52) ± (56), the nets with positive dij values parti-
pˆ1 pˆ1
tion V into vertex sets V1, V2, . . . , Vk. More speci-
We transform the above linear programming ®cally, nets connecting modules in di€erent sets,
problem to its dual expression by assigning dual Vi, Vj, i 6ˆ j, have the same distance dij values (we
…p†
variables i to module vi with respect to use dij to denote the distance between vertex sets Vi
commodity p Eq. (53), and distance dij to net eij and Vj when this does not cause confusion), while
Eq. (54), then we have: nets connecting only modules in the same sub-
X graph have zero distance, dij=0 (Fig. 22). We can
Obj : min cij dij …55†
eij 2E
rewrite the denominator of the objective function
and state the problem as follows.
subject to Statement of Weighted Cluster Ratio Cut
…p† …p† [103] Find the distance dij and the number of
dij  i ÿ j ; 1  i; j; p  jVj …56†
partition k with an objective function of weighted
1 X X ÿ …p†
jVj jVj
 cluster ratio:
i ÿ …p†
p 1 …57†
2 pˆ1 iˆ1;i6ˆp
mindij ;k WC …V1 ; V2 ; . . . ; Vk †
Pk Pkÿ1
The Properties of Shadow Prices The shadow iˆj‡1 jˆ1 dij C…Vi ; Vj †
ˆ mindij ;k Pk Pkÿ1 …60†
price dij can be viewed as bidirectional, i.e., dij=dji. iˆj‡1 jˆ1 dij S…Vi †  S…Vj †
It represents the distance of net eij, which
corresponds to the cost to transmit ¯ow through where distance dij is subject to the property of
…p†
eij. Variable i is the potential of module vi with triangular inequality.
respect to commodity p. According to the mechanism of the duality, the
From constraints (56), (57), we can derive two objective functions of the primal and dual
properties for distance function dij and potential formulations are equal when the solution is
…p†
i [71]. optimal [25].

Property I: Triangular Inequality The distance THEOREM 5.1 For feasible solutions, we have the
metric dij satis®es the triangular inequality: inequality f  WC (V1, V2, . . . , Vk). The equality
holds when the solution is optimal, i.e., the
dij ‡ djk  dik ; 8vi ; vj ; vk 2 V …58† maximum uniform multicommodity ¯ow equals the
…p†
Property II: Potential Function The term i ÿ
…p†
p in expression (56) is equal to the shortest
distance between modules vi and vp based on net
distances dij. In fact, from triangular inequality, we
…p† …p†
obtain i ÿ p ˆ dip .
We normalize the objective function (55) with
the left hand side terms of inequality (57). The
objective function can be expressed as:
P
eij 2E cij dij
Obj : min PjVj PjVj ÿ …p† …p† 
…1=2† pˆ1 iˆ1;i6ˆp i ÿ p
P
eij 2E cij dij
ˆ PjVj PjVj …59†
…1=2† pˆ1 iˆ1;i6ˆp dip
FIGURE 22 Distance between clusters.
I207T001015 . 207
T001015d.207

26 S.-J. CHEN AND C.-K. CHENG

minimum weighted cluster ratio of any cut, di€erence between the cut from module vi 2 V1 to
maxxij f  mindij ;k WC …V1 ; V2 ; . . . ; Vk †. module vj 2 = V1. The potential of each module vi is
denoted by pi. For module vi in V1, pi=1, and for
Expression (60), weighted cluster ratio [103], is  1 , pi=0. Thus all nets eij 2
modules vi in V
similar to cluster ratio with a weighted metric dij. 
E…V1 ! V 1 † have wij=1. The remaining nets have
In general, the solution for the minimum weighted
wij=0.
cluster ratio does not directly correspond to the  2 †, we
With respect to the directed cut E…V2 ! V
partition of optimum cluster ratio. However, if
use uji with a reversed subscript ji to denote the
distance dij is a constant value between all pairs of
potential di€erence between the cut from module
vertex sets Vi and Vj then the weighted cluster ratio
vi 2 V2 to module vj 2 = V2 (Fig. 23). The potential of
provides the solution for cluster ratio.
each module vi is denoted by qi. For modules vi in
When the nets with positive distance dij form a  2 , qi=1, and for modules vi in V2, qi=0. The
V
two-way partition, we can show that the partition
potential di€erence uji has a reverse direction with
de®nes the ratio cut. When the nets with positive  2 side high
net eij because we set the potential on V
distances form a k-way partition with k  4, we
and the potential on V2 side low. All nets
also ®nd that there exists a two-way partition that  2 † have uji=1. The remaining nets
eij 2 E…V2 ! V
again de®nes the ratio cut [28].
have uji=0.
THEOREM 5.2 Let net set D={eijjdij > 0} de®ne a Primal Linear Programming Formulation The
cut that separates the circuit into k disconnected problem is to minimize the total weight of crossing
subsets. If k  4, then there exists a ratio cut that is nets:
a subset of D. X X
Obj : min cij wij ‡ cji uij …61†
eij 2E eij 2E

5.4.3. A Replication Cut for Two-way subject to


Partitioning
wij ÿ pi ‡ pj  0 81  i; j  jVj …62†
We adopt the linear programming formulation of
uij ÿ qi ‡ qj  0 81  i; j  jVj …63†
network ¯ow problem [1, 30], where each module
is assigned a potential and a cut is represented by q i ÿ pi  0 8vi 2 V; vi 6ˆ vs ; vt …64†
the di€erence of module potentials as shown in
Figure 23. With respect to the directed cut ps ˆ 1 …65†
E…V1 ! V  1 †, we use wij to denote the potential
qs ˆ 1 …66†

pt ˆ 0 …67†

qt ˆ 0 …68†

wij ; uij  0 81  i; j  jVj …69†

To minimize objective function (61), the equality


of constraint (62) holds, i.e., wij=pi ÿ pj, if pi  pj,
otherwise, wij=0. Similarly, constraint (63) re-
quires uij=qi ÿ qj if qi  qj, otherwise uij=0.
Expression (64) demands potential qi be not less
FIGURE 23 p potential and q potential of each module. than potential pi for any module vi 2 V. Since high
I207T001015 . 207
T001015d.207

VLSI PARTITIONING 27

potential pi corresponds to set V1, and high


potential qi corresponds to set V  2 , inequality (64) as ; at ; bs ; bt unrestricted …80†
enforces V1 be a subset of V  2 . Consequently, the
where inequalities (71), (72) are derived with
requirement that V1 \ V2=; is satis®ed.
respect to each wij and uij respectively. Similarly,
Constraints (65) ± (68) set the potentials of
Eqs. (73) ± (78) are derived with respect to each pi,
modules vs and vt. Constraint (69) requires
qi, ps, pt, qs and qt. The equality of Eqs. (73) ± (78)
potential di€erence wij and uij be nonnegative.
holds because pi, qi, ps, pt, qs and qt are not
Figure 23 shows one ideal potential con®guration
restricted on sign in the primal formulation.
of the solution.
Variables i, xij, and x0ij are positive in Eq. (79)
Dual Linear Programming Formulation If we
because their corresponding expressions (62) ± (64)
assign dual variables (Lagrangian multiplier) xij to
are inequality constraints.
inequality (62) with respect to each net, xij0 to
We can view G(V, E ) as a network ¯ow problem
inequality (63), i to inequality (64) with respect
and interpret cij as the ¯ow capacity, xij as the ¯ow
to module vi, and as, bs, at, bt to inequalities (65) ±
of net eij. Constraint (71) requires that the ¯ow xij
(68), respectively, then we have the dual formula-
be not larger than the ¯ow capacity cij on each net
tion.
eij. In constraint (72), the set of nets are in a
Obj : max as ‡ bs …70† reversed direction and ¯ow x0ij is not larger than
subject to the capacity of the capacity cji of net eji in E.
Corresponding to G(V, E ), we use G0 (V 0 , E 0 ) to
xij  cij 81  i; j  jVj …71†
denote the reversed graph.
xij0  cji 81  i; j  jVj …72† Constraint (73) has the total ¯ow xij injected
from module vi into G be equal to ÿ i. On the
X
jVj other hand, constraint (74) has the total ¯ow x0ij
ÿxij ‡ xji ÿ i ˆ 0 8vi 2 V; vi 6ˆ vs ; vt …73† injected from module vi0 into G0 be equal to i.
jˆ1
Suppose we combine Eqs. (73) and (74), we have
X
jVj X X
ÿx0ij ‡ x0ji ‡ i ˆ 0 8vi 2 V; vi 6ˆ vs ; vt …74† ÿxij ‡ xji ˆ i ˆ x0ij ÿ x0ji : …81†
j j
jˆ1

This means that the amount of ¯ow i which


X
jVj
ÿxsj ‡ xjs ‡ as ˆ 0 …75† emanates from module vi in G enters its corre-
jˆ1 sponding module in vi0 in G0 .
Constraints (75) ± (78) indicate that as and bs are
X
jVj
the ¯ow injections to module vs in G and its
ÿxtj ‡ xjt ‡ at ˆ 0 …76† reversed circuit G0 ; at and bt are the ¯ow ejections
jˆ1
from module vt in G and its reversed circuit G0 ,
respectively. Combining circuit G and G0 together,
X
jVj
ÿx0sj ‡ x0js ‡ bs ˆ 0 …77† we have the maximum total ¯ow, as‡bs, be the
jˆ1 optimum solution of the minimum replication cut
problem.
X
jVj
ÿx0tj ‡ x0jt ‡ bt ˆ 0 …78†
jˆ1 5.4.4. The Optimum Partition

i ; xij ; x0ji  0 81  i; j  jVj; vi 6ˆ vs ; vt …79† In this subsection, we describe the construction of


replication graph and take an example to describe
I207T001015 . 207
T001015d.207

28 S.-J. CHEN AND C.-K. CHENG

it. We then apply the maximum ¯ow algorithm on that V2 is derived from the cut in vertex set V 0 . To
the constructed replication graph to derive an  0 † to denote
simplify the notation, we shall use …X; X
optimum replication cut. The optimality of the the derived replication cut of G.
derived replication cut is proved by using a
Example Given a circuit in Figure 25, its replica-
network ¯ow approach.
tion graph G is constructed as shown in Figure 26.
Construction of Replication Graph Given a circuit
The maximum-¯ow minimum-cut of G derives
G(V, E ) and modules vs and vt, we construct  0 † ˆ …fv0s ;
 ˆ …fvs ; va g; fvb ; vc ; vt g† and …X 0 ; X
…X; X†
another circuit G0 (V 0 , E 0 ) where j V 0 j=j V j with
v0a ; v0b ; v0c g; fv0t g† with a ¯ow amount, 5 (Fig. 26).
each module v0i in V 0 corresponding to a module vi
Thus the sets V1={vs, va} and V2={vt} de®ne an
in V, and j E 0 j=j E j with each directed net eij in E 0
optimum replication cut R(V1, V2) with R={vb, vc}
in the reverse direction of net eij in E. We create
and a cut cost equal to 5 (Fig. 27).
super modules vs and vt and nets …vs ; vs †, …vs ; v0s †,
…vt ; vt †, and …v0t ; vt † with in®nite capacity as shown The network ¯ow approach leads to the opti-
in Figure 24. From every module vi in V except vs mality of the solution as stated in the following
and vt, we add a directed net of in®nite capacity to theorem.
the corresponding module v0i in V 0 . We refer to the  0 † derived
THEOREM 5.3 The replication cut R…X; X
combined circuit as G. 
from the transformed circuit G generates the
Polynomial-time Algorithm The optimum repli-  0 † (expression
minimum replication cut count CR …X; X
cation cut problem with respect to module pair vs
(19)).
and vt and without size constraints can be solved
by a maximum-¯ow minimum-cut solution of the
circuit G with vs as the source and vt as the sink of
the ¯ow (Fig. 24). Suppose the maximum-¯ow 5.4.5. Heuristic Flow Algorithms
minimum-cut ®nds partition …X; X†  of V with
We introduce the heuristic approaches that accel-
vs 2 X and vt 2 X  0 † of V 0 with
 and partition …X 0 ; X
0 erate the ¯ow calculation and take advantage the
v0s 2 X 0 and v0t 2 X  . Then a replication cut (V1, V2)
optimality properties of the ¯ow methods. We ®rst
of the original circuit with V1=X, V2 ˆ fiji0 2 X  0g
introduce an approach that utilizes the maximum
and R=V ÿ V1 ÿ V2 is an optimum solution. Note ¯ow minimum cut method for the min cut with

FIGURE 24 The replication graph G.


I207T001015 . 207
T001015d.207

VLSI PARTITIONING 29

with V2 and shrink set V2 as a new sink module.


Otherwise, we ®nd from V2 a module vi to merge
with V1 and shrink set V1 as a new source module.
We repeat the maximum ¯ow minimum cut process
on the graph with new source or sink module until
the size of the partition ®ts the size constraint.
Two Way Partitioning using Maximum Flow
Minimum Cut
1. Find two seeds as vs and vt.
FIGURE 25 A ®ve module circuit to demonstrate the
2. Call Maximum Flow Minimum Cut to ®nd
replication cut. partition (V1, V2).
3. If S(V1) > S(V2), ®nd a seed vi 2 V1, merge
size constraints. We then explain a shortest path {vi} [ V2 into a new sink module vt.
method for multiple commodity ¯ow calculation. 4. Else ®nd a seed vi 2 V2, merge {vi} [ V1 into a
new source module vs.
(i) Usage of Maximum Flow Minimum Cut We
5. Repeat Steps 1 ± 4, until Sl < S(V1) < Su and
adopt a heuristic approach [113] to get around the
Sl < S(V2) < Su.
unbalanced partition of the maximum ¯ow and
minimum cut method. First, we ®nd two seeds as We can use parametric ¯ow approach recur-
the source and the sink modules, vs, vt. We then sively to the maximum ¯ow minimum cut pro-
use the maximum ¯ow and minimum cut method blems recursively (Step 2). The total complexity is
to ®nd partition (V1, V2) with vs 2 V1 and vt 2 V2. equivalent to a single maximum ¯ow minimum
Suppose the size S(V1) of V1 is larger than the size cut.
S(V2) of V2, we ®nd from V1 a module vi to merge The seeds are chosen according to its connectiv-

FIGURE 26 The constructed replication graph of the circuit shown in Figure 25.
I207T001015 . 207
T001015d.207

30 S.-J. CHEN AND C.-K. CHENG

FIGURE 27 The duplicated circuit of the circuit shown in Figure 25.

ity to the vertex set in the other side. The result is 1.1. Saturate-Network (H, , ).
sensitive to the choice of the seeds. We can make 1.2. Select-Cut (H) until the clustering result
multiple trials and choose the best results. Other are satisfactory
methods such as programming approach can serve
2. Output clustering result.
as a guideline on the choice of the seeds [79, 80].
The method has shown to derive excellent results Procedure Saturate-Network (H, , )
with reasonable running time.
1. Set the distance of each net e to be one.
(ii) Approximation of Multiple Commodity Flow 2. While (H is connected ) do 2.1 to 2.3.
Based on the multicommodity ¯ow formulation
2.1. Randomly pick two distinct modules vs
[103], we try to solve a multiple way partitioning
by deriving approximate multiple commodity ¯ow and vt.
with a stochastic process [13, 55, 114, 117]. 2.2. Find the shortest path between vs and vt.
2.3. For each net e on the shortest path, let f (e)
Given a circuit H(V, E ), the ¯ow increment , and de be the ¯ow and distance of net e.
and the distance coecient , the algorithm starts 2.3.1. If n is not saturated, increase f (e) by 
with procedure Saturate-Network to saturate the and set de=exp ((  f (e))/ce).
circuit with ¯ows. A stochastic ¯ow injection 2.3.2. If e is saturated, set de to be 1.
algorithm is adopted to reduce the computational
complexity. Then, Select-Cut is activated to select 3. Output E with ¯ow informations.
a set of nets by the ¯ow values to constitute a cut. The initial distance of each net is one since there is
The conversion from weighted ratio cut to cluster no ¯ow being injected (see the distance formulation
ratio cut is performed by a Select-Cut routine in Step 2.3.1). Step 2.1 uses a random process with
which selects the subset of the cut derived from even distribution over all modules to pick two
Saturate-Network with a greedy approach. distinct modules, and Steps 2.2 ± 2.3 inject 
Multiple Commodity Flow Approximation amount of ¯ows along the shortest path between
(H, , ) the modules. In Steps 2.3.1 ± 2.3.2, the distances of
1. Iterate the following procedures the nets whose ¯ow has been increased are
recomputed using an exponential function de=exp
I207T001015 . 207
T001015d.207

VLSI PARTITIONING 31

((  f (e))/ce) to penalize the congested nets, where The two way partition (V1, V2) is represented by
de and f (e) are the distance and ¯ow of net e, a linear placement with only two slots at coordi-
respectively. Steps 2.1 ± 2.3 are iteratively executed nates ÿ 1 and 1. For an even sized partition, half
until a pair of modules are chosen where all possible of the modules are assigned to each slot. Let xi
paths between them are saturated by ¯ows. These denote the coordinate of module vi. If vi 2 V1,
saturated nets identify a partition of the circuit. xi=1, else xi=ÿ 1 for vi 2 V2. The cut count can be
Figure 28 shows a sample circuit saturated by expressed as follows.
¯ows after executing Saturate-Network with
=0.01 and =10. The ¯ow values are shown 1 1
C…V1 ; V2 † ˆ cij …xi ÿ xj †2 ˆ X > BX …82†
by the numbers right beside each net. The dashed 4 4
lines indicate the cut lines along the set of where X is a vector of xi, and X> is the transpose
saturated nets to form the three clusters. These of vector X. Matrix B has its entry bij=ÿ cij if i 6ˆ j,
saturated nets de®ne an approximate weighted P
else bii ˆ 1 jjVj cij . Suppose we relax the slot
cluster ratio cut which are potential set of nets for a constraint by enforcing only the rules of the
selection of cluster ratio cut. gravity center and the norm. The constraint of
vector X can be expressed as:
5.5. Programming Approaches
1> X ˆ 0; …83†
For programming approaches [7, 18, 35, 41, 46, 44],
we adopt two way minimum cut with size X > X ˆ jVj …84†
constraints as the target problem. We assume that
the nets are two pin nets and thus, the circuit can Matrix B is symmetric and diagonally semido-
be described as a graph G(V, E ). We also assume minant. Thus, it is semipositive de®nite, i.e., all
the modules are of unit size, i.e., si=1. eigenvalues are nonnegative. And its eigenvectors
are orthogonal. Let us order its eigenvalues from

FIGURE 28 The ¯ow and partition generated by saturate-network.


I207T001015 . 207
T001015d.207

32 S.-J. CHEN AND C.-K. CHENG

small to large, i.e., 0  1     jVjÿ1. The smal- tion of ®xed modules will destroy the nice
lest eigenvalue 0=0 with its eigenvector X0=1. structure based on which we have the eigenvalue
The second eigenvalue 1 is nonnegative with its and eigenvector as optimal solutions. Therefore, it
eigenvector orthogonal to the ®rst eigenvector, i.e., is dicult to utilize the approach recursively.
X0> X1 ˆ 1> X1 ˆ 0. Therefore, the second eigenvec-
For a general case, we can view the problem as
tor X1 is an optimal solution to objective function
nonlinear programming with Boolean quadratic
(82) with constraints (83) [46]. Since X>X=jVj Eq.
objective function. Nonlinear programming tech-
(84) the solution
niques are adopted to derive the results [16, 107].
1 > 1 1
X BX1 ˆ 1  X1> X1 ˆ 1  jVj; …85† 5.6. A Lagrange Multiplier Approach for
4 1 4 4
Performance Driven Partitioning
which is a lower bound of the min-cut problem.
To push for a higher lower bound, we can adjust Lagrange multiplier is one useful tool for perfor-
the diagonal term of matrix B by adding constants mance optimization. In this section, we demon-
di. Let strate the usage of Lagrange multiplier for
performance driven partitioning. The problem is
X
~ 1 ; V2 † ˆ C…V1 ; V2 † ‡ 1
C…V di  x2i
to optimize the performance of a two-way parti-
4 1ijVj tion (V1, V2) with retiming [86].
1 X We ®rst introduce a vector of binary variables to
ÿ di …86† represent a partition. The performance-driven
4 1ijVj
! partitioning problem is thus represented by a
1 X Boolean quadratic programming formulation with
ˆ ~ ÿ
X > BX di ;
4 nonlinear constraints. We then absorb the non-
1ijVj
linear constraints into the objective function as a
where matrix B ~ has its entry ~
bij ˆ bij if i 6ˆ j, else Lagrangian. We use primal and dual subproblems
~bii ˆ bii ‡ di . Either xi=1 or xi=ÿ 1, the last two to decompose the Lagrangian and derive the
terms cancel each other. The modi®cation thus partitions. Lagrange multiplier is adjusted in each
does not alter the optimal partition solution. iteration via a subgradient method to monitor the
The new nonlinear programming problem is to timing criticality and improve the performance.
®nd the assignment of di to maximize the objective
function [11]: 5.6.1. Programming Formulation with Lagrange
! Multiplier
1 ~ X
1  jVj ÿ di …87† We assume that the circuit can be represented by a
4 1ijVj
graph G(V, E ) with two pin nets and unit module
size. The two-way partition is described by a vector
where ~1 is the second smallest eigenvalue of
~ The solution is an upper bound of the x=(x1,1, . . . , x1,n, x2,1, . . . , x2,n), where xb,i is 1 if
matrix B.
module vi is assigned to vertex set Vb, otherwise xb,i
partition. It is larger than 1 in the sense that 1
is 0. If modules vi and vj are in di€erent vertex set,
can serve as an initial feasible solution to maximize
the value of the term x1,ix2, j‡x2,ix1, j is equal to 1.
expression (87).
This contributes one interpartition delay  into the
Remarks The programming approach ®nds a delay of the net eij. Let gl (x) denote the delay to
global view of the problem [9, 79, 80, 118]. How- register ratio of loop l. Delay ratio gl (x) can be
ever, the formulation is very restricted. The written as the following formula:
extension to multiple pin nets and the incorpora-
I207T001015 . 207
T001015d.207

VLSI PARTITIONING 33

P
d` ‡ eij 2l   …x1;i x2; j ‡ x2;i x1; j † the objective function (90). The Lagrangian-
gl …x† ˆ …88†
rl relaxed problem is as follows.

Given a path p, the total delays hp(x) of p is as max min L…x; † …95†
0 x
follows:
X subject to constraints C1 and C2, where
hp …x† ˆ dp ‡   …x1;i x2; j ‡ x2;i x1; j † …89† X
eij 2p L…x; † ˆ cij …x1;i x2; j ‡ x2;i x1; j †
eij 2E
To formulate the problem, we use an objective X
‡ gl …gl …x† ÿ ~J† …96†
function of cut count:
8 simple loop l
X X
‡ ~
hp …hp …x† ÿ M†
min cij …x1;i x2; j ‡ x2;i x1; j †; …90†
eij 2E 8 IO-critical path p

subject to the following constraints: (i) The Dual Problem Given vector x, we can
represent (96) as a function of variable , i.e.,
C1 (Size Constraints)
Lx(). Thus, the dual problem can be written as:
X
jVj
max Lx …† …97†
xb;i si  Su 8 b 2 f1; 2g: …91† 0
iˆ1
(ii) The Primal Problem Let Fij and Qij denote the
C2 (Variable Assignment Constraints) sets of the simple loops and IO-critical paths
passing the net eij. The cost aij of net eij is
X
2
composed of connectivity cij and the penalty of
xb;i ˆ 1 8 vi 2 V: …92†
the timing constraints.
bˆ1
X X
C3 (Iteration Bound Constraints) aij ˆ cij ‡ gl ‡ hp …98†
r
l2Fij l p2Qij
gl …x†  ~
J 8 loop l: …93†
Given vector , we can represent (96) as a function
C4 (Latency Bound Constraints) of vector x, i.e., L(x). Thus, the primal problem
can be rewritten as:
~
hp …x†  M 8 IO-critical path p: …94† X
min L …x† ˆ min aij …x1;i x2; j ‡ x2;i x1; j † ‡
Actually, we don't need to consider all loops in C3. eij 2E
Because all loops are composed of simple loops, …99†
we have the following lemma:
subject to C1 and C2, where represents the
LEMMA 1 Given a number ~ J, if gl(x) is less than or constant contributed by .
equal to ~J for any simple loop l, then gl (x) is less
than or equal to ~J for all loops l. 5.6.2. Subgradient Method using Cycle Mean
Let c and p represent the number of the simple Method
loops and the number of IO-critical paths,
We solve the partitioning problem through primal
respectively. Let  denote the vector …g1 ; . . . ;
and dual iterations on the Lagrangian. A Quad-
gc ; h1 ; . . . ; hp †. Using Lagrangian Relaxation
ratic Boolean Programming, QBP, [16] is used to
[104], we absorb the constraints (93) and (94) into
I207T001015 . 207
T001015d.207

34 S.-J. CHEN AND C.-K. CHENG

solve the primal problem and generate a solution x 5. Revise shadow price aij for all nets eij 2 E:
…k‡1† …k†
(Step 2). aij ˆ aij ;
…k‡1† …k†
For the dual problem based on x, we select the if net eij is in active loop, then aij ˆ aij ‡
set of loops and paths that violates the timing t …pij ÿ ~J†;
…k†
…k‡1† …k†
constraints as active loops and paths. The nets if net eij is in active path, then aij ˆ aij ‡
contained in the active loops or paths are termed ~
t…k† …qij ÿ M†.
active nets. 6. While k  MaxNumIter, set k k‡1 and goto
Active Loops and Paths Given a solution x, a 2.
loop l is called active, if gl (x) is not less than ~
J. A
path p is called active, if hp(x) is not less than M. ~ 5.7. Clustering Heuristics
Active Nets Given a net e, we de®ne e to be an
We ®rst discuss the usage of clustering heuristics.
active net, if net e is covered by an active loop or
We then discuss top down clustering and bottom
an active path.
up clustering approaches. At the last, we discuss
We call a minimum cycle mean algorithm [57]
some variations of clustering metrics.
and an all-pairs shortest-paths algorithm to mark
all the nets on active loops and paths, respectively
(Step 3). For every net eij on active paths, we 5.7.1. Usage of Clustering Heuristics
record qij: the maximum path delay among all
paths passing through eij. For every net eij on The usage of clustering heuristics plays an
active loops, we record pij: the maximum delay-to- important role in determining the quality of the
register ratio among all loops passing through eij. ®nal results. In the following, we discuss the issue
We then calculate the subgradient on the marked in di€erent topics. We use a two-way partitioning
nets and update the constants aij for the next with size constraints as the target problem.
primal dual iteration (Steps 4 ± 5). We increase the 1. Top Down Clustering versus Bottom Up
costs of active nets using subgradient approach Clustering: Top down clustering approach
[104]. The iteration proceeds until the bound of all provides a global view of the solution. The
loops and paths are within the given limits. operations are consistent with the target pro-
blem. However, it is more time consuming
Algorithm using Lagrange Multiplier Input: Con- because the clustering operates on the whole
~J; M;
~ ˆ 1:3 and an initial partition circuit [29]. Bottom up clustering is ecient.
ÿstants …0† 
…0†
V1 ; V2 . However, because the process operates locally,
the target solution is sensitive to the clustering
…0†
1. Initialize k 1; aij ˆ cij . heuristics [59].
ÿ …k† …k† 
2. Run QBP [16] to ®nd a partition V1 ; V2 2. The Level of the Clustering: Suppose we
with represent the clustering results with a hierarch-
ÿ …k† an…k† object
P
to minimize cut count
…k†
C V1 ; V2 ˆ e2E…V …k† ;V …k† † aij . ical tree structure. Let the root correspond to
1 2
3. Calculate the ÿiteration and latency bounds of the whole circuit, the leaves correspond to the
…k† …k†
the partition V1 ; V2 , respectively. Stop if smallest clusters, and the internal nodes corre-
timing constraints are satis®ed. Otherwise, spond to the intermediate clusters. Hence, the
revise pij and qij for all nets eij. size of the clusters grows with the level of the
4. Compute nodes. Top down clustering creates clusters
ÿ …k† …k†  ÿ …0† …0†  corresponding to nodes in high levels, while
…k† C V 1 ; V2 ÿ C V1 ; V 2 bottom up clustering creates clustering corre-
t ˆP P
~ 2‡
…pij ÿ J† ~ 2
…qij ÿ M† sponding to nodes in low levels.
eij 2E eij 2E
I207T001015 . 207
T001015d.207

VLSI PARTITIONING 35

For example, in [60], Kernighan and Lin Solution: The clustering operation has to be
proposed a top down clustering approach, consistent with the target solution. For example,
which divides the whole circuit into four clusters suppose the target is ®nding a two-way min-cut
only. In [59], Karypis et al., used a bottom up with size constraints. Then, it is natural to cluster
clustering which starts with clusters of two modules based on net connectivity because the
modules or a net. If we continue the application probability that a net is in an optimal cut set is
of bottom up clustering on intermediate clus- small (see the subsection of min-cut with size
ters, the quality of the clusters degenerates as the constraints in problem formulations). More-
size of the clusters grows bigger. over, it is important that the clustering follows
3. Iteration of Clustering and Unclustering: We go the current partitioning results, i.e., only mod-
through the iterations of clustering and unclus- ules in the same partition are clustered.
tering to improve the quality of the results. At
each level of the hierarchical tree, we derive an 5.7.2. Top Down Clustering Approach
intermediate target solution, e.g., a two-way for Partitioning
partition. In unclustering, we go down the level
of tree hierarchy to ®nd an expanded circuit with We use an application to two-way cut with size
more modules. In clustering, we go up the level constraints to illustrate the top down clustering
of tree hierarchy with a circuit of a smaller approach [24, 29]. The partitioning of huge designs
number of modules. The previous partitioning is complicated and the results can be erratic. Our
result becomes the initial of the new partitioning strategy (Fig. 29) is to reduce the circuit complex-
problem. Note that the hierarchical tree is ity by constructing a contracted hypergraph. The
constructed dynamically. For each clustering, clusters for the contracted hypergraph are
the modules can be grouped based on the current searched via a recursive top down partitioning
partitioning con®guration. method. The number of modules is much reduced
4. The Clustering Operations and the Target after we contract the clusters. Hence, a group

FIGURE 29 Strategy of top down clustering.


I207T001015 . 207
T001015d.207

36 S.-J. CHEN AND C.-K. CHENG

migration approach can derive excellent two way 5.7.3. Bottom Up Clustering Approaches
cut results on the contracted hypergraph with
In this section, we discuss bottom up clustering
much eciency. Furthermore, since the clusters
[90] with two applications: linear placement and
are grouped via a top down partitioning, concep-
performance driven designs. We then show two
tually a minimum cut on the hypergraph can take
strategies to perform the clustering: maximum
advantage of the previous results and generate
matching and maximum pairing. We will demon-
better solutions.
strate via examples the advantage of maximum
In this section, we describe a top down clustering
pairing over maximum matching.
algorithm. A ratio cut is adopted to perform the top
down clustering process. Other partition ap- (i) Linear Placement For linear placement, we
proaches can also be used to replace the ratio cut. reduce the complexity of the problem by a bottom
A group migration method is used to ®nd a up clustering approach [96, 100, 53]. The clustering
minimum cut of the contracted hypergraph with is based on the result of a tentative placement. We
size constraint. Finally, we apply a last run of the adopt a heuristic approach to generate tentative
group migration algorithm to the original circuit to placements throughout iterations. In each itera-
®ne tune the result. tion, we cluster modules only when they are in
Input a hypergraph H(V, E ), an integer k for consecutive order of the placement. We then
the number of expected clusters, an integer construct a contracted hypergraph. In the next
num_of_reps for repetition, and Sl, Su for the size iteration, the heuristic approach generates the
constraints of two resultant subsets. placement of the contracted hypergraph. For each
iteration, we either grow the size of the clusters or
1. Initialize ={V } and V =V.
construct new clusters adaptively.
2. Apply ratio cut [109] to obtain a partition
Inspired by the property of the minimum cut
(A, A0 ) of V =A [ A0 .
separating two modules (Theorem 3.1), we use a
3. Set =( ÿ V }) [ {A, A0 }. Set V  to be a
density as a measure to ®nd the cluster. A density
vertex set in such that S…V  † ˆ maxVi 2 S…Vi †.
d(i) at a slot i of a linear placement is the total
4. While S(V  ) > ((S(V ))/k), repeat Steps 2, 3.
connectivity of nets connecting modules on the
5. Construct a contracted hypergraph Hÿ(Vÿ, Eÿ ).
di€erent sides of the slot. The following algorithm
6. Apply num_of_reps times of a group migration
describes the clustering using a given placement.
algorithm to Hÿ with the size constraints Sl, Su.
Each cluster size is between L and U.
7. Use the best result from Step 6 to the circuit H
Input placement P, two parameters L and U.
as an initial partition. Apply a group migration
algorithm once to H with the size constraints Sl, 1. Initialize cluster boundary at slot p=1.
S u. 2. Scan placement P from slot p toward the
right end. Find slot i such that p‡L  i 
The choice of cluster number k It was shown
p‡U and density d(i) is minimum among
[24] that the cut count versus cluster number k is a
d( p‡L)    d( p‡U ).
concave curve. When k is small, the quality is not
3. Cluster modules between slots p and i. Set
as good because the cluster is too coarse. When k
p=i‡1
is large, there are too many clusters. We lose the
4. Repeat Steps 2, 3 until the scan reaches the
bene®t of the clustering.
right end.
For the case that the circuit is large, we may
need to adopt multiple levels of clustering to push Remark The proposed clustering process and the
for the performance and eciency [58, 66]. criteria are consistent with the target linear
I207T001015 . 207
T001015d.207

VLSI PARTITIONING 37

placement application. The whole process depends and l have no common nets but are merged because
on an ecient and e€ective linear placement. their choices are taken by others.
(ii) Performance Driven Clustering For perfor- Furthermore, as we proceed to the next level
mance driven clustering [31, 112], nets which maximum matching, the merge of pairs (c, l ) and
contribute to the longest delay are termed critical ( f, i) will enforce grouping modules into cluster
nets. Pins of the critical net are merged to form {a, b, c, j, k, l} and cluster {d, e, f, g, h, i}. If we
clusters. measure the quality of the results with cluster cost
For a special case that the circuit is a directed (expression (26)), the cost of the two clusters is
P
tree, we can ®nd optimal solution in polynomial i((C (Vi))/(CI (Vi)))=4/12‡4/12=2/3. For this
time. Let us assume the tree has its leaves at the case, we can ®nd a better solution of clusters
input and its root at the output. We use a dynamic {a, b, c, d, e, f } and {g, h, i, j, k, l} of which the
programming approach to trace from the leaves cluster cost is equal to zero.
toward the root. Each module is not traced until Figure 31 shows another example of twelve
all its input modules are processed. For each modules with connectivities attached to the nets.
module, we treat it as a root of a subtree and ®nd The connectivity is 1 if not speci®ed. Figure 31(a)
the optimal clustering of the subtree. Since all the shows an optimum cut with cut count 6.6. If a
modules in the subtree except its root have been maximum matching [61] criterion is adopted in the
processed, we can derive an optimal solution of the bottom up clustering approach, then modules with
root in polynomial time. a net of weight 1.1 between them will be merged. A
minimum cut on the merged modules yields a cut
(iii) Maximum Matching The maximum match-
count of 18 (Fig. 31(b)). In general, a 2n module
ing pairs all modules into j V j /2 groups simulta-
circuit having a symmetric con®guration as in
neously. Given a measurement of pairing modules,
Figure 31 will have a cut count of n2/2 if the
we can ®nd a matching that maximizes the total
maximum matching criterion is applied to perform
pairing measurement in polynomial time.
the clustering; while the optimum solution will
We can call maximum matching recursively to
have a cut weight of 1.1  n. From this extreme
create clusters of equal sizes. However, this
case, we can claim the following theorem:
strategy may enforce unrelated pairs to merge.
The enforcement will sacri®ce the quality of ®nal THEOREM 5.4 There is no constant factor of error
clustering results. bound of the cut count generated by the maximum
matching approach, from the cut count of a
Example Figure 30 illustrates the clustering be-
minimum cut.
havior of maximum matching. The circuit contains
twelve modules of equal size. The ®rst level Proof As shown in the above example, the factor
maximum matching pairs modules (a, b), (d, e), of error bound is (n2/2)/(1.1  n)=n/2.2, which is
(g, h), ( j, k), (c, l ), and ( f, i). Modules in the ®rst not a constant. Q.E.D.
four pairs are strongly connected with their
(iv) Maximum Pairing The maximum pairing is
partners. However, the last two are not. Module c

FIGURE 30 Clustering of two module circuit.


I207T001015 . 207
T001015d.207

38 S.-J. CHEN AND C.-K. CHENG

In this section, we will discuss a few di€erent


clustering metrics. For the case of k connectivity,
we count the number of k-hop paths between two
modules. Or, we use an analogy of a resistive
network to check the conductance between the
modules. Furthermore, we check beyond the
hypergraph and use other information such as
the module functions, pin locations, and control
signals.

(i) kth Connectivity The number of k-hop paths


between two modules provides a di€erent aspect of
information on the adjacency. Suppose the circuit
has only two-pin nets. We can derive the kth
connectivity with sparse matrix multiplication. Let
C be the connectivity matrix with connectivity cij
as its elements at row i column j, and at row j
column i, and its diagonal entry cii=0. Note
that we set cij=0 if there is no net connecting
modules vi and vj.
…2†
Let cij be the element of the square of matrix C
2 …k†
(C ), and cij be the element of the kth order of
…k†
FIGURE 31 A twelve module example to demonstrate
matrix C (C k). Then we have cij representing the
maximum matching. number of distinct k-hop paths connecting mod-
ules vi and vj.
similar to maximum matching, except that it does
(ii) Conductivity We use a resistive network
not enforce the matching of all modules. Only the
analogy [21, 93] to derive the relation between
top q percent of the modules are paired. Thus, we
modules. Suppose the circuit has only two pin
can avoid the enforced pairing of unrelated
nets. We replace each net eij with a resistor of
modules.
conductance cij. Hence, we can view the whole
However, this strategy may cause certain system as a resistive network and derive the
modules to keep on growing and produce very conductance between modules. The system con-
uneven cluster results. Thus, we need to choose a ductance between two modules vi and vj reveals the
proper cost function that discourages unlimited adjacency relation between the two modules.
growth of the cluster size, e.g., cost function (26). The network conductance can be derived using
circuit analysis. We can also approximate the
5.7.4. Variations of Clustering Metric conductance with a random walk approach. In a
random network model, we start walking from a
In order to identify good clusters, we need to look module vi. At each module vk, the probability to
beyond the direct adjacency between modules. It is walk via net ekl to module vl is proportional to the
useful if we can also extract the relation between P
connectivity, i.e., (ckl/ m ckm). We can derive the
the neighbors' neighbors, or even several levels of relation between the random walk and the con-
neighbors' neighbors. The probabilistic gain model ductivity [89]:
of group migration approach is one good example
of such approach [37, 42].
I207T001015 . 207
T001015d.207

VLSI PARTITIONING 39

P
2 e2jEj ce obtains signal from B, and modules A and B are
hij ‡ hji ˆ ; …100†
ij similar.

where hij denotes the expected number of hops to


walk from modules vi and vj, and ij denotes the 6. RESEARCH DIRECTIONS
conductance between vi and vj.
Partitioning remains to be an important research
(iii) Similarity of Signatures We can use certain
problem. Many applications such as ¯oorplan-
features beyond connectivity for the clustering
ning, engineering change orders, and performance
metric [88, 91]. For example, the index of data bits,
driven emulation demand e€ective and ecient
sequence of the pins, function of logic, and
partitioning solutions.
relation with common control signals can serve
Recent e€orts released benchmarks with reason-
as signatures of function blocks in data path
able complexity [3]. However, more design cases
designs. All these features form the ®rst level
are still needed to represent the class of huge
adjacency. We can extend the relation to multiple
circuitry with details of functions and timing.
levels. For example, two modules connecting a set
In this section, we touch on a few interesting
of modules with strong similarity makes these two
research problems regarding the correlation be-
modules similar.
tween the partition of logic and physical designs,
the manipulation of hierarchical tree structure,
Example As shown in Figure 32, modules A and
and the performance driven partitioning.
B are similar in signature because they are of the
same OR function, connected to consecutive bit
number at the same pin location, and controlled 6.1. Correlation of Hierarchical Partitioning
by the same control signal at the same pin Structure Between Logic Synthesis and
location. Physical Layout

Modules C and D become similar because It is desired to correlate the logic hierarchy with
module C obtains signal from A, module D the physical design hierarchy. The main reason is
the control of timing for huge designs. Currently,
the design turnaround takes 2 ± 8 months for ASIC
and much longer for custom designs. Throughout
the design process, designs keep on changing. We
don't want to lose control of timing as design
changes. A tight correlation of logic and physical
hierarchies makes timing predictable. Without this
kind of mechanism, the timing characteristics of a
¯oorplan may become erratic after iterations of
design changes.

6.2. Manipulation of Hierarchical Partitioning


Structure
One main issue in mapping a huge hierarchical
circuit is the utilization of the hierarchy to reduce
the mapping complexity. We can drastically
FIGURE 32 Signature identi®es data structure. improve the eciency of the mapping process, if
I207T001015 . 207
T001015d.207

40 S.-J. CHEN AND C.-K. CHENG

we properly exploit the structure of the design [7] Alpert, C. J. and Yao, S. Z., ``Spectral partitioning: the
more eigenvectors, the better'', In: Proc. ACM/IEEE
hierarchy. The generic binary tree is a good Design Automation Conf., June, 1995, pp. 195 ± 200.
formulation to start with. [8] Bakoglu, H. B., Circuits, Interconnections, and Packaging
for VLSI, MA: Addison-Wesley, 1990.
The handling of a hierarchy tree gives rise to [9] Blanks, J. (1989). ``Partitioning by Probability Conden-
many fundamental research problems. For exam- sation'', ACM/IEEE 26th Design Automation Conf., pp.
758 ± 761.
ple, ®nding k shortest-paths or exploring the [10] Bollobas, B. (1985). Random Graphs, Academic Press
maximum-¯ow minimum-cut of the whole circuit Inc., pp. 31 ± 53.
[11] Boppana, R. B. (1987). ``Eigenvalues and Graph
[51] embedded in a hierarchical tree can be useful Bisection: An Average Case Analysis'', Annual Symp.
for interconnect analysis and optimization. Such on Foundations in Computer Science, pp. 280 ± 285.
research can also bene®t many di€erent ®elds [12] Breuer, M. A., Design Automation of Digital Systems,
Prentice-Hall, NY, 1972.
which have to handle huge hierarchical systems. [13] Bui, T., Chaudhuri, S., Jones, C., Leighton, T. and
Sipser, M. (1987). ``Graph bisection algorithms with
good average case behavior'', Combinatorica, 7(2),
171 ± 191.
6.3. Performance Driven Partitioning [14] Bui, T., Heigham, C., Jones, C. and Leighton, T.,
``Improving the performance of the Kernighan-Lin and
For performance driven partitioning, we need a simulated annealing graph bisection algorithms'', In:
fast evaluation on the hierarchical tree structure. Proc. ACM/IEEE Design Automation Conf., June, 1989,
pp. 775 ± 778.
The analysis needs to be incremental with incor- [15] Buntine, W. L., Su, L., Newton, A. R. and Mayer, A.,
poration of signal integrity. ``Adaptive methods for netlist partitioning'', In: Proc.
IEEE Int. Conf. Computer-Aided Design, November,
The network ¯ow method is a potential 1997, pp. 356 ± 363.
approach for the partitioning with timing con- [16] Burkard, R. E. and Bonniger, T. (1983). ``A Heuristic for
straints. More e€orts are needed to improve the Quadratic Boolean Programs with Applications to
Quadratic Assignment Problems'', European Journal of
speed and derive desired results. Operational Research, 13, 372 ± 386.
[17] Camposano, R. and Brayton, R. K. (1987). ``Partitioning
Before Logic Synthesis'', Int. Conf. on Computer-Aided
Design, pp. 324 ± 326.
Acknowledgements [18] Chan, P. K., Schlag, D. F. and Zien, J. Y., ``Spectral
k-way ratio-cut partitioning and clustering'', IEEE
The authors thank the editor for the encourage- Trans. Computer-Aided Design, 13(9), 1088 ± 1096, Sep-
ment of preparing this manuscript. The authors tember, 1994.
[19] Charney, H. R. and Plato, D. L., ``Ecient Partitioning
would also like to thank Ted Carson, Lung-Tien of Components'', IEEE Design Automation, July, 1968,
Liu, and John Lillis for helpful discussions. pp. 16.0 ± 16.21.
[20] Chatterjee, A. C. and Hartley, R., ``A new Simultaneous
Circuit Partitioning and Chip Placement Approach
based on Simulated Annealing'', In: Proc. ACM/IEEE
References Design Automation Conf., June, 1990, pp. 36 ± 39.
[21] Cheng, C. K. and Kuh, E. S., ``Module Placement Based
[1] Ahuja, R. K., Magnanti, T. L. and Orlin, J. B., Network on Resistive Network Optimization'', IEEE Trans. on
Flows, Prentice Hall, 1993. Computer-Aided Design, CAD-3, 218 ± 225, July, 1984.
[2] Alpert, C. J., ``The ISPD98 circuit benchmark suite'', Int. [22] Cheng, C. K., ``Linear Placement Algorithms and
Symp. on Physical Design, pp. 80 ± 85, April, 1998. Applications to VLSI Design'', Networks, 17, 439 ± 464,
[3] Alpert, C. J., Caldwell, A. E., Kahng, A. B. and Markov, Winter, 1987.
I. L., ``Partitioning with Terminals: a ``New'' Problem [23] Cheng, C. K. and Hu, T. C., ``Ancestor Tree for
and New Benchmarks'', Int. Symp. on Physical Design, Arbitrary Multi-Terminal Cut Functions'', Porc. Integer
pp. 151 ± 157, April, 1999. Programming/Combinatorial Optimization Conf., Univ.
[4] Alpert, C. J., Huang, J. H. and Kahng, A. B., ``Multi- of Waterloo, May, 1990, pp. 115 ± 127.
level circuit partitioning'', In: Proc. ACM/IEEE Design [24] Cheng, C. K. and Wei, Y. C. (1991). ``An Improved
Automation Conf., June, 1997, pp. 530 ± 533. Two-Way Partitioning Algorithm with Stable Perfor-
[5] Alpert, C. J. and Kahng, A. B., ``Recent directions in mance'', IEEE Trans. on Computer Aided Design, 10(12),
netlist partitioning: a survey'', Integration: The VLSI J., 1502 ± 1511.
19(1), 1 ± 81, August, 1995. [25] Cheng, C. K. (1992). ``The Optimal Partitioning of
[6] Alpert, C. J. and Kahng, A. B., ``A general framework Networks'', Networks, 22, 297 ± 315.
for vertex orderings with applications to circuit cluster- [26] Cherng, J. S. and Chen, S. J., ``A Stable Partitioning
ing'', IEEE Trans. VLSI Syst., 4(2), 240 ± 246, June, Algorithm for VLSI Circuits'', In: Proc. IEEE Custom
1996. Integrated Circuits Conf., May, 1996, pp. 9.1.1 ± 9.1.4.
I207T001015 . 207
T001015d.207

VLSI PARTITIONING 41

[27] Cherng, J. S., Chen, S. J. and Ho, J. M., ``Ecient 1992.


Bipartitioning Algorithm for Size-Constrained Circuits'', [45] Hagen, L. and Kahng, A. B., ``Combining problem
IEEE Proceedings-Computers and Digital Techniques, reduction and adaptive multistart: a new technique for
145(1), 37 ± 45, January, 1998. superior iterative partitioning'', IEEE Trans. Computer-
[28] Cheng, C. K. and Hu, T. C. (1992). ``Maximum Aided Design, 16(7), 709 ± 717, July, 1997.
Concurrent Flow and Minimum Ratio Cut'', Algorith- [46] Hall, K. M., ``An r-dimensional Quadratic Placement
mica, 8, 233 ± 249. Algorithm'', Management Science, 17(3), 219 ± 229,
[29] Chou, N. C., Liu, L. T., Cheng, C. K., Dai, W. J. and November, 1970.
Lindelof, R., ``Local Ratio Cut and Set Covering [47] Hamada, T., Cheng, C. K. and Chau, P., ``An Ecient
Partitioning for Huge Logic Emulation Systems'', IEEE Multi-Level Placement Technique Using Hierarchical
Trans. Computer-Aided Design, pp. 1085 ± 1092, Septem- Partitioning'', IEEE Trans. Circuits and Systems, 39,
ber, 1995. 432 ± 439, June, 1992.
[30] Chvatal, V. (1983). Linear Programming, W. H. Freeman [48] Hennessy, J. (1983). ``Partitioning Programmable Logic
and Company. Arrays Summary'', Int. Conf. on Computer-Aided Design,
[31] Cong, J. and Ding, Y., ``FlowMap: An Optimal pp. 180 ± 181.
Technology Mapping Algorithm for Delay Optimization [49] Ho€mann, A. G., ``The Dynamic Locking Heuristic ± A
in Lookup-Table Based FPGA Designs'', IEEE Trans. New Graph Partitioning Algorithm'', In: Proc. IEEE Int.
Computer-Aided Design, January, 1994, 13, 1 ± 12. Symp. Circuits and Systems, May, 1994, pp. 173 ± 176.
[32] Cong, J., Labio, W. and Shivakumar, N., ``Multi-way [50] Adolphson, D. and Hu, T. C., ``Optimal Linear
VLSI circuit partitioning based on dual net representa- Ordering'', SIAM J. Appl. Math., 25(3), 403 ± 423,
tion'', In: Proc. IEEE Int. Conf. Computer-Aided Design, November, 1973.
November, 1994, pp. 56 ± 62. [51] Hu, T. C., ``Decomposition Algorithm'', pp. 17 ± 22, In:
[33] Cong, J., Li, H. P., Lim, S. K., Shibuya, T. and Xu, D., Combinatorial Algorithms, Addison Wesley, 1982.
``Large scale circuit partitioning with loose/stable net [52] Hu, T. C. and Moerder, K., ``Multiterminal ¯ows in a
removal and signal ¯ow based clustering'', In: Proc. hypergraph'', In: VLSI Circuit Layout: Theory and
IEEE Int. Conf. Computer-Aided Design, November, Design, Hu, T. C. and Kuh, E. (Eds.) NY: IEEE Press,
1997, pp. 441 ± 446. 1985, pp. 87 ± 93.
[34] Donath, W. E. and Ho€man, A. J. (1973). ``Lower [53] Hur, S. W. and Lillis, J. (1999). ``Relaxation and
Bounds for the Partitioning of Graphs'', IBM J. Res. Clustering in a Local Search Framework: Application
Dev., pp. 420 ± 425. to Linear Placement'', Design Automation Conference,
[35] Donath, W. E. and Ho€man, A. J. (1972). ``Algorithms pp. 360 ± 366.
for partitioning of graphs and computer logic based on [54] Hwang, J. and Gamal, A. E., ``Optimal Replication for
eigenvectors of connection matrices'', IBM Technical Min-Cut Partitioning'', Proc. IEEE/ACM Intl. Conf.
Disclosure Bulletin 15, pp. 938 ± 944. Computer-Aided Design, November, 1992, pp. 432 ± 435.
[36] Donath, W. E. (1988). ``Logic partitioning'', In: Physical [55] Iman, S., Pedram, M., Fabian, C. and Cong, J.,
Design Automation of VLSI Systems, Preas, B. and ``Finding uni-directional cuts based on physical parti-
Lorenzetti, M. (Eds.) Menlo Park, CA: Benjamin/ tioning and logic restructuring'', In: Proc. ACM/SIGDA
Cummings, pp. 65 ± 86. Physical Design Workshop, May, 1993, pp. 187 ± 198.
[37] Dutt, S. and Deng, W., ``A Probability-based Approach [56] Johnson, D. S., Aragon, C. R., McGeoch, L. A. and
to VLSI Circuit Partitioning'', In: Proc. ACM/IEEE Schevon, C. (1989). ``Optimization by Simulated Anneal-
Design Automation Conf., June, 1996, pp. 100 ± 105. ing: an Experimental Evaluation, Part I, Graph Parti-
[38] Dutt, S. and Deng, W., ``VLSI Circuit Partitioning by tioning'', Operations Research, 37(5), 865 ± 892.
Cluster-Removal Using Iterative Improvement Techni- [57] Karp, R. M. (1978). ``A Characterization of The
ques'', In: Proc. IEEE Int. Conf. Computer-Aided Design, Minimum Cycle Mean in A Digraph'', Discrete Mathe-
November, 1996, pp. 194 ± 200. matics, 23, 309 ± 311.
[39] Enos, M., Hauck, S. and Sarrafzadeh, M., ``Evaluation [58] Karypis, G., Aggarwal, R., Kumar, V. and Shekhar, S.,
and optimization of Replication Algorithms for logic ``Multilevel Hypergraph Partitioning: Application in
Bipartitioning'', IEEE Trans. on Computer-Aided Design, VLSI Domain'', In: Proc. ACM/IEEE Design Automa-
September, 1999, 18, 1237 ± 48. tion Conf., June, 1997, pp. 526 ± 529.
[40] Fiduccia, C. M. and Mattheyses, R. M., ``A Linear-Time [59] Karypis, G., Aggarwal, R., Kumar, V. and Shekhar, S.
Heuristic for Improving Network Partitions'', In: Proc. (1998). ``Multilevel Hypergraph Partitioning: Application
ACM/IEEE Design Automation Conf., June, 1982, in VLSI Domain'', Manuscript of CS Dept., Univ. of
pp. 175 ± 181. Minnesota, pp. 1 ± 25 (http://www.users.cs.umn.edu/kar-
[41] Frankle, J. and Karp, R. M. (1986). ``Circuit Placement ypis/metis/publications/ ).
and Cost Bounds by Eigenvector Decomposition'', Proc. [60] Kernighan, B. W. and Lin, S., ``An Ecient Heuristic
Int. Conf. on Computer-Aided Design, pp. 414 ± 417. Procedure for Partitioning Graphs'', Bell Syst. Tech. J.,
[42] Garbers, J., Promel, H. J. and Steger, A. (1990). 49(2), 291 ± 307, February, 1970.
``Finding clusters in VLSI circuits'', In: Proc. IEEE Int. [61] Khellaf, M., ``On The Partitioning of Graphs and
Conf. Computer-Aided Design, pp. 520 ± 523. Hypergraphs'', Ph.D. Dissertation, Indus. Engineering
[43] Garey, M. R. and Johnson, D. S., Computers and and Operations Research, Univ. of California, Berkeley,
Instractability: A Guide to the Theory of NP-Complete- 1987.
ness, W.H. Freeman, San Francisco, CA, 1979. [62] Kirkpatrick, S., Gelatt, C. and Vechi, M., ``Optimization
[44] Hagen, L. and Kahng, A. B., ``New spectral methods for by Simulated Annealing'', Science, 220(4598), 671 ± 680,
ratio cut partitioning and clustering'', IEEE Trans. May, 1983.
Computer-Aided Design, 11(9), 1074 ± 1085, September, [63] Knuth, D. E., The Art of Computer Programming,
I207T001015 . 207
T001015d.207

42 S.-J. CHEN AND C.-K. CHENG

Addison Wesley, 1997. [82] Liu, L. T., Kuo, M. T., Cheng, C. K. and Hu, T. C., ``A
[64] Kring, C. and Newton, A. R. (1991). ``A Cell-Replicating Replication Cut for Two-Way Partitioning'', IEEE
Approach to Mincut Based Circuit Partitioning'', Proc. Trans. Computer-Aided Design, May, 1995, pp. 623 ± 630.
IEEE Int. Conf. on Computer-Aided Design, pp. 2 ± 5. [83] Liu, L. T., Kuo, M. T., Cheng, C. K. and Hu, T. C.,
[65] Krishnamurthy, B., ``An Improved Min-Cut Algorithm ``Performance-Driven Partitioning Using a Replication
for Partitioning VLSI Networks'', IEEE Trans. Compu- Graph Approach'', In: Proc. ACM/IEEE Design Auto-
ters, C-33(5), 438 ± 446, May, 1984. mation Conf., June, 1995, pp. 206 ± 210.
[66] Krupnova, H., Abbara, A. and Saucier, G. (1997). ``A [84] Liu, L. T., Kuo, M. T., Huang, S. C. and Cheng, C. K.,
Hierarchy-Driven FPGA Partitioning Method'', Design ``A gradient method on the initial partition of Fiduccia-
Automation Conf., pp. 522 ± 525. Mattheyses algorithm'', In: Proc. IEEE Int. Conf.
[67] Kuo, M. T. and Cheng, C. K., ``A New Network Flow Computer-Aided Design, November, 1993, pp. 229 ± 234.
Approach for Hierarchical Tree Partitioning'', In: Proc. [85] Liu, L. T., Shih, M., Chou, N. C., Cheng, C. K. and Ku,
ACM/IEEE Design Automation Conf., June, 1997, pp. W., ``Performance-Driven Partitioning Using Retiming
512 ± 517. and Replication'', In: Proc. IEEE Int. Conf. Computer-
[68] Kuo, M. T., Liu, L. T. and Cheng, C. K., ``Network Aided Design, November, 1993 pp. 296 ± 299.
Partitioning into Tree Hierarchies'', In: Proc. ACM/ [86] Liu, L. T., Shih, M. and Cheng, C. K., ``Data Flow
IEEE Design Automation Conf., June, 1996, pp. Partitioning for Clock Period and Latency Minimiza-
477 ± 482. tion'', In: Proc. ACM/IEEE Design Automation Conf.,
[69] Kuo, M. T., Liu, L. T. and Cheng, C. K., ``Finite State June, 1994, pp. 658 ± 663.
Machine Decomposition for I/O Minimization'', In: [87] Matula, D. W. and Shahrokhi, F., ``The Maximum
Proc. IEEE Int. Symp. on Circuits and Systems, May, Concurrent Flow Problem and Sparsest Cuts'', Tech.
1995, pp. 1061 ± 1064. Report, southern Methodist Univ., 1986.
[70] Kuo, M. T., Wang, Y., Cheng, C. K. and Fujita, M., [88] McFarland, M. C., S.J.,``Computer-aided partitioning of
``BDD-Based Logic Partitioning for Sequential Cir- behavioral hardware descriptions'', In: Proc. ACM/
cuits'', In: Proc. ASP/DAC, Chiba, Japan, January, IEEE Design Automation Conf., June, 1983, pp. 472 ±
1997, pp. 607 ± 612. 478.
[71] Lomonosov, M. V. (1985). ``Combinatorial Approaches [89] Motwani, R. and Raghavan, P. (1995). Randomized
to Multi¯ow Problems'', Discrete Applied Mathematics, Algorithms, Cambridge University Press.
11(1), 1 ± 94. [90] Ng, T. K., Old®eld, J. and Pitchumani, V., ``Improve-
[72] Landman, B. S. and Russo, R. L., ``On a Pin Versus ments of a mincut partition algorithms'', In: Proc. IEEE
Block Relationship for Partitioning of Logic Graphs'', Int. Conf. Computer-Aided Design, November, 1987, pp.
IEEE Trans. on Computers, C-20, 1469 ± 1479, Decem- 470 ± 473.
ber, 1971. [91] Nijssen, R. X. T., Jess, J. A. G. and Eindhoven, T. U.,
[73] Lawler, E. L., Combinatorial Optimization: Networks and ``Two-Dimensional Datapath Regularity Extraction'',
Matroids, Holt, Rinehart and Winston, New York, 1976. Physical Design Workshop, April, 1996, pp. 111 ± 117.
[74] Leighton, T. and Rao, S. (1988). ``An Approximate [92] Parhi, K. K. and Messerschmitt, D. G. (1991). ``Static
Max-Flow Min-cut Theorem for Uniform Multicom- Rate-Optimal Scheduling of Iterative Data-Flow Pro-
modity Flow Problems with Applications to Approx- grams via Optimum Unfolding'', IEEE Trans. on
imation Algorithms'', IEEE Symp. on Foundations of Computers, 40(2), 178 ± 195.
Computer Science, pp. 422 ± 431. [93] Riess, B. M., Doll, K. and Johannes, F. M., ``Partition-
[75] Leighton, T., Makedon, F., Plotkin, S., Stein, C., ing very large circuits using analytical placement
Tardos, E. and Tragoudas, S., ``Fast Approximation techniques'', In: Proc. ACM/IEEE Design Automation
Algorithms for Multicommodity Flow Problems'', Tech. Conf., June, 1994, pp. 646 ± 651.
report no. STAN-CS-91-1375, Dept. of Computer [94] Roy, K. and Sechen, C., ``A Timing Driven N-Way Chip
Science, Stanford University. and Multi-Chin Partitioner'', Proc. IEEE/ACM Int.
[76] Leiserson, C. E. and Saxe, J. B. (1991). ``Retiming Conf. on Computer-Aided Design, pp. 240 ± 247, Novem-
Synchronous Circuitry'', Algorithmica, 6(1), 5 ± 35. ber, 1993.
[77] Lengauer, T. and Muller, R. (1988). ``Linear Arrange- [95] Russo, R. L., Oden, P. H. and Wol€, P. K. Sr., ``A
ment Problems on Recursively Partitioned Graphs'', heuristic procedure for the partitioning and mapping of
Zeitschrift fur Operations Research, 32, 213 ± 230. computer logic graphs'', IEEE Trans. on Computers,
[78] Lengauer, T., Combinatorial Algorithms for Integrated C-20, 1455 ± 1462, December, 1971.
Circuit Layout, Wiley, 1990. [96] Saab, Y., ``A fast and robust network bisection
[79] Li, J., Lillis, J. and Cheng, C. K., ``Linear decomposition algorithm'', IEEE Trans. Computers, 44(7), 903 ± 913,
algorithm for VLSI design applications'', In: Proc. IEEE July, 1995.
Int. Conf. Computer-Aided Design, November, 1995, pp. [97] Saab, Y. and Rao, V. (1989). ``An Evolution-Based
223 ± 228. Approach to Partitioning ASIC Systems'', ACM/IEEE
[80] Li, J., Lillis, J., Liu, L. T. and Cheng, C. K., ``New 26th Design Automation Conf., pp. 767 ± 770.
Spectral Linear Placement and Clustering Approach'', [98] Sanchis, L. A., ``Multiple-Way Network Partitioning'',
In: Proc. ACM/IEEE Design Automation Conf., June, IEEE Trans. Computers, 38(1), 62 ± 81, January, 1989.
1996, pp. 88 ± 93. [99] Sanchis, L. A., ``Multiple-Way Network Partitioning
[81] Liou, H. Y., Lin, T. T., Liu, L. T. and Cheng, C. K., with Di€erent Cost Functions'', IEEE Trans. on
``Circuit Partitioning for Pipelined Pseudo-Exhaustive Computers, pp. 1500 ± 1504, December, 1993.
Testing Using Simulated Annealing'', In: Proc. IEEE [100] Schuler, D. M. and Ulrich, E. G. (1972). ``Clustering and
Custom Integrated Circuits Con., May, 1994, pp. 417 ± Linear Placement'', Proc. 9th Design Automation Work-
420. shop, pp. 50 ± 56.
I207T001015 . 207
T001015d.207

VLSI PARTITIONING 43

[101] Schweikert, D. G. and Kernighan, B. W. (1972). ``A clustering using a stochastic ¯ow injection method'',
Proper Model for the Partitioning of Electrical Circuits'', IEEE Trans. Computer-Aided Design, 14(2), 154 ± 162,
Proc. 9th Design Automation Workshop, pp. 57 ± 62. February, 1995.
[102] Sechen, C. and Chen, D. (1988). ``An Improved Objec- [118] Zien, J. Y., Chan, P. K. and Schlag, M., ``Hybrid
tive Function for Mincut Circuit Partitioning'', Proc. Int. spectral/iterative partitioning'', In: Proc. IEEE Int. Conf.
Conf. on Computer-Aided Design, pp. 502 ± 505. Computer-Aided Design, November, 1997 pp. 436 ± 440.
[103] Shahrokhi, F. and Matula, D. W., ``The Maximum
Concurrent Flow Problem'', Journal of the ACM, 37(2),
318 ± 334, April, 1990.
[104] Shapiro, J. F. (1979). Mathematical Programming: Authors' Biographies
Structures and Algorithms, Wiley, New York.
[105] Sherwani, N. A. (1999). Algorithms for VLSI Physical Sao-Jie Chen has been a member of the faculty in
Design Automation, 3rd edn., Kluwer Academic. the Department of Electrical Engineering, Na-
[106] Shih, M., Kuh, E. S. and Tsay, R.-S. (1992). ``Perfor-
mance-Driven System Partitioning on Multi-Chip Mod- tional Taiwan University since 1982, where he is
ules'', Proc. 29th ACM/IEEE Design Automation Conf., currently a full professor. During the fall of 1999,
pp. 53 ± 56.
[107] Shih, M. and Kuh, E. S. (1993). ``Quadratic Boolean
he held a visiting appointment at the Department
Programming for Performance-Driven System Partition- of Computer Science and Engineering, University
ing'', Proc. 30th ACM/IEEE Design Automation Conf.,
of California, San Diego. His current research
pp. 761 ± 765.
[108] Shin, H. and Kim, C., ``A Simple Yet E€ective interests include: VLSI circuits design, VLSI
Technique for Partitioning'', IEEE Trans. on Very Large physical design automation, and object-oriented
Scale Integration Systems, pp. 380 ± 386, September,
1993. software engineering. Dr. Chen is a member of the
[109] Wei, Y. C. and Cheng, C. K. (1991). ``Ratio Cut Association for Computing Machinery, the IEEE,
Partitioning for Hierarchical Designs'', IEEE Trans. on
Computer-Aided Design, 10(7), 911 ± 921. and the IEEE Computer Society.
[110] Wei, Y. C., Cheng, C. K. and Wurman, Z., ``Multiple Chung-Kuan Cheng received the B.S. and M.S.
Level Partitioning: An Application to the Very Large
Scale Hardware Simulators'', IEEE Journal of Solid
degrees in electrical engineering from National
State Circuits, 26, 706 ± 716, May, 1991. Taiwan University, and the Ph.D. degree in
[111] Woo, N. S. and Kim, J. (1993). ``An Ecient Method of electrical engineering and computer sciences from
Partitioning Circuits for Multiple-FPGA Implementa-
tion'', Proc. ACM/IEEE Design Automation Conf., pp. University of California, Berkeley in 1984. From
202 ± 207. 1984 to 1986 he was a senior CAD engineer at
[112] Yang, H. and Wong, D. F. (1994). ``Edge-Map: Optimal
Performance Driven Technology Mapping for Iterative Advanced Micro Devices Inc. In 1986, he joined
LUT Based FPGA Designs'', Int. Conf. on Computer- A the University of California, San Diego, where he
Aided Design, pp. 150 ± 155.
[113] Yang, H. and Wong, D. F., ``Ecient Network Flow is a Professor in the Computer Science and
based Min-Cut Balanced Partitioning'', In: Proc. IEEE Engineering Department, an Adjunct Professor
Int. Conf. Computer-Aided Design, November, 1994, pp.
50 ± 55.
in the Electrical and Computer Engineering
[114] Yeh, C. W., ``On the Acceleration of Flow-Oriented Department. He served as a chief scientist at
Circuit Clustering'', IEEE Trans. Computer-Aided De- Mentor Graphics in 1999. He is an associate editor
sign, 14(10), 1305 ± 1308, October, 1995.
[115] Yeh, C. W., Cheng, C. K. and Lin, T. T. Y., ``A general of IEEE Trans. on Computer Aided Design since
purpose, multiple-way partitioning algorithm'', IEEE 1994. He is a recipient of the best paper award,
Trans. Computer-Aided Design, 13(12), 1480 ± 1488,
December, 1994. IEEE Trans. on Computer-Aided Design 1997, the
[116] Yeh, C. W., Cheng, C. K. and Lin, T. T. Y., NCR excellence in teaching award, School of
``Optimization by iterative improvement: an experimen-
tal evaluation on two-way partitioning'', IEEE Trans. Engineering, UCSD, 1991. His research interests
Computer-Aided Design, 14(2), 145 ± 153, February, include network optimization and design automa-
1995.
[117] Yeh, C. W., Cheng, C. K. and Lin, T. T. Y., ``Circuit
tion on microelectronic circuits.

You might also like