0% found this document useful (0 votes)
2 views

A_simple_algorithm_for_fanout_optimization_using_high-performance_buffer_libraries

This paper presents an algorithm for optimizing fanout networks in high-speed custom CMOS designs, focusing on minimal-area solutions under delay constraints. It emphasizes the use of fanout chains and demonstrates that these structures can yield efficient results compared to traditional methods. The authors provide a systematic approach for sink assignments and inverter selection, ultimately showing that their algorithm can produce near-optimal solutions with reduced complexity.

Uploaded by

Tamal Das
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

A_simple_algorithm_for_fanout_optimization_using_high-performance_buffer_libraries

This paper presents an algorithm for optimizing fanout networks in high-speed custom CMOS designs, focusing on minimal-area solutions under delay constraints. It emphasizes the use of fanout chains and demonstrates that these structures can yield efficient results compared to traditional methods. The authors provide a systematic approach for sink assignments and inverter selection, ultimately showing that their algorithm can produce near-optimal solutions with reduced complexity.

Uploaded by

Tamal Das
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

A Simple ithm for Fa Optimization

using W rformance er Libraries


K. Kodandapani, J. Grodstein, A.. Domic and H. Touati
Digital Equipment Corporation
Hudson, MA, USA, and Paris, France

we enumerate all possible sink assignments out


of a subset of assignments guaranteed to contain
We present an algorithm for computing minimal- the optimal one;
area fanout networks satisfying a delay con-
straint. We focus on one type of fanout network for each sink assignment, we perform a MADC
structure, the fanout chain. We show that, when cost computation. We observe that a fanout chain
is a tree, hence standard tree-covering tech-
using libraries designed for high-speed custom niques [1,2,3] can be used for the fanout problem,
CMOS chips, the fanout chain typically produces simplifying the software implementation.
minimal-area fanout networks for a given delay
constraint. We then present fast, near-optimal The paper is organized as follows. First we make
algorithms to compute these fanout structures. a comparison with previous work. Then we provide
our definitions and assumptions. In the following
section we argue our claim that the fanout chain
is the best solution to the “custom fanout problem”.
Then we exhibit a subset of the set of all possible
Introduction sink assignments that is guaranteed to contain an
optimal assignment. In the following section we
This paper addresses the fanout optimization prob- present two possible approaches to inverter selec-
lem for high-speed, custom CMOS designs. Such de- tion in a fanout chain, and conclude by discussing
sign style allows many degrees of freedom, ranging experimental results. The Appendix includes de-
from the use of non-standard circuit structures to tailed technology data to justify our assumptions.
hand-tailored layout. Logic synthesis is feasible for
many control sections, and necessary to meet design
schedules. For optimal performance, logic synthesis
Comparison with Previous Work
tools should be adapted to exploit the extra freedom The basic techniques for fanout optimization (buffer-
available in custom designs. ing, gate resizing, gate duplication, critical signal
isolation) are not new. There is a vast literature
A characteristic of custom CMOS designs is the on timing optimization ([ll, [41, [51, [61, [71). In
use of gates with a large, almost continuous range of [4], Berman et al. first singled out the fanout prob-
sizes. The presence of a wide range of inverter sizes lem as an autonomous subject of study. Most previ-
allows us to limit the search space to a fairly simple ous work was motivated by standard-cell and gate-
topology, a fanout chain. A fanout chain consists array designs, where the range of inverter sizes is
of inverters in series with arbitrary sinks tied in at restricted. The problem is known to be NP-complete
any location in the chain. [l]. Moving from libraries with discrete sizes to a li-
brary with near-continuous sizes greatly simplifies
We show that in this case, the restricted search the problem. By taking advantage of specific in-
space is large enough to provide high-quality so- verter areddelay relations in CMOS technology, we
lutions. When the objective is to minimize area are able to develop efficient algorithms.
under a delay constraint, our algorithm produces
results that compare favorably with existing algo- In [I], Tbuati introduces several types of structures
rithms that analyze a much larger number of topolo- to address fanout optimization. He uses Type 1 LT-
gies. Our algorithm works as follows: trees, another name for fanout chains, and 5 p e - 2

466
1063-6757193$03.00 0 1993 IEEE
Authorized licensed use limited to: TAMAL DAS. Downloaded on January 20,2025 at 17:41:16 UTC from IEEE Xplore. Restrictions apply.
LT-trees, which are fanout chains terminated by a loads of the replaced sinks, and as required time the
two-level tree of inverters for extra buffering. Dy- earliest of the replaced sinks times.
namic programming techniques are applied to solve
for the minimal-delay implementation. Yoshikawa Our aggregate sinks are labeled SI to S,, with
et. al. 161 propose a variation of combinatorial merg- loads Llto L, and required times RI to I?.,. Thus,
ing able to generate a broad spectrum of buffer- chain node Jvk has load Lk,and its signal must ar-
tree topologies. Singh et al. 171 recursively gen- rive no later than Rk. The buffer driving Nk is la-
erate fanout trees, which is combined with gate re- beled &, and its delay is denoted by &. Specify-
powering. These algorithms are complex and slow. ing the sink assignment transforms a problem into a
simpler but equivalent "aggregate fanout problem."
By contrast, our algorithm only builds fanout
chains. Its simplicity allows us to obtain better so- We now indicate a crucial characteristic of the in-
lutions when the objective is to minimize area under verters in our custom library: It has a wide range of
delay constraints. We incur little or no penalty by inverter sizes so we may assume they are continu-
reducing the search space because of the freedom ous. We define two functions: A(COut,D)is the area
we have in our technology to size inverters almost required by a single inverter to drive load Gout with
continuously. delay D, and Cin(Cotrtr D)is its input capacitance.
Both A and Ci, are monotonicallyincreasing in Gout
and decreasing in D. Qpically Ci, is proportional to
Definitions the area A, but we handle Ci, as a separate function
to clarify the equations. While size can grow arbi-
We now state the fanout problem. The problem
trarily, in practice the library has only inverters up
inputs are:
to a "reasonable" size.
a set of n sinks, labeled S1 to S,. Each sink has
a given polarity, a load Lk,and a required time Assumptions
Rk. There are p sinks with positive polarity and
q with negative polarity; We have two classes of assumptions. First, tech-
nology assumptions (TG,1 to 3, which are typical
a driving gate, or source, with a worst-case ar- of CMOS technologies over their useful range, and
rival time A; so valid across processes. Data supporting 1 to 3 is
found in the Appendix.
a set of inverters (differing in size) from a library.
1. CMOS inverters have a high gain. Specifically,
We want as solution a funout tree of minimum area over the useful range of the library:
that meets the delay constraints, i.e. such that the
required arrival times for all the sinks are satisfied.

We place the sinks on a simple series chain of in-


verters, labeled as in the figure below.
This isolation ratio varies from 0.13 for small
Figure 1 inverters up to 0.6 for large ones driving a t the
Inverter Chain limit of the process speed. Afterwards, inverters
become so highly self-loaded that they are use-
less.

2. As inverters get faster, their self-loading in-


creases. Specifically, the marginal area cost to
drive an extra load unit at constant delay in-
creases. This area penalty grows fourfold from
slow to fast inverters over their useful range. So,
for the same load C and two delays such that
D1<&:

We point out a property of the fanout problem that


simplifies the analysis. The aggregation property
says that several sinks at one chain location can
be replaced by a single aggregate sink at the same 3. Driving a load in a given time with a single in-
location. This new sink has as load the sum of the verter is better than splitting the load into two.

467

Authorized licensed use limited to: TAMAL DAS. Downloaded on January 20,2025 at 17:41:16 UTC from IEEE Xplore. Restrictions apply.
Duplicating gates implies extra wires and less considers monotone sink assignments, i.e. sink as-
efficient routing. Load splitting is common when signments that place sinks along a fanout chain in
driving large loads with semi-custom libraries, order of increasing required times.
but a custom methodology allows very large in-
verters. For our processes, driving a load with Restricting the search to monotone sink assign-
one inverter results in 4 to 18 % less area than ments is suboptimal in general, as was shown by
using two smaller inverters. l’buati [l,pp.461. However, his example relies crit-
ically on the use of a discrete library. With a con-
We add two methodology assumptions (MA). These tinuous library, restricting the search to monotone
assumptions come from our design methodology, sink assignments does not incur any loss of opti-
rather than CMOS physical properties, and may be mality, as long as our technology and methodology
violated in special circumstances. assumptions remain valid.

1. No gate in the network can have a delay smaller Theorem 1: For a fanout chain with sinks of both
than Dmin,or larger than D,,,. This avoids polarities, given required times and a library with
potential coupling noise and reduces power con- almost continuous inverter sizes, the minimal area
sumption. In practice, Dminand D,,, are within solution has the sinks of a given polarity placed in
a 1:4range. order of required times.

2. Critical paths should not contain inverters. This Before proving the theorem, we consider a special
cannot be guaranteed in all cases. However, to case. Assume two sink assignments for a fanout
speed up critical paths, signal polarities are cho- chain, PA and PB, such that:
sen to minimize the number of inverters. If two
the only difference between PA and PB is that in
paths need a signal in opposite polarities, the
PB, a load ALh was shifted to a sink of equal polar-
most critical path gets it inversion-free.
ity but further away from the source.

Near-optimality of Fanout Chains Lemma: If Sol-A and S o l B denote the minimum


area solutions for assignments PA and PB respec-
We now argue that a fanout chain is very close to tively, then S o l B has smaller area than SolA.
the minimal area under delay constraint solution to
the fanout problem. Note these are heuristic argu- We prove this Lemma by considering several cases.
ments, not proofs. We analyze only the basic situation where a load
AL3 is moved from NI to N3 as the same arguments
Intuitively, we are choosing simpler circuits. By apply to the general case by changing indexes.
restricting our attention to these topologies, we min-
imize area and routing costs, and produce efficient Case 1: If Sol-A has B1 sized smaller than B3,
circuits. Also, without precise routing data, delay then Sol-B has smaller area than SOIA.
models are only approximate, and there is no point Proof: use the sizes from PA as a starting solution
in minimizing exactly a n approximate cost function. for PB. It is trivial to see that this is a solution for
We avoid complexity in the pursuit of optimizing a PB that is faster than Sol-A, and it has the same
cost function that is not completely valid. Neverthe-
area. Intuitively, as large inverters are better at
less, since we are not considering all topologies, we handling load, the transfer gives a better solution.
may overlook the exact optimal network.
So the minimal area Sol-B satisfies the claim.
But even if a complex tree absolutely minimizes Case 2: If Sol-A has B, sized so it is faster than
the cost function, often a simple chain comes quite B3, then S o l B has smaller area than Sol-A.
close to it. Our MA 1 says that in any tree, many
inverters will have similar delays, and then, accord- Proof: Again, start from Sol-A. First upsize the
ing to TA 3, many can be just merged into single buffer B3, driving N3, so that its delay with the new
inverters with less area. load L3+AL3 stays the same as it was with &. This
results in a larger area for buffer B3 and a larger
Ordering of Sinks capacitance on node Nz. Specifically

We can place the sinks at different locations in a AreaGainB3 = A(L3 + AL3,D3) - A(&, 4 )
given fanout chain as long as we respect the po-
larities. For n sinks and c locations, there are cn but by the mean-value theorem
possible sink assignments. An exhaustive enumer-
ation of all possible assignments is too costly. As
most other fanout algorithms, our algorithm only AreaGainB3 = scout - (8zAr 3 , ~ 3 ) ~ ~ 3

468

Authorized licensed use limited to: TAMAL DAS. Downloaded on January 20,2025 at 17:41:16 UTC from IEEE Xplore. Restrictions apply.
for some intermediate value L35.t!,35L3 AL3. As + By the Lemma, there is a solution for the new
the load change at node NZis only due to the new sink assignment at least as small as the best for the
input capacitance at Nz out-of-order sink assignment. By repeatedly moving
out-of-order sinks to an in-order location without in-
creasing area, we eventually obtain an in-order so-
lution no worse than the original one.
then we have the inequality
Minimal Area under Delay Constraint
We present two techniques for attacking the prob-
lem of minimal area fanout chains under a delay
constraint (MADC). Each is appropriate in a differ-
ent environment. In both cases, the basic algorithm
The same process is repeated back along the chain. consists of the following steps:
If we examine the area gain for Bz, we see i t involves
the product of AL3 with two partials, aA/aCout Try all possible sink assignments compatible
and aCi,/aC,ut. By our TA 1 and 2, we neglect with increasing order of required times and sink
AreaGainBz when adding up the total area change. polarity.

As for AreaGaanB~,two terms cause load changes: For each sink assignment, apply an inverter
one due to the removal of ALa, and another due to selection algorithm and determine the solution
the increase of the input capacitance of Bz. The area cost.
gain due to the larger input capacitance involves the pick the best solution.
product of two partials of the type aCin/aCout with
AL3 as one can see by applying the inequality above We now limit ourselves to chains with at most four
twice. Thus, we can neglect this term. After these inverters. In practice, longer chains are rarely nec-
observations, we are left with a total area gain essary. So, we do not consider this an undue re-
striction. The arguments can be easily extended to
an arbitrary chain, but at a loss in computational
efficiency.
As a consequence of Theorem 1, when we analyze
for some LI - A L ~ ~ ~ I S L I . sink assignments we can restrict our search to as-
As we assumed B1 is faster than B3, by TA 2 we signments that are ordered with respect to required
obtain the first partial is actually smaller than the times. As the number of inverters in our chain
second, so we have decreased the total area. This is equal to 4, we then have O(p * q ) assignments
proves the claim. (where p(q) indicates the number of positive (nega-
tive) sinks), instead of a number of cases that grows
Case 3: If Sol-A has B1 sized larger and slower exponentially with sink count. If we used inverter
than B3, S o l B still has smaller area than Sol-A. chains of length 2K,the number of cases to consider
is O((p * q ) K - l ) , still polynomial in the number of
Due to lack of space, we omit the proof. Now we sinks.
prove Theorem 1. Zntegration with a Bee-based MADC Algorithm
Proof of the Theorem: Consider a solution with Optimizing fanout trees is typically done when
out-of-order sinks. Some sinks with late required mapping the network. Since an inverter chain is
times are at earlier points on the chain than other a tree, sizing the chain can be combined with a
sinks of the same polarity with smaller required minimum area under delay constraint (MADC) tree-
times, i.e., there are sinks S; and S,, of the same covering algorithm ([I], [SI). We add the fanout
polarity, on nodes Nk and Nm with k < m, such chain to its fanin tree to make a single tree, and
that Ri > R, (so Si is less critical). Consider now optimize the entire unit using the same covering al-
moving Si to N,. By changing one sink assignment, gorithm. We perform fanout chain inverter selection
we create a new problem PB from the original one. and tree covering in a single step.
Note that now PB has the same (or greater) re- Since algorithms such as [91 proceed from tree
quired times as the original problem: RBI >= RA1 leaves to tree root, calculating dynamic programs,
for all 1. Rk can only have improved, since we re- the integration works well. The dynamic programs
laxed its requirements by removing a sink. But & for all non-chain nodes (the nodes in the original
has not changed since we added a noncritical sink. fanin tree) are calculated first, only once. Then, the

469

Authorized licensed use limited to: TAMAL DAS. Downloaded on January 20,2025 at 17:41:16 UTC from IEEE Xplore. Restrictions apply.
dynamic program for the chain nodes is calculated In summary, we have focused on the case of li-
O(p * q ) times (once for each ordered sink assign- braries with a large, almost continuous range of in-
ment) and the best overall solution is chosen. The verter sizes. For such libraries, we have provided
total complexity is thus 0(sinks2 * treenodes). experimental evidence that the fanout chain is typi-
cally the minimal-area configuration of a fanout tree
Note that the algorithm naturally picks the opti- which satisfies the delay constraints. We present
mal length of inverter chain to use. Even though simple, efficient algorithms to explore the relevant
the initial network has four inverters, it may map space of fanout chains to find the best one. We
to fewer inverters if this produces less area and i t showed that our techniques give similar or better re-
can drive the loads under the delay constraints. sults than existing techniques at lower cost in soft-
ware complexity.
Device Sizing Approach

While we did not implement this approach, we


Appendix: Technology Data
mention i t as a possible solution. First, assign sinks
to chain locations in order of required times. As be-
We show the delay characteristics of inverters for
fore, there are O(p * q ) possible selections. Once the
a 1-micron CMOS process (numbers for smaller
sink locations and the chain topology are fured, what
widths scale almost perfectly). In the plot below,
remains is a transistor sizing problem. Both op-
lines represent inverter width vs. load capacitance
timal and fast, near-optimal solutions to this prob-
at a constant delay. The lowest line corresponds to
lem have been described in the literature ([lo], [ll]).
a delay of 0.9ns, and lines above show a O.lns de-
The complexity of the optimization techniques in
crease, except for the top line which represents a
[11]is O(locations3),and that of the fanout gener-
delay of 0.35ns. Data was produced with an inter-
ation is O(sinks2);thus the net complexity for this
nal transistor sizer similar to TILOS [9]. Its de-
solution is O(locations3* sinks2).
lay model is based on linear interpolation between
SPICE data points, and its error relative to SPICE
Results and Conclusions is about 5%.
"his algorithm was implemented as part of SIS, Figure 2
the University of California at Berkeley program [31, Width vs Load for Constant Delay
using a simple MADC tree covering scheme similar
to [I]. We compared results on a set of ten industrial
problems. In these examples, the number of sinks
ranges from 2 to 6. Our library contains 10 invert-
ers, whose sizes scale up from 1for the smallest one
to 50 for the largest. In all problems, the most crit-
ical sink is of the same polarity as the driver. SIS
was run with the map -B 1 -AFG -n 1 command.
Results are presented in %ble 1.

From a practical perspective, the new algorithm


produces better or equal results. The exception is
that in most cases, SIS produces better slacks but
at a very high cost in area. This is relevant only
when there is negative slack, such as in example 9,
but we note that in this example the difference is
0.10 ns, a negligible amount gained by a five fold
increase in area. A reason for the dramatically dif-
ferent area results is that SIS uses a min-delay al-
gorithm followed by area recovery to approximate a
true MADC. Using a large number of inverters of-
ten produces the fastest circuit (maybe only by an
infinitesimal amount), so SIS area recovery may be
saddled with an inefficient topology. Similarly, SIS
chooses its sink placement to maximize slack, re-
gardless of whether it is positive or negative. Op- TA 1 states that inverters have a high gain. The
tions such as a smaller number of inverters, or a isolation ratio, aC,,,f aCi, at a given constant delay
different sink placement, are not fully considered. is proportional to the slope of the corresponding line.

470

Authorized licensed use limited to: TAMAL DAS. Downloaded on January 20,2025 at 17:41:16 UTC from IEEE Xplore. Restrictions apply.
[4] "The Fanout Problem: From Theory to Prac-
The 0.9ns line gives a n isolation ratio of 0.13, and tice," C.L. Berman et al., in Advanced Research in
the 0.35ns line corresponds to a ratio of 0.6. Thus, VLSI: Roc. 1989 Decennial Caltech Conference,
reasonable speeds do have high gains. C.L.Seitz (Ed.), MIT Press, 1989, pp. 69-99.
TA 2 states that as inverters become faster, they [5] "Lattis: An Iterative Speedup Heuristic for
get more self-loaded. Fixing a load C, we see that Mapped Logic," J. Fishburn, Proc. DAC 1992,
the constant-delay lines are initially closely spaced pp. 488-491.
and then they get much further apart, showing in-
creased area penalties for the same delay reduction. [6] "Timing Optimization on Mapped Circuits," T.
Yoshikawa et. al., h o c . DAC 1991, pp. 112-117.
Finally, TA 3 says that driving a load in a given
time with a single buffer is better than splitting the [ 71 "A Heuristic Algorithm for the Fanout Problem, "
load into two. Note that the constant delay lines K.J. Singh and A. Sangiovanni-Vincentli, Proc.
have a y-intercept greater than zero. This means DAC 1990, pp. 357-360.
that a "zero load' inverter still has positive area, so [8] 'XNear-Optimal Algorithm for IIL.chnology Map-
using two inverters doubles this cost. The numbers ping Minimizing Area under Delay Constraints, "

do not include the wiring costs; this effect would K. Chaudhary and M. Pedram, h o c . DAC 1992,
make the gain more dramatic. This justifies TA 3. pp. 492-498.
[9] "DAGON: Technology Binding and Local Opti-
References mization by DAG Matching," K. Keutzer, Proc.
DAC 1987, pp. 341-347.
[13 "Performance-OrientedTechnology Mapping," H.
Touati, PhD Thesis, U. California at Berkeley, [101 "TILOS:A Posynomial Approach to Pansistor
1990. Sizing, " J. Fishburn and A. Dunlop, h o c . IC-
CAD 1985, p . 326-328.
[2] "Logic Synthesis for VLSZ Design," R. Rudell, [lll 'A Convex 8ptimrzatron Appnmch to Pansistor
Memo No. UCB/ERL M89/49, U. California at Sizing for VLSZ Circuits," S.S. Sapatnekar et al.,
Berkeley, 1989. Proc. ICCAD 1991, pp. 482-485.
[3] "Sequential Circuit Design using Synthesis and
Optimization," E. Sentovich et al., Proc. ICCD
1992, pp. 328-333.

Table 1
Results

SIS new algorithm

problem # sinks # inver area slack (ns) # inver area slack (ns)

1 6 1 360 0.47 1 144 0.37


2 6 1 276 0.02 - identical -
3 6 1 276 0.02 - identical -
4 2 1 96 1.53 1 36 1.05
5 5 2 2820 0.01 3 2124 0.10
6 5 2 1920 -0.22 - identical -
7 4 4 432 0.15 3 276 0.06
8 3 3 432 0.04 3 192 0.05
9 5 4 3156 -0.02 3 600 -0.12
10 6 3 1620 -0.15 - identical -

47 1

Authorized licensed use limited to: TAMAL DAS. Downloaded on January 20,2025 at 17:41:16 UTC from IEEE Xplore. Restrictions apply.

You might also like