64-Bit Prefix Adders: Power-Efficient Topologies and Design Solutions
64-Bit Prefix Adders: Power-Efficient Topologies and Design Solutions
Abstract64-bit adders of various prefix algorithms are designs has made such a study possible. The rest of the article
designed using a novel dataflow synthesis methodology, Our is organized as following: In Section 2, well briefly review
synthesis methodology offers robust adder solutions typically the popular prefix algorithms, discuss common design
used for high-performance microprocessor needs. The power-
practices and their implications for high performance adders.
performance tradeoffs are analyzed for a portfolio of popular
adder topologies and design styles. In particular, the intrinsically Well discuss the adder synthesis flow used for our study in
sparser designs in hierarchical prefix scheme are demonstrated Section 3. The power-delay tradeoff curves will be analyzed
to be preferable choices for both high-performance and low- for various adder designs in Section 4, and well outline the
power adder applications. conclusions for adder design choices in the last Section.
Index TermsPrefix Adders, Circuit Synthesis, Design II. PREFIX ADDER TOPOLOGIES REVISITED
Automation, Hierarchical Circuit Design
Parallel adders are identified by the prefix algorithms used
I. INTRODCTION for carry propagation [1]. Three algorithms, Kogge-Stone
(KS) [3], Sklansky (Sk) [4] and the Knowles family of adders
Binary addition is a fundamental operation in most digital
[5], stand out for having the minimum required logic depth, at
circuits and is often on the timing critical path of modern
the expense of high fanout and/or wiring. The Brent-Kung
microprocessors. As a result, adders have been studied
(BK) [6] algorithm differs by building on least area, minimum
extensively and various addition algorithms have been
fanouts as well as wiring tracks, and nearly doubles the logic
proposed to improve adder performance [1]. Traditionally,
depth compared to KS and Sk algorithms. Sparse prefix trees
adders have been implemented using a hand-crafted custom
with small added logic depth are used by other prefix
design technique to achieve desired performance target.
algorithms, Han-Carlson (HC) [7] and Ladner-Fischer (LF) [8]
Because custom design is a labor intensive process, designers
in particular, to avoid the excessive wiring in KS or the large
are forced to settle on a specific circuit architecture early in
fanouts in Sk. Figure 1 shows a 16-bit HC prefix tree; it has a
the design cycle, often based on limited information on design
sparseness of 2 since every other bit is not involved in the
constraints at the time. On the other hand,, since (1) there is
prefix operations until at the last stage.
only limited prior understanding on performance tradeoffs of
various adder designs, and (2) the design constraints may
change significantly throughout the design cycle, the adder
design runs the risk of major rework if it fails to meet the
performance or power requirements, due to either a
improperly selected circuit architecture or late design
changes. Using a semi-custom design flow for adder design
could automate the physical design and circuit tuning, but it
still requires an early decision on adder architecture and
extensive circuit work [2].
The purpose of this study is therefore twofold. First, well Fig.1: The 16-bit HC prefix network. The square boxes
explore a novel dataflow synthesis methodology to design represent the bit generator/propagator, the black and white
truly high performance, and power-efficient adders typically nodes stand for prefix operator and buffer respectively.
used for microprocessors. The ability to do so will not only
allow the high performance adder IP to be portable but also Wire delays become more significant with each generation
flexible for late changes. Secondly, well evaluate a portfolio of VLSI technology. Simulations [9] and logical effort
of popular radix-2 adder topologies and design styles widely analysis [10] have both indicated that, in the presence of
discussed in literature for their power-performance wiring capacitance in wide adders, the KS algorithm is slower
characteristics. A systematic analysis of their tradeoffs within than the properly buffered Sk algorithm [9,10], as well as the
the same context, in particular by taking into account actual HC algorithm [10]. Some recent studies have evaluated adders
design practices, allows for a unbiased and objective with an energy-delay metric [11-13]. In particular, Patil et al
comparison of various adder designs for the benefit of future [13] suggest that the Sk topology is the most energy-efficient.
design choices.. The fast turnaround time with synthesized The analyses in [9-13], however, do not cover practical
implementations of prefix networks for wide adders, which are
2
always implemented hierarchically [14-15]. Hierarchy breaks differences between the prefix algorithms considered here, any
a large prefix network into reusable blocks and makes the reliable analysis of energy-delay characteristics of these
custom design more efficient. Figure 2 shows hierarchical algorithms should take into account typical design practices
(16/4) prefix trees for a 64-bit adder. Group carry signals from for high-performance wide adders. The synthesis methodology
four level-1 16-bit prefix blocks are propagated by a level-2 to be discussed in the next section enables a systematic and
prefix tree. The propagated group carries are then merged with resource-efficient evaluation of all design choices that reflects
the adder carry-in and the pre-computed local carries from typical practice.
each block to form the individual bit carries.
III. ADDER SYNTHESIS FLOW
For the current work, we used the IBM in-house placement-
based synthesis tool PDSRTL[16]. PDSRTL inputs are
VHDL, timing constraints, pin locations and physical
boundaries; the optimized output netlist is a placed TECH
vim. The synthesis flow translates VHDL into a high-level
netlist, then does technology-independent logic synthesis,
technology mapping into specific CMOS standard cells and
timing-driven placement and optimization. PDSRTL also
includes power-reduction processes which selectively reduce
gate widths and substitute higher-VT gates to improve area and
power efficiency. After PDSRTL, the TECH vim is imported
into the Cadence framework for routing.
Fig. 2. A (16/4) hierarchical prefix tree for a 64-bit adder. In prior work, Zimmermann and Tran [17] have
implemented a prefix-optimization algorithm in synthesis that
The sparseness of a hierarchical prefix scheme rebalances provides a one-size-fits-all adder architecture for any given
the fanouts and wiring in the prefix trees, generally saving constraints. The algorithm excludes all prefix structures with
gate count, area and power. Table 1 shows the effect of bounded fanout like KS and HC adders. In this work, we
sparseness on the logic depth of various combinations of synthesize and compare each of the prefix networks.
algorithm and hierarchical structure. The first row (Flat64) To allow fair comparison among different adder topologies
values are for uniform prefix designs. The second row and design solutions, we coded the structure of each prefix
(Hier16/4) applies to the hierarchical prefix scheme above. In network in VHDL, and used a VHDL attribute to prevent
KS/Sk designs, dense prefix trees have carry signals at every synthesis from changing the prefix networks topology.
bit at all levels; introducing hierarchy adds an extra level for PDSRTL has complete freedom, however, to size and place
the final merge, but it also greatly reduces the wiring (KS) and individual cells to achieve minimal delay on timing critical
fanout (Sk) complexities. In the third row (Hier16/4+CS) the paths and minimal power for all others, PDSRTL also selects
extra stage is removed by changing the final merge to a carry- device VTs for timing or power savings. In addition, PDSRTL
select (CS) scheme [13], at the cost of higher power. Merging may insert buffers and clone cells as needed; we observed
the typically late-arriving adder carry-in into the group carry these transformations in our synthesis runs using the Sk adder
prefix tree uses less power to achieve the same reduction in and hierarchical prefix designs.
logic depth as seen in the last row (Hier 16/4(a)). We studied five (KS, Sk, HC, LF and BK) popular prefix
In intrinsically sparser designs, there is already at least one topologies. All were synthesized from both uniform and
logic stage to extend the directly generated carries to the other hierarchical (16/4) prefix trees. We also studied the hybrid
bits (Fig. 1). For these designs, adding hierarchy increases the designs mixing KS and Sk prefix trees with the carry-select
sparseness and expands the existing merge stage to more bits. scheme shown in Table 1. All adders were synthesized within
the same design environment, subject to the same constraints
Table 1 on input/output capacitance, physical boundaries and pin
Logic Levels for 64-bit Radix-2 Prefix Adders locations. The area budgeting for the adder design is aimed to
Output Sum Cout Grp Car be competitive against custom designed adders without
Adder Style KS/Sk HC/LF BK All All hindering the synthesis optimization, and we have therefore
Flat64 9 10 13 8 allowed the area utilization ratio to reach as high as 80% in the
worst case scenario, above which synthesis results become
Hier16/4 10 10 10 8 8
suboptimal. All adders are synthesized using standard cells
Hier16/4 +CS 9 9 9 8 8 built in 45nm, 0.925V CMOS SOI technology [18].
Hier16/4 (a) 9 9 9 8 7 Typical synthesis runs last from hour to a few hours
depending on the timing constraints. The fast turnaround time
Hierarchical prefix algorithms offer not only efficiency (in
is essential for us to provide optimal adder solutions flexible
area, and in custom adders, design reuse), but also provide
for late design changes typical in any microprocessor project.
opportunities for performance gains. Given the significant
3
IV. POWER-PERFORMANCE RESULTS more efficient than the Sk design at moderate to low energies,
it lags behind the latter in the high performance region. As
A. Adder Designs in Uniform Prefix Trees discussed earlier, the LF design is a sparser alternative to the
Figure 3 shows power-performance tradeoffs for Sk design by trading one extra stage for reduced fanout loads.
synthesized adders with uniform prefix trees in five However, since synthesis already provides effective buffering
topologies. Data points were obtained by varying PDSRTL at high-fanout nodes, the extra logic stage inherent in the LF
delay constraints and computing the average power dissipation design becomes noticeable at high performance requirements.
of the synthesized adders using the IBM power simulator For KS adders, beside the power issue, we have also
CPAM [19]. The adder delay has been normalized to a fanout- observed that adder delay becomes less predictable in the
of-four (FO4) inverter delay and the power was simulated with neighborhood of 14 FO4s. By its nature, the dense KS design
a 30% input switching factor. requires the most gates, driving area utilization over 80% at
17.00
HC64
the tightest timing constraints. With limited free space,
15.00
SK64 synthesis returns sub-optimal gate placement and sizing. To
KS64 verify that the area constraint is responsible for delay
13.00 LF64 fluctuations in the KS design, we synthesized the same adders
Power (mW)
BK64
11.00
with a 50% bigger area budget at timing targets of 15 FO4s or
less. While the other adders had nearly identical or only
9.00 marginally better timing, the KS adder had significantly better
7.00
and more predictable timing with the large area budget: 12.75
FO4s at 15mw. Figure 4 shows the power-delay curves for
5.00
10.00 15.00 20.00 25.00 30.00
KS designs with and without the additional area budget.
FO4s 15.00
14.00 KS64
Fig. 3. Power-delay curves for five popular adder topologies implemented in 13.00
uniform prefix trees.
12.00
KSbig
At low energies, synthesis uses only high-VT devices and 11.00
minimizes area for power savings. The BK adder has the 10.00
lowest power in this region because of its low gate count and 9.00
minimal need for buffering and wiring resources. Overall, the 8.00
12.5 FO4 for all algorithms. This difference can be explained performance to a similarly designed custom adder while
as follows: 1) With hierarchy, the logic depths for group saving power. The fast turnaround time of the synthesis
carries and sum outputs are the same for all four adders. 2) methodology offers a significant advantage for late
The level-2 group prefix trees differ very little if at all, adjustments in design, responding to late changes in delay or
because of the small number of bits involved. In fact, the HC, power constraints, or floorplan reorganizations. Our
Sk, and LF adders have the same level-2 prefix structure. 3) synthesized adders have been adopted for use in multiple IBM
The differences between the topologies are mostly in the designs now in development.
(noncritical paths in the) 16-bit prefix blocks, and affect
minimum power more than delay. ACKNOWLEDGMENT
13.00 SK16/4 The authors gratefully acknowledge the financial support of
12.00 KS16/4
LF16/4 DARPA through Contract B554331. The authors appreciate
11.00
HC16/4 fruitful discussions with David Geiger, Tom Fox, Pong-Fei
Power (mW)
10.00 CSSK Lu, Matt Ziegler and many IBM colleagues for this work.
9.00 CSKS
8.00
REFERENCES
7.00
6.00
[1] D. Harris, A taxonomy of parallel prefix networks, Proc. 37th
Asilomar Conf. Signals, Systems and Computers, pp. 2213-2217, Nov.
5.00
10 12 14 16 18 20 22 24 26 28 2003.
[2] P-F. Lu, G. Northrop and K. Chiarot, A semi-custom design of branch
FO4s address calculator in the IBM Power4 microprocessor, IEEE VLSI-
TSA intern. Symp. On VLSI-DAT, pp. 327-329, 2005.
Fig. 5. Power-delay curves for 64-bit adders in hierarchical (16/4) prefix [3] P. Kogge and H. Stone, A parallel algorithm for the efficient solution of
implementation. a general class of recurrence relations, IEEE Trans. Computers, C-
22(2), pp 786-793, Aug. 1973.
[4] J. Sklansky, Conditional sum addition logic, IRE Trans. Electronic
Because the logic depths become leveled in our hierarchical
Computers, vol EC-9, pp. 226-231, June 1960.
prefix adder designs, intrinsically sparser designs (e.g., HC [5] S. Knowles, A family of adders, Proc. 14th Symp Computer
and LF) ultimately offer not only lower power but also more Arithmetic, Adelaide, Australia, pp. 30-34, April 1999.
competitive performance. They are better able to reduce the [6] R. Brent and H. Kung, A regular layout for parallel adders, IEEE
Trans. Computers, C-31(3), pp. 260-264, March 1982.
internal wiring and fanout pressures without being penalized [7] T. Han and D. Carlson, Fast area-efficient VLSI adders, Proc. 8th
for latency. The HC adder is already a preferred design choice Symp. Comp. Arith., pp. 49-56, Sept. 1987.
even in the uniform prefix scheme; our results for hierarchical [8] R. E. Ladner and M. J. Fischer, Parallel prefix computation, J. of
ACM, vol 27, 831-838, Oct. 1980.
schemes further support that preference. The LF adder, on the [9] Z. Huang and M. D. Ercegovac, Effect of wire delay on the design of
other hand, has poorer performance than the Sk adder with a prefix adders in deep-submicron technology, Proc. 34th Asilomar Conf.
uniform prefix network, but becomes competitive with Siginals, Systems & Computers, Vol. 2, p1713-1717, 2000.
[10] D. Harris and I. Sutherland, Logical effect of carry propagator adders,
hierarchical prefix trees. Proc. 38th Asilomar Conf. On Signals, Systems & Computers, pp. 873-
In Figure 5, weve also included power-delay curves for 878, Nov. 2003.
two hybrid designs mixing hierarchical Sk/KS prefix trees and [11] V. G. Oklobdzija, B. T. Zeydel, H. Q. Dao, S. Mathew and R.
Krishnamurthy, Comparison of high-performance VLSI adders in the
the carry-select scheme described in Table 1 (Hier16/4+CS). energy-delay space, IEEE Trans. on VLSI Systems, vol. 13, pp. 754-
This scenario has frequently been adopted in custom adder 758, 2005.
designs for high performance applications. Our synthesis [12] R. Zlatanovici and B. Nikolic, Power-performance optimal 64-bit
carry-lookahead adders, Proc. 29th European Solid State Circuit Conf.,
results do not demonstrate any performance advantage for this
pp. 321-324, 2003.
scenario, although the extra energy cost is clearly evident [13] D. Patil, O. Azizi, M. Horowitz, R. Ho and R. Ananthraman, Robust
because of the excessive logic involved in the conditional sum energy-efficient adder topologies, Proc. 18th Symp. Comp. Arith., pp.
calculation and carry-(0/1) propagations. 16-28, 2007.
[14] X. Yu, et al., A 5GHz+ 128-bit binary floating-point adder for the
POWER6 processor, Proc. of European Solid-State Circuits Conf.,
V. CONCLUSIONS Montreux, Switzerland, pp. 166-169, 2006.
[15] J. Park, H. S. Ngo, J. A. Silberman and S. H. Dhong, 470ps 64bit
We have applied a unique dataflow synthesis methodology parallel binary adder, Symp. On VLSI Circuits, pp. 192-193, 2000.
to high-performance adder designs, and have implemented [16] L. Trevillyan, D. Kung, R. Puri, L. N. Reddy, and M. A. Kazda, An
64-bit adder designs for a portfolio of popular prefix integrated environment for technology closure of deep-submicron IC
designs, IEEE Design and Test of Computers, vol. 21, pp. 14-22,
algorithms and design styles. Hierarchical prefix adders offer 2004.
significant delay and power advantages over uniform prefix [17] R. Zimmermann and D. Q. Tran, Optimized synthesis of sum-of-
adders. In particular, the intrinsically sparse HC and LF adders products, Proc. 37th Asilomar Conf. Signals, Systems & Computers, pp.
1-6, 2003.
are the preferred design choices for high energy efficiency and [18] S. Narasimha, et al., "High-Performance 45nm SOI Technology with
high performance. LF and BK adders provide the best Enhanced Strain, Porous Low-k BEOL, and Immersion Lithography,"
solutions for low power applications. 2006 IEDM , pp. 1-4, Dec. 2006.
[19] J. S. Neely, H. H. Chen, S. G. Walker, J. Venuto and T. J. Bucelot,
In 45nm 0.925V CMOS SOI technology, the synthesized CPAM: A common power analysis methodology for high-performance
adder designs are capable of performing 64-bit binary addition VLSI design, Proc. IEEE 9th Topical Meetings on Electrical
in 12.5 FO4 delays dissipating about 11mw, comparable in Performance of Electronic Packaging, pp. 303-306, Oct. 2000.