0% found this document useful (0 votes)
20 views14 pages

Roy 2014

Uploaded by

saikumarvit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views14 pages

Roy 2014

Uploaded by

saikumarvit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 33, NO.

10, OCTOBER 2014 1517

Towards Optimal Performance-Area Trade-Off


in Adders by Synthesis of Parallel
Prefix Structures
Subhendu Roy, Student Member, IEEE, Mihir Choudhury, Member, IEEE, Ruchir Puri, Fellow, IEEE,
and David Z. Pan, Fellow, IEEE

Abstract—This paper proposes an efficient algorithm to syn- between regular adder structures such as Kogge–Stone [1],
thesize prefix graph structures that yield adders with the best Sklansky [2], Brent–Kung [3], Han–Carlson [4], and tune
performance-area trade-off. For designing a parallel prefix adder physical design parameters such as placement, gate sizing,
of a given bit-width, our approach generates prefix graph struc-
buffer optimization to maximize performance under power
tures to optimize an objective function such as size of prefix
graph subject to constraints like bit-wise output logic level. Given constraints for the target technology [5], [6]. Hence, custom
bit-width n and level (L) restriction, our algorithm excels the adder design methodology is expensive, takes a long time
existing algorithms in minimizing the size of the prefix graph. to converge to a satisfactory design, and is inflexible to late
We also prove its size-optimality when n is a power of two and design changes.
L = log2 n. Besides prefix graph size optimization and having In contrast, automated synthesis approach is productive
the best performance-area trade-off, our approach, unlike exist- and flexible to late design changes but traditionally has
ing techniques, can 1) handle more complex constraints such as
maximum node fanout or wire-length that impact the perfor-
lagged behind in performance as compared to custom designs.
mance/area of a design and 2) generate several feasible solutions Therefore, the prevalent design approach for high-performance
that minimize the objective function. Generating several size- datapath logic continues to be custom design. In the past,
optimal solutions provides the option to choose adder designs several algorithms have been proposed to generate parallel
that mitigate constraints such as wire congestion or power con- prefix adders targeting minimization of the size of the pre-
sumption that are difficult to model as constraints during logic fix graph (s) under given bit-width (n) and logic level (L)
synthesis. Experimental results demonstrate that our approach constraints. A prefix graph is said to be zero deficiency if
improves performance by 3% and area by 9% over even a
64-bit full custom designed adder implemented in an industrial
s + L = 2n − 2. Snir [7] has proved this theoretical bound
high-performance design. for L ≥ 2 log2 n − 2 with uniform input profile. In [8], zero-
deficiency prefix graphs Z(L) are proposed, where Z(L) has
Index Terms—Bottom-up approach, logic synthesis, parallel the provable maximum bit-width for a given depth L among
prefix adder, performance-area trade-off.
all zero-deficiency prefix circuits. The bit-width of Z(L) circuit
is given by NZ (L) = F(L + 3) − 1, (F denotes the fibonacci
I. I NTRODUCTION function) for L > 1. Compared to [7], [8] indeed gives a
more general bound for size of the prefix graphs. For instance,
ATAPATH logic constitutes a significant portion of a gen-
D eral purpose microprocessor and frequently occurs on the
timing-critical paths in high-performance designs. Arithmetic
NZ (6) = 33, so for a prefix graph of bit-width 32 and level
6, the minimum achievable size smin = 32 ∗ 2 − 2 − 6 = 56,
which Snir fails to give as 6 < 2 ∗ 5 − 2.
components, such as adders, multipliers, shifters are the basic
Ladner and Fischer [9] present a recursive construction of
building blocks in datapath logic and hence, to a great extent
parallel prefix graphs to obtain a trade-off between s and L,
dictate the performance of the entire chip. Binary addition
but it could not even achieve the bound provided by [7]. Other
is one of the most fundamental and widely used arithmetic
existing algorithms like a greedy depth-decreasing heuris-
operations in microprocessors. Today, adders are designed in
tic [10], dynamic programming based approaches [11], [12],
two ways—either manually through full custom design or
or non-heuristic optimization [13] could achieve this bound
in an automated manner using synthesis tools. In a custom
for some cases but yield sub-optimal result as logic level
adder design methodology, a designer has to manually choose
constraints are reduced (for e.g., to log2 n)—which is more
Manuscript received December 8, 2013; revised March 26, 2014 and
relevant for high performance adders. In [12], an algorithmic
June 11, 2014; accepted June 13, 2014. Date of current version September 16, approach is proposed to achieve minimal delay at all output
2014. This paper was recommended by Associate Editor J. Cortadella. bits for uniform/non-uniform input profile, although this paper
S. Roy and D. Z. Pan are with the Department of Electrical and Computer does not focus on minimizing the size of the prefix graph.
Engineering, University of Texas at Austin, Austin, TX 78712 USA (e-mail:
[email protected]; [email protected]). Reference [13] presents an algorithm for the generation of
M. Choudhury and R. Puri are with the IBM T. J. Watson Research parallel prefix structures for arbitrary level constraints to min-
Center, Yorktown Heights, NY 10598 USA (e-mail: [email protected]; imize the size, but it fails to get size-optimal solutions for
[email protected]).
Color versions of one or more of the figures in this paper are available
levels closer to log2 n. Reference [14] proposes logarithmic
online at http://ieeexplore.ieee.org. adder structures with a fan-out of 2, and presents a model to
Digital Object Identifier 10.1109/TCAD.2014.2341926 analyze the area-delay product of those structures. However,
0278-0070 c 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
1518 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 33, NO. 10, OCTOBER 2014

the key limitation of [14] is that these parallel prefix struc-


tures have more than log2 n levels leading to a compromise
in performance. Reference [15] attempts to generate a family
of adder structures for log2 n levels, but that does not give
the size-optimal solutions. In [16], an exhaustive approach is
attempted to explore the optimal arithmetic-circuit architec-
tures through selective factorization, but it is very limited in
terms of scalability.
The most recent approach [11], that uses dynamic program-
ming (DP) on a restricted search space to generate a seed prefix Fig. 1. Prefix graph representation.
graph followed by an area-heuristic to further reduce the size
of the seed prefix graph, is the most effective in minimizing
the size of the prefix graphs. However, the quality of the area- bit LSB) and an associative operation o, prefix computation
heuristic solution depends on the selection of seed solution of n outputs is defined as follows:
from DP, which is not unique. Furthermore, this algorithm yi = xi o xi−1 o . . . o x0 ∀i ∈ [0, n − 1] (1)
cannot handle fanout/wire-length constraints on nodes in the
prefix graph or arrival/required time constraints on individual where i-th output depends on all previous inputs xj (j ≤ i).
input/output bits that impact the performance, area, and power A prefix graph of width n is a directed acyclic graph (with
consumption of the adder after physical design. n inputs/outputs) whose nodes correspond to the associative
To tackle these issues, this paper proposes an efficient algo- operation “o” in the prefix computation and there exists an
rithm to generate prefix graphs for synthesizing adders with the edge from node vi to node vj if vi is an operand of vj . Fig. 1
best performance-area trade-off. In this approach, prefix graph represents a prefix graph for 6 bit. In this example, we can
structures are constructed in bottom-up fashion by exhaus- write y5 as
tively generating all possible n + 1 bit prefix graphs from n bit
y5 = i1 o y3 = (x5 o x4 ) o (i0 o y1 )
prefix graphs. For scalability to large adders up to 128 bits, our
approach proposes a novel compact data structure for manipu- = (x5 o x4 ) o ((x3 o x2 ) o (x1 o x0 )). (2)
lating prefix graphs, efficient memory management techniques Next, we will explain this prefix graph in the context of
like lazy copy for storing several prefix graph solutions, and binary addition.
search space reduction strategies like level-restriction, dynamic Binary addition problem is defined as follows [17]: given
size pruning, repeatability pruning for targeting prefix graph n bit augend A = an−1 . . . a1 a0 and n bit addend B =
structures relevant for achieving the best performance-area bn−1 . . . b1 b0 , compute the sum S = sn−1 . . . s1 s0 and carry
trade-off. Furthermore, we have described a method to gen- out Cout = cn−1 , where si = ai ⊕ bi ⊕ ci−1 and ci =
erate size-optimal solutions for any 2m bit adder with level ai bi + ai ci−1 + bi ci−1 .
restriction of m. Compared to existing algorithms our approach With bitwise (group) generate function g (G) and propagate
has the following advantages. function p (P), n bit binary addition can be mapped to a prefix
1) It provides a way to generate size-optimum prefix graph computation problem with three components as follows [18].
structures for 2m bit adder with level m and theoretically 1) Preprocessing: Bitwise g, p generation
proves its optimality.
2) It is more effective than all existing algorithms in mini- gi = ai .bi and pi = ai ⊕ bi . (3)
mizing the size of the prefix graph for given bit-width n
and arbitrary logic level, including bitwise input/output 2) Prefix-Processing: The concept of generate/propagate is
logic level constraints. extended to multiple bits and G[i:j] , P[i:j] (i ≥ j) are
3) It provides greater opportunity for improving perfor- defined as

mance of the adder because the algorithm can handle pi if i = j
P[i:j] =
fanout/wire-length constraints on nodes in the prefix P[i:k] .P[k−1:j] otherwise
graph and arrival/required time constraints on individual 
gi if i = j
input/output bits. G[i:j] = (4)
G[i:k] + P[i:k] .G[k−1:j] otherwise.
4) It generates many candidate prefix graph structures for
a given set of constraints, which can also be evaluated The computation for (G, P) is expressed in terms of
for placement and wiring congestion to yield efficient associative operation o as
physical and routing implementation. (G, P)[i:j] = (G, P)[i:k] o (G, P)[k−1:j]
The rest of the paper is organized as follows. Section II
describes binary addition as a prefix graph problem. Section III = (G[i:k] + P[i:k] .G[k−1:j] , P[i:k] .P[k−1:j] ).(5)
presents our algorithm for generating prefix graph struc- 3) Post-Processing: Sum generation
tures. Section IV presents the results of this approach with
a conclusion in Section V. si = pi ⊕ ci−1 and ci = G[i:0] . (6)
Among the three components of binary addition problem,
II. P RELIMINARIES both preprocessing and postprocessing parts are fixed struc-
Given an ordered n inputs x0 , x1 ,. . . , xn−1 (where xn−1 is tures. However, o being an associative operator, provides the
the most significant bit or MSB and x0 is the least significant flexibility of grouping the sequence of operations in prefix
ROY et al.: TOWARDS OPTIMAL PERFORMANCE-AREA TRADE-OFF IN ADDERS BY SYNTHESIS 1519

processing part and executing them in parallel. So the structure


of the prefix graph determines the extent of parallelism.
At the technology independent level, size of the prefix
graphs (# of prefix nodes) gives the area measure and the
logic levels of the nodes estimate roughly the timing. It is
important to note that the actual timing depends on other
parameters as well like fan-out distribution and size of the Fig. 2. Prefix graph restriction.
prefix graph. Smaller sizes of prefix graph offer better flexi-
bility during post-synthesis optimizations such as gate sizing,
buffer insertion etc.
Equations (3)–(6) represent the Weinberger recurrance equa-
tion [19] for carry-propagation. Ling adders [19], [20] have
been proposed as an alternative in the past by transforming
these equations which have provided better performance. Since
there is direct mapping between Weinberger’s equations and
Ling’s equations [20], one can explore the Ling implementa-
tion of any prefix network, such as Sklansky, Kogge–Stone, Fig. 3. Compact notation for a prefix graph.
etc. As another design alternative, sparse tree-adders have also each prefix node (p) in a prefix graph has 2 fan-in nodes.
been used in [21] for specific applications, however, it needs One node is vertically above p having the same MSB
conditional sum generators as additional design blocks. In (2) as that of p, we define it as trivial fan-in (tf ) and the
or Fig. 1, we can see that the number of fan-ins for each of other node is termed as non-trivial fan-in node (ntf ). For
the associative operation o is two and thus it is often termed instance, a and c are respectively trivial and non-trivial
as radix-2 implementation of prefix network. However, there fan-in node of b.
exist other choices such as radix-3 or radix-4 implementation, 2) The prefix-graph is non-overlapping, i.e., for any prefix
but the complexity is very high and not beneficial in static node, LSB(tf ) − MSB(ntf ) = 1. However, idempotency
CMOS circuits [22]. In [23] and [24], fast domino adders are property can be used to generate correct and overlapping
implemented using radix-4 Ling network, but domino logic prefix trees [15].
has been phased out due to the high power consumption. But we impose these restrictions to reduce the search space
Reference [25] demonstrates that radix-2 implementation is and at the same time attempt to generate the potential can-
indeed the most energy-efficient. An implementation of mixed- didate prefix trees which could give best performance/area
radix Jackson adder has also been shown to be inefficient in trade-off after placement/routing. We denote this set of non-
terms of energy/area [22]. overlapping prefix graphs as PG.
However, the search space is still huge and we require com-
III. O UR A PPROACH pact data structure, efficient memory management, and search
This section describes a compact data structure for storing space reduction techniques to scale this approach.
and manipulating a prefix graph, efficient memory manage-
ment strategies for storing several prefix graph solutions, and A. Compact Notation and Data Structure
pruning strategies to scale our approach up to 128 bit adders. We represent the prefix graph by a sequence of indices
We also prove the size-optimality of any 2m bit prefix graph (seq), where each index represents a prefix node and it is
with level m, generated by our approach, by incorporating sev- the MSB of that node. Fig. 3 illustrates the compact nota-
eral additional pruning strategies and ensuring that any pruning tion, where the sequence is determined in topological order,
does not degrade the optimality of the solution. Any prefix and in addition, precedence is given to higher significant
graph solution is said to be size-optimum under certain restric- bits in the sequence of indices. Let SEQ be the set of all
tions, if the size of the prefix graph is minimum with those sequences representing any prefix graph. Suppose VS is the
restrictions. set of valid sequences in our approach, where the restriction
Due to the associative nature of the prefix operation o, of left-to-right precedence is imposed in addition to topo-
each output for bit-index i can be constructed by combin- logical ordering, inherent in SEQ. For instance, in Fig. 3
ing the previous input bits 0, 1 . . . i in any way keeping (right side), indices {3,1} and {3,2} occur at first and second
their relative orders intact and the number of possible ways is topological levels respectively. With only topological order-
catalan(i), where catalan(i) = 1/(i + 1) 2ii . Let Gn denotes ing, 4 sequences are possible—“3132” (N1 N2 N3 N4 ), “3123”
the set of all possible prefix graphs with bit-width n. Then (N1 N2 N4 N3 ), “1332” (N2 N1 N3 N4 ), “1323” (N2 N1 N4 N3 ). Thus
size of Gn grows exponentially with n and is given by all 4 sequences belong to SEQ. But since “3” is given prece-
catalan(n−1) ∗ catalan(n−2) ∗. . . catalan(0). For example, dence over “1” and “2” at the first and second topological
|G8 | = 332972640, |G12 | = 2.29 ∗ 1024 . However, we will levels respectively, the only valid sequence or the only ele-
be exploring the set of prefix graphs with the following ment of VS here is “3132”. So although the mapping from
restrictions. SEQ to PG is many-to-one, the mapping from VS, a subset of
1) One of the fan-in node of any prefix node is the most SEQ, to PG is 1 − 1 and bijective as well (Fig. 4). Later, we
recent node sharing the same MSB with that of the prefix will formally prove this bijective relationship.
node. For instance, in Fig. 2, x can not be a fan-in node Algorithm 1 presents a procedure “checkValidSequence
of z. Alternatively y and c can be combined to form z. So (seq, n),” which returns “true” if seq ∈ VS representing an n bit
1520 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 33, NO. 10, OCTOBER 2014

Fig. 5. Bit slicing.


Fig. 4. Bijective mapping between VS and PG.

Lemma 1: The mapping from VS to PG is 1 − 1, i.e., if s1 ,


Algorithm 1 Procedure to Check if seq ∈ VS
s2 ∈ VS represent the same prefix graph in PG, then s1 = s2
1: Procedure checkValidSequence(s, n);
Proof: First, we will show that if we enumerate the prefix
2: for i = 0 to n − 1 do
nodes of a prefix graph in a topological order from left to
3: bitSpan(i) = i; right, then the order of the list of the prefix nodes is fixed. For
4: end for
instance, in Fig. 3 (right side), this fixed order is N1 N2 N3 N4 .
5: for all index ∈ seq from left to right do
Once we prove this, the sequence representation is guaranteed
6: if index > lastIndex and bitSpan(index)−1 = lastIndex to be unique as each index in the sequence corresponds to the
then MSB of each prefix node. We will prove this by induction.
7: return false; We consider that the order of the prefix nodes is fixed till
8: end if some node xn in the list. At this point, we will have a set
9: bitSpan(index) = bitSpan(bitSpan(index) − 1); of topologically dependent prefix-nodes (St ) for which both
10: end for
trivial and non-trivial fan-in are either any node in the list till
11: for i = 1 to n − 1 do
xn or any input node. So the next node in the list will be any
12: if bitSpan(i) = 0 then one node in St and this node will be shown unique. Since the
13: return false; trivial fan-in of any prefix node is the most recent node with
14: end if the same MSB as that of the prefix node, for any two nodes
15: end for
xi , xj ∈ St , MSB(xi ) = MSB(xj ), otherwise either xi would
16: return true;
topologically depend on xj or xj on xi which is not possible.
17: end Procedure
This implies that there exists a unique prefix node xu ∈ St ,
such that MSB(xu ) is maximum and the next node in the list
is xu . Note that for the base case of induction, i.e., when the
prefix graph. Here bitSpan(i) at any instant of traversing the list is empty, the first node corresponding to the first element
sequence represents the LSB of the node with index i, having in the sequence is the node in the sequence having highest
maximum logic level at that instant. So when we start travers- MSB with logic level 1 and thus unique as well.
ing seq, bitSpan(i) is equal to i and bitSpan(i) should be equal Corollary 1: There exists a bijective mapping between VS
to 0 when the entire sequence is traversed. Lines 2-4 initialize and PG.
bitSpan(i) with i representing the input nodes. Lines 11-14 Proof: For any prefix graph in PG there exists a sequence
check whether seq represents a prefix graph by ensuring representation following topological ordering from left to
that the LSB of each output node is 0, where as Lines 6-8 right. So the mapping is surjective. Also, it follows from
check the topological left to right ordering. For instance, for Lemma 1 that the mapping from VS to PG is injective. Hence
the sequence “3123,” when the second “3” is visited, then the Corollary is proved.
index = 3 > 2 = lastIndex indicating right-to-left ordering. So Apart from storing the index, we also need to track the LSB,
the node represented by this “3” should topologically depend level, fanout for each node in the prefix graph. We store all this
on the node represented by “2” and bitSpan(3) − 1 should be information using a single integer for each node, and represent
equal to “2” to maintain the topological left-to-right ordering, a prefix graph by a list/sequence of integers. Since we want
but bitSpan(3) − 1 = 2 − 1 = 2. So “3123” is not a valid to explore adders up to 128 bits and provision a carry-in as
sequence. On the contrary, for the example sequence “3132,” the 129th bit, we reserve 8 bits ( log2 (129) ) for index, level,
when the second 3 is visited, index = 3 > 2 = lastIndex, but fanout, and LSB. Thus, all information for a node can be stored
bitSpan(3) − 1 = 2 − 1 = 1 = lastIndex. For other indices in a single integer as shown in Fig. 5.
in the same sequence, index < lastIndex. So the condition for This compact data structure helps in reducing memory usage
Line 6 in Algorithm 1 is not satisfied for any of the indices and runtime (due to faster copy/delete operation for a prefix
and “3132” is determined as a valid sequence. node) as compared to using a structure to store index, LSB,
On the other hand, we can construct a prefix graph by level, and fanout as individual integers.
traversing the sequence of indices from left to right in the
following way: for each index i in the sequence, we add a
node p which is derived from 2 nodes—the most recent node B. Exhaustive Bottom-Up Enumeration
r with index i (or input bit i) and the node just before p in We start from a prefix graph of 2 bits (represented by a
the sequence (or the input bit LSB(r) − 1). For instance, in single index sequence “1”) and construct the prefix graph
the sequence “3132” in Fig. 3, the node for first “3” is con- structures for higher bits in an inductive way, i.e., given all
structed from input bits 3 and 2, where as that for second “3” possible prefix graphs (Gn ) for n bit, we construct all possible
is constructed from the node for first “3” and the node (with prefix graphs (Gn+1 ) of n + 1 bit. The process of generating
index 1) just before it. such graphs of n+1 bit from an element of Gn by inserting n at
ROY et al.: TOWARDS OPTIMAL PERFORMANCE-AREA TRADE-OFF IN ADDERS BY SYNTHESIS 1521

Algorithm 2 Exhaustive Bottom-Up Enumeration


1: // Given Gn construct Gn+1 . . .
2: Procedure buildBottomUp(Gn )
3: for all g ∈ Gn do
4: buildRecursive(g, null, g.begin, n);
5: end for
6: end Procedure
7: Procedure buildRecursive(nodeList, recentNode, currIter,
index)
8: if recentNode = null and LSB(recentNode) = 0 then
9: save solution nodeList in Gn+1 ;
10: return true;
11: end if
12: searchIndex ← LSB(recentNode) − 1;
13: newIter ← nodeList.insert(currIter, index);
14: newNode ← value at newIter;
15: flag ← buildRecursive(nodeList, newNode, currIter,
index);
16: if flag = true then
Fig. 6. Illustrative example. 17: return false;
18: end if
appropriate positions is a recursive procedure. Fig. 6 explains
19: nodeList.erase(newIter);
this for an element “12” of G3 with the help of a recursion tree.
20: repeat
At the beginning of this recursive procedure (RP), we have a
21: node ← value at currIter;
sequence “12” (node 1) with an arrow on “1.” The arrow points
22: currIter ← currIter + 1;
to the index before which “3” can be inserted. At any stage,
23: until MSB(node) = searchIndex and currIter =
there are two options, either insert “3” and call RP, or move
nodeList.end
the arrow to a suitable position and then call RP. This position
24: buildRecursive(nodeList, recentNode, currIter, index);
is found by iterating the list/sequence in forward direction
25: end Procedure
until searchIndex (= LSB(recentNode(3)) −1) is found, where
recentNode(i) signifies the most recent node with index i in the
sequence. The left subtree denotes the first option and the right
xi and either we insert n before xi or we forward the pointer
subtree indicates the second option. So the procedure either
in the sequence for next possible insertion point, and suppose
inserts “3” at the beginning of “12” and goes to node 2 or it
the next insertion position be after xp , i.e., xp is the first node
goes to node 7 by moving the arrow to the appropriate position.
in the sequence after xi , such that searchIndex = MSB(xp ).
We can see that, searchIndex = LSB(recentNode(3)) − 1 =
If we can prove the proposition that inserting n at any other
3 − 1 = 2 for this case. Similarly, for node 2, the searchIndex
intermediate position does not follow the topological left-to-
has become 2 − 1 = 1, and so this procedure either inserts
right ordering, then we are generating all sequences following
“3” (in node 3) or shifts the pointer after “1” (in node 5). The
the topological left-to-right ordering (VS), and since the map-
traversal is done in preorder and this recursion is continued
ping from VS to any prefix graph of our consideration (PG) is
until LSB(recentNode(3)) becomes “0” or alternatively, a 4 bit
bijective (by Corollary. 1), it would be sufficient to infer that
prefix graph is constructed. The right subtree of a node is not
Algorithm 2 is exhaustive. Also, this bijective mapping from
traversed if a prefix graph for 4 bits has been constructed at
VS to PG ensures that we are generating non-repetitive prefix
the left child of the node. For example, we do not traverse the
graph solutions of Gn+1 .
right subtree of node 3 and node 5.
Suppose, for contradiction, we insert n after xq which is an
Algorithm 2 illustrates the steps of this exhaustive enumera-
intermediate node between xi and xp , and the inserted node
tion technique. The algorithm preserves the uniqueness of the
be xn . But MSB(xn ) = n > MSB(xq ), so xq would be at right
solutions by inserting the indices at appropriate positions. In
to xn . Since xn comes after xq , xn should be topologically
the “buildRecursive” procedure, nodeList is an STL list (insert
dependent on xq , which means the non-trivial fan-in node of xn
and erase operations are thus O(1) operations), recentNode is
should depend on xq . But xn is just the next node to xq , which
passed as a parameter which is used to find searchIndex and to
means xq is the non-trivial fan-in node of xn . So MSB(xq ) =
track if a solution has been generated. currIter is the iterator
LSB(recentNode(n)) − 1, which is the searchIndex. As xp is
corresponding to ↓ in Fig. 6. The return value of the procedure
the first node in the sequence after xi , for which MSB(xp ) =
is true, when nodeList is a solution of Gn+1 , thereby indicating
searchIndex, xq = xp . Hence the bottom-up enumeration in
that the right subtree of parent of nodeList does not require
Algorithm 2 is exhaustive and non-repetitive.
traversal.
Theorem 1: The bottom-up enumeration in Algorithm 2 is
exhaustive and non-repetitive. C. Efficient Recursion Implementation
Proof: We construct all possible prefix graphs of bit-width The key step of Algorithm 2 is the recursive procedure as
n + 1 from any element of Gn , by inserting n at appropriate explained in Fig. 6. In a preorder traversal of typical recursion
positions. At any instant, say the arrow is pointed to a node tree implementation, when we move from root node to its
1522 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 33, NO. 10, OCTOBER 2014

left subtree, a copy of the root node is stored to traverse


the right subtree at later stage. In our approach, we copy
the sequence only when we get a valid prefix graph, other-
wise keep on modifying the sequence. As for example, we
do not store the sequences (“312,” “3312”) in Fig. 6, i.e.,
when we move to the left subtree of a node in the recur-
sion tree, we insert the index and delete it while coming back
to the node in the preorder traversal, and store only the leaf
nodes. This notion of late copy is motivated by a concept in
object-oriented-programming, known as lazy copy or copy-on- Fig. 7. 3132 is better prefix structure than 33312.
write [26] which is a combination of deep copy and shallow
copy. In lazy-copy, when an object is copied initially, a shal- 128 bit. But any kind of restriction (like fanout) on the graph
low copy (fast) is used and then deep copy (slow) is performed structure requires higher  to achieve feasible solutions. In
when it is absolutely necessary (for example, modifying a that case, we store a fixed number of solutions of Gn for each
shared object). Lazy copy helps to significantly reduce run size s (smin ≤ s ≤ smin + ), which allows higher  without
time by replacing list copy and delete operations with list entry increasing memory usage too much.
insertion and deletion operations at a given position (iterator) However, pruning the superfluous solutions after construct-
which are O(1) operations and thus improves the runtime. For ing the whole set Gn+1 can cause peak memory overshoot.
the simple example shown in Fig. 6, an implementation with- So we employ the strategy “Delete as early as possible,” i.e.,
out lazy copy needs five list copy and two list delete operations we generate solutions on the basis of current minimum size
whereas an implementation with lazy copy only needs three scurrent
list copy operations and no list delete operations. The benefits min . Let us take the same example to illustrate this. In X1 ,
scurrent
min = 12 and so we do not construct the graph with size 15,
of lazy copy increase exponentially with bit-width. as 15 > 12 + 2. Similarly, when we get the solution with size
11 in X2 , we delete the graph with size 14 from X1 and do not
D. Search Space Reduction construct the graph with size 14 in X2 and 16 in X3 . Indeed,
As the size of the solution space of all prefix graphs is huge, whenever the size of the list/sequence in Algorithm 2 exceeds
it is not feasible to generate all possible prefix graphs. Many scurrent
min by  + 1, the flow is returned from RP. Apart from
prefix graphs are also not relevant because they do not have reducing the peak memory usage, this dynamic pruning of
a good performance-area trade-off. We are interested only in solutions helps in improving run time by reducing copy/delete
generating candidate solutions to optimize performance (pre- operations.
fix graphs with minimum logic levels) and area (prefix graphs 3) Repeatability Pruning: The sequence (in our notation)
with minimum number of prefix nodes). Hence, the follow- denoting a prefix graph can have consecutive indices. We
ing search space reduction techniques are employed to scale denote the maximum number of consecutive indices in a
this approach, however, the details of these techniques are not sequence by R. For instance, “33312” in Fig. 6 has 3 con-
shown in Algorithm 2. secutive 3’s in the sequence so R = 3. We have observed
1) Level Pruning: The performance of an adder depends that R = 1 does not degrade the solution quality, but signifi-
directly on the number of logic levels of the prefix graph. Our cantly reduces the search space at an early stage. For instance,
approach intends to minimize the number of prefix nodes with in Fig. 7, “3132” is a better solution than “33312” both in
given bit-width and logic level (L) constraints. In Algorithm 2, terms of logic level and size. Algorithm 2 is modified to track
we keep track of the levels of each prefix node and solu- repeatability and prune solutions with R > 1.
tions are discarded if the level of the inserted node (or index) Lemma 2: If R > 1, the non-trivial fan-in node of the prefix
becomes greater than L. node represented by the repetitive index is an input node. For
2) Dynamic Size Pruning: As discussed in Section III-B, instance, N1 , N2 , and N3 in Fig. 7 are represented by the index
we construct the set Gn+1 from Gn . While doing this, we 3 consecutively. Among them, N2 and N3 are the nodes where
prune the solution space based on size (# of prefix nodes) of repetition of the index 3 occurs. By this lemma, the non-trivial
elements in Gn . Let smin be the size of the minimum sized fan-in nodes of N2 and N3 would be input nodes. Please note
prefix graph(s) of Gn . Then we prune the solutions (g) for that, the non-trivial fan-in node of N1 (represented by first
which size(g) > smin + . For example, suppose the sizes of occurring index) is also an input node in this example, but it
the solutions in Gn = [9 10 11] and  = 2. To construct is not necessarily true always.
Gn+1 , we select the graphs of Gn in increasing order of sizes Proof: Let p and x be 2 consecutive prefix nodes in a
and build the elements of Gn+1 . Let the graphs with sizes sequence and they have the same MSB as shown in Fig. 8.
X1 = [12 13 14 15], X2 = [11 14] and X3 = [13 16] Then the trivial fan-in node of x is p and suppose the non-
be respectively constructed from the graphs of sizes 9, 10, 11 trivial fan-in node of x be y. We need to prove that y is an
in Gn . In this case, the minimum size solution is the solu- input node. We shall prove this by contradiction. Let us con-
tion with size 11 and so the sizes of the solutions stored in sider that y is a prefix node, then the relative order of the
Gn+1 = [[12 13], [11], [13]]. This pruning is done to choose prefix nodes must be p → x → y or y → p → x, since p and
the potential elements of Gn+1 , which can give minimum size x are consecutive. p → x → y is not possible as it violates the
solution for the higher bits. The selection of  is critical to topological ordering and y → p → x violates the left-to-right
reduce the search space and we found empirically that  = 3 ordering (since y must be right to p). So y must be an input
is sufficient to get minimum size solutions for log2 n level till node.
ROY et al.: TOWARDS OPTIMAL PERFORMANCE-AREA TRADE-OFF IN ADDERS BY SYNTHESIS 1523

Fig. 8. Proof of Lemma 2.


Fig. 10. Proof of Theorem 2 (2).

For example, if we do not allow non-trivial fan-in from nb1 ,


nb2 , nb3 (Fig. 9) for constructing any prefix graph of bit-width
2m with level m, we will still get a size-optimum solution.
Please refer to Appendix for proof.
Corollary 2: ∀m, there exists an optimum solution when all
non-trivial fan-ins from bit-index (2m − 1) are taken from its
base-node, b2m −1 .
Proof: Since the base-node for any bit index (2m − 1) is
Fig. 9. Prefix structure restriction.
the output node for that bit-index as well, the proof directly
follows from Lemma 4.
4) Prefix Structure Restriction: This is a special restriction opt
Theorem 2: Let G2m be an optimum prefix graph of bit-
in prefix graph structure for 2m bit adders with m logic levels. m
width 2 and level m with the imposed restriction mentioned
For instance, if we need to construct an 8 bit adder with logic
in Lemma 4. Suppose Gx be the prefix graph of bit-width x,
level 3, the only possible way to realize output bit 7 using the opt
same notation as (2) is given by embedded in G2m . Then Gx is an optimum prefix graph of
bit-width x and level m under prefix structure restriction, if
y7 = ((x7 o x6 ) o (x5 o x4 )) o((x3 o x2 ) o (x1 o x0 )). (7) either of the following conditions are satisfied for x.
So 2m − 1 prefix nodes are fixed and must be present in any 1) x = 2p .
2m bit adder with m level. These fixed prefix nodes form a 2) x = 2p + 2q .
binary-tree structure as illustrated for 8 bit in Fig. 9. Among p, q ∈ Z + and p, q < m.
these fixed nodes, we define the bottom-most node (or the Proof: Suppose, G2p is not an optimum prefix graph of bit-
node with highest topological level) in each bit-column of this width 2p and level restriction m. By Corollary 2, all non-trivial
binary-tree prefix structure to be the base node for that bit. For fan-ins from bit-index 2p −1 are from its base-node b2p −1 (this
instance, b3 is the base node for bit-index x3 . Please note that, is the output node for bit-index 2p − 1 as well), which implies
we have used the terms bit-width and bit-index interchange- that any prefix node, which is at the right-side of the bit-index
ably. As bit-index starts from 0, the prefix graph of bit-width 2p − 1 (or alternatively bit-indices lesser than 2p − 1), will not
n is same as that for bit-index n − 1. be used for constructing higher output bits (i > 2p − 1). So
Lemma 3: Let lv(bi ) denotes the level of base-node of bit- if G2p is not optimum, then we should be able to reduce the
index i and j = i − 2lv(bi ) . Then ∀j, s.t. j > 0, lv(bj ) > lv(bi ). size of G2p keeping the rest of the prefix-structure, which is
Proof: Let a bit-index i be represented as i + 1 = 2a0 + at the left side of bit-index (2p − 1), intact. But that reduces
opt
2 + · · · + 2ak−1 + 2ak , where a0 > a1 > . . . ak−1 > ak .
a 1 the size of G2m , leading to contradiction.
Then lv(bi ) = ak (bit-index starting from 0). For example, Without any loss of generalization, we can assume p > q
lv(b5 ) = 1, since 5 + 1 = 6 = 22 + 21 . Therefore, j + 1 = (p = q leads to condition 1) and suppose Gx is not optimum,
2a0 + 2a1 + · · · + 2ak−1 , which implies that lv(bj ) = ak−1 > where x = 2p + 2q . Therefore, lv(bx−1 ) = q and q prefix
ak = lv(bi ). nodes, in the column corresponding to bit-index x − 1, are
Next, we will prove several lemmas/theorems which will fixed under prefix structure restriction. The optimal way to
hold good under this prefix structure restriction and provide generate the output for bit-index x − 1 is by combining the
a basis of generating size-optimum solutions for 2m bit pre- base nodes b2p −1 (2p − 1:0) and bx−1 (x − 1:2p ) as shown in
fix graph with level m. Please note that, we are not claiming Fig. 10, because it adds only 1 node N3 and increases its level
that our approach with the restrictions imposed by these lem- to its minimum possible value p + 1 (output bit for bit-index
mas/theorems will provide all size-optimum solutions. Instead, x−1 can not be realized in less than p+1 levels as x−1 > 2p ).
we will prove theoretically that our approach with each of By Lemma 4, the non-trivial fan-in from bit-index x − 1 can
these restrictions does not hamper the optimality and we will only come from b2p −1 or N3 , which signifies that for any prefix
be able to obtain at least one optimum solution. In practice, node of bit-index i > x − 1, there is no non-trivial fan-in from
our approach provides more than one optimum solution (to be the bits Y for optimality, where y ∈ Y if x − 1 < y < 2p − 1.
discussed in Section III-E). Moreover, G2p is optimum. Now, if Gx is not optimum, then
Any node N1 is said to be above (or below) another node we should be able to reduce the size of Gx restoring the prefix
N2 if MSB (N1 ) = MSB (N2 ) and level (N1 ) < (or >) level structure between the bit-ranges 2m − 1:x and 2p − 1:0, but
opt
(N2 ). For example, node nb2 is above the node b7 in Fig. 9. that reduces the size of G2m , leading to contradiction.
Lemma 4: There exists an optimum solution even when a Let us denote the bit-indices 0, 2, . . . be even indices (E)
restriction is imposed in search space by not allowing non- and 1, 3, . . . be odd indices (O). In our approach, we construct
trivial fan-in from the nodes which are above the base nodes. the prefix-graphs of higher bits in a bottom-up fashion.
1524 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 33, NO. 10, OCTOBER 2014

is proven not to degrade the optimality of the solution. These


strategies are as follows.
1) Enabling prefix-structure restriction, which is a con-
straint for generating any 2m bit adder with level m.
2) Not allowing any non-trivial fan-in from any node above
base-nodes. (Lemma 4 ensures the optimality in this
case).
Fig. 11. Proof of Lemma 5. 3) Set  = 0 for any bit-index x−1, such that x = 2p + 2q
(p, q ∈ Z). We have proved in Theorem 2 that prefix
graphs of bit-width x embedded in an optimum pre-
fix graph (with the restriction imposed by Lemma 4)
for 2m bit adder with level m is also optimum for x
bit adder with level m under prefix structure restriction.
So keeping only the minimum size solutions at each
bit-index x − 1 is not going to hamper the optimality of
the solution.
Fig. 12. Proof of Theorem 4. 4) Greedy construction of prefix graph of even bit-index by
adding the minimum prefix node to the prefix graph of
its immediate next lower bit-index (Theorem 3).
Lemma 5: Under prefix structure restriction there exists an 5) Set R = 1. (Theorem 4 ensures optimality).
optimum solution without allowing any non-trivial fan-in from With this approach, we are able to generate the size-
a prefix node corresponding to bit-index ie ∈ E. optimum solutions for 32, 64, and 128 bits (optimum sizes are
Proof: ∀io ∈ O, lv(bio ) ≥ 1, which means there exists an 74, 167, and 364 respectively). The total number of size-
optimum solution where any input node corresponding to odd optimum solutions for them are respectively 2, 8, and 768.
indices is not a non-trivial fan-in node (by Lemma 4) implying It is interesting to note that we also get exactly these many
that it is not essential to have any prefix node with LSB lsb ∈ size-optimum solutions without using the restriction imposed
O to get an optimum solution. But to have a non-trivial fan-in by Lemma 4, Theorems 2 and 3, rather by setting  = 0, 1, 2
from a prefix node of bit-index ie ∈ E we need to have at least for n = 32, 64, 128 respectively and enabling prefix structure
one prefix node whose LSB lsb = ie + 1 ∈ O (Fig. 11). Hence restriction (note that without this prefix structure restriction, 
the Lemma is proved. needs to be 3 to achieve the optimum size for n = 128). This is
Theorem 3: There exists an optimum solution under prefix intuitive as we need higher  (i.e., more exploration of search
structure restriction when prefix-graph of bit-index io + 1 is space) to get optimum solutions for higher bits. Increasing 
constructed from a prefix graph (gio ) of bit-index io , by adding beyond that does not reduce the size further, and this reinforces
minimum number of prefix nodes, where io ∈ O. our claim of theoretical size-optimality for 2m bit adders with
Proof: It follows from Lemma 5 that there exists an opti- level m. The run-time for generating the size-optimum solu-
mum solution where no non-trivial fan-in is taken from any tions for 128 bit is 5.8 s, where as the same for 64 bit adder
prefix node of bit ie ∈ E. So addition of minimum number is 0.04 s.
of prefix nodes to construct a prefix-graph of bit-index io + 1 We denote the pruning strategies 1 to 4 as the set of
bin
special pruning strategies (Spruning ) which is effective under
from gio restores the optimality.
Theorem 4: There exists an optimum solution when search binary prefix structure restriction and without any other restric-
bin
space is restricted by setting R = 1. tions, such as fan-out. However, we will be using Spruning for
Proof: Let R > 1. By Lemma 2, the non-trivial fan-in node more general cases to be illustrated later. Please note that,
for the corresponding node is an input node. Since ∀io ∈ O, we have kept the restriction R = 1 outside this set, as we
lv(bio ) ≥ 1, there exists an optimum solution where any input will be using this pruning strategy more extensively and in all
node corresponding to odd indices is not a non-trivial fan- situations.
in node (by Lemma 4). Now it remains to prove that we do
not need such non-trivial fan-in input node to be of bit-index
ie ∈ E either. For contradiction, let us consider that input F. Generating Solutions for More General Case
node corresponds to ie . But, this will require a non-trivial fan- In the earlier section, we have described a method to gener-
in from an input node of io = ie + 1 (Fig. 12), which is not ate size-optimum solutions for n = 2m bit adder with level
essential to get an optimum solution. So R = 1 will provide m. We have extended our approach for bit-width n = 2m
an optimum solution. and levels other than log2 n. We impose the pruning strate-
bin
gies Spruning till 2 log2 (n) −1 and then remove that restriction.
For example, while we run our algorithm to generate 64 bit
E. Method to Generate Size Optimum Solution for prefix graphs with level > 6, we remove the prefix struc-
2m Bit Adder With Level m ture restriction after 32 bit. The notion behind this heuristic is
Procedure “buildBottomUp” in Algorithm 2 generates Gn+1 that keeping the balanced structure till some point would help
from Gn exhaustively and we call this procedure for bit-indices in getting minimum-size solutions for higher bits. In addition
2 to 2m −1 to generate the solutions for G2m . We apply certain to this, we set  = 3 and R = 1 to scale the approach in
pruning strategies to this approach, and each pruning strategy general case.
ROY et al.: TOWARDS OPTIMAL PERFORMANCE-AREA TRADE-OFF IN ADDERS BY SYNTHESIS 1525

TABLE I TABLE II
P REFIX G RAPH S IZE FOR log2 n L EVEL P REFIX G RAPH S IZE FOR OTHER T HAN log2 n L EVEL

IV. E XPERIMENTAL R ESULTS


We have implemented our approach in C++ and integrated
our approach to a placement driven synthesis (PDS) [27]
tool in IBM. It has been executed on a linux machine with
72GB RAM and 2.8GHz CPU. First, we present our results
at the logic synthesis (technology independent) level. As TABLE III
P REFIX G RAPH S IZE FOR Z ERO -D EFICIENCY P REFIX G RAPHS
the dynamic programming based area-heuristic approach pre-
sented in [11] has achieved better results compared to the
other existing techniques [12], [13], we have implemented this
approach as well to compare with our experimental results.
Table I presents the comparison of minimum number of pre-
fix nodes for adders with different bit-width (n) with log2 n
logic level constraint for all output bits. The number of pre-
fix nodes for Sklansky adders are also mentioned in Table I
for adders of bit-widths which are power of 2. For 128 bit
adder, our approach improves Sklansky adder by 18.8% in TABLE IV
terms of the size of the prefix graph. Table II compares P REFIX G RAPH S IZE FOR N ON -U NIFORM I NPUT P ROFILE
IN A 32 B IT A DDER
the result of our algorithm with [11] for levels greater than
log2 n. We can see that we have achieved theoretically pos-
sible minimum size solutions for most of the cases, where
the bound is known. Prefix graph solutions for 32 bit adders
with level 5 and 6 generated by our approach are shown
in Fig. 13.
Next, we run our algorithm to generate the zero-deficiency
prefix graphs. For example, we can build a zero-deficiency
prefix graph with L = 7 till 54 bit and the minimum achiev-
able size is 99. So we ran our algorithm for 54 bit graph
with level restriction of 7, and got the minimum size (smin ) TABLE V
C OMPARISON ON Z IMMERMANN ’ S E XAMPLES
as 99 which is the theoretical minimum indeed. With same
constraints, the minimum size solutions for [11] is 109 and
for [13] it is 104 [8]. Table III presents the result for L = [3, 8]
and our approach is able to achieve the theoretically possible
minimum prefix graph sizes.
In Tables I–III, the input profile is uniform, i.e., the arrival
times of all input bits are assumed to be the same. In Table IV,
we have compared the result for non-uniform input profile.
The required time of arrival for all output bits are set to 9
and the input arrival levels have been randomly generated
between 0–4. Finally, we run our algorithm for 32 bit adders
with non-uniform input/output profiles appeared in [13]. In
these examples, the input arrival times are correlated, for optimize post-synthesis design performance. Usually, electri-
example late higher words or monotonically increasing inputs, cal violations at high-fanout points are mitigated by buffer-
which are more common in practical situations like multipli- insertion and gate-sizing, but at the cost of performance. We
cations etc. Table V compares the result with [11] and [13] study the impact of the parameter maximum fan-out (MFO) by
for those profiles. We can see that we have obtained compa- plotting the worst negative slack (WNS) against the size of the
rable/better results than [11] and [13] in all cases. prefix graph for 16 bit adders (Fig. 14). We observe that the
As mentioned earlier, the existing automated synthesis prefix graphs of higher node count and smaller MFO are better
approaches ([11], [12], [13], etc.) are not flexible in restrict- for timing. For high-performance designs, Kogge–Stone [1] is
ing parameters like fan-out, which is a critical parameter to the most effective adder structure due to the special property
1526 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 33, NO. 10, OCTOBER 2014

Fig. 13. 32 bit prefix graphs generated by our approach with level 5 and 6.

Fig. 15. Size of a 16 bit prefix graph with level 4 and fanout 2 generated
by our approach is less than that of Kogge Stone by 7.
Fig. 14. # of prefix nodes versus. WNS for 16 bit adder.

TABLE VII
TABLE VI
P OST P LACEMENT C OMPARISON
C OMPARISON W ITH KOGGE –S TONE A DDER

that maximum fan-out (MFO) of a n bit adder is less than


log2 n (without any buffer insertion) and the fan-out for prefix
nodes at logic level log2 n − 1 is 2. Table VI shows that, even
with a fan-out restriction of 2 for all prefix nodes, the pre-
fix graph generated by our approach has fewer prefix nodes
than the prefix graph for a Kogge–Stone adder. Fig. 15 shows
such an example for 16 bit. As mentioned in Section III-D2,
 needs to be set to a higher value in this case. For instance,
the parameters used to generate the 64 bit adder solution with
a fan-out restriction of 2 is  = 20, R = 1, and MFO = 2. other optimization techniques such as buffer-insertion, gate-
However, it should be noted that although our approach scales sizing etc., which are inherent in the tool are applied followed
with fan-out restriction and logic level log2 n, it does not scale by placement. However, we have prevented Vth -swapping
well with fan-out restriction and levels higher than log2 n for in the placement tool so that the leakage power becomes
adders of higher bit-width (n > 32). proportional to area. We present the various metrics like area,
We run our approach, integrated in PDS tool, on the WNS, wire-length, total-negative-slack (TNS) after placement
minimum size solutions of 8, 16, 32, 64 bit adders under tight in Table VII for the solution having best WNS. The target
timing constraints. A cutting-edge technology node (CMOS delay specified for 8, 16, 32, and 64 bit adders are respectively
SOI 22nm) is used for technology mapping. In addition to this, 35ps, 45ps, 65ps, and 75ps. So we can calculate the critical
ROY et al.: TOWARDS OPTIMAL PERFORMANCE-AREA TRADE-OFF IN ADDERS BY SYNTHESIS 1527

Fig. 16. Area versus worst negative slack plot for 16 and 32 bit adders.

Fig. 18. Delay versus power plot for 64 bit adder.

Fig. 17. 64 bit adder after placement.

path delay by adding the target delay and the absolute value area by 3.2% and 8.5% over a full custom adder design. Note
of the WNS. For instance, the critical path delay for 64 bit that the performance improvement was computed based on
Kogge–Stone adder is 75 + 84.5 = 159.5ps. Both wirelength the actual critical path delay value and not the worst negative
and area are unitless. Area is reported as the number of icells slack. Our approach also improves wire-length and TNS over
and wirelength as the number of tracks. An icell has a constant both Kogge–Stone and full custom adder design.
area based on pitch. Our approach is compared against regu- Since most adders today are synthesized in Design Compiler
lar adders like Brent–Kung (BK), Kogge–Stone (KS) adders, (DC) using Synopsys DesignWare, the adder architectures pro-
adders generated by dynamic programming (DP) [11], and vided by our approach are also synthesized in DC (Version
64 bit full custom adder (CT). G-2012.06-SP4) and placed, routed and timed by IC Compiler
Fig. 16 represents the plot of area versus WNS for the solu- (ICC) to compare with the behavioral adder implementation
tions provided by our approach along with those provided by (Y = A + B) by DC. To generate high-performance adders,
other methods. We can draw a pareto curve with the solution DC produces modified Sklansky adders consisting of alternat-
points obtained using our approach, which gives the option ing AOI21 and OAI21 gates, and employing gate-sizing or
to select the individual points on the pareto curve based on buffer insertion to handle the high-fanout nodes. This gener-
area/power budget. We see that the solution points of the other ally gives delay almost close to Kogge–Stone at much lower
methods are above and/or to the right of this curve, which indi- area/power and competitive power/performance/area with even
cates that we can always get some solution on the pareto-front, custom adders. 32 nm SAED LVT cell-library [28] (avail-
which is better in terms of performance and/or area than each able through Synopsys University Program) has been used for
of the other methods. For a 16 bit adder, the total number of technology-mapping. All experimental results for DC/ICC are
pareto-optimal points is 4 and the single point p1 provides bet- in “tt1p05v125c” corner, in which the supply voltage is 1.05 V
ter solution than DP, KS, and BK. For a 32 bit adder, the points and temperature is 125◦ C. The FO4 delay of a unit-sized
p1, p2, p3 are better solutions than BK, DP, KS respectively. inverter in this corner is 36 ps and the area of the unit-sized
Fig. 17 compares these metrics for single solution (with best inverter is 1.27 μm2 .
WNS) of 64 bit adder with other approaches. Our approach Fig. 18 shows the delay versus power (total power i.e.,
improves performance by 19% with 2% higher area over a leakage + switching + internal power) plot for minimum
Brent-Kung adder, improves performance and area by 0.4% size solutions of 64 bit adder architectures provided by our
and 33%, respectively, over a Kogge–Stone adder, improves approach after synthesis by DC and placed, routed by ICC.
performance and area by 3% and 6.7%, respectively over For all these runs (including those for Sklansky, Kogge–Stone
Dynamic Programming [11], and improves performance and and behavioral adder synthesis by DC), the target delay is
1528 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 33, NO. 10, OCTOBER 2014

TABLE VIII
C OMPARISON FOR 64 B IT A DDERS , S YNTHESIZED BY DC AND
P LACED /ROUTED BY ICC

Fig. 19. xp : xr ∈ RBR =⇒ non-trivial fan-in from node above bq .

set to 200 ps, the operating frequency is 1 GHz, activ- for high performance adders in state of the art microproces-
ities at the primary inputs are 0.1, and the adders are sor designs. The proposed approach improves over even the
synthesized by the command “compile_ultra.” Please note that, manually designed custom adders yielding, up to 3% better
the option “-area_high_effort_script” is on by default. We also delay and 9% better area. As our approach can generate mul-
perform some experiments by: 1) switching on the option tiple prefix graph structures for given constraints, it provides
“-timing_high_effort_script” which can further optimize at the a framework for further exploration to identify structures that
expense of run time and 2) altering the target delay (180 ps can account for practical design issues like wire congestion
or 220 ps), but observe that the change in delay value remains and power consumption.
within a range of 5–10 ps. We can draw the pareto-optimal
curve of delay versus power with those solutions and see that A PPENDIX
the solution provided by Sklansky adder, Kogge–Stone adder Proof (Lemma 4): Let us denote any node by a triplet, viz.
and that by behavioral adder implementation of DC are above bit-range of the node (MSB and LSB) and level. We consider
and/or to the right side of the pareto-front. For instance, the a node M1 (msb1 , lsb1 , level1 ) to be no worse than another
solution p2 in Fig. 18 improves Sklansky adder in all metrics, node M2 (msb2 , lsb2 , level2 ) iff msb1 = msb2 , lsb1 = lsb2
i.e., delay (1.8%), area (2.4%), and power (2.8%) or solu- (i.e., bit-ranges of M1 and M2 are equal) and level1 ≤ level2 .
tion p1 in Fig. 18 improves Kogge–Stone adder in area by We define a restricted set of bit-range (RBR) as any bit-range
30.6% and power by 29.6% with 3.8 ps or 1.1% overhead msb:lsb ∈ RBR, if ∀i, such that msb > i ≥ lsb, LSB(bi ) ≥ lsb.
in delay. Compared to DC behavioral adder implementation, For instance, 7:4 ∈ RBR, since LSB(b6 ) = 6 ≥ 4, LSB(b5 ) =
our approach (point p1 ) provides competitive delay (5 ps bet- LSB(b4 ) = 4 ≥ 4, where as 4:2 ∈ RBR, since LSB(b3 ) =
ter) with significant area (26%) and power (18%) reduction. 0 < 2. It is easy to notice that if there is no non-trivial fan-in
Table VIII compares our approach with other approaches in from nodes above base-nodes, then there does not exist any
terms of delay, power, and area. Note that the solution with node in the prefix graph, for which the bit-range is not in
best delay is considered for this comparison. RBR, because for any bit-range msb:lsb ∈ RBR, ∃q, such that
It should be stressed that our approach generates several msb > q ≥ lsb and LSB(bq ) < lsb, which is not possible
candidate prefix graphs for performance/area trade-off and unless there is a non-trivial fan-in from any node above bq
prefix networks, which would give best performance, are not (black node marked in Fig. 19).
the same across different technology node and libraries. For The structure of the proof is as follows. We will first prove
instance, we have run our approach in PDS (IBM) with CMOS the proposition (by induction) that by not allowing any non-
SOI 22 nm and in Synopsys DesignWare (DC + ICC) with trivial fan-in from the nodes above base-nodes, we can still
32 nm SAED library, and the prefix trees which have given realize any bit-range br ∈ RBR with same (or less) level
the best performance in the two cases differ one from another. restriction and size, compared to allowing non-trivial fan-in
Ling transformations [20] can also be applied to the pre- from nodes above base-nodes. Once we prove this for any such
fix graphs generated in our approach to further optimize the bit-range, it directly follows that we can get the size-optimum
performance. Also, since the solutions for regular adders are solutions of 2m bit prefix graph with level m by not allow-
located above and/or to the right side of the pareto-front, we ing any non-trivial fan-in from the nodes above base-nodes,
believe that the solutions on the pareto-front can be used as because the bit-ranges of all output bit nodes ∈ RBR.
alternatives for regular adders for use in custom designs. Let bx (x, z + 1, r) be a base-node for bit-index x and N1
(x, y + 1, l1 ) be any node above bx , where l1 < r (Fig. 20). We
V. C ONCLUSION assume that this proposition holds for bit-ranges with MSB ≤ x
In this paper, a highly efficient parallel prefix graph gen- and then prove its validity for any bit-range with MSB = x + 1
eration driven high performance adder synthesis technique is (by induction). Please note that, the proposition holds for x = 1
presented. The complexity of parallel prefix graph generation (Bit-range 1:0 can be constructed only by adding input bits for
problem for adders is exponential in the number of bits. We bit-index 0 and 1). The node N1 may be used for constructing
present efficient pruning strategies and implementation tech- any bit-range with MSB x + 1 by taking a non-trivial fan-in
niques to scale this approach up to 128 bit adders. We have from N1 . But if we can show that there is always an alternative
demonstrated a way to generate size-optimum prefix graphs way by taking non-trivial fan-in from or below bx (which is
for 2m bit adders with level m and proved its optimality. The no worser than allowing the non-trivial fan-in from N1 ) to
results, both at the technology-independent level and after construct the bit-range with MSB x + 1, then we are done.
physical synthesis (post placement) show that this approach Let we combine the node N1 with the input node for bit-index
significantly improves over existing techniques by yielding x+1 to get N5 (x+1, y+1, l1 +1). Let N2 (z, u, l2 ) be the node
better quality of results in terms of both timing and wire length for bit z, which is used for realizing any arbitrary bit-range
ROY et al.: TOWARDS OPTIMAL PERFORMANCE-AREA TRADE-OFF IN ADDERS BY SYNTHESIS 1529

R EFERENCES
[1] P. M. Kogge and H. S. Stone, “A parallel algorithm for the efficient solu-
tion of a general class of recurrence equations,” IEEE Trans. Comput.,
vol. C-22, no. 8, pp. 786–793, Aug. 1973.
[2] J. Sklansky, “Conditional sum addition logic,” IRE Trans. Electron.
Comput., vol. EC-9, no. 2, pp. 226–231, Jun. 1960.
[3] R. P. Brent and H. T. Kung, “A regular layout for parallel adders,” IEEE
Trans. Comput., vol. C-31, no. 3, pp. 260–264, Mar. 1982.
[4] T. Han and D. Carlson, “Fast area-efficient VLSI adders,” in Proc.
(a) IEEE 8th Symp. Comput. Arith. (ARITH), Como, Italy, May 1987,
pp. 49–56.
[5] C. Zhou, B. M. Fleischer, M. Gschwind, and R. Puri, “64-bit pre-
fix adders: Power-efficient topologies and design solutions,” in Proc.
IEEE Custom Integr. Circuit Conf., San Jose, CA, USA, Sep. 2009,
pp. 179–182.
[6] J. Liu, Y. Zhu, H. Zhu, C. K. Cheng, and J. Lillis, “Optimum prefix
adders in a comprehensive area, timing and power design space,” in
Proc. Asia South Pac. Des. Autom. Conf., Yokohama, Japan, Jan. 2007,
pp. 609–615.
[7] M. Snir, “Depth-size trade-offs for parallel prefix computation,”
J. Algorithms, vol. 7, no. 2, pp. 185–201, Jun. 1986.
[8] C. K. Cheng, H. Zhu, and R. Graham, “Constructing zero-deficiency
parallel prefix adder of minimum depth,” in Proc. Asia South Pacific
(b) Des. Autom. Conf., Jan. 2005, pp. 883–88.
[9] R. E. Ladner and M. J. Fischer, “Parallel prefix computation,” J. ACM,
vol. 27, no. 4, pp. 831–838, Oct. 1980.
[10] J. P. Fishburn, “A depth decreasing heuristic for combinational logic;
or how to convert a ripple-carry adder into a carry-lookahead adder or
anything in-between,” in Proc. Des. Autom. Conf., Orlando, FL, USA,
Jun. 1990, pp. 361–364.
[11] T. Matsunaga and Y. Matsunaga, “Area minimization algorithm for par-
allel prefix adders under bitwise delay constraints,” in Proc. Great Lakes
Symp. VLSI, 2007, pp. 435–440.
[12] J. Liu, S. Zhou, H. Zhu, and C. K. Cheng, “An algorithmic approach
for generic parallel adders,” in Proc. Int. Conf. Comput. Aided Des.,
San Jose, CA, USA, Nov. 2003, pp. 734–740.
[13] R. Zimmermann, “Non-heuristic optimization and synthesis of paral-
(c) lel prefix adders,” in Proc. Int. Workshop Logic Archit. Synth., 1996,
pp. 123–132.
Fig. 20. Proof of lemma 4. (a) Option 1. (b) Option 2. (c) Alternative option. [14] M. Ziegler and M. Stan, “Optimal logarithmic adder structures with a
fanout of two for minimizing the area-delay product,” in Proc. Int. Symp.
Circuit. Syst., Sydney, NSW, Australia, May 2001, pp. 657–660.
x+1:u ∈ RBR with MSB x+1. By our assumption of induction, [15] S. Knowles, “A family of adders,” in Proc. 15th IEEE Symp. Comput.
Arithmetic, Vail, CO, USA, 2001, pp. 277–284.
l2 ≥ lv(bz ) and lv(bz ) > lv(bx ) = r (by Lemma 3). Therefore,
[16] A. K. Verma and P. Lenne, “Towards the automatic exploration
l2 > r. of arithmetic-circuit architectures,” in Proc. Des. Autom. Conf.,
Now, there are 2 options to get x + 1 : u by using nodes San Francisco, CA, USA, 2006, pp. 445–450.
N5 and N2 . Firstly, we can combine N5 and some node N3 [17] S. Roy, M. Choudhury, R. Puri, and D. Z. Pan, “Towards optimal
performance-area trade-off in adders by synthesis of parallel prefix struc-
(y, z + 1, l3 ) to generate N6 (x + 1, z + 1, l6 ) and then com- tures,” in Proc. 50th ACM/EDAC/IEEE Des. Autom. Conf., Austin, TX,
bine with N2 to generate N7 (x + 1, u, l7 ) [Fig. 20(a)]. USA, May/Jun. 2013, pp. 1–8.
l6 = max(l1 + 2, r + 1) (since x + 1 − z > 2r ). Therefore, [18] D. Harris, “A taxonomy of parallel prefix networks,” in Proc. 37th
l7 = max(l1 + 3, r + 2, l2 + 1) = max(l1 + 3, l2 + 1) (since Asilomar Conf. Signals Syst. Comput., Nov. 2003, pp. 2213–2217.
[19] B. R. Zeydel, T. T. J. H. Kluter, and V. G. Oklobdzija, “Efficient mapping
l2 > r). In the second case [Fig. 20(b)], we combine N4 and of addition recurrence algorithms in CMOS,” in Proc. 17th IEEE Symp.
N5 to generate N8 (x+1, u, l8 ), where l8 ≥ max(l1 +2, l2 +2). Comput. Arithmetic, Jun. 2005, pp. 107–113.
But we can always have an alternative choice to construct the [20] G. Dimitrakopoulos and D. Nikolos, “High-speed parallel-prefix VLSI
bit-range x + 1 : u by combining bx and the input node for bit- ling adders,” IEEE Trans. Comput., vol. 54, no. 2, pp. 225–231,
Feb. 2005.
index x + 1 and then combine with N2 [Fig. 20(c)] to generate [21] S. Mathew, M. Anders, R. K. Krishnamurthy, and S. Borkar, “A 4-
N10 (x + 1, u, l10 ) where l10 = max(r + 2, l2 + 1) = l2 + 1. GHz 130 nm address generation unit with 32-bit sparse-tree adder
Compared to both option 1 and option 2, the alternative choice core,” IEEE J. Solid-State Circuits, vol. 38, no. 5, pp. 689–695,
adds less or equal number of nodes and still realize the same May. 2003.
[22] M. Ketter et al., “Implementation of 32-bit Ling and Jackson adders,” in
bit-range with less or same level restriction (l10 < l8 and Proc. 45th Asilomar Conf. Signals Syst. Comput. (ASILOMAR), Pacific
l10 ≤ l7 ). Grove, CA, USA, Nov. 2011, pp. 170–175.
Hence the proposition holds for any bit-range ∈ RBR with [23] S. Kao, R. Zlatanovici, and B. Nikolic, “A 240ps 64b carry-lookahead
MSB = x + 1, given it holds for any bit-range ∈ RBR with adder in 90nm CMOS,” in Proc. Int. Solid-State Circuits Conf.,
San Francisco, CA, USA, Feb. 2006, pp. 1735–1744.
MSB ≤ x. This proves the lemma. [24] S. Naffziger, “A subnanosecond 0.5 um 64b adder design,” in Proc. IEEE
Int. Solid-State Circuits Conf., San Francisco, CA, USA, Feb. 1996,
ACKNOWLEDGMENT pp. 362–363.
[25] D. Patil, M. Horowitz, R. Ho, and R. Ananthraman, “Robust energy-
The authors would like to thank R. Chhabra, currently with efficient adder topologies,” in Proc. IEEE Symp. Comput. Arithmetic,
Broadcom, for his help in setting up DC/ICC run. Montepellier, France, Jun. 2007, pp. 16–28.
1530 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 33, NO. 10, OCTOBER 2014

[26] H. Sutter, More Exceptional C++. Addison Wesley, 2002 [Online]. David Z. Pan (S’97–M’00–SM’06–F’14) received
Available: http://www.gotw.ca/publications/mxc++.htm the B.S. degree from Peking University, Beijing,
[27] H. Ren, D. Z. Pan, and D. S. Kung, “Sensitivity guided net weighting for China, and the M.S. and Ph.D. degrees from the
placement driven synthesis,” in Proc. Int. Symp. Phys. Des., Apr. 2004, University of California, Los Angeles (UCLA),
pp. 10–17. Los Angeles, CA, USA.
[28] (2014, Mar. 14). [Online]. Available: http://www.synopsys.com/ From 2000 to 2003, he was a Research Staff
Community/UniversityProgram/Pages/32-28nm-generic-library.aspx Member at IBM T. J. Watson Research Center. He
is currently a Full Professor and Brasfield Endowed
Faculty Fellow at the Department of Electrical
and Computer Engineering, University of Texas at
Subhendu Roy (S’13) received the B.E. degree in Austin, Austin, TX, USA. He has published over
electronics and telecommunication engineering from 200 papers in refereed journals and conferences, and is the holder of eight
Jadavpur University, Kolkata, India, in 2006, and the U.S. patents. His current research interests include nanoscale design for man-
M.Tech. degree in electronic systems from the Indian ufacturability and reliability, physical design, vertical integration design and
Institute of Technology, Bombay, Mumbai, India, technology, and design/CAD for emerging technologies.
in 2009. He is currently pursuing the Ph.D. degree Prof. Pan has served as a Senior Associate Editor for ACM Transactions
from the Department of Electrical and Computer on Design Automation of Electronic Systems, an Associate Editor for the
Engineering, University of Texas at Austin, Austin, IEEE T RANSACTIONS ON C OMPUTER A IDED D ESIGN OF I NTEGRATED
TX, USA. C IRCUITS AND S YSTEMS, the IEEE T RANSACTIONS ON V ERY L ARGE
His current research interests include design S CALE I NTEGRATION S YSTEMS, the IEEE T RANSACTIONS ON C IRCUITS
automation for logic synthesis, physical design, and AND S YSTEMS —PART I, the IEEE T RANSACTIONS ON C IRCUITS AND
cross-layer reliability. He has 3 years of full-time industry experience at EDA S YSTEMS —PART II, Science China Information Sciences, Journal of
company, Atrenta, where he was involved in developing tools in the architec- Computer Science and Technology, and the IEEE CAS Society Newsletter.
tural power domain and RTL domain. He also did internships at IBM T. J. He has served as the Chair of the IEEE CANDE Committee and the
Watson Research Center in 2012 and Mentor Graphics in 2013 and 2014. ACM/SIGDA Physical Design Technical Committee, Program/General Chair
Mr. Roy received the Best Paper Award from ISPD’14. of ISPD, TPC Subcommittee Chair for DAC, ICCAD, ASPDAC, ISLPED,
ICCD, ISCAS, VLSI-DAT, ISQED, and Tutorial Chair for DAC 2014, among
others. He received a number of awards for his research contributions and
professional services, including the SRC 2013 Technical Excellence Award,
Mihir Choudhury (S’05–M’12) received the DAC Top 10 Author in Fifth Decade, DAC Prolific Author Award, 11 Best
B.Tech. degree in computer science and engineering Paper Awards at premier venues (ISPD 2014, ICCAD 2013, ASPDAC 2012,
from the Indian Institute of Technology, Bombay, ISPD 2011, IBM Research 2010 Pat Goldberg Memorial Best Paper Award in
Mumbai, India, and the M.S. and Ph.D. degrees CS/EE/Math, ASPDAC 2010, DATE 2009, ICICDT 2009, SRC Techcon in
in computer engineering from Rice University, 1998, 2007, and 2012), Communications of the ACM Research Highlights
Houston, TX, USA. in 2014, ACM/SIGDA Outstanding New Faculty Award in 2005, NSF
He is a Research Staff Member at the IBM CAREER Award in 2007, SRC Inventor Recognition Award three times, IBM
T. J. Watson Research Center, Yorktown Heights, Faculty Award four times, UCLA Engineering Distinguished Young Alumnus
NY, USA. His current research interests include Award in 2009, UT Austin RAISE Faculty Excellence Award in 2014, ISPD
advanced logic synthesis algorithms and high-level Routing Contest Awards in 2007, eASIC Placement Contest Grand Prize in
synthesis. 2009, ICCAD’12 and ICCAD’13 CAD Contest Awards, IBM Research Bravo
Award in 2003, Dimitris Chorafas Foundation Research Award in 2000, and
ACM Recognition of Service Award in 2007 and 2008. From 2008 to 2009,
he was an IEEE CAS Society Distinguished Lecturer.
Ruchir Puri (F’07) received the bachelor’s degree in
electronics and communication engineering from the
National Institute of Technology, Kurukshetra, India,
in 1988, the master’s degree in electrical engineer-
ing from the Indian Institute of Technology, Kanpur,
Kanpur, India, in 1990, and the Ph.D. degree in elec-
trical and computer Engineering from the University
of Calgary, Calgary, AB, Canada, in 1994.
He is currently an IBM Fellow at IBM
T. J. Watson Research Center, Yorktown Heights,
NY, USA, where he leads high performance design
and methodology solutions for all of IBM’s enterprise server and system chip
designs. He is an Inventor of over 50 U.S. patents (both issued and pending)
and has authored over 120 publications on the automated design of low-power
and high-performance circuits with several Best Paper awards. He is very pas-
sionate about technology among school children and has been evangelizing
fun with electronics and FIRST LEGO LEAGUE Robotics in community
schools.
Dr. Puri is a member of the IBM Academy of Technology and is an IBM
Master Inventor. In addition, he has received the Best of IBMİ awards in both
2011 and 2012. He is a recipient of Semiconductor Research Corporation
Mehboob Khan outstanding Mentor Award and has been an Adjunct Professor
at the Department of Electrical Engineering, Columbia University, New York,
NY, USA. In 2011, he was honored with the John Von-Neumann Chair at the
Institute of Discrete Mathematics at Bonn University, Bonn, Germany, for his
scientific contributions and their impact on broader society. He has received
numerous accolades including the highest technical position at IBM, the IBM
Fellow, which was awarded for his transformational role in microprocessor
design methodology. He is also an ACM Distinguished Speaker and has been
an IEEE Distinguished Lecturer. He also received the 2014 Asian American
Engineer of the Year Award. He has delivered numerous keynotes and invited
talks at major VLSI Design and Automation conferences, National Science
Foundation and U.S. Department of Defense Research panels and has been
an Editor of the IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS.

You might also like