Roy 2014
Roy 2014
Abstract—This paper proposes an efficient algorithm to syn- between regular adder structures such as Kogge–Stone [1],
thesize prefix graph structures that yield adders with the best Sklansky [2], Brent–Kung [3], Han–Carlson [4], and tune
performance-area trade-off. For designing a parallel prefix adder physical design parameters such as placement, gate sizing,
of a given bit-width, our approach generates prefix graph struc-
buffer optimization to maximize performance under power
tures to optimize an objective function such as size of prefix
graph subject to constraints like bit-wise output logic level. Given constraints for the target technology [5], [6]. Hence, custom
bit-width n and level (L) restriction, our algorithm excels the adder design methodology is expensive, takes a long time
existing algorithms in minimizing the size of the prefix graph. to converge to a satisfactory design, and is inflexible to late
We also prove its size-optimality when n is a power of two and design changes.
L = log2 n. Besides prefix graph size optimization and having In contrast, automated synthesis approach is productive
the best performance-area trade-off, our approach, unlike exist- and flexible to late design changes but traditionally has
ing techniques, can 1) handle more complex constraints such as
maximum node fanout or wire-length that impact the perfor-
lagged behind in performance as compared to custom designs.
mance/area of a design and 2) generate several feasible solutions Therefore, the prevalent design approach for high-performance
that minimize the objective function. Generating several size- datapath logic continues to be custom design. In the past,
optimal solutions provides the option to choose adder designs several algorithms have been proposed to generate parallel
that mitigate constraints such as wire congestion or power con- prefix adders targeting minimization of the size of the pre-
sumption that are difficult to model as constraints during logic fix graph (s) under given bit-width (n) and logic level (L)
synthesis. Experimental results demonstrate that our approach constraints. A prefix graph is said to be zero deficiency if
improves performance by 3% and area by 9% over even a
64-bit full custom designed adder implemented in an industrial
s + L = 2n − 2. Snir [7] has proved this theoretical bound
high-performance design. for L ≥ 2 log2 n − 2 with uniform input profile. In [8], zero-
deficiency prefix graphs Z(L) are proposed, where Z(L) has
Index Terms—Bottom-up approach, logic synthesis, parallel the provable maximum bit-width for a given depth L among
prefix adder, performance-area trade-off.
all zero-deficiency prefix circuits. The bit-width of Z(L) circuit
is given by NZ (L) = F(L + 3) − 1, (F denotes the fibonacci
I. I NTRODUCTION function) for L > 1. Compared to [7], [8] indeed gives a
more general bound for size of the prefix graphs. For instance,
ATAPATH logic constitutes a significant portion of a gen-
D eral purpose microprocessor and frequently occurs on the
timing-critical paths in high-performance designs. Arithmetic
NZ (6) = 33, so for a prefix graph of bit-width 32 and level
6, the minimum achievable size smin = 32 ∗ 2 − 2 − 6 = 56,
which Snir fails to give as 6 < 2 ∗ 5 − 2.
components, such as adders, multipliers, shifters are the basic
Ladner and Fischer [9] present a recursive construction of
building blocks in datapath logic and hence, to a great extent
parallel prefix graphs to obtain a trade-off between s and L,
dictate the performance of the entire chip. Binary addition
but it could not even achieve the bound provided by [7]. Other
is one of the most fundamental and widely used arithmetic
existing algorithms like a greedy depth-decreasing heuris-
operations in microprocessors. Today, adders are designed in
tic [10], dynamic programming based approaches [11], [12],
two ways—either manually through full custom design or
or non-heuristic optimization [13] could achieve this bound
in an automated manner using synthesis tools. In a custom
for some cases but yield sub-optimal result as logic level
adder design methodology, a designer has to manually choose
constraints are reduced (for e.g., to log2 n)—which is more
Manuscript received December 8, 2013; revised March 26, 2014 and
relevant for high performance adders. In [12], an algorithmic
June 11, 2014; accepted June 13, 2014. Date of current version September 16, approach is proposed to achieve minimal delay at all output
2014. This paper was recommended by Associate Editor J. Cortadella. bits for uniform/non-uniform input profile, although this paper
S. Roy and D. Z. Pan are with the Department of Electrical and Computer does not focus on minimizing the size of the prefix graph.
Engineering, University of Texas at Austin, Austin, TX 78712 USA (e-mail:
[email protected]; [email protected]). Reference [13] presents an algorithm for the generation of
M. Choudhury and R. Puri are with the IBM T. J. Watson Research parallel prefix structures for arbitrary level constraints to min-
Center, Yorktown Heights, NY 10598 USA (e-mail: [email protected]; imize the size, but it fails to get size-optimal solutions for
[email protected]).
Color versions of one or more of the figures in this paper are available
levels closer to log2 n. Reference [14] proposes logarithmic
online at http://ieeexplore.ieee.org. adder structures with a fan-out of 2, and presents a model to
Digital Object Identifier 10.1109/TCAD.2014.2341926 analyze the area-delay product of those structures. However,
0278-0070 c 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
1518 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 33, NO. 10, OCTOBER 2014
TABLE I TABLE II
P REFIX G RAPH S IZE FOR log2 n L EVEL P REFIX G RAPH S IZE FOR OTHER T HAN log2 n L EVEL
Fig. 13. 32 bit prefix graphs generated by our approach with level 5 and 6.
Fig. 15. Size of a 16 bit prefix graph with level 4 and fanout 2 generated
by our approach is less than that of Kogge Stone by 7.
Fig. 14. # of prefix nodes versus. WNS for 16 bit adder.
TABLE VII
TABLE VI
P OST P LACEMENT C OMPARISON
C OMPARISON W ITH KOGGE –S TONE A DDER
Fig. 16. Area versus worst negative slack plot for 16 and 32 bit adders.
path delay by adding the target delay and the absolute value area by 3.2% and 8.5% over a full custom adder design. Note
of the WNS. For instance, the critical path delay for 64 bit that the performance improvement was computed based on
Kogge–Stone adder is 75 + 84.5 = 159.5ps. Both wirelength the actual critical path delay value and not the worst negative
and area are unitless. Area is reported as the number of icells slack. Our approach also improves wire-length and TNS over
and wirelength as the number of tracks. An icell has a constant both Kogge–Stone and full custom adder design.
area based on pitch. Our approach is compared against regu- Since most adders today are synthesized in Design Compiler
lar adders like Brent–Kung (BK), Kogge–Stone (KS) adders, (DC) using Synopsys DesignWare, the adder architectures pro-
adders generated by dynamic programming (DP) [11], and vided by our approach are also synthesized in DC (Version
64 bit full custom adder (CT). G-2012.06-SP4) and placed, routed and timed by IC Compiler
Fig. 16 represents the plot of area versus WNS for the solu- (ICC) to compare with the behavioral adder implementation
tions provided by our approach along with those provided by (Y = A + B) by DC. To generate high-performance adders,
other methods. We can draw a pareto curve with the solution DC produces modified Sklansky adders consisting of alternat-
points obtained using our approach, which gives the option ing AOI21 and OAI21 gates, and employing gate-sizing or
to select the individual points on the pareto curve based on buffer insertion to handle the high-fanout nodes. This gener-
area/power budget. We see that the solution points of the other ally gives delay almost close to Kogge–Stone at much lower
methods are above and/or to the right of this curve, which indi- area/power and competitive power/performance/area with even
cates that we can always get some solution on the pareto-front, custom adders. 32 nm SAED LVT cell-library [28] (avail-
which is better in terms of performance and/or area than each able through Synopsys University Program) has been used for
of the other methods. For a 16 bit adder, the total number of technology-mapping. All experimental results for DC/ICC are
pareto-optimal points is 4 and the single point p1 provides bet- in “tt1p05v125c” corner, in which the supply voltage is 1.05 V
ter solution than DP, KS, and BK. For a 32 bit adder, the points and temperature is 125◦ C. The FO4 delay of a unit-sized
p1, p2, p3 are better solutions than BK, DP, KS respectively. inverter in this corner is 36 ps and the area of the unit-sized
Fig. 17 compares these metrics for single solution (with best inverter is 1.27 μm2 .
WNS) of 64 bit adder with other approaches. Our approach Fig. 18 shows the delay versus power (total power i.e.,
improves performance by 19% with 2% higher area over a leakage + switching + internal power) plot for minimum
Brent-Kung adder, improves performance and area by 0.4% size solutions of 64 bit adder architectures provided by our
and 33%, respectively, over a Kogge–Stone adder, improves approach after synthesis by DC and placed, routed by ICC.
performance and area by 3% and 6.7%, respectively over For all these runs (including those for Sklansky, Kogge–Stone
Dynamic Programming [11], and improves performance and and behavioral adder synthesis by DC), the target delay is
1528 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 33, NO. 10, OCTOBER 2014
TABLE VIII
C OMPARISON FOR 64 B IT A DDERS , S YNTHESIZED BY DC AND
P LACED /ROUTED BY ICC
set to 200 ps, the operating frequency is 1 GHz, activ- for high performance adders in state of the art microproces-
ities at the primary inputs are 0.1, and the adders are sor designs. The proposed approach improves over even the
synthesized by the command “compile_ultra.” Please note that, manually designed custom adders yielding, up to 3% better
the option “-area_high_effort_script” is on by default. We also delay and 9% better area. As our approach can generate mul-
perform some experiments by: 1) switching on the option tiple prefix graph structures for given constraints, it provides
“-timing_high_effort_script” which can further optimize at the a framework for further exploration to identify structures that
expense of run time and 2) altering the target delay (180 ps can account for practical design issues like wire congestion
or 220 ps), but observe that the change in delay value remains and power consumption.
within a range of 5–10 ps. We can draw the pareto-optimal
curve of delay versus power with those solutions and see that A PPENDIX
the solution provided by Sklansky adder, Kogge–Stone adder Proof (Lemma 4): Let us denote any node by a triplet, viz.
and that by behavioral adder implementation of DC are above bit-range of the node (MSB and LSB) and level. We consider
and/or to the right side of the pareto-front. For instance, the a node M1 (msb1 , lsb1 , level1 ) to be no worse than another
solution p2 in Fig. 18 improves Sklansky adder in all metrics, node M2 (msb2 , lsb2 , level2 ) iff msb1 = msb2 , lsb1 = lsb2
i.e., delay (1.8%), area (2.4%), and power (2.8%) or solu- (i.e., bit-ranges of M1 and M2 are equal) and level1 ≤ level2 .
tion p1 in Fig. 18 improves Kogge–Stone adder in area by We define a restricted set of bit-range (RBR) as any bit-range
30.6% and power by 29.6% with 3.8 ps or 1.1% overhead msb:lsb ∈ RBR, if ∀i, such that msb > i ≥ lsb, LSB(bi ) ≥ lsb.
in delay. Compared to DC behavioral adder implementation, For instance, 7:4 ∈ RBR, since LSB(b6 ) = 6 ≥ 4, LSB(b5 ) =
our approach (point p1 ) provides competitive delay (5 ps bet- LSB(b4 ) = 4 ≥ 4, where as 4:2 ∈ RBR, since LSB(b3 ) =
ter) with significant area (26%) and power (18%) reduction. 0 < 2. It is easy to notice that if there is no non-trivial fan-in
Table VIII compares our approach with other approaches in from nodes above base-nodes, then there does not exist any
terms of delay, power, and area. Note that the solution with node in the prefix graph, for which the bit-range is not in
best delay is considered for this comparison. RBR, because for any bit-range msb:lsb ∈ RBR, ∃q, such that
It should be stressed that our approach generates several msb > q ≥ lsb and LSB(bq ) < lsb, which is not possible
candidate prefix graphs for performance/area trade-off and unless there is a non-trivial fan-in from any node above bq
prefix networks, which would give best performance, are not (black node marked in Fig. 19).
the same across different technology node and libraries. For The structure of the proof is as follows. We will first prove
instance, we have run our approach in PDS (IBM) with CMOS the proposition (by induction) that by not allowing any non-
SOI 22 nm and in Synopsys DesignWare (DC + ICC) with trivial fan-in from the nodes above base-nodes, we can still
32 nm SAED library, and the prefix trees which have given realize any bit-range br ∈ RBR with same (or less) level
the best performance in the two cases differ one from another. restriction and size, compared to allowing non-trivial fan-in
Ling transformations [20] can also be applied to the pre- from nodes above base-nodes. Once we prove this for any such
fix graphs generated in our approach to further optimize the bit-range, it directly follows that we can get the size-optimum
performance. Also, since the solutions for regular adders are solutions of 2m bit prefix graph with level m by not allow-
located above and/or to the right side of the pareto-front, we ing any non-trivial fan-in from the nodes above base-nodes,
believe that the solutions on the pareto-front can be used as because the bit-ranges of all output bit nodes ∈ RBR.
alternatives for regular adders for use in custom designs. Let bx (x, z + 1, r) be a base-node for bit-index x and N1
(x, y + 1, l1 ) be any node above bx , where l1 < r (Fig. 20). We
V. C ONCLUSION assume that this proposition holds for bit-ranges with MSB ≤ x
In this paper, a highly efficient parallel prefix graph gen- and then prove its validity for any bit-range with MSB = x + 1
eration driven high performance adder synthesis technique is (by induction). Please note that, the proposition holds for x = 1
presented. The complexity of parallel prefix graph generation (Bit-range 1:0 can be constructed only by adding input bits for
problem for adders is exponential in the number of bits. We bit-index 0 and 1). The node N1 may be used for constructing
present efficient pruning strategies and implementation tech- any bit-range with MSB x + 1 by taking a non-trivial fan-in
niques to scale this approach up to 128 bit adders. We have from N1 . But if we can show that there is always an alternative
demonstrated a way to generate size-optimum prefix graphs way by taking non-trivial fan-in from or below bx (which is
for 2m bit adders with level m and proved its optimality. The no worser than allowing the non-trivial fan-in from N1 ) to
results, both at the technology-independent level and after construct the bit-range with MSB x + 1, then we are done.
physical synthesis (post placement) show that this approach Let we combine the node N1 with the input node for bit-index
significantly improves over existing techniques by yielding x+1 to get N5 (x+1, y+1, l1 +1). Let N2 (z, u, l2 ) be the node
better quality of results in terms of both timing and wire length for bit z, which is used for realizing any arbitrary bit-range
ROY et al.: TOWARDS OPTIMAL PERFORMANCE-AREA TRADE-OFF IN ADDERS BY SYNTHESIS 1529
R EFERENCES
[1] P. M. Kogge and H. S. Stone, “A parallel algorithm for the efficient solu-
tion of a general class of recurrence equations,” IEEE Trans. Comput.,
vol. C-22, no. 8, pp. 786–793, Aug. 1973.
[2] J. Sklansky, “Conditional sum addition logic,” IRE Trans. Electron.
Comput., vol. EC-9, no. 2, pp. 226–231, Jun. 1960.
[3] R. P. Brent and H. T. Kung, “A regular layout for parallel adders,” IEEE
Trans. Comput., vol. C-31, no. 3, pp. 260–264, Mar. 1982.
[4] T. Han and D. Carlson, “Fast area-efficient VLSI adders,” in Proc.
(a) IEEE 8th Symp. Comput. Arith. (ARITH), Como, Italy, May 1987,
pp. 49–56.
[5] C. Zhou, B. M. Fleischer, M. Gschwind, and R. Puri, “64-bit pre-
fix adders: Power-efficient topologies and design solutions,” in Proc.
IEEE Custom Integr. Circuit Conf., San Jose, CA, USA, Sep. 2009,
pp. 179–182.
[6] J. Liu, Y. Zhu, H. Zhu, C. K. Cheng, and J. Lillis, “Optimum prefix
adders in a comprehensive area, timing and power design space,” in
Proc. Asia South Pac. Des. Autom. Conf., Yokohama, Japan, Jan. 2007,
pp. 609–615.
[7] M. Snir, “Depth-size trade-offs for parallel prefix computation,”
J. Algorithms, vol. 7, no. 2, pp. 185–201, Jun. 1986.
[8] C. K. Cheng, H. Zhu, and R. Graham, “Constructing zero-deficiency
parallel prefix adder of minimum depth,” in Proc. Asia South Pacific
(b) Des. Autom. Conf., Jan. 2005, pp. 883–88.
[9] R. E. Ladner and M. J. Fischer, “Parallel prefix computation,” J. ACM,
vol. 27, no. 4, pp. 831–838, Oct. 1980.
[10] J. P. Fishburn, “A depth decreasing heuristic for combinational logic;
or how to convert a ripple-carry adder into a carry-lookahead adder or
anything in-between,” in Proc. Des. Autom. Conf., Orlando, FL, USA,
Jun. 1990, pp. 361–364.
[11] T. Matsunaga and Y. Matsunaga, “Area minimization algorithm for par-
allel prefix adders under bitwise delay constraints,” in Proc. Great Lakes
Symp. VLSI, 2007, pp. 435–440.
[12] J. Liu, S. Zhou, H. Zhu, and C. K. Cheng, “An algorithmic approach
for generic parallel adders,” in Proc. Int. Conf. Comput. Aided Des.,
San Jose, CA, USA, Nov. 2003, pp. 734–740.
[13] R. Zimmermann, “Non-heuristic optimization and synthesis of paral-
(c) lel prefix adders,” in Proc. Int. Workshop Logic Archit. Synth., 1996,
pp. 123–132.
Fig. 20. Proof of lemma 4. (a) Option 1. (b) Option 2. (c) Alternative option. [14] M. Ziegler and M. Stan, “Optimal logarithmic adder structures with a
fanout of two for minimizing the area-delay product,” in Proc. Int. Symp.
Circuit. Syst., Sydney, NSW, Australia, May 2001, pp. 657–660.
x+1:u ∈ RBR with MSB x+1. By our assumption of induction, [15] S. Knowles, “A family of adders,” in Proc. 15th IEEE Symp. Comput.
Arithmetic, Vail, CO, USA, 2001, pp. 277–284.
l2 ≥ lv(bz ) and lv(bz ) > lv(bx ) = r (by Lemma 3). Therefore,
[16] A. K. Verma and P. Lenne, “Towards the automatic exploration
l2 > r. of arithmetic-circuit architectures,” in Proc. Des. Autom. Conf.,
Now, there are 2 options to get x + 1 : u by using nodes San Francisco, CA, USA, 2006, pp. 445–450.
N5 and N2 . Firstly, we can combine N5 and some node N3 [17] S. Roy, M. Choudhury, R. Puri, and D. Z. Pan, “Towards optimal
performance-area trade-off in adders by synthesis of parallel prefix struc-
(y, z + 1, l3 ) to generate N6 (x + 1, z + 1, l6 ) and then com- tures,” in Proc. 50th ACM/EDAC/IEEE Des. Autom. Conf., Austin, TX,
bine with N2 to generate N7 (x + 1, u, l7 ) [Fig. 20(a)]. USA, May/Jun. 2013, pp. 1–8.
l6 = max(l1 + 2, r + 1) (since x + 1 − z > 2r ). Therefore, [18] D. Harris, “A taxonomy of parallel prefix networks,” in Proc. 37th
l7 = max(l1 + 3, r + 2, l2 + 1) = max(l1 + 3, l2 + 1) (since Asilomar Conf. Signals Syst. Comput., Nov. 2003, pp. 2213–2217.
[19] B. R. Zeydel, T. T. J. H. Kluter, and V. G. Oklobdzija, “Efficient mapping
l2 > r). In the second case [Fig. 20(b)], we combine N4 and of addition recurrence algorithms in CMOS,” in Proc. 17th IEEE Symp.
N5 to generate N8 (x+1, u, l8 ), where l8 ≥ max(l1 +2, l2 +2). Comput. Arithmetic, Jun. 2005, pp. 107–113.
But we can always have an alternative choice to construct the [20] G. Dimitrakopoulos and D. Nikolos, “High-speed parallel-prefix VLSI
bit-range x + 1 : u by combining bx and the input node for bit- ling adders,” IEEE Trans. Comput., vol. 54, no. 2, pp. 225–231,
Feb. 2005.
index x + 1 and then combine with N2 [Fig. 20(c)] to generate [21] S. Mathew, M. Anders, R. K. Krishnamurthy, and S. Borkar, “A 4-
N10 (x + 1, u, l10 ) where l10 = max(r + 2, l2 + 1) = l2 + 1. GHz 130 nm address generation unit with 32-bit sparse-tree adder
Compared to both option 1 and option 2, the alternative choice core,” IEEE J. Solid-State Circuits, vol. 38, no. 5, pp. 689–695,
adds less or equal number of nodes and still realize the same May. 2003.
[22] M. Ketter et al., “Implementation of 32-bit Ling and Jackson adders,” in
bit-range with less or same level restriction (l10 < l8 and Proc. 45th Asilomar Conf. Signals Syst. Comput. (ASILOMAR), Pacific
l10 ≤ l7 ). Grove, CA, USA, Nov. 2011, pp. 170–175.
Hence the proposition holds for any bit-range ∈ RBR with [23] S. Kao, R. Zlatanovici, and B. Nikolic, “A 240ps 64b carry-lookahead
MSB = x + 1, given it holds for any bit-range ∈ RBR with adder in 90nm CMOS,” in Proc. Int. Solid-State Circuits Conf.,
San Francisco, CA, USA, Feb. 2006, pp. 1735–1744.
MSB ≤ x. This proves the lemma. [24] S. Naffziger, “A subnanosecond 0.5 um 64b adder design,” in Proc. IEEE
Int. Solid-State Circuits Conf., San Francisco, CA, USA, Feb. 1996,
ACKNOWLEDGMENT pp. 362–363.
[25] D. Patil, M. Horowitz, R. Ho, and R. Ananthraman, “Robust energy-
The authors would like to thank R. Chhabra, currently with efficient adder topologies,” in Proc. IEEE Symp. Comput. Arithmetic,
Broadcom, for his help in setting up DC/ICC run. Montepellier, France, Jun. 2007, pp. 16–28.
1530 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 33, NO. 10, OCTOBER 2014
[26] H. Sutter, More Exceptional C++. Addison Wesley, 2002 [Online]. David Z. Pan (S’97–M’00–SM’06–F’14) received
Available: http://www.gotw.ca/publications/mxc++.htm the B.S. degree from Peking University, Beijing,
[27] H. Ren, D. Z. Pan, and D. S. Kung, “Sensitivity guided net weighting for China, and the M.S. and Ph.D. degrees from the
placement driven synthesis,” in Proc. Int. Symp. Phys. Des., Apr. 2004, University of California, Los Angeles (UCLA),
pp. 10–17. Los Angeles, CA, USA.
[28] (2014, Mar. 14). [Online]. Available: http://www.synopsys.com/ From 2000 to 2003, he was a Research Staff
Community/UniversityProgram/Pages/32-28nm-generic-library.aspx Member at IBM T. J. Watson Research Center. He
is currently a Full Professor and Brasfield Endowed
Faculty Fellow at the Department of Electrical
and Computer Engineering, University of Texas at
Subhendu Roy (S’13) received the B.E. degree in Austin, Austin, TX, USA. He has published over
electronics and telecommunication engineering from 200 papers in refereed journals and conferences, and is the holder of eight
Jadavpur University, Kolkata, India, in 2006, and the U.S. patents. His current research interests include nanoscale design for man-
M.Tech. degree in electronic systems from the Indian ufacturability and reliability, physical design, vertical integration design and
Institute of Technology, Bombay, Mumbai, India, technology, and design/CAD for emerging technologies.
in 2009. He is currently pursuing the Ph.D. degree Prof. Pan has served as a Senior Associate Editor for ACM Transactions
from the Department of Electrical and Computer on Design Automation of Electronic Systems, an Associate Editor for the
Engineering, University of Texas at Austin, Austin, IEEE T RANSACTIONS ON C OMPUTER A IDED D ESIGN OF I NTEGRATED
TX, USA. C IRCUITS AND S YSTEMS, the IEEE T RANSACTIONS ON V ERY L ARGE
His current research interests include design S CALE I NTEGRATION S YSTEMS, the IEEE T RANSACTIONS ON C IRCUITS
automation for logic synthesis, physical design, and AND S YSTEMS —PART I, the IEEE T RANSACTIONS ON C IRCUITS AND
cross-layer reliability. He has 3 years of full-time industry experience at EDA S YSTEMS —PART II, Science China Information Sciences, Journal of
company, Atrenta, where he was involved in developing tools in the architec- Computer Science and Technology, and the IEEE CAS Society Newsletter.
tural power domain and RTL domain. He also did internships at IBM T. J. He has served as the Chair of the IEEE CANDE Committee and the
Watson Research Center in 2012 and Mentor Graphics in 2013 and 2014. ACM/SIGDA Physical Design Technical Committee, Program/General Chair
Mr. Roy received the Best Paper Award from ISPD’14. of ISPD, TPC Subcommittee Chair for DAC, ICCAD, ASPDAC, ISLPED,
ICCD, ISCAS, VLSI-DAT, ISQED, and Tutorial Chair for DAC 2014, among
others. He received a number of awards for his research contributions and
professional services, including the SRC 2013 Technical Excellence Award,
Mihir Choudhury (S’05–M’12) received the DAC Top 10 Author in Fifth Decade, DAC Prolific Author Award, 11 Best
B.Tech. degree in computer science and engineering Paper Awards at premier venues (ISPD 2014, ICCAD 2013, ASPDAC 2012,
from the Indian Institute of Technology, Bombay, ISPD 2011, IBM Research 2010 Pat Goldberg Memorial Best Paper Award in
Mumbai, India, and the M.S. and Ph.D. degrees CS/EE/Math, ASPDAC 2010, DATE 2009, ICICDT 2009, SRC Techcon in
in computer engineering from Rice University, 1998, 2007, and 2012), Communications of the ACM Research Highlights
Houston, TX, USA. in 2014, ACM/SIGDA Outstanding New Faculty Award in 2005, NSF
He is a Research Staff Member at the IBM CAREER Award in 2007, SRC Inventor Recognition Award three times, IBM
T. J. Watson Research Center, Yorktown Heights, Faculty Award four times, UCLA Engineering Distinguished Young Alumnus
NY, USA. His current research interests include Award in 2009, UT Austin RAISE Faculty Excellence Award in 2014, ISPD
advanced logic synthesis algorithms and high-level Routing Contest Awards in 2007, eASIC Placement Contest Grand Prize in
synthesis. 2009, ICCAD’12 and ICCAD’13 CAD Contest Awards, IBM Research Bravo
Award in 2003, Dimitris Chorafas Foundation Research Award in 2000, and
ACM Recognition of Service Award in 2007 and 2008. From 2008 to 2009,
he was an IEEE CAS Society Distinguished Lecturer.
Ruchir Puri (F’07) received the bachelor’s degree in
electronics and communication engineering from the
National Institute of Technology, Kurukshetra, India,
in 1988, the master’s degree in electrical engineer-
ing from the Indian Institute of Technology, Kanpur,
Kanpur, India, in 1990, and the Ph.D. degree in elec-
trical and computer Engineering from the University
of Calgary, Calgary, AB, Canada, in 1994.
He is currently an IBM Fellow at IBM
T. J. Watson Research Center, Yorktown Heights,
NY, USA, where he leads high performance design
and methodology solutions for all of IBM’s enterprise server and system chip
designs. He is an Inventor of over 50 U.S. patents (both issued and pending)
and has authored over 120 publications on the automated design of low-power
and high-performance circuits with several Best Paper awards. He is very pas-
sionate about technology among school children and has been evangelizing
fun with electronics and FIRST LEGO LEAGUE Robotics in community
schools.
Dr. Puri is a member of the IBM Academy of Technology and is an IBM
Master Inventor. In addition, he has received the Best of IBMİ awards in both
2011 and 2012. He is a recipient of Semiconductor Research Corporation
Mehboob Khan outstanding Mentor Award and has been an Adjunct Professor
at the Department of Electrical Engineering, Columbia University, New York,
NY, USA. In 2011, he was honored with the John Von-Neumann Chair at the
Institute of Discrete Mathematics at Bonn University, Bonn, Germany, for his
scientific contributions and their impact on broader society. He has received
numerous accolades including the highest technical position at IBM, the IBM
Fellow, which was awarded for his transformational role in microprocessor
design methodology. He is also an ACM Distinguished Speaker and has been
an IEEE Distinguished Lecturer. He also received the 2014 Asian American
Engineer of the Year Award. He has delivered numerous keynotes and invited
talks at major VLSI Design and Automation conferences, National Science
Foundation and U.S. Department of Defense Research panels and has been
an Editor of the IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS.