Lee 2010
Lee 2010
Chris Y.H. Lee1, Lo Hai Hiung2 , Sean W.F. Lee3, Nor Hisham Hamid4
1, 2, 4
Department of Electrical and Electronic Engineering
Universiti Teknologi PETRONAS, Bandar Seri Iskandar
31750 Tronoh, Perak Darul Ridzuan, Malaysia
E-mail: 1 [email protected], 2 [email protected], 4 [email protected]
3
Emerald Systems Sdn. Bhd.
737-1-10 Kompleks Sri Sg. Nibong
Jln. Sultan Azlan Shah
11990 Penang, Malaysia
E-mail: [email protected]
(1)
Area Optimized
Size Array Wallace Dadda RA
Figure 4 – Dot Diagram of an 8x8-bit RA Multiplier 4x4 75 81 78 82
This study encompasses a front-end custom IC design flow 16x16 1,744 1,905 1,782 1,802
(see Figure 5). Each design was implemented in Verilog 7,231 7,598 7,114 7,198
HDL and functionally verified in Mentor Graphics 32x32
ModelSim, then synthesized with the LeonardoSpectrum Speed Optimized
synthesis tool. The complete HDL methodology is described
in [1]. Each tree design utilizes an AND array for PP Size Array Wallace Dadda RA
generation and a ripple carry adder for final summation as in 4x4 113 118 115 114
their dot diagram-based designs (no enhancements).
8x8 628 609 676 633
Auto Optimized
Size Array Wallace Dadda RA
4x4 74 76 75 81
Area Optimized
Size Array Wallace Dadda RA
4x4 3.26 2.77 2.64 2.77 Figure 6 – Area-Optimized Delay Comparison
9.17 6.08 5.61 6.14 Following that, an interesting finding obtained is the delay of
8x8
the RA design for Auto-Optimized synthesis (see Figure 7),
16x16 20.64 11.6 10.97 12.03 which outperforms both Wallace and Dadda designs. This
44.33 22.77 22.06 22.63 speed gain can likely be attributed to its maximal use of
32x32 adders in each reduction stage, and which indicates that it is
Speed Optimized not necessary to employ non-trivial Wallace and Dadda
schemes to obtain best speed performance.
Size Array Wallace Dadda RA
4x4 3.10 2.40 2.54 2.59
Auto Optimized
Size Array Wallace Dadda RA
4x4 3.30 3.30 2.87 2.91
Auto Optimized
Size Array Wallace Dadda RA
A 7193 7692 7241 7289
Conclusion
Findings corroborate the basic features of the Array and tree
Figure 8 – Speed-Optimized Delay Comparison designs based on linear and logarithmic delay increase with
operand size respectively. Different synthesis modes show
This could be best explained by the fact that the model in [3] that the Wallace and Dadda do not always behave as they
was analyzed based on an ideal model, while the model in were designed to, but largely rely on how the gate-level
[5] was synthesized according to only a single optimization synthesis was performed. In conjunction with that, best-case
mode. As for this study, the various synthesis optimization performances for each design in different optimization
modes established different logic structures according to the modes were also analyzed for their 32x32-bit variants.
selected cost function (area/delay), while still embodying
their respective architectures (each HDL design was From what can be observed, the Wallace design would be the
implemented in a gate-level structural form). Thus, Speed most suitable for speed critical applications, where area is
Optimization was found to more effectively optimize the not a priority. Next, the Dadda design generally suits smaller
Wallace tree structure for minimal delay compared to the scaled applications, in which it can significantly outperform
Dadda scheme. the other two designs in terms of speed. On the other hand,
the Reduced Area design performed somewhere in between
To probe even further, a quantitative comparison was made the Wallace and Dadda algorithms for Area and
between the area and delay parameters (see Table 3). The Speed-Optimizations. However, it showed very promising
AD and AD2 values tabulated serve to reinforce the findings delay values in Auto-Optimized synthesis, making it feasible
presented in the previous section. The design parameters are for applications that minimize area.
could assist in determining the right multiplier design
selection for a particular high speed, or limited area While this study relies mostly on gate counts and thus gate
application. delays to gauge performance, no wiring effects were taken
into consideration. In addition to that, with continuously
As described previously, each synthesis optimization mode decreasing process granularity, prediction models employed
configures relative block locations differently. Therefore, by current synthesis tools struggle to provide accurate
due to the different structures of every design, each estimates to real parasitic models. This lack of accuracy for
multiplier can be associated with a particular application performance characteristics seems to be more prevalent in
based on their optimal performance among synthesis modes. technologies beyond the 0.25-micron CMOS boundary [6].
Table 3 – A-D Comparison On that account, future analysis on tree multiplier designs
could be based on more objective methods as described in
Area Optimized [11] as well as on the array design, which appears to be more
feasible for large operand multiplications.
References
[1] C.Y.H. Lee. “A Performance Comparison Study on
Multiplier Designs,” B.Eng. Project Report, Universiti
Teknologi PETRONAS, 2009.
[2] A. Habibi and P.A. Wintz. “Fast Multipliers,” IEEE
Trans. on Computers, vol. 19, pp. 153-157, 1970.
[3] W.J. Townsend, E.E. Swartzlander, Jr. and J.A. Abraham.
“A Comparison of Dadda and Wallace Multiplier
Delays”, in SPIE Adv. Signal Proc. Algorithms,
Architectures and Implementations XIII, pp. 552-560,
2003.
[4] S. Shah, A.J. Al-Khalili and D. Al-Khalili. “Comparison
of 32-bit Multipliers for Various Performance
Measures,” in The 12th International Conference on
Microelectronics, pp. 75-80, 2000.
[5] K.C. Bickerstaff and E.E. Swartzlander, Jr. “Analysis of
Column Compression Multipliers,” in 15th IEEE Symp.
on Computer Arithmetic, pp. 33-39, 2001.
[6] J.M. Rabaey, A. Chandrakasan and B. Nikolic. “Design
Synthesis,” in Digital Integrated Circuits, 2nd ed., New
Jersey: Pearson Education Inc., 2003, pp. 397, 435-439.
[7] Mi Lu. “Modular Structure of Large Multiplier,” in
Arithmetic and Logic in Computer Systems, 1st ed, New
Jersey: John Wiley & Sons, Inc., 2004, pp. 120-122.
[8] C.S. Wallace. “A Suggestion for a Fast Multiplier,” IEEE
Trans. on Electronic Computers, vol. EC-13, pp. 14-17,
1964.
[9] L. Dadda. “Some Schemes for Parallel Multipliers,” Alta
Frequenza, vol. 34, pp. 349-356, 1965.
[10] K.A.C. Bickerstaff, M. Schulte, and E.E. Swartzlander,
Jr., “Reduced Area Multipliers,” Intl. Conf. on
Application-Specific Array Processors, pp. 478-489,
1993.
[11] P.C.H. Meier, R.A. Rutenbar and L.R. Carley,
“Exploring Multiplier Architecture and Layout for Low
Power,’ in IEEE Custom Integrated Circuits Conf., pp.
513-516, 1996