0% found this document useful (0 votes)
6 views6 pages

Lee 2010

This study compares the performance of various multiplier designs, including Array, Wallace, Dadda, and Reduced Area multipliers, focusing on their area, speed, and optimization modes using Verilog HDL and TSMC 0.35-micron ASIC Design Kit. Results indicate that while the Wallace design is optimal for high-speed applications, the Dadda and Reduced Area designs excel in area optimization, with performance varying significantly based on synthesis optimization choices. The findings highlight the trade-offs between area and delay, emphasizing the importance of selecting the appropriate multiplier design based on specific application requirements.

Uploaded by

Linay Nayak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views6 pages

Lee 2010

This study compares the performance of various multiplier designs, including Array, Wallace, Dadda, and Reduced Area multipliers, focusing on their area, speed, and optimization modes using Verilog HDL and TSMC 0.35-micron ASIC Design Kit. Results indicate that while the Wallace design is optimal for high-speed applications, the Dadda and Reduced Area designs excel in area optimization, with performance varying significantly based on synthesis optimization choices. The findings highlight the trade-offs between area and delay, emphasizing the importance of selecting the appropriate multiplier design based on specific application requirements.

Uploaded by

Linay Nayak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

A Performance Comparison Study on Multiplier Designs

Chris Y.H. Lee1, Lo Hai Hiung2 , Sean W.F. Lee3, Nor Hisham Hamid4
1, 2, 4
Department of Electrical and Electronic Engineering
Universiti Teknologi PETRONAS, Bandar Seri Iskandar
31750 Tronoh, Perak Darul Ridzuan, Malaysia
E-mail: 1 [email protected], 2 [email protected], 4 [email protected]
3
Emerald Systems Sdn. Bhd.
737-1-10 Kompleks Sri Sg. Nibong
Jln. Sultan Azlan Shah
11990 Penang, Malaysia
E-mail: [email protected]

Abstract the Dadda design demonstrated that doubling operand size


quadruples power dissipation. In addition, effects of
This study investigates the relative performances of Array, parasitics were also demonstrated to penalize delay
Wallace, Dadda and Reduced Area multipliers for several performance.
synthesis optimization modes. All multiplier designs were
Also suggested in [5] was that the reduction matrix and
modeled in Verilog HDL and synthesized based on the
counters structures in these tree designs are mainly
TSMC 0.35-micron ASIC Design Kit standard cell library.
responsible for the nearly quadratic increase in power and
Performance data was extracted after logic synthesis in
area with respect to operand size. Other than that, ratios of
LeonardoSpectrum for Area, Speed and Auto optimization
power and area also increase with operand word length as a
modes. Findings indicate that the Dadda multiplier may not
result of longer interconnects and greater glitching effects
always have a speed advantage over Wallace’s design, but
from longer reduction stages.
depends greatly on the optimization effects in gate-level
synthesized design. Results for comparison of 32x32-bit However, those previous analyses did not consider the
variants indicate that the Wallace scheme is well suited for effects of different synthesis optimizations. Current
high-speed applications, independent of area constraints, synthesis tools provide the flexibility of optimizing the
while the Dadda and Reduced Area designs deliver best design based on a chosen cost function: area, speed, power
speed when synthesized to minimize area or logic usage. or a combination of these parameters [6]. The performance
of a particular CMOS multiplier design would depend on the
Keywords: type of synthesis optimization. In some cases, optimal
digital arithmetic, Array multiplier, Wallace multiplier, performance may be achieved, while for others, a worse-case
Dadda multiplier, Reduced Area multiplier, logic synthesis behavior could result.
Hence, this study examines the delay and area performance
Introduction parameters for various synthesis optimization modes on the
different multiplier designs. Subsequently, tradeoffs
Multipliers are crucial in modern electronic systems that run between area and delay parameters for each multiplier
complex high speed calculations, especially in DSP design are also analyzed for the different modes.
applications. Following that, several efforts have been made Comparisons were done for each design relative to one
to provide analyses on the various multiplier schemes in another, without enhancements to the basic architectures.
aiding designers to further develop multiplier technology, as
well as guide the selection of a suitable algorithm for a Methodology
particular application.
Previous performance analyses have been conducted on Array Multiplier Architecture
common multiplier designs, particularly for Wallace and
Dadda schemes. An early study in [2] was fully based on The array multiplication scheme incorporates a simple
TTL IC implementation while in [3], a mathematical add-shift method, comprising AND gates and adders.
comparison was done to analyze the delays of each both Modular blocks of square array multipliers, based on [7],
schemes with varying operand sizes. There was also a Field were used to create the next larger variant, starting from a
Programmable Gate Array (FPGA) targeted study done in 4x4-bit version. Four 16x16 instances are combined (see
[4] for various hybrid multiplication architectures. Figure 1) to form the next larger variant.
The other, more relevant, research was a CMOS-based Wallace Multiplier Architecture
performance analysis done in [5]. Results show that the Wallace introduced an efficient parallel multiplication
Wallace design more area than Dadda while the worst case algorithm [8], which has a reduction stage delay of the order
delays for both schemes is about equal. Power estimates for O(log n). This logarithmic increase in delay with respect to
operand size constitutes the major speed gain over array Dadda Multiplier Architecture
designs (linear delay increase) at the cost of higher structural
The Dadda Tree multiplier [9] has the same general stages as
complexity.
the Wallace Tree. However, unlike the Wallace Tree, Dadda
multipliers do not attempt to reduce as many partial products
in each layer but rather, perform as few reductions as
possible [3]. This makes the Dadda multiplier less costly in
the reduction stage but contain longer numbers in each stage,
requiring larger carry-propagate adders (CPA).
For this design, formation of partial products was done using
the same method as in the Wallace Tree multipliers. As for
the reduction stage, a series of generalized recursive steps
[3] were used to determine the heights of each stage and the
number of additions required to achieve each stage height as
described in the following:
1. Let d1 = 2 and dj+1 = [1.5  dj]. Dj is the height of the
matrix for the jth stage. Repeat until the largest jth
stage is reached in which the original N height matrix
contains at least one column that has more than dj
dots.
Figure 1 – 32x32-bit Modular Array Multiplier
2. In the jth stage from the end, place (3, 2) and (2, 2)
The three key stages of a Wallace tree as summarized in [3] counters as required to achieve a reduced matrix.
involve: Only columns with more than dj dots as they receive
1. Reducing the number of partial products (PP) carries from less significant (3, 2) and (2, 2) counters
2. Accelerating the formation of PPs are reduced.
3. Accelerating the summing of PPs 3. Let j = j – 1 and repeat step 2 until a matrix with a
height of two is generated. This should occur when j
Partial products are reduced based on a recursive algorithm = 1.
in [5] which defines the height of the matrix in each
reduction stage. The height of the matrix in the jth reduction Application of this recursive algorithm produces the dot
stage, ωj, is defined by the following recursive equations: diagram for the 8x8-bit Dadda multiplier (see Figure 3).

(1)

These three stages along with the corresponding matrix


heights for each stage are shown in the dot diagram for an
8x8-bit Wallace multiplier (see Figure 2).

Figure 3 – Dot Diagram of an 8x8-bit Dadda Multiplier


Reduced Area Multiplier Architecture
This multiplier draws upon the same principles used by the
Wallace Tree and Dadda Tree methods of a three-stage
multiplication process. The major difference with this
algorithm however, is that each reduction stage uses the
maximum number of (3,2) counters to minimize the number
Figure 2 – Dot Diagram of an 8x8-bit Wallace Multiplier of bits to be routed to the following stage.
In the reduction stage, the recursive steps described in [10] Results and Discussion
were used to determine the heights of each stage and the
number of additions required to achieve each stage height as Area Comparison
described below. Two rules for each reduction stage define
the only conditions where (2,2) counters are used for Generally, all designs display a roughly linear increase of
reduction: logic gate consumption as operand size increases (see Table
1). Among the various modes, the Area Optimized setting
1. To reduce the number of bits in a column to the
provides best-case area savings for all multiplier designs in
number of bits specified in the Dadda algorithm for a
all operand sizes.
particular reduction stage.
For best-case Area-Optimized results, the Array and Dadda
2. To reduce (exactly) two bits in the rightmost column
designs outperform the other two schemes. Further
(least significant) of each stage.
observation shows that for designs smaller than 32-bit, the
The dot diagram for the 8x8-bit Reduced Area (RA) Array scheme has an area advantage over the rest.
multiplier depicts the structure of this design (see Figure 4). Meanwhile, Dadda and Reduced Area (RA) schemes may
yield more area savings for operand sized beyond 32-bit for
this optimization mode. However, as operand sizes increase
savings in area become less evident: gate savings approach
5% for 32-bit Dadda (lowest gate count) and Wallace
designs.
As for the Speed-Optimized mode, logic gate consumption
increased significantly compared to the previous case across
all multiplier designs: from a 50% difference in the Wallace
32-bit design up to a 75% increment for the Dadda scheme
when comparing with Area-Optimized data. Unlike in the
previous optimization mode, the Dadda scheme shows the
worst area performance after Speed-Optimized synthesis.
On the contrary, the Wallace design shows most area savings
for Speed-Optimized mode: about 7 to 10% gate savings
compared to the Dadda design.
Table 1 – Gate Counts for Various Modes

Area Optimized
Size Array Wallace Dadda RA
Figure 4 – Dot Diagram of an 8x8-bit RA Multiplier 4x4 75 81 78 82

Process Flow 8x8 401 432 405 413

This study encompasses a front-end custom IC design flow 16x16 1,744 1,905 1,782 1,802
(see Figure 5). Each design was implemented in Verilog 7,231 7,598 7,114 7,198
HDL and functionally verified in Mentor Graphics 32x32
ModelSim, then synthesized with the LeonardoSpectrum Speed Optimized
synthesis tool. The complete HDL methodology is described
in [1]. Each tree design utilizes an AND array for PP Size Array Wallace Dadda RA
generation and a ripple carry adder for final summation as in 4x4 113 118 115 114
their dot diagram-based designs (no enhancements).
8x8 628 609 676 633

16x16 2855 2732 3014 2780

32x32 11961 11661 12503 11808

Auto Optimized
Size Array Wallace Dadda RA
4x4 74 76 75 81

8x8 391 419 397 400

16x16 1738 1840 1738 1739

32x32 7193 7692 7241 7289


Figure 5 – Front-end Custom IC Design Flow
Next, the values obtained for Auto-Optimized synthesis delay timings appear to be around 0.5ns from the Area to
show close resemblance to the Area-Optimized mode. The Auto-Optimized synthesis. This further reinforces the notion
Auto-Optimized mode would be expected to spend the least that a ‘least-effort’ synthesis would produce a result similar
time on synthesis effort, which in turn produces a result to an Area-Optimized result.
mostly similar in structure to the original HDL design. All
Comparison of 32x32-bit Tree Designs
designs fare comparatively similar to one another except the
Wallace design. The simplistic conclusion drawn from this For further analysis on high performance multiplier designs,
finding was that a non-optimizing synthesis mode would the 32x32-bit versions of the three tree models were
lean towards an area-efficient output design. analyzed in detail. The delay characteristic was used as the
focal point for this part of the study, along with references to
Delay Comparison
previous studies done on the Wallace and Dadda designs.
Overall, an almost linearly increasing delay was expected for Trade-offs between delay and area are considered as the
increasing Array multiplier size while the other three tree designs are compared for each optimization mode.
designs show logarithmic delay increments (see Table 2). In terms of speed for the Area and Auto-Optimization modes
Best-case delay was expected for Speed-Optimized Dadda’s algorithm did score better than Wallace (see Figure
synthesis, which showed that the Wallace design 6 and 7). This corroborates initial studies which suggest that
outperformed all other designs, in all operand sizes. The the Dadda architecture minimizes area while having a speed
Dadda and RA designs performed similarly to each other, advantage over its Wallace equivalent.
while being slower by up to 1ns behind the Wallace scheme.
As expected, delay for the Array design was far behind all
tree multipliers due to its linear reduction stage.
Area-Optimized synthesis constitutes a delay penalty for all
designs tested. The main point of interest here is the
sensitivity of the different tree designs to changes in
optimization modes. The point to note here is that the
Wallace design suffers a delay increase of up to 4ns as
compared to the Dadda and RA designs (up to 2.8ns).
Table 2 – Delay Data for Various Modes

Area Optimized
Size Array Wallace Dadda RA
4x4 3.26 2.77 2.64 2.77 Figure 6 – Area-Optimized Delay Comparison

9.17 6.08 5.61 6.14 Following that, an interesting finding obtained is the delay of
8x8
the RA design for Auto-Optimized synthesis (see Figure 7),
16x16 20.64 11.6 10.97 12.03 which outperforms both Wallace and Dadda designs. This
44.33 22.77 22.06 22.63 speed gain can likely be attributed to its maximal use of
32x32 adders in each reduction stage, and which indicates that it is
Speed Optimized not necessary to employ non-trivial Wallace and Dadda
schemes to obtain best speed performance.
Size Array Wallace Dadda RA
4x4 3.10 2.40 2.54 2.59

8x8 8.36 5.19 5.44 5.36

16x16 19.14 9.82 10.47 10.54

32x32 40.88 18.69 19.63 19.74

Auto Optimized
Size Array Wallace Dadda RA
4x4 3.30 3.30 2.87 2.91

8x8 9.54 5.86 6.15 6.31

16x16 22.55 11.67 11.50 11.80


Figure 7 – Auto-Optimized Delay Comparison
32x32 48.56 23.19 22.63 22.31
The RA design’s speed advantage (about 4% against
Once again, Auto-Optimized synthesis shows similar Wallace) for the Auto-Optimized mode makes it ideal for
performance to the Area-Optimized mode. Differences in low logic resource applications. In addition to that, the
internal structure of the RA, which makes it more suitable Array Wallace Dadda RA
for pipelining [10], may be yet another justification to select
it over the other two designs for applications that minimize A 7,231 7,598 7,114 7,198
area. 44.33 22.77 22.06 22.63
D
In the case of delay performance, a mathematical, gate delay
comparison of Wallace and Dadda multipliers shows that AD 3.206E+05 1.730E+05 1.569E+05 1.629E+05
Dadda’s scheme has a speed advantage over Wallace’s for 1.421E+07 3.939E+06 3.462E+06 3.686E+06
AD2
all operand sizes, an average of 10% speed gain in [3].
Additionally, the other study on these same designs in Speed Optimized
CMOS technology [5] showed that both Wallace and Dadda
delays were about the same, 0.1ns. Size Array Wallace Dadda RA
Even so, results in those previous studies were not reflected A 11961 11661 12503 11808
in the findings obtained within this study, where for the 40.88 18.69 19.63 19.74
best-case Speed-Optimized mode, Wallace’s design D
outperforms its Dadda counterpart (see Figure 8); about 5% AD 4.890E+05 2.179E+05 2.454E+05 2.331E+05
speed gain.
AD2 1.999E+07 4.073E+06 4.818E+06 4.601E+06

Auto Optimized
Size Array Wallace Dadda RA
A 7193 7692 7241 7289

D 48.56 23.19 22.63 22.31

AD 3.493E+05 1.784E+05 1.639E+05 1.626E+05

AD2 1.696E+07 4.137E+06 3.708E+06 3.628E+06

Conclusion
Findings corroborate the basic features of the Array and tree
Figure 8 – Speed-Optimized Delay Comparison designs based on linear and logarithmic delay increase with
operand size respectively. Different synthesis modes show
This could be best explained by the fact that the model in [3] that the Wallace and Dadda do not always behave as they
was analyzed based on an ideal model, while the model in were designed to, but largely rely on how the gate-level
[5] was synthesized according to only a single optimization synthesis was performed. In conjunction with that, best-case
mode. As for this study, the various synthesis optimization performances for each design in different optimization
modes established different logic structures according to the modes were also analyzed for their 32x32-bit variants.
selected cost function (area/delay), while still embodying
their respective architectures (each HDL design was From what can be observed, the Wallace design would be the
implemented in a gate-level structural form). Thus, Speed most suitable for speed critical applications, where area is
Optimization was found to more effectively optimize the not a priority. Next, the Dadda design generally suits smaller
Wallace tree structure for minimal delay compared to the scaled applications, in which it can significantly outperform
Dadda scheme. the other two designs in terms of speed. On the other hand,
the Reduced Area design performed somewhere in between
To probe even further, a quantitative comparison was made the Wallace and Dadda algorithms for Area and
between the area and delay parameters (see Table 3). The Speed-Optimizations. However, it showed very promising
AD and AD2 values tabulated serve to reinforce the findings delay values in Auto-Optimized synthesis, making it feasible
presented in the previous section. The design parameters are for applications that minimize area.
could assist in determining the right multiplier design
selection for a particular high speed, or limited area While this study relies mostly on gate counts and thus gate
application. delays to gauge performance, no wiring effects were taken
into consideration. In addition to that, with continuously
As described previously, each synthesis optimization mode decreasing process granularity, prediction models employed
configures relative block locations differently. Therefore, by current synthesis tools struggle to provide accurate
due to the different structures of every design, each estimates to real parasitic models. This lack of accuracy for
multiplier can be associated with a particular application performance characteristics seems to be more prevalent in
based on their optimal performance among synthesis modes. technologies beyond the 0.25-micron CMOS boundary [6].
Table 3 – A-D Comparison On that account, future analysis on tree multiplier designs
could be based on more objective methods as described in
Area Optimized [11] as well as on the array design, which appears to be more
feasible for large operand multiplications.

References
[1] C.Y.H. Lee. “A Performance Comparison Study on
Multiplier Designs,” B.Eng. Project Report, Universiti
Teknologi PETRONAS, 2009.
[2] A. Habibi and P.A. Wintz. “Fast Multipliers,” IEEE
Trans. on Computers, vol. 19, pp. 153-157, 1970.
[3] W.J. Townsend, E.E. Swartzlander, Jr. and J.A. Abraham.
“A Comparison of Dadda and Wallace Multiplier
Delays”, in SPIE Adv. Signal Proc. Algorithms,
Architectures and Implementations XIII, pp. 552-560,
2003.
[4] S. Shah, A.J. Al-Khalili and D. Al-Khalili. “Comparison
of 32-bit Multipliers for Various Performance
Measures,” in The 12th International Conference on
Microelectronics, pp. 75-80, 2000.
[5] K.C. Bickerstaff and E.E. Swartzlander, Jr. “Analysis of
Column Compression Multipliers,” in 15th IEEE Symp.
on Computer Arithmetic, pp. 33-39, 2001.
[6] J.M. Rabaey, A. Chandrakasan and B. Nikolic. “Design
Synthesis,” in Digital Integrated Circuits, 2nd ed., New
Jersey: Pearson Education Inc., 2003, pp. 397, 435-439.
[7] Mi Lu. “Modular Structure of Large Multiplier,” in
Arithmetic and Logic in Computer Systems, 1st ed, New
Jersey: John Wiley & Sons, Inc., 2004, pp. 120-122.
[8] C.S. Wallace. “A Suggestion for a Fast Multiplier,” IEEE
Trans. on Electronic Computers, vol. EC-13, pp. 14-17,
1964.
[9] L. Dadda. “Some Schemes for Parallel Multipliers,” Alta
Frequenza, vol. 34, pp. 349-356, 1965.
[10] K.A.C. Bickerstaff, M. Schulte, and E.E. Swartzlander,
Jr., “Reduced Area Multipliers,” Intl. Conf. on
Application-Specific Array Processors, pp. 478-489,
1993.
[11] P.C.H. Meier, R.A. Rutenbar and L.R. Carley,
“Exploring Multiplier Architecture and Layout for Low
Power,’ in IEEE Custom Integrated Circuits Conf., pp.
513-516, 1996

You might also like