Approximate Wallace Tree Multiplier
Approximate Wallace Tree Multiplier
Abstract—Today in sub-nanometer regime, chip/system design- increases. In this paper, we present a new Approximate
ers add accuracy as a new constraint to optimize Latency-Power- Wallace Tree Multiplier (AWTM) based on a bit-width aware
Area (LPA) metrics. In this paper, we present a new power algorithm. We design it specifically to give good results for
and area-efficient Approximate Wallace Tree Multiplier (AWTM)
for error-tolerant applications. We propose a bit-width aware large operands. Besides accuracy, the AWTM is also optimized
approximate multiplication algorithm for optimal design of our for power and area. For single cycle implementation, AWTM
multiplier. We employ a carry-in prediction method to reduce gives significant reduction in latency as well. Our contributions
the critical path. It is further augmented with hardware efficient are:
precomputation of carry-in. We also optimize our multiplier de-
∙ We propose a new power and area-efficient AWTM based
sign for latency, power and area using Wallace trees. Accuracy as
well as LPA design metrics are used to evaluate our approximate on a bit-width aware multiplication algorithm.
multiplier designs of different bit-widths, 𝒊.𝒆. 4 × 4, 8 × 8 and ∙ We employ a novel Carry-in Prediction technique which
16 × 16. The simulation results show that we obtain a mean significantly reduces the critical path of our multiplier. We
accuracy of 99.85% to 99.965%. Single cycle implementation further derive an efficient carry-in precomputation logic
of AWTM gives almost 24% reduction in latency. We achieve
significant reduction in power and area, 𝒊.𝒆. up to 41.96% and to accelerate the carry propagation.
34.49% respectively that clearly demonstrates the merits of our ∙ We obtain a very high mean accuracy of 99.965% (mean
proposed AWTM design. Finally, AWTM is used to perform a error of only 0.035%) when the size of operands are 10
real time application on a benchmark image. We obtain up to bits or more. However, if there is no lower bound on the
39% reduction in power and 30% reduction in area without any size of operands, the mean accuracy varies from 99.85%
loss in image quality.
Index Terms—Approximate multiplier; Bit-width aware mul-
to 99.9% (a very small mean error of 0.1% to 0.15%).
tiplication algorithm; Wallace tree; Error-resilient systems ∙ We achieve a significant reduction in power and area,
i.e. up to 41.96%, and 34.49% respectively for the 16-
I. I NTRODUCTION bit accuracy-configurable AWTM design. For single cycle
implementation of 16 × 16 AWTM, we also reduce the
The International Technology Roadmap for Semiconductors latency by around 24%.
(ITRS) [1] has anticipated imprecise/approximate designs that ∙ Our proposed AWTM, when used for a real time applica-
became a state-of-the art demand for the emerging class of tion on an image, achieved up to 39% reduction in power
killer applications that manifest inherent error-resilience such and up to 30% reduction in area with negligible loss in
as multimedia, graphics, and wireless communications. In the image quality.
error-resilience systems, adders and multipliers are used as Rest of the paper is organized in various sections. In section
basic building blocks and their approximate designs have 2 we discuss some background and related work reported in
attracted significant research interest recently. Conventional literature. We describe some preliminaries in Section 3. An
wisdom investigated several mechanisms such as truncation approximate multiplier architecture is explained in Section
[2], over-clocking, and voltage over-scaling(VOS) [3] which 4. We propose a bit-width aware approximate multiplication
could not configure accuracy as well as Latency-Power-Area algorithm in Section 5. We present AWTM design based on the
(LPA) design metrics effectively. Most of the other design proposed methodology and its optimization w.r.t. LPA design
techniques rely on functional approximations and a wide metrics in Section 6. The experimental results are given in
spectrum of approximate adders like [4], [5], [6] and [7] have Section 7. Finally, we conclude the paper in Section 8.
been proposed in the past. However, very few research papers
are reported on approximate multipliers in the literature. II. BACKGROUND AND R ELATED W ORK
Most of the approximate multiplier designs reported shorten Research on approximate arithmetic circuits mainly reported
the carry-chains in which error is configurable and the algo- in the literature is on approximate adders. It is worthwhile
rithms employed in the designs are for smaller numbers and to study these approximate adders in order to make research
give large magnitude of error as the bit-width of operands contributions on approximate multipliers. Lu [8] proposed a
978-1-4799-3946-6/14/$31.00 ©2014 IEEE 263 15th Int'l Symposium on Quality Electronic Design
𝑘-bit carry look-ahead adder in which only previous 𝑘 bits are Accurate
Partial Product
considered to estimate current carry signal. Lu’s adder exhibits 2b bits
AH XL
to exploit a given error rate to improve parametric yield. AL XH
Accurate to a
b bits b bits b bits
Zhu et al.[4] manifest an error-tolerant adder: ETA-I. ETA-I 2b bits
Large Extent
divides inputs into: 1) Accurate part, and 2) Inaccurate part. Final Product
4b bits Final Product
In the latter, no carry signal is considered at any bit position. 4b bits
Gupta et al. [10] target low-power and propose five different (a) (b)
versions of mirror adder by reducing the number of transistors Fig. 1. (a) Recursive Multiplication (b) Approximate Multiplication
and internal node capacitance. Verma et al. [6] presented a
Variable Latency Speculative Adder (VLSA) which provides
approximate/accurate results but gives considerable delay and followed by additions. Fig. 1(b) is derived from Fig. 1(a) for
large area overhead. Kahng et. al [7] proposed an accuracy- approximate multiplication.
configurable adder with reduced critical-path and error rate.
In contrast with the above work, very few researchers have B. Accuracy Design Metrics
reported work on approximate multipliers. Sullivan et al. [11] The accuracy design metrics are defined as follows:
used 𝑇 𝑟𝑢𝑛𝑐𝑎𝑡𝑒𝑑 𝐸𝑟𝑟𝑜𝑟 𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑖𝑜𝑛 (TEC) to investigate an 1) Relative Error: Relative Error can be calculated as
iterative approximate multiplier in which some amount of (∣𝑅𝑐 − 𝑅𝑒 ∣/𝑅𝑐 ) × 100% for 𝑅𝑐 ∕= 0. Here, 𝑅𝑐 is correct
error correcting circuitry is added for each iteration. This result and 𝑅𝑒 is approximate result. We denote accuracy
circuitry replicates the effects of multiple pipeline iterations as 𝐴𝐶𝐶𝑎𝑚𝑝 where 𝐴𝐶𝐶𝑎𝑚𝑝 = 1 − 𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝐸𝑟𝑟𝑜𝑟.
for the most problematic inputs quite inexpensively. Kulkarni 2) Mean Error: Mean error is the average of relative errors
et al. [12] proposed a 2 × 2 underdesigned multiplier block of all the combinations tested in an algorithm.
and built arbitrarily power aware inaccurate multipliers. Kyaw 3) Minimum Acceptable Accuracy (MAA): Minimum Ac-
et al. [13] presented an Error Tolerant Multiplication (ETM) ceptable Accuracy is the minimum level of accuracy that
algorithm in which the input operands are split into two parts. an application can tolerate.
a multiplication part consists of higher order bits and a non- 4) Acceptance Probability (𝐴𝑃 ): It is the probability that
multiplication part with the remaining lower order bits. The accuracy of the approximate arithmetic circuit is higher
multiplication begins at the point where the bits split and than the minimum acceptable accuracy. Its value is given
move simultaneously towards the two opposite directions till by 𝐴𝑃 = 𝑃 (𝐴𝐶𝐶𝑎𝑚𝑝 > 𝑀 𝐴𝐴).
all bits are taken care of. The ETM exhibited a significant In the next section, we discuss an approximate multiplier
reduction in delay, power and hardware cost for specific input architecture to explain the bit-width aware algorithm proposed.
combinations. Next, we explain the preliminary concepts so
as to understand the proposed approximate multiplier. IV. A PPROXIMATE M ULTIPLIER A RCHITECTURE
In order for the multiplier to exhibit high accuracy, the most
III. P RELIMINARIES significant bits (MSBs) of the final 4𝑏 bit product (𝐴 × 𝑋)
We make use of a simple recursive multiplication for our should be accurate to high extent. Therefore, we make the
approximate multiplier design and use various accuracy design multiplier 𝐴𝐻 𝑋𝐻 as 𝑏 × 𝑏 accurate multiplier and 𝐴𝐻 𝑋𝐿 ,
metrics [7], [13] for its evaluation. The recursive multiplication 𝐴𝐿 𝑋𝐻 , 𝐴𝐿 𝑋𝐿 as 𝑏 × 𝑏 approximate multipliers. As we shall
and the accuracy design metrics are described in the following see, the 𝑏 × 𝑏 approximate multipliers generate upper 𝑏 bits
subsections. as accurate to high extent, which further makes the upper 2𝑏
bits of final 4𝑏 bit product achieve high accuracy. The same
A. Recursive Multiplication is illustrated in Fig. 1(b).
We have explained the design methodology of these approx-
A given multiplication can be recursively broken down into
imate 𝑏×𝑏 multipliers in the Carry-in Prediction Logic[14]. We
several smaller-size multiplications, each of which can be
briefly explain this novel technique in the following subsection
performed in the same clock cycle. Let 𝐴 be the multiplicand
with the help of an example.
and 𝑋 be the multiplier and both are of 2𝑏 bits each. 𝐴 and 𝑋
can also be written as 𝐴 = 𝐴𝐻 𝐴𝐿 and 𝑋 = 𝑋𝐻 𝑋𝐿 where A. The Carry-in Prediction − An Example
𝐴𝐻 , 𝐴𝐿 , 𝑋𝐻 , and 𝑋𝐿 are of 𝑏 bits each. Consider the unsigned multiplication of two 16-bit numbers
The multiplication 𝐴 × 𝑋, which is 2𝑏 × 2𝑏, can be (i.e. 𝑏 = 8):
recursively carried out as shown in Fig. 1(a). In this mul-
tiplication, 𝐴𝐻 𝑋𝐿 , 𝐴𝐻 𝑋𝐻 , 𝐴𝐿 𝑋𝐿 , and 𝐴𝐿 𝑋𝐻 are partial 𝐴 = (𝐴𝐸𝐷𝐵)16 = (44763)10
products, each of which is a 𝑏 × 𝑏 multiplication. Hence, a
2𝑏 × 2𝑏 multiplication is divided into four 𝑏 × 𝑏 multiplications 𝑋 = (𝐵6𝐸7)16 = (46823)10
b bits b bits EF = 00 EF = 01
11011011 11011011 CD CD CD CD CD CD CD CD
X 11100111 X 11100111 1
AB 0 01 1 10 AB 01 1 1
critical column
C11011011 11011011 AB 01 1 1 1 AB 1 1 1 1
STAGE 1
Mode Parameter For Operands For Operands
> 1 (in %) > 1000 (in %)
1 Mean Error 5.26 4.59
𝐴𝑃 27.34 28.18
2 Mean Error 3.42 3.16
1 1 1 1
𝐴𝑃 46.18 46.04
3 Mean Error 0.46 0.29
STAGE 2 𝐴𝑃 91.58 94.28
4 Mean Error 0.13 0.035
𝐴𝑃 98.44 99.72
1 1 1 1
STAGE 3
generate results on real time application by computing Dis-
crete Cosine Transform (DCT) and Inverse Discrete Cosine
1 1 1 1 Transform (iDCT) of a benchmark image.
6 BIT Full Adder (Carry−in at LSB is Carry−out of last stage, not 0) A. Accuracy and Acceptance Probability
We simulate our 16 × 16 bit-width aware multiplier us-
Fig. 5. Wallace Tree for approximate 8 × 8 partial product evaluation
ing a C Program by generating 5000 random numbers to
compute accurate and approximate products for all possible
vertical critical path as well (of 𝐴𝐻 𝑋𝐿 , 𝐴𝐿 𝑋𝐻 and 𝐴𝐿 𝑋𝐿 ), combinations for different accuracy modes. For each case,
we use Wallace Tree Reduction [15]. The Wallace Trees are 𝑀 𝐴𝐴 is set as 99%. It generates results like 𝐴𝐶𝐶𝑎𝑚𝑝 ,
fast and hardware efficient for multiplication of more than mean error and Acceptance Probability. The mean error and
16 bits. Wallace tree height also grows as 𝑙𝑜𝑔3/2 (𝑁/2). We Acceptance Probability (AP) results are tabulated in Table II
call this Wallace tree based design as Approximate Wallace for various modes. Note that for unbounded operand size
Tree Multiplier (AWTM). For an accurate 8 × 8 Wallace (operands > 1 in Table II), bit-width aware algorithm is
multiplication, it takes a total of 4 stages of reduction (each employed. Whereas when operand size is constrained to be
of which has a delay of 1 full adder) and then uses a 11-bit 10 bit or more (operands > 1000), it is found that a simple
full adder to compute the final product. We use these 8 × 8 16 × 16 approximate multiplier without bit-width awareness
partial products to evaluate a 16 × 16 multiplication. produces almost the same results.
Fig. 5 shows that approximate partial product multipliers Table II shows spectacular results for accuracy levels.
(𝐴𝐻 𝑋𝐿 , 𝐴𝐿 𝑋𝐻 and 𝐴𝐿 𝑋𝐿 ), which are 8 × 8, take a total of Acceptance Probability of more than 98% for a minimum
3 stages of reduction and further use a 6-bit full adder for final acceptable accuracy of 99% signifies that for all possible
product evaluation. When compared in terms of critical paths combinations of the random numbers generated, more than
(of stage 1 of pipelined 16 × 16 multiplier), an accurate 8 × 8 98% cases give an accuracy greater than 99%. Also, as
multiplier uses a delay of 15 full adders (4 stages and 11-bit explained earlier, for larger numbers (operand size > 1000),
full adder) and its approximate counterpart uses a delay of 9 the accuracy level shoots up to as high as 99.965%. A 16 × 16
full adders (3 stages and a 6-bit full adder). Theoretically, this multiplier proposed in [12] generates a mean error of 3.32%.
leads to an improvement of 40% in latency of stage 1. Clearly, our multiplier performs better than this for modes 3
Furthermore, for each of the 𝐴𝐻 𝑋𝐿 , 𝐴𝐿 𝑋𝐻 and 𝐴𝐿 𝑋𝐿 , and 4. The error is comparable for mode 2.
number of adders reduced is around 51.24% (48 full adders We further investigate the relationship between 𝑀 𝐴𝐴 and
and 25 half adders are required for their accurate evaluation acceptance probability. Fig. 6 shows a plot of 𝐴𝑃 vs. 𝑀 𝐴𝐴
which is in contrast with 23 full adders and 13 half adders for various modes and for ETM (proposed in [13]). It is
required for approximate evaluation) and hence for complete evident that our multiplier (when used in mode 3 and 4)
pipelined multiplier, total power reduction is expected to be outperforms ETM easily as far as accuracy is concerned. The
around 38.42% (because 𝐴𝐻 𝑋𝐻 is accurate and other 3 are accuracy results for mode 4 of 16×16 AWTM were confirmed
inaccurate, therefore three-fourth of 51.24%). We validate by inputting 10, 000 random test vectors in the RTL netlist.
these theoretical estimates by running actual simulations. The Note that for the sake of simplicity, we do not make hardware
experimental results and their analysis are discussed next. implementation (HDL codes) of AWTM as bit-width aware.
For such a design, we obtained a mean error of 0.16% and 𝐴𝑃
VII. E XPERIMENTAL R ESULTS AND A NALYSIS of 98.56% for 𝑀 𝐴𝐴 of 99%, quite in agreement with Table II.
In this section, we present power and area results obtained Hence, mode 4 of 16 × 16 multiplier gives almost the same
experimentally. All results have been produced using Cadence results (for all operands) when employed with or without bit-
RTL Compiler for 45𝑛𝑚 Nangate Opencell Library. We also width awareness. Next, we present Power and Area results.
70
100 4 bit
60 8 bit
60 30
AWTM Mode 1
20
AWTM Mode 2
AWTM Mode 3
40 AWTM Mode 4 10
ETM
0
AWTM ETM Kulkarni Truncation
90 92 94 96 98 Multipliers
Minimum Acceptable Accuracy (in %)