0% found this document useful (0 votes)
58 views7 pages

Approximate Wallace Tree Multiplier

Uploaded by

tinnguyen230303
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views7 pages

Approximate Wallace Tree Multiplier

Uploaded by

tinnguyen230303
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Power- and Area-Efficient Approximate Wallace

Tree Multiplier for Error-Resilient Systems


Kartikeya Bhardwaj Pravin S. Mane Jörg Henkel
Electrical & Electronics Engg. Electrical & Electronics Engg. Department of Computer Science
BITS Pilani-Goa Campus BITS Pilani-Goa Campus Karlsruhe Institute of Technology
Goa – 403 726, India Goa – 403 726, India Karlsruhe – 76131, Germany
Email: [email protected] Email: [email protected] Email: [email protected]

Abstract—Today in sub-nanometer regime, chip/system design- increases. In this paper, we present a new Approximate
ers add accuracy as a new constraint to optimize Latency-Power- Wallace Tree Multiplier (AWTM) based on a bit-width aware
Area (LPA) metrics. In this paper, we present a new power algorithm. We design it specifically to give good results for
and area-efficient Approximate Wallace Tree Multiplier (AWTM)
for error-tolerant applications. We propose a bit-width aware large operands. Besides accuracy, the AWTM is also optimized
approximate multiplication algorithm for optimal design of our for power and area. For single cycle implementation, AWTM
multiplier. We employ a carry-in prediction method to reduce gives significant reduction in latency as well. Our contributions
the critical path. It is further augmented with hardware efficient are:
precomputation of carry-in. We also optimize our multiplier de-
∙ We propose a new power and area-efficient AWTM based
sign for latency, power and area using Wallace trees. Accuracy as
well as LPA design metrics are used to evaluate our approximate on a bit-width aware multiplication algorithm.
multiplier designs of different bit-widths, 𝒊.𝒆. 4 × 4, 8 × 8 and ∙ We employ a novel Carry-in Prediction technique which
16 × 16. The simulation results show that we obtain a mean significantly reduces the critical path of our multiplier. We
accuracy of 99.85% to 99.965%. Single cycle implementation further derive an efficient carry-in precomputation logic
of AWTM gives almost 24% reduction in latency. We achieve
significant reduction in power and area, 𝒊.𝒆. up to 41.96% and to accelerate the carry propagation.
34.49% respectively that clearly demonstrates the merits of our ∙ We obtain a very high mean accuracy of 99.965% (mean
proposed AWTM design. Finally, AWTM is used to perform a error of only 0.035%) when the size of operands are 10
real time application on a benchmark image. We obtain up to bits or more. However, if there is no lower bound on the
39% reduction in power and 30% reduction in area without any size of operands, the mean accuracy varies from 99.85%
loss in image quality.
Index Terms—Approximate multiplier; Bit-width aware mul-
to 99.9% (a very small mean error of 0.1% to 0.15%).
tiplication algorithm; Wallace tree; Error-resilient systems ∙ We achieve a significant reduction in power and area,
i.e. up to 41.96%, and 34.49% respectively for the 16-
I. I NTRODUCTION bit accuracy-configurable AWTM design. For single cycle
implementation of 16 × 16 AWTM, we also reduce the
The International Technology Roadmap for Semiconductors latency by around 24%.
(ITRS) [1] has anticipated imprecise/approximate designs that ∙ Our proposed AWTM, when used for a real time applica-
became a state-of-the art demand for the emerging class of tion on an image, achieved up to 39% reduction in power
killer applications that manifest inherent error-resilience such and up to 30% reduction in area with negligible loss in
as multimedia, graphics, and wireless communications. In the image quality.
error-resilience systems, adders and multipliers are used as Rest of the paper is organized in various sections. In section
basic building blocks and their approximate designs have 2 we discuss some background and related work reported in
attracted significant research interest recently. Conventional literature. We describe some preliminaries in Section 3. An
wisdom investigated several mechanisms such as truncation approximate multiplier architecture is explained in Section
[2], over-clocking, and voltage over-scaling(VOS) [3] which 4. We propose a bit-width aware approximate multiplication
could not configure accuracy as well as Latency-Power-Area algorithm in Section 5. We present AWTM design based on the
(LPA) design metrics effectively. Most of the other design proposed methodology and its optimization w.r.t. LPA design
techniques rely on functional approximations and a wide metrics in Section 6. The experimental results are given in
spectrum of approximate adders like [4], [5], [6] and [7] have Section 7. Finally, we conclude the paper in Section 8.
been proposed in the past. However, very few research papers
are reported on approximate multipliers in the literature. II. BACKGROUND AND R ELATED W ORK
Most of the approximate multiplier designs reported shorten Research on approximate arithmetic circuits mainly reported
the carry-chains in which error is configurable and the algo- in the literature is on approximate adders. It is worthwhile
rithms employed in the designs are for smaller numbers and to study these approximate adders in order to make research
give large magnitude of error as the bit-width of operands contributions on approximate multipliers. Lu [8] proposed a

978-1-4799-3946-6/14/$31.00 ©2014 IEEE 263 15th Int'l Symposium on Quality Electronic Design
𝑘-bit carry look-ahead adder in which only previous 𝑘 bits are Accurate
Partial Product
considered to estimate current carry signal. Lu’s adder exhibits 2b bits

a low probability of getting correct sum and increases area AH XL


AH XH AL XL

overhead. Shin et al. [9] reduce data-path delay and re-design AL XH


the data-path modules. It cuts the critical-path in carry-chain AH XH AL XL

AH XL
to exploit a given error rate to improve parametric yield. AL XH
Accurate to a
b bits b bits b bits
Zhu et al.[4] manifest an error-tolerant adder: ETA-I. ETA-I 2b bits
Large Extent

divides inputs into: 1) Accurate part, and 2) Inaccurate part. Final Product
4b bits Final Product
In the latter, no carry signal is considered at any bit position. 4b bits

Gupta et al. [10] target low-power and propose five different (a) (b)

versions of mirror adder by reducing the number of transistors Fig. 1. (a) Recursive Multiplication (b) Approximate Multiplication
and internal node capacitance. Verma et al. [6] presented a
Variable Latency Speculative Adder (VLSA) which provides
approximate/accurate results but gives considerable delay and followed by additions. Fig. 1(b) is derived from Fig. 1(a) for
large area overhead. Kahng et. al [7] proposed an accuracy- approximate multiplication.
configurable adder with reduced critical-path and error rate.
In contrast with the above work, very few researchers have B. Accuracy Design Metrics
reported work on approximate multipliers. Sullivan et al. [11] The accuracy design metrics are defined as follows:
used 𝑇 𝑟𝑢𝑛𝑐𝑎𝑡𝑒𝑑 𝐸𝑟𝑟𝑜𝑟 𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑖𝑜𝑛 (TEC) to investigate an 1) Relative Error: Relative Error can be calculated as
iterative approximate multiplier in which some amount of (∣𝑅𝑐 − 𝑅𝑒 ∣/𝑅𝑐 ) × 100% for 𝑅𝑐 ∕= 0. Here, 𝑅𝑐 is correct
error correcting circuitry is added for each iteration. This result and 𝑅𝑒 is approximate result. We denote accuracy
circuitry replicates the effects of multiple pipeline iterations as 𝐴𝐶𝐶𝑎𝑚𝑝 where 𝐴𝐶𝐶𝑎𝑚𝑝 = 1 − 𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝐸𝑟𝑟𝑜𝑟.
for the most problematic inputs quite inexpensively. Kulkarni 2) Mean Error: Mean error is the average of relative errors
et al. [12] proposed a 2 × 2 underdesigned multiplier block of all the combinations tested in an algorithm.
and built arbitrarily power aware inaccurate multipliers. Kyaw 3) Minimum Acceptable Accuracy (MAA): Minimum Ac-
et al. [13] presented an Error Tolerant Multiplication (ETM) ceptable Accuracy is the minimum level of accuracy that
algorithm in which the input operands are split into two parts. an application can tolerate.
a multiplication part consists of higher order bits and a non- 4) Acceptance Probability (𝐴𝑃 ): It is the probability that
multiplication part with the remaining lower order bits. The accuracy of the approximate arithmetic circuit is higher
multiplication begins at the point where the bits split and than the minimum acceptable accuracy. Its value is given
move simultaneously towards the two opposite directions till by 𝐴𝑃 = 𝑃 (𝐴𝐶𝐶𝑎𝑚𝑝 > 𝑀 𝐴𝐴).
all bits are taken care of. The ETM exhibited a significant In the next section, we discuss an approximate multiplier
reduction in delay, power and hardware cost for specific input architecture to explain the bit-width aware algorithm proposed.
combinations. Next, we explain the preliminary concepts so
as to understand the proposed approximate multiplier. IV. A PPROXIMATE M ULTIPLIER A RCHITECTURE
In order for the multiplier to exhibit high accuracy, the most
III. P RELIMINARIES significant bits (MSBs) of the final 4𝑏 bit product (𝐴 × 𝑋)
We make use of a simple recursive multiplication for our should be accurate to high extent. Therefore, we make the
approximate multiplier design and use various accuracy design multiplier 𝐴𝐻 𝑋𝐻 as 𝑏 × 𝑏 accurate multiplier and 𝐴𝐻 𝑋𝐿 ,
metrics [7], [13] for its evaluation. The recursive multiplication 𝐴𝐿 𝑋𝐻 , 𝐴𝐿 𝑋𝐿 as 𝑏 × 𝑏 approximate multipliers. As we shall
and the accuracy design metrics are described in the following see, the 𝑏 × 𝑏 approximate multipliers generate upper 𝑏 bits
subsections. as accurate to high extent, which further makes the upper 2𝑏
bits of final 4𝑏 bit product achieve high accuracy. The same
A. Recursive Multiplication is illustrated in Fig. 1(b).
We have explained the design methodology of these approx-
A given multiplication can be recursively broken down into
imate 𝑏×𝑏 multipliers in the Carry-in Prediction Logic[14]. We
several smaller-size multiplications, each of which can be
briefly explain this novel technique in the following subsection
performed in the same clock cycle. Let 𝐴 be the multiplicand
with the help of an example.
and 𝑋 be the multiplier and both are of 2𝑏 bits each. 𝐴 and 𝑋
can also be written as 𝐴 = 𝐴𝐻 𝐴𝐿 and 𝑋 = 𝑋𝐻 𝑋𝐿 where A. The Carry-in Prediction − An Example
𝐴𝐻 , 𝐴𝐿 , 𝑋𝐻 , and 𝑋𝐿 are of 𝑏 bits each. Consider the unsigned multiplication of two 16-bit numbers
The multiplication 𝐴 × 𝑋, which is 2𝑏 × 2𝑏, can be (i.e. 𝑏 = 8):
recursively carried out as shown in Fig. 1(a). In this mul-
tiplication, 𝐴𝐻 𝑋𝐿 , 𝐴𝐻 𝑋𝐻 , 𝐴𝐿 𝑋𝐿 , and 𝐴𝐿 𝑋𝐻 are partial 𝐴 = (𝐴𝐸𝐷𝐵)16 = (44763)10
products, each of which is a 𝑏 × 𝑏 multiplication. Hence, a
2𝑏 × 2𝑏 multiplication is divided into four 𝑏 × 𝑏 multiplications 𝑋 = (𝐵6𝐸7)16 = (46823)10
b bits b bits EF = 00 EF = 01

11011011 11011011 CD CD CD CD CD CD CD CD
X 11100111 X 11100111 1
AB 0 01 1 10 AB 01 1 1
critical column
C11011011 11011011 AB 01 1 1 1 AB 1 1 1 1

C=1 11011011x 11011011x AB 1 1 1 1 AB 1 1 1 1


1 1 0 1 1 0 1 1 xx 1 1 0 1 1 0 1 1xx
AB 01 1 1 1 AB 1 1 1 1
0 0 0 0 0 0 0 0 xxx 0 0 0 0 0 0 0 0 xxx
0 0 0 0 0 0 0 0 x xxx 0 0 0 0 0 0 0 0x xxx
1 1 0 1 1 0 1 1 xx xxx 1 1 0 1 1 0 1 1 xx xxx EF = 10 EF = 11
1 1 0 1 1 0 1 1 xxx xxx 1 1 0 1 1 0 1 1 xxx xxx CD CD CD CD CD CD CD CD CD CD CD CD
1 1 0 1 1 0 1 1 xxxx xxx 1 1 0 1 1 0 1 1 xxxx xxx
AB 0 01 1 10 AB 01 1 1 1 AB 1 1 1 1
1 1 0 0 0 0 1 1 1 1 1 11 1 0 1 1 1 0 0 0 1 0 11 0 0 1 1 1 0 1
AB 0 1 1 1 AB 1 1 1 1 AB 1 1 1 1
accurate to inaccurate accurate completely accurate
certain extent (b/2 bits) (b/2 bits) AB 1 1 1 1 AB 1 1 1 1 AB 1 1 1 1
(2b bits)
(b bits)
AB 0 1 1 1 AB 1 1 1 1 AB 1 1 1 1
(a) (b)
(a) (b)
Fig. 2. Carry-in Prediction Example for 𝑏 = 8 (i.e. 16 × 16 multiplication):
(a) Approximate 𝐴𝐿 𝑋𝐿 (b) Accurate 𝐴𝐿 𝑋𝐿
Fig. 3. Carry-in Precomputation for (a) 𝑏 = 4 i.e. 8 × 8 Multiplier (b) 𝑏 = 6
i.e. 12 × 12 Multiplier

Now, let us evaluate one approximate product out of 𝐴𝐻 𝑋𝐿 ,


𝐴𝐿 𝑋𝐻 and 𝐴𝐿 𝑋𝐿 using our algorithm. Say, we want to can further simplify and approximate the evaluation of carry-
evaluate 𝐴𝐿 𝑋𝐿 i.e (𝐷𝐵)16 × (𝐸7)16 . As shown in Fig. 2(a), in so that reduction in latency is not achieved at the cost of
we divide this multiplication in three independent parts: First, power. We consider the cases of 𝑏 = 4 and 𝑏 = 6 in order to
accurate computation of 𝑏/2 least significant bits (LSBs), better explain carry-in precomputation procedure.
followed by second part wherein 𝑏/2 bits are simply set to The precomputation is made hardware efficient by making
1’s. The third part is again accurate computation of remaining minor changes in the K-Maps of the carry-in expressions as
elements in the multiplication tree with an additional carry ‘C’ shown in Fig. 3. Here 𝐴, 𝐵, . . . , 𝐹 are the elements in critical
arising from the inaccurate part at least significant position. So, column. The original K-Maps are obtained from the statement
the idea is to precompute ‘C’ through some mechanism and of Carry-in Prediction Logic i.e. 𝐶𝑖𝑛 = 1 if 2 or more elements
begin multiplication simultaneously from both first and third of critical column are 1. Fig. 3 further derives
part. At the same time, we reduce the number of addition ∙ for 𝑏 = 4: By making changes in 2 cases out of 16, we
operations involved by directly setting the bits in second part, can simplify the Carry-in expression to
thus significantly reducing the hardware costs.
Fig. 2(a) further shows a critical column as the column 𝐶𝑖𝑛 = 𝐴.𝐵 + 𝐶 + 𝐷
containing maximum number of elements in the multiplication Similar results can also be derived for 𝑏 < 4.
tree. Carry-in Prediction logic exploits the fact that if there are ∙ for 𝑏 = 6: We make changes in 6 cases out of 64 and get
two or more 1’s in the critical column, then a carry of at least
1 is definitely propagated to the next column. Next subsection 𝐶𝑖𝑛 = 𝐴 + 𝐵 + 𝐶 + 𝐷 + 𝐸 + 𝐹
discusses this prediction in more detail. In the second part, we This is same as OR operation of all the elements present
set the 𝑏/2 bits in the inaccurate part as 1’s because for such in critical column. Therefore, in general, we can state that
a large 𝑏 (>= 5), it is very probable that carry propagated for large 𝑏 (greater than 4), one should take the OR of
from critical column is more than 1. Therefore, setting those all the elements present in critical column to get 𝐶𝑖𝑛 .
bits will reduce the error involved as it is analogous to the
Next, we propose various accuracy configurations and a bit-
difference between 16 (5′ 𝑏10000) and 15 (5′ 𝑏01111) i.e. 16
width aware approximate multiplication algorithm.
just passes an extra carry.
Fig. 2(b) shows the accurate 𝐴𝐿 𝑋𝐿 evaluation. As evident, V. B IT-W IDTH AWARE A PPROXIMATE M ULTIPLICATION
out of 8 most significant bits (MSBs), 6 are correct in our In the approximate multiplication, we divide the 𝑏 × 𝑏
approximate 𝐴𝐿 𝑋𝐿 . Evaluating 𝐴𝐻 𝑋𝐿 and 𝐴𝐿 𝑋𝐻 in a accurate multiplier 𝐴𝐻 𝑋𝐻 into 4 smaller components, each
similar fashion and adding all these as indicated in Fig. 1(b) being a 𝑏/2 × 𝑏/2 multiplier. This is because, when accurate
gives approximate result as (7𝐶𝐸𝐵𝐴7𝐹 𝐵)16 . The correct 𝐴𝐻 𝑋𝐻 is performed in parallel with approximate 𝐴𝐻 𝑋𝐿 ,
answer is (7𝐶𝐸𝐷799𝐷)16 . The relative error in this case is 𝐴𝐿 𝑋𝐻 and 𝐴𝐿 𝑋𝐿 , the critical path will still be determined
merely 0.0056%. Precomputation of Carry-in ‘C’ is described by the accurate multiplier. Therefore, recursively reducing it
next in detail. to smaller multipliers will make approximate 𝑏 × 𝑏 multipliers
as deciding factors of critical path as they are more critical
B. Efficient Carry-in Precomputation than accurate 𝑏/2 × 𝑏/2 multipliers.
Carry-in Prediction necessitates the precomputation of In other words, the stage 1 of the pipelined approximate
carry-in. Since we are dealing with error resilient systems, we multiplier effectively consists of 7 multipliers. The designation
2b bits
set to 2 and not as 4. Further, the positions at which middle
b bits
AH XL
𝑏/2 bits are set to 1 also changes with operand bit-width. This
has already been indicated in Fig. 2. This is plausible because
AHH X HL
at a time, we will use a multiplier of fixed size depending on
AHH X HH AL XL
AHL X HL application and hence can program it accordingly.
b A
HL
X
HH
We present our algorithm (see Algorithm 1) of approximate
2
b
b bits b bits
𝑏 × 𝑏 partial product computation (e.g. 𝐴𝐻 𝑋𝐿 ) in its most
AL XH
2 general form. This algorithm is coded later as a C-Program
for simulation purposes. It should be noted that the index 0 is
2b bits
Final Product
the Most Significant Bit in the Algorithm 1.
4b bits

Algorithm 1 Approximate Partial Product Evaluation


Fig. 4. Latency-Driven Pipelined Approximate Multiplier procedure A PPROXIMATE P RODUCT(𝑝, 𝑎, 𝑥)
TABLE I
𝑃 𝑟𝑜𝑑𝑢𝑐𝑡 ← 𝑝[0, 1, ..., 2𝑏 − 1] /* Say 𝐴𝐻 𝑋𝐿 */
M ODES OF O PERATION OF ACCURACY C ONFIGURABLE M ULTIPLIER 𝑀 𝑢𝑙𝑡𝑖𝑝𝑙𝑖𝑐𝑎𝑛𝑑 ← 𝑎[0, 1, ..., 𝑏 − 1]
𝑀 𝑢𝑙𝑡𝑖𝑝𝑙𝑖𝑒𝑟 ← 𝑥[0, 1, ..., 𝑏 − 1]
Mode 𝐴𝐻𝐻 𝑋𝐻𝐻 𝐴𝐻𝐻 𝑋𝐻𝐿 𝐴𝐻𝐿 𝑋𝐻𝐻 𝐴𝐻𝐿 𝑋𝐻𝐿 𝑐 ← 0 /* Temporary Carry */
1 𝐴 𝐼 𝐼 𝐼
2 𝐴 𝐴 𝐼 𝐼 𝑞, 𝑟 ← 2𝑏 − 1; 𝑑 ← 𝑏 − (𝑘/2)
3 𝐴 𝐴 𝐴 𝐼 for 𝑖 ← 𝑏 − 1, 𝑏 − (𝑘/2) do /* Inaccurate Part */
4 𝐴 𝐴 𝐴 𝐴 for 𝑗 ← 𝑏 − 1, 𝑑 do
[𝑝(𝑞), 𝑐] ← add-bits(𝑝(𝑞), 𝑎(𝑗)&𝑥(𝑖), 𝑐);
𝑞 ←𝑞−1
of each of these multipliers and their respective arrangement end for
for addition in second stage is depicted in Fig. 4. Note that 𝑞, 𝑟 ← 𝑟 − 1; 𝑑 ← 𝑑 + 1; 𝑐 ← 0
this kind of arrangement will not lead to any change in latency end for
of second pipeline stage as we perform the addition of 𝐴𝐻 𝑋𝐿 for 𝑟 ← 2𝑏 − (𝑘/2) − 1, 2𝑏 − 𝑘 do
and 𝐴𝐿 𝑋𝐻 in parallel with rest of the smaller multipliers. The 𝑝(𝑟) ← 1;
latter additions generate a net sum of 𝐴𝐻 𝑋𝐻 in almost the end for
same time as the former takes to complete its addition. Now 𝐶𝑖𝑛 ← Carry-in-Pre /* Switch Case for various 𝑏 */
let us see how we can exploit this property in the proposed 𝑝(𝑟) ← 𝐶𝑖𝑛 ; 𝑞, 𝑟 ← 2𝑏 − 𝑘 − 1; 𝑑 ← 𝑏 − 𝑘 − 1; 𝐶𝑖𝑛 ← 0
multiplication to configure its accuracy. for 𝑖 ← 𝑏 − 1, 𝑏 − 𝑘 do /* Accurate Part */
for 𝑗 ← 𝑑, 0 do
A. Accuracy Configuration Modes [𝑝(𝑞), 𝐶𝑖𝑛 ] ← add-bits(𝑝(𝑞), 𝑎(𝑗)&𝑥(𝑖), 𝐶𝑖𝑛 );
Since now we have 7 multipliers in stage 1, we can vary if 𝑗 == 0 and 𝐶𝑖𝑛 == 1 and 𝑞 ∕= 0 then
the accuracy level of the proposed multiplier by varying the 𝑝(𝑞 − 1) ← 1
number of multipliers that are accurate. In any case, we keep end if
the 𝐴𝐻𝐻 𝑋𝐻𝐻 as always accurate, so that the accuracy level 𝑞 ←𝑞−1
does not fall below a certain level. Therefore, we obtain end for
an accuracy configurable multiplier whose accuracy can be 𝑞 ← 𝑟; 𝑑 ← 𝑑 + 1; 𝐶𝑖𝑛 ← 0
adjusted according to error tolerance of the application. The end for
number of inaccurate multipliers used will directly determine for 𝑖 ← 𝑏 − 𝑘 − 1, 0 do
the amount of power saved by the multiplier. for 𝑗 ← 𝑏 − 1, 0 do
We propose 4 different modes of operations of our approxi- [𝑝(𝑞), 𝐶𝑖𝑛 ] ← add-bits(𝑝(𝑞), 𝑎(𝑗)&𝑥(𝑖), 𝐶𝑖𝑛 );
mate multiplier based on accuracy levels. The proposed modes if 𝑗 == 0 and 𝐶𝑖𝑛 == 1 and 𝑞 ∕= 0 then
are given in Table I. Here ‘𝐴’ stands for an accurate multiplier 𝑝(𝑞 − 1) ← 1
and ‘𝐼’ stands for an inaccurate multiplier. We explain the bit- end if
width aware algorithm next. 𝑞 ←𝑞−1
end for
B. Proposed Bit-Width Aware Algorithm 𝑞, 𝑟 ← 𝑟 − 1; 𝐶𝑖𝑛 ← 0
We propose a bit-width aware algorithm for generalized 2𝑏× end for
2𝑏 multiplication which is configurable at run-time according end procedure
to bit-width of operands i.e. if 𝑏 × 𝑏 multiplication is with
smaller operands, say (1011)2 × (1101)2 , it will configure 𝑏
at run-time as 4, not as 8 (used previously for 𝑏 × 𝑏). The size VI. AWTM D ESIGN AND I TS LPA O PTIMIZATION
of inaccurate part (𝑘) in the approximate partial products will In our accuracy-configurable design, we have reduced the
always be equal to 𝑏/2 bits. Therefore, 𝑘 will be automatically horizontal critical path to just half. In order to reduce the
Carry−in TABLE II
Predicted
= Partial Product R ESULTS : M EAN E RROR AND ACCEPTANCE P ROBABILITY (AP)
Ai X j

STAGE 1
Mode Parameter For Operands For Operands
> 1 (in %) > 1000 (in %)
1 Mean Error 5.26 4.59
𝐴𝑃 27.34 28.18
2 Mean Error 3.42 3.16
1 1 1 1
𝐴𝑃 46.18 46.04
3 Mean Error 0.46 0.29
STAGE 2 𝐴𝑃 91.58 94.28
4 Mean Error 0.13 0.035
𝐴𝑃 98.44 99.72
1 1 1 1

STAGE 3
generate results on real time application by computing Dis-
crete Cosine Transform (DCT) and Inverse Discrete Cosine
1 1 1 1 Transform (iDCT) of a benchmark image.

6 BIT Full Adder (Carry−in at LSB is Carry−out of last stage, not 0) A. Accuracy and Acceptance Probability
We simulate our 16 × 16 bit-width aware multiplier us-
Fig. 5. Wallace Tree for approximate 8 × 8 partial product evaluation
ing a C Program by generating 5000 random numbers to
compute accurate and approximate products for all possible
vertical critical path as well (of 𝐴𝐻 𝑋𝐿 , 𝐴𝐿 𝑋𝐻 and 𝐴𝐿 𝑋𝐿 ), combinations for different accuracy modes. For each case,
we use Wallace Tree Reduction [15]. The Wallace Trees are 𝑀 𝐴𝐴 is set as 99%. It generates results like 𝐴𝐶𝐶𝑎𝑚𝑝 ,
fast and hardware efficient for multiplication of more than mean error and Acceptance Probability. The mean error and
16 bits. Wallace tree height also grows as 𝑙𝑜𝑔3/2 (𝑁/2). We Acceptance Probability (AP) results are tabulated in Table II
call this Wallace tree based design as Approximate Wallace for various modes. Note that for unbounded operand size
Tree Multiplier (AWTM). For an accurate 8 × 8 Wallace (operands > 1 in Table II), bit-width aware algorithm is
multiplication, it takes a total of 4 stages of reduction (each employed. Whereas when operand size is constrained to be
of which has a delay of 1 full adder) and then uses a 11-bit 10 bit or more (operands > 1000), it is found that a simple
full adder to compute the final product. We use these 8 × 8 16 × 16 approximate multiplier without bit-width awareness
partial products to evaluate a 16 × 16 multiplication. produces almost the same results.
Fig. 5 shows that approximate partial product multipliers Table II shows spectacular results for accuracy levels.
(𝐴𝐻 𝑋𝐿 , 𝐴𝐿 𝑋𝐻 and 𝐴𝐿 𝑋𝐿 ), which are 8 × 8, take a total of Acceptance Probability of more than 98% for a minimum
3 stages of reduction and further use a 6-bit full adder for final acceptable accuracy of 99% signifies that for all possible
product evaluation. When compared in terms of critical paths combinations of the random numbers generated, more than
(of stage 1 of pipelined 16 × 16 multiplier), an accurate 8 × 8 98% cases give an accuracy greater than 99%. Also, as
multiplier uses a delay of 15 full adders (4 stages and 11-bit explained earlier, for larger numbers (operand size > 1000),
full adder) and its approximate counterpart uses a delay of 9 the accuracy level shoots up to as high as 99.965%. A 16 × 16
full adders (3 stages and a 6-bit full adder). Theoretically, this multiplier proposed in [12] generates a mean error of 3.32%.
leads to an improvement of 40% in latency of stage 1. Clearly, our multiplier performs better than this for modes 3
Furthermore, for each of the 𝐴𝐻 𝑋𝐿 , 𝐴𝐿 𝑋𝐻 and 𝐴𝐿 𝑋𝐿 , and 4. The error is comparable for mode 2.
number of adders reduced is around 51.24% (48 full adders We further investigate the relationship between 𝑀 𝐴𝐴 and
and 25 half adders are required for their accurate evaluation acceptance probability. Fig. 6 shows a plot of 𝐴𝑃 vs. 𝑀 𝐴𝐴
which is in contrast with 23 full adders and 13 half adders for various modes and for ETM (proposed in [13]). It is
required for approximate evaluation) and hence for complete evident that our multiplier (when used in mode 3 and 4)
pipelined multiplier, total power reduction is expected to be outperforms ETM easily as far as accuracy is concerned. The
around 38.42% (because 𝐴𝐻 𝑋𝐻 is accurate and other 3 are accuracy results for mode 4 of 16×16 AWTM were confirmed
inaccurate, therefore three-fourth of 51.24%). We validate by inputting 10, 000 random test vectors in the RTL netlist.
these theoretical estimates by running actual simulations. The Note that for the sake of simplicity, we do not make hardware
experimental results and their analysis are discussed next. implementation (HDL codes) of AWTM as bit-width aware.
For such a design, we obtained a mean error of 0.16% and 𝐴𝑃
VII. E XPERIMENTAL R ESULTS AND A NALYSIS of 98.56% for 𝑀 𝐴𝐴 of 99%, quite in agreement with Table II.
In this section, we present power and area results obtained Hence, mode 4 of 16 × 16 multiplier gives almost the same
experimentally. All results have been produced using Cadence results (for all operands) when employed with or without bit-
RTL Compiler for 45𝑛𝑚 Nangate Opencell Library. We also width awareness. Next, we present Power and Area results.
70
100 4 bit
60 8 bit

Acceptance Probability (in %)


16 bit

Power Reduction (in %)


50
80
40

60 30

AWTM Mode 1
20
AWTM Mode 2
AWTM Mode 3
40 AWTM Mode 4 10
ETM
0
AWTM ETM Kulkarni Truncation
90 92 94 96 98 Multipliers
Minimum Acceptable Accuracy (in %)

Fig. 7. Power Reduction plot for Scalability of approximate multipliers


Fig. 6. Acceptance Probability vs. Minimum Acceptable Accuracy
(Higher is better)
TABLE III
R EDUCTION IN A REA AND P OWER (H IGHER IS BETTER ) 60
4 bit
8 bit
Approximate Multiplier Area (%) Power (%) 50 16 bit

AWTM (Proposed) 55.76 53.16

Area Reduction (in %)


40
4×4 ETM [13] 53.85 49.60
Kulkarni [12] 35.75 36.3 30
Truncation [11] 43.44 43.08
AWTM (Proposed) 51.93 57.19 20

8×8 ETM [13] 50.02 39.25


10
Kulkarni [12] 22.03 41.5
Truncation [11] 47.90 15.30 0
AWTM (Proposed) 34.49 41.96 AWTM ETM Kulkarni Truncation
16 × 16 ETM [13] 30.27 31.49 Multipliers

Kulkarni [12] 17.89 31.8


Truncation [11] 15.17 9.69
Fig. 8. Area Reduction for various scalable approximate multipliers (Higher
is better)

B. Power and Area Analysis TABLE IV


RTL C OMPILER RESULTS OF 16 × 16 AWTM
We obtain power and area results for our 4 × 4, 8 × 8
and 16 × 16 AWTM designs. Table III shows these results Mode Area Leakage Dynamic Total
along with power and area results for the comparable designs (in %) Power(in %) Power(in %) Power (in %)
reported in the literature. Here, power and area reduction of 1 34.49 34.06 43.4 41.96
various approximate multipliers are computed with respect to 2 31.91 32.52 41.46 40.63
their accurate counterparts. Fig. 7 and Fig. 8 display the scal- 3 29.89 30.97 39.68 38.86
ability results from which it is evident that AWTM performs 4 27.90 29.43 37.90 37.10
better than all other corresponding multipliers reported in the
literature w.r.t power and area.
Since, we have not optimized the second stage of pipeline Key Observations: First as expected, leakage power and area
(i.e. the addition stage), the latency results were same for follow almost the same trends (even in the values of percentage
both accurate and approximate 16 × 16 pipelined multipliers. reduction). Second, the percentage power reduction goes as
As Wallace trees are very fast, the minimum clock period high as 41.96%. Therefore, we have larger power savings for
in RTL Synthesis was decided by the addition stage itself. applications that can tolerate relatively more error. Finally, for
Nevertheless, when latency of single cycle implementation of mode 4, net power saving obtained from RTL Compiler is
AWTM (complete multiplier, not just stage 1) was compared 37.10% which was estimated to be around 38% theoretically,
with that of single cycle accurate multiplier, it was found that thus confirming the validity of these results.
AWTM reduces the latency by 23.91%. Similarly for 8 × 8 Furthermore, Fig. 9 compares percentage reduction in power
multiplier, we achieved 32% reduction in latency, which was and area against the mean error involved in 16 × 16 product.
theoretically predicted to be around 40%. The figure clearly shows that with increase in error tolerance
A net reduction in total power and area of 57.19% and of application, power (dynamic as well as leakage) and area
51.93% respectively is also obtained for 8×8 AWTM. Both of savings also increase. In the next subsection, we evaluate our
these values were expected to be around 52% theoretically as multiplier on a real time application.
mentioned in the previous section. Therefore, the experimental
results confirm to the theoretical results to a large extent. The C. Real Time Application: DCT and iDCT
area and power reduction results of 16 × 16 AWTM are given We make use of the AWTM to demonstrate its effectiveness
in Table IV for various modes of operation. in computing DCT and iDCT of benchmark image ‘Lena’ We
new bit-width aware approximate multiplication algorithm is
also presented. The AWTM design is further empowered with
40
Percentage Reduction
a Carry-in Prediction logic and its efficient precomputation to
increase overall throughput. Our power-area efficient AWTM
35 is fast, particularly for operands size of 16-bit or more, and
optimized w.r.t. LPA design metrics. Single cycle implemen-
30 Area
tation of AWTM showed a 23.91% reduction in latency. We
Leakage Power
Dynamic Power
obtained the mean accuracy of 99.85% to 99.965% for 16-bit
25
Total Power
multiplication of different sized operands. We also achieved
0.0 1.0 2.0 3.0
Mean Error (in %)
4.0 significant reduction in power and area of our multiplier
design, up to 41.96% and 34.49% respectively for 16-bit
multiplication which clearly demonstrates efficiency and ef-
Fig. 9. Mean Error vs. Area and Power Reduction fectiveness of AWTM. Finally, we demonstrated that AWTM
produced images of almost the same quality as obtained by
Original Image Accurate Multiplier Approximate Mode 4 operations using accurate multipliers but with power and area
savings of around 39% and 30% respectively.
R EFERENCES
[1] “International technology roadmap for semiconductors,
http://www.itrs.net.”
[2] E. J. Swartzlander, “Truncated multiplication with approximate round-
(a) (b) (c) ing,” in Signals, Systems, and Computers, 1999. Conference Record of
the Thirty-Third Asilomar Conference on, vol. 2, oct. 1999, pp. 1480–
Approximate Mode 3 Approximate Mode 2 Approximate Mode 1 1483 vol.2.
[3] L. N. Chakrapani, K. K. Muntimadugu, L. Avinash, J. George, and
K. V.Palem, “Highly energy and performance efficient embedded com-
puting through approximately correct arithmetic: a mathematical foun-
dation and preliminary experimental validation,” CASES, pp. 187–196,
2008.
[4] N. Zhu, W. L. Goh, W. Zhang, K. S. Yeo, and Z. H. Kong, “Design of
low-power high-speed truncation-error-tolerant adder and its application
(d) (e) (f) in digital signal processing,” Very Large Scale Integration (VLSI) Sys-
tems, IEEE Transactions on, vol. 18, no. 8, pp. 1225–1229, aug. 2010.
[5] N. Zhu, W. L. Goh, G. Wang, and K. S. Yeo, “Enhanced low-power high-
speed adder for error-tolerant application,” in SoC Design Conference
Fig. 10. DCT and iDCT of image using accurate multiplier and AWTM
(ISOCC), 2010 International, nov. 2010, pp. 323–327.
[6] A. Verma, P. Brisk, and P. Ienne, “Variable latency speculative addition:
A new paradigm for arithmetic circuit design,” in Design, Automation
have used this application because it involves multiplication of and Test in Europe, 2008. DATE ’08, march 2008, pp. 1250–1255.
[7] A. Kahng and S. Kang, “Accuracy-configurable adder for approximate
floating point numbers. Floating point multiplication uses large arithmetic designs,” in Design Automation Conference (DAC), 2012 49th
unsigned multipliers, making them an ideal application area of ACM/EDAC/IEEE, june 2012, pp. 820–825.
AWTM. Fig. 10 shows our results on this image. The same [8] S. L. Lu, “Speeding up processing with approximation circuits,” Com-
puter, vol. 37, no. 3, pp. 67–73, mar 2004.
image is generated back when the DCT and iDCT operations [9] D. Shin and S. Gupta, “A re-design technique for datapath modules in
are performed on it. error tolerant applications,” in Asian Test Symposium, 2008. ATS ’08.
As expected, the results for mode 1 (Fig. 10(e)) and mode 2 17th, nov. 2008, pp. 431–437.
[10] V. Gupta, D. Mohapatra, A. Raghunathan, and K. Roy, “Low-power
(Fig. 10(f)) are not good as they give a mean error of 5 − 6%. digital signal processing using approximate adders,” Computer-Aided
It means that this application can’t tolerate such a magnitude Design of Integrated Circuits and Systems, IEEE Transactions on,
of error. Obviously there are applications which can do, and vol. 32, no. 1, pp. 124–137, jan. 2013.
[11] M. B. Sullivan and E. E. Swartzlander, “Truncated error correction for
hence mode 1 and 2 can be easily employed there. On the flexible approximate multiplication,” in Signals, Systems and Computers
other hand, it is hardly possible to distinguish between results (ASILOMAR), 2012 Conference Record of the Forty Sixth Asilomar
of accurate multiplier (Fig. 10(b)) and those of AWTM mode Conference on, 2012, pp. 355–359.
[12] P. Kulkarni, P. Gupta, and M. Ercegovac, “Trading accuracy for power
4 (Fig. 10(c)) and mode 3 (Fig. 10(d)). The multiplier used with an underdesigned multiplier architecture,” in VLSI Design (VLSI
to produce image in Fig. 10(c) saves 37% power and 28% Design), 2011 24th International Conference on, 2011, pp. 346–351.
area. The one used for Fig. 10(d) reduces power by 38.86% [13] K. Y. Kyaw, W.-L. Goh, and K.-S. Yeo, “Low-power high-speed multi-
plier for error-tolerant application,” in Electron Devices and Solid-State
and area by 29.89%. Therefore, using AWTM, we can save Circuits (EDSSC), 2010 IEEE International Conference of, 2010, pp.
up to 39% power and 30% area with negligible loss in image 1–4.
quality. [14] K. Bhardwaj and P. S. Mane, “Acma: Accuracy-configurable multiplier
architecture for error-resilient system-on-chip,” in Reconfigurable and
Communication-Centric Systems-on-Chip (ReCoSoC), 2013 8th Inter-
VIII. C ONCLUSION national Workshop on, July 2013, pp. 1–6.
We proposed a power and area-efficient Approximate Wal- [15] C. S. Wallace, “A suggestion for a fast multiplier,” in Electronic
Computers, 1964 IEEE Transactions on, 1964, pp. 14–17.
lace Tree Multiplier (AWTM) for error-resilient systems. A

You might also like