Power and Area Efficient Approximate Multipliers
Power and Area Efficient Approximate Multipliers
Abstract— Approximate computing can decrease the design complexity k successive partial products starting from jth position, where j ∈ [0,
with an increase in performance and power efficiency for error resilient n-1] and k ∈ [1, min(n-j, n-1)] of a n-bit multiplier. In [8], 2 × 2
applications. This brief deals with a new design approach for approxima-
approximate multiplier based on modifying an entry in the Karnaugh
tion of multipliers. The partial products of the multiplier are altered to
introduce varying probability terms. Logic complexity of approximation map is proposed and used as a building block to construct 4 × 4 and
is varied for the accumulation of altered partial products based on their 8 × 8 multipliers. In [9], inaccurate counter design has been proposed
probability. The proposed approximation is utilized in two variants of for use in power efficient Wallace tree multiplier. A new approximate
16-bit multipliers. Synthesis results reveal that two proposed multipliers adder is presented in [10] which is utilized for partial product
achieve power savings of 72% and 38%, respectively, compared to an
exact multiplier. They have better precision when compared to existing accumulation of the multiplier. For 16-bit approximate multiplier
approximate multipliers. Mean relative error figures are as low as in [10], 26% of reduction in power is accomplished compared to
7.6% and 0.02% for the proposed approximate multipliers, which are exact multiplier. Approximation of 8-bit Wallace tree multiplier due
better than the previous works. Performance of the proposed multipliers to voltage over-scaling (VOS) is discussed in [11]. Lowering supply
is evaluated with an image processing application, where one of the
voltage creates paths failing to meet delay constraints leading to error.
proposed models achieves the highest peak signal to noise ratio.
Previous works on logic complexity reduction focus on straight-
Index Terms— Approximate computing, error analysis, low forward application of approximate adders and compressors to the
error, low power, multipliers. partial products. In this brief, the partial products are altered to
introduce terms with different probabilities. Probability statistics of
I. I NTRODUCTION the altered partial products are analyzed, which is followed by
systematic approximation. Simplified arithmetic units (half-adder,
In applications like multimedia signal processing and data mining
full-adder, and 4-2 compressor) are proposed for approximation.
which can tolerate error, exact computing units are not always
The arithmetic units are not only reduced in complexity, but care
necessary. They can be replaced with their approximate counterparts.
is also taken that error value is maintained low. While systemic
Research on approximate computing for error tolerant applications
approximation helps in achieving better accuracy, reduced logic
is on the rise. Adders and multipliers form the key components
complexity of approximate arithmetic units consumes less power and
in these applications. In [1], approximate full adders are proposed
area. The proposed multipliers outperforms the existing multiplier
at transistor level and they are utilized in digital signal processing
designs in terms of area, power, and error, and achieves better peak
applications. Their proposed full adders are used in accumulation of
signal to noise ratio (PSNR) values in image processing application.
partial products in multipliers.
Error distance (ED) can be defined as the arithmetic distance
To reduce hardware complexity of multipliers, truncation is widely
between a correct output and approximate output for a given input.
employed in fixed-width multiplier designs. Then a constant or vari-
In [12], approximate adders are evaluated and normalized ED (NED)
able correction term is added to compensate for the quantization error
is proposed as nearly invariant metric independent of the size of the
introduced by the truncated part [2], [3]. Approximation techniques
approximate circuit. Also, traditional error analysis, MRE is found
in multipliers focus on accumulation of partial products, which is
for existing and proposed multiplier designs.
crucial in terms of power consumption. Broken array multiplier
The rest of this brief is organized as follows. Section II details the
is implemented in [4], where the least significant bits of inputs
proposed architecture. Section III provides extensive result analysis
are truncated, while forming partial products to reduce hardware
of design and error metrics of the proposed and existing approximate
complexity. The proposed multiplier in [4] saves few adder circuits
multipliers. The proposed multipliers are utilized in image process-
in partial product accumulation.
ing application and results are provided in Section IV. Section V
In [5], two designs of approximate 4-2 compressors are presented
concludes this brief.
and used in partial product reduction tree of four variants of 8 × 8
Dadda multiplier. The major drawback of the proposed compressors
II. P ROPOSED A RCHITECTURE
in [5] is that they give nonzero output for zero valued inputs,
which largely affects the mean relative error (MRE) as discussed Implementation of multiplier comprises three steps: generation of
later. The approximate design proposed in this brief overcomes the partial products, partial products reduction tree, and finally, a vector
existing drawback. This leads to better precision. In static segment merge addition to produce final product from the sum and carry rows
multiplier (SSM) proposed in [6], m-bit segments are derived from generated from the reduction tree. Second step consumes more power.
n-bit operands based on leading 1 bit of the operands. Then, m In this brief, approximation is applied in reduction tree stage.
× m multiplication is performed instead of n × n multiplication, A 8-bit unsigned1 multiplier is used for illustration to describe the
where m<n. Partial product perforation (PPP) multiplier in [7] omits proposed method in approximation of multipliers. Consider
two 8-bit
unsigned input operands α = 7m=0 αm 2m and β = 7n=0 βn 2n .
Manuscript received May 18, 2016; accepted December 19, 2016. The partial product am,n = αm · βn in Fig. 1 is the result of AND
The authors are with the Department of Electrical and Computer Engineer-
ing, University of Saskatchewan, Saskatoon, SK S7N 5C5, Canada (e-mail: operation between the bits of αm and βn .
[email protected]; [email protected]).
Color versions of one or more of the figures in this paper are available 1 The proposed approximate technique can be applied to signed multipli-
online at http://ieeexplore.ieee.org. cation including Booth multipliers as well, except it is not applied to sign
Digital Object Identifier 10.1109/TVLSI.2016.2643639 extension bits.
1063-8210 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE II
T RUTH TABLE OF A PPROXIMATE H ALF A DDER
TABLE III
T RUTH TABLE OF A PPROXIMATE F ULL A DDER
Fig. 1. Transformation of generated partial products into altered partial
products.
TABLE I
P ROBABILITY S TATISTICS OF Generate S IGNALS
From statistical point of view, the partial product am,n has a B. Approximation of Other Partial Products
probability of 1/4 of being 1. In the columns containing more The accumulation of other partial products with probability 1/4
than three partial products, the partial products am,n and an,m are for am,n and 7/16 for pm,n uses approximate circuits. Approximate
combined to form propogate and generate signals as given in (1). The half-adder, full-adder, and 4-2 compressor are proposed for their
resulting propogate and generate signals form altered partial products accumulation. Carr y and Sum are two outputs of these approximate
pm,n and gm,n . From column 3 with weight 23 to column 11 with circuits. Since Carr y has higher weight of binary bit, error in Carr y
weight 211 , the partial products am,n and an,m are replaced by altered bit will contribute more by producing error difference of two in the
partial products pm,n and gm,n . The original and transformed partial output. Approximation is handled in such a way that the absolute
product matrices are shown in Fig. 1 difference between actual output and approximate output is always
maintained as one. Hence Carr y outputs are approximated only for
pm,n = am,n + an,m
the cases, where Sum is approximated.
gm,n = am,n · an,m . (1) In adders and compressors, XOR gates tend to contribute to high
The probability of the altered partial product gm,n being one is area and delay. For approximating half-adder, XOR gate of Sum is
1/16, which is significantly lower than 1/4 of am,n . The probability replaced with OR gate as given in (2). This results in one error in the
of altered partial product pm,n being one is 1/16 + 3/16 + 3/16 = Sum computation as seen in the truth table of approximate half-adder
7/16, which is higher than gm,n . These factors are considered, while in Table II. A tick mark denotes that approximate output matches with
applying approximation to the altered partial product matrix. correct output and cross mark denotes mismatch
Sum = x1 + x2
A. Approximation of Altered Partial Products gm,n Carr y = x1 · x2. (2)
The accumulation of generate signals is done columnwise. As each In the approximation of full-adder, one of the two XOR gates is
element has a probability of 1/16 of being one, two elements being 1 replaced with OR gate in Sum calculation. This results in error in last
in the same column even decreases. For example, in a column with two cases out of eight cases. Carr y is modified as in (3) introducing
4 generate signals, probability of all numbers being 0 is (1 − pr )4 , one error. This provides more simplification, while maintaining the
only one element being one is 4 pr (1 − pr )3 , the probability of two difference between original and approximate value as one. The truth
elements being one in the column is 6 pr 2 (1 − pr )2 , three ones is table of approximate full-adder is given in Table III
4 pr 3 (1− pr ) and probability of all elements being 1 is pr 4 , where pr
is 1/16. The probability statistics for a number of generate elements W = (x1 + x2)
m in each column are given in Table I. Sum = W ⊕ x3
Based on Table I, using OR gate in the accumulation of columnwise Carr y = W · x3. (3)
generate elements in the altered partial product matrix provides exact
result in most of the cases. The probability of error (Perr ) while using Two approximate 4-2 compressors in [5] produce nonzero output
OR gate for reduction of generate signals in each column is also listed even for the cases where all inputs are zero. This results in high ED
in Table I. As can be seen, the probability of misprediction is very and high degree of precision loss especially in cases of zeros in all
low. As the number of generate signals increases, the error probability bits or in most significant parts of the reduction tree. The proposed
increases linearly. However, the value of error also rises. To prevent 4-2 compressor overcomes this drawback.
this, the maximum number of generate signals to be grouped by OR In 4-2 compressor, three bits are required for the output only when
gate is kept at 4. For a column having m generate signals, m/4 OR all the four inputs are 1, which happens only once out of 16 cases.
gates are used. This property is taken to eliminate one of the three output bits in
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE IV TABLE V
T RUTH TABLE OF A PPROXIMATE 4-2 C OMPRESSOR S YNTHESIS R ESULTS OF E XACT, E XISTING , AND P ROPOSED
A PPROXIMATE M ULTIPLIERS
TABLE VI
E RROR M ETRICS FOR 16-bit M ULTIPLIER
TABLE VII
R ANKING OF A PPROXIMATE M ULTIPLIERS IN T ERMS
OF D ESIGN AND E RROR M ETRICS
Fig. 4. (a) Input image-1 with Gaussian noise. Geometric mean filtered
images and corresponding PSNR and energy savings in μJ using (b) exact
multiplier, (c) Multiplier1, (d) Multiplier2, (e) ACM1, (f) ACM2, (g) SSM,
(h) PPP, (i) UDM, and (j) VOS.
Fig. 5. (a) Input image-2 with Gaussian noise. Geometric mean filtered
Multiplier2 offers 32% area savings and 38% power savings, over images and corresponding PSNR and energy savings in μJ using (b) exact
the exact multiplier. ACM2 provides 22% and 30% area and power multiplier, (c) Multiplier1, (d) Multiplier2, (e) ACM1, (f) ACM2, (g) SSM,
(h) PPP, (i) UDM, and (j) VOS.
savings, respectively. SSM has 19% area and 31% power savings over
accurate multiplier. Perforated multiplier has 6% and 12% area and
power savings, respectively. UDM provides 19% and 26% area and of merit to assess the quality of approximate multipliers. PSNR is
power savings. Multiplier2 has one order of lower MRE than ACM2, based on mean-square error found between resulting image of exact
two orders of lower MRE than UDM, 73% lower MRE than PPP, multiplier and the images generated from approximate multipliers.
and 62% lower MRE than SSM. NED of Multiplier2 outperforms all Energy required by exact and approximate multiplication process
approximate multipliers except ACM2. ACM2 exhibits 10% lower while performing geometric mean filtering of the images is found
NED than Multiplier2. Multiplier2 produces large ED relative to using Synopsys Primetime. Further, exact multiplier is voltage scaled
ACM2. However, lower MRE indicates that Multiplier2 has smaller from 1 to 0.85 V (VOS), and its impact on energy consumption and
relative error values. image quality is computed.
Table VII gives a comprehensive comparison of approximate The noisy input image and resultant image after denoising using
multipliers to get an idea of tradeoff between design metrics and error exact and approximate multipliers, with their respective PSNRs and
metrics. Multiplier1 delivers the lowest APP; Multiplier2 delivers energy savings in μJ are shown in Figs. 4 and 5, respectively. Energy
the lowest MRE value. Overall, Multiplier2 has better PDP, APP, required for exact multiplication process for image-1 and image-2 is
and MRE over ACM2, SSM, perforated multiplier, and UDM, with 3.24 and 2.62 μJ , respectively. Although ACM1 has better energy
lower NED in most cases as well. For applications where high savings compared to Multiplier1, Multiplier1 has significantly higher
power savings are desired with more error tolerance, Multiplier1 PSNR than ACM1. Multiplier2 shows the best PSNR among all the
can be used. For moderate power savings with better performance, approximate designs. Multiplier2 has better energy savings, compared
Multiplier2 is suggested. to ACM2, PPP, SSM, UDM, and VOS. The intensity of image-1 being
MRE distribution of 16-bit versions of Multiplier1 and Multiplier2 mostly on the lower end of the histogram causes poor performance of
is shown in Fig. 3. All possible outputs ranging from 0 to 655352 are ACM multipliers. As the switching activity impacts most significant
categorized into 255 intervals. MRE of Multiplier2 is significantly part of the design in VOS, PSNR values are affected.
low at higher product values, as exact units are used in most
significant part of the multiplier. V. C ONCLUSION
In this brief, to propose efficient approximate multipliers, partial
IV. A PPLICATION —I MAGE P ROCESSING products of the multiplier are modified using generate and propagate
Geometric mean filter is widely used in image processing to signals. Approximation is applied using simple OR gate for altered
reduce Gaussian noise [13]. The geometric mean filter is better at generate partial products. Approximate half-adder, full-adder, and
preserving edge features than the arithmetic mean filter. Two 16- 4-2 compressor are proposed to reduce remaining partial products.
bits per pixel gray scale images with Gaussian noise are considered. Two variants of approximate multipliers are proposed, where approx-
3 × 3 mean filter is used, where each pixel of noisy image is imation is applied in all n bits in Multiplier1 and only in n − 1 least
replaced with geometric mean of 3 × 3 block of neighboring pixels significant part in Multiplier2. Multiplier1 and Multiplier2 achieve
centered around it. The algorithms are coded and implemented in significant reduction in area and power consumption compared with
MATLAB. Exact and approximate 16-bit multipliers are used to exact designs. With APP savings being 87% and 58% for Multiplier1
perform multiplication between 16-bit pixels. PSNR is used as figure and Multiplier2 with respect to exact multipliers, they also outperform
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
in APP in comparison with existing approximate designs. They [6] S. Narayanamoorthy, H. A. Moghaddam, Z. Liu, T. Park, and N. S. Kim,
are also found to have better precision when compared to existing “Energy-efficient approximate multiplication for digital signal process-
ing and classification applications,” IEEE Trans. Very Large Scale
approximate multiplier designs. The proposed multiplier designs can
Integr. (VLSI) Syst., vol. 23, no. 6, pp. 1180–1184, Jun. 2015.
be used in applications with minimal loss in output quality while [7] G. Zervakis, K. Tsoumanis, S. Xydis, D. Soudris, and K. Pekmestzi,
saving significant power and area. “Design-efficient approximate multiplication circuits through partial
product perforation,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst.,
vol. 24, no. 10, pp. 3105–3117, Oct. 2016.
R EFERENCES
[8] P. Kulkarni, P. Gupta, and M. D. Ercegovac, “Trading accuracy for
[1] V. Gupta, D. Mohapatra, A. Raghunathan, and K. Roy, “Low-power dig- power in a multiplier architecture,” J. Low Power Electron., vol. 7, no. 4,
ital signal processing using approximate adders,” IEEE Trans. Comput.- pp. 490–501, 2011.
Aided Design Integr. Circuits Syst., vol. 32, no. 1, pp. 124–137, [9] C.-H. Lin and C. Lin, “High accuracy approximate multiplier with error
Jan. 2013. correction,” in Proc. IEEE 31st Int. Conf. Comput. Design, Sep. 2013,
[2] E. J. King and E. E. Swartzlander, Jr., “Data-dependent truncation pp. 33–38.
scheme for parallel multipliers,” in Proc. 31st Asilomar Conf. Signals, [10] C. Liu, J. Han, and F. Lombardi, “A low-power, high-performance
Circuits Syst., Nov. 1998, pp. 1178–1182. approximate multiplier with configurable partial error recovery,” in Proc.
[3] K.-J. Cho, K.-C. Lee, J.-G. Chung, and K. K. Parhi, “Design of Conf. Exhibit. (DATE), 2014, pp. 1–4.
low-error fixed-width modified booth multiplier,” IEEE Trans. Very [11] R. Venkatesan, A. Agarwal, K. Roy, and A. Raghunathan, “MACACO:
Large Scale Integr. (VLSI) Syst., vol. 12, no. 5, pp. 522–531, Modeling and analysis of circuits for approximate computing,” in Proc.
May 2004. IEEE/ACM Int. Conf. Comput.-Aided Design (ICCAD), Oct. 2011,
[4] H. R. Mahdiani, A. Ahmadi, S. M. Fakhraie, and C. Lucas, “Bio-inspired pp. 667–673.
imprecise computational blocks for efficient VLSI implementation of [12] J. Liang, J. Han, and F. Lombardi, “New metrics for the reliability of
soft-computing applications,” IEEE Trans. Circuits Syst. I, Reg. Papers, approximate and probabilistic adders,” IEEE Trans. Comput., vol. 63,
vol. 57, no. 4, pp. 850–862, Apr. 2010. no. 9, pp. 1760–1771, Sep. 2013.
[5] A. Momeni, J. Han, P. Montuschi, and F. Lombardi, “Design and [13] S. Suman et al., “Image enhancement using geometric mean filter and
analysis of approximate compressors for multiplication,” IEEE Trans. gamma correction for WCE iamges,” in Proc. 21st Int. Conf., Neural
Comput., vol. 64, no. 4, pp. 984–994, Apr. 2015. Inf. Process. (ICONIP), 2014, pp. 276–283.