0% found this document useful (0 votes)
60 views14 pages

A Theoretical Framework For Quality Estimation and Optimization of DSP Applications Using Low-Power Approximate Adders

This document presents a framework for estimating the quality of digital signal processing blocks that use approximate adders. The framework models the error of approximate adders as additive noise, called approximation noise. It develops a signal processing approach to describe the power of this noise. The framework then estimates the output quality of DSP blocks like filters and transforms based on this noise model. It also presents an optimization method for choosing adder configurations to optimize quality and efficiency.

Uploaded by

krishna s
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views14 pages

A Theoretical Framework For Quality Estimation and Optimization of DSP Applications Using Low-Power Approximate Adders

This document presents a framework for estimating the quality of digital signal processing blocks that use approximate adders. The framework models the error of approximate adders as additive noise, called approximation noise. It develops a signal processing approach to describe the power of this noise. The framework then estimates the output quality of DSP blocks like filters and transforms based on this noise model. It also presents an optimization method for choosing adder configurations to optimize quality and efficiency.

Uploaded by

krishna s
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS, VOL. 66, NO.

1, JANUARY 2019 327

A Theoretical Framework for Quality Estimation


and Optimization of DSP Applications Using
Low-Power Approximate Adders
Masoud Pashaeifar, Mehdi Kamal , Ali Afzali-Kusha , and Massoud Pedram

Abstract— In this paper, we present a framework for analyt- Internet-of-Things (IoT), the computations consist of digital
ically estimating the output quality of common digital signal processing of the signals [3]–[5]. For a large bunch of appli-
processing (DSP) blocks that utilize approximate adders. The cations where minimum output quality constraint is tolera-
framework is based on considering the error of approximate
adders as an additive noise (approximation noise) that disturbs
ble, digital signal processing (DSP) blocks may perform the
the output of the DSP block in question. A signal processing required processing approximately. These blocks consisting of
theoretical modeling approach for describing the power of the arithmetic units, therefore, may operate using the approximate
approximation noise which is the integral of error spectral computing paradigm. In these applications, the minimum
density over the bandwidth, is developed. The output qualities output quality may be subject to a compromise between
of DSP blocks, such as finite impulse response filter, discrete the quality and energy efficiency/speed. In fact, approximate
cosine transform, and fast Fourier transform, which utilize
approximate adders, are thus estimated. The accuracy of the pro- computing has the feature of sacrificing the accuracy for
posed framework is evaluated by comparing mathematical model the energy or speed (performance) [3]. This paradigm may
predictions to simulation results by using the signal-to-noise be invoked at both software and hardware domains of the
ratio (SNR) metric. The inaccuracy of the SNRs predicted by processing systems.
the framework was, on average, less than 2.5dB compared with In the hardware domain, several approximate components
that obtained from simulations. Therefore, a mathematical opti-
mization approach based on Lagrange Multipliers for optimizing
like adders [6]–[17] and multipliers [18], [19] have been
design parameters is also presented. The optimization is realized introduced. Some prominent examples of approximate adders
by choosing a proper configuration of the target block, such as are: ETA-I [6], AMAs [7], TGAs [8], LOA [10], ETA-II [12],
determining the data width of the inexact computation part for LREA [13], GeAr [14], RAP-CLA [15], and QuAd [16]. The
each approximate adder in the design. approximate components have been evaluated by using them
Index Terms— Approximation noise, analytical quality estima- in DSP blocks like Finite Impulse Response (FIR) filters,
tion, approximate computing, optimization, low power approxi- and Discrete Cosine Transform (DCT) [7], and in multi-
mate adders, digital signal processing. media applications like image processing [15]. Approximate
I. I NTRODUCTION computing may also be applied at the algorithmic level in DSP
applications while exact components are employed for imple-
M ODERN digital systems may require a high volume
of computations while having some critical energy
and speed constraints. In mobile systems where normally
menting datapath operations after algorithmic approximations
have been applied [3].
The use of approximate units in computation systems
the energy stored in the battery is the source of power for
including DSP blocks degrade the output quality which should
their operation, energy consumption reduction is a critical
be determined for an optimum use of these units. More
design goal [1], [2]. The criticality also applies to the systems
specifically, one should quantify the impact of approximation
that harvest energies from the environment. In many applica-
error on output quality as a key step for using approximate
tions, such as communications, biomedical, multi-media and
units. An efficient quantification may be achieved using an
Manuscript received February 25, 2018; revised June 1, 2018 and analytical model for the units such as approximate adders.
July 5, 2018; accepted July 10, 2018. Date of publication July 27, In conventional approaches, statistical characterizations of
2018; date of current version December 6, 2018. This paper was recom-
mended by Associate Editor M. Mozaffari Kermani. (Corresponding author: approximate component error have been obtained by con-
Ali Afzali-Kusha.) ducting exhaustive simulations [8], [9], [15]. In an attempt
M. Pashaeifar and M. Kamal are with the School of Electrical and to manage the complexity, some designers relied on Monte-
Computer Engineering, University of Tehran, Tehran 14399-57131, Iran
(e-mail: [email protected]; [email protected]).
Carlo (MC) simulations to determine the quality [12]. The
A. Afzali-Kusha is with the School of Electrical and Computer Engineering, limitations of exhaustive and MC simulation include the
University of Tehran, Tehran 14399-57131, Iran, and also with the School of need for simulating each type of adder (multipliers) along
Computer Science, Institute for Research in Fundamental Sciences (IPM), with its configuration, requiring large runtimes for estimating
Tehran 19538-33511, Iran (e-mail: [email protected]).
M. Pedram is with the Department of Electrical Engineering, Univer- the output quality of DSP applications, not providing any
sity of Southern California, Los Angeles, CA 90089-2562 USA (e-mail: insights about the mechanisms and causes of the error, and
[email protected]). not giving any vision on the effect of the approximation
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. error on the signal characteristic. To overcome these issues,
Digital Object Identifier 10.1109/TCSI.2018.2856757 recently, a fully analytical modeling approach for some error
1549-8328 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Authorized licensed use limited to: COLLEGE OF ENGINEERING - Pune. Downloaded on December 10,2023 at 17:47:50 UTC from IEEE Xplore. Restrictions apply.
328 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS, VOL. 66, NO. 1, JANUARY 2019

metrics like error probability and mean-error-distance was


presented [20], [21]. Although the first issue was addressed
in these two works, due to using convolutional calculations,
the output quality estimation required large computation times.
Using the MC simulation to discover statistical error charac-
teristics of the adders/multipliers, semi-analytical approaches
have been presented to address the large runtimes for output
quality estimation [22]–[28]. Finally, optimizing the speed and
energy characteristics for a given maximum output quality
loss is a key objective in designing digital systems based
on approximate computing. Since the optimization framework
would heavily depend on the quality estimation method,
an efficient modeling technique is highly desired to prevent
large runtime [26], [27].
To analytically quantify the output quality degradation in Fig. 1. The proposed optimization framework flow.
DSP blocks as a function of the invoked approximations,
the computation error might be described using a signal
obtained from a mathematical model. In this paper, we present
a framework for using this type of approach for analytically
estimating the output error of the DSP blocks. The focus
here is on approximate adders. The modeling is then used
for the optimization of approximate adders in DSP blocks.
The approach is based on modeling the approximation error
as an additive noise called approximation noise. For this Fig. 2. The structure of n-bit ETA I [6].
purpose, first, mathematical models of the output errors of Low
are reviewed. Section III contains the mathematical analysis
Power Approximate adder (LPA adder) structures [6]–[11]
and modeling of the approximation noise for approximate
as a function of the width of the inexact part of the computa-
adders. The output qualities of DSP blocks are calculated
tion, are obtained. Next, we present a framework for analyt-
based on the approximation, quantization and input noises in
ical quality estimation of the approximate adders. To assess
Section IV. Section V evaluates the accuracy of the proposed
the efficacy of the proposed framework, the impact of the
approximation noise model and output error estimation. The
approximation noise of DSP blocks are studied in three DSP
optimization flow based on the analytical quality estimation
blocks of FIR filters, DCT, and Fast Fourier Transform (FFT).
is discussed in Section VI while the paper is concluded in
Having analytically calculated the output noise of the DSP
Section VII.
blocks in terms of the approximation noise, we define an
output constraint function to be used for optimizing the design II. R EVIEW OF A PPROXIMATE A DDERS
parameters such as energy consumption, delay, and energy- AND R ELATED W ORK
delay-product (EDP). In this work, by employing the Lagrange In this section, first, some of the recent approximate adders
multipliers optimization method, the hardware implementa- are reviewed. Based on the structures of the approximate
tions of the approximate units for the DSP applications are adders, we categorize them in two classes of high-performance
optimized (considering the given expected output quality). The (called Segmented in this paper) and low power using sim-
proposed optimization framework flow has been illustrated by plified FA structures based on the ripple-carry adder (RCA)
Fig. 1. More specifically, the contributions of this work are which has the lowest power. In the segmented approximate
itemized below: adders, the critical path of the structures are truncated to
• Proposing a framework whose steps including error mod- improve the speed. Then, some prior works dealing with the
eling, quality estimation, and optimization are performed analysis and estimation of the impact of the approximation
fully analytical. error on the output quality are addressed.
• Providing insight on the effect of the error on the output
signal by characterizing the approximate error as an A. Low-Power Approximate Adders
undesirable additive signal (approximation noise). LPA adders (targeted low power applications) are a kind
• Accurately estimating the output quality of different of approximate adders that consist of two parts of exact and
DSP blocks in terms of the operating domain (time- vs. inexact. The most significant bits (MSBs) of the input operands
frequency-domain), output type (single- vs. multi-output), are added by former while the less significant bits (LSBs) are
and complexity (number of addition operations). added by the latter. In [6], an Error-Tolerant Adder type I
• Obtaining the optimum configuration of the DSP blocks with RCA structure was presented. The structure of ETA I is
to reach minimum energy, delay, and EDP in a fully shown in Fig. 2. In the exact part of the structure, conventional
analytical fashion. FAs with the carry input of zero were used. The inexact part
The rest of this paper is organized as follows. In Section II, included a carry-free addition part (XORs) and a control block.
approximate adders and some related works for calculating The control block checked the bits of the input operands from
the impact of the approximation error on the output quality the joining point of the exact and inexact parts to the LSB.

Authorized licensed use limited to: COLLEGE OF ENGINEERING - Pune. Downloaded on December 10,2023 at 17:47:50 UTC from IEEE Xplore. Restrictions apply.
PASHAEIFAR et al.: THEORETICAL FRAMEWORK FOR QUALITY ESTIMATION AND OPTIMIZATION OF DSP APPLICATIONS 329

TABLE I the proposed approach employed the interval arithmetic and


T RUTH TABLE FOR C ONVENTIONAL ( EXACT ) affine arithmetic for estimating the output quality of an
FA, AMA S , AND TGA S [7], [8]
approximate computing application. In the proposed approach,
first, the distribution of the approximation error (probability
mass function (PMF)) was extracted from Monte-Carlo (MC)
simulations and then, the extracted PMF was represented by
a modified interval arithmetic (MIA) and a modified affine
arithmetic (MAA). Propagating the MIA (PMF) through add
operation needed the convolution operation requiring high
computational efforts. The high computational efforts may be
reduced by [23], while the PMF of the error was extracted
using MC simulations. Generally, the prior works tried to
address two main challenges of modeling the error and extract-
ing the statistical metrics of error analytically without any
At the first point where both of the inputs were ’1’, the control need for MC simulations and reducing or eliminating the
block replaced the result bits to ’1’ from this point to the LSB. complexity of propagating the error for estimating the output
In [7], five Approximate Mirror Adder (AMA) structures quality. Additionally, optimizing the design to achieve the
having a smaller number of transistors compared to that of maximum energy-saving/performance has been a secondary
the conventional adder have been proposed. These designs objective in these works.
were based on simplifying the internal structure (removing 1) Analytical Error Modeling: Using inexact part width
the transistors) of the mirror adder leading to smaller area and and ER, a model was presented in [30] which obtained the
power consumption as well as higher speed. In the same way, bounds of the quality loss. More specifically, the proposed
in [8], by simplifying the internal structure of the transmission model has been used to obtain upper and lower bounds of
gate-based conventional FA, two types of Transmission Gate- PSNR in image processing applications. This model could be
based Approximate (TGA) FAs were proposed. The truth employed for the class of LPA adders in image processing
tables of AMA-I to AMA-V and TGA type I and type II are applications. In [29], the errors of segmented adders were stud-
given in Table I. In addition, approximate FAs called inexact ied where first, the MED and ER of the adders were derived
adders (InXAs), implemented by employing pass-transistors, analytically. Next, the peak signal-to-noise ratio (PSNR), as a
have been presented in [9]. quality parameter for image processing, was studied based on
B. Segmented Adders the MED. The analytical estimated PSNR was compared to
Segmented adders constitute an important group of approxi- the PSNR extracted from the simulation of an approximate
mate adders. The general idea of segmented adders is the trun- adder model. The results of comparison showed, on aver-
cation of the carry propagation chain by segmenting a full-size age, a difference of 3.7dB. FEMTO [31] was a framework
adder. The reduction in the carry propagation path reduces the for analytically determining the PMF of multiplier errors.
delay and increases the performance accordingly. Moreover, The PMF of FAs were extracted from the truth table of
the reduction in the critical path delay can result in a reduction FA and the PMF of the multiplier was determined utilizing
in the energy consumption thanks to simplified architectures. convolution operations. To relax the complexity, Z-transform
Segmented adders may be categorized into two types: block- was employed. Fully analytical error modeling approaches
based [12], [13] and overlapping structure [14], [15]. The for high-performance and low-power adders were proposed
block-based adders consist of non-overlapping sub-adders in [20] and [21], respectively. In these works, the probability
(smaller bit length adders) where each of them generates carry of the occurrence of an error in the considered adders was
for its next sub-adder and does summation using the output accurately obtained. In [20], the probability of occurrence
carry of its previous sub-adder. In the case of the overlapping of an error was used for determining the error PMF of the
structure, the adder consists of overlapping sub-adders with segmented adders. In addition, the output qualities of some
the width of w where a P-bit of overlapping (P = w − R) applications were captured by propagating the error PMF
exist between them. Except for the first one which generates through the corresponding DFGs. Propagating the PMF needed
w-bit LSB, each sub-adder generates R bits of the summation convolution operation requiring high computational efforts.
(see, e.g., [14]). 2) Simplifying the Quality Estimation and Optimization:
An approach based on look-up tables for the output quality
C. Quality Estimation and Modeling estimation of approximate designs was presented in [24].
To estimate the error, conventionally, appropriate statistical The statistical properties of the approximate components were
metrics have been employed to determine the influence of the characterized in LUTs. In addition, for the error propagation,
error on the output quality. Metrics such as the mean error (μ), the output error of the components were characterized by
mean error distance (MED), mean relative error dista- regression-based techniques, where the provided regression
nce (MRED), max error distance (MAX-ED), error rate (ER), coefficients for each metrics were stored in LUTs. The ER,
signal-to-noise ratio (SNR), variance (σ 2), mean squared MRED, MAX-ED, SNR, and MSE were the metrics con-
error (MSE), and root mean square error (RMSE) have been sidered in the FIR and multiply-accumulator (MAC) designs.
used for quantifying the error [24], [29]. One of the first A semi-analytical approach for statistical quality-energy opti-
works on the error analysis was presented in [22] where mization of data flow graphs (DFG) was proposed in [25].

Authorized licensed use limited to: COLLEGE OF ENGINEERING - Pune. Downloaded on December 10,2023 at 17:47:50 UTC from IEEE Xplore. Restrictions apply.
330 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS, VOL. 66, NO. 1, JANUARY 2019

First, the dependency of the error and the inputs were TABLE II
determined by one-time error-free simulation. Then, a simple T HE S UMMARY OF THE D IFFERENCES B ETWEEN F EATURES
OF THE P ROPOSED F RAMEWORK AND THE P RIOR W ORKS
representation of this dependency was extracted and employed
for obtaining the error PMF. Finally, the combination of the
quality estimation and energy models were utilized to optimize
the approximate DFGs. In this case, the optimization problem
was a non-linear non-convex integer problem, which was
solved by meta-heuristic methods. Employing the optimiza-
tion, FFT and inverse DCT (iDCT) were designed obtaining
28× (2×) faster runtime compared to that of a full simulation-
based optimization (of the approach of [23]).
In [26], first, the error variances of some LPA adders are
determined through MC simulations. Next, by using variance
as the error metric, the output quality of the considered
application is estimated by a depth-first search on the direct
acyclic graphs (DAGs) of the application. Finally, the integer important metrics of error rate (ER) and error distance (ED).
linear programming (ILP) is exploited for determining the The ER of the LPA adders is significantly high, while their ED
optimum energy of the application under the predefined quality is limited to their inexact part width. On the other hand, the ER
constraint. The same flow has been employed for optimizing of the segmented adders is low while since the error can occur
the JPEG compression in [28]. Also, in [27], which was an in each bit position including most significant bits, their ED
extension of [26], a heuristic technique based on a mathemati- is high. Therefore, modeling of the error as an additive signal,
cal solution was proposed to solve the area/power optimization should be performed differently for these two types of adders.
problem. The error estimation approach in [26] and [27], did Thus, this work focuses on error modeling in the case of the
not properly support multi-output components. LPA adders.
3) Summary and Conclusion: The recent related works
have focused on analytical models to assess the statistical A. Theory of Approximation Noise
error metrics or focused on reducing high computational The magnitude of the error of an approximate add operation
efforts of output quality estimation in DSP applications. The is defined as the difference between the exact and approximate
optimization has been a secondary importance which has been summation results. This leads us to consider the error as
addressed by solving linear or non-linear optimization prob- an additional signal added to the exact summation result.
lems. For an efficient approximate design framework (for DSP Therefore, one may express the approximate add operation as
applications), a fully analytical yet accurate error model,
a simple quality estimator, and a mathematical optimizer x [n] + y [n] = s [n] + e [n] , (1)
may be considered as critical requirements. Toward this end,
in this work, we concentrate on presenting an analytical model, where n is an integer, x[n] and y[n] are the n th sequence of
based on the noise signal, for describing the approximate input signals, s[n] is the exact summation result, and e[n] is
error and its effect on the output of the approximate adders the approximation error magnitude.
in the DSP blocks. The knowledge obtained from the error Based on the quantization theory and the experimental
characterization as a signal leads to enlightening the impact of study, we modeled the approximation error as white noise,
the approximation error on the outputs when processing digital which is called approximation noise. To model the approx-
signals. In addition, we present a compact error model which imation error as white noise, two criteria should be satisfied
may be employed to generate the error of the approximate which are: 1) the error is independent of the input signals, and
adders in simulations without employing the approximate 2) the error is a random signal with constant power spectral
adder model. The mathematical quality estimation framework density.
provides us with the ability to use analytical optimization 1) The Quantization Theory: As mentioned in Section II.A,
method for minimizing the energy and maximizing the perfor- an n-bit LPA adder consists of k-bit inexact adders for LSBs
mance of hardware implementations of the DSP blocks for a and an (n-k)-bit exact adders for MSBs. This means that
given application subject to the desired output quality. Hence, the error occurs in the summation result and output carry of
in this work, a mathematical optimization based on Lagrange the inexact part where the error depends on the bit patterns
Multipliers method for characterizing the design parameters of the input operand k lease significant bits. Based on the
is suggested. TABLE II summarizes the differences between quantization error, if 2n is significantly larger than 2k , and
the features of the proposed framework and those of the prior the probability density distribution of the input operands are
works. smooth (which is valid in the case of DSP applications),
the inexact part inputs will become independent from the input
III. A NALYSIS AND M ODELING operands having uniform probability density distributions [32].
OF A PPROXIMATION N OISE Therefore, even though the error and the inexact part inputs
As discussed in Section II, the approximate adders were are dependent, the independence of the inexact part inputs with
categorized in two groups. Generally, the segmented adders respect to the input signals (based on the quantization theory)
and LPA adders have different characteristics in terms of two makes the error independent of the input signals.

Authorized licensed use limited to: COLLEGE OF ENGINEERING - Pune. Downloaded on December 10,2023 at 17:47:50 UTC from IEEE Xplore. Restrictions apply.
PASHAEIFAR et al.: THEORETICAL FRAMEWORK FOR QUALITY ESTIMATION AND OPTIMIZATION OF DSP APPLICATIONS 331

In the following subsections, we study the approximation


noise based on the above discussion.

B. Approximation Noise of LPA Adders


Using Theorem 1, we analytically define the MSE of the
error for this type of adders. The theorem may be employed
to present a compact model for evaluating the impact of
the approximation error on the output quality of the system.
To simplify the problem, it is assumed that the error probability
Fig. 3. The output spectrum of the (a) exact and (b) approximate (by employ-
ing AMA-I) summation of two sine signals. of inexact FAs in a LPA adder is independent of each other.
The assumption implies ignoring the impact of the occurrence
2) The Experimental Proof: If the error is independent from of the error in the i t h position on the error probability of the
the inputs, the error may not be characterized as a function of (i +1)st bit position through the carry propagation. In addition,
input signals (e [n] = f (x [n] , y [n])). It means that adding the impact of simultaneous occurrences of error at different bit
two signals with specific frequency characteristics should not positions is also neglected.
show any dependency to the frequency characteristic of the To state and prove the theorem, let us denote the error
error. To validate our conclusion from the quantization the- probability of each FA by Pe , which may be determined by
ory, two discrete sinusoidal signals with different frequencies employing Table I. In Table I, the probability of operands
were summed using the approximate adders. The exact and (P(Ai = 1) and P(Bi = 1)) are 1/2, while the probability of
approximate summation results of the two input signals versus the carry (P(C i = 1)) should be determined based on the input
frequency (0 to π) are plotted in Fig. 3. The approximate and output values (a similar approach is employed in [33]). For
summation was performed by employing AMA-I (approximate example, in the case of AMA-I, two errors are present in the
mirror adder of reference [7]). As shown in Fig. 3, similar to 3rd and 5th rows. By considering the fact that the probability
the quantization noise, the total error spectrum reveal a white of carry is almost 2/3 (which is obtained in [7]), the error
noise characteristic not showing any frequency dependency probability is obtained as:
to those of the input signals. Therefore, the error signal of 1
approximate adders can be considered as a noise signal, which Pe = P (Ai , Bi , C i = {010, 100}) = . (5)
6
is called approximation noise.
Theorem 1: If the error probability of each FA is Pe and
To determine the impact of the approximation noise on the
the width of the inexact part is k, then the MSE of the
signal, the noise power (i.e., the integral of noise spectral
approximation noise is determined from
density over the bandwidth) should be obtained. The mean
square error (MSE) is typically used as the parameter charac- 22k
terizing the noise power in the signal processing area. MSE M S E = Pe × . (6)
3
is also important when calculating the best performance that Proof: Based on the simplifying assumptions, the MSE is
may be achieved when estimating a parameter of a signal that obtained by
is corrupted by noise. In general, the mean square error may
be expressed as [34] 
k−1

  MSE = E D 2i × P (E D i ) , (7)
1  2
N−1
M S E = E e [n] =
2
e [n], (2) i=0
N where in the case of LPA adders, E D i and P (E D i ) are
n=0  1−x n
To study the impact of approximation noise, one should 2i and Pe , respectively. Using n−1
i=0 x = 1−x , one can easily
i

consider the spectrum of the noise which is determined by show the validity of (6). 
the autocorrelation function defined as the correlation of a As mentioned before, the approximation in LPA adders is
signal with its delayed version. For white noise, which is applied in the internal structure of the FAs. Hence, in LPA
an uncorrelated signal, the autocorrelation function (denoted adders, the error magnitude depends on weights of the bit
by R) is defined as [34] positions of the employed approximate FAs. Moreover, the ER
  of LPA adders is high [21] which we consider it to be 1 here.
R [m] = E e2 [n] δ [m] Therefore, in addition to the analytical model, we suggest a
where compact model for error of LPA adders based on the following
 definition:
1 m=0
δ [m] = (3) Definition 1: The approximate adder may be modeled by
0 m = 0. an exact adder with injected noise, which we called compact
Evidently, the MSE and autocorrelation function are equal model. The injected noise may be modeled as white noise,
when m = 0. The power spectral density (PSD) of e [n] which exists at all times (ER is one) and its amplitude
(denoted by S) is the discrete Fourier transform (DFT) of and frequency are random. Its probability distribution may
autocorrelation function obtained from be modeled as an exponential distribution, which MSE is
+∞
 determined from Theorem 1.
S (ω) = R [m]e− j mω . (4) Fig. 4 shows the estimated (using theorem 1) and simulated
m=−∞ RMSE for LPA adders. Note that instead of MSE, we report

Authorized licensed use limited to: COLLEGE OF ENGINEERING - Pune. Downloaded on December 10,2023 at 17:47:50 UTC from IEEE Xplore. Restrictions apply.
332 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS, VOL. 66, NO. 1, JANUARY 2019

Fig. 4. The estimated and simulated RMSE for LPA adders.


Fig. 5. a) The structure of the conventional, and b) linear-phase FIR
filter [35].
RMSE to scale down the MSE values, which leads to a simpler
comparison between reported values. As shown, simple mod- A. FIR Filter
els which are presented in theorem 1, estimate the RMSE with The FIR filter is a DSP block that is used in several
a very high accuracy. This result shows that the simplifying signal processing systems. To design an FIR filter, first,
assumptions do not lead to a considerable inaccuracy in the poles and zeros are determined and then the correspond-
proposed estimation modeling. ing coefficients are calculated [35]. The quantization of the
coefficients changes places of poles and zeros, which leads
IV. M ATHEMATICAL Q UALITY to design modification. If only the quantization noise exists,
M ODELING OF DSP B LOCKS the desired signal to noise ratio (SNR) is determined through
In this section, the output qualities of some DSP blocks the determination of the bit length of the data-path. Also,
are modeled by employing the proposed approach. The FIR to avoid underflow/overflow in adders and multipliers, scaling
filter, DCT, and FFT are the well-known DSP blocks, which should be applied to intermediate results [35]. Scaling after
output qualities will be modeled. The considered blocks are each multiplication adds quantization noise to the signal.
different in terms of the operating domain (time vs. frequency To implement an FIR filter several structures may be
domain), output type (single- vs. multi-output), and complexity used. In this work, two well-known conventional FIR filters
characteristics. The output qualities of these blocks are defined structures are considered. These structures, which are called
based on the SNR parameter defined by conventional and linear-phase structures, are depicted in Fig. 5.
Numbers of adders are the same in both structures while
A2s number of multipliers in the linear-phase structure is half
S N R = 10 log10 . (8)
n̂ 2 of that of the conventional structure. To calculate the noise
where A2s and n̂ 2 are powers of the signal and noise at the power at the output, the input noise, the quantization noise
output, respectively. added by the multipliers, and the approximation noise of the
In this section, the input noise that accompanies the input adders were considered. In this work, our focus was only on
signal, the quantization noise generated after scaling or mul- the approximate adders, and hence, the approximation noise
tiplying in DSP blocks, and the approximation noise which is of the multipliers was not studied. When a model for the
added to signal when approximate components are employed, multiplier approximation noise is available, one may add this
are considered as noises that impact the output quality. There- noise to the quantization noise of the multiplier shown at the
fore, the quality of the DSP block may be obtained from output of the multipliers in Fig. 5. Note that the MSE of input
noise and quantization noise is assumed equal independence
A2s from number of stages or coefficients in all DSP blocks. The
S N R = 10 log10 . (9)
n̂ 2i + n̂ 2q + n̂ 2A output noise of the conventional structure is obtained from
Theorem 2.
where n̂ 2i , n̂ 2q , and n̂ 2A are the powers of input, quantization, Theorem 2: If n 2i is the MSE of the input noise, n 2q is
and approximation noises appeared at the output, respectively. the MSE of the quantization noise, and n 2A is the MSE of
In the following subsections, based on this model, the output the approximation noise, and h [i ] is the i t h coefficient of an
qualities of some benchmarks are calculated using some m-TAP FIR filter, the output noise power of the conventional
theorems. Theorems 2 and 3 express the output noises of con- structure (n̂ 2o ) will be determined from
ventional and linear-phase structures of FIR filter, respectively.
The output noises of simple and improved butterfly structures 
m

of the DCT are obtained by Theorems 4 and 5. Similarly, n̂ 2o = n 2i × h [i ]2 + n 2q × (m + 1) + n 2A × m. (10)


the output noises of decimation in time (DIT) and decimation i=0

in frequency (DIF) butterfly structures of FFT are expressed Proof: When two signals with independent random nature
by Theorems 6 and 7. (uncorrelated) are added with each other, the power of the

Authorized licensed use limited to: COLLEGE OF ENGINEERING - Pune. Downloaded on December 10,2023 at 17:47:50 UTC from IEEE Xplore. Restrictions apply.
PASHAEIFAR et al.: THEORETICAL FRAMEWORK FOR QUALITY ESTIMATION AND OPTIMIZATION OF DSP APPLICATIONS 333

resulting signal becomes the sum of the powers of the two


signals. Therefore, if n 2 is the power of the input signals of
the adder, the power of the result will be 2n 2 . Also, when
a random signal is multiplied by a coefficient (e.g., h [i ]),
the resulting output signal power will be n 2 ×h [i ]2 . Based
on the above discussion, to calculate the output noise power,
because of the existence of m + 1 multipliers and m adders,
the noise power of the multipliers and adders at the output
will be n 2q × (m + 1) and n 2A × m, respectively. In the case of
the input noise, there are some multipliers in the propagation
path making
 the output noise power due to the input noise
as n 2i × m i=0 h [i ]2. 
Similar to Theorem 2, Theorem 3 is provided for the linear-
phase structure.
Theorem 3: If n 2i is the MSE of the input noise, n 2q is
the MSE of the quantization noise, n 2A1 is the MSE of the
approximation noise of the first stage adders, and n 2A2 is the
MSE of the approximation noise of the second stage adders,
and h [i ] is the i t h coefficient of an m-TAP FIR filter, then Fig. 6. a) The generic structure of a DCT, b) the simple structure of butterfly
with four multipliers and c) the improved structure of butterfly with three
the power of output noise in the case of linear-phase structure multipliers [35].
will be obtained from
without any need to apply iDCT. For this modeling, in the
2

m
m  first step, the output noise of the butterfly structure should be
n̂ 2o = 2n 2i × h [i ]2 + n 2q × + n 2A1 determined. Theorems 4 and 5 characterize the output noise
2
i=0 powers of the two implementations.

2
m
Theorem 4: If n 2i , n 2q , and n 2A are powers of the input,
m
× h [i ]2 + n 2A2 × . (11) quantization, and approximation noises, respectively, and the
2
i=0 tuples of ci , c j ∈ {(c1 , c7 ) , (c7 , c1 ) , . . . , (c5 , c3 )} contains
It is important to notice that the spectrum of white noise the coefficients of the butterfly of the DCT shown in Fig. 6(b),
may be changed  under a filtering process [36]. The coeffi- the output noise of the butterfly will be obtained from
cient of n 2i (i.e., m 1
i=0mh [i ] ) in (10), and the coefficient of
2
n 2o1 = n 2o1 = n 2i + 2n 2q + n 2A . (12)
 2 4
n 2i and n 2A1 (i.e., i=0 h [i ] ) in (11) shows the filtering
2

function is applied on the ones. It means that some part of Proof: To calculate the output noise of the butterfly, in the
the identical spectrum of white noises (n 2i and n 2A1 ), which first step, the input and added noises should be propagated to
is in the rejection band of the filter, will be rejected. Thus, the output. Thus, the output noise power may be written as
the spectrum of the noise is shaped by filter, and hence, cannot n 2o1 = n 2o1 = ci2 n 2i + c2j n 2i + 2n 2q + n 2A (13)
be considered as white noise.
By employing trigonometric equation (cos2 (θ ) +
π
cos(i 16 )
cos2 π2 − θ = 1), and the fact that ci = , the proof
B. Discrete Cosine Transform 2
of the theorem is reached by simplifying (13).
Discrete Cosine Transform (DCT) is a DSP block com-
Theorem 5: If n 2i , n 2q , and n 2A are powers of the input,
monly used in multi-media systems. DCT has an intrinsic
quantization, and approximation noises, respectively, and the
compression of signal power. This characteristic makes DCT
tuples of ci , c j ∈ {(c1 , c7 ) , (c7 , c1 ) , . . . , (c5 , c3 )} contains
suitable for the signal compression process. A generic structure
the coefficients of butterfly of the DCT shown in Fig. 6(c),
of a DCT is shown in Fig. 6(a). The butterfly, which is
the output noise of butterfly can be determined from
used commonly in the DCT implementation, can be imple-
1 
mented in two ways of simple (four multipliers) and improved n 2o1 = n 2o1 = n 2i + 2n 2q + 1 + c2j n 2A . (14)
(three multipliers) structures (see Fig. 6). To calculate the 4
output SNR, the noise of each output should be determined. Proof: Unlike the previous structure, here, the input noise
Unlike FIR filters, since DCT is a transform function from the passes through two paths to reach the output. In this case,
time to the frequency domain and is not a complete transform which two correlated signals are added, to obtain the power,
function such as FFT, PSD may not be employed for the output the amplitude of the noise, which is determined by adding the
quality estimation. Therefore, inverse DCT (iDCT) should amplitudes of the signals [26], is used while in the case of
be employed to transform the calculated noise power at the uncorrelated signals, we only need to add the powers of the
output of the DCT to the time domain for calculating the signals (see proof of Theorem 2). Thus, the noise power of
SNR. The power gain of iDCT is one (output power to input the output may be expressed as

power ratio is one) meaning that the calculated noise power n 2o1 = n 2o1 = ci2 n 2i + c2j n 2i + 2n 2q + 1 + c2j n 2A (15)
at the output of the DCT is usable for calculating SNR. Thus,
we determine the output noise of DCT in the time domain Simplifying (15), one can proof Theorem 5. 

Authorized licensed use limited to: COLLEGE OF ENGINEERING - Pune. Downloaded on December 10,2023 at 17:47:50 UTC from IEEE Xplore. Restrictions apply.
334 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS, VOL. 66, NO. 1, JANUARY 2019

two ways: decimation in time (DIT) and decimation in fre-


quency (DIF). The internal structures of the butterflies in these
two implementations shown in Fig. 7(b) and Fig. 7(c) are
different [35].
Since the FFT block transforms the signal from the time
to frequency domain, the output shows the PSD of the input
signal. This inspired us to consider the PSD of the noise to
represent the noise power in the quality analysis instead of
the MSE. For this, the noise PSD should be determined and
propagated to each FFT. The noise power is the integral of the
noise PSD which is obtained by the summation of the PSD in
the FFT output. Like the DCT case, first, the noise of butterfly
blocks should be calculated, and then the output referred noise
of the FFT block is calculated. The coefficients of the FFT are
complex numbers requiring complex the multiplier should be
Fig. 7. a) The general structure of an eight-point FFT, b) the structure of
DIT butterfly and c) the structure of DIF butterfly [35].
designed for complex multiplication. The coefficients of FFT
may be obtained by
By using Theorems 4 and 5, the DCT output-referred noise  
j 2π k N
can be calculated for each output (e.g., X[0], . . . , X[7]), which Wk = e− N , k ∈ 0, 1, . . . , − 1 . (20)
2
are different from each other. To characterize the output quality
of the DCT block, the average of the output noise is considered This may be realized using four multipliers and two adders
as the output-referred noise because the output signal is looked like the simple DCT butterfly case of Fig. 6(b). As is shown
at as a time domain signal. This may be justified by noting in Fig. 7(b), the coefficient is multiplied by one of the
that to have the signal in the time domain, the output of the inputs using a complex multiplier. In contrast to the DIT
DCT should go through an iDCT block. Since powers of the case, the coefficient in the DIF technique is multiplied at the
input and output signals of the iDCT are equal, considering output (Fig. 7(c)). The noise of DIT and DIF butterflies are
the input of the iDCT (output of the DCT) for calculating the determined by Theorems 6 and 7, respectively.
output noise power will result the same value as that of the Theorem 6: If Si , Sq , and S A are PSDs of the input, quan-
output of the iDCT. tization, and approximation noises, respectively, and Wk =
The output-referred noise in two considered structures may Wr + j.Wi is the complex coefficient of the butterfly block
be calculated by propagating the noise to the outputs and shown in Fig. 7(b), noises of the real and imaginary parts of
simplifying the result by using trigonometric equations. The the butterfly outputs are determined from
output noise for simple (n 2o−S ) and improved (n 2o−I ) structures So1,R = So1,I = So2,R = So2,I = 2Si + 2S q + 2S A , (21)
can be attained from
11 1 1 1 9 Proof: The structure of the complex multiplier is imple-
n 2o−S = n 2i + n 2q + n 2A1 +n 2A2 + n 2A3 + n 2A4 + n 2A5 . mented similar to the simple butterfly in the DCT block and
4 2 2 8 32
(16) the coefficient is W = e j x = cos x + j. sin x (ci = sin x and
2 11 2 1 2 9 2 1 2 1 2 c j = cos x). By employing trigonometric equations, the noise
n o−I = n i + n q + n A1 + n A2 + n A3 + n A4
2
at the imaginary and real part of the output of the multiplier
4 √ 2 8 2 8
11 + 2 2 is defined by
+ n A5 . (17)
32 Smul,R = Smul,I = Si + 2S q + S A . (22)
where n 2A1 to n 2A5 are the approximation noises of the adders Now, by propagating the noise (i.e., (22)) to the output of
used in stages 1 to 5 of the DCT block shown in Fig. 6(a). the butterfly, we can prove Theorem 6.
If the structure of adders in all the stages are the same, the Theorem 7: If Si , Sq , and S A are PSDs of the input, quan-
above equations can be simplified to tization, and approximation noises, respectively, and Wk =
11 77 Wr + j.Wi is the complex coefficient of the butterfly shown
n 2o−S = n 2i + n 2q + n 2A . (18) in Fig. 7(c), the real and imaginary part noises of the butterfly
4 32 √
11 83 + 2 2 outputs are obtained from
n 2o−I = n 2i + n 2q + n A. (19)
4 32 So1,R = So1,I = 2Si + S A ,
So2,R = So2,I = 2Si + 2S q + 2S A , (23)
C. Fast Fourier Transform
Fast Fourier Transform (FFT) is a frequently-used DSP Proof: The proof is similar to that of Theorem 6 and is
block utilized to convert a signal from the time domain to omitted from here for the sake of compactness. 
frequency domain. Due to the compound calculations, the FFT Employing Theorems 6 and 7, the output noises in each
is more complex than the FIR and the DCT. An FFT block stage of the FFT are calculated and propagated to the outputs.
with N inputs consists of S = log2 N stages, where each stage As mentioned in Theorem 6, the noises in both DIT butterfly
has N/2 butterflies [35]. A generic structure of an eight-point outputs are the same. If all the adders in a stage have the
FFT is shown in Fig. 7(a). The FFT can be implemented in same structure, the noise in each stage and consequently, at the

Authorized licensed use limited to: COLLEGE OF ENGINEERING - Pune. Downloaded on December 10,2023 at 17:47:50 UTC from IEEE Xplore. Restrictions apply.
PASHAEIFAR et al.: THEORETICAL FRAMEWORK FOR QUALITY ESTIMATION AND OPTIMIZATION OF DSP APPLICATIONS 335

output of the block is homogenous. Thus, employing DIT but- Section III for each DSP block (denoted by ANL), simulating
terfly, the noise PSD at the output of the FFT ( Ŝ D I T ,R , Ŝ D I T ,I ) the DSP blocks by employing exact adders and injecting
is obtained from noises based on the compact representation models proposed

s
in Section III by definitions 1 (symbolized by MOD), and
Ŝ D I T ,R = Ŝ D I T ,I = 2s Si + 2i Sq pure simulation of the DSP blocks when the approximate
i=1 adders were used without using the approximation noise
+ 2s S A,1 + 2s−1 S A,2 + . . . + 2S A,S . (24) model (indicated by SIM). These cases were implemented in
MATLAB, and with the SNR (PSD) as the quality metric. The
where s is the number of stages in the FFT blocks, and
estimation errors and accuracies of ANL and MOD cases were
S A,1 to S A,S are the PSD of the approximation noise of the
obtained by comparing to the SIM case. In all cases and for
adders which are utilized in stage 1 to s. If all adders in all
all implementations, 24-bit approximate adders were employed
stages have the same structure, the added noise in all the stages
for the addition operation. Because studies were based on the
will be the same and the PSD of the output noise of the FFT
SNR, the width of the operations did not impact final results.
block may be simplified to

Ŝ D I T ,R = Ŝ D I T ,I = 2s Si + 2s+1 − 2 Sq + S A . (25) A. Low-Pass FIR Filter
In this work, to evaluate the proposed estimation approach,
Output noises of the DIF butterfly in both outputs are not 10-tap (10 add operations) and 40-tap (40 add operations) low-
the same making the noise, in general, different in different pass FIR filters were designed with the conventional and linear
stages. The PSD of the output noise for the DIF structure phase structures by utilizing 24-bit adders. The SNRs for the
( Ŝ D I F,R , Ŝ D I F,I ) is determined by outputs of the FIR filters in the three studied cases (S N R AN L ,

s−1
3 S N R M O D , and S N R S I M ) implemented by LPA adders were
Ŝ D I F,R = Ŝ D I F,I = 2s Si + 2i Sq + 2s−1 S A,1 studied for several k values where the results for the inaccuracy
2
i=1 of the analysis and models are depicted in Fig. 8. On average,
3 3 the inaccuracy of S N R AN L in the conventional (linear phase)
+ 2s−2 S A,2 + . . . + S A,S . (26)
2 2 structure of 10-tap and 40-tap FIRs were 2.4dB (2.4dB) and
where s is the number of stages in the FFT blocks, and 2.7dB (2.6dB), respectively. The conventional and linear-phase
S A,1 to S A,S are the approximation noise of the adders structures were denoted by conv. and L.P., respectively. The
employed in stages 1 to s. If all adders in all stages have the maximum inaccuracy of S N R AN L in the conventional (linear
same structure, the added noise in all the stages will be the phase) structure belongs to AMA-III (AMA-V) which was
same and the output noise PSD of FFT block may be simplified 4.9dB (4.7dB).
to
 3
 B. DCT
Ŝ D I F,R = Ŝ D I F,I = 2 S Si + 2 S − 1 Sq + S A . (27) The DCT is the second DSP block which was studied
2 for evaluating the accuracy of the proposed approach. The
To obtain the power of the noise in the output from DCT was designed using simple (28 add operations) and
PSD, which is determined in (24) to (27), the autocorrelation improved (33 add operations) structures. To perform this study,
function (see (4)) should be determined by [34] 16384 (2048 input set) random numbers were injected into
 π 
1 the DCT block as an input signal and the SNR of the outputs
R [m] = S e j ω e j mω dω. (28)
2π −π were calculated. The level of inaccuracy of the estimation and
modeling of the SNR for six selected approximate adders are
As mentioned in Section III, the power of the noise is the
reported in Fig. 8.
same as the autocorrelation function when m = 0 (n̂ 2o =
R [0]). Furthermore, the right side of (28) is the integral of C. Fast Fourier Transform
signal spectrum that is continuous. In the case of the FFT, the To assess the effectiveness of the proposed estimation
spectrum of the signal is discrete and the output power can be and modeling approaches, the FFT block was designed with
expressed by DIT and DIF structures. The number of input samples was

N−1   the same as that of the FFT points (NFFT). The FFT block
R [0] = n̂ 2o = Ŝ R [ j ] + Ŝ I [ j ] = N Ŝ R [ j ] + Ŝ I [ j ] , was implemented for different FFT points, inexact part widths,
j =0 and structures. Adders in all stages had the same structure.
(29) For evaluating the accuracy level of the proposed approach,
the PSD, which inaccuracy is the same as the error of SNR
where N is the number of outputs and Ŝ R [ j ] ( Ŝ I [ j ]) is real (see (29)), was considered.
(imaginary) part of the PSD which determined in the j t h Accuracies of the analysis and models compared to
output by employing (24) to (27). the simulations for 256-point (6144 add operations) and
1024-point (30720 add operations) FFT implementations in
V. R ESULTS AND D ISCUSSION different structures are demonstrated in Fig. 8. It should be
In this section, the efficacy of the proposed error estimation mentioned that in the case of the FFT (FIR) results, the results
for the considered DSP blocks is assessed. For this purpose, for AMA-II and AMA-IV for 256-point (10-tap) and AMA-III
output qualities were studied for three cases of estimating and AMA-V for 1024-point (40-tap) resemble those for all
the noise by employing the analytical approach proposed in the eight type adders. On average, the inaccuracy of

Authorized licensed use limited to: COLLEGE OF ENGINEERING - Pune. Downloaded on December 10,2023 at 17:47:50 UTC from IEEE Xplore. Restrictions apply.
336 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS, VOL. 66, NO. 1, JANUARY 2019

Fig. 8. The inaccuracies of the quality analysis and modeling for a) 10-tab and 40-tab FIR filters, b) DCT, and c) 256- and 1024-point FFT compared to
the case of the simulation approach (SIM).

P S D AN L for all the adders in the DIT (DIF) structure of


256- and 1024-point FFTs were 3.7dB (3dB) and
3.3dB (3.6dB), respectively.
As the summary for the discussion, the average and max-
imum inaccuracies of the analytical SNR estimation and
compact models are shown in Fig. 9. In the case of the FIR,
the average inaccuracy of the 10-tap filter is approximately
equal to that of 40-tap filter which means that increasing the
complexity (the number of add operations) does not disturb the
accuracy of the proposed analytical model. Also, the accuracy
of the compact model is higher than the analytical model.
In the case of the DCT, as the results reveal, S N R AN L is
estimated more accurately compared to the FIR case (on aver- Fig. 9. The average and maximum inaccuracy of estimated SNR for
age 1.5dB), while the maximum error of those are almost the 10-tap FIR, 40-tap FIR filter, DCT, 256-point FFT, and 1024-point FFT.
same (4.7dB). Additionally, Fig. 9 shows that the accuracies
of the proposed analytical and modeling approaches in the was 6.3dB. Although the average of the accuracy of the
FFT case were less than those of the other two studied proposed modeling approach was similar to the proposed ana-
cases. The inaccuracy of the analysis for the FFT case was, lytical approach, the maximum inaccuracy of the model was
on average, about 3.6dB while the maximum inaccuracy 10dB which is 3.7dB more than that of the analysis. For this

Authorized licensed use limited to: COLLEGE OF ENGINEERING - Pune. Downloaded on December 10,2023 at 17:47:50 UTC from IEEE Xplore. Restrictions apply.
PASHAEIFAR et al.: THEORETICAL FRAMEWORK FOR QUALITY ESTIMATION AND OPTIMIZATION OF DSP APPLICATIONS 337

benchmark, in most cases, the inaccuracy of the proposed width of the approximate adders. If the carry propagation delay
method was almost equal or less than 3dB (see Fig. 8). of the inexact full-adder is smaller than that of the exact full-
adder in the LPA adders, the delay function (D()) of the DSP
VI. A NALYSIS -BASED O PTIMIZATION block may be obtained from
In the previous sections, we introduced a general yet simple D(k1 , . . . , k S ) = nD E X T −mi n(k1 , . . . , k S )(D E X T −D A P X ),
analytical model for describing the power of the approximation
(34)
noise which estimated the output quality of DSP blocks with
a good accuracy. Now, in this section, we discuss the use where D E X T (D A P X ) is the carry propagation delay of exact
of the analysis-based approach for optimizing the hardware (approximate) full-adder and n is the adders bit length. The
implementations which use the approximation computing par- first term on the RHS represents the delay of the exact adder,
adigm for reducing the delay and energy consumption. Here, the term (D E X T − D A P X ) denotes the delay difference of the
we propose an optimization method based on the proposed exact and approximate FAs, and the clock of the system is set
analytical quality estimation approach. As was shown in by mi n (k1 , . . . , k S ). To be more specific, as the minimum of
Section IV, without loss of generality, the adders of a DSP k j (minimum width of the inexact parts) increased, we can
block could be categorized into S stages (e.g., see Figs. 4-6) apply a higher clock rate minimizing the delay of the block.
where the adders in a stage have a similar configuration. For Now, let us assume that the energy saving obtained at the
the optimization, the SNR is used as the metric for determining cost of 1-bit approximation in LPA adders is E S , and the total
the output quality. The relation between the approximation energy of the exact implementation of the DSP block is E E X T .
noise and SNR was shown in (9), which can be rewritten as Thus, we can express the energy function (E()) for the DSP
SN R block which has been implemented by LPA adders as
10− 10
n̂ 2A = − n̂ 2q − n̂ 2i , (30) 
S
A2s
E (k1 , . . . , k S ) = E E X T − m j ES k j , (35)
where n̂ 2q (n̂ 2i ) is the power of the quantization (input) noise j =1
appeared at the output of the DSP block and A2s is the power where m j is the number of adders that are employed in the
of the signal. For example, in the case of the linear-phase j t h stage. In addition to the energy and delay, energy-delay-
 m2 
FIR (see Theorem 3), n̂ 2i and n̂ 2q are 2n 2i × i=0 h [i ]2 and product (EDP) is another parameter which may be considered
n q × 2 , respectively. The approximation noise (n̂ A ) obtained
2 m 2 as optimization objective function. Based on (34) and (35),
by (30), is a function of the width of the inexact part (k j where the EDP () function may be written as
j is the stage number) of the LPA adders given by  (k1 , . . . , k S ) = E (k1 , . . . , k S ) × D (k1 , . . . , k S ). (36)

S 
S
Pe 2k j
n̂ 2A = a j n 2A j = aj 2 , (31) Now, the optimization problem is defined by
3
j =1 j =1 min  (k1 , . . . , k S )

where a j is the coefficient of the approximation noise of the subject to G n̂ 2A , k1 , . . . , k S ≥ 0, (37)
j t h stage adders propagated to the output. In the case of
 m2  where  (k1 , . . . , k S ) is the objective function (which can be
linear-phase
m  FIR (see Theorem 3), a1 and a2 are i=0 h [i ]2
and 2 , respectively. one of the functions from (34) to (36)).
If minimizing the delay is the objective function, the min
By using (30), the noise budget (maximum n̂ 2A based on the
operation of (34) (i.e., mi n (k1 , . . . , k S )) should be maximized
minimum SNR constraint) that could be employed to approx-
which results in minimizing (34). According to the constraint
imate the low-energy/high-performance design is determined.
function given in (31), increasing each of k j may lead to some
Using this amount of noise budget guarantees achieving the
quality reduction and vice versa. Therefore, the minimum
desired output quality. Also, (31) determines n̂ 2A based on the
delay may be achieved when the design parameters (k j ) of
inexact widths of the approximate adders utilized in the block.
all the stages have the same value.
Therefore, the constraint can be expressed as
In the case of optimizing the energy and EDP, the Lagrange

S
Pe 2k j 10− 10
SN R
multipliers method may be employed [37]. The Lagrange
aj 2 ≤ n̂ 2A = − n̂ 2q − n̂ 2i . (32) multipliers method can find the local maxima and minima
3 A2s
j =1
of a function subject to equality constraints. By defining the
This leads us to defining the constraint function given in parameter λ as the Lagrange multiplier, the Lagrange func-
Definition 2 for our optimization problem. tion (L()) for the optimization problem may be expressed as
Definition 2: The constraint function (G()) is a function that L (k1 , . . . , k S , λ) =  (k1 , . . . , k S ) − λG (k1 , . . . , k S ). (38)
relates the SNR metric to the inexact part of the LPA adders.
Thus, we define the constraint function for the LPA adders as To obtain the optimum point, the stationary point, which is
 
S
Pe
the point where the partial derivatives of Lagrange function
G n̂ 2A , k1 , . . . , k S = n̂ 2A − a j 22k j . (33) are zero, should be found. The partial derivatives (∇L()) of
3 the Lagrange function are determined by
j =1
 
To complete our optimization formulation, we should also ∂L ∂L ∂L
∇ K 1,..., K S ,λ L (k1 , . . . , k S , λ) = ,..., , . (39)
model the energy and delay as functions of the inexact part ∂k1 ∂k S ∂λ

Authorized licensed use limited to: COLLEGE OF ENGINEERING - Pune. Downloaded on December 10,2023 at 17:47:50 UTC from IEEE Xplore. Restrictions apply.
338 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS, VOL. 66, NO. 1, JANUARY 2019

TABLE III
T HE O PTIMUM PARAMETERS AND C ONDITIONS FOR
D ELAY AND E NERGY O PTIMIZATION

As mentioned before, two objective functions can be opti-


mized by the method of the Lagrange multipliers. Since the
energy function (i.e., (35)) is a convex function, the equations
that are achieved by applying Lagrange multipliers method,
can be analytically solved. The expressions for the opti-
mum inexact part width for the minimum energy are given
in Table III (see Appendix A for the derivation).
Due to min operation in the delay function ((36)), the EDP
target function is not convex. To optimize the EDP, the para-
meter km is assumed as the minimum width of the inexact
part (km = mi n (k1 , . . . , k S )). The optimization approach
which was applied to the EDP can be found in Appendix B.
By employing (39), (51) is extracted which should be solved
numerically under the considered assumption (km ≤ k j where
1 ≤ j ≤ S, j = m). For each km (1 ≤ m ≤ S), these equations
should be solved for calculating the corresponding EDP. Now,
the optimum EDP is the minimum of the calculated EDPs.
This optimization approach for EDP may not have a specific
answer (it may not be solvable).
The optimum parameters (k j,opt is optimum inexact part
width for jth stage) and the corresponding constraints are Fig. 10. Energy savings for different optimization approaches.
shown in Table III for two optimization targets (delay
TABLE IV
and energy). Note that due to the simplicity of extracting the
T HE PARAMETERS AND O PTIMUM P OINTS T HAT A CHIEVED
provided delay formula, the details of determining it, is not F ROM O PTIMIZATION M ETHOD FOR DCT FOR AMA-II
provided for the sake of space. The equations of Table III have
positive real answers whereas the parameters should belong
to the Whole Numbers. Therefore, to ensure that the quality
condition is met, the floor function may be applied to the
equations in Table III. In this case, however, the results may
not be optimum. On the other hand, the rounding function,
also, can be applied to these equations which may result in
quality condition violation.
For the evaluation of the proposed optimization approach,
we have applied it to the simple structure of DCT for various
SNRs. The energy optimization results by applying floor and
round functions, which are compared to the exhaustive search,
are shown in Fig. 10 where the full-scale signal amplitude at
the output has been considered (reporting SNRs in dBFS). By assuming that the total approximation noise is generated
It should be noted that in the optimization problem, the para- by the jth stage while assuming other stages are exact, the
meter k j has an upper band limit which can be determined upper bound of the jth stage (k j,U B ) can be determined.
from Note that for decreasing the search time, we used (40) to
⎢ ⎥
⎢ 3n̂2A ⎥ limiting the search space in the exhaustive search. As shown
⎢ ln a P ⎥
⎢ j e ⎥ in Fig. 10, the energy saving in the three considered cases are
k j,U B = ⎣ ⎦. (40)
2 ln 2 almost the same. On average, the energy saving differences
between the exhaustive search and the proposed optimizing

Authorized licensed use limited to: COLLEGE OF ENGINEERING - Pune. Downloaded on December 10,2023 at 17:47:50 UTC from IEEE Xplore. Restrictions apply.
PASHAEIFAR et al.: THEORETICAL FRAMEWORK FOR QUALITY ESTIMATION AND OPTIMIZATION OF DSP APPLICATIONS 339

approach, when the floor and round functions applied were A PPENDIX B
less than 0.9% and 0.4%, respectively. VII shows the para- For optimizing EDP, by putting (34) and (35) in (36), the
meters and optimum points that achieved from the proposed objective function may be expressed as
optimization method for the improved structure of DCT when
utilizing AMA-II. 
S
 = nE E X T D E X T − E E X T D S km − nD E X T E S m jkj
VII. C ONCLUSION j =1
In this paper, a framework for accurate yet analytical output 
S
quality estimation of DSP designs realized using approximate + D S E S km m j k j , (47)
adders was presented. The error of low power approximate j =1
adders was studied as an additive noise (approximation noise)
which disturbs the signals in the digital signal processors. where km and D S are mi n (k1 , . . . , k S ) and (D E X T − D A P X ),
A mathematical modeling approach for describing the power respectively. Taking the partial derivative of the Lagrange
of the approximation noise was developed providing accurate function, one obtains
yet simple expressions for calculating the noise power and ∂L Pe
SNR of the output signal in DSP blocks. To evaluate the = −nD E X T E S m j + E S D S m j km − 2λa j 22k j , (48)
∂k j 3
proposed framework, the output quality of some DSP blocks ∂L Pe
including FIR filter, DCT, and FFT which represented different = −nD E X T E S m m + 2E S D S m m km − 2λam 22km , (49)
∂km 3
types (both in time and frequency domains) and complexities
(up to more than 30,720 addition operations) were studied. According to (43) and using (48) and (49), λ can be
The error of the estimated SNRs using the proposed analytical determined as
model was, on average, less than 2.5dB compared to that of 
E S (D S km − nD E X T ) Sj=1 m j + E S D S m m km
the pure simulation showing its high accuracy estimation of λ=− . (50)
the output quality. Also, an analytical optimization approach 2n̂ 2A
based on the Lagrange multipliers was presented. For proposed ∂L
By replacing λ in (49), the solution of ∂km = 0 can be
optimization approach, the energy, delay, and EDP were
determined as
minimized subject to the desired quality determined as SNR.
When compared to the exhaustive search method, the proposed 2D S m m km − nD E X T m m 3n̂ 2A
22km = , (51)
optimization approach provided the energy saving of less −D S (m t + m m ) km + nD E X T m t am Pe
than 1% compared to that of the exhaustive method. 
where m t is Sj=1 m j . When n̂ 2A is determined, (51) which fol-
A PPENDIX A
lows the form of e x = − CAx−B
x−D has to be solved numerically.
Using (33) and (35), the Lagrange function for the energy This optimization problem may either be infeasible or have
optimization may be expressed as more than one solution.
⎛ ⎞
S 
S
P
m j E S k j − λ ⎝n̂ 2A − a j 22k j ⎠. (41)
e
L = EE X T − R EFERENCES
3
j =1 j =1
[1] J. Kung, D. Kim, and S. Mukhopadhyay, “On the impact of energy-
Taking the partial derivations of the Lagrange function, one accuracy tradeoff in a digital cellular neural network for image process-
obtains ing,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 34,
no. 7, pp. 1070–1081, Jul. 2015.
∂L Pe
= −m j E S − 2λa j 22k j , (42) [2] T. Moreau, A. Sampson, and L. Ceze, “Approximate computing: Making
∂k j 3 mobile systems more efficient,” IEEE Pervasive Comput., vol. 14, no. 2,
pp. 9–13, Apr. 2015.
∂L  Pe S
[3] A. Madanayake et al., “Low-power VLSI architectures for DCT/DWT:
= −n̂ 2A + a j 22k j . (43) Precision vs approximation for HD video, biomedical, and smart antenna
∂λ 3 applications,” IEEE Circuits Syst. Mag., vol. 15, no. 1, pp. 25–47,
j =1
1st Quart., 2015.
To find the stationary point, the partial derivations should [4] S. P. Kadiyala et al., “Perceptually guided inexact DSP design for power,
be equated to zero. Using (42), we may find area efficient hearing aid,” in Proc. IEEE Biomed. Circuits Syst. Conf.
(BioCAS), Oct. 2015, pp. 1–4.
Pe m j ES [5] F. Samie, L. Bauer, and J. Henkel, “An approximate compressor for
a j 22k j = − . (44) wearable biomedical healthcare monitoring systems,” in Proc. Int. Conf.
3 2λ
Hardw./Softw. Codesign Syst. Synthesis (CODES+ISSS), Oct. 2015,
m j ES
By replacing − 2λ in (43), λ can be determined as pp. 133–142.
S [6] N. Zhu, W. L. Goh, W. Zhang, K. S. Yeo, and Z. H. Kong, “Design of
j =1 m j E S
low-power high-speed truncation-error-tolerant adder and its application
λ=− . (45) in digital signal processing,” IEEE Trans. Very Large Scale Integr. (VLSI)
2n̂ 2A Syst., vol. 18, no. 8, pp. 1225–1229, Aug. 2010.
[7] V. Gupta, D. Mohapatra, A. Raghunathan, and K. Roy, “Low-power dig-
Now, utilizing (42) and (45), the energy optimized k j is ital signal processing using approximate adders,” IEEE Trans. Comput.-
expressed as Aided Design Integr. Circuits Syst., vol. 32, no. 1, pp. 124–137,
Jan. 2013.
m j n̂ 2A
ln [8] Z. Yang, J. Han, and F. Lombardi, “Transmission gate-based approx-
a j P3e (m 1 +···+m S ) imate adders for inexact computing,” in Proc. IEEE/ACM Int. Symp.
k j,opt = . (46) Nanoscale Archit. (NANOARCH), Jul. 2015, pp. 145–150.
2 ln 2
Authorized licensed use limited to: COLLEGE OF ENGINEERING - Pune. Downloaded on December 10,2023 at 17:47:50 UTC from IEEE Xplore. Restrictions apply.
340 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS, VOL. 66, NO. 1, JANUARY 2019

[9] H. A. F. Almurib, T. N. Kumar, and F. Lombardi, “Inexact designs for [33] K. P. Parker and E. J. McCluskey, “Probabilistic treatment of general
approximate low power addition by cell replacement,” in Proc. Design, combinational networks,” IEEE Trans. Comput., vol. C-24, no. 6,
Automat. Test Eur. Conf. Exhib. (DATE), Mar. 2016, pp. 660–665. pp. 668–670, Jun. 1975.
[10] H. R. Mahdiani, A. Ahmadi, S. M. Fakhraie, and C. Lucas, “Bio-inspired [34] A. Papoulis and S. Pillai, Probability, Random Variables, and Stochastic
imprecise computational blocks for efficient VLSI implementation of Processes. New York, NY, USA: McGraw-Hill, 2002.
soft-computing applications,” IEEE Trans. Circuits Syst. I, Reg. Papers, [35] L. Wanhammar, DSP Integrated Circuits. New York, NY, USA:
vol. 57, no. 4, pp. 850–862, Apr. 2010. Academic, 1999.
[11] S. Geetha and P. Amritvalli, “High speed error tolerant adder for [36] M. S. Khairy, A. Khajeh, A. M. Eltawil, and F. J. Kurdahi, “Equi-
multimedia applications,” J. Electron. Test., vol. 33, no. 5, pp. 675–688, noise: A statistical model that combines embedded memory failures and
Oct. 2017. channel noise,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 61, no. 2,
[12] N. Zhu, W. L. Goh, and K. S. Yeo, “An enhanced low-power high-speed pp. 407–419, Feb. 2014.
adder for error-tolerant application,” in Proc. 12th Int. Symp. Integr. [37] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge,
Circuits, Dec. 2009, pp. 69–72. U.K.: Cambridge Univ. Press, 2004.
[13] J. Hu and W. Qian, “A new approximate adder with low relative error
and correct sign calculation,” in Proc. Design, Automat. Test Eur. Conf. Masoud Pashaeifar received the B.Sc. degree from
Exhib. (DATE), Mar. 2015, pp. 1449–1454. the Shahid Bahonar University of Kerman, Kerman,
[14] M. Shafique, W. Ahmad, R. Hafiz, and J. Henkel, “A low latency generic Iran, in 2011, and the M.Sc. degree in electrical
accuracy configurable adder,” in Proc. ACM/EDAC/IEEE Des. Automat. engineering, circuits and systems from the Univer-
Conf. (DAC), Jun. 2015, pp. 1–6. sity of Tehran, Tehran, Iran, in 2013, where he
[15] O. Akbari, M. Kamal, A. Afzali-Kusha, and M. Pedram, “RAP-CLA: is currently pursuing the Ph.D. degree in circuits
A reconfigurable approximate carry look-ahead adder,” IEEE Trans. and systems. His current research interests include
Circuits Syst., II, Exp. Briefs, to be published. approximate computing, robust and energy efficient
[16] M. A. Hanif, R. Hafiz, O. Hasan, and M. Shafique, “QuAd: Design signal-processing, and Internet of Things.
and analysis of quality-area optimal low-latency approximate adders,”
in Proc. ACM 54th Des. Automat. Conf., Jun. 2017, pp. 1–6.
[17] W. Liu, L. Chen, C. Wang, M. O’Neill, and F. Lombardi, “Design and
analysis of inexact floating-point adders,” IEEE Trans. Comput., vol. 65, Mehdi Kamal received the B.Sc. degree from the
no. 1, pp. 308–314, Jan. 2016. Iran University of Science and Technology, Tehran,
[18] B. Shao and P. Li, “Array-based approximate arithmetic computing: Iran, in 2005, the M.Sc. degree from the Sharif
A general model and applications to multiplier and squarer design,” University of Technology, Tehran, in 2007, and the
IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 62, no. 4, pp. 1081–1090, Ph.D. degree from the University of Tehran, Tehran,
Apr. 2015. Iran, in 2013, all in computer engineering. He is
[19] O. Akbari, M. Kamal, A. Afzali-Kusha, and M. Pedram, “Dual-quality currently an Assistant Professor with the School of
4:2 compressors for utilizing in dynamic accuracy configurable multi- Electrical and Computer Engineering, University of
pliers,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 25, no. 4, Tehran. His current research interests include relia-
pp. 1352–1361, Apr. 2017. bility in nanoscale design, approximate computing,
[20] S. Mazahir, O. Hasan, R. Hafiz, M. Shafique, and J. Henkel, “Proba- neuromorphic computing, design for manufactura-
bilistic error modeling for approximate adders,” IEEE Trans. Comput., bility, embedded systems design, and low-power design.
vol. 66, no. 3, pp. 515–530, Mar. 2017.
[21] M. K. Ayub, O. Hasan, and M. Shafique, “Statistical error analysis for Ali Afzali-Kusha received the B.Sc. degree from
low power approximate adders,” in Proc. ACM/EDAC/IEEE 54th Des. the Sharif University of Technology, Tehran, Iran,
Automat. Conf. (DAC), Jun. 2017, pp. 1–6. in 1988, the M.Sc. degree from the University
[22] J. Huang, J. Lach, and G. Robins, “Analytic error modeling for imprecise of Pittsburgh, Pittsburgh, PA, USA, in 1991, and
arithmetic circuits,” in Proc. SELSE, 2011, pp. 1–4. the Ph.D. degree from the University of Michigan,
[23] J. Huang, J. Lach, and G. Robins, “A methodology for energy-quality Ann Arbor, MI, USA, in 1994, all in electrical
tradeoff using imprecise hardware,” in Proc. ACM/EDAC/IEEE Design engineering.
Automat. Conf. (DAC), 2012, pp. 504–509. He was a Post-Doctoral Fellow with the University
[24] W.-T. J. Chan, A. B. Kahng, S. Kang, R. Kumar, and J. Sartori, of Michigan from 1994 to 1995. He has been with
“Statistical analysis and modeling for error composition in approxi- the University of Tehran, since 1995, where he is
mate computation circuits,” in Proc. Int. Conf. Comput. Des. (ICCD), currently a Professor with the School of Electri-
Oct. 2013, pp. 47–53. cal and Computer Engineering and the Director of the Low-Power High-
[25] S. Lee, D. Lee, K. Han, E. Shriver, L. K. John, and A. Gerstlauer, Performance Nanosystems Laboratory. He was a Research Fellow with the
“Statistical quality modeling of approximate hardware,” in Proc. Int. University of Toronto, Toronto, ON, Canada, and the University of Waterloo,
Symp. Qual. Electron. Design (ISQED), Mar. 2016, pp. 163–168. Waterloo, ON, Canada, in 1998 and 1999, respectively. His current research
[26] C. Li, W. Luo, S. S. Sapatnekar, and J. Hu, “Joint precision opti- interests include low-power high-performance design methodologies from the
mization and high level synthesis for approximate computing,” in Proc. physical design level to the system level for nanoelectronics era.
ACM/EDAC/IEEE Des. Automat. Conf. (DAC), Jun. 2015, pp. 1–6.
[27] D. Sengupta, F. S. Snigdha, J. Hu, and S. S. Sapatnekar, “SABER: Massoud Pedram received the B.S. degree in elec-
Selection of approximate bits for the design of error tolerant circuits,” trical engineering from the California Institute of
in Proc. ACM/EDAC/IEEE Des. Automat. Conf. (DAC), Jun. 2017, Technology, Pasadena, CA, USA, in 1986, and the
pp. 1–6. M.S. and Ph.D. degrees in electrical engineering and
[28] F. S. Snigdha, D. Sengupta, J. Hu, and S. S. Sapatnekar, “Optimal design computer sciences from the University of California
of JPEG hardware under the approximate computing paradigm,” in Proc. at Berkeley, Berkeley, CA, USA, in 1989 and 1991,
ACM/EDAC/IEEE Des. Automat. Conf. (DAC), Jun. 2016, pp. 1–6. respectively. In 1991, he joined the Ming Hsieh
[29] C. Liu, J. Han, and F. Lombardi, “An analytical framework for evaluating Department of Electrical Engineering, University of
the error characteristics of approximate adders,” IEEE Trans. Comput., Southern California (USC), Los Angeles, CA, USA,
vol. 64, no. 5, pp. 1268–1281, May 2015. where he is currently the Stephen and Etta Varra
[30] J. Miao, K. He, A. Gerstlauer, and M. Orshansky, “Modeling and synthe- Professor with the USC Viterbi School of Engi-
sis of quality-energy optimal approximate adders,” in Proc. IEEE/ACM neering. He was a recipient of the National Science Foundation’s Young
Int. Conf. Comput.-Aided Design (ICCAD), Nov. 2012, pp. 728–735. Investigator Award in 1994, the Presidential Early Career Award for Scientists
[31] D. Sengupta and S. S. Sapatnekar, “FEMTO: Fast error analysis in and Engineers in 1996, two Design Automation Conference Best Paper
multipliers through topological traversal,” in Proc. IEEE/ACM Int. Conf. Awards, the Distinguished Paper Citation from the International Conference
Comput.-Aided Design (ICCAD), Nov. 2015, pp. 294–299. on Computer Aided Design, three Best Paper Awards from the International
[32] B. Widrow, “A study of rough amplitude quantization by means of Conference on Computer Design, the IEEE T RANSACTIONS ON V ERY
Nyquist sampling theory,” IRE Trans. Circuit Theory, vol. 3, no. 4, L ARGE S CALE I NTEGRATION S YSTEMS Best Paper Award, and the IEEE
pp. 266–276, Dec. 1956. Circuits and Systems Society Guillemin-Cauer Award.

Authorized licensed use limited to: COLLEGE OF ENGINEERING - Pune. Downloaded on December 10,2023 at 17:47:50 UTC from IEEE Xplore. Restrictions apply.

You might also like