0% found this document useful (0 votes)
60 views12 pages

AxPPA Approximate Parallel Prefix Adders

This paper proposes approximate parallel prefix adders (AxPPAs) by approximating the logic of carry propagation and generation in prefix operators. Four AxPPA architectures are implemented and evaluated, showing improvements in energy and area over state-of-the-art approximate adders. The AxPPAs are also evaluated in finite impulse response filters and sum of squared differences applications.

Uploaded by

pskumarvlsipd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views12 pages

AxPPA Approximate Parallel Prefix Adders

This paper proposes approximate parallel prefix adders (AxPPAs) by approximating the logic of carry propagation and generation in prefix operators. Four AxPPA architectures are implemented and evaluated, showing improvements in energy and area over state-of-the-art approximate adders. The AxPPAs are also evaluated in finite impulse response filters and sum of squared differences applications.

Uploaded by

pskumarvlsipd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 31, NO.

1, JANUARY 2023 17

AxPPA: Approximate Parallel Prefix Adders


Morgana Macedo Azevedo da Rosa , Student Member, IEEE, Guilherme Paim , Member, IEEE,
Patrícia Ücker Leleu da Costa , Student Member, IEEE, Eduardo Antonio Ceśar da Costa , Member, IEEE,
Rafael I. Soares , Member, IEEE, and Sergio Bampi , Senior Member, IEEE

Abstract— Addition units are widely used in many compu- Several error-resilient and computationally intensive applica-
tational kernels of several error-tolerant applications such as tions such as digital signal processing [3], [4], image [5], [6]
machine learning and signal, image, and video processing. Besides and video processing [7], consumer electronics [3], computer
their use as stand-alone, additions are essential building blocks
for other math operations such as subtraction, comparison, vision, and machine learning [8] require lots of adder units
multiplication, squaring, and division. The parallel prefix adders in the parallel datapath architecture of hardware accelerators.
(PPAs) is among the fastest adders. It represents a parallel prefix Aside from their ubiquitous presence, adder units are also
graph consisting of the carry operator nodes, called prefix opera- intrinsic building blocks that compose many other key arith-
tors (POs). The PPAs, in particular, are among the fastest adders metic hardware operators [9]. Therefore, approximate adder
because they optimize the parallelization of the carry generation
(G) and propagation ( P). In this work, we introduce approximate (AxA) units can also be cooperatively employed in many
PPAs (AxPPAs) by exploiting approximations in the POs. To eval- existing approximate arithmetic units of higher complexity for
uate our proposal for approximate POs (AxPOs), we generate the cross-layer approximations, such as squaring modules [10],
following AxPPAs, consisting of a set of four PPAs: approximate [11], multipliers [12], [13], [14], [15], [16], square roots [17],
Brent–Kung (AxPPA-BK), approximate Kogge–Stone (AxPPA- and dividers [18], [19].
KS), Ladner-Fischer (AxPPA-LF), and Sklansky (AxPPA-SK).
We compare four AxPPA architectures with energy-efficient Many works have proposed architectures for AxAs [21],
approximate adders (AxAs) [i.e., Copy, error-tolerant adder I [22], [23], [24], [25], [26], [27], [48]. The energy-efficient
(ETAI), lower-part OR adder (LOA), and Truncation (trunc)]. AxAs approximate the logic of the lower part of the adder from
We tested them generically in stand-alone cases and embedded the least significant bits (LSBs) to the most significant bits
them in two important signal processing application kernels: a (MSBs). Parallel prefix adders (PPAs) are among the fastest yet
sum of squared differences (SSDs) video accelerator and a finite
impulse response (FIR) filter kernel. The AxPPA-LF provides a most area-efficient addition units because they use logarithmic
new Pareto front in both energy-quality and area-quality results reduction of carry propagation paths to optimize the delay of
compared to state-of-the-art energy-efficient AxAs. the critical paths. Optimizing the circuit synthesis of PPAs is
Index Terms— Approximate adders (AxA), approximate com- an essential challenge in digital hardware design [28], [29],
puting (AxC), energy-efficient operators, parallel prefix adders [30], [31]. This article presents approximate PPAs (AxPPAs)
(PPAs). that combine logical approximations from LSB to MSB and
I. I NTRODUCTION fast carry propagation, providing a hybrid solution for both
fast and energy-efficient adders. Our idea of combining speed
A PPROXIMATE computing (AxC) has emerged as a new
design paradigm for increasing efficiency across the com-
puting stack by leveraging the inherent fault resilience of many
and energy efficiency through PPA approximations shows a
new AxA Pareto front of circuit area and energy reduction for
applications [1]. AxC introduces accuracy (i.e., quality of the the same level of approximation (i.e., the incidence of errors).
results) as a new explicit dimension for trade-offs in design The key idea behind our proposal is to approximate the
optimizations [2], which can significantly reduce the VLSI logic of carry propagation (P) and generation (G) of prefix
circuit area and the energy consumption for computations. operators (POs). To demonstrate the proposal of approximate
POs (AxPOs), we implement and evaluate four AxPPAs
Manuscript received 11 July 2022; revised 7 October 2022; accepted using the architectures of Brent Kung (AxPPA-BK), Kogge
22 October 2022. Date of publication 21 November 2022; date of current Stone (AxPPA-KS), Ladner-Fischer (AxPPA-LF), and Sklan-
version 28 December 2022. This work was supported in part by the Coor-
denação de Aperfeiçoamento de Pessoal de Nível Superior-Brasil (CAPES)- sky (AxPPA-SK). To prove their applicability, we exercise
Finance Code 001. (Corresponding author: Rafael I. Soares.) these proposed AxPPAs in such applications, including many
Morgana Macedo Azevedo da Rosa and Rafael I. Soares are with the adders, i.e., the following hardware accelerators, as complete
Department of Computer Science, Universidade Federal de Pelotas (UFPel),
Pelotas 96010-610, Brazil (e-mail: [email protected]). case studies: 1) a finite impulse response (FIR) filter and 2) a
Guilherme Paim, Patrícia Ücker Leleu da Costa, and Sergio Bampi are with sum of squared differences (SSDs) pixel comparison employed
the Department of Microelectronics, Universidade Federal do Rio Grande do for block matching in video processing applications. Notably,
Sul (UFRGS), Porto Alegre 90010-150, Brazil (e-mail: [email protected]).
Eduardo Antonio Ceśar da Costa is with the Department of Electronics and FIR filter and SSD applications are relevant case studies since
Computing, Universidade Católica de Pelotas (UCPel), Pelotas 96015-560, both circuits include many adders that directly impact the
Brazil (e-mail: [email protected]). area, delay, and power dissipation. The FIR filter uses adders
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TVLSI.2022.3218021. to accumulate the generation of the inner product of vector-
Digital Object Identifier 10.1109/TVLSI.2022.3218021 to-vector multiplication in the convolution step. On the other
1063-8210 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 17,2023 at 05:48:03 UTC from IEEE Xplore. Restrictions apply.
18 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 31, NO. 1, JANUARY 2023

Fig. 1. Generic description of the preprocessing, prefix computation, and postprocessing steps of (a) exact PPA and (b) our AxPPA proposal.

hand, the SSD hardware architecture consists of a summation with reduced latency by overlapping multiple sub-adders to
tree that accumulates the computation of partial values. The reduce the carry prediction length. Copy [42], truncation
addition tree in both case studies allows for exploring efficient (trunc) [43], error-tolerant adder I (ETAI) [32], and lower-
addition schemes, such as the combinations of AxA. part OR adder (LOA) [48] are among the most energy- and
The novel contributions of this work are as follows: 1) we area-efficient AxAs present in [45]. These AxAs split the
propose AxPOs to synthesize AxPPAs that are generically operation into an exact part for the MSB and an inexact
applicable to any PPA architecture; 2) we implement and part to approximate the remaining LSB. The LSB parts are
evaluate the error versus energy and circuit area savings for approximated by: 1) copying part of one of the inputs (Copy
four different AxPPA architectures; 3) we also evaluate our AxA); 2) pinning the outputs to either “0” or “1” logical values
AxPPAs in two case studies (FIR filters and SSD metric in (Trunc-0 or Trunc-1 AxAs); 3) a bitwise OR function of the
video processing); and 4) our AxPPA proposals offer a new inputs (LOA) [48] for the inexact part; or 4) an XOR with an
Pareto front in the trade-off between quality and hardware error correction that sets n − 1 LSBs to “1” if the generate at
synthesis results when bench-marked against the most energy- the nth bit position is true (ETAI) [21].
efficient AxAs in the literature. The accuracy-configurable approximate (ACA) adder [35],
The article is organized as follows. Section II overviews the variable latency carry selection adder (VLCSA) [41], and
related works and qualitatively compares them with our pro- the variable latency speculative adder (VLSA) are AxAs that
posal. Section III presents an overview of PPAs. Section IV support both exact and approximate operation with config-
presents our AxPOs proposal and AxPPA architectures. urable error detection and correction for accuracy configura-
Section V presents our AxPPA results, first compared to tion. ACA and VLCSA can provide accurate results, but at
the state-of-the-art for application-independent metrics, and the cost of significant circuit area overhead for error detection
second, presents their evaluation in two case studies. Finally, and correction. VLSA [46], [47] architecture consists of units
Section VI summarizes the main results and contributions of for error detection, AxA, and error correction. The addition
the work. in VLSA is exact; the error correction stage obtains the
correct result. Esposito et al. [46], [47] have proposed variable
II. R ELATED W ORK speculative latency PPAs (VLSPPAs) with a technique that
The development of approximate arithmetic units has reduces hardware overhead while maintaining a low error rate
attracted significant research interest [21], [32], [33], [34], compared to PPA. The VLSPPAs can be divided into five steps,
[35], [36], [37], [38], [39], [40], [41], [42], [43], [48], driven two more than non-speculative prefix adders: preprocessing,
by their high potential for saving circuit area and energy. speculative prefix processing, postprocessing, error detection,
Many authors have proposed AxA architectures [21], [22], and error correction. In the speculative prefix processing step,
[23], [24], [25], [26], [27], [48]. Typically, AxAs either shorten only a subset of the block generation and forwarding signals
the carry-propagation chain to reduce latency/critical path [21], are computed instead of computing all g[0:W −1] and p [0:W −1] ,
[32], [34], [35], [36], [44], [48] or eliminate carry calculations which are required in non-speculative PPA. The VLSPPAs
and other circuit logic altogether, to reduce power consump- have introduced an error detection network that reduces the
tion [21], [32], [33], [48]. error probability compared to the exact PPA. The exact spec-
Gracefully degrading adder (GDA) [36] and Generic accu- ulative unit has an error detection block to control the error
racy configurable adder (GeAR) [34] target configurable AxAs caused by the AxA unit. Esposito et al. [46] have proposed

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 17,2023 at 05:48:03 UTC from IEEE Xplore. Restrictions apply.
MACEDO AZEVEDO DA ROSA et al.: AxPPA: APPROXIMATE PARALLEL PREFIX ADDERS 19

TABLE I The exact prefix computing step has been extensively


R ELATED W ORK ON A X A S explored in the literature to achieve an optimal trade-off
between energy, circuit area, and delay [29], [30], [31], [49],
[50]. There are many ways to implement prefix computing
by balancing resources such as circuit area, logic depth, the
total number of logic gates, maximum fan-out per gate, and
the number of connections between the carry propagate and
generate cells. Prefix calculation groups the carry’s according
to the configuration of the adders [51].
POs are the key blocks of the prefix computing step and
can be described by boolean equations (3) and (4), where the
terms gi , pi+1 , and gi+1 are taken from the preprocessing step.
two VLSPPA architectures, the Han Carlson (VLSPPA-HC) The PO blocks must contain the associative operator, which
and Kogge Stone (VLSPPA-KS). Compared to non-speculative produces the carry generation and propagation bits. This means
PPA, [46], [47] have shown that their VLSPPA-HC, -CI, -SK, that the PO blocks referenced in boolean equations (3) and (4)
-LF, and -BK adders show significant improvements when are combined in the prefix computing for calculating the graph
the highest speed is required. Otherwise, the overhead of the structure of the PPA
error detection and correction stages outweighs the benefits of
VLSPPA. Esposito et al. [46], [47] synthesized their design P = pi · pi+1 (3)
for the UMC 65-nm CMOS library and showed the trade-off G = (gi · pi+1 ) + gi+1 . (4)
between performance and area. In our work, we analyze the
VLSA simplified by assuming that the speculative versions The graph refers to the prefix computing step and groups
for W -bit inputs are approximated in K -bit. Therefore, the the carry generation and propagation nodes. The PO is also
speculative versions proposed in [46] and [47] for K -bit inputs referred to as the delta operator, a fundamental carry operator
are used in the speculative versions of the prefix processing in [51]. In the end, the prefix computing step generates C
stage -BK, -CI, -HC, -KS, -LF, and -SK. For example, the BK (Carry out) as output, i.e., the carry word.
specification (proposed in [47]) includes the K -bit versions of The postprocessing step generates the final sum by recom-
the BK speculative prefix processing stage in the LSB part bining the C [shown in (5)] generated by the prefix computing
and the exact BK on W − K bits in the MSB part. with the p of the preprocessing. The postprocessing function
Although several strategies for optimized PPAs have been is shown in (6) and implements in parallel a bitwise XOR gate
described in the literature, no previous work, to our knowledge, between C and p signals for all i th order bits
proposed AxPPAs like the one we developed herein. Ci = cini · (Pi + G i ) (5)
Table I summarizes the state-of-the-art AxAs and qualita-
Si+1 = Pi+1 ⊕ Ci . (6)
tively highlights the contributions of our work compared to
others. Fig. 2 shows four classic examples of PPAs:
1) Brent–Kung [52]; 2) Kogge–Stone [53]; 3) Sklansky [54];
III. PPA BACKGROUND and 4) Ladner-Fischer [55]. Depending on the prefix
computing step, each PPA differs in the circuit area, energy,
Optimizing the design of PPAs has been extensively studied
and delay. The Brent–Kung strategy groups the prefix
in [28], [29], [30], [46], [47], [49], [50], and [31] to achieve
computation for 2-bit groups and uses it to find prefixes for
data path optimization. PPA architectures can be generally
4-bit groups and uses it to find prefixes for 8-bit groups.
divided into three blocks: preprocessing, prefix computing, and
It continues until the sum tree reaches the desired number
postprocessing [as shown in Fig. 1(a)].
of bits. Two cells per logical level limit the fan-out. The
Preprocessing is the first step, producing signals bitwise for
approach of this adder allows a regular layout to reduce
the subsequent steps: 1) generate a carry (g) and 2) propagate
design and implementation costs, one of the most important
a carry ( p). Preprocessing, hence, encodes the A and B input
criteria for the design of VLSI. The Kogge–Stone leads to
bits of the operands into (g) (generate) and ( p) (propagate),
a combination of efficiency and fan-out reduction. The tree
as shown in the following Boolean equations:
contains parallel propagation and generation cells. However,
p i = A i ⊕ Bi (1) when arranging the adder layout in a regular grid, the circuit
g i = A i · Bi . (2) area increases due to the routing possibilities. The Sklansky
adder is also called a divide-and-conquer adder and reduces
A g true signals that the carry-out is true independent of the the delay in computing intermediate prefixes. However,
carry-in value. A p true signals for propagating the carry-in it is done at the cost of a fan-out that doubles with each
at the i th order to the carry-out. A bitwise AND logic gate stage. These high fan-out values can lead to performance
executes the g function, and an XOR logic gate performs the degradation of this summing cell. The Sklansky adder shows
p, which are evaluated in parallel for all i -order bits, with intermediate regularity between the PPAs Sklansky and
a single gate delay. The preprocessing circuit area obviously Brent–Kung, although its architecture is very similar to that
escalates proportionally to the input bit width of the adder. proposed by Sklansky [54]. This adder computes the prefix

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 17,2023 at 05:48:03 UTC from IEEE Xplore. Restrictions apply.
20 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 31, NO. 1, JANUARY 2023

Fig. 2. Examples of the prefix computing architectures: (a) Brent–Kung, (b) Kogge–Stone, (c) Sklansky, and (d) Ladner-Fischer. (e) POs.

TABLE II Algorithm 1 Pseudo Code for W -Bit PPA [56]


L OGARITHMIC U NIT D ELAY AND G ATE A REA OF PPA

for odd numbers and uses another stage to represent the even
digits.
Table II shows the comparison between these four PPAs,
where W represents the adder bit-width. Note that delay refers
to the number of stages and circuit area represents the number
of POs present in each PPA [42]. PPA performs the addition
operation in three stages. The first stage is the preprocessing
stage. This stage involves parallel computation of propagating
P and generating G signals for each pair of bits in augend
A and addend B as indicated by Lines 3–6 in Algorithm 1.
The second stage, also known as the prefix tree, computes [shown in Fig. 3(h)]
the carry propagation according to the delay iteration (column
Delay in Table II), where W represents the word size. Each P ≈ pi+1 (7)
iteration uses P and G from the previous iteration to compute
the current ones [56]. The computation in each iteration is G ≈ gi+1 . (8)
parallel and only W − y computations are needed where
1 ≤ y ≤ 2 j −1 and 2 ≤ j ≤ Delay (e.g., for KS Table III shows where the approximations are applied. The
is 2 ≤ j ≤ log2 W ). Lines 7–15 in Algorithm 1 show the cells in red represent the approximation of the POs. As shown
computation in this stage. The last stage is the postprocessing in Table III, the proposed approach implies around 25% of
stage. This stage involves parallel computation of the sum approximation on propagation and 12.5% on carry generation
(S) and the output carry (C) as indicated by Lines 16–19 in calculations.
Algorithm 1. Fig. 2(e) shows the POs of the PPAs that compute AxPPA eliminates the logic gates in the prefix computation
the internal propagate (P) and generate (G) signals for prefix step. Our proposal deletes the PO in the prefix computa-
computing, shown here in capital letters to distinguish from tion step, so there is no PPA prefix computation. It means
preprocessing. In this work, we propose an approximation to that the order of each PO in the PPA prefix computation
PO for trading accuracy by higher circuit efficiency. [see Fig. 2(a)–(d)] depends on the type of PPA.
Table IV shows the delay and gate area of the AxPPA. The
W in AxPPA is the size of the words input. The parameter
IV. A X PPA P ROPOSAL K on AxPPA represents the number of approximated bits.
The prefix computing in a PPA is composed of sets of POs The other bits are exact for W − K . Thus, the K represents
[see Fig. 1(a)]. The AxPPA proposals exploit approximations the portion of the approximate LSBs and the remaining
in the logic of part of its POs [see Fig. 3(c)]. The number MSBs implemented with an exact PPA structure. For example,
of approximate POs can be configurable at design time to an adder with 16 input bits, K = 4 approximates the four least
adjust the desired exactness of the AxPPA. Fig. 1(b) shows essential bits using the AxPOs of Fig. 3(c). It implements the
that our AxPO uses just wires to process the prefix comput- other 12 MSBs using any exact adder architecture. Our work
ing, connecting the preprocessing to the postprocessing. The explores the AxPO proposals in four PPAs, building the four
computation of POs [shown in Fig. 3(f)] is described in (3) following architectures: BK (AxPPA-BK), KS (AxPPA-KS),
and (4), while (7) and (8) describe the computation of AxPOs LF (AxPPA-LF), and SK (AxPPA-SK). For example, to 8 bits,

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 17,2023 at 05:48:03 UTC from IEEE Xplore. Restrictions apply.
MACEDO AZEVEDO DA ROSA et al.: AxPPA: APPROXIMATE PARALLEL PREFIX ADDERS 21

Fig. 3. Steps for the AxPPA-LF example: (a) MSB operation with exact adder, (b) LSB operation with AxPPA adder for K = 8 bits, (c) exact part with
LF, (d) structure of AxPPA for K = 8 bits, (e) preprocessing step, (f) POs, (g) Carry, and (h) AxPOs.

TABLE III TABLE IV


T RUTH TABLE OF THE A PPROXIMATE PO (A X PO) P ROPOSAL L OGARITHMIC U NIT D ELAY AND G ATE A REA OF A X PPA

divided into an exact [see Fig. 3(a)–(c)] of the 8-bit and


an approximate part [see Fig. 3(b)–(d)] of the 8-bit. This
example shows that the sum of 3322210 and 11625410 is an
approximate value equal to 21714210. Fig. 3(d) shows the
approximate sum to the same example of Fig. 3(b) with K =
8 bits. Fig. 3(c)–(d) is divided into preprocessing, approximate
prefix computing, and postprocessing. The preprocessing has
a critical path of one XOR logic gate [shown in Fig. 3(b) and
(c)]. Note that approximate prefix computing only uses wires
[see Fig. 3(h)] to connect values of generate and propagate
carry of the preprocessing step to the post-processing step.
The approximate part in Fig. 3(d) generates 1-bit of carrying
to the exact part. The POs dark yellow-colored highlighted in
Fig. 3(c) describes the calculation of carrying in PPA provided
by the AxPPA. The carry operator in Fig. 3(g) has the critical
the PPA-KS uses 17 POs, totalling 51 logic gates in the prefix path equal to two logic gates-one XOR and one AND logic
computing step. gates.
Fig. 3 shows a generic example for 16-bits binary approx- Fig. 3 details the example showing the steps for the AxPPA
imate addition with K = 8 for values 3322210 and 11625410 [see Fig. 3(a)]. Note that the light gray rectangles represent the
in decimal. The value of K = 8 in a 16-bit AxPPA is preprocessing steps, calculating propagate and generate terms

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 17,2023 at 05:48:03 UTC from IEEE Xplore. Restrictions apply.
22 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 31, NO. 1, JANUARY 2023

Algorithm 2 Pseudo Code for W -Bit Input and K -Bit AxPPA part (lines 3–13) in Algorithm 2 eliminates the computing
prefix step (lines 20–24) common the exact part. The exact
part from i = K to i = W − 1 is summed with a PPA
(lines 14–32 in Algorithm 2). The PPA architecture is divided
preprocess (lines 15–18), computing prefix (lines 20–26), and
post-process (lines 28–31) steps. The preprocess step involves
parallel computation of propagate P and generate G sig-
nals for each pair of input bits. The computing prefix step
computes carry propagation according to the selected PPA
delay value. Each iteration uses P and G from previous
iteration to compute the currents ones. In the case of not
computing P and G, the current iteration receives the value
from the previous iteration. Lines 20–24 in Algorithm 2 show
the computation in this stage. The post-process step involves
parallel computation of the sum Si and the output carry ci
as indicated by lines 28–31 in Algorithm 2. To verify this
improvement in Algorithm 2, we comprehensively tested our
AxPPA proposal employing two applications as case studies:
SSD in video processing and FIR filter in signal processing.
V. S YNTHESIS AND ACCURACY E VALUATION R ESULTS
This section shows the main results obtained by applying
our AxPPA to two case studies: FIR filters and the SSD
metric from video processing. We perform the design space
exploration (DSE) in a co-simulation MATLAB-Modelsim
process, enabling the designer to carry out several simulations
employing vectors with realistic behavior. A golden model
compares the proposed AxPPA structures with exact adders
for each operands’ input-bits (a and b). The internal structure in the MATLAB environment. Regarding application-specific
is composed of a simple AND gate for generate, and an XOR integrated circuit (ASIC)-based results, the architectures were
gate for propagate, as seen in Fig. 3(b). The AxPOs [see described in very high speed integrated circuits hardware
Fig. 3(c)] configures the structure for (7) and (8), eliminating description language (VHDL) and synthesized using the
an AND gate for the propagate term (P) and eliminating one Cadence Genus1 synthesis tool at frequencies of 200 MHz,
OR and one AND logic gates in the generate (G) term. The bits
22.05 kHz, and 543.47 MHz for AxPPA, FIR, and SSD,
values for G and P along with the structure of the AxPPA tree respectively. The syntheses considered the low-power ST 65-
[see Fig. 3(d)] follow the calculation of AxPOs highlighted nm commercial standard cell library with 1.25-V supply volt-
in red in Table III. The XOR operation calculates the final age. For comparisons, we obtained all results at the maximum
sum (S). This operation occurs between the carries obtained achievable frequency (with zero slack value) to extract the cir-
in the parallel prefix computing step and the propagate term cuit quality of results (QoR) under extreme cases. We used the
calculated in the preprocessing. It is particularly true for the Cadence Incisive tool to simulate all network lists considering
S1–S7 bits results. The S0 and S8 results copy the p0 and g7 the standard delay file (SDF) for precise signal propagation
carries values, respectively, as seen in Fig. 3(d). delays and temporal glitches. The simulation generates Toggle
Algorithm 2 represents an AxPPA for inputs of W -bits and Count Format (TCF) file, loaded into the synthesis tools for
K -bits of approximation. The area of the AxPPA in logic gates realistic power extraction. The power estimation methodology
is calculated by 3W − 1, the number of XOR logic gates is uses the Genus synthesis tool in PLE mode to generate
2W − 1, and W are AND logic gates. The critical path is equal the Verilog gate-level network list and the SDF Format.
to two logic gates XOR for any AxPPA. Also, the AxPPA uses Concerning field programmable gate array (FPGA)-based
the exact adder in PPA to W − K -bits MSB. Thus, the K > 0 results, we used a Xilinx Virtex-7 XC7VX1140TFLG1930
(line 3) represents the portion of the approximate LSBs and FPGA as an evaluation target device. The proposed
the remaining most significant (W − K ) > 0 (line 14) bits architecture was designed with VHDL and mapped on the
(MSBs) implemented with an exact PPA structure. Therefore, FPGA using Xilinx ISE 14.7. We executed the software code
Algorithm 2 is divided into approximate part (lines 3–13) and on the Intel I7-10750H processor with 16 GB DDR4 memory.
exact part (lines 14–32). We consider Ai and Bi vector inputs The working frequency operation of FPGA is 100 MHz.
to AxA in Algorithm 2. The approximate part from i = 0 to
A. AxPPA Application-Agnostic Results Evaluation
i = K −1 is summed with AxPPA (lines 3–13 in Algorithm 2).
This section evaluates the literature’s quality results of
Lines 4–7 in Algorithm 2 represent the preprocess step of
AxPPA and energy-efficient AxAs, in both ASIC- and
the AxPPA, while Lines 8–11 in Algorithm 2 represent the
postprocess step on AxPPA. Note that, the approximation 1 Trademarked.

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 17,2023 at 05:48:03 UTC from IEEE Xplore. Restrictions apply.
MACEDO AZEVEDO DA ROSA et al.: AxPPA: APPROXIMATE PARALLEL PREFIX ADDERS 23

Fig. 4. MRED for our AxPPA against the literature AxAs: (a) savings of energy versus quality trade-off and (b) savings of circuit area versus quality
trade-off. The baseline employed for the savings calculation is the exact adder automatically selected by the synthesis tool.

FPGA-based. We employ the following well-known metrics


for evaluating the errors: Mean Absolute Error (MAE) and
Mean Relative Error Distance (MRED). The error distance is
defined as the difference between an accurate result (x) and the
approximate one (y), i.e., E = x i −yi . The MAEis the average
of all absolute errors (n), i.e., MAE = (1/n) ni=1 |E|. The
MRED defines the error distance divided by the accurate sum,
and the WCE is the maximum error distance. We acquired
the accuracy results in the co-simulation process between
MATLAB and Modelsim software. We calculate the MAE and
MRED values for 100 000 pseudorandom inputs of 16-bits
with the frequency operation of the 200 MHz.
1) ASIC-Based Results: Fig. 4(a) shows the energy-quality
trade-off and Fig. 4(b) shows the area-quality trade-off for our
AxPPA proposal structure and other AxAs of the literature,
with variations in the approximation step (K = 1–16). The
Pareto fronts in Fig. 4 summarize the approximation-quality Fig. 5. Comparison of carry prediction error rates of AxAs with K = 8, 14,
trade-off MRED in perspective with energy and area synthesis and 16.
results. The AxPPA results in a circuit area savings of up to
60% compared with an exact one. Since the AxAs approximate gates, from K = 13–16 indicates little area savings compared
part uses fewer or no logic gates, it requires fewer circuit with AxPPA. Note that, to K = 16, the Trunc has 7.76%
area than the exact one. Furthermore, the AxPPA achieves area savings compared to AxPPA. Notably, the AxPPA-LF
energy savings of up to 80% compared to the exact one. offers the best MRED-quality trade-offs, significantly reducing
There are significant variations in the circuit area of the AxAs, circuit area and energy consumption.
and this is due to their differences in logic composition. The carry prediction error rate is the percentage of cycles
As a result, the AxAs dissipate less power than the exact in which the output value deviates from the correct value.
adder for all AxAs and AxPPAs proposals, as seen in Fig. 4. Fig. 5 shows an average error rate of about 1.67 times higher
Conversely, the AxAs imply an error in the result of the sum. for 8-, 14-, and 16-bit adders in the literature compared to
Notably, Fig. 4 depicts that as the approximation increases, the AxPPA. The lack of carry prediction in the COPY, ETAI, LOA,
energy consumption decreases. The VLSPPA has the higher Trunc, and VLSPPA versions is noticeable by a higher carry
energy consumption among the AxAs: AxPPA, COPY, ETAI, prediction error superior to 60% for K = 16. Note that the
and LOA. AxPPA-LF with K = 16 achieves the following carry input of the trunc adder is set to “0,” so for a 16-bit
reductions in design metrics compared to the exact adder: input adder and K = 16, the error rate is equal to 100%.
1) energy savings of 85.43% and 2) area savings of 60.05%; The ETAI and LOA include the AND-based carry speculation,
com K = 1: 1) energy savings of 72.63%; 2) area savings of which reduces the error rate to about 25% compared to trunc.
40.41%; and 3) small error value of 6.5 × 10−4 considering Note that the VLSPPA-SK behaves similarly to trunc in terms
the metric MRED. AxPPA-LF with K = 12 achieves the fol- of carry prediction error rate (see Fig. 5). The AxPPA is
lowing mean reductions in design metrics compared to COPY extremely error resistant. Note that for K = 8, our carry
and Trunc: 1) energy savings of 11.34%; 2) area savings of against prediction technique has an error rate of 0.27%. For
2.29%; and 3) error reduction with MRED = 36. Conversely, K = 14, the error rate for AxPPA is equal to 14.13%, while
as the approximate part of COPY and Trunc have no logic for K = 16, the error rate is 23.91%.

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 17,2023 at 05:48:03 UTC from IEEE Xplore. Restrictions apply.
24 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 31, NO. 1, JANUARY 2023

(a)

Fig. 6. MAE for our AxPPA against the literature AxAs: (a) savings of energy and delay versus quality trade-off and (b) savings of circuit area versus
quality trade-off. The baseline employed for the savings calculation is the exact adder automatically selected by the synthesis tool.

Fig. 7. Case study I results: our AxPPA into an FIR filter accelerator. Average SNR versus the savings in (a) energy and (b) circuit area. The baseline FIR
filter accelerator employs exact adders automatically selected by the synthesis tool.

2) FPGA-Based Results: Fig. 6(a) shows the energy-delay trade-offs, significantly reducing LUT, delay, and energy
quality trade-off and Fig. 6(b) shows the lookup table (LUT)- consumption.
quality trade-off for our AxPPA proposal structure and other
AxAs of the literature, with variations in the approximation B. AxPPA Embedded Into Case Study Applications
step (K = 1–16). The Pareto fronts in Fig. 6 summarize This section offers the results of the case studies for testing
the approximation-quality trade-off MAE in perspective with the impact of using our AxPPAs proposal.
energy and area synthesis results. AxPPA results in LUT 1) Case Study I-FIR Filter Accelerator: The FIR accel-
savings of up to 64% and achieves energy and delay savings erator case study is a passband filter with 40 taps, with
of up to 71% and 86%, respectively, compared to the exact 0.1 normalized passband and 0.2 normalized stopband,
version. The FPGA-based results show that the VLSPPA has generated by the Remez algorithm (as in [57]) from
the higher energy consumption among the AxAs: AxPPA, MATLAB1 in the transposed form as described in [58]
COPY, ETAI, and LOA. AxPPA-KS with K = 16 achieves and [59]. The input is multiplied by a set of constant
the following reductions in design metrics compared to the coefficients, and the constant multiplications are realized in
exact adder: 1) energy savings of 70.85%; 2) LUT savings parallel using additions/subtractions and shifting operations.
of 63.41%; and 3) delay savings of 85.91%. Compared to The implemented FIR uses multiple constant multiplication
Trunc adder, the savings are the following: 1) energy savings (MCM) optimization [60] that finds the minimum number of
of 5.39%; 2) error reduction of 99.76%; and 3) delay savings addition/subtraction operations that implement the constant
of 17.80%. AxPPA-LF with K = 12 achieves the follow- multiplications described in Verilog with 2’s complement
ing mean reductions in design metrics compared to COPY: representation. The filter coefficients and output samples are
1) energy savings of 19.56% and 2) error reduction of 50.11%. 16 and 32 bits, respectively.
Since the approximate part of COPY and Trunc does not We employ a signal-to-noise ratio (SNR) metric to evaluate
contain logic gates, the change from K = 13–16 implies the DSE for the FIR filter case. The SNR represents the ratio
a small saving of LUT compared to AxPPA. Note that, between the precise filtered signals over the approximation
to K = 16, the Trunc has 7.32% area savings compared to errors. The inputs of the FIR are 661, 794 samples of four
AxPPA. Notably, the AxPPA-LF offers the best MAE-quality recorded 16-bit audio signals with a sampling frequency of

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 17,2023 at 05:48:03 UTC from IEEE Xplore. Restrictions apply.
MACEDO AZEVEDO DA ROSA et al.: AxPPA: APPROXIMATE PARALLEL PREFIX ADDERS 25

Fig. 8. Case study II results: our AxPPA into an SSD accelerator. NCC versus the savings in (a) energy and (b) circuit area. The baseline SSD accelerator
employs exact adders automatically selected by the synthesis tool.

22.05 kHz [61]. Fig. 7(a) shows the energy-quality trade-off each video sequence to stimulate the architectures in a post-
and Fig. 7(b) shows area-quality trade-off for our proposed synthesis timed simulation. Each input line has 16 pixels (eight
AxPPA in the FIR architecture filtering the audios. We also pixels from the original block and eight from the reference
use AxAs such as the classical trunc in LSBs (Trunc) and block). We have configured the HM software into the low
other topologies known from the literature Copy Adder, LOA, delay standard preset and quantization parameter (QP) to 32.
ETAI, and VLSPPA to compare accuracies and energy, with We employed the BasketballDrive (BD), BQTerrace (BT),
variations in the approximation level (K = 1–8). This SNR HoneyBee (HB), and Jockey (JO) 8-bit video sequences, with
information extracted from real signals helps validate the 1080 pixel resolution [63], totaling 400 000 stimuli on SSD
inserted loss of accuracy arising from our AxPPA proposal inputs. SystemVerilog describes our testbench and performs
in the FIR filter topology. Fig. 7 presents the evaluation of the the data-driven timed simulation in the Incisive Cadence tool.
approximate filter’s quality-energy and quality-power trade- Fig. 8(a) shows the energy-quality trade-off and Fig. 8(b)
offs. The results show that there is no filter design with average shows the circuit area-quality trade-offs for our AxPPA pro-
SNR lower than the minimum bound we imposed of 60 dB posals and other AxAs of the literature in the SSD video accel-
SNR for the noise generated by the approximate operations. erator, with variations in the approximation level (K = 1–8).
The x-axis of Fig. 7 reports the total energy improvement We employ a normalized cross correlation (NCC) metric to
concerning the exact design, while the y-axis reports the evaluate the AxPPA in the SSD video accelerator. NCC is,
average SNR of the four recorded, where each dot corresponds by definition, the inverse Fourier transform of the convolution
to a different AxA, and it is labeled with the corresponding of the Fourier transform of two videos or images [66]. The
normalized energy. We offer a Pareto front in Fig. 7, where the NCC analysis is measured by considering the difference in
AxPPA-LF (K = 1–7) is in the optimal solutions ranging from the quality of an accurate SSD video accelerator and each
68.62% to 78.15% in energy savings. Note that our AxPPA approximate one. The quality analysis step considers the
proposal also shows the best trade-off when embedded in the identical four sequences applied to SSD inputs in the synthesis
FIR case. The AxPPA-LF stands out among the proposed process. Each of the four videos (BD, BT, HB, and JO)
AxPPA, presenting a circuit area savings of up to 70% over have 100 000 input vectors delivered in concatenated form,
the other AxAs. The AxPPA-LF offers excellent balances in totaling 400 000 stimuli, given to the 16-input SSD video
error versus area and energy savings. accelerator. In a co-simulation process between ModelSim and
2) Case Study II-SSD Accelerator: The architectures SSD MATLAB1 tools, we tested the outputs of the SSD video
operating at 543.47 MHz allows the motion estimation block accelerator with an approximate sum tree in perspective quality
to achieve real-time processing with 1920 × 1080 pixels with an exact SSD.
resolution at 30 frames/s [62], on average, for common test Concerning quality, according to the work [67], the NCC
conditions [63]. SSD evaluates pixel-by-pixel intensity dif- results for the approximate SSD accelerators can be in the
ferences between the two image blocks and calculates the range of 0.95–0.99. The AxPPA proposal is extremely error-
SSD between pixels of two images. The SSD architecture resilient. The AxPPA guarantees the quality range from 0.95 to
employed as the case study processes eight-pixel differences 0.99, supporting to 6-bit approximation (K = 6). AxPPA-LF
per cycle, composed of eight squarer units, eight adders, and has the best energy- and area-quality savings result among the
eight subtractors. The details of the SSD hardware accelerator four analyses AxPPAs. AxPPA-LF with K = 6 has energy
can be found in [64]. We perform a data-driven energy savings of 76.31% and area savings of 72.46%, compared
dissipation extraction employing real-world video sequences. with the exact adder automatically selected by the synthesis
We extract the input pixels from the motion estimation block tool. AxPPA-LF with K = 6 has energy savings of 43.34%
of the high efficiency video codec (HEVC) reference soft- and area savings of 11.50%, compared with AxA Copy with
ware (HM) [65]. Eight million input pixels are extracted for K = 6. The same comparison with other analyses AxAs, for

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 17,2023 at 05:48:03 UTC from IEEE Xplore. Restrictions apply.
26 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 31, NO. 1, JANUARY 2023

K = 6, leads to energy savings of 94.23% and area savings [8] Z. G. Tasoulas, G. Zervakis, I. Anagnostopoulos, H. Amrouch,
of 23.35%. and J. Henkel, “Weight-oriented approximation for energy-
efficient neural network inference accelerators,” IEEE Trans.
The work [68] proposes SSD accelerators with exact adders Circuits Syst. I, Reg. Papers, vol. 67, no. 12, pp. 4670–4683,
using PPA Brent–Kung, Kogge–Stone, Ladner-Fischer, and Dec. 2020.
Sklansky, suggesting the same synthesis conditions as our [9] B. Silveira et al., “Power-efficient sum of absolute differences hardware
architecture using adder compressors for integer motion estimation
work. The syntheses considered the low-power ST 65-nm com- design,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 64, no. 12,
mercial standard cell library with a 1.25-V supply voltage and pp. 3126–3137, Aug. 2017.
a frequency of 543.47 MHz. The total power savings averaged [10] K. M. Reddy, M. H. Vasantha, Y. B. N. Kumar, and D. Dwivedi, “Design
of approximate booth squarer for error-tolerant computing,” IEEE Trans.
316 µW. The proposed approximate SSD accelerators achieve Very Large Scale Integr. (VLSI) Syst., vol. 28, no. 5, pp. 1230–1241,
an average total power dissipation of 10 µW. Thus, our May 2020.
proposed approximate accelerators achieve significant power [11] M. M. A. da Rosa, G. Paim, R. I. S. J. Castro-Godínez,
E. A. C. Costa, and S. and Bampi, “AxRSU: Approximate radix-4
savings and ensure good information quality. squarer uni,” in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), Austin,
TX, USA, Jun. 2022, pp. 1–4.
VI. C ONCLUSION [12] D. Esposito, A. G. M. Strollo, E. Napoli, D. De Caro, and N. Petra,
“Approximate multipliers based on new approximate compressors,”
This article proposed an AxPPA—a new AxPPA archi- IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 65, no. 12, pp. 4169–4182,
tecture. The developed approximation technique computes Dec. 2018.
[13] W. Liu, J. Xu, D. Wang, C. Wang, P. Montuschi, and F. Lombardi,
the POs of the prefix computation step. We evaluated “Design and evaluation of approximate logarithmic multipliers for low
our AxPPA proposals in an application-independent manner power error-tolerant applications,” IEEE Trans. Circuits Syst. I, Reg.
and for specific case studies. In the application-independent Papers, vol. 65, no. 9, pp. 2856–2868, Sep. 2018.
[14] H. Jiang, L. Liu, P. P. Jonker, D. G. Elliott, F. Lombardi, and
case studies, our proposal outperforms the energy-efficient J. Han, “A high-performance and energy-efficient FIR adaptive fil-
AxAs in the literature considering three well-known met- ter using approximate distributed arithmetic circuits,” IEEE Trans.
rics: MAE and MRED. In the application-specific case stud- Circuits Syst. I, Reg. Papers, vol. 66, no. 1, pp. 313–326,
Jan. 2019.
ies, we demonstrated the better efficiency of our proposal [15] A. G. M. Strollo, E. Napoli, D. De Caro, N. Petra, and G. Di Meo,
in terms of circuit area and power-quality trade-off when “Comparison and extension of approximate 4–2 compressors for low-
embedded in an SSD video accelerator and an FIR-filter power approximate multipliers,” IEEE Trans. Circuits Syst. I, Reg.
Papers, vol. 67, no. 9, pp. 3021–3034, Sep. 2020.
signal processing accelerator. Our AxPPA proposal shows
[16] P. da Costa, P. T. L. Pereira, B. A. Abreu, G. Paim, E. da Costa, and
the best savings in synthesis results in both case stud- S. Bampi, “Improved approximate multipliers for single-precision
ies compared to the consolidated AxAs in the literature. floating-point hardware design,” in Proc. IEEE 13th Latin Amer. Symp.
AxPPA achieves excellent quality standards and supports Circuits Syst. (LASCAS), Mar. 2022, pp. 1–4.
[17] N. Arya, M. Pattanaik, and G. K. Sharma, “Energy-efficient logarithmic
higher approximation levels. In particular, AxPPA-LF offers square rooter for error-resilient applications,” IEEE Trans. Very Large
the best trade-off between approximation and quality, result- Scale Integr. (VLSI) Syst., vol. 29, no. 11, pp. 1994–1997, Nov. 2021.
ing in significant energy savings for both the application- [18] G. Paim, P. Marques, E. Costa, S. Almeida, and S. Bampi, “Improved
Goldschmidt algorithm for fast and energy-efficient fixed-point divider,”
independent (MAE, MRED) and the specific case studies in Proc. 24th IEEE Int. Conf. Electron., Circuits Syst. (ICECS),
(i.e., FIR and SSD). Dec. 2017, pp. 482–485.
[19] V. Guidotti, G. Paim, L. M. G. Rocha, E. Costa, S. Almeida, and
S. Bampi, “Power-efficient approximate Newton–Raphson integer
R EFERENCES divider applied to NLMS adaptive filter for high-quality interference can-
[1] M. Pashaeifar, M. Kamal, A. Kusha, and M. Pedram, “A theoretical celling,” Circuits, Syst., Signal Process., vol. 39, no. 11, pp. 5729–5757,
framework for quality estimation and optimization of DSP applications Nov. 2020.
using low-power approximate adders,” IEEE Trans. Circuits Syst. I, Reg. [20] H. R. Mahdiani, A. Ahmadi, S. M. Fakhraie, and C. Lucas, “Bio-inspired
Papers, vol. 66, no. 1, pp. 327–340, Jan. 2019. imprecise computational blocks for efficient VLSI implementation of
[2] P. Stanley-Marbell et al., “Exploiting errors for efficiency: A survey from soft-computing applications,” IEEE Trans. Circuits Syst. I, Reg. Papers,
circuits to applications,” ACM Comput. Surv., vol. 53, no. 3, pp. 1–39, vol. 57, no. 4, pp. 850–862, Apr. 2010.
Jun. 2020. [21] N. Zhu, W. L. Goh, W. Zhang, K. S. Yeo, and Z. H. Kong, “Design of
[3] H. B. Seidel, M. M. A. da Rosa, G. Paim, E. A. C. da Costa, low-power high-speed truncation-error-tolerant adder and its application
S. J. M. Almeida, and S. Bampi, “Approximate pruned and truncated in digital signal processing,” IEEE Trans. Very Large Scale Integr. (VLSI)
Haar discrete wavelet transform VLSI hardware for energy-efficient Syst., vol. 18, no. 8, pp. 1225–1229, Aug. 2010.
ECG signal processing,” IEEE Trans. Circuits Syst. I, Reg. Papers, [22] V. Gupta, D. Mohapatra, A. Raghunathan, and K. Roy, “Low-power
vol. 68, no. 5, pp. 1814–1826, May 2021. digital signal processing using approximate adders,” IEEE Trans.
[4] P. Pereira et al., “Energy-quality scalable design space exploration of Comput.-Aided Design Integr. Circuits Syst., vol. 32, no. 1, pp. 124–137,
approximate FFT hardware architectures,” IEEE Trans. Circuits Syst. I, Jan. 2013.
Reg. Papers, vol. 69, no. 11, pp. 4524–4534, Nov. 2022. [23] J. Lee, H. Seo, H. Seok, and Y. Kim, “A novel approx-
[5] G. Paim, L. M. G. Rocha, G. M. Santana, L. B. Soares, imate adder design using error reduced carry prediction and
E. A. C. da Costa, and S. Bampi, “Power-, area-, and compression- constant truncation,” IEEE Access, vol. 9, pp. 119939–119953,
efficient eight-point approximate 2-D discrete tchebichef transform hard- 2021.
ware design combining truncation pruning and efficient transposition [24] K.-L. Tsai, Y.-J. Chang, C.-H. Wang, and C.-T. Chiang, “Accuracy-
buffers,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 66, no. 2, configurable Radix-4 adder with a dynamic output modification scheme,”
pp. 680–693, Feb. 2019. IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 68, no. 8, pp. 3328–3336,
[6] L. Soares, J. Oliveira, E. Costa, and S. Bampi, “An energy-efficient Aug. 2021.
and approximate accelerator design for real-time Canny edge detection,” [25] N. Zhu, W.-L. Goh, and K.-S. Yeo, “An enhanced low-power high-
Circuits Syst. Signal Process., vol. 39, no. 12, pp. 6098–6120, 2020. speed adder for error-tolerant application,” in Proc. 12th IEEE Int. Symp.
[7] G. Paim, H. Amrouch, E. A. C. D. Costa, S. Bampi, and J. Henkel, Integr. Circuits (ISIC), Dec. 2009, pp. 69–72.
“Bridging the gap between voltage over-scaling and joint hardware [26] A. K. Verma, P. Brisk, and P. Ienne, “Variable latency speculative
accelerator-algorithm closed-loop,” IEEE Trans. Circuits Syst. Video addition: A new paradigm for arithmetic circuit design,” in Proc. Design,
Technol., vol. 32, no. 1, pp. 398–410, Jan. 2022. Autom. Test Eur., Jun. 2008, pp. 1250–1255.

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 17,2023 at 05:48:03 UTC from IEEE Xplore. Restrictions apply.
MACEDO AZEVEDO DA ROSA et al.: AxPPA: APPROXIMATE PARALLEL PREFIX ADDERS 27

[27] L. Soares, M. da Rosa, C. Machado, E. da Costa, and S. Bampi, “Design [49] S. Roy, M. Choudhury, R. Puri, and D. Z. Pan, “Towards optimal
methodology to explore hybrid approximate adders for energy-efficient performance-area trade-off in adders by synthesis of parallel prefix
image and video processing accelerators,” IEEE Trans. Circuits Syst. I, structures,” in Proc. 50th ACM/EDAC/IEEE Design Autom. Conf. (DAC),
Reg. Papers, vol. 66, no. 6, pp. 2137–2150, Jun. 2019. May 2013, pp. 1–8.
[28] V. Pudi, K. Sridharan, and F. Lombardi, “Majority logic formulations for [50] S. Daphni and K. S. V. Grace, “A review analysis of parallel prefix
parallel adder designs at reduced delay and circuit complexity,” IEEE adders for better performnce in VLSI applications,” in Proc. IEEE Int.
Trans. Comput., vol. 66, no. 10, pp. 1824–1830, Oct. 2017. Conf. Circuits Syst. (ICCS), Dec. 2017, pp. 103–106.
[29] Y. Ma, S. Roy, J. Miao, J. Chen, and B. Yu, “Cross-layer optimization [51] N. H. Weste and D. Harris, CMOS VLSI Design: A Circuits and Systems
for high speed adders: A Pareto driven machine learning approach,” Perspective. London, U.K.: Pearson, 2015.
IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 38, no. 12, [52] R. P. Brent and H. T. Kung, “A regular layout for parallel adders,” IEEE
pp. 2298–2311, Dec. 2019. Trans. Comput., vol. C-31, no. 3, pp. 260–264, Mar. 1982.
[30] T.-D. Ene and J. E. Stine, “A comprehensive exploration of the parallel [53] P. M. Kogge and H. S. Stone, “A parallel algorithm for the efficient
prefix adder tree space,” in Proc. IEEE 39th Int. Conf. Comput. Design solution of a general class of recurrence equations,” IEEE Trans.
(ICCD), Oct. 2021, pp. 125–129. Comput., vol. C-22, no. 8, pp. 786–793, Aug. 1973.
[31] R. Roy et al., “PrefixRL: Optimization of parallel prefix circuits using [54] J. Sklansky, “Conditional-sum addition logic,” IEEE Trans. Electron.
deep reinforcement learning,” in Proc. 58th ACM/IEEE Design Automat. Comput., vol. EC-9, no. 2, pp. 226–231, Jun. 1960.
Conf. (DAC), Dec. 2021, pp. 853–858. [55] R. Ladner and M. Fischer, “Parallel prefix computation,” J. ACM,
[32] J. Miao, K. He, A. Gerstlauer, and M. Orshansky, “Modeling and synthe- vol. 27, pp. 831–838, Oct. 1980.
sis of quality-energy optimal approximate adders,” in Proc. IEEE/ACM [56] B. Alhazmi and F. Gebali, “Fast large integer modular addition in
Int. Conf. Computer-Aided Design (ICCAD), Nov. 2012, pp. 728–735. GF(p) using novel attribute-based representation,” IEEE Access, vol. 7,
[33] V. Gupta, D. Mohapatra, S. P. Park, A. Raghunathan, and K. Roy, pp. 58704–58719, 2019.
“IMPACT: IMPrecise adders for low-power approximate computing,” in [57] G. Paim, R. S. Ferreira, L. M. G. Rocha, E. A. C. da Costa, T. G. Alves,
Proc. IEEE/ACM Int. Symp. Low Power Electron. Design, Aug. 2011, and S. Bampi, “A power-predictive environment for fast and power-
pp. 409–414. aware ASIC-based FIR filter design,” in Proc. 30th Symp. Integr. Circuits
[34] M. Shafique, W. Ahmad, R. Hafiz, and J. Henkel, “A low latency generic Syst. Design Chip Sands (SBCCI), 2017, pp. 168–173.
accuracy configurable adder,” in Proc. 52nd Annu. Design Autom. Conf., [58] L. B. Soares, S. Bampi, and E. Costa, “Approximate adder synthesis for
New York, NY, USA, 2015, pp. 1–6, doi: 10.1145/2744769.2744778. area- and energy-efficient FIR filters in CMOS VLSI,” in Proc. IEEE
[35] A. B. Kahng and S. Kang, “Accuracy-configurable adder for approximate 13th Int. New Circuits Syst. Conf. (NEWCAS), Jun. 2015, pp. 1–4.
arithmetic designs,” in Proc. 49th Annu. Design Autom. Conf. (DAC), [59] G. Paim, L. Soares, J. F. R. Oliveira, E. Costa, and S. Bampi, “A power-
Jun. 2012, pp. 820–825. efficient imprecise radix-4 multiplier applied to high resolution audio
[36] R. Ye, T. Wang, F. Yuan, R. Kumar, and Q. Xu, “On reconfiguration- processing,” in Proc. IEEE Int. Conf. Electron., Circuits Syst. (ICECS),
oriented approximate adder design and its application,” in Proc. Dec. 2016, pp. 261–264.
IEEE/ACM Int. Conf. Comput.-Aided Design (ICCAD), Nov. 2013, [60] L. Aksoy, E. Günes, and P. Flores, “Search algorithms for the multi-
pp. 48–54. ple constant multiplications problem: Exact and approximate,” Micro-
[37] S. Rehman, W. El-Harouni, M. Shafique, A. Kumar, J. Henkel, and process. Microsyst., vol. 34, no. 5, pp. 151–162, Aug. 2010.
J. Henkel, “Architectural-space exploration of approximate multipli- [61] G. Tzanetakis, G. Essl, and P. Cook. (2001). Automatic Musi-
ers,” in Proc. IEEE/ACM Int. Conf. Comput.-Aided Design (ICCAD), cal Genre Classification of Audio Signals. [Online]. Available:
Nov. 2016, pp. 1–8. http://ismir2001.ismir.net/pdf/tzanetakis.pdf
[38] S.-L. Lu, “Speeding up processing with approximation circuits,” Com- [62] I. Seidel, M. Monteiro, J. Güntzel, and L. Agostini, “Squarer exploration
puter, vol. 37, no. 3, pp. 67–73, Mar. 2004. for energy-efficient sum of squared differences,” in Proc. IEEE 7th Latin
[39] N. Zhu, W. L. Goh, G. Wang, and K. S. Yeo, “An enhanced low-power Amer. Symp. Circuits Syst. (LASCAS), Feb. 2016, pp. 327–330.
high-speed adder for error-tolerant application,” in Proc. Int. SoC Design [63] F. Bossen, “Common test conditions and software configurations,”
Conf., Nov. 2009, pp. 323–327. document L1100, JCTVC, Jan. 2013.
[40] Y. Kim, Y. Zhang, and P. Li, “An energy efficient approximate adder [64] G. Paim et al., “On the resiliency of NCFET circuits against voltage
with carry skip for error resilient neuromorphic VLSI systems,” in over-scaling,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 68, no. 4,
Proc. IEEE/ACM Int. Conf. Comput.-Aided Design (ICCAD), Nov. 2013, pp. 1481–1492, Mar. 2022.
pp. 130–137. [65] HM. (2017). HEVC test model (HM) V. 16.7. [Online]. Available:
[41] K. Du, P. Varman, and K. Mohanram, “High performance reliable http://hevc.hhi.fraunhofer.de
variable latency carry select addition,” in Proc. Design, Autom. Test Eur. [66] K. Briechle and U. D. Hanebeck, “Template matching using fast normal-
Conf. Exhib. (DATE), Mar. 2012, pp. 1257–1262. ized cross correlation,” in Optical Pattern Recognition XII, vol. 4387.
[42] M. Macedo, L. Soares, B. Silveira, C. M. Diniz, and E. A. C. da Costa, SPIE, Mar. 2001, pp. 95–102.
“Exploring the use of parallel prefix adder topologies into approximate [67] B. Mohamad, S. Yaakob, R. A. A. Raof, A. Nazren, and M. W. Nasrudin,
adder circuits,” in Proc. 24th IEEE Int. Conf. Electron., Circuits Syst. “Template matching using sum of squared difference and normalized
(ICECS), Dec. 2017, pp. 298–301. cross correlation,” in Proc. IEEE Student Conf. Res. Develop. (SCOReD),
[43] P. T. L. Pereira, G. Paim, G. Ferreira, E. Costa, S. Almeida, and Dec. 2015, pp. 100–104.
S. Bampi, “Exploring approximate adders for power-efficient harmonics [68] M. M. A. da Rosa, G. Paim, L. M. G. Rocha, E. A. C. da Costa, and
elimination hardware architectures,” in Proc. IEEE 12th Latin Amer. S. Bampi, “Exploring efficient adder compressors for power-efficient
Symp. Circuits Syst. (LASCAS), Feb. 2021, pp. 1–4. sum of squared differences design,” in Proc. 27th IEEE Int. Conf.
[44] S. Mazahir, O. Hasan, R. Hafiz, M. Shafique, and J. Henkel, “An area- Electron., Circuits Syst. (ICECS), Nov. 2020, pp. 1–4.
efficient consolidated configurable error correction for approximate hard-
ware accelerators,” in Proc. 53rd ACM/EDAC/IEEE Design Autom. Conf.
(DAC), Jun. 2016, pp. 1–6.
[45] G. Paim, L. M. G. Rocha, H. Amrouch, E. A. C. da Costa, S. Bampi,
and J. Henkel, “A cross-layer gate-level-to-application co-simulation
for design space exploration of approximate circuits in HEVC video Morgana Macedo Azevedo da Rosa (Student
encoders,” IEEE Trans. Circuits Syst. Video Technol., vol. 30, no. 10, Member, IEEE) received the Diploma degree
pp. 3814–3828, Oct. 2020. in computer engineering from the Universidade
[46] D. Esposito, D. De Caro, E. Napoli, N. Petra, and A. G. M. Strollo, Católica de Pelotas (UCPel), Pelotas, Brazil, in
“Variable latency speculative Han-Carlson adder,” IEEE Trans. Circuits 2019, with a Dom Antônio Zattera First Place
Syst. I, Reg. Papers, vol. 62, no. 5, pp. 1353–1361, May 2015. Diploma in the Computer Engineering Course, and
[47] D. Esposito, D. De Caro, and A. G. M. Strollo, “Variable latency the master’s degree in electronic engineering and
speculative parallel prefix adders for unsigned and signed operands,” computing from UCPel in 2020. She is currently
IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 63, no. 8, pp. 1200–1209, working toward the Doctoral degree in computer sci-
Aug. 2016. ence at the Universidade Federal de Pelotas, CAPES,
[48] H. R. Mahdiani, A. Ahmadi, S. M. Fakhraie, and C. Lucas, “Bio-inspired Pelotas.
imprecise computational blocks for efficient VLSI implementation of She is interested in researching low-power consumption arithmetic operators
soft-computing applications,” IEEE Trans. Circuits Syst. I, Reg. Papers, and dedicated architectures for cryptography, and digital processing of signals
vol. 57, no. 4, pp. 850–862, Apr. 2010. and images.

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 17,2023 at 05:48:03 UTC from IEEE Xplore. Restrictions apply.
28 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 31, NO. 1, JANUARY 2023

Guilherme Paim (Member, IEEE) received the Rafael I. Soares (Member, IEEE) received the
B.Eng. degree (Hons.) in electronics engineering B.Eng. degree in computer engineering from the
from the Universidade Federal de Pelotas (UFPel), Federal University of Rio Grande (FURG), Rio
Pelotas, Brazil, in 2015, and the Ph.D. degree Grande, Brazil, in 2004, and the master’s and Doc-
(summa cum laude) in microelectronics from torate degrees in computer science from the Pon-
the Universidade Federal do Rio Grande do Sul tifical Catholic University of Rio Grande do Sul
(UFRGS), Porto Alegre, Brazil, in 2021. (PUCRS), Porto Alegre, Brazil, in 2006 and 2010,
He is currently a Research Team Leader with respectively.
UFRGS in a postdoctoral fellowship at the Instituto He held a Doctorate Sandwich at the Laboratoire
de Engenharia de Sistemas e Computadores- d’Informatique, de Robotique e de Microélectron-
Investigação e Desenvolvimento (INESC-ID), ique de Montpellier (LIRMM), Montpellier, France,
Lisbon, Portugal. His research interests are energy-efficient VLSI design and from 2007 to 2008. He is currently an Adjunct Professor with the Univer-
cross-layer approximate computing. sidade Federal de Pelotas (UFPel), Pelotas, Brazil. He has experience in
computer science, with an emphasis on hardware, working mainly on the
following topics: digital systems, field programmable gate arrays (FPGAs),
rapid prototyping of digital systems, dynamic reconfiguration, reconfigurable
architectures, non-synchronous circuit design, encryption, side-channel attacks
Patrícia Ücker Leleu da Costa (Student Mem- (SCAs), and countermeasures to SCAs.
ber, IEEE) received the B.Eng. degree in electron-
ics engineering from the Universidade Federal de
Pelotas, Pelotas, Rio Grande do Sul, Brazil, in 2018, Sergio Bampi (Senior Member, IEEE) received the
and the M.Sc. degree in electronic engineering Electronics Engineer and B.Sc. degrees in physics
and computing from the Universidade Católica de from the Universidade Federal do Rio Grande
Pelotas, Pelotas, in 2020. She is currently working do Sul, Porto Alegre, Brazil, in 1979, and the
toward the Ph.D. degree at the Universidade Federal M.S.E.E. and Ph.D. degrees in electrical engineer-
do Rio Grande do Sul, Porto Alegre, Brazil. ing from Stanford University, Stanford, CA, USA,
Her research interests are low-power VLSI archi- in 1982 and 1986, respectively.
tectures, arithmetic operators, and digital signal He is currently a Full Professor with the Informat-
processing. ics Institute, Universidade Federal do Rio Grande do
Sul, Porto Alegre, Brazil, which he joined in 1981.
He has published more than 440 research papers in
conferences and journals, in the fields of CMOS analog, digital, and RF
circuits design; video coding algorithms, and hardware architectures.
Eduardo Antonio Ceśar da Costa (Member, IEEE) Dr. Bampi served as the Technical Program Chair for the Symposium on
received the five-year B.Eng. degree in electrical Integrated Circuits and Systems Design (SBCCI) in 1997 and 2005, the
engineering from the University of Pernambuco, IEEE Circuits and Systems Society in Latin America (LASCAS) in 2013,
Recife, Brazil, in 1988, the M.Sc. degree in electrical the International Workshop on CMOS Variability (VARI) in 2015, and the
engineering from the Federal University of Paraiba, Sociedade Brasileira de Microeletrônica (SBMICRO) Congress in 1989 and
Campina Grande, Paraíba, Brazil, in 1991, and the 1995; and served on TPC committees of the International Conference
Ph.D. degree in computer science from the Univer- on Computer-Aided Design (ICCAD), the Design Automation Conference
sidade Federal do Rio Grande do Sul, Porto Alegre, (DAC), SBCCI, the International Congress of Mathematicians (ICM), LAS-
Brazil, in 2002. CAS, VLSI-SoC, the International Conference dedicated to Circuits and Sys-
Part of his doctoral work was developed at the tems (ICECS), the IEEE Interregional NEWCAS Conference (NEWCAS), and
Instituto de Engenharia de Sistemas Computadores many other international conferences. He was the President of the Brazilian
(INESC-ID), Lisbon, Portugal. He is currently a Full Professor with the Microelectronics Society, the Fundação de Amparo à pesquisa do Estado do
Universidade Católica de Pelotas (UCPel), Pelotas, Brazil, where he is the RS (FAPERGS) Brazilian Research Funding Agency, and the Centro Nacional
Co-Founder and the Coordinator of the Graduate Program on Electronic de Tecnologia Eletrônica Avançada (CEITEC) Technical Director. He was
Engineering and Computing. His research interests are VLSI architectures a Distinguished Lecturer of the IEEE Circuit and Systems Society (CAS)
and low-power design. from 2009 to 2010.

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 17,2023 at 05:48:03 UTC from IEEE Xplore. Restrictions apply.

You might also like