FPGA-Based Multiplier With A New Approximate Full Adder For Error-Resilient Applications
FPGA-Based Multiplier With A New Approximate Full Adder For Error-Resilient Applications
Abstract— Electronic devices primarily aim to offer low power subtractors, compressors, and multipliers. The FA is a crucial
consumption, high speed, and a compact area. The performance component in arithmetic cells. Utilizing FAs and compressors to
of very large-scale integration (VLSI) devices is influenced by add partial products (PPs) simplifies the circuit design.
arithmetic operations, where multiplication is a crucial operation. However, replacing exact FA with approximate alternatives can
Therefore, a high-speed multiplier is essential for developing any yield significant benefits in circuit performance. In high-speed
signal-processing module. Numerous multipliers have been multipliers, both compressors and FAs facilitate faster
reviewed in existing literature, and their speed is largely accumulation of PPs. In [5], an approximate 4:2 compressor was
determined by how partial products (PPs) are accumulated. To proposed for adding PPs, offering enhanced error metrics.
enhance the speed of multiplication beyond current methods, an
However, this approach leads to a substantial increase in circuit
approximate adder-based multiplier is introduced. This approach
allows for the simultaneous addition of PPs from two consecutive
area. The authors utilized both their proposed exact compressors
bits using a novel approximate adder. The proposed multiplier is and approximate compressors to create configurable dual-
utilized in a mean filter structure and implemented in ISE Design quality multipliers. However, their approach does not apply to
Suite 14.7 using VHDL and synthesized on the Xilinx Spartan3- field programmable gate arrays (FPGAs) because current
XC3S400 FPGA board. Compared to the literature, the proposed commercial FPGAs lack power gating capabilities. Moreover,
multiplier achieves power and power-delay product (PDP) their 4:2 compressor still requires significant hardware resources
improvements of 56.09% and 73.02%, respectively. The validity of and power, resulting in moderate error rates when implemented
the expressed multiplier is demonstrated through the mean filter on FPGAs. Based on the discussions above, it is evident that
system. Results show that it achieves power savings of 33.33%. using the presented approximation techniques for developing
Additionally, the proposed multiplier provides more accurate various types of multipliers leads to improvements in
results than other approximate multipliers by expressing higher performance, energy efficiency, and area reduction. However,
values of peak signal-to-noise ratio (PSNR), (30.58%), and controlling their application is challenging. Therefore,
structural similarity index metric (SSIM), (22.22%), while power approximation techniques that have been successfully applied in
consumption is in a low range. application-specific integrated circuits (ASICs) yield limited
advantages when transferred to FPGAs. Xilinx and Intel FPGAs
Keywords—approximate computing, approximate full adder, offer fast DSP-based multipliers suitable for low-power digital
multiplier, mean filter.
signal processing applications; however, utilizing these
I. INTRODUCTION multiplier intellectual properties (IPs) leads to substantial
routing delays because they are only accessible at certain
In real-time application systems, the key objectives are locations on the FPGA. Recently, [6] proposed optimizing the
speed, power efficiency, and area optimization. Multiplication multipliers by removing the least significant PP of a 4×2
and addition are crucial components in digital signal processors multiplier to conserve look-up tables (LUTs), subsequently
(DSPs), central processing units (CPUs), and digital filters [1]. using this component to construct larger operand size
Multipliers serve diverse functions, influenced by the specific multipliers. This strategy, however, provided only marginal
constraints of each application. Therefore, it is important to gains in area and power efficiency. To address these challenges,
analyze the area, power consumption, and delay characteristics this research focuses on the design and analysis of a new
of the multipliers utilized in signal processing tasks [2]. In approximate FA, which is applied in a multiplier and presents
numerous error-tolerant applications, like multimedia, image an approximate multiplier with various accuracy levels. The
processing, and machine learning, exact computations are not introduced approximate multiplier is optimized for effective
always required [1]-[4], and approximate computing is an implementation on FPGAs, achieving high electrical
effective approach. Approximate computing reduces power performance characteristics such as low power consumption,
consumption and improves the performance of embedded minimal area, and low latency, along with minimal accuracy
systems. By allowing some errors in the outputs of a complex loss. Consequently, this multiplier is well-suited for digital
circuit, the logic expressions are simplified, which in turn signal-processing applications like image processing. The
decreases the logic counts. Approximate-based arithmetic cells multiplier is compared with state-of-the-art designs in terms of
require fewer logic gates, resulting in lower power consumption delay, power consumption, area, and power-delay product
but sacrificing accuracy. Recently, researchers have designed (PDP).
various arithmetic circuits [4], such as full adders (FAs),
This paper is arranged as follows: Section II presents the of ≥ 2 can significantly impair the overall accuracy of a circuit,
proposed circuits. Section III analyzes and implements the new particularly in more complex structures such as multipliers. But
approximate designs and error analysis. Section IV discusses in the proposed circuit ED=1 leads to a lower NMED and
application and performance evaluation, and Section V indicates better accuracy. Fewer gates through critical paths
concludes the paper. without inverters, reduce static power and delay. In the 8-bit
RCA of Fig.1, the number of approximate bits (NABs) varies to
II. PROPOSED CIRCUITS evaluate the performance of the approximate FAs. NAB1
A. Presented Approximate FA indicates that an approximate FA is applied solely to the least
significant bit (LSB). For NAB2, approximate FAs are used for
FA is a core unit for the performance enhancement of digital
the two LSBs, and this pattern continues up to NAB8, where
systems. Various FAs’ techniques are utilized in intermediate
they are applied to the most significant bit (MSB).
modules to generate sum and carry outputs [7]-[8]. A primary
limitation of FAs is their operational speed, which necessitates B. Approximate Multiplier
a focus on minimizing delay. Approximate computing offers a Multiplier is a fundamental arithmetic operation in various
trade-off between accuracy and improvements in circuit applications like finite impulse response (FIR) filters, discrete
parameters. Many applications that require significant cosine transform (DCT), fast Fourier transform (FFT), and
computational resources are inherently error resilient, given the multimedia processing. To achieve optimal quality and accuracy
limitations of human visual perception or the absence of a in the output data of signal processing modules, it is essential to
definitive correct answer for specific problems. Thus, enhance the speed, area, and power efficiency of the arithmetic
approximate computing can effectively enhance the digital modules [9-11]. Consequently, the multiplication module is
hardware specifications like area, power, and speed for these crucial for reducing computation delay and improving system
error-tolerant applications. In this paper, an approximate FA is speed [12]. Fig. 2 illustrates the architecture of a typical 8×8
proposed. As shown in Fig. 1, according to the truth table, the multiplier, which performs multiplication in three primary steps:
outputs for Sum and Cout of the FA exhibit 4 and 2 errors, (1) recording and generating PPs, (2) reducing PPs, and (3)
respectively. Also, the gate-level structure is just one OR gate, accumulating those PPs.
and an 8-bit ripple carry adder (RCA) is considered for the
A7, A6, ….., A0 8-bit Multiplier
performance evaluation. Here, the Sum is generated with only
an OR gate and Cout equal to B input. The functions of the FA B7, B6, B0 8-bit Multiplicand
Cout2
FA HA
𝐶𝑜𝑢𝑡 = 𝐵 (2) 4:3 Counter
5:3 Counter
The error distance (ED), error rate (ER), and normalized mean 6:3 Counter
error distance (NMED) for the FA are |-1|, 0.5%, and 0.166, 7:3 Counter
Proposed
A B Cin
CS ED A
0 0 0 00 - OR Sum
0 0 1 01 - Cin
0 1 0 10 +1
0 1 1 11 +1 B Cout
Addition
1 0 0 01 -
1 0 1 01 -1 RCA Based on the Proposed
Last
1 1 0 11 +1
1 1 1 11 - Approximate FA
B1 A7 B1 A1 B0 A0
P12 P8 P4 P0
P13 P9 P5 P1
C0 Cin P14 P10 P6 P2
Cout Proposed C7 C1 Proposed Proposed
P15 P11 P7 P3
Full Adder Full Adder Full Adder
Fig. 2. The architecture of the proposed 8*8 approximate multiplier.
S7 S1 S0
Multipliers can be categorized based on their partial product
Fig. 1. Truth table, gate level, and 8-bit RCA of the proposed approximate FA. reduction (PPR) methods, with common types being linear array
multipliers and tree multipliers. In an array multiplier, the
Therefore, one objective of this FA is to minimize the multiplication of two binary numbers follows an add-and-shift
occurrences of ED=1. Extensive literature indicates that an ED
1
method, which involves a regular structure. However, when 𝑀𝐸𝐷 =
2𝑁
∑2𝑖=1 𝐸𝐷𝑖 ()
22𝑁
dealing with a large number of bits, this type of multiplier can
incur significant delays and high power consumption due to 1 2𝑁 𝐸𝐷𝑖
carry propagation, highlighting the importance of optimizing 𝑀𝑅𝐸𝐷 = ∑2𝑖=1 ()
22𝑁 𝑆𝑖
carry propagation speed (the critical output path) [13]-[14]. Tree
multipliers, organize PPs in rows or columns, reducing the ∑2
2𝑁
1 𝑖=1 𝐸𝐷𝑖
number of components compared to array types. The 𝑁𝑀𝐸𝐷 = × ()
(2𝑁 −1)2 22𝑁
accumulation of PPs often limits the multiplication speed. An
array multiplier employing a modified FA-based multiplexer
was developed to minimize power consumption [15]. A
multiplier that incorporates an approximate 4:2 compressor
demonstrates significantly reduced delay and area compared to
other types of multipliers [16], although it does produce a
notable error rate. Given these findings, it is critical to focus on
the accumulation process to enhance speed, area, and accuracy.
The initial step of multiplication involves generating PPs
through logical AND operations. For an 8×8 multiplier with
multiplier A (A0 to A7) and multiplicand B (B0 to B7), the first
step involves ANDing the LSB of multiplicand B (B0) with
every bit of multiplier A. The proposed approximate multiplier
comprises three stages, each utilizing different sizes of adders.
The first stage contains two half adders (HAs), two proposed Fig. 3. Results of the PDP and NMED for proposed FA and references.
approximate FAs, and several counter structures (4:3, 5:3, 6:3,
and three 7:3 counters) [12]. The outputs from this stage are Table. I show the comparison between the presented 8×8
passed into the next stage's adders. The second stage includes multiplier and previous works in terms of resource utilization
three HAs and nine FAs, where the outputs from the first stage and accuracy. The proposed 8×8 multiplier outperforms the
are added concurrently, with results sent to a final RCA, which prior works utilizing 71 out of 7168 available 4-input LUTs
represents the final product of the 8×8 multiplication. By (0.99%) and 9 out of 7168 flip-flops (0.13%). Compared to [6],
replacing the 4:2 compressor in the proposed multiplier with the proposed multiplier can achieve power and PDP
approximate FAs, fewer gates than traditional structures are improvements of 56.09% and 73.02%, respectively. As shown
required. This reduction contributes to lower power in Table. I, in terms of MED, MRED, and NMED proposed
consumption and enhanced speed, area, and accuracy. So, in the multiplier has better performance and less resource utilization
proposed multiplier, the use of approximate FAs instead of 4:2 compared to other references.
compressors is preferred for minimizing critical path delays. By
strategically selecting the appropriate combination of counters TABLE. I. PERFORMANCE ANALYSIS OF THE 8 × 8 MULTIPLIERS.
and FAs, the critical paths can be further optimized, leading to Design
Power Delay No. of PDP
MED MRED NMED
(mW) (ns) LUTs (PJ)
improved overall performance.
M [13] 0.636 5.31 68 3.377 0.0034 0.3676 0.0154
III. SIMULATION RESULTS M [14] 0.468 4.84 71 2.265 0.0022 0.0196 0.0034
M [15] 0.744 5.11 101 3.802 0.0011 0.0548 0.0154
The proposed approximate circuits and references are M [6] 0.984 8.31 57 8.177 0.0054 0.0029 0.0008
described using VHDL and are synthesized and implemented M [16] 0.780 5.37 91 4.189 0.0013 0.0062 0.0020
using the Xilinx ISE Design Suite 14.7 on the Xilinx Spartan3 Proposed 0.432 5.11 71 2.207 0.0010 0.0148 0.0017
Adder 0
39.52 0.88 1.27 33.31
Denoised Image
Fig. 4. Hardware implementation of the mean filter algorithm using the
proposed multiplier on an FPGA.