Efficient Approximate Parallel Prefix Adder Design
Efficient Approximate Parallel Prefix Adder Design
INTRODUCTION
Adders are widely used in the field of digital systems, serving a variety of applications, not just
limited to basic arithmetic operations like multiplication and decimal addition but extending to
more advanced applications in core and accelerator. An adder typically employs the Full-Adder
as its basic design, which is composed of two Half-Adders. Design a Multi-bit Adder using
multiple Full-Adders. Various structures for Multi-bit Adders have been proposed, such as
Ripple Carry Adder(RCA), Carry Lookahead Adder(CLA), Carry Select Adder(CSLA), and
Carry Skip Adder(CSKA). The simplest structure, RCA, connects Full-Adders in a linear
fashion,making it the most area-efficient Multi-bit Adder. However, it suffers from significant
delay due to the propagation of the carry signal. To resolve this problem, structures such as CLA,
CSLA, and CSKA were proposed, each with its own method of carry calculation. To more
effectively resolve the delay problem, the Parallel Prefix Adder (PPA) structure was introduced.
PPA consists of three stages, preprocessing, prefix-processing, and post-processing. Different
PPA has been proposed based on the configuration of the prefix-processing stage . PPA offers
the advantage of minimal delay compared to traditional serial adders. However, it requires logic
for parallel carry calculation, which can result in suboptimal performance in terms of circuit area
and power efficiency compared to serial adders. To resolve this problem with each adder design,
research has been conducted on applying Approximate Computing(AxC) to adder to optimize
accuracy and reduce circuit area and energy consumption, resulting in Approximate
Adder(AxA). The range of AxC application includes the transistor level, full-adder level, multi-
bit adder level, and PPA level. We conduct research on Approximate PPA with AxC applied at
the PPA level. A structure with AxC applied to PPA is proposed in , known as AxPPA. This
paper analyzes the limitations of AxPPA and introduces the Efficient Approximate
PPA(EAxPPA).
The second stage, prefix-processing, groups 𝑝 and 𝑔 to generate the final carry signal over
several steps. prefix-processing is composed of an operation block
called the Prefix Operator (PO). The boolean equations for the PO are provided by (3), (4).
𝐺 is generated based on the previous node's 𝑔𝑘 and expressions for 𝑝𝑖 and 𝑔𝑖 for the i-th bit. 𝑃
is generated based on 𝑝𝑖 for the i-th bit and the previously calculated 𝑝𝑘 . The outputs 𝑃 and 𝐺
from the PO are then connected as inputs to other POs. Unlike traditional serial adders, POs in
PPA are connected in parallel over multiple stages to compute the carry quickly. The last stage,
post-processing, combines the 𝑝 from the pre-processing stage and the individual bit carries 𝐺
calculated in the prefixprocessing stage to generate the final sum 𝑆 for operands. The boolean
equation for the i-th bit is given by (5).
PPA performance varies based on the configuration of the PO, leading to the proposal of
different PPA types such as Kogge-Stone(KS) PPA, Brent-Kung(BK) PPA, Sklansky(SK) PPA,
and LadnerFischer(LF) PPA.
1.2.Approximate Technique
The approximate techniques used in the design of Approximate PPA can be broadly defined as
three main techniques:
1) Elimination Technique: The elimination technique involves removing gates from the existing
operator and connecting the input and output of the operator using wires.
2) Constant Technique: The constant technique entails removing gates from the existing
operator and connecting constant values, typically cte-0(1'b0) and cte-1(1'b1), to the output.
3) Simplification Technique: The simplification technique involves replacing the gates in the
existing operator with other area-efficient gates(and, nand, or, nor, xor, nxor).
1.3.AxPPA Design
Approximate PPA refers to a modified structure of the existing PPA, where approximate
techniques are applied to improve the area and power efficiency issues inherent in traditional
PPA. One prominent structure is the AxPPA proposed in which introduces modifications to
address these inefficiencies. Figure 1 shows the structure of AxPPA. In the case of AxPPA, the
key proposal involves divide the entire PPA into two parts, and apply elimination technique to
low part POs to enhance efficiency.
CHAPTER 2
LITERATURE REVIEW
● H. Jiang et al. introduced approximate PPAs (AxPPA) that optimize delay and energy
consumption for multimedia and machine learning workloads.
● Z. Liu et al. reviewed approximate adders and their trade-offs in speed, power, and
accuracy, offering insights into their application in resource-constrained environments .
Innovative architectural techniques have been proposed to improve the efficiency of PPAs.
Approximate PPAs have been tailored for IoT devices that demand low power and acceptable
error rates.
Error-tolerant designs are crucial for approximate adders to maintain functional correctness.
● A. Gupta and S. Bansal achieved 30% energy savings by simplifying logic paths in PPAs
for low-power devices .
● K. Verma et al. examined power-aware PPAs, focusing on voltage scaling and logic
simplification to minimize energy consumption .
Approximate PPAs have found applications in AI accelerators, where exact precision is not
always required.
Security concerns in approximate computing include fault tolerance and resistance to hardware
attacks.
● H. Li et al. analyzed the impact of faults in PPAs, proposing secure and fault-resilient
designs for mission-critical applications .
● S. Kumar and P. Roy addressed vulnerabilities in approximate computing and introduced
robust designs to mitigate security risks .
● J. Wang et al. integrated carry-lookahead and ripple-carry designs into a hybrid PPA to
balance speed and power .
● F. Gao et al. explored trends in approximate computing, including machine-learning-
guided PPA optimization .
CHAPTER 3
EXISTING WORK
AxPPA has several limitations that need to be considered. First, the approximate technique is
only applied to POs. Ignoring the impact on performance when applied to other operators, apart
from POs, is not advisable. Second, results are generated by passing through multiple operators
from input to output. Therefore, basing the selection and application of an approximate technique
on the error rate of a single PO is insufficient. Additionally, providing only a single technique
lacks a comparative analysis of applying different techniques. In this paper, we consider to
address these limitations and design an our proposed Approximate PPA.
analysis is conducted, considering Mean Absolute Error (MAE), Mean Relative Error Distance
(MRED), and circuit area as the three metrics. The circuit area is calculated as the total area for
each circuit in Table 1, synthesized using the CMOS 28nm cell library and
Figure 4. Z-Score by Area-Efficient Combination Bit
a 1.1V operating voltage with Synopsys Design Compiler. MAE, MRED, and Z-Score are
defined as in (6), (7), and (8).
For a sample size of 𝑛, 𝑥 represents the actual value, and 𝑦 represents the approximate value.
MAE and MRED are used as accuracy evaluation metrics. The Z-Score is the value obtained by
transforming the original value 𝑥 for each metric into a normal distribution. 𝜇 represents the
mean, and 𝜎 represents the standard deviation. The overall metric is determined by adding the Z-
Score for all three metrics, we define the least Z-Score is optimal. The experiments is conducted
106 times for each combination, and the metrics is averaged for use.
Table 2 presents the results of the experiments, based on the overall Z-Score, to determine the
optimal approximate technique for design. The experimental results, combination [nor, cte-0,
cte-0, cte-0, nor] exhibited the best performance.
The optimal number of bits for applying the most efficient approximate technique is determined
through quantitative experiments. In the lower bits, the optimal combination approximate
technique derived from B section is applied, and performance is compared based on the applied
bits to derive the optimal ratio. The experimental results for design is as shown in Figure 3. The
design is optimal performance when the approximate bits are set at 10 bits. Subsequently, based
on the derived ratios, the part where the approximate technique is applied is further divided into
three parts, the most area-efficient combination for the lower bits. The experimental results for
design is shown in Figure 4. The results show optimal performance when the applied bits are set
at 8 bits. Figure 2 represents EAxPPA structure verified through experiments proposed in this
paper. Different approximate techniques are applied for each stage and bit, and these choices are
the result of quantitative analysis through experiments. EAxPPA is designed based on the 16 bits
Sklansky PPA. The upper 6 bits are the same as PPA, approximate technique is applied to the
lower 10 bits. Upper 2 bits of these are applied the optimal approximate technique combination,
lower 8 bits are applied areaefficient approximate technique combination.
FA is the fundamental block for adders design. The relaxation of numerical exactness provides
freedom to study on imprecise or approximate computation. This freedom gives a solution to low
power and high speed designs. The existing work introduces er- ror to substantially reduce the
power consumption with little loss in output quality. Error Tolerant Adder (ETA) achieves
tremendous improvements in power consumption and speed by introducing re- striction on
accuracy improvement in power consumption and speed with reduced accuracy than ETA. ETA
II does not eliminate the entire carry propagation path but it divides the entire carry propagation
path into a number of smaller paths. It completes the carry propagations in shorter paths
simultaneously. So, the performance of an adder is significantly improved in terms of power
consumption and speed. The building blocks for implementing the bio inspired systems do not
require fully precise digital logic circuits. This allows inaccurate computation by reducing the
logic complexity and cost. Soft additions are generally based on the operation of deterministic
ap- proximate logic or probabilistic imprecise arithmetic. Bio-inspired LOA has been designed
based on approximate logic. The LOA is slowest, but has low power dissipation. Dynamic
segmentation consists of dividing the adder into a smaller bit width adders by bit-slicing the data
path. However, the application of the segmentation approach incurs a significant error in meta-
function computation due to its accumulative nature over multiple cycles. To overcome the
above issue, an improved dynamic segmentation with multi-cycle error compensation technique
(DSEC) improves the accuracy under a wide range of over scaled voltage. Approximate
computing techniques in error tolerant applications, like image and video processing, provides a
considerable improvement in speed and power with a trade-off in quality. The accuracy
requirements of various applications differ from each other. Even the same application needs
different computations with different accuracy requirements and varies over time and
user requirements. Therefore, accuracy configurable arithmetic cicuits are important. Using an
Accuracy-Configurable Approximate adder (ACAA), the accuracy has been configured during
runtime by changing the circuit structure, with a trade-off in accuracy, performance, and power.
The Almost Correct Adder (ACA) is the most power consuming approach with moderate
accuracy [9] . Logic complexity reduction for adders at bit level provides better power savings
over conventional low power design techniques. Logic complexity reduction of a conventional
MA cell has been achieved by reducing the number of transistors. Based on logic complexity
reduction, five different simplified versions of MA have been proposed. The existing AAs, AA1,
AA2, AA3, AA4, AA5 and the proposed AAs, AA6, AA7, AA8, AA9, AA10, AA11, AA12 are
discussed in the following subsections.
How It Works
1. Division of Inputs:
2. Approximate Addition:
3. Accurate Addition:
o The results of the accurate and approximate parts are concatenated to form the 16-
bit result.
Example Simulation
A = 15'b101010101010101 (binary)
B = 15'b010101010101010 (binary)
Result:
Handles the lower 4 bits (C[0] to C[3]) using exact logic for accuracy.
These bits are critical for reducing overall error in approximation.
Bits C[8] to C[11] are further approximated by using earlier carry signals
like C[4], C[5], etc., instead of precise dependencies.
4. Sum Calculation:
o Sum[i] = P[i] ^ C[i-1]: Combines the propagate signal and the previous carry to
compute the sum for each bit.
5. CarryOut:
o The final carry-out signal is C[15], which is derived from the approximate carry
propagation.
Design Benefits:
1. Reduced Complexity:
o By introducing approximations in the mid and higher stages, the logic becomes
simpler, reducing the delay and hardware cost.
o Approximate logic reduces switching activity and the number of gates, which
decreases power usage.
3. Error-Tolerance:
o The design focuses on minimizing error in the lower bits (exact stages), where
inaccuracies have a larger impact on the final result.
Advantages:
o The approximate carry logic reduces the switching activity and the number of
logic gates, leading to significant power savings compared to conventional adders.
2. High-Speed Computation:
o The use of approximations in carry generation for mid-range and higher-order bits
reduces the critical path delay, making the adder suitable for high-speed
applications.
3. Error-Tolerance:
4. Hardware Efficiency:
Limitations:
1. Reduced Accuracy:
o The approximations in the carry generation logic introduce errors, which might
not be acceptable in systems requiring high precision.
2. Application Dependency:
Applications:
o Low-power and high-speed addition make it ideal for DSP tasks such as filtering
or image processing.
2. Machine Learning:
o Approximate arithmetic can accelerate training and inference, especially for large-
scale neural networks where minor errors do not significantly affect the results.
3. IoT Devices:
o Its energy-efficient design aligns with the needs of battery-powered IoT devices.
4. Multimedia Systems:
o Image and video processing applications can tolerate minor errors without
significant degradation in quality, making this adder a good fit.