Kahng 2012
Kahng 2012
820
• The proposed ACA adder has runtime-configurable accuracy adder with a parameter k, which is the bit-width of the sub-adder
to better enable tradeoff of accuracy in computation versus result. In the adder, each divided sub-module produces a k-bit re-
performance and power. sult except for the last sub-module, which produces a 2k-bit result.
• We provide quantitative metrics for an approximate arith- The approximate adder thus consists of the (N/k − 1) sub-modules
metic design. We compare the ACA adder to previous ap- as described in Equation (1).
proximate adders based on these metrics. SU M [N − ik − 1 : N − (i + 1)k] =
• We demonstrate the power benefits of the ACA adder over A[N − ik − 1 : N − (i + 2)k] +
previous approximate and conventional adder designs for ac-
curacy-configurable applications. B[N − ik − 1 : N − (i + 2)k],
where i = 0, ..., N/k − 2 (1)
The rest of the paper is organized as follows. Section 2 presents
the proposed ACA adder design. Section 3 provides experimen- In modern adder designs, such as carry-lookahead (CLA), carry-
tal results and analysis. Section 4 summarizes and concludes the select and Kogge-Stone adders, the path depth and area are asymp-
paper. totically proportional to log2 N and N log2 N respectively, where
N is the bit-width of the adder [15]. Based on this, we can ex-
press delay, area and power consumption of the proposed adder in
2. ACCURACY-CONFIGURABLE ADDER terms of the parameters N and k. The proposed ACA adder has
(N/k − 1) sub-adders, each of which is a 2k-bit adder. Therefore,
2.1 Approximate Adder Implementation delay of the critical path can be expressed with Equation (2) and
area can be estimated with Equation (3), where Cdelay and Carea
FDUU\
$+ $>@ are constants for delay and area, respectively.
$0 $>@
ELW
$/ $>@ DGGHU 680+ delay = Cdelay (log2 k + 1) (2)
680>@
$+%+
$>@
680>@ area = Carea (N − 2k)(log2 k + 1) (3)
ELW
680>@ P owerdyn = Cpower (N − 2k)(log2 k + 1)2 (4)
DGGHU 6800
680>@ Power consumption of the ACA adder can be roughly estimated
$0%0
680>@ as follows. Dynamic power consumption with voltage scaling at
$>@
$>@ %>@ ELW
a fixed frequency is proportional to capacitance · Vdd 2 , where
680 the capacitance is proportional to the area. Cell delay is pro-
DGGHU 680/
2
$/%/ portional to 1/(Vdd − Vt )β , and Vdd is roughly proportional to
1/(cell delay) if we assume that β is 2. Since (cell delay) ×
Figure 2: Proposed approximate adder – 16-bit adder case. 2
(path depth) is constant at a fixed frequency, Vdd is proportional to
Previous approximate adders [7] [10] [14] have difficulty detect- the path depth, which is log2 k + 1. Consequently, dynamic power
ing and correcting errors since they are designed for error-accept- with voltage scaling can be expressed using Equation (4), where
able applications with a target accuracy. However, accurate com- Cpower is a constant fixed for given Vdd for dynamic power con-
putations are still required at certain times, according to the appli- sumption. Static power consumption of the adder can be roughly
cation. VLSA [12] can provide accurate results, but has large delay estimated as proportional to the area in Equation (3).
and area overhead for the error detection and correction. The cen- In our proposed adder design, the output of each sub-adder (ex-
tral contribution of our present work is to propose an approximate cept the last sub-adder) is incorrect when a carry input should be
adder which supports both accurate and inaccurate computation propagated to the results. In Figure 2, when the carry[4] (carry
with error-correction and accuracy-configuration capability. Figure bit from AL + BL ) is ‘1’ and SU MM [3 : 0] is 1111(2) , the
2 shows our proposed approximate circuit for the case of a 16-bit output result has an error in SU M [11 : 8]. In the general im-
adder. In the adder, the carry chain is cut to reduce critical-path plementation, the output result will be correct when there are no
delay, and three sub-adders generate results of partial summations. errors in all (N/k − 1) sub-adders. In the ith sub-adder, errors
With the reduced critical-path delay, high performance (by increas- occur when (1) the LSB part of the result (SU Mi [k − 1 : 0])
ing the clock frequency) or low power consumption (by decreasing has all ‘1’ values (probability P = 21k ) and (2) the LSB part
the operating voltage) is obtained. A middle sub-adder (AM +BM ) ([k − 1 : 0]) of the (i + 1)th sub-adder produces a carry bit (prob-
is introduced to increase accuracy. Without the middle sub-adder ability P = 14 + 12 · 14 + 12 · 12 · 14 + ...). Therefore, with a random
(as in ETAII [13]), error occurs when the eighth carry bit is high, input vector, the probability of having a correct result in the pro-
and for random input patterns the error rate is 50.1%. On the posed adder is
other hand, with the introduction of the middle sub-adder, error rate
for random input patterns is reduced to 5.5%. (In the real imple- 1 2k − 1 Nk −2
mentation, all redundant parts (four-LSB output of AH + BH and P (N, k) = (1 − · ) (5)
2k 2k+1
AM + BM sub-adders) are optimized only for carry-generation.)
Table 1 shows the estimated results of 16-bit ACA adders with
k N: bit width, k: ½ carry-chain depth
different parameter values k. With smaller k value, the minimum
clock period and dynamic power can be reduced, but the pass rate
(probability of having a correct result) will be decreased. The esti-
A [N-1:N-k] A [N-k-1:N-2k] A [N-2k-1:N-3k] A [N-2k-1:N-3k]
mations come from Equations (2), (3), (4) and (5). In Section 3.3
below, we validate the above estimation with real implementations.
B [N-1:N-k] B [N-k-1:N-2k] B [N-2k-1:N-3k] B [N-2k-1:N-3k]
Table 1: Estimated minimum clock cycle, area, dynamic power and pass rate for
each k value when N = 16 (normalized to the conventional CLA 16-bit adder).
carry SUM [N-1:N-k] SUM [N-k-1:N-2k] SUM [N-2k-1:N-3k]
k=2 k=3 k=4 k=5 k=6
min. clock period 0.5 0.65 0.75 0.83 0.89
Figure 3: General implementation for the proposed adder. area 0.87 1.05 1.12 1.15 1.12
dynamic power 0.44 0.68 0.84 0.95 1.00
We can generalize the implementation of the proposed approxi- pass rate 0.554 0.829 0.942 0.982 0.995
mate adder. Figure 3 shows the general implementation of an N -bit
821
2.2 Error Detection and Correction for Accurate tiple stages. Figure 6 shows the pipelined adder implementation
Computation (k = N/8 case), in which four pipeline stages are required to
As described in Section 2.1, our proposed adder is incorrect when achieve a 100% accurate result. In the pipelined adder, each stage
a carry bit is propagated between sub-adders. However, the error generates a result with different accuracy; the output accuracy in-
can be detected and corrected with a small overhead. We detect an creases as the number of pipeline stages increases. According to
error for each sub-adder by checking the output of the sub-adder the accuracy requirement, we can turn off the later stages with a
and the carry-in signal that comes from the previous sub-adder. Er- power gating technique, and we can reduce the power consumption
ror detection can be implemented with several ‘and’ gates. To cor- further with the accuracy tradeoff.
rect the error, ‘1’ should be added to the approximate (inaccurate) Since the proposed adder supports both approximate and accu-
output, and the error correction can be implemented with an incre- rate results, it can be used in applications that require accurate re-
mentor circuit. sults only under certain conditions. Conventional accurate designs
are energy-inefficient in the error-acceptable application context,
approximate adder EDC circuit because they always compute the exact function. Previous approx-
SUMapprox imate designs cannot handle a varying accuracy requirement, and
IN sumi OUT this limits the benefit of the accuracy tradeoff: as noted above, the
sub-adderi SUMcorrect approximate function must meet the maximum accuracy threshold
incrementor across all applications. Moreover, if the application requests an ex-
sub-adderi+1 errori
act computation, additional accurate circuits must be added to the
previous approximate designs. By contrast, the ACA design effi-
error ciently exploits a tradeoff between accuracy and power/performance
with its runtime accuracy configurability.
data stall carryi+1
Stage 1 Stage 2
Figure 4: Error detection and correction with the approximate adder. AL
BL N/2-bit adder SUML
With these simple error detection and correction circuits, our carry
proposed adder can be implemented to have variable latency like AH
N/2-bit adder SUMH
the previous VLSA adder [12], with a small overhead for an er- BH
ror detection and correction (EDC) system. Figure 4 shows an
EDC system with our proposed adder. The error detection cir- error
cuit (‘and’ gates) checks the carry propagation and generates an A
approximate adder SUMcorrect
error signal. The error correction (incrementor) circuit produces B error correction
an error-free output by adding compensation data, and requires an SUMapprox
accurate power gating
additional clock cycle. When errors are detected from input pat- mode switches
terns, the error signal is activated. The error signal holds the input Figure 5: Pipelined adder implementation – conventional adder (above) and ap-
pattern during the error correction and chooses the error-corrected proximate adder (below). In approximate operation, the error correction stage is
value (SU Mcorrect ) as an output. With this approach, our approxi- power-gated.
mate adder can provide accurate results at a higher clock frequency
than that of conventional adders (e.g., CLA). According to the esti- 3. EXPERIMENTAL SETUP AND RESULTS
mated results in Table 1, clock period can be reduced by 25% with
6% (= error rate) recovery-cycle overhead (16-bit ACA, k = 4). 3.1 Experimental Setup
To test approximate designs, we have written each design in Ver-
2.3 Accuracy Configuration with Pipelined Archi- ilog and synthesized it to a TSMC 65GP cell library with Synopsys
tecture DesignCompiler [17]. We then perform gate-level simulations us-
When our proposed adder is combined with a pipelined architec- ing Cadence NC-Sim [18]. In the simulation, gate delay is taken
ture, we can obtain accurate results with the same throughput as a from an SDF (standard delay format) file. For voltage scaling ex-
conventional adder. In the pipelined architecture, approximate ad- periments, we prepare Synopsys Liberty (.lib) files for each voltage
ditions are computed at the first pipeline stage, and error correction from 1.00V to 0.60V in 0.01V increments, using Cadence Library
can be completed at the second stage. Figure 5 shows the conven- Characterizer v9.1 [19]. The prepared libraries are used for SDF
tional pipelined adder (above) and the approximate adder (below). file generation and power estimation at each voltage. Each simula-
The pipelined implementation of approximate adder has a struc- tion is performed with input patterns for one million cycles. During
tural analogy with the pipelined adder of the 2006 U.S. patent [8] in the simulation, each output value is compared with a reference (cor-
which partial summations are performed at the first stage and carry rect) value to produce the accuracy metrics. For the input patterns,
bits are added at the later stages. However, the patent is clearly we use random data, as well as actual data from SPEC 2006 [20]
directed to accurate operations, not approximate computations. In benchmarks. We extract operand data from ADD instructions in
addition, we use our approximate adder (Figure 3) in the first stage. the SPEC benchmarks.
In the pipelined approach, there is no improvement of the clock fre-
quency since the achievable clock period is the same as that of the 3.2 Metric for Approximate Design
conventional adder. However, power benefits are obtained through To quantify errors in approximate designs, two metrics have been
configuration of accuracy: in the approximate mode, the error cor- previously proposed [1]. Error rate (ER) is the percentage of cy-
rection stage is power-gated with foot (or, head) switches in Figure cles in which output value is different from the correct value. Error
5, and power reduction over the conventional adder design can be significance (ES) is the numerical difference between correct and
achieved. We compare the conventional and approximate pipelined output results; this quantifies the amount of error. In image/video
adders in Section 3. applications, [2] uses the product of ES and ER as a metric of
In the proposed adder implementation, to achieve higher perfor- error tolerance. [10] introduces a criterion for acceptability: ES
mance or lower power consumption, we can reduce the carry chain × ER ≤ acceptance threshold, where the acceptance threshold is
depth (k) of sub-adders (see Table 1). However, when k is less than specified according to the application. For the error significance
N/4, it is impossible to correct all errors and achieve 100% cor- (ES) metric, [14] considers only amplitude of error. This is use-
rect results within one clock cycle since the error-correction paths ful for many digital signal processing (DSP) systems that process,
become critical. To achieve correct results in the pipelined imple- e.g., sound and image data. However, in communication systems
mentation, the error-correction stage should be extended to mul- that mainly handle information data, the number of incorrect bits
822
6WDJH 6WDJH 6WDJH 6WDJH
680 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
(Hamming distance) is a more meaningful metric for accuracy – adders: CLA, Lu’s adder [7], ETAI, ETAIIM [14] and the pro-
e.g. a (32,28) Reed-Solomon code can correct up to 2-byte errors. posed ACA adder (without error correction). In the experiment,
This consideration for the ES metric is required when approximate the same carry-chain width (8-bit) is selected for the four approxi-
arithmetic is applied to error-tolerant systems with a redundancy mate adders. In the implementation, a register (flip-flop) is inserted
technique. in each output port to detect timing errors.
Table 2 shows two accuracy metrics for amplitude data and in- Table 4 shows area, pass rate, accuracy, minimum clock period
formation data. ACCamp used in [14] quantifies the amplitude of and EDC overhead for each adder design. According to the re-
errors, where Rc and Re are the correct and obtained results, re- sults, the ETAI adder has the smallest design area, but has a low
spectively. We propose another accuracy metric, ACCinf , which pass rate and limited accuracy with respect to the ACCinf metric.
measures error significance as Hamming distance, where Be is the Therefore, the ETAI adder is preferred for applications which allow
number of error bits and Bw is the bit-width of the data. For ex- low accuracy in results. The ETAIIM adder shows fairly high ac-
ample, when the correct (reference) data is 1000_0000(2) and the curacy, but does not have speed (clock period) benefit. Lu’s adder
result data is 1100_0000(2) , accuracy with ACCamp and ACCinf shows a smaller error rate and high accuracy with respect to both
will be 12 and 78 , respectively. To evaluate the approximate cir- ACCamp and ACCinf metrics. However, it requires larger area
cuits, we obtain average values of accuracy metrics ACCamp and than the other designs. The proposed adder shows similar results
ACCinf over the entire simulation to consider both ER and ES. for both metrics as Lu’s adder. However, the area of the ACA adder
is smaller than that of Lu’s adder, and EDC is possible with small
Table 2: Accuracy metrics for error significance (ES).
area overhead (28%). With the ACA adder, the minimum clock
metric definition data type period can be reduced by 26% compared to the accurate CLA.
ACCamp 1 − |Rc − Re |/Rc amplitude data
ACCinf 1 − Be /Bw information data 1.000
ACCamp
Voltage scaling
(1.0V~0.6V)
Table 3: ACA adder results with different k values. 0.900
k 2 3 4 5 1.000
Voltage scaling
ACCamp (maximum) 1.000 0.998 0.997 0.999 0.999 (1.0V~0.6V)
ACCinf (maximum) 1.000 0.999 0.993 0.694 0.996 0.900
1.000
area overhead for EDC N/A 75% 28% N/A 15%
0.800
0.990
0.700
823
reduction with small accuracy penalty. When the required accu- lected as N/4 for a two-stage pipelined implementation. In the
racy is 0.970 (ACCamp ), the ACA adder shows 37.0%, 36.4% and table, minimum clock period is measured at a fixed voltage (1.0V ),
15.9% total power reduction over CLA, Lu’s adder and ETAIIM, and total power is measured at a fixed frequency (2.5GHz) with
respectively. voltage scaling. In the ACA adder case, timing and power over-
We have tested our approximate adder on a real application – a heads from power gating cells, output MUXes, and IR drop are
Gaussian smoothing filter used in [6]. Gaussian smoothing is per- included. We can see that area, timing and power of both designs
formed on the input image by convolving with a matrix in the spa- are similar when the ACA adder operates in the accurate mode.
tial domain. In the convolution, the addition operation is done with Total power of the approximate adder is comparable to that of the
approximate 16-bit adders. Other operations, such as multiplication conventional adder, even though ACA has additional EDC circuits.
and division, are accurate computations. Figure 8 shows results for This is because ACA has fewer registers between stage-1 and stage-
various approximate adders when they consume 50% of the power 2 than the conventional pipelined adder. (In Figure 5, the conven-
of accurate CLA. From the results, the ACA adder has PSNR of tional adder requires registers for AH , BH , SU ML and carry at
24.5dB, and this suggests that image processing/filtering applica- the first stage. For a 16-bit adder, 25 registers (8 + 8 + 8 + 1) are
tions could employ our proposed adder with significant power sav- required. On the other hand, ACA requires 18 registers (16 for
ings and only small loss in image quality. SU Mapprox and 2 for error indication).)
7.00E-03
voltage scaling accurate result
6.00E-03
4.00E-03
7.00E-03
ACCamp
accurate result
6.00E-03 voltage scaling
total power consumption (W)
5.00E-03
4.00E-03
mode change
3.00E-03
824
Table 7: Accuracy (ACCamp , ACCinf ) results of 32-bit ACA adder for real benchmarks (SPEC 2006).
accuracy metric benchmark astar bzip2 calculix gcc h264ref mcf sjeng soplex
mode-1 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
mode-2 0.9999 1.0000 0.9999 0.9992 0.9999 0.9997 0.9998 0.9999
ACCamp
mode-3 0.9993 0.9998 0.9972 0.9990 0.9990 0.9997 0.9995 0.9998
mode-4 0.9979 0.9970 0.9958 0.9951 0.9978 0.9991 0.9981 0.9953
mode-1 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
mode-2 0.9979 1.0000 0.9978 0.9881 0.9953 0.9819 0.9897 0.9985
ACCinf
mode-3 0.9949 0.9984 0.9967 0.9849 0.9897 0.9809 0.9876 0.9965
mode-4 0.9940 0.9931 0.9910 0.9617 0.9851 0.9596 0.9787 0.9925
825