A Multiplier-Free Discrete Cosine Transform Architecture Using Approximate Full Adder and Subtractor
A Multiplier-Free Discrete Cosine Transform Architecture Using Approximate Full Adder and Subtractor
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LES.2024.3395900
Authorized licensed use limited to: K. Ramakrishnan Health and Educational Trust. Downloaded on August 02,2024 at 03:40:47 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Embedded Systems Letters. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LES.2024.3395900
Fig. 1. Truth table, gate level, transistor level, and layout of the
proposed approximate FA.
The FA has GDI-based F2, 𝐹2 = (𝐴̅ + 𝐵). As an Fig. 3. 8-bit RCA.
advantage, the Sum is generated by the F2 with an
inherent inverter. Fewer internal nodes and fixed input
capacitance through critical paths without inverters,
reduce static power. The F2 enables a multiplexer
(MUX), allowing the MUX to perform faster. Besides, Fig. 4. 8-bit subtractor.
the Cout is generated with OR and AND gates. Each Cout
and Sum are generated by 4 transistors, and the FA has III. SIMULATION RESULTS
8 transistors. The FA’s functions are (1)-(2).
All circuits are simulated by HSPICE and the 32 nm
𝑆𝑢𝑚 = 𝐹2 𝐶 + ̅̅̅
𝐹2 𝐴 (1) CNTFET technology. The frequency is 500 MHz, the
𝐶𝑜𝑢𝑡 = 𝐴𝐵 + 𝐴𝐶 = (𝐵 + 𝐶)𝐴 (2) load is 5 fF, while the pitch and tubes are 5 nm and 10,
respectively. The chirality vector is considered (38,0),
B. Proposed approximate subtractor and the VDD is 0.9 V. By the post-layout extraction, all
The block diagram of the new approximate subtractor is parasitic capacitances and resistances are considered
illustrated in Fig. 2. The GDI technique is used due to [11]. The circuitry performance is evaluated by power-
have advantages of small area, low complexity, and delay product (PDP), and the accuracy is checked by
reduced transistor count, to attain low-power and fast NMED. The results of normalized PDP and NMED of
circuits. The subtractor shows errors in the states of the introduced FA and subtractor with references for the
XYBin=010 and XYBin=101, therefore ED = |±1|, and the number of approximate bits (NABs) of 1, NAB1, in the
ER and NMED are 25% and 0.083, respectively.
RCA are shown in Fig. 5 (a) and Fig.5 (b), respectively.
(a) (b)
Fig. 5. Results of the PDP and NMED for (a) FA, (b) subtractor.
The proposed FA and subtractor have minimum PDP
and the best performance in terms of power and NMED.
IV. APPLICATION
Fig. 2. Truth table, gate level, transistor level, and layout of the DCT compression manages the storage space or
proposed subtractor. transmission bandwidth. A multiplier-free DCT is shown
If the two errors are related to the Diff, Bout produces in Fig. 6 [10], where the proposed FA and subtractor are
correct outputs. The subtractor has only two MUX and implemented. Here, only 24 adders/subtractors are used
one XNOR to give the Boolean functions of (3)-(4). and the 8-point propeller approximate DCT significantly
̅̅̅̅̅̅̅̅
𝐵𝑜𝑢𝑡 = (𝑋 ⊕ 𝑌)𝐵𝑖𝑛 + 𝑋̅𝑌 (3) improves the PDP and area by lowering the number of
adders and removing additional multipliers. The pixels
̅̅̅̅̅̅̅̅
𝐷𝑖𝑓𝑓 = (𝑋 ⊕ 𝑌)𝐵𝑖𝑛 + 𝑋𝑌̅ (4) of the input image are given to the first adder stage and
Authorized licensed use limited to: K. Ramakrishnan Health and Educational Trust. Downloaded on August 02,2024 at 03:40:47 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Embedded Systems Letters. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LES.2024.3395900
the data is processed by the remaining adders and flip- and structural similarity index measure (SSIM). The
flops. At the end of the third stage of the flip-flops, the DCT circuit based on the proposed FA has a PSNR of
DCT coefficients are ready. 34.96 dB and SSIM of 0.76 for QF=50.
D D D
X0 + + + X0
CLK CLK CLK
D
+ D -+ D
X1 + X4
CLK CLK CLK
D D D
X2 + + + X2
CLK - CLK D CLK
<<1
CLK
(a) (b) (c) (d) (e)
D D -+ D
X3 +
CLK -
+
CLK <<1
D
CLK
X5
Fig. 8. The results of the JPEG method with QF=50, (a) input image,
CLK
D D D D
(b) exact DCT compression, (c) exact IDCT reconstruction, (d)
X4
-+ CLK
+
CLK
<<1
CLK
+
CLK
X1
presented approximate DCT compression, and (e) approximate IDCT
D D D -+ D reconstruction.
X5 + + <<1
X6
- CLK CLK CLK CLK
D D D -+ D
To consider the function of the circuit and the image
+
-+
X6 X3
- CLK CLK
<<1
CLK CLK
processing accuracy, a figure of merit (FoM) is defined
D D D -+ D
X7
-
+
CLK -
+
CLK
<<1
CLK CLK
X7
as FoM=PDP/(PSNR*SSIM). Lower FoM indicates
Fig. 6. The approximate multiplier-free DCT architecture.
better circuit performance and accurate image
The number of additions, multiplications, and bit-shift processing. Table 2 provides power, delay, transistor
operations required for the proposed architecture and the count, PDP, power-delay-area-product (PDAP), PSNR,
references are presented in Table 1. The proposed DCT SSIM, and FoM results of various FAs based on the DCT
has the same condition as T [6], but 57.14%, and 17.24% architecture. Here, the proposed approximate DCT has
saving in power computation than the Conventional better power, PDP, FoM, PSNR, and SSIM compared to
DCT and Scaled DCT [5], respectively. other designs. The Exact and PPA2 have a high number
Table 1. Computation complexity assessment. of transistors and occupy a large area. The comparison
Transform Addition Multiplication Shift
of PSNR in the context of image compression yields only
Conventional DCT 56 64 0
Scaled DCT [5] 29 5 0
≈0.5 dB degradation when compared to an exact DCT.
Table 2. The results of DCT architecture based on different FAs.
BAS-[8] 24 0 0 Adder Power Delay PDP PSNR
T [9] 24 0 2 SSIM FoM
Type (mW) (ns) (fJ) (dB)
T [6] 24 0 6 Exact [1] 3.14 29.40 92.31 35.36 0.85 3.07
Proposed 24 0 6 AFA1 [2] 2.50 29.15 72.87 32.71 0.74 3.01
As shown in Fig. 7, the input image, JPEG, is partitioned AFA2 [4] 2.32 29.88 69.32 33.75 0.73 2.81
into 8x8 blocks, and each block is entered into the PPA1 [3a] 2.25 29.30 65.92 32.80 0.75 2.67
PPA2 [3b] 2.44 29.40 71.73 33.85 0.73 2.90
approximate DCT. Then, the resulting matrix of the DCT
Proposed 2.18 29.18 63.61 34.96 0.76 2.39
coefficients is reduced by removing the high-frequency
Table 2 confirms that the exact FA in the DCT can be
DCT coefficients. The steps of image compression and
replaced by an approximate FA to save power and area
reconstruction are shown in Fig. 7. The reconstruction is
performed by the inverse DCT (IDCT). The presented at the cost of a small decrease in image quality. Table 3
approximate DCT works only until the output compares the hardware consumption of various DCTs,
coefficients are generated and the rest of the process is and the proposed DCT has the best result of PDAP.
done by other compression steps like quantization in Table 3. Comparison of hardware consumption of DCTs.
Power Delay PDP Area
MATLAB. The Quality factor (QF) is set to 50. Adder Type
(mW) (ns) (fJ) (µm2)
PDAP
8×8 Blocks
Source DCT 6.84 40.32 275.7 40.4 11138
Image
Applying the 8x8 image
Perform JPEG
Compression
Scale DCT [5] 4.42 35.22 155.6 38.9 6052.8
DCT
matrix block to the DCT
Structure Coefficients
by Quantizing and
Removing Unimportant
BAS-[8] 4.33 29.60 128.1 37.07 4748.6
DCT Coefficients
T [9] 2.75 29.34 80.68 38.6 3114.2
255×255
Image
T [6] 2.90 29.27 84.88 37.05 3144.8
Reconstructed Proposed 2.18 29.18 63.61 35.3 2245.4
Image Data
Perform DCT The original and reconstructed images are considered,
IDCT
Coefficients
Dequantization
and the PSNR comparisons are presented in Table 4 and
Reconstructed Fig. 9. The standard value of PSNR is between 30 and
Image
40, and its higher value means better results. The PSNR
Fig. 7. Block diagram of DCT/IDCT.
of the Scaled DCT [13] and Conventional DCT are
In this letter, a computed tomography (CT) scan image significantly higher than the other recent algorithms, but
is compressed by the JPEG method and a standard these require a greater number of arithmetic operations.
quantization matrix. The original image, the compressed, This research concentrates on low computational
and the reconstructed image under the DCT are shown in complexity algorithms and Table 2 and Fig. 3 show that
Fig. 8. The quality of the reconstructed images is the proposed DCT has a better PSNR than BAS-[8], T
evaluated using the peak signal-to-noise ratio (PSNR) [9], and T [6].
Authorized licensed use limited to: K. Ramakrishnan Health and Educational Trust. Downloaded on August 02,2024 at 03:40:47 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Embedded Systems Letters. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LES.2024.3395900
Authorized licensed use limited to: K. Ramakrishnan Health and Educational Trust. Downloaded on August 02,2024 at 03:40:47 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.