Approximate DCT and Quantization Techniques For Energy-Constrained Image Sensors
Approximate DCT and Quantization Techniques For Energy-Constrained Image Sensors
1, JANUARY 2025 11
Authorized licensed use limited to: Thapar Institute of Engineering & Technology. Downloaded on December 27,2024 at 05:23:49 UTC from IEEE Xplore. Restrictions apply.
LI et al.: APPROXIMATE DCT AND QUANTIZATION TECHNIQUES FOR ENERGY-CONSTRAINED IMAGE SENSORS 13
Fig. 5. 1D-FDCT: Butterfly diagram for 8-point 1D-DCT. Each colored small
block represents a multiplication with a specific cosine factor.
Fig. 3. Visual example of the JPEG compression process. (a) Colored (RGB
format) image to be compressed. (b) Colored image in YCbCr format. (c1)
Y component represents the brightness of the original image. (c2) and (c3)
Cb and Cr components represent the strength of blue and red signals of the where tij is the element in T, and subscripts i and j are from 0
original image, respectively. (d) DCT [given by (1)] result of the Y component. to 7 and represent the row position and the column position,
(e) Quantized result of (d).
respectively.
In this work, to build an energy-efficient JPEG compression
circuit, we implement (1) not by the direct matrix multiplica-
3, the compression is performed in the unit of 8 × 8 pixel
tion method but by the FDCT method [17]. Since (1) can be
blocks; moreover, three channels are required to process the
written as
compression of the original image’s Y, Cb, and Cr components.
Fig. 3 demonstrates the visual example of how an image is TMT = (T)(TM) (3)
compressed.
After the first two stages, the input image is processed the computation of D can be treated as 2 rounds of one-
into chunks of pixel blocks. One block is represented as an dimensional DCT (1D-DCT), as shown in Fig. 4. The 1D-DCT
8 × 8 matrix M. The DCT on M is given by the matrix is the multiplication of the transformation matrix T and any
multiplication given 8 × 1 vector x. The computation of 1D-DCT can be
mathematically “accelerated” using the fast architecture. The
D = TMT (1) fast architecture with different types of butterfly units is shown
in Fig. 5, which reduces the multiplications of Tx from 64 to
where D is the DCT of M, T is the transformation matrix, 16. On the hardware level, an energy-efficient 1D-FDCT unit is
and T is the transpose of T. The transformation matrix T is achieved by adopting the multiplier-less approach [18], where
defined as four types of multiplications in 1D-FDCT (one scalar multipli-
⎧
⎨ √1 , if i = 0 cation and three 2 × 2 matrix multiplications) are implemented
tij = 82 (2j+1)iπ
(2) by adders and shifters. Equations (4a), (5a), (6a), and (7a)
⎩
8 · cos 16 , if i > 0 delineate the precise mathematical expression for the four
Authorized licensed use limited to: Thapar Institute of Engineering & Technology. Downloaded on December 27,2024 at 05:23:49 UTC from IEEE Xplore. Restrictions apply.
14 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 44, NO. 1, JANUARY 2025
Authorized licensed use limited to: Thapar Institute of Engineering & Technology. Downloaded on December 27,2024 at 05:23:49 UTC from IEEE Xplore. Restrictions apply.
LI et al.: APPROXIMATE DCT AND QUANTIZATION TECHNIQUES FOR ENERGY-CONSTRAINED IMAGE SENSORS 15
The output of this stage is given by an 8 × 8 matrix Similarly, Q90 can be approximated as
C, with C = D./ Q, where ./ represents the element-wise ⎡ ⎤
division operator. (We note that the entry values of Q90 are 2 2 2 2 4 8 8 8
⎢2 2 2 4 4 8 8 8⎥
smaller than those of Q50 , which implies Q90 will result in a ⎢ ⎥
⎢2 2 2 4 8 8 8 8⎥
C with larger entry values than Q50 will. That is, there will ⎢ ⎥
⎢2 2 4 4 8 16 16 8 ⎥
be more information for the image’s reconstruction and higher
Q90 = ⎢ ⎢ ⎥. (11)
reconstructed image quality can be obtained.) ⎢4 4 4 8 8 16 16 8 ⎥⎥
⎢4 4 8 8 16 8 16 16⎥
Finally, the quantized result is compressed into a bit stream ⎢ ⎥
⎣8 8 16 16 16 16 16 16⎦
using Huffman encoding. The elements C are concatenated
into a 1-D vector based on the zigzag traversal. After con- 8 16 16 16 16 16 16 16
catenating the bits in this traversal as an input bit stream, a This approach introduces some errors due to approximation
binary tree of the most common N-bit groups is created where but reduces the divider power consumption because of the
a traversal of the encoding tree in the left or right directions simplicity of bit shifting. Note that instead of being converted
maps to a binary bit (0 or 1). This allows an N-bit chunk of up, the elements in the original matrix are converted down
zeros (most common) to map to only a single bit after Huffman to the nearest power of 2 to retain higher image quality. The
encoding. After encoding, a compressed bit stream is created general mathematical behavior of the proposed approximate
and sent to the communication channel. technique is described as follows. Given a quantization matrix
Q, for each element qij in Q, we first locate qij using an integer
III. A PPROXIMATION T ECHNIQUES sij which satisfies
To date, the approximate JPEG compression technique has
2sij ≤ qij < 2sij +1 . (12)
been explored by using only bit truncation [11], dynamic bit
width reduction in DCT operation [13], or using an approxi- The corresponding approximated element in Q , qij , is then
mate adder [12]. However, we observed that the quantization constructed by
block realized using standard division algorithms consumes
high power while occupying a considerable silicon area. qij = 2sij . (13)
For the first time, we explored an approximate quantization
block by updating the Q-matrix to enable divisions with If all elements in Q are considered, (7) can be further extended
bit-shift operations, eliminating the need for high-budget to a matrix form
standard division blocks, thereby saving energy and reducing Q = 2.∧ S (14)
silicon area. Conventional approximation strategies, like loop
perforation and precision scaling, are also explored in this where .∧ is a element-wise power operator and S is an
work. In addition to the approximate quantization block, we 8 × 8 matrix with its element sij representing the exponent
have proposed a heuristic-based approach to select the optimal part of qij . Since 1 ≤ qij ≤ 255, we can obtain 0 ≤ sij ≤ 7
configuration between loop perforation and precision scaling according to (6); therefore, sij can be represented using only 3
for a given quality requirement. bits. Using this approximate technique, the quantization circuit
in the JPEG compression circuit only needs to take 192 bits
A. Approximate Quantization (3 bits × 64 elements) as the input, while the conventional
A common approach to reducing the power of the Q block division-based quantization circuit requires 512 bits (8 bits
is to replace the standard division (A/B) with multiplication × 64 elements). At the hardware level, (12) and (13) can
using A · (1/B) using techniques like Taylor series expansion be implemented by an 8-to-3 priority encoder [Fig. 7(a)]. For
to approximate (1/B) [19], [20] or reducing the width of an 8-bit input qij [7:0], an 8-to-3 priority encoder can locate
operation in division block [21]. These methods, however, the first bit appearing from the MSB side and output that
require one or more multipliers, which demand relatively particular location by a 3-bit signal sij [2:0]. The quantization
higher energy. is then performed by a bit shifter, with sij [2:0] being the
By observation, quantization matrix Q can be replaced shifting amount. Therefore, our proposed architecture realizes
with approximated quantization matrix Q by converting each the approximated element-wise division operation through
element of the Q matrix to the power of 2 so that the division an 8-to-3 priority encoder and a barrel shifter, as shown
operation can be implemented via bit shifting. For example, in Fig. 7(b). Note that at the time of decoding, the same
Q50 can be approximated as quantization matrix Q must be used for better reconstruction
⎡ ⎤ of the image.
16 8 8 16 16 32 32 32
⎢8 8 8 16 16 32 32 32⎥
⎢ ⎥
⎢8 8 16 16 32 32 64 32⎥ B. Precision Scaling
⎢ ⎥
⎢ 8 16 16 16 32 64 64 32⎥ Precision scaling, or bit truncation, alleviates the computa-
Q50 = ⎢⎢ ⎥. (10)
⎥ tional load of image compression by reducing the data width
⎢16 16 32 32 64 64 64 64⎥
⎢16 32 32 32 64 64 64 64⎥ of the input image. Least significant bit (LSB) truncation is
⎢ ⎥
⎣32 64 64 64 64 64 64 64⎦ realized in this article. The size of data reduction, truncation
64 64 64 64 64 64 64 64 level Bj , has to be specified before image compression begins.
Authorized licensed use limited to: Thapar Institute of Engineering & Technology. Downloaded on December 27,2024 at 05:23:49 UTC from IEEE Xplore. Restrictions apply.
16 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 44, NO. 1, JANUARY 2025
9 Mout = Ml ; Cout = Cl ;
10 return Mout , Cout ;
(a)
Fig. 7. Hardware implementation of the proposed approximate quantization
method. (a) 8-to-3 priority encoder. (b) Proposed element-wise division cell
comprises an 8-to-3 priority encoder and a barrel-shifter.
Authorized licensed use limited to: Thapar Institute of Engineering & Technology. Downloaded on December 27,2024 at 05:23:49 UTC from IEEE Xplore. Restrictions apply.
LI et al.: APPROXIMATE DCT AND QUANTIZATION TECHNIQUES FOR ENERGY-CONSTRAINED IMAGE SENSORS 17
Authorized licensed use limited to: Thapar Institute of Engineering & Technology. Downloaded on December 27,2024 at 05:23:49 UTC from IEEE Xplore. Restrictions apply.
18 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 44, NO. 1, JANUARY 2025
6 if (Qt [j + 1] ≤ QA ) then
7 E = E − Et ; Q = Q − Qt ; j = j + 1; over an image dataset from the gallery of the Computer
8 else if (Ql [i + 1] ≤ QA ) then Vision Group of the University of Granada [22], where many
9 E = E − El ; Q = Q − Ql ; i = i + 1; representative images in the field of image processing, such
as Baboon, Boat, Barbara, Pirate, Bridge, and Airplane, are
10 else included. The input image is reduced to chunks of 8 × 8 matrix
11 if (Ql [i + 1] ≤ QA ) then in MATLAB before feeding to the design under test (DUT).
12 E = E − El ; Q = Q − Ql ; i = i + 1; The MATLAB code runs the gradient descent-based heuristic
13 else if (Qt [j + 1] ≤ QA ) then algorithm to estimate the optimal approximation knobs for
14 E = E − Et ; Q = Q − Qt ; j = j + 1; a particular input quality bound (SAD). The software also
decides the degree of precision scaling to be realized and
15 return i, j; configures the hardware by clock gating the required bits
throughout the accelerator. A Verilog test bench is used to
convey the appropriate degree of truncation and the prepro-
cessed image (in a text file) for simulation. The accelerator
where E(Bi , Li ) and Q(Bi , Li ) represent the relative energy and performs the JPEG encoding on the input image and writes
%SAD degradation under particular Bi and Li , respectively. the output processed image in a separate text file, which is
The convexity of the problem can be justified as follows: then reconstructed through inverse quantization and inverse
First, Q(Bi , Li ) is monotonically increasing in each dimension DCT in the software. Inverse quantization is an operation of
because higher bit truncation or loop skipping levels result in element-wise multiplication, which is given by
higher %SAD degradation. Second, since the relative energy
and quality degradation are two inversely related variables, to R=CQ (17)
minimize E(Bi , Li ) is equivalent to maximizing Q(Bi , Li ). With
these two properties, (16) can be treated as a maximization where is the element-wise multiplication operator, and R is
problem constrained in the first octant of the 3-D space the result of inverse quantization.
expanded by Bi , Li , and Q(Bi , Li ), where the objective function In general, the Q used in this step should be the same
is monotonically increasing in the direction of Bi and Li . As one as used in the encoding process. However, to give a
a result, the convexity is assured, and it is feasible to apply a more comprehensive analysis of the effect of introducing an
gradient descent algorithm to find the optimal. approximated quantization matrix, we also analyze the case
The controller implementing this heuristic, realized in where the approximated matrix Q is used in encoding while
software code, automatically configures the degree of loop the unmodified one Q is still used for inverse quantization.
perforation and bit truncation, Li and Bj , respectively, by Inverse DCT is given by the transformation of
moving in the direction of the steepest gradient of the ratio
of energy savings to quality degradation resulting from the N = T RT (18)
variation in each degree of the approximation knobs. Fig. 9(b)
shows the 3-D Q–E plot for different Li and Bj , along with the where N is the result of inverse DCT and T is given by (2).
relative energy required for processing. For a specified quality The reconstructed image can be obtained after rounding N’s
degradation bound, our proposed gradient descent algorithm all entries. In the end, performance evaluation and quality
uses the 3-D plot and selects the best (Bi , Lj ) combination assessment are conducted based on the reconstructed results.
that results in the lowest energy configuration according to the To validate the accelerator’s functionality, we use in-built
color bar on the right according to Algorithm 3. MATLAB-based JPEG compression and compare it with
hardware output. The JPEG register-transfer level (RTL) is
synthesized using Synopsys Design Compiler, mapped to
IV. S IMULATION M ETHODOLOGY TSMC 65nm standard cell library. The functionality of the
Fig. 11 depicts the simulation setup used for testing the extracted netlist is revalidated. All the RTL simulations are
functionality of the hardware. Our simulation is conducted performed using the Cadence NC-Verilog simulator. The
Authorized licensed use limited to: Thapar Institute of Engineering & Technology. Downloaded on December 27,2024 at 05:23:49 UTC from IEEE Xplore. Restrictions apply.
LI et al.: APPROXIMATE DCT AND QUANTIZATION TECHNIQUES FOR ENERGY-CONSTRAINED IMAGE SENSORS 19
TABLE I
S TATISTICS OF F IG . 12: I MAGE Q UALITY U NDER D IFFERENT Q UANTIZATION S CHEMES
design area and power values are provided from the post-
synthesis simulation results. Note that results from SPICE
simulations (done in Cadence Virtuoso) are utilized as they
provide precise energy numbers.
V. R ESULTS
In this section, we first discuss the effect of individual
approximation techniques on the overall performance of JPEG
(a) (b)
compression hardware. The performance of the combined
approximation strategy that dynamically tunes the config-
uration of the constituent techniques is also shown. Note
that system-level hardware performances like power and area
are not reported in previous related literature, such as [23]
(which applies bit-truncation and loop peroration in JPEG
compression) or [24] (which works on the approximation of
DCT hardware). However, this work provides the power and
area numbers for the most optimal design, including the DCT,
approximate quantization block, and loop-skipping circuitry (c) (d)
from the post-synthesis SPICE simulations. The quality of the
Fig. 12. Effect of quantization block approximation on image quality: (a)
images is evaluated in terms of SSIM, PSNR, and SAD as and (b) SSIM and PSNR comparison for Q50, and (c) and (d) SSIM and
discussed earlier in Section III. PSNR comparison for Q90.
A. Approximate Quantization
1) Case I: Images Reconstructed by the Corresponding
Encoding Quantization Matrix (10), (11): The scatter plots
from Fig. 12 and the corresponding statistic value in Table I
together show the change in the SSIM and PSNR of the
reconstructed images because of the use of an approximation
matrix. It is observed that the reconstructed images encoded
using bit-shifting-based quantization demonstrate better qual-
(a) (b)
ity (higher SSIM and higher PSNR) than the reconstructed
images encoded using standard division-based quantization for Fig. 13. Effect of quantization block approximation on compression ratio:
quality levels 50 and 90. This is because every element in the (a) on Q50 and (b) on Q90.
quantization matrix is down-approximated to its closest power
of 2, resulting in a new quantization matrix whose quality quantization scheme, achieving 85% reduction in area and
factor is higher than the original’s. This result, however, entails 94% power savings for conventional division-based quantiza-
a reduction in the compression ratio of the compressed image. tion block.
Fig. 13 shows the effect of the approximated quantization 2) Case II: Images Reconstructed by the Unmodified
method on compression ratio, where the 20.1% and 13.4% Quantization Matrix (8), (9): One should decode the com-
compression ratio decline are observed when approximated pressed image with the same quantization matrix used in the
Q50 and Q90 are adopted, respectively. Detailed statistic encoding stage for better image quality when reconstructing an
values of Fig. 13 are provided in Table II. Although there image. However, since the approximate quantization circuit is
are reductions in compression ratio, the advantages of using only applied in the encoding end in this work, we assume that
approximated quantization matrices are evident once the the decoding end may still use the unmodified quantization
hardware-level implementation is considered. Fig. 14 enlight- matrix to reconstruct images. The discussion about using
ens the benefits in power and area using the proposed different quantization matrices for encoding and decoding is
Authorized licensed use limited to: Thapar Institute of Engineering & Technology. Downloaded on December 27,2024 at 05:23:49 UTC from IEEE Xplore. Restrictions apply.
20 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 44, NO. 1, JANUARY 2025
TABLE II
S TATISTICS OF F IG . 13: C OMPRESSION R ATIO U NDER D IFFERENT
Q UANTIZATION S CHEMES
Fig. 17. Effect of precision scaling: (a) SSIM versus precision scaling level
(a) (b) and (b) PSNR versus precision scaling level.
Fig. 15. Comparison of reconstructed image quality between the image be addressed by keeping the first entry in the standard Q90
decoded by the standard Q and by the modified Q for (a) Q50 and (b) Q90.
matrix unmodified and designing a multiplier-less divider.
For example, the element-wise division for Q90 ’s first entry
thus presented in Fig. 15, which provides the image quality
(quantizing a number by 3) can be approximately implemented
comparison between images reconstructed with the modified
as 1/4+1/8-1/16+1/32 (0.334). Thus, this can be implemented
(approximated) Q matrix and with the original (standard) one
just by addition and subtraction along with bit-shift operation.
for Q50 and Q90 . Using the standard matrix Q50 to reconstruct
Lastly, we present subjective analysis on three images
the image encoded by the approximated matrix Q50 results in
(Baboon, Pirate, and Boat) selected from the image dataset.
4.94% SSIM degradation, as shown in Fig. 15(a). However, in
Fig. 16 shows the reconstructed images using different
the case of Q90 , using the standard Q90 to reconstruct images
quantization methods and levels.
induces more SSIM degradation than using the standard Q50 ,
as shown in Fig. 15(b), where 22.35% SSIM degradation is
presented. B. Precision Scaling
This result stems from the fact that the first entry in Q50 is Fig. 17 depicts the effect of precision scaling on the
originally a 2’s power (Q50 (0,0) = 16), as is not the case for reconstructed image quality. Detailed statistics of the quality
Q90 (Q90 (0,0) = 3). For the scenario where Q90 is used in degradation is reported in Table III. With one-bit truncation,
encoding while Q90 is used in decoding, different divisors for the quality of the reconstructed images degrades slightly; only
the first entry of the DCT coefficient block (also known as the 4.82% SSIM degradation and 7.05% PSNR degradation are
DC coefficient) are used in the quantization (Q90 (0,0) = 2), observed according to the simulation results over the dataset.
and inverse quantization (Q90 (0,0) = 3). This discrepancy Hardware-wise benefit resulting from the bit truncation tech-
corrupts the reconstructed value of the DC coefficient at the nique is illustrated in Fig. 18. The JPEG compression circuit
step of inverse quantization. Since for still images, most of the with 1-bit truncation consumes around 30% less power and
energy is located in the low-frequency area [25], the corrupted area than the circuit with no data bit width modification.
DC coefficient will then result in severe image degradation Fig. 19 shows the subjective analysis on three selective
after inverse DCT is performed. Note that this problem could images, Baboon, Pirate, and Boat, under different truncation
Authorized licensed use limited to: Thapar Institute of Engineering & Technology. Downloaded on December 27,2024 at 05:23:49 UTC from IEEE Xplore. Restrictions apply.
LI et al.: APPROXIMATE DCT AND QUANTIZATION TECHNIQUES FOR ENERGY-CONSTRAINED IMAGE SENSORS 21
TABLE III
S TATISTICS OF F IG . 17: I MAGE Q UALITY U NDER D IFFERENT B IT T RUNCATION L EVELS
TABLE IV
S TATISTICS OF F IG . 20: E NERGY S AVED U NDER D IFFERENT L OOP
S KIPPING L EVELS
Authorized licensed use limited to: Thapar Institute of Engineering & Technology. Downloaded on December 27,2024 at 05:23:49 UTC from IEEE Xplore. Restrictions apply.
22 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 44, NO. 1, JANUARY 2025
TABLE V
C OMPARISON W ITH E XISTING M ULTIPLIER -L ESS DCT D ESIGNS
(a) (b)
Fig. 21. Effect of different loop-skipping levels: (a) relative energy versus
SAD degradation and (b) relative energy versus SSIM degradation.
Authorized licensed use limited to: Thapar Institute of Engineering & Technology. Downloaded on December 27,2024 at 05:23:49 UTC from IEEE Xplore. Restrictions apply.
LI et al.: APPROXIMATE DCT AND QUANTIZATION TECHNIQUES FOR ENERGY-CONSTRAINED IMAGE SENSORS 23
the area overhead of the quantization blocks and the blue bars VII. C ONCLUSION
represent the nonquantization part. This work demonstrates a synthesizable multiplier-less
According to the simulation of Cadence Spectre Simulator, JPEG accelerator equipped with approximations both in soft-
at typical/typical (TT) corner, 25 ◦ C, with a supply of 1 V, the ware and RTL in the form of modified quantization block,
baseline design (using bit truncation level 1 and loop skipping precision scaling, and loop perforation, trading off the quality
level 2) consumes an average current of 6.75 mA at 100 of the image with energy and area reduction. With a gradient
MHz at the expense of 2% SAD degradation. On the other descent-based heuristic, the accelerator’s performance can
hand, the proposed architecture consumes an average current be tuned to maximize energy savings while meeting the
of 4.35 mA under the same simulation condition. Therefore, image quality constraints. The proposed architecture with the
36% energy is saved from our proposed approach, as shown in combined approximation strategies achieves 36% reduction
Fig. 23(b). The proposed architecture equivalently dissipates in energy consumption at the expense of 2% SAD quality
a power of 15 uW in the DCT and quantization stages to degradation in the image, which lies within acceptable limits
generate a throughput of 480-p colored image @ 6 frames/s. for any image processing applications. Moreover, it consumes
This is 10× better than the analog solution [32] (which utilizes 15 uW at the DCT and quantization stages to compress a
passive elements, i.e., switch capacitors to save power) and colored 480-p image at 6 frames/s, which is 10 × better than
6× better than current state-of-the-art [33] (which operates at the previous literature.
near-threshold to reduce power).
R EFERENCES
VI. D ISCUSSION [1] C. Zhu, H. Zhang, and Y. Tang, “Lossless image compression algorithm
This work focuses on accelerating the DCT and quantization based on long short-term memory neural network,” in Proc. 5th ICCIA,
2020, pp. 82–88.
steps in JPEG compression. However, another widely used [2] L. Liang and D. Shujun, “Study on JPEG2000 optimized compres-
image compression scheme, JPEG2000, merits discussion due sion algorithm for remote sensing image,” in Proc. NSWCTC, 2009,
to its superior compression efficiency and image quality. pp. 771–775.
[3] A. Skodras, C. Christopoulos, and T. Ebrahimi, “The JPEG 2000 still
1) JPEG Versus JPEG2000: Although the transform step image compression standard,” IEEE Signal Process. Mag., vol. 18, no. 5,
in JPEG2000, which uses discrete wavelet transform (DWT), pp. 36–58, Sep. 2001.
is simpler than the DCT in JPEG, the overall structure [4] C. Christopoulos, A. Skodras, and T. Ebrahimi, “The JPEG2000 still
image coding system: An overview,” IEEE Trans. Consum. Electron.,
of JPEG2000 is more complex [34], [35], [36], [37], [38], vol. 46, no. 4, pp. 1103–1127, Nov. 2000.
[39]. This complexity arises from more refined quantization [5] F. Ebrahimi, M. Chamik, and S. Winkler, “JPEG vs. JPEG 2000: An
steps involving floating-point division and advanced coding objective comparison of image encoding quality,” in Proc. 28th Appl.
Digit. Image Process., 2004, pp. 300–308.
schemes like embedded block coding with optimized trun- [6] E. Allen, S. Triantaphillidou, and R. Jacobson, “Image quality compar-
cation (EBCOT) [40]. Consequently, JPEG2000 is not well ison between JPEG and JPEG2000. I. Psychophysical investigation,” J.
suited for applications in edge devices. Imag. Sci. Technol., vol. 51, no. 3, pp. 248–258, 2007.
[7] S. Bouguezel, M. Omair Ahmad, and M. N. S. Swamy, “Low-complexity
2) DCT Versus DWT: The 2D-DCT is more complex than 8 × 8 transform for image compression,” Electron. Lett., vol. 44, no. 21,
the 2D-DWT for the transformation step. Nonetheless, the pp. 1249–1250, 2008.
multiplier-less approach adopted in this design significantly [8] S. Bouguezel, M. Omair Ahmad, and M. N. S. Swamy, “A fast 8 × 8
transform for image compression,” in Proc. Int. Conf. Microelectron.
reduces the hardware implementation cost, making it com- (ICM), 2009, pp. 74–77.
petitive with the 2D-DWT in terms of hardware resources. [9] S. Bouguezel, M. Omair Ahmad, and M. N. S. Swamy, “A low-
The transform block in the proposed design performs the 2D- complexity parametric transform for image compression,” in Proc. IEEE
Int. Symp. Circuits Syst. (ISCAS), 2011, pp. 2145–2148.
DCT using only shift, addition, and subtraction operations—no [10] N. Brahimi, T. Bouden, T. Brahimi, and L. Boubchir, “A novel and effi-
multipliers are used. This approach is highly similar to the cient 8-point DCT approximation for image compression,” Multimedia
implementation of FIR-based 2D-DWT. Tools Appl., vol. 79, pp. 7615–7631, Mar. 2020.
[11] F. S. Snigdha, D. Sengupta, J. Hu, and S. S. Sapatnekar, “Optimal design
3) Possible Future Work for Approximate JPEG2000: The of JPEG hardware under the approximate computing paradigm,” in Proc.
approximate techniques proposed in this article—multiplier- 53nd ACM/EDAC/IEEE DAC, 2016, pp. 1–6.
less transformation, approximate quantization, bit truncation, [12] V. Gupta, D. Mohapatra, A. Raghunathan, and K. Roy, “Low-power
digital signal processing using approximate adders,” IEEE Trans.
and loop perforation—can be applied not only to JPEG but Comput.-Aided Design Integr. Circuits Syst., vol. 32, no. 1, pp. 124–137,
also to JPEG2000. For instance, 2D-DWT can be implemented Jan. 2013.
Authorized licensed use limited to: Thapar Institute of Engineering & Technology. Downloaded on December 27,2024 at 05:23:49 UTC from IEEE Xplore. Restrictions apply.
24 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 44, NO. 1, JANUARY 2025
[13] J. Park, J. H. Choi, and K. Roy, “Dynamic bit-width adaptation in [38] A. M. Reza, “JPEG2000 hardware implementation procedures and
DCT: An approach to trade off image quality and computation energy,” issues,” WSEAS Trans. Circuits Syst., vol. 12, no. 4, pp. 101–117, 2013.
IEEE Trans. Very Large Scale Integr. Syst., vol. 18, no. 5, pp. 787–793, [39] L. Liu, N. Chen, H. Meng, L. Zhang, Z. Wang, and H. Chen, “A VLSI
May 2010. architecture of JPEG2000 encoder,” IEEE J. Solid-State Circuits, vol. 39,
[14] H. A. F. Almurib, T. Nandha Kumar, and F. Lombardi, “Approximate no. 11, pp. 2032–2040, Nov. 2004.
DCT image compression using inexact computing,” IEEE Trans. [40] D. Taubman, “High performance scalable image compression with
Comput., vol. 67, no. 2, pp. 149–159, Feb. 2018. EBCOT,” IEEE Trans. Image Process., vol. 9, pp. 1158–1170, 2000.
[15] M. Imani, R. Garcia, A. Huang, and T. Rosing, “CADE: Configurable
approximate divider for energy efficiency,” in Proc. DATE, 2019,
pp. 586–589. Ming-Che Li (Graduate Student Member, IEEE)
[16] G. K. Wallace, “The JPEG still picture compression standard,” IEEE was born in Yilan, Taiwan, in 1999. He received the
Trans. Consum. Electron., vol. 38, no. 1, pp. xviii–xxxiv, Feb. 1992. B.S. degree in electrical engineering from National
[17] W.-H. Chen, C. Smith, and S. Fralick, “A fast computational algorithm Tsing Hua University, Hsinchu, Taiwan, in 2021. He
for the discrete cosine transform,” IEEE Trans. Commun., vol. 25, no. 9, is currently pursuing the Ph.D. degree in electrical
pp. 1004–1009, Sep. 1977. and computer engineering with Purdue University,
[18] B.-I. Kim and S. G. Ziavras, “Low-power multiplierless DCT for West Lafayette, IN, USA.
image/video coders,” in Proc. IEEE 13th Int. Symp. Consum. Electron., His current research interests include hardware
2009, pp. 133–136. security, approximate computing in image and video
[19] J. Melchert, S. Behroozi, J. Li, and Y. Kim, “SAADI-EC: A quality- compression, and stochastic computing.
configurable approximate divider for energy efficiency,” IEEE Trans.
Very Large Scale Integr. Syst., vol. 27, no. 11, pp. 2680–2692,
Nov. 2019. Archisman Ghosh (Student Member, IEEE)
[20] S. Hashemi, R. I. Bahar, and S. Reda, “A low-power dynamic divider received the B.E. degree in electronics and telecom-
for approximate applications,” in Proc. 53nd ACM/EDAC/IEEE DAC, munication engineering from Jadavpur University,
2016, pp. 1–6. Kolkata, India, in 2017. He is currently pursuing
[21] S. Vahdat, M. Kamal, A. Afzali-Kusha, M. Pedram, and Z. Navabi, the Ph.D. degree with Purdue University, West
“TruncApp: A truncation-based approximate divider for energy efficient Lafayette, IN, USA.
DSP applications,” in Proc. DATE, 2017, pp. 1635–1638. He is currently a Bilsland Dissertation Fellow with
[22] “University of Granada test images.” Accessed: 31 Jan. 2024. [Online]. Purdue University. His research interests include
Available: https://ccia.ugr.es/cvg/index2.php digital SoC design and hardware security. Prior to
[23] S. Barone, M. Traiola, M. Barbareschi, and A. Bosio, “Multi-objective his Ph.D., he worked with Samsung Semiconductor
application-driven approximate design method,” IEEE Access, vol. 9, India R&D, Bengaluru, India, for two years. He has
pp. 86975–86993, 2021. interned with Intel Labs, Hillsboro, OR, USA.
[24] M. Barbareschi, S. Barone, A. Bosio, J. Han, and M. Traiola, “A genetic- Mr. Ghosh is one of the recipients of the prestigious IEEE SSCS Pre-
algorithm-based approach to the design of DCT hardware accelerators,” Doctoral Achievement Award 2022. He was a recipient of the prestigious ECE
ACM J. Emerg. Technol. Comput. Syst., vol. 18, no. 3, pp. 1–25, 2022. Meissner Fellowship from Purdue University in 2019–2020 as an incoming
[25] B. Furht, Discrete Cosine Transform (DCT). Boston, MA, USA: graduate student.
Springer, 2008, pp. 186–188.
[26] R. M. Haralick, K. Shanmugam, and I. Dinstein, “Textural features for
image classification,” IEEE Trans. Syst., Man, Cybernet., vol. SMC-3, Shreyas Sen (Senior Member, IEEE) received the
no. 6, pp. 610–621, Nov. 1973. Ph.D. degree from the School of Electrical and
[27] X. Wang, K. Chen, C. Wang, and W. Liu, “An energy-efficient approxi- Computer Engineering (ECE), Georgia Institute of
mate DCT design for image processing,” in Proc. IEEE 15th Int. Conf. Technology (Georgia Tech), Atlanta, GA, USA, in
(ASIC) (ASICON), 2023, pp. 1–4. 2011.
[28] S. Skandha Deepsita, K. Divya, and S. Noor Mahammad, “Energy He is an Elmore Associate Professor of ECE and
efficient and multiplierless approximate integer DCT implementation for BME with Purdue University, West Lafayette, IN,
HEVC,” in Proc. IFIP/IEEE 29th Int. Conf. Very Large Scale Integr. USA, where he serves as the Director of the Center
(VLSI-SoC), 2021, pp. 1–6. for Internet of Bodies. He has authored/co-authored
[29] Y. Xing, Z. Zhang, Y. Qian, Q. Li, and Y. He, “An energy-efficient three book chapters, over 200 journal and conference
approximate DCT for wireless capsule endoscopy application,” in Proc. papers and has 25 patents granted/pending. His cur-
IEEE Int. Symp. Circuits Syst. (ISCAS), 2018, pp. 1–4. rent research interests span mixed-signal circuits/systems and electromagnetics
[30] V. Kaushal, B. Garg, A. Jaiswal, and G. K. Sharma, “Energy aware com- for the Internet of Bodies and hardware security.
putation driven approximate DCT architecture for image processing,” in Dr. Sen is a recipient of the NSF CAREER Award 2020, the AFOSR
Proc. 28th Int. Conf. VLSI Design, 2015, pp. 357–362. Young Investigator Award 2016, the NSF CISE CRII Award 2017, the Intel
[31] A. Darji and R. P. Makwana, “High-performance multiplierless DCT Outstanding Researcher Award 2020, the Google Faculty Research Award
architecture for HEVC.” in Proc. 19th Int. Symp. VLSI Design Test, 2017, the Purdue CoE Early Career Research Award 2021, the Intel Labs
2015, pp. 1–5. Quality Award 2012 for industry-wide impact on USB-C type, the Intel
[32] K. Gaurav Kumar, G. Barik, B. Chatterjee, S. Bose, S. Maity, and S. Sen, Ph.D. Fellowship 2010, the IEEE Microwave Fellowship 2008, the GSRC
“A 65 nm 2.02 mw 50 mbps direct analog to MJPEG converter for video Margarida Jacome Best Research Award 2007, and nine best paper awards,
sensor nodes using low-noise switched capacitor MAC-Quantizer with including IEEE CICC 2019 and 2021 and IEEE HOST 2017–2020, for four
automatic calibration and sparsity-aware ADC,” in Proc. IEEE Custom consecutive years. He is the inventor of the Electro-Quasistatic Human Body
Integr. Circuits Conf. (CICC), 2023, pp. 1–2. Communication, or Body as a Wire technology, for which, he is the recipient
[33] N. Reynders and W. Dehaene, “27.3 a 210mv 5MHz variation-resilient of the MIT Technology Review top-10 Indian Inventor Worldwide Under 35
near-threshold JPEG encoder in 40nm CMOS,” in Proc. ISSCC, 2014, (MIT TR35 India) Award in 2018 and Georgia Tech 40 Under 40 Award in
pp. 456–457. 2022. To commercialize this invention, he founded Ixana and serves as the
[34] C.-J. Lian, K.-F. Chen, H.-H. Chen, and L.-G. Chen, “Analysis and Chairman and the CTO and led Ixana to awards, such as 2 × CES Innovation
architecture design of block-coding engine for EBCOT in JPEG 2000,” Award 2024, EE Times Silicon 100, and Indiana Startup of the Year Mira
IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 3, pp. 219–230, Award 2023. His work has been covered by 250+ news releases worldwide,
Mar. 2003. invited appearance on TEDx Indianapolis, NASDAQ live Trade Talks at CES
[35] M. D. Adams, “The jpeg-2000 still image compression standard,” 2023, Indian National Television CNBC TV18 Young Turks Program, NPR
document ISO/IEC JTC1/SC29/WG1N2412, Int. Org. Stand., Geneva, subsidiary Lakeshore Public Radio, and the CyberWire podcast. His work was
Switzerland, Dec. 2002. chosen as one of the top-10 papers in the Hardware Security field (TopPicks
[36] L.-G. Chen, C.-J. Lian, K.-F. Chen, and H.-H. Chen, “Analysis and 2019). He serves/has served as an Associate Editor for IEEE S OLID -S TATE
architecture design of JPEG2000,” in Proc. ICME, 2001, pp. 210–213. C IRCUITS L ETTERS, Nature Scientific Reports, Frontiers in Electronics, and
[37] Y. Meng, L. Liu, L. Zhang, and Z. Wang, “Design methodology of low IEEE Design & Test, an Executive Committee Member of IEEE Central
power JPEG2000 codec exploiting dual voltage scaling,” in Proc. 6th Indiana Section, and the Technical Program Committee Member of ISSCC,
Int. Conf. ASIC, 2005, pp. 183–186. CICC, DAC, CCS, IMS, DATE, ISLPED, ICCAD, ITC, and VLSI Design.
Authorized licensed use limited to: Thapar Institute of Engineering & Technology. Downloaded on December 27,2024 at 05:23:49 UTC from IEEE Xplore. Restrictions apply.