0% found this document useful (0 votes)
27 views14 pages

Approximate DCT and Quantization Techniques For Energy-Constrained Image Sensors

Uploaded by

Sumit Mehta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views14 pages

Approximate DCT and Quantization Techniques For Energy-Constrained Image Sensors

Uploaded by

Sumit Mehta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 44, NO.

1, JANUARY 2025 11

Approximate DCT and Quantization Techniques for


Energy-Constrained Image Sensors
Ming-Che Li , Graduate Student Member, IEEE, Archisman Ghosh , Student Member, IEEE,
and Shreyas Sen , Senior Member, IEEE

Abstract—Recent expansions in multimedia devices for many I. I NTRODUCTION


applications, such as surveillance, self-driving cars, and health-
SAGE of multimedia devices has expanded exponentially
care, gather enormous amounts of real-time images for processing
and inference. The images are first compressed using compres-
sion schemes, like joint photographics experts group (JPEG),
U in recent years in every application, such as surveil-
lance, self-driving cars, and healthcare. These sensors generate
before processing to reduce storage costs and additional power enormous amounts of data in images or videos that require
requirements for transmitting the captured data in this era
processing and storage. As these sensors operate in an energy-
of emerging ultra-wideband communication and human-body
communication. The JPEG algorithm realizes image compression constrained environment, as shown in Fig. 1, the collected
using simplistic matrix manipulations, making it preferable for images are often compressed at the source using conventional
hardware implementations. Furthermore, due to inherent error image compression techniques to mitigate the energy expended
resilience and imperceptibility in images, JPEG can be approxi- in transmission, processing, and storage.
mated to reduce the required computation/processing power and
area. This work demonstrates the first end-to-end approximation Image compression algorithms can be categorized into two
computing-based optimization of JPEG hardware using 1) an classes based on the error they introduced, i.e., lossy and
approximate division realized using bit-shift operators to reduce lossless compressions. Lossless compressions, like portable
the complexity of the computationally intensive quantization network graphics (PNG) and tagged image file format (TIFF),
block; 2) loop perforation; and 3) precision scaling on top of a
multiplier-less fast discrete cosine transform (DCT) architecture
ensure perfect reconstruction of the image at the cost of
to achieve an extremely energy-efficient JPEG compression unit reduced compression (usually 2:1 [1]). On the other hand,
which will be a perfect fit for power/bandwidth-limited scenario. lossy schemes, such as joint photographics experts group
Furthermore, a gradient descent-based heuristic composed of (JPEG), progressive graphics file (PGF), and JPEG2000 [2],
two conventional approximation strategies, i.e., precision scaling provide higher compression while seeking to reconstruct a
and loop perforation, is implemented for tuning the degree of
approximation to tradeoff energy consumption with the quality visually similar image with imperceptible degradation in
degradation of the decoded image. The entire register-transfer quality.
level (RTL) design is coded in Verilog HDL, synthesized using the Among these lossy compression schemes, JPEG is one of
industry-standard tool, mapped to TSMC 65nm CMOS technol- the most common image compression methods utilized in
ogy, and simulated using Cadence Spectre Simulator under 25 ◦ C,
typical/typical (TT) corner. The approximate division approach energy-constrained edge devices due to its lightweight nature
in the quantization block achieved around 28% reduction in the and broad compatibility. Although JPEG2000 provides better
active design area. The heuristic-based approximation technique compression efficiency and less image quality degradation
combined with accelerator optimization achieves a significant at lower bit rates because of its compression algorithm [3],
energy reduction of 36% for a minimal image quality degradation
[4], [5], [6], JPEG has gained popularity over JPEG2000
of 2% sum of absolute difference (SAD). Simulation results also
show that the proposed architecture consumes 15 uW at the DCT due to its lower hardware requirements, which is a more
and quantization stages to compress a colored 480-p image at 6 critical factor for performing image compression in energy-
frames/s. constrained environments.
Index Terms—Approximate computing, approximate divider, To date, approximate solutions for JPEG image compression
energy efficient, image sensor, in-sensor analytics, loop perfora- are concentrated on algorithm or hardware levels. Previous
tion, precision scaling. works [7], [8], [9], [10] present mathematical approximations
on the discrete cosine transform (DCT) matrix in which
fractional coefficients are replaced with integers or nega-
tive powers of two, but no corresponding implementation
Manuscript received 24 September 2023; revised 26 February 2024 and on hardware is proposed in the literature. On the other
15 June 2024; accepted 18 July 2024. Date of publication 26 July 2024;
date of current version 26 December 2024. This work was supported by hand, hardware-target approximation literature, such as [11]
Quasistatics, Inc. under Grant 40003567 (Account F.00127126.02.036). This (performing bit truncation at DCT stage), [12], [13], [14]
article was recommended by Associate Editor X. Jiao. (Ming-Che Li and (using approximate adders), or [15] (replacing the quantization
Archisman Ghosh contributed equally to this work.) (Corresponding author:
Shreyas Sen.) using approximate divider), only focus on the single stage
The authors are with the School of Electrical and Computer optimization. They fail to produce an energy-optimal com-
Engineering, Purdue University, West Lafayette, IN 47907 USA (e-mail: pression circuit that performs approximate computing on more
[email protected]).
Digital Object Identifier 10.1109/TCAD.2024.3434333 than one of the JPEG compression stages.
1937-4151 
c 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Thapar Institute of Engineering & Technology. Downloaded on December 27,2024 at 05:23:49 UTC from IEEE Xplore. Restrictions apply.
12 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 44, NO. 1, JANUARY 2025

Fig. 1. Significance of energy-constrained compression hardware in a power


and bandwidth-limited scenario. Fig. 2. Flowchart of standard JPEG compression.

operation. Simulation methodology is discussed in Section IV.


In this work, we present an accelerated JPEG computing
Finally, we discuss the results in Section V, before concluding
unit that incorporates an approximate quantization block and
this article in Section VII.
two well-known approximation techniques—loop perforation
and precision scaling—on top of a multiplier-less fast DCT
(FDCT) architecture to create a solution suitable for energy- II. I MAGE C OMPRESSION T ECHNIQUE
constrained devices. A lightweight heuristic is also developed The JPEG developed the International Organization for
to estimate and fine-tune the degree of loop perforation and Standardization (ISO) for the standard JPEG image compres-
precision scaling without introducing significant degradation sion [16] algorithm, which can be broken down into five
in the output image quality. essential stages. To compress one 512 × 512 pixels colored
In brief, the proposed work has threefold contributions. image, the image has to be run through the following stages,
1) We propose an approximate low-power area-efficient as shown in Fig. 2.
JPEG compression architecture focused on restructur- 1) Stage 1: Convert pixel values from RGB to YCbCr
ing the computation-heavy quantization block. No such format. (Different from RGB color space, YCbCr is
hardware-efficient image compression circuit has been another color space where a colored image is described
reported that utilizes the bit shift operators-based divi- in terms of brightness (Y) and the chroma strength of
sion modules for optimizing the same to date. blue (Cb) and red (Cr) signals.)
2) A gradient descent-based heuristic is proposed that 2) Stage 2: Downsample the chrominance component (Cb
employs the commonly used loop perforation and and Cr) for every 2 × 2 pixel block.
precision scaling strategies to automatically configure 3) Stage 3: DCT on 8 × 8 pixel block.
the degree of approximation in the compression engine 4) Stage 4: Quantize DCT output matrix via element-wise
to reduce the energy required for processing while division by quantization matrix.
maintaining the quality of the decoded image within 5) Stage 5: Perform Huffman encoding on the zigzag
acceptable limits. traversal of the quantized matrix to keep low-frequency
3) The reconfigurable architecture with optimized division components at the beginning of the serial chain and
modules and the implemented heuristics achieves a insignificant high-frequency components at the end.
significant energy reduction of 36% for a minimal image JPEG compression exploits the fact that the human eye
quality degradation of sum of absolute difference (SAD) is relatively insensitive to 1) chromaticity compared to lumi-
(2%). nosity and 2) high-frequency spatial changes in an image
The overall implementation achieves 28% area improvement to compress an image. Because of property 1), the size of
with respect to the baseline design. It is important to note that the data used to represent the color intensity of an image
our baseline design is already optimized using multiplier-less can be appropriately reduced without affecting the image’s
DCT architecture. Simulation results show that the proposed visual quality. This data size reduction is realized by the
architecture consumes only 15-uW power while compressing first two stages of JPEG compression, where an image is
480-p images at a 6-frames/s rate. converted into the space of luminance and chrominance first,
The remainder of this article is organized as follows. and its chrominance components are then sampled every
The JPEG compression technique is briefly discussed in 2 pixels in the horizontal direction while every luminance
Section II. Section III dives into the implemented approx- pixel is retained. On the other hand, high-frequency spatial
imation techniques along with the proposed heuristics that changes in an image are suppressed in DCT, quantization, and
dynamically selects the optimum approximation knobs for Huffman encoding three stages. Note that starting from Stage

Authorized licensed use limited to: Thapar Institute of Engineering & Technology. Downloaded on December 27,2024 at 05:23:49 UTC from IEEE Xplore. Restrictions apply.
LI et al.: APPROXIMATE DCT AND QUANTIZATION TECHNIQUES FOR ENERGY-CONSTRAINED IMAGE SENSORS 13

Fig. 4. Dividing 2-D DCT into 1-D DCT.

Fig. 5. 1D-FDCT: Butterfly diagram for 8-point 1D-DCT. Each colored small
block represents a multiplication with a specific cosine factor.
Fig. 3. Visual example of the JPEG compression process. (a) Colored (RGB
format) image to be compressed. (b) Colored image in YCbCr format. (c1)
Y component represents the brightness of the original image. (c2) and (c3)
Cb and Cr components represent the strength of blue and red signals of the where tij is the element in T, and subscripts i and j are from 0
original image, respectively. (d) DCT [given by (1)] result of the Y component. to 7 and represent the row position and the column position,
(e) Quantized result of (d).
respectively.
In this work, to build an energy-efficient JPEG compression
circuit, we implement (1) not by the direct matrix multiplica-
3, the compression is performed in the unit of 8 × 8 pixel
tion method but by the FDCT method [17]. Since (1) can be
blocks; moreover, three channels are required to process the
written as
compression of the original image’s Y, Cb, and Cr components.
 
Fig. 3 demonstrates the visual example of how an image is TMT = (T)(TM) (3)
compressed.
After the first two stages, the input image is processed the computation of D can be treated as 2 rounds of one-
into chunks of pixel blocks. One block is represented as an dimensional DCT (1D-DCT), as shown in Fig. 4. The 1D-DCT
8 × 8 matrix M. The DCT on M is given by the matrix is the multiplication of the transformation matrix T and any
multiplication given 8 × 1 vector x. The computation of 1D-DCT can be
mathematically “accelerated” using the fast architecture. The
D = TMT (1) fast architecture with different types of butterfly units is shown
in Fig. 5, which reduces the multiplications of Tx from 64 to
where D is the DCT of M, T is the transformation matrix, 16. On the hardware level, an energy-efficient 1D-FDCT unit is
and T is the transpose of T. The transformation matrix T is achieved by adopting the multiplier-less approach [18], where
defined as four types of multiplications in 1D-FDCT (one scalar multipli-

⎨ √1 , if i = 0 cation and three 2 × 2 matrix multiplications) are implemented
tij = 82 (2j+1)iπ
(2) by adders and shifters. Equations (4a), (5a), (6a), and (7a)

8 · cos 16 , if i > 0 delineate the precise mathematical expression for the four

Authorized licensed use limited to: Thapar Institute of Engineering & Technology. Downloaded on December 27,2024 at 05:23:49 UTC from IEEE Xplore. Restrictions apply.
14 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 44, NO. 1, JANUARY 2025

(a) (b) by = −71ay


ax = x − (x << 2) − (y << 1)
ay = −(x << 1) + y + (y << 1). (6b)

Butterfly Matrix Multiplication (iii):


    
(c) (d) X 0.8315 0.5556 x 1 251 50 x
=  8 (7a)
Y 0.5556 −0.8315 y 2 50 −251 y
1
X= (251x + 50y) = cx >> 8
256
1
Y= (50x − 251y) = cy >> 8
256
Fig. 6. Detailed multiplier-less implementation of (a) scaling operation and cx = (x << 8) + bx
(b)–(d) three 2 × 2 matrix multiplications in Fig. 5. cy = −(y << 8) + by
bx = ax + (ax << 2)
multiplications, with x and y representing the inputs and X and by = ay + ay << 2
Y denoting the multiplication outputs. Equations (4b), (5b), ax = −x + ((y + (y << 2)) >> 2)
(6b), and (7b) explicitly illustrate how these multiplications ay = ((x + (x << 2)) >> 2) + y. (7b)
are executed through the shift add method. Fig. 6 portrays the
associated hardware-level implementation, comprising solely After the 8 × 8 pixel block is converted into the frequency
adders, subtractors, and barrel shifters. Our DCT core does domain, the DCT matrix D is next quantized by a designated
not employ any general multiplier. quantization matrix Q. The quantization stage generally per-
We provide the details of the multiplier-less operation in the forms element-wise division by directly instantiating divider
following set of equations: blocks. The user can choose different quantization matrices
Scalar Multiplication: based on the tradeoff between the image quality and the
181 5(1 − 16) + 256 compression level. The higher the quality level a Q matrix
X = 0.707x  x= x (4a) has, the less image compression will be. In other words, the
256 256
X = b >> 8 information of the original image will not be discarded much,
b = (a − a << 4) + (x << 8) so when the image is reconstructed, higher image quality can
be obtained. Typically, the quantization matrix with a quality
a = x + (x << 2). (4b) level of 50, Q50 , is used at this stage to achieve a good
Butterfly Matrix Multiplication (i): decompressed image quality and a decent compression ratio
     ⎡ ⎤
X 0.9238 0.3836 x 1 473 196 x 16 11 10 16 24 40 51 61
=  9 ⎢12 12 14 19 26
Y 0.3836 −0.9238 y 2 196 −473 y ⎢ 58 60 55 ⎥⎥
(5a) ⎢14 13 16 24 40 57 69 56 ⎥
⎢ ⎥
1 ⎢14 17 22 29 51 87 80 62 ⎥
X= (473x + 196y) = dx >> 9 Q50 = ⎢⎢ ⎥. (8)

512 ⎢18 22 37 56 68 109 103 77 ⎥
1 ⎢24 35 55 64 81 104 113 92 ⎥
Y= (196x − 473y) = dy >> 9 ⎢ ⎥
512 ⎣49 64 78 87 103 121 120 101⎦
dx = cxy − (bx << 6) + (y << 9) 72 92 95 98 112 100 103 99
dy = cxy << 3 − by If a higher image quality is needed, a quantization matrix with
cxy = bx + axy << 5 a quality level greater than 50 is necessitated. The required
bx = x − (x << 3) + (y << 2) matrix is obtained by multiplying Q50 with a scaling factor
by = axy << 2 + y of (100 − quality level)/50 and then rounded and clipped so
that all entry values are integers ranging from 1 to 255. For
axy = x − (y << 1). (5b) example, Q90 , the quantization matrix with a quality level of
Butterfly Matrix Multiplication (ii): 90, is given by
     ⎡ ⎤
X 0.9807 0.1951 x 1 213 142 x 3 2 2 3 5 8 10 12
=  8 ⎢2
Y 0.1951 −0.9807 y 2 142 −213 y ⎢ 2 3 4 5 12 12 11⎥ ⎥
(6a) ⎢3 3 3 5 8 11 14 11⎥
⎢ ⎥
1 ⎢3 3 4 6 10 17 16 12⎥
X= (213x + 142y) = bx >> 8 Q90 = ⎢⎢ ⎥. (9)
256 ⎢4 4 7 11 14 22 21 15⎥ ⎥
1 ⎢5 7 11 13 16 12 23 18⎥
Y= (142x − 213y) = by >> 8 ⎢ ⎥
256 ⎣10 13 16 17 21 24 24 21⎦
bx = −71ax 14 18 19 20 22 20 20 20

Authorized licensed use limited to: Thapar Institute of Engineering & Technology. Downloaded on December 27,2024 at 05:23:49 UTC from IEEE Xplore. Restrictions apply.
LI et al.: APPROXIMATE DCT AND QUANTIZATION TECHNIQUES FOR ENERGY-CONSTRAINED IMAGE SENSORS 15

The output of this stage is given by an 8 × 8 matrix Similarly, Q90 can be approximated as
C, with C = D./ Q, where ./ represents the element-wise ⎡ ⎤
division operator. (We note that the entry values of Q90 are 2 2 2 2 4 8 8 8
⎢2 2 2 4 4 8 8 8⎥
smaller than those of Q50 , which implies Q90 will result in a ⎢ ⎥
⎢2 2 2 4 8 8 8 8⎥
C with larger entry values than Q50 will. That is, there will ⎢ ⎥
⎢2 2 4 4 8 16 16 8 ⎥
be more information for the image’s reconstruction and higher 
Q90 = ⎢ ⎢ ⎥. (11)
reconstructed image quality can be obtained.) ⎢4 4 4 8 8 16 16 8 ⎥⎥
⎢4 4 8 8 16 8 16 16⎥
Finally, the quantized result is compressed into a bit stream ⎢ ⎥
⎣8 8 16 16 16 16 16 16⎦
using Huffman encoding. The elements C are concatenated
into a 1-D vector based on the zigzag traversal. After con- 8 16 16 16 16 16 16 16
catenating the bits in this traversal as an input bit stream, a This approach introduces some errors due to approximation
binary tree of the most common N-bit groups is created where but reduces the divider power consumption because of the
a traversal of the encoding tree in the left or right directions simplicity of bit shifting. Note that instead of being converted
maps to a binary bit (0 or 1). This allows an N-bit chunk of up, the elements in the original matrix are converted down
zeros (most common) to map to only a single bit after Huffman to the nearest power of 2 to retain higher image quality. The
encoding. After encoding, a compressed bit stream is created general mathematical behavior of the proposed approximate
and sent to the communication channel. technique is described as follows. Given a quantization matrix
Q, for each element qij in Q, we first locate qij using an integer
III. A PPROXIMATION T ECHNIQUES sij which satisfies
To date, the approximate JPEG compression technique has
2sij ≤ qij < 2sij +1 . (12)
been explored by using only bit truncation [11], dynamic bit
width reduction in DCT operation [13], or using an approxi- The corresponding approximated element in Q , qij , is then
mate adder [12]. However, we observed that the quantization constructed by
block realized using standard division algorithms consumes
high power while occupying a considerable silicon area. qij = 2sij . (13)
For the first time, we explored an approximate quantization
block by updating the Q-matrix to enable divisions with If all elements in Q are considered, (7) can be further extended
bit-shift operations, eliminating the need for high-budget to a matrix form
standard division blocks, thereby saving energy and reducing Q = 2.∧ S (14)
silicon area. Conventional approximation strategies, like loop
perforation and precision scaling, are also explored in this where .∧ is a element-wise power operator and S is an
work. In addition to the approximate quantization block, we 8 × 8 matrix with its element sij representing the exponent
have proposed a heuristic-based approach to select the optimal part of qij . Since 1 ≤ qij ≤ 255, we can obtain 0 ≤ sij ≤ 7
configuration between loop perforation and precision scaling according to (6); therefore, sij can be represented using only 3
for a given quality requirement. bits. Using this approximate technique, the quantization circuit
in the JPEG compression circuit only needs to take 192 bits
A. Approximate Quantization (3 bits × 64 elements) as the input, while the conventional
A common approach to reducing the power of the Q block division-based quantization circuit requires 512 bits (8 bits
is to replace the standard division (A/B) with multiplication × 64 elements). At the hardware level, (12) and (13) can
using A · (1/B) using techniques like Taylor series expansion be implemented by an 8-to-3 priority encoder [Fig. 7(a)]. For
to approximate (1/B) [19], [20] or reducing the width of an 8-bit input qij [7:0], an 8-to-3 priority encoder can locate
operation in division block [21]. These methods, however, the first bit appearing from the MSB side and output that
require one or more multipliers, which demand relatively particular location by a 3-bit signal sij [2:0]. The quantization
higher energy. is then performed by a bit shifter, with sij [2:0] being the
By observation, quantization matrix Q can be replaced shifting amount. Therefore, our proposed architecture realizes
with approximated quantization matrix Q by converting each the approximated element-wise division operation through
element of the Q matrix to the power of 2 so that the division an 8-to-3 priority encoder and a barrel shifter, as shown
operation can be implemented via bit shifting. For example, in Fig. 7(b). Note that at the time of decoding, the same
Q50 can be approximated as quantization matrix Q must be used for better reconstruction
⎡ ⎤ of the image.
16 8 8 16 16 32 32 32
⎢8 8 8 16 16 32 32 32⎥
⎢ ⎥
⎢8 8 16 16 32 32 64 32⎥ B. Precision Scaling
⎢ ⎥
⎢ 8 16 16 16 32 64 64 32⎥ Precision scaling, or bit truncation, alleviates the computa-

Q50 = ⎢⎢ ⎥. (10)
⎥ tional load of image compression by reducing the data width
⎢16 16 32 32 64 64 64 64⎥
⎢16 32 32 32 64 64 64 64⎥ of the input image. Least significant bit (LSB) truncation is
⎢ ⎥
⎣32 64 64 64 64 64 64 64⎦ realized in this article. The size of data reduction, truncation
64 64 64 64 64 64 64 64 level Bj , has to be specified before image compression begins.

Authorized licensed use limited to: Thapar Institute of Engineering & Technology. Downloaded on December 27,2024 at 05:23:49 UTC from IEEE Xplore. Restrictions apply.
16 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 44, NO. 1, JANUARY 2025

(a) Algorithm 1: Loop Skipping Logic


Input: the on-trial 8x8 block Min , the latest-processed
block Ml and its JPEG compression result Cl ,
and the error tolerance ε
Output: the new latest-processed block Mout and its
corresponding result Cout
1 for (i = 0; i < 8; i = i + 1) do
2 for (j = 0; j < 8; j = j + 1) do
3 ceiling = min{ml,ij + ε, 127};
4 floor = max{ml,ij − ε, −128};
5 if ((min,ij > ceiling) or (min,ij < floor)) then
(b) 6 Mout = Min ;
7 Cout = JpegCompression(Min );
8 return Mout , Cout ;

9 Mout = Ml ; Cout = Cl ;
10 return Mout , Cout ;

(a)
Fig. 7. Hardware implementation of the proposed approximate quantization
method. (a) 8-to-3 priority encoder. (b) Proposed element-wise division cell
comprises an 8-to-3 priority encoder and a barrel-shifter.

The truncated pixel block, Mtr , can be described in terms of


the original pixel block M and truncation level Bj as
 
1
Mtr = round B M . (15)
2 j (b)
The precision scaling technique allows the DCT and quan-
tization to function with fewer bits overhead. Note that
bit truncation is implemented uniformly throughout the data
path of the hardware accelerator. Each data-path component
(adders, multipliers, etc.) can be equipped to modify the
operational bit-width by introducing a bit-wise clock gating in
each of them, thereby reducing energy consumption.
(c)
C. Loop Perforation
Loop perforation, or loop skipping, takes advantage of
spatial redundancies in an image. This technique involves
bypassing the compression process for the current pixel block
if it closely resembles the preceding one. This results in a
substantial decrease in the energy expended for computation
while preserving an acceptable level of degradation in the Fig. 8. Hardware realization of loop perforation: (a) additional registers
and logic added, (b) data flow when the similarity of neighboring blocks is
encoded image quality. not detected, and (c) data flow when the similarity of neighboring blocks is
This work implements the loop skipping function by first detected.
asserting the loop skip threshold Li = i (i is a non-negative
integer), representing the error tolerance ε with ε = 5i. To
determine whether a pixel block should be sent into the JPEG Fig. 8(a) shows the additional hardware cost incurred to
compression computation core, the pixel block is compared perform loop skipping, where registers for storing the previous
with the previously processed pixel block by checking if all block and its corresponding results, a similarity checker, and
pixels in two blocks are within the error tolerance. The exact related selection logic (MUX) are added to the original JPEG
step-by-step operation is illustrated in Algorithm 1. If a pixel core. Fig. 8(b) and (c) provides insight into the data flow
block satisfies the loop skipping criteria, the JPEG compres- and circuit operation under conditions where the similarity of
sion circuit will disable the computation core and directly two blocks is detected or not. Fig. 8(b) depicts the situation
output the computation result of the previously processed pixel where no similarity is detected. If two neighboring blocks are
block. not similar enough, the similarity check logic will generate

Authorized licensed use limited to: Thapar Institute of Engineering & Technology. Downloaded on December 27,2024 at 05:23:49 UTC from IEEE Xplore. Restrictions apply.
LI et al.: APPROXIMATE DCT AND QUANTIZATION TECHNIQUES FOR ENERGY-CONSTRAINED IMAGE SENSORS 17

Algorithm 2: To Extract the Q–E Characteristics for


Individual Approximation Technique
Input: Set of required image qualities: Q[0 : N − 1]
Output: Set of quality knob configuration: k[0 : N − 1]
and energy consumption: E[0 : N − 1]
corresponding to Q[0 : N − 1]
1 for (i = 0; i < N; i = i + 1) do
2 m = 0;
3 while ((k[m]) ≥ Q[i]) do
Fig. 9. Gradient descent algorithm. (a) Block diagram. (b) 3-D quality versus 4 m = m+1;
energy plot.
5 E[i] = E[(k[m − 1])]; k[i] = k[m − 1];
6 return k, Q, E
a FALSE signal, writing 64 pixels of the current block into
the previous block register and saving the current block
compressed results for the next cycle comparison. On the other
hand, if the similarity is detected, a TRUE signal will be
sent out from the similarity check logic, which disables the
JPEG core and outputs the computation results directly from
the compressed result registers. This scenario is illustrated in
Fig. 8(c).
The overall power savings remain substantial despite the
need for additional hardware resources at the circuit level (a) (b)
to implement loop perforation, such as the logic determining
the similarity of the pixel blocks and registers storing the
previously processed pixel block and its corresponding result.
This is primarily due to the disabling of the DCT unit, a
significantly more power-consuming block. This point will be
further discussed in Section V-E.
By increasing the loop skipping levels, higher energy sav-
ings are achieved at the expense of degraded image quality. In (c) (d)
this work, we have implemented loop skipping in the software
and the hardware that performs the desired operations and Fig. 10. (a) and (b) Normalized energy consumption versus SAD degradation
bound for (a) bit truncation and (b) loop skipping. (c) and (d) Individual E
generates the required control signals before the acceleration. plots for (c) bit truncation and (d) loop skipping generated using Algorithm 2.

D. Dynamic Selection of Optimum Approximation Technique


To maximize the energy savings from approximate comput- Algorithm 2 obtains the individual Q–E plots that provide
ing, we propose to combine the loop skipping and precision insights into the degradation of the quality of the decoded
scaling techniques. The combination of both strategies per- images for benefits in relative energy for different approxima-
forms better than either standalone scheme. tion scenarios. For a particular output image quality bound, the
A gradient descent-based heuristic algorithm determines quality knobs B0 –B4 and L0 –L6 , along with the relative energy
the optimal approximation degrees of bit truncation and loop savings, are found by the online image gallery of the Computer
skipping using the Quality–Energy (Q–E) plots of individ- Vision Group of the University of Granada [22]. Fig. 10 shows
ual approximation strategies, as suggested by Fig. 9(a). To the extracted plots for loop perforation and precision scaling.
quantify the effect of approximation on the quality of the Algorithm 3 employs a gradient descent-based optimization
image, SAD is chosen as a performance metric, which is search using these extracted plots to provide an overall Quality
defined as the ratio of the SAD in pixel values between versus Energy for the combined strategy. We vary the loop
the generated and the reference image to the sum of pixel perforation categories (Li ) and bit truncation levels (Bj ) to
values in the reference image. Note that %SAD degradation obtain the optimum settings for a particular output quality
has shown a good correlation to other metrics, such as peak bound (QA ). In other words, we are searching for the optimal
signal-to-noise ratio (PSNR) and structural similarity (SSIM) solution (B̂i , L̂i ) such that
and is used for its easy computation in the heuristics and
lower overhead. Also, note that the quality of the image is minimize E(Bi , Li )
inversely proportional to this metric. However, we use SSIM
subject to Q(Bi , Li ) ≤ QA
and PSNR while evaluating the approximation techniques,
as those metrics are well accepted in the image processing Q(Bi , Li ) ≥ 0
community. Bi ≥ 0, Li ≥ 0 (16)

Authorized licensed use limited to: Thapar Institute of Engineering & Technology. Downloaded on December 27,2024 at 05:23:49 UTC from IEEE Xplore. Restrictions apply.
18 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 44, NO. 1, JANUARY 2025

Algorithm 3: Gradient Descent Determines the Optimal


Approximation Degrees for a Given Quality Bound
Input: Output quality bound: QA ,
Quality vs. Energy (Q-E) curves for loop perforation and
precision scaling (Ql -El ) and (Qt -Et ), respectively
Output: Optimal approximation knob settings (i, j)
according to QA
1 Initialize: i = j = 0, Q = 1, E = 1
2 while (Q ≥ QA ) do
3 El =El [i]-El [i + 1]; Ql =Q-Ql [i + 1];
4 Et =Et [j]-Et [j + 1]; Qt =Q-Qt [j + 1]; Fig. 11. Simulation setup.
5 if ( Q Qt
El ≥ Et ) then
l

6 if (Qt [j + 1] ≤ QA ) then
7 E = E − Et ; Q = Q − Qt ; j = j + 1; over an image dataset from the gallery of the Computer
8 else if (Ql [i + 1] ≤ QA ) then Vision Group of the University of Granada [22], where many
9 E = E − El ; Q = Q − Ql ; i = i + 1; representative images in the field of image processing, such
as Baboon, Boat, Barbara, Pirate, Bridge, and Airplane, are
10 else included. The input image is reduced to chunks of 8 × 8 matrix
11 if (Ql [i + 1] ≤ QA ) then in MATLAB before feeding to the design under test (DUT).
12 E = E − El ; Q = Q − Ql ; i = i + 1; The MATLAB code runs the gradient descent-based heuristic
13 else if (Qt [j + 1] ≤ QA ) then algorithm to estimate the optimal approximation knobs for
14 E = E − Et ; Q = Q − Qt ; j = j + 1; a particular input quality bound (SAD). The software also
decides the degree of precision scaling to be realized and
15 return i, j; configures the hardware by clock gating the required bits
throughout the accelerator. A Verilog test bench is used to
convey the appropriate degree of truncation and the prepro-
cessed image (in a text file) for simulation. The accelerator
where E(Bi , Li ) and Q(Bi , Li ) represent the relative energy and performs the JPEG encoding on the input image and writes
%SAD degradation under particular Bi and Li , respectively. the output processed image in a separate text file, which is
The convexity of the problem can be justified as follows: then reconstructed through inverse quantization and inverse
First, Q(Bi , Li ) is monotonically increasing in each dimension DCT in the software. Inverse quantization is an operation of
because higher bit truncation or loop skipping levels result in element-wise multiplication, which is given by
higher %SAD degradation. Second, since the relative energy
and quality degradation are two inversely related variables, to R=CQ (17)
minimize E(Bi , Li ) is equivalent to maximizing Q(Bi , Li ). With
these two properties, (16) can be treated as a maximization where  is the element-wise multiplication operator, and R is
problem constrained in the first octant of the 3-D space the result of inverse quantization.
expanded by Bi , Li , and Q(Bi , Li ), where the objective function In general, the Q used in this step should be the same
is monotonically increasing in the direction of Bi and Li . As one as used in the encoding process. However, to give a
a result, the convexity is assured, and it is feasible to apply a more comprehensive analysis of the effect of introducing an
gradient descent algorithm to find the optimal. approximated quantization matrix, we also analyze the case
The controller implementing this heuristic, realized in where the approximated matrix Q is used in encoding while
software code, automatically configures the degree of loop the unmodified one Q is still used for inverse quantization.
perforation and bit truncation, Li and Bj , respectively, by Inverse DCT is given by the transformation of
moving in the direction of the steepest gradient of the ratio
of energy savings to quality degradation resulting from the N = T RT (18)
variation in each degree of the approximation knobs. Fig. 9(b)
shows the 3-D Q–E plot for different Li and Bj , along with the where N is the result of inverse DCT and T is given by (2).
relative energy required for processing. For a specified quality The reconstructed image can be obtained after rounding N’s
degradation bound, our proposed gradient descent algorithm all entries. In the end, performance evaluation and quality
uses the 3-D plot and selects the best (Bi , Lj ) combination assessment are conducted based on the reconstructed results.
that results in the lowest energy configuration according to the To validate the accelerator’s functionality, we use in-built
color bar on the right according to Algorithm 3. MATLAB-based JPEG compression and compare it with
hardware output. The JPEG register-transfer level (RTL) is
synthesized using Synopsys Design Compiler, mapped to
IV. S IMULATION M ETHODOLOGY TSMC 65nm standard cell library. The functionality of the
Fig. 11 depicts the simulation setup used for testing the extracted netlist is revalidated. All the RTL simulations are
functionality of the hardware. Our simulation is conducted performed using the Cadence NC-Verilog simulator. The

Authorized licensed use limited to: Thapar Institute of Engineering & Technology. Downloaded on December 27,2024 at 05:23:49 UTC from IEEE Xplore. Restrictions apply.
LI et al.: APPROXIMATE DCT AND QUANTIZATION TECHNIQUES FOR ENERGY-CONSTRAINED IMAGE SENSORS 19

TABLE I
S TATISTICS OF F IG . 12: I MAGE Q UALITY U NDER D IFFERENT Q UANTIZATION S CHEMES

design area and power values are provided from the post-
synthesis simulation results. Note that results from SPICE
simulations (done in Cadence Virtuoso) are utilized as they
provide precise energy numbers.

V. R ESULTS
In this section, we first discuss the effect of individual
approximation techniques on the overall performance of JPEG
(a) (b)
compression hardware. The performance of the combined
approximation strategy that dynamically tunes the config-
uration of the constituent techniques is also shown. Note
that system-level hardware performances like power and area
are not reported in previous related literature, such as [23]
(which applies bit-truncation and loop peroration in JPEG
compression) or [24] (which works on the approximation of
DCT hardware). However, this work provides the power and
area numbers for the most optimal design, including the DCT,
approximate quantization block, and loop-skipping circuitry (c) (d)
from the post-synthesis SPICE simulations. The quality of the
Fig. 12. Effect of quantization block approximation on image quality: (a)
images is evaluated in terms of SSIM, PSNR, and SAD as and (b) SSIM and PSNR comparison for Q50, and (c) and (d) SSIM and
discussed earlier in Section III. PSNR comparison for Q90.

A. Approximate Quantization
1) Case I: Images Reconstructed by the Corresponding
Encoding Quantization Matrix (10), (11): The scatter plots
from Fig. 12 and the corresponding statistic value in Table I
together show the change in the SSIM and PSNR of the
reconstructed images because of the use of an approximation
matrix. It is observed that the reconstructed images encoded
using bit-shifting-based quantization demonstrate better qual-
(a) (b)
ity (higher SSIM and higher PSNR) than the reconstructed
images encoded using standard division-based quantization for Fig. 13. Effect of quantization block approximation on compression ratio:
quality levels 50 and 90. This is because every element in the (a) on Q50 and (b) on Q90.
quantization matrix is down-approximated to its closest power
of 2, resulting in a new quantization matrix whose quality quantization scheme, achieving 85% reduction in area and
factor is higher than the original’s. This result, however, entails 94% power savings for conventional division-based quantiza-
a reduction in the compression ratio of the compressed image. tion block.
Fig. 13 shows the effect of the approximated quantization 2) Case II: Images Reconstructed by the Unmodified
method on compression ratio, where the 20.1% and 13.4% Quantization Matrix (8), (9): One should decode the com-
compression ratio decline are observed when approximated pressed image with the same quantization matrix used in the
Q50 and Q90 are adopted, respectively. Detailed statistic encoding stage for better image quality when reconstructing an
values of Fig. 13 are provided in Table II. Although there image. However, since the approximate quantization circuit is
are reductions in compression ratio, the advantages of using only applied in the encoding end in this work, we assume that
approximated quantization matrices are evident once the the decoding end may still use the unmodified quantization
hardware-level implementation is considered. Fig. 14 enlight- matrix to reconstruct images. The discussion about using
ens the benefits in power and area using the proposed different quantization matrices for encoding and decoding is

Authorized licensed use limited to: Thapar Institute of Engineering & Technology. Downloaded on December 27,2024 at 05:23:49 UTC from IEEE Xplore. Restrictions apply.
20 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 44, NO. 1, JANUARY 2025

TABLE II
S TATISTICS OF F IG . 13: C OMPRESSION R ATIO U NDER D IFFERENT
Q UANTIZATION S CHEMES

Fig. 16. Reconstructed images using approximate division block versus


standard division block with two different quality levels 50 and 90.
Fig. 14. Hardware-level area and power improvement based on approximate
quantization: bit-shifting-based quantization versus division-based quantiza-
tion.

Fig. 17. Effect of precision scaling: (a) SSIM versus precision scaling level
(a) (b) and (b) PSNR versus precision scaling level.

Fig. 15. Comparison of reconstructed image quality between the image be addressed by keeping the first entry in the standard Q90
decoded by the standard Q and by the modified Q for (a) Q50 and (b) Q90.
matrix unmodified and designing a multiplier-less divider.
For example, the element-wise division for Q90 ’s first entry
thus presented in Fig. 15, which provides the image quality
(quantizing a number by 3) can be approximately implemented
comparison between images reconstructed with the modified
as 1/4+1/8-1/16+1/32 (0.334). Thus, this can be implemented
(approximated) Q matrix and with the original (standard) one
just by addition and subtraction along with bit-shift operation.
for Q50 and Q90 . Using the standard matrix Q50 to reconstruct
Lastly, we present subjective analysis on three images
the image encoded by the approximated matrix Q50 results in
(Baboon, Pirate, and Boat) selected from the image dataset.
4.94% SSIM degradation, as shown in Fig. 15(a). However, in
Fig. 16 shows the reconstructed images using different
the case of Q90 , using the standard Q90 to reconstruct images
quantization methods and levels.
induces more SSIM degradation than using the standard Q50 ,
as shown in Fig. 15(b), where 22.35% SSIM degradation is
presented. B. Precision Scaling
This result stems from the fact that the first entry in Q50 is Fig. 17 depicts the effect of precision scaling on the
originally a 2’s power (Q50 (0,0) = 16), as is not the case for reconstructed image quality. Detailed statistics of the quality
Q90 (Q90 (0,0) = 3). For the scenario where Q90 is used in degradation is reported in Table III. With one-bit truncation,
encoding while Q90 is used in decoding, different divisors for the quality of the reconstructed images degrades slightly; only
the first entry of the DCT coefficient block (also known as the 4.82% SSIM degradation and 7.05% PSNR degradation are
DC coefficient) are used in the quantization (Q90 (0,0) = 2), observed according to the simulation results over the dataset.
and inverse quantization (Q90 (0,0) = 3). This discrepancy Hardware-wise benefit resulting from the bit truncation tech-
corrupts the reconstructed value of the DC coefficient at the nique is illustrated in Fig. 18. The JPEG compression circuit
step of inverse quantization. Since for still images, most of the with 1-bit truncation consumes around 30% less power and
energy is located in the low-frequency area [25], the corrupted area than the circuit with no data bit width modification.
DC coefficient will then result in severe image degradation Fig. 19 shows the subjective analysis on three selective
after inverse DCT is performed. Note that this problem could images, Baboon, Pirate, and Boat, under different truncation

Authorized licensed use limited to: Thapar Institute of Engineering & Technology. Downloaded on December 27,2024 at 05:23:49 UTC from IEEE Xplore. Restrictions apply.
LI et al.: APPROXIMATE DCT AND QUANTIZATION TECHNIQUES FOR ENERGY-CONSTRAINED IMAGE SENSORS 21

TABLE III
S TATISTICS OF F IG . 17: I MAGE Q UALITY U NDER D IFFERENT B IT T RUNCATION L EVELS

Fig. 18. Hardware-level area and power improvement based on different


precision scaling levels. Fig. 20. Effect of different loop-skipping levels on (a) energy saved and
(b) correlation coefficient between the energy saved and homogeneity.

TABLE IV
S TATISTICS OF F IG . 20: E NERGY S AVED U NDER D IFFERENT L OOP
S KIPPING L EVELS

textural features developed in [26], which is an indicator to


describe pixel discrepancy in an image. Generally, the higher
the homogeneity is, the more similar the neighboring pixels
in an image are.) For any loop-skipping level beyond L0, a
correlation coefficient larger than or close to 0.7 is observed,
which implies the energy saved and the homogeneity are
positively and highly correlated. Therefore, we argue that the
proposed loop-skipping technique is an effective approxima-
tion scheme for signals exhibiting high spatial locality, like
images, video, etc.
Fig. 19. Reconstructed images for different precision scaling (bit truncation)
levels.
The tradeoff of the energy saved by loop skipping versus
the quality degradation of reconstructed images is outlined in
levels (0–4). The quality of the reconstructed images degrades Fig. 21. It plots the relation between relative energy and image
significantly under bit truncation levels 3 and 4. It can be seen quality degradation for (a) SAD and (b) SSIM. The result
from the images in the last two rows of Fig. 19 that they are shows that 30% energy can be saved with roughly 2% SAD
heavily blurred compared to the original ones. and 10% SSIM degradation at loop-skipping level L3 .
In the end, subjective analysis of loop perforation is shown
C. Loop Perforation in Fig. 22, which displays the reconstructed images of Baboon,
Fig. 20(a) and Table IV present the simulation results of Pirate, and Boat for different loop skipping levels.
how much energy is saved based on a particular loop-skipping
level, and Fig. 20(b) shows the correlation coefficient between D. DCT
the energy saved and the image homogeneity under different Several multiplier-less DCT architectures have been
loop-skipping levels. (Homogeneity is one of the Haralick proposed these years. For example, Wang et al. [27]

Authorized licensed use limited to: Thapar Institute of Engineering & Technology. Downloaded on December 27,2024 at 05:23:49 UTC from IEEE Xplore. Restrictions apply.
22 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 44, NO. 1, JANUARY 2025

TABLE V
C OMPARISON W ITH E XISTING M ULTIPLIER -L ESS DCT D ESIGNS

(a) (b)

Fig. 21. Effect of different loop-skipping levels: (a) relative energy versus
SAD degradation and (b) relative energy versus SSIM degradation.

proposed an efficient 1D-DCT structure that approximated the


coefficients of the DCT transform matrix T as the power
of 2 and exploited adjacent pixel correlation to reduce hard-
ware complexity. Similarly, the idea of approximating T’s
coefficients as the power of 2 was applied in [28] to realize
an approximate integer DCT scheme for HEVC. Other work,
such as [29], utilized approximate adders to build an energy-
efficient DCT. Kaushal et al. [30] proposed a 25-coefficient
DCT architecture by exploiting pixels’ correlation and relative
significance of DCT coefficients.
We compare the hardware simulation results of this work’s
DCT scheme, which uses the techniques of canonic signed
digit (CSD) and common subexpression elimination (CSE)
in [18], with other existing 8-point multiplier-less DCT works
in Table V. The 1D-DCT structure utilized in this article
is described in Verilog and synthesized by Synopsys Design
Compiler with TSMC 65nm process. The circuit’s power Fig. 22. Effect of loop skipping: reconstructed images for different loop
performance is measured by Cadence Spectre Simulator, skipping levels.
where 256 8-by-1 column vectors of random pixels are sent
as inputs to the circuit, and the average power consumption
is measured under 1-V supply and 100-MHz clock. The final
result shows that our 1D-DCT structure consumes 16.4 pJ, The area reduction comes mainly from improving the
which is better than most of the existing works. quantization step through the bit shift operators-based division,
even though, at the same time, bit-truncation and loop skipping
induce some area overhead. The baseline quantization blocks,
E. Results From Final Architecture: Quality, Area, and comprising synthesis tool-generated dividers, consume 47%
Energy (71 285 um2 ) of the entire area; however, the optimized
The final JPEG compression circuit is synthesized using the quantization takes only 15% (11 013 um2 ) of the total area.
Synopsys Design Compiler tool and mapped to TSMC 65nm This area saving, 71 285 − 11 013 = 60 272 um2 , outweighs
library. The synthesis result shows that the final architecture the additional area resulting from the logic and registers used
(with the proposed bit-shifting-based quantization block and for bit-truncation and loop skipping, which takes (109 470 −
the loop skipping function) occupies a cell area of 109 470 11 013) − (151 617 − 71 285) = 98 457 − 80 332 = 18 125
um2 , which is 28% less than the baseline (with conventional um2 . Our proposed methods combined, therefore, give an area
division-based quantization block and without the loop skip- saving of 42 147 um2 . Fig. 23(a) shows the area comparison of
ping function) design (151 617 um2 ). the baseline and proposed design, where the red bars represent

Authorized licensed use limited to: Thapar Institute of Engineering & Technology. Downloaded on December 27,2024 at 05:23:49 UTC from IEEE Xplore. Restrictions apply.
LI et al.: APPROXIMATE DCT AND QUANTIZATION TECHNIQUES FOR ENERGY-CONSTRAINED IMAGE SENSORS 23

using only shift, addition, and subtraction operations, as it


is essentially an FIR filter. Quantization in JPEG2000 can
also be implemented through shifting logic. Additionally, since
JPEG2000 compresses images in small tiles, the concept of
loop skipping can be utilized. If two neighboring tiles are
sufficiently similar, the compression unit can be disabled,
and the results from the last processed tile can be reused.
(a) (b) Our future work can include detailed approximated hardware
analysis and implementation and related encoded/decoded
Fig. 23. Comparison between the baseline and proposed design. (a) Area image analysis for JPEG2000.
breakdown. (b) Power consumption.

the area overhead of the quantization blocks and the blue bars VII. C ONCLUSION
represent the nonquantization part. This work demonstrates a synthesizable multiplier-less
According to the simulation of Cadence Spectre Simulator, JPEG accelerator equipped with approximations both in soft-
at typical/typical (TT) corner, 25 ◦ C, with a supply of 1 V, the ware and RTL in the form of modified quantization block,
baseline design (using bit truncation level 1 and loop skipping precision scaling, and loop perforation, trading off the quality
level 2) consumes an average current of 6.75 mA at 100 of the image with energy and area reduction. With a gradient
MHz at the expense of 2% SAD degradation. On the other descent-based heuristic, the accelerator’s performance can
hand, the proposed architecture consumes an average current be tuned to maximize energy savings while meeting the
of 4.35 mA under the same simulation condition. Therefore, image quality constraints. The proposed architecture with the
36% energy is saved from our proposed approach, as shown in combined approximation strategies achieves 36% reduction
Fig. 23(b). The proposed architecture equivalently dissipates in energy consumption at the expense of 2% SAD quality
a power of 15 uW in the DCT and quantization stages to degradation in the image, which lies within acceptable limits
generate a throughput of 480-p colored image @ 6 frames/s. for any image processing applications. Moreover, it consumes
This is 10× better than the analog solution [32] (which utilizes 15 uW at the DCT and quantization stages to compress a
passive elements, i.e., switch capacitors to save power) and colored 480-p image at 6 frames/s, which is 10 × better than
6× better than current state-of-the-art [33] (which operates at the previous literature.
near-threshold to reduce power).
R EFERENCES
VI. D ISCUSSION [1] C. Zhu, H. Zhang, and Y. Tang, “Lossless image compression algorithm
This work focuses on accelerating the DCT and quantization based on long short-term memory neural network,” in Proc. 5th ICCIA,
2020, pp. 82–88.
steps in JPEG compression. However, another widely used [2] L. Liang and D. Shujun, “Study on JPEG2000 optimized compres-
image compression scheme, JPEG2000, merits discussion due sion algorithm for remote sensing image,” in Proc. NSWCTC, 2009,
to its superior compression efficiency and image quality. pp. 771–775.
[3] A. Skodras, C. Christopoulos, and T. Ebrahimi, “The JPEG 2000 still
1) JPEG Versus JPEG2000: Although the transform step image compression standard,” IEEE Signal Process. Mag., vol. 18, no. 5,
in JPEG2000, which uses discrete wavelet transform (DWT), pp. 36–58, Sep. 2001.
is simpler than the DCT in JPEG, the overall structure [4] C. Christopoulos, A. Skodras, and T. Ebrahimi, “The JPEG2000 still
image coding system: An overview,” IEEE Trans. Consum. Electron.,
of JPEG2000 is more complex [34], [35], [36], [37], [38], vol. 46, no. 4, pp. 1103–1127, Nov. 2000.
[39]. This complexity arises from more refined quantization [5] F. Ebrahimi, M. Chamik, and S. Winkler, “JPEG vs. JPEG 2000: An
steps involving floating-point division and advanced coding objective comparison of image encoding quality,” in Proc. 28th Appl.
Digit. Image Process., 2004, pp. 300–308.
schemes like embedded block coding with optimized trun- [6] E. Allen, S. Triantaphillidou, and R. Jacobson, “Image quality compar-
cation (EBCOT) [40]. Consequently, JPEG2000 is not well ison between JPEG and JPEG2000. I. Psychophysical investigation,” J.
suited for applications in edge devices. Imag. Sci. Technol., vol. 51, no. 3, pp. 248–258, 2007.
[7] S. Bouguezel, M. Omair Ahmad, and M. N. S. Swamy, “Low-complexity
2) DCT Versus DWT: The 2D-DCT is more complex than 8 × 8 transform for image compression,” Electron. Lett., vol. 44, no. 21,
the 2D-DWT for the transformation step. Nonetheless, the pp. 1249–1250, 2008.
multiplier-less approach adopted in this design significantly [8] S. Bouguezel, M. Omair Ahmad, and M. N. S. Swamy, “A fast 8 × 8
transform for image compression,” in Proc. Int. Conf. Microelectron.
reduces the hardware implementation cost, making it com- (ICM), 2009, pp. 74–77.
petitive with the 2D-DWT in terms of hardware resources. [9] S. Bouguezel, M. Omair Ahmad, and M. N. S. Swamy, “A low-
The transform block in the proposed design performs the 2D- complexity parametric transform for image compression,” in Proc. IEEE
Int. Symp. Circuits Syst. (ISCAS), 2011, pp. 2145–2148.
DCT using only shift, addition, and subtraction operations—no [10] N. Brahimi, T. Bouden, T. Brahimi, and L. Boubchir, “A novel and effi-
multipliers are used. This approach is highly similar to the cient 8-point DCT approximation for image compression,” Multimedia
implementation of FIR-based 2D-DWT. Tools Appl., vol. 79, pp. 7615–7631, Mar. 2020.
[11] F. S. Snigdha, D. Sengupta, J. Hu, and S. S. Sapatnekar, “Optimal design
3) Possible Future Work for Approximate JPEG2000: The of JPEG hardware under the approximate computing paradigm,” in Proc.
approximate techniques proposed in this article—multiplier- 53nd ACM/EDAC/IEEE DAC, 2016, pp. 1–6.
less transformation, approximate quantization, bit truncation, [12] V. Gupta, D. Mohapatra, A. Raghunathan, and K. Roy, “Low-power
digital signal processing using approximate adders,” IEEE Trans.
and loop perforation—can be applied not only to JPEG but Comput.-Aided Design Integr. Circuits Syst., vol. 32, no. 1, pp. 124–137,
also to JPEG2000. For instance, 2D-DWT can be implemented Jan. 2013.

Authorized licensed use limited to: Thapar Institute of Engineering & Technology. Downloaded on December 27,2024 at 05:23:49 UTC from IEEE Xplore. Restrictions apply.
24 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 44, NO. 1, JANUARY 2025

[13] J. Park, J. H. Choi, and K. Roy, “Dynamic bit-width adaptation in [38] A. M. Reza, “JPEG2000 hardware implementation procedures and
DCT: An approach to trade off image quality and computation energy,” issues,” WSEAS Trans. Circuits Syst., vol. 12, no. 4, pp. 101–117, 2013.
IEEE Trans. Very Large Scale Integr. Syst., vol. 18, no. 5, pp. 787–793, [39] L. Liu, N. Chen, H. Meng, L. Zhang, Z. Wang, and H. Chen, “A VLSI
May 2010. architecture of JPEG2000 encoder,” IEEE J. Solid-State Circuits, vol. 39,
[14] H. A. F. Almurib, T. Nandha Kumar, and F. Lombardi, “Approximate no. 11, pp. 2032–2040, Nov. 2004.
DCT image compression using inexact computing,” IEEE Trans. [40] D. Taubman, “High performance scalable image compression with
Comput., vol. 67, no. 2, pp. 149–159, Feb. 2018. EBCOT,” IEEE Trans. Image Process., vol. 9, pp. 1158–1170, 2000.
[15] M. Imani, R. Garcia, A. Huang, and T. Rosing, “CADE: Configurable
approximate divider for energy efficiency,” in Proc. DATE, 2019,
pp. 586–589. Ming-Che Li (Graduate Student Member, IEEE)
[16] G. K. Wallace, “The JPEG still picture compression standard,” IEEE was born in Yilan, Taiwan, in 1999. He received the
Trans. Consum. Electron., vol. 38, no. 1, pp. xviii–xxxiv, Feb. 1992. B.S. degree in electrical engineering from National
[17] W.-H. Chen, C. Smith, and S. Fralick, “A fast computational algorithm Tsing Hua University, Hsinchu, Taiwan, in 2021. He
for the discrete cosine transform,” IEEE Trans. Commun., vol. 25, no. 9, is currently pursuing the Ph.D. degree in electrical
pp. 1004–1009, Sep. 1977. and computer engineering with Purdue University,
[18] B.-I. Kim and S. G. Ziavras, “Low-power multiplierless DCT for West Lafayette, IN, USA.
image/video coders,” in Proc. IEEE 13th Int. Symp. Consum. Electron., His current research interests include hardware
2009, pp. 133–136. security, approximate computing in image and video
[19] J. Melchert, S. Behroozi, J. Li, and Y. Kim, “SAADI-EC: A quality- compression, and stochastic computing.
configurable approximate divider for energy efficiency,” IEEE Trans.
Very Large Scale Integr. Syst., vol. 27, no. 11, pp. 2680–2692,
Nov. 2019. Archisman Ghosh (Student Member, IEEE)
[20] S. Hashemi, R. I. Bahar, and S. Reda, “A low-power dynamic divider received the B.E. degree in electronics and telecom-
for approximate applications,” in Proc. 53nd ACM/EDAC/IEEE DAC, munication engineering from Jadavpur University,
2016, pp. 1–6. Kolkata, India, in 2017. He is currently pursuing
[21] S. Vahdat, M. Kamal, A. Afzali-Kusha, M. Pedram, and Z. Navabi, the Ph.D. degree with Purdue University, West
“TruncApp: A truncation-based approximate divider for energy efficient Lafayette, IN, USA.
DSP applications,” in Proc. DATE, 2017, pp. 1635–1638. He is currently a Bilsland Dissertation Fellow with
[22] “University of Granada test images.” Accessed: 31 Jan. 2024. [Online]. Purdue University. His research interests include
Available: https://ccia.ugr.es/cvg/index2.php digital SoC design and hardware security. Prior to
[23] S. Barone, M. Traiola, M. Barbareschi, and A. Bosio, “Multi-objective his Ph.D., he worked with Samsung Semiconductor
application-driven approximate design method,” IEEE Access, vol. 9, India R&D, Bengaluru, India, for two years. He has
pp. 86975–86993, 2021. interned with Intel Labs, Hillsboro, OR, USA.
[24] M. Barbareschi, S. Barone, A. Bosio, J. Han, and M. Traiola, “A genetic- Mr. Ghosh is one of the recipients of the prestigious IEEE SSCS Pre-
algorithm-based approach to the design of DCT hardware accelerators,” Doctoral Achievement Award 2022. He was a recipient of the prestigious ECE
ACM J. Emerg. Technol. Comput. Syst., vol. 18, no. 3, pp. 1–25, 2022. Meissner Fellowship from Purdue University in 2019–2020 as an incoming
[25] B. Furht, Discrete Cosine Transform (DCT). Boston, MA, USA: graduate student.
Springer, 2008, pp. 186–188.
[26] R. M. Haralick, K. Shanmugam, and I. Dinstein, “Textural features for
image classification,” IEEE Trans. Syst., Man, Cybernet., vol. SMC-3, Shreyas Sen (Senior Member, IEEE) received the
no. 6, pp. 610–621, Nov. 1973. Ph.D. degree from the School of Electrical and
[27] X. Wang, K. Chen, C. Wang, and W. Liu, “An energy-efficient approxi- Computer Engineering (ECE), Georgia Institute of
mate DCT design for image processing,” in Proc. IEEE 15th Int. Conf. Technology (Georgia Tech), Atlanta, GA, USA, in
(ASIC) (ASICON), 2023, pp. 1–4. 2011.
[28] S. Skandha Deepsita, K. Divya, and S. Noor Mahammad, “Energy He is an Elmore Associate Professor of ECE and
efficient and multiplierless approximate integer DCT implementation for BME with Purdue University, West Lafayette, IN,
HEVC,” in Proc. IFIP/IEEE 29th Int. Conf. Very Large Scale Integr. USA, where he serves as the Director of the Center
(VLSI-SoC), 2021, pp. 1–6. for Internet of Bodies. He has authored/co-authored
[29] Y. Xing, Z. Zhang, Y. Qian, Q. Li, and Y. He, “An energy-efficient three book chapters, over 200 journal and conference
approximate DCT for wireless capsule endoscopy application,” in Proc. papers and has 25 patents granted/pending. His cur-
IEEE Int. Symp. Circuits Syst. (ISCAS), 2018, pp. 1–4. rent research interests span mixed-signal circuits/systems and electromagnetics
[30] V. Kaushal, B. Garg, A. Jaiswal, and G. K. Sharma, “Energy aware com- for the Internet of Bodies and hardware security.
putation driven approximate DCT architecture for image processing,” in Dr. Sen is a recipient of the NSF CAREER Award 2020, the AFOSR
Proc. 28th Int. Conf. VLSI Design, 2015, pp. 357–362. Young Investigator Award 2016, the NSF CISE CRII Award 2017, the Intel
[31] A. Darji and R. P. Makwana, “High-performance multiplierless DCT Outstanding Researcher Award 2020, the Google Faculty Research Award
architecture for HEVC.” in Proc. 19th Int. Symp. VLSI Design Test, 2017, the Purdue CoE Early Career Research Award 2021, the Intel Labs
2015, pp. 1–5. Quality Award 2012 for industry-wide impact on USB-C type, the Intel
[32] K. Gaurav Kumar, G. Barik, B. Chatterjee, S. Bose, S. Maity, and S. Sen, Ph.D. Fellowship 2010, the IEEE Microwave Fellowship 2008, the GSRC
“A 65 nm 2.02 mw 50 mbps direct analog to MJPEG converter for video Margarida Jacome Best Research Award 2007, and nine best paper awards,
sensor nodes using low-noise switched capacitor MAC-Quantizer with including IEEE CICC 2019 and 2021 and IEEE HOST 2017–2020, for four
automatic calibration and sparsity-aware ADC,” in Proc. IEEE Custom consecutive years. He is the inventor of the Electro-Quasistatic Human Body
Integr. Circuits Conf. (CICC), 2023, pp. 1–2. Communication, or Body as a Wire technology, for which, he is the recipient
[33] N. Reynders and W. Dehaene, “27.3 a 210mv 5MHz variation-resilient of the MIT Technology Review top-10 Indian Inventor Worldwide Under 35
near-threshold JPEG encoder in 40nm CMOS,” in Proc. ISSCC, 2014, (MIT TR35 India) Award in 2018 and Georgia Tech 40 Under 40 Award in
pp. 456–457. 2022. To commercialize this invention, he founded Ixana and serves as the
[34] C.-J. Lian, K.-F. Chen, H.-H. Chen, and L.-G. Chen, “Analysis and Chairman and the CTO and led Ixana to awards, such as 2 × CES Innovation
architecture design of block-coding engine for EBCOT in JPEG 2000,” Award 2024, EE Times Silicon 100, and Indiana Startup of the Year Mira
IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 3, pp. 219–230, Award 2023. His work has been covered by 250+ news releases worldwide,
Mar. 2003. invited appearance on TEDx Indianapolis, NASDAQ live Trade Talks at CES
[35] M. D. Adams, “The jpeg-2000 still image compression standard,” 2023, Indian National Television CNBC TV18 Young Turks Program, NPR
document ISO/IEC JTC1/SC29/WG1N2412, Int. Org. Stand., Geneva, subsidiary Lakeshore Public Radio, and the CyberWire podcast. His work was
Switzerland, Dec. 2002. chosen as one of the top-10 papers in the Hardware Security field (TopPicks
[36] L.-G. Chen, C.-J. Lian, K.-F. Chen, and H.-H. Chen, “Analysis and 2019). He serves/has served as an Associate Editor for IEEE S OLID -S TATE
architecture design of JPEG2000,” in Proc. ICME, 2001, pp. 210–213. C IRCUITS L ETTERS, Nature Scientific Reports, Frontiers in Electronics, and
[37] Y. Meng, L. Liu, L. Zhang, and Z. Wang, “Design methodology of low IEEE Design & Test, an Executive Committee Member of IEEE Central
power JPEG2000 codec exploiting dual voltage scaling,” in Proc. 6th Indiana Section, and the Technical Program Committee Member of ISSCC,
Int. Conf. ASIC, 2005, pp. 183–186. CICC, DAC, CCS, IMS, DATE, ISLPED, ICCAD, ITC, and VLSI Design.

Authorized licensed use limited to: Thapar Institute of Engineering & Technology. Downloaded on December 27,2024 at 05:23:49 UTC from IEEE Xplore. Restrictions apply.

You might also like