0% found this document useful (0 votes)
8 views10 pages

Jpeg 2 K

Uploaded by

bajahaya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views10 pages

Jpeg 2 K

Uploaded by

bajahaya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 13, NO.

3, MARCH 2003 209

A High-Performance JPEG2000 Architecture


Kishore Andra, Chaitali Chakrabarti, and Tinku Acharya, Senior Member, IEEE

Abstract—JPEG2000 is an upcoming compression standard for operations and are performed bit-plane wise. As a result, they
still images that has a feature set well tuned for diverse data dis- cannot be efficiently implemented on DSP or media proces-
semination. These features are possible due to adaptation of the sors. Even though inherent parallelization is present in the
discrete wavelet transform, intra-subband bit-plane coding, and
binary arithmetic coding in the standard. In this paper, we pro- entropy-coding algorithm, the parallel paths are complex, data
pose a system-level architecture capable of encoding and decoding dependent, and are defined only at run time. Further, since the
the JPEG2000 core algorithm that has been defined in Part I of the JPEG2000 kernel will be a part of digital cameras, scanners,
standard. The key components include dedicated architectures for printers, wireless devices with multimedia capabilities, etc., it
wavelet, bit plane, and arithmetic coders and memory interfacing is important that the kernel be area, time, and power efficient.
between the coders. The system architecture has been implemented
in VHDL and its performance evaluated for a set of images. The Thus, we conclude that while DWT can be implemented
estimated area of the architecture, in 0.18- technology, is 3-mm by DSP or media processors, specialized implementations
square and the estimated frequency of operation is 200 MHz. are needed for the BPC and BAC coders. Recently, Analog
Index Terms—Binary arithmetic coding, bit-plane coding, Devices has introduced a JPEG2000 co-processor [7], further
JPEG2000, system architecture, wavelet transform. supporting the hardware implementation paradigm.
The core algorithm in JPEG2000 has been defined in Part I
of the standard and any JPEG 2000 system has to minimally
I. INTRODUCTION comply with the Part I specification. In this paper, we propose an
integrated architecture to implement the encoding and decoding
T HE DIFFERENCES in the computing power, bandwidth
and memory of wireless and wired devices, as well as
emergence of diverse imaging application requirements, have
for the JPEG2000 part I coder. The architecture primarily con-
sists of three modules: 1) the DWT module; 2) the BPC module;
made resolution scalability and quality scalability essential and 3) the BAC module. The modules interface with each other
in today’s still image compression standards. Although these via memory and buffers. The DWT module is capable of per-
properties can be attained with present JPEG, they cannot forming (5,3) filter in the lossless mode and (9,7) filter in the
be achieved in a single bit stream [1]. To overcome these lossy mode on an 8-bit input data. Three pairs of BPC and BAC
drawbacks, the upcoming still-image compression standard modules are used to reduce the time required for entropy coding.
JPEG2000 has been designed [2]. Error resilience, manipula- The architecture has been implemented in VHDL and its perfor-
tion of images in compressed domain, acceptable performance mance has been evaluated. The estimated area of the architec-
even at very low bit rates ( 0.1 bpp), region-of-interest coding, ture, in 0.18- technology, is 3-mm square and the estimated
lossy and lossless performance using same coder, noniterative frequency of operation is 200 MHz.
rate control, etc., are some of the other important features of The rest of the paper is organized as follows. In Section II,
the JPEG2000 standard. All these features are possible due we describe the JPEG2000 Part I coder in brief. The proposed
to adaptation of the discrete wavelet transform (DWT) and system-level architecture is discussed in Section III. The DWT,
intra-subband entropy coding along the bit planes using a BPC, and BAC algorithms and proposed architectures are pre-
combination of a bit plane coder (BPC) and binary arithmetic sented in Sections IV–VI, respectively. The performance of the
coder (BAC) in the core algorithm. architecture is discussed in Section VII and the paper is con-
All three core blocks namely, the DWT, BPC, and BAC cluded in Section VIII.
blocks are computationally, as well as memory, intensive. The
DWT algorithm is a typical “DSP algorithm” with a small
II. JPEG2000 BASICS
set of arithmetic operations performed continuously with
symmetrical data access (read) and generation (write) pattern. The encoder proposed for the JPEG2000 Part I standard is ex-
These properties makes it amenable for implementation using plained using the block diagram in Fig. 1. During encoding, an
DSP and media processors or even dedicated hardware. In image is split into rectangular structures called tiles. The tiles are
contrast to DWT, both BPC and BAC are control intensive (i.e., coded separately as if they are different images. The encoding
contain substantial branching conditions) with few arithmetic steps are summarized below. For more details, please refer to
[2]. Note that decoding is symmetric to encoding and can be
achieved by performing the encoding steps in the reverse order.
Manuscript received August 7, 2001; revised September 30, 2002. This paper
was recommended by Associate Editor A. Tabatabai. Wavelet Transform: In the first step, the DWT is applied on
The authors are with the Department of Electrical Engineering, Telecom- the tile to decompose it into a number of wavelet subbands. Re-
munications Research Center, Arizona State University, Tempe, AZ cently, a new methodology called lifting [8], [9] has been pro-
85287-5706 USA (e-mail: [email protected]; [email protected];
[email protected]). posed to perform the DWT. Lifting enables the DWT to be com-
Digital Object Identifier 10.1109/TCSVT.2003.809834 puted using a series of banded matrix multiplications. In Part I
1051-8215/03$17.00 © 2003 IEEE
210 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 13, NO. 3, MARCH 2003

Fig. 1. Block diagram of the JPEG2000 encoder.

Fig. 2. Proposed architecture for JPEG2000 encoder.

of the JPEG 2000 standard, lifting-based implementation of the stream is formed based on the available bit rate by means of
(5,3) filter is prescribed for lossless encoding and that of the “layers” which contain incremental contribution from each code
(9,7) filter for lossy encoding. block. So even though neither the required resolution nor the
Quantization: The wavelet coefficients in each subband are required rate is known while encoding, the best possible image
scalar quantized if lossy compression is required. In JPEG 2000, of required resolution is generated for a given bit rate.
uniform scalar quantization with deadzone at the origin is ap-
plied to the subband samples for lossy compression. The quan-
III. PROPOSED SYSTEMS ARCHITECTURE FOR JPEG2000
tization step size is determined by the dynamic range of the sam-
ples in a subband. It can vary from one subband to another based Here, we propose a systems architecture capable of per-
on the visual models, similar to the specification of -table in forming the coding process described in the previous section.
baseline JPEG. The input to the architecture is an image tile and the outputs
BPC: The quantized subbands are divided into code blocks. are three code streams (one for each subband). The division of
The code blocks are entropy coded along the bit planes using a the image into tiles and formation of the layers at the end of
combination of embedded BPC and BAC. coding process are handled by software. The block diagram of
In JPEG2000, the embedded block coding with optimized the proposed architecture is shown in Fig. 2.
truncation (EBCOT) algorithm [10] has been adopted to im- The architecture primarily consists of a DWT module, three
plement the BPC. This algorithm exploits the symmetries and pairs of BPC and BAC modules and three data formatters (DF).
redundancies within and across the bit planes. It generates the It also consists of: 1) three subband memory (SM) blocks be-
input to the BAC block based on statistics (state information tween the DWT coder and three BPC coders to store the code
bits that are maintained across the bit planes) of the data coded blocks formed from the subband data and 2) three CXD buffers
previously. between the BPC and BAC modules to store the context and
BAC: The BPC outputs are entropy coded using BAC to gen- symbol pairs generated by the BPC module. A global controller
erate the code stream. The MQ coder, which is a derivative of the is present to control the interactions between all these blocks.
coder [11], [12], has been proposed to implement the BAC. The data flow of the architecture is as follows. DWT is ap-
The algorithm is multiplication free. Predetermined probability plied on the image tile to generate the three high-frequency sub-
values are supplied by the standard and are stored in a look bands (HL, LH, HH) and one low-frequency subband (LL) at
up table. The adaptation state machine is also supplied by the each level. The LL subband data is used by the DWT module to
standard. compute the next level of decomposition while the other three
File formatting and layer formation: For each of the code subbands are entropy coded. The subband data is quantized (if
blocks, distortion for a fixed number of bit rates and code size required) and broken up into rectangular structures called code
is calculated by a suitable rate control mechanism. The final bit blocks. The code blocks are then entropy coded, independently.
ANDRA et al.: A HIGH-PERFORMANCE JPEG2000 ARCHITECTURE 211

Code blocks are written into the SM blocks. Each BPC reads (with a read port and a write port); and 3) a controller (counter,
the data from the corresponding SM and writes the context-data signal generator, address generator). The architecture generates
pairs into the corresponding CXD buffer. BAC reads from the an output from a lifting step every cycle. Details of the architec-
CXD buffer and generates the code stream for each code block. ture are given in Section IV.
At the last level, the LL subband is entropy coded using the HL
entropy coder pair. The code stream generated is supplied to B. Data Formatter
the bit-stream formation (BSF) tool to form the final bit stream
Data Formatter (DF) carries out the conversion between two’s
based on the resolution and quality needed. This process is con-
complement data that is generated by the DWT module and the
trolled by a rate controller. The proposed architecture does not
sign magnitude data that is required by the BPC module. Fur-
handle the rate controlling or the BSF tool; a host processor or
ther, DF also determines the most significant bit plane (i.e., the
an ASIC has to perform this function.
first bit plane which contains a “1”) of each code block. The
The entropy coding of JPEG2000 takes an inordinately long
BPC starts coding from the significant bit plane. Quantization,
time. For instance, to entropy code a code block, with
if needed, can be performed by DF. As mentioned earlier, scalar
one bit position being coded in each cycle, cycles
quantization is prescribed in the standard and this can be han-
are required. This is because the internal precision is 16 bits for
dled with a multiplier if the quantization step sizes are known
lossless performance [4] and BPC performs the coding in three
for each subband.
passes [10]. On top of this, the BAC requires at least two table
In the decoder, DF performs the conversion from sign-magni-
lookups and two additions per bit [2]. The entropy coder still
tude form to two’s complement form. The significant bit-plane
requires a few million cycles even if the bypass mode, proposed
value for each code block is supplied by the encoder. The bit
in [2] to speed up the entropy coding, is used. In contrast, the
planes from the 15th bit plane to the significant plane are filled
DWT requires only about 300 000 cycles to code a 128 128
with zeros by the DF. Inverse quantization, if needed, is per-
block to five levels.
formed by DF.
Fortunately, the time required for encoding can be reduced if
multiple hardware modules are provided since the code blocks
are entropy coded independently. For instance, for the case C. SM
where the DWT coder and the entropy coder work in sync (i.e., The data formatters write sign magnitude data to the SM
while the DWT coder operates on level , the entropy coder blocks. The bit-plane coders read the data bits and sign bit
operates on coefficients of level ), at the most from the SM blocks along the strips. A novel memory structure
hardware modules are needed at the first level. that can handle word-in–bit-out format combined with the strip
This is because in the code block structure that we have con- structure required for the BPC has been designed.
sidered, each subband is split into four code blocks at the first The SM structure is shown in Fig. 3. Each row of the SM con-
level and the whole subband forms a code block in the rest of the tains four words, where the four words are obtained from four
levels. In such a scheme, during entropy coding of levels 2, 3, consecutive rows along a column. Each word is 16-bits wide and
and 4, three out of 13 modules are needed. Also, each hardware so each row is bits wide. The corresponding bits of
module costs 6000 gates memory interface. So the choice of each word are grouped together as shown in Fig. 3. If the max-
the number of hardware modules is clearly a balance between imum number of rows and columns of a code block are and
the time constraint and the area constraint. In our design, we , respectively, then memory structure would have
chose three hardware modules—one for the HH subband, one rows with 64 bits per row. It should be noted that all the ele-
for the LH subband, and one for the HL and LL subbands. This ments in a four-row strip are stored in consecutive rows.
makes the memory interface between the DWT coder and the
entropy coder easier to handle and at the same time reduces the D. BPC Module
computation time by a factor of three.
The decoder architecture is similar to the encoder architec- The architecture to carry out the BPC is based on the EBCOT
ture with data flow in the opposite direction. The BSF tool is algorithm. The encoder architecture consists of: 1) combina-
replaced by a code stream formation tool. The CXD buffer is tional logic blocks that transform the state information into input
replaced by a single register to hold the context. This is because to the BAC module; 2) three memory blocks to hold the state
the bit plane decoding cannot proceed before the data is obtained information bits; 3) five registers (of various sizes and function-
from the binary arithmetic decoder. ality) to hold the state and magnitude bits; and 4) a 24-state con-
Next, we briefly describe the different architectural troller to control all the blocks. The decoder is very similar to
components. the encoder architecture. Details of the architecture are given in
Section V.
A. DWT Module
The architecture performs lifting-based DWT/IDWT for the E. CXD Buffer
(5,3) and (9,7) filters. The transform is computed in column-row The CXD buffer is a FIFO with a read and a write port as
fashion one level at a time (i.e., with no interleaving between shown in Fig. 4. Each entry contains a context data bit pair
the levels). Symmetric extension is used at the boundaries. The (6 bits). The length of the buffer needs to be as large as pos-
key components are: 1) a data path with two adders, one shifter, sible to account for speed difference between the BPC and BAC
one multiplier; 2) a memory block of size equal to the tile size coders. A buffer with 128 entries has been used; the number of
212 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 13, NO. 3, MARCH 2003

Fig. 3. SM structure to hold 32 rows and 32 columns.

in lossy mode. We propose an architecture which is capable of


performing the above filters using the lifting scheme. The archi-
tecture supports both forward and inverse DWT. The architec-
ture is very simple and consists of a processor (two adders, one
shifter and a four-level pipelined multiplier), a memory block,
and a controller.

A. Precision Analysis
Fig. 4. CXD buffer structure.
The first step in the design of the architecture is to determine
the number of bits required for satisfactory lossy and lossless
entries was determined with experimentation. The global con- performance in the fixed point implementation. The study was
troller uses two pointers namely, a BPC pointer and a BAC conducted in [4] on three gray-scale images—baboon, barbara,
pointer, to keep track of the FIFO stack. The pointers are reset and fish—each of size 512 512 for five levels of decomposi-
whenever the BAC module is initialized or reinitialized. If the tion. The results were validated with 15 gray-scale images from
buffer is not large enough, the BPC module ends up following the USC-SIPI database [13]—5.2.08–10, 7.1.01–04, 7.1.06–10,
the BAC module with the buffer behaving like a register. boat, elaine, ruler, and gray21 from the Miscellaneous directory.
From the study, we concluded that 10 bits are required to rep-
F. BAC Module resent the coefficients and 14 (16) bits are required to represent
the signals for lossy (lossless) performance. A rounding oper-
The architecture to implement the BAC module is based on
ation (all the number are rounded toward is employed in
the MQ coder. The architecture consists of a: 1) 16-bit adder;
the product terms and the internal precision is maintained with
2) registers (various sizes and functionality); 3) a logic block
14 (16) bits for lossy (lossless) performance. Based on this pre-
that helps in the adaptation process; 4) two memory blocks to
cision analysis, the size of the data path units is chosen to be
perform the table look-up operations; and 5) a controller. The ar-
16-bits wide.
chitectural components of the encoder and decoder are the same
though the corresponding controllers are completely different. B. Proposed Architecture for Lifting-Based DWT
Architectural details are described in Section VI.
The proposed architecture performs the DWT in column-row
fashion one level at a time. We chose this method over a recur-
IV. LIFTING-BASED DWT
sive pyramid algorithm (RPA) [6] based method that generates
In JPEG2000, the DWT is implemented using a lifting-based coefficients of multiple levels in an interleaved fashion since the
scheme. The lifting-based scheme breaks up the high pass and entropy coder works on coefficients level by level and use of
low pass filters into a sequence of upper and lower triangular RPA-based method would result in an unnecessary increase in
matrices, and converts the filter implementation into banded ma- the latency of the system.
trix multiplications. Such a scheme has several advantages, in- The architecture performs one lifting step (i.e., calculating
cluding “in-place” computation of the DWT, integer-to-integer high pass terms from the low pass terms or vice versa) in each
wavelet transforms (which are useful for lossless coding), sym- iteration. So, the (5,3) filter requires four iterations (two lifting
metric forward and inverse transform etc. steps along each dimension) while the (9,7) requires nine iter-
To be JPEG 2000 (Part I) compliant, the DWT module should ations (four lifting steps in each dimension and one modified
be able to support (5,3) filter in lossless mode and the (9,7) filter scaling step).
ANDRA et al.: A HIGH-PERFORMANCE JPEG2000 ARCHITECTURE 213

However, data is not generated or consumed by the BPC mod-


ules simultaneously. So, the global controller can perform the
operations in a staggered manner thereby limiting the access re-
quirement to two read accesses and one write access per cycle.
3) Controller: The controller consists of three blocks: a
counter, a signal generator, and an address generator.
Fig. 5. Proposed architecture for the lifting-based DWT.
• Counter—keeps track of the number of elements in a
row, number of rows, and number of levels processed. It
also keeps track of the total number of elements and total
number of rows at each level that needs to be processed.
• Signal generator—generates the control signals for the
processor, the address generator, and the memory block
using state machine with six states. The states are changed
in a sequential order based on the counter input, the la-
tency of the data path units, and the specific filter being
used.
Fig. 6. Processor structure for DWT computation. • Address generator—performed such that in-place com-
putation (i.e., the old values are overwritten with the
The architecture shown in Fig. 5 consists of a processor (two updated values instead of using new memory locations)
adders, one multiplier, one shifter), a memory and controller is performed. The generator logic is simple as the size
blocks. The processor reads in the data from the memory block of the subbands decrease/increase while performing
and writes back into it after the transform computation. The con- DWT/IDWT by a factor of two in each dimension. The
troller generates the input/output signals for both the processor address generation is achieved with two adders and a
and the memory modules. The data flow remains the same for shifter.
both DWT and IDWT.
In [3], we had presented an architecture with four processors C. Timing
which generated coefficients from two subbands in each cycle. If , , and are the delays of the adder, shifter, and
The four-processor architecture was not used here because the multiplier, respectively, then the latency for each iteration
entropy coder following the DWT module would not be able to is— for the (5,3) filter and
handle such high data rates. for the (9,7) filter. So, the total time to finish a iteration
1) Processor: All the lifting steps for DWT and IDWT are (assuming a block) is latency . Recall that
of the form the (5,3) filter requires 4 iterations and the (9,7) filter requires
eight iterations and a modified scaling step. So, the total time
required to calculate one level of transform on a block
is for the (5,3) filter and
The multiplication factors for the (5,3) filter are multiples of
two, so multiplication can be replaced with a shift operation. for the (9,7) filter.
To perform the above general structure, a processor with two
adders, a shifter, and a multiplier is required (see Fig. 6). The V. BIT-PLANE CODING
registers between the units and at the input are not shown.
Based on the precision analysis, the adder and shifter are In this section, we briefly describe the embedded block
chosen to be 16-bits wide. The multiplier performs a signed 16 coding with optimized truncation (EBCOT) algorithm followed
10 multiplication. A rounding operation is performed on the by the proposed architecture.
product so that multiplier output is 16-bits wide. The shifter is
capable of shifting 1 or 2 bits, right or left in a single cycle. Fur- A. EBCOT Algorithm
ther, we assume that the adder has a unit delay and that the mul- The EBCOT algorithm is summarized here for the sake
tiplier is pipelined to four levels, with each stage of pipe having of completeness. Each bit plane is coded in three passes:
a delay equal to adder delay. significance pass (SP), magnitude refinement pass (MRP), and
2) Memory Block: To support the proposed processor archi- clean up pass (CP). In each pass, only a part of the bit plane is
tecture with in-place style of memory accesses, a memory block coded and each bit position is coded only once by one of the
of size (where is the size of the tile row/column) is three passes. The BPC works on strips of four elements along
required. The memory has to support two read accesses and a the rows. The code block scan is carried from left to right.
write access per cycle and so we chose a dual port memory with Two modes of coding, namely “regular” and “vertical causal”
two access per cycle on read port. (VC), are possible [2]. The proposed architecture assumes VC
While encoding, the three SM blocks have to be written from mode, although the regular mode can easily be supported at the
the DWT memory block. So, three read accesses are required expense of extra memory.
per cycle. While decoding, the three SM blocks write into DWT The BPC requires four-state information bits and 1 magnitude
memory block, so three write accesses are required per cycle. bit for each bit position. The state information bits determine
214 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 13, NO. 3, MARCH 2003

in which pass each bit is coded and are used in the generation TABLE I
of context and data bits. The four-state information bits are as NEIGHBORHOOD USED FOR CODING THE INPUT BIT AT POSITION X
follows:
1) Significance bit ( —This bit is set whenever the magni-
tude bit of the corresponding subband coefficient is “1”
for the first time.
2) Visited once bit —This bit is set when the bit is coded
The corresponding bit is set to 1. In CP, if and
in a pass.
for the first element in the strip, the RLC condition is checked.
3) Magnitude refinement coded bit —This bit is set the
If the RLC condition (mentioned in the RLC primitive) is sat-
first time the magnitude refinement primitive (explained
isfied, the RLC primitive is used. If one of the bits in the strip
below) is used.
become significant, then SC is used and is set for that bit. This
4) Sign bit —This bit is 0 for positive numbers and 1 for
is followed by application of ZC SC for the rest of the bits in
negative numbers and is obtained from the sign-magni-
the strip. If the RLC condition is not satisfied, then ZC SC is
tude representation of the subband values.
used for all the elements with and .

All the state bits except for bits are maintained across all
the bit planes. The bits are reset at the end of each bit plane. It B. BPC Encoder Architecture
should be noted that and for the neighbors that are outside The block diagram of the proposed architecture for the
the strip are assumed to be zero. All the three passes make use of EBCOT encoder is shown in Fig. 7. The architecture consists
one or more of the following four primitives—zero coding (ZC), of the following key building blocks: 1) three combinational
sign coding (SC), magnitude refinement coding (MRC), and run logic blocks to determine the contexts for ZC, MRC, and SC
length coding (RLC). All the primitives use context which is (the contexts for RLC are hard coded); 2) five shift registers
a binary representation of the neighboring pixels. Context for of varying sizes and functionality to store the variables , ,
the data in bit position is formed from the eight neighboring , , ; and 3) three memory blocks (for the three state bits
values ( – , , , , ) in the matrix as shown in , and ) each of size 32 4. The magnitude and sign
Table I. bits are obtained from the SMs. In addition to these key
1) Primitives building blocks, there is a multiplexer (MUX) to select the
• ZC—uses nine (contexts 0–8) out of possible 19 contexts. right context for the bit to be coded from the various contexts
The data is the magnitude of the bit position . based on the coding pass, a counter to keep track of the number
• SC—uses five contexts (contexts 9–13) and is a two-step of strips processed and also the coding pass being used, and
process. In the first step, the and of the horizontal finally a controller. The functionality of these building blocks
and vertical neighbors are used to form the horizontal and is described below.
vertical “contributions” and a “XOR” bit [2]. In the second 1) Combinational logic blocks
step, context is formed from the two contributions and data The tables provided in [2] to form the context for each of the
is formed by exclusive OR operation of the sign bit and the primitives can be expressed in terms of simple logic operations.
XOR bit. These logic operations are mapped into gates and are placed in
• MRC—uses three contexts (contexts 14–16). The contexts the combinational logic blocks. For more details, please refer to
are formed based on whether it is the first time the magni- [14].
tude refinement is being used on a certain position and its • ZC context block: The input to this block is of the eight
eight immediate neighbors. The data is the magnitude bit. neighbors of the bit being coded and magnitude of the bit
• RLC—uses the remaining two contexts (contexts 17–18). position. The output is the ZC context and data pair.
It is invoked only at the beginning of a strip if the of • SC context block: Inputs to this block are the
all the eight neighbors is 0 for all the bits in a strip. If ) and , ) from
none of the bits in the strip become significant, context 17 and registers respectively. The output is the SC context
with data is used. On the other hand, if any bit does and the sign data bit.
become significant, context 17 with data is used. This • Magnitude refinement coding context block: The two in-
is followed by MSB and LSB of zero index (ZI) (00–11) puts to this block are and the nhood0 bit (which indi-
of the bit position which contains the “1” bit. Context 18 cates if the eight neighbors are all zeros) from the reg-
is used for ZI bits. ister. The output is context and data pair for MRC.
2) Coding passes: As mentioned earlier, each bit plane is • RLC contexts: The four possible contexts are—RLC con-
coded in three passes. The first bit plane is coded just with the dition satisfied and strip is all 0’s, RLC condition satisfied
CP. In the SP, all the bits whose and have at least one and the strip contains at least one “1” bit. The latter case
of the immediate eight neighbors with are coded using is followed by context 18 and two bits of the Zero Index
ZC primitive. If the bit becomes significant, the SC primitive (supplied by the register) of the bit position that is “1.”
is used and of the bit being coded is set to 1. When ZC is These contexts and data pairs are hard coded.
applied, the corresponding is set. In MRC, all the bits with • Registers: There are five registers of varying sizes to store
corresponding and are coded using MR primitive. the state variables (see [14]). All the registers are capable
ANDRA et al.: A HIGH-PERFORMANCE JPEG2000 ARCHITECTURE 215

Fig. 7. Proposed architecture for EBCOT encoder.

of 1 bit left shift. For initialization and RLC, the reg-


ister and register are capable of 5-bit and 4-bit left shift,
respectively.
The , and registers have an “update” position,
where a “1” is written to set the corresponding state
variable, when required. Data from the (corresponding)
memory is written into four least-significant-bit positions
in the registers. But data can be read from different
Fig. 8. Interpretation of the parameters in the MQ coder.
positions of the registers and written into memory. The
registers are read and written at the end of coding of each
strip. context block is utilized in this phase. The RLC phase is
• Memory blocks: Three memory blocks each of size 32 invoked during CP. If the RLC condition is satisfied and
4 are used to store the state variables. The subband MEMs strip contains all zeros, then strip is not coded further.
supply the and bits. The DWT module writes into the The contexts for RLC are hard coded. Finally, termination
subband MEMs. The other three memories are written by phase is invoked at the end of coding a strip. Using the
the corresponding internal registers and they have a single counter information, the next coding step is determined.
read and write port, as shown in Fig. 7. Please refer to [14] for a detailed discussion of the state
• Context and data mux: The multiplexer chooses the con- diagrams.
text from the outputs of ZC context block, SC context
block, MR context block, or the hard coded RLC contexts C. BPC Decoder Architecture
(17–18). The data bit is chosen from the , sign data, , The architecture for the decoder remains almost the same as
hard coded RLC data bits (0,1), or the ZI (MSB, LSB) the encoder except for small changes to and memories and
bits. The mux is controlled with a 3-bit word. Based on the registers. For instance, data from and MEMs is written out.
pass being performed, the controller generates the control Also, while in the encoder, both the bits of zero index are known
word. before RLC is started, in the decoder, the bits are obtained one at
• Counter: It keeps tracks of the element in the strip being a time. The resulting state machine is slightly different, although
coded, the number of strips coded in each pass, the pass the number of states required still remains the same.
being processed, and the bit plane being processed. This
information is required for the state machine. VI. BINARY ARITHMETIC CODING
• Controller: The state machine consists of 24 states. The
state machine can be divided into five phases—initializa- A. MQ Coder Basics
tion phase, ZC and SC phase, MRC phase, RLC phase, The basic principle of an arithmetic coder is to recursively
and termination phase. In initialization phase, the registers subdivide the 0–1 interval based on the conditional probability
are reset and and registers are initialized as required. of the input symbols. The MQ coder uses the convention shown
Based on the pass, one of the primitives is performed. The in Fig. 8: the current interval is , the starting point of the in-
ZC and SC phase are performed during SP and during CP terval is (which also holds the code string), and the proba-
when some conditions are satisfied. The context generated bility of LPS occurring next is “ .” So to code a MPS, the code
by ZC context block is used in this phase. The MRC phase string has to be changed by adding the sub-interval of the LPS.
is performed during MRP. The context generated by MRC Nothing needs to be done to code a LPS. From Fig. 8, it can be
216 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 13, NO. 3, MARCH 2003

Fig. 9. Block diagram of the MQ coder.

observed that the interval and the starting point have to changed Since the register is 32-bits wide and the register is
for the MPS and LPS cases as follows: 16 bits wide, a 32-bit adder is required to compute .
Since experimental results showed that only 30% of the time
(for the MPS case) a carry is propagated to the 16 MSBs of the register, the
(for the LPS case) 32-bit addition is handled in two steps using a 16-bit adder.
• Update Logic (UL): The UL consists of a combinational
By making sure that is close to unity, the A*q value is ap- logic block and the Info table. To generate the new -index
proximated to a value “ .” This simplifies the above equations and MPS, the present -index, MPS, and Symbol are sup-
to ; for the MPS case and for plied to the logic block. If a renormalization process is per-
the LPS case. formed, the new information is written into the Info table.
The statistics required to determine the value, given a con- The new data is generated based on the state machine [2].
text and symbol (generated by the BPC module), are maintained • Counter: The Counter is initialized to 12 (to account for the
with the help of two look up tables. The index of the first lookup spacer bits) at the beginning of coding [2], [12]. Whenever
table is the context and each entry in the table is the MPS for that the count becomes zero, the data present in is written out
context and index to second table. The second table contains the and a new byte of data is written to from . Then, based
pre-computed values, provided by the standard. on whether bit stuffing is required or not, the counter value
It can be seen from the MQ coder algorithm [2] that to code is set to 7 or 8.
a symbol, a minimum of two table lookups and two additions • Registers:
are required. This shows that the MQ coder is inherently slow. register—It is 16 bits wide and capable of a 1-bit left
To speed up the bit-plane coding, by-pass mode is proposed in shift. The MSB ( ) is supplied to the controller. This bit
the JPEG2000 standard. In this mode, starting with the fifth bit helps in determining if renormalization has to be performed.
plane, BAC coding is bypassed for symbols generated in signif- This register can be written by the adder and the register.
icance and magnitude refinement passes. We have implemented register—is 32-bits wide and is capable of one bit left
the by-pass mode in the proposed architecture. shift. The 28th bit is used by the controller to verify if a carry
is available for the byte in the register. It can be written by
B. BAC Encoder Architecture
the adder. Various arrangements are required to reset parts of
The proposed encoder architecture, shown in Fig. 9, consists register during reading the code byte and during the “flush”
of: 1) a 16 bit adder to perform the arithmetic operations and procedure (used to terminate the coding) [2]. Also, it should
comparison; 2) a combinational logic block (part of the update be noted that only 16 bits at a time are accessed.
logic “UL”) to update the -index and MPS sense; 3) a counter register—It is 8-bits wide and can be written by the
which is used to keep track of the number of code bits generated; adder or the C register. An All1 detector is built into the reg-
4) two memories to store the “Info table” and the “ table”; and ister to indicate to the controller if bit stuffing is required. The
5) eight registers— (to hold the interval), (code string), code stream is written to the external memory as required.
(the last byte generated), -index, , (context), MPS and Other registers—All the other registers (16 bits),
symbol (data). All the units are controlled by a controller which -index (6 bits), Context (5 bits), MPS (1 bit), and Symbol
also generates the read/write signals for the memories. The data (1 bit) are just simple registers with no extra functionality.
transfer between the adder and the registers is carried out using They do not generate any inputs to the controller.
two 16–bit data buses. Controller: It generates control signals for all the registers
• Adder: The adder is used to calculate , , and and the memories. It is driven by a state machine which has
also to perform the comparison operation . 36 states. The by-pass mode is supported by the controller.
ANDRA et al.: A HIGH-PERFORMANCE JPEG2000 ARCHITECTURE 217

TABLE II VII. PERFORMANCE OF THE PROPOSED


CYCLES (IN MILLIONS) REQUIRED TO ENCODE AND SYSTEMS ARCHITECTURE
DECODE WITH AND WITHOUT BY PASS MODE
We have conducted the performance analysis of the proposed
architecture with four images (baboon, barbara, fish and elaine)
of size 512 512. The input to the architecture is a 128 128
image tile. DWT is carried out to five levels. The maximum size
of the code block is fixed at 32 32. After the first level of
encoding, each of the three subbands are of size 64 64. Each
subband is split into four code blocks, each of size 32 32.
For the rest of the levels, the whole subband is treated as a code
block since the subband is of size 32 32 or smaller.
The number of cycles (in millions) required to encode and
decode the images with and without bypass mode for the (9,7)
filter is given in Table II. It can be seen that the speed up with by
pass mode is around 15% for encoding and 25% for decoding.
Note that the number of cycles required for decoding is signif-
icantly higher than that required by encoding. This is expected
since during encoding, the BPC and BAC coders work indepen-
dently most of the time due to the CXD buffer, while during de-
coding, the coders cannot work independently. Similar results
TABLE III have been obtained for the (5,3) filter.
HARDWARE REQUIREMENT OF THE PROPOSED ARCHITECTURE
The system architecture has been implemented in VHDL. We
have synthesized the data path units in the DWT coder, BPC en-
coder and decoder, and BAC encoder and decoder. The prelim-
inary gate counts (in two input NAND gate equivalents) of the
modules and the memory required by each module is given in
Table III. The estimated area of the architecture, assuming the
control is 20% of data path area in case of DWT, in 0.18- tech-
nology is 3-mm square and the estimated operation frequency
is 200 MHz.

VIII. CONCLUSION
In this paper, we have proposed a systems architecture to per-
TABLE IV form the new JPEG2000 part I standard for compression and
DIFFERENCES BETWEEN ADV JP2000 AND THE PROPOSED ARCHITECTURE
decompression of images. The architecture consists of modules
to implement the DWT, BPC, and BAC algorithms and inter-
facing memory structures. The BPC and BAC modules are im-
plemented by three sets of computation engines. Such a struc-
ture was necessary to compensate for the high computational
requirements of these two modules. The system level architec-
ture has been implemented in VHDL.
To the best of our knowledge, the only other JPEG2000 ar-
chitecture is the JPEG co-processor, ADV-JP2000, by Analog
Devices [7]. There are several differences between the two ar-
chitectures some of which have been listed in Table IV. It is
C. BAC Decoder Architecture stated that the bypass mode has no effect on the coding speed
for ADV. This is quite surprising since in our implementation
The proposed architecture is very similar to the encoder archi- bypass mode speeds up encoding by around 15% and decoding
tecture with the following exceptions. The code byte is written by 25%.
into the register instead of reading from it. The symbol reg-
ister is not required; instead, MPS or a inverted version of it REFERENCES
(LPS) is written out as required. The counter needs to count
[1] J. L. Mitchell and W. B. Pennebaker, JPEG Still Image Data Compres-
down from eight, unlike in the encoder counter which has to sion Standard. New York: Van Nostrand, 1993.
count down from 12 initially. All the arithmetic operations are [2] JPEG2000 Final Committee Draft (FCD). JPEG2000 Committee Drafts.
performed on the 16 MSBs of the register. To undo the bit [Online]. Available: http://www.jpeg.org/CDs15444.htm.
[3] K. Andra, C. Chakrabarti, and T. Acharya, “A VLSI architecture for
stuffing, an incrementing function is required. The controller lifting based forward and inverse wavelet transform,” IEEE Trans.
state machine consists of 28 states. Signal Processing, vol. 50, pp. 966–977, Apr. 2002.
218 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 13, NO. 3, MARCH 2003

[4] , “An efficient implementation of a set of lifting based wavelet fil- Chaitali Chakrabarti received the B.Tech. degree
ters,” in Proc. ICASSP 2001, pp. 1101–1104. in electronics and electrical communication engi-
[5] W. Jiang and A. Ortega, “Lifting factorization-based discrete wavelet neering from the Indian Institute of Technology,
transform architecture design,” IEEE Trans. Circuits Syst. Video Kharagpur, India, in 1984, and the M.S. and Ph.D.
Technol., vol. 11, pp. 651–657, May 2001. degrees in electrical engineering from the University
[6] C. Chakrabarti and M. Vishwanath, “Efficient realizations of the discrete of Maryland at College Park in 1986 and 1990,
and continuous wavelet transforms: From single chip implementations respectively.
to mappings on SIMD array computers,” IEEE Trans. Signal Processing, Since August 1990, she has been with the
vol. 43, pp. 759–771, Mar. 1995. Department of Electrical Engineering, Arizona
[7] Analog products—ADV-JP2000 [Online]. Available: http://prod- State University, Tempe, where she is currently an
ucts.analog.com/products/info.asp?product=ADV%2DJP2000. Associate Professor. Her research interests are in the
[8] I. Daubechies and W. Sweldens, “Factoring wavelet transforms into areas of low-power systems design, including memory optimization, high-level
lifting schemes,” J. Fourier Anal. Applic., vol. 4, pp. 247–269, 1998. synthesis and compilation, and VLSI architectures and algorithms for signal
[9] W. Sweldens, “The lifting scheme: A new philosophy in biorthogonal processing, image processing, and communications.
wavelet constructions,” in Proc. SPIE, vol. 2569, 1995, pp. 68–79. Dr. Chakrabarti is currently an Associate Editor of the IEEE TRANSACTIONS
[10] D. Taubman, “High performance scalable image compression with ON SIGNAL PROCESSING and the Journal of VLSI Signal Processing Systems. She
EBCOT,” IEEE Trans. Image Processing, vol. 9, pp. 1158–1170, July has served on the program committees of ICASSP, ISCAS, SIPS, ISLPED, and
2000. DAC. She is a member of the Center of Low Power Electronics (jointly funded
[11] G. L. Langdon, Jr. and J. Rissanen, “Compression of black-white im- by the National Science Foundation, the state of Arizona, and the member com-
ages with arithmetic coding,” IEEE Trans. Commun., vol. COM-29, pp. panies) and the Telecommunications Research Center. She received the Re-
858–867, June 1981. search Initiation Award from the National Science Foundation in 1993, a Best
[12] J. L. Mitchell and W. B. Pennebaker, “Software implementations of the Teacher Award from the College of Engineering and Applied Sciences, ASU,
Q -coder,” IBM J. Res. Develop., vol. 32, no. 6, pp. 753–774, Nov. 1988. in 1994, and the Outstanding Educator Award from the IEEE Phoenix section
[13] USC-SIPI image database [Online]. Available: http://sipi.usc.edu/ser- in 2001.
vices/database/Database.html.
[14] K. Andra, T. Acharya, and C. Chakrabarti, “Efficient VLSI implemen-
tation of bit plane coder of JPEG2000,” in Proc. SPIE Int. Conf. Ap-
Tinku Acharya (M’96–SM’01) received the B.Sc. (Hons.) degree in physics
plications of Digital Image Processing XXIV, vol. 4472, pp. 246–257.
in 1983 and the B.Tech and M.Tech degrees in computer science from the Uni-
[Online]. Available: http://enws155.eas.asu.edu:8001/papers.html.
versity of Calcutta, Calcutta, India, in 1983, 1986, and 1989, respectively, and
the Ph.D. degree in computer science from the University of Central Florida,
Orlando, in 1994.
Since 1997, he has been an Adjunct Professor in the Department of Elec-
trical Engineering, Arizona State University, Tempe. He has been with Elution
Technologies, Phoenix, AZ, since June 2002, a start-up company. Previously, he
was a Principal Engineering in the Intel Architecture Group with Intel Corpora-
tion. Before joining Intel Corporation in 1996, he was a consulting Engineer at
AT&T Bell Laboratories (1995–1996), a faculty member at the Institute of Sys-
tems Research, University of Maryland at College Park (1994–1995), and held
visiting faculty positions at the Indian Institute of Technology (IIT), Kharagpur
during 1998–2001. He contributed to over 60 technical papers published in in-
ternational journals, conferences, and book chapters. He holds 37 U.S. patents
Kishore Andra received the B.Tech. degree and has more than 80 patents pending. His current areas of interest include
in electrical and electronics engineering from VLSI Architectures and Algorithms, Electronic and Digital Image Processing,
J. N. T. University, Anantapur, India, in 1994, the Data/Image/Video Compression and Media processing algorithms in general.
M.S. degree from the Indian Institute of Technology, Dr. Acharya was awarded the “Most Prolific Inventor” by Intel Worldwide in
Madras, India, and the Ph.D. degree from Arizona 1999 and “Most Prolific Inventor” by Intel Arizona for the past five consecu-
State University, Tempe, both in electrical engi- tive years for his significant contribution in intellectual property generation in
neering, in 1997 and 2001, respectively. different areas of development in Intel Corporation. He also served in the U.S.
Currently, he is with Maxim Integrated Products, National Body of the JPEG2000 committee of the International Standard Or-
Sunnyvale, CA, working on the design of low-power ganization (ISO) as the primary member of Intel Corporation. He is a Senior
high-performance mixed-signal ICs. Member of the SPIE Optical Society.

You might also like