0% found this document useful (0 votes)
35 views14 pages

A Low Complexity Embedded Compression Codec Design With Rate Control For High Definition Video

Uploaded by

dev- ledum
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views14 pages

A Low Complexity Embedded Compression Codec Design With Rate Control For High Definition Video

Uploaded by

dev- ledum
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

674 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 25, NO.

4, APRIL 2015

A Low-Complexity Embedded Compression


Codec Design With Rate Control
for High-Definition Video
Yin-Tsung Hwang, Member, IEEE, Ming-Wei Lyu, and Cheng-Chen Lin
Abstract— A hardwired design of embedded compression high computing complexity as well. Embedded image/video
engine targeting the reduction of full high-definition (HD) video compression, on the contrary, targets specifically at reducing
transmission bandwidth over the wireless network is developed. the data access bandwidth during the transmission process and
It adopts an intra-coding framework and supports both lossless
and rate-controlled near lossless compression options. The lossless resorts to less-complicated measures to minimize the overhead.
compression algorithm is based on a simplified Context-Based, Embedded compression has been found useful in various
Adaptive, Lossless Image Coding (CALIC) scheme featuring emerging applications. Consider a high-definition (HD) video
pixelwise gradient-adjusted prediction and error-feedback mech- transmitted over a wireless network to a remote display device
anism. To reduce the implementation complexity, an adaptive for playback; the required communication bandwidth would
Golomb–Rice coding scheme in conjunction with a context
modeling technique is used in lieu of an adaptive arithmetic be tremendous if raw pixel data are used. To cope with the
coder. With the measures of prediction adjustment, the near bandwidth constraint, it is necessary to perform compression
lossless compression option can be implemented on top of the on the fly when the transmission is proceeding. The compres-
lossless compression engine with minimized overhead. An efficient sion should preferably be lossless to avoid losing any data
bit-rate control scheme is also developed and can support authenticity. In case of a strict network bandwidth constraint,
rate or distortion-constrained controls. For full HD (previously
encoded) and nonfull HD test sequences, the lossless compression near lossless compression is acceptable provided that the
ratio of the proposed scheme, on average, is 21% and 46%, visual distortion is barely noticeable. The essential features of
respectively, better than the Joint Photographic Experts embedded compression can be summarized as: 1) preservation
Group-Lossless Standard and the Fast, Efficient Lossless Image of image/video quality (lossless or near loss compression);
Compression System (FELICS) schemes. The near lossless com- 2) low computing efforts and memory usage; 3) sufficiently
pression option can offer additional 6%–20% bit-rate reduction
while keeping the Peak Signal-to-Noise Ratio value 50 dB high compression efficiency and precise bit-rate control for
or higher. The codec is further optimized complexity-wise to the required bandwidth reduction; and 4) high-throughput,
facilitate a high-throughput chip implementation. It features real-time operations. Previous lossless video codecs can be
a five-stage pipelined architecture and two parallel computing classified into three categories. The first category employs
kernels to enhance the throughput. Fabricated using the Taiwan complex prediction schemes to pursue the compression
semiconductor manufacturing company 90-nm complementary
metal–oxide–semiconductor technology, the design can operate efficiency. They are mostly implemented in software and
at 200 MHz and supports a 64 frames/s processing rate for full do not support real-time compression. The second category
HD videos. uses intra-frame coding only and emphasizes low-complexity
Index Terms— Embedded compression, high-definition (HD) implementations. Many of them have been successfully imple-
video, lossless video compression, rate control, very-large-scale mented in hardware for high-throughput operations, too. The
integration (VLSI) design. third category is built on the top of a lossy coder and aims at
I. I NTRODUCTION providing a lossless option.
The basic framework of a lossless video codec consists of

C ONVENTIONAL lossy video compression schemes aim


at largely reducing the data size of the source video for
efficient storage. These schemes, employing complex tech-
a prediction module followed by an entropy coder. Codecs in
the first category exploit the data correlation in both spatial
and temporal domains to improve the prediction accuracy.
niques, such as motion estimation and arithmetic coding, can
In [1], a zero-motion prediction preprocessing is applied to
provide high compression efficiency, but are lossy and bear
the temporal domain followed by a line-based entropy coder
Manuscript received November 22, 2013; revised April 8, 2014, in the spatial domain. In [2], the scheme exploits the spectral,
June 25, 2014, and August 15, 2014; accepted September 2, 2014. Date temporal, and spatial redundancies in a backward adaptive
of publication September 8, 2014; date of current version April 2, 2015. way. In [3], the colocated pixel in the previous frame is
This work was supported in part by the Ministry of Science and Technology
under Project 99-2221-E-005-106-MY3 and Project 102-2220-E-005-012. included to form a 3-D predictor. In [4], temporal and spatial
This paper was recommended by Associate Editor M. Tagliasacchi. predictions are both performed but selected adaptively. For
Y.-T. Hwang is with the Department of Electrical Engineering, these codecs, the measures of applying multidimensional pre-
National Chung Hsing University, Taichung 402, Taiwan (e-mail:
[email protected]). diction along temporal and spatial domains do provide some
M.-W. Lyu is with Andes Technology, Hsinchu 300, Taiwan. marginal benefits on compression efficiency. The incurred
C.-C. Lin is with Largan Precision Company, Ltd., Taichung 408, Taiwan. computing and memory overheads are nonetheless signifi-
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. cant. Tools such as HuffYUV [5], FF Video codec 1 [6],
Digital Object Identifier 10.1109/TCSVT.2014.2355691 YUVsoft’s Lossless video Codec [7], Lagarith [8], and
1051-8215 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Authorized licensed use limited to: University of Portsmouth. Downloaded on December 26,2020 at 14:04:03 UTC from IEEE Xplore. Restrictions apply.
HWANG et al.: LOW-COMPLEXITY EMBEDDED COMPRESSION CODEC DESIGN 675

Moscow State University [9] also belong to the first category codec, the overall algorithm complexity (base codec plus the
and are available in software only. They are equipped with dif- extension) is usually too high to be considered in embedded
ferent coding options to support either maximum compression compressions.
efficiency or faster compression. However, no bit-rate control As for the rate-control issue, conventional video codecs
options are provided in these tools. A complete performance such as H.264 employ sophisticated rate distortion
comparison of these software tools is presented in [10]. optimization (RDO) schemes, which determine the best
For the second category of lossless codecs, only intra-frame prediction mode and the quantization factors to minimize
coding is applied. This averts the expensive search opera- a cost function. Such schemes are too complicated to be
tions along the temporal domain and eliminates extra data employed in embedded compressions. For low-complexity
buffer for the previous frame. Fast, Efficient Lossless Image bit-rate control schemes, in [20], the scheme is content
Compression System (FELICS) [11] is one of the early adaptive, that is, more distortion is allowed in areas with
works in this category. It adopts a primitive linear predic- more complicated local texture. The basic measure is
tion (LP) scheme and employs an adjusted binary coder. adjusting the quantization in DPCM coding. It focuses more
The compression efficiency is compromised because of its on the perceptual quality rather than the Peak Signal-to-Noise
algorithmic simplicity. To enhance the prediction accuracy, the Ratio (PSNR) value of the image. The scheme in [21] uses
median edge detector (MED) is used in Joint Photographic a smart combination of variable length and fixed length
Experts Group-Lossless Standard (JPEG-LS) [12] and the coding for the rate control. It can guarantee the compression
gradient-adjusted prediction (GAP) is adopted in Context- ratio (CR) but not the PSNR quality. So far, an effective and
Based, Adaptive, Lossless Image Coding (CALIC) [13], which low-complexity bit-rate control scheme supporting both bit
are nonlinear gradient-based schemes. Both schemes are also rate and PSNR controls is yet to be developed.
employed in context-based error models to compensate the In terms of hardware implementations, various hardwired
bias of prediction residuals. This measure is effective in designs of lossless image compression have been investi-
improving the prediction accuracy but requires extra memory gated. These include the chip designs for FELICS [22],
to store the information. The need to update the models JPEG-LS [23], [24], discrete wavelet transform (DWT)–Set
also poses challenges for high-throughput designs. In entropy Partitioning Into Hierarchical Trees (SPIHT) [25], simplified
coding, a simple Golomb–Rice coder is used in JPEG-LS, CALIC (S-CALIC) [26], hierarchical average and copy pre-
whereas a more complicated adaptive arithmetic coder is diction (a scheme using hierarchical prediction plus a signif-
used in CALIC. In general, the CALIC scheme has better icant bit truncation technique in coding) [27], and associated
performance than the JPEG-LS scheme [12]. In contrast to the geometric-based probability model (AGPM) [28] (an enhance-
spatial domain’s approach, JPEG2000 [14] applies an integer ment of design [22] with rate control). Hardwired solutions
wavelet transform, and the frequency coefficients are coded by excel in speed and power performance compared with software
a context-based entropy model called embedded block coding solutions carried out on programmable processors.
with optimal truncation (EBCOT). The frequency domain’s Existing compression standard technologies often lie
approach is considered less efficient in lossless compression, at the two extremes of the performance spectrum and
because unquantized high-frequency components usually lead simultaneously cannot meet the basic requirements, such
to excessive coding bit rate. Instead of a pixel by pixel coding, as compression efficiency, implementation complexity, and
a sequence of pixels is processed altogether in [15]. The bit-rate control, of embedded compression. This prompts for
difference sequence consisting of the Differential Pulse Code a new codec design. This paper supports both lossless and
Modulation (DPCM) coefficients of adjacent pixels are coded near lossless compression and features an optimal tradeoff
using a differences codes map (DCM), where each code word between the implementation complexity and the compression
is Huffman coded. It is claimed to be a most lightweight efficiency. It can outperform JPEG-LS in compression
lossless codec and the compression efficiency is similar to efficiency and requires a much lower implementation
that of FELICS. complexity than full-fledged codecs such as H.264. It is
For codecs in the third category, a lossy compression is per- also equipped with precise bit-rate control schemes, which
formed first and produces a base layer coding. The distortions are not supported in JPEG-LS and other lossless coding
are further coded in a lossless way as an enhancement layer. standards. The main contributions of this paper include:
In [16], the base-layer codec is H.264/AVC and the distortions 1) a low-complexity yet efficient lossless video codec;
are encoded by a quantization parameter adaptive Rice coder. 2) a prediction adjustment-based near lossless compression
In [17], High Efficiency Video Coding (HEVC) is used as technique and precise bit-rate control schemes; 3) extensive
the base layer codec and Golomb-based binarization and performance evaluations of compression schemes; 4) a novel
context-based arithmetic coding are for the enhancement layer. pipelined and high-throughput codec architecture; and
Reference [18] also targets HEVC. It uses techniques such 5) a hardwired solution with chip implementation. The
as improved-level coding and the binarization table selection remaining of the paper is organized as follows. In Section II,
to enhance the performance. In [19], a prediction residual is the CALIC scheme, as the kernel of the proposed lossless
further predicted with those of surrounding pixels by using codec, is reviewed. The proposed lossless compression scheme
either an inter-frame or an intra-frame scheme. In case of inter- is elaborated in Section III. In Section IV, the bite-rate control
frame prediction, a spiral search in the previous residual frame mechanism for near lossless compression is developed.
is needed. Because these schemes must work with a lossy base Section V describes the performance evaluation results of

Authorized licensed use limited to: University of Portsmouth. Downloaded on December 26,2020 at 14:04:03 UTC from IEEE Xplore. Restrictions apply.
676 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 25, NO. 4, APRIL 2015

different embedded compression schemes. In Section VI,


the hardware design of the proposed scheme and the chip
implementation results are presented. Comparison with exist-
ing chip designs for embedded compression is also discussed.

II. R EVIEW OF CALIC L OSSLESS C OMPRESSION S CHEME


The CALIC scheme developed in [13] relies solely on
intra-frame prediction and is equipped with smooth region
detection and context-based prediction bias cancellation.
It features high compression efficiency and moderate imple-
mentation complexity. It can outperform rival schemes such
as FELICS and JPEG-LS lacking such features [26]. It is
thus chosen as the basic framework of the proposed lossless
video codec. Fig. 1 shows the block diagram of the CALIC Fig. 1. Block diagram of CALIC system.
scheme. Given an image I , prior to the pixel value prediction,
a decision on the smoothness of the surrounding area (binary
mode test) is made first. If the pixel is indeed located in a
smooth area, a binary mode is assumed and a ternary entropy
coder is employed to code the pixel value. If otherwise, the
codec enters a continuous tone mode and a GAP scheme is
used to calculate an initial prediction of the pixel value. This
prediction is further compensated through an error-feedback
mechanism to reduce the prediction error. The compensation
is based on the history of past predictions, which are classified
Fig. 2. Context window of the CALIC scheme.
by a context modeling technique to improve the accuracy of
compensation. After prediction compensation, the prediction
residuals are coded by an adaptive arithmetic coding (AAC) If the horizontal gradient dh is larger than the vertical
module. The details of the scheme related to the proposed gradient dv , the pixel right above the current pixel, that is,
work will be elaborated in the following. the north is weighted more in prediction. If otherwise, the
pixel to the left of the current pixel is favored in prediction.
A. Binary-Mode Coding This leads to a prediction formula as shown in
Fig. 2 shows the context window used in CALIC. It consists
of the current pixel (denoted as I p ) and its seven neighboring if (dv − dh > 80) I p [i, j ] = Iw
pixels, where the subscript of the notation (n for north, e for else if (dv − dh < −80) I p [i, j ] = In
east, and w for west) indicates the relative location. If the else {I p [i, j ] = (Iw + In )/2 + (Ine − Inw )/4
context window contains at most two distinct pixel values, the
if (dv − dh > 32) I p [i, j ] = (I p [i, j ] + Iw )/2
codec enters a binary mode directly and no pixel prediction
value is calculated. The two distinct pixel values in the context else if (dv − dh > 8) I p [i, j ] = (3I p [i, j ] + Iw )/4
window are considered two prediction alternatives, and a else if (dv − dh < −32) I p [i, j ] = (I p [i, j ] + In )/2
ternary entropy coder is used to encode the value. Let x 1 be else if (dv − dh < −8) I p [i, j ] = (3I p [i, j ] + In )/4. (3)
the pixel value of Iw and x 2 be the second possible pixel
value in the context window. Refer to (1), the ternary entropy
coder consists of three symbols 0, 1, and 2 (the bits enclosed C. Context Modeling and Error Feedback
by parenthesis are the actual coding). It takes no more than The prediction obtained by the GAP scheme is further com-
two bits to encode, if the current pixel value matches either pensated by an error-feedback value to correct the prediction
x 1 or x 2 bias statistically. Context modeling is a mechanism to profile
⎧ and classify past prediction errors. Those pixels with similar
⎨ “0”, (1) if I = x 1
T = “1”, (01) if I = x 2 (1) texture contexts are classified to the same prediction error
⎩ model and a total of 576 models are used. Each model logs
“2”, (00) otherwise.
the prediction errors constantly and takes their average as an
error feedback value. The error-feedback value can be obtained
B. Gradient-Adjusted Prediction through table lookup and combines with the GAP value to
If the binary-mode test fails or the ternary entropy coder form the final prediction value. These models are classified
returns a symbol 2, a continuous tone mode is assumed and the and addressed as well by a 10-bit index C = (β, δ). β is
GAP applies. GAP is based on two gradient values defined as 8-b wide and derived from an eight-tuple texture context C
dh = |Iw − Iww | + |In − Inw | + |In − Ine | consisting of six pixel values plus two gradient values
dv = |Iw − Inw | + |In − Inn | + |Ine − Inne |. (2) C = {In , Iw , Inw , Ine , Inn , Iww , 2In − Inn , 2Iw − Iww } (4)

Authorized licensed use limited to: University of Portsmouth. Downloaded on December 26,2020 at 14:04:03 UTC from IEEE Xplore. Restrictions apply.
HWANG et al.: LOW-COMPLEXITY EMBEDDED COMPRESSION CODEC DESIGN 677

where C is compared elementwise with the GAP value of the


current pixel and quantized to an 8-bit value β

0 if x i ≥ I p
βi = , for i = 0 to 7. (5)
1 if x i < I p
δ is 2-b wide and obtained by quantizing the value  = dv −
dh + |ew | to four levels, where ew is the prediction error of
the west pixel. Finally, an AAC entropy coder is employed to
encode the prediction residuals. Preprocessing measures, such
as, conditional probability estimation and coding histogram
sharpening, are applied before entering coding.

D. Challenges of High-Throughput Operations


Although the CALIC scheme has laid a basic framework
for the embedded compression engine, some of its features
are adverse to the pursuit of high-throughput operations. First
of all, the GAP prediction and the update of the prediction
error models form a tightly coupled loop. It basically rules
out any chances of concurrent computations and leads to a
throughput bottleneck in the decoding process. Second, the
AAC scheme, in spite of its moderate performance edge
against alternative schemes, is much more complicated in
implementation. Note that low computing complex is essential
to embedded compression. In addition, the requirement of
conditional probability update required in AAC decoder poses
a similar throughput challenge. A simple entropy coder can
be adopted if the performance loss is mild.

III. L OW-C OMPLEXITY AND H IGH -T HROUGHPUT


L OSSLESS CODEC D ESIGN
Adapting from the CALIC scheme, the proposed scheme Fig. 3. Block diagram of the proposed lossless codec. (a) Encoder.
differs from CALIC in three aspects. First, in the continuous (b) Decoder.
tone mode, an adaptive Golomb–Rice coder is used in lieu of
an m-array arithmetic coder. The Golomb–Rice coding scheme
uses an implicit value k to define the bit length of each coded A. Color Space Transform
symbol. If the value k can closely match the size of the Color space transform is a common practice used to decorre-
prediction residuals, the scheme’s coding efficiency can be late the information of different color components. An invert-
comparable to that of AAC but the implementation complexity ible color space transform, which converts the R, G, B signals
is much lower. Second, a bit-rate control extension based on to Y , Db, Dr, is employed. The transform equations are
prediction adjustment is supported. The original CALIC did  
R + 2G + B
not address the bit-rate control issue. A simple and effective Y = , Dr = R − G, Db = B − G (6)
measure without changing any prediction and coding schemes 4
is developed for the proposed codec. Third, to mitigate the Db + Dr
G =Y− , R = Dr + G, B = Db + G. (7)
throughput bottleneck of the CALIC scheme, a two-stage 4
prediction is adopted and the computations independent of Fig. 4 shows the color components of a 720 × 480 image
the error modeling is removed from the recursive computing from a skiing sequence before and after the transform. In each
loop. This can shorten the critical path delay to facilitate a 3-D subgraph, the x and y axes indicate the coordinate of
higher throughput implementation. The compression efficiency a pixel and the z axis shows the magnitude of the pixel
is not affected, because the prediction result is identical value. Because of high data correlation between the R and the
to that obtained using one-step calculation. In addition, to G components, the magnitudes of the transformed Dr com-
enhance the compression efficiency, a color space transform ponent are much smaller than those of the Y component.
is employed. Fig. 3 shows the block diagrams of the proposed To evaluate the effectiveness of the color space transform,
lossless video/image codec. The modules colored in gray are the river sequence from the DISCOVERY channel is used
distinctive from the original CALIC scheme. The recursive as the test bench video. Entropy is adopted as the evaluation
computing loops in the codec are also illustrated. In the index so that the performance is independent of the choice of
following, the function of each module will be elaborated. prediction and coding schemes. Fig. 5 shows the simulation
The bit-rate control scheme based on a prediction adjustment results. The entropies of R, G, B, and Y are similar in
technique will be discussed in Section IV. magnitude. The entropies of Db and Dr are much smaller

Authorized licensed use limited to: University of Portsmouth. Downloaded on December 26,2020 at 14:04:03 UTC from IEEE Xplore. Restrictions apply.
678 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 25, NO. 4, APRIL 2015

Fig. 6. Two-stage GAP computing scheme.

Fig. 4. Color space transform. (a) G component. (b) R component. TABLE I


(c) Y Component. (d) Dr component. S ELECTION L OGIC AND C OMPUTATIONS OF THE GAP M ODULE

are generated in parallel via an adder and shifter network.


Fig. 5. Entropies variation before and after the color space transform.
In the decoding process, the pixel value of Iw is available last.
Therefore, the computations requiring this pixel value and
should be separated from the computations involving only
in most frames. Preliminary coding results also indicate other pixels. The computation of the prediction scheme is thus
2–4 bits per pixel savings in these two components compared divided into two stages as shown in Fig. 6. Stage 1, termed
with other components. forward prediction, involves pixels obtained earlier than Iw .
Stage 2, termed backward prediction, requires the immediate
decoding result of Iw . Only stage two is a part of the recursive
B. Two-Stage GAP loop. A quantizer converts the gradient difference dv − dh into
As shown in Fig. 3, for the encoder, an inner loop is formed a 3-bit code used as the selection control. Table I shows the
within the error modeling module. If the next pixel’s prediction selection logic and the computations performed in each stage.
hits the same context model, it needs to wait for the completion The weight (a fractional number) of each input is derived
of the model update. Such data dependence confines the from (3).
maximum coding throughput rate. The main performance
bottleneck, however, exists in the decoding process due to
an even larger computing loop consisting of the GAP and C. Error Modeling and Prediction Error-Feedback Scheme
the error-modeling modules. A less critical computing loop The prediction derived from the GAP scheme is further
is formed with the adaptive Golomb–Rice decoding and the compensated by a value (error feedback) to correct the bias of
error modeling modules. To mitigate the data dependence past predictions. The value is obtained through a table lookup
problem, two remedial measures are introduced. The first one using the context index C = (β, δ) to address one of the
is shortening the computations within a loop by separating 576 prediction error models kept in a lookup table (LUT). The
the nonrecursive part from the recursive part. This leads to a LUT is then updated by logging the compensated prediction
two-stage prediction. The second one is alleviating the data error. Each model keeps track of its prediction history and the
dependence by computing relaxation, which will be addressed average of the past prediction errors is regarded as a prediction
in Section III-C. error bias used to refine the GAP value. In the original CALIC
Recall that the GAP formula shown in (3) requires lengthy scheme, the moving average of the latest N prediction errors is
comparisons and updates. To reduce its computing latency, kept. To save the memory usage, a simple first-order recursive
common terms are extracted and different weighted versions update scheme is adopted. The prediction error bias ẽ(t) is

Authorized licensed use limited to: University of Portsmouth. Downloaded on December 26,2020 at 14:04:03 UTC from IEEE Xplore. Restrictions apply.
HWANG et al.: LOW-COMPLEXITY EMBEDDED COMPRESSION CODEC DESIGN 679

TABLE II
C OMPRESSION R ATIO V ERSUS k VALUE

Fig. 7. Simulations on the update scheme of prediction error modeling.


(a) Word length of ẽ(t). (b) Parameter n.

calculated as
ẽ(t) = ẽ(t − 1) · (1 − λ) + λ · e (8)
where λ is considered as a forgetting factor and is set as 2−n
to eliminate the need of multiplication. The implicit k value
used in Golomb–Rice coding is likewise updated and stored in Fig. 8. Rate control FSM for binary mode.
the memory. After compensating the prediction value with the
prediction error bias, the resultant prediction error becomes small, the k value is confined between 1 and 3. Table II shows
the experimental results on the compression ratio using fixed
e = I − Iˆ − ẽ(t − 1) (9)
and adaptive k values. The adaptive approach improves the
where I and Iˆ are the current pixel value and its GAP CR significantly.
value, respectively. Experiments are conducted to determine
the parameter n and the word length of ẽ(t). The evaluation is IV. B IT-R ATE C ONTROL O PTION FOR N EAR
based on the final CR and the results are shown in Fig. 7. The L OSSLESS C OMPRESSION
parameter n and the word length are chosen as 5 and 13 bits, In bit-rate control, we focus on curbing the excessive
respectively. Because the prediction error models are updated
bit-rate occasionally introduced by certain frames with more
per pixel prediction, if two successive pixels are related to
complicated textures. This is to ensure a near (or visually)
the same model, the coding of the second pixel must be lossless video quality. In addition, the implementation com-
deferred until the update due to the first pixel is completed.
plexity should be minimized. The RDO schemes adopted in
The throughput rate thus suffers. Computing relaxation is a H.264 or HEVC are thus considered inappropriate in this
technique that uses the delayed rather than the most updated
application. A threshold driven bit-rate control scheme based
LUT entry to achieve the computations. Because the prediction
on the concept of prediction adjustment is proposed. The
error bias ẽ(t) and the k value both vary slowly, such a adjustment aims at reducing the number of bits needed in
relaxation can hardly impact any prediction accuracy.
coding the symbol and is performed dynamically subject to a
threshold of either the PSNR value or CR. The major benefit
D. Adaptive Golomb–Rice Coding of this approach is that neither the prediction NOR the coding
Since Golomb–Rice coding works for positive values only, schemes need to be changed.
a remapping process as shown in the following is applied first:

A. Prediction Adjustment in Binary Mode
 2·e if e ≥ 0
e= (10) Recall the ternary entropy coding defined in (1). If the
−2 · e + 1 if e < 0.
current pixel value matches one of the two possible pixel
A Golomb–Rice code word consists of a prefix p and a values in the context window, at most two bits are required in
remainder r . A k value is selected first. Given a prediction coding. If we adjust the prediction value deliberately to get the

residual e, p and r can be calculated as match, this would avert the codec from entering the continuous
  tone mode and the bit saving is significant. The incurred

e 
p = /2k , r = mod( e, 2k ). (11) distortion due to prediction adjustment, however, should be
justified by evaluating the PSNR value. If the resultant PSNR

The prefix consists of p 0s ended with a 1. If e is less than 2k , value can still meet the near lossless requirement (50 dB or
the prefix is a single-bit 1. The remainder r contains k bits. above), the adjustment is accepted. Fig. 8 shows the finite state
The length of the code word is thus p + k + 1 bits. In adaptive machine (FSM) of the rate control scheme for the binary-mode
Golomb–Rice coding, the k value is not a constant. The coding coding. E b,max is a distortion limit. Refer to (1), two pixel
efficiency depends on the selection of k value. To achieve this, value differences, δ P1 = |I − x 1 | and δ P2 = |I − x 2 |, will
each prediction error model also tracks the k values (k̃(t)) of be calculated. If the distortion of setting the current pixel
its past predictions and uses the average as the implicit k value value I to x 1 does not exceed E b,max , the pixel is coded as
for coding. Because compensated prediction errors are usually symbol 0. If δ P1 > E b,max and δ P2 ≤ E b,max , the pixel is

Authorized licensed use limited to: University of Portsmouth. Downloaded on December 26,2020 at 14:04:03 UTC from IEEE Xplore. Restrictions apply.
680 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 25, NO. 4, APRIL 2015

Fig. 10. Pixels that might be affected by the prediction adjustment.

prefix length is reduced by m bits, the incurred distortion is


Fig. 9. Experimental results of binary-mode prediction adjustment. 
δ e = 2k · (m − 1) + r/2 · 2 + 2. (15)

coded as symbol 1. No prediction adjustment is performed if The actual distortion is


otherwise. The distortion limit E b,max is updated dynamically δ P = 2k−1 · (m − 1) + r/2 + 1. (16)
subject to the bit-rate budget and the current PSNR value. The
PSNR value is calculated as The prediction adjustment can be controlled by setting a dis-

tortion limit E g,max . The maximum number of bit saving
2
Imax
PSNR = 10 log10 (12) m subject to the constraint δ P ≤ E g,max can be calculated as
MSE 
E g,max − r/2 − 1
m max = min p, +1 . (17)
where MSE is the mean square of the prediction errors. The 2k−1
MSE value is reset when a new frame starts. Fig. 9 shows the
experiments of coding result versus different E b,max values
using an image from a test bench video sequence skiing. The C. Code Length Overhead Checking Mechanism
horizontal axis indicates the value of E b,max and the vertical Note that the two aforementioned prediction adjustment
axis indicates the number of each symbol’s appearances in measures are not mutually independent. As shown in Fig. 10,
binary-mode coding. With the relaxation of E b,max , more the prediction adjustment in the continuous tone mode may
pixels can be coded as symbol 0 and the total number of affect the binary-mode decision of other pixels in the context
pixels coded in the binary mode increases as well. window. In other words, a pixel with a binary-mode context
could be coded in continuous tone mode, if any prediction
B. Prediction Adjustment in Continuous Tone Mode adjustment is applied to its context. Because the binary-mode
coding is the most efficient one, the bit-rate penalty of losing
Because the binary mode accounts for only a small per- the chance of the binary-mode coding is large. Similarly, the
centage in prediction, the prediction adjustment is equally prediction adjustment in the binary-mode coding can equally
applied to the continuous tone mode, where GAP residuals affect the continuous tone mode coding. The impact, however,
are coded via Golomb–Rice coding. In Golomb–Rice coding, is small. In most cases, only the remainder value r will be

the prediction residual is represented as e = p · 2k + r and changed, which does not affect the code length. We will thus
requires p + 1 + k bits in coding. Because value k is derived focus on preventing the prediction adjustments performed in
adaptively from the error modeling, it is not suitable for the the continuous tone mode that negatively impacts the binary-
adjustment approach. Values p and r , however, depend on the mode coding. The remedy is a checking mechanism that will
magnitude of the prediction error. To reduce the number bits discard a prediction adjustment, if its code length saving is
 
required for e, an adjusted prediction error e , which lowers smaller than the estimated code length overhead in the binary
prefix p by 1, is used. If we use the largest value that can be mode. As shown in Fig. 10, the code length overhead in the
represented with a prefix equal to p − 1, the distortion can be binary mode (ϕ) is estimated as
minimized. That is 5



 ( p − 1) · 2k + (2k − 1) If e is odd ϕ= ωi · bi · T (18)
e =  (13)
( p − 1) · 2 + (2 − 2)
k k If e is even. i=1
where bi is binary (0 or 1) and indicates whether the
Note that the parity of the prediction error (in Golomb–Rice i th forward pixel can be coded in the binary mode. ωi (= 1
coding) should be preserved as it indicates the polarity of the for i = 1, 2, and = 2/3 for i = 3–5) is the weighting factor.
 
original prediction error. The distortion ( e − e ) introduced by For forward pixels 1 and 2, the context window information is
this adjustment is thus fully available and the binary-mode checking result is definite.
 For forward pixels 3, 4, and 5, only four out of the six pixel
δ e = r/2 · 2 + 2. (14)
values in their context windows are available. The weighting
After undoing the remapping, the actual distortion is only one factor is thus set as two third. Pixel 6 is excluded from the

half of the coded value δ e, i.e. δ P = r/2+1. Similarly, if the estimation because only one pixel information in its context

Authorized licensed use limited to: University of Portsmouth. Downloaded on December 26,2020 at 14:04:03 UTC from IEEE Xplore. Restrictions apply.
HWANG et al.: LOW-COMPLEXITY EMBEDDED COMPRESSION CODEC DESIGN 681

TABLE III
E XPERIMENTAL R ESULTS OF P REDICTION A DJUSTMENT S CHEMES

window is available. T is a constant corresponds to the average


code length difference between the continuous tone mode and
the binary mode. According to the simulation results, T is
set as 1.75. Table III shows how the code length overhead
checking mechanism can effectively improve the PSNR value.
The simulations are based on a subset of the skiing sequence
and only the luminance (Y ) results are given. RC1 stands for
a rate control scheme applying the two prediction adjustment
measures independently. RC2 indicates a rate control scheme
applying the checking mechanism. It can be seen that, both
the bits per pixel and the PSNR values are improved, if the
checking mechanism is applied. The statistics of the three
coding symbols in the binary mode are also listed. The bit-rate Fig. 11. Frame-based bit-rate control algorithm.
control schemes do increase the total number of pixels coded
in the binary mode. The RC2 scheme has 14% more pixels TABLE IV
coded as symbol 0 than the RC1 scheme. In the meantime, the C HARACTERISTICS OF T EST B ENCH V IDEO S EQUENCES
number of pixels coded as symbol 2 under the RC2 scheme is
2% lower than that under the RC1 scheme. The effectiveness
of the checking mechanism is verified. It is also observed that
the number of pixels coded as symbol 2 increases in both rate
control schemes when compared with the number obtained
with no rate control.

D. Bit-Rate Control Scheme


The proposed bit-rate control scheme provides two control
options. Option 1 maximizes the PSNR value subject to a
CR constraint. Option 2 maximizes CR while observing a
PSNR threshold. The flow chart of the scheme is shown in V. E XPERIMENTAL R ESULTS AND
Fig. 11. The two distortion limits, E g,max and E b,max are reset P ERFORMANCE E VALUATION
to 0 at the beginning of a new frame. The control index for
option 1 is the mean square error (MSE) rather than the PSNR A. Compression Efficiency
value. This simplifies the computation and a direct mapping The performance evaluation in this section consists of
between PSNR and MSE values exists. The control index three experiments: the lossless compression efficiency, the
for option 2 is the CR. To simplify the complexity, both algorithm complexity, and the effectiveness of the bit-rate
indexes are updated per line. The current MSE (CMSE in control scheme. For lossless compression efficiency, three
option 1), or the current compression ratio (CCR in option 2), existing lossless compression schemes (FELICS, JPEG-LS,
is first compared with the target value (TCR and TMSE for and CALIC) plus two state-of-art video codecs (H.264/AVC
option 1 and option 2, respectively). The initial CMSE and FREXT and H.265/HEVC REXT) are included. The H.264
CCR values used for the comparison in the first line are the and H.265 codecs are configured to perform intra mode coding
results from the previous frame. If the current control index only and the quantization process is disabled. The test bench
value is inferior to the target one, both distortion limits are includes two sets of video sequences as described in Table IV.
increased by one and all the pixel coding in the current line There are five sequences in set 1. The first three, skiing
is subject to prediction adjustments. If otherwise, the two (140 frames), mountain (160 frames), and surf (300 frames),
distortion indexes are decremented by 1 and no prediction are videos in a raw RGB format and contain dynamic scenes.
adjustments are performed in this line. The CCR and the These three video sequences indicate undistorted contents with
CMSE values are updated at the end of the line and the high frequency details preserved. The other two sequences
statistics will be reset for a new frame. are decoded from encoded sources. The source encoder is

Authorized licensed use limited to: University of Portsmouth. Downloaded on December 26,2020 at 14:04:03 UTC from IEEE Xplore. Restrictions apply.
682 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 25, NO. 4, APRIL 2015

TABLE V
L OSSLESS V IDEO C OMPRESSIONS R ESULTS IN CR

MPEG-2 MP@HL with a HD resolution (1920 × 1080) at used in JPEG-LS; 2) the number of error models used in the
30 frames/s and using 4:2:0 sampling. The original bit rate proposed scheme is larger than that in JPEG-LS; and 3) the
is 80 Mb/s and each sequence has 300 frames. This set of proposed scheme supports more flexible smooth region coding.
video sequences features the major video contents available in The FELICS scheme, bearing the lowest computing complex-
digital media, that is, either some form of lossy compression ity among all, falls behind with no surprise. In summary,
was performed in producing the full HD content or the for set #1 sequences, the lossless compression ratio of the
frame resolution is less than full HD. The six sequences proposed scheme, on average, is 21% and 46%, respectively,
in set 2 are from the common test conditions (CTCs) [29]. better than the JPEG-LS and the FELICS schemes.
These sequences contain raw data directly captured from the For set #2 sequences, the CALIC scheme still takes the lead.
camera sensor and preserve all the high frequency details It performs particularly well at the CrowdRunsequence, which
including camera noise. They are designed to thoroughly shows the most complex scene. H.264 and HEVC perform
test the compression performance of new codecs such as comparably in this test set. The performance gap between
HEVC. Note that all sequences use 4:4:4 color sampling and the proposed scheme and the CALIC scheme is widened. The
the bit depth is 8. CR (the upper entry) and the software performance edge of the proposed scheme against JPEG-LS
execution time (the lower entry) of each scheme are shown also becomes marginal. The performance discrepancy of the
in Table V. All codecs perform better in set #1. The CALIC proposed scheme in two video sets can be explained as: First,
scheme has the best compression performance with an average because the content complexity in set #2 is much higher, the
CR equal to 4.85. This suggests that, in lossless compres- effectiveness of the entropy coder dominates the compression
sion, the pixel-based prediction used in CALIC is more efficiency. Theoretically, joint entropy coding of a vector
efficient than the blockwise prediction used in HEVC/H.264. always performs better than marginal entropy coding of the
HEVC and H.264 are ranked in the second and the third individual components of the vector. The adaptive arithmetic
places, respectively. HEVC, although equipped with many new coder used in CALIC belongs to the former case, whereas
intra-coding techniques [30], shows only mild advantages over the adaptive Golomb–Rice coding belongs to the latter case.
H.264 when the quantization procedure is switched off. The Second, in face of extremely complex textures, the prediction
compression performance of the proposed scheme is inferior accuracy advantage of the GAP scheme against the MED
to H.264/HEVC but outperforms JPEG-LS (3.34 versus 2.76 scheme is undermined. The FELICS scheme remains in the
in CR). This can be attributed to three factors: 1) the adopted last place in this test set. The ranking of these schemes in both
GAP prediction is more accurate than the MED prediction video sets are consistent. Although the compression efficiency

Authorized licensed use limited to: University of Portsmouth. Downloaded on December 26,2020 at 14:04:03 UTC from IEEE Xplore. Restrictions apply.
HWANG et al.: LOW-COMPLEXITY EMBEDDED COMPRESSION CODEC DESIGN 683

of the proposed scheme degrades in set #2, it can be argued


that these video sequences are primarily for HD source codec
development, and are not the typical video contents transmitted
over wireless networks. For set #1 alone, the proposed scheme
is the most competitive one with balanced performance and
complexity.
The execution time of each compression scheme is
measured on a PC platform equipped with a 2.5 GHz,
Intel core 2 Quad CPU processor working at 667 MHz. It can
be regarded as an indicator of the algorithm complexity but
not as a precise measurement because of the difference in code
optimization efforts. The execution times of H.264 and HEVC
are much longer (about two orders of magnitude larger) than
others. The CALIC scheme is ranked in the third place.
Its execution time is about two times that of the proposed
scheme. The extra time overhead comes mainly from its
adaptive arithmetic coder. The average execution time of
JPEG-LS is roughly one half that of the proposed scheme. This
is due to the complexity associated with enhanced prediction
and coding techniques in our scheme. The FELICS scheme
shows the smallest execution time at the expense of a much
inferior compression performance. Only CALIC, JPEG-LS,
and the proposed schemes are feasible solutions to embedded
compressions. Consider a full HD video signal (1920 × 1080
at 30 frames/s) transmitted over an 802.11n compliant 4 × 4
WiFi network, the minimum CR requirement is 2.488 (the raw Fig. 12. Curves of PSNR versus CR under rate control. (a) Option 1 with
data rate, 1920 × 1080 × 30 × 24 = 1493 Mb/s, divided by a CR threshold. (b) Option 2 with a PSNR threshold.
the maximum network data bandwidth 600 Mb/s). In practical
applications, this number should be much larger to cope TABLE VI
with the variations in video complexity and communication A CCURACY OF THE P ROPOSED B IT-R ATE C ONTROL S CHEME
traffic. The compression ratio of JPEG-LS (average 2.76 for
set #1 video sequences) is apparently insufficient. Because
the communication bandwidth is a hard limit, a moderate
increase in computing complexity for improved compression
efficiency is justified. The proposed scheme is thus considered
a better candidate (in terms of implementation complexity
and compression efficiency) than CALIC and JPEG-LS for
embedded compression.

B. Accuracy of Bit-Rate Control Scheme


For the performance evaluation of the proposed bit-rate
control scheme, the curves of PSNR versus CR under two
control options are shown in Fig. 12. Three video sequences
from set #1 and one video sequence from set #2 are used as the
bench marks. For the curves in part (a), seven CR values are set The proposed bit-rate control scheme is also compared with
as the threshold to optimize the PSNR values. For the curves two other near lossless bit-rate control schemes presented
in part (b), 7 PSNR values ranging from ∞ to 50 dB are set in [20] and [21]. Both schemes resort to prediction residual
as the targets. The two families of the curves are similar. The quantization to achieve the bit-rate control. Only CR-based
PSNR value drops rapidly at low CR regions. When compared control is supported. This corresponds to the option 1 in our
with the lossless compression, the near lossless compression case. For the purpose of comparison, they are implemented
option can offer additional 6%–20% bit-rate reduction while under the framework of the proposed codec to replace the
keeping the PSNR value 50 dB or higher. The bit-rate control prediction adjustment module. The obtained CRs are com-
accuracy of the proposed scheme is high. In option 1, the pared with those obtained by the proposed scheme. Table VII
difference between the target and the obtained value is almost summarizes the comparison results. Three video sequences
zero, because the scheme can have direct control over the CR. are simulated. For each sequence, two target CRs are set.
In option 2, the PSNR control is achieved indirectly through Although all three schemes meet the targets in all settings,
the MSE estimation. From the result shown in Table VI, the the proposed scheme shows the most precise control. The
deviation is less than 0.5 dB. deviation is less than 0.1%. The deviations in scheme [20] and

Authorized licensed use limited to: University of Portsmouth. Downloaded on December 26,2020 at 14:04:03 UTC from IEEE Xplore. Restrictions apply.
684 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 25, NO. 4, APRIL 2015

Fig. 13. Block diagram of the proposed video compression codec.

TABLE VII A. System Architecture


P ERFORMANCE C OMPARISON OF D IFFERENT N EAR
L OSSLESS R ATE C ONTROL S CHEMES
The design features a highly pipelined architecture (up to
five stages in the encoder design) and employs two comput-
ing kernels working in parallel to enhance the throughput.
Design optimization techniques include register retiming and
computing relaxation. The design can operate up to 200 MHz
and process 2 pixels (per color component) concurrently in
every clock cycle. This leads to an equivalent throughput of
133.3M pixels/s. Fig. 13 shows the block diagram of the
proposed codec design. Besides the encoder and the decoder,
a bus interface compliant with the advanced microcontroller
bus architecture (AMBA) 2.0 bus protocol is also included
so that the design can be readily attached to an advanced
scheme [21] are greater than 1%. The difference seems to be Reduced Instruction Set Computer machine (ARM) processor
subtle, however, in near loss compression; a slight deviation and serves as an embedded acceleration engine.
(increase) in CR causes a significant drop in PSNR value.
In Table VII, the PSNR values obtained by each scheme
B. Encoder Design
under a target CR are also listed. It can be seen, the proposed
scheme is, on average, 4.6 and 7.9 dB better than the other The encoder design is five-stage pipelined and the bound-
two control schemes in PSNR values. In terms of the bit- aries of the pipeline stages are optimized using a register
rate control scheme complexity, it consumes roughly 10% of retiming technique, which adjusts the register locations to
the total execution time in the proposed scheme. This number shorten the critical path delay. In stage 1, the color space
is slightly higher than those of schemes [20] and [21]. The transform is performed and the context window register is
extra complexity comes from the aforementioned code length updated using the newly arrived pixel values from the line
overhead checking mechanism. In hardware implementation, buffer. In stage 2, either the binary-mode coding or the forward
this logic performs in parallel with the coding process and (nonrecursive) part of the GAP is performed. As elaborated
does not affect the throughput rate. in Section III-B, moving the nonrecursive part of the GAP
computation to a separate pipeline stage can shorten the
critical path delay formed by the GAP and the error-feedback
VI. H ARDWARE A RCHITECTURE AND C HIP
modules. In stage 3, the backward part of the GAP and the
I MPLEMENTATION
context formation/quantization are calculated. The prediction
To demonstrate the hardware implementation efficiency of adjustment of the bit-rate control, the error compensation, and
the proposed compression scheme, a high-throughput codec the LUT entry update are performed concurrently in stage 4.
design targeting the transmission of full HD video signals over Finally, either the ternary entropy coding or the adaptive
the wireless network is developed. Golomb–Rice coding is carried out in the last stage. One pixel

Authorized licensed use limited to: University of Portsmouth. Downloaded on December 26,2020 at 14:04:03 UTC from IEEE Xplore. Restrictions apply.
HWANG et al.: LOW-COMPLEXITY EMBEDDED COMPRESSION CODEC DESIGN 685

can be processed per clock cycle by this computing pipeline.


To support concurrent table lookup and entry update in one
clock cycle, a dual port memory is employed. In addition, to
double the processing throughput, each video frame is divided
horizontally to two subframes and coded independently by
two copies of the computing pipeline. Because prediction
accuracy of boundary pixels is inferior to that of interior pixels,
the parallel subframe approach does impact the compression
efficiency. However, for 720 × 480 sized video sequences, the
loss in CR is roughly 0.4% only. For 1920 × 1080 sized video
sequences, the number is less than 0.2%. The loss is thus
negligible.
Fig. 14. Chip design of the proposed compression scheme.

C. Decoder Design TABLE VIII


The decoder design contains four pipeline stages. In stage 1, C OMPARISONS W ITH E XISTING C HIP D ESIGNS
the received bit stream is parsed and decoded in either one of FOR E MBEDDED C OMPRESSION
the coding modes. A pointer is used to indicate the starting
point of the undecoded bit stream in the input line buffer.
The size of the decoding window (16-bit wide in our design)
needs to be greater than the maximum bit-width of a symbol.
The forward GAP is performed in stage 2. Stage 3 is the
most critical stage in the pipeline. The backward GAP, the
context formation/quantization, and the LUT entry update
are performed in the same stage. This implies two memory
accesses (a read followed by a write) in one clock cycle, which
prolongs the clock cycle period significantly. To mitigate
the problem, the write operation is deliberately postponed to
the next pipeline stage. In case the new update is needed
in the Golomb–Rice decoding, a data forwarding scheme is
applied. It holds a copy of the new update in a register and
passes it directly to the Golomb–Rice decoder without going
through the prediction error LUT memory. The color space
transform is performed in the last pipeline stage. Similar to
the encoder design, two decoding pipelines work in parallel
and 2 pixels can be decoded per clock cycle. Note that the
bit rate control scheme affects the encoder design only. The
memory hierarchy used in this design consists of three levels.
Shift register is considered at level 1 of the hierarchy and
stored on-chip as well. This will triple the size of the on-chip
provides concurrent accesses to multiple pixel data within the
RAM. The design has a die size of 1.61 mm2 and 99 IO pads.
context window. It is updated every clock cycle. Level 2 of the
The pin count is high as it supports the AMBA bus interface.
memory hierarchy is a line buffer holding the working set data
The maximum working frequency is 200 MHz. This suggests
on-chip. The working set indicates the three adjacent lines that
a full HD video processing rate of 64 frames/s. The logical
cover the entire context window. We adopt an incremental line
gate count is 123.5 k, which includes the contribution from
buffer update scheme to fetch data from off-chip memory. This
the AMBA bus interface. The logic gate for the codec only is
reduces the size of line buffer from three lines to two lines
87.2 k. The total memory is 6.7 kB, which also includes the
plus a small segment equal in size to the length of a burst
frame buffers with the AMBA bus interface. Roughly 2.5 kB
mode DRAM access. Level 3 of the memory hierarchy is an
is dedicated to the interface. The power consumption of the
off-chip memory, which serves as the video frame buffer.
core logic is 73.6 mW when operating at 200 MHz. The codec
itself consumes 47.2 mW.
D. Chip Implementation
The proposed design is implemented in a Taiwan semi-
conductor manufacturing company (TSMC) 90-nm process E. Design Comparisons
technology and the chip layout of the design is shown in The comparisons with existing designs are compiled into
Fig. 14. The color space transform module is removed from Table VIII. Design [25] is an embedded compression engine
the design in real chip implementation. This is mainly due to adopting discrete wavelet transform and SPIHT coding.
the memory usage concern. If the color transform is performed Design [23] implements a JPEG-LS codec. Design [22] real-
on-chip, all three signal components, Y, Db, and Dr, need to be izes a simple FELICS compression scheme and emphasizes

Authorized licensed use limited to: University of Portsmouth. Downloaded on December 26,2020 at 14:04:03 UTC from IEEE Xplore. Restrictions apply.
686 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 25, NO. 4, APRIL 2015

its low complexity and high-throughput features. Design [28] portion of the power consumption. Apparently, none of these
is an enhancement of design [22] using a context table- designs is considered a well-rounded solution. The proposed
free modeling (AGPM) in prediction and supporting the rate one, leading in compression efficiency, throughput and sup-
control option. Design [27] uses a hierarchical prediction porting bit-rate control, is considered most suitable for wireless
scheme (hierarchical average and copy prediction, HACP) multimedia networks.
plus a significant bit truncation coding scheme. The lossless
compression kernel of the proposed design is a simplified VII. C ONCLUSION
CALIC scheme with bit-rate control extensions. Comparison In conclusion, in this paper, a low-complexity lossless/near
items shown in Table VIII include the prediction and the lossless video codec design is presented. It aims at serv-
coding schemes, the implementation process technology, the ing as an embedded compression engine to reduce the data
maximum working frequency, the throughput, the logical gate transmission bandwidth of high-resolution video signal over
count, the size of on-chip memory, and the CR. Although a wireless networks. The design features a best tradeoff in
composite performance index (including power, gate count, complexity and efficiency. The complexity is two orders of
throughput, etc.) can be used for quantitative comparisons, it magnitude lower than full-fledged video codecs and is only
covers only the merits of architecture design and overlooks the about one half that of the CALIC scheme. For video test
qualitative factors such as the compression efficiency and the set consisting of previously encoded full HD and nonfull
ability of precise bit-rate control. Therefore, we will examine HD sequences, the compression efficiency of the proposed
these designs on a per index basis. In terms of the compression scheme is significantly better than other low-complexity com-
efficiency, because the test benches used in different designs peting techniques such as JPEG-LS and FELICS. These video
are not the same, the numbers are provided just for the sequences indicate typical video contents transmitted over the
reference purpose. CRs of designs in [25], [27], and [28] wireless networks. For video test sequences that contain raw
are the numbers reported in each paper. Those of designs data directly captured from the camera sensor, for example, the
in [22] and [23] are based on the simulations using our test CTC sequences, the compression efficiency of the proposed
video sequences as shown in Table IV. scheme degrades, and so on. The performance of the proposed
The proposed design is the only one supporting near lossless codec varies with the texture complexity of the video. The
compression with bit-rate control. The bit-rate control mech- performance stands out when the HD video comes from a
anism adopted for the design in [28] is actually performed moderately encoded source. The performance is compromised
in a source encoder (H.264) by adjusting the quantization (due to a simplified entropy coder to reduce the hardware
parameter. The embedded compression scheme itself has no complexity) if the HD video is raw and not coded. On the
control over the bit-rate. In the regard of the throughput, the other hand, if the HD video originates from a heavily encoded
design in [27] exhibits the highest throughput. The design source, the performance edge against other simpler schemes
in [28] is ranked in the second place trailed by the proposed such as JPEG-LS scheme is reduced. This is because the
design in the third place. For the designs in [27] and [28], efficiency of the MED prediction used in JPEG-LS is close to
this is in fact a tradeoff between the speed and the prediction the GAP prediction used in the proposed scheme when heavy
accuracy. The hierarchical prediction scheme adopted for the quantization leads to smoother textures.
design in [27] facilitates parallel prediction of multiple pixels, An efficient bit-rate control scheme is also developed in this
but the prediction reference can be up to 4 pixel apart. This paper. A prediction adjustment approach is devised and acts
can be verified that CRs of the designs in [27] and [28] are like a postprocessing to the prediction module. The two bit-rate
significantly lower than other designs in comparison. All other control options can either optimize CR subject to the PSNR
schemes use context windows in prediction. The inherent data constraint or the vice versa. The near lossless compression
dependence prevents them from parallel predictions. In terms options can offer additional 6%–18% bit-rate reduction while
of the circuit complexity, the design in [22] has the lowest keeping the PSNR value 50 dB or higher. The compressed
logical gate count due to its algorithm simplicity. Two major video quality of the proposed scheme is also 4–8 dB better than
factors contribute to the logical gate count of the proposed the rival schemes can achieve. The entire codec is efficiently
design. First, to enhance the throughput, two compression implemented in a high-throughput chip design by using a
kernels are employed. Second, the bit-rate control logic TSMC 90-nm process technology. The design can support a
complicates the design, as well. The on-chip memory size throughput of 133.3M pixels/s, which corresponds to a 64 full
is largely affected by specs of the target video format. That HD frames/s processing rate and meets the target application
explains a somewhat higher memory usage of the proposed well.
design targeting a full HD video format.
ACKNOWLEDGMENT
Regarding the power consumption index, the design in [25]
consumes the lowest power in all designs. Its primary goal The authors would like to thank W. H. Peng and
is reducing the off-chip communication bandwidth while min- C. C. Chen at National Chiao Tung University and
imizing the power consumption overhead. The compression Y. M. Huang at National Chung Hsing University, Taiwan, for
efficiency is thus less emphasized. The proposed design, on their help in conducting the simulations of video compression.
the other hand, focuses more on the compression efficiency, The authors would also like to thank the National Chip
throughput, and bit-rate control for wireless network. The line Implementation Center (CIC), Taiwan, for their technical
buffer and the bit-rate control logic also contribute a significant supports in chip design and fabrication.

Authorized licensed use limited to: University of Portsmouth. Downloaded on December 26,2020 at 14:04:03 UTC from IEEE Xplore. Restrictions apply.
HWANG et al.: LOW-COMPLEXITY EMBEDDED COMPRESSION CODEC DESIGN 687

R EFERENCES [25] C.-C. Cheng, P.-C. Tseng, and L.-G. Chen, “Multimode embedded
compression codec engine for power-aware video coding system,” IEEE
[1] Y.-C. Fang, C.-Y. Lee, Y.-M. Wang, C.-N. Wang, and T. Chiang, “Low Trans. Circuits Syst. Video Technol., vol. 19, no. 2, pp. 141–150,
complexity lossless video compression,” in Proc. ICIP, vol. 4. Oct. 2004, Feb. 2009.
pp. 2519–2522. [26] Y.-T. Hwang, C.-C. Lin, and M.-W. Liu, “Design and implementation
[2] Y. Li and K. Sayood, “Lossless video sequence compression using of a low complexity lossless video codec,” in Proc. IEEE APCCAS,
adaptive prediction,” IEEE Trans. Image Process., vol. 16, no. 4, Kuala Lumpur, Malaysia, Dec. 2010, pp. 556–559.
pp. 997–1007, Apr. 2007. [27] J. Kim and C.-M. Kyung, “A lossless embedded compression using
[3] G. Ulacha and R. Stasinski, “A system for real-time lossless and near- significant bit truncation for HD video coding,” IEEE Trans. Circuits
lossless video compression,” in Proc. 16th Int. Conf. Syst., Signals, Syst. Video Technol., vol. 20, no. 6, pp. 848–860, Jun. 2010.
Image Process., Jun. 2009, pp. 1–4. [28] T.-H. Tsai and Y.-H. Lee, “A 6.4 Gbit/s embedded compression codec
[4] Y. Lu, Y. Shen, C. Wang, and Z. Zhu, “Pixelwise adaptive prediction for memory-efficient applications on advanced-HD specification,” IEEE
for lossless video compression,” in Proc. IEEE 11th Int. Conf. Signal Trans. Circuits Syst. Video Technol., vol. 20, no. 10, pp. 1277–1291,
Process., vol. 2. Oct. 2012, pp. 1022–1025. Oct. 2010.
[5] B. Rudiak-Gould. HuffYUV v.2.1.1. [Online]. Available: [29] F. Bossen, Common HM Test Conditions and Software Reference
http://www.videohelp.com/download/huffyuv-2.1.1.zip, accessed Configurations, ITU-T/ISO/IEC Joint Collaborative Team on Video
Dec. 22, 2003. Coding (JCT-VC), document Rec. JCTVC-K1100, 2012.
[6] M. Niedermayer. FFV1. [Online]. Available: http://sourceforge.net/ [30] J. Lainema, F. Bossen, W.-J. Han, J. Min, and K. Ugur, “Intra coding of
projects/ffdshow/, accessed Nov. 17, 2011. the HEVC standard,” IEEE Trans. Circuits Syst. Video Technol., vol. 22,
[7] YUV Soft. YULS. [Online]. Available: http://www.yuvsoft.com/ no. 12, pp. 1792–1801, Dec. 2012.
download/losslesscodec/index.html
[8] B. Greenwood. Lagarith. [Online]. Available: http://lags.leetcode.net/
codec.html, accessed Dec. 8, 2011.
[9] MSU Graphics & Media Lab. MSU Lab. [Online]. Available:
http://www.compression.ru/video/lscodec/index_en.html
[10] CS MSU Graphics & Media Lab (Video Group). Loss-
less Video Codecs Comparison 2007. [Online]. Available: Yin-Tsung Hwang (M’93) received the B.S.
http://www.compression.ru/video, accessed Jan. 25, 2005. and M.S. degrees in electronic engineering from
[11] P. G. Howard and J. S. Vitter, “Fast and efficient lossless image National Chiao Tung University, Hsinchu, Taiwan,
compression,” in Proc. IEEE Int. Conf. Data Compress., Mar. 1993, in 1983 and 1985, respectively, and the Ph.D. degree
pp. 501–510. from the Department of Electrical and Computer
[12] M. J. Weinberger, G. Seroussi, and G. Sapiro, “The LOCO-I loss- Engineering, University of Wisconsin-Madison,
less image compression algorithm: Principles and standardization into Madison, WI, USA, in 1993.
JPEG-LS,” IEEE Trans. Image Process., vol. 9, no. 8, pp. 1309–1324, He became a faculty member of the Department
Aug. 2000. of Electronic Engineering, National Yunlin
[13] X. Wu and N. Memon, “Context-based, adaptive, lossless image coding,” University of Science and Technology, Douliu,
IEEE Trans. Commun., vol. 45, no. 4, pp. 437–444, Apr. 1997. Taiwan. In 2004 he joined the Department of
[14] A. Skodras, C. Christopoulos, and T. Ebrahimi, “The JPEG 2000 still Electrical Engineering with National Chung Hsing University, Taichung,
image compression standard,” IEEE Signal Process. Mag., vol. 18, no. 5, Taiwan, where he is currently a Professor. His research interests include
pp. 36–58, Sep. 2001. very-large-scale-integration (VLSI) designs for wireless baseband processing,
[15] S. Rehana, O. Turgeman, R. Manevich, and A. Kolodny, video/image signal processing, and low-power VLSI circuit designs.
“ViLoCoN—An ultra-lightweight lossless VLSI video codec,” in
Proc. IEEE 26th Int. SOC Conf. (SOCC), Sep. 2013, pp. 172–177.
[16] W.-D. Chien, K.-Y. Liao, and J.-F. Yang, “H.264-based hierarchical
two-layer lossless video coding method,” IET Signal Process., vol. 8,
no. 1, pp. 21–29, Feb. 2014. Ming-Wei Lyu received the B.S. degree from
[17] G. Braeckman, S. M. Satti, H. Chen, S. Delputte, P. Schelkens, and the Department of Electronic Engineering, National
A. Munteanu, “Lossy-to-lossless screen content coding using an HEVC Yunlin University of Science and Technology,
base-layer,” in Proc. 18th Int. Conf. Digit. Signal Process., Jul. 2013, Douliu, Taiwan, in 2009 and the M.S. degree from
pp. 1–6. the Department of Electrical Engineering, National
[18] J.-A. Choi and Y.-S. Ho, “Improved residual data coding for high Chung Hsing University, Taichung, Taiwan, in 2011.
efficiency video coding lossless extension,” in Proc. IEEE RIVF Int. He has been with Andes Technology, Hsinchu,
Conf. Comput. Commun. Technol., Res., Innov., Vis. Future (RIVF), Taiwan, since his graduation, where he is currently
Nov. 2013, pp. 18–21. an Engineer involved in IP verification, FPGA pro-
[19] Q. Zhang, Y. Dai, and C.-C. J. Kuo, “Lossless video compression with totyping, and embedded processor platform design.
residual image prediction and coding (RIPC),” in Proc. IEEE ISCAS,
May 2009, pp. 617–620.
[20] J. Jiang, “A low-cost content-adaptive and rate-controllable near-lossless
image codec in DPCM domain,” IEEE Trans. Image Process., vol. 9,
no. 4, pp. 543–554, Apr. 2000.
[21] M. Uchiyama, K. Oikawa, N. Date, and S. Koto, “A rate-controllable Cheng-Chen Lin was born in Taiwan in 1981.
near-lossless data compression IP for HDTV decoder LSI in 65 nm He received the B.S. and M.S. degrees from the
CMOS,” in Proc. IEEE Asian Solid-State Circuits Conf., Taipei, Taiwan, Department of Electronic Engineering, National
Nov. 2009, pp. 201–204. Yunlin University of Science and Technology,
[22] T.-H. Tsai, Y.-H. Lee, and Y.-Y. Lee, “Design and analysis of high- Douliu, Taiwan, in 2003 and 2005, respectively, and
throughput lossless image compression engine using VLSI-oriented the Ph.D. degree from the Department of Electrical
FELICS algorithm,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., Engineering, National Chung Hsing University,
vol. 18, no. 1, pp. 39–52, Jan. 2010. Taichung, Taiwan, in 2011.
[23] X. W. Li et al., “A low power, fully pipelined JPEG-LS encoder for He is a Senior Engineer with Largan Precision
lossless image compression,” in Proc. IEEE Int. Conf. Multimedia Expo, Company, Ltd., Taichung, where he is in charge
Jul. 2007, pp. 1906–1909. of the modulation transfer function development for
[24] P. Merlino and A. Abramo, “A fully pipelined architecture for optical systems. His research interests include video/image coding, very large
the LOCO-I compression algorithm,” IEEE Trans. Very Large Scale scale integration designs for video compression, and optical lens system
Integr. (VLSI) Syst., vol. 17, no. 7, pp. 967–971, Jul. 2009. design.

Authorized licensed use limited to: University of Portsmouth. Downloaded on December 26,2020 at 14:04:03 UTC from IEEE Xplore. Restrictions apply.

You might also like