6.1 Lossless Compression Algorithms: Introduction: Unit 6: Multimedia Data Compression
6.1 Lossless Compression Algorithms: Introduction: Unit 6: Multimedia Data Compression
As a start, suppose we want to encode the call numbers of the 120 million or so items in the
Library of Congress (a mere 20 million, if we consider just books). Why don’t we just transmit
each item as a 27-bit number, giving each item a unique binary code (since 227 > 120, 000,
000)? The main problem is that this “great idea” requires too many bits. And in fact there exist
many coding techniques that will effectively reduce the total number of bits needed to represent
the above information. The process involved is generally referred to as compression.
We had a beginning look at compression schemes aimed at audio. There, we had to first consider
the complexity of transforming analog signals to digital ones, whereas here, we shall consider
that we at least start with digital signals.For example, even though we knowan image is captured
using analog signals, the file produced by a digital camera is indeed digital. The more general
problem of coding (compressing) a set of any symbols, not just byte values, say, has been studied
for a long time.
Getting back to our Library of Congress problem, it is well known that certain parts of call
numbers appear more frequently than others, so it would be more economic to assign fewer bits
as their codes. This is known as variable-length coding (VLC)—the more frequently appearing
symbols are coded with fewer bits per symbol, and vice versa. As a result, fewer bits are usually
needed to represent the whole collection.
In this chapter we study the basics of information theory and several popular lossless
compression techniques. Figure 7.1 depicts a general data compression scheme, in which
compression is performed by an encoder and decompression is performed by a decoder.
We call the output of the encoder codes or codewords. The intermediate medium could either be
data storage or a communication/computer network. If the compression and decompression
processes induce no information loss, the compression scheme is lossless; otherwise, it is lossy.
The next several chapters deal with lossy compression algorithms as they are commonly used for
image, video, and audio compression. Here, we concentrate on lossless compression.
If the total number of bits required to represent the data before compression is B0 and the total
number of bits required to represent the data after compression is B1, then we define the
compression ratio as
compression ratio =B0 /B1 . (7.1)
In general, we would desire any codec (encoder/decoder scheme) to have a compression ratio
much larger than 1.0. The higher the compression ratio, the better the lossless compression
scheme, as long as it is computationally feasible.
It is one of the simplest forms of data compression. The basic idea is that if the information
source we wish to compress has the property that symbols tend to form continuous groups,
instead of coding each symbol in the group individually, we can code one such symbol and the
length of the group. As an example, consider a bilevel image (one with only 1-bit black and
white pixels) with monotone regions—like an fx.
This information source can be efficiently coded using run-length coding. In fact, since there are
only two symbols, we do not even need to code any symbol at the start of each run. Instead, we
can assume that the starting run is always of a particular color (either black or white) and simply
code the length of each run.
The above description is the one-dimensional run-length coding algorithm. A two dimensional
variant of it is generally used to code bilevel images. This algorithm uses the coded run
information in the previous row of the image to code the run in the current row. A full
description of this algorithm can be found in
Table summarizes the result, showing each symbol, its frequency count, information Content
, resulting codeword, and the number of bits needed to encode each symbol in the
word HELLO. The total number of bits used is shown at the bottom.
It should be pointed out that the outcome of the Shannon–Fano algorithm is not necessarily
unique. For instance, at the first division in the above example, it would be equally valid to
divide into the two parts L, H:(3) and E, O:(2). This would result in the coding in Fig.
Table shows the codewords are different now. Also, these two sets of codewords may behave
differently when errors are present. Coincidentally, the total number of bits required to encode
the world HELLO remains at 10. The Shannon–Fano algorithm delivers satisfactory coding
results for data compression, but it was soon outperformed and overtaken by the Huffman coding
method.
6.3.2 Huffman Coding
First presented by Huffman in a 1952 paper , this method attracted an overwhelming amount of
research and has been adopted in many important and/or commercial applications, such as fax
machines, JPEG, and MPEG.
In contradistinction to Shannon–Fano, which is top-down, the encoding steps of the Huffman
algorithm are described in the following bottom-up manner. Let us use the same example word,
HELLO. A similar binary coding tree will be used as above, in which the left branches are coded
0 and right branches 1. A simple list data structure is also used.
3. Assign a codeword for each leaf based on the path from the root.
In Fig., new symbols P1, P2, P3 are created to refer to the parent nodes in the Huffman coding
tree. The contents in the list are illustrated below:
After initialization: L H E O
After iteration (a): L P1 H
After iteration (b): L P2
After iteration (c): P3
For this simple example, the Huffman algorithm apparently generated the same coding result as
one of the Shannon–Fano results shown in Fig., although the results are usually better. The
average number of bits used to code each character is also 2, (i.e., (1+1+2+3+3)/5 = 2).
As another simple example, consider a text string containing a set of characters and their
frequency counts as follows: A:(15), B:(7), C:(6), D:(6) and E:(5). It is easy to show that the
Shannon–Fano algorithm needs a total of 89 bits to encode this string, whereas the Huffman
algorithm needs only 87.
As shown above, if correct probabilities (“prior statistics”) are available and accurate, the
Huffman coding method produces good compression results. Decoding for the Huffman coding
is trivial as long as the statistics and/or coding tree are sent before the data to be compressed (in
the file header, say).
for H or 110 for E or 111 for O; nor is the code 10 for H a prefix of the code 110 for E or 111 for
O. It turns out that the unique prefix property is guaranteed by the above Huffman algorithm,
since it always places all input symbols at the leaf nodes of the Huffman tree.
The Huffman code is one of the prefix codes for which the unique prefix property holds. The
code generated by the Shannon–Fano algorithm is another such example. This property is
essential and also makes for an efficient decoder, since it precludes any ambiguity in decoding.
In the above example, if a bit 0 is received, the decoder can immediately produce a symbol L
without waiting for any more bits to be transmitted.
Various modern versions of arithmetic coding have been developed for newer multimedia
standards: for example, Fast Binary Arithmetic Coding in JBIG, JBIG2 and JPEG-2000, and
Context-Adaptive Binary Arithmetic Coding (CABAC) in H.264 and H.265.
Normally (in its non-extended mode), Huffman coding assigns each symbol a codeword that has
an integral bit length. As stated earlier, log2(1/pi) indicates the amount of information contained
in the information source si , which corresponds to the number of bits needed to represent it. For
example, when a particular symbol si has a large probability (close to 1.0), log2 (1/pi ) will be
close to 0, and even assigning only one bit to represent that symbol will be very costly if we have
to transmit that one bit many times.
Although it is possible to group symbols into metasymbols for codeword assignment (as in
extended Huffman coding) to overcome the limitation of integral number of bits per symbol, the
increase in the resultant symbol table required by the Huffman encoder and decoder would be
formidable.
Arithmetic coding can treat the whole message as one unit and achieve fractional number of bits
for each input symbol. In practice, the input data is usually broken up into chunks to avoid error
propagation. In our presentation below, we will start with a simplistic approach and include a
terminator symbol. Then we will introduce some improved methods for practical
implementations.
It is apparent that finally we have range = PC × PA × PE × PE × P$ = 0.2 × 0.2 × 0.3 × 0.3 × 0.1
= 0.00036
The final step in encoding calls for generation of a number that falls within the range [low, high).
This number is referred to as a tag, i.e., a unique identifier for the interval that represents the
sequence of symbols. Although it is trivial to pick such a number in decimal, such as 0.33184,
0.33185, or 0.332 in the above example, it is less obvious how to do it with a binary fractional
number. The following algorithm will ensure that the shortest binary codeword is found if low
and high are the two ends of the range and low < high.
For the above example, low = 0.33184, high = 0.3322. If we assign 1 to the first binary fraction
bit, it would be 0.1 in binary, and its decimal value(code) = value(0.1) = 0.5 > high. Hence, we
assign 0 to the first bit. Since value(0.0) = 0 < low, the while loop continues. Assigning 1 to the
second bit makes a binary code 0.01 and value(0.01)=0.25, which is less than high, so it is
accepted. Since it is still true that value(0.01) < low, the iteration continues. Eventually, the
binary codeword generated is 0.01010101, which is 2−2 + 2−4 + 2−6 + 2−8 = 0.33203125. It can
be proven [2] that ⌈log2(1/πi Pi )⌉ is the upper bound. Namely, in the worst case, the shortest
codeword in arithmetic coding will require k bits to encode a sequence of symbols.
Many coding schemes can be used for the binarization. We will introduce one of them, the Exp-
Golomb code, Fast Binary Arithmetic Coding (Q-coder, MQ-coder) was developed in
multimedia standards such as JBIG, JBIG2, and JPEG-2000. The more advanced version,
Context-Adaptive Binary Arithmetic Coding (CABAC) is used in H.264 (M-coder) and H.265.
As in the Adaptive Huffman Coding, as long as the encoder and decoder are synchronized (i.e.,
using the same update rules), the adaptive process will work flawlessly. Nevertheless, Adaptive
Arithmetic Coding has a major advantage over Adaptive Huffman Coding: there is now no need
to keep a (potentially) large and dynamic symbol table and constantly update the Adaptive
Huffman tree. Below we outline the procedures for Adaptive Arithmetic Coding, and give an
example to illustrate how it also works for Binary Arithmetic Coding.
6.5 Lossless Image Compression
One of the most commonly used compression techniques in multimedia data compression is
differential coding. The basis of data reduction in differential coding is the redundancy in
consecutive symbols in a data stream. Recall that we considered lossless differential coding in,
when we examined how audio must be dealt with via subtraction from predicted values. Audio is
a signal indexed by one dimensional time. Here we consider how to apply the lessons learned
from audio to the context of digital image signals that are indexed by two, spatial, dimensions (x,
y).
Because of the continuity of the physical world, the gray-level intensities (or color) of
background and foreground objects in images tend to change relatively slowly across the image
frame. Since we were dealing with signals in the time domain for audio, practitioners generally
refer to images as signals in the spatial domain.
The generally slowly changing nature of imagery spatially produces a high likelihood that
neighboring pixels will have similar intensity values. Given an original image
I (x, y), using a simple difference operator we can define a difference image d(x, y)
as follows:
This is a simple approximation of a partial differential operator ∂/∂x applied to an image defined
in terms of integer values of x and y. Another approach is to use the discrete version of the 2D
Laplacian operator to define a difference image d(x, y) as
In both cases, the difference image will have a histogram as in Fig. d, derived from the d(x, y)
partial derivative image in Fig. b for the original image I in Fig. a. Notice that the histogram for
the unsubtracted I itself is much broader, as in Fig. c. It can be shown that image I has larger
entropy than image d, since it has amore even distribution in its intensity values. Consequently,
Huffman coding or some other variable-length coding scheme will produce shorter bit-length
codewords for the difference image. Compression will work better on a difference image.
Lossless JPEG is invoked when the user selects a 100% quality factor in an image tool.
Essentially, lossless JPEG is included in the JPEG compression standard simply for
completeness.
The following predictive method is applied on the unprocessed original image (or each color
band of the original color image). It essentially involves two steps: forming a differential
prediction and encoding.
1. A predictor combines the values of up to three neighboring pixels as the predicted value for
the current pixel, indicated by X in Fig. The predictor can use any one of the seven schemes
listed in Table 7.6. If predictor P1 is used, the neighboring intensity value A will be adopted as
the predicted intensity of the current pixel; if predictor P4 is used, the current pixel value is
derived from the three neighboring pixels as A + B − C; and so on.
2. The encoder compares the prediction with the actual pixel value at position X and encodes the
difference using one of the lossless compression techniques we have discussed, such as the
Huffman coding scheme. Since prediction must be based on previously encoded neighbors, the
very first pixel in the image I (0, 0) will have to simply use its own value. The pixels in the first
row always use predictor P1, and those in the first column always use P2. Lossless JPEG usually
yields a relatively low compression ratio, which renders it impractical for most multimedia
applications. An empirical comparison using some 20 images indicates that the compression
ratio for lossless JPEG with any one of the seven predictors ranges from 1.0 to 3.0, with an
average of around 2.0. Predictors 4–7 that consider neighboring nodes in both horizontal and
vertical dimensions offer slightly better compression (approximately 0.2–0.5 higher) than
predictors 1–3.
Table shows a comparison of the compression ratio for several lossless compression techniques
using test images Lena, football, F-18, and flowers. These standard images used for many
purposes in imaging work are shown on the textbook website for this chapter.
This chapter has been devoted to the discussion of lossless compression algorithms. It should be
apparent that their compression ratio is generally limited (with a maximum at about 2–3).
However, many of the multimedia applications we will address in the next several chapters
require a much higher compression ratio. This is accomplished by lossy compression schemes.
In this chapter, we consider lossy compression methods. Since information loss implies some
tradeoff between error and bitrate, we first consider measures of distortion—e.g., squared error.
Different quantizers are introduced, each of which has a different distortion behavior.
A discussion of transform coding leads into an introduction to the Discrete Cosine Transform
used in JPEG compression and the Karhunen Loève transform. Another transform scheme,
wavelet-based coding, is then set out deals extensively with the subject of lossy data
compression in a well-organized and easy-to-understand manner.
The mathematical foundation for the development of many lossy data compression algorithms is
the study of stochastic processes. The compression ratio for image data using lossless
compression techniques (e.g., Huffman Coding, Arithmetic Coding, LZW) is low when the
image histogram is relatively flat.
For image compression in multimedia applications, where a higher compression ratio is required,
lossy methods are usually adopted. In lossy compression, the compressed image is usually not
the same as the original image but is meant to form a close approximation to the original image
perceptually. To quantitatively describe how close the approximation is to the original data,
some form of distortion measure is required.
6.6.1 DistortionMeasures
A distortion measure is a mathematical quantity that specifies how close an approximation is to
its original, using some distortion criteria. When looking at compressed data, it is natural to think
of the distortion in terms of the numerical difference between the original data and the
reconstructed data.
However, when the data to be compressed is an image, such a measure may not yield the
intended result. For example, if the reconstructed image is the same as original image except that
it is shifted to the right by one vertical scan line, an average human observer would have a hard
time distinguishing it from the original and would therefore conclude that the distortion is small.
However, when the calculation is carried out numerically, we find a large distortion, because of
the large changes in individual pixels of the reconstructed image. The problem is that we need
ameasure of perceptual distortion, not a more naive numerical approach. However, the study of
perceptual distortions is beyond the scope of this book. Of the many numerical distortion
measures that have been defined, we present the three most commonly used in image
compression.
If we are interested in the average pixel difference, the mean square error (MSE) σ2 is often
used. It is defined as
where xn, yn, and N are the input data sequence, reconstructed data sequence, and length of the
data sequence, respectively. If we are interested in the size of the error relative to the signal, we
can measure the signal-to-noise ratio (SNR) by taking the ratio of the average square of the
original data sequence and the mean square error (MSE). In decibel units (dB), it is defined as
where σ2x is the average square value of the original data sequence
and σ2d is the MSE.
Another commonly used measure for distortion is the peak-signal-to-noise ratio (PSNR), which
measures the size of the error relative to the peak value of the signal xpeak. It is given by
Intuitively, for a given source and a given distortion measure, if D is a tolerable amount of
distortion, R(D) specifies the lowest rate at which the source data can be encoded while keeping
the distortion bounded above by D. It is easy to see that when D = 0, we have a lossless
compression of the source. The rate-distortion function
is meant to describe a fundamental limit for the performance of a coding algorithm and so can be
used to evaluate the performance of different algorithms. Figure shows a typical rate-distortion
function. Notice that the minimum possible rate at D = 0, no loss, is the entropy of the source
data. The distortion corresponding to a rate R(D) ≡ 0 is the maximum amount of distortion
incurred when “nothing” is coded.
Finding a closed-form analytic description of the rate-distortion function for a given source is
difficult, if not impossible.
In this lesson, we are going to study the first two of these standards i.e., H.261 and H.263. These
standards mostly use the same concepts as those followed in MPEG and instead of repeating
information, only the novelties and the special features of these standards will be covered.
For this reason, the standard is also referred to as the p x 64 Kbits/sec standard.
In addition to forming a basis for the MPEG-1 and MPEG-2 standards, the H.261 standards
offers two important features:
a) Maximum coding delay of 150 msec. It has been observed that delays exceeding 150 msec do
not provide direct visual feed back in bi-directional video conferencing
i) Common Intermediate Format (CIF), having 352 x 288 pixels for the luminance channel (Y)
and 176 x 144 pixels for each of the two chrominance channels U and V. Four temporal rates,
viz, 30 15, 10 or 7.5 frames/ sec are supported. CIF images are used when p ≥ 6, that is for video
conferencing applications.
ii) Quarter of Common Intermediate Format (QCIF) having 176 x 144 pixels for the Y and 88 x
72 pixels each for U and V. QCIF images are normally used for low bit-rates applications like
videophones (typically p =1). The same four temporal rates are supported by QCIF images also.
• P-frames: These are coded using a previous frame as a reference for prediction.
• GOB layer that includes a GOB start code, the group number, a group quantization value,
followed by macroblocks (MB) data.
• MB layer, that includes macroblock address (MBA), macroblock type (MTYPE : intra/inter),
quantizer (MQUANT), motion vector data (MVD), the coded block pattern (CBP), followed by
encoded block.
• Block-layer, that includes zig-zag scanned (run, level) pair of coefficients, terminated by the
end of block (EOB)
It is possible that encoding of some GOBs may have to be skipped and the GOBs considered for
encoding must therefore have a group number, as indicated. A common quantization value may
be used for the entire GOB by specifying the group quantizer value. However, specifying the
MQUANT in the macroblock overrides the group quantization value.
We now explain the major elements of the hierarchical data structure.
Fig 26.3 Composition of a GOB. Each box corresponds to a macroblock and the number
corresponds to the macroblock number. Data for each GOB consists of a GOB header followed
by data for macroblocks. Each GOB header is transmitted once between picture start codes in the
CIF or QCIF sequence.
6.7.1.4 Macroblock layer:
As already shown in fig , each GOB consists of 33 macroblocks. Each macroblock relates to 16 x
16 pixels of Y and corresponding 8 x 8 pixels of each U and V, as shown in fig .
Since block is defined as a spatial array of 8 x 8 pixels, each macroblock therefore consists of six
blocks – four from Y and one each from U and V. Each macroblock has a header, that includes
the following information –
• Macroblock type (MTYPE)- It is also variable length codeword that indicates the prediction
mode employed and which data elements are present. The H.261 standard supports the following
prediction modes: o intra modes are adopted for those macroblocks whose content change
significantly between two successive macroblocks.
o inter modes employ DCT of the inter-frame prediction error.
o inter + MC modes employ DCT of the motion compensated prediction error.
o inter +MC + fil modes also employ filtering of the predicted macroblock
• Motion vector data (MVD) – It is also a variable length codeword (VLC) for the horizontal
component of the motion vector, followed by a variable length codeword for the vertical
component. MVD is obtained from the macroblock vector by subtracting the vector of the
preceding macroblock.
• Coded block pattern (CBP) – CBP gives a pattern number that signifies which of the block
within the macroblock has at least one significant transformation coefficient. The pattern number
is given by
where, Pn = 1 if any coefficient is present for block n, else 0. The block numberings are as per
fig.2.
Transform coefficients are always present for intra macroblocks. For inter-coded macro blocks,
transform coefficients may or may not be present within the block and their status is given by the
CBP field in the macroblock layer. TCOEFF encodes the (RUN, LEVEL) combinations using
variable length codes, where RUN indicates run of zero coefficient in the zig-zag scanned block
DCT array.
The H.261 standard’s coding algorithm achieved several major performance breakthroughs, but
at the lower extreme of its bit-rate, i.e., at 64 Kbit/sec, serious blocking artifacts produced
annoying effects.
This was tackled by reduced the frame rate, for example from the usual rate of 30 frames/ sec
down to 10 frames/sec by considering only one out of every three frames and dropping the
remaining two. However, reduced frame rate reduces temporal resolution, which is also not very
desirable for rapidly changing scenes. Reduced frame rate also causes high endto-end delays,
which is also not very desirable.
Hence, there was a need to design a coding standard that would provide better performance than
the H.261 standard at lower bit-rate. With this requirement evolved the H.263 standard, whose
targeted application was POTS video conferencing.
During the development of H.263, the target bit-rate was determined by the maximum bitrate
achievable at the general switched telephone network (GSTN), which was 28.8 Kbits/sec at that
time. At these bit-rates, it was necessary to keep the overhead information at a minimum. The
other requirements of H.263 standardization were:
Based on all these requirements an efficient coding scheme was designed. Although it was
optimized for 28.8 Kbits/sec, even at higher bit rates up to 600 Kbits/sec, H.263 outperformed
the H.261 standard.
The CIF, 4CIF and 16 CIF picture formats are optional for encoders as well as decoders. It is
mandatory for the decoders to support both sub-QCIF and QCIF picture formats. However, for
encoders, only one of these two formats (Sub-QCIF or QCIF) is mandatory. In all these formats
Y, U and V are sampled in 4: 2:0.
In fig , the integer pixel positions have been indicated by the “+” symbol. A half pixel wide grid
is formed in which one out of four pixels in the grid coincides with the integer grid position. The
remaining three out of four pixels are generated through interpolation. The interpolated pixels
indicated by “O” symbol lies at the integer as well as the half-pixel positions. Together, the
integer and the half pixel positions create a picture that has two times the spatial resolution as
compared to the original picture in both the horizontal and the vertical directions. In fig 26.5, the
interpolated pixels marked as ‘a” ‘b” ‘c’ and ‘d’ are given by.
where A, B,C and D are the pixel intensities at the integer pixel positions.
When the motion estimation is carried out on this improved resolution interpolated image, a
motion vector of 1 unit in this resolution corresponds to 0.5 unit with respect to the original
resolution. This is the basic principles of half-pixel motion estimation.
(b) it uses overlapped block motion compensation (OBMC), which results in overall smoothing
of the image and removal of blocking artifacts. OBMC involves using motion vectors of
neighboring blocks to reconstruct a block. An 8 x 8 luminance block pixel P(i,j) is a weighted
sum of three prediction values, as shown below
where (uk,vk) is the motion vector of the current block (k = 0) the block either above or below (k
=1), or the block either to the left or right of the current block (k =2). Here p (i,j) is the reference
(previous) frame and { Hk(i,j): k=0,1,2} are the weights defined as
In the advanced prediction mode, motion vectors are allowed to cross the picture boundaries, just
like the unrestricted motion vector mode.
The P-picture within the PB-frame is predicted from the previously decoded P picture and the B-
picture is bi-directionally predicted from the previous P-picture, as well as the P-picture currently
being decoded. Information from the P-picture and the B-picture within the PB-frame is
interleaved at the macroblock level.