Data Compression Unit III
Data Compression Unit III
Arithmetic Coding
Like Huffman coding, this too is a Variable Length Coding (VLC) scheme
requiring a priori knowledge of the symbol probabilities. The basic principles of
arithmetic coding are as follows:
a) Unlike Huffman coding, which assigns variable length codes to a fixed group
symbols (usually of length one), arithmetic coding assigns variable length codes
to a variable group of symbols.
d) The code word itself defines a real number within the half-open interval [0,1)
and as more symbols are added, the interval is divided into smaller and smaller
subintervals, based on the probabilities of the added symbols.
Step-1: Consider a range of real numbers in [0,1). Subdivide this range into a
number of sub-ranges that is equal to the total number of symbols in the source
alphabet. Each sub-range spans a real value equal to the probability of the
source symbol.
Step-2: Consider a source message and take its first symbol. Find to which
subrange does this source symbol belongs.
Step-4: Now parse the next symbol in the given source message and determine
the next-level sub-range to which it belongs.
Step-5: Repeat step-3 and step-4 until all the symbols in the source message are
parsed. The message may be encoded using any real value in the last sub- range
so formed. The final message symbol is reserved as a special end-of- symbol
message indicator.
1
Encoding a Sequence of Symbols using Arithmetic Coding
Frequency
Symbol Probability Sub-Interval
a 4 0.4 [0.0; 0.4)
b 2 0.2 [0.4; 0.6)
c 2 0.2 [0.6; 0.8)
d 1 0.1 [0.8; 0.9)
e 1 0.1 [0.9; 1.0)
Sum 10 1.0 [0.0; 1.0)
Frequency
Symbol Probability Sub-Interval
a 4 0.4 [0.90;
0.94)
b 2 0.2 [0.94;
0.96)
c 2 0.2 [0.96;
0.98)
d 1 0.1 [0.98;
0.99)
e 1 0.1 [0.99;
1.00)
Frequency
Symbol Probability Sub-Interval
a 4 0.4 [0.990;
0.994)
b 2 0.2 [0.994;
0.996)
c 2 0.2 [0.996;
0.998)
d 1 0.1 [0.998;
0.999)
e 1 0.1 [0.999;
2
1.000)
Accordingly a rational number in the range of 0.999 to < 1.000 represents the
string "eee".
3
Assignment of Intervals to Codes
The symbols are identified by any number, which falls into the interval
concerned. All values, which are greater or equal than 0.0 and less than 0.4, will
be interpreted as 'a' for example.
Symbol Sub-Interval
a [0.0; 0.4)
b [0.4; 0.6)
c [0.6; 0.8)
d [0.8; 0.9)
e [0.9; 1.0)
All intervals have a closed lower endpoint (indicated by '[') and an open upper
endpoint (indicated by ')'). The lower endpoint always belongs to the interval,
while the upper endpoint is placed outside. I.e. the value 1 might be never
encoded in this context.
Sub-Intervals
After the initial encoding, any further symbol will not relate to the basic
intervall [0; 1). The sub-interval identified by the first symbol has to be used for
the next step. In the case of an 'a' the new interval would be [0; 0.4), for 'b'
[0.4; 0.6).
Providing a constant probability distribution and the first symbol 'a', the new
sub-intervals would be:
4
Divide interval after encoding of 'a'
On the basis of the decimal number system the character strings "aa" and "ab"
could be encoded with one decimal position, "ac" and "ad" with two positions.
5
The code word for the string could be an arbitrary number greater or equal
than 0.0000 and less than 0.0256. The best choice would be a number
requiring a small amount of significant digits, in this case the 0.
Scheme Intervals "abcd"
Using the same parameter set, "abcd" will be encoded according to the
following scheme:
6
The string "abcd" is represented by an abritary number within the interval
[0,2208; 0,2224). In a binary system 10 digits or bit are sufficient to encode
such a number:
0.0011,1000,11 (binary)
0.221,679,687,5 (decimal)
Step-1: Identify the message bit stream. Convert this to the real decimal number
and determine its position within the subintervals identified at the beginning of
encoding process. The corresponding symbol is the first one in the message
Step-2: Consider the expanded subinterval of the previous decoded symbol and
map the real number within this to determine the next subinterval and obtain
the next decoded symbol. Repeat this step until the end-of-message indicator is
parsed.
7
Advantage of Arithmetic Coding
The high requirements for calculation were the most important disadvantage,
especially if taking older computer systems into consideration. Additionally the
patent situation had a crucial influence to decisions about the implementation of
an arithmetic coding.
8
JBIG
JBIG is short for the 'Joint Bi-level Image experts Group'. This is a group of
experts nominated by national standards bodies and major companies to work to
produce standards for bi-level image coding. The 'joint' refers to its status as a
committee working on both ISO/IEC and ITU-T standards. The official title of the
committee is ISO/IEC JTC1 SC29 Working Group 1, and is responsible for both
JPEG and JBIG standards.
JBIG has developed ISO/IEC 11544 (ITU-T T.82) for the lossless compression of a
bi-level image. It can also be used for coding greyscale and colour images with
limited numbers of bits per pixel. It can be regarded as a form of facsimile
encoding, similar to Group 3 or Group 4 fax, offering between 20 and 80%
improvement in compression over these methods (about 20 to one over the
original uncompressed digital bit map).
JBIG2 is the successor (ISO/IEC 14492 | ITU-T T.88) format for bi-level
(black/white) image compression that offers significant advantages over other
compression formats:
large increases in compression performance (typically 3-5 times smaller than
Group 4/MMR, 2-4 times smaller than JBIG1)
special compression methods for text, halftones, and other binary image content
Lossy and lossless compression
multi-page document compression
flexible format, designed for easy embedding in other image file formats, such
as TIFF
high-performance decompression: using some coding modes, images can be
decompressed at over 250 million pixels/second in software
JBIG Basics
Bi-level images contain only two colors and are stored using a single bit per
pixel. Black-and-white images, such as the pages of a book, are the most
common type of bi-level images. However, any two colors may be represented
by the 1 (foreground color) or 0 (background color) state of a single bit.
Typical bi-level compression algorithms encode only a single scan line at a time
using a run-length encoding technique. Such algorithms are referred to as 1D
encoding methods. 2D encoding methods encode runs of pixels by describing the
differences between the pixel values in the current and the previous scan lines.
JBIG encodes redundant image data by comparing a pixel in a scan line with a
set of pixels already scanned by the encoder. These additional pixels are called
a template, and they form a simple map of the pattern of pixels that surround
the pixel that is being encoded. The values of these pixels are used to identify
redundant patterns in the image data. These patterns are then compressed
using an adaptive arithmetic compression coder.
The adaptive nature of templates allows the color of the pixel values being
encoded to be predicted with a high degree of success. For gray-scale images
with halftoning, compression ratios are increased by as much as 80 percent over
non-adaptive methods.
9
Although designed primarily as a method for compressing bi-level image data,
JBIG is capable of compressing color or gray-scale images with a depth of up to
255 bits per pixel. Such multi-bit pixel images are compressed by bitplane rather
than by pixel. For example, an 8-bit image compressed using JBIG would be
encoded into eight separate bitplanes.
This type of encoding may be used as an alternative to lossless JPEG. JBIG has
been found to produce better compression results than lossless JPEG (using the
Q-coder) on images with two to five bits per pixel and to produce identical
results on image data with pixels six to eight bits in depth.
It is recommended that each bitplane be preprocessed with a gray-coding
algorithm to normalize the changes between adjacent byte values in the image
data. This process increases the efficiency of the JBIG encoder.
JBIG images may be encoded sequentially or progressively. Sequentially
encoded images are stored in a single layer at full resolution and without other
lower resolution images being stored in the same data stream. This sequential
JBIG image is equivalent in function and application to a G4 image. Such an
image is decoded in a single pass and has at least as good a compression ratio
as G4.
Progressively encoded images start with the highest resolution image and end
with the lowest. The high-resolution image is stored in a separate layer and is
then used to produce a lower resolution image, also stored in its own layer. Each
layer after the first layer is called a resolution doubling. An image with three
layers is said to have two doublings.
There is no imposed limit to the number of doublings that may be encoded. For
example, a 1200-dpi image may be encoded as one layer (1200 dpi), three
layers (1200, 600, and 300 dpi), or five layers (1200, 600, 300, 150, and 75 dpi).
The lowest resolution is determined by whatever is considered useful. Even a 10-
dpi image, though not legible, is still useful as an icon.
Progressive decoding is the opposite process, with the lowest resolution image
being decoded first, followed by increased resolutions of the image until the full
resolution is achieved. This technique has the advantage of allowing data to
appear immediately on the output device. Only data up to the appropriate
resolution of the output device need be decoded and sent.
Both sequential and progressive JBIG encoding are completely compatible.
Images compressed using sequential encoding are readable by progressive JBIG
decoders. Sequential JBIG decoders are only capable of reading the first, lowest-
resolution layer within a progressively-encoded JBIG image.
Many applications that utilize JBIG may only have use for sequential encoding
and decoding, especially those used for facsimile transmission. It is therefore
possible to implement a simplified JBIG algorithm that encodes only the first
layer in a JBIG data stream. Such encoders produce a valid JBIG-encoded data
stream that is readable by all JBIG decoders.
Progressive encoding does not add much more data to a JBIG data stream than
does sequential encoding, but it does have greater memory requirements.
Because a lower resolution image is encoded from data of the next higher
10
resolution image (and vice versa when decoding), a frame buffer must be used
to store image data that is being used as a reference.
JBIG2
Ideally, a JBIG2 encoder will segment the input page into regions of text, regions
of halftone images, and regions of other data. Regions that are neither text nor
halftones are typically compressed using a context- dependent
arithmetic coding algorithm called the MQ coder. Textual regions are compressed
as follows: the foreground pixels in the regions are grouped into symbols. A
dictionary of symbols is then created and encoded, typically also using context-
dependent arithmetic coding, and the regions are encoded by describing which
symbols appear where. Typically, a symbol will correspond to a character of text,
but this is not required by the compression method. For lossy compression the
difference between similar symbols (e.g., slightly different impressions of the
same letter) can be neglected; for lossless compression, this difference is taken
into account by compressing one similar symbol using another as a template.
Halftone images may be compressed by reconstructing the gray scale image
used to generate the halftone and then sending this image together with a
dictionary of halftone patterns.
Introduced as an ITU standard in 1993, JBIG (also called JBIG1) never achieved
the acceptance that TIFF G4 enjoyed even though it provided a 20-30% reduction
in file size over TIFF G4. Such a reduction rate never generated sufficient
enthusiasm among the digital imaging community to justify broad- based
industry support. Consequently, JBIG was mostly used for bitonal image
compression on a very limited range of (mostly Japanese) MFP devices and
digital copiers.
In contrast, the digital media industry has readily received the JBIG2 standard.
Almost from the time of its introduction, JBIG2 was supported for bitonal
compression in the JPEG 2000 Part 6 specifications, and as a compression filter in
Adobe PDF. It quickly became the format of choice for a number of document-
11
heavy organizations including legal, media, financial, scanning and banking
firms.
One advantage held by both JBIG and JBIG2 over TIFF G3 and G4 is the JBIG
formats' ability to use arithmetic coding instead of Huffman coding. Again, the
key difference is the higher compression ratio arithmetic coding can bring to the
JBIG standard. Arithmetic coding allows for data to be represented by a fraction
of a bit. In comparison, Huffman coding requires whole bits to represent runs in
the image, resulting in a lower compression ratio for the TIFF formats.
This approach is also commonly used to decrease the waiting time needed
for the image to start appearing after transmission and is used by World
Wide Web image transmissions.
12
Predication by partial matching (PPM)
PPM was invented by John Cleary and Ian Witten back in 1984.
The basic algorithm initially attempted to use the largest context. The size of largest
context is predetermined.
Suppose we are encoding a sequence X1, X2, . . . , Xn and we are trying to estimate
the probabilities of the next symbol, say X j+k given the previous k symbols. If
some symbols have never occurred before in this context of k symbols, then we
have the zero frequency problem mentioned last lecture. We don’t know how to
estimate the probabilities of these symbols. The idea of the PPM method 1 is to
avoid the zero frequency problem by switching to a lower order context. The
encoder tells the decoder that the symbol X j+k that comes next has never
occured in the present context, and that they should (both) switch to the context
of the previous k − 1 symbols, instead of the k symbol context. Here are some
details:
Rather than using one Markov model of order k, use several Markov models,
namely, models of order 0, 1, . . . , k max, where kmax is the largest order model
considered. (In fact, an order -1 model is used as well. See below.) As the
encoder encodes the sequence, it computes counts for each of these models.
This allows it to compute conditional probability functions: p(next symbol |
previous kmax symbols ) p(next symbol | previous k max − 1 symbols ) . . . p(next
symbol | previous symbol ) p(next symbol ) • Add one more symbol, AN+1 = ǫ
(for “escape”) to the alphabet. This escape symbol indicates that the encoder is
switching from the kth order model to the k − 1 th order model. • To choose the
probabilities of the next symbol, use the highest order model k such that the
previous k symbols followed by the next symbol has occurred at least once in
the past. That is, the encoder keeps escaping until it finds such an order. • What
if a symbol has never occurred before in any context? That is, what if it is the
first time that a symbol occurs? In this case, the encoder uses a “k = −1” order
model to encode the symbol. (Call the order “-1” is meaningless, of course. This
entire means is that we are decrementing the order from k = 0, i.e. we escaped
from k = 0.)
13
The symbols that are encoded are:
ǫ, ǫ, ǫ, a, ǫ, ǫ, ǫ, b, ǫ, ǫ, ǫ, r, ǫ, ǫ, a, ǫ, ǫ, ǫ, c, ǫ, ǫ, a, ǫ, ǫ, ǫ, d, ǫ, ǫ, a, ǫ, b, r, a
You may wonder how this could achieve compression. The number of symbols
(including the ǫ) is far more than the original number of symbols. On the one
hand, one would think that this would require more bits! On the other hand,
notice that many of these “escapes” occur with probability 1 and so no bits need
to be sent.
1: begin
2: while (not last character) do
3: begin
14
4: readSymbol()
5: shorten context
6: while (context not found and context length not -1) do
7: begin
8: output(escape sequence)
9: shorten context
10: end
11: output(character)
12: while (context length not -1) do
13: begin
14: increase count of character (create node if
nonexistant) 15: shorten context
16: end
17: end
18: end
15
Dictionary Method
The compression techniques we have seen so far replace individual symbols with
a variable length codewords. In dictionary compression, variable length
substrings are replaced by short, possibly even fixed length codewords.
Compression is achieved by replacing long strings with shorter codewords. The
general scheme is as follows:
16
In 1977 and 1978, Abraham Lempel and Jacob Ziv published two adaptive dictionary
compression algorithms that soon became to dominate practical text
compression. Numerous variants have been published and implemented, and
they are still the most commonly used algorithms in general purpose
compression tools.
The common feature of the two algorithms and all their variants is that the
dictionary consists of substrings of the already processed part of the text. This
means that the dictionary adapts to the text.
The two algorithms are known as LZ77 and LZ78, and most related methods can be
categorized as a variant of one or the other. The primary difference is the
encoding of the phrases:
LZ77
Jacob Ziv and Abraham Lempel have presented their dictionary-based scheme in
1977 for lossless data compression. The LZ77 compression algorithm is the most
used compression algorithm, on which program like PkZip has their foundation
along with a few other algorithms. LZ77 exploits the fact that words and phrases
within a text file are likely to be repeated. When there is repetition, they can be
encoded as a pointer to an earlier occurrence, with the pointer followed by the
number of characters to be matched. It is a very simple technique that requires
no prior knowledge of the source and seems to require no assumptions about
the characteristics of the source. In the LZ77 approach, the dictionary work as a
portion of the previously encoded sequence. The encoder examines the input
sequence by pressing into service of sliding window which consists of two parts:
Search buffer and Look-ahead buffer. A search buffer contains a portion of the
recently encoded sequence and a look-ahead buffer contains the next portion of
the sequence to be encoded. The algorithm searches the sliding window for the
longest match with the beginning of the look-ahead buffer and outputs a pointer
to that match. It is possible that there is no match at all, so the output cannot
contain just pointers. In LZ77 the sequence is encoded in the form of a triple ,
where ‘o’ stands for an offset to the match, ‘l’ represents length of the match,
and ‘c’ denotes the next symbol to be encoded. A null pointer is generated as
the pointer in case of absence of the match (both the offset and the match
length equal to 0) and the first symbol in the look-ahead buffer i.e.
(0,0,”character”). The values of an offset to a match and length must be limited
to some maximum constants. Moreover the compression performance of LZ77
mainly depends on these values.
17
{Get a pointer (position, length) to longest match;
If (length > 0)
{
Output (position, Longest match length, next symbol );
Shift the window by (length+1) positions along;
}
Else
{
Output (0, 0, first symbol in the look-ahead buffer); Shift
the window by 1 character along;
}
}
In the original LZ77 algorithm; Lempel and Ziv proposed that all string be
encoded as a length and offset, even string founds no match. In LZ77, search
buffer is thousands of bytes long, while the look-ahead buffer is tens of byte long.
LZ77 doesn’t have its external dictionary which cause problem while
decompressing on another machine.
LZ78
W: = NIL;
18
While (there is input)
{
K: = next symbol from input; If (wK exists in the dictionary)
{
W := wK;
}
Else
{
Output (index(w), K); Add
wK to the dictionary;
W := NIL;
}
}
LZW
LZW is simple optimization of LZ78 used, e.g., in the unix tool compress.
D = fZ1; : : : ; Zσg.
The phrase Tj is encoded with the index k requiring dlog(σ + j - 1)e bits.
Exercise:
Besides error control protocols, all current high-speed modems also support data
compression protocols. That means the sending modem will compress the data
on-the-fly and the receiving modem will decompress the data to its original form.
There are two standards for data compression protocols, MNP-5 and CCITT
V.42bis. Some modems also use proprietary data compression protocols.
19
Also note that although V.42 include MNP-4, V.42bis does not include MNP-5.
However, virtually all high-speed modems that support CCITT V.42bis also
incorporate MNP-5.
The maximum compression ratio that a MNP-5 modem can achieve is 2:1. That is
to say, a 9600 bps MNP-5 modem can transfer data up to 19200 bps. The
maximum compression ratio for a V.42bis modem is 4:1. That is why all those
V.32 modem manufacturers claim that their modems provide throughput up to
38400 bps.
Don't be fooled by the claim. It is extremely rare, if ever, that you will be able to
transfer files at 38400 bps. In fact, V.42bis and MNP-5 are not very useful when
you are downloading files from online services. Why?
How well the modem compression works depends on what kind of files are being
transferred. In general, you will be able to achieve twice the speed for
transferring a standard text file (like the one you are reading right now).
Decreasing by 50% means that you can double the throughput on the line so
that a 9600 bps modem can effectively transmit 19200 bps.
V.42bis and MNP-5 modem cannot compress a file which is already compressed
by software. In the case of MNP-5, it will even try to compress a precompressed
file and actually expand it, thus slow down the file transfer! Here are the test
results obtained by downloading the three compressed files using (1) MNP-4
without data compression, (2) MNP-5, (3) V.42 without data compression, and
(4) V.42bis.
Most PC files are in the ZIP format. Macintosh files are typically in the .SIT
(Stuffit) or .CPT (Compact Pro) format. Amiga files are usually in the ZOO, ARC or
LZH format. Note that GIF files are also in a compressed format.
There are several reasons why compression software programs (such as PKZIP or
Stuffit) are superior to MNP-5 or V.42bis.
20
2. Compression software programs are more versatile. Most of them allow
you to group several files in a compressed file archive to ensure that all
the related files get transferred at the same time.
3. Software compression is more efficient than on-the-fly modem
compression. In the case of a small file, this may not make much
difference. But the difference can be significant when you are transferring
large files.
Hayes BBS does not provide a compressed version for the file the-wave.txt.
Using PKZIP (for PC) and Stuffit (for Macintosh), we obtain the following results:
Here is another example. Spider Island Software BBS (714-730-5785) has a test file
called One-Minute Max. It is a Macintosh TIFF file (file size 206,432 bytes).
According to Spider Island Software, the file can be downloaded in 56 seconds
(with an effective throughput of 3745cps) with a V.32/V.42bis modem.
The result may seem impressive at first. However, the file can be compressed to
6065 bytes (with Compact Pro) or 7385 bytes (with Stuffit). Assuming a transfer
speed of 1000 cps, it would only take 6-8 seconds to transfer. Again, it is seven
to nine times faster than downloading the file with V.42bis.
To get the most from a modem with data compression, you'll want to send data
from your PC to the modem as quickly as possible. If the modem is idle and
waiting for the computer to send data, you are not getting the maximum
performance from the modem.
21
For example, you have a V.32/V.42bis modem and you want to send a text file to
a remote system which also has a V.32/V.42bis modem. Let's assume the
modem is able to send the file at 20000 bps using V.42bis. If your computer is
sending data to your modem at 9600 bps, your modem will have to stop and wait
to receive data from your computer.
To get the maximum performance, you want to set the computer to send data to
the modem at 38400 bps (the maximum a V.32/V.42bis modem can achieve).
Since the modem can only send the file to the other modem at 20000 bps, it will
never have to wait.
CALIC
CALIC stands for Context Adaptive Lossless Image Compression, a lossless image
compression technique. CALIC obtains higher lossless compression of
continuous-tone images than other techniques reported in the literature. This
high coding efficiency is accomplished with relatively low time and space
complexities. It's designed to achieve higher compression ratios than other
lossless methods, particularly for continuous-tone images. CALIC works by using
context modeling and a non-linear predictor that adapts to varying image
statistics.
Lossless Compression:
CALIC ensures that the original image data is preserved, meaning no information
is lost during compression or decompression.
Context Modeling:
Adaptive Predictor:
The predictor in CALIC is designed to adapt to the statistical variations within the
image. This allows it to handle different types of image content (e.g., smooth
areas, edges) more effectively.
Side Information:
CALIC can also utilize side information (unaltered portions of the original image)
to further improve compression efficiency.
22
Benefits of CALIC:
The adaptive nature of CALIC allows it to handle various types of image content
effectively.
Low Complexity:
Despite its effectiveness, CALIC has relatively low time and space complexities.
23