Data Compression (Pt2)
Data Compression (Pt2)
Compression
1
Why Data Compression?
Make optimal use of limited storage space
In sending data over communication line: less time to transmit and less
storage to host
7.1 Introduction
If the compression and decompression processes induce no
information loss, then the compression scheme is lossless;
otherwise, it is lossy.
Compression ratio:
(7.1)
3
7.1 Introduction
Compression: the process of coding that will effectively
reduce the total number of bits needed to represent certain
information.
Figure 7.1 depicts a general data compression scheme, in which
compression is performed by an encoder and decompression
is performed by a decoder.
4
Data Compression Methods
Data compression is about storing and sending
a smaller number of bits.
There’re two major categories for methods to
compress data: lossless and lossy methods
7.2 Basics of Information Theory
The entropy η of an information source with alphabet
S = {s1, s2, . . . , sn} is:
n
1
H (S ) pi log 2 (7.2)
i 1 pi
n (7.3)
pi log 2 pi
i 1
pi – probability that symbol si will occur in S.
1
log
– indicates the amount of information (
2 pi
self-information as defined by Shannon) contained in
si, which corresponds to the number of bits needed
to encode si.
Li & Drew 6
Data Compression- Entropy
Entropy is the measure of information content
in a message.
Messages with higher entropy carry more information than
messages with lower entropy.
The average entropy over the entire message is
the sum of the entropy of all n symbols in the
message
7.3 Run-Length Coding
• RLC is one of the simplest forms of data compression.
The basic idea is that if the information source has the property that
symbols tend to form continuous groups, then such symbol and the
length of the group can be coded.
Consider a screen containing plain black text on a solid white
background.
There will be many long runs of white pixels in the blank space, and
many short runs of black pixels within the text. Let us take a
hypothetical single scan line, with B representing a black pixel and W
representing white:
WWWWWBWWWWBBBWWWWWWBWWW
If we apply the run-length encoding (RLE) data compression
algorithm to the above hypothetical scan line, we get the following:
5W1B4W3B6W1B3W
The run-length code represents the original 21 characters in only 14.
8
7.4 Variable-Length Coding
variable-length coding (VLC) is one of the best-known
entropy coding methods
9
7.4.1 Shannon–Fano Algorithm
To illustrate the algorithm, let us suppose the symbols to be
coded are the characters in the word HELLO.
The frequency count of the symbols is
Symbol H E L O
Count 1 1 2 1
The encoding steps of the Shannon–Fano algorithm can be
presented in the following top-down manner:
1. Sort the symbols according to the frequency count of their
occurrences.
2. Recursively divide the symbols into two parts, each with
approximately the same number of counts, until all parts
contain only one symbol.
10
7.4.1 Shannon–Fano Algorithm
A natural way of implementing the above procedure is to
build a binary tree.
As a convention, let us assign bit 0 to its left branches and 1
to the right branches.
Initially, the symbols are sorted as LHEO.
As Fig. 7.3 shows, the first division yields two parts: L with a
count of 2, denoted as L:(2); and H, E and O with a total
count of 3, denoted as H, E, O:(3).
The second division yields H:(1) and E, O:(2).
The last division is E:(1) and O:(1).
11
7.4.1 Shannon–Fano Algorithm
Li & Drew 13
7.4.1 Shannon–Fano Algorithm
The Shannon–Fano algorithm delivers satisfactory coding results
for data compression, but it was soon outperformed and
overtaken by the Huffman coding method.
The Huffman algorithm requires prior statistical knowledge about
the information source, and such information is often not
available.
This is particularly true in multimedia applications, where future
data is unknown before its arrival, as for example in live (or
streaming) audio and video.
Even when the statistics are available, the transmission of the
symbol table could represent heavy overhead
The solution is to use adaptive Huffman coding compression
algorithms, in which statistics are gathered and updated
dynamically as the data stream arrives.
14
7.4.2 Huffman Coding Algorithm
Huffman coding is based on the frequency
of occurrence of a data item.
The principle is to use a lower number of
bits to encode the data that occurs more
frequently.
15
Huffman Coding Algorithm
Symbol Count
A 5
B 7
C 10
D 15
E 20
F 45
Step 1: Sort the list by frequency (descending)
Symbol Count
F 45
E 20
D 15
C 10
B 7
A 5 16
Huffman Coding Algorithm
Step 2: Make the 2 lowest elements into leaves, creating a parent node
with a frequency that is the sum of the lower element frequencies.
AB (12)
A (5) B (7)
Step 3: The two elements are removed from the list and the new
parent node with frequency 12 is inserted into the list. The list sorted by
frequency is
Symbol Count
F 45
E 20
D 15
AB 12
C 10
17
Huffman Coding Algorithm
Step 4: Then, repeat the loop, combining the two lowest elements.
Symbol Count
ABC (22)
F 45
C(10) AB (12) ABC 22
E 20
A (5) B (7)
D 15
Step 5: Repeat until there is only one element left in the list.
Symbol Count
ABCDE (57) F 45
ABCDE 57
ABC (22)
ABCDE (57)
ABC (22)
Symbol Count
ABCDEF 102
19
Huffman Coding Algorithm
ABCDEF (102)
ABCDE (57)
0 0 1
ABC (22)
0 1
DE (35) C(10) AB (12)
0 1 0 1
21
7.5 Lempel-Ziv-Welch (LZW)
The Lempel-Ziv-Welch (LZW) algorithm employs an
adaptive, dictionary-based compression technique.
22