0% found this document useful (0 votes)
52 views22 pages

Data Compression (Pt2)

Data compression techniques aim to reduce the size of data files or streams for efficient storage or transmission by removing redundant information. There are two main categories of compression methods: lossless techniques that allow for perfect reconstruction of the original data, and lossy techniques that discard some information for higher compression ratios but lower quality reconstruction. Popular lossless compression algorithms discussed in the document include run-length encoding, Shannon-Fano coding, and Huffman coding which assigns variable length codes to symbols based on their frequency of occurrence.

Uploaded by

Syahmi Ahmad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views22 pages

Data Compression (Pt2)

Data compression techniques aim to reduce the size of data files or streams for efficient storage or transmission by removing redundant information. There are two main categories of compression methods: lossless techniques that allow for perfect reconstruction of the original data, and lossy techniques that discard some information for higher compression ratios but lower quality reconstruction. Popular lossless compression algorithms discussed in the document include run-length encoding, Shannon-Fano coding, and Huffman coding which assigns variable length codes to symbols based on their frequency of occurrence.

Uploaded by

Syahmi Ahmad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 22

Multimedia Data

Compression

1
Why Data Compression?
 Make optimal use of limited storage space

 Save time and help to optimize resources


 If compression and decompression are done in I/O processor, less time is
required to move data to or from storage subsystem, freeing I/O bus for
other work

 In sending data over communication line: less time to transmit and less
storage to host
7.1 Introduction
 If the compression and decompression processes induce no
information loss, then the compression scheme is lossless;
otherwise, it is lossy.
 Compression ratio:

 (7.1)

 B0 – number of bits before compression


 B1 – number of bits after compression
 In general, we would desire any codec (encoder/decoder
scheme) to have a compression ratio much larger than 1.0.
 The higher the compression ratio, the better the lossless
compression scheme, as long as it is computationally feasible.

3
7.1 Introduction
 Compression: the process of coding that will effectively
reduce the total number of bits needed to represent certain
information.
 Figure 7.1 depicts a general data compression scheme, in which
compression is performed by an encoder and decompression
is performed by a decoder.

Fig. 7.1: A General Data Compression Scheme.

4
Data Compression Methods
 Data compression is about storing and sending
a smaller number of bits.
 There’re two major categories for methods to
compress data: lossless and lossy methods
7.2 Basics of Information Theory
 The entropy η of an information source with alphabet
S = {s1, s2, . . . , sn} is:
n
1
  H (S )   pi log 2  (7.2)
i 1 pi
n  (7.3)
  pi log 2 pi
i 1
 pi – probability that symbol si will occur in S.
1
 log
– indicates the amount of information (
2 pi
self-information as defined by Shannon) contained in
si, which corresponds to the number of bits needed
to encode si.

Li & Drew 6
Data Compression- Entropy
 Entropy is the measure of information content
in a message.
 Messages with higher entropy carry more information than
messages with lower entropy.
 The average entropy over the entire message is
the sum of the entropy of all n symbols in the
message
7.3 Run-Length Coding
• RLC is one of the simplest forms of data compression.
 The basic idea is that if the information source has the property that
symbols tend to form continuous groups, then such symbol and the
length of the group can be coded.
 Consider a screen containing plain black text on a solid white
background.
 There will be many long runs of white pixels in the blank space, and
many short runs of black pixels within the text. Let us take a
hypothetical single scan line, with B representing a black pixel and W
representing white:
WWWWWBWWWWBBBWWWWWWBWWW
 If we apply the run-length encoding (RLE) data compression
algorithm to the above hypothetical scan line, we get the following:
5W1B4W3B6W1B3W
 The run-length code represents the original 21 characters in only 14.

8
7.4 Variable-Length Coding
 variable-length coding (VLC) is one of the best-known
entropy coding methods

 Here, we will study the Shannon–Fano algorithm, Huffman


coding, and adaptive Huffman coding.

9
7.4.1 Shannon–Fano Algorithm
 To illustrate the algorithm, let us suppose the symbols to be
coded are the characters in the word HELLO.
 The frequency count of the symbols is
Symbol H E L O
Count 1 1 2 1
 The encoding steps of the Shannon–Fano algorithm can be
presented in the following top-down manner:
 1. Sort the symbols according to the frequency count of their
occurrences.
 2. Recursively divide the symbols into two parts, each with
approximately the same number of counts, until all parts
contain only one symbol.

10
7.4.1 Shannon–Fano Algorithm
 A natural way of implementing the above procedure is to
build a binary tree.
 As a convention, let us assign bit 0 to its left branches and 1
to the right branches.
 Initially, the symbols are sorted as LHEO.
 As Fig. 7.3 shows, the first division yields two parts: L with a
count of 2, denoted as L:(2); and H, E and O with a total
count of 3, denoted as H, E, O:(3).
 The second division yields H:(1) and E, O:(2).
 The last division is E:(1) and O:(1).

11
7.4.1 Shannon–Fano Algorithm

Fig. 7.3: Coding Tree for HELLO by Shannon-Fano. 12


Table 7.1: Result of Performing Shannon-Fano on HELLO
Symbol Count Log2 1 Code # of bits used
pi
L 2 1.32 0 2
H 1 2.32 10 2
E 1 2.32 110 3
O 1 2.32 111 3
TOTAL # of bits: 10

Average number of bits required to represent each symbol (entropy):


n
1
  H ( S )   pi log 2
i 1 pi
= 0.4 (1.32) + 0.2 (2.32) + 0.2 (2.32) + 0.2(2.32)
= 0.528+0.464+0.464+0.464
= 1.92 bits per symbol

Li & Drew 13
7.4.1 Shannon–Fano Algorithm
 The Shannon–Fano algorithm delivers satisfactory coding results
for data compression, but it was soon outperformed and
overtaken by the Huffman coding method.
 The Huffman algorithm requires prior statistical knowledge about
the information source, and such information is often not
available.
 This is particularly true in multimedia applications, where future
data is unknown before its arrival, as for example in live (or
streaming) audio and video.
 Even when the statistics are available, the transmission of the
symbol table could represent heavy overhead
 The solution is to use adaptive Huffman coding compression
algorithms, in which statistics are gathered and updated
dynamically as the data stream arrives.
14
7.4.2 Huffman Coding Algorithm
 Huffman coding is based on the frequency
of occurrence of a data item.
 The principle is to use a lower number of
bits to encode the data that occurs more
frequently.

15
Huffman Coding Algorithm
Symbol Count
A 5
B 7
C 10
D 15
E 20
F 45
Step 1: Sort the list by frequency (descending)

Symbol Count
F 45
E 20
D 15
C 10
B 7
A 5 16
Huffman Coding Algorithm
Step 2: Make the 2 lowest elements into leaves, creating a parent node
with a frequency that is the sum of the lower element frequencies.
AB (12)

A (5) B (7)

Step 3: The two elements are removed from the list and the new
parent node with frequency 12 is inserted into the list. The list sorted by
frequency is
Symbol Count
F 45
E 20
D 15
AB 12
C 10

17
Huffman Coding Algorithm
Step 4: Then, repeat the loop, combining the two lowest elements.

Symbol Count
ABC (22)
F 45
C(10) AB (12) ABC 22
E 20
A (5) B (7)
D 15
Step 5: Repeat until there is only one element left in the list.

Symbol Count
ABCDE (57) F 45
ABCDE 57
ABC (22)

DE (35) C(10) AB (12)

D (15) E (20) A (5) B (7)


18
Huffman Coding Algorithm
ABCDEF (102)

ABCDE (57)

ABC (22)

DE (35) C(10) AB (12)

F (45) D (15) E (20) A (5) B (7)

Symbol Count
ABCDEF 102

19
Huffman Coding Algorithm
ABCDEF (102)

ABCDE (57)
0 0 1
ABC (22)
0 1
DE (35) C(10) AB (12)
0 1 0 1

F (45) D (15) E (20) A (5) B (7)

Symbol Count Pi 1 Log2 1 Code # of bits used


pi pi
A 5 0.049 20.41 4.35 1110 20
B 7 0.069 14.49 3.86 1111 28
C 10 0.098 10.20 3.35 110 30
D 15 0.147 6.80 2.76 100 45
E 20 0.196 5.10 2.35 101 60
F 45 0.441 2.27 1.18 0 45
102 228 20
Huffman Coding Algorithm
Symbol Count Pi 1 Log2 1 Code # of bits used
pi pi
A 5 0.049 20.41 4.35 1110 20
B 7 0.069 14.49 3.86 1111 28
C 10 0.098 10.20 3.35 110 30
D 15 0.147 6.80 2.76 100 45
E 20 0.196 5.10 2.35 101 60
F 45 0.441 2.27 1.18 0 45
102 228

Average number of bits required to represent each symbol (entropy):


n
1
  H ( S )   pi log 2
i 1 pi
= 0.049 (4.35) + 0.069 (3.86) + 0.098 (3.35) + 0.147(2.76) +
0.196(2.35) +0.441(1.18)
= 0.2132+0.2663+0.3283+0.3864+0.4606+0.520
= 2.1748 bits per symbol

21
7.5 Lempel-Ziv-Welch (LZW)
 The Lempel-Ziv-Welch (LZW) algorithm employs an
adaptive, dictionary-based compression technique.

 Unlike variable-length coding, in which the lengths of the


codewords are different, LZW uses fixed-length
codewords to represent variable length strings of
symbols/characters that commonly occur together, such
as words in English text.
 LZW is used in many applications, such as UNIX
compress, GIF for images, WinZip, and others.

22

You might also like