Chapter 7
Chapter 7
Data Compression
Introduction
• Data compression is the process of encoding
data using a representation that reduces the
overall size of data.
• Definition: Data compression is the process of
encoding information using fewer number of
bits so that:
– it takes less memory area (storage) or
– It takes less bandwidth during transmission.
Examples:
• RLE (Run Length Encoding)
• Dictionary Based Coding
• Arithmetic Coding
Introduction…
Lossy data compression:
• the original content of the data is lost to
certain degree when compressed.
• Part of the data that is not much important is
discarded/lost.
• The loss factor determines whether there is a
loss of quality between the original image and
the image after it has been compressed and
played back (decompressed).
• The more compression, the more likely that
quality will be affected.
• Even if the quality difference is not
Information Theory
• Information theory is defined to be the study
of efficient coding and its consequences.
• It is the field of study concerned about the
storage and transmission of data.
• It is concerned with source coding and
channel coding.
• Source coding: involves compression
• Channel coding: how to transmit data, how to
overcame noise, etc
• Data compression may be viewed as a branch
of information theory in which the primary
objective is to minimize the amount of data to
Information Theory…
Compression Algorithms
• Compression methods use mathematical algorithms to
reduce data by eliminating, grouping and/or averaging
similar data found in the signal.
• There are various compression methods, including Motion
JPEG, only MPEG-1 and MPEG-2 are internationally recognized
standards for the compression of moving pictures (video).
Arithmetic none
LZ78 (Lempel-Ziv 1978) LZW (Lempel-Ziv-Welch) GIF
v.42bis
compress
LZ77 (Lempel-Ziv 1977) LZFG ZIP
ARJ
LHA
Variable Length Encoding
1. Shannon-Fano Coding
• A variable-length coding based on the frequency of occurrence
of each character.
HOW DOES IT WORK?
The steps of the algorithm are as follows:
1. Create a list of probabilities or frequency counts for the given
set of symbols so that the relative frequency of occurrence of
each symbol is known.
2. Sort the list of symbols in decreasing order of probability, the
most probable ones to the left and least probable to the right.
3. Split the list into two parts, with the total probability of both
the parts being as close to each other as possible.
4. Assign the value 0 to the left part and 1 to the right part.
5. Repeat the steps 3 and 4 for each part, until all the symbols
are split into individual subgroups.
Variable Length…
Example: Suppose the following source and with related probabilities
S = {A,B,C,D,E}
P = {0.35,0.17,0.17,0.16,0.15}
Message to be encoded = ”ABCDE”
• Solution:
• Let P(x) be the probability of occurrence of symbol x:
1. Upon arranging the symbols in decreasing
order of probability:
P(D) + P(B) = 0.30 + 0.28 = 0.58 and,
P(A) + P(C) + P(E) = 0.22 + 0.15 + 0.05 = 0.42
• Result
Symbol A B C D E
Code 00 01 10 110 111
Dictionary Encoding
• Dictionary coding uses groups of symbols, words, and phrases
with corresponding abbreviation.
• It transmits the index of the symbol/word instead of the word
itself.
• There are different variations of dictionary based coding:
– LZ77 (printed in 1977)
– LZ78 (printed in 1978)
– LZSS
– LZW (Lempel-Ziv-Welch)
LZW Compression
• LZW compression has its roots in the work of Jacob Ziv and
Abraham Lempel.
• In 1977, they published a paper on sliding-window compression,
and followed it with another paper in 1978 on dictionary based
compression.
• These algorithms were named LZ77 and LZ78, respectively.
Dictionary Encoding…
The Concept
• Many files, especially text files, have certain strings that
repeat very often, for example " the ".
• With the spaces, the string takes 5 bytes, or 40 bits to encode.
• But what if we were to add the whole string to the list of
characters?
• Then every time we came across " the ", we could send the
code instead of 32,116,104,101,32.
• This would take less no of bits.
Encoding
• Create dictionary of letters found in the message
Encoder Dictionary
Input Output Index Entry
1 a
2 b
3 c
4 d
message = aababacbaacbaadaaa
Encoder Dictionary
Input(s+c) Output Index Entry
1 a
2 b
3 c
4 d
aa 1 5 aa
ab 1 6 ab
ba 2 7 ba
aba 6 8 aba
ac 1 9 ac
cb 3 10 cb
baa 7 11 baa
acb 9 12 acb
baad 11 13 baad
da 4 14 da
aaa 5 15 aaa
a 1
1 a
2 b
3 w
code
message=31221461
Dictionary Encoding…
Decompression algorithm:
LZWDecoding()
Enter all the source letters into the dictionary;
Read priorCodeword and output one symbol corresponding to it;
While codeword is still left
read Codeword;
PriorString = string (PriorCodeword);
If codeword is in the dictionary
Enter in dictionary PriorString + firstsymbol(string(codeword));
output string(codeword);
else
Enter in the dictionary priorString +firstsymbol(priorString);
Output priorString+firstsymbol(priorstring);
priorCodeword=codeword;
end loop
Dictionary Encoding…
Example:
• Let us decode the message. 31221461
• We will start with the following table.
Encoder Dictionary
Input(s+c) Output Index Entry
1 a
2 b
3 w
Message = 31221461
Encoder Dictionary
Input Output Index Entry
1 a
2 b
3 w
3 w
1 a 4 wa
2 b 5 ab
2 b 6 bb
1 a 7 ba
4 wa 8 aw
6 bb 9 wab
1 a 10 bba
Huffman Compression
• The idea is to assign variable-length codes to
input characters, lengths of the assigned
codes are based on the frequencies of
corresponding characters.
• Huffman coding has the following properties:
– Codes for more probable characters are
shorter than ones for less probable
characters.
– Each code can be uniquely decoded
• To accomplish this, Huffman coding creates
what is called a Huffman tree, which is a
binary tree.
Huffman Compression…
• First count the amount of times each character appears,
and assign this as a weight/probability to each character,
or node.
• Add all the nodes to a list.
• Then, repeat these steps until there is only one node left:
– Find the two nodes with the lowest weights.
– Create a parent node for these two nodes.
– Give this parent node a weight of the sum of the two
nodes.
– Remove the two nodes from the list, and add the
parent node.
• This way, the nodes with the highest weight will be near
the top of the tree, and have shorter codes.
Huffman Compression…
Algorithm to create the tree
Assume the source alphabet S = {X1, X2, X3, …,Xn} and
Associated Probabilities P = {P1, P2, P3,…, Pn}
Huffman()
For each letter create a tree with single root node and order all trees according to the
• Notice that since all the characters are at the leafs of the tree, there is
never a chance that one code will be the prefix of another one (eg. a is
01 and b is 011).
• Hence, this unique prefix property assures that each code can be
uniquely decoded.
E e r i space
y s n a r l k .
Building a Tree
Scan the original text
Eerie eyes seen near lake.
• What is the frequency of each character in the
text?
E i y l k . r s n a sp e
1 1 1 1 1 1 2 2 2 2 4 8
Building a Tree
E i y l k . r s n a sp e
1 1 1 1 1 1 2 2 2 2 4 8
Building a Tree
y l k . r s n a sp e
1 1 1 1 2 2 2 2 4 8
2
E i
1 1
Building a Tree
y l k . r s n a sp e
1 1 1 1 2 2 2 2 2 4 8
E i
1 1
Building a Tree
k . r s n a sp e
1 1 2 2 2 2 2 4 8
E i
1 1
2
y l
1 1
Building a Tree
2
k . r s n a 2 sp e
1 1 2 2 2 2 4 8
y l
E i 1 1
1 1
Building a Tree
r s n a 2 2 sp e
2 2 2 2 4 8
y l
E i
1 1
1 1
k .
1 1
Building a Tree
r s n a 2 2 sp e
2
2 2 2 2 4 8
E i y l k .
1 1 1 1 1 1
Building a Tree
n a 2 sp e
2 2
2 2 4 8
E i y l k .
1 1 1 1 1 1
r s
2 2
Building a Tree
n a 2 sp e
2 4
2
2 2 4 8
E i y l k . r s
1 1 1 1 1 1 2 2
Building a Tree
2 4 e
2 2 sp
8
4
y l k . r s
E i
1 1 1 1 2 2
1 1
n a
2 2
Building a Tree
2 4 4 e
2 2 sp
8
4
y l k . r s n a
E i
1 1 1 1 2 2 2 2
1 1
Building a Tree
4 4 e
2 sp
8
4
k . r s n a
1 1 2 2 2 2
2 2
E i y l
1 1 1 1
Building a Tree
4 4 4
2 sp e
4 2 2 8
k . r s n a
1 1 2 2 2 2 E i y l
1 1 1 1
Building a Tree
4 4 4
e
2 2 8
r s n a
2 2 2 2 E i y l
1 1 1 1
2 sp
4
k .
1 1
Building a Tree
4 4 4 6 e
2 sp 8
r s n a 2 2
k . 4
2 2 2 2
E i y l 1 1
1 1 1 1
Building a Tree
4 6 e
2 2 2 8
sp
k . 4
E i y l
1 1 8
1 1 1 1
4 4
r s n a
2 2 2 2
Building a Tree
4 6 e 8
2 2 2 8
sp
4 4
k . 4
E i y l
1 1 1 1 1 1 r s n a
2 2 2 2
Building a Tree
e 8
8
4 4
10
r s n a
4
2 2 2 2 6
2 2 2 sp
E i y l k . 4
1 1 1 1 1 1
Building a Tree
8 10
e
8
4 4 4
6
2 2 2
r s n a sp
2 2 2 2 4
E i y l k .
1 1 1 1 1 1
Building a Tree
10
16
4
6
2 2 e 8
2 sp
8
E i y l k . 4 4
4
1 1 1 1 1 1
r s n a
2 2 2 2
Building a Tree
10 16
4
6
e 8
2 2
2 sp 8
4 4
E i y l k . 4
1 1 1 1 1 1 r s n a
2 2 2 2
Building a Tree
26
16
10
4 e 8
6
8
2 2 2 sp 4 4
E i y l k . 4
r s n a
1 1 1 1 1 1
2 2 2 2
Encoding the File
Traverse Tree for Codes
• Perform a traversal
of the tree to obtain 26
new code words
• Going left is a 0 10
16
going right is a 1
• code word is only 4 e 8
6
completed when a 8
leaf node is reached 2 2
2 4 4
sp
4
E i y l k .
r s n a
1 1 1 1 1 1
2 2 2 2
Encoding the File
Traverse Tree for Codes
Char Code
E 0000
i 0001 26
y 0010
l 0011 16
10
k 0100
. 0101
space 011 4 e 8
6
e 10 8
r 1100 2 2
2
s 1101 sp 4 4
n 1110 4
E i y l k .
a 1111 r s n a
1 1 1 1 1 1
2 2 2 2
Encoding the File
• Rescan text and encode
file using new code Char Code
E 0000
words i 0001
Eerie eyes seen near lake.
y 0010
l 0011
00001011000001100111000101011
k 0100
. 0101
01101001111101011111100011001 space 011
111110100100101 e 10
r 1100
s 1101
n 1110
a 1111
Encoding the File
Results
• Have we made things
00001011000001100111000101011
any better?
01101001111101011111100011001
• 73 bits to encode the 111110100100101
text
• ASCII would take 8 * 26
= 208 bits