0% found this document useful (0 votes)
15 views70 pages

Chapter 7

Chapter 7 discusses data compression, defining it as the encoding of data to reduce size for storage or transmission, with two main types: lossless and lossy compression. It covers various compression algorithms, including Huffman coding and dictionary encoding, explaining their mechanisms and applications in data transmission and storage. The chapter emphasizes the importance of efficient coding in information theory and provides examples of compression techniques and their implementations.

Uploaded by

gemechisgadisa77
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views70 pages

Chapter 7

Chapter 7 discusses data compression, defining it as the encoding of data to reduce size for storage or transmission, with two main types: lossless and lossy compression. It covers various compression algorithms, including Huffman coding and dictionary encoding, explaining their mechanisms and applications in data transmission and storage. The chapter emphasizes the importance of efficient coding in information theory and provides examples of compression techniques and their implementations.

Uploaded by

gemechisgadisa77
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 70

Chapter 7

Data Compression
Introduction
• Data compression is the process of encoding
data using a representation that reduces the
overall size of data.
• Definition: Data compression is the process of
encoding information using fewer number of
bits so that:
– it takes less memory area (storage) or
– It takes less bandwidth during transmission.

• Two types of compression:


– Lossy data compression
– Lossless data compression
Introduction…
Lossless Data Compression:
• in lossless data compression, the original
content of the data is not lost/changed when
it is compressed (encoded).
• With lossless compression, every single bit of
data that was originally in the file remains
after the file is uncompressed.

Examples:
• RLE (Run Length Encoding)
• Dictionary Based Coding
• Arithmetic Coding
Introduction…
Lossy data compression:
• the original content of the data is lost to
certain degree when compressed.
• Part of the data that is not much important is
discarded/lost.
• The loss factor determines whether there is a
loss of quality between the original image and
the image after it has been compressed and
played back (decompressed).
• The more compression, the more likely that
quality will be affected.
• Even if the quality difference is not
Information Theory
• Information theory is defined to be the study
of efficient coding and its consequences.
• It is the field of study concerned about the
storage and transmission of data.
• It is concerned with source coding and
channel coding.
• Source coding: involves compression
• Channel coding: how to transmit data, how to
overcame noise, etc
• Data compression may be viewed as a branch
of information theory in which the primary
objective is to minimize the amount of data to
Information Theory…
Compression Algorithms
• Compression methods use mathematical algorithms to
reduce data by eliminating, grouping and/or averaging
similar data found in the signal.
• There are various compression methods, including Motion
JPEG, only MPEG-1 and MPEG-2 are internationally recognized
standards for the compression of moving pictures (video).

• A simple characterization of data compression is that it


involves transforming a string of characters in some
representation (such as ASCII) into a new string which
contains the same information but whose length is as small
as possible.
• Data compression has important application in the areas of
data transmission and data storage.
Compression Algorithms…
• The proliferation of computer communication networks is
resulting in massive transfer of data over communication links.
• Compressing data to be stored or transmitted reduces storage
and/or communication costs.
• When the amount of data to be transmitted is reduced, the
effect is that of increasing the capacity of the communication
channel.

• Lossless compression is a method of reducing the size of


computer files without losing any information.
• That means when you compress a file, it will take up less space,
but when you decompress it, it will still have the exact same
information.
• The idea is to get rid of any redundancy in the information, this is
exactly what happens is used in ZIP and GIF files.
• This differs from lossy compression, such as in JPEG files, which
loses some information that isn't very noticeable.
Compression Algorithms…
• You can use lossless compression whenever
space is a concern, but the information must
be the same.
• An example is when sending text files over a
modem or the Internet.
• If the files are smaller, they will get there
faster.
• However, they must be the same as that you
sent at destination.
• There are several popular algorithms for
lossless compression.
• There are also variations of most of them,
Compression Algorithms…

Family Variations Used in


Running-Length none
Huffman Huffman MNP5
Adaptive Huffman COMPACT
Shannon-Fano SQ

Arithmetic none
LZ78 (Lempel-Ziv 1978) LZW (Lempel-Ziv-Welch) GIF
v.42bis
compress
LZ77 (Lempel-Ziv 1977) LZFG ZIP
ARJ
LHA
Variable Length Encoding

1. Shannon-Fano Coding
• A variable-length coding based on the frequency of occurrence
of each character.
HOW DOES IT WORK?
The steps of the algorithm are as follows:
1. Create a list of probabilities or frequency counts for the given
set of symbols so that the relative frequency of occurrence of
each symbol is known.
2. Sort the list of symbols in decreasing order of probability, the
most probable ones to the left and least probable to the right.
3. Split the list into two parts, with the total probability of both
the parts being as close to each other as possible.
4. Assign the value 0 to the left part and 1 to the right part.
5. Repeat the steps 3 and 4 for each part, until all the symbols
are split into individual subgroups.
Variable Length…
Example: Suppose the following source and with related probabilities
S = {A,B,C,D,E}
P = {0.35,0.17,0.17,0.16,0.15}
Message to be encoded = ”ABCDE”

• The probability is already arranged in non-increasing order.


• First we divide the message into AB and CDE. Why?
• This gives the smallest difference between the total probabilities of
the two groups.
S1={A,B} P={0.35,0.17} = 0.52
S2={C,D,E} P={0.17,0.17,0.16} = 0.49
• The difference is only 0.52-0.49=0.03. This is the smallest possible
difference when we divide the message.
Attach 0 to S1 and 1 to S2.
• Subdivide S1 into sub groups.
S11={A} attach 0 to this
S12={B} attach 1 to this
• Again subdivide S2 into subgroups considering the probability
again.
S21={C} P={0.17} = 0.17
S22={D,E} P={0.16,0.15} = 0.31
• Attach 0 to S21 and 1 to S22. Since S22 has more than one letter
in it, we have to subdivide it.
S221={D} attach 0
S222={E} attach 1

The message is transmitted using the following code (by


traversing the tree)
A=00 B=01
C=10 D=110
E=111
Instead of transmitting ABCDE, we transmit 000110110111.
EXAMPLE: Given task is to construct Shannon codes
for the given set of symbols using the Shannon-Fano
lossless compression technique.

• Solution:
• Let P(x) be the probability of occurrence of symbol x:
1. Upon arranging the symbols in decreasing
order of probability:
P(D) + P(B) = 0.30 + 0.28 = 0.58 and,
P(A) + P(C) + P(E) = 0.22 + 0.15 + 0.05 = 0.42

{D, B} and {A, C, E}


• and assign them the values 0 and 1 respectively.
2. Now, in {D, B} group,
P(D) = 0.30 and P(B) = 0.28
divide {D, B} into {D} and {B} and assign 0 to D
3.and 1 to C,
In {A, B. E} group,
P(A) = 0.22 and P(C) + P(E) = 0.20
So the group is divided into
{A} and {C, E}
and they are assigned values 0 and 1 respectively.
4. In {C, E} group
P(C) = 0.15 and P(E) = 0.05
So divide them into {C} and {E} and assign 0 to {C}
and 1 to {E}
Note: The splitting is now stopped as each symbol is
separated now.
The Shannon codes for the set of symbols are:
Variable Length…
Decoding
• For decoding, the decoder is supplied with a binary
tree which it uses to decode the bit stream that is
the compressed data.
• The decoder traverses the tree for once for each
compressed character.
• This is a simple and therefore fast operation for a
computer to execute.
Variable Length…
To decode 000110110111
• 0-left, 0-left =>A
• 0-left, 1-right =>B
• 1-right, 0-left =>C
• 1-right, 1-right, 0-left =>E
• 1-right, 1-right, 1-right =>D
• Example: compress the following
message “X1X2X3X4X5” given the
following table of probability
symbol probability
X1 0.5
X2 0.25
X3 0.125
X4 0.0625
X5 0.0625

symbol count code #of bits


X1 1 0 1
X2 1 10 2
X3 1 110 3
X4 1 1110 4
X5 1 1111 4
Variable Length…
• Exercise: compress the following message “ABCDE”
given the following table of frequency
Symbol A B C D E
Count 15 7 6 6 5

Probabilities 0.38 0.18 0.15 0.15 0.14

• Result
Symbol A B C D E
Code 00 01 10 110 111
Dictionary Encoding
• Dictionary coding uses groups of symbols, words, and phrases
with corresponding abbreviation.
• It transmits the index of the symbol/word instead of the word
itself.
• There are different variations of dictionary based coding:
– LZ77 (printed in 1977)
– LZ78 (printed in 1978)
– LZSS
– LZW (Lempel-Ziv-Welch)

LZW Compression
• LZW compression has its roots in the work of Jacob Ziv and
Abraham Lempel.
• In 1977, they published a paper on sliding-window compression,
and followed it with another paper in 1978 on dictionary based
compression.
• These algorithms were named LZ77 and LZ78, respectively.
Dictionary Encoding…
The Concept
• Many files, especially text files, have certain strings that
repeat very often, for example " the ".
• With the spaces, the string takes 5 bytes, or 40 bits to encode.
• But what if we were to add the whole string to the list of
characters?
• Then every time we came across " the ", we could send the
code instead of 32,116,104,101,32.
• This would take less no of bits.

• This is exactly the approach that LZW compression takes. It


starts with a dictionary of all the single character with indexes
0-255.
• It then starts to expand the dictionary as information gets sent
through.
• Then, redundant strings will be coded, and compression has
occurred.
Dictionary Encoding…
set w = NIL
loop
read a character k
if wk exists in the dictionary
w = wk
else
output the code for w
add wk to the dictionary
w=k
endloop
The program reads one character at a time. If the code is in the
dictionary, then it adds the character to the current work string,
and waits for the next one. This occurs on the first character as
well. If the work string is not in the dictionary, (such as when the
second character comes across), it adds the work string to the
dictionary and sends over the wire (or writes to a file) the code
assigned to the work string without the new character. It then
Dictionary Encoding…
Example:
• Encode the message aababacbaacbaadaaa using the
above algorithm

Encoding
• Create dictionary of letters found in the message
Encoder Dictionary
Input Output Index Entry
1 a
2 b
3 c
4 d
message = aababacbaacbaadaaa

Encoder Dictionary
Input(s+c) Output Index Entry
1 a
2 b
3 c
4 d
aa 1 5 aa
ab 1 6 ab
ba 2 7 ba
aba 6 8 aba
ac 1 9 ac
cb 3 10 cb
baa 7 11 baa
acb 9 12 acb
baad 11 13 baad
da 4 14 da
aaa 5 15 aaa
a 1

Coded message = 1126137911451


Example2
• Encode the message wabbawabba
using Lempell-Ziv- Welch algorithm
index dictionary

1 a

2 b

3 w

code
message=31221461
Dictionary Encoding…
Decompression algorithm:
LZWDecoding()
Enter all the source letters into the dictionary;
Read priorCodeword and output one symbol corresponding to it;
While codeword is still left
read Codeword;
PriorString = string (PriorCodeword);
If codeword is in the dictionary
Enter in dictionary PriorString + firstsymbol(string(codeword));
output string(codeword);
else
Enter in the dictionary priorString +firstsymbol(priorString);
Output priorString+firstsymbol(priorstring);
priorCodeword=codeword;
end loop
Dictionary Encoding…
Example:
• Let us decode the message. 31221461
• We will start with the following table.

Encoder Dictionary
Input(s+c) Output Index Entry
1 a
2 b
3 w
Message = 31221461
Encoder Dictionary
Input Output Index Entry
1 a
2 b
3 w

3 w
1 a 4 wa
2 b 5 ab
2 b 6 bb
1 a 7 ba
4 wa 8 aw
6 bb 9 wab
1 a 10 bba
Huffman Compression
• The idea is to assign variable-length codes to
input characters, lengths of the assigned
codes are based on the frequencies of
corresponding characters.
• Huffman coding has the following properties:
– Codes for more probable characters are
shorter than ones for less probable
characters.
– Each code can be uniquely decoded
• To accomplish this, Huffman coding creates
what is called a Huffman tree, which is a
binary tree.
Huffman Compression…
• First count the amount of times each character appears,
and assign this as a weight/probability to each character,
or node.
• Add all the nodes to a list.
• Then, repeat these steps until there is only one node left:
– Find the two nodes with the lowest weights.
– Create a parent node for these two nodes.
– Give this parent node a weight of the sum of the two
nodes.
– Remove the two nodes from the list, and add the
parent node.
• This way, the nodes with the highest weight will be near
the top of the tree, and have shorter codes.
Huffman Compression…
Algorithm to create the tree
Assume the source alphabet S = {X1, X2, X3, …,Xn} and
Associated Probabilities P = {P1, P2, P3,…, Pn}

Huffman()
For each letter create a tree with single root node and order all trees according to the

probability of letter of occurrence;


while more than one tree is left
take two trees t1, and t2 with the lowest probabilities p1, p2 and create a tree
with probability in its root equal to p1+p2 and with t1 and t2 as its subtrees;
associate 0 with each left branch and 1 with each right branch;
create unique codeword for each letter by traversing the tree the root to the leaf
containing the probability corresponding to this letter and putting all
encountered 0s and 1s together;
Example: Suppose the following source and related
probability
S={A,B,C,D,E}
P={0.15,0.16,0.17,0.17,0.35}
Message=”abcde”
Huffman Compression…
• To read the codes from a Huffman tree, start from the root and add a 0
every time you go left to a child, and add a 1 every time you go right.
• So in this example, the code for the character b is 01 and the code for d
is 110.
• As you can see, a has a shorter code than d.

• Notice that since all the characters are at the leafs of the tree, there is
never a chance that one code will be the prefix of another one (eg. a is
01 and b is 011).
• Hence, this unique prefix property assures that each code can be
uniquely decoded.

• The code for each letter is:


a=000 b=001
c=010 d=011
e=1

• The original message will be encoded to:


abcde=0000010100111
Building a Tree
Scan the original text
• Consider the following short text:

Eerie eyes seen near lake.

• Count up the occurrences of all


characters in the text
Building a Tree
Scan the original text

Eerie eyes seen near lake.


• What characters are present?

E e r i space
y s n a r l k .
Building a Tree
Scan the original text
Eerie eyes seen near lake.
• What is the frequency of each character in the
text?

Char Freq. Char Freq. Char Freq.


E 1 y 1 k 1
e 8 s 2 . 1
r 2 n 2
i 1 a 2
space 4 l 1
Building a Tree

• Create binary tree nodes with character and


frequency of each character
• Place nodes in a priority queue
– The lower the occurrence, the higher the priority
in the queue

The queue after inserting all nodes

E i y l k . r s n a sp e
1 1 1 1 1 1 2 2 2 2 4 8
Building a Tree

E i y l k . r s n a sp e
1 1 1 1 1 1 2 2 2 2 4 8
Building a Tree

y l k . r s n a sp e
1 1 1 1 2 2 2 2 4 8

2
E i
1 1
Building a Tree

y l k . r s n a sp e
1 1 1 1 2 2 2 2 2 4 8

E i
1 1
Building a Tree

k . r s n a sp e
1 1 2 2 2 2 2 4 8

E i
1 1

2
y l
1 1
Building a Tree

2
k . r s n a 2 sp e
1 1 2 2 2 2 4 8
y l
E i 1 1
1 1
Building a Tree

r s n a 2 2 sp e
2 2 2 2 4 8
y l
E i
1 1
1 1

k .
1 1
Building a Tree

r s n a 2 2 sp e
2
2 2 2 2 4 8

E i y l k .

1 1 1 1 1 1
Building a Tree

n a 2 sp e
2 2
2 2 4 8
E i y l k .
1 1 1 1 1 1

r s
2 2
Building a Tree

n a 2 sp e
2 4
2
2 2 4 8

E i y l k . r s
1 1 1 1 1 1 2 2
Building a Tree

2 4 e
2 2 sp
8
4
y l k . r s
E i
1 1 1 1 2 2
1 1

n a
2 2
Building a Tree

2 4 4 e
2 2 sp
8
4
y l k . r s n a
E i
1 1 1 1 2 2 2 2
1 1
Building a Tree

4 4 e
2 sp
8
4
k . r s n a

1 1 2 2 2 2

2 2

E i y l
1 1 1 1
Building a Tree

4 4 4
2 sp e
4 2 2 8
k . r s n a

1 1 2 2 2 2 E i y l
1 1 1 1
Building a Tree

4 4 4
e
2 2 8
r s n a
2 2 2 2 E i y l
1 1 1 1

2 sp
4
k .
1 1
Building a Tree

4 4 4 6 e
2 sp 8
r s n a 2 2
k . 4
2 2 2 2
E i y l 1 1
1 1 1 1
Building a Tree

4 6 e
2 2 2 8
sp
k . 4
E i y l
1 1 8
1 1 1 1

4 4

r s n a
2 2 2 2
Building a Tree

4 6 e 8
2 2 2 8
sp
4 4
k . 4
E i y l
1 1 1 1 1 1 r s n a
2 2 2 2
Building a Tree

e 8
8
4 4
10
r s n a
4
2 2 2 2 6
2 2 2 sp
E i y l k . 4
1 1 1 1 1 1
Building a Tree

8 10
e
8
4 4 4
6
2 2 2
r s n a sp
2 2 2 2 4
E i y l k .
1 1 1 1 1 1
Building a Tree

10
16
4
6
2 2 e 8
2 sp
8
E i y l k . 4 4
4
1 1 1 1 1 1
r s n a
2 2 2 2
Building a Tree

10 16

4
6
e 8
2 2
2 sp 8
4 4
E i y l k . 4
1 1 1 1 1 1 r s n a
2 2 2 2
Building a Tree

26

16
10

4 e 8
6
8
2 2 2 sp 4 4

E i y l k . 4
r s n a
1 1 1 1 1 1
2 2 2 2
Encoding the File
Traverse Tree for Codes
• Perform a traversal
of the tree to obtain 26
new code words
• Going left is a 0 10
16

going right is a 1
• code word is only 4 e 8
6
completed when a 8
leaf node is reached 2 2
2 4 4
sp
4
E i y l k .
r s n a
1 1 1 1 1 1
2 2 2 2
Encoding the File
Traverse Tree for Codes
Char Code
E 0000
i 0001 26
y 0010
l 0011 16
10
k 0100
. 0101
space 011 4 e 8
6
e 10 8
r 1100 2 2
2
s 1101 sp 4 4
n 1110 4
E i y l k .
a 1111 r s n a
1 1 1 1 1 1
2 2 2 2
Encoding the File
• Rescan text and encode
file using new code Char Code
E 0000
words i 0001
Eerie eyes seen near lake.
y 0010
l 0011
00001011000001100111000101011
k 0100
. 0101
01101001111101011111100011001 space 011
111110100100101 e 10
r 1100
s 1101
n 1110
a 1111
Encoding the File
Results
• Have we made things
00001011000001100111000101011
any better?
01101001111101011111100011001
• 73 bits to encode the 111110100100101
text
• ASCII would take 8 * 26
= 208 bits

hIf modified code used 4 bits per


character are needed. Total bits
4 * 26 = 104.
Huffman Compression….
• To decode the message coded by Huffman coding,
a conversion table had to be known by the
receiver.
• Using this table, a tree can be constructed with
the same path as the tree used for coding.
• Leaves store the same path as the tree used for
coding.
• Leaves store letters instead of probabilities for
efficiency purpose.
• The decoder then can use the Huffman tree to
decode the string by following the paths
according to the string and adding a character
every time it comes to one.
Huffman Compression…
• How can the encoder let the decoder know which
particular coding tree has been used?
• Two ways:
– Both agree on particular Hufmann tree and
both use it for sending any message
– The encoder constructs Huffman tree afresh
every time a new message is sent and sends
the conversion table along with the message.
This is more versatile, but has additional
overload—sending conversion table. But for
large data, there is the advantage.
The Algorithm
Move left if you get 0
Move right if you get 1
If you get letter (reach leaf node) output that letter.
Go back and start from root again with the remaining code.
Reading Assignment : Arithmetic Coding

You might also like