0% found this document useful (0 votes)
15 views

Unit 1 Data Compression

MM notes

Uploaded by

justicesavior08
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Unit 1 Data Compression

MM notes

Uploaded by

justicesavior08
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 30

UNIT 1

BASICS OF DATA COMPRESSION


Multimedia:
It is the use of computers to present and combine text, graphics, audio, video and
animation in digital format

Basic Elements of Multimedia


Text, Images, Audio, Video and Animation

Compression:
It is the method of reducing the size of digital file so that it occupies less space .
Types of Compression:

There are two types of compression: lossless and lossy.

Lossless compression algorithms reduce the size of files without losing any information in
the file, which means that we can reconstruct the original data from the compressed file.

Lossy compression algorithms reduce the size of files by discarding the less important
information in a file, which can significantly reduce file size but also affect file quality.

Difference between Lossy and Lossless Comparison

S.NO Lossy Compression Lossless Compression

Lossy compression is the method which While Lossless Compression does not
1. eliminate the data which is not eliminate the data which is not
noticeable. noticeable.

In Lossy compression, A file does not While in Lossless Compression, A file can
2.
restore or rebuilt in its original form. be restored in its original form.

In Lossy compression, Data’s quality is But Lossless Compression does not


3.
compromised. compromise the data’s quality.

Lossy compression reduces the size of But Lossless Compression does not
4.
data. reduce the size of data.

Algorithms used in Lossy compression Algorithms used


are: Transform coding, Discrete Cosine in Lossless compression are: Run Length
5.
Transform, Discrete Wavelet Transform, Encoding, Lempel-Ziv-Welch, Huffman
fractal compression etc. Coding, Arithmetic encoding etc.

Lossy compression is used in Images, Lossless Compression is used in Text,


6.
audio, video. images, sound.
Lossless Compression has less data-
Lossy compression has more data-
7. holding capacity than Lossy compression
holding capacity.
technique.

Lossy compression is also termed as Lossless Compression is also termed as


8.
irreversible compression. reversible compression.

Advantages and Disadvantages of Lossy and Lossless comparison:

Lossy Lossless
Small file sizes. Ideal for web
Pros use. Lots of tools, plugins and No loss in quality. Slight decreases in file sizes.
software support it.

Quality degrades due to


Cons Compressed files are larger than lossy files.
higher rate of compression.

Methods used for Lossy and Lossless compression

Algorithms used in Lossy compression are:

Transform coding,

Discrete Cosine Transform,

Discrete Wavelet Transform,

fractal compression etc.

Algorithms used in Lossless compression are:

Run Length Encoding,

Lempel-Ziv-Welch,

Huffman Coding,

Arithmetic encoding etc

Run Length Encoding

Run-length encoding (RLE) is a form of lossless compression.. RLE is a simple method of


compressing data by specifying the number of times a character or pixel colour repeats
followed by the value of the character or pixel. The aim is to reduce the number of bits used
to represent a set of data.
1. Encode AAAAAABBBBBCCCCCCCCDDEEEEEFFF using RLE

Output:

6A 5B 8C 2D 5E 3F

2. Encode
WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWW
WWWWWWWWWBWWWWWWWWWWWWWW
Output
12W 1B 12W 3B 24W 1B 14W

DICTIONARY-BASED ALGORITHMS

The compression techniques we have seen so far replace individual symbols with a variable
length codewords. In dictionary compression, variable length substrings are replaced by
short, possibly even fixed length codewords. Compression is achieved by replacing long
strings with shorter codewords

The general scheme is as follows:

• The dictionary D is a collection of strings

. For completeness, the dictionary includes all single symbols.

• The text T is parsed into a sequence of phrases: T = T1T2 . . . Tz,

• The text is encoded by replacing each phrase Ti with a code that acts as a pointer to the
dictionary.

LEMPEL ZIV

 Lossless compression
 Dictionary based compression
 Encoded string is longer than the input stream

Example 1

Let
AABABBBABAABABBBABBABB A=0
B=1

POSITION 1 2 3 4 5 6 7 8 9
ABA
SEQUENCE A AB ABB B ABA BB ABBA BB
B
1B
(Numbe
NUMERICAL
r
REPRESENTATIO ΦA 2B ΦB 2A 5B 4B 3A 7
denotes
N
position
)
0000
(Last
digit
denote
s the
CODE letter, 0011 0101 `0001 0101 1011 1001 0110 0111
First 3
digit
the
numbe
r

Example 2: 01000101110010100101

LEMPEL ZIV WELCH

 Common technique
 Used in GIF, optionally in TIFF and PDF
 Simple
 High throughput in hardware
 Widely used in Unix file compression
 Doubles the capacity of the hardware
 Scans a file for data pattern that appears more than once.
 Patterns are stored in dictionary

Example1:

2.Encode “TOBEORNOTTOBEORTOBEORNOT#”

There are 26 symbols in the plaintext alphabet (the capital letters A through Z). # is used to
represent a stop code

The initial dictionary, then, consists of the following


entries:
Symbol Decimal Symbol Decimal
# 0 N 14
A 1 O 15
B 2 P 16
C 3 Q 17
D 4 R 18
E 5 S 19
F 6 T 20
G 7 U 21
H 8 V 22
I 9 W 23
J 10 X 24
K 11 Y 25
L 12 Z 26
M 13

OUTPUT:

Extended
CODE Description
Dictionary
27 TO 20 CODE OF FIRST LETTER T
CODE OF FIRST LETTER O
28 OB 15 (OVERLAPPING WITH THE SECOND
BIT OF THE FIRST CODE)
29 BE 2 CODE OF FIRST LETTER B
30 EO 5 CODE OF FIRST LETTER E
31 OR 15 CODE OF FIRST LETTER O
32 RN 18 CODE OF FIRST LETTER R
33 NO 14 CODE OF FIRST LETTER N
34 OT 15 CODE OF FIRST LETTER O
35 TT 20 CODE OF FIRST LETTER T
36 TOB 27 CODE OF FIRST TWO LETTERS TO
37 BEO 29 CODE OF FIRST TWO LETTERS BE
38 ORT 31 CODE OF FIRST TWO LETTERS OR
39 TOBE 36 CODE OF FIRST TWO LETTERS TOB
40 EOR 30 CODE OF FIRST TWO LETTERS EO
41 RNO 18 CODE OF FIRST TWO LETTERS RN
42 O# 15 CODE OF FIRST LETTER O
Thus the code is 20,15,2,5,15,18,14,15,20,27,29,31,36,30,18,15

Example 2:

Encode wa_ba_wabba_wabba_wabba_w

Initial given dictionary

Index Entry
1 _
2 a
3 b
4 o
5 w
Solution:

Extended
CODE Description
Dictionary
6 wa 5 CODE OF FIRST LETTER w
7 ab 2 CODE OF FIRST LETTER a
8 bb 3 CODE OF FIRST LETTER b
9 ba 3 CODE OF FIRST LETTER b
10 a_ 2 CODE OF FIRST LETTER a
11 _w 1 CODE OF FIRST LETTER _
12 wab 6 CODE OF FIRST TWO LETTERS wa
13 bba 8 CODE OF FIRST TWO LETTERS bb
14 a_w 10 CODE OF FIRST TWO LETTERS a_
15 wabb 12 CODE OF FIRST THREE LETTERS wab
16 ba_ 9 CODE OF FIRST TWO LETTERS ba
17 _wa 11 CODE OF FIRST TWO LETTERS _w
18 abb 7 CODE OF FIRST TWO LETTERS ab
19 ba_w 16 CODE OF FIRST THREE LETTERS ba_
Thus the code is 5,2,3,3,2,1,6,8,10,12,9,11,7,16

Example 3:

Encode geekific-geeficf

Index Entry
45 -
99 C
101 e
102 f
103 g
105 i
107 k
Dictionary length is 255

Extended
CODE Description
Dictionary
256 ge 103 CODE OF FIRST LETTER g
257 ee 101 CODE OF FIRST LETTER e
258 ek 101 CODE OF FIRST LETTER e
259 ki 107 CODE OF FIRST LETTER k
260 if 105 CODE OF FIRST LETTER i
261 fi 102 CODE OF FIRST LETTER f
262 ic 105 CODE OF FIRST LETTER i
263 e- 101 CODE OF FIRST LETTER e
264 -g 45 CODE OF FIRST LETTER -
265 gee 256 CODE OF FIRST TWO LETTERS ge
266 ef 101 CODE OF FIRST LETTER e
267 fic 262 CODE OF FIRST TWO LETTERS fi
268 cf 99 CODE OF FIRST LETTER c

Thus the code word is 103,101,101,107,105,102,105,101,45,256,101,262,99


HUFFMAN CODING

Huffman coding is an algorithm for compressing data with the aim of reducing its size
without losing any of the details. This algorithm was developed by David Huffman.
Huffman coding is typically useful for the case where data that we want to compress has
frequently occurring characters in it.
It is a lossless compression.
It is a variable length code.
Variable length is based on the frequencies of the characters.
Variable length codes are assigned to input characters.
They are prefix codes.
Huffman Coding step-by-step working or creation of Huffman Tree is as follows:
 Step-1: Calculate the frequency of each string.
 Step-2: Sort all the characters on the basis of their frequency in ascending order.
 Step-3: Mark each unique character as a leaf node.
 Step-4: Create a new internal node.
 Step-5: The frequency of the new node as the sum of the single leaf node
 Step-6: Mark the first node as this left child and another node as the right child of the
recently created node.
 Step-7: Repeat all the steps from step-2 to step-6.
Example:
Suppose a data file has the following characters and the frequencies. If huffman coding is
used, calculate:
 Huffman Code of each character
 Average code length
 Length of Huffman encoded data

Solution:
Initially, create the Huffman Tree:
Step-2:

Step-3:

Step-4:

Step-5:
The above tree is a Huffman Tree.
Now, assign weight to all the nodes.
Assign “0” to all left edges and “1” to all right edges.
The tree will become

Huffman Code of each character:


A: 00
B: 10
C: 110
D: 01
E: 111
Average code length per character = ∑(frequencyi x code lengthi)/ ∑ frequencyi
= {(12 x 2) + (13 x 2)+ (15 x 2)+ ( 7 x 3)+ (9 x 3)} / (12 + 13 + 15 + 7+ 9)
= (24+ 26+ 30 + 21 + 27)/56
= 128/56
= 2.28
Average code length per character = 2.28
Total number of bits in Huffman encoded message
= Total number of characters in the message x Average code length per
character
= 56 x 2.28
= 127.68

Example :2

Encode "go go gophers" using Huffman coding.

SYMBOL FREQUENCY
g 3
0 3
p 1
e 1
h 1
r 1
s 1
“ 2

Step 1: Arrange in ascending order:

Step 2: add 2 least numbers and arrange in ascending order again as shown below tree
diagram

Step 3: add 2 least numbers and arrange in ascending order again as shown below tree
diagram
Step 4: add 2 least numbers and arrange in ascending order again as shown below tree
diagram

Step 5: add 2 least numbers and arrange in ascending order again as shown below tree
diagram

Step 6: add 2 least numbers and arrange in ascending order again as shown below tree
diagram,

Step 7: add 2 least numbers and arrange in ascending order again as shown below tree
diagram

Step 8: add 2 least numbers and arrange in ascending order again as shown below tree
diagram, and assign values 0 to the left side and 1 right side
Step 8: For every symbol assign code from the values from top to bottom

Frequency of a
Symbol Code Size
String
G 3 00 3x2=6
O 3 01 3x2=6
S 1 100 1x3=3
E 1 1100 1x4=4
H 1 1101 1x4=4
P 1 1110 1x4=4
R 1 1111 1x4=4
“ 2 101 2x3=6
8x4=32 bits 13 bits 37 bits

Before compression 13 x 8 = 104 bits

Total size after compression = 32+13+37 = 82 bits

Example 3:

Find Huffman codeword of the given text “AAAAAAAAAABBBBBCCCSS” BY USING


STATIC Huffman tree. Calculate Entropy and derive the average number of bits per
character for codeword?
Avg L = Σ Li * P (i)
=10/20*1+5/20*2+3/20*3+2/20*3=10+10+9+6/20=35/20=1.75
Space required to store the original message=20*8=160 bits
Space required to store the decoded message=35 bits

Example4:

Encode BCAADDDCCACACAC using Huffman coding.

HUFFMAN CODING
It is the average level of information surprise or uncertainity inherent to the variable’s
possible outcomes.

H[X] = − Σ P i log2 Pi

H[X]= -(1/0.3010) Σ P i log Pi

Average Length L= Σ P I x Ni

Efficiency = η = H/L

Redundancy = 1 – η

Procedure
Example: Encode using Huffman coding P0=0.4, P1=0.2, P2=0.1, P3=0.2 and P4=0.1

Step 1: Arrange the probabilities in Descending order

MSG PROB

P0 0.4
P1 0.2
P3 0.2
P2 0.1
P4 0.1
Step2: Combine ( add ) last 2 probabilities and arrange in descending order

Step3: Combine ( add ) last 2 probabilities and arrange in descending order

Repeat until all the probabilities are combined.


Step 4: Assign values 0 and 1 to all elements as shown

Step 5: Obtain codes by tracing

P0=00

P1=01

P2=010

P3=11

P4=110

SYMBOL PROB CODE INVERSE CODE CODE LENGTH


P0 0.4 00 00 2
P1 0.2 01 10 2
P2 0.1 010 010 3
P3 0.2 11 11 2
P4 0.1 110 011 3

Entropy = H[X]= -(1/0.3010) Σ P i log Pi

=-(1/0.3010)[(0.4*log0.4)+(0.2*log0.2)+(0.1*log0.1)+(0.2*log0.2)+(0.1*log0.1)

=2.122

Average Length L= Σ P I x Ni

=(0.4*2)+(0.2*2)+(0.1*3)+(0.2*2)+(0.1*3)

=2.2

Efficiency = η = H/L =2.122/2.2

96.45

Redundancy =1- η =0.355

-------------------------------------------------------------------------------------------------------------------
-------
Example3: x1=0.4,x2=0.19,x3=0.16,x4=x5=0.15
Example 4: x1=0.3,x2=0.25,x3=0.2,x4=0.12,x5=0.08 and x6=0.05

Advantages:

 Efficient
 Shorter codes
 Higher compression ratio
 Doest need any special markers to separate different codes
 Easy to integrate with existing system
 Lossless
 Easy implementation In hardware and software

Disadvantages:

 Frequency should be known in advance


 Tree is complex
 Difficult to understand and debug
 Time consuming
 When unique symbols are present less effective

---------------------------------------------------------------------------------------------------------------------
--------

ARITHMETIC CODING
Arithmetic coding is a type of entropy encoding utilized in lossless data compression.
Arithmetic coding can be used in most applications of data compression. Its main
usefulness is in obtaining compression in conjunction with an adaptive model, or when the
probability of one event is close to 1.

Difference between Huffman coding and Arithmetic coding:

Arithmetic coding Huffman coding


Doesn’t need probability Distribution Need probability Distribution
No need to keep and share codeword table Need to store code word table
Decompression speed is low Decompression speed is high
compression speed is low compression speed is high
compression ratio is high compression ratio is low

PROCEDURE

Example:
Find Sm for the message”babc” with a=0.2,b=0.5 and c=0.3 probabilities using arithmetic
coding.

symbol probability Cumulative probability


a 0.2 0.2
b 0.5 0.7
c 0.3 1.0

Step1:

Step 2: For b

For “b” Calculate Range and mark in the diagram


Lower Limit =0.2

Upper Limit =0.7

Difference =d =0.7-0.2 =0.5

Range = lower Limit +(difference x probability of the symbol)

Range of a=0.2+(0.5x0.2)=0.2+0.10=0.3

Range of b= 0.3+(0.5x0.5)=0.3+0.25=0.55

Range of c=0.55+(0.5x0.3)=0.55+0.15=0.7

Ste 3: for a

For “a” Calculate Range and mark in the diagram

Lower Limit =0.2

Upper Limit =0.3

Difference =d =0.3-0.2 =0.1

Range = lower Limit +(difference x probability of the symbol)


Range of a=0.2+(0.1x0.2)=0.2+0.02=0.22

Range of b= 0.22+(0.1x0.5)=0.22+0.05=0.27

Range of c=0.27+(0.1x0.3)=0.27+0.03=0.3

For “b” Calculate Range and mark in the diagram

Lower Limit =0.22

Upper Limit =0.27

Difference =d =0.27-0.22 =0.05

Range = lower Limit +(difference x probability of the symbol)

Range of a=0.22+(0.05x0.2)=0.22+0.01=0.23

Range of b= 0.23+(0.05x0.5)=0.23+0.025=0.255

Range of c=0.255+(0.05x0.3)=0.255+0.015=0.27

For code word


Code word lies between Lower Limit =0.255 and Upper Limit =0.27

Tag is = (0.255+0.27)/2=0.2275
---------------------------------------------------------------------------------------------------------------------
-----

CONTEXT BASED COMPRESSION/PREDICTIVE CODING

Compresses based on previous symbol.

Encoding is done in prediction mode.

It the distribution or transformation is based on previous sequence (history), there is no


need to transfer additional information

Using the history to determine the sequence in a predictive manner is called as predictive
coding

Popular algorithm used is Predictive with Partial Match (PPM)

In PPM it is necessary only to store that context (surrounding elements) of an encoding


sequence.

At first since there is no previously encoded elements it has to be encoded separately

In PPM escape symbol <ESC> is used to denote that the letter to be encoded isn’t present in
the context.

PPM Basic Algorithm

The basic algorithm initially attempts to use the largest context. The size of the largest
context is predetermined.
If the symbol to be encoded has not previously been present in this context, an escape
symbol is encoded and the algorithm attempts to use the next smaller context.
If the symbol has not occurred in this context either, the size of the context is further
reduced.
This process continues until either we obtain a context that has previously been present
with this symbol, or we arrive at the conclusion that the symbol has not been encountered
previously in any context.
In this case, we use a probability of 1/M to encode the symbol, where M is the size of the
source alphabet.

Example,
Letter to be coded - letter a of the word probability

First attempt - See if the string proba has previously occurred— that is, if the letter a had
previously occurred in the context of prob.
If not, we would send an escape symbol

Second attempt – See if the letter a had previously occurred in the context of rob.
If the string roba had not occurred previously, not, we would send an escape symbol

Third attempt - See if the letter a had previously occurred in the context of ob.
If the string oba had not occurred previously, not, we would send an escape symbol

Forth attempt - See if the letter a had previously occurred in the context of b.
If the string ba had not occurred previously, it is concluded that a has occurred for the first
timeand encode it separately

Example:
Encode the sequence
“this is the “

1.Assume that we have already encoded initial 7 symbols “this is” (including space)
2.Assume that the longest context length is two, that is order is from -1 to 2 (-1,0,1,2)
3. Count array for -1 order context “this is the “
Letter count Cumulative count
t 1 1
h 1 2
i 1 3
s 1 4
space 1 5
e 1 6
Total count 6
Since special character should be written last interchange e and space

Letter count Cumulative count


t 1 1
h 1 2
i 1 3
s 1 4
e 1 5
space 1 6
Total count 6

4.Count array for 0 order context: “this is”


Letter count Cumulative count Remarks
t 1 1
h 1 2
i 2 4 i appears twice
s 2 6 s appears twice
space 1 7
<ESC> 1 8
Total count 8

4.Count array for 1 order context: “this is”

It gives how many times the letter appears after the context
I order cum
letter count Remarks
context count

h 1 1 h appears once
t
before t in "this
<ESC> 1 2 is"
total count 2
i 1 1
h i appears once
<ESC> 1 2 beforeh in "this
is"
total count 2

s 2 2
i s appears once
<ESC> 1 3
before i in "this
total count 3 is"

space 1 1
s space appears
<ESC> 1 2 once before s in
total "this is"
2
count

i 1 1
space i appears once
before space in
<ESC> 1 2 "this is"
total
2
count
5.Count array for 2 order context: “this is”

II order cum
letter count Remarks
context count
i 1 1
th th appears once before
<ESC> 1 2
i in "this is"
total count 2
s 1 1
hi hi appears once before
<ESC> 1 2 s in "this is"
total count 2

space 1 1
is is appears once before
<ESC> 1 2 space in "this is"
total count 2

i 1 1
s space s space appears once
<ESC> 1 2 before i in "this is"
total count 2

s 1 1
space i space i appears once
<ESC> 1 2 before s in "this is"
total count 2

6.We assume that the word length for arithmetic coding is six. Thus, l=000000 and
u=111111

7.As “this is” has already been encoded, the next letter to be encoded is space in ”this is
the”.

In the II order consider space as the first letter (ie in space i) the cumulative count for space is 1
and total count is 2

Formula for lower limit is ln=ln-1+{[(un-1-ln-1+1)*(cum count/(xn-1)]/total count}


Upper limit =un=ln-1+{[(un-1-ln-1+1)*(cum count/(xn)]/total count}

Un-1=decimal of (111111)=63

Ln-1=0, cum count for space=1 and total count =2

Substituting in ln =0+{[(63-0+1)*0]/2}=0=000000

and un =0+{[(63-0+1)*1]/2}=31=011111

You might also like