Unit 1 Data Compression
Unit 1 Data Compression
Compression:
It is the method of reducing the size of digital file so that it occupies less space .
Types of Compression:
Lossless compression algorithms reduce the size of files without losing any information in
the file, which means that we can reconstruct the original data from the compressed file.
Lossy compression algorithms reduce the size of files by discarding the less important
information in a file, which can significantly reduce file size but also affect file quality.
Lossy compression is the method which While Lossless Compression does not
1. eliminate the data which is not eliminate the data which is not
noticeable. noticeable.
In Lossy compression, A file does not While in Lossless Compression, A file can
2.
restore or rebuilt in its original form. be restored in its original form.
Lossy compression reduces the size of But Lossless Compression does not
4.
data. reduce the size of data.
Lossy Lossless
Small file sizes. Ideal for web
Pros use. Lots of tools, plugins and No loss in quality. Slight decreases in file sizes.
software support it.
Transform coding,
Lempel-Ziv-Welch,
Huffman Coding,
Output:
6A 5B 8C 2D 5E 3F
2. Encode
WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWW
WWWWWWWWWBWWWWWWWWWWWWWW
Output
12W 1B 12W 3B 24W 1B 14W
DICTIONARY-BASED ALGORITHMS
The compression techniques we have seen so far replace individual symbols with a variable
length codewords. In dictionary compression, variable length substrings are replaced by
short, possibly even fixed length codewords. Compression is achieved by replacing long
strings with shorter codewords
• The text is encoded by replacing each phrase Ti with a code that acts as a pointer to the
dictionary.
LEMPEL ZIV
Lossless compression
Dictionary based compression
Encoded string is longer than the input stream
Example 1
Let
AABABBBABAABABBBABBABB A=0
B=1
POSITION 1 2 3 4 5 6 7 8 9
ABA
SEQUENCE A AB ABB B ABA BB ABBA BB
B
1B
(Numbe
NUMERICAL
r
REPRESENTATIO ΦA 2B ΦB 2A 5B 4B 3A 7
denotes
N
position
)
0000
(Last
digit
denote
s the
CODE letter, 0011 0101 `0001 0101 1011 1001 0110 0111
First 3
digit
the
numbe
r
Example 2: 01000101110010100101
Common technique
Used in GIF, optionally in TIFF and PDF
Simple
High throughput in hardware
Widely used in Unix file compression
Doubles the capacity of the hardware
Scans a file for data pattern that appears more than once.
Patterns are stored in dictionary
Example1:
2.Encode “TOBEORNOTTOBEORTOBEORNOT#”
There are 26 symbols in the plaintext alphabet (the capital letters A through Z). # is used to
represent a stop code
OUTPUT:
Extended
CODE Description
Dictionary
27 TO 20 CODE OF FIRST LETTER T
CODE OF FIRST LETTER O
28 OB 15 (OVERLAPPING WITH THE SECOND
BIT OF THE FIRST CODE)
29 BE 2 CODE OF FIRST LETTER B
30 EO 5 CODE OF FIRST LETTER E
31 OR 15 CODE OF FIRST LETTER O
32 RN 18 CODE OF FIRST LETTER R
33 NO 14 CODE OF FIRST LETTER N
34 OT 15 CODE OF FIRST LETTER O
35 TT 20 CODE OF FIRST LETTER T
36 TOB 27 CODE OF FIRST TWO LETTERS TO
37 BEO 29 CODE OF FIRST TWO LETTERS BE
38 ORT 31 CODE OF FIRST TWO LETTERS OR
39 TOBE 36 CODE OF FIRST TWO LETTERS TOB
40 EOR 30 CODE OF FIRST TWO LETTERS EO
41 RNO 18 CODE OF FIRST TWO LETTERS RN
42 O# 15 CODE OF FIRST LETTER O
Thus the code is 20,15,2,5,15,18,14,15,20,27,29,31,36,30,18,15
Example 2:
Encode wa_ba_wabba_wabba_wabba_w
Index Entry
1 _
2 a
3 b
4 o
5 w
Solution:
Extended
CODE Description
Dictionary
6 wa 5 CODE OF FIRST LETTER w
7 ab 2 CODE OF FIRST LETTER a
8 bb 3 CODE OF FIRST LETTER b
9 ba 3 CODE OF FIRST LETTER b
10 a_ 2 CODE OF FIRST LETTER a
11 _w 1 CODE OF FIRST LETTER _
12 wab 6 CODE OF FIRST TWO LETTERS wa
13 bba 8 CODE OF FIRST TWO LETTERS bb
14 a_w 10 CODE OF FIRST TWO LETTERS a_
15 wabb 12 CODE OF FIRST THREE LETTERS wab
16 ba_ 9 CODE OF FIRST TWO LETTERS ba
17 _wa 11 CODE OF FIRST TWO LETTERS _w
18 abb 7 CODE OF FIRST TWO LETTERS ab
19 ba_w 16 CODE OF FIRST THREE LETTERS ba_
Thus the code is 5,2,3,3,2,1,6,8,10,12,9,11,7,16
Example 3:
Encode geekific-geeficf
Index Entry
45 -
99 C
101 e
102 f
103 g
105 i
107 k
Dictionary length is 255
Extended
CODE Description
Dictionary
256 ge 103 CODE OF FIRST LETTER g
257 ee 101 CODE OF FIRST LETTER e
258 ek 101 CODE OF FIRST LETTER e
259 ki 107 CODE OF FIRST LETTER k
260 if 105 CODE OF FIRST LETTER i
261 fi 102 CODE OF FIRST LETTER f
262 ic 105 CODE OF FIRST LETTER i
263 e- 101 CODE OF FIRST LETTER e
264 -g 45 CODE OF FIRST LETTER -
265 gee 256 CODE OF FIRST TWO LETTERS ge
266 ef 101 CODE OF FIRST LETTER e
267 fic 262 CODE OF FIRST TWO LETTERS fi
268 cf 99 CODE OF FIRST LETTER c
Huffman coding is an algorithm for compressing data with the aim of reducing its size
without losing any of the details. This algorithm was developed by David Huffman.
Huffman coding is typically useful for the case where data that we want to compress has
frequently occurring characters in it.
It is a lossless compression.
It is a variable length code.
Variable length is based on the frequencies of the characters.
Variable length codes are assigned to input characters.
They are prefix codes.
Huffman Coding step-by-step working or creation of Huffman Tree is as follows:
Step-1: Calculate the frequency of each string.
Step-2: Sort all the characters on the basis of their frequency in ascending order.
Step-3: Mark each unique character as a leaf node.
Step-4: Create a new internal node.
Step-5: The frequency of the new node as the sum of the single leaf node
Step-6: Mark the first node as this left child and another node as the right child of the
recently created node.
Step-7: Repeat all the steps from step-2 to step-6.
Example:
Suppose a data file has the following characters and the frequencies. If huffman coding is
used, calculate:
Huffman Code of each character
Average code length
Length of Huffman encoded data
Solution:
Initially, create the Huffman Tree:
Step-2:
Step-3:
Step-4:
Step-5:
The above tree is a Huffman Tree.
Now, assign weight to all the nodes.
Assign “0” to all left edges and “1” to all right edges.
The tree will become
Example :2
SYMBOL FREQUENCY
g 3
0 3
p 1
e 1
h 1
r 1
s 1
“ 2
Step 2: add 2 least numbers and arrange in ascending order again as shown below tree
diagram
Step 3: add 2 least numbers and arrange in ascending order again as shown below tree
diagram
Step 4: add 2 least numbers and arrange in ascending order again as shown below tree
diagram
Step 5: add 2 least numbers and arrange in ascending order again as shown below tree
diagram
Step 6: add 2 least numbers and arrange in ascending order again as shown below tree
diagram,
Step 7: add 2 least numbers and arrange in ascending order again as shown below tree
diagram
Step 8: add 2 least numbers and arrange in ascending order again as shown below tree
diagram, and assign values 0 to the left side and 1 right side
Step 8: For every symbol assign code from the values from top to bottom
Frequency of a
Symbol Code Size
String
G 3 00 3x2=6
O 3 01 3x2=6
S 1 100 1x3=3
E 1 1100 1x4=4
H 1 1101 1x4=4
P 1 1110 1x4=4
R 1 1111 1x4=4
“ 2 101 2x3=6
8x4=32 bits 13 bits 37 bits
Example 3:
Example4:
HUFFMAN CODING
It is the average level of information surprise or uncertainity inherent to the variable’s
possible outcomes.
H[X] = − Σ P i log2 Pi
Average Length L= Σ P I x Ni
Efficiency = η = H/L
Redundancy = 1 – η
Procedure
Example: Encode using Huffman coding P0=0.4, P1=0.2, P2=0.1, P3=0.2 and P4=0.1
MSG PROB
P0 0.4
P1 0.2
P3 0.2
P2 0.1
P4 0.1
Step2: Combine ( add ) last 2 probabilities and arrange in descending order
P0=00
P1=01
P2=010
P3=11
P4=110
=-(1/0.3010)[(0.4*log0.4)+(0.2*log0.2)+(0.1*log0.1)+(0.2*log0.2)+(0.1*log0.1)
=2.122
Average Length L= Σ P I x Ni
=(0.4*2)+(0.2*2)+(0.1*3)+(0.2*2)+(0.1*3)
=2.2
96.45
-------------------------------------------------------------------------------------------------------------------
-------
Example3: x1=0.4,x2=0.19,x3=0.16,x4=x5=0.15
Example 4: x1=0.3,x2=0.25,x3=0.2,x4=0.12,x5=0.08 and x6=0.05
Advantages:
Efficient
Shorter codes
Higher compression ratio
Doest need any special markers to separate different codes
Easy to integrate with existing system
Lossless
Easy implementation In hardware and software
Disadvantages:
---------------------------------------------------------------------------------------------------------------------
--------
ARITHMETIC CODING
Arithmetic coding is a type of entropy encoding utilized in lossless data compression.
Arithmetic coding can be used in most applications of data compression. Its main
usefulness is in obtaining compression in conjunction with an adaptive model, or when the
probability of one event is close to 1.
PROCEDURE
Example:
Find Sm for the message”babc” with a=0.2,b=0.5 and c=0.3 probabilities using arithmetic
coding.
Step1:
Step 2: For b
Range of a=0.2+(0.5x0.2)=0.2+0.10=0.3
Range of b= 0.3+(0.5x0.5)=0.3+0.25=0.55
Range of c=0.55+(0.5x0.3)=0.55+0.15=0.7
Ste 3: for a
Range of b= 0.22+(0.1x0.5)=0.22+0.05=0.27
Range of c=0.27+(0.1x0.3)=0.27+0.03=0.3
Range of a=0.22+(0.05x0.2)=0.22+0.01=0.23
Range of b= 0.23+(0.05x0.5)=0.23+0.025=0.255
Range of c=0.255+(0.05x0.3)=0.255+0.015=0.27
Tag is = (0.255+0.27)/2=0.2275
---------------------------------------------------------------------------------------------------------------------
-----
Using the history to determine the sequence in a predictive manner is called as predictive
coding
In PPM escape symbol <ESC> is used to denote that the letter to be encoded isn’t present in
the context.
The basic algorithm initially attempts to use the largest context. The size of the largest
context is predetermined.
If the symbol to be encoded has not previously been present in this context, an escape
symbol is encoded and the algorithm attempts to use the next smaller context.
If the symbol has not occurred in this context either, the size of the context is further
reduced.
This process continues until either we obtain a context that has previously been present
with this symbol, or we arrive at the conclusion that the symbol has not been encountered
previously in any context.
In this case, we use a probability of 1/M to encode the symbol, where M is the size of the
source alphabet.
Example,
Letter to be coded - letter a of the word probability
First attempt - See if the string proba has previously occurred— that is, if the letter a had
previously occurred in the context of prob.
If not, we would send an escape symbol
Second attempt – See if the letter a had previously occurred in the context of rob.
If the string roba had not occurred previously, not, we would send an escape symbol
Third attempt - See if the letter a had previously occurred in the context of ob.
If the string oba had not occurred previously, not, we would send an escape symbol
Forth attempt - See if the letter a had previously occurred in the context of b.
If the string ba had not occurred previously, it is concluded that a has occurred for the first
timeand encode it separately
Example:
Encode the sequence
“this is the “
1.Assume that we have already encoded initial 7 symbols “this is” (including space)
2.Assume that the longest context length is two, that is order is from -1 to 2 (-1,0,1,2)
3. Count array for -1 order context “this is the “
Letter count Cumulative count
t 1 1
h 1 2
i 1 3
s 1 4
space 1 5
e 1 6
Total count 6
Since special character should be written last interchange e and space
It gives how many times the letter appears after the context
I order cum
letter count Remarks
context count
h 1 1 h appears once
t
before t in "this
<ESC> 1 2 is"
total count 2
i 1 1
h i appears once
<ESC> 1 2 beforeh in "this
is"
total count 2
s 2 2
i s appears once
<ESC> 1 3
before i in "this
total count 3 is"
space 1 1
s space appears
<ESC> 1 2 once before s in
total "this is"
2
count
i 1 1
space i appears once
before space in
<ESC> 1 2 "this is"
total
2
count
5.Count array for 2 order context: “this is”
II order cum
letter count Remarks
context count
i 1 1
th th appears once before
<ESC> 1 2
i in "this is"
total count 2
s 1 1
hi hi appears once before
<ESC> 1 2 s in "this is"
total count 2
space 1 1
is is appears once before
<ESC> 1 2 space in "this is"
total count 2
i 1 1
s space s space appears once
<ESC> 1 2 before i in "this is"
total count 2
s 1 1
space i space i appears once
<ESC> 1 2 before s in "this is"
total count 2
6.We assume that the word length for arithmetic coding is six. Thus, l=000000 and
u=111111
7.As “this is” has already been encoded, the next letter to be encoded is space in ”this is
the”.
In the II order consider space as the first letter (ie in space i) the cumulative count for space is 1
and total count is 2
Un-1=decimal of (111111)=63
Substituting in ln =0+{[(63-0+1)*0]/2}=0=000000
and un =0+{[(63-0+1)*1]/2}=31=011111