0% found this document useful (0 votes)
8 views

Data Compression - Unit 3

The document discusses data compression techniques, focusing on arithmetic coding and binary coding methods. It explains how arithmetic coding represents messages as a single floating-point number between 0 and 1, and provides examples for encoding and decoding sequences. Additionally, it covers dictionary techniques and bi-level image compression, highlighting static and adaptive dictionary methods.

Uploaded by

Dharmendra Patel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Data Compression - Unit 3

The document discusses data compression techniques, focusing on arithmetic coding and binary coding methods. It explains how arithmetic coding represents messages as a single floating-point number between 0 and 1, and provides examples for encoding and decoding sequences. Additionally, it covers dictionary techniques and bi-level image compression, highlighting static and adaptive dictionary methods.

Uploaded by

Dharmendra Patel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

DATA

COMPRESSION
Jitendra kumar
Assistant professor - CSE
UNIT-3

TOPIC:
Arithmetic Coding or Binary Code
• Arithmetic coding is used for lossless data compression
• It is used to decrease average length of the codewords.
• Arithmetic coding represents the entire message as a single floating-point
number between 0 and 1 means the answer will lie between 0 and 1
• Overcome the problem of assigning individual symbols by assigning one code to
the entire file.
• Method starts with certain interval [0,1) where 0 is included and 1 is excluded.
• Reads the input file symbol by symbol and used the probability of each symbol
to narrow interval.

Example:

Define probability ranges for each symbol


Symbols: {A: 0.4, B: 0.3, C: 0.2, D: 0.1}
Answer:
A → [0.0, 0.4)
B → [0.4, 0.7)
C → [0.7, 0.9)
D → [0.9, 1.0)

TOPIC:
Generating a Binary Code:

Example:
Question: Consider 4 letters, A = {1,2,3,4} where P(1) = 0.5, P(2) = 0.25, P(3) = 0.125, P(4) =
0.125, find Code for it.
Answer:
Find Fx until we get 1 or until the total letters of sequence we reach.
F(0) = 0
F(1) = 0+0.5 = 0.5
F(2) = 0+0.5+0.25 = 0.75
F(3) = 0+0.5+0.25+0.125 = 0.812
F(4) = 0+0.5+0.25+0.125+0.125 = 1.0
Now,
Number of Bits required: ⌈log2(1/p(n))⌉+1
Tx̅ = Fx(i-1) + ½ P(i)

Where,
Fx is “Cumulative probability of previous symbol”
Tx is Intermediate value used to determine binary representation

Symbol Fx Tx̅ Binary In how many bits we have to Code


represent Final code
1 0.5 0.25 01 2 01
2 0.75 0.625 101 3 101
3 0.812 0.812 1101 4 1101
4 1.0 0.9375 1111 4 1111

For 1st iteration, we have i=1 as the first iteration (n) = 1,


Tx̅ = Fx(i-1) + ½ P(i)
= Fx(1-1) + ½ P(1)
= Fx(0) + ½ (0.5)
= 0 + ½ (0.5)
= 0.25
Now perform for all.

Now Convert 0.25, 0.625, 0.812, 0.9375 into binary:

0.25 x 2 = 0.50
0.50 x 2 = 1.0
So, the binary for 0.25 is 01

0.625 x 2 = 1.250
0.250 x 2 = 0.50
0.50 x 2 = 1.0
So, the binary for 0.625 is 101

Likewise perform for all and find Binary.


Now, How many bits required: ⌈log2(1/p(n))⌉+1

When n =1,
= ⌈log2(1/p(1))⌉+1
= ⌈log2(1/0.5)⌉+1
= ⌈log2(2)⌉+1
= 1+1 = 2

When n =2,
= ⌈log2(1/p(2))⌉+1
= ⌈log2(1/0.25)⌉+1
= ⌈log2(4)⌉+1
= 2+1 = 3

When n =3,
= ⌈log2(1/p(3))⌉+1
= ⌈log2(1/0.125)⌉+1
= ⌈log2(8)⌉+1
= 3+1 = 4

When n =4,
= ⌈log2(1/p(4))⌉+1
= ⌈log2(1/0.125)⌉+1
= ⌈log2(8)⌉+1
= 3+1 = 4

CODING A SEQUENCE (Arithmetic coding) (ENCODING) (METHOD-1):


Example:
Question: Consider 3 letters, S = {1,2,3} where P(1) = 0.5, P(2) = 0.3, P(3) = 0.2, Encode
‘1332’ or this may be written like {a1, a3, a3, a2}
Answer:
• So to perform this we have to find out the lower and the upper limit in iterations
• Formula for it:

ln = ln-1 + (un-1 – ln-1).Fx(xn-1)


un = ln-1 + (un-1 – ln-1).Fx(xn)

Where
l is “lower limit”
u is “upper limit”
n is “length of the sequence”
x is “sequence that we want to encode” and Fx is the Frequency
Now perform iterations until we reach to the end of the length of the sequence given i.e
n=4.
Default value of l0 = 0 and u0 = 1 and rest we have to find it by using formulas
Now,
Finding Fx first
Fx is the cumulative sum of probabilities. Compute Fx iteratively until the last value
reaches 1.0 or until the number of letters is reached. Repeat this process for the
required count to obtain Fx.
Like:
F(0) = 0 (As, this is the initial stage)
F(1) = 0 + 0.5 = 0.5
F(2) = 0 + 0.5 + 0.3 = 0.8
F(3) = 0 + 0.5 + 0.3 + 0.2 = 1.0
Now perform to find Lower limit and Upper limit:

ln = ln-1 + (un-1 – ln-1).Fx(xn-1)


un = ln-1 + (un-1 – ln-1).Fx(xn)

Value of Xn = 1332, so for the first iteration we have 1 and for second we have 3 and so
on….

1st Iteration when n=1 and l0 = 0, u0 = 0:


l1 = l1-1 + (u1-1-l1-1) Fx(1-1) = 0
u1 = l1-1 + (u1-1-l1-1) Fx(1) = 0.5
l2 = 0.4
u2 = 0.5
l3 = 0.48
u3 = 0.5
l4 = 0.49
u4 = 0.496

Like this perform upto l4 and u4, then we have to find out the average bit rate.
Average bit rate = l4 + u4 / 2
CODING A SEQUENCE (Arithmetic coding) (DECODING or DECIPHERING OF METHOD 1):
Example:
Question: Decipher the tag value 0.8835, P(1) = 0.6, P(2) = 0.3, P(3) = 0.1.
Answer:
As we already know the formula,
ln = ln-1 + (un-1 – ln-1).Fx(xn-1)
un = ln-1 + (un-1 – ln-1).Fx(xn)

where l0 = 0, u0 = 1, n = 3 now find,

F(0) = 0 (As, this is the initial stage)


F(1) = 0 + 0.6 = 0.6
F(2) = 0 + 0.6 + 0.3 = 0.9
F(3) = 0 + 0.6 + 0.3 + 0.1 = 1.0

Now perform all iterations,

1st Iteration when n=1 and l0 = 0, u0 = 0:


l1 = l1-1 + (u1-1-l1-1) Fx(Xn-1) = Fx(Xn-1)
u1 = l1-1 + (u1-1-l1-1) Fx(Xn) = Fx(Xn)

In the first iteration we don’t know the value of Xn (Digit of the original message), so
we will start assuming to get the correct value and the tag value should between those
range and for each iteration, start from Xn value from 1 itself……

Let Xn = 1,
Now put in the l1 and u1,
l1 = Fx(Xn-1) = Fx(0) = 0
u1 = Fx(Xn) = Fx(1) = 0.6
So we got range 0 – 0.6, but the tag value is < than these two values. So we will
perform by assuming the next value,

Let Xn = 2,
l1 = Fx(Xn-1) = Fx(1) = 0.6
u1 = Fx(Xn) = Fx(2) = 0.9
So the first value of Xn we got is 2, same wise perform all and find the rest value.

So the final result we will get, X1 = 2, X2 = 3, X3 = 1, X4 = 2


METHOD-2 (ENCODING) or GENERATING A TAG:
Example:
Question: Consider 3 letters, S = {a,b,c} where P(1) = 0.2, P(2) = 0.5, P(3) = 0.3, Encode
‘bac’.
Answer:

(1.0) (0.7) (0.3)

c c c
(0.7) (0.55) (0.27)

b b b

(0.2) (0.3) (0.22)


a a a

(0) (0.2) (0.2)

1. According to our given code “bac” so first we will expand b, check the above
diagram.
2. Now after taking out b values, divide the same line according to the probability
given (Convert probability into %)
i.e,
Total difference: 0.7-0.2 = 0.5
P(1) = 0.2x100 = 20%
So, 20% of 0.5 = 0.5x 20/100 = 0.1
Means new value of a = 0.2 + 0.1 = 0.3

P(2) = 0.5x100 = 50%


So, 50% of 0.5 = 0.5x50/100 = 0.25
Means new value of b = 0.3+0.25 = 0.55

P(3) = 0.3x100 = 30%


So, 30% of 0.5 = 0.5x30/100 = 0.15
Means new value of c = 0.7
3. Now in the sequence bac, b is completed, same wise perform for a,
Total difference: 0.3-0.2 = 0.1
P(1) = 0.2x100 = 20%
So, 20% of 0.1 = 0.1x 20/100 = 0.02
Means new value of a = 0.2 + 0.02 = 0.22

P(2) = 0.5x100 = 50%


So, 50% of 0.1 = 0.1x50/100 = 0.05
Means new value of b = 0.22+0.05 = 0.27

P(3) = 0.3x100 = 30%


So, 30% of 0.1 = 0.1x30/100 = 0.03
Means new value of c = 0.27 + 0.03 = 0.3
4. Now we have reached to the last, so no need to expand c

Average bit rate = c + b/2 = (0.3 + 0.27)/ 2 = 0.285

Bi-Level Image Compression (pending):


Bi-level image compression is a type of data compression technique used specifically
for binary (black-and-white) images, where each pixel is either black or white (1-bit
per pixel). It is widely used in applications like scanned documents, fax transmission,
and digital signatures

DICTIONARY TECHNIQUES:
• Whatever the data is given, we use its structure to compress it.
• Suppose we have given abcdeab, we will use dictionary in this. We will find the
frequently occurring pattern and will write it down in dictionary with a index or
code.
Like
Dictionary code
ab 00
• We use this technique when patterns are small (ab, abb…..) and frequently
occurring, means there will be a less gap(small difference) between the next
same pattern occurrence.
Abbcdabdweab
• Types of Dictionary: Static and Adaptive

Diagram code LZ77 / LZ78 / LZW

• Static: We must have complete knowledge of the source before compression


because any changes made to the source after compression cannot be reflected
in the compressed data.
• Dynamic (Adaptive): If any changes occurs in source, we can update in the
Compression data as well.

Static-Diagram Code:
Example:
Question:
A = {a,b,c,d,r}
Dictionary:

Code Entry
000 a
001 b
010 c
011 d
100 r
101 ab
110 ac
111 ad
Encode: abracadabra
Solution:
Total we have 11 letters in the given encode message means to accommodate 11
we need 4 bits to represent all, means total before compression we need 11x4 = 44 bits.
Lets see how we can compress using Static technique or Diagram code.

Step 1:
abracadabra
Start comparing first 2 letters in the given dictionary.
ab = 101
Step 2:
abracadabra
Now we will take next 2 letters ie. ra, but it is not in the dictionary, so we will
take only 1 letter
r = 100

Continue these steps until we reach till end.


101100110111101100000

Adaptive – LZ77 / LZ1 / Sliding window - Encoding:


• LZ77 is a lossless data compression
• It is the foundation for many modern compression algorithms, including ZIP,PNG
• LZ77 uses a sliding window technique to replace repeated occurrences of data
with references to previous occurrences.

How LZ77 Works:


1. Sliding Window: The algorithm maintains a search buffer (which stores
previously seen text or which is already encoded) and a lookahead buffer (which
contains upcoming text to be processed).
2. Match Detection: It scans the lookahead buffer for the longest match in the
search buffer.
3. Encoding: Instead of storing duplicate data, it stores a
(distance, length, next character) triplet:
• Distance: How many elements match.
• Length: How long the match is.
• Next character: The next character after the match.
4. (distance, length, next character/codeword of next) triplet or also called <o,l,c>,
where o is Offset, L is length, c is character

Example:
Question:
Given sequence : cabracadabrarrarrad
Window size: 13 and Lookahead buffer = 6
Answer:
Given values:
LA buffer = 6
Which means, we can find Search buffer = 13-6 = 7

Search Buffer (7) Lookahead Buffer (6)

Step 1:
Total we have LA buffer is 6, which means only 6 items can be occupied in the sliding
window.
C A B R A C adabrarrarrad

So our first element of LA buffer is C, so the triplet for c is:


<o,l,c> means
Offset: First we will check whether c is available in the search buffer or not, so o is 0 in
this case
Length: length is also 0 (means how many elements match at the same time)
<0,0,c(c)>
Now below is the updated window

C A B R A C A dabrarrarrad

Step 2:
Next element is A, as A is not present in the search buffer so encoded value is
A = <0,0,c(a)>
After updating(Next element is B):
C A B R A C A D abrarrarrad

Step 3:
Next element is B, as B is not present in the search buffer so encoded value is
B = <0,0,c(B)>
After Updating(Next element is R):
C A B R A C A D A brarrarrad
Step 4:
Next element is R, as R is not present in the search buffer so encoded value is
R = <0,0,c(R)>
After Updating(Next element is A):
C A B R A C A D A B rarrarrad

Step 5:
Next element is A, as A is present in the search buffer so encoded value is
A = <3,1,c(C)>
How?
Offset: from LA buffer towards Search buffer, A comes at the position of 3
Length: Only 1 element is matching, means in search buffer its AB but in LA buffer its
AC, so length is 1
Codeword: A is matched so we will take next element which is C
Means Final encoded value is <3,1,c(C)>

C A B R A C A D A B rarrarrad

After Updating (Next element is A):


C A B R A C A D A B R A rrarrad

Step 6:
Next element is A, as A is present in the search buffer so encoded value is
A = <2,1,c(D)>
How?
Offset: We have to take minimum value of offset, as we can see we have 2 times A in
search buffer, but the less value we will take i.e 2
Length: Only 1 element is matching, so it is 1
Codeword: A is matched so we will take next element which is D
Means Final encoded value is <2,1,c(D)>

After Updating (Next element is A):


C A B R A C A D A B R A R R arrad

Step 7:
Next element is A, as A is present in the search buffer so encoded value is
A = <7, 4, c(R)>
How?
Offset: We have 3 times A, as we know we have to that element which give use
maximum length i.e 7
Length: Total 4 elements are matching ie. ABRA is present in Search buffer as well as
in LA Buffer so value is 4
Codeword: ABRA is matched so we will take next element which is R
Means Final encoded value is <7,4,c(R)>

After Updating (Next element is R):


CABRAC A D A B R A R R A R R A D

Step 8:
Next element is R, as R is present in the search buffer so encoded value is
A = <3, 5, c(D)>
How?
Offset: We have 2 times A, as we know we have to that element which give use
maximum length i.e 3
Length: Total 5 elements are matching ie. RARRA is present while combining both
Search buffer as well as in LA Buffer so value is 5
Codeword: RARRA is matched so we will take next element which is D
Means Final encoded value is <3,5,c(D)>

So the Final answer is (Write down all the triplets)


Decoding LZ77:
Question:
W = 13, LA Buffer = 6, Search buffer = 7, Consider the below triplets and decode them:
< 0,0, c(c)>
< 0,0, c(a) >
< 0,0, c(b) >
< 0,0, c(r) >
< 3,1, c(c) >
< 2,1, c(d) >
< 7,4, c(r) >
< 3,5, c(d) >

Answer:
1. < 0,0, c(c)>
C

2. < 0,0, c(a)>


C A

3. < 0,0, c(b)>


C A B

4. < 0,0, c(r)>


C A B R

5. < 3,1, c(c)>


C A B R A C
Now Update,
C A B R A C

6. < 2,1, c(d)>


C A B R A C A D
Now Update,
C A B R A C A D

7. < 7,4, c(r)>


C A B R A C A D A B R A R
Now Update,
CABRAC A D A B R A R
8. < 3,5, c(d)>
CABRAC A D A B R A R R A R R A D

So our decoded value is: CABRACADABRARRARRAD

Adaptive – LZ78 / LZ2 - Encoding:


• We can customized the input in efficient way, means we not using triplets
here, we are using doublets <i,c> where i is index value and c is the
codeword.
Example:
Encode the string by LAZ78/LZ2
‘ABCDABCABCDAABCABCE’
Answer:

Encoder o/p Index Entry


< 0, C(A) > (0 as index, as it is coming for the 1 time)
st
1 A
< 0, C(B) > (0 as index, as it is coming for the 1 time)
st
2 B
< 0, C(C) > (0 as index, as it is coming for the 1st time) 3 C
< 0, C(D) > (0 as index, as it is coming for the 1 time)
st
4 D
Now, Next is A, which is already present with unique index number so we will take A
along with Next character i.e B, So we have to encode AB now, and we already know
the index value of A i.e 1, So below is the Doublet for AB
< 1, C(B) > 5 AB
< 3, C(A) > 6 CA
< 2, C(C) > 7 BC
< 4, C(A) > 8 DA
Now, Next is A, which is already present , so will take another one as well i.e AB
which is also already encoded so will make pair of 3 i.e ABC and we already know
the Index value of AB i.e 5, so the encoded value is give below
< 5, C(C) > 9 ABC
< 9, C(E) > 10 ABCE
Adaptive – LZ78 / LZ2 - Decoding:
Given doublets and we have to find out index and the message:
Answer:
Doublet (Already Given) Index Entry
⟨0, C(A)⟩ 1 A
⟨0, C(B)⟩ 2 B
⟨0, C(C)⟩ 3 C
⟨0, C(D)⟩ 4 D
⟨1, C(B)⟩ 5 AB
⟨3, C(A)⟩ 6 CA
⟨2, C(C)⟩ 7 BC
⟨4, C(A)⟩ 8 DA
⟨5, C(C)⟩ 9 ABC
⟨9, C(E)⟩ 10 ABCE
Adaptive – LZW - Encoding:
• It is a best dictionary code.
• It never code a single character
Question:
Given Data is (Lempel-Ziv-Welch (LZW))
Index Entry
1 a
2 b
3 c
And the sequence is : ababbabcababba
Answer:
Encoded Output Index Entry

– 1 a
– 2 b
– 3 c
1 (In this case, we will take the index of first character and will 4 ab
start with the last character for the next encoding process)
2 5 ba
4 6 abb
5 7 bab
2 8 bc
3 9 ca
4 10 aba
6 11 abba
Process:
• If we are making a pair of 2, then we have to write the index number of first
character and will start making pair with second character as first for the
next pair.

Encoded Output is: 124523461


Adaptive – LZW - Decoding:

Question: Encoded Message: 124533461, Decode the given message and we have initial
values:
Index Entry
1 a
2 b
3 c

LZW Decoding Table


Update Dictionary

Received Decoded Index Entry Index Partial Entry


1 a – – 4 a–
2 b 4 ab 5 b–
4 ab 5 ba 6 ab–
5 ba 6 abb 7 ba–
2 b 7 bab 8 b–
3 c 8 bc 9 c–
6 abb 9 ca 10 c–
4 ab 10 aba 11 ab–
6 abb 11 abba 12 abb–
1 a – – – a–

Decoded Message is: ababbabcabbababba

You might also like