Data Compression - Unit 3
Data Compression - Unit 3
COMPRESSION
Jitendra kumar
Assistant professor - CSE
UNIT-3
TOPIC:
Arithmetic Coding or Binary Code
• Arithmetic coding is used for lossless data compression
• It is used to decrease average length of the codewords.
• Arithmetic coding represents the entire message as a single floating-point
number between 0 and 1 means the answer will lie between 0 and 1
• Overcome the problem of assigning individual symbols by assigning one code to
the entire file.
• Method starts with certain interval [0,1) where 0 is included and 1 is excluded.
• Reads the input file symbol by symbol and used the probability of each symbol
to narrow interval.
Example:
TOPIC:
Generating a Binary Code:
Example:
Question: Consider 4 letters, A = {1,2,3,4} where P(1) = 0.5, P(2) = 0.25, P(3) = 0.125, P(4) =
0.125, find Code for it.
Answer:
Find Fx until we get 1 or until the total letters of sequence we reach.
F(0) = 0
F(1) = 0+0.5 = 0.5
F(2) = 0+0.5+0.25 = 0.75
F(3) = 0+0.5+0.25+0.125 = 0.812
F(4) = 0+0.5+0.25+0.125+0.125 = 1.0
Now,
Number of Bits required: ⌈log2(1/p(n))⌉+1
Tx̅ = Fx(i-1) + ½ P(i)
Where,
Fx is “Cumulative probability of previous symbol”
Tx is Intermediate value used to determine binary representation
0.25 x 2 = 0.50
0.50 x 2 = 1.0
So, the binary for 0.25 is 01
0.625 x 2 = 1.250
0.250 x 2 = 0.50
0.50 x 2 = 1.0
So, the binary for 0.625 is 101
When n =1,
= ⌈log2(1/p(1))⌉+1
= ⌈log2(1/0.5)⌉+1
= ⌈log2(2)⌉+1
= 1+1 = 2
When n =2,
= ⌈log2(1/p(2))⌉+1
= ⌈log2(1/0.25)⌉+1
= ⌈log2(4)⌉+1
= 2+1 = 3
When n =3,
= ⌈log2(1/p(3))⌉+1
= ⌈log2(1/0.125)⌉+1
= ⌈log2(8)⌉+1
= 3+1 = 4
When n =4,
= ⌈log2(1/p(4))⌉+1
= ⌈log2(1/0.125)⌉+1
= ⌈log2(8)⌉+1
= 3+1 = 4
Where
l is “lower limit”
u is “upper limit”
n is “length of the sequence”
x is “sequence that we want to encode” and Fx is the Frequency
Now perform iterations until we reach to the end of the length of the sequence given i.e
n=4.
Default value of l0 = 0 and u0 = 1 and rest we have to find it by using formulas
Now,
Finding Fx first
Fx is the cumulative sum of probabilities. Compute Fx iteratively until the last value
reaches 1.0 or until the number of letters is reached. Repeat this process for the
required count to obtain Fx.
Like:
F(0) = 0 (As, this is the initial stage)
F(1) = 0 + 0.5 = 0.5
F(2) = 0 + 0.5 + 0.3 = 0.8
F(3) = 0 + 0.5 + 0.3 + 0.2 = 1.0
Now perform to find Lower limit and Upper limit:
Value of Xn = 1332, so for the first iteration we have 1 and for second we have 3 and so
on….
Like this perform upto l4 and u4, then we have to find out the average bit rate.
Average bit rate = l4 + u4 / 2
CODING A SEQUENCE (Arithmetic coding) (DECODING or DECIPHERING OF METHOD 1):
Example:
Question: Decipher the tag value 0.8835, P(1) = 0.6, P(2) = 0.3, P(3) = 0.1.
Answer:
As we already know the formula,
ln = ln-1 + (un-1 – ln-1).Fx(xn-1)
un = ln-1 + (un-1 – ln-1).Fx(xn)
In the first iteration we don’t know the value of Xn (Digit of the original message), so
we will start assuming to get the correct value and the tag value should between those
range and for each iteration, start from Xn value from 1 itself……
Let Xn = 1,
Now put in the l1 and u1,
l1 = Fx(Xn-1) = Fx(0) = 0
u1 = Fx(Xn) = Fx(1) = 0.6
So we got range 0 – 0.6, but the tag value is < than these two values. So we will
perform by assuming the next value,
Let Xn = 2,
l1 = Fx(Xn-1) = Fx(1) = 0.6
u1 = Fx(Xn) = Fx(2) = 0.9
So the first value of Xn we got is 2, same wise perform all and find the rest value.
c c c
(0.7) (0.55) (0.27)
b b b
1. According to our given code “bac” so first we will expand b, check the above
diagram.
2. Now after taking out b values, divide the same line according to the probability
given (Convert probability into %)
i.e,
Total difference: 0.7-0.2 = 0.5
P(1) = 0.2x100 = 20%
So, 20% of 0.5 = 0.5x 20/100 = 0.1
Means new value of a = 0.2 + 0.1 = 0.3
DICTIONARY TECHNIQUES:
• Whatever the data is given, we use its structure to compress it.
• Suppose we have given abcdeab, we will use dictionary in this. We will find the
frequently occurring pattern and will write it down in dictionary with a index or
code.
Like
Dictionary code
ab 00
• We use this technique when patterns are small (ab, abb…..) and frequently
occurring, means there will be a less gap(small difference) between the next
same pattern occurrence.
Abbcdabdweab
• Types of Dictionary: Static and Adaptive
Static-Diagram Code:
Example:
Question:
A = {a,b,c,d,r}
Dictionary:
Code Entry
000 a
001 b
010 c
011 d
100 r
101 ab
110 ac
111 ad
Encode: abracadabra
Solution:
Total we have 11 letters in the given encode message means to accommodate 11
we need 4 bits to represent all, means total before compression we need 11x4 = 44 bits.
Lets see how we can compress using Static technique or Diagram code.
Step 1:
abracadabra
Start comparing first 2 letters in the given dictionary.
ab = 101
Step 2:
abracadabra
Now we will take next 2 letters ie. ra, but it is not in the dictionary, so we will
take only 1 letter
r = 100
Example:
Question:
Given sequence : cabracadabrarrarrad
Window size: 13 and Lookahead buffer = 6
Answer:
Given values:
LA buffer = 6
Which means, we can find Search buffer = 13-6 = 7
Step 1:
Total we have LA buffer is 6, which means only 6 items can be occupied in the sliding
window.
C A B R A C adabrarrarrad
C A B R A C A dabrarrarrad
Step 2:
Next element is A, as A is not present in the search buffer so encoded value is
A = <0,0,c(a)>
After updating(Next element is B):
C A B R A C A D abrarrarrad
Step 3:
Next element is B, as B is not present in the search buffer so encoded value is
B = <0,0,c(B)>
After Updating(Next element is R):
C A B R A C A D A brarrarrad
Step 4:
Next element is R, as R is not present in the search buffer so encoded value is
R = <0,0,c(R)>
After Updating(Next element is A):
C A B R A C A D A B rarrarrad
Step 5:
Next element is A, as A is present in the search buffer so encoded value is
A = <3,1,c(C)>
How?
Offset: from LA buffer towards Search buffer, A comes at the position of 3
Length: Only 1 element is matching, means in search buffer its AB but in LA buffer its
AC, so length is 1
Codeword: A is matched so we will take next element which is C
Means Final encoded value is <3,1,c(C)>
C A B R A C A D A B rarrarrad
Step 6:
Next element is A, as A is present in the search buffer so encoded value is
A = <2,1,c(D)>
How?
Offset: We have to take minimum value of offset, as we can see we have 2 times A in
search buffer, but the less value we will take i.e 2
Length: Only 1 element is matching, so it is 1
Codeword: A is matched so we will take next element which is D
Means Final encoded value is <2,1,c(D)>
Step 7:
Next element is A, as A is present in the search buffer so encoded value is
A = <7, 4, c(R)>
How?
Offset: We have 3 times A, as we know we have to that element which give use
maximum length i.e 7
Length: Total 4 elements are matching ie. ABRA is present in Search buffer as well as
in LA Buffer so value is 4
Codeword: ABRA is matched so we will take next element which is R
Means Final encoded value is <7,4,c(R)>
Step 8:
Next element is R, as R is present in the search buffer so encoded value is
A = <3, 5, c(D)>
How?
Offset: We have 2 times A, as we know we have to that element which give use
maximum length i.e 3
Length: Total 5 elements are matching ie. RARRA is present while combining both
Search buffer as well as in LA Buffer so value is 5
Codeword: RARRA is matched so we will take next element which is D
Means Final encoded value is <3,5,c(D)>
Answer:
1. < 0,0, c(c)>
C
– 1 a
– 2 b
– 3 c
1 (In this case, we will take the index of first character and will 4 ab
start with the last character for the next encoding process)
2 5 ba
4 6 abb
5 7 bab
2 8 bc
3 9 ca
4 10 aba
6 11 abba
Process:
• If we are making a pair of 2, then we have to write the index number of first
character and will start making pair with second character as first for the
next pair.
Question: Encoded Message: 124533461, Decode the given message and we have initial
values:
Index Entry
1 a
2 b
3 c