Data Compression - Unit 2
Data Compression - Unit 2
C = 10 ( 4 x 2 = 8 )
B = 110 (3X3) = 9
A = 111 (3x3) = 9
Means if Message size is greater then TT tooks longer time and vice versa
• 3 bit = 23 = 8 Combinations
• 8 bit = 28 = 256 bits
• 1 byte = 8 bits
• 1 byte can hold number between 0 to 255
• Decoding process:
Encoded bits + number of unique alphabets * 8 + Total number of frequency
Example:
27 + 4*8 + 15
Questions:
1. Build a Huffman Tree:
Given the following characters and their frequencies, construct a Huffman tree and
determine the Huffman codes for each character:
A - 5, B - 9, C - 12, D - 13, E - 16, F – 45
2. Decode a Huffman Encoded String:
Given the Huffman tree you built in Exercise 1, decode the following binary sequence:
1100110011111010 (Using the above question generated code)
3. Find the Huffman Code for a Given String:
Consider the string "MISSISSIPPI". Compute the frequency of each character and
generate the Huffman encoding.
More questions:
1. ABBCDBCCDAABBEEEBEAB (20 characters) (8 bit * 20 = 160 bits)
For 1 character, we required 8 bits
2. A:15, B:6, C:7, D:12, E:25, F:4, G:6, H:10, I:15
Example: ABCDE – 5 different characters available , it can be fit into 3 bits ( how we find this 3
bits, by performing 2n , where n represents the number of Bits. So to fit 5 characters we need 5
bits, means 23 )
Now we get 5, 8, 7 , again try to arrange them in minimum heap ( We have to keep on
arranging numbers to create new heap )
5,7,8
• Create chart and write codeword and calculate required bits.
• Now perform Encoding process.
Example:
Latter prob codeword
A1 0.2
A2 0.4
A3 0.2
A4 0.1
A5 0.1
Steps:
Questions:
Huffman Coding:
a) Sometimes, we can obtain more than one Huffman code due to different tie breaks during
the Huffman code construction. Construct two Huffman codes, H and H', for the following
data:
Symbol A B C D E
Here are the AKTU-style Huffman coding questions along with their answers:
Numerical/Problem-Solving Questions
6. Construct a Huffman tree for the given symbols and probabilities.
Symbol A B C D E F
Probability 0.05 0.1 0.15 0.2 0.25 0.25
Solution:
• Build binary tree using the steps which we learnt in the class.
Final Huffman Codes (It may Vary):
• A → 000
• B → 001
• C → 010
• D → 011
• E → 10
• F → 11
8. Two Huffman codes H1 and H2 have the same average length, but H1 has higher variance.
Which one is better for transmission and why?
Answer:
H2 is better because lower variance means that the codeword lengths are more uniform,
leading to less fluctuation in transmission times and buffering. High variance can cause
jitter and inefficient resource allocation in real-time applications.
9. If all symbols have equal probabilities, what type of encoding is optimal? Is Huffman
coding still beneficial?
Answer:
When all symbols have equal probabilities, fixed-length encoding (e.g., ASCII, UTF-8) is
optimal. Huffman coding would assign the same-length codes in this case, making it no
more efficient than fixed-length encoding.
Summary Table
Question Key Takeaway
Lossless compression algorithm that assigns shorter
What is Huffman coding?
codes to frequent symbols.
Steps to build a Huffman tree? Merge smallest probabilities, assign 0/1 recursively.
Prefix-free codes? Yes, Huffman codes are always prefix-free.
Time complexity? O(n log n) using priority queues.
Question Key Takeaway
Huffman codes adapt to symbol frequency, unlike fixed-
Variable-length encoding?
length codes.
High variance vs. low variance? Low variance is preferable for smoother transmission.
Symbol A B C D E
Frequency 5 10 15 20 50
Questions:
Find Probability, Huffman code, Code- length
Avg length:
E[L]=∑(Pi×Li) ( ∑ means the sum of a set of terms)
TOPIC:
ADAPTIVE HUFFMAN CODING:
1. It is used when probability of the huffman coding is not known.
2. It is used for real-time data compression
3. Adaptive Huffman coding updates the tree dynamically as symbols are processed
4. Examples like audio, video streaming, and real-time communication.
5. 3 phases are included in Adaptive Huffman coding: Updating the tree, Encoding and
Decoding
Update process:
- NYT (Not yet transmitted)
- Encoding based on the tree which is we are going to create in Updating process
Steps:
1. At the first there will be no node, So the first we will took NYT by default.
2. The default value of NYT is 0
3. NYT always remains at the left child of the tree even when the tree start
growing in the next steps.
4. Initial weight for NYT is 51
5. To find the weight for symbol is:
Weight = (2*n)-1
// here n is the total number of alphabets i.e 26
So,
Weight = (2*26)-1
= 51
6. Whenever the Symbol is encountered for the first time , its weight will be 1
(means which we are going to write inside box)
7. If the same symbol is occurring again and again, the value of that symbol keep
increasing again and again.
8. Internal nodes represented with and external node
External nodes: which is on the outer side.
Internal nodes: which is on the inner side. (In short whose having child)
Encoding process:
Message: a a r d v a r k
Algorithm:
Read input symbol
(yes) (No)
NYT code followed by Fixed Code Send code for that symbol
NYT CODE and Send code : Reaching from Node to NYT and finding out the Binary code like
0101
1. Fixed Code:
2 parameter (e and r)
Example 1:
B ( 1 <= 2 <= 20 ), [value of k is 2, as the position of B in alphabet series is 2]
* Means the condition is true, so we need total 5 bits to represent B symbol
* Now we have to represent binary code for the symbol B, using 5 bits, so to
represent it, we have case 1, i.e k-1 (value of k is 2)
Now,
k-1
2-1 = 1.
So binary code for B is 00001 (1)
Example 2:
Z (Value of k become 26)
▪ Means the second condition:
26 >= 20
Now,
k-r-1 (k = 26, r = 10)
26-10-1
16-1
15 (means we have to represent 15 in e bit i.e 4 bits, by converting 15 into
binary code)
To convert 15 into binary, we have to perform Decimal to Binary
conversion steps by taking LCM.
So binary code for z is 1111
DECODING:
NO
Is it NYT Code Decode element directly
YES
Read ‘e’ bits after NYT (Value of e is having 4 and value of r is 10)
NO
Is the bit value ‘p’ is < r? Add ‘r’ to ‘p’ (add r to that element which we got)
YES
Step 2: ‘0000010100010000011000101110110001010’
• Next value is 1
• So according to binary tree diagram, we are reaching from parent node which is
having value 1 to a and the bit we are using to reach to a is also 1
• Means node is not NYT
• Simply decode that element with the symbol mentioned according to the binary tree
• Second value also we got a
Step 3: ‘0000010100010000011000101110110001010’
• Next value is 0
• So 0 is the NYT node, as we are coming to NYT from parent node using 0 according to
diagram
• Read ‘e’ bits after NYT, means take 4 bits i.e 1000
How to convert?
Step 4: ‘0000010100010000011000101110110001010’
• Now using next two 0’s we are reaching to NYT and after 2 0’s we at end with not
further child node that’s why we are not taking 3rd 0
• Now we will read e bits after NYT(i.e after two 0’s)
• 0001 means this is also less then 10 (ie. 1<10) then read one more bit
• So now the bits become 00011 i.e 3
• Now according to flow chart we have to add +1 to 3 i.e 4 = d
• Now update your tree.
TOPIC
Golomb Code:
• Denotated by Gm(n)
• Used for loss less data compression
• Golomb coding is a lossless data compression method that is particularly efficient for
encoding sequences of integers with a geometric distribution (i.e., when smaller
numbers occur more frequently than larger ones).
Step 3:
• Concatenate result of step1 and step2
Example:
Question: Design golomb code for 9 with devisor 4
Answer:
n = 9, m = 4, G4(9) = ?
Lets start:
Step 1:
Quotient:
Unary code of q=⌊n/m⌋
q= ⌊9/4⌋, means we have to take only integer value after diving, not decimal value, so
q= 2 (now we have to take unary code followed by one 0)
q = 110
Remainder:
R = 9 mod 4 (means we have to take remainder part) = 1
Step 2:
K=⌈ log2m ⌉
K =⌈ log24⌉ means 2
• C = 2k-m
C = 22-4 = 0
r+c = 1+0 = 1
k=2
Means we have to represent 1 in 2 bits i.e 01
Step 3:
• Now we have to simple concatenate the result of step1 and step2
• Step1 = 110, step2 = 01
• 11001 is the result
So the golomb code of G4(9) = 11001
More questions:
1. Find the golomb code for n = 0,1,2,3…..15, where m = 5
Tunstall Code
• Tunstall Code is a type of variable-to-fixed-length coding used in data compression.
• Tunstall coding replaces variable-length sequences of input symbols with fixed-
length codewords.
N + k(N-1) <= 2n
(Number of iterations) that is used to find Tunstall code.
• n → Bits required for Tunstall code
• N → Size of source (Number of letters)
• k → Number of iterations
Example:
Letters Probability
a₁ 0.7
a₂ 0.2
a₃ 0.1
Given N = 3, we want to generate a 3-bit Tunstall code.
N + k(N-1) <= 2n
3+k(3-1) <=23
k<=2.5
Since k is a natural number, we take k = 2.
1st Iteration:
Find the highest probability which is a1 letter & remove it from the base. And multiply with the
all 3 letters
Letters Probability
a₂ 0.2
a₃ 0.1
a₁a₁ 0.7×0.7=0.49
a₁a₂ 0.7×0.2=0.14
a₁a₃ 0.7×0.1=0.07
2nd Iteration: Now this is our last iteration as we got 3 bits for symbols
- Remove the highest probability and multiply with the letters given in the questions,
means we have to multiply a₁a₁ with a₁ , a₂
Letters Codeword
a₂ 000 [0]
a₃ 001 [1]
a₁a₂ 010 [2]
a₁a₃ 011 [3]
a₁ a₁ a₁ 100 [4]
a₁ a₁a₂ 101 [5]
a₁ a₁a₃ 110 [6]
→ Find the highest probability from Table-1 & remove the entry & concatenate with others.
Letters Probability
B 0.3
C 0.1
AA 0.6 × 0.6 = 0.36
AB 0.6 × 0.3 = 0.18
AC 0.6 × 0.1 = 0.06
→ Now again, find the highest probability & perform the same:
yi 32 33 35 39 37 38 39 40 40 40 40 39 40 40 41 40
Ŷi 0 32 33 35 39 37 38 39 40 40 40 40 39 40 40 41
di 32 1 2 4 -2 1 1 1 0 0 0 -1 1 0 1 -1
Ti 0 9 8 6 2 4 3 2 1 1 1 1 2 1 1 0
xi 32 2 4 8 3 2 2 2 0 0 0 1 2 0 2 1
2. Given a set of symbols with their respective probabilities, construct the Huffman code
and calculate the average code length.
Answer:
Example: Given symbols A, B, C, D with frequencies (0.4, 0.3, 0.2, 0.1), construct Huffman
codes.
Symbol Probability Huffman Code
A 0.4 0
B 0.3 10
C 0.2 110
D 0.1 111
Average Code Length:
L=(0.4×1)+(0.3×2)+(0.2×3)+(0.1×3)=1.9 bits per
3. Explain the concept of Extended Huffman Codes. How do they differ from standard
Huffman Codes?
Answer:
• Extended Huffman Codes use blocks of symbols instead of single symbols for
encoding.
• If the original Huffman coding doesn't give sufficient compression, combining multiple
symbols into a single unit helps in further reducing the code length.
• Difference:
o Standard Huffman coding assigns a code to each individual character.
o Extended Huffman coding assigns a code to groups of characters (pairs,
triplets, etc.).
4. Discuss Adaptive Huffman Coding. How does it adjust to changing data characteristics?
Answer:
• Adaptive Huffman Coding dynamically updates the Huffman tree as new symbols are
encountered.
• Unlike static Huffman coding (where frequencies are predetermined), adaptive
coding does not require a frequency table beforehand.
5. What are Rice Codes and Golomb Codes? Provide examples of each.
8. Determine whether the following code set is uniquely decodable: {0, 01, 10, 110}.
Answer:
• A set is uniquely decodable if no code word is a prefix of another.
• Here, 0 is a prefix of 01, making it not uniquely decodable.
10. Explain the concept of coding redundancy and its impact on compression efficiency.
Answer:
• Coding redundancy occurs when more bits than necessary are used to represent
data.
• Impact:
o Reduces compression efficiency.
o Huffman coding reduces redundancy by assigning shorter codes to frequent
symbols.
NUMERICAL QUESTIONS:
8. Tunstall Coding
Construct a Tunstall Code for the following source probabilities using block size n = 3:
Symbol Probability
A 0.5
B 0.3
C 0.2
Find the generated codewords.
9. Entropy Calculation
For a source generating symbols X, Y, and Z with probabilities 0.5, 0.3, and 0.2, calculate:
1. Entropy of the source.
2. Minimum average bits required per symbol.
3. Efficiency of Huffman coding if the average code length obtained is 1.6 bits per
symbol.