Unit 2
Unit 2
Unit – II:
The Huffman coding algorithm: Minimum variance Huffman codes, Adaptive Huffman
coding: Update procedure, Encoding procedure, Decoding procedure. Golomb codes,
Rice codes, Tunstall codes, Applications of Hoffman coding: Loss less image
compression, Text compression, Audio Compression.
Huffman Coding
• This technique was developed by David Huffman. Huffman coding is a lossless data
compression algorithm. The idea is to assign variable-length codes to input characters, lengths
of the assigned codes are based on the frequencies of corresponding characters.
• The variable-length codes assigned to input characters are Prefix Codes, means the codes (bit
sequences) are assigned in such a way that the code assigned to one character is not the prefix
of code assigned to any other character.
• This is how Huffman Coding makes sure that there is no ambiguity when decoding the
generated bitstream.
• Huffman Coding is also used as a component in many different compression algorithms. It is
used as a component in lossless compressions such as zip, gzip, and png, and even as part of
lossy compression algorithms like mp3 and jpeg.
• The codes generated using this technique or procedure are called Huffman codes. These codes
are prefix codes and are optimum for a given model (set of probabilities).
• The Huffman procedure is based on two observations regarding optimum prefix codes.
1. In an optimum code, symbols that occur more frequently (have a higher probability of occurrence)
will have shorter codewords than symbols that occur less frequently.
2. In an optimum code, the two symbols that occur least frequently will have the same length in
which only the last bit differs.
• If symbols that occur more often had codewords that were longer than the codewords for symbols
that occurred less often, the average number of bits per symbol would be larger than if the
conditions were reversed.
• Therefore, a code that assigns longer codewords to symbols that occur more frequently cannot be
optimum.
• This requirement is that the codewords corresponding to the two lowest probability symbols differ
only in the last bit. That is, if a1 and a2 are the two least probable symbols in an alphabet, if the
codeword for was m∗ 0, the codeword for would be m∗ 1. Here m is a string of 1s and 0s, and ∗
denotes concatenation
There are mainly two major parts in Huffman Coding: 1. Build a Huffman Tree from input
characters. 2. Traverse the Huffman Tree and assign codes to characters.
Properties of Huffman coding
• Prefix-Free Code (Prefix codes)
• No code in the Huffman tree is a prefix of another code.
• Ensures unique decodability without ambiguity.
2. Optimality:
• In an optimum code, symbols that occur more frequently (have a higher probability of occurrence)
will have shorter codewords than symbols that occur less frequently.
• In an optimum code, the two symbols that occur least frequently will have the same length in which
only the last bit differs.
3. Variable-Length Encoding: Frequently occurring symbols get shorter codes, while less frequent
symbols get longer codes. This reduces the overall size of encoded data.
4. Binary Tree Representation: The codes are derived from a binary tree, where each left edge is
assigned ‘0’ and each right edge is assigned ‘1’.The tree structure ensures efficient encoding and unique
decoding.
5. Lossless Compression: Huffman coding does not lose any information during compression. The
original data can be perfectly reconstructed from the encoded data.
How Huffman Coding Works
1.Frequency Analysis: Count the frequency of each character in the input data.
2.Build a Priority Queue: Create a min-heap (priority queue) where each node represents
a character and its frequency.
3.Construct a Huffman Tree:
•Take the two nodes with the smallest frequencies.
•Merge them into a new node with a frequency equal to the sum of the two.
•Repeat until only one node (the root of the tree) remains.
4.Generate Codes:
•Assign 0 to the left edge and 1 to the right edge of the tree.
•Traverse the tree to assign unique binary codes to each character.
5.Encode the Data: Replace each character with its corresponding binary code.
Example for designing Huffman Code
• The average length for this code is:
l = 4×1+2×2+2×3+1×4+1×4 = 22 bits/symbol
• A measure of the efficiency of this code is its redundancy—the difference between the entropy and
the average length. In this case, the redundancy is 0.078 bits/symbol.
Points to Remember:
1. Entropy: Entropy gives the minimum number of bits required to code a message, representing the
theoretical lower limit of how efficiently you can encode information from a given source without losing the
information.
• No coding scheme can encode information using fewer bits than the entropy of the source.
2. Huffman coding: A compression technique that assigns variable-length codes to symbols based on their
frequency, aiming to get as close as possible to the theoretical entropy.
3. Average code length: The average number of bits used to represent a symbol in a Huffman code.
4. Redundancy: The difference between the entropy of a data source and the average length of a Huffman
code generated for that source is called "redundancy". Measures the extra or repetitive information that can
be removed without losing essential data. Determines how much a source can be compressed.
Key Relationship
Higher entropy → Lower redundancy → Less compression possible
Lower entropy → Higher redundancy → More compression possible
Example of Redundancy:
• Consider a source that includes letters from the alphabet A = {a1, a2, a3} with the probability
model P(a1) = 0.8, P(a2) = 0.02, and P(a3) = 0.18. The entropy for this source is 0.816
bits/symbol. A Huffman code for this source is shown in Table below:
Letter Codeword
a1 0
a2 11
a3 10
• The average length for this code is 1.2 bits/symbol. The difference between the average code
length and the entropy, or the redundancy, for this code is 0.384 bits/symbol, which is 47% of the
entropy. This means that to code this sequence we would need 47% more bits than the minimum
required.
• Finding number of bits without using Huffman, Total number of characters = sum of frequencies
= 100 * size of 1 character (bits) = 1byte = 8 bits Total number of bits = 8*100 = 800.
• Using Huffman Encoding, Total number of bits needed can be calculated as: 5*4 + 9*4 + 12*3 +
13*3 + 16*3 + 45* 1 = 224 Bits saved = 800-224 = 576.
Minimum Variance Huffman Codes
• Minimum Variance Huffman Code is a variation of Huffman coding that aims to reduce the
variance in codeword lengths.
• In standard Huffman coding, the difference between the longest and shortest codeword lengths
can be significant, leading to inefficiencies in certain applications. The minimum variance
Huffman code (MVHC) minimizes this difference, making the code lengths more balanced.
• More Balanced Encoding: Reduces the difference between the shortest and longest
codewords.
• Efficient Memory Access: Helps in applications where uniform codeword lengths improve
processing efficiency.
• Better Decoding Performance: Reduces decoding time by avoiding long codewords.
• By performing the sorting procedure in a slightly different manner, we could have found a
different Huffman code.
Key points about minimum variance Huffman coding:
Merging strategy:
• To achieve minimum variance, when combining symbols during the Huffman tree construction, the
algorithm prioritizes merging symbols with similar probabilities, placing the combined symbol
higher in the tree, which leads to more balanced codeword lengths.
Benefit:
• By minimizing variance, a minimum variance Huffman code can potentially achieve better
compression efficiency, especially in scenarios where the distribution of symbol probabilities is
skewed, as it avoids assigning very short codes to extremely rare symbols and very long codes to
frequent symbols.
Implementation:
• The core algorithm for generating a minimum variance Huffman code is similar to a standard
Huffman algorithm, but with a modified merging strategy that focuses on minimizing the variance
of the resulting codeword lengths.
How it works:
Like in regular Huffman coding, the first step is to determine the probability of each symbol
appearing in the data.
Prioritize merging:
When merging symbols to create internal nodes in the Huffman tree, always choose the two
symbols with the closest probabilities to combine first.
Position the newly created combined symbol higher in the tree compared to the individual
symbols it was merged from.
Design a minimum variance Huffman code for a source that put out letter from an alphabet
A={ a1,a2,a3,a4,a5,a6a1,a2,a3,a4,a5,a6} with P(a1)=P(a2)=0.2,P(a3)=0.25
,P(a4)=0.05,P(a5)=0.15,P(a6)=0.15P(a1)=P(a2)=0.2,P(a3)=0.25,P(a4)=0.05,P(a5)=0.15,P(a6)=0.15.
Find the entropy of the source, avg. length of the code and efficiency. Also comment on the difference
between Huffman code and minimum variance Huffman code.
P(a4)=0.05 1
The probabilities for each character are arranged in descending order and by using
Minimum variance Huffman coding, we obtained following Huffman tree.
Note: If we solve this by creating a binary tree by
considering Left branch as 0 and right branch as 1, the
codewords will change from the ones shown in previous
table in such a way that 0 will be changed to 1 and I will
be changed with 0 in each codeword. But the codeword
length will remain same .
LETTER CODEWORD
𝑎1 00
𝑎2 111
𝑎3 10
𝑎4 010
𝑎5 110
𝑎6 011
Coding Efficiency (%)=(Entropy / Average Length of Code) * 100
Ques. Create binary tree using minimum variance Huffman coding for the same table
used for Huffman coding below given:
• McMillan, states that if we have a uniquely decodable code with K codewords of length 𝑙𝑖 ,
for i=1 to K, then the following inequality holds:
Examining the code generated in above Table , the lengths of the codewords
are 1 2 3 4 4. Substituting these values into the left-hand side of Equation :
• The average length for this code is 1.2 bits/symbol. The difference between the average code
length and the entropy, or the redundancy, for this code is 0.384 bits/symbol, which is 47% of the
entropy..
2) Using Extended Huffman Coding Method
• For the source described in the previous example, instead of generating a codeword for every
symbol, we will generate a codeword for every two symbols.
• If we look at the source sequence two at a time, the number of possible symbol pairs, or size of the
extended alphabet, is 3^2= 9. The extended alphabet, probability model, and Huffman code for this
example are shown in Table.
Letters Probability Codeword
a1a1 0.64 1
a1a3 0.144 00
a3a1 0.144 011
a3a3 0.0324 0100
a1a2 0.016 01010
a2a1 0.016 010111
a2a3 0.0036 0101101
a3a2 0.0036 01011011
a2a2 0.0004 01011010
The average codeword length for this
extended code is 1.7228 bits/symbol.
However, each symbol in the
extended alphabet corresponds to two
symbols from the original alphabet.
• The flowchart for the decoding procedure is shown in Figure. As we read in the received
binary string, we traverse the tree in a manner identical to that used in the encoding
procedure.
• Once a leaf is encountered, the symbol corresponding to that leaf is decoded. If the leaf is the
NYT node, then we check the next e bits to see if the resulting number is less than r.
• If it is less than r, we read in another bit to complete the code for the symbol. The index for
the symbol is obtained by adding one to the decimal number corresponding to the e- or e+1-
bit binary string.
• Once the symbol has been decoded, the tree is updated and the next received bit is used to
start another traversal down the tree.
Example : Decoding procedure
The binary string generated by the encoding procedure is: 0000010100010000011
• Initially, the decoder tree consists only of the NYT node. Therefore, the first symbol to be
decoded must be obtained from the NYT list. We read in the first 4 bits, 0000, as the value
of e is four. The 4 bits 0000 correspond to the decimal value of 0.
• As this is less than the value of r, which is 10, we read in one more bit for the entire code
of 00000. Adding one to the decimal value corresponding to this binary string, we get the
index of the received symbol as 1. This is the index for a; therefore, the first letter is
decoded as a. The tree is now updated as shown in Figure.
• The next bit in the string is 1. This traces a path from the root node to the external node
corresponding to a. We decode the symbol a and update the tree. In this case, the update
consists only of incrementing the weight of the external node corresponding to a.
• The next bit is a 0, which traces a path from the root to the NYT node. The next 4 bits, 1000,
correspond to the decimal number 8, which is less than 10, so we read in one more bit to get the 5-bit
word 10001. The decimal equivalent of this 5-bit word plus one is 18, which is the index for r. We
decode the symbol r and then update the tree.
• The next 2 bits, 00, again trace a path to the NYT node. We read the next 4 bits, 0001. Since this
corresponds to the decimal number 1, which is less than 10, we read another bit to get the 5-bit word
00011. To get the index of the received symbol in the NYT list, we add one to the decimal value of this
5-bit word. The value of the index is 4, which corresponds to the symbol d.
GOLOMB CODES
• A Golomb code is a lossless data compression technique that efficiently encodes positive
integers, particularly useful when the data to encode integers with the assumption that the
larger an integer, the lower its probability of occurrence.
• it works by splitting an integer into a quotient (coded with a unary code) and a remainder
(coded with a fixed-length code), making it ideal for compressing data where large values
are less common, like image prediction residuals in image compression algorithms like
JPEG.
• Golomb codes represent integers using two parts: Quotient: This is the result of dividing
the number by a divisor m & Remainder: This is the remainder when dividing the number
by m.
• The number n is encoded as a combination of binary quotient and remainder.
• The key parameter in a Golomb code is “m" which determines how the integer is split
into quotient and remainder.
• The Golomb code is actually a family of codes parameterized by an integer m > 0. In the
Golomb code with parameter m, we represent an integer n > 0 using two numbers q and r,
where, an integer n > 0 using two numbers q and r, where q is the quotient (integer
division of n by m).r is the remainder
C) The final Golomb code for the number n is the concatenation of: The unary-
encoded quotient q. The binary-encoded remainder r (fixed code).
Assume we have m = 5 and we want to encode n = 12:
•Step 1: Divide 12 by 5. We get a quotient of 2 and a remainder of 2.
• The Rice code was originally developed by Robert F. Rice (he called it the Rice machine) and
later extended by Pen-Shu Yeh and Warner Miller.
• The Rice code can be viewed as an adaptive Golomb code. In the Rice code, a sequence of
nonnegative integers (which might have been obtained from the preprocessing model) is
divided into blocks of size J.
• Each block is then coded using one of several options, most of which are a form of Golomb
codes.
• Each block is encoded with each of these options, and the option resulting in the least number
of coded bits is selected.
• The particular option used is indicated by an identifier attached to the code for each block. The
easiest way to understand the Rice code is to examine one of its implementations.
• The implementation of the Rice code in the recommendation for lossless compression is done
from the Consultative Committee on Space Data Standards (CCSDS).
CCSDS Recommendation of Rice Codes for Lossless
Compression
• The Rice algorithm consists of a preprocessor (the modeling step) and a binary coder (coding
step). The preprocessor removes correlation from the input and generates a sequence of non-
negative integers.
• This sequence has the property that smaller values are more probable than larger values. The
binary coder generates a bitstream to represent the integer sequence. The binary coder is our
main focus at this point.
• The preprocessor functions as follows: Given a sequence {𝑦𝑖 }, for each 𝑦𝑖 we generate a
prediction 𝑦ෝ𝑖 . A simple way to generate a prediction would be to take the previous value of the
sequence to be a prediction of the current value of the sequence:
𝑦ෝ𝑖 = 𝑦𝑖−1
• We then generate a sequence whose elements are the difference between 𝑦𝑖 and its predicted
value 𝑦ෝ𝑖 :
𝑑𝑖 = 𝑦𝑖 - 𝑦ෝ𝑖
• Let 𝑦𝑚𝑎𝑥 and 𝑦𝑚𝑖𝑛 be the largest and smallest values that the sequence 𝑦𝑖 takes on.
• It is reasonable to assume that the value of 𝑦will
ො be confined to the range [𝑦𝑚𝑎𝑥 , 𝑦𝑚𝑖𝑛 ]. Define,
• The sequence 𝑑𝑖 can be converted into a sequence of nonnegative integers 𝑥𝑖 using the following
mapping:
• The value of 𝑥𝑖 will be small whenever the magnitude of 𝑑𝑖 is small. Therefore, the value of 𝑥𝑖 will
be small with higher probability.
• The sequence 𝑥𝑖 is then divided into segments with each segment being further divided into blocks
of size J. It is recommended by CCSDS that ‘J’ have a default value of 16 bits.
• Each block is then coded using one of the following options. The coded block is transmitted
along with an identifier that indicates which particular option was used.
1) Fundamental sequence: This is a unary code. A number ‘n’ is represented by a sequence of
n 0s followed by a 1 (or a sequence of ‘n’ 1s followed by a 0).
2) Split sample options:
▪ These options consist of a set of codes indexed by a parameter m.
▪ The code for a k-bit number ‘n’ using the mth split sample option consists of the ‘m’ least
significant bits of ‘k’ followed by a unary code representing the (𝑘 − 𝑚) most significant
bits.
▪ For example, suppose we wanted to encode the 8-bit number 23 using the third split sample
option.
▪ The 8-bit representation of 23 is 00010111. The three least significant bits are 111. The
remaining bits (00010) correspond to the number 2, which has a unary code 001. Therefore,
the code for 23 using the third split sample option is 111011.
Tunstall Codes:
• Most of the coding methods encode letters from the source alphabet using codewords with
varying numbers of bits: codewords with fewer bits for letters that occur more frequently
and codewords with more bits for letters that occur less frequently.
• Tunstall coding builds a fixed-length codebook from the source symbols, unlike Huffman
coding, which typically generates variable-length codes.
• In Tunstall coding, the data source symbols are encoded into fixed-length codewords.
• The codebook used for which is built from the source's statistics (symbol frequencies).
• The key idea is to replace each source symbol with a codeword from the codebook.
• The main advantage of a Tunstall code is that errors in codewords do not propagate, unlike
other variable-length codes, such as Huffman codes, in which an error in one codeword will
cause a series of errors to occur.
Implementation of Tunstall Codes
• Suppose we want an n-bit Tunstall code for a source alphabet of size N.
• We start with the N letters of the source alphabet in the codebook.
• Remove the entry in the codebook that has the highest probability and add the N strings obtained by
concatenating this letter with every letter in the alphabet (including itself).
• This will increase the size of the codebook from N to N + (N −1).
• The probabilities of the new entries will be the product of the probabilities of the letters concatenated to
form the new entry.
• Now look through the N + (N −1) entries in the codebook and find the entry that has the highest
probability, keeping in mind that the entry with the highest probability may be a concatenation of
symbols.
• Each time we perform this operation we increase the size of the codebook by N −1. Therefore, this
operation can be performed K times, where,
𝑁 + 𝐾(𝑁 − 1) ≤ 2𝑛
Ques: design a 3-bit Tunstall code for a memoryless source with the following
alphabet:= {A B C} P(A )= 0.6 P(B) = 0.3 P(C) = 0.1
• Huffman coding plays a crucial role in text compression by reducing the amount of
space required to store or transmit data. It is a lossless compression algorithm that
works by assigning shorter codes to more frequent characters and longer codes to
less frequent ones.
• In text, we have a discrete alphabet that, in a given class, has relatively stationary
probabilities. For example, the probability model for a particular novel will not
differ significantly from the probability model for another novel.
• Huffman coding is used in databases for compressing large amounts of textual data.
When storing large text records (like customer information, product descriptions, or
logs), Huffman coding helps save storage space while keeping data retrieval intact.
3. Audio Compression
• Huffman coding is widely used in audio compression because it is an efficient way to represent
data by reducing the amount of space required to store or transmit it.
• Huffman coding uses variable-length codes to represent data. In the context of audio
compression, this means that more frequent audio data (like certain sound frequencies or
amplitudes) are represented with shorter codes, and less frequent data gets longer codes. This
allows the overall size of the audio file to be reduced because the most common values take up
less space.
• Audio files often contain redundant information, such as repeated patterns or sounds that appear
frequently. Huffman coding identifies these redundancies and replaces them with shorter codes,
effectively reducing the overall size of the file without sacrificing quality.
• Audio codecs (like MP3, FLAC, and AAC) often incorporate Huffman coding as a part of their
compression pipeline. It’s used in various steps of encoding audio, such as coding the quantized
frequency domain data or the bitstream that represents the audio. This ensures that the audio file
is compact, without any loss in quality or accuracy of the original sound.