0% found this document useful (0 votes)
18 views82 pages

Unit 2

Uploaded by

paridhisingh52
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views82 pages

Unit 2

Uploaded by

paridhisingh52
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 82

UNIT 2

Unit – II:

The Huffman coding algorithm: Minimum variance Huffman codes, Adaptive Huffman
coding: Update procedure, Encoding procedure, Decoding procedure. Golomb codes,
Rice codes, Tunstall codes, Applications of Hoffman coding: Loss less image
compression, Text compression, Audio Compression.
Huffman Coding
• This technique was developed by David Huffman. Huffman coding is a lossless data
compression algorithm. The idea is to assign variable-length codes to input characters, lengths
of the assigned codes are based on the frequencies of corresponding characters.
• The variable-length codes assigned to input characters are Prefix Codes, means the codes (bit
sequences) are assigned in such a way that the code assigned to one character is not the prefix
of code assigned to any other character.
• This is how Huffman Coding makes sure that there is no ambiguity when decoding the
generated bitstream.
• Huffman Coding is also used as a component in many different compression algorithms. It is
used as a component in lossless compressions such as zip, gzip, and png, and even as part of
lossy compression algorithms like mp3 and jpeg.
• The codes generated using this technique or procedure are called Huffman codes. These codes
are prefix codes and are optimum for a given model (set of probabilities).
• The Huffman procedure is based on two observations regarding optimum prefix codes.
1. In an optimum code, symbols that occur more frequently (have a higher probability of occurrence)
will have shorter codewords than symbols that occur less frequently.
2. In an optimum code, the two symbols that occur least frequently will have the same length in
which only the last bit differs.
• If symbols that occur more often had codewords that were longer than the codewords for symbols
that occurred less often, the average number of bits per symbol would be larger than if the
conditions were reversed.
• Therefore, a code that assigns longer codewords to symbols that occur more frequently cannot be
optimum.
• This requirement is that the codewords corresponding to the two lowest probability symbols differ
only in the last bit. That is, if a1 and a2 are the two least probable symbols in an alphabet, if the
codeword for was m∗ 0, the codeword for would be m∗ 1. Here m is a string of 1s and 0s, and ∗
denotes concatenation
There are mainly two major parts in Huffman Coding: 1. Build a Huffman Tree from input
characters. 2. Traverse the Huffman Tree and assign codes to characters.
Properties of Huffman coding
• Prefix-Free Code (Prefix codes)
• No code in the Huffman tree is a prefix of another code.
• Ensures unique decodability without ambiguity.
2. Optimality:
• In an optimum code, symbols that occur more frequently (have a higher probability of occurrence)
will have shorter codewords than symbols that occur less frequently.
• In an optimum code, the two symbols that occur least frequently will have the same length in which
only the last bit differs.
3. Variable-Length Encoding: Frequently occurring symbols get shorter codes, while less frequent
symbols get longer codes. This reduces the overall size of encoded data.
4. Binary Tree Representation: The codes are derived from a binary tree, where each left edge is
assigned ‘0’ and each right edge is assigned ‘1’.The tree structure ensures efficient encoding and unique
decoding.
5. Lossless Compression: Huffman coding does not lose any information during compression. The
original data can be perfectly reconstructed from the encoded data.
How Huffman Coding Works
1.Frequency Analysis: Count the frequency of each character in the input data.
2.Build a Priority Queue: Create a min-heap (priority queue) where each node represents
a character and its frequency.
3.Construct a Huffman Tree:
•Take the two nodes with the smallest frequencies.
•Merge them into a new node with a frequency equal to the sum of the two.
•Repeat until only one node (the root of the tree) remains.
4.Generate Codes:
•Assign 0 to the left edge and 1 to the right edge of the tree.
•Traverse the tree to assign unique binary codes to each character.
5.Encode the Data: Replace each character with its corresponding binary code.
Example for designing Huffman Code
• The average length for this code is:

l = 4×1+2×2+2×3+1×4+1×4 = 22 bits/symbol
• A measure of the efficiency of this code is its redundancy—the difference between the entropy and
the average length. In this case, the redundancy is 0.078 bits/symbol.
Points to Remember:
1. Entropy: Entropy gives the minimum number of bits required to code a message, representing the
theoretical lower limit of how efficiently you can encode information from a given source without losing the
information.

• No coding scheme can encode information using fewer bits than the entropy of the source.

2. Huffman coding: A compression technique that assigns variable-length codes to symbols based on their
frequency, aiming to get as close as possible to the theoretical entropy.
3. Average code length: The average number of bits used to represent a symbol in a Huffman code.

4. Redundancy: The difference between the entropy of a data source and the average length of a Huffman
code generated for that source is called "redundancy". Measures the extra or repetitive information that can
be removed without losing essential data. Determines how much a source can be compressed.

Key Relationship
Higher entropy → Lower redundancy → Less compression possible
Lower entropy → Higher redundancy → More compression possible
Example of Redundancy:

• Consider a source that includes letters from the alphabet A = {a1, a2, a3} with the probability
model P(a1) = 0.8, P(a2) = 0.02, and P(a3) = 0.18. The entropy for this source is 0.816
bits/symbol. A Huffman code for this source is shown in Table below:
Letter Codeword
a1 0
a2 11
a3 10

• The average length for this code is 1.2 bits/symbol. The difference between the average code
length and the entropy, or the redundancy, for this code is 0.384 bits/symbol, which is 47% of the
entropy. This means that to code this sequence we would need 47% more bits than the minimum
required.
• Finding number of bits without using Huffman, Total number of characters = sum of frequencies
= 100 * size of 1 character (bits) = 1byte = 8 bits Total number of bits = 8*100 = 800.
• Using Huffman Encoding, Total number of bits needed can be calculated as: 5*4 + 9*4 + 12*3 +
13*3 + 16*3 + 45* 1 = 224 Bits saved = 800-224 = 576.
Minimum Variance Huffman Codes

• Minimum Variance Huffman Code is a variation of Huffman coding that aims to reduce the
variance in codeword lengths.
• In standard Huffman coding, the difference between the longest and shortest codeword lengths
can be significant, leading to inefficiencies in certain applications. The minimum variance
Huffman code (MVHC) minimizes this difference, making the code lengths more balanced.
• More Balanced Encoding: Reduces the difference between the shortest and longest
codewords.
• Efficient Memory Access: Helps in applications where uniform codeword lengths improve
processing efficiency.
• Better Decoding Performance: Reduces decoding time by avoiding long codewords.
• By performing the sorting procedure in a slightly different manner, we could have found a
different Huffman code.
Key points about minimum variance Huffman coding:
Merging strategy:
• To achieve minimum variance, when combining symbols during the Huffman tree construction, the
algorithm prioritizes merging symbols with similar probabilities, placing the combined symbol
higher in the tree, which leads to more balanced codeword lengths.
Benefit:
• By minimizing variance, a minimum variance Huffman code can potentially achieve better
compression efficiency, especially in scenarios where the distribution of symbol probabilities is
skewed, as it avoids assigning very short codes to extremely rare symbols and very long codes to
frequent symbols.
Implementation:

• The core algorithm for generating a minimum variance Huffman code is similar to a standard
Huffman algorithm, but with a modified merging strategy that focuses on minimizing the variance
of the resulting codeword lengths.
How it works:

Calculate symbol probabilities:

Like in regular Huffman coding, the first step is to determine the probability of each symbol
appearing in the data.

Prioritize merging:

When merging symbols to create internal nodes in the Huffman tree, always choose the two
symbols with the closest probabilities to combine first.

Place merged symbol higher:

Position the newly created combined symbol higher in the tree compared to the individual
symbols it was merged from.
Design a minimum variance Huffman code for a source that put out letter from an alphabet
A={ a1,a2,a3,a4,a5,a6a1,a2,a3,a4,a5,a6} with P(a1)=P(a2)=0.2,P(a3)=0.25
,P(a4)=0.05,P(a5)=0.15,P(a6)=0.15P(a1)=P(a2)=0.2,P(a3)=0.25,P(a4)=0.05,P(a5)=0.15,P(a6)=0.15.
Find the entropy of the source, avg. length of the code and efficiency. Also comment on the difference
between Huffman code and minimum variance Huffman code.

P(a4)=0.05 1
The probabilities for each character are arranged in descending order and by using
Minimum variance Huffman coding, we obtained following Huffman tree.
Note: If we solve this by creating a binary tree by
considering Left branch as 0 and right branch as 1, the
codewords will change from the ones shown in previous
table in such a way that 0 will be changed to 1 and I will
be changed with 0 in each codeword. But the codeword
length will remain same .

LETTER CODEWORD
𝑎1 00
𝑎2 111
𝑎3 10
𝑎4 010
𝑎5 110
𝑎6 011
Coding Efficiency (%)=(Entropy / Average Length of Code) * 100
Ques. Create binary tree using minimum variance Huffman coding for the same table
used for Huffman coding below given:

Follow the same


process as done in
Huffman
Coding but while
sorting the letters give
priority to
two least probable
letters which are
merged and place
them above in table of
already existing letters
maintain
the descending order.
The average length of the code is:

l = 4×2+2×2+2×2+1×3+1×3 = 2.2 bits/symbol


Optimality of Huffman Coding
Length of Huffman Codes
• The length of any code will depend on a number of things, including the size of the alphabet and the
probabilities of individual letters.
• The optimal code for a source , or the Huffman code for the source , has an average code length 𝑙 ҧ
bounded below by the entropy and bounded above by the entropy plus 1 bit. In other words,

• McMillan, states that if we have a uniquely decodable code with K codewords of length 𝑙𝑖 ,
for i=1 to K, then the following inequality holds:
Examining the code generated in above Table , the lengths of the codewords
are 1 2 3 4 4. Substituting these values into the left-hand side of Equation :

which satisfies the Kraft-McMillan inequality


Therefore, we can write Redundancy as the difference between the entropy of the source H and the
average length as:
Extended Huffman Coding
• While regular Huffman coding assigns unique codes to individual symbols in a data set, "extended
Huffman coding" improves compression efficiency by grouping several symbols together to create new,
larger symbols, essentially encoding sequences of the original symbols instead of each one individually,
leading to potentially better compression when dealing with highly skewed probability distributions; this is
achieved by creating a new, "extended" alphabet with these grouped symbols.
• Extended Huffman coding is an enhancement of standard Huffman coding where multiple source symbols
are grouped together before encoding. This approach often improves efficiency by reducing the average
codeword length relative to entropy
• Extended Huffman can be particularly useful when the probability distribution of symbols is very uneven,
where grouping symbols together can better represent their combined probability and lead to shorter
codewords.
• Standard Huffman coding sometimes results in an average codeword length significantly larger than the
entropy. By grouping symbols into blocks, we get a better match with the entropy.
• Extended Huffman coding takes groups of symbols together, essentially creating a larger alphabet, to
generate codes, potentially achieving better compression by exploiting dependencies between symbols
within a sequence, rather than treating each symbol independently; this approach can lead to more efficient
coding when there are patterns or correlations in the data
Ques: Consider a source that puts out letters from the alphabet A = {a1, a2, a3} with the
probability model Pa1 = 0.8, Pa2 = 0.02, and Pa3 = 0.18. The entropy for this source
is 0.816 bits/symbol. Find the average code length and redundancy using both Huffman and
Extended Huffman Coding (Take 2 letters together).

1) Using Huffman Coding Method:

• The average length for this code is 1.2 bits/symbol. The difference between the average code
length and the entropy, or the redundancy, for this code is 0.384 bits/symbol, which is 47% of the
entropy..
2) Using Extended Huffman Coding Method

• For the source described in the previous example, instead of generating a codeword for every
symbol, we will generate a codeword for every two symbols.
• If we look at the source sequence two at a time, the number of possible symbol pairs, or size of the
extended alphabet, is 3^2= 9. The extended alphabet, probability model, and Huffman code for this
example are shown in Table.
Letters Probability Codeword
a1a1 0.64 1
a1a3 0.144 00
a3a1 0.144 011
a3a3 0.0324 0100
a1a2 0.016 01010
a2a1 0.016 010111
a2a3 0.0036 0101101
a3a2 0.0036 01011011
a2a2 0.0004 01011010
The average codeword length for this
extended code is 1.7228 bits/symbol.
However, each symbol in the
extended alphabet corresponds to two
symbols from the original alphabet.

Therefore, in terms of the original


alphabet, the average codeword length
is 1.7228/2 = 0.8614 bits/symbol.
This redundancy is about 0.045
bits/symbol, which is only about 5.5%
of the entropy.
Adaptive Huffman Coding
• The standard Huffman coding requires knowledge of the probabilities of the source sequence. If this
knowledge is not available, then adaptive Huffman coding is used,
• Huffman coding becomes a two-pass procedure: the statistics are collected in the first pass, and the source is
encoded in the second pass. In order to convert this algorithm into a one-pass procedure, Faller and Gallagher
independently developed adaptive algorithms to construct the Huffman code based on the statistics of the
symbols already encountered. These were later improved by Knuth. Therefore, it is also referred as FGK
algorithm. The other implementation was suggested by Vitter which will be discussed here.
• Adaptive Huffman Coding is a dynamic variant of Huffman coding that updates the Huffman tree as data is
processed. Unlike static Huffman coding, which requires a prior frequency analysis of symbols,
Adaptive Huffman coding modifies the tree structure during encoding and decoding procedure.
• Adaptive Huffman Coding maintains and updates a Huffman tree while processing a stream of data. The tree
is adjusted dynamically as symbols are read, ensuring that frequently occurring symbols get shorter codes
over time.
• In the adaptive Huffman coding procedure, neither transmitter nor receiver knows anything about the
statistics of the source sequence at the start of transmission. It can provide better compression results. It can
be used to build a code as the symbols are being transmitted.
Key points:
• It involves three procedures: a) Encoding b) Updating c) Decoding
• Update procedure is called after each symbol is encoded or decoded to update the Huffman tree. The
codeword for a symbol can be obtained by traversing the tree from the root to the leaf corresponding to
the symbol, where 0 corresponds to a left branch and 1 corresponds to a right branch.
• External Nodes: The Huffman code can be described in terms of a binary tree. The squares denote the
external nodes or leaves that corresponds to the symbols in the source alphabet.
• Internal Nodes: The internal nodes are represented by oval that further has branches or child nodes.
• Weight: The weight of each node is simply the number of times the symbol corresponding to the leaf
node has been encountered or number of times that node has been traversed in algorithm. The weight of
each node, is written as a number inside the node. The weight of each internal node is the sum of the
weights of its offspring.
• Node Number: It refers to unique identifier assigned to each node in the Huffman tree, typically ordered
from top to bottom and right to left (decreasing order). When updating the tree due to new symbol
occurrences, the node numbering helps to preserve the order of nodes, making it easier to identify which
nodes need to be adjusted.
• Nodes are numbered based on their position in the tree, with lower-level nodes getting smaller
numbers and nodes on the same level being numbered from right to left in descending order.
• The node number depends on the size of alphabet and the maximum node number is given by (2N-1)
where N denotes the alphabet size. (N=26 for alphabets)
• Sibling Property: The nodes 𝑦2𝑗−1 and 𝑦2𝑗 are offspring of the same parent node, or siblings, for 1 ≤
j < n, and the node number for the parent node is greater than 𝑦2𝑗−1 and 𝑦2𝑗 .
• NYT Node: A special "Not Yet Transferred" (NYT) node is often used to handle new symbols
encountered during encoding that are not yet present in the tree.
➢Initially, the tree at both the encoder and decoder consists of a single node which we call as the not
yet transmitted node known as NYT node, and the number of this node is the maximum number that
is 2 m minus one where m is the size of the alphabet. It has weight as zero.
➢it's a special node within the Huffman tree that represents any symbol which hasn't been encountered
yet during the encoding process, allowing the code to adapt to new symbols as they appear in the data
stream without requiring prior knowledge of the symbol frequencies.
➢Block: The set of nodes with the same weight makes up a block. It is used in Update Procedure for
updating the Huffman tree.
UPDATE PROCEDURE
• The update procedure in Adaptive Huffman Coding ensures that the Huffman tree dynamically adjusts as new
symbols are encountered. This allows encoding and decoding to be done in a single pass without prior
knowledge of symbol frequencies.
• The update procedure requires that the nodes be in a fixed order. This ordering is preserved by numbering the
nodes.
• The largest node number is given to the root of the tree, and the smallest number is assigned to the NYT node.
• The numbers from the NYT node to the root of the tree are assigned in increasing order from left to right, and
from lower level to upper level.
• The function of the update procedure is to preserve the sibling property. In order that the update procedures at
the transmitter and receiver both operate with the same information, the tree at the transmitter is updated after
each symbol is encoded, and the tree at the receiver is updated after each symbol is decoded.
• If the symbol to be encoded or decoded has occurred for the first time, a new external node is assigned to the
symbol and a new NYT node is appended to the tree. Both the new external node and the new NYT node are
offsprings of the old NYT node.
• We increment the weight of the new external node by one. As the old NYT node is the parent of the new
external node, we increment its weight by one and then go on to update all the other nodes until we reach the
root of the tree.
If the symbol to be encoded or decoded has occurred for the first time, a new external node is
assigned to the symbol and a new NYT node is appended to the tree. Both the new external node
and the new NYT node are offsprings of the old NYT node. We increment the weight of the new
external node by one. As the old NYT node is the parent of the new external node, we increment
its weight by one and then go on to update all the other nodes until we reach the root of the tree.

• Handling New Symbols (NYT Mechanism)


1.If the symbol is new:
1. Replace the NYT node with two child nodes:
1. A new NYT node (for future symbols).
2. A leaf node for the new symbol with frequency 1.
2. Assign appropriate binary codes to the nodes.
2.Update the tree after adding the new symbol (follow the update
steps shown in flowchart).
Example : Update procedure
• Assume we are encoding the message [a a r d v], where our alphabet consists of the 26
lowercase letters of the English alphabet.
• The updating process is shown in Figure 3.7. We begin with only the NYT node. The total
number of nodes in this tree will be 2×26−1 = 51, so we start numbering backwards from
51 with the number of the root node being 51. The first letter to be transmitted is a.
• As a does not yet exist in the tree, then add a to the tree. The NYT node gives birth to a
new NYT node and a terminal node corresponding to a.
• The weight of the terminal node will be higher than the NYT node, so we assign the
number 49 to the NYT node and 50 to the terminal node corresponding to the letter a.
• The second letter to be transmitted is also a. The node corresponding to a has the highest
number (if we do not consider its parent), so we do not need to swap nodes. The next letter
to be transmitted is r. This letter does not have a corresponding node on the tree, thus the
NYT node gives birth to a new NYT node
• and an external node corresponding to r.
Encoding Procedure
• Initially, the tree at both the encoder and decoder consists of a single node, the NYT node.
• As transmission progresses, nodes corresponding to symbols transmitted will be added to the tree,
and the tree is reconfigured using an update procedure. Before the beginning of transmission, a fixed
code for each symbol is agreed upon between transmitter and receiver. A simple (short) code is as
follows:
• If the source has an alphabet, 𝑎1 , 𝑎2 , 𝑎3 ……… 𝑎𝑚 of size m, then pick e and r such that:
m = 2e +r and 0 ≤ r < 2e.
Two imp encoding rules:
a. The letter 𝒂𝒌 is encoded as the e+1-bit binary representation of k−1, if 1 ≤ k ≤ 2r;
b. else, 𝒂𝒌 is encoded as the e-bit binary representation of k−r −1.
• For example, suppose m = 26, then e = 4, and r = 10. The symbol 𝑎1 is encoded as 00000, the symbol
𝑎2 is encoded as 00001, and the symbol 𝑎22 is encoded as 1011.
• When a symbol is encountered for the first time, the code for the NYT node is transmitted, followed
by the fixed code for the symbol. A node for the symbol is then created, and the symbol is taken out
of the NYT list.
Example: Encode: aardva

• As size of alphabet m=26, the value of e= 4, r=10 using eq. m = 2e +r


Step 1: Start and read symbol ‘a’
• As it is the first occurrence, check index value of symbol which is k=1 Now check the encoding
rules, as 1 ≤ k ≤ 2r, therefore encode (e+1) binary bit representation of (K-1) number.
(e+1) tells width or length of codeword to encode letter ‘a’
e+1= 5 , 5 digit binary representation of (k-1)th number i.e, 0 which is in binary form as 00000
• The Huffman tree is then updated as shown in the figure. The NYT node gives birth to an
external node corresponding to the element a and a new NYT node. As ‘a’ has occurred once,
the external node corresponding to ‘a’ has a weight of one. The weight of the NYT node is zero.
The internal node also has a weight of one, as its weight is the sum of the weights of its
offspring.
Step 2: Read Next Symbol ‘a’
• The next symbol is again a. As we have an external node corresponding to symbol a, we
simply traverse the tree from the root node to the external node corresponding to a in order to
find the codeword.
• This traversal consists of a single right branch. Therefore, the Huffman code for the symbol a
is 1.
• Again, call Update procedure to update the Huffman tree After the code for a has been
transmitted, the weight of the external node corresponding to a is incremented, as is the weight
of its parent.
Step 3: Read Next Symbol ‘r’
• As this is the first appearance of this symbol, we send the code for the NYT node followed by the
index value of ‘r’ in binary form using the encoding rule.
• If we traverse the tree from the root to the NYT node, we get a code of 0 for the NYT node. The
letter r is the 18th letter of the alphabet;
To encode check the rule if 1 ≤ k ≤ 2r, (satisfied), thus the code will be binary representation of (k-
1) which is 17 using e+1 bits or 5 bits.
Therefore, the binary representation of r is 10001. The code for the symbol r becomes NVT code
appended with r code i.e, 010001.
• The tree is again updated as shown in the figure,
Step 4: Read Next Symbol ‘d’
Using the same procedure for d, (first appearance of symbol) the code for the NYT
node, which is now 00, is sent, followed by the index for d, resulting in the
codeword 0000011. Then update the Huffman tree using update procedure.
Step 5: Read Next Symbol ‘v’
The next symbol v is the 22nd symbol in the alphabet. As this is greater than 20, we
send the code for the NYT node followed by the 4-bit binary representation of 22−10−1
= 11. The code for the NYT node at this stage is 000, and the 4-bit binary representation
of 11 is 1011; therefore, v is encoded as 0001011.
Step 6: Read Next Symbol ‘a’
The next symbol is a, for which the code is 0, and the encoding proceeds if any more
letters.
Decoding Procedure for Adaptive Huffman Coding

• The flowchart for the decoding procedure is shown in Figure. As we read in the received
binary string, we traverse the tree in a manner identical to that used in the encoding
procedure.
• Once a leaf is encountered, the symbol corresponding to that leaf is decoded. If the leaf is the
NYT node, then we check the next e bits to see if the resulting number is less than r.
• If it is less than r, we read in another bit to complete the code for the symbol. The index for
the symbol is obtained by adding one to the decimal number corresponding to the e- or e+1-
bit binary string.
• Once the symbol has been decoded, the tree is updated and the next received bit is used to
start another traversal down the tree.
Example : Decoding procedure
The binary string generated by the encoding procedure is: 0000010100010000011

• Initially, the decoder tree consists only of the NYT node. Therefore, the first symbol to be
decoded must be obtained from the NYT list. We read in the first 4 bits, 0000, as the value
of e is four. The 4 bits 0000 correspond to the decimal value of 0.
• As this is less than the value of r, which is 10, we read in one more bit for the entire code
of 00000. Adding one to the decimal value corresponding to this binary string, we get the
index of the received symbol as 1. This is the index for a; therefore, the first letter is
decoded as a. The tree is now updated as shown in Figure.
• The next bit in the string is 1. This traces a path from the root node to the external node
corresponding to a. We decode the symbol a and update the tree. In this case, the update
consists only of incrementing the weight of the external node corresponding to a.
• The next bit is a 0, which traces a path from the root to the NYT node. The next 4 bits, 1000,
correspond to the decimal number 8, which is less than 10, so we read in one more bit to get the 5-bit
word 10001. The decimal equivalent of this 5-bit word plus one is 18, which is the index for r. We
decode the symbol r and then update the tree.

• The next 2 bits, 00, again trace a path to the NYT node. We read the next 4 bits, 0001. Since this
corresponds to the decimal number 1, which is less than 10, we read another bit to get the 5-bit word
00011. To get the index of the received symbol in the NYT list, we add one to the decimal value of this
5-bit word. The value of the index is 4, which corresponds to the symbol d.
GOLOMB CODES
• A Golomb code is a lossless data compression technique that efficiently encodes positive
integers, particularly useful when the data to encode integers with the assumption that the
larger an integer, the lower its probability of occurrence.
• it works by splitting an integer into a quotient (coded with a unary code) and a remainder
(coded with a fixed-length code), making it ideal for compressing data where large values
are less common, like image prediction residuals in image compression algorithms like
JPEG.
• Golomb codes represent integers using two parts: Quotient: This is the result of dividing
the number by a divisor m & Remainder: This is the remainder when dividing the number
by m.
• The number n is encoded as a combination of binary quotient and remainder.
• The key parameter in a Golomb code is “m" which determines how the integer is split
into quotient and remainder.
• The Golomb code is actually a family of codes parameterized by an integer m > 0. In the
Golomb code with parameter m, we represent an integer n > 0 using two numbers q and r,
where, an integer n > 0 using two numbers q and r, where q is the quotient (integer
division of n by m).r is the remainder

(A) Encode the quotient (q) using “unary encoding”:


• Unary encoding represents a number q as a sequence of q ones followed by a zero. For
example:
q = 0 is encoded as 0.
q = 1 is encoded as 10.
q = 2 is encoded as 110.
q = 3 is encoded as 1110.
(B) Encode the remainder (r) to form fixed code using binary encoding rules:
i) If, ‘m’ is a power of two, we use the log 2 𝑚 bit binary representation of r.
ii) Else, if m is not the power of two then
• a) use ⌊ log 2 𝑚 ⌋ bits binary representation binary representation of r for the
first 2⌈log2 𝑚⌉ − m values ,
b) and use ⌈ log 2 𝑚 ⌉-bit binary representation of 𝑟 + 2⌈log2 𝑚⌉ − m
for the rest of the values.

C) The final Golomb code for the number n is the concatenation of: The unary-
encoded quotient q. The binary-encoded remainder r (fixed code).
Assume we have m = 5 and we want to encode n = 12:
•Step 1: Divide 12 by 5. We get a quotient of 2 and a remainder of 2.

•Step 2: Encode the quotient 2 in unary: 110.

•Step 3: Encode the remainder 2 in binary using log₂(5) = 2 bits: 10.

•Step 4: Combine the unary and binary encodings: 110 (quotient) + 10


(remainder) = 11010.

So, the Golomb code for 12 with m = 5 is 11010.


Rice Codes

• The Rice code was originally developed by Robert F. Rice (he called it the Rice machine) and
later extended by Pen-Shu Yeh and Warner Miller.
• The Rice code can be viewed as an adaptive Golomb code. In the Rice code, a sequence of
nonnegative integers (which might have been obtained from the preprocessing model) is
divided into blocks of size J.
• Each block is then coded using one of several options, most of which are a form of Golomb
codes.
• Each block is encoded with each of these options, and the option resulting in the least number
of coded bits is selected.
• The particular option used is indicated by an identifier attached to the code for each block. The
easiest way to understand the Rice code is to examine one of its implementations.
• The implementation of the Rice code in the recommendation for lossless compression is done
from the Consultative Committee on Space Data Standards (CCSDS).
CCSDS Recommendation of Rice Codes for Lossless
Compression
• The Rice algorithm consists of a preprocessor (the modeling step) and a binary coder (coding
step). The preprocessor removes correlation from the input and generates a sequence of non-
negative integers.
• This sequence has the property that smaller values are more probable than larger values. The
binary coder generates a bitstream to represent the integer sequence. The binary coder is our
main focus at this point.
• The preprocessor functions as follows: Given a sequence {𝑦𝑖 }, for each 𝑦𝑖 we generate a
prediction 𝑦ෝ𝑖 . A simple way to generate a prediction would be to take the previous value of the
sequence to be a prediction of the current value of the sequence:
𝑦ෝ𝑖 = 𝑦𝑖−1
• We then generate a sequence whose elements are the difference between 𝑦𝑖 and its predicted
value 𝑦ෝ𝑖 :
𝑑𝑖 = 𝑦𝑖 - 𝑦ෝ𝑖
• Let 𝑦𝑚𝑎𝑥 and 𝑦𝑚𝑖𝑛 be the largest and smallest values that the sequence 𝑦𝑖 takes on.
• It is reasonable to assume that the value of 𝑦will
ො be confined to the range [𝑦𝑚𝑎𝑥 , 𝑦𝑚𝑖𝑛 ]. Define,

• The sequence 𝑑𝑖 can be converted into a sequence of nonnegative integers 𝑥𝑖 using the following
mapping:

• The value of 𝑥𝑖 will be small whenever the magnitude of 𝑑𝑖 is small. Therefore, the value of 𝑥𝑖 will
be small with higher probability.
• The sequence 𝑥𝑖 is then divided into segments with each segment being further divided into blocks
of size J. It is recommended by CCSDS that ‘J’ have a default value of 16 bits.
• Each block is then coded using one of the following options. The coded block is transmitted
along with an identifier that indicates which particular option was used.
1) Fundamental sequence: This is a unary code. A number ‘n’ is represented by a sequence of
n 0s followed by a 1 (or a sequence of ‘n’ 1s followed by a 0).
2) Split sample options:
▪ These options consist of a set of codes indexed by a parameter m.
▪ The code for a k-bit number ‘n’ using the mth split sample option consists of the ‘m’ least
significant bits of ‘k’ followed by a unary code representing the (𝑘 − 𝑚) most significant
bits.
▪ For example, suppose we wanted to encode the 8-bit number 23 using the third split sample
option.
▪ The 8-bit representation of 23 is 00010111. The three least significant bits are 111. The
remaining bits (00010) correspond to the number 2, which has a unary code 001. Therefore,
the code for 23 using the third split sample option is 111011.
Tunstall Codes:
• Most of the coding methods encode letters from the source alphabet using codewords with
varying numbers of bits: codewords with fewer bits for letters that occur more frequently
and codewords with more bits for letters that occur less frequently.
• Tunstall coding builds a fixed-length codebook from the source symbols, unlike Huffman
coding, which typically generates variable-length codes.
• In Tunstall coding, the data source symbols are encoded into fixed-length codewords.
• The codebook used for which is built from the source's statistics (symbol frequencies).
• The key idea is to replace each source symbol with a codeword from the codebook.
• The main advantage of a Tunstall code is that errors in codewords do not propagate, unlike
other variable-length codes, such as Huffman codes, in which an error in one codeword will
cause a series of errors to occur.
Implementation of Tunstall Codes
• Suppose we want an n-bit Tunstall code for a source alphabet of size N.
• We start with the N letters of the source alphabet in the codebook.
• Remove the entry in the codebook that has the highest probability and add the N strings obtained by
concatenating this letter with every letter in the alphabet (including itself).
• This will increase the size of the codebook from N to N + (N −1).
• The probabilities of the new entries will be the product of the probabilities of the letters concatenated to
form the new entry.
• Now look through the N + (N −1) entries in the codebook and find the entry that has the highest
probability, keeping in mind that the entry with the highest probability may be a concatenation of
symbols.
• Each time we perform this operation we increase the size of the codebook by N −1. Therefore, this
operation can be performed K times, where,

𝑁 + 𝐾(𝑁 − 1) ≤ 2𝑛
Ques: design a 3-bit Tunstall code for a memoryless source with the following
alphabet:= {A B C} P(A )= 0.6 P(B) = 0.3 P(C) = 0.1

Step 1: Arrange the letters in descending order of their probabilities.


Remove the letter with highest probability i.e ‘A’ and make the pairs with A, B, C ( leaving
the other letters as it is). The number of iterations k is calculated using the 𝑁 + 𝐾 (𝑁 −
1) ≤ 2𝑛 , N =3, n=3 bits, k will be 2.
Step 2: Again remove the letter with highest probability from the codebook which is AA and make the
pairs with A, B, C leaving the remaining letters in the codebook unaltered.
• Step 3 Assign the codewords (binary codes) to the letters in the codebook obtained
in step 2 by numbering them in increasing order from (0,1,2,3…). The final 3-bit
Tunstall code is shown in Table below.
Applications of Huffman Coding

1) Lossless Image Compression


• Huffman coding is a widely used algorithm in lossless data compression, and it has a significant
application in lossless image compression. In the context of image compression, Huffman coding is
typically used to represent the image's pixel values or certain image data more efficiently, reducing the
overall size of the image without losing any information.
• A simple application of Huffman coding to image compression would be to generate a Huffman code for
the set of values that any pixel may take. For monochrome images, this set usually consists of integers
from 0 to 255. Examples of such images are contained in the accompanying data sets.
• efficiently reduces the size of an image by assigning shorter codes to more frequent pixel values,
resulting in fewer bits for the same information.
• The original (uncompressed) image representation uses 8 bits/pixel. The image consists of 256 rows of
256 pixels, so the uncompressed representation uses 65,536 bytes. The compression ratio is simply the
ratio of the number of bytes in the uncompressed representation to the number of bytes in the
compressed representation. The number of bytes in the compressed representation includes the number
of bytes needed to store the Huffman code.
2. Text Compression

• Huffman coding plays a crucial role in text compression by reducing the amount of
space required to store or transmit data. It is a lossless compression algorithm that
works by assigning shorter codes to more frequent characters and longer codes to
less frequent ones.
• In text, we have a discrete alphabet that, in a given class, has relatively stationary
probabilities. For example, the probability model for a particular novel will not
differ significantly from the probability model for another novel.
• Huffman coding is used in databases for compressing large amounts of textual data.
When storing large text records (like customer information, product descriptions, or
logs), Huffman coding helps save storage space while keeping data retrieval intact.
3. Audio Compression
• Huffman coding is widely used in audio compression because it is an efficient way to represent
data by reducing the amount of space required to store or transmit it.
• Huffman coding uses variable-length codes to represent data. In the context of audio
compression, this means that more frequent audio data (like certain sound frequencies or
amplitudes) are represented with shorter codes, and less frequent data gets longer codes. This
allows the overall size of the audio file to be reduced because the most common values take up
less space.
• Audio files often contain redundant information, such as repeated patterns or sounds that appear
frequently. Huffman coding identifies these redundancies and replaces them with shorter codes,
effectively reducing the overall size of the file without sacrificing quality.
• Audio codecs (like MP3, FLAC, and AAC) often incorporate Huffman coding as a part of their
compression pipeline. It’s used in various steps of encoding audio, such as coding the quantized
frequency domain data or the bitstream that represents the audio. This ensures that the audio file
is compact, without any loss in quality or accuracy of the original sound.

You might also like