SlideShare a Scribd company logo
Huffman Coding
The idea behind Huffman coding is to find a way to compress
the storage of data using variable length codes. Our standard
model of storing data uses fixed length codes. For example,
each character in a text file is stored using 8 bits. There are
certain advantages to this system. When reading a file, we
know to ALWAYS read 8 bits at a time to read a single
character. But as you might imagine, this coding scheme is
inefficient. The reason for this is that some characters are
more frequently used than other characters. Let's say that the
character 'e' is used 10 times more frequently than the
character 'q'. It would then be advantageous for us to use a 7
bit code for e and a 9 bit code for q instead because that could
shorten our overall message length.
Huffman coding finds the optimal way to take advantage of
varying character frequencies in a particular file. On average,
using Huffman coding on standard files can shrink them
anywhere from 10% to 30% depending to the character
distribution. (The more skewed the distribution, the better
Huffman coding will do.)
The idea behind the coding is to give less frequent characters
and groups of characters longer codes. Also, the coding is
constructed in such a way that no two constructed codes are
prefixes of each other. This property about the code is crucial
with respect to easily deciphering the code.
Building a Huffman Tree
The easiest way to see how this algorithm works is to work
through an example. Let's assume that after scanning a file we
find the following character frequencies:
Character Frequency
'a' 12
'b' 2
'c' 7
'd' 13
'e' 14
'f' 85
Now, create a binary tree for each character that also stores
the frequency with which it occurs.
The algorithm is as follows: Find the two binary trees in the list
that store minimum frequencies at their nodes. Connect these
two nodes at a newly created common node that will store NO
character but will store the sum of the frequencies of all the
nodes connected below it. So our picture looks like follows:
9 12 'a' 13 'd' 14 'e' 85 'f'
/ 
2 'b' 7 'c'
Now, repeat this process until only one tree is left:
21
/ 
9 12 'a' 13 'd' 14 'e' 85 'f'
/ 
2 'b' 7 'c'
21 27
/  / 
9 12 'a' 13 'd' 14 'e' 85 'f'
/ 
2 'b' 7 'c'
48
/ 
21 27
/  / 
9 12 'a' 13 'd' 14 'e' 85 'f'
/ 
2 'b' 7 'c'
133
/ 
48 85 'f'
/ 
21 27
/  / 
9 12 'a' 13 'd' 14 'e'
/ 
2 'b' 7 'c'
Once the tree is built, each leaf node corresponds to a letter
with a code. To determine the code for a particular node, walk
a standard search path from the root to the leaf node in
question. For each step to the left, append a 0 to the code and
for each step right append a 1. Thus for the tree above we get
the following codes:
Letter Code
'a' 001
'b' 0000
'c' 0001
'd' 010
'e' 011
'f' 1
Why are we guaranteed that one code is NOT the prefix of
another?
Find a set of valid Huffman codes for a file with the given
character frequencies:
Character Frequency
'a' 15
'b' 7
'c' 5
'd' 23
'e' 17
'f' 19
Calculating Bits Saved
All we need to do for this calculation is figure out how many
bits are originally used to store the data and subtract from that
how many bits are used to store the data using the Huffman
code.
In the first example given, since we have six characters, let's
assume each is stored with a three bit code. Since there are 133
such characters, the total number of bits used is 3*133 = 399.
Now, using the Huffman coding frequencies we can calculate
the new total number of bits used:
Letter Code Frequency Total Bits
'a' 001 12 36
'b' 0000 2 8
'c' 0001 7 28
'd' 010 13 39
'e' 011 14 42
'f' 1 85 85
Total 238
Thus, we saved 399 - 238 = 161 bits, or nearly 40% storage
space. Of course there is a small detail we haven't taken into
account here. What is that?
Huffman Coding is an Optimal Prefix Code
Of all prefix codes for a file, Huffman coding produces an
optimal one. In all of our examples from class on Monday, we
found that Huffman coding saved us a fair percentage of
storage space. But, we can show that no other prefix code can
do better than Huffman coding.
First, we will show the following:
Let x and y be two characters with the least frequencies in a
file. Then there exists an optimal prefix code for C in which the
codewords for x and y have the same length and differ only in
the last bit.
Here is how we will prove this:
Assume that a tree T stores an optimal prefix code. Let and
characters a and b be sibling nodes stored at the maximum
depth of the tree. We will show that we can create T' with x
and y as siblings at the lowest depth of the tree such that the
number of bits used for the coding with T' is the same as with
T. (Let f(a) denote the frequency of the character a. Without
loss of generality, assume f(x) ≤ f(y) and f(a) ≤ f(b). It also
follows that f(x) ≤ f(a) and f(y) ≤ f(b). Let h be the height of the
tree T. Let x have a depth of dx in T and y have a depth of dx in
T.)
Create T' as follows: swap the nodes storing a and x, and then
swap the nodes storing b and y. Now, we have that the depth of
x and y in T' is h, the depth of a is dx and the depth of b is dy in
T'.
Now, let's calculate the change in the number of bits used for
the coding with tree T' with the coding in tree T. (Note: Since
all other codes remain unchanged, we only need to analyze the
total number of bits it takes to code a, b, x and y.)
# bits for tree T (for a,b,x and y) = hf(a) + hf(b) + dxf(x) dyf(y)
# bits for tree T' (for a, b, x, and y) = dxf(a) + dyf(b) + hf(x) +
hf(y).
Difference =
hf(a) + hf(b) + dxf(x) dyf(y) - (dxf(a) + dyf(b) + hf(x) + hf(y)) =
hf(a) + hf(b) + dxf(x) dyf(y) - dxf(a) - dyf*b) - hf(x) - hf(y) =
h(f(a) - f(x)) + h(f(b)-f(y)) + dx(f(x) - f(a)) + dy(f(y) - f(b)) =
h(f(a) - f(x)) + h(f(b)-f(y)) - dx(f(a) - f(x)) - dy(f(b) - f(y)) =
(h - dx)(f(a) - f(x)) + (h - dy)(f(b) - f(y))
Notice that all four of the terms above must be non-negative
since we know that h ≥ dx, h ≥ dy, f(a) ≥ f(x), and f(b) ≥ f(y).
Thus, it follows that this difference must be 0. Thus, the
number of bits to used in a code where x and y (the two
characters with lowest frequency) are siblings at maximum
depth of the coding tree is optimal.
In layman's terms, give me what you think is an optimal coding
tree, and I can create a new one from it with the two nodes
corresponding to low frequencies at the bottom of the tree.
To complete the proof, you'll notice that by construction,
Huffman coding ALWAYS makes sure that the nodes with the
lowest frequencies are at the bottom of the coding tree, all the
way through the construction. (You can't find any pair of
nodes for which this isn't true.) Technically, to carry out the
proof, you'd use induction, but we'll skip that for now...

More Related Content

PPTX
Huffman codes
PPT
Huffman Student
PPTX
Huffman Algorithm By Shuhin
PPTX
Huffman Algorithm and its Application by Ekansh Agarwal
PPT
Huffman Coding
PDF
Huffman Encoding Pr
PDF
Huffman Code Decoding
Huffman codes
Huffman Student
Huffman Algorithm By Shuhin
Huffman Algorithm and its Application by Ekansh Agarwal
Huffman Coding
Huffman Encoding Pr
Huffman Code Decoding

What's hot (17)

PPTX
Huffman coding || Huffman Tree
PPTX
Huffman Coding
PPTX
Huffman coding || Huffman Tree
PDF
Huffman and Arithmetic coding - Performance analysis
PPTX
Huffman Coding
PPTX
Huffman's Alforithm
PPTX
Shannon Fano
PPTX
Multimedia lossless compression algorithms
PPTX
Huffman Codes
PPT
Huffman coding
PPTX
Huffman coding
PPTX
Text compression
PPT
PPTX
Data Compression - Text Compression - Run Length Encoding
PPTX
Huffman coding
PDF
Arithmetic Coding
PDF
Data Communication & Computer network: Shanon fano coding
Huffman coding || Huffman Tree
Huffman Coding
Huffman coding || Huffman Tree
Huffman and Arithmetic coding - Performance analysis
Huffman Coding
Huffman's Alforithm
Shannon Fano
Multimedia lossless compression algorithms
Huffman Codes
Huffman coding
Huffman coding
Text compression
Data Compression - Text Compression - Run Length Encoding
Huffman coding
Arithmetic Coding
Data Communication & Computer network: Shanon fano coding
Ad

Similar to Huffman coding01 (20)

PPTX
Huffman analysis
PPT
Lec5 Compression
PDF
PDF
Huffman
PDF
Huffman
PDF
12_HuffmanhsjsjsjjsiejjssjjejsjCoding_pdf.pdf
PDF
INSTRUCTIONS For this assignment you will be generating all code on y.pdf
PPT
Lossless
PPTX
Lecture 15_Strings and Dynamic Memory Allocation.pptx
DOCX
Lecft3data
DOCX
The assigment is overdue now. I will up the price I am willing to pa.docx
PPT
Komdat-Kompresi Data
ODP
Mysql Performance Optimization Indexing Algorithms and Data Structures
PPTX
Day5 String python language for btech.pptx
PPT
Hufman coding basic
PDF
Perly Parallel Processing of Fixed Width Data Records
PPT
16_Greedy_Algorithms presentation paradigm
PPT
16_Greedy_Algorithms.ppt
PPT
16_Greedy_Algorithms.ppt
PPT
16_Greedy_Algorithms Greedy_AlgorithmsGreedy_Algorithms
Huffman analysis
Lec5 Compression
Huffman
Huffman
12_HuffmanhsjsjsjjsiejjssjjejsjCoding_pdf.pdf
INSTRUCTIONS For this assignment you will be generating all code on y.pdf
Lossless
Lecture 15_Strings and Dynamic Memory Allocation.pptx
Lecft3data
The assigment is overdue now. I will up the price I am willing to pa.docx
Komdat-Kompresi Data
Mysql Performance Optimization Indexing Algorithms and Data Structures
Day5 String python language for btech.pptx
Hufman coding basic
Perly Parallel Processing of Fixed Width Data Records
16_Greedy_Algorithms presentation paradigm
16_Greedy_Algorithms.ppt
16_Greedy_Algorithms.ppt
16_Greedy_Algorithms Greedy_AlgorithmsGreedy_Algorithms
Ad

More from Nv Thejaswini (12)

DOC
Mea notes
PPT
Appendix b 2
PPT
Chapter9 4
PPT
Lecture11
DOC
Branch and bound
DOC
Unit 5 jwfiles
DOC
Unit 4 jwfiles
DOC
Unit 3 daa
DOC
Unit 2 in daa
PPT
Ch8 of OS
PPTX
Presentation solar
Mea notes
Appendix b 2
Chapter9 4
Lecture11
Branch and bound
Unit 5 jwfiles
Unit 4 jwfiles
Unit 3 daa
Unit 2 in daa
Ch8 of OS
Presentation solar

Recently uploaded (20)

PDF
오픈소스 LLM, vLLM으로 Production까지 (Instruct.KR Summer Meetup, 2025)
PPTX
anatomy of limbus and anterior chamber .pptx
PDF
Top 10 read articles In Managing Information Technology.pdf
PPTX
24AI201_AI_Unit_4 (1).pptx Artificial intelligence
PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
International Journal of Information Technology Convergence and Services (IJI...
PDF
flutter Launcher Icons, Splash Screens & Fonts
PPTX
MET 305 MODULE 1 KTU 2019 SCHEME 25.pptx
PDF
B.Tech (Electrical Engineering ) 2024 syllabus.pdf
PPTX
ANIMAL INTERVENTION WARNING SYSTEM (4).pptx
PDF
algorithms-16-00088-v2hghjjnjnhhhnnjhj.pdf
PPTX
Practice Questions on recent development part 1.pptx
PDF
Structs to JSON How Go Powers REST APIs.pdf
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
Introduction to Data Science: data science process
PPTX
TE-AI-Unit VI notes using planning model
PPTX
436813905-LNG-Process-Overview-Short.pptx
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
Fluid Mechanics, Module 3: Basics of Fluid Mechanics
오픈소스 LLM, vLLM으로 Production까지 (Instruct.KR Summer Meetup, 2025)
anatomy of limbus and anterior chamber .pptx
Top 10 read articles In Managing Information Technology.pdf
24AI201_AI_Unit_4 (1).pptx Artificial intelligence
bas. eng. economics group 4 presentation 1.pptx
International Journal of Information Technology Convergence and Services (IJI...
flutter Launcher Icons, Splash Screens & Fonts
MET 305 MODULE 1 KTU 2019 SCHEME 25.pptx
B.Tech (Electrical Engineering ) 2024 syllabus.pdf
ANIMAL INTERVENTION WARNING SYSTEM (4).pptx
algorithms-16-00088-v2hghjjnjnhhhnnjhj.pdf
Practice Questions on recent development part 1.pptx
Structs to JSON How Go Powers REST APIs.pdf
CH1 Production IntroductoryConcepts.pptx
Introduction to Data Science: data science process
TE-AI-Unit VI notes using planning model
436813905-LNG-Process-Overview-Short.pptx
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Fluid Mechanics, Module 3: Basics of Fluid Mechanics

Huffman coding01

  • 1. Huffman Coding The idea behind Huffman coding is to find a way to compress the storage of data using variable length codes. Our standard model of storing data uses fixed length codes. For example, each character in a text file is stored using 8 bits. There are certain advantages to this system. When reading a file, we know to ALWAYS read 8 bits at a time to read a single character. But as you might imagine, this coding scheme is inefficient. The reason for this is that some characters are more frequently used than other characters. Let's say that the character 'e' is used 10 times more frequently than the character 'q'. It would then be advantageous for us to use a 7 bit code for e and a 9 bit code for q instead because that could shorten our overall message length. Huffman coding finds the optimal way to take advantage of varying character frequencies in a particular file. On average, using Huffman coding on standard files can shrink them anywhere from 10% to 30% depending to the character distribution. (The more skewed the distribution, the better Huffman coding will do.) The idea behind the coding is to give less frequent characters and groups of characters longer codes. Also, the coding is constructed in such a way that no two constructed codes are prefixes of each other. This property about the code is crucial with respect to easily deciphering the code.
  • 2. Building a Huffman Tree The easiest way to see how this algorithm works is to work through an example. Let's assume that after scanning a file we find the following character frequencies: Character Frequency 'a' 12 'b' 2 'c' 7 'd' 13 'e' 14 'f' 85 Now, create a binary tree for each character that also stores the frequency with which it occurs. The algorithm is as follows: Find the two binary trees in the list that store minimum frequencies at their nodes. Connect these two nodes at a newly created common node that will store NO character but will store the sum of the frequencies of all the nodes connected below it. So our picture looks like follows: 9 12 'a' 13 'd' 14 'e' 85 'f' / 2 'b' 7 'c'
  • 3. Now, repeat this process until only one tree is left: 21 / 9 12 'a' 13 'd' 14 'e' 85 'f' / 2 'b' 7 'c' 21 27 / / 9 12 'a' 13 'd' 14 'e' 85 'f' / 2 'b' 7 'c' 48 / 21 27 / / 9 12 'a' 13 'd' 14 'e' 85 'f' / 2 'b' 7 'c' 133 / 48 85 'f' / 21 27 / / 9 12 'a' 13 'd' 14 'e' / 2 'b' 7 'c' Once the tree is built, each leaf node corresponds to a letter with a code. To determine the code for a particular node, walk
  • 4. a standard search path from the root to the leaf node in question. For each step to the left, append a 0 to the code and for each step right append a 1. Thus for the tree above we get the following codes: Letter Code 'a' 001 'b' 0000 'c' 0001 'd' 010 'e' 011 'f' 1 Why are we guaranteed that one code is NOT the prefix of another? Find a set of valid Huffman codes for a file with the given character frequencies: Character Frequency 'a' 15 'b' 7 'c' 5 'd' 23 'e' 17 'f' 19
  • 5. Calculating Bits Saved All we need to do for this calculation is figure out how many bits are originally used to store the data and subtract from that how many bits are used to store the data using the Huffman code. In the first example given, since we have six characters, let's assume each is stored with a three bit code. Since there are 133 such characters, the total number of bits used is 3*133 = 399. Now, using the Huffman coding frequencies we can calculate the new total number of bits used: Letter Code Frequency Total Bits 'a' 001 12 36 'b' 0000 2 8 'c' 0001 7 28 'd' 010 13 39 'e' 011 14 42 'f' 1 85 85 Total 238 Thus, we saved 399 - 238 = 161 bits, or nearly 40% storage space. Of course there is a small detail we haven't taken into account here. What is that?
  • 6. Huffman Coding is an Optimal Prefix Code Of all prefix codes for a file, Huffman coding produces an optimal one. In all of our examples from class on Monday, we found that Huffman coding saved us a fair percentage of storage space. But, we can show that no other prefix code can do better than Huffman coding. First, we will show the following: Let x and y be two characters with the least frequencies in a file. Then there exists an optimal prefix code for C in which the codewords for x and y have the same length and differ only in the last bit. Here is how we will prove this: Assume that a tree T stores an optimal prefix code. Let and characters a and b be sibling nodes stored at the maximum depth of the tree. We will show that we can create T' with x and y as siblings at the lowest depth of the tree such that the number of bits used for the coding with T' is the same as with T. (Let f(a) denote the frequency of the character a. Without loss of generality, assume f(x) ≤ f(y) and f(a) ≤ f(b). It also follows that f(x) ≤ f(a) and f(y) ≤ f(b). Let h be the height of the tree T. Let x have a depth of dx in T and y have a depth of dx in T.) Create T' as follows: swap the nodes storing a and x, and then swap the nodes storing b and y. Now, we have that the depth of x and y in T' is h, the depth of a is dx and the depth of b is dy in T'.
  • 7. Now, let's calculate the change in the number of bits used for the coding with tree T' with the coding in tree T. (Note: Since all other codes remain unchanged, we only need to analyze the total number of bits it takes to code a, b, x and y.) # bits for tree T (for a,b,x and y) = hf(a) + hf(b) + dxf(x) dyf(y) # bits for tree T' (for a, b, x, and y) = dxf(a) + dyf(b) + hf(x) + hf(y). Difference = hf(a) + hf(b) + dxf(x) dyf(y) - (dxf(a) + dyf(b) + hf(x) + hf(y)) = hf(a) + hf(b) + dxf(x) dyf(y) - dxf(a) - dyf*b) - hf(x) - hf(y) = h(f(a) - f(x)) + h(f(b)-f(y)) + dx(f(x) - f(a)) + dy(f(y) - f(b)) = h(f(a) - f(x)) + h(f(b)-f(y)) - dx(f(a) - f(x)) - dy(f(b) - f(y)) = (h - dx)(f(a) - f(x)) + (h - dy)(f(b) - f(y)) Notice that all four of the terms above must be non-negative since we know that h ≥ dx, h ≥ dy, f(a) ≥ f(x), and f(b) ≥ f(y). Thus, it follows that this difference must be 0. Thus, the number of bits to used in a code where x and y (the two characters with lowest frequency) are siblings at maximum depth of the coding tree is optimal. In layman's terms, give me what you think is an optimal coding tree, and I can create a new one from it with the two nodes corresponding to low frequencies at the bottom of the tree. To complete the proof, you'll notice that by construction, Huffman coding ALWAYS makes sure that the nodes with the
  • 8. lowest frequencies are at the bottom of the coding tree, all the way through the construction. (You can't find any pair of nodes for which this isn't true.) Technically, to carry out the proof, you'd use induction, but we'll skip that for now...