Project Huffman Code(Final)
Project Huffman Code(Final)
Course Instructor:
Ms. Fauzia Ehsan
Page 1 of 20
Contribution by group members
Muhammad Omar Hassan -huffman tree visualizer and encode-table builder
applications
-huffman code and robust testing of the Huffman tree
-optimization of code
-Project Report
Muhammad Asad Fayyaz -huffman code and functions
-research on different data structures for the project
-power point
-project report structure
CONTENTS
Page 2 of 20
PROJECT PROPOSAL: HUFFMAN CODING IMPLEMENTATION
1. Executive Summary:
Our project aims to implement and optimize Huffman coding, a widely used algorithm
for lossless data compression. Huffman coding efficiently represents text data by
assigning variable-length codes to different characters, resulting in reduced storage
requirements. The project's primary objective is to create a robust and well-optimized
implementation of Huffman coding, explore potential improvements, and analyze the
impact on compression efficiency.
3. Objectives:
DATA STRUCTURES
We’ll use:
1. Linked Lists:
Linked List will be used as Priority Queues.
Page 3 of 20
2. Running time complexity:
The implementation will involve building a Huffman tree, where the running time complexity
will be influenced by the frequency analysis of characters in the input text.
3. Queues:
The “heapq” module will be used to create a priority queue, which is a form of a queue, to
efficiently merge nodes while building the Huffman tree.
6. Binary Heaps:
The priority queue will be used for building the Huffman tree is implemented as a binary heap.
8. Greedy Algorithms:
Huffman coding is a greedy algorithm where characters are assigned codes based on their
frequencies, optimizing for shorter codes for more frequent characters
Application: The heap variable is used as a priority queue to efficiently select nodes
with the lowest frequencies during the construction of the Huffman tree.
How Applied: The heap is implemented as a Python list, which acts as a binary heap.
The heapq module provides functions to transform the list into a heap, pop the smallest
element efficiently, and push new elements onto the heap.
Page 4 of 20
Application: The Node class represents nodes in the Huffman tree and uses pointers
(left and right) to link to child nodes.
How Applied: The left and right attributes of the Node class are used to connect nodes
in a binary tree structure, allowing for efficient traversal during encoding and decoding.
Alternative Options: While a binary tree is suitable for Huffman coding, different tree
structures could be considered based on specific requirements. For instance, a ternary
tree might be explored for different scenarios.
Application: The heap serves as a priority queue, leveraging the concepts of both
stacks and queues.
How Applied: The heapq module is used to maintain a priority queue (min heap) during
the construction of the Huffman tree.
Application: Sorting is implicitly used when building the Huffman tree by repeatedly
selecting and merging nodes with the lowest frequencies.
How Applied: The heapq module efficiently maintains a min heap, ensuring that the
smallest elements are easily accessible for merging.
Application: The entire project revolves around constructing and traversing a binary
tree, i.e., the Huffman tree.
How Applied: The Node class and the construction of the Huffman tree demonstrate
fundamental concepts of binary trees, and the binary search property is utilized during
tree traversal.
Alternative Options: Other types of trees could be considered for different scenarios,
but a binary tree is a natural fit for Huffman coding.
Application: The Huffman tree can be viewed as a binary tree, a type of graph. Graph
search operations are implicit in tree traversal during decoding.
How Applied: Depth-first search is employed during decoding, where bits determine
whether to move left or right in the tree.
Application: These concepts are not directly applied in the project as the core algorithm
is based on constructing a Huffman tree using a greedy approach.
Page 5 of 20
How Applied: The construction of the Huffman tree is inherently greedy, always merging
the nodes with the lowest frequencies first.
Alternative Options: The greedy approach is a natural fit for Huffman coding. Alternative
methods might involve dynamic programming or other optimization techniques, but they
would likely deviate significantly from Huffman coding principles.
In summary, our project effectively applies a variety of data structures and algorithms, including arrays,
linked lists, heaps, trees, and graph traversal, to implement the Huffman coding algorithm efficiently. The
choices made align with the specific requirements of Huffman coding, emphasizing simplicity and
performance.
There are other modules and data structures that can be used for implementing Huffman coding.
However, the choice of heapq is justified by the nature of Huffman coding and the efficiency of heap-
based operations.
Queues can be used to implement the priority queue required for building the Huffman tree.
However, in Python, the queue.PriorityQueue class is not as efficient as the heapq module for
small priority queues, which are common in Huffman coding.
The heapq module provides a simple and efficient implementation of a priority queue using a
binary heap.
While the PriorityQueue class is a built-in priority queue in Python, it is implemented using a list
and requires elements to be comparable.
In the case of Huffman coding, custom objects (Node instances) need to be compared based on
their frequencies.
The heapq module allows custom comparison functions through the _lt_ method in the Node
class, providing more flexibility.
Instead of using the heapq module, a custom implementation of a binary heap could be used.
However, the heapq module is part of the Python standard library and is optimized for
performance, making it a convenient and efficient choice.
Writing a custom binary heap implementation might not provide significant advantages in terms of
performance for this specific use case.
The PriorityQueue class from the queue module can be used, but it may not be as efficient as
the heapq module for small priority queues.
Huffman coding involves building a priority queue of nodes based on their frequencies.
Page 6 of 20
The heapq module provides a binary heap implementation, which is well-suited for maintaining a
priority queue efficiently.
It allows for in-place heap operations, making it memory-efficient for small priority queues.
The simplicity and performance of the heapq module make it a suitable choice for Huffman
coding, where building and manipulating the priority queue is a critical part of the algorithm.
Huffman coding is a variable-length prefix coding algorithm that is used for lossless data compression. It was
developed by David A. Huffman, a computer scientist, while he was a Ph.D. student at MIT (Massachusetts Institute
of Technology). Huffman coding is particularly known for its efficiency in compressing data and is widely used in
various applications, including file compression algorithms like ZIP and in network communication protocols.
The Huffman coding algorithm works by assigning variable-length codes to different input characters. The length of
each code is determined by the frequency of the corresponding character in the input data. Characters with higher
frequencies are assigned shorter codes, while characters with lower frequencies are assigned longer codes. This
approach ensures that more frequently occurring characters are represented by shorter bit sequences, leading to
overall compression of the data.
David A. Huffman introduced this coding technique in his 1952 paper titled "A Method for the Construction of
Minimum-Redundancy Codes." Huffman's work on this algorithm was part of his research in information theory
and coding theory. The Huffman coding algorithm has since become a fundamental concept in the field of data
compression and is widely taught in computer science courses.
Huffman coding is closely related to data structures and algorithms. In fact, it's often used as an example in
computer science courses to illustrate fundamental concepts in these areas. Here are some key points highlighting
the relationship between Huffman coding, data structures, and algorithms:
1. Binary Trees: Huffman coding involves the construction of a binary tree, known as the Huffman tree. The
nodes of this tree represent characters, and the tree is constructed in a way that minimizes the total
length of the encoded data. The tree structure plays a crucial role in achieving efficient compression.
Page 7 of 20
2. Priority Queues (Min Heap): In the construction of the Huffman tree, a priority queue is often used to
efficiently select and merge nodes with the lowest frequencies. The priority queue ensures that the nodes
with the lowest frequencies are processed first.
3. Greedy Algorithm: The construction of the Huffman tree is based on a greedy algorithm. At each step, the
algorithm makes the locally optimal choice by selecting and merging the two nodes with the lowest
frequencies. This greedy approach leads to a globally optimal solution, resulting in an optimal prefix code
for data compression.
4. Recursion: The process of constructing the Huffman tree involves recursive steps. The algorithm
recursively combines nodes until a single root node is created, representing the entire set of characters.
5. Time Complexity: The time complexity of constructing a Huffman tree is typically O(n log n), where n is
the number of distinct characters. This complexity arises from the repeated insertion and removal of
nodes in the priority queue.
6. Space Complexity: The space complexity of the Huffman coding algorithm is determined by the storage
needed for the priority queue and the Huffman tree. It is O(n), where n is the number of distinct
characters.
In summary, Huffman coding provides a practical application of data structures and algorithms. The algorithmic
concepts of binary trees, priority queues, and greedy algorithms are fundamental to understanding and
implementing Huffman coding for efficient data compression.
GREEDY ALGORITHM
In Huffman coding, the greedy algorithm is used to construct an optimal prefix-free binary tree called the Huffman
tree. The key idea is to build the tree in a way that minimizes the total length of the encoded data. The algorithm
makes locally optimal choices at each step, leading to a globally optimal solution. Here's how the greedy algorithm
works in Huffman coding:
1. Frequency Count:
Begin with a set of characters to be encoded and their frequencies (the number of times each
character appears in the data).
2. Node Creation:
Create a leaf node for each character and assign it a weight equal to its frequency.
Place all the leaf nodes into a priority queue (min heap) based on their weights (frequencies). The
node with the lowest frequency has the highest priority.
4. Greedy Merge:
Remove the two nodes with the lowest frequencies from the priority queue.
Page 8 of 20
Create a new internal node with a weight equal to the sum of the frequencies of the two
nodes.
Insert the new internal node back into the priority queue.
5. Tree Construction:
Continue the process until only one node remains in the priority queue. This node becomes the
root of the Huffman tree.
6. Encoding:
Traverse the tree to determine the binary codes for each character. Assign '0' for a left branch
and '1' for a right branch.
7. Result:
The resulting Huffman tree is used to create a set of variable-length codes, where shorter codes
are assigned to more frequent characters. This achieves efficient compression, as frequently
occurring characters have shorter codes.
The greedy nature of the algorithm lies in the choice of merging the two nodes with the lowest frequencies at each
step. Although the algorithm makes locally optimal choices, the resulting Huffman tree is globally optimal for
minimizing the total encoded length of the data. The use of a priority queue ensures that the nodes with the
lowest frequencies are processed first, contributing to the algorithm's efficiency.
import heapq
from collections import Counter
class Node:
def __init__(self, char, freq):
self.char = char
self.freq = freq
self.left = None
self.right = None
Page 9 of 20
Each node has a character (char), frequency (freq), and two child nodes (left and right).
The __lt__ method is defined for comparison, allowing nodes to be compared based on their frequencies.
def build_huffman_tree(text):
frequency = Counter(text)
heap = [Node(char, freq) for char, freq in frequency.items()]
heapq.heapify(heap)
def print_heap():
print("\nHeap (Priority Queue) After Merging Nodes:")
for node in heap:
print(f"{node.char}:{node.freq}")
print_heap()
heapq.heappush(heap, merged_node)
print_heap()
return heap[0]
The function begins by counting the frequency of each character in the input text using Counter.
It then creates initial nodes for each character and forms a heap using heapq.heapify.
Page 10 of 20
2. Merge Nodes Until One Node Left:
The while loop continues until there is only one node left in the heap.
In each iteration, it pops two nodes with the lowest frequencies (left and right) from the heap.
The print_heap function is called before and after each merge to show the state of the heap
(priority queue).
The final result is the root of the Huffman tree, which is the only node left in the heap.
5. Example Output:
The function prints information about the heap and merged nodes during the process.
return code_table
Builds the Huffman code table (mapping characters to their binary codes).
Recursively traverses the Huffman tree and assigns binary codes based on the traversal path.
Page 11 of 20
Encodes the input text using the Huffman code table.
return decoded_text
Iterates through each bit in the encoded text and traverses the tree accordingly.
When a leaf node is reached, adds the corresponding character to the decoded text.
Page 12 of 20
Recursively traverses the tree and prints each node's character and frequency.
def main():
text = input("Enter a text message: ")
huffman_tree = build_huffman_tree(text)
code_table = build_code_table(huffman_tree)
print("\nHuffman Tree:")
print_tree(huffman_tree)
print("\nCode Table:")
for char, code in code_table.items():
print(f"{char}: {code}")
if __name__ == "__main__":
main()
Page 13 of 20
Takes user input, builds the Huffman tree, constructs the code table, encodes and decodes the text, and
prints the results.
This implementation demonstrates the core components of Huffman coding, including tree construction, code
table generation, and encoding/decoding processes.
EXAMPLE
Input : abbcccdddd
1. Initialize the Heap:
Count the frequency of each character: {'a': 1, 'b': 2, 'c': 3, 'd': 4}.
Create initial nodes for each character and form a heap: [Node(a:1), Node(b:2), Node(c:3), Node(d:4)].
Iteration 1: Merge nodes 'a' and 'b'. New heap: [Node(c:3), Node(d:4), Node(ab:3)]
Iteration 2: Merge nodes 'c' and 'ab'. New heap: [Node(d:4), Node(abc:6)].
Page 14 of 20
Iteration 3: Merge nodes 'd' and 'abc'. New heap: [Node(abcd:10)].
The final result is the root of the Huffman tree, which is the only node left in the heap:
Node(abcd:10).
We have also developed two new applications to provide a clearer illustration of the Huffman tree and the table.
While the original code effectively displays both the tables and the encoded text, it can be challenging to visualize
the structure and comprehend the code, its functionality, and the implementation of the functions.
So for the input “abbcccdddd” , the Huffman tree using greedy algo looks like the following
Page 15 of 20
3. Build Code Table:
After obtaining the Huffman tree root, you can use the build_code_table function to generate a code
table. The code table associates each character with its corresponding binary Huffman code based on the
tree.
This function recursively traverses the Huffman tree and assigns binary codes to each character. Starting
from the root, it moves left for '0' and right for '1' until a leaf node is reached. The code for each character
is the path from the root to that character.
Applying this to the Huffman tree root (Node(abcd:10)), you would get a code table like:
For improved visualization, we once again utilized the Huffman Tree Visualizer application that we created. It helps
illustrate how build_encode_table traverses the Huffman tree to assign 1s and 0s.
Page 16 of 20
4. Encode the Input:
Using the generated code table, you can encode the original input "abbcccdddd" into its Huffman-
encoded form.
The encoded result for "abbcccdddd" with the given code table would be: "00011110101010101010."
Using the second application we built, which displays a table for the string input, its encoded text, and the total
number of bits encoded, we obtained the following output:
Page 17 of 20
We can see that the Encoded length is 19 which is reduced 4x compared to ASCII encoding. Without the Huffman
code the message “abbcccdddd”
Would have been way larger in bits. To be exact it would have been
10 ( input length ) × 8 ( ASCII bits )=80 bits because normal message transfer uses ASCII encoding which
provides 8 bits to code a single character.
ASCII abbreviated from American Standard Code for Information Interchange, is a character encoding standard for
electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other
devices.
You can also decode the Huffman-encoded string back to the original input using the Huffman tree.
1. Initialization:
The function starts by initializing an empty string decoded_text to store the characters of the
decoded message.
2. Decoding Loop:
Page 18 of 20
For each bit:
If the bit is "0", the function moves to the left child of the current node
(current_node.left).
If the bit is "1", the function moves to the right child of the current node
(current_node.right).
After moving to the left or right child, the function checks if the current node is a leaf node (i.e., it
has a non-None character). This is done using the condition if current_node.char is not None:.
If the current node is a leaf node, it means a character has been found, and it is appended to the
decoded_text.
After appending the character, the current_node is reset to the root of the Huffman tree to start
decoding the next sequence of bits.
4. Completion:
The process continues until all bits in the encoded_text have been processed.
The final decoded_text is returned, representing the original message that was encoded.
Applying this to the encoded text "00011110101010101010" and the Huffman tree, you would get the
original input "abbcccdddd."
CONCLUSION
In conclusion, the implemented Huffman coding algorithm serves as an effective and intuitive method for text
compression. The project successfully achieves its primary objectives of constructing a Huffman tree, generating a
corresponding code table, and employing these structures to encode and decode text messages. The compression
is achieved by assigning shorter codes to more frequent characters, thereby minimizing the overall bit
representation of the input text.
The clarity of the code facilitates a comprehensive understanding of the Huffman coding process, from the
creation of the priority queue-based heap to the recursive traversal of the Huffman tree for code generation. The
accompanying visualizations of the Huffman tree and code table contribute to the project's educational value.
Several opportunities for improvement have been identified, including the incorporation of error handling for edge
cases, optimization of code for larger inputs, documentation enhancements, thorough testing, a more user-friendly
Page 19 of 20
interface, and the potential extension for file input and output operations. Addressing these aspects would
enhance the robustness, efficiency, and user experience of the project.
Overall, the project demonstrates a solid foundation in algorithmic concepts, particularly in the realm of lossless
data compression. As further enhancements are implemented, the project has the potential to serve as a valuable
educational resource and practical tool for understanding and applying Huffman coding principles.
Page 20 of 20