0% found this document useful (0 votes)
29 views

Project Huffman Code(Final)

The project report details the implementation of Huffman coding for efficient text compression, highlighting the contributions of group members and the data structures utilized. It outlines the objectives, problem statement, and the application of various data structures like heaps and trees in constructing the Huffman tree. The report also discusses the greedy algorithm used in Huffman coding and provides insights into its efficiency and optimization strategies.

Uploaded by

ss240103635
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Project Huffman Code(Final)

The project report details the implementation of Huffman coding for efficient text compression, highlighting the contributions of group members and the data structures utilized. It outlines the objectives, problem statement, and the application of various data structures like heaps and trees in constructing the Huffman tree. The report also discusses the greedy algorithm used in Huffman coding and provides insights into its efficiency and optimization strategies.

Uploaded by

ss240103635
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

25th December, 2023

CS-250:DATA STRUCTURES & ALGORITHMS

Project Report: Huffman Coding Implementation

Course Instructor:
Ms. Fauzia Ehsan

Project Report Submitted by:

Muhammad Omar Hassan (365784)


Muhammad Asad Fayyaz (369438)

School of Natural Sciences

Page 1 of 20
Contribution by group members
Muhammad Omar Hassan -huffman tree visualizer and encode-table builder
applications
-huffman code and robust testing of the Huffman tree
-optimization of code
-Project Report
Muhammad Asad Fayyaz -huffman code and functions
-research on different data structures for the project
-power point
-project report structure

CONTENTS

Project Proposal: Huffman Coding Implementation....................................................................................................3


Data Structures............................................................................................................................................................3
Application in our project............................................................................................................................................ 4
Why we used the concept of heaps?...........................................................................................................................6
1)Queue module (queue.Queue):............................................................................................................................6
2)PriorityQueue class (queue.PriorityQueue):.........................................................................................................6
3)Binary Heap (Custom Implementation):...............................................................................................................6
4)PriorityQueue from queue.PriorityQueue:...........................................................................................................6
Justification for using heapq in Huffman coding:.....................................................................................................7
Topic: Huffman Coding.................................................................................................................................................7
What is Huffman Coding?........................................................................................................................................7
Huffman coding and DSA.............................................................................................................................................7
Greedy Algorithm.........................................................................................................................................................8
Project Code and Explanation......................................................................................................................................9
Example..................................................................................................................................................................... 14
Conclusion................................................................................................................................................................. 19

Page 2 of 20
PROJECT PROPOSAL: HUFFMAN CODING IMPLEMENTATION

Efficient Text Compression using Huffman Coding

1. Executive Summary:

Our project aims to implement and optimize Huffman coding, a widely used algorithm
for lossless data compression. Huffman coding efficiently represents text data by
assigning variable-length codes to different characters, resulting in reduced storage
requirements. The project's primary objective is to create a robust and well-optimized
implementation of Huffman coding, explore potential improvements, and analyze the
impact on compression efficiency.

2. Problem Statement: The rapid growth of digital data necessitates efficient


compression algorithms. Huffman coding addresses this need but implementing it
optimally poses engineering challenges. Our project seeks to address the complex
problem of creating a high-performance Huffman coding implementation, optimizing its
runtime, and exploring strategies for further enhancement.

3. Objectives:

 Implementation of Huffman Coding: To develop a functional Huffman coding


algorithm capable of encoding and decoding text messages.
4. Achievable Targets:
 Baseline Implementation:
Successful implementation of a basic Huffman coding algorithm.
Functional encoding and decoding processes for text messages.

DATA STRUCTURES

We’ll use:

1. Linked Lists:
Linked List will be used as Priority Queues.

Page 3 of 20
2. Running time complexity:
The implementation will involve building a Huffman tree, where the running time complexity
will be influenced by the frequency analysis of characters in the input text.

But, generally it is O(nlogn)

3. Queues:
The “heapq” module will be used to create a priority queue, which is a form of a queue, to
efficiently merge nodes while building the Huffman tree.

4. Sorting Algorithms & Recursion:


The “heapq” module will be used for merging nodes, which internally uses heap-based
algorithms. The construction of the Huffman tree will involve recursion.

5. Trees & Binary Search Tree Operations:


The Huffman tree is a binary tree where each character corresponds to a leaf node.

6. Binary Heaps:
The priority queue will be used for building the Huffman tree is implemented as a binary heap.

7. Graphs and Search Operations:


While Huffman coding itself is not directly related to graphs, the underlying construction of the
Huffman tree can be viewed as a form of tree-based graph structure.

8. Greedy Algorithms:
Huffman coding is a greedy algorithm where characters are assigned codes based on their
frequencies, optimizing for shorter codes for more frequent characters

APPLICATION IN OUR PROJECT

1. Array / Linked Lists:

 Application: The heap variable is used as a priority queue to efficiently select nodes
with the lowest frequencies during the construction of the Huffman tree.

 How Applied: The heap is implemented as a Python list, which acts as a binary heap.
The heapq module provides functions to transform the list into a heap, pop the smallest
element efficiently, and push new elements onto the heap.

 Alternative Options: Alternative options could include using a custom implementation of


a binary heap or a linked list, but the built-in heapq module is a practical choice for
simplicity and efficiency.

2. Singly / Doubly / Circular Linked List:

Page 4 of 20
 Application: The Node class represents nodes in the Huffman tree and uses pointers
(left and right) to link to child nodes.

 How Applied: The left and right attributes of the Node class are used to connect nodes
in a binary tree structure, allowing for efficient traversal during encoding and decoding.

 Alternative Options: While a binary tree is suitable for Huffman coding, different tree
structures could be considered based on specific requirements. For instance, a ternary
tree might be explored for different scenarios.

3. Stacks and Queues:

 Application: The heap serves as a priority queue, leveraging the concepts of both
stacks and queues.

 How Applied: The heapq module is used to maintain a priority queue (min heap) during
the construction of the Huffman tree.

 Alternative Options: Alternative priority queue implementations, such as using


queue.PriorityQueue in Python, could be considered.

4. Sorting Algorithms & Recursion:

 Application: Sorting is implicitly used when building the Huffman tree by repeatedly
selecting and merging nodes with the lowest frequencies.

 How Applied: The heapq module efficiently maintains a min heap, ensuring that the
smallest elements are easily accessible for merging.

 Alternative Options: While heap-based selection is common for Huffman coding,


alternative sorting algorithms like quicksort or mergesort could be explored.

5. Trees & Binary Search Tree Operations:

 Application: The entire project revolves around constructing and traversing a binary
tree, i.e., the Huffman tree.

 How Applied: The Node class and the construction of the Huffman tree demonstrate
fundamental concepts of binary trees, and the binary search property is utilized during
tree traversal.

 Alternative Options: Other types of trees could be considered for different scenarios,
but a binary tree is a natural fit for Huffman coding.

6. Graphs and Search Operations:

 Application: The Huffman tree can be viewed as a binary tree, a type of graph. Graph
search operations are implicit in tree traversal during decoding.

 How Applied: Depth-first search is employed during decoding, where bits determine
whether to move left or right in the tree.

 Alternative Options: Different traversal strategies or algorithms could be considered, but


depth-first search is suitable for the hierarchical nature of Huffman trees.

7. Topological Sort, Spanning Trees, Shortest Paths, Greedy Algorithms:

 Application: These concepts are not directly applied in the project as the core algorithm
is based on constructing a Huffman tree using a greedy approach.

Page 5 of 20
 How Applied: The construction of the Huffman tree is inherently greedy, always merging
the nodes with the lowest frequencies first.

 Alternative Options: The greedy approach is a natural fit for Huffman coding. Alternative
methods might involve dynamic programming or other optimization techniques, but they
would likely deviate significantly from Huffman coding principles.

In summary, our project effectively applies a variety of data structures and algorithms, including arrays,
linked lists, heaps, trees, and graph traversal, to implement the Huffman coding algorithm efficiently. The
choices made align with the specific requirements of Huffman coding, emphasizing simplicity and
performance.

WHY WE USED THE CONCEPT OF HEAPS?

There are other modules and data structures that can be used for implementing Huffman coding.
However, the choice of heapq is justified by the nature of Huffman coding and the efficiency of heap-
based operations.

1)QUEUE MODULE (QUEUE.QUEUE):

 Queues can be used to implement the priority queue required for building the Huffman tree.
 However, in Python, the queue.PriorityQueue class is not as efficient as the heapq module for
small priority queues, which are common in Huffman coding.
 The heapq module provides a simple and efficient implementation of a priority queue using a
binary heap.

2)PRIORITYQUEUE CLASS (QUEUE.PRIORITYQUEUE):

 While the PriorityQueue class is a built-in priority queue in Python, it is implemented using a list
and requires elements to be comparable.
 In the case of Huffman coding, custom objects (Node instances) need to be compared based on
their frequencies.
 The heapq module allows custom comparison functions through the _lt_ method in the Node
class, providing more flexibility.

3)BINARY HEAP (CUSTOM IMPLEMENTATION):

 Instead of using the heapq module, a custom implementation of a binary heap could be used.
 However, the heapq module is part of the Python standard library and is optimized for
performance, making it a convenient and efficient choice.
 Writing a custom binary heap implementation might not provide significant advantages in terms of
performance for this specific use case.

4)PRIORITYQUEUE FROM QUEUE.PRIORITYQUEUE:

 The PriorityQueue class from the queue module can be used, but it may not be as efficient as
the heapq module for small priority queues.

JUSTIFICATION FOR USING HEAPQ IN HUFFMAN CODING:

 Huffman coding involves building a priority queue of nodes based on their frequencies.

Page 6 of 20
 The heapq module provides a binary heap implementation, which is well-suited for maintaining a
priority queue efficiently.
 It allows for in-place heap operations, making it memory-efficient for small priority queues.
 The simplicity and performance of the heapq module make it a suitable choice for Huffman
coding, where building and manipulating the priority queue is a critical part of the algorithm.

TOPIC: HUFFMAN CODING

WHAT IS HUFFMAN CODING?

Huffman coding is a variable-length prefix coding algorithm that is used for lossless data compression. It was
developed by David A. Huffman, a computer scientist, while he was a Ph.D. student at MIT (Massachusetts Institute
of Technology). Huffman coding is particularly known for its efficiency in compressing data and is widely used in
various applications, including file compression algorithms like ZIP and in network communication protocols.

The Huffman coding algorithm works by assigning variable-length codes to different input characters. The length of
each code is determined by the frequency of the corresponding character in the input data. Characters with higher
frequencies are assigned shorter codes, while characters with lower frequencies are assigned longer codes. This
approach ensures that more frequently occurring characters are represented by shorter bit sequences, leading to
overall compression of the data.

David A. Huffman introduced this coding technique in his 1952 paper titled "A Method for the Construction of
Minimum-Redundancy Codes." Huffman's work on this algorithm was part of his research in information theory
and coding theory. The Huffman coding algorithm has since become a fundamental concept in the field of data
compression and is widely taught in computer science courses.

HUFFMAN CODING AND DSA

Huffman coding is closely related to data structures and algorithms. In fact, it's often used as an example in
computer science courses to illustrate fundamental concepts in these areas. Here are some key points highlighting
the relationship between Huffman coding, data structures, and algorithms:

1. Binary Trees: Huffman coding involves the construction of a binary tree, known as the Huffman tree. The
nodes of this tree represent characters, and the tree is constructed in a way that minimizes the total
length of the encoded data. The tree structure plays a crucial role in achieving efficient compression.

Page 7 of 20
2. Priority Queues (Min Heap): In the construction of the Huffman tree, a priority queue is often used to
efficiently select and merge nodes with the lowest frequencies. The priority queue ensures that the nodes
with the lowest frequencies are processed first.

3. Greedy Algorithm: The construction of the Huffman tree is based on a greedy algorithm. At each step, the
algorithm makes the locally optimal choice by selecting and merging the two nodes with the lowest
frequencies. This greedy approach leads to a globally optimal solution, resulting in an optimal prefix code
for data compression.

4. Recursion: The process of constructing the Huffman tree involves recursive steps. The algorithm
recursively combines nodes until a single root node is created, representing the entire set of characters.

5. Time Complexity: The time complexity of constructing a Huffman tree is typically O(n log n), where n is
the number of distinct characters. This complexity arises from the repeated insertion and removal of
nodes in the priority queue.

6. Space Complexity: The space complexity of the Huffman coding algorithm is determined by the storage
needed for the priority queue and the Huffman tree. It is O(n), where n is the number of distinct
characters.

In summary, Huffman coding provides a practical application of data structures and algorithms. The algorithmic
concepts of binary trees, priority queues, and greedy algorithms are fundamental to understanding and
implementing Huffman coding for efficient data compression.

GREEDY ALGORITHM

In Huffman coding, the greedy algorithm is used to construct an optimal prefix-free binary tree called the Huffman
tree. The key idea is to build the tree in a way that minimizes the total length of the encoded data. The algorithm
makes locally optimal choices at each step, leading to a globally optimal solution. Here's how the greedy algorithm
works in Huffman coding:

1. Frequency Count:

 Begin with a set of characters to be encoded and their frequencies (the number of times each
character appears in the data).

2. Node Creation:

 Create a leaf node for each character and assign it a weight equal to its frequency.

3. Priority Queue (Min Heap):

 Place all the leaf nodes into a priority queue (min heap) based on their weights (frequencies). The
node with the lowest frequency has the highest priority.

4. Greedy Merge:

 While there is more than one node in the priority queue:

 Remove the two nodes with the lowest frequencies from the priority queue.

Page 8 of 20
 Create a new internal node with a weight equal to the sum of the frequencies of the two
nodes.

 Insert the new internal node back into the priority queue.

5. Tree Construction:

 Continue the process until only one node remains in the priority queue. This node becomes the
root of the Huffman tree.

6. Encoding:

 Traverse the tree to determine the binary codes for each character. Assign '0' for a left branch
and '1' for a right branch.

7. Result:

 The resulting Huffman tree is used to create a set of variable-length codes, where shorter codes
are assigned to more frequent characters. This achieves efficient compression, as frequently
occurring characters have shorter codes.

The greedy nature of the algorithm lies in the choice of merging the two nodes with the lowest frequencies at each
step. Although the algorithm makes locally optimal choices, the resulting Huffman tree is globally optimal for
minimizing the total encoded length of the data. The use of a priority queue ensures that the nodes with the
lowest frequencies are processed first, contributing to the algorithm's efficiency.

PROJECT CODE AND EXPLANATION

import heapq
from collections import Counter

class Node:
def __init__(self, char, freq):
self.char = char
self.freq = freq
self.left = None
self.right = None

def __lt__(self, other):


return self.freq < other.freq

 Node represents a node in the Huffman tree.

Page 9 of 20
 Each node has a character (char), frequency (freq), and two child nodes (left and right).

 The __lt__ method is defined for comparison, allowing nodes to be compared based on their frequencies.

def build_huffman_tree(text):
frequency = Counter(text)
heap = [Node(char, freq) for char, freq in frequency.items()]
heapq.heapify(heap)

def print_heap():
print("\nHeap (Priority Queue) After Merging Nodes:")
for node in heap:
print(f"{node.char}:{node.freq}")

print_heap()

while len(heap) > 1:


left = heapq.heappop(heap)
right = heapq.heappop(heap)

merged_node = Node(None, left.freq + right.freq)


merged_node.left = left
merged_node.right = right

print(f"Merged Nodes: {left.char}:{left.freq} and{right.char}:


{right.freq} into {merged_node.char}:{merged_node.freq}")

heapq.heappush(heap, merged_node)
print_heap()

return heap[0]

1. Initialize the Heap:

 The function begins by counting the frequency of each character in the input text using Counter.

 It then creates initial nodes for each character and forms a heap using heapq.heapify.

Page 10 of 20
2. Merge Nodes Until One Node Left:

 The while loop continues until there is only one node left in the heap.

 In each iteration, it pops two nodes with the lowest frequencies (left and right) from the heap.

 It creates a new node (merged_node) representing the merged nodes.

3. Print Heap Information:

 The print_heap function is called before and after each merge to show the state of the heap
(priority queue).

 This helps visualize how nodes are being merged.

4. Return Huffman Tree Root:

 The final result is the root of the Huffman tree, which is the only node left in the heap.

5. Example Output:

 The function prints information about the heap and merged nodes during the process.

def build_code_table(node, code="", code_table=None):


if code_table is None:
code_table = {}

if node is not None:


if node.char is not None:
code_table[node.char] = code
build_code_table(node.left, code + "0", code_table)
build_code_table(node.right, code + "1", code_table)

return code_table

 Builds the Huffman code table (mapping characters to their binary codes).

 Recursively traverses the Huffman tree and assigns binary codes based on the traversal path.

 Returns the code table.

def huffman_encode(text, code_table):


encoded_text = "".join(code_table[char] for char in text)
return encoded_text

Page 11 of 20
 Encodes the input text using the Huffman code table.

 Converts each character in the text to its corresponding Huffman code.

 Returns the encoded text.

def huffman_decode(encoded_text, tree):


decoded_text = ""
current_node = tree

for bit in encoded_text:


if bit == "0":
current_node = current_node.left
else:
current_node = current_node.right

if current_node.char is not None:


decoded_text += current_node.char
current_node = tree

return decoded_text

 Decodes the Huffman-encoded text using the Huffman tree.

 Iterates through each bit in the encoded text and traverses the tree accordingly.

 When a leaf node is reached, adds the corresponding character to the decoded text.

 Returns the decoded text.

def print_tree(node, level=0, prefix="Root: "):


if node is not None:
print(" " * (level * 4) + prefix + f"{node.char}:{node.freq}")
if node.left or node.right:
print_tree(node.left, level + 1, "L--- ")
print_tree(node.right, level + 1, "R--- ")

 Prints the Huffman tree in a human-readable format.

Page 12 of 20
 Recursively traverses the tree and prints each node's character and frequency.

 Indents nodes based on their level in the tree.

def main():
text = input("Enter a text message: ")

huffman_tree = build_huffman_tree(text)
code_table = build_code_table(huffman_tree)

encoded_text = huffman_encode(text, code_table)

decoded_text = huffman_decode(encoded_text, huffman_tree)

print("\nHuffman Tree:")
print_tree(huffman_tree)

print("\nCode Table:")
for char, code in code_table.items():
print(f"{char}: {code}")

print("\nEncoded Text:", encoded_text)


print("Number of bits in encoded text:", len(encoded_text))
print("Number of characters in input message:", len(text))

print("\nDecoded Text:", decoded_text)

if __name__ == "__main__":
main()

 The main function orchestrates the entire Huffman coding process.

Page 13 of 20
 Takes user input, builds the Huffman tree, constructs the code table, encodes and decodes the text, and
prints the results.

This implementation demonstrates the core components of Huffman coding, including tree construction, code
table generation, and encoding/decoding processes.

EXAMPLE

Input : abbcccdddd
1. Initialize the Heap:

 Count the frequency of each character: {'a': 1, 'b': 2, 'c': 3, 'd': 4}.

 Create initial nodes for each character and form a heap: [Node(a:1), Node(b:2), Node(c:3), Node(d:4)].

2. Merge Nodes Until One Node Left:

 Iteration 1: Merge nodes 'a' and 'b'. New heap: [Node(c:3), Node(d:4), Node(ab:3)]

 Iteration 2: Merge nodes 'c' and 'ab'. New heap: [Node(d:4), Node(abc:6)].

Page 14 of 20
 Iteration 3: Merge nodes 'd' and 'abc'. New heap: [Node(abcd:10)].

 Return Huffman Tree Root:

 The final result is the root of the Huffman tree, which is the only node left in the heap:
Node(abcd:10).

We have also developed two new applications to provide a clearer illustration of the Huffman tree and the table.
While the original code effectively displays both the tables and the encoded text, it can be challenging to visualize
the structure and comprehend the code, its functionality, and the implementation of the functions.

So for the input “abbcccdddd” , the Huffman tree using greedy algo looks like the following

Page 15 of 20
3. Build Code Table:

 After obtaining the Huffman tree root, you can use the build_code_table function to generate a code
table. The code table associates each character with its corresponding binary Huffman code based on the
tree.

 This function recursively traverses the Huffman tree and assigns binary codes to each character. Starting
from the root, it moves left for '0' and right for '1' until a leaf node is reached. The code for each character
is the path from the root to that character.

Applying this to the Huffman tree root (Node(abcd:10)), you would get a code table like:

For improved visualization, we once again utilized the Huffman Tree Visualizer application that we created. It helps
illustrate how build_encode_table traverses the Huffman tree to assign 1s and 0s.

Page 16 of 20
4. Encode the Input:

 Using the generated code table, you can encode the original input "abbcccdddd" into its Huffman-
encoded form.

 The encoded result for "abbcccdddd" with the given code table would be: "00011110101010101010."

Using the second application we built, which displays a table for the string input, its encoded text, and the total
number of bits encoded, we obtained the following output:

Page 17 of 20
We can see that the Encoded length is 19 which is reduced 4x compared to ASCII encoding. Without the Huffman
code the message “abbcccdddd”

Would have been way larger in bits. To be exact it would have been

10 ( input length ) × 8 ( ASCII bits )=80 bits because normal message transfer uses ASCII encoding which
provides 8 bits to code a single character.

ASCII abbreviated from American Standard Code for Information Interchange, is a character encoding standard for
electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other
devices.

5. Decode the Input:

 You can also decode the Huffman-encoded string back to the original input using the Huffman tree.

1. Initialization:

 The function starts by initializing an empty string decoded_text to store the characters of the
decoded message.

 It also initializes a variable current_node to the root of the Huffman tree.

2. Decoding Loop:

 The function iterates through each bit in the encoded_text.

Page 18 of 20
 For each bit:

 If the bit is "0", the function moves to the left child of the current node
(current_node.left).

 If the bit is "1", the function moves to the right child of the current node
(current_node.right).

3. Checking for Leaf Node:

 After moving to the left or right child, the function checks if the current node is a leaf node (i.e., it
has a non-None character). This is done using the condition if current_node.char is not None:.

 If the current node is a leaf node, it means a character has been found, and it is appended to the
decoded_text.

 After appending the character, the current_node is reset to the root of the Huffman tree to start
decoding the next sequence of bits.

4. Completion:

 The process continues until all bits in the encoded_text have been processed.

 The final decoded_text is returned, representing the original message that was encoded.

 Applying this to the encoded text "00011110101010101010" and the Huffman tree, you would get the
original input "abbcccdddd."

CONCLUSION

In conclusion, the implemented Huffman coding algorithm serves as an effective and intuitive method for text
compression. The project successfully achieves its primary objectives of constructing a Huffman tree, generating a
corresponding code table, and employing these structures to encode and decode text messages. The compression
is achieved by assigning shorter codes to more frequent characters, thereby minimizing the overall bit
representation of the input text.

The clarity of the code facilitates a comprehensive understanding of the Huffman coding process, from the
creation of the priority queue-based heap to the recursive traversal of the Huffman tree for code generation. The
accompanying visualizations of the Huffman tree and code table contribute to the project's educational value.

Several opportunities for improvement have been identified, including the incorporation of error handling for edge
cases, optimization of code for larger inputs, documentation enhancements, thorough testing, a more user-friendly

Page 19 of 20
interface, and the potential extension for file input and output operations. Addressing these aspects would
enhance the robustness, efficiency, and user experience of the project.

Overall, the project demonstrates a solid foundation in algorithmic concepts, particularly in the realm of lossless
data compression. As further enhancements are implemented, the project has the potential to serve as a valuable
educational resource and practical tool for understanding and applying Huffman coding principles.

Page 20 of 20

You might also like