Huffman Project Report
Huffman Project Report
Project Report
Submitted By:
Raza Umar
ID: g200905090
Algorithm Description
Algorithm developed for Huffman encoding takes a string of data symbols to be encoded along
with a vector containing respective symbol probabilities as input. It calls two recursive functions
to generate the Huffman dictionary and reports the average length of the codeword dictionary as
output.
The main theme of algorithm is to make use of cell structures in matlab to build the Huffman tree
while keeping track of child and parent nodes. Once the tree has been built, codeword
corresponding to each input data symbol (which acts like a leaf node in Huffman tree) can be
found out by simply traversing the tree from the branch till that leaf node is encountered. The
general structure contains cells corresponding to input data symbol, probability and its original
order in the list of symbols passed to the algorithm as a string. Two additional cells have been
added in the structure to keep information regarding the child nodes and code word of the current
node. A structure is made for each data symbol and M (= number of input data symbols)
instances of this structure are filled with known information and sorted in ascending order of
probability. This result in M leaf nodes corresponding to M data symbols arranged in ascending
order of probability.
Huffman tree is generated by passing this structure (with M nodes) to a recursive function
“gen_h_tree”. This function combines the top two nodes (nodes with least probability) to make
one parent node. Parent node contains the information of two combining nodes as child nodes
and the probability of parent node is equal to the sum of probabilities of child nodes. The two
child nodes are then removed from the Huffman tree and depending on the probability of this
parent node, it is inserted in the Huffman tree such that all the (M-1) nodes remain in ascending
order of probability. Note that, by replacing two child nodes with one parent node, number of
nodes gets reduced by 1. This function is recursively called till the Huffman tree consists of only
one final node with probability 1.
Huffman dictionary is then generated by traversing this tree recursively till the leaf nodes.
Essentially, Huffman dictionary is another structure containing cells corresponding to input data
symbol, probability, codeword, length of its codeword and its original order in input string of
data symbols. Since Huffman tree is a binary tree so each parent node contains information of its
two child nodes. A child node with least probability is assigned bit 1 while a child node with
higher probability is assigned a bit 0. All these bits corresponding to each node are concatenated
into a vector which ultimately becomes the code word of the node which has no child i.e. leaf
node.
Each time when a leaf node is encountered, weighted average length of the code word is
accumulated in a variable “avglen” containing 0 as its initial value. This variable represents the
average length of the codeword dictionary when all leaf nodes get their codewords assigned.
Codeword dictionary is then arranged according to the desired output format e.g. either same as
input data symbol order (original order) or in ascending/descending order of code length.
The algorithm then output each data symbol along with its respective codeword from the
codeword dictionary.
Algorithm Flowchart
Start
Generate h_tree
yes
Is this structure
has only 1 node?
no
Generate h_dict
For i=1:2
h_tree.child{i}.code=[h_tree.code 2-i]
call Generate h_dict with h_tree.child{i} as input
end For
Display Output
• Sort h_dict instances according to original order of input data symbols
• Output symbols and their respective codewords
• Output avglen of codeword dictionary