Huffman Coding On Matlab
Huffman Coding On Matlab
[email protected], 900128T311
Objective:
The objective of this assignment is to find the probability distributions of the letters of a text
and to construct an optimal source code based on the estimation. Also, the aim includes
compression of the text and comparing the average codeword length with the entropy of the
estimated distribution.
Method:
It is found in the theory that Huffman code is the optimal source code for an input data. So, in
the assignment the provided text on the file LifeOnMars.txt was encoded with Huffman
code.
The problem was entirely solved in MATLAB; the code is attached in the appendix section.
The outline of the methodology is given below1.
2.
3.
4.
5.
Assignment-2, EITN45
[email protected], 900128T311
Results:
The distribution of the letters along with the assigned codes are given belowLetters
Probabilities
Occurrence
Code
[New line]
[Space]
[Apostrophe]
'a'
'b'
'c'
'd'
'e'
'f'
'g'
'h'
'i'
'k'
'l'
'm'
'n'
'o'
'p'
'r'
's'
't'
'u'
'v'
'w'
'y'
'z'
0.0343
0.1733
0.0104
0.0575
0.0144
0.0136
0.0208
0.0942
0.0192
0.0200
0.0527
0.0543
0.0152
0.0351
0.0256
0.0519
0.0727
0.0032
0.0447
0.0599
0.0671
0.0216
0.0064
0.0216
0.0096
0.0008
43
217
13
72
18
17
26
118
24
25
66
68
19
44
32
65
91
4
56
75
84
27
8
27
12
1
01100
000
111011
1000
011011
111010
001010
110
001101
001011
1010
1001
011010
00111
11100
1011
0100
001100010
1111
0111
0101
001000
00110000
001001
0011001
001100011
Code
length
5
3
6
4
6
6
6
3
6
6
4
4
6
5
5
4
4
9
4
4
4
6
8
6
7
9
#Zeros #Ones
3
3
1
3
2
2
4
1
3
3
2
2
3
2
2
1
3
6
0
1
2
5
6
4
4
5
2
0
5
1
4
4
2
2
3
3
2
2
3
3
3
3
1
3
4
3
2
1
2
2
3
4
Probability of Ones
0.4554
Assignment-2, EITN45
[email protected], 900128T311
Conclusion:
From the result section we can see that the average code length is very close to the entropy of
the distribution. It satisfies the optimal code condition H(X) L H(X)+1.
The compression ratio is satisfactory and the distribution of zeros and ones are close to .
References:
[1] Information Theory Engineering, Stefan Hst, Lund University.
Assignment-2, EITN45
[email protected], 900128T311
APPENDIX
The main code to find the probabilities and the Huffman codes
clc;
clear;
fid = fopen('LifeOnMars.txt');
Txt = fscanf(fid,'%c');
fclose(fid);
j=1;
for i='a':'z'
result(j)=length(find(Txt==i));
j=j+1;
end
data='a':'z';
result(27)=length(find(Txt==char(39)));
data(27)=char(39);
result(28)=length(find(Txt==char(32)));
data(28)=char(32);
result(29)=length(find(Txt==char(10)));
data(29)=char(10);
data(29)=char(64);%Newline character is creating unexpected problem
coding part, So changed to @
data(28)=char(255);%Newline character is creating unexpected problem
coding part, So changed to
in
in
prob=result./sum(result);
%%---------------------------------------------------------------------%Needed
%%to get the no of characters
input=struct('data',[],'prob',[],'flag',[]);
for i=1:length(data)
input(i).data =data(i);
input(i).prob =prob(i);
input(i).flag=0;
end
input([input.prob] == 0) = [];
[B,order] = sort([input(:).prob],'descend');
input = input(order);
huff=input;
temp=struct('data',[],'prob',[],'flag',[]);
tree=struct('data',[],'code',[]);
avg_code_length=0;
while length(huff)>1
dump=0;
[B,order] = sort([huff(:).prob],'descend');
huff = huff(order);
%extracting the least probable symbols from huffman tree to add them
%together
temp.data=strcat(huff(length(huff)).data,huff(length(huff)-1).data);
temp.prob=huff(length(huff)).prob+huff(length(huff)-1).prob;
temp.flag=1;
%flag 1 denotes an internal node
avg_code_length=avg_code_length+temp.prob; %path length lemma, avg
length=sum of internal node probabilities
%So if no node is internal node then
code
Assignment-2, EITN45
[email protected], 900128T311
end
tree(length(tree)+1).data=huff(length(huff)-1).data;
tree(length(tree)).code=0;
tree(length(tree)+1).data=huff(length(huff)).data;
tree(length(tree)).code=1;
end
%if the 1st node is not internal but 2nd node is internal
if (huff(length(huff)-1).flag==0)&&(dump==0)
tree(length(tree)+1).data=huff(length(huff)-1).data;
tree(length(tree)).code=0;
%if the 1st node is an internal node
elseif huff(length(huff)-1).flag==1
end
%if the 1st node is internal but 2nd node is not internal
if (huff(length(huff)).flag==0)&&(dump==0)
tree(length(tree)+1).data=huff(length(huff)).data;
tree(length(tree)).code=1;
%if the 2nd node is an internal node
elseif huff(length(huff)).flag==1
for j=1: length(huff(length(huff)).data)
a=huff(length(huff)).data(j);
for i=1:length(tree)
if a==tree(i).data
tree(i).code(length(tree(i).code)+1)=1;
end
end
end
end
%removing already considered nodes from original tree and adding the new
%node
huff(length(huff)) = [];
huff(length(huff)) = [];
huff(length(huff)+1) = temp;
end
for i=1:length(tree)
Assignment-2, EITN45
[email protected], 900128T311
dict(i).data=tree(i).data;
dict(i).code=fliplr(tree(i).code);
end
%summarizing data in a structured way
summary=struct('data',[],'probability',[],'occurrence',[],'code',[],......
'codelength',[],'zeros',[],'ones',[]);
for i=1:length(dict)
summary(i).data=dict(i).data;
summary(i).code=dict(i).code;
summary(i).codelength=length(dict(i).code);
for j=1: length(input)
if summary(i).data==input(j).data
summary(i).probability=input(j).prob;
end
end
summary(i).occurrence=summary(i).probability*sum(result);
if summary(i).data=='@';
summary(i).data= char(10);
end
if summary(i).data==char(255);
summary(i).data= char(32);
end
summary(i).zeros= sum(summary(i).code(:)==0);
summary(i).ones= sum(summary(i).code(:)==1);
end
[B,order] = sort([summary(:).data],'ascend');
summary = summary(order);
%Entropy of the coded data
Entropy_coded=Entropy(reshape(prob,[],1));
%Length of the encoded text &
%Distribution of Zero's and One's in the coded sequence
Length_coded_text=0;
Number_zeros=0;
Number_ones=0;
for i=1:length(summary)
Length_coded_text=Length_coded_text+summary(i).occurrence*summary(i).codele
ngth;
Number_zeros=Number_zeros+summary(i).occurrence*summary(i).zeros;
Number_ones=Number_ones+summary(i).occurrence*summary(i).ones;
end
Length_uncoded_text=sum(result)*8;
compression_ratio=Length_uncoded_text/Length_coded_text;
Dist_zeros=Number_zeros/(Number_zeros+Number_ones);
Assignment-2, EITN45
[email protected], 900128T311
Assignment-2, EITN45
[email protected], 900128T311
b(isnan(b))=0;
H(i)=-1*(a+b);
end
%--------------------------------------------------------------------------------------%This block deals with the scalar input
%--------------------------------------------------------------------------------------else
a=P*log2(P);
a(isnan(a))=0;
b=(1-P)*log2(1-P);
b(isnan(b))=0;
H=-1*(a+b);
end