Entropy & Run Length Coding
Entropy & Run Length Coding
CODING
Contents
• What is Entropy coding?
• Huffman Encoding
• Huffman encoding Example
• Arithmetic coding
• Encoding Algorithms for arithmetic coding
• Decoding Algorithm for Arithmetic decoding
• Run Length Encoding
• Question –Answer
• References
What is Entropy Coding?
Entropy coding is lossless compression
scheme.
Arithmetic coding
Huffman Encoding-
For each encoding unit (letter, symbol or any
character), associate with a frequency.
You can choose percentage or probability for
occurrence of the encoding unit.
Create a binary tree whose children are the encoding
units with the smallest frequencies/ probabilities.
The frequency of the root is the sum of the
frequencies/probabilities of the leaves
Repeat this procedure until all the encoding units are
covered in the binary tree.
Example, step I
Assume that relative frequencies are:
A: 40
B: 20
C: 10
D: 10
R: 20
(I chose simpler numbers than the real frequencies)
Smallest number are 10 and 10 (C and D), so connect those
Example, step II
C and D have already been used, and the new node
above them (call it C+D) has value 20
The smallest values are B, C+D, and R, all of which
have value 20
Connect any two of these
Example, step III
The smallest values is R, while A and B+C+D all
have value 40
Connect R to either of the others
Example, step IV
Connect the final two nodes
Example, step V
Assign 0 to left branches, 1 to right branches
Each encoding is a path from the root
A=0
B = 100
C = 1010
D = 1011
R = 11
Each path
terminates at a
leaf
Do you see
why encoded
strings are
decodable?
Unique prefix property
A=0
B = 100
C = 1010
D = 1011
R = 11
No bit string is a prefix of any other bit string
For example, if we added E=01, then A (0) would be a
prefix of E
Similarly, if we added F=10, then it would be a prefix of
three other encodings (B=100, C=1010, and D=1011)
The unique prefix property holds because, in a binary tree,
a leaf is not on a path to any other node
Data compression-
Huffman encoding is a simple example of data
compression: representing data in fewer bits than it
would otherwise need
A more sophisticated method is GIF (Graphics
Interchange Format) compression, for .gif files
Another is JPEG (Joint Photographic Experts Group), for
.jpg files
Unlike the others, JPEG is lossy—it loses information
Generally OK for photographs (if you don’t compress
them too much), because decompression adds “fake”
data very similiar to the original
Arithmetic Coding
Arithmetic Coding-
Huffman coding has been proven the best in compare
to fixed length coding method available.
Yet, since Huffman codes have to be an integral
number of bits long, while the entropy value of a
symbol may (as a matter of fact, almost always so) be
a fraction number, theoretical possible compressed
message cannot be achieved.
15
Arithmetic Coding(Cont…)
For example, if a statistical method assign 90%
probability to a given character, the optimal code size
would be 0.15 bits.
The Huffman coding system would probably assign a
1-bit code to the symbol, which is six times longer
than necessary.
16
Arithmetic Coding(Cont..)
Arithmetic coding bypasses the idea of
replacing an input symbol with a specific code.
It replaces a stream of input symbols with a
single floating point output number.
17
Character probability Range
^(space) 1/10
A 1/10
B 1/10
E 1/10
G 1/10
I 1/10
L 2/10
S 1/10
T 1/10
19
Continue……………….
To encode the first character B properly, the final
coded message has to be a number greater than or
equal to 0.20 and less than 0.30.
range = 1.0 – 0.0 = 1.0
high = 0.0 + 1.0 × 0.3 = 0.3
low = 0.0 + 1.0 × 0.2 = 0.2
After the first character is encoded, the low end for
the range is changed from 0.00 to 0.20 and the high
end for the range is changed from 1.00 to 0.30.
20
Continue…………..
The next character to be encoded, the letter I, owns
the range 0.50 to 0.60 in the new sub range of 0.20 to
0.30.
So, the new encoded number will fall somewhere in
the 50th to 60th percentile of the currently established.
Thus, this number is further restricted to 0.25 to 0.26.
21
Continue……………………….
Note that any number between 0.25 and 0.26 is
a legal encoding number of ‘BI’. Thus, a
number that is best suited for binary
representation is selected.
22
1.0 0.3 0.26 0.258 0.2576 0.25724 0.2572168 0.257216776
0.25722 0.2572168
T T T T T T T T T T
0.9 0.2572167756
S S S S S S S S S S
0.8
0.2572167752
L L L L L L L L L L
0.6
I I I I I I I I I I
0.5
G G G G G G G G G G
0.4
E E E E E E E E E E
0.3
B B B B B B B B B B
0.2
A A A A A A A A A A
0.1
() () () () () () () () () ()
0.257216 0.25721676 23
0.0 0.2 0.25 0.256 0.2572 0.2572 0.2572164 0.257216772
Continue……………..
25
Arithmetic Coding(Decoding)
Decoding is the inverse process.
Since 0.2572167752 falls between 0.2 and 0.3, the
first character must be ‘B’.
Removing the effect of ‘B’from 0.2572167752 by
first subtracting the low value of B, 0.2, giving
0.0572167752.
Then divided by the width of the range of ‘B’, 0.1.
This gives a value of 0.572167752.
26
Decoding(Cont………..)
Then calculate where that lands, which is in the range
of the next letter, ‘I’.
The process repeats until 0 or the known length of the
message is reached.
27
Arithmetic Decoding Algorithm-
Decoding algorithm:
r = input_number
repeat
search c such that r falls in its range
output(c) ;
r = r - low_range(c);
r = r ÷ (high_range(c) - low_range(c));
until EOF or the length of the message is reached
28
r c Low High range
0.2572167752 B 0.2 0.3 0.1
0.572167752 I 0.5 0.6 0.1
0.72167752 L 0.6 0.8 0.2
0.6083876 L 0.6 0.8 0.2
0.041938 ^(space) 0.0 0.1 0.1
0.41938 G 0.4 0.5 0.1
0.1938 A 0.2 0.3 0.1
0.938 T 0.9 1.0 0.1
0.38 E 0.3 0.4 0.1
0.8 S 0.8 0.9 0.1
0.0
29
Arithmetic Coding Summary
In summary, the encoding process is simply one of
narrowing the range of possible numbers with every
new symbol.
The new range is proportional to the predefined
probability attached to that symbol.
Decoding is the inverse procedure, in which the range
is expanded in proportion to the probability of each
symbol as it is extracted.
30
Continue…………………..
Coding rate approaches high-order entropy
theoretically.
Not so popular as Huffman coding because × , ÷ are
needed.
31
Run Length Encoder/Decoder
What is RLE?
Compression technique
Represents data using value and run length
Run length defined as number of consecutive
equal values
e.g
RLE
1110011111 130215
Val Run
ues Length
s
Advantage of RLE-
Useful for compressing data that contains repeated
values
e.g. output from a filter, many consecutive values are
0.
Very simple compared with other compression
techniques
Reversible (Lossless) compression
decompression is just as easy
Applications-
I Frame compression in Video-
Run Length E
RLE Effectiveness-
Compression effectiveness depends on input
Must have consecutive runs of values in order to
maximize compression
Best case: all values same
Can represent any length using two values
Worst case: no repeating values
Compressed data twice the length of original!!
Should only be used in situations where we know for
sure have repeating values
Encoder - Algorithm
Start on the first element of input
Examine next value
If same as previous value
Keep a counter of consecutive values
Keep examining the next value until a different value or
end of input then output the value followed by the
counter. Repeat
If not same as previous value
Output the previous value followed by ‘1’ (run length.
Repeat
Encoder – Matlab Code
% Run Length Encoder
% EE113D Project
my_size = size(input);
length = my_size(2);
run_length = 1;
encoded = [];
for i=2:length
if input(i) == input(i-1)
run_length = run_length + 1;
else
encoded = [encoded input(i-1) run_length];
run_length = 1;
end
end
if length > 1
% Add last value and run length to output
encoded = [encoded input(i) run_length];
else
% Special case if input is of length 1
encoded = [input(1) 1];
end
Encoder – Matlab Results
>> RLE_encode([1 0 0 0 0 2 2 2 1 1 3])
ans =
1 1 0 4 2 3 1 2 3 1
ans =
0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9
1
Encoder
Input from separate .asm file
In the form of a vector
e.g. ‘array .word 4,5,5,2,7,3,6,9,9,10,10,10,10,10,10,0,0’
Output is declared as data memory space
Examine memory to get output
Originally declared to be all -1.
Immediate Problem
Output size not known until run-time (depends on input
size as well as input pattern)
Cannot initialize variable size array
Encoder
Solution
Limit user input to preset length (16)
Initialize output to worst case (double input length – 32)
Initialize output to all -1’s (we’re only handling positive
numbers and 0 as inputs)
Output ends when -1 first appears or if length of output
equals to worst case
Decoder – Matlab Code
% Run Length Decoder
% EE113D Project
% The input to this function should be the output from Run Length Encoder,
% which means it assumes even number of elements in the input. The first
% element is a value followed by the run count. Thus all odd elements in
% the input are assumed the values and even elements the run counts.
%
function decoded = RLE_decode(encoded)
my_size = size(encoded);
length = my_size(2);
index = 1;
decoded = [];
% iterate through the input
while (index <= length)
% get value which is followed by the run count
value = encoded(index);
run_length = encoded(index + 1);
for i=1:run_length
% loop adding 'value' to output 'run_length' times
decoded = [decoded value];
end
% put index at next value element (odd element)
index = index + 2;
end
Decoder – Matlab Results
>> RLE_decode([0 12])
ans =
0 0 0 0 0 0 0 0 0 0 0 0