Notes 7 2013 - Arithmetic Coding
Notes 7 2013 - Arithmetic Coding
Motivation:
Huffman coding generates a code with rate within
pmax +0.086 of the entropy, where pmax is the
probability of the most frequently occurring symbol.
Small pmax small deviation from the entropy
Large pmax Huffman codes become inefficient
compared with the entropy.
Example 1
Huffman codes for the 3 letter alphabet a1 , a2 , a3
Letter Probability Code
a1 0.95 0
a2 0.02 11
a3 0.03 10
Entropy : 0.335 bits/symbol
Rate: 1.05 bits/symbol
Redundancy: 0.715 bits/symbol = 213% of the entropy.
Possible solution: block 2 symbols together and
generate the extended code
Letter Probability Huffman code
a1a1 0.9025 0
a1a2 0.0190 111
a1a3 0.0285 100
a2 a1 0.0190 1101
a2 a2 0.0004 110011
a 2 a3 0.0006 110001
a3a1 0.0285 101
a3 a 2 0.0006 110010
a3 a 2 0.0009 110000
Entropy: 0.355 bits/symbol of original alphabet
Rate : 0.611 bits/symbol of original alphabet
Redundancy: 0.276 bits/symbol of original alphabet ( 72%
of entropy)
Redundancy drops to acceptable values if we block about
8
8 symbols together Alphabet size: 3 6561.
Impractical (space, time to encode – decode).
In Huffman coding with extended alphabet of m
symbols/block we need codewords for each sequence
of m symbols.
The main idea
- P( X i ) P( ai ) , i 1,2,..., m
Probability distribution of X
FX (0) 0, FX ( m) 1
Generating a tag
Partition the 0,1 interval into subintervals defined by
the cdf of the source.
While there are more symbols to be encoded do
- get next symbol
- restrict the tag to the subinterval corresponding to
the new symbol
- Partition the new subinterval of the tag proportionally
based on the cdf
The tag for the sequence is any number in the final
subinterval
Example 2 (continued) Code the sequence a1a2 a3
Example 2 (continued)
The decoder can sequentially recover the data vector from
its tag.
0.546 0.56
Suppose the given tag value is 0.553.
2
The decoder knows that the interval I1 is either
[0.0,0.7) or [0.7,0.8) or [0.8,1.0) . The first subinterval
corresponds to the symbol a1 , the second corresponds to
a2 and the third subinterval corresponds to the symbol a3 .
Since the number 0.553 lies in the interval [0.0,0.7) , the
decoder determines that I1 [ 0. 0, 0. 7, )and that the first
data sample is x1 a1 .
Now the decoder knows that I 2 is either
[0.0,0.49) or [0.49,0.56), or [0.56,0.7) . Since 0.553 lies
in the interval [0.49,0.56) the decoder determines that
I 2 [0.49,0.56) , and that the second data sample is
x2 a2 .
The decoder now knows that I 3 is either
[0.49,0.539) or [0.539,0.546) or [0.546,0.56) . Since
0.553 lies in the interval [0.539,0.546) corresponding to
the symbol a3 the decoder concludes that the last data
sample is x3 a3
Tag generation for single-letter sequences
- Alphabet A a1 , a2 ,..., am
- Random variable X ( ai ) i , ai A , i 1,2,..., m
- Define a tag for ai , denoted by , TX ( ai ) to be
Example 3
A a1 , a2 ,..., a6
1
P( X i ) , i 1,2,...,6
6
1 2
a2 0.25
6 6
1 3
a3 0.4166
6 6
1 4
a4 0.5833
6 6
1 5
a5 0.75
6 6
1
a6 1 0.9166
6
Tag generation for multi-letter sequences
Impose an order “< “on the sequences
(lexicographic ordering is often used)
Define the tag for the n - letter sequence x denoted by
TX( n ) (x)
1
TX( n ) (x) P(y) P(x)
y x 2
Example 3 (continued)
1
TX(2) ( a1a3 ) P ( a1a1 ) P ( a1a2 ) P ( a1a3 )
2
1 1 1 5
36 36 72 72
u( k ) l (k 1)
u( k 1)
l (k 1)
FX ( xk )
u( n ) l ( n )
- TX( n ) (x)
2
Deciphering a tag
Mimic the encoder
(0)
- Initialize l 0 and u (0) 1
-k 1
- Repeat until the whole sequence has been decoded
tag l ( k 1)
-t
u ( k 1) l ( k 1)
- Find the value of xk : FX xk 1 t FX xk
(k ) (k )
- Update l and u
- k: k 1
Example 6
I1 [2 / 5,1] .
I2 [2 / 5,16 / 25] .
log2 3125/108 1 6.
100100 .
No compression!
Practical implementation
Problems that must be resolved
(n) (n)
- The values l and u come closer and closer
together, as n gets larger.
In a system with finite precision the two values are
bound to converge.
E1 :[0,0.5) [0,1) E1 ( x ) 2 x
Example 5 (continuation)
The average length for this code is
2
H ( X ) lA H(X )
m
1
H ( X ) lA H(X )
m
m
(but the size of the alphabet is k )