Lec 5 Data Compression Part3
Lec 5 Data Compression Part3
Fan Cheng
Shanghai Jiao Tong
University
http://www.cs.sjtu.edu.cn/~chengfan/
[email protected]
Spring, 2023
Outline
Kraft inequality
Optimal codes
Huffman coding
Shannon-Fano-Elias coding
Generation of discrete distribution
Universal source coding
Random Variable Generation
We are given a sequence of
fair coin tosses and we wish
to generate with probability
mass function
Let the random variable denote
the number of coin flips used in
the algorithm.
The entropy of
𝑯 (𝒀 )= 𝑬𝑻
Random Variable Generation
(Theorem). For any algorithm generating , the expected number of fair
bits used is greater than the entropy , that is,
𝑬𝑻 ≥ 𝑯 (𝑿 )
Random Variable Generation
(Theorem). Let the random variable have a dyadic
distribution. The optimal algorithm to generate
from fair coin flips requires an expected number of
coin tosses precisely equal to the entropy:
For the constructive part, we use the Huffman code tree for as the
tree to generate the random variable. Each will correspond to a leaf
For a dyadic distribution, the Huffman code is the same as the
Shannon code and achieves the entropy bound.
For any , the depth of the leaf in the code tree corresponding to is
the length of the corresponding codeword, which is . Hence, when
this code tree is used to generate , the leaf will have a probability
The expected number of coin flips is the expected depth of the tree,
which is equal to the entropy (because the distribution is dyadic).
Hence, for a dyadic distribution, the optimal generating algorithm
achieves
Random Variable Generation
If the distribution is not dyadic? In this case we cannot use the same
idea, since the code tree for the Huffman code will generate a
dyadic distribution on the leaves, not the distribution with which
we started
Since all the leaves of the tree have probabilities of the form , it follows
that we should split any probability that is not of this form into
atoms of this form. We can then allot these atoms to leaves on the
tree
Tree to generate a
distribution
where
0101010101010101011010101010101101, W = 7
0101010101010101011010101010101101
0101010101010101011010101010101101
Find the maximum repeated substring inside
Lempel-Ziv Coding: Sliding
Window
0101010101010101011010101010101101, W = 6
0101010101010101011010101010101101
0101010101010101011010101010101101
Find the maximum repeated substring inside
, ABBABBABBBAABABA
ABBABBABBBAABABA
A BBABBABBBAABABA
A, B BABBABBBAABABA
A, B, B ABBABBBAABABA
A, B, B, ABBABB BAABABA
A, B, B, ABBABB, BA ABABA
A, B, B, ABBABB, BA, A, BA BA
A, B, BA BBABBBAABABAA
A, B, BA, BB ABBBAABABAA