Chapshannon PDF
Chapshannon PDF
Shannon entropy
1. Quantifying information
Computers have popularized the notion of bit, a unit of information that takes
two values, 0 or 1. We introduce the information size H0 (A) of a set A as the
number of bits that is necessary to encode each element of A separately, i.e.
H0 (A) = log2 |A|. (6.1)
This quantity has a unit, the bit. If we have two sets A and B, then
H0 (A × B) = H0 (A) + H0 (B). (6.2)
This justifies the logarithm. The information size of a set is not necessarily integer.
If we need to encode the elements of A, the number of necessary bits is dH0 (A)e
rather than H0 (A); but this is irrelevant for the theory.
The ideas above are natural and anybody might have invented the concept of
information size. The next notion, the information gain, is intuitive but needed a
genius to define and quantify it.
Suppose you need to uncover a certain English word of five letters. You manage
to obtain one letter, namely an e. This is useful information, but the letter e is
common in English, so this provides little information. If, on the other hand, the
letter that you discover is j (the least common in English), the search has been
more narrowed and you have obtained more information. The information gain
quantifies this heuristics.
We need to introduce a relevant formalism. Let A = (A, p) be a discrete
probability space. That is, A = {a1 , . . . , an } is a finite set, and each element has
probability pi . (The σ-algebra is the set of all subsets of A.) The information gain
G(B|A) measures the gain obtainedPby the knowledge that the outcome belongs to
the set B ⊂ A. We denote p(B) = i∈B pi .
Definition 6.1. The information gain is
1
G(B|A) = log2 = − log2 p(B).
p(B)
51
52 6. SHANNON ENTROPY
The information gain is positive, and it satisfies the following additivity prop-
erty. Let B ⊂ C ⊂ A. The gain for knowing that the outcome is in C is
G(C|A) = − log2 p(C). The gain for knowing that it is in B, after knowing that it
is in C, is
p(B)
G(B|C) = − log2 p(B|C) = − log . (6.3)
p(C)
It follows that G(B|A) = G(C|A) + G(B|C), as it should be.
The unit for the information gain is the bit. We gain 1 bit if p(B) = 21 .
2. Shannon entropy
It is named after Shannon1, although its origin goes back to Pauli and von
Neumann.
Definition 6.2. The Shannon entropy of A is
n
X
H(A) = − pi log2 pi .
i=1
Definition 6.3. The relative entropy of the probability p with respect to the
probability q is
n
X pi
H(p|q) = pi log2 .
i=1
qi
4. Shannon theorem
A basic problem in information theory deals with encoding large quantities of
information. We start with a finite set A, that can denote the 26 letters from the
Latin alphabet, or the 128 ASCII symbols, or a larger set of words. We consider a
file that contains N |A| symbols with N large. How many bits are required so that
the file can be encoded without loss of information? The answer is given by the
information size, H0 (AN ) = N H0 (A).
The question becomes more interesting, and the answer more surprising, if we
allow an error δ. We now seek to encode only files that fall in a set B ⊂ A, such
that p(B) > 1 − δ. If a file turns out to be in A \ B, then we lose the information.
The information size is given by Hδ (A), where
Hδ (A) = inf log2 |B|. (6.9)
B⊂A
p(B)>1−δ
The theorem says that if we allow for a tiny error, and if our message is large
(depending on the error), the number of required bits is roughly N H(A). Notice
that the limit in the theorem is a true limit, not a lim inf or a lim sup. Thus
Shannon entropy gives the optimal compression rate, that can be approached but
not improved.
2Ch.-É. Pfister, Thermodynamical aspects of classical lattice systems, in In and out of equi-
librium: Physics with a probability flavor, Progr. Probab. 51, Birkhäuser (2002)
4. SHANNON THEOREM 55
Proof. It is based on the (weak) law of large numbers. Consider the random
variable − log2 p(a). The law of large numbers states that, for any ε > 0,
n 1 X N o
lim Prob (a1 , . . . , aN ) : − log2 p(ai ) − E − log2 p(a) > ε = 0.
N →∞ N i=1 | {z }
| {z } H(A)
1
−N log2 p(a1 ,...,aN )
(6.10)
There exists therefore a set AN,ε ⊂ AN such that limN p(AN,ε ) = 1, and such that
any (a1 , . . . , aN ) ∈ AN,ε satisfies
so that |AN,ε | 6 2N (H(A)+ε) . For any δ > 0, we can choose N large enough so that
p(AN,ε ) > 1 − δ. Then
It follows that
1
lim sup Hδ (AN ) 6 H(A). (6.14)
N →∞ N
For the lower bound, let BN,δ be the minimizer for Hδ ; that is, p(BN,δ ) > 1 − δ,
and
Hδ (AN ) = log2 |BN,δ | > log2 BN,δ ∩ AN,ε .
(6.15)
We need a lower bound for the latter term.
+ p BN,δ ∩ AcN,ε .
1−δ 6 p BN,δ ∩ AN,ε (6.16)
| {z } | {z }
6|BN,δ ∩AN,ε |2−N (H(A)−ε) 6δ if N large
Then
BN,δ ∩ AN,ε > (1 − 2δ) 2N (H(A)−ε) .
(6.17)
We obtain
1 1
Hδ (AN ) > log2 (1 − 2δ) + H(A) − ε. (6.18)
N N
This gives the desired bound for the lim inf, and Shannon’s theorem follows.
It is instructive to quantify the ε’s and δ’s of the proof. Invoking the central
limit theorem instead of the law of large numbers, we get that
Z ∞
1 2 2
p(AN,ε ) ≈ 2σ √ e− 2 t dt ≈ e−N ε ≈ δ.
2
Nε
1
(σ 2 is the variance of the random variable − log2 p(a).) This shows that ε ≈ N − 2
and δ ≈ e−N . It is surprising that δ can be so tiny, and yet makes such a difference!
56 6. SHANNON ENTROPY
Any fixed length, uniquely decodable code is also prefix. Prefix codes can be
represented by trees, see Fig. 6.1.
r a1 7→ 00 r a1 7→ 0
0 0
r r r a2 → 7 10
0
1 r
HH 0
a2 7→ 01 @
r 1@ r r a3 7→ 110
0 r
0
@ a3 7→ 10 @
1@
r 1@ r
@ @
1@ r 1@ r
a4 7→ 11 a4 →
7 111
Figure 6.1. Tree representations for the prefix codes c(3) and c(4) .
5. CODES AND COMPRESSION ALGORITHMS 57
We set `min = mina∈A `(a), and similarly for `max . Since the code is one-to-one,
the number #{·} above is no more than 2L . Then
X N NX
`max
2−`(a) 6 2−L 2L = N (`max − `min + 1). (6.21)
a∈A L=N `min
Since the right side grows like N , the left side cannot grow exponentially with N ;
it must be less or equal to 1.
The second claim can be proved e.g. by suggesting an explicit construction
for the prefix code, given lengths that satisfy Kraft inequality. It is left as an
exercise.
3The word “algorithm” derives from Abu Ja’far Muhammad ibn Musa Al-Khwarizmi, who
was born in Baghdad around 780, and who died around 850.
58 6. SHANNON ENTROPY
Proof. For the lower bound, consider a code c with lengths `i = `(c(ai )).
−`i
Define qi = 2 z , with z = j 2−`j . We have
P
n
X n
X n
X
L(A, c) = pi `i = − pi log2 qi − log2 z > − pi log2 pi = H(A). (6.22)
i=1 i=1 i=1
The inequality holds because of positivity of the relative entropy, Proposition 6.3,
and because z 6 1 (Kraft inequality).
For the upper bound, define `i = d− log2 pi e (the integer immediately bigger
than − log2 pi ). Then
Xn Xn
2−`i 6 pi = 1. (6.23)
i=1 i=1
This shows that Kraft inequality is verified, so there exists a prefix code c with
these lengths. The expected length is easily estimated,
Xn n
X
L(A, c) = pi d− log2 pi e 6 pi (− log2 pi + 1) = H(A) + 1. (6.24)
i=1 i=1
Exercise 6.3. Recall the definitions for the codes c(1) , c(2) , c(3) , and c(4) . Explain
why c(1) is undecodable; c(2) is uniquely decodable but not prefix; c(3) and c(4) are
prefix codes.
Exercise 6.4. Given lengths {`(a)} satisfying Kraft inequality, show the existence
of a corresponding prefix code.