Elements of Information Theory 2006 Thomas M. Cover and Joy A. Thomas
Elements of Information Theory 2006 Thomas M. Cover and Joy A. Thomas
xX
p( x) log p( x)
The entropy of X can also be interpreted as the
expected value of the random variable
log
1
p( X )
,
where X is drawn according to a mass function
p(x). Thus
H ( X )=E
p
log
1
p( X )
Properties of H
1. H( X )0
2. H
b
( X )=(log
b
a) H
a
( X )
3. (Conditioning reduces entropy) For any
two random variables, X and Y, we have
H( XY )H ( X ) with equality iff X and
Y are independent.
4. H( X
1,
X
2,
. . . , X
n
)
( i=1)
n
H ( X
i
) with
equality if and only if the Xi are
independent.
5. H( X )logX With equality if and only if
X is distributed uniformly over X
6. H(p) is concave in p.
A function is convex if it always lies below any
chord.
Definition: A function f(x) is said to be convex over
an interval (a,b) if for every x
1,
x
2
( a , b) and 0
1
f (\ x
1
+(1\) x
2
)\ f ( x
1
)+(1\) f ( x
2
)
A function is said to be strictly convex if equality
holds only for = 0 or = 1
Relative entropy is a measure of the distance
between two distributions; it is a measure of the
inefficiency of assuming that the distribution is q
when the true distribution is p. The relative entropy
or Kullback-Leiber distance is not a true distance
between distributions since it is not symmetric and
does not satisfy the triangle inequality.
Definition: The relative entropy D(p||q) of the
John A Brown www.nhoj.info 2007
Cover & Thomas Elements of Information Theory 2006
probability mass function p with respect to the
probability mass function q is defined by:
D( pq)=
x
p( x) log
p( x)
q( x)
Mutual Information is a measure of the amount of
information that one random variable contains about
another random variable; it is a reduction in the
uncertainty of one random variable due to the
knowledge of another.
Definition: The mutual information between two
random variables X and Y is defined as:
I ( X ; Y )=
x X
yY
p( x , y)log
p( x , y)
p( x) p( y)
The mutual information of a random variable with itself
is the entropy of the random variable. This is the
reason that entropy is sometimes referred to as self-
information.
Alternative expressions.
H( X )=E
p
log
1
p( X )
H( X , Y )=E
p
log
1
p( X ,Y )
H ( XY )=E
p
log
1
p( XY )
I ( X ; Y )=E
p
log
p( X ,Y )
p( X ) p(Y )
D( XY )=E
p
log
p( X )
q( X )
Properties of D and I
I(X;Y) = H(X) H(X|Y)
= H(Y) H(Y|X)
= H(X) + H(Y) H(X,Y)
D(p||q) 0
with equality if and only if p(x) = q(x), for all x X
I(X;Y) = D(p(x,y) || p(x)q(y)) 0,
with equality iff p(x,y) = p(x)p(y) (i.e. X and Y are
independent)
If |X|=m, and u is the uniform distribution over X, then
D(p||u) = log m H(p)
D(p||q) is convex in the pair (p,q)
Chain Rules
Entropy:
H( X
1,
X
2,
. . . , X
n
)=
i =1
n
H( X
i
X
i 1
, .. . , X
1
)
Mutual Information:
I ( X
1
, X
2
, . . , X
n
Y )=
i =1
n
I ( X
i
; YX
1,
X
2,
.. , X
i1
)
Relative entropy:
D( p( x , y)q ( x , y ))=D( p( x )q( x ))+D( p( xy)q( yx))
Jensen's inequality. Jensen's inequality is one of
the most widely used in mathematics and one that
underlies many of the basic results in information
theory. If f is a convex function, then
Ef ( X ) f ( EX )
Log sum inequality. For n positive numbers, a1,
a2, . . . , an and b1, b2, . . . , bn,
i=1
n
a
i
log
a
i
b
i
(
i=1
n
a
i
) log
i =1
n
a
i
i=1
n
bi
with equality iff
a
i
b
i
=constant
Data processing inequality. The data-processing
inequality can be used to show that no cleaver
manipulation of the data can improve inferences
that can be made from the data.
If X Y Z forms a Markov chain,
I ( X ; Y )I ( X ;Z )
Sufficient Statistic. A statistic T(X) is called
sufficient for if it contains all the information in X
about .
T(X) is sufficient relative to iff I(;X) = I(;T(X))
for all distributions on
A statistic T(X) is a minimal sufficient statistic
relative to {f(x)} if it is a function of every other
sufficient statistic U.
Fano's inequality. Relates the probability of error
in guessing the random variable X to its conditional
entropy H(X|Y). Let P
e
=Pr
X (Y )X
Then,
H( P
e
)+P
e
logXH( XY )
Inequality. If X and X' are independent and
identically distributed then
PR( X =X ' )2
H ( X)
John A Brown www.nhoj.info 2007
Cover & Thomas Elements of Information Theory 2006
Chapter 3. Asymptotic Equipartition Property.
AEP Almost all event are almost equally surprising.
Specifically, if X1, X2, . . . are i.d.d. ~ p(x), then
1
n
log p( X
1,
X
2,
. . . , X
n
)-H( X )
in probability.
Definition: The sequence X1, X2, ... converges to a
random variable X:
In probability if for every >0, Pr{|Xn-X|>} 0
In mean square if E(Xn-X)
2
0
With probability 1 if Pr{limn Xn=X} = 1
Definition: The typical set A
( n)
is the set of
sequences x1, x2, . . . , xn satisfying
2
n( H( X )+)
p( x
1,
x
2,
. . . , x
n
)2
n( H ( X ))
Properties of the typical set.
If ( x
1,
x
2,
. .. , x
n
)A
( n)
then p( x
1,
x
2,.
.. , x
n
)=2
n( H!)
Pr A
( n)
>1 for n sufficiently large
A
( n)
2
n( H ( X )+)
, where |A| denotes the number
of elements in set A
Thus the typical set has probability nearly 1, all
elements of the typical set are nearly equiprobable,
and the number of elements in the typical set is nearly
2
nH
Definition:
an = bn means that
1
n
log
a
n
b
n
-0 as n
Smallest probable set. Let X1, X2, . . . , Xn be i.d.d. ~
p(x), and for <, let
B
6
( n)
X
n
be the smallest
subset such that
Pr B
6
( n)
16
Then
B
6
( n)
=2
nH
The typical set has essentially the same number of
elements as the smallest set, to first order in the
exponent.
Note that for a binary random variable with Pr(0)=0.1,
Pr(1)=0.9, the typical sequences will have proportions
close to 1:9 but this does not include the most likely
single sequence of straight 1's.
Chapter 4 Entropy Rates of a Stochastic
Process.
A stochastic process is said to be stationary if the
joint distribution of any subset of the sequence of
random variables is invariant with respect to shifts
in the time index.
A Markov process is a stochastic process in
which each random variable depends only on the
one preceding it and is conditionally independent
of all the other preceding random variables.
A Markov chain is said to be time invariant if the
conditional probability does not depend on n the
position in the chain.
A stationary distribution on the states of a
Markov process is the same for n and n+1.
j
=
i
i
P
ij
Entropy Rate. Two definitions of entropy rate for a
stochastic process are:
H ( X )=lim
n-
1
n
H ( X
1,
X
2,
. . . , X
n
)
H ' ( X )=lim
n-
H( X
n
X
n1
, X
n2
, . .. , X
1
)
The first is the per symbol entropy of the n random
variables, and the second is the conditional
entropy of the last random variable given the past.
For a stationary stochastic process:
H(X) = H'(X)
Entropy rate of a stationary Markov chain
H ( X )=
ij
i
P
ij
log P
ij
Second law of thermodynamics. For a Markov
chain:
1. Relative entropy D(n||'n) decreases with time
2. Relative entropy D(n||) between a
distribution and the stationary distribution
decreases with time
3. Entropy H(Xn) increases if the stationary
distribution is uniform.
4. The conditional entropy H(Xn|X1) increases
with time for a stationary Markov chain.
5. The conditional entropy H(X0|Xn) of the initial
condition X0 increases for any Markov chain.
John A Brown www.nhoj.info 2007
Cover & Thomas Elements of Information Theory 2006
Functions of a Markov chain. If X1, X2, .... Xn form a
stationary Markov chain and Yi (X
i
), then:
H(Y
n
Y
n1
, . . . Y
1,
X
1
)H(Y )H(Y
n
Y
n1
, . . . Y
1
)
lim
n-
H(Y
n
Y
n1
, . . . Y
1,
X
1
)=H(Y )=lim
n-
H(Y
n
Y
n1
, . .. Y
1
)
Chapter 5 Data Compression.
A code is nonsingular if every element of the
range of X maps onto a different string in D*.
However, delimiter is required to transmit a
sequence.
The extension C* of a code C is the mapping from
finite length strings of X to finite-length strings of D
where each element is concatenated.
A code is called uniquely decodable if its
extension is non-singular (ie every encoded string
in a uniquely decodable code has only one
possible source string producing it)
A code is called a prefix code or an
instantaneous code if no codeword is a prefix of
any other codeword.
Kraft inequality. The set of codeword lengths
possible for instantaneous codes is limited by:
instantaneous codes
D
l
i
1
A probability distribution is called D-adic if each of
the probabilities is equal to D
-n
for some n.
McMillan inequality. The set of codeword lengths
possible for uniquely decodable codes is limited
by:
Uniquely decodable codes
D
l
i
1
Entropy bound on data compression:
L=
p
i
l
i
H
D
( X )
Shannon code:
l
i
= log
D
1
p
i
H
D
( X )LH
D
( X )+1
where x is the smallest integer x
Huffman code: The Hauffman code is constructed
by recursively combining the symbols with the
lowest probability into a single source symbol until
the problem has reduced to one symbol, then
assigning codewords to the symbols.
Code
word
X Probability
01 1 0.25 0.3 0.45 0.55 1
10 2 0.25 0.25 0.3 0.45
11 3 0.2 0.25 0.25
000 4 0.15 2
001 5 0.15
John A Brown www.nhoj.info 2007
Cover & Thomas Elements of Information Theory 2006
L
i
= min
D
li
1
p
i
l
i
H
D
( X )L
i
H
D
( X )+1
Hauffman coding:
is optimal
is equivalent to 20 questions
can code weighted codewords
can be used for slice codes (alphabetic codes)
can be sub-optimum when used with codeword
lengths
log
1
p
i
(Shannon coding)
is similar to fano coding which divides the symbols
into groups of nearly equal combined probability
Wrong code:
X p( x) , l ( x)= log
1
q
( x) , L=
p( x)l ( x):
H( p)+D( pq)LH ( p)+D( pq)+1
Shannon-Fano-Elias coding assigns codewords to
the midpoint of each discrete step in the cumulative
distribution function. Unfortunately S-F-E coding
calculations grow exponentially with block length and
precision grows linearly. Arithmetic coding is an
extension which resolves these issues.
Stochastic processes: The number of fair coin flips
required to generate a random variable with X drawn
according to a specified probability mass function is
equal to the Entropy.
H ( X
1,
X
2,.
.. , X
n
)
n
L
n
H( X
1,
X
2,.
.. , X
n
)
n
+
1
n
Stationary processes L
n
-H ( X )
Competitive optimality Shannon code
l ( x)= log
1
p
( x) versus any other code l'(x):
Pr (l ( X )l ' ( X )+c)
1
2
c1
Chapter 6 Gambling and Data Compression.
Doubling Rate: The doubling rate is the rate at
which wealth grows when each outcome has
probability = pk, the bet placed on each outcome =
bk and the payout on each outcome is okbk
W(b , p)=E (logS ( X ))=
k=1
m
p
k
log b
k
o
k
Optimal doubling rate W*(p) = maxbW(b,p)
Proportional gambling is log optimal. Where the
bets placed are in proportion with the probability of
the outcome.
W( p)=max
b
W(b , p)=
p
i
log o
i
H( p)
is achieved by b*=p
Growth rate. Wealth grows as Sn=2
nW*(p)
The doubling rate is equal to the difference
between the distance of the bookies estimate from
the true distribution and the distance of the
gamblers estimate from the true distribution
W(b,p) = D(p||r) - D(p||b)
Conservation law. For uniform odds, H(p) + W*(p)
= log m
Side Information. In a horse race X, the increase
W in doubling rate due to side information Y is
W = I(X;Y)
Any sequence on which a gambler makes a large
amount of money is also a sequence that can be
compressed by a large factor.
The entropy of English is approx 1.3 bits.
John A Brown www.nhoj.info 2007
Cover & Thomas Elements of Information Theory 2006
Chapter 7 Channel Capacity.
Channel capacity: The logarithm of the number of
distinguishable inputs is given by:
C=max
p( x)
I ( X ; Y )
During compression, we remove all the redundancy in
the data to form the most compressed version
possible, whereas during data transmission, we add
redundancy in a controlled fashion to combat errors in
the channel.
Examples
Binary symmetric channel C=1-H(p)
Binary erasure channel C=1-
Symmetric channel C=log|Y| - H(row of transition
matrix)
Properties of C
0 C min {log|X|, log|Y|}
I(X;Y) is a continuous concave function of p(x)
Joint Typicality. We decode a channel output Y
n
as
the i
th
index if the codeword X
n
(i) is jointly typical with
the received signal Y
n
.
The set A
( n)
of jointly typical sequences {(x
n
, y
n
)}
with respect to the distribution p(x,y) is given by
1
n
log p( x
n
)H ( X ) ,
(n)
=( x
n
, y
n
) X
n
Y
n
:
1
n
log p( y
n
)H(Y ) ,
1
n
log p( x
n
, y
n
)H ( X ,Y )
_
where p( x
n
, y
n
)=
i=1
n
p( x
i
, y
i
)
Joint AEP. Let (Xn, Yn) be sequences of length n
drawn i.i.d according to
p( x
n
, y
n
)=
i=1
n
p( x
i
, y
i
) then:
Pr (( X
n
, Y
n
) A
( n)
)-1as n-
( n)
2
n( h( X ,Y )+)
If (
X
n
,
Y
n
)p( x
n
) p( y
n
) ,
then Pr ((
X
n
,
Y
n
) A
( n)
)2
n( I ( X ;Y)3)
Channel coding theorem. All rates below capacity C
are achievable, and all rates above capacity are not;
that is for all rates R < C, there exists a sequence of
(2
nR
,n) codes with probability of error (n) 0.
Conversely, for rates R > C, (n) is bounded away
from 0.
A Hamming code is an example of a parity check
code where one or more parity bits are added to
the end of the transmitted sequence which depend
on various subsets of the information bits.
Hamming codes can be visualised as a Venn
diagram where each region of overlap represents
one information bit and each region belonging to
just one set is a parity bit. The location of an error
can be inferred from the location of the parity bit.
Feedback capacity. Feedback does
not increase capacity for discrete
memoryless channels (i.e., CFB = C)
Source-channel theorem. A
stochastic process with entropy rate
H cannot be sent reliably over a discrete
memoryless channel if H > C. Conversely, if the
process satisfies the AEP, the source can be
transmitted reliably if H < C.
The result - that a two-stage process is as good
as any one - stage process-seems so obvious that
it may be appropriate to point out that it is not
always true. There are examples of multi-user
channels where the decomposition breaks down.
We also consider two simple situations where the
theorem appears to be misleading. A simple
example is that of sending English text over an
erasure channel. We can look for the most efficient
binary representation of the text and send it over
the channel. But the errors will be very difficult to
decode. If, however, we send the English text
directly over the channel, we can lose up to about
half the letters and yet be able to make sense out
of the message. Similarly, the human ear has
some unusual properties that enable it to
distinguish speech under very high noise levels if
the noise is white. In such cases, it may be
appropriate to send the uncompressed speech
over the noisy channel rather than the compressed
version. Apparently, the redundancy in the source
is suited to the channel. pg 219
The data compression theorem is a consequence
of the AEP, which shows that there exists a "small"
subset (of size 2
nH
) of all possible source
sequences that contain most of the probability and
that we can therefore represent the source with a
small probability of error using H bits per symbol.
The data transmission theorem is based on the
joint AEP; it uses the fact that for long block
lengths, the output sequence of the channel is very
likely to be jointly typical with the input codeword,
while any other codeword is jointly typical with
probability 2
-nI
. Hence, we can use about 2
nI
codewords and still have negligible probability of
error. The source-channel separation theorem
John A Brown www.nhoj.info 2007
Cover & Thomas Elements of Information Theory 2006
shows that we can design the source code and the
channel code separately and combine the results to
achieve optimal performance. pg 222
Chapter 8 Differential Entropy
Differential entropy is the entropy of a continuous
random variable. Differential entropy can be
negative but the volume of the support set must be
non-negative.
h( X )=h( f )=
S
f ( x) logf ( x) dx
if it exists.
AEP for continuous random variables.
f ( X
n
)=2
nh( X )
Properties of the typical set A
( n)
parallel those
for the discrete random variable.
Vol ( A
(n)
)=2
nh( X )
The entropy of a n-bit quantisation of a continuous
random variable X is, on average, the number of
bits required to describe X to n-bit accuracy.
H(| X
2
n )h( X )+n
The entropy of the Normal distribution with mean
and variance :
h( N (0,
2
))=
1
2
log 2e
2
bits
The entropy of a multivariate Normal distribution
with mean and covariance matrix K:
h( N
n
( , K))=
1
2
log(2e)
n
K bits
The relative entropy (or Kullback-Leibler distance)
between two densities is defined by:
D( f g)=
f log
f
g
0
Chain rule for differential entropy:
h( X
1,
X
2,
. .. X
n
)=
i =1
n
h( X
i
X
1,
X
2,
.. . , X
i1
)
h( XY )h( X ) with equality iff X and Y are
independent.
h(aX )=h( X )+loga
Mutual Information:
John A Brown www.nhoj.info 2007
Cover & Thomas Elements of Information Theory 2006
I ( XY )=
f ( x , y)log f
( x , y)
f ( x) f ( y)
0
The multivariate normal distribution maximises the
entropy over all distributions with the same
covariance:
max
EXX
t
=K
h( X )=
1
2
log( 2e)
n
K
Given side information Y and estimator
X (Y ) , it
follow that:
E( X
X (Y ))
2
1
2e
e
2h( XY )
2
nH( X )
is the effective alphabet size for a discrete
random variable.
2
nh( X )
is the effective support set size for a
continuous random variable.
2
C
is the effective alphabet size of a channel of
capacity C.
Chapter 9 Gaussian Channel.
The most important continuous alphabet channel is
the Gaussian channel. It is modelled as a time
discrete channel with output Yi at time i, where Yi is
the sum of the input Xi and the noise Zi. The noise
is drawn i.i.d. From a Gaussian distribution. The
most common limitation on the input is an energy
or power constraint.
The information capacity of the Gaussian channel
is:
C=max
f ( x) : EX
2
P
I ( X ; Y )
Maximum entropy.
max
EX
2
=
h( X )=
1
2
log(2e)
Gaussian Channel.
Y
i
=X
i
+Z
i
;
Z
i
N (0, N) ;
Power constraint
1
n
i=1
n
x
i
2
P ; and
C=
1
2
log(1+
P
N
) bits per transmission
Constructing (2nC,n) codes with a low probability
of error is analogous to sphere packing. The
received vector is normally distributed around the
mean of the codeword and with noise variance. It
is contained within a sphere of
.n( N+)
with
high probability.
Bandwidth additive white Gaussian noise
channel.
BandwidthW ;
twosided power spectral density N
0
/ 2;
signal power P ; and
C=W log(1+
P
N
0
W
)
One of the most famous formulas in information
theory.
Water-filling (k parallel Gaussian channels)
Considers the case where the noise on the parallel
channels is independent. The objective is to
distribute the total power among the channels so
as to maximise the capacity.
John A Brown www.nhoj.info 2007
Cover & Thomas Elements of Information Theory 2006
Y
j
=X
j
+Z
j
, j=1,2,. .. , k ;
Z
j
N(0, N
j
) ;
j=1
n
X
j
2
P ; and
C=
i=1
k
1
2
log(1+
( vN
i
)
+
N
i
)
where v is chosen sothat
( vN
i
)
+
=nP
As the signal power is increased from zero, we allot
the power to the channels with the lowest noise.
Additive non-white Gaussian noise channel.
Considers the case where the noise on the channels
is dependent (parallel channels or channels with
memory). Let KZ be the covariance matrix of the noise,
and KX be the input covariance matrix.
Y
i
=X
i
+Z
i
;
Z
n
N (0, K
Z
); and
C=
1
n
i =1
n
1
2
log(1+
(v
i
)
+
i
)
where
1,
2,.
.. ,
n
are the eigenvalues of K
Z
and v is chosen sothat
( vN
i
)
+
=P
In this case, the above water-filling argument
translates to water-filling in the spectral domain. For
channels in which the noise forms a stationary
stochastic process, the input signal should be chosen
to be a Gaussian process with a spectrum that is large
at frequencies where the noise spectrum is small.
As in the discrete case, feedback does not increase
capacity for memoryless Gaussian channels.
However, for channels with memory, where the noise
is correlated from time instant to time instant,
feedback does increase capacity.
Capacity without feedback
C
n
= max
tr( K
X
)nP
1
2n
log
K
X
+K
Z
K
Z
K
Z
feedback bounds
C
n , FB
C
n
+
1
2
C
n , FB
2C
n
bits per transmission.
Chapter 10 Rate Distortion Theory.
A continuous random source requires infinite
precision to represent it exactly, so a we cannot
represent it exactly with a finite rate code. The
question is then to find the best possible
representation for any given data rate. Given a
source distribution and a distortion measure, what
is the minimum expected distortion achievable at a
particular rate. Interestingly, joint descriptions are
more efficient than individual descriptions, even for
independent random variables.
Quantization. The Lloyd algorythm constructs a
good quantization by starting with a set of
reconstruction points, finding the optimal set of
construction regions (nearest neighbours with
respect to the distortion measure), then find the
optimal reconstruction points for these regions and
interate.
Rate distortion. The information rate distortion
function for a source X ~ p(x) and distortion
measure d ( x , x) is:
R( D)= min
p( xx):
( x, x)
p( x) p( xx) d( x , x)D
I ( X ;
X )
where the minimisation is over all conditional
distributions p( x , x) for which the joint
distribution p( x , x)=p( x) p( xx) satisfies
the expected distortion constraint.
Rate distortion theorem. If R>R(D), there exists a
sequence of codes
X
n
( X
n
)
with the number of
codewords
X
n
()2
nR
with
Ed (
X ,
X
n
( X
n
))-D
. If R<R(D), no such
codes exist.
Bernoulli source. For a Bernoulli source with
Hamming distortion,
R(D) = H(p) H(D)
Gaussian Source. For a Gaussian source with
squared-error distortion.
R( D)=
1
2
log
2
D
Each bit of description reduces the expected
distortion by a factor of 4. With a 1-bit description,
the best expected square error is
2
/4. A 1-bit
quantization of N(0,2) random variable usine two
regions corresponding to the +ve and -ve real lines
and reproduction points as the centroids, the
expected distortion is
(2)
2
=0.3633
2
Hence, we can achieve a lower distortion by
considering several distortion problems in
John A Brown www.nhoj.info 2007
Cover & Thomas Elements of Information Theory 2006
succession (long block lengths) than can be achieved
by considering them separately
Source-channel separation. A source with rate
distortion R(D) can be sent over a channel of capacity
C and recovered with distortion D if and only if
R(D)<C.
Multivariate Gaussian Source. The rate distortion
function for a multivariate normal vector with
Euclidean mean-squared-error distortion is given by
reverse water-filling on the eigenvalues. We choose a
constant and only describe those random variables
with variances greater than .
We can transform a good code for channel
transmission into a good code for rate distortion. The
essential idea is to fill the space of source sequences:
In channel transmission, we want to find the largest
set of code words that have a large minimum distance
between codewords, whereas in rate distortion, we
wish to find the smallest set of codewords that covers
the entire space.
Chapter 11. Information Theory and Statistics
The method of types is a powerful method in
deviation theory which considers sequences which
have the same empirical distribution.
Definition: The type Px (or empirical probability
distribution) of a sequence x1, x2,..., xn is the
relative proportion of occurrences of each symbol
X (i.e., Px(a) = N(a|x)/n for all aX, where N(a|x) is
the number of times the symbol a occurs in the
sequence xXn.
Basic identities
P
n
=( n+1)
X
Q
n
( x)=2
n( D( P
x
Q)+H ( P
x
))
T ( P)=2
nH ( P)
Q
n
(T ( P))=2
nD( P Q)
These equations state that there is only a
polynomial number of types and that there are an
exponential number of sequences of each type.
There is an exact formula for the probability of any
sequence of type P under distribution Q and an
approximate formula for the probability of a type
class.
The crucial point is that it follows that at least one
type has exponentially many sequences in its type
class. In fact, the largest type class has essentially
the same number of elements as the entire set of
sequences, to the first order in the exponent.
Since the probability of each type class depends
exponentially on the relative entropy distance
between the type P and the distribution Q, type
classes that are far away from the true distribution
have exponentially smaller distribution.
Universal Data Compression
Huffman coding compresses an i.i.d source with a
known distribution p(x) to its entropy limit H(X).
However if the code is designed for some incorrect
distribution q(x), a penalty of D(p||q) is incurred.
Surprisingly, there is a universal code of rate R,
say, that suffices to describe every i.i.d source with
entropy H(X)<R.
P
e
( n)
2
nD( P
R
*
Q)
for all Q
where D( P
R
*
)= min
P :H ( P)R
D( P Q)
However, universal source codes need a longer
block length to obtain the same performance as a
code defined specifically for the probability
distribution.. We pay a penalty for this increase in
block length by the increased complexity of the
John A Brown www.nhoj.info 2007
Cover & Thomas Elements of Information Theory 2006
encoder and decoder.
Large deviations (Sanov's theorem)
Estimates the probability of a set of non-typical types.
The probability that a sample will show a large
deviation from the expected outcome is exponentially
small. (We can estimate the exponent using the
central limit theorem but this is a poor approximation
for more than a few standard deviations).
Q
n
( E) = Q
n
( EP
n
) (n+1)
X
2
nD( P
*
Q)
D( P
*
Q) = min
PE
D( P Q)
If E is the closure of its interior, then
Q
n
( E)=2
nD( P
*
Q)
L1 bound on relative entropy.
Convergence in relative entropy implies convergence
in the L1 norm.
D( P
1
P
2
)
1
2ln2
P
1
P
2
1
2
Pythagorean theorem.
Many of the intuitive properties of distance are not
valid for D(P||Q) which behaves like the square of the
Euclidean distance.
If E is a convex set of types, distribution
QE , and P* achieves,
D( P
*
Q) = min
PE
D( P Q) we have
D( P Q) D( P P
*
)+D( P
*
Q)
for all
PE
Conditional Limit Theorem.
The conditional limit theorem implies that there is a
very high probability that the type observed is close to
P* where P* is probability of the closest type. This is
established by considering the empirical distribution of
the sequence of outcomes given that the type is in a
particular set of distributions E. The probability of E is
essentially determined by D(P||Q), the distance of the
closest element of E to Q, and the conditional type is
essentially P*, and the probability of other types, that
are far away from P* is negligible.
If X1, X2,..., Xn i.i.d. ~ Q, then
Pr ( X
1
=aP
X
n E) - P
*
(a) in probability,
where P* minimizes D(PQ) over PE .
In particular,
Pr
X
1
=a
1
n
i =1
n
X
i
-
Q(a)e
a
x
Q( x) e
x
Neyman-Pearson lemma. The optimal test
between two densities P1 and P2 has a decision
region of the form
accept P=P
1
if
P
1
( x
1,
x
2,.
... , x
n
)
P
2
( x
1,
x
2,.
... , x
n
)
>T
Chernoff-Stein lemma.
The Chernoff-Stein lemma considers hypothesis
testing in the case where one of the probabilities of
error is held fixed and the other is made as small
as possible. The other prorbability of error is
exponentially small, with an exponential rate equal
to the relative entropy between the two
distributions. The best achievable error exponent
if
n
:
=min
A
n
X
n
n
lim
( n-)
1
n
log
n
= D( P
1
P
2
)
Chernoff information.
The Bayesian approach assigns prior probabilities
to both hypotheses and minimises the overall
probability of error given by the weighted sum of
the individual probabilities of error. The best
achievable exponent for a Bayesian probability of
error is:
D
*
=D( P
*
P
1
)=D( P
*
P
2
) where
P
=
P
1
( x) P
2
1
( x)
aX
P
1
( x) P
2
1
( x)
with
=
*
chosen sothat
D( P
P
1
)=D( P
P
2
)
Fisher information.
A standard problem in statistical estimation is to
determine a parameter of a distribution from a
sample of data drawn from that distribution. An
estimator T is meant to approximate the value of
the parameter. The bias is the expected value of
the error in the estimator. However, a bias = 0
does not guarantee that the error is low with high
probability. A loss function of the error is required ;
most commonly the expected square of the error.
J ()=E
ln f ( x ; )
2
The Fisher information is a measure of the amount
of information about that is present in the data.
It is a lower bound on estimating from the data.
John A Brown www.nhoj.info 2007
Cover & Thomas Elements of Information Theory 2006
Cramer-Rao inequality.
The Cramer-Rao inequality is a lower bound on the
variance of all unbiased estimators. For and unbiased
estimator T of ,
E
(T ( X ))
2
=var (T )
1
J ()
Fisher information is defined with respect to a family
of parametric distributions, unlike entropy, which is
defined for all distributions. Entropy is related to the
volume of the typical set and Fisher information to the
surface area of the typical set.
Chapter 12 Maximum Entropy
The temperature of a gas corresponds to the
average kinetic energy of the molecules in the gas.
What can we say about the distribution of velocities
in the gas at a given temperature? We know from
physics that this distribution is the maximum
entropy distribution under the temperature
constraint, otherwise known as the Maxwell-
Boltzmann distribution. The maximum entropy
distribution corresponds to the macro state (as
indexed by the empirical distribution) that has the
most micro states (the individual gas velocities).
Implicit in the use of maximum entropy methods in
physics is a sort of AEP which says that all
microstates are equally probable.
Maximum Entropy Distribution. Let f be a
probability density satisfying the constraints
S
f ( x) r
i
( x)=
i
for 1im
Let
f
*
( x)=f
( x)=e
0
i=1
m
i
r
i
( x)
, xS
and let
0,...,m be chosen so that f* satisfies the
constraints. Then f* uniquely maximises h(f) overall
f satisfying these constraints. F is a density on
support set S meeting certain moment constraints
1, 2,..., m
Example: dice, no constraints. Let S =
{1,2,3,4,5,6}. The distribution that maximises the
entropy is the uniform distribution, p(x)=1/6 for
sS
Example used by Boltzman - dice with
EX =
ip
i
= That is, n dice are thown which
sum to n then what proportion of the dice are
showing each face?
Example: S=[0,] and EX=. The distribution of
the height of molecules in the atmosphere.
Maximum entropy spectral density estimation.
Burg assumed a process was stationary and
Gaussian and found that the process that
maximises the entropy subject to the correlation
constraints is an autoregressive Gaussian process
of the appropriate order.
The entropy rate of a stochastic process subject to
autocorrelation constraints R0, R1,..., Rp is
maximised by the p
th
order zero-mean Gauss-
Markov process satisfying these constraints. The
maximum entropy rate is
h
*
=
1
2
log( 2e)
K
p
K
p1
k=1
p
a
k
e
ik
2
Chapter 13 Universal Source Coding
If the probability distribution underlying the source
is unknown, then we cannot apply the methods of
Chapter 5 directly (e.g. Huffman coding) unless
two passes of the data are made. However, there
are online algorithms that use the probability
distribution of the inbound data to guide the
compression, and these do well for any distribution
within a class of distributions. In an individual
sequence (i.e. text and music) there is no
underlying probability distribution. We compare our
performance to that achievable by optimal word
assignments with respect to Bernoulli distributions
or k
th
-order Markov processes. The ultimate
compression of an individual sequence is the
Kolmogorov complexity.
Ideal word length. If the distribution is known.
l
*
( x)=log
1
p
( x)
Average description length. As a basis for
comparison.
E
p
l
*
( x)=H( p)
Estimated probability distribution p( x)
increases the word length by an amount equal to
the relative entropy between the estimate an the
actual.
If
l ( x)=log
1
p( x)
, then
E
p
l ( x)=H ( p)+D( p p)
Average redundancy of using universal coding.
R
p
=E
p
l ( X )H ( p)
Minimax redundancy.
For X p
( x) ,
D*=min
l
max
p
R
p
=min
q
max
D( p
q)
This minimax redundancy is achieved by a
distribution q that is at the 'center' of the
information ball containing the distributions p, that
is, the distribution q whose maximum distance from
any of the distributions p is minimised.
Minimax theorem. D*=C, where C is the capacity
of the channel {, p(x), X}.
This is a channel with the rows of the transition
matrix equal to the different p's, the possible
distributions of the source. The minimax
redundancy is equal to the capacity of this channel
and the corresponding optimal coding distribution
is the output distribution of this channel induced by
John A Brown www.nhoj.info 2007
Cover & Thomas Elements of Information Theory 2006
the capacity-achieving input distribution.
Bernoulli sequences. For X
n
~ Bernoulli(), the
redundancy is
D
n
*
=min
q
max
D( p
( x
n
) q( x
n
))
1
2
log n+o(log n)
That is, the cost of describing the sequence is about
log(n) bits above the optimal cost with the Shannon
code for a Bernoulli distribution corresponding to k/n.
Arithmetic coding. The objective of the arithmetic
coding algorithm is to represent a sequence of
random variables by a subinterval in [0,1]. As the
algorithm observes more input symbols the length of
the subinterval corresponding to the input sequence
decreases. As the top and bottom ends of the interval
get closer they begin to agree in the first few bits and
they can be output. The process continues on the
remaining subinterval until the whole sequence is
output. The procedure achieves an average block
length within 2 bits of the entropy for any block-length.
nH bits of F(x
n
) reveal approximately n bits of x
n
.
Lempel-Ziv coding. The key idea of the Lempel-Ziv
algorithm is to parse the string into phrases and to
replace phrases by pointers to where the same string
has occurred in the past. The differences between the
algorithms is based on differences in the set of
possible match locations (and match lengths) the
algorithm allows.
Lempel-Ziv coding (recurrence time coding). Let
Rn(X
n
)) be the last time in the past that we have seen
a block of n symbols X
n
, Then
1
n
log R
n
-H( X ) ,
and encoding by describing the recurrence time is
asymptotically optimal. Used in zip and gzip
implementations.
Lempel-Ziv coding (sequence parsing). If a
sequence is parsed into the shortest phrases not seen
before (e.g. 011011101 is parsed to 0,1,10,11,101,...)
and l(x
n
) is the description length of the parsed
sequence, then,
lim sup
1
n
l ( X
n
)H( X ) with probability1
for every stationary ergodic process (Xi). Used in
compress in Uni, modems and the GIFF format.
Chapter 14 Kolmogorov Complexity
Definition: The Kolmogorov complexity K(x) of a
string x is:
K ( x)= min
p:U ( p)=x
l ( p)
K ( xl ( x))= min
p:U( p ,l ( x))=x
l ( p)
Kolmogorov complexity is the minimum length over
all programs that print x and halt.
Universality of Kolmogorov complexity. There
exists a universal computer U such that for any
other computer A,
K
u
( x)K
a
( x)+c
A
for any string x, where the constant cA does not
depend on x. If U and A are universal
K
U
( x)K
A
( x)c for all x.
Upper bound on Kolmogorov complexity.
K ( x| l ( x)) l ( x)+c
K ( x) K( x | l ( x))+2log l ( x)+c
Komologrov complexity and entropy. If X1,
X2,..., Xn are i.d.d. Integer-valued random variables
with entropy H, there exists a constant c such that
for all n,
H
1
n
EK( X
n
| n) H+X
log n
n
+
c
n
Lower bound on Kolmogorov complexity. There
are no more than 2
k
strings x with complexity K(x)
< k. If X1, X2,..., Xn are drawn according to
Bernoulli() process.
Pr ( K ( X
1,
X
2
.. . X
n
| n) nk) 2
k
Definition. A sequence x is said to be
incompressible if
K( x
1,
x
2
... x
n
| n)
n
-1
Strong law of large numbers for
incompressible sequences
K( x
1,
x
2
, ... , x
n
)
n
-1
1
n
i=1
n
x
i
-
1
2
Definition. The universal probability of a string x is
P
U
( x) =
p:U ( p)=x
2
l ( p)
= Pr (U( p)=x)
This is the probability that a randomly drawn
program will print out the string x.
Most sequences of length n have complexity close
to n. Shorter programs are much more probable
than longer ones. That is, there are not enough
programs to go around.
John A Brown www.nhoj.info 2007
Cover & Thomas Elements of Information Theory 2006
The halting problem and the non-computability of
Kolmogorov complexity. Both these are related to
Godells incompleteness theorem and all three are
bases on self-referential ideas like This sentence is
false.
Universality of PU(x). For every computer A,
P
U
( x)c
A
P
A
( x) for every string
x0,1
*
where the constant cA depends only on U and A.
Definition.
=
p:U ( p) halts
2
l ( p)
= Pr (U( p) halts)
is the probability that the computer halts and the input
p to the computer is a binary string drawn according to
a Bernoulli() process.
Properties of
is not computable. The halting problem.
is a philosopher's stone. Knowing to an
accuracy of n bits will enable us to decide the
truth of any provable of finitely refutable
mathematical theorem that can be written in less
than n bits.
is algorithmically random (incompressible).
Universal Gambling. Is based on PU(x) and does
asymptotically as well as a scheme that uses of the
true distribution.
Equivalence of K(x) and
log(
1
P
U
( x)
)
. There
exists a constant c independent of x such that,
log
1
P
U
( x)
K ( x)c
for all strings x. Thus the universal probability of a
string x is essentially determined by its Kolmogorov
complexity.
Notice that the ideal Shannon code length assignment
l ( x)=log(
1
p( x)
) achieves an average
description length H(X), while in Kolmogorov
complexity theory, the ideal description length
log(
1
P
U
( x)
)
is almost equal to K(X). Thus
log(
1
p( x)
) is the natural notion of descriptive
complexity of x in algorithmic as well as probabilistic
settings.
Definition. The Kolmogorov structure function
K
k
( x
n
| n) of a binary string
x
n
0,1
n
is
defined as,
K
k
( x
n
| n)= min
p:l ( p)k
U ( p ,n)=S
xS
logS
Definition. Let k* be the least k such that,
K
k
i ( x
n
| n)+k
*
= K ( x
n
| n)
Let S** be the corresponding set and let p** be the
program that prints out the indicator function of S**.
Then p** is the Kolmogorov minimal sufficient
statistic for x.
John A Brown www.nhoj.info 2007
Cover & Thomas Elements of Information Theory 2006
Chapter 15 Network Information Theory
I have not read this chapter it did not appear
relevant to my interests (and it was also long)
Chapter 16 Information Theory and Portfolio
Theory
I have not read this chapter.
Chaper 17 Inequalities in Information Theory
This chapter is a brutal summary of the key aspects of
the previous chapters with a similarly brutal
introduction of some additional results.
There may be important aspects here if attempting to
construct new proofs particularly in relation to Fisher
Information.
Further Reading
pg 508 A non-technical introduction to the
various measures of complexity can be found
in a thought provoking book by Pagels, H. The
Dreams of Reason: the Computer and the
Rise of the Sciences of Complexity. Simon and
Schuster, New York 1988.
pg 171 non-technical introduction to the
estimation of various information sources
including English is Lucky , R. W. (1989)
Silicone Dreams: Information, Man and
Machine. St Martins Press, New York.
John A Brown www.nhoj.info 2007