0% found this document useful (0 votes)
15 views32 pages

Notes

1. This document introduces notation and terminology used in lectures on the applications of information theory concepts in statistics. 2. It defines entropy, divergence, joint entropy, conditional entropy, and mutual information. Variable-length codes and block codes are also defined. 3. Binary prefix codes are discussed, and it is shown that they satisfy the Kraft inequality, ensuring unique decodability. The Shannon-Fano code is constructed as a prefix code that achieves the Kraft bound with equality.

Uploaded by

Giane Higino
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views32 pages

Notes

1. This document introduces notation and terminology used in lectures on the applications of information theory concepts in statistics. 2. It defines entropy, divergence, joint entropy, conditional entropy, and mutual information. Variable-length codes and block codes are also defined. 3. Binary prefix codes are discussed, and it is shown that they satisfy the Kraft inequality, ensuring unique decodability. The Shannon-Fano code is constructed as a prefix code that achieves the Kraft bound with equality.

Uploaded by

Giane Higino
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

0 Notation and Terminology. is a mapping C: An 7→ B ∗ .

The length function L: An 7→


{1, 2, . . .} for a variable-length is defined by the formula
This course will be concerned with the applications
of information theory concepts in statistics. Much of the L(xn
1)
C(xn1 ) = b1 .
course will be based on lectures given by Imre Csiszár
at Maryland in 1989. Some recent results about depen- Thus, in particular, a block code is just a variable length
dent processes will also be given. It is assumed that the code whose length function is constant.
reader is familiar with basic information theory ideas A block code C is invertible (or faithful) if it is one-
as presented, for example, in the initial chapters of the to-one. A variable-length code is uniquely decodable if
Csiszár-Körner book, and with basic statistical concepts for any two distinct sequences, u(1), u(2), . . . , u(m) and
as presented, for example, in the book by Cox and Hink- v(1), v(2), . . . , v(k), where u(i), v(j) ∈ An , ∀i, j, the con-
ley. Notation and terminology that will be used in these catenations of the images, C(u(1))C(u(2)) · · · C(u(m))
lectures will be introduced in this section. and C(v(1))C(v(2)) · · · C(v(k)), are not equal. A con-
The symbol A = {a1 , a2 , . . . , a|A| } will denote a finite dition that guarantees unique decodability is the prefix
set of cardinality |A| and xnm will denote the sequence condition. A variable-length code C satisfies the prefix
xm , xm+1 , . . . , xn , where each xi ∈ A. The set of all n- condition if
length sequences xn1 will be denoted by An , the set of
all infinite sequences x = x∞ 1 , with xi ∈ A, i ≥ 1 will be
C(v) = C(u)w, u, v ∈ An , w ∈ B ∗ ⇒ w = Λ, u = v,
denoted by A∞ , and the set of all finite sequences drawn
from A will be denoted by A∗ . If u and v are finite length where Λ denotes the empty string.
sequences then their concatenation is denoted by uv, and In most cases of interest to us, the image alphabet will
uk = uk−1 u, k > 1. be binary, that is, B = {0, 1}. It is easy to see that the
The entropy H(P ) of a probability distribution, P = length function for a binary prefix code must satisfy the
(P (a)) on A, is defined by the formula so-called Kraft inequality.
n
2−L(x1 ) ≤ 1.
X X
H(P ) = − P (a) log P (a),
a∈A xn
1

where here, as elsewhere in these lectures, base two loga- It can in fact be shown that a uniquely decodable binary
rithms are used. Random variable notation is often used code also satisfies the Kraft inequality, and that if L is
in this context, that is, H(X) denotes the entropy of a positive integer-valued function on An for which the
the distribution P of the random variable X. If P and Kraft inequality holds then there is a binary prefix code
Q are two distributions on A then their divergence or C whose length function is L. (Thus, in particular, for
cross-entropy is defined by any uniquely decodable code C with length function L
X P (a) there is a prefix code C̃ whose length function is also
D(P kQ) = P (a) log . L.) The reason for the connection between the Kraft
a∈A
Q(a)
inequality and prefix codes is the connection between
If P is the joint distribution of two random variables the Kraft inequality and binary trees, a connection that
(X, Y ) then their joint entropy is defined by we now sketch.
X A (binary) tree is a directed graph (V, E), along with
H(X, Y ) = − P (a, b) log P (a, b), a distinguished vertex r ∈ V , called the root, such that
(a,b)
the following properties hold.
while the conditional entropy H(X|Y ) and mutual infor-
1. The outdegree of each vertex is at most 2.
mation I(X ∧ Y ) are defined, respectively, by
2. The indegree of the root is 0. The indegree of all
H(X|Y ) = H(X, Y ) − H(Y ),
other vertices is exactly 1.
I(X ∧ Y ) = H(X) + H(Y ) − H(X, Y )
= H(X) − H(X|Y ) = H(Y ) − H(Y |X). 3. Given any v ∈ V − r there is a directed path from r
to v.

Two types of codes will be of interest. A block code It is easy to see from the above that there is only one
is a mapping C: An 7→ B m , while a variable-length code path from r to any v 6= r; the length of this path is called

1
the depth d(v) of v. A vertex is called an outer node if us to look in the next 4 places where we see 1100, the
its outdegree is 0; otherwise it is an inner node. Let O binary representation of 12. The code C is a prefix code;
denote the set of outer nodes. It is easy to see that the the codeword length is `(n) + 2`(`(n)), which, for large
edges of the tree can be labeled by 0’s and 1’s so that for n, is approximately equal to
any vertex v whose outdegree is 2, the two edges leading
out of v have different labels. Such a labeling assigns a log2 n + 2 log2 log2 n.
binary sequence of length d(v) to each outer node v such
that distinct outer nodes are assigned distinct sequences.
1 Large Deviations.
The labeling is therefore just a binary code on the set
of outer nodes. Furthermore, the code is a prefix code, One important application of information theory is to
due to the simple fact that an outer node is not an inner the theory of large deviations. A key to this application
node! It is clear that is the theory of types. The n-type of a sequence xn1 ∈ An
is just another name for its empirical distribution P̂ =
2−v(x) ≤ 1.
X
P̂xn1 , that is, the distribution defined by
v∈O

In summary, binary trees lead to binary prefix codes on |{i: xi = a}|


P̂ (a) = , a ∈ A.
their outer codes for which the Kraft inequality holds. n
Now suppose L is a positive integer-valued function
Two sequences xn1 and y1n are said to be equivalent if they
defined on a set A such that 2−L(a) ≤ 1. Our goal is to
P
have the same type; the equivalence classes will be called
show that there is a prefix code C whose length function
type classes. The type class of xn1 will be denoted by TPn ,
is L. Without loss of generality it can be assumed that
where P = P̂xn1 . The proof of the following lemma is left
A is labeled so that L(ai ) ≤ L(ai+1 ), i < |A|. The code
to the student.
C is defined by setting C(ai ) = w(i) ∈ B ∗ , where w(1) is
a block of 0’s of length L(a1 ), and w(i), i > 1 is the first Lemma 1 The! number of possible types is
L(ai ) bits in the binary expansion of j<i 2−L(aj ) . It is
P
n + |A| − 1
left to the reader to show that this defines a prefix code. .
|A| − 1
The code is known as the Shannon-Fano code, or simply
the Shannon code. The following theorem summarizes Theorem 2 For any type P
this coding construction in a form that will be used later.
!−1
n + |A| − 1
Theorem 1 Let P be a probability distribution on A 2nH(P ) ≤ |TPn | ≤ 2nH(P ) .
|A| − 1
and define L(a) = d− log P (a)e, a ∈ A, where d·e de-
notes the least integer function. There is a binary pre-
fix code for which the expected length satisfies E(L) =
Proof. Fix the type P and define P n (xn1 ) = i P (xi ).
Q
a L(a)P (a) ≤ H(P ) + 1.
P
A simple calculation shows that if xn1 has type P then
P n (xn1 ) = 2−nH(P ) . Since P n is a probability distribu-
We shall also make use of a prefix code defined on tion on An we must have P n (TPn ) ≤ 1. This gives the
the integers, a code that is essentially due to Elias. Let desired upper bound since P n (TPn ) = |TPn |2−nH(P ) .
b(n) be the usual binary representation of the integer The lower bound can be obtained as follows. Let A =
n ≥ 0, and let `(n) denote the length of b(n), so that {a1 , a2 , . . . , at }, where t = |A|. By definition of types we
`(n) = dlog2 (n + 1)e. Let Ok denote a sequence of 0’s of can write P (ai ) = ki /n, i = 1, 2, . . . , t with k1 +k2 +. . .+
length k. The code is defined by kt = n, where ki is the number of times ai appears in xn1
for any fixed xn1 ∈ TPn . Thus we have
C(n) = 0`(`(n)) b(`(n))b(n).
n!
|TPn | = ,
For example b(12) = 1100, so b(`(12)) = b(4) = 100 and k1 !k2 ! · · · kt !
`(`(12)) = 3. Thus C(12) = 0001001100. The decoding
so that
is as follows. The initial block 000 of 0’s has length 3.
This tells us to look in the next 3 places, where we see n!
k1j1 · · · ktjt ,
X
nn = (k1 + . . . + kt )n =
100, the binary representation of 4, which in turn tells j1 ! · · · jt !

2
where the sum is over all t-tuples (j1 , . . . , jt ) of nonneg- P̂n be the n-type of the random sequence X1 , . . . , Xn .
ative integers such that !j1 + . . . + jt = n. The number of The law of large numbers tells us that P̂n → P with
n + |A| − 1 probability 1 as n → ∞. The next result is useful for es-
terms is , by Lemma ??, and the largest
|A| − 1 timating the (exponentially small) probability that P̂n
term is belongs to some set Π of distributions that does not
n! contain the true distribution P . We use the notation
k k1 k k2 · · · ktkt ,
k1 !k2 ! · · · kt ! 1 2 D(ΠkP ) = inf Q∈Π D(QkP ).
for if jr > kr , js < ks then decreasing jr by 1 and
increasing js by 1 multiplies by Theorem 4 (Sanov’s Theorem.) Let Π be a set of
distributions on A whose closure is equal to the closure
jr ks jr of its interior. Then
≥ ≥ 1.
kr 1 + js kr
1  
This yields the lower bound. − log P P̂n ∈ Π → D(ΠkP ).
n
The following corollary will be useful in a later section.
(n)
Corollary 1 The minimum number `min of bits needed
Proof. Let Pn be the set of possible n-types and let
to encode sequences xn1 of known type P , with codewords
Πn = Π ∩ Pn . Theorem ?? implies that
of a fixed length, satisfies
 
P (P̂n ∈ Πn ) = P n ∪Q∈Πn TQn
!
n + |A| − 1 (n)
nH(P ) − log ≤ `min ≤ dnH(P )e.
|A| − 1
is upper bounded by
In particular, (1/n)`(n) → H(P ) as n → ∞. !
n + |A| − 1
2−nD(Πn kP )
|A| − 1
Our next result connects the theory of types with gen-
eral probability theory. and lower bounded by
Theorem 3 For any distribution P on A and any n- !−1
n + |A| − 1
type Q 2−nD(Πn kP ) .
|A| − 1
!−1
n + |A| − 1
2−nD(QkP ) ≤ P n (TQn ) Since D(QkP ) is continuous in Q, the hypothesis on Π
|A| − 1
implies that D(Πn kP ) is arbitrarily close to D(ΠkP ) if
≤ 2−nD(QkP ) , n is large. Hence the theorem follows.
where P n is the product measure defined by P on An .
Proof. If xn1 has type Q then the number of times xi = a Example 1 Let f be a given function on A and set
Π = {Q: a Q(a)f (a) > α} where α < maxa f (a).
P
is just nQ(a), and hence
P The set Π is open and hence satisfies the hypothe-
P (a)nQ(a) = 2n Q(a) log P (a)
Y
P n (xn1 ) = a . sis of Sanov’s theorem. Note that P̂n ∈ Π is equiv-
P P
a alent to (1/n) xi f (xi ) > α, since a P̂n (a)f (a) =
P
Thus Lemma ?? yields the desired upper bound (1/n) xi f (xi ). Thus we obtain the classical large de-
P viations result
P n (TQn ) = TQn 2n Q(a) log P (a)
a
n
!
1 1X
−n
P
Q(a) log
P (a) − log P n f (xi ) > α → D(ΠkP ).
≤ 2 a Q(a) n n i=1
= 2−nD(QkP ) .
In this case, D(ΠkP ) = D(cl(Π)kP ) = min D(QkP ),
where the minimum is over all Q for which Q(a)f (a) ≥
P
A similar argument establishes the lower bound. P
α. In particular, for any α > P (a)f (a) we
Let X1 , X2 , . . . be independent random variables tak- have D(ΠkP ) > 0, so that, the probability of
(1/n) n1 f (Xi ) > α goes to 0 exponentially fast.
P
ing values in X with common distribution P and let

3
It is instructive to see how to calculate the expo- Since D(Q∗ kP̃ ) > 0 for P̃ 6= Q∗ , it follows that
nent D(ΠkP ) for the preceeding example. Consider X
the exponential family of distributions P̃ of the form log c + tα = − log P (a)2tf (a) + tα
P̃ (a) = cP (a)2tf (a) , where c = ( a P (a)2tf (a) )−1 . a
P
P
Clearly a P̃ (a)f (a) is a continuous function of the pa- attains its maximum at t = t∗ . This means that the
rameter t and this function tends to max f (a) as t → ∞. “large deviations exponent”
(Check!) As t = 0 gives P̃ = P , it follows by the as-
sumption n
" ( )#
1 1X
lim − log P n f (Xi ) > α)
X n→ n n i=1
P (a)f (a) < α < max f (a)
a
a
can be represented also as
that there an element of the exponential family, with
t > 0, such that P̃ (a)f (a) = α. Denote this P̃ by Q∗ ,
P " #
X
tf (a)
max − log P (a)2 + tα .
so that, t≥0
a
∗ f (a)
Q∗ (a) = c∗ P (a)2t , t∗ > 0, Q∗ (a)f (a) = α.
X
This latter form is the one usually found in text-
a
books. Note that the restriction t ≥ 0 is not needed
P
We claim that when α > a P (a)f (a), because, as just seen, the un-
constrained maximum is attained at t∗ > 0. However,
D(ΠkP ) = D(Q∗ kP ) = log c∗ + t∗ α. (1) the restriction to t ≥ 0 takes care also of the case when
α ≤ a P (a)f (a), when the exponent is equal to 0.
P
To show that D(ΠkP ) = D(Q∗ kP ) it suffices to show
that D(QkP ) > D(Q∗ kP ) for every Q ∈ Π, i. e., for
P
every Q for which a Q(a)f (a) > α. A direct calculation 2 I-projections.
gives
The I-projection of a distribution Q onto a closed,
Q∗ (a) convex subset Π of distributions on A is the P ∗ ∈ Π
D(Q∗ kP ) = Q∗ (a) log
X
=
a P (a) such that
X
Q (a) [log c∗ + t∗ f (a)] = log c∗ + t∗ α

(2) D(P ∗ kQ) = min D(P kQ).
P ∈Π
a
In the sequel we suppose that Q(a) > 0 for all a ∈ A.
and The function D(P kQ) is then continuous and strictly
X Q∗ (a) convex in P , so that P ∗ exists and is unique.
Q(a) log = The support of the distribution P is the set S(P ) =
a P (a)
{a: P (a) > 0}. Since Π is convex, among the supports of
Q(a) [log c∗ + t∗ f (a)] > log c∗ + t∗ α.
X
elements of Π there is one whose support contains all the
a
others; this will be called the support of Π and denoted
Hence by S(Π).

D(QkP ) − D(Q∗ kP ) > Theorem 5 S(P ∗ ) = S(Π) and D(P kQ) ≥ D(P kP ∗ ) +
Q∗ (a) D(P ∗ kQ) for all P ∈ Π.
= D(QkQ∗ ) > 0.
X
D(QkP ) − Q(a) log
a P (a)
Proof. Of course, if the asserted inequality holds for some
This completes the proof of (??). P ∗ ∈ Π and all P ∈ Π then P ∗ must be the I-projection
of Q onto Π.
Remark 1 Replacing P in (??) by any P̃ of the expo- For arbitrary P ∈ Π, by the convexity of Π we have
nential family, i. e., P̃ (a) = cP (a)2tf (a) , we get that Pt = (1 − t)P ∗ + tP ∈ Π, for 0 ≤ t ≤ 1, hence for each
t ∈ (0, 1),
D(Q∗ kP̃ ) =
c∗ 1 d
log + (t∗ − t)α = log c∗ + t∗ α − (log c + tα). 0≤ [D(Pt kQ) − D(P ∗ kQ)] = D(Pt kQ) |t=t̃ ,
c t dt

4
for some t̃ ∈ (0, t). But preceeding proof is equal to 0, for all P ∈ L. This gives
the desired identity. Also we can equivalently write
d Pt (a)
(P (a) − P ∗ (a)) log
X
D(Pt kQ) = , P ∗ (a)
 
− D(P ∗ kQ) = 0, P ∈ L.
X
dt a Q(a) P (a) log (4)
a Q(a)
and this converges (as t ↓ 0) to −∞ if P ∗ (a) = 0 for Now, by the definition of L, the distributions P ∈ L,
some a ∈ S(P ), and otherwise to regarded as |A|-dimensional vectors, are in the orthog-
onal complement of the subspace F spanned by the k
P ∗ (a)
(P (a) − P ∗ (a)) log
X
. (3) vectors, {fi (·)−αi : 1 ≤ i ≤ k}. If S(L) = A then the dis-
a Q(a) tributions P ∈ L also span the orthogonal complement
of F, from Lemma ??, below , and hence the identity
It follows that the first contigency is ruled out, proving
(??) implies that the vector
that S(P ∗ ) ⊃ S(P ), and also that the quantity (??) is
nonnegative, proving the claimed inequality. P ∗ (·)
log − D(P ∗ kQ)
Q(·)
Now we examine some situations in which the inequal-
must be in F. This proves that P ∗ ∈ EQ .
ity of Theorem ?? is actually an equality. For any given
Finally, if P̃ ∈ L ∩ EQ then it is easily checked that
functions f1 , f2 , . . . , fk on A and corresponding numbers
the identity (??) holds for P̃ in place of P ∗ . This implies
α1 , α2 , . . . , αk , the set
that P̃ satisfies the Pythagorean identity in the role of
L = {P :
X
P (a)fi (a) = αi , 1 ≤ i ≤ k}, P ∗ , and this, in turn, implies that P̃ = P ∗ .
a The proof of the theorem is finished, once the following
linear algebra result is established.
will be called a linear family of probability distributions.
For any given functions f1 , f2 , . . . , fk on A, the set E of Lemma 2 Suppose V is a the subspace of Rn such that
all P such that there is a strictly positive vector p ∈ V ⊥ , the orthogonal
complement of V . Then V ⊥ is spanned by the probabil-
k
X ity vectors that belong to it.
P (a) = cQ(a) exp( θi fi (a)), for some θ1 , . . . , θk ,
1 Proof. Choose a basis for V ⊥ of the form {p, q1 , . . . , q` }
and determine ti ∈ (0, 1), 1 ≤ i ≤ ` such that pi =
will be called an exponential family of probability distri- (1 − ti )p + ti qi is a nonnegative vector. The vectors
butions; here Q is any given distribution and {p, p1 , . . . , p` } are easily seen to be a basis for V ⊥ ; each
!−1 can be then be rescaled to obtain a basis for V ⊥ that
k
c = c(θ1 , . . . , θk ) =
X
Q(a) exp(
X
θi fi (a)) . consists of probability vectors. This completes the proof
a 1 of the lemma.

We will assume that S(Q) = A; then S(P ) = A for all If S(L) 6= A then no element of the exponential family
P ∈ E. Note that Q ∈ E. The family E depends on Q, of E = EQ can belong to L, but since E is not a closed
course, and but only in a weak manner, for any element set in general, some element of the closure, cl(E) may
of E could play the role of Q. If necessary to emphasize be in L. Indeed, if there is a P̃ ∈ L ∩ cl(E) then the
this dependence on Q we shall write E = EQ . Pythagorean identity still holds for P̃ , and this implies
that P̃ = P ∗ . A sequence of elements converging to P ∗
Theorem 6 The I-projection P ∗ of Q onto a linear fam- can always be generated by the “generalized iterative
ily L satisfies scaling” algorithm, which will be discussed at the end of
this section. Hence we always have L ∩ cl(E) = {P ∗ }.
D(P kQ) = D(P kP ∗ ) + D(P ∗ kQ), ∀P ∈ L.
Suppose now that L1 , . . . , Lm are given linear families
Further, if S(L) = A then L ∩ EQ = {P ∗ }. and generate a sequence of distributions Pn as follows:
Set P0 = Q (any given distribution with S(Q) = A), let
Proof. By the preceeding theorem, S(P ∗ ) = S(L). P1 be the I-projection of P0 onto L1 , P2 the I-projection
Hence for every P ∈ L there is some t < 0 such that of P1 onto L2 , and so on, where for n > m we mean by
Pt = (1 − t)P ∗ + tP ∈ L. Therefore, we must have Ln that Li for which i ≡ n (mod m); i. e., L1 , . . . , Lm
(d/dt)D(Pt kQ)|t=0 = 0, that is, the quantity (??) in the is repeated cyclically.

5
Theorem 7 If ∩m 6 ∅ then Pn → P ∗ , the
i=1 Li = L = is called the B-lumping of P .
I-projection of Q onto L. Fix nonnegative constants αi , i ≤ k, whose sum is 1,
and let L = {P : P B (i) = αi , ∀i}. The I-projection of any
Proof. By the preceeding theorem, we have for every
Q onto L is obtained simply by “scaling”:
P ∈ L (even for P ∈ Ln ) that
αi
D(P kPn−1 ) = D(P kPn ) + D(Pn kPn−1 ), n = 1, 2, . . . P ∗ (a) = ci Q(a), a ∈ Bi , where ci = . (6)
QB (i)
Adding these equations for 1 ≤ n ≤ N we get that
This follows from the fact that lumping does not increase
N
X divergence, that is,
D(P kQ) = D(P kP0 ) = D(P kPN ) + D(Pn kPn−1 ).
n=1 D(P kQ) ≥ D(P B kQB ).
By compactness there exists a subsequence PNk → P 0 ,
say, and then from the preceeding inquality we get for The condition that P ∈ L is equivalent to the condition
Nk → ∞ that that P (Bi ) = αi , ∀i. If P ∗ (a) = αi Q(a)/Q(Bi ), a ∈ Bi
then

0
X
D(P kQ) = D(P kP ) + D(Pn kPn−1 ) (5) P ∗ (a) X X αi Q(a) αi
P ∗ (a) log
X
n=1 = log
a Q(a) i a∈Bi
Q(Bi ) Q(Bi )
Since this series is convergent we have D(Pn kPn−1 ) → 0, B
= D(αkQ ), α = (α1 , . . . , αk ).
and hence also |Pn − Pn−1 | → 0, where |Pn − Pn−1 |
a (|Pn (a) −
P
denotes the usual variational distance
Thus, if P ∈ L then
Pn−1 (a)|. This implies that together with PNk → P 0
we also have D(P kQ) ≥ D(P B kQB ) = D(αkQB ),
PNk +1 → P 0 , PNk +2 → P 0 , . . . , PNk +m → P 0 .
which establishes (??).
Since by the periodic construction, among the m consec- Now, if L1 , L2 , . . . , Lm are all of the preceeding form,
utive elements, PNk , PNk +1 , . . . , PNk +m−1 there is one in then the iterated sequence of I-projections P1 , P2 , . . . ,
each Li , i = 1, 2, . . . , m, it follows that P 0 ∈ ∩Li = L. in Theorem ?? can all be obtained by iterative scal-
Since P 0 ∈ L it may be substituted for P in (??) to ing, and the theorem gives that the so obtained sequence
yield converges to the I-projection of Q onto the intersection

L = ∩m i=1 Li . In particular, as we shall see in a later
D(P 0 kQ) =
X
D(Pn kPn−1 ).
section, iterative scaling can be used to evaluate the I-
i=1
projections that are needed in the analysis of contigency
With this, in turn, (??) becomes
tables.
D(P kQ) = D(P kP 0 ) + D(P 0 kQ),

which proves that P 0 equals the I-projection of Q onto 3 f-divergence and contigency tables.
L. Finally, as P 0 was the limit of an arbitrary convergent
subsequence of the sequence Pn , our result means that Let f (t) be a convex function defined for t > 0 with
every convergent subsequence of Pn has the same limit f (1) = 0. The f -divergence of a distribution P from Q
P ∗ . Using compactness again, this proves that Pn → P ∗ is defined by
and completes the proof of the theorem. X 
P (x)

Df (P kQ) = Q(x)f .
Now we discuss iterative scaling, a method for evalu- a Q(x)
ating I-projections that is useful in the analysis of con-
tigency tables, a subject to be discussed in the next sec- Here we take 0f ( 00 ) = 0, f (0) = limt→0 f (t), 0f ( a0 ) =
tion. Let B = {B1 , B2 , . . . , Bk } be a partition of A and limt→0 tf ( at ) = a limu→∞ f (u)
u .
let P be a distribution on A. The distribution defined Some examples include the following.
on {1, 2, . . . , k} by the formula
(1) f (t) = t log t ⇒ Df (P kQ) = D(P kQ).
P B (i) =
X
P (a),
a∈Bi (2) f (t) = − log t ⇒ Df (P kQ) = D(QkP ).

6
(3) f (t) = (t − 1)2 Remark 2 The same proof works even if Q is not fixed,
X (P (a) − Q(a))2 provided that no Q(a) can become arbitrarily small.
⇒ Df (P kQ) = .
a Q(a) However, the theorem (the “asymptotic equivalence” of
√ f -divergences subject to the differentiability hypotheses)
(4) f (t) = 1 − t does not remain true if Q is not fixed and the probabili-
Xq
⇒ Df (P kQ) = 1 − P (a)Q(a). ties of Q(a) are not bounded away from 0.
a

(5) f (t) = |t − 1| ⇒ Df (P kQ) = |P − Q|. Corollary 2 If f satisfies the hypotheses of the theo-
rem and P̂ is the empirical distribution (i. e., type) of a
2
The expression Df (P kQ) = a (P (a)−Q(a)) sample of size n drawn independently from the distribu-
P
Q(a) will be de-
2
noted by χ (P, Q). The analogue of the log-sum inequal- tion Q, then (2/f 00 (1))nDf (P̂ kQ) has an asymptotic χ2
ity is distribution, with |A| − 1 degrees of freedom, as n → ∞.

ai a
X     X X
bi f ≥ bf , a= ai , b = bi .
i
bi b The χ2 distribution with k degrees of freedom is de-
fined as the distribution of the sum of squares of k inde-
Using this, many of the properties of the information pendent random variables having the standard normal
divergence D(P kQ) extend to general f -divergences, in distribution. By this corollary, both (2/ log e)nD(P̂ kQ)
particular and (2/ log e)nD(QkP̂ ) are asymptotically χ2 with |A| −
Lemma 3 Df (P kQ) ≥ 0 and if f is strictly convex at 1 degrees of freedom.
t = 1 then Df (P kQ) = 0 only when P = Q. Further,
One property that distiguishes information divergence
Df (P kQ) is a convex function of the pair (P, Q), and the
among f -divergences is transitivity of projections, as
partitioning property, Df (P kQ) ≥ Df (P B kQB ) holds for
summarized in the following lemma. It can, in fact, be
any partition B of A.
shown that the only f -divergence for which either of the
A basic theorem about f -divergences is the following two properties of the lemma holds is the informational
approximation property. divergence.
Theorem 8 If f is twice differentiable at t = 1 and Lemma 4 Let P ∗ be the I-projection of Q onto a linear
f 00 (1) > 0 then for any Q with S(Q) = A and P “ close” family L. Then
to Q we have
(i) For any convex subfamily L0 ⊂ L the I-projections
f 00 (1) 2 of Q and of P ∗ onto L0 are the same.
Df (P kQ) ∼ χ (P, Q)
2
(ii) For any “translate” L0 of L, the I-projections of Q
(Formally, Df (P kQ)/χ2 (P, Q) → f 00 (1)/2 as χ2 (P, Q) → and of P ∗ onto L0 are the same, provided S(P ∗ ) =
0.) A.
Proof. Since f (1) = 0, Taylor’s expansion gives
Proof. By the Pythagorean identity
0 f 00 (1)
f (t) = f (1)(t − 1) + (t − 1)2 + (t)(t − 1)2 ,
2 D(P kQ) = D(P kP ∗ ) + D(P ∗ kQ), P ∈ L.
where (t) → 0 as t → 1. Hence
It follows that on any subset of L the minimum of
P (a) D(P kQ) and of D(P kP ∗ ) are acheived by the same P .
 
Q(a)f =
Q(a) This establishes (i).
f 00 (1) (P (a) − Q(a))2 L0 is called a translate of L if it is defined in terms of
f 0 (1)(P (a) − Q(a)) + the same functions fi , but possibly different αi . Hence,
2 Q(a)
2 the exponential family corresponding to L0 is the same
P (a) (P (a) − Q(a))
 
+ . as it is for L. Since S(P ∗ ) = A, we know that P ∗ be-
Q(a) Q(a)
longs to this exponential family. But every element of
Summing over a ∈ A then establishes the theorem. the exponential family has the same I-projection onto
L0 , which establishes (ii).

7
The marginals of a contingency table are obtained
Table 1: A 2-dimensional contingency table.
by restricting attention to those features i that be-
x(0, 0) x(0, 1)· · · x(0, r2 ) x(0·) long to some given set γ ⊂ {1, 2, . . . , d}. Formally,
x(1, 0) x(1, 1)· · · x(1, r2 ) x(1·) for γ = (i1 , . . . , ik ) we denote by ω(γ) the γ-projection
.. .. .. .. .. of ω = (j1 , . . . , jd ), that is, ω(γ) = (ji1 , ji2 , . . . , jik ).
. . . . . The γ-marginal of the contingency table is given by the
x(r1 , 0) x(r1 , 1) · · · x(r1 , r2 ) x(r1 ·) marginal counts
x(·, 0) x(·, 1) · · · x(·, r2 ) n
x(ω 0 )
X
x(ω(γ)) =
ω 0 :ω 0 (γ)=ω(γ)

Now we apply some of these ideas to the analysis of or the corresponding empirical distribution p̂(ω(γ)) =
contingency tables. A 2-dimensional contigency table x(ω(γ))/n. In general the γ-marginal of any distribution
is indicated in Figure ??. The sample data have two P (ω): ω ∈ Ω is defined as the distribution Pγ defined by
features, with categories 0, . . . , r1 for the first feature the marginal probabilities
and 0, . . . , r2 for the second feature. The cell counts
P (ω 0 ).
X
Pγ (ω(γ)) =
x(j1 , j2 ), 0 ≤ j1 ≤ r1 , 0 ≤ j2 ≤ r2 ω 0 :ω 0 (γ)=ω(γ)

are nonnegative integers; thus in the sample there were In general a d-dimensional contigency table has d
x(j1 , j2 ) members that had category j1 for the first fea- one-dimensional marginals, d(d − 1)/2 two-dimensional
ture and j2 for the second. The table has two marginals marginals, etc., corresponding to the subsets of
with marginal counts {1, . . . , d} of one, two, etc., elements.
For contingency tables the most important linear fam-
r2 r1
x(j1 ·) =
X
x(j1 , j2 ), x(·j2 ) =
X
x(j1 , j2 ). ilies of distributions are those defined by fixing certain
j2 =0 j1 =0 γ-marginals, for a family Γ of sets γ ⊂ {1, . . . , d}. Thus,
denoting the fixed marginals by P̄γ , γ ∈ Γ, we consider
The sum of all the counts is
X X XX L = {P : Pγ = P̄γ , γ ∈ Γ}.
n= x(j1 ·) = x(·j2 ) = x(j1 , j2 ).
j1 j2 j1 j2 The exponential family (through any given Q) that cor-
responds to this linear family L consists of all distribu-
The term contigency table comes from this exam- tions that can be represented in product form as
ple, the cell counts being arranged in a table, with Y
the marginal counts appearing at the margins. Other P (ω) = cQ(ω) aγ (ω(γ)). (7)
forms are also commonly used, e. g., the marginal γ∈Γ
empirical probabilities are indicated by replacing x(j1 ·)
In particular, if L is given by fixing the one-dimensional
by p̂(j1 ·) = x(j1 ·)/n and x(·j2 ) by p̂(·j2 ) = x(·j2 )/n,
marginals (i. e., Γ consists of the one point subsets
and/or the counts are replaced by the relative counts,
of {1, . . . , d} then the corresponding exponential family
p̂(j1 , j2 ) = x(j1 , j2 )/n.
consists of the distributions of the form
In the general case the sample has d features of in-
terest, with the ith feature having categories 0, 1, . . . , ri . P (i1 , . . . , id ) = cQ(i1 , . . . , id )a1 (i1 ) · · · ad (id )
The d-tuples ω = (j1 , . . . , jd ) are called cells; the corre-
sponding cell count x(ω) is the number of members of The family of all distributions of the form (??) is called
the sample such that, for each i, the ith feature is in the log-linear family with interactions γ ∈ Γ. In most
the ji th category. The collection of possible cells will be applications, Q is chosen as the uniform distributions;
denoted by Ω. The empirical distribution is defined by often the name “log-linear family” is restricted to this
P
p̂(ω) = x(ω)/n, where n = ω x(ω) is the sample size. case. Then (??) gives that the log of P (ω) is equal to a
By a d-dimensional contingency table we mean either the sum of terms, each representing an “interaction” γ ∈ Γ,
aggregate of the cell counts x(ω), or the empirical distri- for it depends on ω = (j1 , . . . , jd ) only through ω(γ) =
bution p̂, or sometimes any distribution P on Ω (mainly (ji1 , . . . , jik ), where γ = (i1 , . . . , ik ).
when considered as a model for the “true distribution” A log-linear family is also called a log-linear model.
from which the sample came.) It should be noted that the representation (??) is not

8
unique, because it corresponds to a representation in and the degrees of freedom for D(P ∗ kQ) are determined
terms of linearly dependent functions. A common way from the condition that the total degrees of freedom is
of eachieving uniqueness is to postulate aγ (ω(γ)) = 1 |Ω| − 1.
whenever at least one component of ω(γ) is equal to 0.
In this manner a unique representation of the form (??) The proof of Theorem ?? is omitted. Us-
is obtained, provided that with every γ ∈ Γ also the sub- ing this theorem, the null-hypothesis is rejected if
sets of γ are in Γ. Log-linear models of this form are also (2n/ log e)D(P̂ kP ∗ ) exceeds the threshold found in the
called hierarchical models. table of the χ2 distribution for the selected level of sig-
nificance.
Remark 3 The way we introduced log-linear models
shows that restricting to the hierarchical ones is more Now we look at the problem of outliers. A lack of fit
a notational than a real restriction. Indeed, if some (i. e., D(P̂ kP ∗ ) “large”) may be due not to the inad-
γ-marginal is fixed then so are the γ 0 -marginals for all equacy of the model tested, but to outliers. A cell ω0
γ 0 ⊂ γ. is considered to be an outlier in the following case: Let
L be the linear family determined by the γ-marginals
In some cases of interest it is desirable to summa- of the empirical distribution P̂ , (γ ∈ Γ) and let L0
rize the information content of a contingency table by be the subfamily of L consisting of those P ∈ L that
its γ-marginals, γ ∈ Γ. In such cases it is natural to satisfy P (ω0 ) = P̂ (ω0 ). Let P ∗∗ be the I-projection
consider the linear family L constisting of those distri- of P ∗ onto L0 . Ideally, we should consider ω0 as an
butions whose γ-marginals equal those of the empirical outlier if D(P ∗∗ kP ∗ ) is “large”, for if D(P ∗∗ kP ∗ ) is
distribution, P̂ . If a prior guess Q is available, then we close to D(P̂ kP ∗ ) then D(P̂ kP ∗∗ ) will be small by the
accept the I-projection P ∗ of Q onto L as an estimate Pythagorean identity. Now by the partitioning inequal-
of the true distribution. By previous results, this P ∗ ity:
equals the intersections of the log-linear family (??), or
its closure, with the linear family L. Also, P ∗ equals the D(P ∗∗ kP ∗ ) ≥
maximum likelihood estimate of the true distribution if P̂ (ω0 )   P̂ (ω0 )
P̂ (ω0 ) log ∗
+ 1 − P̂ (ω 0 log ∗ ,
it is assumed to belong to (??). P (ω0 ) P (ω0 )
Again, by previous results, an asymptotically optimal
test of the null-hypothesis that the true distribution be- and we declare ω0 as an outlier if the right-hand side
longs to the log-linear family E with interactions γ ∈ Γ of this inequality is “large”, that is, after scaling by
consists in accepting the null-hypothesis if (2n/ log e), it exceeds the critical value of χ2 with one
degree of freedom.
D(P̂ kP ∗ ) = min D(P̂ kP ) If the above method produces only a few outliers, say
p∈E
ω0 , ω1 , . . . , ω` , we consider the subset L̃ of L consisting of
is “small.” Unfortunately the numerical bounds ob- those P ∈ L that satisfy P (ωj ) = P̂ (ωj ) for j = 0, . . . , `.
tained in our asymptotic calculation are too crude for If the I-projection of P ∗ onto L̃ is already “close” to P̂ ,
most applications. Better bounds can be obtained from we accept the model and attribute the original lack of fit
the following theorem (still asymptotic, but typically to the outliers. Then the “outlier” cell counts x(ωj ), j =
good for substantially smaller sample sizes than our ex- 0 . . . , ` are deemed unreliable and they may be adjusted
ponential error bounds.) to nP ∗ (ωj ), j = 0 . . . , `.
Similar techniques are applicable in the case when
Theorem 9 If the true distribution Q is in E then the some cell counts are missing.
terms on the right-hand side of the Pythagorean identity

D(P̂ kQ) = D(P̂ kP ∗ ) + D(P ∗ kQ) 4 An iterative algorithm.


are asymptotically independent and (after scaling) have In this section an iterative algorithm to find the
χ2 distributions with appropriate degrees of freedom. minimum divergence between two convex sets of dis-
tributions is presented. In this discussion the nota-
Remark 4 The scaling is by 2n/ log e as in Corol- tion x∗ = arg minx∈X f (x) is used to denote a member
lary ??. The degrees of freedom for D(P̂ kP ∗ ) equals x∗ ∈ X at which the function f acheives its minimum, if
the number of (independent) constraints determining L, such a minimum exists, otherwise arg min is undefined.

9
In the following lemma the sets P and Q and function The inequality (??) implies the desired basic limit re-
D(P kQ) are completely arbitrary. In later applications sult
D(P kQ) will be the divergence and P and Q will be lim D(Pn kQn ) = inf D(P kQ).
n→∞ P ∈P,Q∈Q
convex sets of distributions on a finite set A.
Indeed, if this were false it would mean that there exist
Theorem 10 Let D(P kQ) be an arbitrary real-valued P ∈ P, Q ∈ Q and  > 0 such that
function defined for P ∈ P, Q ∈ Q such that P ∗ =
lim D(Pn kQn ) = lim D(Pn+1 kQn ) > D(P kQ) + .
P ∗ (Q) = arg minP D(P kQ) exists for all Q ∈ Q and n→∞ n→∞
Q∗ = Q∗ (P ) = arg minQ D(P kQ) exists for all P ∈ P. Then (??) would give that δ(P kPn+1 ) ≤ δ(P kPn ) −
Suppose further that there is a nonnegative function , n = 1, 2, . . . which contradicts the assumption that
δ(P kP 0 ) defined on P × P with the following “three- δ is nonnegative.
points property,” Suppose assumptions (i)-(iii) hold. Pick a sub-
sequence Pnk → P ∗ , as k → ∞ and let
δ(P kP ∗ (Q)) + D(P ∗ (Q)kQ) Q∗ = arg minQ∈Q D(P ∗ kQ). Our basic limit re-
≤ D(P kQ), ∀P ∈ P, Q ∈ Q, sult and assumption (i) imply that (P ∗ , Q∗ ) achieves
minP,Q D(P kQ). But it is easy to see that (??) im-
as well as the following “four-points property,” plies that if (P, Q) achieves minP minQ D(P kQ) then
δ(P kPn+1 ) ≤ δ(P kPn ) for every n. Thus δ(P ∗ kPn ) must
D(P 0 kQ0 ) + δ(P 0 kP )
be nondecreasing, and, by assumption (iii), its limit must
≥ D(P 0 kQ∗ (P )), ∀P, P 0 ∈ P, Q0 ∈ Q. be 0. Using assumption (iii) once more, we conclude that
Pn → P ∗ . The final inquality in the statement of the
Let Q0 be an arbitrary member of Q and recursively theorem then follows from (??) by replacing (P, Q) by
define (P ∗ , Q∗ ). This completes the proof of the theorem.
Pn = arg min D(P kQn−1 ), Now we wish to apply the theorem to the case when
P ∈P
Qn = arg min D(Pn kQ). (8) D(P kQ) is the divergence and P and Q are convex, com-
Q∈Q pact sets of nonnegative measures on A. No assumption
that the measures are probability distributions is made
Then
at this point; hence, in particular, D(P kQ) may have
P (a) ≥
P P
lim D(Pn kQn ) = inf D(P kQ). negative values. Of course, if Q(a) then
n→∞ P ∈P,Q∈Q D(P kQ) ≥ 0. Furthermore, the quantity
X P (a)

If, in addition, (i) minQ∈Q D(P kQ) is continuous in P , δ(P kQ) = P (a) log − (P (a) − Q(a)) log e ,
(ii) P is compact, and (iii) δ(P kPn ) → 0 iff Pn → P , then a Q(a)
for the iteration (??) Pn will converge to some P ∗ , such
is always nonnegative and vanishes iff P = Q. This δ sat-
that if Q∗ = arg minQ∈Q D(P ∗ kQ) then D(P ∗ kQ∗ ) =
isfies assumption (iii) of the theorem as well as the three-
minP ∈P,Q∈Q D(P kQ) and, moreover, δ(P ∗ kPn ) ↓ 0 and
points and four-points properties. We verify the four-
points property and leave the verification of the other
D(Pn kQn ) − D(P ∗ kQ∗ ) ≤ δ(P ∗ kPn−1 ) − δ(P ∗ kPn ).
properties to the reader. Let Q∗ = arg minQ∈Q , let Q0 be
an arbitrary member of Q, and set Qt = (1−t)Q∗ +tQ0 ∈
Q, 0 ≤ t ≤ 1. Then
Proof. We have, by the three-points property, 1
0≤ [D(P kQt ) − D(P kQ∗ )] =
t
δ(P kPn+1 ) + D(Pn+1 kQn ) ≤ D(P kQn ),
d
D(P kQt ) t=t̃ , 0 < t̃ ≤ t.
and, by the four-points property dt
With t → 0 it follows that
D(P kQn ) ≤ D(P kQ) + δ(P kPn ), X (Q∗ (a) − Q0 (a)) log e
0 ≤ lim P (a)
t̃→0 a (1 − t̃)Q∗ (a) + t̃Q0 (a)
for all P ∈ P, Q ∈ Q. Hence
X Q∗ (a) − Q0 (a)
= P (a) log e. (10)
δ(P kPn+1 ) ≤ D(P kQ) − D(Pn+1 kQn ) + δ(P kPn ) (9) a Q∗ (a)

10
If we then combine this with the fact that log t ≥ (1 − Such (P ∗ , Q∗ ) can be achieved using Theorem ??. In-
1/t) log e) we obtain deed, to Qn−1 ∈ Q we can find Pn ∈ P minimizing
D(P kQn−1 ) for P ∈ P merely by letting
P 0 (a)Q∗ (a)
P 0 (a) log 0
X 
− P (a) − P (a) log e ≥ 0,
a Q0 (a)P (a) P̃ (T a)
Pn (a) = Qn−1 (a) ,
QTn−1 (T a)
which is just a rewritten version of the four-points prop-
erty. for by definition P T = P̃ , if P ∈ P. The alternate step,
Remark 5 Suppose we are given a convex family F of finding Qn ∈ Q minimizing D(Pn kQ) is ‘easily” found,
random variables defined on a finite probability space by assumption.
(Ω, P ) and let X ∗ be a member of the family for which
E(log X) is maximal. Then, letting X and X ∗ play the Now we apply the preceeding dicsussion to a mixture
role of Q0 and Q∗ , respectively, the inequality (??) gives distribution problem. Let Q̃ be the set of all Q̃ of the
form Q̃(b) = ki=1 ci µi (b), where ci ≥ 0, ci = 1, and
P P
that
µi (b) are arbitrary nonnegative measures.
X∗ − X X∗
   
E ≥ 0, i. e., E ≥ 1, ∀X ∈ F. Goal: Find (c∗1 , . . . , c∗k ) achieving minQ̃ D(P̃ kQ̃), for a
X X
given P̃ .
The finiteness assumption is not really needed here, for
all that is needed is that max E(log X) is attained. This Solution. Let A be the set of all pairs (i, b), 1 ≤ i ≤
is known as Cover’s inequality. k, b ∈ B, and let T (i, b) = b. Define P and Q as above
and apply the iteration scheme. Thus
The result of Theorem ?? can be applied to the prob- k
lem of minimizing divergence from a set of distributions
X
P = {P : P (i, b) = P̃ (b)},
that is the image of a “nice” set in some other space. Let i=1
T : A 7→ B be a given mapping and for any P on A write Q = {Q: Q(i, b) = ci µi (b)}.
P T for its image on B, that is, P T (b) = a:T a=b P (a).
P
Start with an arbitrary (c01 , . . . , c0k ) with positive com-
Problem 1. Given a set Q̃ = {QT : Q
∈ Q} of distri- ponents that sum to 1; this defines Q0 (i, b) = c0i µi (b).
butions on B for some set Q of distributions on A, If Qn−1 (i, b) = cn−1 µi (b) is already defined let Pn be
i
minimize D(P̃ kQ̃), subject to Q̃ ∈ Q̃ for some given determined as above, that is,
P̃ on B. Here it is assumed that to any P ∈ P, a
Q ∈ Q minimizing D(P kQ) can “easily” be found. P̃ (b)
Pn (i, b) = Qn−1 (i, b)
Q̃n−1 (b)
Problem 2. The same but with the role of P and Q
interchanged. P̃ (b)
= cn−1
i µi (b) P n−1 .
j cj µj (b)
The first problem is relevant for maximum likelihood
estimation based on partially observed data, when es- The next step is to find Qn ∈ Q minimizing
timation from the full data would be “easy.” The two D(Pn kQ). To do this put Pn (i) = b Pn (i, b), Pn (b|i) =
P
problems can be solved in similar ways; we concentrate Pn (i, b)/Pn (i) and use the relation Q(i, b) = ci µi (b) to
on the first one. write
Let P be the set of all P on A such that P T = P̃ . Here
k X
P̃ and the elements of Q̃ are not necessarily probability X Pn (i, b)
D(Pn kQ) = Pn (i, b) log
Q(i, b)
P P
distributions; indeed, either P̃ (b) or Q̃(b) maybe i=1 b
less than, equal to, or greater than 1. Nevertheless the
partitioning inequality gives D(P kQ) ≥ D(P T kQT ) with in the form
equality iff X 
Pn (i) Pn (b|i)

P (a) P T (T a) D(Pn kQ) = Pn (i)Pn (b|i) log + log .
= T , ∀a ∈ A. i,b
ci µi (b)
Q(a) Q (T a)
(11)
Hence P ∗ ∈ P, Q∗ ∈ Q achieve minP,Q D(P kQ) iff Q̃∗ = Note that i Pn (i) = b P̃n (b), and hence D(Pn kQ) is
P P

Q∗T achieves minQ̃ D(P̃ kQ̃).


P
minimized if in (??) we set ci = Pn (i)/ b P̃ (b) (using

11
the fact that Pn (b|i) is a probability distribution for fixed 5 Redundancy.
i.) Thus the recursion for cni will be
P  This and the next two sections are concerned with
b
PP̃ (b)µ i (b) measuring the performance of codes. The symbol Cn will
cn−1 µj (b)
cni = cn−1 
 j j

, denote a binary prefix n-code with length function L =
i P
b P̃ (b) L(Cn , n). The (pointwise) redundancy R = RP (Cn , n) of
 
the code Cn relative to a distribution P on An is defined
and by our general theorem, cni → c∗i achieving by
1
minQ̃ D(P̃ kQ̃). R(xn1 ) = L(xn1 ) − log .
P (xn1 )
Remark 6 The finiteness of B is not essential for the The expected redundancy is
convergence of this iteration. In particular, using the
X X 1
remark with Cover’s inequality, Remark ??, for positive R̄ = E(R) = L(xn1 )P (xn1 ) − P (xn1 ) log .
valued random variables X1 , . . . , Xk , the weights c∗i max- xn xn
P (xn1 )
P 1 1
imizing E(log i ci Xi ) can be found by the same itera-
tion, i. e., The Shannon code determined by the length function
! L(xn1 ) = d− log P (xn1 )e produces essentially zero redun-
Xi dancy, and, is almost the optimal code for P in that
cni = cn−1
i E .
P n−1
it produces expected coding length within 1 bit of the
j cj Xj
minimal expected coding length. Thus, in general, re-
This is Cover’s portfolio algorithm. dundancy gives an approximate measure of the cost in
using the code Cn on P -sequences, rather than the opti-
Remark 7 The “decomposition of mixtures” algorithm mal code.
can be used also if the individual µi ’s depend on some Note that the expected redundancy E(R) = E(L) −
parameter to be estimated, i. e., when H(P ) is always nonnegative, but the pointwise redun-
( ) dancy R(xn1 ) can take negative values. We will show
that for random processes the pointwise redundancy is
X
Q̃ = Q̃: Q̃(b) = ci µ(b|θi ) .
i essentially nonnegative. A random process is an infi-
nite sequence X1 , X2 , . . . of A-valued random variables
Then, from (??), θin is chosen to minimize the divergence defined on probability space (Ω, P ∗ ). The Kolmogorov
X Pn (y|i) representation of a process produces the measure P on
Pn (b|i) log . the space A∞ of infinite sequences drawn from A, which
b
µi (y|θ)
is defined by requiring that the value of P on cylinder
Unfortunately, the general theorem is not applicable to sets [an1 ] = {x ∈ A∞ : xn1 = an1 } be given by the formula
this case, because Q̃ and Q are not convex. Indeed, the
P ([an1 ]) = Prob (Xi = ai , : 1 ≤ i ≤ n) .
iteration may get stuck at a local mimimum and fail to
find the global one. If P is the Kolmogorov measure determined by a process
we shall write P n for the measure on An determined
by P n (an1 ) = P ([an1 ]). (In cases where n is clear from
the context we write P in place of P n .) Note that a
process defines a sequence of distributions P n , where P n
is defined on An . The key difference between the concept
of process and the general concept of sequences {Pn } of
distributions is that P n+1 is required to be related to P n
by the (Kolmogorov consistency) formula

P n+1 (xn+1
X
P n (xn1 ) = 1 ).
xn+1

In the remainder of this section, P will denote the


Kolomogorov measure of a random process {Xn } and

12
Cn will denote a binary prefix n-code, for n = 1, 2, . . .. and it is sufficient to show that
The word code will mean either the sequence {Cn } or
P∞ P
2 −L(xn1 ) ≤ 1.
n=1 xn
1 ∈Bn (c)
one member Cn of this sequence; the context will make If the code is a strong prefix code then
clear which possiblity is being used. Our first result ex- ∞
n
2−L(x1 ) ≤ 1,
X X
presses the idea that for random processes the pointwise
redundancy is essentially nonnegative, in that it is very n=1 xn
1 ∈Bn (c)
unlikely to asymptotically take large negative values.
and hence we are done. If the code is a Shannon code
Theorem 11 Let {cn } be a sequence of positive num- for a process Q then
bers satisfying 2−cn < ∞. Then R(xn1 ) ≥ −cn , even-
P
n
2−L(x1 ) ≤
X X
tually almost surely. Q(xn1 ) = Q(B̃n (c)),
xn
1 ∈Bn (c) xn
1 ∈Bn (c)
Proof. Let
n
where B̃n (c) is the union of the [xn1 ] for which xn1 ∈ Bn (c).
−c
An (c) = {xn1 : R(xn1 ) < −c} = {xn1 : 2L(x1 ) P (xn1 ) < 2 }. Since these sets are disjoint, the sum of their Q-measures
cannot exceed 1 and we again reach the desired result
Then n
that ∞ 2−L(x1 ) ≤ 1. This completes the
P P
X n=1 xn1 ∈Bn (c)
P (An (c)) = P (xn1 ) proof of the corollary.
xn
1 ∈An (c)
n If the sequences to be encoded are sample paths from
< 2−c 2−L(x1 ) ≤ 2−c ,
X
some known random process P then we cannot do sig-
xn
1 ∈An (c) nificantly better (in the sense of minimizing expected
where we used the Kraft inequality. Hence redundancy) than we can by using the Shannon code,

which produces expected redundancy of at most 1. In
X
Prob (R(X1n ) < −cn ) = many typical situations, however, the process P is un-
n=1 known, although it may be known to belong to some
∞ ∞ parametric family. In such cases it is difficult to design
2−cn < ∞.
X X
P (An (cn )) ≤ codes for which the redundancy stays bounded. The fol-
n=1 n=1 lowing result shows that if the code is a Shannon code for
The theorem now follows from the Borel-Cantelli princi- some Q then the redundancy will indeed be unbounded,
ple. unless Q is already very nearly the same as P .

A sharper lower bound can be obtained for the case Theorem 12 If Q is singular with respect to P then
when there is a process Q such that each Cn is a Shannon the P -redundancy of the Shannon code with respect to
code for Qn , or in the case when the sequence of codes Q goes to infinity with probability 1.
{Cn } satisfies the strong prefix property, that is, for m 6= Proof. The redundancy equals log(P (xn1 )/(Q(xn1 )), up to
n the code word for xm 1 is not a prefix of the code word 1 bit, hence it suffices to show that Zn = Q(xn1 )/P (xn1 )
for x1 unless m ≤ n and xm
n n
1 is a prefix of x1 . We state goes to 0, with probability 1. Towards this end, let Fn
this as a corollary as its proof is a modification of the be the smallest σ-algebra for which the sequences xn1 are
preceeding proof. measurable, that is, the σ-algebra generated by the cylin-
Corollary 3 For the Shannon code with respect to a der sets [xn1 ], xn1 ∈ An . Then {Zn } is a martingale with
process Q, or for a strongly prefix code, the pointwise respect to the increasing sequence {Fn } and therefore
redundancy R(xn1 ) is bounded below by a random vari- converges almost surely to some random variable Z. It
able and E(inf n R(xn1 )) > − log e. suffices to show that Z = 0.
Since Q is assumed to be singular with respect to P
Proof. Let there is a measurable set à ⊂ A∞ such that P (Ã) =
1, Q(Ã) = 0. Let µ by the measure defined by
Bn (c) = {xn1 : R(xn1 ) < −c, R(xk1 ) ≥ −c, k < n}. Z
As in the proof of the theorem, µ(B) = P (B) + Q(B) + Z dP.
B
n
P (Bn (c)) < 2−c 2−L(x1 ) ,
X
Since ∪n Fn generates the entire σ-algebra, for every  >
xn
1 ∈Bn (c)
0 there exists Ãm ∈ Fm , for sufficiently large m, such

13
that the symmetric difference between à and Ãm has The entropy theorem implies that
µ-measure less than . In particular,
1 P (xn1 )
log → D∞ (P kU ) < , P ∈ N , (13)
U (xn1 )
Z
n
P (Ãm ) > 1 − , Q(Ãm ) < , Z dP > E(Z) − .
Ãm
for P -almost all infinite sequences (the exceptional set
But for n ≥ m the martingale property gives may depend on U .) This means that the set of all pairs
(x, U ), where x ∈ A∞ , for which (??) does not hold, has
R
Ãm Zn dP = Q(Ãm ), and therefore Fatou’s lemma gives
Z Z P × ν-measure 0; this in turn implies that for P -almost
Z dP ≤ lim inf Zn dP = Q(Ãm ) < . all x, the set of U 0 s not satisfying (??) has ν-measure 0
Ãm n→∞ Ãm (in both cases, by Fubini’s theorem.)
It follows that E(Z) < 2 and hence that E(Z) = 0. Thus, for P -almost all x the integrand in (??) goes
Since Z ≥ 0, we must have Z = 0, with probability 1, as to infinity for ν-almost all P ∈ N∞ . It follows by Fa-
claimed. This completes the proof of the theorem. tou’s lemma that the integral itself goes to +∞, which
completes the proof of the theorem.
Good codes are those for which the P -redundancy
grows slowly as n → ∞. The following theorem gives An important class of examples of codes that satisfy
a condition that guarantees the existence of such codes, the hypotheses of the preceeding theorem are obtained
under some restrictions about the process P . In this and as follows. Let Γ be a given (countable) list of stationary
later results, the limiting divergence-rate for processes is ergodic distributions, and let each U ∈ Γ be assigned a
defined by “description length” L(U ), subject to the Kraft inequal-
ity, U ∈Γ 2−L(U ) ≤ 1. Then xn1 can be encoded by a
P
1 prefix code of length
D∞ (P kQ) = lim D(P n kQn ),
n→∞ n
1
 
provided this limit exists. The limit is known to exist min L(U ) + log ;
U ∈Γ U (xn1 )
for stationary P if Q is i.i.d. or finite-order Markov, but
not necessarily otherwise. namely, choose U ∈ Γ achieving this minimum, encode
xn1 by the Shannon code with respect to U , and add a
Theorem 13 Suppose P is stationary ergodic and let preample of length L(U ) to identify U (here, the 1 bit
Q
R
be a mixture of stationary ergodic distributions, Q = error from dropping the upper integer part symbol is
U ν(dU ), such that for every  > 0 the set of all finite- disregarded.) Let us call this the code generated by the
order Markov measures U with D∞ (P kU ) <  has pos- list Γ.
itive ν-measure. Then for the Shannon code with re-
spect to Q, the redundancy satisfies R(xn1 )/n → 0, al- Theorem 14 If to any  > 0 there is some finite-order
most surely. Markov code in the list Γ with D∞ (P kU ) < , then
the redundancy of the code generated by Γ satisfies
Proof. We have to prove that for every  > 0, R(xn1 )/n → 0, almost surely.
log(P (xn1 )/Q(xn1 )) < n, eventually almost surely, or,
Proof. Set Q = U ∈Γ 2−2L(U ) U ; then Q satisfies the
P
equivalently,
hypotheses of Theorem ??, hence
P (xn1 ) < 2n Q(xn1 ), eventually a.s. 1 P (xn1 )
log → 0, a.s. (14)
Let N be the set of finite-order Markov measures U for n U (xn1 )
which D∞ (P kU ) <  and note that Now we want to show that R(xn1 )/n → 0, a.s., where R
Z Z is the redundancy of the code defined by the list Γ. To-
Q(xn1 ) = U (xn1 )ν(dU ) ≥ U (xn1 )ν(dU ), wards this end, note that the condition U ∈Γ 2−L(U ) ≤ 1
P
N
implies that
so that
Q(xn1 ) ≤ max 2−L(U ) U (xn1 ),
2n Q(xn1 ) 2n U (xn1 ) U ∈Γ
Z
≥ ν(dU ) ≥
P (xn1 ) N P (xn1 ) from which it follows that
P (xn
 
1) 1 1
Z  
n −log U (xn )
2 1 ν(dU ). (12) log ≥ min L(U ) + log ,
N Q(xn1 ) U ∈Γ U (xn1 )

14
which implies that R(xn1 ) ≤ log(P (xn1 )/Q(xn1 )). This, The preceeding inequality implies that P̂n = P and com-
combined with (??) implies our desired result that pletes the proof of the theorem.
R(xn1 )/n → 0, a.s. This completes the proof of the the-
orem. Now let us be given a finite or countable list of para-
metric families of (stationary, ergodic) processes {Pθ : θ ∈
The following principle, called the minimum descrip- Θγ , where γ ∈ Γ, and to each family on the list, i. e.,
tion length (MDL) principle has been suggested by Ris- to each γ ∈ Γ suppose there is assigned a codeword of
sanen. length L(γ) describing this family, such that the Kraft
inequality holds. Further, let on each parameter set Θγ
Principle. The statistical information in data
be given a “prior” νγ , i. e., νγ is a probability measure
is best extracted when a possibly short descrip-
on Θγ . We also assume that the mixture distributions
tion of the data is found. The distribution in-
ferred from the data is the one that leads to
Z
Qγ = Pθ νγ (dθ), γ ∈ Γ
the shortest description, taking into account Θγ
that the inferred distribution itself must be de-
scribed. are mutually singular. (In particular, these mean that
the families {Pθ : θ ∈ Θγ are essentially disjoint.)
Let Γ be a given finite or countably infinite list of
stationary ergodic processes on the space A∞ . Let to Theorem 16 There exists subsets Θ̃γ ⊂ Θγ of full mea-
each U ∈ Γ a codeword of length L(U ) be assigned as sure 1, such that if P ∈ Θ̃γ ∗ , for some γ ∗ ∈ Γ, then
a description of U ; these lengths must satisfy the Kraft " #
inequality. Then, given a sample xn1 , the MDL estimate 1
min L(γ) + log
P̂n of the unknown distribution P is P̂n = U , where U γ∈Γ Qγ (xn1 )
achieves minU ∈Γ [L(U ) − log U (xn1 )].
is attained for γ = γ ∗ , eventually almost surely.
Theorem 15 If P ∈ Γ then P̂n = P , eventually almost
surely. Proof. In other words, the family containing the true
distribution will be found with probability 1, unless P is
Proof. Let Q =
P −L(U ) U,
U ∈Γ−{P } 2 and note that in a subset of this family having νγ -measure 0.
Exactly as in the proof of the preceeding theorem (re-
Q(xn1 ) ≥ max 2−L(U ) U (xn1 ), placing U by Qγ and L(U ) by L(γ)) we obtain that for
U ∈Γ−{P }
sufficiently large n,
that is, " #
1
1

1

min L(γ) + log
log ≤ min L(U ) + log . (15) γ∈Γ Qγ (xn1 )
n
Q(x1 ) U ∈Γ−{P } U (xn1 )
will be attained for γ = γ ∗ , with Qγ ∗ -probability 1. Let
Now, Q is singular with respect to P , since each sta-
F be the set of all x ∈ A∞ for which this “almost sure”
tionary, ergodic U 6= P is singular with respect to the
statement is true, so that Qγ ∗ (F c ) = 0. Since by defini-
stationary, ergodic process P , hence by Theorem ?? the
tion
redundancy of the Shannon code with respect to Q goes
Z
Qγ ∗ (F c ) = Pθ (F c )νγ ∗ (dθ),
to +∞, that is, Θγ ∗

1 1 it follows that νγ ∗ ({θ: Pθ (F c ) > 0}) = 0 and we can take


log − log → ∞, a.s.
Q(xn1 ) P (xn1 )
Θ̃γ ∗ = Θγ ∗ − {θ: Pθ (F c ) > 0}.
Using the bound (??) we therefore have
This completes the proof of the theorem.
1 1
 
min L(U ) + log n − log → ∞, a.s.,
U ∈Γ−{P } U (x1 ) P (xn1 ) Remark 8 The hypotheses of Theorem ?? are fulfilled,
in particular, when the parameter sets Θγ are subsets of
hence, for sufficiently large n Euclidean spaces of different dimensions and νγ is abso-

1

1 lutely continuous with respect to the Lebesgue measure
min L(U ) + log > log + L(P ). for the corresponding dimension.
U ∈Γ−{P } U (xn1 ) P (xn1 )

15
6 Redundancy bounds. or, equivalently,

Some techniques for obtaining bounds on redundancy n


Y n(xj |xj−1
1 ) + 1 + αxj
Q(xn1 ) = . (17)
for i.i.d processes will be discussed in this section. Con- j−1+
P
j=1
αi + k
sider the i.i.d. process with alphabet A = {1, . . . , k}
with distribution P . We then have where n(xj |xj−1
1 ) is the number of occurences of the sym-
k
Y bol xj in the “past” xj−1
1 .
P (xn1 ) = P (i)ni ,
i=1 Theorem 17 If Q is defined by (??) with αi = −1/2, ∀i,
the redundancy always satisfies
where ni is the number of times i occurs in xn1 . This
probability is maximum if P (i) = ni /n, hence the maxi- Γ(n + k2 )Γ( 12 )
R(xn1 ) ≤ log ≤
mum likelihood estimate is given by Γ(n + 12 )Γ( k2 )
k 
k−1 Γ(k/2)
Y ni ni
 ≤ log n − log + n
PML (xn1 ) = . 2 Γ(1/2)
i=1
n
where n → 0 as n → ∞.
When encoding with respect to an auxiliary distribution
Q, the redundancy satisfies (disregarding at most 1 bit) Proof. The second inequality is a simple consequence of
the following simple bound Stirling’s formula for the Γ-function, so it is enough to
prove the first inequality.
P (xn1 ) PML (xn1 ) For αi ≡ −1/2 we have
R(xn1 ) = log ≤ log . (16)
Q(xn1 ) Q(xn1 ) k
Γ( k2 ) Y Γ(ni + 12 )
Q(xn1 ) = =
Let us take for Q the mixture distribution Q(xn1 ) = Γ(n + k2 ) i=1 Γ( 12 )
U (xn1 )ν(p) dp, with a Dirichlet prior having density
R
Qk h i
i=1 (ni − 12 )(ni − 32 ) · · · 12
P  = (18)
Γ k
i=1 αi
k
+k Y (n − 1 + k2 )(n − 2 + k2 ) · · · k2
ν(p) = Qk pαi i , p = (p1 , . . . , pk ).
i=1 Γ (αi + 1) i=1 Note that, in particular, if xn1 consists of identical sym-
bols, say, xi ≡ a, then
For α1 = . . . = αk = −1/2 we will get a sharp upper
bound on the redundancy (??), a bound not depending Γ( k2 )Γ(n + 12 )
on the true distribution P nor xn1 . Before we state and Q(xn1 ) = ;
Γ(n + k2 )Γ( 12 )
derive this bound we obtain a representation for Q that
will be useful in constructing the Shannon code for Q.. hence to prove Theorem ?? it is enough to show that
For a Dirichlet prior with arbitrary αi > −1, ∀i, we R(xn1 ) ≤ log(1/Q(xn1 )). The simple upper bound (??)
have then tells us that it is enough to show that
Z
k 
Q(xn1 ) U (xn1 )ν(p) dp = ni ni Q(xn1 )

= Y
PML (xn1 ) ≤ ≤ ,
P
k

i=1
n Q(x̃n1 )
k Γ i=1 αi +k
Z Y
= pni i +αi dp · Qk where x̃i ≡ a. The identity (??) can then be used to see
i=1 i=1 Γ (αi + 1)
P
k
 that it is enough to prove that
Γ k
i=1 αi +k Y Γ (ni + αi + 1)
= . Qk h i
Γ (n +
P
αi + k) i+1 Γ (αi + 1) k 
Y ni ni

i=1 (ni − 12 )(ni − 32 ) · · · 12
≤ ,
i=1
n (n − 12 )(n − 32 ) · · · 12
Using the functional equation Γ(x + 1) = xΓ(x) we see
that Q(xn1 ) is given by the ratio which can be converted to
k  Qk
ni ni − 1) · · · (ni + 1)]
i=1 [2ni (2ni
Qk 
i=1 [(ni + αi )(ni − 1 + αi ) . . . (1 + αi ]
Y
≤ (19)
(n − 1 +
P
αi + k)(n − 2 +
P P
αi + k) . . . ( αi + k) i=1
n 2n(2n − 1) · · · (n + 1)

16
since Suppose we choose P at random with prior distribution
ν; then the observation of xn1 provides information about
1 3 1 1 1 1 (2n)!
 
(n − )(n − ) · · · = n(n − ) · · · = 2n the unknown P , measured by the mutual information
2 2 2 n! 2 2 2 n! Z
2n(2n − 1) · · · (n + 1) I(ν) = H(Qν ) − H(P n )ν(dP )
= .
22n Z
At last we have arrived at the assertion we shall prove, = H(Qν ) − n H(P )ν(dP ),
namely, (??). This will be proved if we show that it is
where Qν is the mixture distribution, Qν =
possible to assign to each ` = 1, . . . , n in a one-to-one R n
P (x1 )ν(dP ). Even though this mutual information ap-
mannner, a pair (i, j), 1 ≤ i ≤ k, 1 ≤ j ≤ n, such that
pears to be unrelated to the previous average redun-
ni ni + j dancy, the remarkable fact is that
≤ (20)
n n+`
inf sup D(P n kQ) = sup I(ν).
Q P ν
Now, for any given ` and i, (??) holds iff j ≥ ni `/n.
Hence the number of those 1 ≤ j ≤ ni that satisfy (??) Indeed, the following lemma holds in general.
is greater than ni − ni `/n, and the total number of pairs
(i, j), 1 ≤ i ≤ k, 1 ≤ j ≤ n, satisfying (??) is greater Lemma 5 Consider any noisy channel with input al-
than phabet U = {1, . . . , `} and output alphabet V =
k  {1, . . . , m}, given by the probability distributions Pi on
ni
X 
ni − ` = n − `. V governing the output if the input is i, i = 1, . . . , `.
i=1
n
For any input distribution π, let Qπ denote the output
It follows that if we assign to ` = n any (i, j) satisfying distribution and let
(??) (i. e., i may be chosen arbitrarily and j = ni ), then X Pi (j) X
recursively assign to each ` = n − 1, n − 2, etc., a pair I(π) = π(i)Pi (j) log = π(i)D(Pi kQπ )
Qπ (j)
(i, j) satisfying (??) that were not assigned previously, i,j i
we never get stuck; at each step there will be at least
be the mutual information between input and output.
one “free” pair (i, j) (because the total number of pairs
Then
(i, j) satisfying (??) is greater than n − `, the number of
max I(π) = min max D(Pi kQ).
pairs already assigned.) This completes the proof of the π Q 1≤i≤`
theorem.

Our next goal is to show that the result of the preceed- Proof. The left-side is known as the channel capacity.
ing theorem is “best possible,” even if we don’t insist on The lemma states that it equals the “radius” of the small-
a uniformly small redundancy (i. e., on a bound valid for est “divergence ball” that contains all the Pi ’s. To es-
every xn1 ), but want only the average redundancy E(R) tablish this relation first note that for any distribution
to be small. Q on V ,
Consider any prefix code. Without loss of general- ` X
m
Pi (j) Q(j)
X  
ity (for the purpose of bounding the redundancy) we I(π) = π(i)Pi (j) log + log
Q(j) Qπ (j)
may assume that it satisfies the Kraft inequality with i=1 j=1
the equality sign, and therefore that is is a Shannon code `
X
with respect to some Q (not necessarily of mixture type.) = π(i)D(Pi kQ) − D(Qπ kQ).
Then i=1

This identity shows that for any fixed π


P (X1n )
E(R(X1n )) = E log = D(P n kQ).
Q(X1n ) `
X
min π(i)D(Pi kQ) = I(π),
Q
Since P is unknown, we want to select Q in such a way i=1
that no matter what P is the average redundancy will
and hence
be small, that is, we want Q to minimize
`
X
sup EP (RP (X1n )) = sup D(P n kQ). max I(π) = max min π(i)D(Pi kQ).
π π Q
P P i=1

17
The minimax theorem asserts that if f (x, y) is a conti- Proof. Suppose the set Θ1 of those θ’s for which (??)
nous function of two variables ranging over convex, com- doesn’t hold has Lebesgue measure at least . We will
pact sets, which is concave in x and convex in y then show that this supposition leads to a contradiction if K
is suitably chosen. By Theorem ?? of the preceeding
max min f (x, y) = min max f (x, y).
x y y x section, it suffices to show that for some distribution ν
In our case the theorem can be applied and we get on Θ1 , we have I(ν) > (k log n)/2 − K. where I(ν) is the
mutual information of the channel having input alphabet
`
X Θ1 and transition probabilities Pθ . (In Theorem ??, Pθ
max I(π) = min max π(i)D(Pi kQ).
π Q π was i.i.d., but the proof is clearly valid in general.)
i=1
Now let c be so large that the subset Θ2 of Θ1 on which
Since the inner maximum is clearly equal to the maxi- (??) holds with c(θ) = c has Lebesgue measure at least
mum of D(Pi kQ) over i, the proof of the lemma is com- /2. Let ν be the uniform distribution on Θ2 , let Z be a
pleted. distribution chosen at random according to ν, and let Ẑ
be an estimator of Z such that EkZ − Ẑk2 ≤ c/n. Then
The lemma is valid also in the case when the input
I(ν) ≥ I(Z ∧ Ẑ), by the data processing theorem.
alphabet X is infinite, providing the maximum with re-
Note that
spect ot π is replaced by “supremum.” We omit the
proof of this more general case, even though this is what I(Z ∧ Ẑ) = H(Z) − H(Z|Ẑ)
we need for lower bounding average redundancy. Indeed,
= H(Z) − H(Z − Ẑ|Ẑ)
using this result, we can state e
≥ log − H(Z − Ẑ).
Theorem 18 For any prefix code, the supremum for P 2
of the expected
R
redundancy is lower bounded by I(ν) = But the entropy of a k-dimensional random variable
H(Qν )−n H(P )ν(dP ), where ν is an arbitrarily chosen Y subject to E(kY 2 k) ≤ α is maximized if its dis-
prior distribution. tribution is Gaussian with independent components of
Of course, the best bound is the supremum of I(ν), variance σ 2 = α/k, and this maximum entropy equals
which is the channel capacity of the set of all possible (k/2) log(2πeσ 2 ). Applying this fact with Y = Z−Ẑ, α =
distributions P on A considered as the input alphabet, c/n, σ 2 = c/(kn), it follows that
An the output alphabet, and the distribution on An cor- k c k
responding to the input P is P n . H(Z − Ẑ) ≤ log(2πe ) = − log n + B,
2 nk 2
where B depends only on c and k. From this it follows
7 Rissanen’s theorem. that
k 
Now we would like to establish the most general I(ν) ≥ I(Z ∧ Ẑ) ≥ log n − B + log ,
known lower bound on the redundancy of a prefix codes, 2 2
a result due to Rissanen. which proves Rissanen’s theorem.
Theorem 19 Let {Pθ }θ∈Θ be any family of random pro- Corollary 4 If for some subfamily {Pθ }θ∈Θ0 of a fam-
cesses, not necessarily i.i.d., possibly not even stationary, ily of sources satisfying (??) there exist universal codes
where Θ ∈ Rk . Suppose that for each n ≥ n0 there exists whose average redundancy grows slower than (k/2) log n,
an estimator Θ̂n (xn1 ) with i. e.,
c(θ) 
k

Eθ kθ̂ − θk2 ≤
, ∀θ ∈ Θ. (21) lim Eθ Rθ (X1n ) − log n = −∞, θ ∈ Θ0 ,
n n→∞ 2
Then, for every  > 0 there is a constant K > 0 such
then, necessarily, Θ0 has Lebesgue measure 0.
that for n ≥ n0 and for every probability density or
mass function g we have, Proof. This is immediate, because, without restricting
Pθ (xn1 ) k generality, any code can be supposed to be a Shannon
Eθ log ≥ log n − K, (22) code with respect to some distribution g.
g(xn1 ) 2
except possibly for a set of parameters θ of Lebesgue The hypotheses of Rissanen’s theorem are satisfied, in
measure less than . particular, if {Pθ } is the family of all i.i.d. distributions

18
on a finite alphabet A = {1, . . . , k}. Then θ may be iden- Again, this result is asymptotically best possible (up
tified with the vector of the probabilities (P1 , . . . , Pk ), to the constant term.) Indeed, on account of Risan-
and since these form a (k − 1)-dimensional subspace we nen’s theorem, even the average redundancy cannot be
get that (??) holds for k replaced by k − 1, thus proving made significantly smaller than (k(k − 1)/2) log n on a
that that universal codes constructed in the preceeding set of positive Lebesgue measure in the parameter space
section have asymptotically optimal redundancy. needed to describe Markov chain probabilities.

Our results extend beyond the i.i.d. case; in particular Rissanen has provided an interesting application of his
they extend to the Markov case. A Markov chain with theorem to a special class of processes, which we will
transition matrix P (j|i), 1 ≤ i, j ≤ k, is given by the call the chains with finite context. A process has finite
joint distributions context if there is a positive integer m and a function
k
Y f : Am 7→ S where S is some finite set) such that
Prob (Xt = it , 0 ≤ t ≤ n) = P (i0 ) P (it |it−1 ).
t=1 Prob (Xt = it , 0 ≤ t ≤ n) =
We will suppose that the initial state i0 is fixed, so that k
Y
we can rewrite these probabilities in the form, P (i0 ) P (it |f (ij−m , . . . , ij−1 )),
t=1
k Y
Y k
Prob (Xt = it , 0 ≤ t ≤ n) = P (j|i)n(i,j) , (23) where it is assumed here that i−m+1 , . . . , i0 is fixed. The
i=1 j=1
elements of S are called “contexts” or “states” and P (i|`)
where n(i, j) is the number of times the pair i, j occurs is interpreted as the “probability of the symbol i in the
in adjacent places in xn0 . Further, let n(i) = j n(i, j)
P
context `.” Of course, any source that is Markov of order
denote the number of occurences of i in the block xn−10
m has finite memory, and conversely; the context idea
and note that the probability in (??) is maximized for emphasizes that the probability of occurence of the next
P̂ (j|i) = n(i, j)/n(i), that is symbol may be depend on something much simpler than
k
k Y
the entire past of length m, namely |S| may be much
PML (xn0 ) =
Y
P̂ (j|i)n(i,j) . smaller than |Am | = k m , and it would be nice to take
i=1 j=1 advantage of this fact in coding.
To obtain optimal bounds for processes with finite con-
By analogy with the i.i.d. case we introduce the mix-
text we need make only a few changes in our preceeding
ture distribution
discussion. Let us fix S and f and let n(i, `), i < n, ` ∈ S
k Z Y
k
Y denote the number of pairs (i, `) that occur among the
Q(xn1 ) = P (j|i)n(i,j) ν(P (·|i))dP,
pairs (it , st−1 ), where st−1 = f (it−m , . . . , it−1 ), for t < n.
i=1 j=1
We then have
where ν is the Dirichlet prior with αi ≡ −1/2. Thus
k
Q(xn1 ) is given by the product YY
   P (xn1 ) = P (i|`)n(i,`)
k k `∈S i=1
Y

Y Γ(n(i, j) + 1/2) 
Γ(k/2) ,
i=1 j=1
Γ(1/2) Γ(n(i) + k/2) and the maximum likelihood probabilities
which is, in turn, equal to the product k
YY
k Qk
− 1/2)(n(i, j) − 3/2) . . . (1/2) PM L (xn1 ) = P̂ (i|`)n(i,`) ,
j=1 (n(i, j)
Y
. `∈S i=1
i=1
(n(i) − 1 + k/2)(n(i) − 2 − k/2) . . . (k/2) X
P̂ (i|`) = n(i, `)/n(`), n(`) = n(i, `).
The redundancy of the code based on the above auxil- i
iary distribution can be bounded, using the correspond-
Again, as in the i.i.d. case, an asymptotically optimal
ing i.i.d. result. It follows that
universal code is the one based on the auxiliary mixture
k 
k−1 distribution (with the Dirichlet prior), as follows,
X 
R(xn1 ) = log n(i) + constant
i=1
2
k
YZ Y
k(k − 1) Q(xn1 ) = P (i|`)n(i,`) ν(P (·|`))dP
= log n + constant.
2 `∈S i=1

19
which is equal to These can then be used to do arithmetic coding in the
Qk finite context case; such coding will also yield the same
Y − 1/2)(n(i, `) − 3/2) . . . (1/2)
i=1 (n(i, `)
, asymptotics as the Shannon code. Rissanen’s theorem
`∈S
(n(`) − 1 + k/2)(n(`) − 2 + k/2) . . . (k/2) implies that even the average redundancy can not be
and this code has redundancy substantially smaller than the bound above, for any uni-
versal code, expect possibly for a vanishingly small set
X k − 1 
R(xn1 ) ≤ log n(`) + constant of parameters (i. e., matrices P (i|`).)
`∈S
2
k−1
≤ |S| log n + constant. 8 Additions.
2
Furthermore, Rissanen’s theorem implies that even the 8.1 The scaling formula.
average redundancy cannot be substantially smaller the
The scaling formula
above bound, for any universal codes, except possibly for
a vanishingly small set of parameter values, i. e., matrices αi
P ∗ (a) = ci Q(a), a ∈ Bi , where ci = . (24)
P (i`). QB (i)

Remark 9 Before leaving this topic of redundancy see (??) can be proved as follows. First, lumping does
bounds let us mention an aspect of our discussion which not increase divergence, that is,
has some practical value in designing codes. In the pre-
ceeding section we derived the following formula (see D(P kQ) ≥ D(P B kQB ).
(??)), valid for the i.i.d. case
The condition that P ∈ L is equivalent to the condition
n
n(xj |xj−1 that P (Bi ) = αi , ∀i. If P ∗ (a) = αi Q(a)/Q(Bi ), a ∈ Bi
1 ) + 1 + αxj
Y
Q(xn1 ) = then
j−1+
P
j=1
αi + k
P ∗ (a) X X αi Q(a) αi
P ∗ (a) log
X
where n(xj |xj−1 = log
1 ) is the number of occurences of the sym-
a Q(a) Q(Bi ) Q(Bi )
i a∈Bi
bol xj in the “past” xj−1
1 . This formula suggests the B
conditional probabilities = D(αkQ ), α = (α1 , . . . , αk ).

n(xj |xj−1
1 ) + 1 + αxj
Thus, if P ∈ L then
Q(xj |xj−1
1 ) =
j − 1 + αi + k
P
D(P kQ) ≥ D(P B kQB ) = D(αkQB ),
The latter formula can be used as the specification of
the conditional probabilities used in arithmetic coding, which establishes (??).
a (practical) sequential procedure that yields the same
asymptotics as the Shannon coding procedure. 8.2 Pearson’s χ2 .
Likewise, the Markov discussion in this section leads
The chi-square function was defined on page ??. In
to the conditional formula
the case when P = P̂ , the empirical distribution the
Q(ik |i1 , . . . , ik−1 ) = formula can be rewritten as follows.
nk−1 (i, j) + 1/2 X (P̂ (a) − Q(a))2
, if ik−1 = i, ik = j,
nk−1 (i) + k/2 χ2 (P̂ , Q) =
a Q(a)
where nk−1 (i, j) is the number of consecutive (i, j)’s in
1 X (nP̂ (a) − nQ(a))2
the sequence ik−1
P
0 and nk−1 (i) = j nk−1 (i, j). These =
n nQ(a)
conditional probabilities are easily evaluated, because
1 2
only simple updating is needed to go from k − 1 to k; = χ ,
arithmetic coding can then be performed. n k−1
The corresponding finite context formula is where χ2k−1 =
P (nP̂ (a)−nQ(a))2
is Pearson’s classical chi-
nQ(a)
Q(ik |i1 , . . . , ik−1 ) = square function. Here nP̂ (a) gives the observed count,
nk−1 (i, `) + 1/2 while nQ(a) gives the expected count of the number of
, if sk−1 = `, ik = j.
nk−1 (`) + k/2 appearances of a.

20
8.3 Maximum entropy and Likelihood. Extensions of these results to the Markov and hidden
Markov cases can also be obtained.
There is an important case when divergence mini-
Let c = c(xn1 ) be the number of commas in the LZ
mization corresponds to maximum likelihood, namely,
parsing of xn1 . The final block, which may be empty, is
the case when the linear family contains the empirical
coded by telling the first prior word that this block pre-
distribution. Suppose we are given the corresponding
fixes. Let ULZ (xn1 ) be the length of the resulting code.
linear and exponential families,
Each word, except the final word, can be encoded with
L = {P :
X
P (a)fi (a) = αi , 1 ≤ i ≤ k} at most dlog ce bits to give the location of the prior oc-
a curence of all but its final symbol and dlog |A|e to encode
k
X this final symbol. Thus we have the upper bound
E = {P : P (a) = c(θ)Q(a) exp( θi fi (a))}.
1 ULZ (xn1 ) ≤ (c + 1)dlog ce + cdlog |A|e. (25)

Theorem 20 If Q(a) > 0, ∀a and if the empirical dis- The next step in upper bounding the redundancy is to
tribution P̂ belongs to L then the maximum likelihood obtain a lower bound on − log P (xn1 ), stated here as the
estimate in E is the I-projection P ∗ of any member of following lemma.
E onto L. Furthermore, the minimum value of D(P̂ kP ) Lemma 6 There is a positive number δ such that if P
for P ∈ E is attained at P ∗ . is an i.i.d. process then

Proof. Let P ∗ = D(LkQ), so that L ∩ E = {P ∗ }. If log(n/c)


− log P (xn1 ) ≥ c log c − cδ + .
P ∈ E we can write n/c
X
P = cQ(a) exp( θi fi (a)),
Proof. Let W = W (xn1 ) be the first c words in the LZ
P ∗ = c∗ Q(a) exp( θi∗ fi (a)).
X
parsing of xn1 , let WL = WL (X1n ) be the subset of W
consisting of the words of length L, and let c(L) be the
P ∗ (a)fi (a) = αi we have
P
Since
cardinality of WL . We then have
0 ≤ D(P ∗ kP ) = log c∗ + θi∗ αi − (log c +
X X
θi αi ), LY
max Y
P (xn1 ) ≤ P (w),
so that L=1 w∈WL

so that
log c∗ + θi∗ αi = max(log c +
X X
θi αi ).
P ∈E LX
max X
− log P (xn1 ) ≥ − log P (w)
If P̂ ∈ L, however, then D(P̂ kP ) − = D(P̂ kP ∗ ) L=1 w∈WL
D(P ∗ kP ), since P̂ (a)fi (a) = αi . This proves that the
P
LX
max
minimum value of D(P̂ kP ) for P ∈ E is attained at
X 1
= − c(L) log P (w)
P ∗ . Furthermore, P (xn1 ) = n P̂ (a) log P (a), so that if
P
L=1 w∈WL
c(L)
P ∈ E and P̂ ∈ L then LX
max X P (w)
≥ − c(L) log
P ∗ (xn1 ) P ∗ (a) c(L)
= D(P ∗ kP ) ≥ 0,
X
log =n P̂ (a) log L=1 w∈WL
P (xn1 ) P (a) LX
max
≥ c(L) log c(L)
so that P ∗ is indeed the MLE in E. L=1
The argument can be applied to any member of E in
place of the given Q, since they all describe the same where the first inequality comes from Jensen’s in-
exponential family. equality, and the final inequality uses the fact that
w∈WL P (w) ≤ 1, which holds because the words in WL
P

are distinct and have fixed length L.


8.4 Redundancy for the LZ algorithm. P
To obtain a suitable bound on c(L) log c(L) set
An upper bound on the reduncancy of the form
O(log log n/ log n) for the Lempel-Ziv (LZ) algorithm on 1 LX
max
L̄ = c(L) log c(L),
the class of i.i.d. processes will now be established. c L=1

21
so that 8.6 Cutting off the memory.
c(L) 1
−c
P P
c(L) log c(L) = c log c(L) Let P be a stationary finite-alphabet process. The
c(L) 2−L/L̄ k-step Markoviztion of P is the k-step Markov process
−c −c
P
= c log c(L)
(a) P (k) defined by the transition probabilities
≥ −c + c log c − c log L max −L/L̄
P
1 2
−1/L̄
≥ 2
−c + c log c − c log 1−2−1/L̄ P (xk+1
1 )
P (xk+1 |xk1 ) = .
(b) P (xk1 )
≥ −c + c log c − c log(21/L̄ − 1)
(c)
≥ −c + c log c − c log( lnL̄2 ) The following general result shows that the conditions
≥ c log c − cδ − c log nc . stated in Theorems 13 and 14 often hold. For example,
the set of all Markov types of all orders is a countable
where Jensen’s inequality was used in (a) and the finite set for which the conditions of Theorem 14 hold for every
sum was replaced by the infinite sum (of a geometric ergodic process P .
series) to go to (b). The Taylor expansion of 2x was used
to obtain (c), while the final line used δ = 1 − log(ln 2) Theorem 21 D∞ (P kP (k) ) → 0 as k → ∞.
and the fact that L̄ ≤ n/c. This completes the proof of
Lemma ??. Proof. We have
Taking the difference between the upper bound on the
code length, (??), and the lower bound of Lemma ??, n
P (xk1 ) X P (xi+1 |xi1 )
then dividing by n, produces the redudancy bound log = log
P (k) (xn1 ) i=k+1 P (xi+1 |xii−k+1 )
1 c log n/c
RLZ (xn1 ) ≤ K + , (26)
n n n/c so that taking expectations yields
where K is a constant. !
P (xk )
To complete the argument a simple bound for c/n will EP log (k) 1 n =
be needed, a bound that follows from the foct that the P (x1 )
n
largest value of c is obtained when all short blocks occur. P (xi+1 |xi1 )
P (xi+1
X X
It is enough to consider the case when all blocks of length 1 ) log . (27)
i=k+1 xi+1
P (xi+1 |xii−k+1 )
up to t occur, so that 1

t t
X X To see what this is we use the formula
c= |A|i ∼ |A|t , n = i|A|i ∼ t|A|t ,
1 1 X P (x|y, z)
I(X ∧ Y |Z) = P (x, y, z) log ,
which gives the (asymptotic) bound c/n = O(1/ log n). P (x|z)
Since log x/x is decreasing in x for x > e, the desired
result, with X = Xi+1 , Y = X1i , Z = Xi−k+1
i ; the sum (??)
1 log log n
 
RLZ (xn1 ) = O , then takes the form
n log n
n
follows easily from the bound (??). X
I(Xi+1 ∧ X1i |Xi−k+1
i
)=
i=k+1
8.5 Minimization for general measures. n
X
0 0
I(X1 ∧ X−i+1 |X−k+1 )
The minimization result claimed in the paragraph fol- i=k+1
lowing statement (10) on page 10 follows from a general
result about nonnegative measures, a result that is a sim- where stationarity was used to obtain the final form.
ple consequence of the log-sum inequality. Suppose P is Now we pass to the limit in n, using the martingale the-
an arbitrary nonnegative measure, suppose Q is a prob- orem to obtain
ability distribution, and set Q∗ (a) = P (a)/ P (b). The
P

log-sum inequality then gives D∞ (P kP (k) ) = I(X1 ∧ X−∞


0 0
|X−k+1 ),
X  X 
D(P kQ) ≥ P (a) log P (a) = D(P kQ∗ ).
whichs goes to 0 as k → ∞, establishing the theorem.

22
8.7 Arithmetic coding. P̂n denote the empirical distribution, and Π the set of
all binary distributions Q with Q(0) ≥ 0.2. We want to
An interesting idea due originally to Elias and later
estimate tha probability that P̂n ∈ Π. Sanov’s theorem
adapted in a useful form by Rissanen leads to sequen-
gives
tial coding procedure known as arithmetic coding. Fix
an integer n. An arithmetic code first assigns to each Prob(P̂n ∈ Π) ≈ exp(−nD(ΠkP )), P = (0.1, 0.9).
xn1 a nonempty subinterval J(xn1 ) = [`(xn1 ), r(xn1 )) of the
unit interval [0, 1) such that the set {J(xn1 )} is a parti- Now
tion of the interval into disjoint subintervals. To obtain D(ΠkP ) = min D(QkP )
sequential codes it is required that Q∈Π
0.2 0.8
J(xn1 ) = ∪s J(xn1 s), (28) = 0.2 log + 0.8 log ≈ 0.066,
0.1 0.9
and to avoid trivialities it is required that J(xn1 ) shrinks and exp(−nD(ΠkP )) ≈ 0.01. This number is suspect
to a single point as n → ∞. The code is then defined because the large deviations approximation is reliable
by setting C(xn1 ) = z1m if the endpoints of J(xn1 ) have for “very small” probabilities (however, since Π is con-
binary expansions .z1 z2 . . . zm that agree in their first m vex, this number certainly gives an upper bound.) In
places but no further, that is, our case, approximating the binomial distribution by the
normal will be preferable, and its result is smaller by a
`(xn1 ) = .z1 z2 . . . zm 0 . . . , r(xn1 ) = .z1 z2 . . . zm 1 . . . .
factor of 10.
Since the intervals are disjoint this is a prefix code. Fur-
thermore, if Qn (xn1 ) = r(xn1 ) − `(xn1 ) then Qn is a proba- Example 3 The null-hypothesis P1 is to be tested on
bility distribution; the condition (??) then implies that the basis of an iid sample xn1 , and the probability of first
there is a process Q whose n-th order probabilties are kind error is required to be no more than exp(−n(γ −
n
given by Qn . Note also that 2−L(x1 )−1 ≥ Q(xn1 ) so that o(1))). Prove that the test with critical region equal to
L(xn1 ) ≤ − log Q(xn1 )+1 and hence the Qn -expected code {xn1 : P̂xn1 6∈ Π}, with Π = Πγ = {Q: D(QkP1 ) ≤ γ}, is
length is no more than H(Qn ) + 1. The code operates asymptotically optimal against any alternative P2 with
sequentially in that the code word assigned to xn+1 is an D(P2 kP1 ) > γ, in the sense that for no test meeting the
1
extension of the word assigned to xn1 . condition on the first kind error can the probability of
In general, a process Q can be specified by giving its second kind error go to 0 with a larger exponent.
sequence of conditional probabilities Q(xn |xn1 ). These Solution. The meaning of this problem is the follow-
probabilities can then be used to specify subintervals ing. Two distributions P1 and P2 are given such that
of the unit interval in a sequential manner. Thus, we D(P2 kP1 ) > γ. Based on a sample path xn1 drawn from
first partition [0, 1) into subintervals labeled J(x1 ), x1 ∈ P1n or P2n , a decision is to be made as to which process it
A, then for each x1 partition J(x1 ) into subintervals comes from. To make this decision, the set An of possi-
J(x21 ), x2 ∈ A. Proceeding in this manner the process Q ble sample paths is partitioned into two disjoint sets H1n
defines a nested sequence of partitions {J(xn1 ): xn1 ∈ An } and H2n , and the decision rule is to choose Pi if xn1 ∈ Hin .
that satisfy the compatibility condition (??), hence de- It is enough to specify the region H2n , which is called the
fine an arithmetic code. The mixture processes intro- critical region of the test, for we can take H1n to be its
duced in Sections 6 and 7, thus lead to useful sequential complement. The first part of the problem is to show
codes, as noted in Remark 9 of the notes. that if H2n = {xn1 : P̂xn1 6∈ Π} then the probability of a
first kind error, namely P1 (H2n ), satifies
9 Further examples. P1 (H2n ) ≤ exp(−n(γ − o(1))) (29)
Example 2 A lot contains n = 100 defective items. For this partition there will be a largest number δ > 0
Each item is tested but the test may fail with probability such that the probability of a second kind error, P2 (H1n ),
p = 0.1. Use the techniques of this course to estimate satisfies
the probability that 20 or more defective items remain P2 (H1n ) ≤ exp(−n(δ − o(1))) (30)
undetected.
The second goal is to show that if {H1n , H2n } is any se-
Solution. Let Xi denote the outcome of testing the i-th quence of partitions for which (??) holds then P2 (H1n )
defective item, Xi = 1, if detected, 0, otherwise. Let cannot go to zero at a faster rate than (??).

23
Let us first show that for H2n = {xn1 : P̂xn1 6∈ Π}, the (i) For Γ = {{1}, {2}, {3}}, P ∗ is the product of the em-
two inequalities (??) and (??) both hold. Let P ∗ be the pirical marginal distributions, that is, P ∗ (i, j, k) =
xi·· x·j· x··k
I-projection of P2 onto Π and let δ = D(P ∗ kP2 ), which ∗
n n n . This P clearly does not fit the sample.
is necessarily positive. If P̂xn1 6∈ Π then D(P̂xn1 kP1 ) > γ,
(ii) For Γ = {{1, 2}, {1, 3}}, P ∗ is of the form
so that (??) holds. Sanov’s theorem asserts that (??)
P ∗ (i, j, k) = a(i, j)b(i, k), that is, under this
holds.
model the second and third features are condition-
To establish the second goal we first note that
ally independent given the first feature. Hence
D(Πγ kP1 ) is continuous in γ, hence given  > 0 we can
P ∗ (i, j, k) is obtained by multiplying the empiri-
choose n so large that there is an n-type Q such that
cal marginal (1/n)xi·· by the conditional distribu-
D(QkP1 ) < γ − , and D(QkP2 ) < δ + . tions evaluated from the {1, 2} and {1, 3} empirical
marginals xij· /xi·· and xi·k /xi·· . Thus P ∗ (i, j, k) =
If the critical region Cn = H2n of some test contains (xij· xi·k )/(nxi·· ). For x∗ijk = nP ∗ (i, j, k) we get
at least half of TQn then P1 (Cn ) is lower bounded by x∗111 = 7.8, x∗112 = 10.2, x∗121 = 5.2, x∗122 =
P1 (TQn )/2 which is turn lower bounded by 6.8, x∗211 = 10, x∗212 = 2, x∗221 = 15, x∗222 = 3,
!−1 and
1 n + |X| − 1 xijk 1.1
exp[−nD(QkP1 )] nD(P̂ kP ∗ ) =
X
2 |X| − 1 xijk log ∗ = ln 2
i,j,k
xijk 2
≥ exp(−n(γ − /2)),
The degress of freedom, that is, the dimensional-
for large enough n. Therefore if we require that (??) ity of L determined by the marginals, is 2, and
holds then, if n is large enough, at least half of TQn is not Prob(χ2 ≥ 1.1) ≈ 0.58, hence the model fits well.
in Cn and hence the second kind error P2 (Cnc ) is lower Example 5 A finite-valued random variable Y is -
bounded by independent from a finite-valued X if
!−1 X X
1 n + |X| − 1 P (X = x) |P (Y = y|X = x) − P (Y = y)| < .
exp[−nD(QkP2 )] x y
2 |X| − 1
≥ exp(−n(δ + /2), Show that I(X ∧ Y ) ≤ (2 /2) log e implies -
independence.
for large enough n. Since  is arbitrary this proves that
the second kind error cannot go to 0 with a larger expo- Solution. Exercise 17, page 58, of the Csiszár-Körner
nent than δ. book gives the bound 2D(P kQ) ≥ log e|P − Q|2 , where
| · | denotes variational distance. Since I(X ∧ Y ) =
Example 4 Let a contingency table with 3 features, D(PX,Y kPX × PY ) the condition I(X ∧ Y ) ≤ (2 /2) log e
each with two categories, be given by the cell counts implies that D(PX,Y kPX × PY ) ≤ , which is the condi-
tion for -independence.
x111 = 8, x112 = 10, x121 = 5, x122 = 7
x211 = 11, x212 = 1, x221 = 14, x222 = 4 Example 6 Consider an exponential family defined by
Pθ (x) = c(θ) exp[ ki=1 θi fi (x)], where c(θ) =
P
densities
Consider the log-linear models corresponding to (i) Γ = ( exp[ ki=1 θi fi (x)]dx)−1 and θ = θ1k . Let θM L be the
R P
{{1}, {2}, {3}} and (ii) Γ = {{1, 2}, {1, 3}}, and deter- value maximizing Pθ for a given x0 (which we suppose
mine the maximum likelihood estimate P ∗ for both mod- exists.) Show that − log PθM L (x0 ) = H(PθM L ).
els. Does either model fit the data?
Solution. Since
Z
Solution. Here n = 60 and the one-dimensional marginal H(Pθ ) = − Pθ (x) log Pθ (x) dx
counts are Z
x1·· = x2·· = x·1· = x·2· = 30, x··1 = 38, x··2 = 22. = − log c(θ)Pθ (x) dx
Z k
The two-dimensional marginal counts needed in part (ii) − (log e)
X
θi fi (x)Pθ (x) dx
are i=1
k
x11· = 18, x12· = 12, , x21· = 12, x22· = 18 X
x1·1 = 13, x1·2 = 17, , x2·1 = 25, x2·2 = 5. = − log c(θ) − (log e) θi Eθ fi ,
i=1

24
it suffices to show that for θ = θML we have Eθ fi = provided this family has a member satisfying the con-
fi (x0 ), 1 = 1, . . . , k. But this immediately follows by straints. Comparing the family with the 3-dimensional,
setting the derivatives (∂/∂θi ) log Pθ (x0 ) equal to 0. For mean 0, Gassian densities, that is, those of the form,
this last step it is necessary to assume that θML is an
(det A)1/2 ) 1
 
interior point of the set of those θ0 s for which Pθ is de- p(u) = exp − uAuT ,
(2π)3/2 2
fined, that is, that the integral in the definition of c(θ)
is finite. where A is symmetric and positive definite, we see that
our exponential family is a subfamily of these Gaussians,
Example 7 Let P n and Qn be n-dimensional distribu- with
tions on An such that n−1 D(P n kQn ) → 0. Show that if
 
−2θ0 −θ1 0
for some sets Bn ⊂ An we have Qn (Bn ) < exp(−n) for  −θ1 −2θ0 −θ1  .
 
some  > 0 that does not depend on n then P n (Bn ) → 0. 0 −θ1 −2θ0
Is it also true that P n (Bn ) < exp(−n) implies that
Qn (Bn ) → 0? Computing the covariance matrix Σ = A−1 , the given
moment constraints result in two equations for the un-
Solution. For an arbitrary set A, knowns θ1 and θ2 . The solution of these equations is
straightforward, but tedious. It is clear, however, from
D(P n kQn ) the form of A, that the first and last elements of the main
P n (A) P n (Ac ) diagonal of Σ = A−1 will be equal and its middle element
≥ P n (A) log + P n
(A c
n ) log will be different from these (unless θ1 = 0, which occurs
Qn (A) Qn (Ac )
if b = 0, when the maximum entropy distribution is iid.)
≥ P (A) log P (A) + P (An ) log P n (Acn )
n n n c
The remaining assertion of the problem also follows from
1
+P n (A) log n the form of A without any further calculations.
Q (A)
1
≥ −1 + P n (A) log n . 10 Summary of Process Concepts.
Q (A)

If here Qn (A) ≤ exp(−n) then it follows that A number of process concepts will be used in the discus-
D(P n kQn ) ≥ −1 + nP n (A), that is, sion of redundancy. These concepts and the results to
be used are summarized here.
1 1 1
 
n
P (A) ≤ D(P n kQn ) + . A (stochastic) process is a sequence {Xn } of random
 n n variables defined on a probability space, say (X, Σ, µ).
We shall assume that all the random variables have val-
Example 8 Let X, Y, Z be real-valued random vari- ues in a fixed finite set A, called the alphabet. For each
ables with unknown joint density p(x, y, z) for which n a process defines a probability measure on An , called
E(X 2 ) + E(Y 2 ) + E(Z 2 ) = a and E(XY ) + E(Y Z) = b, the n-fold joint distribution, by the formula
where a and b are known. Show that the joint den-
sity achieving maximum entropy subject to these con- Pn (xn1 ) = Prob (Xi = xi , 1 ≤ i ≤ n) .
straints is Gaussian with mean 0. Indicate how its co-
variance matrix could be determined (the actual compu- The sequence of measures {Pn } is not completely arbi-
tation is not required) and show that for this maximum trary, for the consistency conditions,
entropy joint distribution E(X 2 ) = E(Z 2 ) 6= E(Y 2 ) and
Pn+1 (xn+1
X
Pn (xn1 ) = 1 ) (31)
E(XY ) = E(Y Z) 6= E(XZ). xn+1

Solution. Let f1 (x, y, z) = x2 + y 2 + z 2R and f2 (x, y, z) = must hold.


xy + yz. Then the entropy H(p) = − p(u) log p(u) du, The space (X, Σ, µ) on which the process is defined
where u = (x, y, z), du = dxdydz, has to be max- is not important; all that matters is the sequence of
imized subject to the constraints inf f1 (u)p(u) du = joint distributions, {Pn }. In fact, two processes are said
to be equivalent if they have the same joint distribu-
R
a, f2 (u)p(u) du = b. The maximizing density will
be in the exponential family tions; we are free to choose any convenient space and
sequence of functions, as long as the joint distributions
pθ (u) = c(θ) exp[θ1 f1 (u) + θ2 f2 (u), θ = (θ1 , θ2 ), is not changed. The Kolmogorov model takes the space

25
to be the set A∞ of infinite sequences drawn from A, and For ergodic processes we also have that the entropy of the
the functions to be the coordinate functions, defined by empirical distribution, H(P̂k ), converges almost surely
X̂n (x) = xn , x ∈ A∞ . The measure is the (unique) to the theoretical entropy, Hk . Furthermore, if we define
Borel measure P defined by the requirement that if transition probabilities by the formula
[an1 ] = {x: xi = ai , 1 ≤ i ≤ n} P̂k (ak1 )
P̂k (ak |a1k−1 ) = P ,
is the cylinder set defined by an1 , then P ([an1 ]) = Pn (an1 ). ak P̂k (ak1 )
In summary, the concept of process, that is, a sequence
of measures {Pn } that satisfy the consistency conditions, the entropy of the resulting Markov chain will converge
(??), is formally equivalent to the concept of Borel prob- almost surely, as sample path length n → ∞, to the con-
ability measure P on the sequence space A∞ . We usually ditional entropy, H(Xk |X1k−1 ), which, in turn, converges
take the latter as our definition of process; thus, when we as k → ∞ to the entropy-rate H(P ).
say process we shall mean a Borel probability measure A stationary process P is always a mixture of ergodic
P on the sequence space A∞ . We shall use the notation processes, that is, there is a probability space (Y, Σ, ν)
P (an1 ) for P ([an1 ]), as well as sample path terminology. and a family Uy , y ∈ Y, of ergodic processes such that for
A sample path is a member of A∞ , while a finite sample each an1 the function Uy (xn1 ) is Σ-measurable and such
path is a member of some An . that Z
P (an1 ) = Uy (an1 )ν(dy).
A process P is stationary if it is invariant under the
shift T , which is the transformation on A∞ defined by The process {Xn } is finite-state (hidden Markov) if
the formula (T x)n = xn+1 , x ∈ A∞ , n = 1, 2, . . .. Thus there is a finite alphabet process {Sn } such that the pro-
P is stationary if and only if P = P ◦ T −1 . cess Yn = (Xn , Sn ) is a Markov chain. Csiszár has shown
A stationary process is ergodic if almost every sam- (unpublished) that if Q is finite-state then for any sta-
ple path is “typical” for the process. The concept of tionary, ergodic process P the limiting divergence-rate
“typical” is defined as follows. The relative frequency
1X P (xn1 )
of occurence of ak1 in the sequence xn1 is the distribution D∞ (P kQ) = lim P (xn1 ) log
n n an Q(xn1 )
P̂k = P̂k (·|xn1 ) on Ak defined by 1

|{i ∈ [0, n − k]: xi+k


= i+1 ak1 }| exists, and, furthermore, (1/n) log P (xn1 )/Q(xn1 ) con-
P̂k (ak1 |xn1 ) = .
n−k+1 verges, for P -almost all x, to the limit D∞ (P kQ).
The measure P̂k is also called the empirical distribution
of overlapping k-blocks in the sample path xn1 . The se- 11 Homework # 1.
quence x is said to be typical for the process P if for all
k and all ak1 , the following holds Due Date: Oktober 7-én.
P (ak1 ) = lim P̂k (ak1 |xn1 ). 1. Find the Shannon-Fano code for the distribution
n→∞
The set of sequences that are typical for P will be de- P = (0.4, 0.35, 0.1, 0.1, 0.05). Determine the average
noted by T (P ). A stationary process P is ergodic if length and compare it with the entropy H(P ). Can
its set of typical sequences has measure 1, that is, if you improve this code by shortening some words,
P (T (P )) = 1. without losing the prefix property? Do you get an
The entropy (or entropy-rate) of a stationary process optimal code in this way?
P is defined by H(P ) = limn Hn /n where the n-th order 2. Determine whether there exist binary prefix codes
entropy Hn = Hn (P ) is defined by with the following codeword lengths and give such
X
Hn = − P (an1 ) log P (an1 ). a code if the answer is yes.
an
1 (a) 2,3,3,3,4,4,4,4,4,5,5,5 (b) 2,2,3,3,4,4,4,5,6,6
The entropy theorem (also known as the Shannon- 3. Determine which of the following bit sequences can
McMillan-Breiman Theorem) asserts that if P is an er- be a code of some sequence of integers, using the
godic process of entropy H then prefix code given in the notes.
1 1 1 (a) 0011110000001100010001110101010100011101
log = − log P (xn1 ) = H, a. s.
n P (xn1 ) n (b) 00001001110010000000100111100001011100010010000

26
4. Let B ⊂ An and let P = (1/|B|) 12 Homework # 2.
P
x∈B Px , be the
average type of the sequences in B.
1. For k simple hypotheses P1 , . . . , Pk , and a clas-
(a) Prove that |B| ≤ exp[nH(P )]. sification rule consisting of the partition A =
(A1 , . . . , Ak ) of X n such that Pi is accepted when
(b) Prove that the sample belongs to Ai , there are k error prob-
abilities, ei = Pin (Aci ), i = 1, . . . , k. Give a nec-
k
!
X n essary and sufficient condition for the existence of
≤ 2nh(k/n) , k ≤ n/2
i=0
i classification rules such that all k error probabil-
ities go to 0 with exponential rate at least some
where h(p) = −p log p − (1 − p) log(1 − p). γ > 0, as the sample size n goes to infinity, that is,
ei ≤ exp(−n(γ + o(1)), i = 1, . . . , k.
(c) Given a function f on a finite set X, show
that for every α that is a possible value of 2. Let E1 ⊂ E2 be exponential families of the form
Pn
i=1 f (xi )/n, we have   
 k1
X
( n
) " # E1 = Q: Q(x) = Q0 (x)c(θ) exp  θi fi (x)
n 1X

x1 : f (xi ) = α ≤ exp n max H(X)
 
i=1
n i=1
E(f (X))=α   
 Xk2 
E2 = Q: Q(x) = Q0 (x)c(θ) exp  θi fi (x) ,
5. Prove that H(Y |X) is a concave function of the joint
 
i=1
distribution of (X, Y ), that is, if PXY = αPX1 Y1 +
(1 − α)PX2 Y2 then H(Y |X) ≥ αH(Y1 |X1 ) + (1 − where k2 > k1 . Given a sample with empirical
α)H(Y2 |X2 ). distribution P̂ , let Pi∗ ∈ Ei , be the maximum
likelihood estimate for the model Ei i = 1, 2.
6. Let P1 and P2 be probability distributions on the Prove that P2∗ is the I-projection of P1∗ onto L =
{P : ki=1 P (x)fi (x) = ki=1
P 2 P 2
finite set X such that D(P2 kP1 ) > γ. Prove that the P̂ (x)fi (x)}.
I-projection P ∗ of P2 on Π = {Q: D(QkP1 ) ≤ γ} 3. In a telephone network serving r cities, the incom-
is of the form P ∗ = cP1θ P21−θ , where c > 0 and ing and outgoing calls were counted in each city on a
0 < θ < 1 are determined by the requirements that given day. From these numbers, xin (k) and xout (k),
P ∗
P (x) = 1 and D(P ∗ kP1 ) = γ. (Hint: first show k = 1, . . . , r, the number of calls x(i, j) from city i
that P ∗ is also the I-projection of P2 on the linear to city j are inferred by the method of maximum en-
family L = {Q: Q(x) log PP21 (x)(x)
= δ − γ}, where
P
tropy, setting x∗ (i, j) = np∗ (i, j); here n is the total

δ = D(P kP2 ).) number of calls and P ∗ = {p∗ (i, j)} is the maximum
entropy distribution among those P = {p(i, j)} that
7. Prove that D(P kQ) ≤ χ2 (P, Q) log e. satisfy the marginal constraints
r r
8. Let X1 , X2 , . . . be an iid sequence of X-valued ran- X 1 X 1
p(k, j) = xout (k), p(i, k) = xin (k),
dom variables with entropy H, and let Ĥn be the n n
j=1 i=1
empirical entropy of X1n , that is, the entropy of the
empirical distribution P̂n . Prove that for k = 1, . . . , r, and, in addition, p(k, k) = 0, k =
! 1, . . . , r (local calls were not counted.) Specify the
1 n + |X| − 1 exponential family for which this P ∗ is the maxi-
H − log ≤ E(Ĥn ) ≤ H.
n |X| − 1 mum likelihood estimate, and suggest an iterative
algorithm for determining P ∗ .
9. Given two strictly positive finite distributions P1 4. Suppose that for a 5 × 5 array of random variables
and P2 on X, determine γ such that there is exactly Xij , each taking values in a finite set X, the joint
one P ∗ with D(P ∗ kP1 ) = D(P ∗ kP2 ) = γ. Show distributions of “neighboring pairs” (Xij , Xi(j+1) )
that and (Xij , X(i+1)j ) are known, where addition mod-
P1θ (x)P21−θ (x).
X
γ = − log min ulo 5 is used. Based on this information, the
0≤θ≤1
x joint distribution of the whole array is estimated by

27
maximizing joint entropy subject to the constraints where the final line indicates the optimal (Huffman)
given. Interprete the maximum entropy joint distri- code. Expected code length and entropy are
bution as an I-projection, and suggest a convergent X X
iteration for computing it. L= pi `i = 2.55, H = − pi log pi = 1.94

5. For binary sequences of length n let Q(xn1 ) denote while the optimal L is 2.15. Note that the last code-
the uniform mixture of the iid probabilities P (xn1 ) = word can be shortened by deleting its final two bits,
pn0 (1 − p)n−n0 , where n0 denotes the number of ze- but the obtained code is still not optimal.
roes in xn1 . Find an explicit formula for Q and de- 2. The Kraft inequality shows that there is no prefix
termine the asymptotic behavior of the maximum code with length set (a), but there is one for (b).
redundancy log PM L (xn1 )/Q(xn1 ) as n → ∞, for se-
quences of two kinds: (i) n0 ∼ αn for 0 < α < 1, 3. The first bit sequence cannot be decoded, but it can
and (ii) n0 constant. Suggest a mixture distribution if one more bit is added at the end.
that is more appropriate for the latter case.
4. Let N (a, B) denote the number of occurrences of
6. Determine the code-length for the sequence a in all the sequences xn1 ∈ B, so that P (a) =
000110001000111000010000110010000 N (a, B)/n|B|. Let X1 , . . . , Xn be random variables
using the universal coding method discussed in defined by Prob(X1n = xn1 ) = 1/|B|, xn1 ∈ B and
class, supposing first that the sequence is iid, then 0, otherwise. But Pi (a) = Prob(Xi = a) satisfies
P
second that it is Markov. i Pi (a) = N (a, B)/|B| = nP (a). Thus,
X
7. Let {Pθ }θ∈Θ be an arbitrary family of distributions log |B| = H(X1n ) ≤ H(Pi ) ≤
for a random process with finite alphabet A, and let i
!
rn denote the smallest positive integer r for which 1X
≤ nH Pi = nH(P ).
there exists a prefix code with codeword lengths n i
L(xn1 ) such that the redundancy satisfies the uni-
form bound This establishes part (a).

L(xn1 ) + logPθ (xn1 ) ≤ r, ∀xn1 , θ. For part (b) apply part (a) to the set B of binary
sequences of length n that contain no more than k
Show that rn equals log Sn , up to 1 bit, where Sn = zeroes. For part (c), let P be the average type of
supθ Pθ (xn1 ).
P
xn
1 the sequences xn1 that belong to
8. For an iid sequence of random variables with values n
( )
1X
in a finite set X let Ĥn = Ĥn,xn1 denote the empiri- B= xn1 : f (xi ) = α .
n i=1
cal entropy of the sequence xn1 , that is, Ĥn = H(P̂ )
where P̂ is the type of xn1 . Show that with proba- ¿From (a) we have |B| ≤ exp[nH(P )]. But xn1 ∈ B
P
bility 1 means that its type Pxn1 satisfies a Pxn1 (a)f (a) =
|X| − 1 α; which therefore also holds for the average type
nĤn ≤ − log P (xn1 ) ≤ nĤn + log n + Z, Hence H(P ) ≤
P
P , that is, a P (a)f (a) = α.
n
maxE(f (X))=α H(X). A lower bound on |B| cannot
where P is the true distribution and Z is a random
be given without additional assumptions.
variable, depending on n, such that E(Z) < ∞.
5. It suffices to show that
13 Solutions: Homework #1. PXY (x, y) log PXY (x,y)
PX (x) ≤ αPX1 Y1 (x, y) log
PX1 Y1 (x,y)
PX1 (x)
PX2 Y2 (x,y)
1. The i-th codeword, c(i), is the first `i = d− log pi e +(1 − α)PX2 Y2 (x, y) log
P PX2 (x) .
binary digits of ai = j<i pj .
This follows from
i 1 2 3 4 5
a1 a2 a1 + a2
ai 0 0.4 0.75 0.85 0.95 a1 log + a2 log ≥ (a1 + a2 ) log ,
b1 b2 b1 + b2
`i 2 2 4 4 5
c(i) 00 01 1100 1101 11110 with a1 = αPX1 Y1 , a2 = (1 − α)PX2 Y2 (x, y), b1 =
opt(i) 00 01 10 110 111 αPX1 (x), b2 = (1 − α)PX2 (x).

28
6. Suppose the Pi are everywhere positive. We prove The result now follows
! from the fact that H̄(P̂n ) ≤
that the I-projection of P2 on n + |X| − 1
log .
|X| − 1
P1 (x)
 X 
L = Q: Q(x) log =δ−γ
P2 (x) 9. From an earlier problem, the I-projection of P2 on
is that same P ∗ as the I-projection of P2 on Π = Π = {Q: D(QkP1 ) ≤ γ} is of the form
{Q: D(QkP1 ) ≤ γ}, where δ = D(P ∗ kP2 ). Indeed, " #−1
P ∗ ∈ L; furthermore, for every Q ∈ L P ∗ (x) = cP1θ (x)P21−θ (x), c = P1θ (x)P21−θ (x)
X
.
x
X P1 (x)
D(QkP2 ) − D(QkP1 ) = Q(x) log = δ − γ.
P2 (x) Therefore,

Here if D(QkP2 ) were less than δ, also D(QkP1 ) cP21−θ (x)


D(P ∗ kP1 ) = P ∗ (x) log
X
would be less than γ, which would imply that Q ∈
P11−θ (x)
Π, contradicting the hypothesis that the P ∗ with
D(P ∗ kP2 ) = δ is the I-projection of P2 on Π. P2 (x)
P ∗ (x) log
X
= log c + (1 − θ)
x P1 (x)
Now, the exponential family corresponding to L is
cP1θ (x)
D(P ∗ kP2 ) = P ∗ (x) log
X
P1 (x)
  
Pθ : Pθ (x) = c(θ)P2 (x) exp θ log P2θ (x)
P2 (x) P2 (x)
P ∗ (x) log
X
n o = log c − θ .
= Pθ : Pθ (x) = c(θ)P1θ (x)P21−θ (x) x P1 (x)

Since P1 and P2 belong to this family, and L By assumption, D(P ∗ kP1 ) = D(P ∗ kP2 ) = γ, hence
speparates P1 and P2 , there must be some θ∗ with it follows that
Pθ∗ ∈ L. Then, by the general theorem, P ∗ = Pθ∗
is the I-projection of P2 on L and therefore also on P2 (x)
P ∗ (x) log
X
=0 (32)
Π. x P1 (x)

7. Using the inequality ln x ≤ x − 1 we have But this means ex-


actly that (d/dθ) x P1θ (x)P21−θ (x) = 0, hence the
P
pi pi
X X  
pi ln ≤ pi −1 θ for which (??) holds actually minimizes the con-
qi qi
vex function x P1θ (x)P21−θ (x). Since we also have
P
i i
X p2 X (pi − qi )2 γ = log c = − log x P1θ (x)P21−θ (x), we must have
P
i
= −1=
i
qi i
qi the desired result

P1θ (x)P21−θ (x).


X
Multiplying both sides by log e produces the claimed γ = − log min
0≤θ≤1
inequality. x

8. Since Ĥn = H(P̂n ) and E P̂n = P , we have E Ĥn ≤


H(E P̂n ) = H(P ), by concavity. Here H(P̂n ) de- 14 Solutions: Homework.2.
notes the entropy of P̂n as a distribution. On the
1. The necessary and sufficient condition is that the
other hand, P̂n is also a random variable; H̄(P̂n ) will
“divergence balls” Πi = {Q: D(QkPi ) < γ} have to
denote the entropy of this random variable. Then
be disjoint, which follows from previous homework
we can write
problems.
nH(P ) = H(X1n ) = H(X1n |P̂n ) + H̄(P̂n )
X 2. The assertion follows from the fact that the ML
= Pr(P̂n = Q) log |TQ | + H̄(P̂n ) estimate from an exponential family equals the I-
Q
X projection of any element of the exponential fam-
≤ Pr(P̂n = Q)nH(Q) + H̄(P̂n ) ily onto the corresponding linear family (containing
Q the empirical distribution), and from the transitiv-
= nĤn + H̄(P̂n ) ity property of I-projections.

29

3. In general, maximizing H(P ) is the same as min- If n0 ∼ αn then Stirling’s formula, k! ∼ k k e−k 2πk,
imizing D(P kQ0 ) where Q0 is the uniform distri- gives
bution. In our case we must have P (i, i) = 0, for n0 !n1 ! nn0 nn1 q
∼ 0 n1 2πnα(1 − α)
every i; therefore, maximizing H(P ) is equivalent n! n
to minimizing D(P kQ0 ) where Q(i, i) = 0, ∀i, and which implies that
Q(i, j) = constant, i 6= j. This minimization can
PM L (xn1 ) nn0 0 nn1 1 1
 
be performed by iteratively adjusting the marginals
log = log /Q(xn1 ) ∼ log n+ const.
(iterative scaling). Q(xn1 ) nn 2
The exponential family will consist of all distribu-
tions of the form P (i, j) = cQ(i, j)a(i)b(j) and the If, on the other hand, n0 is a constant, then
maximum entropy distribution will be the ML es- 1 n0 !n1 ! n0 !
timate for this family. In this case the exponential Q(xn1 ) = =
n + 1 n! (n + 1)n · · · (n − n0 + 1)
family through Q0 , the uniform distribution, is not
appropriate because it does not intersect the set of and
feasible distributions, all of which have their diago-
nal elements equal to 0. PM L (xn1 )
=
Q(xn1 )
(1)
4. For each pair (i, j), 1 ≤ i ≤ 5, 1 ≤ j ≤ n, let Lij nn0 (n − n0 )n−n0
= 0 · · (n + 1)n · · · (n − n0 + 1)
denote the set of all joint distributions on X 25 whose n0 ! nn
two-dimensional marginal representing the joint dis- nn0 0 n0
tribution of Xij and Xi(j+1) equals the given one. = · (1 − )n−n0 ×
n0 ! n
(2)
1 n0 − 1
 
Similarly, let Lij , 1 ≤ i ≤ n, 1 ≤ j ≤ 5, be defined
× (1 − ) · · · (1 − ) (n + 1),
by the given joint distribution of Xij and X(i+1)j . n n
Let L be the intersection of all these linear linear
families and let P0 be the uniform distribution on so that in this case,
X 25 . The required maximum entropy joint distri- PM L (xn1 )
bution will be the I-projection of P0 on L. It can be log ∼ log n + const.
Q(xn1 )
computed by iterative scaling, performing cyclically
(1) (2)
I-projections on the sets Lij and Lij (by adjust- A better choice for Q is the mixture with respect
ing the corresponding two-dimensional marginals.) to the Dirichlet prior with α1 = α2 = −1/2, i. e.,
(1) (2) p P L (xn1)
Since L is the intersection of 40 sets Lij and Lij , with ν(p) = 1/(π p(1 − p)), for then log M Q(xn
1)
one cycle of the iteration will consist of 40 consecu- will be asymptotically (1/2) log n + constant, no
tive scalings. matter what is xn1 .
5. From Section 6, 6. (i) From formula (17) in the lecture notes with k = 2
P  we have the auxiliary distribution
k k
Γ i=1 αi +k Y
ν(p) = Qk pαi i , (n0 − 12 )(n0 − 32 ) · · · 12 · (n1 − 12 )(n1 − 32 ) · · · 12
i=1 Γ (αi + 1) i=1 Q(xn1 ) =
n!
is a density (for every α1 , . . . , αk greater than -1), With n = 32, n0 = 22, n1 = 10, we obtain L(xn1 ) =
hence its integral over the probability simplex is 1. d− log Q(xn1 )e = 32.
Applying this with k = 2 and with n0 and n1 =
n − n0 in the role of α1 and α2 , it follows that In the Markov case, the formula on page 16 of the
notes yields
Z
Q(xn1 ) = pn0 (1 − p)n1 dp Q(xn1 ) =
Γ(n0 + 1)Γ(n1 + 1) n0 !n1 ! (n(0, 0) − 1/2) . . . (1/2) · (n(0, 1) − 1/2) . . . (1/2)
= =
Γ(n + 2) (n + 1)! n0 !
1 n0 !n1 ! (n(1, 0) − 1/2) . . . (1/2) · (n(1, 1) − 1/2) . . . (1/2)
= . × .
n + 1 n! n1 !

30
In our case, setting the unspecified initial state equal On the other hand, the (pointwise) redundancy of
to 0, we have noo = 16, n01 = 6, n10 = 6, n11 = the Shannon code with respect to Q, though it might
4 and n0 = 22, n1 = 10. (Note that the present be negative for some xn1 , is lower bounded by a ran-
n0 and n1 equal those in part (i) only because the dom variable that has finite expectation, (Corollary
initial state has been set equal to the last bit of the 3). Thus − log Q(xn1 ) ≥ − log P ∗ (xn1 ) − Y , where
sequence xn1 ; otherwise there would be a difference E(Y ) is finite. Putting this together with the pre-
of 1.) Substituting these values we obtain L(xn1 ) = ceeding inequality yields the bound
d− log Q(xn1 )e = 33.
1 k−1
Remark. The perhaps surprising result is that for log ≤ nĤn + log n + const. + Y,
P ∗ (xn1 ) 2
the given sequence neither method leads to compres-
sion. This is so in spite of the fact that the first or- which completes the proof.
der empirical entropy Ĥn is clearly less than 1, and Remark. It follows in a similar manner that if
(2)
the second order empirical entropy Ĥn is clearly X1 , X2 , . . . , is an m-th order Markov chain (with ar-
smaller than Ĥn . The reason is that the true code- bitrarily specified states at times 0, −1, . . . , −m + 1)
(2)
length is not Ĥn (or Ĥn , respectively), rather, an then for the m-th order emprical (conditional) en-
additional term (1/2) log n (or log n, respectively), tropy Ĥnm we have
has to be added which stands for the description of
the ML distribution. nĤnm ≤ − log P ∗ (xn1 )
|A|m (|A| − 1)
7. Let r > 0 be any number such that for some prefix ≤ nĤnm + log n + const. + Z,
code with word length function L(xn1 ) we have 2
where Z is a random variable not depending on n,
L(xn1 ) + log Pθ (xn1 ) ≤ r, θ ∈ Θ, xn1 ∈ An . whose expectation is finite. If X1 , X2 , . . . is Markov
of order `, then it is also Markov of order m > ` and
Thus log Pθ (xn1 ) ≤ −L(xn1 ) + r, so that
hence
n
sup Pθ (xn1 ) ≤ 2−L(x1 ) 2r .
θ∈Θ nĤn` ≤ − log P ∗ (xn1 )
|A|m (|A| − 1)
Summing over xn1 and using the Kraft inequality ≤ nĤnm + log n + const. + Zm ,
2
then gives Sn ≤ 2r , so that rn ≥ log Sn .
so that,
On the other hand, for the Shannon code with
respect to the auxiliary distribution Q(xn1 ) = |A|m (|A| − 1) 1
Sn−1 supθ∈Θ Pθ (xn1 ), we have Ĥn` − Ĥnm ≤ log n + const. + Zm .
2n n
Pθ (xn1 ) This allows us to check if the Markov chain is of
L(xn1 ) + log Pθ (xn1 ) ≤ log + 1 ≤ log Sn + 1,
Q(xn1 ) order ` < m. Here Zm can be positive or negative,
but since it has finite expected value, we can use the
which proves that rn ≤ log Sn + 1. Markov inequality to get bounds on the probability
that Zm >  > 0 and use this in the above.
8. The first inequality is trivial because nĤn =
− log PM L (xn1 ). Consider the mixture distribution
Q with respect to the Dirichlet prior with αi ≡ 14.1 Corrections.
−1/2. Theorem 17 gives Line n+ is the n-th line from the top and line n− is
the n-th line from the bottom.
PM L (xn1 ) k−1
log ≤ log n + const.
Q(xn1 ) 2 1. Page 2, column 2, line 20+: Change js+1 to 1 + js .

Combining this with nĤn = − log PM L (xn1 ) then 2. Page 3, column 1, line 19+: Change P (P̂n ∈ Π) to
yields P (P̂n ∈ Πn ).

k−1
P
1 3. Page 3, column 1, line 8-: Change (1/n) a f (a) >
log ≤ nĤn + log n + const.
Q(xn1 )
P
2 α to (1/n) i f (xi ) > α.

31
4. Page 4, column 1, line 12+: The minimum should
be over P ∈ Π, not P ∗ ∈ Π.

5. Page 7, column 2, line 18+: γ = (j1 , . . . , jd ) should


be ω = (j1 , . . . , jd )

6. Page 8, column 1, line 17-: log P̂ (ω0 )/P (ω0 ) should


be log 1− P̂ (ω0 )
1−P (ω0 ) .

7. Page 11, column 1, lines 2- and 11-: Replace


n n
2−L(x1 ) by ∞ 2−L(x1 ) .
P P P
xn
1 ∈Bn (c) n=1 xn
1 ∈Bn (c)

8. Page 11, column 2, line 16+: Replace Zn =


P (xn1 )/Q(xn1 ) by Zn = Q(xn1 )/P (xn1 ).

9. Page 11, column 2, line 16-: Replace P (Ã) by


P (Ãm ) and Q(Ã) by Q(Ãm ).

10. Page 12, column 1, formula (11): In the integral


exponent the logarithm should be multiplied by 1/n.

11. Page 12, column 1, formula (12): Replace P ∈ by


U ∈.

12. Page 12, column 1, line 19-: Replace P ∈ N∞ by


U ∈ N .

13. Page 12, column 1, line 3-: Replace “code” by “pro-


cess U ”.

14. Page 14, column 1, formula (14): Replace i + 1 by


i = 1 in the product.

15. Page 15, column 2, line 14-: Replace log 2e by log 2 .

32

You might also like