0% found this document useful (0 votes)
123 views373 pages

Lecture Note PDF

This document provides an overview of a lecture on information theory. It discusses applications of information theory including communication, data compression, coding, cryptography, and physics. It introduces concepts such as entropy, probability mass functions, relative entropy, Jensen's inequality, and joint entropy. Key properties of entropy are outlined, such as it being determined by the probability mass function and always being non-negative. The concepts of convex and concave functions are also introduced.

Uploaded by

Sun Young Jo
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
123 views373 pages

Lecture Note PDF

This document provides an overview of a lecture on information theory. It discusses applications of information theory including communication, data compression, coding, cryptography, and physics. It introduces concepts such as entropy, probability mass functions, relative entropy, Jensen's inequality, and joint entropy. Key properties of entropy are outlined, such as it being determined by the probability mass function and always being non-negative. The concepts of convex and concave functions are also introduced.

Uploaded by

Sun Young Jo
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 373

EE 623: Information Theory

Junmo Kim
December 3, 2009
Junmo Kim EE 623: Information Theory
Lecture 1
Junmo Kim EE 623: Information Theory
Applications of Information Theory

Communication

Data compression (lossless)

Image / speech compression (lossy)

Coding

Cryptography

Statistics, Probability

Physics
Junmo Kim EE 623: Information Theory
Communication Theory

Broadcast channel : 1 sender N receiver

Multiple Access Channel : N sender 1 receiver


Junmo Kim EE 623: Information Theory
Notation

A : nite set (set of alphabet)

e.g. A = H, T, A = A, B, C

set of objects

[A[ : Cardinality of A
Denition

Chance variables : mapping from to A

We say that X : A is a chance variable taking value in


A.
Lets consider a mapping g : A . Then g(X) is a random
variable. E[g(X)] is well dened.
Junmo Kim EE 623: Information Theory
PMF

PMF of X : p
X
(x) = Pr (X = x) (C & T : p(x) )

For g : A , E[g(X)] =

p
X
(x)g(x).

p
X
(X) is a random variable. Why ?

p
X
() is a function.
Junmo Kim EE 623: Information Theory
Entropy of a chance variable
A : nite
X : chance variable with PMF p
X
(x) = Pr (X = x).
Denition
The entropy of a chance variable X is dened by
H(X) =

x.
p
X
(x) log
1
p
X
(x)
= E[log(p
X
(X))]
Junmo Kim EE 623: Information Theory
Entropy of a chance variable
Example
X takes on H, T with probability
1
2
.
H(X) =
1
2
log 2 +
1
2
log 2
= log 2 = 1 bit (ln 2 nats)
bits : base 2
nats : base e
Note: We dene 0 log 0 = 0. lim
x0
x log x = 0. (
0.0001 log
10
0.0001 = 0.0004 0 )
Similarly 0 log
1
0
= 0.
Junmo Kim EE 623: Information Theory
Entropy of a chance variable
Example
Let
X =
_

_
a with probability 1/2,
b with probability 1/4
c with probability 1/8
d with probability 1/8.
The entropy of X is
H(X) =
1
2
log 2 +
1
4
log 4 +
1
8
log 8 +
1
8
log 8 =
7
4
bits
.
Junmo Kim EE 623: Information Theory
Entropy of a chance variable

Suppose we wish to determine the value of X with the


minimum number of binary (yes/no) questions.

An ecient rst question is Is X = a ? This splits the


probability in half.

If the answer to the rst question is no, then the second


question can be Is X = b?

The third question can be Is X = c?

What is the resulting expected number of binary questions


required ?

With probability
1
2
, X = a : 1 = log 2 question is required.

With probability
1
4
, X = b : 2 = log 4 questions are required.

With probability
1
8
, X = c : 3 = log 8 questions are required.

With probability
1
8
, X = d : 3 = log 8 questions are required.

The resulting expected number of binary questions required is


1
2
log 2 +
1
4
log 4 +
1
8
log 8 +
1
8
log 8 =
7
4
= H(X)
Junmo Kim EE 623: Information Theory
Properties of the Entropy
1. If X takes some value with probability 1 then
H(X) = 0

pf: By denition & 0 log


1
0
= 0
2.
H(X) 0

pf: x log
1
x
is non-negative.
3. If H(X) = 0, then X must be deterministic.

pf: x log
1
x
0 for 0 x 1 and x log
1
x
= 0 for x = 0 or 1.
Each term of the entropy

xX
p
X
(x) log
1
p
X
(x)
is
non-negative and positive for 0 < x < 1.
4. The entropy is determined by the PMF.

e.g. : H,T with probability


1
3
,
2
3
Rain, Shine with probability
1
3
,
2
3
Junmo Kim EE 623: Information Theory
Properties of the Entropy
5 Of all chance variables taking value in A, the one with highest
entropy is uniformly distributed
P
X
(x) =
1
[.[
with entropy log [A[

The proof comes from non-negativity of relative entropies.

Thus we have
0 H(X) log [A[

rst equality i X is deterministic

second equality i X is uniform


Junmo Kim EE 623: Information Theory
Lecture 2
Junmo Kim EE 623: Information Theory
Entropy of a Chance Variable
A : nite
X : chance variable with PMF p
X
(x) = Pr (X = x).
Denition
The entropy of a chance variable X is dened by
H(X) =

x.
p
X
(x) log
1
p
X
(x)
= E[log(p
X
(X))]
Junmo Kim EE 623: Information Theory
Properties of the Entropy
1. If X takes some value with probability 1 then
H(X) = 0

pf: By denition & 0 log


1
0
= 0
2.
H(X) 0

pf: x log
1
x
is non-negative.
3. If H(X) = 0, then X must be deterministic.

pf: x log
1
x
0 for 0 x 1 and x log
1
x
= 0 for x = 0 or 1.
Each term of the entropy

xX
p
X
(x) log
1
p
X
(x)
is
non-negative and positive for 0 < x < 1.
4. The entropy is determined by the PMF.

e.g. : H,T with probability


1
3
,
2
3
Rain, Shine with probability
1
3
,
2
3
Junmo Kim EE 623: Information Theory
Properties of the Entropy
5 Of all chance variables taking value in A, the one with highest
entropy is uniformly distributed
P
X
(x) =
1
[.[
with entropy log [A[

The proof comes from non-negativity of relative entropies.

Thus we have
0 H(X) log [A[

rst equality i X is deterministic

second equality i X is uniform


Junmo Kim EE 623: Information Theory
Relative Entropy
Denition
Given two PMFs p(), q() on A, the relative entropy between
p(), q() is dened as
D(p|q) =

x.
p(x) log
p(x)
q(x)
Claim: For all p, q, D(p|q) 0.

Equality holds if p = q, i.e. p(x) = q(x) for all x A.

If q is uniform
D(p|q) =

p(x) log
p(x)
1
[.[
= log [A[ H(X) 0
Thus we have H(X) log [A[.
Junmo Kim EE 623: Information Theory
Convex and Concave functions
Denition

A function f : is said to be convex over an interval


(a, b) if for every x
1
, x
2
(a, b) and [0, 1],
f (x
1
+ (1 )x
2
) f (x
1
) + (1 )f (x
2
)

A function f is said to be strictly convex if equality holds only


if = 0 or = 1.

A function f is concave if f is convex.


Junmo Kim EE 623: Information Theory
Convex and Concave functions
Theorem
If f is twice dierentiable, f is convex i f

(x) 0 for all x.

If f

(x) > 0 for all x, f is strictly convex.

If f

(x) < 0 for all x, f is strictly concave.
Example

f (x) = ln x for x > 0

f
tt
(x) =
1
x
2
< 0 : strictly concave
Junmo Kim EE 623: Information Theory
Jensens Inequality
Theorem
If f is concave then for any random variable X,
f (E[X]) E[f (X)]
If f is strictly concave,
f (E[X]) = E[f (X)] X is deterministic.
Junmo Kim EE 623: Information Theory
Jensens Inequality
Proof.
f (x) = f (x
0
) + f
t
(x
0
)(x x
0
) +
1
2
f

()(x x
0
)
2
f (x
0
) + f
t
(x
0
)(x x
0
)
Thus for random variable X, we have
f (X) f (x
0
) + f
t
(x
0
)(X x
0
)
E[f (X)] f (x
0
) + f
t
(x
0
)E[X x
0
]
By taking x
0
= E[X], we have
E[f (X)] f (E[X])
.
Junmo Kim EE 623: Information Theory
Non-negativity of Relative Entropies
Theorem
For all p, q, D(p|q) 0, where the equality holds if p = q, i.e.
p(x) = q(x) for all x A.
Junmo Kim EE 623: Information Theory
Non-negativity of Relative Entropies
Proof.
D(p|q) =

x
p(x) log
p(x)
q(x)
= E
p
_
log
p(X)
q(X)
_
D(p|q) = E
p
_
log
q(X)
p(X)
_
log E
p
[
q(X)
p(X)
]
= log(

x
p(x)
q(x)
p(x)
) = log 1 = 0
Equality holds i
q(X)
p(X)
is deterministic, i.e. p = q.
Junmo Kim EE 623: Information Theory
Non-negativity of Relative Entropies
Junmo Kim EE 623: Information Theory
Joint Entropy
Lets consider a pair of chance variables (X, Y). We can view
(X, Y) as a chance variable taking values in
A = (x, y)[x A, y with PMF p
X,Y
(x, y). Thus
H(X, Y) is dened as
H(X, Y) = H((X, Y))
=

x.

y
p
X,Y
(x, y) log p
X,Y
(x, y)
Junmo Kim EE 623: Information Theory
Conditional Entropy
The conditional entropy H(X[Y) is dened as
H(X[Y) =

y
p
Y
(y)H(X[Y = y)
where H(X[Y = y) =

x.
p
X[Y
(x[y) log p
X[Y
(x[y).
Example
A = H, T, = 0, 1.
Prob[Y = 0] = Prob[Y = 1] =
1
2
.
Y = 0 X = H
Y = 1 X = H, T with probability
1
2
,
1
2
.
H(X[Y = 0) = 0 bit, H(X[Y = 1) = 1 bit
H(X[Y) =
1
2
H(X[Y = 0) +
1
2
H(X[Y = 1) =
1
2
bit
H(X) = H(
1
4
) 0.811
Junmo Kim EE 623: Information Theory
Conditional Entropy
H(X[Y) =

y
p
Y
(y)H(X[Y = y)
=

y
p
Y
(y)

x.
p
X[Y
(x[y) log p
X[Y
(x[y)
=

x.

y
p
X,Y
(x, y) log p
X[Y
(x[y)
= E[log p
X[Y
(X[Y)]
Junmo Kim EE 623: Information Theory
Chain Rule
Theorem
H(X, Y) = H(X) + H(Y[X)
Proof.
log p
X,Y
(X, Y) = log p
X
(X) + log p
Y[X
(Y[X)
Take expectation of both sides (w.r.t. the joint distribution p
X,Y
)
E[log p
X,Y
(X, Y)] = E[log p
X
(X)] E[log p
Y[X
(Y[X)]
This proves H(X, Y) = H(X) + H(Y[X).
Junmo Kim EE 623: Information Theory
Conditional Entropy
If X and Y are independent, H(X[Y) = H(X).
Proof.
It comes from p(x[Y = y) = p(x).
Junmo Kim EE 623: Information Theory
Mutual Information
Denition
Consider two randomvariables X and Y with joint PMF
p
X,Y
(x, y). The mutual information I (X; Y) is dened as
I (X; Y) = H(X) H(X[Y)
Claim: I (X; Y) = I (Y; X)
Proof.
H(X, Y) = H(X) + H(Y[X)
= H(Y) + H(X[Y)
H(X) H(X[Y) = H(Y) H(Y[X)
Junmo Kim EE 623: Information Theory
Mutual Information
Claim :
Let X, Y have joint PMF p
X,Y
(x, y). Let p
X
(x) and p
Y
(y) be the
marginals. (p
X
(x) =

y
p
X,Y
(x, y)) Then
I (X; Y) = D(p
X,Y
|p
X
p
Y
)
Proof.
D(p
X,Y
|p
X
p
Y
) =

x,y
p
X,Y
(x, y) log
p
X,Y
(x, y)
p
X
(x)p
Y
(y)
=

x,y
p
X,Y
(x, y) log
p
Y[X
(y[x)
p
Y
(y)
= H(Y[X) + H(Y)
Junmo Kim EE 623: Information Theory
Conditional Mutual Information
Denition
Lets consider three random variables X, Y, Z with joind PMF
p
X,Y,Z
(x, y, z). The conditional mutual information of X and Y
given Z is dened by
I (X; Y[Z) = H(X[Z) H(X[Y, Z)
= E
p(x,y,z)
log
p
X,Y[Z
(X, Y[Z)
p
X[Z
(X[Z)p
Y[Z
(Y[Z)
Junmo Kim EE 623: Information Theory
Conditional Mutual Information
Claim:
I (X; Y[Z) =

z
I (X; Y[Z = z)p
Z
(z)
Proof.

z
I (X; Y[Z = z)p
Z
(z)
=

z
p
Z
(z)

x,y
p
X,Y[Z
(x, y[z) log
p
X,Y[Z
(x, y[z)
p
X[Z
(x[z)p
Y[Z
(y[z)
=

x,y,z
p
X,Y,Z
(x, y, z) log
p
X,Y[Z
(x, y[z)
p
X[Z
(x[z)p
Y[Z
(y[z)
= E
p(x,y,z)
log
p(X, Y[Z)
p(X[Z)p(Y[Z)
Junmo Kim EE 623: Information Theory
Non-negativity of Mutual Information
Claim:
I (X; Y) 0.
with equality i X and Y are independent.
Proof.
It comes from I (X; Y) = D(p
X,Y
|p
X
p
Y
) 0. Equality holds i
p
X,Y
= p
X
p
Y
, i.e. X and Y are independent.

It follows that H(X) H(X[Y).

Caution: H(X[Y = y) can be larger than H(X).


Junmo Kim EE 623: Information Theory
Example
Junmo Kim EE 623: Information Theory
Chain Rule for Entropy
Theorem
H(X
1
, X
2
, . . . , X
n
) = H(X
1
)+H(X
2
[X
1
)+ +H(X
n
[X
1
, . . . , X
n1
)
Proof.
For two chance variables X
1
and X
2
we have a following chain rule:
H(X
1
, X
2
) = H(X
1
) + H(X
2
[X
1
)
As (X
1
, . . . , X
n1
) can be viewed as a big chance variable, applying
the above chain rule we have
H(X
1
, . . . , X
n1
, X
n
) = H(X
1
, . . . , X
n1
) + H(X
n
[X
1
, . . . , X
n1
)
This and induction prove the chain rule.
Junmo Kim EE 623: Information Theory
Chain Rule for Entropy
H(X
1
, X
2
, . . . , X
n
) = H(X
1
)+H(X
2
[X
1
)+ +H(X
n
[X
1
, . . . , X
n1
)
In short hand notation,
H(X
n
1
) =
n

i =1
H(X
i
[X
i 1
1
)

X
j
i
= (X
i
, X
i +1
, . . . , X
j
).

We often omit subscript 1. e.g. X


j
1
can be written as X
j
.
H(X
n
) =
n

i =1
H(X
i
[X
i 1
)
Junmo Kim EE 623: Information Theory
Lecture 3
Junmo Kim EE 623: Information Theory
Entropy of Binary Random Variable
Junmo Kim EE 623: Information Theory
Entropy is Concave

Let p
1
, . . . , p
[.[
be probability masses satisfying
0 p
i
1,

i
p
i
= 1.

Let q
1
, . . . , q
[.[
be probability masses satisfying
0 q
i
1,

i
q
i
= 1.

For any (0, 1), let r


i
= p
i
+ q
i
, where = 1 .

Then

i
r
i
= 1 and 0 r
i
1. Thus r
1
, . . . , r
[.[
can be
another set of probability masses.

The claim is
H(r
1
, . . . , r
[.[
) H(p
1
, . . . , p
[.[
) + H(q
1
, . . . , q
[.[
)
Junmo Kim EE 623: Information Theory
Probability Simplex

p = (p
1
, . . . , p
[.[
) lies in [A[ dimensional space.

With the constraint that 0 p


i
1,

i
p
i
= 1, the space of
PMFs (p
1
, . . . , p
[.[
) : 0 p
i
1,

i
p
i
= 1 is a subset of

[.[
and is called probability simplex.
Junmo Kim EE 623: Information Theory
Entropy is Concave
Consider two PMFs and their convex combination

p
(1)
= (p
(1)
1
, . . . , p
(1)
[.[
)

p
(2)
= (p
(2)
1
, . . . , p
(2)
[.[
)

p
(1)
+ p
(2)
= (p
(1)
1
+ p
(2)
1
, p
(1)
2
+ p
(2)
2
, . . . , p
(1)
[.[
+ p
(2)
[.[
)
Theorem
H(p
(1)
+ p
(2)
) H(p
(1)
) + H(p
(2)
)
Junmo Kim EE 623: Information Theory
Entropy is Concave
Proof.

Let Z take on the value 1 with probability and


value 2 with probability .

X
1
is distributed according to p
(1)
.

X
2
is distributed according to p
(2)
.

Now consider X
Z
.
H(X
Z
[Z) = Pr (Z = 1)
H(p
(1)
)
..
H(X
Z
[Z = 1) +Pr (Z = 2)
H(p
(2)
)
..
H(X
Z
[Z = 2)
= H(p
(1)
) + H(p
(2)
)
Pr (X
Z
= x) = p
(1)
(x) + p
(2)
(x)
Thus H(X
Z
) = H(p
(1)
+ p
(2)
). Results follows from
H(X
Z
) H(X
Z
[Z).
Junmo Kim EE 623: Information Theory
The Horse Race

Assume that m horses run in a race.

Horse i wins with probability p


i

If horse i wins, the payo is o


i
(or o(i )).

Let X be a chance variable corresponding to the winning


horse.

Given p = (p
1
, . . . p
m
), o = (o
1
, . . . , o
m
)

How do I distribute my wealth ?

Assume bet all my money

choose b = (b
1
, . . . , b
m
) : b
i
is the ratio of my wealth bet on
horse i . We use b
i
and b(i ) interchangeably.


b
i
= 1 (bet all my money). 0 b
i
1.
Junmo Kim EE 623: Information Theory
Question

How should I choose b to do best ?

At the end of the race, I have b(X)o(X), which is a random


variable.

Maybe maximize E[b(X)o(X)] =

p
i
b
i
o
i
=

b
i
p
i
o
i
. Put
all your money on the horse i

of highest p
i
o
i
.

If I have N races, where the race outcomes X


1
, X
2
, . . . , X
N
are
i.i.d. p(x), gamblers wealth is
S
N
=
N

k=1
b(X
k
)o(X
k
)
1
N
log S
N
=
1
N
N

k=1
log(b(X
k
)o(X
k
)) E[log b(X
k
)o(X
k
)] as N
At the end of N races (N >> 1) S
N
2
NE[log b(X)o(X)]
in the
sense that [
1
N
log S
N
E[log b(X)o(X)][ 0 as N
Junmo Kim EE 623: Information Theory

I suggest choose b to maximize


E[log b(X)o(X)] (doubling rate)
W(b, p) = E[log b(X)o(X)]

W(b, p) =

p
i
log o
i
b
i
=

p
i
log o
i
. .
no control
+

p
i
log b
i

Choose b
i
to maximize

p
i
log b
i

No attention to o
i
.
Junmo Kim EE 623: Information Theory
Maximization of the Doubling Rate
We would like to maximize

p
i
log b
i
subject to 0 b
i
1,

b
i
= 1. Writing the functional with a Lagrange multiplier, we
have
J(b) =

p
i
log b
i
+

b
i
J(b)
b
i
= p
i
1
b
i
+ = 0 for i = 1, . . . , m
p
i
= b
i
since

b
i
= 1,

p
i
= 1, and

p
i
=

b
i
, we have = 1.
Thus p
i
= b
i
is a stationary point of the function J(b).
We now verify that this proportional gambling is optimal.
Junmo Kim EE 623: Information Theory
Maximization of the Doubling Rate
Theorem
The doubling rate W(b, p) = E[log b(X)o(X)] is maximized by
choosing b = p.
Proof.
Let b be arbitrary. Compare b

= p with b.

p
i
log b

p
i
log b
i
=

p
i
log p
i

p
i
log b
i
=

p
i
log
p
i
b
i
= D(p|b) 0
where equality holds i p
i
= b
i
. This proves that the proportional
betting is optimal.
Note: This strategy assumes I bet all my money.
Junmo Kim EE 623: Information Theory
Example: Uniform Fair Odds
If one uses b

= p,
W(b

, p) =

p
i
log(p
i
o
i
)
Example: o
i
= m : uniform fair odds If b
i
=
1
m
, it is guaranteed
that we get the money back as b(X)o(X) = 1.
W(b

, p) =

p
i
log p
i
+

p
i
log m
= log m H(p)
S
N
2
N(log mH(p))
Entropy is a measure of uncertainty. The lower the entropy, the
more money you can make betting on X.
Junmo Kim EE 623: Information Theory
Fair, Superfair, Subfair Cases
So far we assumed that you put all the money. What if you dont
have to put all your money ?

Fair odds:

1
o
i
= 1

In this case, there is no loss in optimality in assuming that you


must gamble all your money.

If b
i
=
1
o
i
, the outcome is deterministically 1.

Suppose guy A retain some of his wealth as cash.

Guy B can get the same outcome by distributing the cash over
horses b
i
=
1
o
i
.

Hence proportional betting is optimal.


Junmo Kim EE 623: Information Theory
Fair, Superfair, Subfair Cases
What if you dont have to put all your money ?

Superfair case :

1
o
i
< 1

In this case, the odds are even better than fair odds, so we
will put all the money.

Strategy 1: the proportional betting maximizes the doubling


rate.

Strategy 2: We can form a Dutch book (a set of odds and


bets which guarantees a prot, regardless of the outcome of
the gamble)

We will choose b
i
= c
i
+ d
i
so that

b
i
= 1

Choose c
i
=
1
o
i
, to make sure I get the money back ( o
i
c
i
= 1)

Choose d
i
any non-negative number so that

d
i
= 1

1
o
i
> 0.

By this, a prot is guaranteed.

In general, a Dutch book, though risk-free, does not optimize


the doubling rate.
Junmo Kim EE 623: Information Theory
Fair, Superfair, Subfair Cases
What if you dont have to put all your money ?

Subfair case :

1
o
i
> 1

In this case, dont put all your money into the race.

Proportional gambling is no longer log-optimal. (Cover &


Thomas. Problem 6.2)

There isnt an easy closed form solution, but a water-lling


solution exists.
Junmo Kim EE 623: Information Theory
Side Information

X: winning horse

Y: side information

p
X,Y
(x, y)

Betting strategy b(x[y)

The goal is to maximize


E[log(b(X[Y)o(X))] =

y
p
Y
(y)

x
p
X[Y
(x[y) log(b(x[y)o(x))
. .
maximize separately

(x[y) = p
X[Y
(x[y) is optimal.
Junmo Kim EE 623: Information Theory
Side Information

Dierence in W
W = W(b

(x[y), p
X,Y
) W(b

, p
X
)
=

x,y
p
X,Y
(x, y) log(b(x[y)o(x))

x,y
p
X,Y
(x, y) log(b(x)o(x))
=

x,y
p
X,Y
(x, y) log
b(x[y)
b(x)
=

x,y
p
X,Y
(x, y) log
p(x[y)
p(x)
= I (X; Y)
Junmo Kim EE 623: Information Theory
Doubling Rate and Relative Entropy

Consider a fair odds case: Let r


i
=
1
o
i

Then

r
i
= 1. r can be interpreted as a PMF.

Suppose we do not know the true PMF p and we use


sub-optimal b
i
, which is a gamblers estimate of p
i
.

The doubling rate is

x
p
X
(x) log(o(x)b(x)) =

x
p
X
(x) log
b(x)
p
X
(x)
p
X
(x)
r (x)
= D(p|r ) D(p|b)

The relative entropy can be interpreted as a distance.

The gambler can make money only if his estimate of p is


better than r .
Junmo Kim EE 623: Information Theory
Lecture 4
Junmo Kim EE 623: Information Theory
Review

Betting: proportional betting b

i
= p
i
(i.e. b

(x) = p(x)
(assume bet all the money)

When side information is available, b

(x[y) = p
X[Y
(x[y)

W = I (X; Y)

Fair odds:

1
o
i
= 1

Super-fair odds:

1
o
i
< 1

Sub-fair odds:

1
o
i
> 1
Junmo Kim EE 623: Information Theory
Review

Uniform fair odds : o


i
= [A[.

i
= p
i

After one race, the outcome is S


1
= b(X)o(X) = p
X
(X)[A[.

After two races, the outcome is S


2
= p
X
(X
1
)p
X
(X
2
)[A[
2
.

After n races, the outcome is S


n
=

n
i =1
p
X
(X
i
)[A[
n
.

Log wealth is
1
n
log S
n
=
1
n
n

i =1
log p
X
(X
i
)
. .
random variable
+log [A[
E[log p
X
(X)] + log [A[
= H(X) + log [A[
Junmo Kim EE 623: Information Theory
Dependent Races

X
i
A : X
1
, X
2
, . . .

A rst race b
(1)
= p
X
1
()

If x
1
wins b
(2)
= p
X
2
[X
1
=x
1
()

b
(3)
= p
X
3
[X
1
,X
2
()

With uniform fair odds, how much money do I have ?

S
1
= p
X
1
(X
1
)[A[

S
2
= p
X
1
(X
1
)[A[p
X
2
|X
1
(X
2
)[A[ = p
X
1
,X
2
(X
1
, X
2
)[A[
2

S
n
= p
X
1
,...,X
n
(X
1
, . . . , X
n
)[A[
n

1
n
log S
n
= log [A[
1
n
log
1
p
X
1
,...,X
n
(X
1
,...,X
n
)

The exponent in the growth rate is


1
n
E[log S
n
] = log [A[
1
n
E[log
1
p
X
1
,...,X
n
(X
1
, . . . , X
n
)
]
= log [A[
1
n
H(X
1
, . . . , X
n
)

The limit of
1
n
H(X
1
, . . . , X
n
) is called entropy rate.
Junmo Kim EE 623: Information Theory
Entropy Rate
Denition
The entropy rate of a stochastic process X
i
is dened by
H(A) = lim
n
1
n
H(X
1
, . . . , X
n
)
when the limit exists.

Notations: H(A) (book) or H(X


k
) (stochastic process as
an argument)

When does the limit exist ?

We will prove that the limit exists whenever X


k
is stationary.
Junmo Kim EE 623: Information Theory
Stochastic Processes, Stationarity

A stochastic process is a collection of chance variables indexed


by the natural numbers.
X
1
, . . . , X
n
, . . .

Characterization
P
X
1
,...,X
n
(x
1
, . . . , x
n
) for all n 1

Stationary process if
For all n, k 1 and for all
1
, . . . ,
n
A,
p
X
1
,...,X
n
(
1
, . . . ,
n
) = p
X
1+k
,...,X
n+k
(
1
, . . . ,
n
)
Junmo Kim EE 623: Information Theory
Entropy Rate of a Stationary Stochastic Processes
Theorem
If X
k
is stationary then the limit
lim
n
1
n
H(X
1
, . . . , X
n
) exists.
e.g.) X
k
is i.i.d. according to p
X
(x).
H(X
1
, . . . , X
n
) = H(X
1
) + H(X
2
[X
1
) + + H(X
n
[X
1
, . . . , X
n1
)
= nH(X
1
)
Junmo Kim EE 623: Information Theory
Entropy Rate of a Stationary Stochastic Processes
Lemma
If X
k
is stationary then the limit
lim
n
H(X
n
[X
1
, . . . , X
n1
) exists.
Proof.

Claim: H(X
n
[X
1
, . . . , X
n1
) is monotonically non-increasing.

Lets compare H(X


n+1
[X
1
, . . . , X
n
) and H(X
n
[X
1
, . . . , X
n1
).

H(X
n+1
[X
1
, . . . , X
n
) H(X
n+1
[X
2
, . . . , X
n
) (conditioning)

Because X
k
is stationary, X
1
, . . . , X
n
and X
2
, . . . , X
n+1
have
the same distribution.

H(X
n+1
[X
2
, . . . , X
n
) = H(X
n
[X
1
, . . . , X
n1
)

Thus H(X
n
[X
n1
1
) is non-increasing.

Since H 0, the limit exists.

A non-increasing set of non-negative numbers has a limit.


Junmo Kim EE 623: Information Theory
Entropy Rate of a Stationary Stochastic Processes
Lemma
If X
k
is stationary then
lim
n
1
n
H(X
1
, . . . , X
n
) = lim
n
H(X
n
[X
n1
1
)
Proof.
1
n
H(X
1
, . . . , X
n
)
. .
b
n
=
1
n
n

i =1
H(X
i
[X
i 1
1
)
. .
a
i
Let b
n
=
1
n

a
i
. If a
i
then b
n
.
(Comment: The reverse is not true. 1, 2, 1, 2, 1, 2, . . . has no
limit but the average converges to 1.5)
Junmo Kim EE 623: Information Theory
Entropy Rate: Example
Example

X
k
are independent.
X
i
=
_
H with probability
1
i
T with probability 1
1
i

This process is not stationary.

Does lim
n
H(X
n
[X
n1
) exist ?

Yes. H(X
n
[X
n1
) = H(X
n
) = H
b
(
1
n
,
n1
n
) 0 as n .

Does
1
n
H(X
1
, . . . , X
n
) 0 ?

Yes. a
n

1
n

a
i
.
Junmo Kim EE 623: Information Theory
Example: Markov Process

A stochastic process X
1
, X
2
, . . . is a Markov process if
p
X
n
[X
1
,...,X
n1
(x
n
[x
1
, . . . , x
n1
) = p
X
n
[X
n1
(x
n
[x
n1
)

Time invariant Markov process (homogeneous Markov


Process) if p
X
n
[X
n1
(x[x
t
) does not depend on n.

X
n
is a location of a random walk
X
n+1
=
_
X
n
+ 1 with probability
1
2
X
n
1 with probability
1
2

If X
k
is Markov and stationary, it is time-invariant
(homogeneous).

If X
k
is homogeneous and Markov, it need not be
stationary.

e.g. Random walk starting at zero X


0
= 0

Does H(X
n
[X
n1
) converge whenever X
k
is Markov ?

H(X
n
[X
n1
) = H(X
n
[X
n1
) ( Markov)

No. It depends on H(X


n
[X
n1
).
Junmo Kim EE 623: Information Theory
Lecture 5
Junmo Kim EE 623: Information Theory
Review

Dependent horse race with uniform fair odds : X


i
A :
X
1
, X
2
, . . .

A rst race b
(1)
= p
X
1
() S
1
= p
X
1
(X
1
)[A[

If x
1
wins b
(2)
= p
X
2
|X
1
=x
1
() S
2
=
p
X
1
(X
1
)[A[p
X
2
|X
1
(X
2
)[A[ = p
X
1
,X
2
(X
1
, X
2
)[A[
2

S
n
= p
X
1
,...,X
n
(X
1
, . . . , X
n
)[A[
n

S
n
= 2
nlog [.[+
1
n
log p
X
1
,...,X
n
(X
1
,...,X
n
)

H(X
k
) = H(A) = lim
n
1
n
H(X
1
, . . . , X
n
) (if limit exists)

If lim
n
H(X
n
[X
n1
) exists, then lim
n
1
n
H(X
1
, . . . , X
n
)
exists and they are equal.

If X
k
is stationary, then lim
n
H(X
n
[X
n1
) exists.
Junmo Kim EE 623: Information Theory
Markov Chains

Markov chains over nite sets: X


1
, X
2
, . . . A, [A[ < .
Pr (X
k
= x
k
[X
k1
= x
k1
, . . . , X
1
= x
1
) = Pr (X
k
= x
k
[X
k1
= x
k1
)

Homogeneous if p
X
k
[X
k1
(x
t
[x) doesnt depend on k

For a stationary Markov chain


lim
n
1
n
H(X
1
, . . . , X
n
) = lim
n
H(X
n
[X
n1
) (stationary)
= lim
n
H(X
n
[X
n1
) (Markov)
= lim
n
H(X
2
[X
1
) (stationary)
H(X
k
) =

x.
p
X
1
(x)H(X
2
[X
1
= x)
=

x.
p
X
1
(x)

.
p(x
t
[x) log
1
p(x
t
[x)
Junmo Kim EE 623: Information Theory
Stationary Markov Chains

Suppose p(i [j ) is given for a time-invariant Markov chain.


Which distribution on X
1
will result in a stationary process ?

Fact: For time-invariant Markov chain, the process X


k
is
stationary i p
X
1
() = p
X
2
() .

i
(i )p(j [i ) = (j )

In matrix form, we have


[(1) ([A[)] = [(1) ([A[)]
_
p(j [i )
.
.
.
_

Probability transition matrix

Solutions of the above equation () is called a stationary


distribution.
Junmo Kim EE 623: Information Theory
Entropy Rate of Markov Chains

Facts: For irreducible aperiodic time-invariant Markov chain


1. There is a unique stationary distribution
2. Pr (X
n
= i ) converges to the stationary distribution regardless
of p
X
1
().

For such processes H(X


n
[X
n1
) may depend on n.

However, lim
n
H(X
n
[X
n1
) exists and is equal to

x.

X
(x)

.
p(x
t
[x) log
1
p(x
t
[x)

Irreducible : If it is possible to go with positive probability


from any state of the Markov chain to any other state in a
nite number of steps, the Markov chain is irreducible.

Aperiodic : If the largest common factor of the lengths of


dierent paths from a state to itself is 1, the Markov chain is
called aperiodic.
Junmo Kim EE 623: Information Theory
Entropy Rate of Markov Chains

To nd H(X
k
) from p(i [j )

Solve to nd the stationary distribution


X
(x)

plug
X
(x) in

xX

X
(x)

X
p(x

[x) log
1
p(x

|x)
Junmo Kim EE 623: Information Theory
Entropy Rate of Markov Chains: Example

Lets consider a time-invariant Markov chain with transition


probability p(j [i ) = (j )

How should X
1
be distributed to make this stationary ?

In this singular example the distribution of X


2
is just ()
irrespective of the distribution of X
1
.

For stationary, X
1
and X
2
must be identically distributed.

Hence, X
1
().

Alternatively, by brute force


(j ) =

i
(i )p(j |i ) =

i
(i )(j )
= (j )

If p
X
1
() = (), the entropy rate is
H(X
2
[X
1
) =

i
(i )H(X
2
[X
1
= i ) = H(()).

Irrespective of the distribution of X


1
, lim
n
H(X
n
[X
n1
)
exists so the entropy rate is as before
Junmo Kim EE 623: Information Theory
Example: Random Walk on a Weighted Graph

W
ij
> 0

p(j [i ) =
W
ij

W
ij

What is the stationary distribution ?

The stationary distribution is

i
=

j
W
ij
W

W is chosen to normalize the distribution


W =

j
W
ij
Junmo Kim EE 623: Information Theory
Example: Random Walk on a Weighted Graph

Need to check

i
p(j [i ) =
j

Indeed

i
p(j [i ) =

k
W
ik
W
W
ij

m
W
im
=

i
W
ij
W
=
j
Because W
ij
= W
ji
Junmo Kim EE 623: Information Theory
Asymptotic Equipartition Property(AEP)

For chance variable X


1
, . . . , X
n
E[
1
n
log
1
p
X
1
,...,X
n
(X
1
, . . . , X
n
)
] =
1
n
H(X
1
, . . . , X
n
)

Example: X
1
, . . . , X
n
are i.i.d. according to P
X
(x)
1
n
log
1
p
X
1
,...,X
n
(X
1
, . . . , X
n
)
=
1
n
n

i =1
log p
X
(X
i
)
. .
i.i.d. r.v.
E[log p
X
(X)] = H(X)

For i.i.d. chance variable


Pr ([
1
n
log P
X
1
,...,X
n
(X
1
, . . . , X
n
) H(X)[ > ) 0, > 0

Denition: The stochastic process X


k
satises the AEP if
Pr ([
1
n
log P
X
1
,...,X
n
(X
1
, . . . , X
n
)
. .
log likelihood
H(X
k
)[ > ) 0
Junmo Kim EE 623: Information Theory
Typical Sequences

A
(n)

is the set of all length-n sequences


1
, . . . ,
n
such that
[
1
n
log p
X
1
,...,X
n
(
1
, . . . ,
n
) H(X
k
)[ <


1
, . . . ,
n
is typical, if
1
, . . . ,
n
A
(n)

i.e.,
[
1
n
log p
X
1
,...,X
n
(
1
, . . . ,
n
) H(X
k
)[ <
Junmo Kim EE 623: Information Theory
Example

X
1
, . . . , Ber(
1
2
). H, T

H(X) = log 2 = 1 bit


1
n
log P
X
1
,...,X
n
(X
1
, . . . , X
n
)
. .
2
n
= 1 : all sequences are typical

Bernoulli(p) : p = 0.11 H with probablity 0.89, T with


probability 0.11

H(X) =
1
2
bit

most likely sequence is HHHHHH

1
n
log p
X
1
,...,X
n
(H, . . . , H) = log
2
0.89
not typical ?
Junmo Kim EE 623: Information Theory
Lecture 6
Junmo Kim EE 623: Information Theory
Review

Stationary Markov chain has entropy rate H(X


2
[X
1
)

H(X
2
[X
1
) is given by stationary distribution and p(j [i ).

Walks on graph

AEP

Today

AEP

Source Coding
Junmo Kim EE 623: Information Theory
AEP

Assume X
k
has entropy rate H(X
k
), the typical set A
(n)

is dened as
A
(n)

= (
1
, . . . ,
n
) : [
1
n
log p
X
1
,...,X
n
(
1
, . . . ,
n
)H(X
k
)[ <
A
(n)

= (
1
, . . . ,
n
) : 2
n(H+)
p
X
1
,...,X
n
(
1
, . . . ,
n
) 2
n(H)

Typical sequences are almost equally likely.

We say that X
k
satises the AEP if
> 0, Pr ((X
1
, . . . , X
n
) A
(n)

)1 as n
where Pr ((X
1
, . . . , X
n
) A
(n)

) =

xA
(n)

p
X
(x).
Junmo Kim EE 623: Information Theory
IID Process Satises AEP
Theorem
If X
k
is IID, it satises AEP, i.e., > 0,
Pr ((X
1
, . . . , X
n
) A
(n)

) 1 as n .
Proof.
Pr ([
1
n
log p
X
1
,...,X
n
(X
1
, . . . , X
n
) H(X)[ < ) 1 as n ,
which comes from the weak law of large numbers.
Junmo Kim EE 623: Information Theory
Example

X
i
IID Ber (1), Pr (X
i
= H) = 1.

H(X
k
) = 0

A
(n)

= (H, . . . , H)

Ber (
1
2
)

H(X
k
) = 1 bit

A
(n)

= H, T
n
, i.e. all sequences are in A
(n)

Ber (p), p = 0.11, Pr (H) = 0.11, Pr (T) = 0.89

H(X
k
) =
1
2
bit

Out of 2
n
sequences, most of them are of likelihood
Pr 2
nH
= 2

n
2
Junmo Kim EE 623: Information Theory
Example Not Satisfying AEP

Consider the following stochastic process

If U = 1 then X
1
, . . . , X
n
i .i .d.Ber (
1
2
)

If U = 2 then X
1
, . . . , X
n
i .i .d.Ber (1)

Pr (U = 1) = Pr (U = 2) =
1
2

What is H(X
k
) ?

H(X
1
, . . . , X
n
[U) H(X
1
, . . . , X
n
) H(X
1
, . . . , X
n
[U) +
1
..
H(U)
. .
H(X
1
,...,X
n
,U)

lim
n
1
n
H(X
1
, . . . , X
n
) = lim
n
1
n
H(X
1
, . . . , X
n
[U)
. .
1
2
nH(
1
2
)
=
1
2

Does this satisfy the AEP ? No.

For instance = 0.0001

p
X
(H, H, H, . . . , H) >
1
2
>> 2
nH({X
k
})

The sequence H, . . . , H is not typical.

Pr ((A
(n)

)
C
) P
X
(H, . . . , H) >
1
2
Junmo Kim EE 623: Information Theory
Source Coding

Describe a source outcome using bits

Let X
k
be any source with nite alphabet [A[ < .

Describe x
1
, . . . , x
n
using bits

Assume source satises AEP.

Given > 0

If x
1
, . . . , x
n
is not in A
(n)

describe using brute force with n log [A[ bits

If it is typical,

give index of the sequence in A


(n)

Use log [A
(n)

[ bits to describe the typical sequence.


Junmo Kim EE 623: Information Theory
Expected Length of the Source Code

Algorithm:

Look at x
1
, . . . , x
n

if atypical, transmit 0 followed by n log [A[ bits

if typical, transmit 1 followed by log [A


(n)

[ bits

This is xed to variable code.

What is the expected length ?


Pr (A
(n)

) (1 +log [A
(n)

[) +(1 Pr (A
(n)

)) (1 +n log [A[)

As Pr (A
(n)

) > 1 , log [A
(n)

[ is dominant term.
Junmo Kim EE 623: Information Theory
Source Coding: Size of Typical Set
Claim:
[A
(n)

[ 2
n(H+)
Proof.
1 Pr (A
(n)

)
=

(x
1
,...,x
n
)A
(n)

p
X
(x)

(x
1
,...,x
n
)A
(n)

2
n(H+)
= [A
(n)

[2
n(H+)
Junmo Kim EE 623: Information Theory
Expected Length of the Source Code

Expected length
Pr (A
(n)

) (1+log [A
(n)

[)+(1Pr (A
(n)

)) (1+n log [A[)


Pr (A
(n)

)(2+log(2
n(H(X)+)
))+(1Pr (A
(n)

))(2+n log [A[)

Normalized expected length


1
n
(expected length) H(X) +
Junmo Kim EE 623: Information Theory
Source Coding Theorem
Theorem
If a source of entropy rate H satises the AEP, then for all > 0,
all suciently large n, we can nd a n-to-variable code of
normalized expected length less than H + .

Issues

We need to know source distribution.

Computational complexity
Junmo Kim EE 623: Information Theory
Source Coding Techniques

Given a chance variable X taking value in [A[ < , a code is


a mapping
C : A 0, 1

0, 1

= , 0, 1, 00, 01, 10, 11, . . .

A code is non-singular if the mapping is one-to-one, i.e.


x ,= x
t
c(x) ,= c(x
t
)
.

The extension of a code C

: A

0, 1

is dened as
C

: (x
1
, . . . , x
n
) c(x
1
)c(x
2
) c(x
n
)

A code is uniquely decodable if its extension is non-singular

Example: A 0, B 1, C 00, D 01

Is this singular ?

Is this uniquely decodable ?

AA 00, C 00
Junmo Kim EE 623: Information Theory
Prex Free Code

A code is prex free if no codeword is a prex of another

Claim: Any prex-free code is uniquely decodable


Junmo Kim EE 623: Information Theory
Krafts Inequality

Let l (x) denote the length of the string to which x is mapped


by the code C. If C is uniquely decodable then

x.
2
l (x)
1

Moreover, if the integers l


1
, . . . , l
[.[
satisfy

2
l
i
1, then
there exists a uniquely decodable code of these lengths. In
fact, there exists a prex-free code of these lengths.
Junmo Kim EE 623: Information Theory
Lecture 7
Junmo Kim EE 623: Information Theory
Source Coding Techniques

Given a chance variable X taking value in [A[ < , a code is


a mapping
C : A 0, 1

0, 1

= , 0, 1, 00, 01, 10, 11, . . .

A code is non-singular if the mapping is one-to-one, i.e.


x ,= x
t
c(x) ,= c(x
t
)
.

The extension of a code C

: A

0, 1

is dened as
C

: (x
1
, . . . , x
n
) c(x
1
)c(x
2
) c(x
n
)

A code is uniquely decodable if its extension is non-singular

Example: A 0, B 1, C 00, D 01

Is this singular ?

Is this uniquely decodable ?

AA 00, C 00
Junmo Kim EE 623: Information Theory
Prex Free Code

A code is prex free if no codeword is a prex of another

Claim: Any prex-free code is uniquely decodable


Junmo Kim EE 623: Information Theory
Prex Free Code is Uniquely Decodable

Given a binary string representing a nite source sequence:

Start reading the binary string until it forms a codeword.

And decode the rst symbol.

Read the subsequent bits until they form a codeword, and


decode the second symbol.

Continue until there are no more binary symbols to process.

This reconstruction is the only one possible.

If there is an alternative reconstruction with dierent


reconstruction of the rst symbol, the binary description of the
rst symbol we have reconstructed would have to be a prex
of the description of the alternative reconstruction.

Because the code is prex free, all reconstructions must agree


on the rst symbol.

The arguments then extends to all symbols.


Junmo Kim EE 623: Information Theory
Prex Free Codes and Binary Trees

There is one-to-one correspondence (bijection) between prex


free codes and binary trees.

Consider a binary tree with [A[ leaves, each of which has a


distinct label from A.

The leaves are nodes with no children, whereas the internal


nodes have one or two children.

To every leaf, there corresponds a unique path from the root


to that leaf node.

Such a path can be described by a sequence of left and right


thus by a sequence of 0 and 1.

This tree corresponds to a code: Given a symbol x A, map


x to the binary string that represents the path from the root
node to the leaf node of label x.

This code is prex-free. If C(x) is prex of C(x

) for x ,= x

,
leaf labeled x

is a descendent of the leaf labeled x, so x is not


a leaf.
Junmo Kim EE 623: Information Theory
Prex Free Codes and Binary Trees

Every binary tree of [A[ leaves whose leaves are labeled


distinctly by the elements of A corresponds to a prex-free
code.

Reverse: To every prex-free code, there corresponds a binary


tree of [A[ leaves whose leaves are labeled distinctly by the
elements of A

Start with a full binary tree and label by x the node you reach
by following the path corresponding to C(x).

Do this for every x A.

Trim all the descendents of labeled nodes making them into


leaves, and trim all edges (and their descendents) that lead to
nodes that have no labeled descendents.
Junmo Kim EE 623: Information Theory
Krafts Inequality

Let A = 1, . . . , m

If C is a uniquely decodable code for describing X and i is


described using a codeword of length l
i
then

2
l
i
1
.

Given positive integer l


1
, . . . , l
m
, if

2
l
i
1, then there
exists a uniquely decodable code of these lengths. In fact,
there exists a prex free code for X of length l
i
.
Junmo Kim EE 623: Information Theory
Krafts Inequality
We rst show a weaker statement, namely, if C is prex-free then

2
l
i
1.

The number of leaves in phantom tree is 2


l
max
.

Codeword with length l


i
rules out 2
l
max
l
i
phantom leaves.

Codeword i and codeword j rule out disjoint sets of phantom


leaves for i ,= j .

Suppose there is a common phantom leaf ruled out.

Thus we have a unique path from the phantom leaf to the root.

Both codeword i and codeword j should be on the same path.

Hence, one of the codeword i and codeword j is an ancestor of


the other. This violates prex-free condition.
Junmo Kim EE 623: Information Theory
Krafts Inequality

Total number of phantom leaves ruled out is

2
l
max
l
i
. We
cant rule out more than 2
l
max
leaves.

2
l
max
l
i
2
l
max

2
l
i
1
Junmo Kim EE 623: Information Theory
Krafts Inequality
Alternative proof:

Consider a full tree with height l


max
. This tree satises Kraft
with equality.

2
l
i
=

2
l
max
= 1

If we throw out two siblings,

2
l
i
remains the same.

We subtracted 2 2
l
max
.

We added 2
(l
max
1)
.

If we increase l
i
by adding a single child node,

2
l
i
only
decreases.
Junmo Kim EE 623: Information Theory
Krafts Inequality: Converse
Given positive integer l
1
, . . . , l
m
, if

2
l
i
1, then we can
construct a prex free code of length l
i
as follows.

Order the length l


1
l
2
l
m
.

Label the rst node (lexicographically) of depth l


1
as
codeword 1, and remove its descendants from the tree, as
they cant be a codeword.

Then label the rst remaining node of depth l


2
as codeword 2
and remove its descendants, etc. Proceeding this way, we
construct a prex-free code with the specied l
1
, l
2
, . . . , l
m
.

How can we know that we have a remaining node of depth l


2
and so on?
Junmo Kim EE 623: Information Theory
Krafts Inequality: Converse

Assume I succeeded in assigning l


1
, . . . , l
i
.

Consider a subtree of depth l


i +1
with 2
l
i +1
leaves.

Suppose I assigned l
1
, . . . , l
i
. If i < m, the number of removed
nodes of depth l
i +1
is less than total number of depth l
i +1
nodes, as we have
i

j =1
2
l
i +1
l
j
< 2
l
i +1
Thus we have a remaining node of depth l
i +1
.

If i = m we are done.
Junmo Kim EE 623: Information Theory
Krafts Inequality: Stronger Statement
If C is a uniquely decodable code for describing A and i is
described using a codeword of length l
i
then

2
l
i
1
.

Look at describing x = (x
1
, . . . , x
n
) (n-tuples from the
source).
l (x) =
n

i =1
l (x
i
)
(

x.
2
l (x)
)
n
=

x.
n
2
l (x)
=
nl
max

m=1
a(m)2
m
nl
max
Junmo Kim EE 623: Information Theory
Krafts Inequality: Stronger Statement
(

x.
2
l (x)
)
n
= (

x
1
.
2
l (x
1
)
)(

x
2
.
2
l (x
2
)
) (

x
n
.
2
l (x
n
)
)
=

x
1
.

x
2
.

x
n
.
2
l (x
1
)
2
l (x
2
)
2
l (x
n
)
=

(x
1
,x
2
,...,x
n
).
n
2
(l (x
1
)+l (x
2
)++l (x
n
))
=

x.
n
2
l (x)
=
nl
max

m=1
a(m)2
m
nl
max

where a(m) is the number of n-tuples x that have length m,


i.e. l (x) = m. a(m) 2
m
because uniquely decodable.
Junmo Kim EE 623: Information Theory
Krafts Inequality: Stronger Statement

Thus we have

x.
2
l (x)

n
_
nl
max

As n is arbitrary, we can take n to innity.

x.
2
l (x)
lim
n
n
_
nl
max
= 1
lim
n
log n + log l
max
n
= 0
Junmo Kim EE 623: Information Theory
Lecture 8
Junmo Kim EE 623: Information Theory
Today

Minimum Expected Length

Wrong Probability

Human codes
Junmo Kim EE 623: Information Theory
Criterion for Short Description

X takes value in A according to p


X
(x).

Expected length
L =

p
X
(x)l (x)

Do I agree to optimize this quantity ? Law of large numbers.

With this criterion, minimize

x.
p
X
(x)l (x) over all length
l (x) that a uniquely decodable code could have.

Using Kraft, we can formulate the optimization problem as


minimize

p
X
(x)l (x)
subject to l (x) being integer satisfying

2
l (x)
1.
L

= min
l (x) integer

2
l (x)
1

p
X
(x)l (x)
Junmo Kim EE 623: Information Theory
Minimum Expected Length
L

= min
l (x) integer

2
l (x)
1

p
X
(x)l (x)
Theorem
H(X) L

< H(X) + 1
Junmo Kim EE 623: Information Theory
Minimum Expected Length

We rst prove L

H(X). Compare three quantities.


1.
L

= min
l (x) integer

2
l (x)
1

p
X
(x)l (x)
2.

L = min

2
l (x)
1

p
X
(x)l (x)
3.

L = min

2
l (x)
=1

p
X
(x)l (x)

1 & 2 : If we ignore the integer constraint, minimum value


decreases. L

2 & 3 : We can replace the constraint

2
l (x)
1 by

2
l (x)
= 1 as the minimum occurs only when

2
l (x)
= 1.
If

2
l (x)
< 1, we can further decrease l (x) making

p
X
(x)l (x) smaller.

We will consider the third quantity and show that



L = H(X).
Junmo Kim EE 623: Information Theory
Minimum Expected Length

To nd

L = min

2
l (x)
=1

p
X
(x)l (x), use Lagrange
multipliers
J =

p
X
(x)l (x) + (

2
l (x)
1)

We have [A[ + 1 variables


J =
[.[

i =1
p
i
l
i
+ (

2
l
i
1)
J
l
i
= p
i
(ln 2)2
l
i
= 0 2
l
i
=
p
i
ln 2
1 =

2
l
i
=

p
i
ln 2
=
1
ln 2
2
l
i
= p
i
, l
i
= log
2
1
p
i
Junmo Kim EE 623: Information Theory
Minimum Expected Length

Hypothetical solution
2
l
i
= p
i
, l
i
= log
2
1
p
i

p
i
l
i
=

p
i
log
2
1
p
i
= H(X)

Verify it : For arbitrary l


i
such that

2
l
i
= 1,

p
i
l
i
H(X) =

p
i
l
i
+

p
i
log p
i
=

p
i
log p
i

p
i
log
2
2
l
i
=

p
i
log
p
i
2
l
i
0 relative entropy is non-negative

L = min

2
l (x)
=1

p
X
(x)l (x) = H(X)
Junmo Kim EE 623: Information Theory
Minimum Expected Length

We now prove L

< H(X) + 1

Look at

l
i
= log
2
1
p
i
log
2
1
p
i

l
i
satisfy Kraft

2
log
2
1
p
i

2
log
2
1
p
i
=

p
i
= 1

prex-free code of length

l
i
.

Since L

is the minimum expected length, we have


L

p
i

l
i
, and we have
L

p
i

l
i
=

p
i
log
2
1
p
i

<

p
i
(log
2
1
p
i
+ 1)
= H(X) + 1
Junmo Kim EE 623: Information Theory
n-to-variable Code


l
i
= log
2
1
p
i
can be ridiculous.
Pr (H) = 0.999

l
i
= 1
Pr (T) = 0.001

l
i
= 10

We can do better by n-to-variable code


(x
1
, . . . , x
n
) A
n

Using the previous theorem, we have


H(X
1
, . . . , X
n
) L

< H(X
1
, . . . , X
n
) + 1

Dividing by n gives
1
n
H(X
1
, . . . , X
n
)
. .
H(X
k
)

1
n
L

<
1
n
H(X
1
, . . . , X
n
)
. .
H(X
k
)
+
1
n
..
0
Junmo Kim EE 623: Information Theory
Wrong Probability

P
X
(x), x A

I think Q(x).

I design code with length log


1
Q(x)
.

Expected description length is

x.
p
X
(x)l (x) =

p
X
(x)log
1
Q(x)

<

p
X
(x)(log
1
Q(x)
+ 1)
=

p
X
(x) log
p
X
(x)
Q(x)

p
X
(x) log p
X
(x) + 1
= D(p
X
|Q) + H(X) + 1

D(p
X
|Q) is the price to pay for the mismatch of the
probability.
Junmo Kim EE 623: Information Theory
Humans Procedure

We want to solve the integer optimization problem


min
l (x) integer,

2
l (x)
1

p
X
(x)l (x)

Human developed an algorithm for generating an optimal


single-to-variable code in 1950 as a term paper in Bob Fanos
information theory class at MIT.

This optimal coding problem had eluded many people,


including Shannon and Fano, for several years.

Examples:

p
1
= 0.6, p
2
= 0.4 : With two letters, the optimal codeword
lengths are 1 and 1

p
1
= 0.6, p
2
= 0.3, p
3
= 0.1 : With three letters, the optimal
lengths are 1,2,2. The least likely letters have length 2.
Junmo Kim EE 623: Information Theory
Humans Procedure

Assume p
1
p
2
p
3
p
m
Lemma
Optimal codes have the property that if p
i
> p
j
, then l
i
< l
j
.
Proof.
Assume to the contrary that a code has p
i
> p
j
and l
i
> l
j
. Then
L = p
i
l
i
+ p
j
l
j
+ other terms
If we interchange l
i
and l
j
, L is decreased, as we have
L
new
= p
i
l
j
+ p
j
l
i
+ otehr terms
L L
new
= p
i
(l
i
l
j
) p
j
(l
i
l
j
) = (p
i
p
j
)(l
i
l
j
) > 0
Thus L
new
< L, which contradicts the optimality of the code.
Junmo Kim EE 623: Information Theory
Humans Procedure

Assume p
1
p
2
p
3
p
m
Claim:
There is no loss of generality in looking only at codes where
c(m 1) and c(m) are siblings.
Proof.
Let l
max
be the depth of an optimal tree. Then there are at least
two codewords at l
max
and they are siblings. If the codeword at
l
max
does not have a sibling, the given codeword could be reduced
in length. If the two codewords are not least likely codewords, by
swapping, there is possibly a new optimal code with p
m1
, p
m
at
l
max
.
Junmo Kim EE 623: Information Theory
Humans Procedure
Junmo Kim EE 623: Information Theory
Humans Procedure

We have the following recursion :


L

(p
1
, . . . , p
m
) = L

(p
1
, . . . , p
m2
, p
m1
+ p
m
) + p
m1
+ p
m
It suces to look only at codes/trees where c(m 1) and c(m)
are siblings. For any such tree with m leaves, we can construct a
reduced tree with m 1 leaves as follows: The leaves c(m 1)
and c(m) are removed, converting their parent node into a leaf
with probability p
m1
+ p
m
. The new tree represents a code for a
new chance variable with PMF (p
1
, . . . , p
m2
, p
m1
+ p
m
).
Junmo Kim EE 623: Information Theory
Humans Procedure
Now compare the expected lengths of the original code and the
new code.
m

i =1
p
i
l
i
=
m2

i =1
p
i
l
i
+ p
m1
l
max
+ p
m
l
max
=
m2

i =1
p
i
l
i
+ (p
m1
+ p
m
)(l
max
1) + p
m1
+ p
m
= Expected length of the reduced code(tree) + p
m1
+ p
m
Hence if we minimize the expected length of the reduced
code(tree) with the minimum value L

(p
1
, . . . , p
m2
, p
m1
+ p
m
),
we also minimize the expected length of the original code with the
minimum value L

(p
1
, . . . , p
m2
, p
m1
+ p
m
) + p
m1
+ p
m
, which
proves the recursion formula.
Junmo Kim EE 623: Information Theory
Humans Procedure
Example: (p
1
, p
2
, p
3
, p
4
, p
5
) = (0.4, 0.2, 0.15, 0.15, 0.1)
Junmo Kim EE 623: Information Theory
Humans Procedure
Example:
Junmo Kim EE 623: Information Theory
Lecture 9
Junmo Kim EE 623: Information Theory
Review

D(P|Q) : relative entropy

H(P), H(X), H
P
(X) : entropy

I (X; Y) = D(p
X,Y
|p
X
p
Y
)

Convex combination of two PMFs gives a valid PMF.


(p
1
+

p
2
)(x) = p
1
(x) +

p
2
(x)

= 1

H
p
(X) is concave in p
X
.
H(p
1
+

p
2
) H(p
1
) +

H(p
2
)
Junmo Kim EE 623: Information Theory
Review: Jensens Inequality

If f

(x) < 0 for all x, f is strictly concave.

f (x) = ln x for x > 0

f

(x) =
1
x
2
< 0 : strictly concave
Theorem
If f is concave then for any random variable X,
f (E[X]) E[f (X)]
If f is strictly concave,
f (E[X]) = E[f (X)] X is deterministic.
Junmo Kim EE 623: Information Theory
Convexity of Relative Entropy

Claim: D(P|Q) is convex in (P, Q).


Given (P
1
, Q
1
) and (P
2
, Q
2
)
D(P
1
+

P
2
|Q
1
+

Q
2
) D(P
1
|Q
1
) +

D(P
2
|Q
2
)

Lemma: Log-sum inequality


If a
1
, . . . , a
n
0 and b
1
, . . . , b
n
0, we have
(

a
i
) log

a
i

b
i

i
a
i
log
a
i
b
i
with equality i a
i
= cb
i
for some constant c for all i .

How to remember ? Log-sum inequality leads to D(a|b) 0.

Proof: t log t is strictly convex + Jensens inequality.


Junmo Kim EE 623: Information Theory
Log-Sum Inequality
Junmo Kim EE 623: Information Theory
Log-Sum Inequality and Relative Entropy
Junmo Kim EE 623: Information Theory
Convexity of Relative Entropy
(

a
i
) log

a
i

b
i

i
a
i
log
a
i
b
i
Proof.
D(P
1
+

P
2
|Q
1
+ Q
2
)
=

x
(P
1
(x) +

P
2
(x)) log
P
1
(x) +

P
2
(x)
Q
1
(x) +

Q
2
(x)

x
_
P
1
(x) log
P
1
(x)
Q
1
(x)
+

P
2
(x) log

P
2
(x)

Q
2
(x)
_
= D(P
1
|Q
1
) +

D(P
2
|Q
2
)
Junmo Kim EE 623: Information Theory
Recall: Entropy is Concave
Consider two PMFs and their convex combination

p
(1)
= (p
(1)
1
, . . . , p
(1)
[.[
)

p
(2)
= (p
(2)
1
, . . . , p
(2)
[.[
)

p
(1)
+ p
(2)
= (p
(1)
1
+ p
(2)
1
, p
(1)
2
+ p
(2)
2
, . . . , p
(1)
[.[
+ p
(2)
[.[
)
Theorem
H(p
(1)
+ p
(2)
) H(p
(1)
) + H(p
(2)
)
Junmo Kim EE 623: Information Theory
Convex and Concave Properties of Mutual Information
Junmo Kim EE 623: Information Theory
Convex and Concave Properties of Mutual Information

Notations:

I (X; Y) is determined by p
X,Y
(x, y), which in turn is given by
p
X
(x)p
Y|X
(y[x).

We can describe p
X
(x) : x A by a row vector P
X
, whose
xth element is p
X
(x).

Lets denote x th element of P


X
by P
X
(x) so that
P
X
(x) = p
X
(x). Then we can dene H(P
X
) for the row vector
P
X
as
H(P
X
) =

x
P
X
(x) log P
X
(x),
which is

x
p
X
(x) log p
X
(x) = H(X).

We can describe p
Y|X
(y[x) : x A, y by a matrix W,
whose x, y element is p
Y|X
(y[x).

Lets denote x, y element of the matrix W by W(y[x), i.e.


W(y[x) = p
Y|X
(y[x). Then we have
p
X,Y
(x, y) = P
X
(x)W(y[x)

Similarly, lets describe p


Y
(y) : y by a row vector P
Y
.
Junmo Kim EE 623: Information Theory
Convex and Concave Properties of Mutual Information
Theorem
I (X; Y) = I (P
X
, W) is concave in P
X
for xed W and convex in
W for xed P
X
.
Proof: Part 1:. I is concave in P
X
for xed W.
I (X; Y) = H(Y) H(Y[X)
= H(Y)
. .
concave in P
X

P
X
(x) H(Y[X = x)
. .
xed
. .
linear in P
X
Junmo Kim EE 623: Information Theory
Convex and Concave Properties of Mutual Information
Lets show that H(Y) is concave in P
X
. First note that
P
Y
= P
X
W.

P
X
W is a row vector whose yth element is

x
P
X
(x)W(y[x) =

x
p
X,Y
(x, y) = p
Y
(y)
Thus, H(Y) = H(P
Y
) = H(P
X
W).
Since entropy is concave, for P
(1)
X
, P
(2)
X
, we have
H((P
(1)
X
+

P
(2)
X
)W) = H(P
(1)
X
W +

P
(2)
X
W)
H(P
(1)
X
W) +

H(P
(2)
X
W)
Hence, H(Y) = H(P
X
W) is concave in P
X
.
Therefore, I (P
X
, W) = H(P
X
W)

P
X
(x)H(Y[X = x) is
concave in P
X
for xed W.
Junmo Kim EE 623: Information Theory
Convex and Concave Properties of Mutual Information
Part 2: For xed P
X
, I (X; Y) is convex in W.
Proof: As I (X; Y) = D(p
X,Y
(x, y)|p
X
(x)p
Y
(y)), we need to show
that for two transition matrices W
(1)
and W
(2)
,
D(P
X
(x) (W
(1)
+

W
(2)
)(y[x)|P
X
(x) P
X
(W
(1)
+

W
(2)
)(y))
D(P
X
(x)W
(1)
(y[x)|P
X
(x)(P
X
W
(1)
)(y))
+

D(P
X
(x)W
(2)
(y[x)|P
X
(x)(P
X
W
(2)
)(y))
The above inequality comes from convexity of D(p|q):
D(P
1
+

P
2
|Q
1
+

Q
2
) D(P
1
|Q
1
) +

D(P
2
|Q
2
)
where P
1
= P
X
(x)W
(1)
(y[x), Q
1
= P
X
(x)(P
X
W
(1)
)(y),
P
2
= P
X
(x)W
(2)
(y[x), and Q
2
= P
X
(x)(P
X
W
(2)
)(y).
Thus I is convex in W for xed P
X
.
Junmo Kim EE 623: Information Theory
Channels
Denition
We dene a discrete channel to be a system consisting of an input
alphabet A and output alphabet and a probability transition
probability p(y[x) (or transition matrix W(y[x)).

For an input sequence x A


n
, the output y is a random
n-tuple from
n
.

To specify a channel, we need transition probabilities (for all


input sequences and output sequences)
W
(n)
(y[x) : x A
n
, y
n

n=1

If a channel is memoryless and without feedback, we have


W
(n)
(y[x) =
n

k=1
W(y
k
[x
k
)
for some stochastic matrix W.
Junmo Kim EE 623: Information Theory
Binary Symmetric Channel

A = = 0, 1

C = max
P
X
I
P
X
(x)W(y[x)
(X; Y)

Compute C for a BSC(p).


I (X; Y) = H(Y) H(Y[X)
= H(Y)

P(X = x)H(W([X = x))


= H(Y) H
b
(p)
log 2 H
b
(p)
where H
b
(p) = p log
1
p
+ (1 p) log
1
1p
.
If p
X
(0) = p
X
(1) =
1
2
, H(Y) = log 2, and the equality holds.
Thus C = log 2 H
b
(p), and it is achieved by P
X
uniform on
0, 1.
Junmo Kim EE 623: Information Theory
Binary Symmetric Channel

If p = 0, the channel is noiseless and C = 1 bit/ch. use

If p = 1/2, C = 0.

In this case, Y is independent of X. Tell friend to toss a coin.

If P
X
is uniform and xed, I (X; Y) = C = 1 H
b
(p) is
convex in W, i.e. convex in p.

For a xed p, I is concave in P


X
.

E.g. Suppose p = 0, I (X; Y) = H(Y) H(Y[X) = H(Y),


which is concave.
Junmo Kim EE 623: Information Theory
Lecture 10
Junmo Kim EE 623: Information Theory
Today:

Examples of computing C

Kuhn-Tucker condition

Geometry of channel capacity


Junmo Kim EE 623: Information Theory
Binary Symmetric Channel

A = = 0, 1

C = max
P
X
I
P
X
(x)W(y[x)
(X; Y)

Compute C for a BSC(p).


I (X; Y) = H(Y) H(Y[X)
= H(Y)

P(X = x)H(W([X = x))


= H(Y) H
b
(p)
log 2 H
b
(p)
where H
b
(p) = p log
1
p
+ (1 p) log
1
1p
.
If p
X
(0) = p
X
(1) =
1
2
, H(Y) = log 2, and the equality holds.
Thus C = log 2 H
b
(p), and it is achieved by P
X
uniform.
Junmo Kim EE 623: Information Theory
Binary Symmetric Channel

If p = 0, the channel is noiseless and C = 1 bit/ch. use

If p = 1/2, C = 0.

In this case, Y is independent of X. Tell friend to toss a coin.

If P
X
is uniform and xed, I (X; Y) = C = 1 H
b
(p) is
convex in W, i.e. convex in p.

For a xed p, I is concave in P


X
.

E.g. Suppose p = 0, I (X; Y) = H(Y) H(Y[X) = H(Y),


which is concave.
Junmo Kim EE 623: Information Theory
Weakly Symmetric Channel
1. Rows of W(y[x) are permutations of each other.
2. All the columns have the same sum.
Claim: For a weakly symmetric channel
C = log [[ H(of a row)
I = H(Y) H(Y[X)
. .
H(Y[X=x) does not depend on xH(of a row)
log [[ H(of a row)
Note: For a uniform input distribution, Y is uniformly distributed.
( 2nd property) Thus the equality can be achieved.
Junmo Kim EE 623: Information Theory
Erasure Channel

Consider the following channel

Maximizing I (X; Y) over p


X
(x) is a [A[ dimensional
optimization problem.

Is this weekly symmetric ?

Our approach : guess & verify


Junmo Kim EE 623: Information Theory
Erasure Channel
I (X; Y) = H(Y) H(Y[X)
. .
H
b
()

Guess P
X
that maximizes H(Y). Will it be [
1
2
1
2
] ?

Note that P
(1)
X
= [ 1 ] and P
(2)
X
= [ 1 ] give
the same H(Y).

As H(Y) is concave in p
X
,
H([ 1 ]W) =
1
2
H([ 1 ]W) +
1
2
H([ 1 ]W)
H((
1
2
[ 1 ] +
1
2
[ 1 ])W)
= H([
1
2
1
2
]W)

Thus H(Y) is maximized when P


X
= [
1
2
1
2
].
Junmo Kim EE 623: Information Theory
Erasure Channel
C = H(
1
2
(1 ), ,
1
2
(1 )) H
b
()
= 2
1
2
(1 ) log(
1
2
(1 )) log H
b
()
= (1 ) log
1
2
(1 ) log(1 ) log H
b
()
= 1 bits
Junmo Kim EE 623: Information Theory
Computing the Channel Capacity
C = max
p
X
(x)
I (X; Y)
Maximize a concave function over the simplex. ( For xed p
Y[X
(channel), I is concave in p
X
.)
Junmo Kim EE 623: Information Theory
Maximizing a Concave Function over Simplex

Let f () be concave on the simplex.


: 0
k
1,
m

k=1

k
= 1

Assume
f

k
are dened and continuous over the simplex
with the possible exception that lim

k
0
f

k
may be +

A necessary and sucient condition for to achieve the


maximum of f () is
f

k
= , k :
k
> 0
f

k
, k :
k
= 0
Junmo Kim EE 623: Information Theory
Maximizing a Concave Function over Simplex
Necessary condition: If f is maximized, the two conditions are
satised.
Proof:
Suppose maximizes f ().
If we perturb by increasing
k
and decreasing
k
(provided that

k
> 0) by > 0, f () should be decreased. This requires that
f

k
0
Similarly, if we increase
k
and decrease
k
(provided that

k
> 0) by > 0, f () should be decreased.
f

t
k
0
Junmo Kim EE 623: Information Theory
Maximizing a Concave Function over Simplex
Thus we have
f

k
=
f

k
for all k, k
t
such that
k
> 0,
k
> 0,
which implies the rst condition.
f

k
= , k :
k
> 0
If
k
= 0, we cannot decrease
k
, so we only need
f

0,
which implies the second condition:
f

k
, k :
k
= 0
Junmo Kim EE 623: Information Theory
Maximizing a Concave Function over Simplex
Sucient condition: If the two conditions are satised, f is
maximized.
Proof:

The two conditions guarantee that it is at local maximum.

Suppose the two conditions are satised at .

Then if we perturb to + (
1
,
2
, . . . ,
m
) along an
arbitrary direction (
1
,
2
, . . . ,
m
) such that

k
= 0 and

k
0 if
k
= 0, f does not increase as follows:
f =

k
=

k:
k
>0
f

k
+

k:
k
=0
f

k:
k
>0

k
+

k:
k
=0

k
(
f

k
&
k
0)
=

k
= 0
Junmo Kim EE 623: Information Theory
Maximizing a Concave Function over Simplex
Now the questions is whether it is global maximum.
On the contrary, suppose there is other better point

such that
f (

) > f ()
The concavity of f indicates that f is above the straight line
connecting the two points.
f ((1)+

) (1)f () +f (

) > f () for all 0 < < 1


This means that perturbing to the direction of

increases f ().
This contradicts the fact that f does not increase by a local
perturbation.
Junmo Kim EE 623: Information Theory
Computing the Channel Capacity

Apply the theorem to f being mutual information


I
P
X
(x)
= x, P
X
(x) > 0
I
P
X
(x)
x, P
X
(x) = 0
I
P
X
(x)
= D(W([x)|(P
X
W)()) 1

For the optimal input distribution P


X
, I (X; Y) = C, and we
have
D(W([x)|(P
X
W)()) =
t
P
X
(x) > 0()

t
P
X
(x) = 0
Junmo Kim EE 623: Information Theory
Computing the Channel Capacity

Multiply (*) by P
X
(x) and sum over all x s.t. P
X
(x) > 0

x
P
X
(x)

y
W(y[x) log
W(y[x)
(P
X
W)(y)

P
X
(x)
P
X
(x)
=

x,y
P
X,Y
(x, y) log
P
X,Y
(x, y)
P
X
(x)P
Y
(y)
= I (X; Y) = C ( P
X
is the optimal distribution)

Thus we have
C =
t
Junmo Kim EE 623: Information Theory
Computing the Channel Capacity
Theorem
1. If for some input distribution P
X
(x) we have
D(W([x)|(P
X
W)()) = x, P
X
(x) > 0
x, P
X
(x) = 0
then P
X
(x) achieves capacity C and C = .
2. If P
X
(x) achieves the capacity then the following holds with
= C.
D(W([x)|(P
X
W)()) = x, P
X
(x) > 0
x, P
X
(x) = 0
Usage: guess & check
Junmo Kim EE 623: Information Theory
Kuhn-Tucker Condition
P
X
achieves capacity if and only if
D(W([x)|(P
X
W)()) = C x, P
X
(x) > 0
C x, P
X
(x) = 0
How to compute C
1. Symmetry
2. Concavity argument (e.g. erasure channel [
1
2
1
2
] )
3. Guess & verify using Kuhn-Tucker condition
Junmo Kim EE 623: Information Theory
Lecture 11
Junmo Kim EE 623: Information Theory
Review: Kuhn-Tucker Condition
Theorem
1. If for some input distribution P
X
(x) we have
D(W([x)|(P
X
W)()) = x, P
X
(x) > 0
x, P
X
(x) = 0
then P
X
(x) achieves capacity C and C = .
2. If P
X
(x) achieves the capacity then the following holds with
= C.
D(W([x)|(P
X
W)()) = x, P
X
(x) > 0
x, P
X
(x) = 0
Usage: guess & check
Junmo Kim EE 623: Information Theory
Review: Kuhn-Tucker Condition
P
X
achieves capacity if and only if
D(W([x)|(P
X
W)()) = C x, P
X
(x) > 0
C x, P
X
(x) = 0
How to compute C
1. Symmetry
2. Concavity argument (e.g. erasure channel [
1
2
1
2
] )
3. Guess & verify using Kuhn-Tucker condition
Junmo Kim EE 623: Information Theory
Example: Kuhn-Tucker Condition (Problem 7.13)

Guess P
X
= [
1
2
1
2
] and verify that it satises Kuhn-Tucker
conditions:
D(W([x = 0)|P
X
W()) = D(W([x = 1)|P
X
W()).

P
X
W = [
1
2
(1 )
1
2
(1 ) ]

D(W([x = 0)|P
X
W()) =
D([ 1 ]|[
1
2
(1 )
1
2
(1 ) ])

D(W([x = 1)|P
X
W()) =
D([ 1 ]|[
1
2
(1 )
1
2
(1 ) ])

C = D(W([x = 0)|P
X
W()) =
(1 ) + H(, 1 ) H(1 , , ).
Junmo Kim EE 623: Information Theory
Today

Dene an achievable rate

Prove a converse
1. Fanos inequality
2. Data processing inequality
Junmo Kim EE 623: Information Theory
Notations and Denitions

W: message

W takes values in / (Assume W is uniformly distributed.)


/= 1, 2, ..., M, M = [/[

An encoder f : /A
n

Received signal :
n

Decoder :
n
/ 0

n : block length

For a DMC of transition probability W([), if the input is


x
1
, . . . , x
n
, the output will be y
1
, . . . , y
n
with probability

n
k=1
W(y
k
[x
k
)

Probability of error

i
=

y:(y),=i
W(y[f (i ))
. .

n
k=1
W(y
k
[x
k
(i ))

x
k
(i ) : kth component of f (i ).
Junmo Kim EE 623: Information Theory
Notations and Denitions

Maximal probability of error

(n)
= max
i

Arithmetic average probability of error


P
(n)
e
=
1
M
M

i =1

i
Denition
A rate R is achievable if for any > 0 and all suciently large
block length n there exists a code (f , ) of rate R and
(n)
< .
The rate of a code is
log M
n
.
Junmo Kim EE 623: Information Theory
Converse
Theorem
If R is achievable over the DMC W, then R max
P
X
I (P
X
; W).
Junmo Kim EE 623: Information Theory
Markov Chains
X Y Z
if p
Z[X,Y
(z[x, y) = P
Z[Y
(z[y).
The followings are all equivalent.
1. X Y Z
2. Z Y X
3. X and Z are independent given Y
p(x, z[y) = p(x[y)p(z[y)

Equivalence of 1 & 3.
p(x, y, z) = p(y)p(x, z[y)
= p(y)p(x[y)p(z[y)
= p(x, y)p(z[y)
p(x, y, z) = p(x, y)p(z[x, y)
Junmo Kim EE 623: Information Theory
Venn Diagram

(Theorem 2.2.1) H(X, Y) = H(X) + H(Y[X)

(Corollary) H(X, Y[Z) = H(X[Z) + H(Y[X, Z)

(Denition) I (X; Y[Z) = H(X[Z) H(X[Y, Z)

(Theorem 2.5.2) I (X, Y; Z) = I (X; Z) + I (Y; Z[X)


Junmo Kim EE 623: Information Theory
Data Processing Inequality
Theorem
If X Y Z, we have
I (X; Z) I (Y; Z)
(Also I (X; Y) I (X; Z) )
Proof.
I (X, Y; Z) = I (X; Z) + I (Y; Z[X)
= I (Y; Z) + I (X; Z[Y)
. .
0
I (Y; Z) = I (X; Z) + I (Y; Z[X)
. .
0
I (X; Z)
Junmo Kim EE 623: Information Theory
Fanos Inequality

We want to guess the value of a random variable X A


based on observation Y, where X and Y are correlated.

Estimate of X :

X = g(Y).

Probability of error P
e
= Pr (

X ,= X).

We expect: the higher H(X[Y), the higher the probability of


error.

Fanos inequality is
H(X[Y) H
b
(P
e
) + P
e
log([A[ 1)

Fanos inequality gives a lower bound on the probability of


error in terms of the conditional entropy H(X[Y).

It gives an upper bound on H(X[Y) in terms of the


probability of error. (used in converse)

Deterministic case : H(X[Y) = 0 i X is a function of Y.


Junmo Kim EE 623: Information Theory
Fanos Inequality
H(X[Y) H(X[

X) Pr (X ,=

X) log([A[ 1) + H
b
(Pr (X ,=

X))
Proof: Dene
E =
_
1 X ,=

X
0 X =

X
H(X, E[

X) = H(X[

X) + H(E[X,

X)
. .
0
H(X, E[

X) = H(E[

X) + H(X[E,

X)
H(X[

X) = H(E[

X) + H(X[E,

X)
H(E) + H(X[E,

X)
= H
b
(Pr (X ,=

X)) + H(X[E,

X)
Junmo Kim EE 623: Information Theory
Fanos Inequality
H(X[E,

X) = H(X[

X, E = 0)
. .
0
Pr (E = 0) + H(X[

X, E = 1)Pr (E = 1)
log([A[ 1)Pr (X ,=

X)
This gives us a bounding technique.
H(X[Y) H(X[

X) comes from the data processing inequality


I (X; Y) I (X;

X).
Junmo Kim EE 623: Information Theory
DMC and Mutual Information
Lemma
Let X take value in A
n
according to some law P
X
(x) and let Y be
distributed according to p
Y[X
=

n
k=1
W(y
k
[x
k
) for some W([)
then I (X; Y) nC where C = max
P
X
I (P
X
; W).
Proof:
I (X; Y) = H(Y) H(Y[X)
= H(Y)

H(Y
i
[X, Y
i 1
)
= H(Y)

H(Y
i
[X
i
) memoryless

(H(Y
i
) H(Y
i
[X
i
))
=

I (X
i
; Y
i
) nC
Junmo Kim EE 623: Information Theory
Proof of Converse
Note that we have the following Markov chain
W X = f (W) Y
n


W = (Y)
H(W) = nR H(W) = log [/[, R =
log [/[
n
= H(W[

W) + I (W;

W)
H(W[

W) + I (f (W);

W)
H(W[

W) + I (X; Y)
H(W[

W) + nC
Pr (W ,= W) log [/[ + H
b
(Pr (W ,=

W)) + nC
= P
(n)
e
nR + H
b
(P
(n)
e
) + nC
Junmo Kim EE 623: Information Theory
Proof of Converse
where
Pr (W ,=

W) =

i
Pr (W = i )Pr (W ,=

W[W = i )
=

i
1
M

i
= P
(n)
e
For all n, we have
R P
(n)
e
R +
1
n
H
b
(P
(n)
e
) + C
P
(n)
e
R +
1
n
+ C
Let n . R C if P
(n)
e
0.
Junmo Kim EE 623: Information Theory
Lecture 12
Junmo Kim EE 623: Information Theory
Fanos Inequality

We want to guess the value of a random variable X A


based on observation Y, where X and Y are correlated.

Estimate of X :

X = g(Y).

Probability of error P
e
= Pr (

X ,= X).

We expect: the higher H(X[Y), the higher the probability of


error.

Fanos inequality is
H(X[Y) H
b
(P
e
) + P
e
log([A[ 1)

Fanos inequality gives a lower bound on the probability of


error in terms of the conditional entropy H(X[Y).

It gives an upper bound on H(X[Y) in terms of the


probability of error. (used in converse)

Deterministic case : H(X[Y) = 0 i X is a function of Y.


Junmo Kim EE 623: Information Theory
Fanos Inequality
H(X[Y) H(X[

X) Pr (X ,=

X) log([A[ 1) + H
b
(Pr (X ,=

X))
Proof: Dene
E =
_
1 X ,=

X
0 X =

X
H(X, E[

X) = H(X[

X) + H(E[X,

X)
. .
0
H(X, E[

X) = H(E[

X) + H(X[E,

X)
H(X[

X) = H(E[

X) + H(X[E,

X)
H(E) + H(X[E,

X)
= H
b
(Pr (X ,=

X)) + H(X[E,

X)
Junmo Kim EE 623: Information Theory
Fanos Inequality
H(X[E,

X) = H(X[

X, E = 0)
. .
0
Pr (E = 0) + H(X[

X, E = 1)Pr (E = 1)
log([A[ 1)Pr (X ,=

X)
This gives us a bounding technique.
H(X[Y) H(X[

X) comes from the data processing inequality


I (X; Y) I (X;

X).
Junmo Kim EE 623: Information Theory
Notations and Denitions

Message W takes values in / (Assume W is uniformly


distributed.)
/= 1, 2, ..., M, M = [/[

n : block length

An encoder f : /A
n
(f (m) = x
n
(m)) yields codewords
x
n
(1), x
n
(2), . . . , x
n
(M). The set of codewords is called the
codebook, which is denoted by ( .

Decoder :
n
/ 0

e.g. ML decoder :

W = (y
n
) = arg max
m
W(y
n
[x
n
(m)).
Junmo Kim EE 623: Information Theory
Notations and Denitions

Probability of error

i
= Pr ((Y
n
) ,= i [X
n
= x
n
(i )) =

y:(y),=i
W(y[f (i ))
. .

n
k=1
W(y
k
[x
k
(i ))

x
k
(i ) : kth component of f (i ) = x
n
(i ).

Maximal probability of error

(n)
= max
i

Arithmetic average probability of error


P
(n)
e
=
1
M
M

i =1

i
Junmo Kim EE 623: Information Theory
Channel Coding Theorem

R =
log M
n
: rate in bits / channel use.

W Unif 1, . . . , 2
nR

I
P
X
(x)W(y[x)
(X; Y) = I (P
X
; W)

C = max
P
X
I (P
X
; W)
Denition
R is achievable, if > 0, n
0
s.t. n n
0
, encoder of rate R &
block length n and a decoder with maximal probability of error

(n)
< .
Theorem
If R < C then R is achievable.
Theorem
Converse: If R is achievable, then R C.
Junmo Kim EE 623: Information Theory
DMC and Mutual Information
Lemma
Let X take value in A
n
according to some law P
X
(x) and let Y be
distributed according to p
Y[X
=

n
k=1
W(y
k
[x
k
) for some W([)
then I (X; Y) nC where C = max
P
X
I (P
X
; W).
Proof:
I (X; Y) = H(Y) H(Y[X)
= H(Y)

H(Y
i
[X, Y
i 1
)
= H(Y)

H(Y
i
[X
i
) memoryless

(H(Y
i
) H(Y
i
[X
i
))
=

I (X
i
; Y
i
) nC
Junmo Kim EE 623: Information Theory
Proof of Converse
Note that we have the following Markov chain
W X = f (W) Y
n


W = (Y)
H(W) = nR H(W) = log [/[, R =
log [/[
n
= H(W[

W) + I (W;

W)
H(W[

W) + I (f (W);

W)
H(W[

W) + I (X; Y)
H(W[

W) + nC
Pr (W ,= W) log [/[ + H
b
(Pr (W ,=

W)) + nC
= P
(n)
e
nR + H
b
(P
(n)
e
) + nC
Junmo Kim EE 623: Information Theory
Proof of Converse
where
Pr (W ,=

W) =

i
Pr (W = i )Pr (W ,=

W[W = i )
=

i
1
M

i
= P
(n)
e
For all n, we have
R P
(n)
e
R +
1
n
H
b
(P
(n)
e
) + C
P
(n)
e
R +
1
n
+ C
Let n . R C if P
(n)
e
0.
Junmo Kim EE 623: Information Theory
Channel Coding Theorem

R =
log M
n
: rate in bits / channel use.

W Unif 1, . . . , 2
nR

I
P
X
(x)W(y[x)
(X; Y) = I (P
X
; W)

C = max
P
X
I (P
X
; W)
Denition
R is achievable, if > 0, n
0
s.t. n n
0
, encoder of rate R &
block length n and a decoder with maximal probability of error

(n)
< .
Theorem
If R < C then R is achievable.
Theorem
Converse: If R is achievable, then R C.
Junmo Kim EE 623: Information Theory
Joint Typicality
Given some joint distribution p
X,Y
(x, y) on A and given some
> 0 and a natural number n, dene A
(n)

(p
X,Y
) as
A
(n)

(p
X,Y
) = (x, y) : [
1
n
log p
X
(x) H
p
X
(X)[ <
[
1
n
log p
Y
(y) H
p
Y
(Y)[ <
[
1
n
log p
X,Y
(x, y) H
p
X,Y
(X, Y)[ <
where p
X,Y
is a PMF on A and p
X,Y
(x, y) =

p
X,Y
(x
i
, y
i
).

[A
(n)

(p
X,Y
)[ 2
n(H
p
X,Y
(X,Y)+)
Junmo Kim EE 623: Information Theory
Joint Typicality
Lemma
Suppose

X,

Y are drawn independently according to the law
p
X
(x)p
Y
(y), i.e. (

X
k
,

Y
k
)
i .i .d.
p
X
(x)p
Y
(y). Then
Pr [(

X,

Y) A
(n)

(p
X,Y
)] 2
n(I
p
X,Y
(X;Y)3)
Proof:
Pr [(

X,

Y) A
(n)

(p
X,Y
)] =

(x,y)A
(n)

p
X
(x)p
Y
(y)

(x,y)A
(n)

2
n(H
p
X
(X))
2
n(H
p
Y
(Y))
= [A
(n)

[2
n(H
p
X
(X))
2
n(H
p
Y
(Y))
2
n(H
p
X,Y
(X,Y)+)
2
n(H
p
X
(X))
2
n(H
p
Y
(Y))
= 2
n(I
p
X,Y
(X;Y)3)
Junmo Kim EE 623: Information Theory
Joint Typicality
Lemma
Suppose

X,

Y are drawn independently according to the law
p
X
(x)p
Y
(y), i.e. (

X
k
,

Y
k
)
i .i .d.
p
X
(x)p
Y
(y). Then
Pr [(

X,

Y) A
(n)

(p
X,Y
)] 2
n(I
p
X,Y
(X;Y)3)
Lemma
Suppose I draw (X
k
, Y
k
)
i .i .d.
p
X,Y
(x, y). Then
Pr ((X, Y) A
(n)

(p
X,Y
)) 1 as n .
Proof: Law of large numbers.
Junmo Kim EE 623: Information Theory
Joint Typicality Decoder
(y; p
X,Y
, , n, () = m
if (x(m), y) A
(n)

(p
X,Y
)&
for no other m
t
,= m, (x(m
t
), y) A
(n)

(p
X,Y
)
otherwise (y; p
X,Y
, , n, () = 0
Junmo Kim EE 623: Information Theory
Proof Sketch
1. Fix some p
X
2. Fix some > 0, n
3. Generate a random codebook (
IID p
X
.
4. Reveal ( to encoder and receiver.
5. Design a joint typicality decoder
(; p
X
(x)W(y[x), , n, ()
6. Encodr m x(m) (according to codebook)
7. Each codebook ( gives P
(n)
e
(().
8. Analyze E[P
(n)
e
(()]. Average over (.
9. Will show that if R < I (P
X
; W) then E[P
(n)
e
(()] 0 as
n .
10. By random coding argument, there exists deterministic
sequence (
n
s.t. P
(n)
e
((
n
) 0
11. Trick to get
(n)
to go to zero.
Junmo Kim EE 623: Information Theory
Proof Sketch
Junmo Kim EE 623: Information Theory
Proof: Channel Coding Theorem

p
X
, I (P
X
; W) is achievable. C is achievable

Step 1-6 : dene random codebook

Observation : E[
17
] = E[
5
]
Claim 1: E
(
[
m
] does not depend on m ( symmetry), which
implies
E[P
(n)
e
((, ( ))] = E[
1
]
Assume then W = 1 and compute E[
1
].
Junmo Kim EE 623: Information Theory
Proof: Channel Coding Theorem

Step 8:
E[
1
] =

(
Pr (()
1
((, ) = Pr (Error [W = 1)
Error only if (X(1), Y) is not in A
(n)

or m ,= 1 s.t. (X(m), Y) A
(n)

.
Pr [(X(1), Y) is not in A
(n)

(, n, P
X
(x)W(y[x))]

n 0

Let E
i
be the event that (X(i ), Y) A
(n)

(, n, P
X
(x)W(y[x))
E[
1
] Pr (E
C
1

2
nR
i =2
E
i
)
Pr (E
C
1
)
. .
0
+
2
nR

i =2
Pr (E
i
)
Junmo Kim EE 623: Information Theory
Proof: Channel Coding Theorem

Probability of a wrong codeword X(2) being jointly typical


with Y is small
Pr (E
2
) = Pr ((X(2), Y) A
(n)

)
2
n(I
P
X
(x)W(y|x)
(X;Y)3)

X(2) has distribution P


X
and Y has distribution (P
X
W)(y),
and they are independent.

Thus

2
nR
i =2
Pr (E
i
) (2
nR
1)2
n(I (X;Y)3)
2
n(I (X;Y)3R)
.

If R < I (X; Y) 3, (2
nR
1)2
n(I (X;Y)3)
0 as n .
Junmo Kim EE 623: Information Theory
Proof: Channel Coding Theorem
Junmo Kim EE 623: Information Theory
Proof: Channel Coding Theorem

Step 11: Trick to get


(n)
to go to zero.

Let (
n
satisfy P
(n)
e
((
n
, ()) < .

If we sort the probability of error


i
as follows,

m(1)

m(2)

Best half of the codewords have error probability


i
less than
2.

m(1)

m(2)

m(M/2)
2
Otherwise,

M
i =1

m(i )

M
i =M/2+1

m(i )

M
2
2 and
P
(n)
e
((
n
, ()) =
1
M

M
i =1

m(i )
.

We can take that half of my codewords so that the maximal


probability error is less than 2.

(n)
2.

The new rate is


log M/2
n
=
log M
n

1
n
.
Junmo Kim EE 623: Information Theory
Lecture 13
Junmo Kim EE 623: Information Theory
today

Channel coding theorem

Source-channel separation

Feedback communication
Junmo Kim EE 623: Information Theory
Channel Coding Theorem

R =
log M
n
: rate in bits / channel use.

W Unif 1, . . . , 2
nR

I
P
X
(x)W(y[x)
(X; Y) = I (P
X
; W)

C = max
P
X
I (P
X
; W)
Denition
R is achievable, if > 0, n
0
s.t. n n
0
, encoder of rate R &
block length n and a decoder with maximal probability of error

(n)
< .
Theorem
If R < C then R is achievable.
Theorem
Converse: If R is achievable, then R C.
Junmo Kim EE 623: Information Theory
Joint Typicality
Given some joint distribution p
X,Y
(x, y) on A and given some
> 0 and a natural number n, dene A
(n)

(p
X,Y
) as
A
(n)

(p
X,Y
) = (x, y) : [
1
n
log p
X
(x) H
p
X
(X)[ <
[
1
n
log p
Y
(y) H
p
Y
(Y)[ <
[
1
n
log p
X,Y
(x, y) H
p
X,Y
(X, Y)[ <
where p
X,Y
is a PMF on A and p
X,Y
(x, y) =

p
X,Y
(x
i
, y
i
).

[A
(n)

(p
X,Y
)[ 2
n(H
p
X,Y
(X,Y)+)
Junmo Kim EE 623: Information Theory
Joint Typicality
Lemma
Suppose

X,

Y are drawn independently according to the law
p
X
(x)p
Y
(y), i.e. (

X
k
,

Y
k
)
i .i .d.
p
X
(x)p
Y
(y). Then
Pr [(

X,

Y) A
(n)

(p
X,Y
)] 2
n(I
p
X,Y
(X;Y)3)
Lemma
Suppose I draw (X
k
, Y
k
)
i .i .d.
p
X,Y
(x, y). Then
Pr ((X, Y) A
(n)

(p
X,Y
)) 1 as n .
Proof: Law of large numbers.
Junmo Kim EE 623: Information Theory
Joint Typicality

Roughly,

X takes one of 2
nH(X)
values with high probability.

Independently,

Y takes one of 2
nH(Y)
values with high prob.

X,

Y) takes one of 2
nH(X)
2
nH(Y)
values with high prob.

But only 2
nH(X,Y)
pairs are jointly typical.

Probability that (

X,

Y) is jointly typical is about
2
nH(X,Y)
2
nH(X)
2
nH(Y)
= 2
nI (X;Y)
Junmo Kim EE 623: Information Theory
Joint Typicality Decoder
(y; p
X,Y
, , n, () = m
if (x(m), y) A
(n)

(p
X,Y
)&
for no other m
t
,= m, (x(m
t
), y) A
(n)

(p
X,Y
)
otherwise (y; p
X,Y
, , n, () = 0
Junmo Kim EE 623: Information Theory
Proof Sketch
1. Fix some p
X
2. Fix some > 0, n
3. Generate a random codebook (
IID p
X
.
4. Reveal ( to encoder and receiver.
5. Design a joint typicality decoder
(; p
X
(x)W(y[x), , n, ()
6. Encodr m x(m) (according to codebook)
7. Each codebook ( gives P
(n)
e
(().
8. Analyze E[P
(n)
e
(()]. Average over (.
9. Will show that if R < I (P
X
; W) then E[P
(n)
e
(()] 0 as
n .
10. By random coding argument, there exists deterministic
sequence (
n
s.t. P
(n)
e
((
n
) 0
11. Trick to get
(n)
to go to zero.
Junmo Kim EE 623: Information Theory
Proof Sketch
Junmo Kim EE 623: Information Theory
Proof: Channel Coding Theorem

p
X
, I (P
X
; W) is achievable. C is achievable

Step 1-6 : dene random codebook

Observation :
17
,=
5
but E
(
[
17
] = E
(
[
5
]
Claim 1: E
(
[
m
] does not depend on m ( symmetry), which
implies E[P
(n)
e
((, ( ))] = E[
1
M

M
i =1

m
((, )] =
1
M

M
i =1
E[
m
((, )] = E[
1
]
Assume then W = 1 and compute E[
1
].
Junmo Kim EE 623: Information Theory
Proof: Channel Coding Theorem
E[P
(n)
e
((, ( ))] = E[
1
M
M

i =1

m
((, )] = E[
1
]
Junmo Kim EE 623: Information Theory
Proof: Channel Coding Theorem

Step 8:
E[
1
] =

(
Pr (()
1
((, ) = Pr (Error [W = 1)
Error only if (X(1), Y) is not in A
(n)

or m ,= 1 s.t. (X(m), Y) A
(n)

.
Pr [(X(1), Y) is not in A
(n)

(, n, P
X
(x)W(y[x))]

n 0

Let E
i
be the event that (X(i ), Y) A
(n)

(, n, P
X
(x)W(y[x))
E[
1
] Pr (E
C
1

2
nR
i =2
E
i
)
Pr (E
C
1
)
. .
0
+
2
nR

i =2
Pr (E
i
)
Junmo Kim EE 623: Information Theory
Proof: Channel Coding Theorem

Probability of a wrong codeword X(2) being jointly typical


with Y is small
Pr (E
2
) = Pr ((X(2), Y) A
(n)

)
2
n(I
P
X
(x)W(y|x)
(X;Y)3)

X(2) has distribution P


X
and Y has distribution (P
X
W)(y),
and they are independent.

Thus

2
nR
i =2
Pr (E
i
) (2
nR
1)2
n(I (X;Y)3)
2
n(I (X;Y)3R)
.

If R < I (X; Y) 3, (2
nR
1)2
n(I (X;Y)3)
0 as n .
Junmo Kim EE 623: Information Theory
Proof: Channel Coding Theorem
Junmo Kim EE 623: Information Theory
Proof Sketch
1. Fix some p
X
2. Fix some > 0, n
3. Generate a random codebook (
IID p
X
.
4. Reveal ( to encoder and receiver.
5. Design a joint typicality decoder
(; p
X
(x)W(y[x), , n, ()
6. Encodr m x(m) (according to codebook)
7. Each codebook ( gives P
(n)
e
(().
8. Analyze E[P
(n)
e
(()]. Average over (.
9. Will show that if R < I (P
X
; W) then E[P
(n)
e
(()] 0 as
n .
10. By random coding argument, there exists deterministic
sequence (
n
s.t. P
(n)
e
((
n
) 0
11. Trick to get
(n)
to go to zero.
Junmo Kim EE 623: Information Theory
Proof: Channel Coding Theorem

Step 11: Trick to get


(n)
to go to zero.

Let (
n
satisfy P
(n)
e
((
n
, ()) < .

If we sort the probability of error


i
as follows,

m(1)

m(2)

m(M)

Best half of the codewords have error probability


i
less than
2.

m(1)

m(2)

m(M/2)
2
Otherwise,

M
i =1

m(i )

M
i =M/2+1

m(i )

M
2
2 and
P
(n)
e
((
n
, ()) =
1
M

M
i =1

m(i )
.

We can take that half of my codewords so that the maximal


probability error is less than 2.

(n)
2.

The new rate is


log M/2
n
=
log M
n

1
n
.
Junmo Kim EE 623: Information Theory
Source-Channel Separation

Let V
k
be a source of entropy rate H(V
k
).

Let W(y[x) be some DMC of capacity C.

Which is better ? Joint source-channel coding vs. separate


coding.

The theorems says that separate coding is as good as joint


coding.
Junmo Kim EE 623: Information Theory
Source-Channel Separation
Theorem
If H(V
k
) < C, then for any > 0, suciently large n, there
exists a mapping F : 1
n
A
n
and a mapping :
n
1
n
s.t.
Pr ((Y
1
, . . . , Y
n
) ,= (V
1
, . . . , V
n
)) < .

v
1
,...,v
n
P
V
(v)

y
1
,...,y
n
:(y),=v
W(y[F(v)) <
Junmo Kim EE 623: Information Theory
Converse
Theorem
If H(V
k
) > C for a stationary process V
k
, for any sequences
F
n
: 1
n
A
n
,
n
:
n
1
n
, lim
n
Pr (
n
(Y) ,= V)) > 0.
Proof.
H(V
k
)
1
n
H(V
1
, . . . , V
n
)
=
1
n
H(V
1
, . . . , V
n
[

V
1
, . . . ,

V
n
) +
1
n
I (V
1
, . . . , V
n
;

V
1
, . . . ,

V
n
)

1
n
[1 + P
(n)
e
n log [1[ +
1
n
I (X
1
, . . . , X
n
; Y
1
, . . . , Y
n
)]

1
n
+ P
(n)
e
log [1[ + C C as n 0
Junmo Kim EE 623: Information Theory
Source-Channel Separation

If a source produces a symbol every


1

s
sec:
s
source symbol
sec
,

and has entropy rate H(V


k
)
bit
source symbol
, the entropy in
bits
sec
is H
s
.

Channel has
c
ch use
sec
and capacity C
bits
ch use
, its capacity in
bits
sec
is C
c
.

The condition for reliable transmission of the source is


H
s
< C
c
.
Junmo Kim EE 623: Information Theory
Feedback Communication

Encoder is a sequence of mapping f


i
: J
i 1
A

C
FB
is the feedback capacity, obviously C
FB
C ( you can
ignore feedback if you like )

In fact, C
FB
= C.
Junmo Kim EE 623: Information Theory
Feedback Communication
When there is feedback, the following lemma is no longer true.
Lemma
Let X take value in A
n
according to some law P
X
(x) and let Y be
distributed according to p
Y[X
=

n
k=1
W(y
k
[x
k
) for some W([)
then I (X; Y) nC where C = max
P
X
I (P
X
; W).
Proof:
I (X; Y) = H(Y) H(Y[X)
= H(Y)

H(Y
i
[X, Y
i 1
)
= H(Y)

H(Y
i
[X
i
) memoryless

(H(Y
i
) H(Y
i
[X
i
))
=

I (X
i
; Y
i
) nC
Junmo Kim EE 623: Information Theory
Feedback Communication
nR = H(W)
= H(W[

W) + I (W;

W)
1 + P
(n)
e
nR + I (W;

W)
1 + P
(n)
e
nR + I (W; Y)
= 1 + P
(n)
e
nR + H(Y
n
)

H(Y
i
[W, Y
i 1
)
= 1 + P
(n)
e
nR + H(Y
n
)

H(Y
i
[W, Y
i 1
, X
i
)
(X
i
is function of (W, Y
i 1
))
= 1 + P
(n)
e
nR + H(Y
n
)

H(Y
i
[X
i
) (memoryless ch)
1 + P
(n)
e
nR +

(H(Y
i
) H(Y
i
[X
i
))
1 + P
(n)
e
nR + nC
Junmo Kim EE 623: Information Theory
Feedback Communication
Example: Binary erasure channel

C
FB
= C = 1 .

If there is feedback, channel coding can be simpler.

Send k bits using n channel uses.

If ? is received, retransmit the bit.

An error occurs if #? > n k.

Probability of error is given by


Pr (#? > n k) = Pr (
#?
n
> 1
k
n
)
= Pr (
#?
n
> 1 R)
Junmo Kim EE 623: Information Theory
Feedback Communication
Example: Binary erasure channel

Suppose R < C = 1 .

Let > 0 be small enough so that R + < 1 .

Given some
t
> 0 (very small), choose n large enough so that
Pr (
#?
n
> + ) <
t

Then
Pr (error ) = Pr (
#?
n
> 1 R)
Pr (
#?
n
> + ) ( 1 R > + )
<
t
Junmo Kim EE 623: Information Theory
Lecture 14
Junmo Kim EE 623: Information Theory
Announcement

Midterm exam at 10:35 am on Thursday Oct. 22 in room 201


and 202.

Student id : 20070000 20093500 @ room 201

Student id : 20093501 @ room 202

No class on Tuesday Oct. 20.

PS 7 due on Tuesday Oct. 27.


Junmo Kim EE 623: Information Theory
Source-Channel Separation

Let V
k
be a source of entropy rate H(V
k
).

Let W(y[x) be some DMC of capacity C.

Which is better ? Joint source-channel coding vs. separate


coding.

For the two-stage source and channel coding, reliable


communication i H < C.

Can we do better by combining source and channel coding ?


(i.e., can we achieve reliable communication when C < H ?)

The theorems says that separate coding is as good as joint


coding.
Junmo Kim EE 623: Information Theory
Source-Channel Separation
Theorem
If H(V
k
) < C, then for any > 0, suciently large n, there
exists a mapping F : 1
n
A
n
and a mapping :
n
1
n
s.t.
Pr ((Y
1
, . . . , Y
n
) ,= (V
1
, . . . , V
n
)) < .

v
1
,...,v
n
P
V
(v)

y
1
,...,y
n
:(y),=v
W(y[F(v)) <
Junmo Kim EE 623: Information Theory
Converse
Theorem:
If H(V
k
) > C for a stationary process V
k
, for any sequences
F
n
: 1
n
A
n
,
n
:
n
1
n
, lim
n
Pr (
n
(Y) ,= V)) > 0.
H(V
k
)
1
n
H(V
1
, . . . , V
n
)
( a
i
b
n
=
1
n

a
i
, lim
n
b
n
b
m
, m)
=
1
n
H(V
1
, . . . , V
n
[

V
1
, . . . ,

V
n
) +
1
n
I (V
1
, . . . , V
n
;

V
1
, . . . ,

V
n
)

1
n
[1 + P
(n)
e
n log [1[ +
1
n
I (X
1
, . . . , X
n
; Y
1
, . . . , Y
n
)]

1
n
+ P
(n)
e
log [1[ + C C as n 0
Junmo Kim EE 623: Information Theory
Source-Channel Separation

If a source produces a symbol every


1

s
sec:
s
source symbol
sec
,

and has entropy rate H(V


k
)
bit
source symbol
, the entropy in
bits
sec
is H
s
.

Channel has
c
ch use
sec
and capacity C
bits
ch use
, its capacity in
bits
sec
is C
c
.

The condition for reliable transmission of the source is


H
s
< C
c
.
Junmo Kim EE 623: Information Theory
Feedback Communication

Encoder is a sequence of mapping f


i
: J
i 1
A

C
FB
is the feedback capacity, obviously C
FB
C ( you can
ignore feedback if you like )

In fact, C
FB
= C.
Junmo Kim EE 623: Information Theory
Feedback Communication
When there is feedback, the following lemma is no longer true.
Lemma
Let X take value in A
n
according to some law P
X
(x) and let Y be
distributed according to p
Y[X
=

n
k=1
W(y
k
[x
k
) for some W([)
then I (X; Y) nC where C = max
P
X
I (P
X
; W).
Proof:
I (X; Y) = H(Y) H(Y[X)
= H(Y)

H(Y
i
[X, Y
i 1
)
= H(Y)

H(Y
i
[X
i
) memoryless

(H(Y
i
) H(Y
i
[X
i
))
=

I (X
i
; Y
i
) nC
Junmo Kim EE 623: Information Theory
Feedback Communication
nR = H(W)
= H(W[

W) + I (W;

W)
1 + P
(n)
e
nR + I (W;

W)
1 + P
(n)
e
nR + I (W; Y)
= 1 + P
(n)
e
nR + H(Y
n
)

H(Y
i
[W, Y
i 1
)
= 1 + P
(n)
e
nR + H(Y
n
)

H(Y
i
[W, Y
i 1
, X
i
)
(X
i
is function of (W, Y
i 1
))
= 1 + P
(n)
e
nR + H(Y
n
)

H(Y
i
[X
i
) (memoryless ch)
1 + P
(n)
e
nR +

(H(Y
i
) H(Y
i
[X
i
))
1 + P
(n)
e
nR + nC
Junmo Kim EE 623: Information Theory
Feedback Communication
Example: Binary erasure channel

C
FB
= C = 1 .

If there is feedback, channel coding can be simpler.

Send k bits using n channel uses.

If ? is received, retransmit the bit.

An error occurs if #? > n k.

Probability of error is given by


Pr (#? > n k) = Pr (
#?
n
> 1
k
n
)
= Pr (
#?
n
> 1 R)
Junmo Kim EE 623: Information Theory
Feedback Communication
Example: Binary erasure channel

Suppose R < C = 1 .

Let > 0 be small enough so that R + < 1 .

Given some
t
> 0 (very small), choose n large enough so that
Pr (
#?
n
> + ) <
t

Then
Pr (error ) = Pr (
#?
n
> 1 R)
Pr (
#?
n
> + ) ( 1 R > + )
<
t
Junmo Kim EE 623: Information Theory
Dierential Entropy

Def: A random variable X is said to be continuous if


F(x) = Pr (X x) is continuous.

Probability density function f (x) = F


t
(x).

Support of X : S = x[f (x) > 0.

Def: Dierential entropy of random variable X with density


f () is
h(X) =
_
S
f (x) log f (x)dx
= h(f )
Junmo Kim EE 623: Information Theory
Examples

X : Unif[0, a].
h(f ) =
_
a
0
1
a
log
1
a
dx = log a

X N(0,
2
)
h(f ) =
_
f (x) log
e
1

2
2
e

x
2
2
2
dx
= log
e

2
2
+
E[X
2
]
2
2
=
1
2
ln 2e
2
nats

Attention: h(f ) can be negative.


lim
0
h(f ) =

h(X) = when X is discrete.


Junmo Kim EE 623: Information Theory
Properties of Dierential Entropy

h(X + c) = h(X).

h(aX) ,= h(X), h(aX) = h(X) + log [a[


Pf: Let Y = aX.
f
Y
(y) =
1
[a[
f
X
(
y
a
)
h(Y) = h(X) + log [a[
Junmo Kim EE 623: Information Theory
Joint Dierential Entropy

X
1
, . . . , X
n
are continuous random variables with joint pdf
f (x
1
, . . . , x
n
).
h(X
1
, . . . , X
n
) =
_
f (x
1
, . . . , x
n
) log f (x
1
, . . . , x
n
)dx
1
dx
n

Conditional Dierential Entropy

X, Y have joint pdf f (x, y)


h(X[Y) =
_
f (x, y) log f (x[y)dxdy
h(X[Y) = h(X, Y) h(Y)

Example: X N(, K)
h(X) =
1
2
ln(2e)
n
[K[

Properties: 1) h(AX) = h(X) + log [A[

2) h(X
1
, . . . , X
n
) =

n
i =1
h(X
i
[X
1
, . . . , X
i 1
)
Junmo Kim EE 623: Information Theory
Typical Set

Theorem: X
1
, . . . , X
n
are IID with density f

1
n
log f (X
1
, . . . , X
n
) E[log f (X)] = h(X)

Typical set A
(n)

A
(n)

= (x
1
, . . . , x
n
) S
n
: [
1
n
log f (x
1
, . . . , x
n
)h(X)[

Properties:
1. Pr (A
(n)

) > 1 n big enough


2. Vol (A
(n)

) 2
n(h(X)+)
for all n
3. Vol (A
(n)

) (1 )2
n(h(X))
n big enough
Junmo Kim EE 623: Information Theory
Typical Set
Proof:

1) > 0
Pr ([
1
n
log f (x
1
, . . . , x
n
) h(X)[ ) 1 as n
n
0
s.t. n n
0
,
Pr ([
1
n
log f (x
1
, . . . , x
n
) h(X)[ ) > 1 .

2)
1 =
_
S
n
f (x
1
, . . . , x
n
)dx
1
dx
n

_
A
(n)

f (X)dx

_
A
(n)

2
n(h(X)+)
dx
= 2
n(h(X)+)
Vol (A
(n)

)
Vol (A
(n)

) 2
n(h(X)+)
Junmo Kim EE 623: Information Theory
Typical Set

3)
1 < Pr (A
(n)

)
=
_
A
(n)

f (x)dx
2
n(h(X))
Vol (A
(n)

Vol (A
(n)

) (2
h(X)
)
n
Junmo Kim EE 623: Information Theory
Def: Relative Entropy

f , g 2 densities
D(f |g) =
_
S
f (x) log
f (x)
g(x)
dx

S is support of f ().

If f (x) = 0, g(x) ,= 0, we have 0 log 0

If f (x) ,= 0, g(x) = 0, ?

D(f |g) is nite only if Support of g() support of f ()


Junmo Kim EE 623: Information Theory
Mutual Information

X, Y are random variables with joint pdf f (x, y)


I (X; Y) =
_
f (x, y) log
f (x, y)
f (x)f (y)
dxdy
= D(f (x, y)|f (x)f (y))
= h(X) h(X[Y)
= h(Y) h(Y[X)
Junmo Kim EE 623: Information Theory
Inequalities

Theorem: D(f |g) 0.

Pf:
D(f |g) =
_
f (x) log
g(x)
f (x)
dx
= E
f
[log
g(X)
f (X)
]
log E
f
[
g(X)
f (X)
]
= log
_
f (x)
g(x)
f (x)
dx = log 1 = 0
where equality holds i g(x) = f (x) almost everywhere

As I (X; Y) = D(f (x, y)|f (x)f (y)) 0, we have


h(X[Y) h(X), with equality i X and Y are independent.

h(X
1
, . . . , X
n
) =

h(X
i
[X
i 1
)

h(X
i
), with equality i
X
1
, . . . , X
n
independent.
Junmo Kim EE 623: Information Theory
Inequalities

Thm : For a random vector X with E[X] = 0, E[XX


T
] = K,
h(X)
1
2
log(2e)
n
[K[ = h(N(, K))
Proof:
N(, K)
f (x) any density satisfying the conditions. We will show that
D(f |) = h() h(f ) 0.
D(f |) =
_
f (x) log
f (x)
(x)
dx
=
_
f (x) log f (x)dx
_
f (x) log (x)dx
Junmo Kim EE 623: Information Theory
Inequalities
_
f (x) log (x) =
_
f (x) log
1
_
(2)
n
[K[
1
2
e

1
2
(x)
T
K
1
(x)
dx
=
1
2
log(2)
n
[K[
n
2
=
1
2
ln(2e)
n
[K[
= h()
E
f
[
1
2
x
T
K
1
X] =
1
2
E
f
[

i ,j
X
i
(K
1
)
ij
X
j
]
=
1
2

i ,j
(K
1
)
ij
E
f
[X
i
X
j
]
=
1
2

j
(K
1
)
ij
K
ji
=
n
2
h() h(f )
Junmo Kim EE 623: Information Theory
Lecture 15
Junmo Kim EE 623: Information Theory
Review

Def: Dierential entropy of random variable X with density


f () is
h(X) =
_
S
f (x) log f (x)dx
= h(f )

X N(0,
2
)
h(f ) =
_
f (x) log
e
1

2
2
e

x
2
2
2
dx
= log
e

2
2
+
E[X
2
]
2
2
=
1
2
ln 2e
2
nats
Junmo Kim EE 623: Information Theory
Review

Thm : For a random vector X with E[X] = 0, E[XX


T
] = K,
h(X)
1
2
log(2e)
n
[K[ = h(N(, K))

In particular,
h(X)
1
2
log(2eE[X
2
])
Junmo Kim EE 623: Information Theory
Gaussian Channel
Y = x + Z
Z N(0, N)

If N = 0, C = .

If N = 1, C = . (without limit on x)

Average power constraint:


1
n

x
2
i
(w) P.
Junmo Kim EE 623: Information Theory
Gaussian Channel
We will show that the following quantity is the capacity:
max
E[X
2
]P
I (X; Y)
I (X; Y) = h(Y) h(Y[X)
= h(Y) h(X + Z[X)
= h(Y) h(Z)

1
2
log 2eE[Y
2
]
1
2
log 2eN

1
2
log
2e
2e
P + N
N
E[Y
2
] = E[X
2
] +E[Z
2
] +2E[XZ](E[XZ] = 0 X Z, E[Z] = 0)
For X s.t. E[X
2
] P, I (X; Y)
1
2
log(1 +
P
N
). This is achievable
if X N(0, P) and thus max
E[X
2
]P
I (X; Y) =
1
2
log(1 +
P
N
)
Junmo Kim EE 623: Information Theory
Gaussian Channel: Achievable Rate
Denition
We say that R is achievable if > 0, n
0
, s.t. n > n
0
, a rate-R
block length n codebook ( = x(1), , x(2
nR
)
n
and a
decoder :
n
1, . . . , 2
nR
s.t. the maximum probability of
error < and
1
n

x
2
i
(m) P, m 1, . . . , 2
nR
.
C supremum of achievable rate.
Junmo Kim EE 623: Information Theory
Gaussian Channel: Capacity
Theorem
The capacity of the power-limited Gaussian channel is
C =
1
2
log(1 +
P
N
)
Junmo Kim EE 623: Information Theory
Review
1. Fix some p
X
2. Fix some > 0, n
3. Generate a random codebook (
IID p
X
.
4. Reveal ( to encoder and receiver.
5. Design a joint typicality decoder
(; p
X
(x)W(y[x), , n, ()
6. Encodr m x(m) (according to codebook)
7. Each codebook ( gives P
(n)
e
(().
8. Analyze E[P
(n)
e
(()]. Average over (.
9. Will show that if R < I (P
X
; W) then E[P
(n)
e
(()] 0 as
n .
10. By random coding argument, there exists deterministic
sequence (
n
s.t. P
(n)
e
((
n
) 0
11. Trick to get
(n)
to go to zero.
Junmo Kim EE 623: Information Theory
Direct Part
1. Generate a codebook at random
1.1 Codewords are chosen independently
1.2 The components of the codewords are chosen IID from
N(0, P ).
2. Reveal the codebook to Tx/Rx
3. Decoder
3.1 Joint typicality: If there is one and only one codeword X
n
(w)
that is jointly typical with the received vector, declare

W = w.
Otherwise, declare an error.
3.2 Declare an error if the unique codeword that is typical with y
violates the average power constraint
Junmo Kim EE 623: Information Theory
Direct Part: Error Analysis
Assume W = 1.
E
0
the event X(1) violates the power constraint
E
i
the event (X(i ), Y) A
(n)

Pr (Error [W = 1) Pr (E
0
E
C
1

2
nR
i =2
E
i
)
Pr (E
0
) + Pr (E
C
1
) +
2
nR

i =2
Pr (E
i
)

Pr (E
0
) 0. (
1
n

X
2
i
(1) E[X
2
] = P )

Pr (E
C
1
) 0

Pr (E
i
) 2
n(I (X;Y)3)
, I (X; Y) =
1
2
log(1 +
P
N
)
Junmo Kim EE 623: Information Theory
Direct Part
Finally, deleting the worst half of the codewords, we obtain a code
with low maximal probability of error.
Also the selected codewords satisfy the power constraint.
(Otherwise, maximal probability of error is 1.)
Junmo Kim EE 623: Information Theory
Converse
Let ((, ) be a codebook of rate R, block length n and average
probability of error P
(n)
e
.
nR = H(W)
= H(W[

W) + I (W;

W)
1 + nRp
(n)
e
+ I (W;

W)
Junmo Kim EE 623: Information Theory
Converse
I (W;

W) I (X
n
(W); Y
n
)
= h(Y
n
) h(Y
n
[X
n
(W))
= h(Y
n
) h(X
n
(W) + Z
n
[X
n
(W))
= h(Y
n
) h(Z
n
)
=

(h(Y
i
[Y
i 1
) h(Z
i
))( Z
n
iid)

(h(Y
i
) h(Z
i
))
Junmo Kim EE 623: Information Theory
Converse
I (W;

W)

(h(Y
i
) h(Z
i
))

1
2
log 2eE[Y
2
i
]
1
2
log 2eN
= n
1
n

i
1
2
log(1 +
E[X
2
i
(W)]
N
)
n
1
2
log(1 +
1
n

E[X
2
i
(W)]
N
)
n
1
2
log(1 +
P
N
)
where
1
n

E[X
2
i
(W)] = E[
1
n

X
2
i
(W)] P
Junmo Kim EE 623: Information Theory
Lecture 16
Junmo Kim EE 623: Information Theory
Review
Gaussian channel
Y = X + Z, Z N(0, N)
1
n
n

k=1
x
k
(m)
2
P, m 1, . . . , 2
nR

1
2
log(1 +
P
N
) is achievable.
Junmo Kim EE 623: Information Theory
Band Limited Channel

Y(t) = X(t) h(t) + N


W
(t)

N
W
(t)

Gaussian process

Stationary

E[N
W
(t)N
W
(t + )] =
N
0
2
()

Y(t) = Y
LPF
(t) + Y
HPF
(t)
where Y
LPF
(t) = Y(t) h(t)

X(t) and Y
HPF
(t) are independent given Y
LPF
. Y
LPF
is
sucient statistics.

Limit yourself to X(t) band limited W Hz.

By sampling theorem, limit detection to be based on


Y
k
= Y
LPF
(
k
2W
)
also specify X(t) (now band limited) by X
k
= X(
k
2W
).
Y
k
= X
k
+ Z
k
where Z
k
= N
W
(t)[
LPF
(t =
k
2W
) = N
W,LPF
(
k
2W
)
Junmo Kim EE 623: Information Theory
Band Limited Channel

Consider the noise process Z


k
.

Z
k
= N
W,LPF
(
k
2W
)

Power spectral density of N


W,LPF
(t) = N
W
h(t) is
S
N
W,LPF
(f ) = S
N
(f )[H(f )[
2
=
N
0
2
[H(f )[
2

Autocovariance function: K
N
W,LPF
,N
W,LPF
() =
cov(N
W,LPF
(t), N
W,LPF
(t +)) = E[N
W,LPF
(t)N
W,LPF
(t +)]
K
N
W,LPF
,N
W,LPF
() =
_
S
N
W,LPF
(f )e
i 2f
df

E[Z
k
Z
l
] = E[N
W,LPF
(
k
2W
)N
W,LPF
(
l
2W
)] = K
N
W,LPF
,N
W,LPF
(
kl
2W
)

When k = l , E[Z
k
Z
l
] =

When k ,= l , E[Z
k
Z
l
] =

Thus, Z
k
is IID N(0, N
0
W).
Junmo Kim EE 623: Information Theory
Band Limited Channel

(Time) Average power constraint

lim
T
1
2T
_
T
T
X(t)
2
dt < P

= lim
n
1
2n

n
k=n
X
2
k
n =
T
1
2W
1
2n
1
2W
n

n
X
2
k
1
2W

1
2T
_
T
T
X(t)
2
dt

Thus C =
1
2
log(1 +
P
N
0
W
) bits / sample

We can send 2W samples / sec.


C = W log(1 +
P
N
0
W
) bits /sec
Junmo Kim EE 623: Information Theory
Unlimited Bandwidth

If there are no bandwdith constraint, i.e. W ,


lim
W
W ln(1 +
P
N
0
W
) =
P
N
0
nats/ sec =
P
N
0
log e bits/sec
( ln(1 + x) x for small x)
lim
W
W log
2
(1 +
P
N
0
W
) = lim
W
W ln(1+
P
N
0
W
)
ln2
=
P
N
0
ln 2
R <
P
N
0
ln 2
E
b
=
P
R
>
N
0
log
2
e
E
b
N
0
> 1.6 dB
Junmo Kim EE 623: Information Theory
Parallel Gaussian Channels
Consider L independent Gaussian channels in parallel.

Y
(l )
= X
(l )
+ Z
(l )

Z
(l )
N(0, N
l
)

Input power constraint : P


1
, P
2
, . . . , P
L
,

L
l =1
P
l
P.

C = max I (X
(1)
, X
(2)
, . . . , X
(L)
; Y
(1)
, Y
(2)
, . . . , Y
(L)
) where
the maximum is over all input distribution
f
X
(1)
,X
(2)
,...,X
(L)
(, . . . , ) satisfying the power constraint

L
l =1
E[(X
(l )
)
2
] P.

This problems reduces to : maximize

l
1
2
log(1 +
P
l
N
l
)
subject to

P
l
P.
Junmo Kim EE 623: Information Theory
Parallel Gaussian Channels
I (X
(1)
, X
(2)
, . . . , X
(L)
; Y
(1)
, Y
(2)
, . . . , Y
(L)
)
= h(Y
(1)
, Y
(2)
, . . . , Y
(L)
) h(Y
(1)
, . . . , Y
(L)
[X
(1)
, . . . , X
(L)
)
= h(Y
(1)
, Y
(2)
, . . . , Y
(L)
) h(Z
(1)
, . . . , Z
(L)
[X
(1)
, . . . , X
(L)
)
= h(Y
(1)
, Y
(2)
, . . . , Y
(L)
) h(Z
(1)
, . . . , Z
(L)
)
= h(Y
(1)
, Y
(2)
, . . . , Y
(L)
)

l
h(Z
(l )
)

l
(h(Y
(l )
) h(Z
(l )
))

l
1
2
log(1 +
P
l
N
l
)
Junmo Kim EE 623: Information Theory
Parallel Gaussian Channels
maximize f (P
1
, P
2
, . . . , P
L
) =

l
1
2
log(1 +
P
l
N
l
)
subject to

P
l
P.

f (P
1
, P
2
, . . . , P
L
) is a concave function.

P
l
0 for l = 1, . . . , L, and

P
l
= P.

The Kuhn-Tucker condition :


f
P
l
= if P
l
> 0
f
P
l
if P
l
= 0
moving P from channel l to channel l
t
Lose

1
2
log(1+
P
l
N
l
)
P
l
P
Gain

1
2
log(1+
P
l

N
l

)
P
l

P
Junmo Kim EE 623: Information Theory
Parallel Gaussian Channels

The Kuhn-Tucker condition :


f
P
l
= if P
l
> 0
f
P
l
if P
l
= 0


1
2
log(1+
P
l
N
l
)
P
l
=
1
2
1
N
l
1+
P
l
N
l
=
1
2
1
P
l
+N
l
1
2
1
P
l
+ N
l
= if P
l
> 0 P
l
+ N
l
= if P
l
> 0
1
2
1
P
l
+ N
l
if P
l
= 0 P
l
+ N
l
if P
l
= 0

Optimum P
l
= ( N
l
)
+
where x
+
=
_
x x > 0
0 x 0

is determined such that

( P
l
)
+
= P.
Junmo Kim EE 623: Information Theory
Water-Filling for Parallel Gaussian Channels

Optimum P
l
= ( N
l
)
+
where x
+
=
_
x x > 0
0 x 0

is determined such that

( P
l
)
+
= P.
Junmo Kim EE 623: Information Theory
Lecture 17
Junmo Kim EE 623: Information Theory
Method of Types

H(X): source coding

I (X; Y): channel capacity

D(P|Q): large deviation theory, hypothesis testing

Large deviation theory: probability of rare event (Sanovs


theorem)
e.g. Q = (
1
2
,
1
2
) H,T
Pr (# of H exceeds 75%) 2
nD([
3
4
,
1
4
][
1
2
,
1
2
])

Hypothesis testing (Steins lemma)


Observe X
1
, . . . , X
n
, where X
i
IID.
H
1
: X
i
IID P
1
H
2
: X
i
IID P
2
How fast can the error probability Pr (

H = H
1
[H
2
) go to zero
subject to Pr (

H = H
2
[H
1
) 0 ?
Pr (

H = H
1
[H
2
) 2
nD(P
1
P
2
)
Junmo Kim EE 623: Information Theory
Method of Types

Type (empirical distribution) of a sequence x is denoted by


P
x
(x).

Let x A
n
, [A[ < .
P
x
(x) =
1
n
n

k=1
I x
k
= x
where I statement =
_
1 if statement is true
0 otherwise

e.g. A = a, b, c, n = 5
x = aabcb
P
x
(a) =
2
5
, P
x
(b) =
2
5
, P
x
(c) =
1
5
Junmo Kim EE 623: Information Theory
Method of Types

T
n
(A) = set of all distributions on A with denominator n.

T
n
(A) = T(x) : n(x) Z
where T(A) is probability simplex, the set of all PMFs on A.

T = (p
1
, p
2
, . . . , p
|X|
) : p
i
0,

p
i
= 1

T
n
= (
n
1
n
,
n
2
n
, . . . ,
n
|X|
n
) :

n
i
= n, n
i
0, n Z
Junmo Kim EE 623: Information Theory
Method of Types

Suppose Q(x) is some PMF. What is the probability that a


sequence X
1
, . . . , X
n
generated i.i.d. according to Q(x) will
have type P ?

Type class of P: T(P) = x A


n
: P
x
(x) = P(x), x A
e.g. = aabcb, aabbc, bcbaa, . . .

Q
n
(T(P)) : Q
n
is n-fold pdf.
Junmo Kim EE 623: Information Theory
Notations

T : probability simplex

T
n
: set of types with denominator n.

P
x
: type

T(P): type class


Junmo Kim EE 623: Information Theory
Number of Types
Lemma
[T
n
[ (n + 1)
[.[
The number of types is at most polynomial in n.
Proof.
Symbol x can show up anywhere 0, . . . , n times.
e.g. A = H, T
[T
n
[ = n + 1.
A type (class) is determined by the number of H, which can be
between 0 and n. [T
n
[ = n + 1 (n + 1)
[.[
= (n + 1)
2
.
In this case T
n
= (
0
n
,
n
n
), (
1
n
,
n1
n
), . . . , (
n
n
,
0
n
).
In general, every element of T
n
is in the form of (
n
1
n
,
n
2
n
, . . . ,
n
|X|
n
)
which is determined by the numerators, n
1
, . . . , n
[.[
. There are [A[
numerators, and each numerator can take on only n + 1 values
0, 1, . . . , n. Therefore, T
n
can have at most (n + 1)
[.[
types.
Junmo Kim EE 623: Information Theory
Probability of a Sequence and Its Type

X
1
, . . . , X
n
are i.i.d. according to Q.

Let x A
n
be some sequence of type P
x
.

What is the probability that (X


1
, . . . , X
n
) = x ?
What is Q
n
(x) ?

e.g. Q(H) =
1
3
, Q(T) =
2
3
.
Pr [(X
1
, . . . , X
n
) = (HHHTTTTTTT)] = (
1
3
)
3
(
2
3
)
7
Lemma
Q
n
(x) = 2
n(H(P
x
)+D(P
x
|Q))
Probability of a sequence depends only on its type.
Junmo Kim EE 623: Information Theory
Probability of a Sequence and Its Type
Lemma
Q
n
(x) = 2
n(H(P
x
)+D(P
x
|Q))
Proof.
Q
n
(x) =

x.
Q(x)
N(x)
(N(x) =

k
I x
k
= x)
=

x.
Q(x)
nP
x
(x)
(P
x
(x) =
N(x)
n
)
= 2
n

P
x
(x) log Q(x)
= 2
n

x
P
x
(x) log
P
x
(x)
Q(x)
+

x
P
x
log
1
P
x
(x)

= 2
n(D(P
x
|Q)+H(P
x
))
Junmo Kim EE 623: Information Theory
Size of a Type Class T(P)

What is [T(P)[ ? Total number of permutations.

[T(P)[ 2
nH(P)
Proof.
Draw X
1
, . . . , X
n
i.i.d. according to P.
1 Pr (X
n
T(P))
= P
n
(T(P))
=

xT(P)
P
n
(x)
=

xT(P)
2
nH(P)
( D(P
x
|P) = 0)
= [T(P)[2
nH(P)
Thus [T(P)[ 2
nH(P)
.
Junmo Kim EE 623: Information Theory
Size of a Type Class T(P)
Theorem
1
(n + 1)
[.[
2
nH(P)
[T(P)[ 2
nH(P)
Proof: We have P
n
(T(P)) P
n
(T(

P)) for all other



P T
n
. (see
the textbook) Hence,
1 = P
n
(A
n
)
=

P1
n
P
n
(T(

P))
[T
n
[ max

P1
n
P
n
(T(

P))
= [T
n
[P
n
(T(P))
(n + 1)
[.[
P
n
(T(P))
Junmo Kim EE 623: Information Theory
Size of a Type Class T(P)
Thus we have P
n
(T(P))
1
(n+1)
|X|
.
x T(P) P
n
(x) = 2
nH(P)
.
P
n
(T(P)) = [T(P)[2
nH(P)

1
(n+1)
|X|
Therefore, [T(P)[
2
nH(P)
(n+1)
|X|
.
Junmo Kim EE 623: Information Theory
Probability of Type Class
Theorem
1
(n + 1)
[.[
2
D(P|Q)
Q
n
(T(P)) 2
nD(P|Q)
Q
n
(T(P)) can be viewed as probability of rare event.
D(P|Q) is about how fast the probability decay as n grows.
Junmo Kim EE 623: Information Theory
Sanovs Theorem (Large Deviations)
Theorem
Let E T. Suppose X
1
, . . . , X
n
are i.i.d. according to Q.
Pr ( P
X
1
,...,X
n
. .
emprical type
E) (n + 1)
[.[
2
nD(P

|Q)
where P

= arg min
PE
D(P|Q).
e.g. Q = (
1
2
,
1
2
) H,T, E = P T : P(H) >
3
4

Pr (P
X
1
,...,X
n
E) = Pr (# of H exceeds 75%)
Junmo Kim EE 623: Information Theory
Sanovs Theorem (Large Deviations)
Pr ( P
X
1
,...,X
n
. .
emprical type
E) (n + 1)
[.[
2
nD(P

|Q)
where P

= arg min
PE
D(P|Q).
Proof.
Pr (P
X
E) =

PE1
n
Q
n
(T(P))

PE1
n
2
nD(P|Q)

PE1
n
2
nD(P

|Q)
(n + 1)
[.[
2
nD(P

|Q)
Junmo Kim EE 623: Information Theory
Lecture 18
Junmo Kim EE 623: Information Theory
Review

[T
n
[ (n + 1)
[.[

Q
n
(x) = 2
n(H(P
x
)+D(P
x
|Q))

[T(P)[ =
n!

xX
(nP(x))!
: number of permutations
1
(n + 1)
[.[
2
nH(P)
[T(P)[ 2
nH(P)
Junmo Kim EE 623: Information Theory
Probability of Type Class
Theorem
1
(n + 1)
[.[
2
nD(P|Q)
Q
n
(T(P)) 2
nD(P|Q)
Q
n
(T(P)) can be viewed as probability of rare event.
D(P|Q) is about how fast the probability decay as n grows.
Proof.
If x T(P), Q
n
(x) = 2
n(H(P
x
)+D(P
x
|Q))
= 2
n(H(P)+D(P|Q))
.
Q
n
(T(P)) =

xT(P)
Q
n
(x) = [T(P)[2
n(H(P)+D(P|Q))
As
1
(n+1)
|X|
2
nH(P)
[T(P)[ 2
nH(P)
, we have
1
(n+1)
|X|
2
D(P|Q)
Q
n
(T(P)) 2
nD(P|Q)
.
Junmo Kim EE 623: Information Theory
Sanovs Theorem (Large Deviations)
Theorem
Let E T. Suppose X
1
, . . . , X
n
are i.i.d. according to Q.
Pr ( P
X
1
,...,X
n
. .
emprical type
E) (n + 1)
[.[
2
nD(P

|Q)
where P

= arg min
PE
D(P|Q).
e.g. Q = (
1
2
,
1
2
) H,T, E = P T : P(H)
3
4

Pr (P
X
1
,...,X
n
E) = Pr (# of H 75%)
Junmo Kim EE 623: Information Theory
Sanovs Theorem (Large Deviations)
Pr ( P
X
1
,...,X
n
. .
emprical type
E) (n + 1)
[.[
2
nD(P

|Q)
where P

= arg min
PE
D(P|Q).
Proof.
Pr (P
X
E) =

PE1
n
Q
n
(T(P))

PE1
n
2
nD(P|Q)

PE1
n
2
nD(P

|Q)
(n + 1)
[.[
2
nD(P

|Q)
Junmo Kim EE 623: Information Theory
Sanovs Theorem (Large Deviations)
If E is the closure of its interior then
lim
n

1
n
log Pr (P
X
E) = D(P

|Q).
Proof.
In this case, for all large n, we can nd a distribution in E T
n
(nonempty) that is close to P

. We can then nd a sequence of


distributions P
n
such that P
n
E T
n
and
D(P
n
|Q) D(P

|Q).
Pr (P
X
E) =

PE1
n
Q
n
(T(P))
Q
n
(T(P
n
))

1
(n + 1)
[.[
2
nD(P
n
|Q)
.
Combining the lower bound and upper bound the limit is
D(P

|Q). Junmo Kim EE 623: Information Theory


Conditional Sanovs Theorem

Suppose E is closed and convex.

Let P

achieve inf D(P|Q).


[Q
n
(X
1
= a[P
X
1
,...,X
n
E) P

(a)[ 0 as n .
Junmo Kim EE 623: Information Theory
Conditional Sanovs Theorem
Lemma
For a closed convex set E T, Q not in E,
P

= arg min
PE
D(P|Q),
D(P|Q) D(P|P

) + D(P

|Q)

D(P|Q) behaves like square of distance.

The lemma is useful for showing that if D(P|Q) is very close


to D(P

|Q) then D(P|P

) is very small.
Junmo Kim EE 623: Information Theory
Conditional Sanovs Theorem
Lemma
(Pinsker): In natural logs
|P Q|
1

_
2D(P|Q)

L
1
norm: Given P, Q T
|P Q|
1
=

x.
[P(x) Q(x)[

Let A = x : P(x) Q(x).


|P Q|
1
=

xA
(P(x) Q(x)) +

xA
C
(Q(x) P(x))
= P(A) Q(A) + (1 Q(A)) (1 P(A))
= 2(P(A) Q(A))

In fact,
|PQ|
1
2
= max
B.
(P(B) Q(B))
Junmo Kim EE 623: Information Theory
Conditional Sanovs Theorem

Let D = D(P

|Q) = min
PE
D(P|Q).

Let S
t
= P : D(P|Q) t

Consider S
D+
and S
D+2
.

Pr (P
X
1
,...,X
n
E S
C
D+2
[P
X
1
,...,X
n
E) is very small.

Pr (P
X
1
,...,X
n
E S
C
D+2
) (n + 1)
|X|
2
n(D+2)
(Sanovs
theorem)

Pr (P
X
1
,...,X
n
E)
1
(n+1)
|X|
2
n(D+)
(Sanovs theorem)

Therefore
Pr (P
X
1
,...,X
n
E S
C
D+2
[P
X
1
,...,X
n
E)
=
Pr (P
X
1
,...,X
n
E S
C
D+2
)
Pr (P
X
1
,...,X
n
E)

(n + 1)
|X|
2
n(D+2)
1
(n+1)
|X|
2
n(D+)
(n + 1)
2|X|
2
n
Junmo Kim EE 623: Information Theory
Conditional Sanovs Theorem

Let A = S
D+2
E. For all P A, D(P|Q) D + 2.

By the Pythagorean theorem, if P A,


D(P|P

) + D(P

|Q) D(P|Q) D + 2

We have D(P|P

) 2.

P
X
1
,...,X
n
A implies that D(P
X
1
,...,X
n
|P

) 2.

Since Pr P
X
1
,...,X
n
A[P
X
1
,...,X
n
E 1, we have
Pr (D(P
X
1
,...,X
n
|P

) 2[P
X
1
,...,X
n
E) 1.

Since
[P
X
1
,...,X
n
(a)P

(a)[ |P
X
1
,...,X
n
P

|
1

_
2D(P
X
1
,...,X
n
|Q)
Pr ([P
X
1
,...,X
n
(a) P

(a)[ [P
X
1
,...,X
n
E) 0.
Pr (X
1
= a[P
X
1
,...,X
n
E) P

(a) in probability, a A.
Junmo Kim EE 623: Information Theory
Lecture 19
Junmo Kim EE 623: Information Theory
Conditional Sanovs Theorem

Suppose E is closed and convex.

Let P

E achieve inf
PE
D(P|Q), where Q is not in E.
[Q
n
(X
1
= a[P
X
1
,...,X
n
E) P

(a)[ 0 as n .
Junmo Kim EE 623: Information Theory
Conditional Sanovs Theorem
Lemma
For a closed convex set E T, Q not in E,
P

= arg min
PE
D(P|Q),
D(P|Q) D(P|P

) + D(P

|Q)

D(P|Q) behaves like square of distance.

The lemma is useful for showing that if D(P|Q) is very close


to D(P

|Q) then D(P|P

) is very small.
Lemma
(Pinsker): In natural logs
|P Q|
1

_
2D(P|Q)
Junmo Kim EE 623: Information Theory
Conditional Sanovs Theorem

Let D = D(P

|Q) = min
PE
D(P|Q).

Let S
t
= P : D(P|Q) t

Consider S
D+
and S
D+2
.

Pr (P
X
1
,...,X
n
E S
C
D+2
[P
X
1
,...,X
n
E) is very small.

Pr (P
X
1
,...,X
n
E S
C
D+2
) (n + 1)
|X|
2
n(D+2)
(Sanovs
theorem)

Pr (P
X
1
,...,X
n
E)
1
(n+1)
|X|
2
n(D+)
(Sanovs theorem)

Therefore
Pr (P
X
1
,...,X
n
E S
C
D+2
[P
X
1
,...,X
n
E)
=
Pr (P
X
1
,...,X
n
E S
C
D+2
)
Pr (P
X
1
,...,X
n
E)

(n + 1)
|X|
2
n(D+2)
1
(n+1)
|X|
2
n(D+)
(n + 1)
2|X|
2
n
Junmo Kim EE 623: Information Theory
Conditional Sanovs Theorem

Let A = S
D+2
E. For all P A, D(P|Q) D + 2.

By the Pythagorean theorem, if P A,


D(P|P

) + D(P

|Q) D(P|Q) D + 2

We have D(P|P

) 2.

P
X
1
,...,X
n
A implies that D(P
X
1
,...,X
n
|P

) 2.

Since Pr P
X
1
,...,X
n
A[P
X
1
,...,X
n
E 1, we have
Pr (D(P
X
1
,...,X
n
|P

) 2[P
X
1
,...,X
n
E) 1.

Since [P
X
1
,...,X
n
(a) P

(a)[ |P
X
1
,...,X
n
P

|
1

_
2D(P
X
1
,...,X
n
|P

)
Pr ([P
X
1
,...,X
n
(a) P

(a)[ [P
X
1
,...,X
n
E) 0.
Pr (X
1
= a[P
X
1
,...,X
n
E) P

(a) in probability, a A.
Junmo Kim EE 623: Information Theory
Conditional Sanovs Theorem
Pr ([P
X
1
,...,X
n
(a) P

(a)[ [P
X
1
,...,X
n
E) 0.
Pr (X
1
= a[P
X
1
,...,X
n
E) P

(a) in probability, a A.
Junmo Kim EE 623: Information Theory
Hypothesis Testing

Observe X
1
, . . . , X
n
, where X
i
IID.

Hypothesis testing:
H
1
: X
i
IID P
1
H
2
: X
i
IID P
2
Declare H
1
if X
1
, . . . , X
n
A
n
A
n
.
Declare H
2
if X
1
, . . . , X
n
A
C
n
.

Error probabilities
Type I (False alarm):
n
= P
n
1
(A
C
n
) = Pr (H
2
[H
1
)
Type II (Miss detection) :
n
= P
n
2
(A
n
) = Pr (H
1
[H
2
)

Find a good set A


n
to minimize
n
subject to a constraint
on
n

n
= min
A
n
.
n
,P
n
1
(A
C
n
)

n
P
n
2
(A
n
)
Junmo Kim EE 623: Information Theory
Hypothesis Testing

Find a good set A


n
to minimize
n
subject to a constraint
on
n

n
= min
A
n
.
n
,P
n
1
(A
C
n
)

n
P
n
2
(A
n
)

Given an observation X A
n

Declare H
1
if
P
n
1
(X)
P
n
2
(X)
> T.
i.e. A
n
= x :
P
n
1
(x)
P
n
2
(x)
> T

T is chosen to meet P
n
1
(A
C
n
)

See Neyman-Pearson lemma.


Junmo Kim EE 623: Information Theory
Hypothesis Testing

Suppose
n
0.

How fast can


n
go to zero ?

n
2
nD(P
1
|P
2
)
where P
2
is the true distribution.

There exists a sequence of A


n
such that
n
0 and
1
n
lim
n

n
D(P
1
|P
2
) + . (Achievability)

For any decision regions, if


n
0,
1
n
lim
n

n
D(P
1
|P
2
) . (Converse)
Junmo Kim EE 623: Information Theory
Achievability

A
n
= x : 2
n(D(P
1
|P
2
))
<
P
n
1
(x)
P
n
2
(x)
< 2
n(D(P
1
|P
2
)+)
.

> 0 but arbitrary

P
n
1
(x) =

n
k=1
P
1
(x
k
)

P
n
2
(x) =

n
k=1
P
2
(x
k
)

1
n
log
P
n
1
(x)
P
n
2
(x)
=
1
n

log
P
1
(x
k
)
P
2
(x
k
)

x A
n
if
1
n

log
P
1
(x
k
)
P
2
(x
k
)
(D(P
1
|P
2
) , D(P
1
|P
2
) + )

Claim:
n
0
Because of L.L.N.,
1
n

log
P
1
(X
k
)
P
2
(X
k
)
E
P
1
[log
P
1
(X)
P
2
(X)
] = D(P
1
|P
2
).
Junmo Kim EE 623: Information Theory
Achievability

A
n
= x : 2
n(D(P
1
|P
2
))
<
P
n
1
(x)
P
n
2
(x)
< 2
n(D(P
1
|P
2
)+)
.

Claim: For this scheme


n
.
= 2
nD(P
1
|P
2
)
(exponentially equal)
Proof:

n
=

xA
n
P
n
2
(x)

xA
n
P
n
1
(x)2
n(D(P
1
|P
2
))
= 2
n(D(P
1
|P
2
))

xA
n
P
n
1
(x)
= 2
n(D(P
1
|P
2
))
(1
n
)
1
n
log
n
(D(P
1
|P
2
) ) +
log(1
n
)
n
lim
0
lim
n
1
n
log min
P
n
1
(A
C
n
)<
P
n
2
(A
n
) D(P
1
|P
2
) +
Junmo Kim EE 623: Information Theory
Converse
Lemma
Let B
n
A
n
be any set of sequences x
1
, x
2
, . . . , x
n
such that

B
n
< . Then
B
n
= P
n
2
(B
n
) > (1 2)2
n(D(P
1
|P
2
)+)
, which
implies
1
n
log
B
n
(D(P
1
|P
2
) + ).
Proof:

Since P
n
1
(A
n
) 1 and P
n
1
(B
n
) 1, we have
P
n
1
(A
n
B
n
) 1.

More precisely, P
n
1
(A
n
) > 1 and P
n
2
(B
n
) > 1 ,
P
n
1
(A
n
B
n
) > 1 2.
P
n
1
((A
n
B
n
)
C
) = P
n
1
(A
C
n
B
C
n
) P
n
1
(A
C
n
) + P
n
2
(B
C
n
) < 2
P
n
1
(A
n
B
n
) > 1 2
Junmo Kim EE 623: Information Theory
Converse
Lemma
Let B
n
A
n
be any set of sequences x
1
, x
2
, . . . , x
n
such that

B
n
< . Then
B
n
= P
n
2
(B
n
) > (1 2)2
n(D(P
1
|P
2
)+)
, which
implies
1
n
log
B
n
(D(P
1
|P
2
) + ).

P
n
1
(A
n
B
n
) > 1 2.Thus,
P
n
2
(B
n
) P
n
2
(A
n
B
n
)
=

x
n
A
n
B
n
P
n
2
(x
n
)

x
n
A
n
B
n
P
n
1
(x
n
)2
n(D(P
1
|P
2
)+)
= 2
n(D(P
1
|P
2
)+)

x
n
A
n
B
n
P
n
1
(x
n
)
= 2
n(D(P
1
|P
2
)+)
P
n
1
(A
n
B
n
)
> 2
n(D(P
1
|P
2
)+)
(1 2)
Junmo Kim EE 623: Information Theory
Examples

P
1
= (
1
4
,
1
4
,
1
4
,
1
4
) and P
2
= (0,
1
3
,
1
3
,
1
3
).
D(P
1
|P
2
) = = 0.

P
1
= (0,
1
3
,
1
3
,
1
3
) and P
2
= (
1
4
,
1
4
,
1
4
,
1
4
).
D(P
1
|P
2
) = log
4
3
= 0.
Junmo Kim EE 623: Information Theory
Lecture 20
Junmo Kim EE 623: Information Theory
Hypothesis Testing

Suppose
n
0.

How fast can


n
go to zero ?

n
2
nD(P
1
|P
2
)
where P
2
is the true distribution.

There exists a sequence of A


n
such that
n
0 and
1
n
lim
n

n
D(P
1
|P
2
) + . (Achievability)

For any decision regions, if


n
0,
1
n
lim
n

n
D(P
1
|P
2
) . (Converse)
Junmo Kim EE 623: Information Theory
Examples

P
1
= (
1
4
,
1
4
,
1
4
,
1
4
) and P
2
= (0,
1
3
,
1
3
,
1
3
).
D(P
1
|P
2
) = = 0.

P
1
= (0,
1
3
,
1
3
,
1
3
) and P
2
= (
1
4
,
1
4
,
1
4
,
1
4
).
D(P
1
|P
2
) = log
4
3
= 0.
Junmo Kim EE 623: Information Theory
Rate Distortion Theory

Let A (not nite) be the source alphabet.

Let

A be the reconstruction alphabet.

encoder f
n
: A
n
1, 2, . . . , 2
nR

reconstruction g
n
: 1, 2, . . . , 2
nR


A
n

Scalar quantization vs vector quantization (joint quantization)

If the source has memory (i.e. correlated), vector quantization


is obviously better.

Surprisingly, even if source is IID, vector quantization is better.


Junmo Kim EE 623: Information Theory
Distortion Function

d : A

A [0, )
e.g. A =

X = , d(x, x) = (x x)
2
(squared-error distortion)

We extend d to sequence,
d((x
1
, . . . , x
n
), ( x
1
, . . . , x
n
)) =
1
n
n

k=1
d(x
k
, x
k
)
Junmo Kim EE 623: Information Theory
Achievable Rate and Distortion
Denition
(R, D) is achievable if for any > 0, n
0
, n > n
0
, f
n
, g
n
such
that
Pr (d((X
1
, . . . , X
n
), g
n
(f
n
(X
1
, . . . , X
n
))) < D + ) > 1
Junmo Kim EE 623: Information Theory
Conditions for Optimal f
n

Let D = E[d(X, g
n
(f
n
(X)))].


i
= x A
n
: f
n
(x) = i

For now, let y


i
= g
n
(i ).
Lemma: Suppose that the function g
n
is given (xed), i.e., y
i
is
given. Then the optimal f
n
is given by
f
n
(x) = arg min
1i 2
nR
d(x, g
n
(i ))
Junmo Kim EE 623: Information Theory
Conditions for Optimal g
n
Lemma
If f
n
is given (i.e.
i
)
g
n
(i ) = arg min
y
E[d(X, y)[X
i
]
Proof.
D = E[d(X, g
n
(f
n
(X)))]
=
2
nR

i =1
Pr (X
i
)E[d(X, g
n
(i )[X
i
]
D is minimized by minimizing E[d(X, g
n
(i )[X
i
] for each i .
Junmo Kim EE 623: Information Theory
Lloyd Algorithm
Lemma
Suppose A =

A = , d(x, x) = (x x)
2
and let
i
be xed.
Then
arg min
y1
n
E[d(X, y)[X
i
]
= E[X[X
i
]
Junmo Kim EE 623: Information Theory
Rate Distortion Function

Given distortion function d : A



X .

Given source distribution p


X
().

Rate distortion function for an i.i.d. source X with p


X
and
d(x, x) is dened as
R(D) = min
p
X,

X
(x, x):

x
p
X,

X
(x, x)=p
X
(x),E
p
X,

X
[d(X,

X)]D
I (X;

X)
= min
p

X|X
:E
p
X
p

X|X
[d(X,

X)]D
I (X;

X)

The two conditions for p


X,

X
1.

x
p
X,

X
(x, x) = p
X
(x)
2. E
p
X,

X
[d(X,

X)] D
Junmo Kim EE 623: Information Theory
Rate Distortion Function : Binary Source

A =

X = 0, 1

d(x, x) = Hamming distance = I x ,= x


p
X
(x) =
_
p x = 1
1 p x = 0
where p
1
2

D = 0 R = H(p).

D = p R = 0
Junmo Kim EE 623: Information Theory
Rate Distortion Function : Binary Source
I (X;

X) = H(X) H(X[

X)
= H
b
(p) H(X[

X)
= H
b
(p) H(X

X[

X)
H
b
(p) H(X

X)
= H
b
(p) H
b
(Pr (X ,=

X))
H
b
(p) H
b
(D)
as p
X,

X
satises E[d(X,

X)] D and D
1
2
, we have
Pr (X ,=

X) = E[d(X,

X)] D
1
2
.
Note that equality holds only if the error X

X and the estimate

X are independent.
Junmo Kim EE 623: Information Theory
Rate Distortion Function : Binary Source

A =

A = 0, 1

d(x, x) = Hamming distance = I x ,= x


p
X
(x) =
_
p x = 1
1 p x = 0
where p
1
2
R(D) =
_
H
b
(p) H
b
(D) D < p
0 o.w.

R(D) is achieved by the following p


X,

X
.
Junmo Kim EE 623: Information Theory
Rate Distortion Function : Gaussian Source

A =

A =

d(x, x) = (x x)
2

X N(0,
2
)

For any f
X,

X
satisfying the conditions (1), (2)
I (X;

X) = h(X) h(X[

X)
=
1
2
log 2e
2
h(X

X[

X)

1
2
log 2e
2
h(X

X)

1
2
log 2e
2

1
2
log 2eE[(X

X)
2
]

1
2
log 2e
2

1
2
log 2eD
=
1
2
log

2
D
Junmo Kim EE 623: Information Theory
Rate Distortion Function : Gaussian Source

The equality is achieved by the following joint distribution


f
X,

X
if D
2

If D >
2
, we choose

X = 0 with probability 1, achieving
R(D) = 0.

The rate distortion function for a N(0,


2
) source with
squared-error distance is
R(D) =
_
1
2
log

2
D
0 D
2
0 D >
2
Junmo Kim EE 623: Information Theory
Rate Distortion Function : Gaussian Source

1
2
log

2
D
D(R) =
2
2
2R

Each bit of descriptions reduces the expected distortion by a


factor of 4.
Junmo Kim EE 623: Information Theory
Convexity of Rate Distortion Function
Lemma
As a function of D, R(D) is non-increasing and convex.
Proof.
Let p

1
( x[x) achieves R(D
1
)
p

2
( x[x) achieves R(D
2
).
Now p

1
+ p

2
satises the condition E[d(X,

X)] D
1
+ D
2
.
Thus,
R(D
1
+ D
2
) I
p(x)(p

1
+ p

2
)
(X;

X)
I
p(x)p

1
(X;

X) + I
p(x)p

2
(X;

X)
R(D
1
) + R(D
2
)
Junmo Kim EE 623: Information Theory
Lecture 21
Junmo Kim EE 623: Information Theory
Rate Distortion Function

Given distortion function d : A



X .

Given source distribution p

X
().

Rate distortion function for an i.i.d. source X with p

X
and
d(x, x) is dened as
R(D) = min
p
X,

X
(x, x):

x
p
X,

X
(x, x)=p

X
(x),E
p
X,

X
[d(X,

X)]D
I (X;

X)
= min
p

X|X
:E
p
X
p

X|X
[d(X,

X)]D
I
p

X
p

X|X
(X;

X)

The two conditions for p


X,

X
1.

x
p
X,

X
(x, x) = p

X
(x)
2. E
p
X,

X
[d(X,

X)] D
Junmo Kim EE 623: Information Theory
Rate Distortion Function : Binary Source

A =

X = 0, 1

d(x, x) = Hamming distance = I x ,= x


p

X
(x) =
_
p x = 1
1 p x = 0
where p
1
2

D = 0 R(D) = H(p) (proof is in the next pages).

D p R(D) = 0.
R(D) = min
p
X,

X
(x, x):

x
p
X,

X
(x, x)=p

X
(x),E
p
X,

X
[d(X,

X)]D
I (X;

X)
If we choose p
X,

X
(x, x) such that

X = 0, p
X,

X
(x, x) satises
the two conditions and I (X;

X) = 0
Junmo Kim EE 623: Information Theory
Rate Distortion Function : Binary Source
If D < p
I (X;

X) = H(X) H(X[

X)
= H
b
(p) H(X[

X)
= H
b
(p) H(X

X[

X)
H
b
(p) H(X

X)
= H
b
(p) H
b
(Pr (X ,=

X))
H
b
(p) H
b
(D)
as p
X,

X
satises E[d(X,

X)] D and D
1
2
, we have
Pr (X ,=

X) = E[d(X,

X)] D
1
2
.
Note that equality holds i 1) the error X

X and the estimate

X
are independent and 2) Pr (X ,=

X) = D.
Junmo Kim EE 623: Information Theory
Rate Distortion Function : Binary Source
Claim: X Ber (p), p
1
2
R(D) =
_
H
b
(p) H
b
(D) D < p
0 o.w.

For D p, R(D) = 0 by

X = 0.

For D < p, I (X;



X) = H
b
(p) H
b
(D) i 1) the error X

X
and the estimate

X are independent and 2) Pr (X ,=

X) = D.

As X =

X (X

X), 1) & 2) determine p
X[

X
.

We will compute p

X
so that

x
p
X,

X
(x, x) = p

X
(x).

R(D) is achieved by the following p


X,

X
.
Junmo Kim EE 623: Information Theory
Rate Distortion Function : Gaussian Source

A =

A =

d(x, x) = (x x)
2

X N(0,
2
)

For any f
X,

X
satisfying the conditions (1), (2)
I (X;

X) = h(X) h(X[

X)
=
1
2
log 2e
2
h(X

X[

X)

1
2
log 2e
2
h(X

X)

1
2
log 2e
2

1
2
log 2eE[(X

X)
2
]

1
2
log 2e
2

1
2
log 2eD
=
1
2
log

2
D
Junmo Kim EE 623: Information Theory
Rate Distortion Function : Gaussian Source

The equality is achieved by the following joint distribution


f
X,

X
if D
2

If D >
2
, we choose

X = 0 with probability 1, achieving
R(D) = 0.

The rate distortion function for a N(0,


2
) source with
squared-error distance is
R(D) =
_
1
2
log

2
D
0 D
2
0 D >
2
Junmo Kim EE 623: Information Theory
Rate Distortion Function : Gaussian Source

1
2
log

2
D
D(R) =
2
2
2R

Each bit of descriptions reduces the expected distortion by a


factor of 4.
Junmo Kim EE 623: Information Theory
Convexity of Rate Distortion Function
Lemma
As a function of D, R(D) is non-increasing and convex.
Proof.
Let p

1
( x[x) achieves R(D
1
)
p

2
( x[x) achieves R(D
2
).
Now p

1
+ p

2
satises the condition E[d(X,

X)] D
1
+ D
2
.
Thus,
R(D
1
+ D
2
) I
p(x)(p

1
+ p

2
)
(X;

X)
I
p(x)p

1
(X;

X) + I
p(x)p

2
(X;

X)
= R(D
1
) + R(D
2
)
Junmo Kim EE 623: Information Theory
Converse

Suppose f
n
, g
n
gives rise to distortion
E[d(X
n
, g
n
(f
n
(X
n
)))] D and they are of rate R. We will
show that R R(D).
Proof: X
n
W = f
n
(X
n
)

X
n
= g
n
(W)
nR H(f
n
(X
n
))
H(f
n
(X
n
)) H(f
n
(X
n
)[X
n
)
= I (X
n
; f
n
(X
n
))
I (X
n
;

X
n
)
= H(X
n
) H(X
n
[

X
n
)
=
n

k=1
H(X
k
)
n

k=1
H(X
k
[X
k1
,

X
n
)

k=1
H(X
k
)
n

k=1
H(X
k
[

X
k
) =
n

k=1
I (X
k
;

X
k
)
Junmo Kim EE 623: Information Theory
Converse

R(D) is monotonically non-increasing and convex.


nR
n

k=1
I (X
k
;

X
k
)

k=1
R(E[d(X
k
,

X
k
)])
= n
1
n
n

k=1
R(E[d(X
k
,

X
k
)])
nR(
1
n
n

k=1
E[d(X
k
,

X
k
)])
nR(E[d(X
n
,

X
n
)])
nR(D)
Junmo Kim EE 623: Information Theory
Direct Part (Achievability)
Denition
(R, D) is achievable if for any > 0 and any > 0, n
0
, n > n
0
,
f
n
, g
n
such that
Pr (d((X
1
, . . . , X
n
), g
n
(f
n
(X
1
, . . . , X
n
))) < D + ) > 1

Section 10.5 If R > R(D) for (R, D) and d(x, x) < , for
any > 0, n
0
, n > n
0
, f
n
, g
n
such that
E[d(X
n
, g
n
(f
n
(X
n
)))] < D + .

Section 10.6: (Stronger results) If R > R(D) and


d(x, x) < , (R, D) is achievable.
Junmo Kim EE 623: Information Theory
Direct Part (Sec. 10. 6)

Assume d(x, x) < d


max
.

Fix some distribution p


X,

X
satisfying E
p
X,

X
[d(X,

X)] D

Let R > I
p
X,

X
(X;

X) +
1

Compute p

X
( x), generate IID codebook, where codewords are
independent with IID component p

X
( x).
Junmo Kim EE 623: Information Theory
Direct Part: Strong Typicality
Denition
A sequence x
1
, . . . , x
n
is strongly typical with respect to the
distribution p
X
(x) (denoted by x
n
A
(n)

) if 1)
[
1
n

k
I x
k
= x p
X
(x)[ <

[.[
for all x X with p
X
(x) > 0
and 2) for all x A with p
X
(x) = 0,

I x
k
= x = 0.
Denition
A pair of sequences x
1
, . . . , x
n
and y
1
, . . . , y
n
is strongly jointly
typical with respect to the distribution p
X,Y
(x, y) if 1)
[
1
n

k
I x
k
= x and y
k
= y p
X,Y
(x, y)[ <

[.[[[
for all
(x, y) A with p
X,Y
(x, y) > 0
and 2) for all (x, y) A with p
X,Y
(x, y) = 0,

I x
k
= x and y
k
= y = 0.
Junmo Kim EE 623: Information Theory
Direct Part: Strong Typicality

[
1
n

k
I x
k
= x and y
k
= y p
X,Y
(x, y)[ <

[.[[[
for all
(x, y) A implies [
1
n

k
I x
k
= x p
X
(x)[ <

[.[
for
all x X.
[

y
[
1
n

k
I x
k
= x and y
k
= y p
X,Y
(x, y)][

y
[
1
n

k
I x
k
= x and y
k
= y p
X,Y
(x, y)[

[A[[[
=

[A[
Junmo Kim EE 623: Information Theory
Direct Part: Strong Typicality
Lemma
X
i
are IID according to p
X
(x). Then
Pr ((X
1
, . . . , X
n
) A
(n)

) 1 as n .
Lemma
If x
n
A
(n)

(p
X
(x)). If

X
i
are IID according to p

X
( x),
Pr ((x
n
,

X
n
) A
(n)

(p
X,

X
)) 2
n(I
p
X,

X
(X;

X)
1
)
See problem 10.16 for proof.
Junmo Kim EE 623: Information Theory
Direct Part

Encoding: Given X
n
, index it by w if there exists a w s.t.
(X
n
,

X
n
) A
(n)

, the strongly jointly typical set. If there is


more than one such w, send the rst in lexicographic order. If
a codword

X
n
(w) which is jointly typical with X
n
, set
W = 1.

Decoding: Let the reproduced sequence be



X
n
(w).
Junmo Kim EE 623: Information Theory
Direct Part: Distortion
Lemma
Suppose (x
n
, x
n
) A
(n)

(p
X,

X
), then
d(x
n
, x
n
) E
p
X,

X
[d(X,

X)] + , where 0 as 0.
[d(x
n
, x
n
) E
p
X,

X
[d(X,

X)][
= [
1
n

x.

x

.
N(x, x)d(x, x)

x.

x

.
p
X,

X
(x, x)d(x, x)[
= [

x.

x

.
(
N(x, x)
n
p
X,

X
(x, x))d(x, x)[

x.

x

.
[
N(x, x)
n
p
X,

X
(x, x)[d(x, x)

x.

x

.

[A[[

A[
d
max
= d
max
=
Junmo Kim EE 623: Information Theory
Direct Part: Error Probability

Error occurs if X
n
is not strongly typical or a codword

X
n
(w) which is jointly typical with X
n
.
Pr [error ] /2 +

x
n
A
(n)

p
X
n (x
n
)[1 Pr ((x
n
,

X
n
) A
(n)

)]
2
nR
As Pr ((x
n
,

X
n
) A
(n)

) 2
n(I (X;

X)+
1
)
, and (1 x)
n
e
nx
, we
have
[1 Pr ((x
n
,

X
n
) A
(n)

)]
2
nR
[1 2
n(I (X;

X)+
1
)
]
2
nR
e
2
nR
2
n(I (X;

X)+
1
)
As R > I (X;

X) +
1
, the above goes to 0 as n . Therefore,
Pr (d((X
1
, . . . , X
n
), g
n
(f
n
(X
1
, . . . , X
n
))) < D + ) > 1
Junmo Kim EE 623: Information Theory
Lecture 22
Junmo Kim EE 623: Information Theory
Lecture 23
Junmo Kim EE 623: Information Theory
Announcement

Final exam from 10:00 am to 2:00 pm on Thursday Dec. 17


in room 201 and 202.

Student id : 20070000 20093500 @ room 201

Student id : 20093501 @ room 202


Junmo Kim EE 623: Information Theory
Exam Topics:

Basics on entropy, description length, joint entropy, and


entropy rates.

Finding capacity of DMC. Channel coding theorem. Proving a


direct part and a converse.

Channel capacity with input constraint. (e.g. Gaussian


channel)

Water lling.

Sanovs theorem. Conditional Sanovs theorem.

Rate distortion function.


Junmo Kim EE 623: Information Theory
Convex and Concave functions
Theorem
If f is twice dierentiable, f is convex i f

(x) 0 for all x.

If f

(x) > 0 for all x, f is strictly convex.

If f

(x) < 0 for all x, f is strictly concave.
Example

f (x) = ln x for x > 0

f
tt
(x) =
1
x
2
< 0 : strictly concave
Junmo Kim EE 623: Information Theory
Jensens Inequality
Theorem
If f is concave then for any random variable X,
f (E[X]) E[f (X)]
If f is strictly concave,
f (E[X]) = E[f (X)] X is deterministic.
Junmo Kim EE 623: Information Theory
Chain Rule for Entropy
H(X
1
, X
2
, . . . , X
n
) = H(X
1
)+H(X
2
[X
1
)+ +H(X
n
[X
1
, . . . , X
n1
)
In short hand notation,
H(X
n
1
) =
n

i =1
H(X
i
[X
i 1
1
)

X
j
i
= (X
i
, X
i +1
, . . . , X
j
).

We often omit subscript 1. e.g. X


j
1
can be written as X
j
.
H(X
n
) =
n

i =1
H(X
i
[X
i 1
)
Junmo Kim EE 623: Information Theory
Entropy Rate
Denition
The entropy rate of a stochastic process X
i
is dened by
H(A) = lim
n
1
n
H(X
1
, . . . , X
n
)
when the limit exists.

Notations: H(A) (book) or H(X


k
) (stochastic process as
an argument)

When does the limit exist ?

We will prove that the limit exists whenever X


k
is stationary.
Junmo Kim EE 623: Information Theory
Channel Coding Theorem

R =
log M
n
: rate in bits / channel use.

W Unif 1, . . . , 2
nR

I
P
X
(x)W(y[x)
(X; Y) = I (P
X
; W)

C = max
P
X
I (P
X
; W)
Denition
R is achievable, if > 0, n
0
s.t. n n
0
, encoder of rate R &
block length n and a decoder with maximal probability of error

(n)
< .
Theorem
If R < C then R is achievable.
Theorem
Converse: If R is achievable, then R C.
Junmo Kim EE 623: Information Theory
DMC and Mutual Information
Lemma
Let X take value in A
n
according to some law P
X
(x) and let Y be
distributed according to p
Y[X
=

n
k=1
W(y
k
[x
k
) for some W([)
then I (X; Y) nC where C = max
P
X
I (P
X
; W).
Proof:
I (X; Y) = H(Y) H(Y[X)
= H(Y)

H(Y
i
[X, Y
i 1
)
= H(Y)

H(Y
i
[X
i
) memoryless

(H(Y
i
) H(Y
i
[X
i
))
=

I (X
i
; Y
i
) nC
Junmo Kim EE 623: Information Theory
Proof of Converse
Note that we have the following Markov chain
W X = f (W) Y
n


W = (Y)
H(W) = nR H(W) = log [/[, R =
log [/[
n
= H(W[

W) + I (W;

W)
H(W[

W) + I (f (W);

W)
H(W[

W) + I (X; Y)
H(W[

W) + nC
Pr (W ,= W) log [/[ + H
b
(Pr (W ,=

W)) + nC
= P
(n)
e
nR + H
b
(P
(n)
e
) + nC
Junmo Kim EE 623: Information Theory
Proof Sketch
1. Fix some p
X
2. Fix some > 0, n
3. Generate a random codebook (
IID p
X
.
4. Reveal ( to encoder and receiver.
5. Design a joint typicality decoder
(; p
X
(x)W(y[x), , n, ()
6. Encodr m x(m) (according to codebook)
7. Each codebook ( gives P
(n)
e
(().
8. Analyze E[P
(n)
e
(()]. Average over (.
9. Will show that if R < I (P
X
; W) then E[P
(n)
e
(()] 0 as
n .
10. By random coding argument, there exists deterministic
sequence (
n
s.t. P
(n)
e
((
n
) 0
11. Trick to get
(n)
to go to zero.
Junmo Kim EE 623: Information Theory
Dierential Entropy

Def: A random variable X is said to be continuous if


F(x) = Pr (X x) is continuous.

Probability density function f (x) = F


t
(x).

Support of X : S = x[f (x) > 0.

Def: Dierential entropy of random variable X with density


f () is
h(X) =
_
S
f (x) log f (x)dx
= h(f )
Junmo Kim EE 623: Information Theory
Examples

X : Unif[0, a].
h(f ) =
_
a
0
1
a
log
1
a
dx = log a

X N(0,
2
)
h(f ) =
_
f (x) log
e
1

2
2
e

x
2
2
2
dx
= log
e

2
2
+
E[X
2
]
2
2
=
1
2
ln 2e
2
nats

Attention: h(f ) can be negative.


lim
0
h(f ) =

h(X) = when X is discrete.


Junmo Kim EE 623: Information Theory
Gaussian Channel
Y = x + Z
Z N(0, N)

If N = 0, C = . (why ? 1) can send a real number 2)


I(X;Y) = )

If N = 1, C = . (without limit on x)

Average power constraint:


1
n

x
2
i
(w) P.
Junmo Kim EE 623: Information Theory
Gaussian Channel
We will show that the following quantity is the capacity:
max
E[X
2
]P
I (X; Y)
(True for general channel with input constraint: problem 8.4)
Junmo Kim EE 623: Information Theory
Gaussian Channel: Achievable Rate
Denition
We say that R is achievable if > 0, n
0
, s.t. n > n
0
, a rate-R
block length n codebook ( = x(1), , x(2
nR
)
n
and a
decoder :
n
1, . . . , 2
nR
s.t. the maximum probability of
error < and
1
n

x
2
i
(m) P, m 1, . . . , 2
nR
.
C supremum of achievable rate.
Junmo Kim EE 623: Information Theory
Direct Part
1. Generate a codebook at random
1.1 Codewords are chosen independently
1.2 The components of the codewords are chosen IID from
N(0, P ).
2. Reveal the codebook to Tx/Rx
3. Decoder
3.1 Joint typicality: If there is one and only one codeword X
n
(w)
that is jointly typical with the received vector, declare

W = w.
Otherwise, declare an error.
3.2 Declare an error if the unique codeword that is typical with y
violates the average power constraint
Junmo Kim EE 623: Information Theory
Direct Part: Error Analysis
Assume W = 1.
E
0
the event X(1) violates the power constraint
E
i
the event (X(i ), Y) A
(n)

Pr (Error [W = 1) Pr (E
0
E
C
1

2
nR
i =2
E
i
)
Pr (E
0
) + Pr (E
C
1
) +
2
nR

i =2
Pr (E
i
)

Pr (E
0
) 0. (
1
n

X
2
i
(1) E[X
2
] = P )

Pr (E
C
1
) 0

Pr (E
i
) 2
n(I (X;Y)3)
, I (X; Y) =
1
2
log(1 +
P
N
)
Junmo Kim EE 623: Information Theory
Parallel Gaussian Channels
Consider L independent Gaussian channels in parallel.

Y
(l )
= X
(l )
+ Z
(l )

Z
(l )
N(0, N
l
)

Input power constraint : P


1
, P
2
, . . . , P
L
,

L
l =1
P
l
P.

C = max I (X
(1)
, X
(2)
, . . . , X
(L)
; Y
(1)
, Y
(2)
, . . . , Y
(L)
) where
the maximum is over all input distribution
f
X
(1)
,X
(2)
,...,X
(L)
(, . . . , ) satisfying the power constraint

L
l =1
E[(X
(l )
)
2
] P.

This problems reduces to : maximize

l
1
2
log(1 +
P
l
N
l
)
subject to

P
l
P.
Junmo Kim EE 623: Information Theory
Water-Filling for Parallel Gaussian Channels

Optimum P
l
= ( N
l
)
+
where x
+
=
_
x x > 0
0 x 0

is determined such that

( N
l
)
+
= P.
Junmo Kim EE 623: Information Theory
Sanovs Theorem (Large Deviations)
Theorem
Let E T. Suppose X
1
, . . . , X
n
are i.i.d. according to Q.
Pr ( P
X
1
,...,X
n
. .
emprical type
E) (n + 1)
[.[
2
nD(P

|Q)
where P

= arg min
PE
D(P|Q).
In the exam, Pr ( P
X
1
,...,X
n
. .
emprical type
E) 2
nD(P

|Q)
is enough. e.g.
Q = (
1
2
,
1
2
) H,T, E = P T : P(H)
3
4

Pr (P
X
1
,...,X
n
E) = Pr (# of H 75%)
Junmo Kim EE 623: Information Theory
Conditional Sanovs Theorem

Suppose E is closed and convex.

Let P

E achieve inf
PE
D(P|Q), where Q is not in E.
[Q
n
(X
1
= a[P
X
1
,...,X
n
E) P

(a)[ 0 as n .

We can use Pr (X
1
= a[E) to compute other probabilities.
Junmo Kim EE 623: Information Theory
Rate Distortion Function

Given distortion function d : A



X .

Given source distribution p

X
().

Rate distortion function for an i.i.d. source X with p

X
and
d(x, x) is dened as
R(D) = min
p
X,

X
(x, x):

x
p
X,

X
(x, x)=p

X
(x),E
p
X,

X
[d(X,

X)]D
I (X;

X)
= min
p

X|X
:E
p
X
p

X|X
[d(X,

X)]D
I
p

X
p

X|X
(X;

X)

The two conditions for p


X,

X
1.

x
p
X,

X
(x, x) = p

X
(x)
2. E
p
X,

X
[d(X,

X)] D
Junmo Kim EE 623: Information Theory
Rate Distortion Function : Gaussian Source

The equality is achieved by the following joint distribution


f
X,

X
if D
2

If D >
2
, we choose

X = 0 with probability 1, achieving
R(D) = 0.

The rate distortion function for a N(0,


2
) source with
squared-error distance is
R(D) =
_
1
2
log

2
D
0 D
2
0 D >
2
Junmo Kim EE 623: Information Theory
Convexity of Rate Distortion Function
Lemma
As a function of D, R(D) is non-increasing and convex.
Proof.
Let p

1
( x[x) achieves R(D
1
)
p

2
( x[x) achieves R(D
2
).
Now p

1
+ p

2
satises the condition E[d(X,

X)] D
1
+ D
2
.
Thus,
R(D
1
+ D
2
) I
p(x)(p

1
+ p

2
)
(X;

X)
I
p(x)p

1
(X;

X) + I
p(x)p

2
(X;

X)
= R(D
1
) + R(D
2
)
Junmo Kim EE 623: Information Theory

You might also like