Lecture Note PDF
Lecture Note PDF
Junmo Kim
December 3, 2009
Junmo Kim EE 623: Information Theory
Lecture 1
Junmo Kim EE 623: Information Theory
Applications of Information Theory
Communication
Coding
Cryptography
Statistics, Probability
Physics
Junmo Kim EE 623: Information Theory
Communication Theory
e.g. A = H, T, A = A, B, C
set of objects
[A[ : Cardinality of A
Denition
PMF of X : p
X
(x) = Pr (X = x) (C & T : p(x) )
For g : A , E[g(X)] =
p
X
(x)g(x).
p
X
(X) is a random variable. Why ?
p
X
() is a function.
Junmo Kim EE 623: Information Theory
Entropy of a chance variable
A : nite
X : chance variable with PMF p
X
(x) = Pr (X = x).
Denition
The entropy of a chance variable X is dened by
H(X) =
x.
p
X
(x) log
1
p
X
(x)
= E[log(p
X
(X))]
Junmo Kim EE 623: Information Theory
Entropy of a chance variable
Example
X takes on H, T with probability
1
2
.
H(X) =
1
2
log 2 +
1
2
log 2
= log 2 = 1 bit (ln 2 nats)
bits : base 2
nats : base e
Note: We dene 0 log 0 = 0. lim
x0
x log x = 0. (
0.0001 log
10
0.0001 = 0.0004 0 )
Similarly 0 log
1
0
= 0.
Junmo Kim EE 623: Information Theory
Entropy of a chance variable
Example
Let
X =
_
_
a with probability 1/2,
b with probability 1/4
c with probability 1/8
d with probability 1/8.
The entropy of X is
H(X) =
1
2
log 2 +
1
4
log 4 +
1
8
log 8 +
1
8
log 8 =
7
4
bits
.
Junmo Kim EE 623: Information Theory
Entropy of a chance variable
With probability
1
2
, X = a : 1 = log 2 question is required.
With probability
1
4
, X = b : 2 = log 4 questions are required.
With probability
1
8
, X = c : 3 = log 8 questions are required.
With probability
1
8
, X = d : 3 = log 8 questions are required.
pf: x log
1
x
is non-negative.
3. If H(X) = 0, then X must be deterministic.
pf: x log
1
x
0 for 0 x 1 and x log
1
x
= 0 for x = 0 or 1.
Each term of the entropy
xX
p
X
(x) log
1
p
X
(x)
is
non-negative and positive for 0 < x < 1.
4. The entropy is determined by the PMF.
Thus we have
0 H(X) log [A[
x.
p
X
(x) log
1
p
X
(x)
= E[log(p
X
(X))]
Junmo Kim EE 623: Information Theory
Properties of the Entropy
1. If X takes some value with probability 1 then
H(X) = 0
pf: x log
1
x
is non-negative.
3. If H(X) = 0, then X must be deterministic.
pf: x log
1
x
0 for 0 x 1 and x log
1
x
= 0 for x = 0 or 1.
Each term of the entropy
xX
p
X
(x) log
1
p
X
(x)
is
non-negative and positive for 0 < x < 1.
4. The entropy is determined by the PMF.
Thus we have
0 H(X) log [A[
x.
p(x) log
p(x)
q(x)
Claim: For all p, q, D(p|q) 0.
If q is uniform
D(p|q) =
p(x) log
p(x)
1
[.[
= log [A[ H(X) 0
Thus we have H(X) log [A[.
Junmo Kim EE 623: Information Theory
Convex and Concave functions
Denition
If f
(x) > 0 for all x, f is strictly convex.
If f
(x) < 0 for all x, f is strictly concave.
Example
f
tt
(x) =
1
x
2
< 0 : strictly concave
Junmo Kim EE 623: Information Theory
Jensens Inequality
Theorem
If f is concave then for any random variable X,
f (E[X]) E[f (X)]
If f is strictly concave,
f (E[X]) = E[f (X)] X is deterministic.
Junmo Kim EE 623: Information Theory
Jensens Inequality
Proof.
f (x) = f (x
0
) + f
t
(x
0
)(x x
0
) +
1
2
f
()(x x
0
)
2
f (x
0
) + f
t
(x
0
)(x x
0
)
Thus for random variable X, we have
f (X) f (x
0
) + f
t
(x
0
)(X x
0
)
E[f (X)] f (x
0
) + f
t
(x
0
)E[X x
0
]
By taking x
0
= E[X], we have
E[f (X)] f (E[X])
.
Junmo Kim EE 623: Information Theory
Non-negativity of Relative Entropies
Theorem
For all p, q, D(p|q) 0, where the equality holds if p = q, i.e.
p(x) = q(x) for all x A.
Junmo Kim EE 623: Information Theory
Non-negativity of Relative Entropies
Proof.
D(p|q) =
x
p(x) log
p(x)
q(x)
= E
p
_
log
p(X)
q(X)
_
D(p|q) = E
p
_
log
q(X)
p(X)
_
log E
p
[
q(X)
p(X)
]
= log(
x
p(x)
q(x)
p(x)
) = log 1 = 0
Equality holds i
q(X)
p(X)
is deterministic, i.e. p = q.
Junmo Kim EE 623: Information Theory
Non-negativity of Relative Entropies
Junmo Kim EE 623: Information Theory
Joint Entropy
Lets consider a pair of chance variables (X, Y). We can view
(X, Y) as a chance variable taking values in
A = (x, y)[x A, y with PMF p
X,Y
(x, y). Thus
H(X, Y) is dened as
H(X, Y) = H((X, Y))
=
x.
y
p
X,Y
(x, y) log p
X,Y
(x, y)
Junmo Kim EE 623: Information Theory
Conditional Entropy
The conditional entropy H(X[Y) is dened as
H(X[Y) =
y
p
Y
(y)H(X[Y = y)
where H(X[Y = y) =
x.
p
X[Y
(x[y) log p
X[Y
(x[y).
Example
A = H, T, = 0, 1.
Prob[Y = 0] = Prob[Y = 1] =
1
2
.
Y = 0 X = H
Y = 1 X = H, T with probability
1
2
,
1
2
.
H(X[Y = 0) = 0 bit, H(X[Y = 1) = 1 bit
H(X[Y) =
1
2
H(X[Y = 0) +
1
2
H(X[Y = 1) =
1
2
bit
H(X) = H(
1
4
) 0.811
Junmo Kim EE 623: Information Theory
Conditional Entropy
H(X[Y) =
y
p
Y
(y)H(X[Y = y)
=
y
p
Y
(y)
x.
p
X[Y
(x[y) log p
X[Y
(x[y)
=
x.
y
p
X,Y
(x, y) log p
X[Y
(x[y)
= E[log p
X[Y
(X[Y)]
Junmo Kim EE 623: Information Theory
Chain Rule
Theorem
H(X, Y) = H(X) + H(Y[X)
Proof.
log p
X,Y
(X, Y) = log p
X
(X) + log p
Y[X
(Y[X)
Take expectation of both sides (w.r.t. the joint distribution p
X,Y
)
E[log p
X,Y
(X, Y)] = E[log p
X
(X)] E[log p
Y[X
(Y[X)]
This proves H(X, Y) = H(X) + H(Y[X).
Junmo Kim EE 623: Information Theory
Conditional Entropy
If X and Y are independent, H(X[Y) = H(X).
Proof.
It comes from p(x[Y = y) = p(x).
Junmo Kim EE 623: Information Theory
Mutual Information
Denition
Consider two randomvariables X and Y with joint PMF
p
X,Y
(x, y). The mutual information I (X; Y) is dened as
I (X; Y) = H(X) H(X[Y)
Claim: I (X; Y) = I (Y; X)
Proof.
H(X, Y) = H(X) + H(Y[X)
= H(Y) + H(X[Y)
H(X) H(X[Y) = H(Y) H(Y[X)
Junmo Kim EE 623: Information Theory
Mutual Information
Claim :
Let X, Y have joint PMF p
X,Y
(x, y). Let p
X
(x) and p
Y
(y) be the
marginals. (p
X
(x) =
y
p
X,Y
(x, y)) Then
I (X; Y) = D(p
X,Y
|p
X
p
Y
)
Proof.
D(p
X,Y
|p
X
p
Y
) =
x,y
p
X,Y
(x, y) log
p
X,Y
(x, y)
p
X
(x)p
Y
(y)
=
x,y
p
X,Y
(x, y) log
p
Y[X
(y[x)
p
Y
(y)
= H(Y[X) + H(Y)
Junmo Kim EE 623: Information Theory
Conditional Mutual Information
Denition
Lets consider three random variables X, Y, Z with joind PMF
p
X,Y,Z
(x, y, z). The conditional mutual information of X and Y
given Z is dened by
I (X; Y[Z) = H(X[Z) H(X[Y, Z)
= E
p(x,y,z)
log
p
X,Y[Z
(X, Y[Z)
p
X[Z
(X[Z)p
Y[Z
(Y[Z)
Junmo Kim EE 623: Information Theory
Conditional Mutual Information
Claim:
I (X; Y[Z) =
z
I (X; Y[Z = z)p
Z
(z)
Proof.
z
I (X; Y[Z = z)p
Z
(z)
=
z
p
Z
(z)
x,y
p
X,Y[Z
(x, y[z) log
p
X,Y[Z
(x, y[z)
p
X[Z
(x[z)p
Y[Z
(y[z)
=
x,y,z
p
X,Y,Z
(x, y, z) log
p
X,Y[Z
(x, y[z)
p
X[Z
(x[z)p
Y[Z
(y[z)
= E
p(x,y,z)
log
p(X, Y[Z)
p(X[Z)p(Y[Z)
Junmo Kim EE 623: Information Theory
Non-negativity of Mutual Information
Claim:
I (X; Y) 0.
with equality i X and Y are independent.
Proof.
It comes from I (X; Y) = D(p
X,Y
|p
X
p
Y
) 0. Equality holds i
p
X,Y
= p
X
p
Y
, i.e. X and Y are independent.
i =1
H(X
i
[X
i 1
1
)
X
j
i
= (X
i
, X
i +1
, . . . , X
j
).
i =1
H(X
i
[X
i 1
)
Junmo Kim EE 623: Information Theory
Lecture 3
Junmo Kim EE 623: Information Theory
Entropy of Binary Random Variable
Junmo Kim EE 623: Information Theory
Entropy is Concave
Let p
1
, . . . , p
[.[
be probability masses satisfying
0 p
i
1,
i
p
i
= 1.
Let q
1
, . . . , q
[.[
be probability masses satisfying
0 q
i
1,
i
q
i
= 1.
Then
i
r
i
= 1 and 0 r
i
1. Thus r
1
, . . . , r
[.[
can be
another set of probability masses.
The claim is
H(r
1
, . . . , r
[.[
) H(p
1
, . . . , p
[.[
) + H(q
1
, . . . , q
[.[
)
Junmo Kim EE 623: Information Theory
Probability Simplex
p = (p
1
, . . . , p
[.[
) lies in [A[ dimensional space.
i
p
i
= 1, the space of
PMFs (p
1
, . . . , p
[.[
) : 0 p
i
1,
i
p
i
= 1 is a subset of
[.[
and is called probability simplex.
Junmo Kim EE 623: Information Theory
Entropy is Concave
Consider two PMFs and their convex combination
p
(1)
= (p
(1)
1
, . . . , p
(1)
[.[
)
p
(2)
= (p
(2)
1
, . . . , p
(2)
[.[
)
p
(1)
+ p
(2)
= (p
(1)
1
+ p
(2)
1
, p
(1)
2
+ p
(2)
2
, . . . , p
(1)
[.[
+ p
(2)
[.[
)
Theorem
H(p
(1)
+ p
(2)
) H(p
(1)
) + H(p
(2)
)
Junmo Kim EE 623: Information Theory
Entropy is Concave
Proof.
X
1
is distributed according to p
(1)
.
X
2
is distributed according to p
(2)
.
Now consider X
Z
.
H(X
Z
[Z) = Pr (Z = 1)
H(p
(1)
)
..
H(X
Z
[Z = 1) +Pr (Z = 2)
H(p
(2)
)
..
H(X
Z
[Z = 2)
= H(p
(1)
) + H(p
(2)
)
Pr (X
Z
= x) = p
(1)
(x) + p
(2)
(x)
Thus H(X
Z
) = H(p
(1)
+ p
(2)
). Results follows from
H(X
Z
) H(X
Z
[Z).
Junmo Kim EE 623: Information Theory
The Horse Race
Given p = (p
1
, . . . p
m
), o = (o
1
, . . . , o
m
)
choose b = (b
1
, . . . , b
m
) : b
i
is the ratio of my wealth bet on
horse i . We use b
i
and b(i ) interchangeably.
b
i
= 1 (bet all my money). 0 b
i
1.
Junmo Kim EE 623: Information Theory
Question
p
i
b
i
o
i
=
b
i
p
i
o
i
. Put
all your money on the horse i
of highest p
i
o
i
.
k=1
b(X
k
)o(X
k
)
1
N
log S
N
=
1
N
N
k=1
log(b(X
k
)o(X
k
)) E[log b(X
k
)o(X
k
)] as N
At the end of N races (N >> 1) S
N
2
NE[log b(X)o(X)]
in the
sense that [
1
N
log S
N
E[log b(X)o(X)][ 0 as N
Junmo Kim EE 623: Information Theory
W(b, p) =
p
i
log o
i
b
i
=
p
i
log o
i
. .
no control
+
p
i
log b
i
Choose b
i
to maximize
p
i
log b
i
No attention to o
i
.
Junmo Kim EE 623: Information Theory
Maximization of the Doubling Rate
We would like to maximize
p
i
log b
i
subject to 0 b
i
1,
b
i
= 1. Writing the functional with a Lagrange multiplier, we
have
J(b) =
p
i
log b
i
+
b
i
J(b)
b
i
= p
i
1
b
i
+ = 0 for i = 1, . . . , m
p
i
= b
i
since
b
i
= 1,
p
i
= 1, and
p
i
=
b
i
, we have = 1.
Thus p
i
= b
i
is a stationary point of the function J(b).
We now verify that this proportional gambling is optimal.
Junmo Kim EE 623: Information Theory
Maximization of the Doubling Rate
Theorem
The doubling rate W(b, p) = E[log b(X)o(X)] is maximized by
choosing b = p.
Proof.
Let b be arbitrary. Compare b
= p with b.
p
i
log b
p
i
log b
i
=
p
i
log p
i
p
i
log b
i
=
p
i
log
p
i
b
i
= D(p|b) 0
where equality holds i p
i
= b
i
. This proves that the proportional
betting is optimal.
Note: This strategy assumes I bet all my money.
Junmo Kim EE 623: Information Theory
Example: Uniform Fair Odds
If one uses b
= p,
W(b
, p) =
p
i
log(p
i
o
i
)
Example: o
i
= m : uniform fair odds If b
i
=
1
m
, it is guaranteed
that we get the money back as b(X)o(X) = 1.
W(b
, p) =
p
i
log p
i
+
p
i
log m
= log m H(p)
S
N
2
N(log mH(p))
Entropy is a measure of uncertainty. The lower the entropy, the
more money you can make betting on X.
Junmo Kim EE 623: Information Theory
Fair, Superfair, Subfair Cases
So far we assumed that you put all the money. What if you dont
have to put all your money ?
Fair odds:
1
o
i
= 1
If b
i
=
1
o
i
, the outcome is deterministically 1.
Guy B can get the same outcome by distributing the cash over
horses b
i
=
1
o
i
.
Superfair case :
1
o
i
< 1
In this case, the odds are even better than fair odds, so we
will put all the money.
We will choose b
i
= c
i
+ d
i
so that
b
i
= 1
Choose c
i
=
1
o
i
, to make sure I get the money back ( o
i
c
i
= 1)
Choose d
i
any non-negative number so that
d
i
= 1
1
o
i
> 0.
Subfair case :
1
o
i
> 1
In this case, dont put all your money into the race.
X: winning horse
Y: side information
p
X,Y
(x, y)
y
p
Y
(y)
x
p
X[Y
(x[y) log(b(x[y)o(x))
. .
maximize separately
(x[y) = p
X[Y
(x[y) is optimal.
Junmo Kim EE 623: Information Theory
Side Information
Dierence in W
W = W(b
(x[y), p
X,Y
) W(b
, p
X
)
=
x,y
p
X,Y
(x, y) log(b(x[y)o(x))
x,y
p
X,Y
(x, y) log(b(x)o(x))
=
x,y
p
X,Y
(x, y) log
b(x[y)
b(x)
=
x,y
p
X,Y
(x, y) log
p(x[y)
p(x)
= I (X; Y)
Junmo Kim EE 623: Information Theory
Doubling Rate and Relative Entropy
Then
r
i
= 1. r can be interpreted as a PMF.
x
p
X
(x) log(o(x)b(x)) =
x
p
X
(x) log
b(x)
p
X
(x)
p
X
(x)
r (x)
= D(p|r ) D(p|b)
i
= p
i
(i.e. b
(x) = p(x)
(assume bet all the money)
(x[y) = p
X[Y
(x[y)
W = I (X; Y)
Fair odds:
1
o
i
= 1
Super-fair odds:
1
o
i
< 1
Sub-fair odds:
1
o
i
> 1
Junmo Kim EE 623: Information Theory
Review
i
= p
i
n
i =1
p
X
(X
i
)[A[
n
.
Log wealth is
1
n
log S
n
=
1
n
n
i =1
log p
X
(X
i
)
. .
random variable
+log [A[
E[log p
X
(X)] + log [A[
= H(X) + log [A[
Junmo Kim EE 623: Information Theory
Dependent Races
X
i
A : X
1
, X
2
, . . .
A rst race b
(1)
= p
X
1
()
If x
1
wins b
(2)
= p
X
2
[X
1
=x
1
()
b
(3)
= p
X
3
[X
1
,X
2
()
S
1
= p
X
1
(X
1
)[A[
S
2
= p
X
1
(X
1
)[A[p
X
2
|X
1
(X
2
)[A[ = p
X
1
,X
2
(X
1
, X
2
)[A[
2
S
n
= p
X
1
,...,X
n
(X
1
, . . . , X
n
)[A[
n
1
n
log S
n
= log [A[
1
n
log
1
p
X
1
,...,X
n
(X
1
,...,X
n
)
The limit of
1
n
H(X
1
, . . . , X
n
) is called entropy rate.
Junmo Kim EE 623: Information Theory
Entropy Rate
Denition
The entropy rate of a stochastic process X
i
is dened by
H(A) = lim
n
1
n
H(X
1
, . . . , X
n
)
when the limit exists.
Characterization
P
X
1
,...,X
n
(x
1
, . . . , x
n
) for all n 1
Stationary process if
For all n, k 1 and for all
1
, . . . ,
n
A,
p
X
1
,...,X
n
(
1
, . . . ,
n
) = p
X
1+k
,...,X
n+k
(
1
, . . . ,
n
)
Junmo Kim EE 623: Information Theory
Entropy Rate of a Stationary Stochastic Processes
Theorem
If X
k
is stationary then the limit
lim
n
1
n
H(X
1
, . . . , X
n
) exists.
e.g.) X
k
is i.i.d. according to p
X
(x).
H(X
1
, . . . , X
n
) = H(X
1
) + H(X
2
[X
1
) + + H(X
n
[X
1
, . . . , X
n1
)
= nH(X
1
)
Junmo Kim EE 623: Information Theory
Entropy Rate of a Stationary Stochastic Processes
Lemma
If X
k
is stationary then the limit
lim
n
H(X
n
[X
1
, . . . , X
n1
) exists.
Proof.
Claim: H(X
n
[X
1
, . . . , X
n1
) is monotonically non-increasing.
H(X
n+1
[X
1
, . . . , X
n
) H(X
n+1
[X
2
, . . . , X
n
) (conditioning)
Because X
k
is stationary, X
1
, . . . , X
n
and X
2
, . . . , X
n+1
have
the same distribution.
H(X
n+1
[X
2
, . . . , X
n
) = H(X
n
[X
1
, . . . , X
n1
)
Thus H(X
n
[X
n1
1
) is non-increasing.
i =1
H(X
i
[X
i 1
1
)
. .
a
i
Let b
n
=
1
n
a
i
. If a
i
then b
n
.
(Comment: The reverse is not true. 1, 2, 1, 2, 1, 2, . . . has no
limit but the average converges to 1.5)
Junmo Kim EE 623: Information Theory
Entropy Rate: Example
Example
X
k
are independent.
X
i
=
_
H with probability
1
i
T with probability 1
1
i
Does lim
n
H(X
n
[X
n1
) exist ?
Yes. H(X
n
[X
n1
) = H(X
n
) = H
b
(
1
n
,
n1
n
) 0 as n .
Does
1
n
H(X
1
, . . . , X
n
) 0 ?
Yes. a
n
1
n
a
i
.
Junmo Kim EE 623: Information Theory
Example: Markov Process
A stochastic process X
1
, X
2
, . . . is a Markov process if
p
X
n
[X
1
,...,X
n1
(x
n
[x
1
, . . . , x
n1
) = p
X
n
[X
n1
(x
n
[x
n1
)
X
n
is a location of a random walk
X
n+1
=
_
X
n
+ 1 with probability
1
2
X
n
1 with probability
1
2
If X
k
is Markov and stationary, it is time-invariant
(homogeneous).
If X
k
is homogeneous and Markov, it need not be
stationary.
Does H(X
n
[X
n1
) converge whenever X
k
is Markov ?
H(X
n
[X
n1
) = H(X
n
[X
n1
) ( Markov)
A rst race b
(1)
= p
X
1
() S
1
= p
X
1
(X
1
)[A[
If x
1
wins b
(2)
= p
X
2
|X
1
=x
1
() S
2
=
p
X
1
(X
1
)[A[p
X
2
|X
1
(X
2
)[A[ = p
X
1
,X
2
(X
1
, X
2
)[A[
2
S
n
= p
X
1
,...,X
n
(X
1
, . . . , X
n
)[A[
n
S
n
= 2
nlog [.[+
1
n
log p
X
1
,...,X
n
(X
1
,...,X
n
)
H(X
k
) = H(A) = lim
n
1
n
H(X
1
, . . . , X
n
) (if limit exists)
If lim
n
H(X
n
[X
n1
) exists, then lim
n
1
n
H(X
1
, . . . , X
n
)
exists and they are equal.
If X
k
is stationary, then lim
n
H(X
n
[X
n1
) exists.
Junmo Kim EE 623: Information Theory
Markov Chains
Homogeneous if p
X
k
[X
k1
(x
t
[x) doesnt depend on k
x.
p
X
1
(x)H(X
2
[X
1
= x)
=
x.
p
X
1
(x)
.
p(x
t
[x) log
1
p(x
t
[x)
Junmo Kim EE 623: Information Theory
Stationary Markov Chains
i
(i )p(j [i ) = (j )
However, lim
n
H(X
n
[X
n1
) exists and is equal to
x.
X
(x)
.
p(x
t
[x) log
1
p(x
t
[x)
To nd H(X
k
) from p(i [j )
plug
X
(x) in
xX
X
(x)
X
p(x
[x) log
1
p(x
|x)
Junmo Kim EE 623: Information Theory
Entropy Rate of Markov Chains: Example
How should X
1
be distributed to make this stationary ?
For stationary, X
1
and X
2
must be identically distributed.
Hence, X
1
().
i
(i )p(j |i ) =
i
(i )(j )
= (j )
If p
X
1
() = (), the entropy rate is
H(X
2
[X
1
) =
i
(i )H(X
2
[X
1
= i ) = H(()).
W
ij
> 0
p(j [i ) =
W
ij
W
ij
i
=
j
W
ij
W
j
W
ij
Junmo Kim EE 623: Information Theory
Example: Random Walk on a Weighted Graph
Need to check
i
p(j [i ) =
j
Indeed
i
p(j [i ) =
k
W
ik
W
W
ij
m
W
im
=
i
W
ij
W
=
j
Because W
ij
= W
ji
Junmo Kim EE 623: Information Theory
Asymptotic Equipartition Property(AEP)
Example: X
1
, . . . , X
n
are i.i.d. according to P
X
(x)
1
n
log
1
p
X
1
,...,X
n
(X
1
, . . . , X
n
)
=
1
n
n
i =1
log p
X
(X
i
)
. .
i.i.d. r.v.
E[log p
X
(X)] = H(X)
A
(n)
1
, . . . ,
n
is typical, if
1
, . . . ,
n
A
(n)
i.e.,
[
1
n
log p
X
1
,...,X
n
(
1
, . . . ,
n
) H(X
k
)[ <
Junmo Kim EE 623: Information Theory
Example
X
1
, . . . , Ber(
1
2
). H, T
1
n
log P
X
1
,...,X
n
(X
1
, . . . , X
n
)
. .
2
n
= 1 : all sequences are typical
H(X) =
1
2
bit
1
n
log p
X
1
,...,X
n
(H, . . . , H) = log
2
0.89
not typical ?
Junmo Kim EE 623: Information Theory
Lecture 6
Junmo Kim EE 623: Information Theory
Review
H(X
2
[X
1
) is given by stationary distribution and p(j [i ).
Walks on graph
AEP
Today
AEP
Source Coding
Junmo Kim EE 623: Information Theory
AEP
Assume X
k
has entropy rate H(X
k
), the typical set A
(n)
is dened as
A
(n)
= (
1
, . . . ,
n
) : [
1
n
log p
X
1
,...,X
n
(
1
, . . . ,
n
)H(X
k
)[ <
A
(n)
= (
1
, . . . ,
n
) : 2
n(H+)
p
X
1
,...,X
n
(
1
, . . . ,
n
) 2
n(H)
We say that X
k
satises the AEP if
> 0, Pr ((X
1
, . . . , X
n
) A
(n)
)1 as n
where Pr ((X
1
, . . . , X
n
) A
(n)
) =
xA
(n)
p
X
(x).
Junmo Kim EE 623: Information Theory
IID Process Satises AEP
Theorem
If X
k
is IID, it satises AEP, i.e., > 0,
Pr ((X
1
, . . . , X
n
) A
(n)
) 1 as n .
Proof.
Pr ([
1
n
log p
X
1
,...,X
n
(X
1
, . . . , X
n
) H(X)[ < ) 1 as n ,
which comes from the weak law of large numbers.
Junmo Kim EE 623: Information Theory
Example
X
i
IID Ber (1), Pr (X
i
= H) = 1.
H(X
k
) = 0
A
(n)
= (H, . . . , H)
Ber (
1
2
)
H(X
k
) = 1 bit
A
(n)
= H, T
n
, i.e. all sequences are in A
(n)
H(X
k
) =
1
2
bit
Out of 2
n
sequences, most of them are of likelihood
Pr 2
nH
= 2
n
2
Junmo Kim EE 623: Information Theory
Example Not Satisfying AEP
If U = 1 then X
1
, . . . , X
n
i .i .d.Ber (
1
2
)
If U = 2 then X
1
, . . . , X
n
i .i .d.Ber (1)
Pr (U = 1) = Pr (U = 2) =
1
2
What is H(X
k
) ?
H(X
1
, . . . , X
n
[U) H(X
1
, . . . , X
n
) H(X
1
, . . . , X
n
[U) +
1
..
H(U)
. .
H(X
1
,...,X
n
,U)
lim
n
1
n
H(X
1
, . . . , X
n
) = lim
n
1
n
H(X
1
, . . . , X
n
[U)
. .
1
2
nH(
1
2
)
=
1
2
p
X
(H, H, H, . . . , H) >
1
2
>> 2
nH({X
k
})
Pr ((A
(n)
)
C
) P
X
(H, . . . , H) >
1
2
Junmo Kim EE 623: Information Theory
Source Coding
Let X
k
be any source with nite alphabet [A[ < .
Describe x
1
, . . . , x
n
using bits
Given > 0
If x
1
, . . . , x
n
is not in A
(n)
If it is typical,
Use log [A
(n)
Algorithm:
Look at x
1
, . . . , x
n
[ bits
) (1 +log [A
(n)
[) +(1 Pr (A
(n)
)) (1 +n log [A[)
As Pr (A
(n)
) > 1 , log [A
(n)
[ is dominant term.
Junmo Kim EE 623: Information Theory
Source Coding: Size of Typical Set
Claim:
[A
(n)
[ 2
n(H+)
Proof.
1 Pr (A
(n)
)
=
(x
1
,...,x
n
)A
(n)
p
X
(x)
(x
1
,...,x
n
)A
(n)
2
n(H+)
= [A
(n)
[2
n(H+)
Junmo Kim EE 623: Information Theory
Expected Length of the Source Code
Expected length
Pr (A
(n)
) (1+log [A
(n)
[)+(1Pr (A
(n)
)(2+log(2
n(H(X)+)
))+(1Pr (A
(n)
Issues
Computational complexity
Junmo Kim EE 623: Information Theory
Source Coding Techniques
0, 1
: A
0, 1
is dened as
C
: (x
1
, . . . , x
n
) c(x
1
)c(x
2
) c(x
n
)
Example: A 0, B 1, C 00, D 01
Is this singular ?
AA 00, C 00
Junmo Kim EE 623: Information Theory
Prex Free Code
x.
2
l (x)
1
2
l
i
1, then
there exists a uniquely decodable code of these lengths. In
fact, there exists a prex-free code of these lengths.
Junmo Kim EE 623: Information Theory
Lecture 7
Junmo Kim EE 623: Information Theory
Source Coding Techniques
0, 1
: A
0, 1
is dened as
C
: (x
1
, . . . , x
n
) c(x
1
)c(x
2
) c(x
n
)
Example: A 0, B 1, C 00, D 01
Is this singular ?
AA 00, C 00
Junmo Kim EE 623: Information Theory
Prex Free Code
) for x ,= x
,
leaf labeled x
Start with a full binary tree and label by x the node you reach
by following the path corresponding to C(x).
Let A = 1, . . . , m
2
l
i
1
.
2
l
i
1, then there
exists a uniquely decodable code of these lengths. In fact,
there exists a prex free code for X of length l
i
.
Junmo Kim EE 623: Information Theory
Krafts Inequality
We rst show a weaker statement, namely, if C is prex-free then
2
l
i
1.
Thus we have a unique path from the phantom leaf to the root.
2
l
max
l
i
. We
cant rule out more than 2
l
max
leaves.
2
l
max
l
i
2
l
max
2
l
i
1
Junmo Kim EE 623: Information Theory
Krafts Inequality
Alternative proof:
2
l
i
=
2
l
max
= 1
2
l
i
remains the same.
We subtracted 2 2
l
max
.
We added 2
(l
max
1)
.
If we increase l
i
by adding a single child node,
2
l
i
only
decreases.
Junmo Kim EE 623: Information Theory
Krafts Inequality: Converse
Given positive integer l
1
, . . . , l
m
, if
2
l
i
1, then we can
construct a prex free code of length l
i
as follows.
Suppose I assigned l
1
, . . . , l
i
. If i < m, the number of removed
nodes of depth l
i +1
is less than total number of depth l
i +1
nodes, as we have
i
j =1
2
l
i +1
l
j
< 2
l
i +1
Thus we have a remaining node of depth l
i +1
.
If i = m we are done.
Junmo Kim EE 623: Information Theory
Krafts Inequality: Stronger Statement
If C is a uniquely decodable code for describing A and i is
described using a codeword of length l
i
then
2
l
i
1
.
Look at describing x = (x
1
, . . . , x
n
) (n-tuples from the
source).
l (x) =
n
i =1
l (x
i
)
(
x.
2
l (x)
)
n
=
x.
n
2
l (x)
=
nl
max
m=1
a(m)2
m
nl
max
Junmo Kim EE 623: Information Theory
Krafts Inequality: Stronger Statement
(
x.
2
l (x)
)
n
= (
x
1
.
2
l (x
1
)
)(
x
2
.
2
l (x
2
)
) (
x
n
.
2
l (x
n
)
)
=
x
1
.
x
2
.
x
n
.
2
l (x
1
)
2
l (x
2
)
2
l (x
n
)
=
(x
1
,x
2
,...,x
n
).
n
2
(l (x
1
)+l (x
2
)++l (x
n
))
=
x.
n
2
l (x)
=
nl
max
m=1
a(m)2
m
nl
max
Thus we have
x.
2
l (x)
n
_
nl
max
x.
2
l (x)
lim
n
n
_
nl
max
= 1
lim
n
log n + log l
max
n
= 0
Junmo Kim EE 623: Information Theory
Lecture 8
Junmo Kim EE 623: Information Theory
Today
Wrong Probability
Human codes
Junmo Kim EE 623: Information Theory
Criterion for Short Description
Expected length
L =
p
X
(x)l (x)
x.
p
X
(x)l (x) over all length
l (x) that a uniquely decodable code could have.
p
X
(x)l (x)
subject to l (x) being integer satisfying
2
l (x)
1.
L
= min
l (x) integer
2
l (x)
1
p
X
(x)l (x)
Junmo Kim EE 623: Information Theory
Minimum Expected Length
L
= min
l (x) integer
2
l (x)
1
p
X
(x)l (x)
Theorem
H(X) L
< H(X) + 1
Junmo Kim EE 623: Information Theory
Minimum Expected Length
We rst prove L
= min
l (x) integer
2
l (x)
1
p
X
(x)l (x)
2.
L = min
2
l (x)
1
p
X
(x)l (x)
3.
L = min
2
l (x)
=1
p
X
(x)l (x)
2
l (x)
1 by
2
l (x)
= 1 as the minimum occurs only when
2
l (x)
= 1.
If
2
l (x)
< 1, we can further decrease l (x) making
p
X
(x)l (x) smaller.
To nd
L = min
2
l (x)
=1
p
X
(x)l (x), use Lagrange
multipliers
J =
p
X
(x)l (x) + (
2
l (x)
1)
i =1
p
i
l
i
+ (
2
l
i
1)
J
l
i
= p
i
(ln 2)2
l
i
= 0 2
l
i
=
p
i
ln 2
1 =
2
l
i
=
p
i
ln 2
=
1
ln 2
2
l
i
= p
i
, l
i
= log
2
1
p
i
Junmo Kim EE 623: Information Theory
Minimum Expected Length
Hypothetical solution
2
l
i
= p
i
, l
i
= log
2
1
p
i
p
i
l
i
=
p
i
log
2
1
p
i
= H(X)
2
l
i
= 1,
p
i
l
i
H(X) =
p
i
l
i
+
p
i
log p
i
=
p
i
log p
i
p
i
log
2
2
l
i
=
p
i
log
p
i
2
l
i
0 relative entropy is non-negative
L = min
2
l (x)
=1
p
X
(x)l (x) = H(X)
Junmo Kim EE 623: Information Theory
Minimum Expected Length
We now prove L
< H(X) + 1
Look at
l
i
= log
2
1
p
i
log
2
1
p
i
l
i
satisfy Kraft
2
log
2
1
p
i
2
log
2
1
p
i
=
p
i
= 1
l
i
.
Since L
p
i
l
i
, and we have
L
p
i
l
i
=
p
i
log
2
1
p
i
<
p
i
(log
2
1
p
i
+ 1)
= H(X) + 1
Junmo Kim EE 623: Information Theory
n-to-variable Code
l
i
= log
2
1
p
i
can be ridiculous.
Pr (H) = 0.999
l
i
= 1
Pr (T) = 0.001
l
i
= 10
< H(X
1
, . . . , X
n
) + 1
Dividing by n gives
1
n
H(X
1
, . . . , X
n
)
. .
H(X
k
)
1
n
L
<
1
n
H(X
1
, . . . , X
n
)
. .
H(X
k
)
+
1
n
..
0
Junmo Kim EE 623: Information Theory
Wrong Probability
P
X
(x), x A
I think Q(x).
x.
p
X
(x)l (x) =
p
X
(x)log
1
Q(x)
<
p
X
(x)(log
1
Q(x)
+ 1)
=
p
X
(x) log
p
X
(x)
Q(x)
p
X
(x) log p
X
(x) + 1
= D(p
X
|Q) + H(X) + 1
D(p
X
|Q) is the price to pay for the mismatch of the
probability.
Junmo Kim EE 623: Information Theory
Humans Procedure
2
l (x)
1
p
X
(x)l (x)
Examples:
p
1
= 0.6, p
2
= 0.4 : With two letters, the optimal codeword
lengths are 1 and 1
p
1
= 0.6, p
2
= 0.3, p
3
= 0.1 : With three letters, the optimal
lengths are 1,2,2. The least likely letters have length 2.
Junmo Kim EE 623: Information Theory
Humans Procedure
Assume p
1
p
2
p
3
p
m
Lemma
Optimal codes have the property that if p
i
> p
j
, then l
i
< l
j
.
Proof.
Assume to the contrary that a code has p
i
> p
j
and l
i
> l
j
. Then
L = p
i
l
i
+ p
j
l
j
+ other terms
If we interchange l
i
and l
j
, L is decreased, as we have
L
new
= p
i
l
j
+ p
j
l
i
+ otehr terms
L L
new
= p
i
(l
i
l
j
) p
j
(l
i
l
j
) = (p
i
p
j
)(l
i
l
j
) > 0
Thus L
new
< L, which contradicts the optimality of the code.
Junmo Kim EE 623: Information Theory
Humans Procedure
Assume p
1
p
2
p
3
p
m
Claim:
There is no loss of generality in looking only at codes where
c(m 1) and c(m) are siblings.
Proof.
Let l
max
be the depth of an optimal tree. Then there are at least
two codewords at l
max
and they are siblings. If the codeword at
l
max
does not have a sibling, the given codeword could be reduced
in length. If the two codewords are not least likely codewords, by
swapping, there is possibly a new optimal code with p
m1
, p
m
at
l
max
.
Junmo Kim EE 623: Information Theory
Humans Procedure
Junmo Kim EE 623: Information Theory
Humans Procedure
(p
1
, . . . , p
m
) = L
(p
1
, . . . , p
m2
, p
m1
+ p
m
) + p
m1
+ p
m
It suces to look only at codes/trees where c(m 1) and c(m)
are siblings. For any such tree with m leaves, we can construct a
reduced tree with m 1 leaves as follows: The leaves c(m 1)
and c(m) are removed, converting their parent node into a leaf
with probability p
m1
+ p
m
. The new tree represents a code for a
new chance variable with PMF (p
1
, . . . , p
m2
, p
m1
+ p
m
).
Junmo Kim EE 623: Information Theory
Humans Procedure
Now compare the expected lengths of the original code and the
new code.
m
i =1
p
i
l
i
=
m2
i =1
p
i
l
i
+ p
m1
l
max
+ p
m
l
max
=
m2
i =1
p
i
l
i
+ (p
m1
+ p
m
)(l
max
1) + p
m1
+ p
m
= Expected length of the reduced code(tree) + p
m1
+ p
m
Hence if we minimize the expected length of the reduced
code(tree) with the minimum value L
(p
1
, . . . , p
m2
, p
m1
+ p
m
),
we also minimize the expected length of the original code with the
minimum value L
(p
1
, . . . , p
m2
, p
m1
+ p
m
) + p
m1
+ p
m
, which
proves the recursion formula.
Junmo Kim EE 623: Information Theory
Humans Procedure
Example: (p
1
, p
2
, p
3
, p
4
, p
5
) = (0.4, 0.2, 0.15, 0.15, 0.1)
Junmo Kim EE 623: Information Theory
Humans Procedure
Example:
Junmo Kim EE 623: Information Theory
Lecture 9
Junmo Kim EE 623: Information Theory
Review
H(P), H(X), H
P
(X) : entropy
I (X; Y) = D(p
X,Y
|p
X
p
Y
)
= 1
H
p
(X) is concave in p
X
.
H(p
1
+
p
2
) H(p
1
) +
H(p
2
)
Junmo Kim EE 623: Information Theory
Review: Jensens Inequality
If f
(x) < 0 for all x, f is strictly concave.
f
(x) =
1
x
2
< 0 : strictly concave
Theorem
If f is concave then for any random variable X,
f (E[X]) E[f (X)]
If f is strictly concave,
f (E[X]) = E[f (X)] X is deterministic.
Junmo Kim EE 623: Information Theory
Convexity of Relative Entropy
a
i
) log
a
i
b
i
i
a
i
log
a
i
b
i
with equality i a
i
= cb
i
for some constant c for all i .
a
i
) log
a
i
b
i
i
a
i
log
a
i
b
i
Proof.
D(P
1
+
P
2
|Q
1
+ Q
2
)
=
x
(P
1
(x) +
P
2
(x)) log
P
1
(x) +
P
2
(x)
Q
1
(x) +
Q
2
(x)
x
_
P
1
(x) log
P
1
(x)
Q
1
(x)
+
P
2
(x) log
P
2
(x)
Q
2
(x)
_
= D(P
1
|Q
1
) +
D(P
2
|Q
2
)
Junmo Kim EE 623: Information Theory
Recall: Entropy is Concave
Consider two PMFs and their convex combination
p
(1)
= (p
(1)
1
, . . . , p
(1)
[.[
)
p
(2)
= (p
(2)
1
, . . . , p
(2)
[.[
)
p
(1)
+ p
(2)
= (p
(1)
1
+ p
(2)
1
, p
(1)
2
+ p
(2)
2
, . . . , p
(1)
[.[
+ p
(2)
[.[
)
Theorem
H(p
(1)
+ p
(2)
) H(p
(1)
) + H(p
(2)
)
Junmo Kim EE 623: Information Theory
Convex and Concave Properties of Mutual Information
Junmo Kim EE 623: Information Theory
Convex and Concave Properties of Mutual Information
Notations:
I (X; Y) is determined by p
X,Y
(x, y), which in turn is given by
p
X
(x)p
Y|X
(y[x).
We can describe p
X
(x) : x A by a row vector P
X
, whose
xth element is p
X
(x).
x
P
X
(x) log P
X
(x),
which is
x
p
X
(x) log p
X
(x) = H(X).
We can describe p
Y|X
(y[x) : x A, y by a matrix W,
whose x, y element is p
Y|X
(y[x).
P
X
(x) H(Y[X = x)
. .
xed
. .
linear in P
X
Junmo Kim EE 623: Information Theory
Convex and Concave Properties of Mutual Information
Lets show that H(Y) is concave in P
X
. First note that
P
Y
= P
X
W.
P
X
W is a row vector whose yth element is
x
P
X
(x)W(y[x) =
x
p
X,Y
(x, y) = p
Y
(y)
Thus, H(Y) = H(P
Y
) = H(P
X
W).
Since entropy is concave, for P
(1)
X
, P
(2)
X
, we have
H((P
(1)
X
+
P
(2)
X
)W) = H(P
(1)
X
W +
P
(2)
X
W)
H(P
(1)
X
W) +
H(P
(2)
X
W)
Hence, H(Y) = H(P
X
W) is concave in P
X
.
Therefore, I (P
X
, W) = H(P
X
W)
P
X
(x)H(Y[X = x) is
concave in P
X
for xed W.
Junmo Kim EE 623: Information Theory
Convex and Concave Properties of Mutual Information
Part 2: For xed P
X
, I (X; Y) is convex in W.
Proof: As I (X; Y) = D(p
X,Y
(x, y)|p
X
(x)p
Y
(y)), we need to show
that for two transition matrices W
(1)
and W
(2)
,
D(P
X
(x) (W
(1)
+
W
(2)
)(y[x)|P
X
(x) P
X
(W
(1)
+
W
(2)
)(y))
D(P
X
(x)W
(1)
(y[x)|P
X
(x)(P
X
W
(1)
)(y))
+
D(P
X
(x)W
(2)
(y[x)|P
X
(x)(P
X
W
(2)
)(y))
The above inequality comes from convexity of D(p|q):
D(P
1
+
P
2
|Q
1
+
Q
2
) D(P
1
|Q
1
) +
D(P
2
|Q
2
)
where P
1
= P
X
(x)W
(1)
(y[x), Q
1
= P
X
(x)(P
X
W
(1)
)(y),
P
2
= P
X
(x)W
(2)
(y[x), and Q
2
= P
X
(x)(P
X
W
(2)
)(y).
Thus I is convex in W for xed P
X
.
Junmo Kim EE 623: Information Theory
Channels
Denition
We dene a discrete channel to be a system consisting of an input
alphabet A and output alphabet and a probability transition
probability p(y[x) (or transition matrix W(y[x)).
n=1
k=1
W(y
k
[x
k
)
for some stochastic matrix W.
Junmo Kim EE 623: Information Theory
Binary Symmetric Channel
A = = 0, 1
C = max
P
X
I
P
X
(x)W(y[x)
(X; Y)
If p = 1/2, C = 0.
If P
X
is uniform and xed, I (X; Y) = C = 1 H
b
(p) is
convex in W, i.e. convex in p.
Examples of computing C
Kuhn-Tucker condition
A = = 0, 1
C = max
P
X
I
P
X
(x)W(y[x)
(X; Y)
If p = 1/2, C = 0.
If P
X
is uniform and xed, I (X; Y) = C = 1 H
b
(p) is
convex in W, i.e. convex in p.
Guess P
X
that maximizes H(Y). Will it be [
1
2
1
2
] ?
Note that P
(1)
X
= [ 1 ] and P
(2)
X
= [ 1 ] give
the same H(Y).
As H(Y) is concave in p
X
,
H([ 1 ]W) =
1
2
H([ 1 ]W) +
1
2
H([ 1 ]W)
H((
1
2
[ 1 ] +
1
2
[ 1 ])W)
= H([
1
2
1
2
]W)
k=1
k
= 1
Assume
f
k
are dened and continuous over the simplex
with the possible exception that lim
k
0
f
k
may be +
k
= , k :
k
> 0
f
k
, k :
k
= 0
Junmo Kim EE 623: Information Theory
Maximizing a Concave Function over Simplex
Necessary condition: If f is maximized, the two conditions are
satised.
Proof:
Suppose maximizes f ().
If we perturb by increasing
k
and decreasing
k
(provided that
k
> 0) by > 0, f () should be decreased. This requires that
f
k
0
Similarly, if we increase
k
and decrease
k
(provided that
k
> 0) by > 0, f () should be decreased.
f
t
k
0
Junmo Kim EE 623: Information Theory
Maximizing a Concave Function over Simplex
Thus we have
f
k
=
f
k
for all k, k
t
such that
k
> 0,
k
> 0,
which implies the rst condition.
f
k
= , k :
k
> 0
If
k
= 0, we cannot decrease
k
, so we only need
f
0,
which implies the second condition:
f
k
, k :
k
= 0
Junmo Kim EE 623: Information Theory
Maximizing a Concave Function over Simplex
Sucient condition: If the two conditions are satised, f is
maximized.
Proof:
Then if we perturb to + (
1
,
2
, . . . ,
m
) along an
arbitrary direction (
1
,
2
, . . . ,
m
) such that
k
= 0 and
k
0 if
k
= 0, f does not increase as follows:
f =
k
=
k:
k
>0
f
k
+
k:
k
=0
f
k:
k
>0
k
+
k:
k
=0
k
(
f
k
&
k
0)
=
k
= 0
Junmo Kim EE 623: Information Theory
Maximizing a Concave Function over Simplex
Now the questions is whether it is global maximum.
On the contrary, suppose there is other better point
such that
f (
) > f ()
The concavity of f indicates that f is above the straight line
connecting the two points.
f ((1)+
) (1)f () +f (
increases f ().
This contradicts the fact that f does not increase by a local
perturbation.
Junmo Kim EE 623: Information Theory
Computing the Channel Capacity
Multiply (*) by P
X
(x) and sum over all x s.t. P
X
(x) > 0
x
P
X
(x)
y
W(y[x) log
W(y[x)
(P
X
W)(y)
P
X
(x)
P
X
(x)
=
x,y
P
X,Y
(x, y) log
P
X,Y
(x, y)
P
X
(x)P
Y
(y)
= I (X; Y) = C ( P
X
is the optimal distribution)
Thus we have
C =
t
Junmo Kim EE 623: Information Theory
Computing the Channel Capacity
Theorem
1. If for some input distribution P
X
(x) we have
D(W([x)|(P
X
W)()) = x, P
X
(x) > 0
x, P
X
(x) = 0
then P
X
(x) achieves capacity C and C = .
2. If P
X
(x) achieves the capacity then the following holds with
= C.
D(W([x)|(P
X
W)()) = x, P
X
(x) > 0
x, P
X
(x) = 0
Usage: guess & check
Junmo Kim EE 623: Information Theory
Kuhn-Tucker Condition
P
X
achieves capacity if and only if
D(W([x)|(P
X
W)()) = C x, P
X
(x) > 0
C x, P
X
(x) = 0
How to compute C
1. Symmetry
2. Concavity argument (e.g. erasure channel [
1
2
1
2
] )
3. Guess & verify using Kuhn-Tucker condition
Junmo Kim EE 623: Information Theory
Lecture 11
Junmo Kim EE 623: Information Theory
Review: Kuhn-Tucker Condition
Theorem
1. If for some input distribution P
X
(x) we have
D(W([x)|(P
X
W)()) = x, P
X
(x) > 0
x, P
X
(x) = 0
then P
X
(x) achieves capacity C and C = .
2. If P
X
(x) achieves the capacity then the following holds with
= C.
D(W([x)|(P
X
W)()) = x, P
X
(x) > 0
x, P
X
(x) = 0
Usage: guess & check
Junmo Kim EE 623: Information Theory
Review: Kuhn-Tucker Condition
P
X
achieves capacity if and only if
D(W([x)|(P
X
W)()) = C x, P
X
(x) > 0
C x, P
X
(x) = 0
How to compute C
1. Symmetry
2. Concavity argument (e.g. erasure channel [
1
2
1
2
] )
3. Guess & verify using Kuhn-Tucker condition
Junmo Kim EE 623: Information Theory
Example: Kuhn-Tucker Condition (Problem 7.13)
Guess P
X
= [
1
2
1
2
] and verify that it satises Kuhn-Tucker
conditions:
D(W([x = 0)|P
X
W()) = D(W([x = 1)|P
X
W()).
P
X
W = [
1
2
(1 )
1
2
(1 ) ]
D(W([x = 0)|P
X
W()) =
D([ 1 ]|[
1
2
(1 )
1
2
(1 ) ])
D(W([x = 1)|P
X
W()) =
D([ 1 ]|[
1
2
(1 )
1
2
(1 ) ])
C = D(W([x = 0)|P
X
W()) =
(1 ) + H(, 1 ) H(1 , , ).
Junmo Kim EE 623: Information Theory
Today
Prove a converse
1. Fanos inequality
2. Data processing inequality
Junmo Kim EE 623: Information Theory
Notations and Denitions
W: message
An encoder f : /A
n
Received signal :
n
Decoder :
n
/ 0
n : block length
n
k=1
W(y
k
[x
k
)
Probability of error
i
=
y:(y),=i
W(y[f (i ))
. .
n
k=1
W(y
k
[x
k
(i ))
x
k
(i ) : kth component of f (i ).
Junmo Kim EE 623: Information Theory
Notations and Denitions
(n)
= max
i
i =1
i
Denition
A rate R is achievable if for any > 0 and all suciently large
block length n there exists a code (f , ) of rate R and
(n)
< .
The rate of a code is
log M
n
.
Junmo Kim EE 623: Information Theory
Converse
Theorem
If R is achievable over the DMC W, then R max
P
X
I (P
X
; W).
Junmo Kim EE 623: Information Theory
Markov Chains
X Y Z
if p
Z[X,Y
(z[x, y) = P
Z[Y
(z[y).
The followings are all equivalent.
1. X Y Z
2. Z Y X
3. X and Z are independent given Y
p(x, z[y) = p(x[y)p(z[y)
Equivalence of 1 & 3.
p(x, y, z) = p(y)p(x, z[y)
= p(y)p(x[y)p(z[y)
= p(x, y)p(z[y)
p(x, y, z) = p(x, y)p(z[x, y)
Junmo Kim EE 623: Information Theory
Venn Diagram
Estimate of X :
X = g(Y).
Probability of error P
e
= Pr (
X ,= X).
Fanos inequality is
H(X[Y) H
b
(P
e
) + P
e
log([A[ 1)
X) Pr (X ,=
X) log([A[ 1) + H
b
(Pr (X ,=
X))
Proof: Dene
E =
_
1 X ,=
X
0 X =
X
H(X, E[
X) = H(X[
X) + H(E[X,
X)
. .
0
H(X, E[
X) = H(E[
X) + H(X[E,
X)
H(X[
X) = H(E[
X) + H(X[E,
X)
H(E) + H(X[E,
X)
= H
b
(Pr (X ,=
X)) + H(X[E,
X)
Junmo Kim EE 623: Information Theory
Fanos Inequality
H(X[E,
X) = H(X[
X, E = 0)
. .
0
Pr (E = 0) + H(X[
X, E = 1)Pr (E = 1)
log([A[ 1)Pr (X ,=
X)
This gives us a bounding technique.
H(X[Y) H(X[
n
k=1
W(y
k
[x
k
) for some W([)
then I (X; Y) nC where C = max
P
X
I (P
X
; W).
Proof:
I (X; Y) = H(Y) H(Y[X)
= H(Y)
H(Y
i
[X, Y
i 1
)
= H(Y)
H(Y
i
[X
i
) memoryless
(H(Y
i
) H(Y
i
[X
i
))
=
I (X
i
; Y
i
) nC
Junmo Kim EE 623: Information Theory
Proof of Converse
Note that we have the following Markov chain
W X = f (W) Y
n
W = (Y)
H(W) = nR H(W) = log [/[, R =
log [/[
n
= H(W[
W) + I (W;
W)
H(W[
W) + I (f (W);
W)
H(W[
W) + I (X; Y)
H(W[
W) + nC
Pr (W ,= W) log [/[ + H
b
(Pr (W ,=
W)) + nC
= P
(n)
e
nR + H
b
(P
(n)
e
) + nC
Junmo Kim EE 623: Information Theory
Proof of Converse
where
Pr (W ,=
W) =
i
Pr (W = i )Pr (W ,=
W[W = i )
=
i
1
M
i
= P
(n)
e
For all n, we have
R P
(n)
e
R +
1
n
H
b
(P
(n)
e
) + C
P
(n)
e
R +
1
n
+ C
Let n . R C if P
(n)
e
0.
Junmo Kim EE 623: Information Theory
Lecture 12
Junmo Kim EE 623: Information Theory
Fanos Inequality
Estimate of X :
X = g(Y).
Probability of error P
e
= Pr (
X ,= X).
Fanos inequality is
H(X[Y) H
b
(P
e
) + P
e
log([A[ 1)
X) Pr (X ,=
X) log([A[ 1) + H
b
(Pr (X ,=
X))
Proof: Dene
E =
_
1 X ,=
X
0 X =
X
H(X, E[
X) = H(X[
X) + H(E[X,
X)
. .
0
H(X, E[
X) = H(E[
X) + H(X[E,
X)
H(X[
X) = H(E[
X) + H(X[E,
X)
H(E) + H(X[E,
X)
= H
b
(Pr (X ,=
X)) + H(X[E,
X)
Junmo Kim EE 623: Information Theory
Fanos Inequality
H(X[E,
X) = H(X[
X, E = 0)
. .
0
Pr (E = 0) + H(X[
X, E = 1)Pr (E = 1)
log([A[ 1)Pr (X ,=
X)
This gives us a bounding technique.
H(X[Y) H(X[
n : block length
An encoder f : /A
n
(f (m) = x
n
(m)) yields codewords
x
n
(1), x
n
(2), . . . , x
n
(M). The set of codewords is called the
codebook, which is denoted by ( .
Decoder :
n
/ 0
e.g. ML decoder :
W = (y
n
) = arg max
m
W(y
n
[x
n
(m)).
Junmo Kim EE 623: Information Theory
Notations and Denitions
Probability of error
i
= Pr ((Y
n
) ,= i [X
n
= x
n
(i )) =
y:(y),=i
W(y[f (i ))
. .
n
k=1
W(y
k
[x
k
(i ))
x
k
(i ) : kth component of f (i ) = x
n
(i ).
(n)
= max
i
i =1
i
Junmo Kim EE 623: Information Theory
Channel Coding Theorem
R =
log M
n
: rate in bits / channel use.
W Unif 1, . . . , 2
nR
I
P
X
(x)W(y[x)
(X; Y) = I (P
X
; W)
C = max
P
X
I (P
X
; W)
Denition
R is achievable, if > 0, n
0
s.t. n n
0
, encoder of rate R &
block length n and a decoder with maximal probability of error
(n)
< .
Theorem
If R < C then R is achievable.
Theorem
Converse: If R is achievable, then R C.
Junmo Kim EE 623: Information Theory
DMC and Mutual Information
Lemma
Let X take value in A
n
according to some law P
X
(x) and let Y be
distributed according to p
Y[X
=
n
k=1
W(y
k
[x
k
) for some W([)
then I (X; Y) nC where C = max
P
X
I (P
X
; W).
Proof:
I (X; Y) = H(Y) H(Y[X)
= H(Y)
H(Y
i
[X, Y
i 1
)
= H(Y)
H(Y
i
[X
i
) memoryless
(H(Y
i
) H(Y
i
[X
i
))
=
I (X
i
; Y
i
) nC
Junmo Kim EE 623: Information Theory
Proof of Converse
Note that we have the following Markov chain
W X = f (W) Y
n
W = (Y)
H(W) = nR H(W) = log [/[, R =
log [/[
n
= H(W[
W) + I (W;
W)
H(W[
W) + I (f (W);
W)
H(W[
W) + I (X; Y)
H(W[
W) + nC
Pr (W ,= W) log [/[ + H
b
(Pr (W ,=
W)) + nC
= P
(n)
e
nR + H
b
(P
(n)
e
) + nC
Junmo Kim EE 623: Information Theory
Proof of Converse
where
Pr (W ,=
W) =
i
Pr (W = i )Pr (W ,=
W[W = i )
=
i
1
M
i
= P
(n)
e
For all n, we have
R P
(n)
e
R +
1
n
H
b
(P
(n)
e
) + C
P
(n)
e
R +
1
n
+ C
Let n . R C if P
(n)
e
0.
Junmo Kim EE 623: Information Theory
Channel Coding Theorem
R =
log M
n
: rate in bits / channel use.
W Unif 1, . . . , 2
nR
I
P
X
(x)W(y[x)
(X; Y) = I (P
X
; W)
C = max
P
X
I (P
X
; W)
Denition
R is achievable, if > 0, n
0
s.t. n n
0
, encoder of rate R &
block length n and a decoder with maximal probability of error
(n)
< .
Theorem
If R < C then R is achievable.
Theorem
Converse: If R is achievable, then R C.
Junmo Kim EE 623: Information Theory
Joint Typicality
Given some joint distribution p
X,Y
(x, y) on A and given some
> 0 and a natural number n, dene A
(n)
(p
X,Y
) as
A
(n)
(p
X,Y
) = (x, y) : [
1
n
log p
X
(x) H
p
X
(X)[ <
[
1
n
log p
Y
(y) H
p
Y
(Y)[ <
[
1
n
log p
X,Y
(x, y) H
p
X,Y
(X, Y)[ <
where p
X,Y
is a PMF on A and p
X,Y
(x, y) =
p
X,Y
(x
i
, y
i
).
[A
(n)
(p
X,Y
)[ 2
n(H
p
X,Y
(X,Y)+)
Junmo Kim EE 623: Information Theory
Joint Typicality
Lemma
Suppose
X,
Y are drawn independently according to the law
p
X
(x)p
Y
(y), i.e. (
X
k
,
Y
k
)
i .i .d.
p
X
(x)p
Y
(y). Then
Pr [(
X,
Y) A
(n)
(p
X,Y
)] 2
n(I
p
X,Y
(X;Y)3)
Proof:
Pr [(
X,
Y) A
(n)
(p
X,Y
)] =
(x,y)A
(n)
p
X
(x)p
Y
(y)
(x,y)A
(n)
2
n(H
p
X
(X))
2
n(H
p
Y
(Y))
= [A
(n)
[2
n(H
p
X
(X))
2
n(H
p
Y
(Y))
2
n(H
p
X,Y
(X,Y)+)
2
n(H
p
X
(X))
2
n(H
p
Y
(Y))
= 2
n(I
p
X,Y
(X;Y)3)
Junmo Kim EE 623: Information Theory
Joint Typicality
Lemma
Suppose
X,
Y are drawn independently according to the law
p
X
(x)p
Y
(y), i.e. (
X
k
,
Y
k
)
i .i .d.
p
X
(x)p
Y
(y). Then
Pr [(
X,
Y) A
(n)
(p
X,Y
)] 2
n(I
p
X,Y
(X;Y)3)
Lemma
Suppose I draw (X
k
, Y
k
)
i .i .d.
p
X,Y
(x, y). Then
Pr ((X, Y) A
(n)
(p
X,Y
)) 1 as n .
Proof: Law of large numbers.
Junmo Kim EE 623: Information Theory
Joint Typicality Decoder
(y; p
X,Y
, , n, () = m
if (x(m), y) A
(n)
(p
X,Y
)&
for no other m
t
,= m, (x(m
t
), y) A
(n)
(p
X,Y
)
otherwise (y; p
X,Y
, , n, () = 0
Junmo Kim EE 623: Information Theory
Proof Sketch
1. Fix some p
X
2. Fix some > 0, n
3. Generate a random codebook (
IID p
X
.
4. Reveal ( to encoder and receiver.
5. Design a joint typicality decoder
(; p
X
(x)W(y[x), , n, ()
6. Encodr m x(m) (according to codebook)
7. Each codebook ( gives P
(n)
e
(().
8. Analyze E[P
(n)
e
(()]. Average over (.
9. Will show that if R < I (P
X
; W) then E[P
(n)
e
(()] 0 as
n .
10. By random coding argument, there exists deterministic
sequence (
n
s.t. P
(n)
e
((
n
) 0
11. Trick to get
(n)
to go to zero.
Junmo Kim EE 623: Information Theory
Proof Sketch
Junmo Kim EE 623: Information Theory
Proof: Channel Coding Theorem
p
X
, I (P
X
; W) is achievable. C is achievable
Observation : E[
17
] = E[
5
]
Claim 1: E
(
[
m
] does not depend on m ( symmetry), which
implies
E[P
(n)
e
((, ( ))] = E[
1
]
Assume then W = 1 and compute E[
1
].
Junmo Kim EE 623: Information Theory
Proof: Channel Coding Theorem
Step 8:
E[
1
] =
(
Pr (()
1
((, ) = Pr (Error [W = 1)
Error only if (X(1), Y) is not in A
(n)
or m ,= 1 s.t. (X(m), Y) A
(n)
.
Pr [(X(1), Y) is not in A
(n)
(, n, P
X
(x)W(y[x))]
n 0
Let E
i
be the event that (X(i ), Y) A
(n)
(, n, P
X
(x)W(y[x))
E[
1
] Pr (E
C
1
2
nR
i =2
E
i
)
Pr (E
C
1
)
. .
0
+
2
nR
i =2
Pr (E
i
)
Junmo Kim EE 623: Information Theory
Proof: Channel Coding Theorem
)
2
n(I
P
X
(x)W(y|x)
(X;Y)3)
Thus
2
nR
i =2
Pr (E
i
) (2
nR
1)2
n(I (X;Y)3)
2
n(I (X;Y)3R)
.
If R < I (X; Y) 3, (2
nR
1)2
n(I (X;Y)3)
0 as n .
Junmo Kim EE 623: Information Theory
Proof: Channel Coding Theorem
Junmo Kim EE 623: Information Theory
Proof: Channel Coding Theorem
Let (
n
satisfy P
(n)
e
((
n
, ()) < .
m(1)
m(2)
m(1)
m(2)
m(M/2)
2
Otherwise,
M
i =1
m(i )
M
i =M/2+1
m(i )
M
2
2 and
P
(n)
e
((
n
, ()) =
1
M
M
i =1
m(i )
.
(n)
2.
Source-channel separation
Feedback communication
Junmo Kim EE 623: Information Theory
Channel Coding Theorem
R =
log M
n
: rate in bits / channel use.
W Unif 1, . . . , 2
nR
I
P
X
(x)W(y[x)
(X; Y) = I (P
X
; W)
C = max
P
X
I (P
X
; W)
Denition
R is achievable, if > 0, n
0
s.t. n n
0
, encoder of rate R &
block length n and a decoder with maximal probability of error
(n)
< .
Theorem
If R < C then R is achievable.
Theorem
Converse: If R is achievable, then R C.
Junmo Kim EE 623: Information Theory
Joint Typicality
Given some joint distribution p
X,Y
(x, y) on A and given some
> 0 and a natural number n, dene A
(n)
(p
X,Y
) as
A
(n)
(p
X,Y
) = (x, y) : [
1
n
log p
X
(x) H
p
X
(X)[ <
[
1
n
log p
Y
(y) H
p
Y
(Y)[ <
[
1
n
log p
X,Y
(x, y) H
p
X,Y
(X, Y)[ <
where p
X,Y
is a PMF on A and p
X,Y
(x, y) =
p
X,Y
(x
i
, y
i
).
[A
(n)
(p
X,Y
)[ 2
n(H
p
X,Y
(X,Y)+)
Junmo Kim EE 623: Information Theory
Joint Typicality
Lemma
Suppose
X,
Y are drawn independently according to the law
p
X
(x)p
Y
(y), i.e. (
X
k
,
Y
k
)
i .i .d.
p
X
(x)p
Y
(y). Then
Pr [(
X,
Y) A
(n)
(p
X,Y
)] 2
n(I
p
X,Y
(X;Y)3)
Lemma
Suppose I draw (X
k
, Y
k
)
i .i .d.
p
X,Y
(x, y). Then
Pr ((X, Y) A
(n)
(p
X,Y
)) 1 as n .
Proof: Law of large numbers.
Junmo Kim EE 623: Information Theory
Joint Typicality
Roughly,
X takes one of 2
nH(X)
values with high probability.
Independently,
Y takes one of 2
nH(Y)
values with high prob.
X,
Y) takes one of 2
nH(X)
2
nH(Y)
values with high prob.
But only 2
nH(X,Y)
pairs are jointly typical.
Probability that (
X,
Y) is jointly typical is about
2
nH(X,Y)
2
nH(X)
2
nH(Y)
= 2
nI (X;Y)
Junmo Kim EE 623: Information Theory
Joint Typicality Decoder
(y; p
X,Y
, , n, () = m
if (x(m), y) A
(n)
(p
X,Y
)&
for no other m
t
,= m, (x(m
t
), y) A
(n)
(p
X,Y
)
otherwise (y; p
X,Y
, , n, () = 0
Junmo Kim EE 623: Information Theory
Proof Sketch
1. Fix some p
X
2. Fix some > 0, n
3. Generate a random codebook (
IID p
X
.
4. Reveal ( to encoder and receiver.
5. Design a joint typicality decoder
(; p
X
(x)W(y[x), , n, ()
6. Encodr m x(m) (according to codebook)
7. Each codebook ( gives P
(n)
e
(().
8. Analyze E[P
(n)
e
(()]. Average over (.
9. Will show that if R < I (P
X
; W) then E[P
(n)
e
(()] 0 as
n .
10. By random coding argument, there exists deterministic
sequence (
n
s.t. P
(n)
e
((
n
) 0
11. Trick to get
(n)
to go to zero.
Junmo Kim EE 623: Information Theory
Proof Sketch
Junmo Kim EE 623: Information Theory
Proof: Channel Coding Theorem
p
X
, I (P
X
; W) is achievable. C is achievable
Observation :
17
,=
5
but E
(
[
17
] = E
(
[
5
]
Claim 1: E
(
[
m
] does not depend on m ( symmetry), which
implies E[P
(n)
e
((, ( ))] = E[
1
M
M
i =1
m
((, )] =
1
M
M
i =1
E[
m
((, )] = E[
1
]
Assume then W = 1 and compute E[
1
].
Junmo Kim EE 623: Information Theory
Proof: Channel Coding Theorem
E[P
(n)
e
((, ( ))] = E[
1
M
M
i =1
m
((, )] = E[
1
]
Junmo Kim EE 623: Information Theory
Proof: Channel Coding Theorem
Step 8:
E[
1
] =
(
Pr (()
1
((, ) = Pr (Error [W = 1)
Error only if (X(1), Y) is not in A
(n)
or m ,= 1 s.t. (X(m), Y) A
(n)
.
Pr [(X(1), Y) is not in A
(n)
(, n, P
X
(x)W(y[x))]
n 0
Let E
i
be the event that (X(i ), Y) A
(n)
(, n, P
X
(x)W(y[x))
E[
1
] Pr (E
C
1
2
nR
i =2
E
i
)
Pr (E
C
1
)
. .
0
+
2
nR
i =2
Pr (E
i
)
Junmo Kim EE 623: Information Theory
Proof: Channel Coding Theorem
)
2
n(I
P
X
(x)W(y|x)
(X;Y)3)
Thus
2
nR
i =2
Pr (E
i
) (2
nR
1)2
n(I (X;Y)3)
2
n(I (X;Y)3R)
.
If R < I (X; Y) 3, (2
nR
1)2
n(I (X;Y)3)
0 as n .
Junmo Kim EE 623: Information Theory
Proof: Channel Coding Theorem
Junmo Kim EE 623: Information Theory
Proof Sketch
1. Fix some p
X
2. Fix some > 0, n
3. Generate a random codebook (
IID p
X
.
4. Reveal ( to encoder and receiver.
5. Design a joint typicality decoder
(; p
X
(x)W(y[x), , n, ()
6. Encodr m x(m) (according to codebook)
7. Each codebook ( gives P
(n)
e
(().
8. Analyze E[P
(n)
e
(()]. Average over (.
9. Will show that if R < I (P
X
; W) then E[P
(n)
e
(()] 0 as
n .
10. By random coding argument, there exists deterministic
sequence (
n
s.t. P
(n)
e
((
n
) 0
11. Trick to get
(n)
to go to zero.
Junmo Kim EE 623: Information Theory
Proof: Channel Coding Theorem
Let (
n
satisfy P
(n)
e
((
n
, ()) < .
m(1)
m(2)
m(M)
m(1)
m(2)
m(M/2)
2
Otherwise,
M
i =1
m(i )
M
i =M/2+1
m(i )
M
2
2 and
P
(n)
e
((
n
, ()) =
1
M
M
i =1
m(i )
.
(n)
2.
Let V
k
be a source of entropy rate H(V
k
).
v
1
,...,v
n
P
V
(v)
y
1
,...,y
n
:(y),=v
W(y[F(v)) <
Junmo Kim EE 623: Information Theory
Converse
Theorem
If H(V
k
) > C for a stationary process V
k
, for any sequences
F
n
: 1
n
A
n
,
n
:
n
1
n
, lim
n
Pr (
n
(Y) ,= V)) > 0.
Proof.
H(V
k
)
1
n
H(V
1
, . . . , V
n
)
=
1
n
H(V
1
, . . . , V
n
[
V
1
, . . . ,
V
n
) +
1
n
I (V
1
, . . . , V
n
;
V
1
, . . . ,
V
n
)
1
n
[1 + P
(n)
e
n log [1[ +
1
n
I (X
1
, . . . , X
n
; Y
1
, . . . , Y
n
)]
1
n
+ P
(n)
e
log [1[ + C C as n 0
Junmo Kim EE 623: Information Theory
Source-Channel Separation
s
sec:
s
source symbol
sec
,
Channel has
c
ch use
sec
and capacity C
bits
ch use
, its capacity in
bits
sec
is C
c
.
C
FB
is the feedback capacity, obviously C
FB
C ( you can
ignore feedback if you like )
In fact, C
FB
= C.
Junmo Kim EE 623: Information Theory
Feedback Communication
When there is feedback, the following lemma is no longer true.
Lemma
Let X take value in A
n
according to some law P
X
(x) and let Y be
distributed according to p
Y[X
=
n
k=1
W(y
k
[x
k
) for some W([)
then I (X; Y) nC where C = max
P
X
I (P
X
; W).
Proof:
I (X; Y) = H(Y) H(Y[X)
= H(Y)
H(Y
i
[X, Y
i 1
)
= H(Y)
H(Y
i
[X
i
) memoryless
(H(Y
i
) H(Y
i
[X
i
))
=
I (X
i
; Y
i
) nC
Junmo Kim EE 623: Information Theory
Feedback Communication
nR = H(W)
= H(W[
W) + I (W;
W)
1 + P
(n)
e
nR + I (W;
W)
1 + P
(n)
e
nR + I (W; Y)
= 1 + P
(n)
e
nR + H(Y
n
)
H(Y
i
[W, Y
i 1
)
= 1 + P
(n)
e
nR + H(Y
n
)
H(Y
i
[W, Y
i 1
, X
i
)
(X
i
is function of (W, Y
i 1
))
= 1 + P
(n)
e
nR + H(Y
n
)
H(Y
i
[X
i
) (memoryless ch)
1 + P
(n)
e
nR +
(H(Y
i
) H(Y
i
[X
i
))
1 + P
(n)
e
nR + nC
Junmo Kim EE 623: Information Theory
Feedback Communication
Example: Binary erasure channel
C
FB
= C = 1 .
Suppose R < C = 1 .
Given some
t
> 0 (very small), choose n large enough so that
Pr (
#?
n
> + ) <
t
Then
Pr (error ) = Pr (
#?
n
> 1 R)
Pr (
#?
n
> + ) ( 1 R > + )
<
t
Junmo Kim EE 623: Information Theory
Lecture 14
Junmo Kim EE 623: Information Theory
Announcement
Let V
k
be a source of entropy rate H(V
k
).
v
1
,...,v
n
P
V
(v)
y
1
,...,y
n
:(y),=v
W(y[F(v)) <
Junmo Kim EE 623: Information Theory
Converse
Theorem:
If H(V
k
) > C for a stationary process V
k
, for any sequences
F
n
: 1
n
A
n
,
n
:
n
1
n
, lim
n
Pr (
n
(Y) ,= V)) > 0.
H(V
k
)
1
n
H(V
1
, . . . , V
n
)
( a
i
b
n
=
1
n
a
i
, lim
n
b
n
b
m
, m)
=
1
n
H(V
1
, . . . , V
n
[
V
1
, . . . ,
V
n
) +
1
n
I (V
1
, . . . , V
n
;
V
1
, . . . ,
V
n
)
1
n
[1 + P
(n)
e
n log [1[ +
1
n
I (X
1
, . . . , X
n
; Y
1
, . . . , Y
n
)]
1
n
+ P
(n)
e
log [1[ + C C as n 0
Junmo Kim EE 623: Information Theory
Source-Channel Separation
s
sec:
s
source symbol
sec
,
Channel has
c
ch use
sec
and capacity C
bits
ch use
, its capacity in
bits
sec
is C
c
.
C
FB
is the feedback capacity, obviously C
FB
C ( you can
ignore feedback if you like )
In fact, C
FB
= C.
Junmo Kim EE 623: Information Theory
Feedback Communication
When there is feedback, the following lemma is no longer true.
Lemma
Let X take value in A
n
according to some law P
X
(x) and let Y be
distributed according to p
Y[X
=
n
k=1
W(y
k
[x
k
) for some W([)
then I (X; Y) nC where C = max
P
X
I (P
X
; W).
Proof:
I (X; Y) = H(Y) H(Y[X)
= H(Y)
H(Y
i
[X, Y
i 1
)
= H(Y)
H(Y
i
[X
i
) memoryless
(H(Y
i
) H(Y
i
[X
i
))
=
I (X
i
; Y
i
) nC
Junmo Kim EE 623: Information Theory
Feedback Communication
nR = H(W)
= H(W[
W) + I (W;
W)
1 + P
(n)
e
nR + I (W;
W)
1 + P
(n)
e
nR + I (W; Y)
= 1 + P
(n)
e
nR + H(Y
n
)
H(Y
i
[W, Y
i 1
)
= 1 + P
(n)
e
nR + H(Y
n
)
H(Y
i
[W, Y
i 1
, X
i
)
(X
i
is function of (W, Y
i 1
))
= 1 + P
(n)
e
nR + H(Y
n
)
H(Y
i
[X
i
) (memoryless ch)
1 + P
(n)
e
nR +
(H(Y
i
) H(Y
i
[X
i
))
1 + P
(n)
e
nR + nC
Junmo Kim EE 623: Information Theory
Feedback Communication
Example: Binary erasure channel
C
FB
= C = 1 .
Suppose R < C = 1 .
Given some
t
> 0 (very small), choose n large enough so that
Pr (
#?
n
> + ) <
t
Then
Pr (error ) = Pr (
#?
n
> 1 R)
Pr (
#?
n
> + ) ( 1 R > + )
<
t
Junmo Kim EE 623: Information Theory
Dierential Entropy
X : Unif[0, a].
h(f ) =
_
a
0
1
a
log
1
a
dx = log a
X N(0,
2
)
h(f ) =
_
f (x) log
e
1
2
2
e
x
2
2
2
dx
= log
e
2
2
+
E[X
2
]
2
2
=
1
2
ln 2e
2
nats
h(X + c) = h(X).
X
1
, . . . , X
n
are continuous random variables with joint pdf
f (x
1
, . . . , x
n
).
h(X
1
, . . . , X
n
) =
_
f (x
1
, . . . , x
n
) log f (x
1
, . . . , x
n
)dx
1
dx
n
Example: X N(, K)
h(X) =
1
2
ln(2e)
n
[K[
2) h(X
1
, . . . , X
n
) =
n
i =1
h(X
i
[X
1
, . . . , X
i 1
)
Junmo Kim EE 623: Information Theory
Typical Set
Theorem: X
1
, . . . , X
n
are IID with density f
1
n
log f (X
1
, . . . , X
n
) E[log f (X)] = h(X)
Typical set A
(n)
A
(n)
= (x
1
, . . . , x
n
) S
n
: [
1
n
log f (x
1
, . . . , x
n
)h(X)[
Properties:
1. Pr (A
(n)
) 2
n(h(X)+)
for all n
3. Vol (A
(n)
) (1 )2
n(h(X))
n big enough
Junmo Kim EE 623: Information Theory
Typical Set
Proof:
1) > 0
Pr ([
1
n
log f (x
1
, . . . , x
n
) h(X)[ ) 1 as n
n
0
s.t. n n
0
,
Pr ([
1
n
log f (x
1
, . . . , x
n
) h(X)[ ) > 1 .
2)
1 =
_
S
n
f (x
1
, . . . , x
n
)dx
1
dx
n
_
A
(n)
f (X)dx
_
A
(n)
2
n(h(X)+)
dx
= 2
n(h(X)+)
Vol (A
(n)
)
Vol (A
(n)
) 2
n(h(X)+)
Junmo Kim EE 623: Information Theory
Typical Set
3)
1 < Pr (A
(n)
)
=
_
A
(n)
f (x)dx
2
n(h(X))
Vol (A
(n)
Vol (A
(n)
) (2
h(X)
)
n
Junmo Kim EE 623: Information Theory
Def: Relative Entropy
f , g 2 densities
D(f |g) =
_
S
f (x) log
f (x)
g(x)
dx
S is support of f ().
If f (x) ,= 0, g(x) = 0, ?
Pf:
D(f |g) =
_
f (x) log
g(x)
f (x)
dx
= E
f
[log
g(X)
f (X)
]
log E
f
[
g(X)
f (X)
]
= log
_
f (x)
g(x)
f (x)
dx = log 1 = 0
where equality holds i g(x) = f (x) almost everywhere
h(X
1
, . . . , X
n
) =
h(X
i
[X
i 1
)
h(X
i
), with equality i
X
1
, . . . , X
n
independent.
Junmo Kim EE 623: Information Theory
Inequalities
1
2
(x)
T
K
1
(x)
dx
=
1
2
log(2)
n
[K[
n
2
=
1
2
ln(2e)
n
[K[
= h()
E
f
[
1
2
x
T
K
1
X] =
1
2
E
f
[
i ,j
X
i
(K
1
)
ij
X
j
]
=
1
2
i ,j
(K
1
)
ij
E
f
[X
i
X
j
]
=
1
2
j
(K
1
)
ij
K
ji
=
n
2
h() h(f )
Junmo Kim EE 623: Information Theory
Lecture 15
Junmo Kim EE 623: Information Theory
Review
X N(0,
2
)
h(f ) =
_
f (x) log
e
1
2
2
e
x
2
2
2
dx
= log
e
2
2
+
E[X
2
]
2
2
=
1
2
ln 2e
2
nats
Junmo Kim EE 623: Information Theory
Review
In particular,
h(X)
1
2
log(2eE[X
2
])
Junmo Kim EE 623: Information Theory
Gaussian Channel
Y = x + Z
Z N(0, N)
If N = 0, C = .
If N = 1, C = . (without limit on x)
x
2
i
(w) P.
Junmo Kim EE 623: Information Theory
Gaussian Channel
We will show that the following quantity is the capacity:
max
E[X
2
]P
I (X; Y)
I (X; Y) = h(Y) h(Y[X)
= h(Y) h(X + Z[X)
= h(Y) h(Z)
1
2
log 2eE[Y
2
]
1
2
log 2eN
1
2
log
2e
2e
P + N
N
E[Y
2
] = E[X
2
] +E[Z
2
] +2E[XZ](E[XZ] = 0 X Z, E[Z] = 0)
For X s.t. E[X
2
] P, I (X; Y)
1
2
log(1 +
P
N
). This is achievable
if X N(0, P) and thus max
E[X
2
]P
I (X; Y) =
1
2
log(1 +
P
N
)
Junmo Kim EE 623: Information Theory
Gaussian Channel: Achievable Rate
Denition
We say that R is achievable if > 0, n
0
, s.t. n > n
0
, a rate-R
block length n codebook ( = x(1), , x(2
nR
)
n
and a
decoder :
n
1, . . . , 2
nR
s.t. the maximum probability of
error < and
1
n
x
2
i
(m) P, m 1, . . . , 2
nR
.
C supremum of achievable rate.
Junmo Kim EE 623: Information Theory
Gaussian Channel: Capacity
Theorem
The capacity of the power-limited Gaussian channel is
C =
1
2
log(1 +
P
N
)
Junmo Kim EE 623: Information Theory
Review
1. Fix some p
X
2. Fix some > 0, n
3. Generate a random codebook (
IID p
X
.
4. Reveal ( to encoder and receiver.
5. Design a joint typicality decoder
(; p
X
(x)W(y[x), , n, ()
6. Encodr m x(m) (according to codebook)
7. Each codebook ( gives P
(n)
e
(().
8. Analyze E[P
(n)
e
(()]. Average over (.
9. Will show that if R < I (P
X
; W) then E[P
(n)
e
(()] 0 as
n .
10. By random coding argument, there exists deterministic
sequence (
n
s.t. P
(n)
e
((
n
) 0
11. Trick to get
(n)
to go to zero.
Junmo Kim EE 623: Information Theory
Direct Part
1. Generate a codebook at random
1.1 Codewords are chosen independently
1.2 The components of the codewords are chosen IID from
N(0, P ).
2. Reveal the codebook to Tx/Rx
3. Decoder
3.1 Joint typicality: If there is one and only one codeword X
n
(w)
that is jointly typical with the received vector, declare
W = w.
Otherwise, declare an error.
3.2 Declare an error if the unique codeword that is typical with y
violates the average power constraint
Junmo Kim EE 623: Information Theory
Direct Part: Error Analysis
Assume W = 1.
E
0
the event X(1) violates the power constraint
E
i
the event (X(i ), Y) A
(n)
Pr (Error [W = 1) Pr (E
0
E
C
1
2
nR
i =2
E
i
)
Pr (E
0
) + Pr (E
C
1
) +
2
nR
i =2
Pr (E
i
)
Pr (E
0
) 0. (
1
n
X
2
i
(1) E[X
2
] = P )
Pr (E
C
1
) 0
Pr (E
i
) 2
n(I (X;Y)3)
, I (X; Y) =
1
2
log(1 +
P
N
)
Junmo Kim EE 623: Information Theory
Direct Part
Finally, deleting the worst half of the codewords, we obtain a code
with low maximal probability of error.
Also the selected codewords satisfy the power constraint.
(Otherwise, maximal probability of error is 1.)
Junmo Kim EE 623: Information Theory
Converse
Let ((, ) be a codebook of rate R, block length n and average
probability of error P
(n)
e
.
nR = H(W)
= H(W[
W) + I (W;
W)
1 + nRp
(n)
e
+ I (W;
W)
Junmo Kim EE 623: Information Theory
Converse
I (W;
W) I (X
n
(W); Y
n
)
= h(Y
n
) h(Y
n
[X
n
(W))
= h(Y
n
) h(X
n
(W) + Z
n
[X
n
(W))
= h(Y
n
) h(Z
n
)
=
(h(Y
i
[Y
i 1
) h(Z
i
))( Z
n
iid)
(h(Y
i
) h(Z
i
))
Junmo Kim EE 623: Information Theory
Converse
I (W;
W)
(h(Y
i
) h(Z
i
))
1
2
log 2eE[Y
2
i
]
1
2
log 2eN
= n
1
n
i
1
2
log(1 +
E[X
2
i
(W)]
N
)
n
1
2
log(1 +
1
n
E[X
2
i
(W)]
N
)
n
1
2
log(1 +
P
N
)
where
1
n
E[X
2
i
(W)] = E[
1
n
X
2
i
(W)] P
Junmo Kim EE 623: Information Theory
Lecture 16
Junmo Kim EE 623: Information Theory
Review
Gaussian channel
Y = X + Z, Z N(0, N)
1
n
n
k=1
x
k
(m)
2
P, m 1, . . . , 2
nR
1
2
log(1 +
P
N
) is achievable.
Junmo Kim EE 623: Information Theory
Band Limited Channel
N
W
(t)
Gaussian process
Stationary
E[N
W
(t)N
W
(t + )] =
N
0
2
()
Y(t) = Y
LPF
(t) + Y
HPF
(t)
where Y
LPF
(t) = Y(t) h(t)
X(t) and Y
HPF
(t) are independent given Y
LPF
. Y
LPF
is
sucient statistics.
Z
k
= N
W,LPF
(
k
2W
)
Autocovariance function: K
N
W,LPF
,N
W,LPF
() =
cov(N
W,LPF
(t), N
W,LPF
(t +)) = E[N
W,LPF
(t)N
W,LPF
(t +)]
K
N
W,LPF
,N
W,LPF
() =
_
S
N
W,LPF
(f )e
i 2f
df
E[Z
k
Z
l
] = E[N
W,LPF
(
k
2W
)N
W,LPF
(
l
2W
)] = K
N
W,LPF
,N
W,LPF
(
kl
2W
)
When k = l , E[Z
k
Z
l
] =
When k ,= l , E[Z
k
Z
l
] =
Thus, Z
k
is IID N(0, N
0
W).
Junmo Kim EE 623: Information Theory
Band Limited Channel
lim
T
1
2T
_
T
T
X(t)
2
dt < P
= lim
n
1
2n
n
k=n
X
2
k
n =
T
1
2W
1
2n
1
2W
n
n
X
2
k
1
2W
1
2T
_
T
T
X(t)
2
dt
Thus C =
1
2
log(1 +
P
N
0
W
) bits / sample
Y
(l )
= X
(l )
+ Z
(l )
Z
(l )
N(0, N
l
)
L
l =1
P
l
P.
C = max I (X
(1)
, X
(2)
, . . . , X
(L)
; Y
(1)
, Y
(2)
, . . . , Y
(L)
) where
the maximum is over all input distribution
f
X
(1)
,X
(2)
,...,X
(L)
(, . . . , ) satisfying the power constraint
L
l =1
E[(X
(l )
)
2
] P.
l
1
2
log(1 +
P
l
N
l
)
subject to
P
l
P.
Junmo Kim EE 623: Information Theory
Parallel Gaussian Channels
I (X
(1)
, X
(2)
, . . . , X
(L)
; Y
(1)
, Y
(2)
, . . . , Y
(L)
)
= h(Y
(1)
, Y
(2)
, . . . , Y
(L)
) h(Y
(1)
, . . . , Y
(L)
[X
(1)
, . . . , X
(L)
)
= h(Y
(1)
, Y
(2)
, . . . , Y
(L)
) h(Z
(1)
, . . . , Z
(L)
[X
(1)
, . . . , X
(L)
)
= h(Y
(1)
, Y
(2)
, . . . , Y
(L)
) h(Z
(1)
, . . . , Z
(L)
)
= h(Y
(1)
, Y
(2)
, . . . , Y
(L)
)
l
h(Z
(l )
)
l
(h(Y
(l )
) h(Z
(l )
))
l
1
2
log(1 +
P
l
N
l
)
Junmo Kim EE 623: Information Theory
Parallel Gaussian Channels
maximize f (P
1
, P
2
, . . . , P
L
) =
l
1
2
log(1 +
P
l
N
l
)
subject to
P
l
P.
f (P
1
, P
2
, . . . , P
L
) is a concave function.
P
l
0 for l = 1, . . . , L, and
P
l
= P.
1
2
log(1+
P
l
N
l
)
P
l
P
Gain
1
2
log(1+
P
l
N
l
)
P
l
P
Junmo Kim EE 623: Information Theory
Parallel Gaussian Channels
1
2
log(1+
P
l
N
l
)
P
l
=
1
2
1
N
l
1+
P
l
N
l
=
1
2
1
P
l
+N
l
1
2
1
P
l
+ N
l
= if P
l
> 0 P
l
+ N
l
= if P
l
> 0
1
2
1
P
l
+ N
l
if P
l
= 0 P
l
+ N
l
if P
l
= 0
Optimum P
l
= ( N
l
)
+
where x
+
=
_
x x > 0
0 x 0
( P
l
)
+
= P.
Junmo Kim EE 623: Information Theory
Water-Filling for Parallel Gaussian Channels
Optimum P
l
= ( N
l
)
+
where x
+
=
_
x x > 0
0 x 0
( P
l
)
+
= P.
Junmo Kim EE 623: Information Theory
Lecture 17
Junmo Kim EE 623: Information Theory
Method of Types
H = H
1
[H
2
) go to zero
subject to Pr (
H = H
2
[H
1
) 0 ?
Pr (
H = H
1
[H
2
) 2
nD(P
1
P
2
)
Junmo Kim EE 623: Information Theory
Method of Types
Let x A
n
, [A[ < .
P
x
(x) =
1
n
n
k=1
I x
k
= x
where I statement =
_
1 if statement is true
0 otherwise
e.g. A = a, b, c, n = 5
x = aabcb
P
x
(a) =
2
5
, P
x
(b) =
2
5
, P
x
(c) =
1
5
Junmo Kim EE 623: Information Theory
Method of Types
T
n
(A) = set of all distributions on A with denominator n.
T
n
(A) = T(x) : n(x) Z
where T(A) is probability simplex, the set of all PMFs on A.
T = (p
1
, p
2
, . . . , p
|X|
) : p
i
0,
p
i
= 1
T
n
= (
n
1
n
,
n
2
n
, . . . ,
n
|X|
n
) :
n
i
= n, n
i
0, n Z
Junmo Kim EE 623: Information Theory
Method of Types
Q
n
(T(P)) : Q
n
is n-fold pdf.
Junmo Kim EE 623: Information Theory
Notations
T : probability simplex
T
n
: set of types with denominator n.
P
x
: type
X
1
, . . . , X
n
are i.i.d. according to Q.
Let x A
n
be some sequence of type P
x
.
e.g. Q(H) =
1
3
, Q(T) =
2
3
.
Pr [(X
1
, . . . , X
n
) = (HHHTTTTTTT)] = (
1
3
)
3
(
2
3
)
7
Lemma
Q
n
(x) = 2
n(H(P
x
)+D(P
x
|Q))
Probability of a sequence depends only on its type.
Junmo Kim EE 623: Information Theory
Probability of a Sequence and Its Type
Lemma
Q
n
(x) = 2
n(H(P
x
)+D(P
x
|Q))
Proof.
Q
n
(x) =
x.
Q(x)
N(x)
(N(x) =
k
I x
k
= x)
=
x.
Q(x)
nP
x
(x)
(P
x
(x) =
N(x)
n
)
= 2
n
P
x
(x) log Q(x)
= 2
n
x
P
x
(x) log
P
x
(x)
Q(x)
+
x
P
x
log
1
P
x
(x)
= 2
n(D(P
x
|Q)+H(P
x
))
Junmo Kim EE 623: Information Theory
Size of a Type Class T(P)
[T(P)[ 2
nH(P)
Proof.
Draw X
1
, . . . , X
n
i.i.d. according to P.
1 Pr (X
n
T(P))
= P
n
(T(P))
=
xT(P)
P
n
(x)
=
xT(P)
2
nH(P)
( D(P
x
|P) = 0)
= [T(P)[2
nH(P)
Thus [T(P)[ 2
nH(P)
.
Junmo Kim EE 623: Information Theory
Size of a Type Class T(P)
Theorem
1
(n + 1)
[.[
2
nH(P)
[T(P)[ 2
nH(P)
Proof: We have P
n
(T(P)) P
n
(T(
P1
n
P
n
(T(
P))
[T
n
[ max
P1
n
P
n
(T(
P))
= [T
n
[P
n
(T(P))
(n + 1)
[.[
P
n
(T(P))
Junmo Kim EE 623: Information Theory
Size of a Type Class T(P)
Thus we have P
n
(T(P))
1
(n+1)
|X|
.
x T(P) P
n
(x) = 2
nH(P)
.
P
n
(T(P)) = [T(P)[2
nH(P)
1
(n+1)
|X|
Therefore, [T(P)[
2
nH(P)
(n+1)
|X|
.
Junmo Kim EE 623: Information Theory
Probability of Type Class
Theorem
1
(n + 1)
[.[
2
D(P|Q)
Q
n
(T(P)) 2
nD(P|Q)
Q
n
(T(P)) can be viewed as probability of rare event.
D(P|Q) is about how fast the probability decay as n grows.
Junmo Kim EE 623: Information Theory
Sanovs Theorem (Large Deviations)
Theorem
Let E T. Suppose X
1
, . . . , X
n
are i.i.d. according to Q.
Pr ( P
X
1
,...,X
n
. .
emprical type
E) (n + 1)
[.[
2
nD(P
|Q)
where P
= arg min
PE
D(P|Q).
e.g. Q = (
1
2
,
1
2
) H,T, E = P T : P(H) >
3
4
Pr (P
X
1
,...,X
n
E) = Pr (# of H exceeds 75%)
Junmo Kim EE 623: Information Theory
Sanovs Theorem (Large Deviations)
Pr ( P
X
1
,...,X
n
. .
emprical type
E) (n + 1)
[.[
2
nD(P
|Q)
where P
= arg min
PE
D(P|Q).
Proof.
Pr (P
X
E) =
PE1
n
Q
n
(T(P))
PE1
n
2
nD(P|Q)
PE1
n
2
nD(P
|Q)
(n + 1)
[.[
2
nD(P
|Q)
Junmo Kim EE 623: Information Theory
Lecture 18
Junmo Kim EE 623: Information Theory
Review
[T
n
[ (n + 1)
[.[
Q
n
(x) = 2
n(H(P
x
)+D(P
x
|Q))
[T(P)[ =
n!
xX
(nP(x))!
: number of permutations
1
(n + 1)
[.[
2
nH(P)
[T(P)[ 2
nH(P)
Junmo Kim EE 623: Information Theory
Probability of Type Class
Theorem
1
(n + 1)
[.[
2
nD(P|Q)
Q
n
(T(P)) 2
nD(P|Q)
Q
n
(T(P)) can be viewed as probability of rare event.
D(P|Q) is about how fast the probability decay as n grows.
Proof.
If x T(P), Q
n
(x) = 2
n(H(P
x
)+D(P
x
|Q))
= 2
n(H(P)+D(P|Q))
.
Q
n
(T(P)) =
xT(P)
Q
n
(x) = [T(P)[2
n(H(P)+D(P|Q))
As
1
(n+1)
|X|
2
nH(P)
[T(P)[ 2
nH(P)
, we have
1
(n+1)
|X|
2
D(P|Q)
Q
n
(T(P)) 2
nD(P|Q)
.
Junmo Kim EE 623: Information Theory
Sanovs Theorem (Large Deviations)
Theorem
Let E T. Suppose X
1
, . . . , X
n
are i.i.d. according to Q.
Pr ( P
X
1
,...,X
n
. .
emprical type
E) (n + 1)
[.[
2
nD(P
|Q)
where P
= arg min
PE
D(P|Q).
e.g. Q = (
1
2
,
1
2
) H,T, E = P T : P(H)
3
4
Pr (P
X
1
,...,X
n
E) = Pr (# of H 75%)
Junmo Kim EE 623: Information Theory
Sanovs Theorem (Large Deviations)
Pr ( P
X
1
,...,X
n
. .
emprical type
E) (n + 1)
[.[
2
nD(P
|Q)
where P
= arg min
PE
D(P|Q).
Proof.
Pr (P
X
E) =
PE1
n
Q
n
(T(P))
PE1
n
2
nD(P|Q)
PE1
n
2
nD(P
|Q)
(n + 1)
[.[
2
nD(P
|Q)
Junmo Kim EE 623: Information Theory
Sanovs Theorem (Large Deviations)
If E is the closure of its interior then
lim
n
1
n
log Pr (P
X
E) = D(P
|Q).
Proof.
In this case, for all large n, we can nd a distribution in E T
n
(nonempty) that is close to P
|Q).
Pr (P
X
E) =
PE1
n
Q
n
(T(P))
Q
n
(T(P
n
))
1
(n + 1)
[.[
2
nD(P
n
|Q)
.
Combining the lower bound and upper bound the limit is
D(P
Let P
(a)[ 0 as n .
Junmo Kim EE 623: Information Theory
Conditional Sanovs Theorem
Lemma
For a closed convex set E T, Q not in E,
P
= arg min
PE
D(P|Q),
D(P|Q) D(P|P
) + D(P
|Q)
) is very small.
Junmo Kim EE 623: Information Theory
Conditional Sanovs Theorem
Lemma
(Pinsker): In natural logs
|P Q|
1
_
2D(P|Q)
L
1
norm: Given P, Q T
|P Q|
1
=
x.
[P(x) Q(x)[
xA
(P(x) Q(x)) +
xA
C
(Q(x) P(x))
= P(A) Q(A) + (1 Q(A)) (1 P(A))
= 2(P(A) Q(A))
In fact,
|PQ|
1
2
= max
B.
(P(B) Q(B))
Junmo Kim EE 623: Information Theory
Conditional Sanovs Theorem
Let D = D(P
|Q) = min
PE
D(P|Q).
Let S
t
= P : D(P|Q) t
Consider S
D+
and S
D+2
.
Pr (P
X
1
,...,X
n
E S
C
D+2
[P
X
1
,...,X
n
E) is very small.
Pr (P
X
1
,...,X
n
E S
C
D+2
) (n + 1)
|X|
2
n(D+2)
(Sanovs
theorem)
Pr (P
X
1
,...,X
n
E)
1
(n+1)
|X|
2
n(D+)
(Sanovs theorem)
Therefore
Pr (P
X
1
,...,X
n
E S
C
D+2
[P
X
1
,...,X
n
E)
=
Pr (P
X
1
,...,X
n
E S
C
D+2
)
Pr (P
X
1
,...,X
n
E)
(n + 1)
|X|
2
n(D+2)
1
(n+1)
|X|
2
n(D+)
(n + 1)
2|X|
2
n
Junmo Kim EE 623: Information Theory
Conditional Sanovs Theorem
Let A = S
D+2
E. For all P A, D(P|Q) D + 2.
) + D(P
|Q) D(P|Q) D + 2
We have D(P|P
) 2.
P
X
1
,...,X
n
A implies that D(P
X
1
,...,X
n
|P
) 2.
Since Pr P
X
1
,...,X
n
A[P
X
1
,...,X
n
E 1, we have
Pr (D(P
X
1
,...,X
n
|P
) 2[P
X
1
,...,X
n
E) 1.
Since
[P
X
1
,...,X
n
(a)P
(a)[ |P
X
1
,...,X
n
P
|
1
_
2D(P
X
1
,...,X
n
|Q)
Pr ([P
X
1
,...,X
n
(a) P
(a)[ [P
X
1
,...,X
n
E) 0.
Pr (X
1
= a[P
X
1
,...,X
n
E) P
(a) in probability, a A.
Junmo Kim EE 623: Information Theory
Lecture 19
Junmo Kim EE 623: Information Theory
Conditional Sanovs Theorem
Let P
E achieve inf
PE
D(P|Q), where Q is not in E.
[Q
n
(X
1
= a[P
X
1
,...,X
n
E) P
(a)[ 0 as n .
Junmo Kim EE 623: Information Theory
Conditional Sanovs Theorem
Lemma
For a closed convex set E T, Q not in E,
P
= arg min
PE
D(P|Q),
D(P|Q) D(P|P
) + D(P
|Q)
) is very small.
Lemma
(Pinsker): In natural logs
|P Q|
1
_
2D(P|Q)
Junmo Kim EE 623: Information Theory
Conditional Sanovs Theorem
Let D = D(P
|Q) = min
PE
D(P|Q).
Let S
t
= P : D(P|Q) t
Consider S
D+
and S
D+2
.
Pr (P
X
1
,...,X
n
E S
C
D+2
[P
X
1
,...,X
n
E) is very small.
Pr (P
X
1
,...,X
n
E S
C
D+2
) (n + 1)
|X|
2
n(D+2)
(Sanovs
theorem)
Pr (P
X
1
,...,X
n
E)
1
(n+1)
|X|
2
n(D+)
(Sanovs theorem)
Therefore
Pr (P
X
1
,...,X
n
E S
C
D+2
[P
X
1
,...,X
n
E)
=
Pr (P
X
1
,...,X
n
E S
C
D+2
)
Pr (P
X
1
,...,X
n
E)
(n + 1)
|X|
2
n(D+2)
1
(n+1)
|X|
2
n(D+)
(n + 1)
2|X|
2
n
Junmo Kim EE 623: Information Theory
Conditional Sanovs Theorem
Let A = S
D+2
E. For all P A, D(P|Q) D + 2.
) + D(P
|Q) D(P|Q) D + 2
We have D(P|P
) 2.
P
X
1
,...,X
n
A implies that D(P
X
1
,...,X
n
|P
) 2.
Since Pr P
X
1
,...,X
n
A[P
X
1
,...,X
n
E 1, we have
Pr (D(P
X
1
,...,X
n
|P
) 2[P
X
1
,...,X
n
E) 1.
Since [P
X
1
,...,X
n
(a) P
(a)[ |P
X
1
,...,X
n
P
|
1
_
2D(P
X
1
,...,X
n
|P
)
Pr ([P
X
1
,...,X
n
(a) P
(a)[ [P
X
1
,...,X
n
E) 0.
Pr (X
1
= a[P
X
1
,...,X
n
E) P
(a) in probability, a A.
Junmo Kim EE 623: Information Theory
Conditional Sanovs Theorem
Pr ([P
X
1
,...,X
n
(a) P
(a)[ [P
X
1
,...,X
n
E) 0.
Pr (X
1
= a[P
X
1
,...,X
n
E) P
(a) in probability, a A.
Junmo Kim EE 623: Information Theory
Hypothesis Testing
Observe X
1
, . . . , X
n
, where X
i
IID.
Hypothesis testing:
H
1
: X
i
IID P
1
H
2
: X
i
IID P
2
Declare H
1
if X
1
, . . . , X
n
A
n
A
n
.
Declare H
2
if X
1
, . . . , X
n
A
C
n
.
Error probabilities
Type I (False alarm):
n
= P
n
1
(A
C
n
) = Pr (H
2
[H
1
)
Type II (Miss detection) :
n
= P
n
2
(A
n
) = Pr (H
1
[H
2
)
n
= min
A
n
.
n
,P
n
1
(A
C
n
)
n
P
n
2
(A
n
)
Junmo Kim EE 623: Information Theory
Hypothesis Testing
n
= min
A
n
.
n
,P
n
1
(A
C
n
)
n
P
n
2
(A
n
)
Given an observation X A
n
Declare H
1
if
P
n
1
(X)
P
n
2
(X)
> T.
i.e. A
n
= x :
P
n
1
(x)
P
n
2
(x)
> T
T is chosen to meet P
n
1
(A
C
n
)
Suppose
n
0.
n
2
nD(P
1
|P
2
)
where P
2
is the true distribution.
A
n
= x : 2
n(D(P
1
|P
2
))
<
P
n
1
(x)
P
n
2
(x)
< 2
n(D(P
1
|P
2
)+)
.
P
n
1
(x) =
n
k=1
P
1
(x
k
)
P
n
2
(x) =
n
k=1
P
2
(x
k
)
1
n
log
P
n
1
(x)
P
n
2
(x)
=
1
n
log
P
1
(x
k
)
P
2
(x
k
)
x A
n
if
1
n
log
P
1
(x
k
)
P
2
(x
k
)
(D(P
1
|P
2
) , D(P
1
|P
2
) + )
Claim:
n
0
Because of L.L.N.,
1
n
log
P
1
(X
k
)
P
2
(X
k
)
E
P
1
[log
P
1
(X)
P
2
(X)
] = D(P
1
|P
2
).
Junmo Kim EE 623: Information Theory
Achievability
A
n
= x : 2
n(D(P
1
|P
2
))
<
P
n
1
(x)
P
n
2
(x)
< 2
n(D(P
1
|P
2
)+)
.
n
=
xA
n
P
n
2
(x)
xA
n
P
n
1
(x)2
n(D(P
1
|P
2
))
= 2
n(D(P
1
|P
2
))
xA
n
P
n
1
(x)
= 2
n(D(P
1
|P
2
))
(1
n
)
1
n
log
n
(D(P
1
|P
2
) ) +
log(1
n
)
n
lim
0
lim
n
1
n
log min
P
n
1
(A
C
n
)<
P
n
2
(A
n
) D(P
1
|P
2
) +
Junmo Kim EE 623: Information Theory
Converse
Lemma
Let B
n
A
n
be any set of sequences x
1
, x
2
, . . . , x
n
such that
B
n
< . Then
B
n
= P
n
2
(B
n
) > (1 2)2
n(D(P
1
|P
2
)+)
, which
implies
1
n
log
B
n
(D(P
1
|P
2
) + ).
Proof:
Since P
n
1
(A
n
) 1 and P
n
1
(B
n
) 1, we have
P
n
1
(A
n
B
n
) 1.
More precisely, P
n
1
(A
n
) > 1 and P
n
2
(B
n
) > 1 ,
P
n
1
(A
n
B
n
) > 1 2.
P
n
1
((A
n
B
n
)
C
) = P
n
1
(A
C
n
B
C
n
) P
n
1
(A
C
n
) + P
n
2
(B
C
n
) < 2
P
n
1
(A
n
B
n
) > 1 2
Junmo Kim EE 623: Information Theory
Converse
Lemma
Let B
n
A
n
be any set of sequences x
1
, x
2
, . . . , x
n
such that
B
n
< . Then
B
n
= P
n
2
(B
n
) > (1 2)2
n(D(P
1
|P
2
)+)
, which
implies
1
n
log
B
n
(D(P
1
|P
2
) + ).
P
n
1
(A
n
B
n
) > 1 2.Thus,
P
n
2
(B
n
) P
n
2
(A
n
B
n
)
=
x
n
A
n
B
n
P
n
2
(x
n
)
x
n
A
n
B
n
P
n
1
(x
n
)2
n(D(P
1
|P
2
)+)
= 2
n(D(P
1
|P
2
)+)
x
n
A
n
B
n
P
n
1
(x
n
)
= 2
n(D(P
1
|P
2
)+)
P
n
1
(A
n
B
n
)
> 2
n(D(P
1
|P
2
)+)
(1 2)
Junmo Kim EE 623: Information Theory
Examples
P
1
= (
1
4
,
1
4
,
1
4
,
1
4
) and P
2
= (0,
1
3
,
1
3
,
1
3
).
D(P
1
|P
2
) = = 0.
P
1
= (0,
1
3
,
1
3
,
1
3
) and P
2
= (
1
4
,
1
4
,
1
4
,
1
4
).
D(P
1
|P
2
) = log
4
3
= 0.
Junmo Kim EE 623: Information Theory
Lecture 20
Junmo Kim EE 623: Information Theory
Hypothesis Testing
Suppose
n
0.
n
2
nD(P
1
|P
2
)
where P
2
is the true distribution.
P
1
= (
1
4
,
1
4
,
1
4
,
1
4
) and P
2
= (0,
1
3
,
1
3
,
1
3
).
D(P
1
|P
2
) = = 0.
P
1
= (0,
1
3
,
1
3
,
1
3
) and P
2
= (
1
4
,
1
4
,
1
4
,
1
4
).
D(P
1
|P
2
) = log
4
3
= 0.
Junmo Kim EE 623: Information Theory
Rate Distortion Theory
Let
A be the reconstruction alphabet.
encoder f
n
: A
n
1, 2, . . . , 2
nR
reconstruction g
n
: 1, 2, . . . , 2
nR
A
n
d : A
A [0, )
e.g. A =
X = , d(x, x) = (x x)
2
(squared-error distortion)
We extend d to sequence,
d((x
1
, . . . , x
n
), ( x
1
, . . . , x
n
)) =
1
n
n
k=1
d(x
k
, x
k
)
Junmo Kim EE 623: Information Theory
Achievable Rate and Distortion
Denition
(R, D) is achievable if for any > 0, n
0
, n > n
0
, f
n
, g
n
such
that
Pr (d((X
1
, . . . , X
n
), g
n
(f
n
(X
1
, . . . , X
n
))) < D + ) > 1
Junmo Kim EE 623: Information Theory
Conditions for Optimal f
n
Let D = E[d(X, g
n
(f
n
(X)))].
i
= x A
n
: f
n
(x) = i
i =1
Pr (X
i
)E[d(X, g
n
(i )[X
i
]
D is minimized by minimizing E[d(X, g
n
(i )[X
i
] for each i .
Junmo Kim EE 623: Information Theory
Lloyd Algorithm
Lemma
Suppose A =
A = , d(x, x) = (x x)
2
and let
i
be xed.
Then
arg min
y1
n
E[d(X, y)[X
i
]
= E[X[X
i
]
Junmo Kim EE 623: Information Theory
Rate Distortion Function
X
(x, x):
x
p
X,
X
(x, x)=p
X
(x),E
p
X,
X
[d(X,
X)]D
I (X;
X)
= min
p
X|X
:E
p
X
p
X|X
[d(X,
X)]D
I (X;
X)
X
1.
x
p
X,
X
(x, x) = p
X
(x)
2. E
p
X,
X
[d(X,
X)] D
Junmo Kim EE 623: Information Theory
Rate Distortion Function : Binary Source
A =
X = 0, 1
D = 0 R = H(p).
D = p R = 0
Junmo Kim EE 623: Information Theory
Rate Distortion Function : Binary Source
I (X;
X) = H(X) H(X[
X)
= H
b
(p) H(X[
X)
= H
b
(p) H(X
X[
X)
H
b
(p) H(X
X)
= H
b
(p) H
b
(Pr (X ,=
X))
H
b
(p) H
b
(D)
as p
X,
X
satises E[d(X,
X)] D and D
1
2
, we have
Pr (X ,=
X) = E[d(X,
X)] D
1
2
.
Note that equality holds only if the error X
X and the estimate
X are independent.
Junmo Kim EE 623: Information Theory
Rate Distortion Function : Binary Source
A =
A = 0, 1
X
.
Junmo Kim EE 623: Information Theory
Rate Distortion Function : Gaussian Source
A =
A =
d(x, x) = (x x)
2
X N(0,
2
)
For any f
X,
X
satisfying the conditions (1), (2)
I (X;
X) = h(X) h(X[
X)
=
1
2
log 2e
2
h(X
X[
X)
1
2
log 2e
2
h(X
X)
1
2
log 2e
2
1
2
log 2eE[(X
X)
2
]
1
2
log 2e
2
1
2
log 2eD
=
1
2
log
2
D
Junmo Kim EE 623: Information Theory
Rate Distortion Function : Gaussian Source
X
if D
2
If D >
2
, we choose
X = 0 with probability 1, achieving
R(D) = 0.
1
2
log
2
D
D(R) =
2
2
2R
1
( x[x) achieves R(D
1
)
p
2
( x[x) achieves R(D
2
).
Now p
1
+ p
2
satises the condition E[d(X,
X)] D
1
+ D
2
.
Thus,
R(D
1
+ D
2
) I
p(x)(p
1
+ p
2
)
(X;
X)
I
p(x)p
1
(X;
X) + I
p(x)p
2
(X;
X)
R(D
1
) + R(D
2
)
Junmo Kim EE 623: Information Theory
Lecture 21
Junmo Kim EE 623: Information Theory
Rate Distortion Function
X
().
X
and
d(x, x) is dened as
R(D) = min
p
X,
X
(x, x):
x
p
X,
X
(x, x)=p
X
(x),E
p
X,
X
[d(X,
X)]D
I (X;
X)
= min
p
X|X
:E
p
X
p
X|X
[d(X,
X)]D
I
p
X
p
X|X
(X;
X)
X
1.
x
p
X,
X
(x, x) = p
X
(x)
2. E
p
X,
X
[d(X,
X)] D
Junmo Kim EE 623: Information Theory
Rate Distortion Function : Binary Source
A =
X = 0, 1
X
(x) =
_
p x = 1
1 p x = 0
where p
1
2
D p R(D) = 0.
R(D) = min
p
X,
X
(x, x):
x
p
X,
X
(x, x)=p
X
(x),E
p
X,
X
[d(X,
X)]D
I (X;
X)
If we choose p
X,
X
(x, x) such that
X = 0, p
X,
X
(x, x) satises
the two conditions and I (X;
X) = 0
Junmo Kim EE 623: Information Theory
Rate Distortion Function : Binary Source
If D < p
I (X;
X) = H(X) H(X[
X)
= H
b
(p) H(X[
X)
= H
b
(p) H(X
X[
X)
H
b
(p) H(X
X)
= H
b
(p) H
b
(Pr (X ,=
X))
H
b
(p) H
b
(D)
as p
X,
X
satises E[d(X,
X)] D and D
1
2
, we have
Pr (X ,=
X) = E[d(X,
X)] D
1
2
.
Note that equality holds i 1) the error X
X and the estimate
X
are independent and 2) Pr (X ,=
X) = D.
Junmo Kim EE 623: Information Theory
Rate Distortion Function : Binary Source
Claim: X Ber (p), p
1
2
R(D) =
_
H
b
(p) H
b
(D) D < p
0 o.w.
For D p, R(D) = 0 by
X = 0.
As X =
X (X
X), 1) & 2) determine p
X[
X
.
We will compute p
X
so that
x
p
X,
X
(x, x) = p
X
(x).
X
.
Junmo Kim EE 623: Information Theory
Rate Distortion Function : Gaussian Source
A =
A =
d(x, x) = (x x)
2
X N(0,
2
)
For any f
X,
X
satisfying the conditions (1), (2)
I (X;
X) = h(X) h(X[
X)
=
1
2
log 2e
2
h(X
X[
X)
1
2
log 2e
2
h(X
X)
1
2
log 2e
2
1
2
log 2eE[(X
X)
2
]
1
2
log 2e
2
1
2
log 2eD
=
1
2
log
2
D
Junmo Kim EE 623: Information Theory
Rate Distortion Function : Gaussian Source
X
if D
2
If D >
2
, we choose
X = 0 with probability 1, achieving
R(D) = 0.
1
2
log
2
D
D(R) =
2
2
2R
1
( x[x) achieves R(D
1
)
p
2
( x[x) achieves R(D
2
).
Now p
1
+ p
2
satises the condition E[d(X,
X)] D
1
+ D
2
.
Thus,
R(D
1
+ D
2
) I
p(x)(p
1
+ p
2
)
(X;
X)
I
p(x)p
1
(X;
X) + I
p(x)p
2
(X;
X)
= R(D
1
) + R(D
2
)
Junmo Kim EE 623: Information Theory
Converse
Suppose f
n
, g
n
gives rise to distortion
E[d(X
n
, g
n
(f
n
(X
n
)))] D and they are of rate R. We will
show that R R(D).
Proof: X
n
W = f
n
(X
n
)
X
n
= g
n
(W)
nR H(f
n
(X
n
))
H(f
n
(X
n
)) H(f
n
(X
n
)[X
n
)
= I (X
n
; f
n
(X
n
))
I (X
n
;
X
n
)
= H(X
n
) H(X
n
[
X
n
)
=
n
k=1
H(X
k
)
n
k=1
H(X
k
[X
k1
,
X
n
)
k=1
H(X
k
)
n
k=1
H(X
k
[
X
k
) =
n
k=1
I (X
k
;
X
k
)
Junmo Kim EE 623: Information Theory
Converse
k=1
I (X
k
;
X
k
)
k=1
R(E[d(X
k
,
X
k
)])
= n
1
n
n
k=1
R(E[d(X
k
,
X
k
)])
nR(
1
n
n
k=1
E[d(X
k
,
X
k
)])
nR(E[d(X
n
,
X
n
)])
nR(D)
Junmo Kim EE 623: Information Theory
Direct Part (Achievability)
Denition
(R, D) is achievable if for any > 0 and any > 0, n
0
, n > n
0
,
f
n
, g
n
such that
Pr (d((X
1
, . . . , X
n
), g
n
(f
n
(X
1
, . . . , X
n
))) < D + ) > 1
Section 10.5 If R > R(D) for (R, D) and d(x, x) < , for
any > 0, n
0
, n > n
0
, f
n
, g
n
such that
E[d(X
n
, g
n
(f
n
(X
n
)))] < D + .
X
satisfying E
p
X,
X
[d(X,
X)] D
Let R > I
p
X,
X
(X;
X) +
1
Compute p
X
( x), generate IID codebook, where codewords are
independent with IID component p
X
( x).
Junmo Kim EE 623: Information Theory
Direct Part: Strong Typicality
Denition
A sequence x
1
, . . . , x
n
is strongly typical with respect to the
distribution p
X
(x) (denoted by x
n
A
(n)
) if 1)
[
1
n
k
I x
k
= x p
X
(x)[ <
[.[
for all x X with p
X
(x) > 0
and 2) for all x A with p
X
(x) = 0,
I x
k
= x = 0.
Denition
A pair of sequences x
1
, . . . , x
n
and y
1
, . . . , y
n
is strongly jointly
typical with respect to the distribution p
X,Y
(x, y) if 1)
[
1
n
k
I x
k
= x and y
k
= y p
X,Y
(x, y)[ <
[.[[[
for all
(x, y) A with p
X,Y
(x, y) > 0
and 2) for all (x, y) A with p
X,Y
(x, y) = 0,
I x
k
= x and y
k
= y = 0.
Junmo Kim EE 623: Information Theory
Direct Part: Strong Typicality
[
1
n
k
I x
k
= x and y
k
= y p
X,Y
(x, y)[ <
[.[[[
for all
(x, y) A implies [
1
n
k
I x
k
= x p
X
(x)[ <
[.[
for
all x X.
[
y
[
1
n
k
I x
k
= x and y
k
= y p
X,Y
(x, y)][
y
[
1
n
k
I x
k
= x and y
k
= y p
X,Y
(x, y)[
[A[[[
=
[A[
Junmo Kim EE 623: Information Theory
Direct Part: Strong Typicality
Lemma
X
i
are IID according to p
X
(x). Then
Pr ((X
1
, . . . , X
n
) A
(n)
) 1 as n .
Lemma
If x
n
A
(n)
(p
X
(x)). If
X
i
are IID according to p
X
( x),
Pr ((x
n
,
X
n
) A
(n)
(p
X,
X
)) 2
n(I
p
X,
X
(X;
X)
1
)
See problem 10.16 for proof.
Junmo Kim EE 623: Information Theory
Direct Part
Encoding: Given X
n
, index it by w if there exists a w s.t.
(X
n
,
X
n
) A
(n)
(p
X,
X
), then
d(x
n
, x
n
) E
p
X,
X
[d(X,
X)] + , where 0 as 0.
[d(x
n
, x
n
) E
p
X,
X
[d(X,
X)][
= [
1
n
x.
x
.
N(x, x)d(x, x)
x.
x
.
p
X,
X
(x, x)d(x, x)[
= [
x.
x
.
(
N(x, x)
n
p
X,
X
(x, x))d(x, x)[
x.
x
.
[
N(x, x)
n
p
X,
X
(x, x)[d(x, x)
x.
x
.
[A[[
A[
d
max
= d
max
=
Junmo Kim EE 623: Information Theory
Direct Part: Error Probability
Error occurs if X
n
is not strongly typical or a codword
X
n
(w) which is jointly typical with X
n
.
Pr [error ] /2 +
x
n
A
(n)
p
X
n (x
n
)[1 Pr ((x
n
,
X
n
) A
(n)
)]
2
nR
As Pr ((x
n
,
X
n
) A
(n)
) 2
n(I (X;
X)+
1
)
, and (1 x)
n
e
nx
, we
have
[1 Pr ((x
n
,
X
n
) A
(n)
)]
2
nR
[1 2
n(I (X;
X)+
1
)
]
2
nR
e
2
nR
2
n(I (X;
X)+
1
)
As R > I (X;
X) +
1
, the above goes to 0 as n . Therefore,
Pr (d((X
1
, . . . , X
n
), g
n
(f
n
(X
1
, . . . , X
n
))) < D + ) > 1
Junmo Kim EE 623: Information Theory
Lecture 22
Junmo Kim EE 623: Information Theory
Lecture 23
Junmo Kim EE 623: Information Theory
Announcement
Water lling.
If f
(x) > 0 for all x, f is strictly convex.
If f
(x) < 0 for all x, f is strictly concave.
Example
f
tt
(x) =
1
x
2
< 0 : strictly concave
Junmo Kim EE 623: Information Theory
Jensens Inequality
Theorem
If f is concave then for any random variable X,
f (E[X]) E[f (X)]
If f is strictly concave,
f (E[X]) = E[f (X)] X is deterministic.
Junmo Kim EE 623: Information Theory
Chain Rule for Entropy
H(X
1
, X
2
, . . . , X
n
) = H(X
1
)+H(X
2
[X
1
)+ +H(X
n
[X
1
, . . . , X
n1
)
In short hand notation,
H(X
n
1
) =
n
i =1
H(X
i
[X
i 1
1
)
X
j
i
= (X
i
, X
i +1
, . . . , X
j
).
i =1
H(X
i
[X
i 1
)
Junmo Kim EE 623: Information Theory
Entropy Rate
Denition
The entropy rate of a stochastic process X
i
is dened by
H(A) = lim
n
1
n
H(X
1
, . . . , X
n
)
when the limit exists.
R =
log M
n
: rate in bits / channel use.
W Unif 1, . . . , 2
nR
I
P
X
(x)W(y[x)
(X; Y) = I (P
X
; W)
C = max
P
X
I (P
X
; W)
Denition
R is achievable, if > 0, n
0
s.t. n n
0
, encoder of rate R &
block length n and a decoder with maximal probability of error
(n)
< .
Theorem
If R < C then R is achievable.
Theorem
Converse: If R is achievable, then R C.
Junmo Kim EE 623: Information Theory
DMC and Mutual Information
Lemma
Let X take value in A
n
according to some law P
X
(x) and let Y be
distributed according to p
Y[X
=
n
k=1
W(y
k
[x
k
) for some W([)
then I (X; Y) nC where C = max
P
X
I (P
X
; W).
Proof:
I (X; Y) = H(Y) H(Y[X)
= H(Y)
H(Y
i
[X, Y
i 1
)
= H(Y)
H(Y
i
[X
i
) memoryless
(H(Y
i
) H(Y
i
[X
i
))
=
I (X
i
; Y
i
) nC
Junmo Kim EE 623: Information Theory
Proof of Converse
Note that we have the following Markov chain
W X = f (W) Y
n
W = (Y)
H(W) = nR H(W) = log [/[, R =
log [/[
n
= H(W[
W) + I (W;
W)
H(W[
W) + I (f (W);
W)
H(W[
W) + I (X; Y)
H(W[
W) + nC
Pr (W ,= W) log [/[ + H
b
(Pr (W ,=
W)) + nC
= P
(n)
e
nR + H
b
(P
(n)
e
) + nC
Junmo Kim EE 623: Information Theory
Proof Sketch
1. Fix some p
X
2. Fix some > 0, n
3. Generate a random codebook (
IID p
X
.
4. Reveal ( to encoder and receiver.
5. Design a joint typicality decoder
(; p
X
(x)W(y[x), , n, ()
6. Encodr m x(m) (according to codebook)
7. Each codebook ( gives P
(n)
e
(().
8. Analyze E[P
(n)
e
(()]. Average over (.
9. Will show that if R < I (P
X
; W) then E[P
(n)
e
(()] 0 as
n .
10. By random coding argument, there exists deterministic
sequence (
n
s.t. P
(n)
e
((
n
) 0
11. Trick to get
(n)
to go to zero.
Junmo Kim EE 623: Information Theory
Dierential Entropy
X : Unif[0, a].
h(f ) =
_
a
0
1
a
log
1
a
dx = log a
X N(0,
2
)
h(f ) =
_
f (x) log
e
1
2
2
e
x
2
2
2
dx
= log
e
2
2
+
E[X
2
]
2
2
=
1
2
ln 2e
2
nats
If N = 1, C = . (without limit on x)
x
2
i
(w) P.
Junmo Kim EE 623: Information Theory
Gaussian Channel
We will show that the following quantity is the capacity:
max
E[X
2
]P
I (X; Y)
(True for general channel with input constraint: problem 8.4)
Junmo Kim EE 623: Information Theory
Gaussian Channel: Achievable Rate
Denition
We say that R is achievable if > 0, n
0
, s.t. n > n
0
, a rate-R
block length n codebook ( = x(1), , x(2
nR
)
n
and a
decoder :
n
1, . . . , 2
nR
s.t. the maximum probability of
error < and
1
n
x
2
i
(m) P, m 1, . . . , 2
nR
.
C supremum of achievable rate.
Junmo Kim EE 623: Information Theory
Direct Part
1. Generate a codebook at random
1.1 Codewords are chosen independently
1.2 The components of the codewords are chosen IID from
N(0, P ).
2. Reveal the codebook to Tx/Rx
3. Decoder
3.1 Joint typicality: If there is one and only one codeword X
n
(w)
that is jointly typical with the received vector, declare
W = w.
Otherwise, declare an error.
3.2 Declare an error if the unique codeword that is typical with y
violates the average power constraint
Junmo Kim EE 623: Information Theory
Direct Part: Error Analysis
Assume W = 1.
E
0
the event X(1) violates the power constraint
E
i
the event (X(i ), Y) A
(n)
Pr (Error [W = 1) Pr (E
0
E
C
1
2
nR
i =2
E
i
)
Pr (E
0
) + Pr (E
C
1
) +
2
nR
i =2
Pr (E
i
)
Pr (E
0
) 0. (
1
n
X
2
i
(1) E[X
2
] = P )
Pr (E
C
1
) 0
Pr (E
i
) 2
n(I (X;Y)3)
, I (X; Y) =
1
2
log(1 +
P
N
)
Junmo Kim EE 623: Information Theory
Parallel Gaussian Channels
Consider L independent Gaussian channels in parallel.
Y
(l )
= X
(l )
+ Z
(l )
Z
(l )
N(0, N
l
)
L
l =1
P
l
P.
C = max I (X
(1)
, X
(2)
, . . . , X
(L)
; Y
(1)
, Y
(2)
, . . . , Y
(L)
) where
the maximum is over all input distribution
f
X
(1)
,X
(2)
,...,X
(L)
(, . . . , ) satisfying the power constraint
L
l =1
E[(X
(l )
)
2
] P.
l
1
2
log(1 +
P
l
N
l
)
subject to
P
l
P.
Junmo Kim EE 623: Information Theory
Water-Filling for Parallel Gaussian Channels
Optimum P
l
= ( N
l
)
+
where x
+
=
_
x x > 0
0 x 0
( N
l
)
+
= P.
Junmo Kim EE 623: Information Theory
Sanovs Theorem (Large Deviations)
Theorem
Let E T. Suppose X
1
, . . . , X
n
are i.i.d. according to Q.
Pr ( P
X
1
,...,X
n
. .
emprical type
E) (n + 1)
[.[
2
nD(P
|Q)
where P
= arg min
PE
D(P|Q).
In the exam, Pr ( P
X
1
,...,X
n
. .
emprical type
E) 2
nD(P
|Q)
is enough. e.g.
Q = (
1
2
,
1
2
) H,T, E = P T : P(H)
3
4
Pr (P
X
1
,...,X
n
E) = Pr (# of H 75%)
Junmo Kim EE 623: Information Theory
Conditional Sanovs Theorem
Let P
E achieve inf
PE
D(P|Q), where Q is not in E.
[Q
n
(X
1
= a[P
X
1
,...,X
n
E) P
(a)[ 0 as n .
We can use Pr (X
1
= a[E) to compute other probabilities.
Junmo Kim EE 623: Information Theory
Rate Distortion Function
X
().
X
and
d(x, x) is dened as
R(D) = min
p
X,
X
(x, x):
x
p
X,
X
(x, x)=p
X
(x),E
p
X,
X
[d(X,
X)]D
I (X;
X)
= min
p
X|X
:E
p
X
p
X|X
[d(X,
X)]D
I
p
X
p
X|X
(X;
X)
X
1.
x
p
X,
X
(x, x) = p
X
(x)
2. E
p
X,
X
[d(X,
X)] D
Junmo Kim EE 623: Information Theory
Rate Distortion Function : Gaussian Source
X
if D
2
If D >
2
, we choose
X = 0 with probability 1, achieving
R(D) = 0.
1
( x[x) achieves R(D
1
)
p
2
( x[x) achieves R(D
2
).
Now p
1
+ p
2
satises the condition E[d(X,
X)] D
1
+ D
2
.
Thus,
R(D
1
+ D
2
) I
p(x)(p
1
+ p
2
)
(X;
X)
I
p(x)p
1
(X;
X) + I
p(x)p
2
(X;
X)
= R(D
1
) + R(D
2
)
Junmo Kim EE 623: Information Theory