INGI 2348: Information Theory and Coding
INGI 2348: Information Theory and Coding
i =1
P(y
i
|x
i
)
Example: Binary Symmetric channel
A = B = {0, 1}
P(0|0) = P(1|1) = 0.9
P(0|1) = P(1|0) = 0.1
P(00|00) = (0.9)(0.9) = 0.81 . . .
() INGI 2348: Information Theory and Coding Part 2: Channel Coding 6 / 33
1.1.2. Mutual Information
Probabilities of input X: Q(a
k
), 1 k K
Joint probabilities on X, Y: P(a
k
, b
j
) = Q(a
k
)P(b
j
|a
k
)
Marginal Probabilities on output Y: P(b
j
) =
K
k=1
Q(a
k
)P(b
j
|a
k
)
Mutual Information
I (X; Y) =
K
k=1
J
j =1
P(a
k
, b
j
) log
P(b
j
|a
k
)
P(b
j
)
= E
X,Y
log
P(b
j
|a
k
)
P(b
j
)
=
K
k=1
J
j =1
P(a
k
, b
j
) log
P(b
j
, a
k
)
P(a
k
)P(b
j
)
H(X|Y) = H(X) I (X; Y)
Information obtained on average on X by observing Y
Uncertainty on X is decreased on average by I (X; Y) when observing Y
() INGI 2348: Information Theory and Coding Part 2: Channel Coding 7 / 33
Capacity
For a given channel characterized by transition probabilities P(b
j
|a
k
)
Denition: Capacity
C = max
Q(a
k
)
I (X; Y)
Maximum of mutual information, optimized over the input distribution
If I (X; Y) = C, the distribution Q is said to realize the capacity
Interpretation:
For transmission of sequence x = (x
1
, . . . , x
N
) y = (y
1
, . . . , y
N
)
Coding theorem will show that the capacity is the maximum amount of information
that can be transmitted per letter
() INGI 2348: Information Theory and Coding Part 2: Channel Coding 8 / 33
Transmission of sequences
X
N
, Y
N
: Random sequences of N inputs and N outputs
X
i
, Y
i
are random variables for i
th
letter in the sequence
In general, should consider the mutual information for the whole sequence I (X
N
; Y
N
)
Theorem 1.1: Transmission on memoryless channel
If channel is memoryless
(1) Then
I (X
N
; Y
N
)
N
i =1
I (X
i
; Y
i
)
(2) If inputs are independent
I (X
N
; Y
N
) =
N
i =1
I (X
i
; Y
i
)
(3) If inputs are independent and distribution realizes the capacity
I (X
N
; Y
N
) = NC
() INGI 2348: Information Theory and Coding Part 2: Channel Coding 9 / 33
Proof of Theorem 1.1
I (X
N
; Y
N
) = H(Y
N
) H(Y
N
|X
N
)
Memoryless channel:
H(Y
N
|X
N
) =
y
Q
N
(x)P
N
(y|x)
N
i =1
log P(y
i
|x
i
)
=
N
i =1
H(Y
i
|X
i
)
Joint entropy: H(U, V) H(U) + H(V) with equality i U and V are independent
H(Y
N
)
N
i =1
H(Y
i
)
I (X
N
; Y
N
)
N
i =1
(H(Y
i
) H(Y
i
|X
i
)) =
N
i =1
I (X
i
; Y
i
)
() INGI 2348: Information Theory and Coding Part 2: Channel Coding 10 / 33
Proof of Theorem 1.1
If inputs are independent: Q
N
(x) =
N
i =1
Q
i
(x
i
)
Then outputs are independent (memoryless channel) and
H(Y
N
) =
N
i =1
H(Y
i
)
I (X
N
; Y
N
) =
N
i =1
I (X
i
; Y
i
)
If in addition the distributions Q
i
realize the capacity
H(X
i
; Y
i
) = C for all i
() INGI 2348: Information Theory and Coding Part 2: Channel Coding 11 / 33
1.2.1 Convexity of I (X; Y)
Denition: A function f : R
K
R is -convex if, for all , R
K
and 0 1,
f () + (1 )f () f
+ (1 )
Z
(0) =
Z
(1) = 1
X is assumed to be generated with Q
0
if z = 0 and with Q
1
if z = 1
P
X|Z
(x|0) = Q
0
(x) P
X|Z
(x, 1) = Q
1
(x)
The probability distribution of X is then
P
X
(x) =
1
z=0
Z
(z)P
X|Z
(x|z)
= Q(x)
() INGI 2348: Information Theory and Coding Part 2: Channel Coding 13 / 33
Convexity of Mutual Information (Proof)
Considering the cascade of the two channels Z X, X Y:
I (X; Y |Z) I (X; Y)
Developing
I (X; Y|Z) =
x,y,z
(z)P(x, y|z) log
P(y|x, z)
P(y|z)
I
z
(X; Y) =
x,y
Q
z
(x)P(y|x) log
P(y|x)
x
Q
z
(x)P(y|x)
P(x, y|z) = P(x|z)P(y|x, z) = Q
z
(x)P(y| x, z)
P(y|x, z) = P(y|x) [cascade]
P(y|z) =
x
Q
z
(x)P(y| x, z)
I (X; Y|Z) = I
0
(X; Y) + (1 )I
1
(X; Y)
() INGI 2348: Information Theory and Coding Part 2: Channel Coding 14 / 33
Characterization of the capacity
Maximization of I (X; Y) over parameters
k
= Q(a
k
)
Convex set of possible probabilities
S = { R
K
:
k
0,
1
+ . . . +
K
= 1}
Maximize a convex function on a convex set
- There exist ecient algorithm
- What about an analytical expression?
() INGI 2348: Information Theory and Coding Part 2: Channel Coding 15 / 33
Convex optimization
Based on Lagrange multipliers for the current problem
Theorem 1.3: Convex optimization
For -convex function f : S R on convex set S, also assumed continuously derivable
on S, the vector maximizes f on S i there exists a real number such that
f ()
k
= for all k such that
k
> 0
f ()
k
for all k such that
k
= 0
() INGI 2348: Information Theory and Coding Part 2: Channel Coding 16 / 33
Characterization of capacity
Sometimes an analytical solution can be proven
Theorem 1.4
Let the function g
k
: S R be dened as
g
k
() =
J
j =1
P(b
j
|a
k
) log
P(b
j
|a
k
)
K
t=1
t
P(b
j
|a
t
)
where
t
= Q(a
t
)
The distribution Q realizes the capacity i there exists a real number C such that
g
k
() = C for all k such that Q(a
k
) > 0
g
k
() C for all k such that Q(a
k
) = 0
The number C is unique and is the capacity of the channel
Property: I (X; Y) =
K
k=1
k
g
k
()
Proven by using Theorem 1.3 with f () = I (X; Y)
() INGI 2348: Information Theory and Coding Part 2: Channel Coding 17 / 33
1.2.3 Symmetric Channel
A channel is characterized by its transition probability matrix P
P =
P(b
1
|a
1
) P(b
2
|a
1
) . . . P(b
J
|a
1
)
P(b
1
|a
2
) P(b
2
|a
2
) . . . P(b
J
|a
2
)
.
.
.
.
.
.
.
.
.
P(b
1
|a
K
) P(b
2
|a
K
) . . . P(b
J
|a
K
)
q p
p q
p + q = 1
Simple partition: One block
() INGI 2348: Information Theory and Coding Part 2: Channel Coding 18 / 33
Examples of Symmetric channels
Example 2: Binary symmetric channel with erasure
P =
0 1
q p r
p q r
0
1
p + q + r = 1
Symbol corresponds to erasure of input symbol
Example 3: 3-state symmetric channel with zero diagonal
P =
0 1 2
0 p q
q 0 p
p q 0
0
1
2
p + q = 1
() INGI 2348: Information Theory and Coding Part 2: Channel Coding 19 / 33
Capacity of symmetric channels
Theorem 1.5: Optimal distribution for symmetric channels
For a symmetric channel, the uniform distribution
Q(a
k
) = 1/K for k = 1, . . . , K
realizes capacity.
Proof: Use Theorem 1.4 with = (1/K, . . . , 1/K): show that g
k
() is constant
Consider the partition T
1
. . . T
m
= {1, . . . , J} corresponding to the symmetric
partition P = [P
1
, P
2
, . . . , P
m
]
g
k
() =
m
r =1
g
k,r
()
with
g
k,r
() =
j T
r
P(b
j
|a
k
) log
P(b
j
|a
k
)
P(b
j
)
It is sucient to show that for each r : g
k,r
() is independent of k
() INGI 2348: Information Theory and Coding Part 2: Channel Coding 20 / 33
Capacity of symmetric channels
For j T
r
(with xed r ), P(b
j
) does not depend on j
P(b
j
) =
l
Q(a
l
)P(b
j
|a
l
) =
1
K
l
P(b
j
|a
l
) = c
r
because the columns of P
r
are equivalent to each other
The sum for j T
r
of the P(b
j
|a
k
) log P(b
j
|a
k
) does not depend on k because all
the lines of P
r
are equivalent. Similarly for the sum on P(b
j
|a
k
) log P(b
j
) because
P(b
j
) = c
r
In conclusion: g
k,r
() does not depend on k
The capacity is given by (using any value of k)
C =
J
j =1
P(b
j
|a
k
) log
KP(b
j
|a
k
)
K
l =1
P(b
j
|a
l
)
Example 0: Perfect (error-free) channel. For an input alphabet of K letters
C =
K
j =1
j ,k
log K
j ,k
= log K
Obvious since I (X; Y) = H(X)
() INGI 2348: Information Theory and Coding Part 2: Channel Coding 21 / 33
Capacity of symmetric channels: examples
Ex 1 Binary Symmetric channel
C = q log 2q + p log 2p
= log 2 + q log q + p log p
= log 2 H(p)
where H is the binary entropy function
H(p) = p log p (1 p) log(1 p)
Ex 2 Binary symmetric channel with erasure
C = q log
2q
p + q
+ p log
2p
p + q
+ r log
2r
2r
= (p + q)
log 2 +
q
p + q
log
p
p + q
+
p
p + q
log
p
p + q
= (p + q)
log 2 H(
p
p + q
)
Limit cases:
(1) r = 0: Binary symmetric channel
(2) p = 0: C = (1 r ) log 2
() INGI 2348: Information Theory and Coding Part 2: Channel Coding 22 / 33
Capacity of symmetric channels: examples
Ex 3 3-state symmetric channel with zero diagonal
C = q log 3q + p log 3p
= log 3 + q log q + p log p
= log 3 H(p)
() INGI 2348: Information Theory and Coding Part 2: Channel Coding 23 / 33
Examples of NON-symmetric channels
Ex 4 Z-Channel
P =
1 0
1 0
p q
1 Q(1) = 1 R
0 Q(0) = R
I (X; Y) = Rg
0
(R, 1 R) + (1 R)g
1
(R, 1 R)
g
0
= p log
p
1 Rq
+ q log
q
Rq
g
1
= log
1
1 Rq
The distribution (R, 1 R) that realizes capacity must satisfy g
0
= g
1
(= C)
R =
1
q(1 + )
with = e
H(q)
q
C = g
1
= log(1 +
1
)
For small values of p
C = log 2
1
2
H(p) + O((p log p)
2
)
() INGI 2348: Information Theory and Coding Part 2: Channel Coding 24 / 33
Examples of NON-symmetric channels
Ex 5 W-Channel
P =
1 0
1 0
p q
0 1
1
0
1
I (X; Y) = H(Y) H(Y|X)
H(Y) log 2
Distribution Q(1) = Q(1) = 1/2 and Q(0) = 0 gives I (X; Y) = log 2
(degenerates into Binary symmetric channel without error)
C = log 2
Independently of p and q
In general: No analytical expression for non symmetric channel
() INGI 2348: Information Theory and Coding Part 2: Channel Coding 25 / 33
1.3.1 Coding Theorem: Introduction
One objective of transmission: Guarantee a low error probability
(=probability that the letters sent by the source are incorrectly decoded at the
receiver)
Coding theorem: If the entropy of the source is lower than the capacity, there exists
a coding/decoding scheme that provides an error probability as low as desired
Negative result: If the entropy of the source is larger than the capacity, it is not
possible to decrease the error probability arbitrarily low
Model
Objective: Decoded sequence v reproduces source sequence u as correctly as possible
Error probability on letter l :
l
=
u,v|u
l
=v
l
P(u, v)
Average error probability: =
1
L
L
l =1
l
() INGI 2348: Information Theory and Coding Part 2: Channel Coding 26 / 33
Coding Theorem: Introduction
Source assumed memoryless : H(U
L
) = LH(U)
s
dened as time between source letters
c
is time between channel letters
Negative result: If
H(U)
s
>
C
c
, then
for some positive constant , independent of L
Entropy of source and channel capacity here represented in bits/sec
() INGI 2348: Information Theory and Coding Part 2: Channel Coding 27 / 33
1.3.2 Preliminary results
(1) If uncertainty remains H(U
L
|V
L
) > 0, the average error probability cannot be zero
For one given letter L = 1 (Lemma 1.7)
For the whole sequence L > 1 (Lemma 1.8)
(2) Relation between H(U
L
|V
L
), the source entropy H(U) and the capacity C
(Lemma 1.9)
Lemma 1.7: Error probability as a function of the conditional entropy
Let U and V be 2 random variables on an alphabet D of size S. The error probability
dened as =
u=v
P(u, v) satises
log(S 1) +H() H(U|V) = H(U) I (U; V)
Lower bound on the error probability as a function of the conditional entropy
(= remaining uncertainty on U after observing V)
Average uncertainty is lower than the uncertainty on the letter u when there is an
error + uncertainty on the apparition of an error
() INGI 2348: Information Theory and Coding Part 2: Channel Coding 28 / 33
Proof of Lemma 1.7
Compare P(u|v) with the symmetric distribution of parameter :
P
(u|v) =
S1
if u = v error
1 if u = v no error
Using =
u=v
P(u, v), compute
H(U|V) log(S 1) H() =
u,v
P(u, v) log
P
(u|v)
P(u|v)
Use log z z 1 (for log
e
)
u,v
P(u, v) log
P
(u|v)
P(u|v)
u=v
P(u, v)
(S 1)P(u|v)
1
u=v
P(u, v)
1
P(u|v)
1
=
S 1
u=v
P(v)
u=v
P(u, v) + (1 )
u=v
P(v)
u=v
P(u, v)
= + (1 ) (1 ) = 0
() INGI 2348: Information Theory and Coding Part 2: Channel Coding 29 / 33
Error probability for sequence
Similar result for sequence of L letters
Lemma 1.8
For sequences of L letters, the average error probability satises
log(S 1) +H( )
1
L
H(U
L
|V
L
)
Proof:
Inequality on joint entropy
H(U
L
|V
L
)
L
l =1
H(U
l
|V
l
)
Then apply previous lemma to each pair U
l
, V
l
and use -convexity of H
() INGI 2348: Information Theory and Coding Part 2: Channel Coding 30 / 33
Processing chain and information
Consider chain U
L
X
N
Y
N
V
L
Each element in the chain only depend on previous one (no side information)
When x is known, y is (conditionally) independent of u: P(y|x, u) = P(y|x)
Similarly P(v|y, x, u) = P(v|y)
Lemma 1.9: Theorem of processing chain
If the source sequence of length L is transmitted at the receiver through N use of the
channel:
I (U
L
; V
L
) I (X
N
, Y
N
)
Proof: I (U
L
; Y
N
|X
N
) = 0 : u brings no information on y if x is known. And it comes
I (U
L
; Y
N
) I (X
N
; Y
N
)
Similarly
I (U
L
; V
L
) I (U
L
; Y
N
)
() INGI 2348: Information Theory and Coding Part 2: Channel Coding 31 / 33
1.3.3 Main result
We now relate L and N by introducing the time intervals
s
and
c
Number of channel usage: N = [L
s
/
c
]
uses of
the channel
For all L, the average error probability (per letter) satises
log(S 1) +H( ) H(U)
s
c
C
Proof:
log(S 1) +H( )
1
L
H(U
L
|V
L
) =
1
L
H(U
L
)
1
L
I (U
L
; V
L
) = H(U)
1
L
I (U
L
; V
L
)
H(U)
1
L
I (X
N
; Y
N
) H(U)
N
L
C
H(U)
s
c
C
() INGI 2348: Information Theory and Coding Part 2: Channel Coding 32 / 33
Comments on result
Lower bound on average error probability
For any combination of source and channel coding
If entropy (in bits/sec) is higher than capacity, then some uncertainty must remain
and the error probability cannot approach zero
Only applies on the average probability, not on individual probabilities
l
Remains to prove the positive part
() INGI 2348: Information Theory and Coding Part 2: Channel Coding 33 / 33