Week 4 - Channel Capacity (Chapter 7) and Differential Entropy (Chapter 8)
Week 4 - Channel Capacity (Chapter 7) and Differential Entropy (Chapter 8)
1 / 16
Codes and Probability of Error
Probability of Error1
Conditional Probability of Error: λi = P[g (y n ) ̸= i|x n (i)]
Maximum Pe = λn = maxi=1,2...,M λi
1 P
Average Pe = Pen = M i λi
1
superscripts indicate length of code in bits
2 / 16
Rates and Achievability
Rate of a Code
log2 M
The rate of a (M, n) code is defined as: R = n and hence, often
(M, n) will be presented as (2nR , n) code1 .
Achievability
A rate R is said to be ”achievable” if ∃, a sequence of (⌈2nR ⌉, n) codes
such that λn (R) → 0 as n → ∞
Capacity of a Channel
The capacity of a channel is the supremum (maximum)2 of all achievable
rates.
1
Ideally it is M = ⌈2nR ⌉ but to simplify notations, it is written as 2nR .
2
Supremum is the least upper bound (lub) of an upper-bounded set. For ex, in (2, 3),
the suprema (lub) is 3 but maxima does not exist.
3 / 16
Channel Coding Theorem
Theorem
For a discrete memoryless channel, all rates below capacity C are
achievable ⇒ for every rate R ¡ C, there exists a sequence of (2nR , n)
codes with maximum probability of error λn → 0. Conversely, any
sequence of (2nR , n) codes with λn → 0 must have R ≤ C .
Basic Ideas
We have seen how AEP makes source coding possible at the input.
Channel coding deals with both input (source) and output (receiver)
The proof of this is based on Joint AEP which is the extension of
AEP for two random sequences.
Given Y n , 2nH(X |Y ) are conditionally typical, so probability that a
randomly chosen code is jointly typical is
2nH(X |Y ) /2nH(X ) = 2−nI (X ;Y ) .
Hence, the jointly typical codes occur after approximately 2nI (X ;Y )
codewords. 4 / 16
Joint AEP and Jointly Typical Sequences
5 / 16
Proof Ideas
Non-constructive proofs are not very trivial. Shannon used the following
ideas to prove that such a code is achievable:
Joint Typical Decoding: To use AEP, he considered that the receiver
declares the index Ŵ was sent if ∃ only one Ŵ such that
(X n (Ŵ ), Y n ) ∈ Anϵ . If there is any other W ′ ̸= Ŵ , or no such index
exists, an error is declared.
The main idea was to use a random codebook (Recall M = 2nR ):
x1 (1) x2 (1) . . . xn (1)
.. .. .. .. where
C= . . . .
x1 (2nR ) x2 (2nR ) . . . xn (2nR )
nR
P(C) = Π2w =1 Πni=1 p(xi (w )) implying that each xi (w ) ∼ p(x) is i.i.d.
Using such random codebooks P and 1AEP, PMthe he showed average error
over codebooks, P(E) = C P(C). M w =1 λw (C) ≤ 2ϵ over all
codebooks which implies ∃ a codebook C ∗ with error probability
≤ 2ϵ1 .
1
Why? Arrive at a contradiction if this is not true! In essence, it is a form of
pigeonhole principle (PHP)!
6 / 16
Proof Ideas: Contd
The idea for the existence follows from AEP. As the alphabet source
satisfies AEP, there is a typical set Anϵ which has at most 2n(H(V)+ϵ)
sequences and this is what we encode. The rest of the codes will result in
an error. By, AEP, they require n(H + ϵ) bits. So, we can send the index
to receiver with probability of error less than ϵ if H(V) + ϵ = R < C . The
receiver can decode the index, if it can construct the typical set and
enumerate these indices. The error probability will then be,
P(V n ̸= V̂ n ) ≤ P(V n ∈ Anϵ ) + P(g (Y n ) ̸= V n |V n ∈ Anϵ ) ≤ ϵ + ϵ = 2ϵ
8 / 16
Differential Entropy
9 / 16
Differential Entropy (Contd)
Hence, we have:
Z Z X
T −1
f (x)(x − µ) K (x − µ)dx = f (x) (xi − µi )Kij−1 (xj − µj )dx
i,j
X X h i
= E (Xi − µi )Kij−1 (Xj − µj ) = E (Xi − µi )Kij−1 (Xj − µj )
i,j i,j
(a) X X (b) X X
= Kji Kij−1 = (KK −1 )jj = Ijj = n
j i j j
1
Anϵ = {x ∈ S n : | − f (x) log f (x) − h(X )| ≤ ϵ}
n
However, the typical set is no longer a discrete set. The cardinality is
changed to volume of the set and other properties follow.
Hadamard’s Inequality1
If X ∼ N (0, K ) be a multi-variate normal random variable, then:
|K | ≤ Πni=1 Kii
1
Can also be proved using eigen decompositions and A.M-G.M inequality. See wiki
15 / 16
Maximum Entropy and Other properties
Consider 2 distributions, g (x) and ϕ(x) ∼ N (0, K ) where they have same
covariance i.e.
R R
Eg [Xi Xj ] = g (x)xi xj dx = Eϕ [Xi Xj ] = ϕ(x)xi xj dx = Kij . Now, we use
the KL distance h between i them.
D(g ||ϕ) = Eg log gϕ(x)
(x) R
= −h(X ) − g (x) log ϕ(x)dx
Now ϕ(x) being Normal with 0 mean,
log ϕ(x) = − 12 log (2π)n |K | − 12 x T K −1 x
(a)
g (x) ln ϕ(x)dx = g (x) − 12 log (2π)n |K | − 21 x T K −1 x dx = −h(ϕ)
R R
(a) is because g (x) is a pdf for the 1st term. g has the same covariance
matrix as ϕ for 2nd term. By Gibb’s Inequality: