0% found this document useful (0 votes)
3 views16 pages

Week 4 - Channel Capacity (Chapter 7) and Differential Entropy (Chapter 8)

The document covers channel capacity and differential entropy, focusing on coding theory for discrete memoryless channels. It discusses concepts such as probability of error, achievable rates, and the channel coding theorem, emphasizing the importance of joint typical sequences and the source-channel separation theorem. Additionally, it introduces differential entropy for continuous random variables and explores covariance matrices and multi-variate Gaussian distributions.

Uploaded by

v3193373
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views16 pages

Week 4 - Channel Capacity (Chapter 7) and Differential Entropy (Chapter 8)

The document covers channel capacity and differential entropy, focusing on coding theory for discrete memoryless channels. It discusses concepts such as probability of error, achievable rates, and the channel coding theorem, emphasizing the importance of joint typical sequences and the source-channel separation theorem. Additionally, it introduces differential entropy for continuous random variables and explores covariance matrices and multi-variate Gaussian distributions.

Uploaded by

v3193373
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Week 4 - Channel Capacity (Chapter 7) and

Differential Entropy (Chapter 8)

1 / 16
Codes and Probability of Error

(M, n) code for channel (X n , p(y n |x n ), Y n )


It consists of
An index set {1, 2, 3, . . . , M}
An encoding function X n : {1, 2, . . . , M} → X n , yielding codewords
x n (1), x n (2), . . . , x n (M). The set of codewords is called the codebook
A decoding function g : Y n → {1, 2, . . . , M} which is a deterministic
rule that assigns a guess to each possible received vector.

Probability of Error1
Conditional Probability of Error: λi = P[g (y n ) ̸= i|x n (i)]
Maximum Pe = λn = maxi=1,2...,M λi
1 P
Average Pe = Pen = M i λi

1
superscripts indicate length of code in bits
2 / 16
Rates and Achievability

Rate of a Code
log2 M
The rate of a (M, n) code is defined as: R = n and hence, often
(M, n) will be presented as (2nR , n) code1 .

Achievability
A rate R is said to be ”achievable” if ∃, a sequence of (⌈2nR ⌉, n) codes
such that λn (R) → 0 as n → ∞

Capacity of a Channel
The capacity of a channel is the supremum (maximum)2 of all achievable
rates.

1
Ideally it is M = ⌈2nR ⌉ but to simplify notations, it is written as 2nR .
2
Supremum is the least upper bound (lub) of an upper-bounded set. For ex, in (2, 3),
the suprema (lub) is 3 but maxima does not exist.
3 / 16
Channel Coding Theorem
Theorem
For a discrete memoryless channel, all rates below capacity C are
achievable ⇒ for every rate R ¡ C, there exists a sequence of (2nR , n)
codes with maximum probability of error λn → 0. Conversely, any
sequence of (2nR , n) codes with λn → 0 must have R ≤ C .

Basic Ideas
We have seen how AEP makes source coding possible at the input.
Channel coding deals with both input (source) and output (receiver)
The proof of this is based on Joint AEP which is the extension of
AEP for two random sequences.
Given Y n , 2nH(X |Y ) are conditionally typical, so probability that a
randomly chosen code is jointly typical is
2nH(X |Y ) /2nH(X ) = 2−nI (X ;Y ) .
Hence, the jointly typical codes occur after approximately 2nI (X ;Y )
codewords. 4 / 16
Joint AEP and Jointly Typical Sequences

Similar to the typical set, we can define jointly typical sequences:


Jointly Typical Sequences
A set Anϵ of jointly typical sequences {x n , y n } w.r.t p(x, y ) is the set of
n-sequences with empirical entropies ϵ-close to true entropies.
Anϵ = {(x n , y n ) ∈ X n × Y n : |− n1 log p(x n ) − H(X )| <
ϵ, |− n log p(y n ) − H(Y )| < ϵ}, |− n1 log p(x n , y n ) − H(X , Y )| < ϵ}
1

where p(x n , y n ) = Πni=1 p(xi , yi )


P[X n , Y n ] ∈ Anϵ → 1 as n → ∞
|Anϵ | ≤ 2nH(X ,Y )+ϵ
If (X̃ n , Ỹ n ) ∼ p(x n )p(y n ), P[((X̃ n , Ỹ n ) ∈ Anϵ )] ≤ 2−n(I (X ;Y )−3ϵ) and
as n → ∞, P[((X̃ n , Ỹ n ) ∈ Anϵ )] ≥ (1 − ϵ)2−n(I (X ;Y )+3ϵ)
The proofs of these are similar to that for AEP for X n .

5 / 16
Proof Ideas
Non-constructive proofs are not very trivial. Shannon used the following
ideas to prove that such a code is achievable:
Joint Typical Decoding: To use AEP, he considered that the receiver
declares the index Ŵ was sent if ∃ only one Ŵ such that
(X n (Ŵ ), Y n ) ∈ Anϵ . If there is any other W ′ ̸= Ŵ , or no such index
exists, an error is declared.
The  main idea was to use a random codebook  (Recall M = 2nR ):
x1 (1) x2 (1) . . . xn (1)
 .. .. .. ..  where
C= . . . . 
x1 (2nR ) x2 (2nR ) . . . xn (2nR )
nR
P(C) = Π2w =1 Πni=1 p(xi (w )) implying that each xi (w ) ∼ p(x) is i.i.d.
Using such random codebooks P and 1AEP, PMthe he showed average error
over codebooks, P(E) = C P(C). M w =1 λw (C) ≤ 2ϵ over all
codebooks which implies ∃ a codebook C ∗ with error probability
≤ 2ϵ1 .
1
Why? Arrive at a contradiction if this is not true! In essence, it is a form of
pigeonhole principle (PHP)!
6 / 16
Proof Ideas: Contd

For the codebook C ∗ , we should have


1 X
P[E|C ∗ ] = λi (C ∗ ) ≤ 2ϵ
M
i

From C ∗ , again use Pigeonhole principle in the following way - sort


the indices i in ascending order of λi . Consider the first M/2 and the
remaining. If any code in the first half has λi > 4ϵ (the best case is
for the last (M/2th) code), by sorting, it automatically implies, all
other codes, i.e ≥ (M/2) + 1 codes will have λi > 4ϵ and their sum is
> 2Mϵ which implies that the average over M would be > 2ϵ. Hence,
we must have that for the first half, λi ≤ 4ϵ ∀i and thus, λn ≤ 4ϵ.
Thus, we are left with M/2 = 2nR−1 = 2n(R−1/n) codes for which
λn ≤ 4ϵ and hence the rate R ′ = R − n1 . As n → ∞, this is negligible
compared to R.
∴The proof relies on applications of randomization, AEP and PHP.
7 / 16
Source-Channel Separation Theorem
Theorem
If V1 , V2 , . . . V n is a finite alphabet stochastic process that satisfies the
AEP and H(V) < C , there exists a source–channel code with probability of
error P[V̂ n ̸= V n ] → 0. Conversely, for any stationary stochastic process,
if H(V) > C , the probability of error is bounded away from zero, and it is
not possible to send the process over the channel with arbitrarily low
probability of error.

The idea for the existence follows from AEP. As the alphabet source
satisfies AEP, there is a typical set Anϵ which has at most 2n(H(V)+ϵ)
sequences and this is what we encode. The rest of the codes will result in
an error. By, AEP, they require n(H + ϵ) bits. So, we can send the index
to receiver with probability of error less than ϵ if H(V) + ϵ = R < C . The
receiver can decode the index, if it can construct the typical set and
enumerate these indices. The error probability will then be,
P(V n ̸= V̂ n ) ≤ P(V n ∈ Anϵ ) + P(g (Y n ) ̸= V n |V n ∈ Anϵ ) ≤ ϵ + ϵ = 2ϵ
8 / 16
Differential Entropy

We intend to extend our treatment to continuous random variable.


To do this consider a random variable X which has pdf f (x) which is
quantized as:
X ∆ = xi if i∆ <PX < (i + 1)∆ pi =Pf (xi )∆ Entropy of quantized version
H(X ∆ ) = − ∞
is: P i =−
−∞ pi log pP

−∞ pi log pi =P
− i f (xi )∆ log f (xi )∆ = − i f (xi )∆ log f (xi ) − i f (xi )∆ log ∆
When ∆ →R 0, the 1st term is
h(X ) = − f (x) log f (x)dx.
For 2nd term, as ∆ is a
constant, we can take out log ∆
to our advantage
In
P the limitingRcase,
i f (xi )∆ = f (x)dx = 1
H(X ∆ ) = h(X ) − log ∆

9 / 16
Differential Entropy (Contd)

There are some points that are to be clarified2 :


When ∆ → 0, due to imposed continuity assumptions, 0 log 0 = 0 will
make the 2nd term = 0. Hence, for ∆ → 0, H(X ∆ ) = h(X )
For any ∆ → 0 but ∆ ̸= 0, we will have H(X ∆ ) ≈ h(X ) − log ∆
For X ∼ U[0, 1], with n bit quantization, implying ∆ = 2−n and
H(X ∆ ) = h(X ) + n. Because, h(X ) = 0 when X ∼ U[0, 1], we have
H(X ∆ ) = n i.e, n bits are required to obtain an n bit quantized
version of X . You should not compare this with uniform distribution
over say n nos as that is a different distribution. Rather, this is a
random variable with 2n equally likely possibilities in [0, 1] and hence
requires n bits to represent.
If X ∼ U[0, 1/8], the 1st 3 bits can be set to 0 after decimal and we
need n − 3 bits (equivalently 21/8 1
n−3 = 2n ) for a n bit accurate
reprsentation of X .
2
I was not able to explain this well in the class. Here, I have tried!
10 / 16
Covariance matrices
Let X = [X1 X2 ]T represent a vector random variable, where each Xi is a
real valued random variable. Their means can also be represented as
µ = [µX1 µX2 ]T . We can define:
 = E [(X − µX )(X
K − µX )T ] =
2

E [(X1 − µX1 ) ] E [(X1 − µX1 )(X2 − µX2 )]
as the covariance
E [(X2 − µX2 )(X1 − µX1 )] E [(X2 − µX2 )2 ]
matrix.
Clearly, the matrix is symmetric
The diagonal elements are always non-negative
Is this matrix invertible?
Check if this matrix is positive-semidefinite.
For any y ̸= 0, we have y T K y = E [y T XX T y ] = E [||X T y ||2 ] ≥ 0
This is 0 when X = 0 whose probability is 0.
This implies, it is positive definite for all practical cases and hence all
its eigen values are positive, implying that the determinant (product
of eigen values) is non-singular and hence invertible.
11 / 16
Multi-variate Gaussian Distribution and Entropy
A multi-variate Gaussian random variable X ∼ N (µ, K ) is a vector of n
random variables characterized by its mean vector µ and co-variance
matrix K with pdf defined as.
 
1 1 T −1
f (x) = 1 exp − (x − µ) K (x − µ)
(2π)n/2 |K | 2 2
Let us compute the entropy of this:
ln fR(x) = − 21 ln (2π)n |K | − 21 (x − µ)T KR−1 (x − µ)
− f (x) ln f (x)dx = 12 ln (2π)n |KR| + 21 f (x)(x − µ)T K −1 (x − µ)dx
The first constant term is due to f (x)dx = 1. We now evaluate the 2nd
term. For 2 random variables:
 −1 −1
 
  K11 K12 x1 − µ1
x1 − µ1 x2 − µ2 −1 −1
K21 K22 x2 − µ2
 −1 −1

  K11 (x1 − µ1 ) + K12 (x2 − µ2 )
= x1 − µ1 x2 − µ2 −1 −1
K21 (x1 − µ1 ) + K22 (x2 − µ2 )
X  −1 −1

= (xi − µi ) Ki1 (x1 − µ1 ) + Ki2 (x2 − µ2 )
i∈{1,2} 12 / 16
Entropy of Multi-variate Gaussian
h i X  −1 −1
(x − µ)T K −1 (x − µ) =

(xi − µi ) Ki1 (x1 − µ1 ) + Ki2 (x2 − µ2 )
i∈{1,2}
X X X
= (xi − µi ) Kij−1 (xj − µj ) = (xi − µi )Kij−1 (xj − µj ) [∀ n ≥ 2]
i∈{1,2} j∈{1,2} i,j

Hence, we have:
Z Z X
T −1
f (x)(x − µ) K (x − µ)dx = f (x) (xi − µi )Kij−1 (xj − µj )dx
i,j
 
X X h i
= E  (Xi − µi )Kij−1 (Xj − µj ) = E (Xi − µi )Kij−1 (Xj − µj )
i,j i,j
(a) X X (b) X X
= Kji Kij−1 = (KK −1 )jj = Ijj = n
j i j j

(a) is because K is symmetric. (b) is Pby definition of matrix product. If


A = [aij ], B = [bij ], C = AB = [cij = k aik bkj ]
13 / 16
AEP

Finally, we should have h(X ) = n2 + 12 ln (2π)n |K | = 12 ln (2πe)n |K |.


Note that if we change the base of log in the definition to 2, we should
have h(X ) = 2 lnn 2 + 12 log2 (2π)n |K | = 21 [log2 e n + log2 (2π)n |K |] =
1 n
2 log2 (2πe) |K | bits.

AEP for Differential Entropy


AEP follows from WLLN of continuous RVs (change the sum by integrals
and pmfs by pdfs in the proof).

1
Anϵ = {x ∈ S n : | − f (x) log f (x) − h(X )| ≤ ϵ}
n
However, the typical set is no longer a discrete set. The cardinality is
changed to volume of the set and other properties follow.

Vol(Anϵ ) ≤ 2n(h(X )+ϵ) .


As n → ∞, P[Anϵ ] → 1 and Vol(Anϵ ) ≥ (1 − ϵ)2n(h(X )−ϵ)
14 / 16
Hadamard’s Inequality and Maximum Entropy

Hadamard’s Inequality1
If X ∼ N (0, K ) be a multi-variate normal random variable, then:

|K | ≤ Πni=1 Kii

The proof simply follows from entropy inequalities:


h(X ) = h(X1 , X2 , · · · , Xn ) = 12 ln
P
n
P(2πe) |K | 1=P
i h(Xi |Xi−1 , Xi−2 , . . . , X1 ) ≤ i h(Xi ) = 2 i ln 2πeKii

Maximum Entropy with same covariance matrix


Let the random vector X ∈ Rn have zero mean and covariance
K = E [XX t ]( i.e., Kij = E [Xi Xj ], 1 ≤ i, j ≤ n). Then
h(X ) ≤ 21 log (2πe)n |K |, with equality iff X ∼ N(0, K ).

1
Can also be proved using eigen decompositions and A.M-G.M inequality. See wiki
15 / 16
Maximum Entropy and Other properties
Consider 2 distributions, g (x) and ϕ(x) ∼ N (0, K ) where they have same
covariance i.e.
R R
Eg [Xi Xj ] = g (x)xi xj dx = Eϕ [Xi Xj ] = ϕ(x)xi xj dx = Kij . Now, we use
the KL distance h between i them.
D(g ||ϕ) = Eg log gϕ(x)
(x) R
= −h(X ) − g (x) log ϕ(x)dx
Now ϕ(x) being Normal with 0 mean,
log ϕ(x) = − 12 log (2π)n |K | − 12 x T K −1 x
(a)
g (x) ln ϕ(x)dx = g (x) − 12 log (2π)n |K | − 21 x T K −1 x dx = −h(ϕ)
R R  

(a) is because g (x) is a pdf for the 1st term. g has the same covariance
matrix as ϕ for 2nd term. By Gibb’s Inequality:

D(g ||ϕ) = −h(g ) + h(ϕ) ≥ 0 ⇒ h(g ) ≤ h(ϕ)

Hence, we established that for a zero mean random process following


a known covariance structure, the maximum entropy is achieved for
an equivalent zero mean Normal/Gaussian random process with
same covariance.
16 / 16

You might also like