0% found this document useful (0 votes)
4 views5 pages

Lecture 14

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views5 pages

Lecture 14

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

EE/Stats 376A: Information Theory Winter 2017

Lecture 14 — February 28
Lecturer: David Tse Scribe: Sagnik M, Vivek B

14.1 Outline
• Gaussian channel and capacity
• Information measures for continuous random variables

14.2 Recap
So far, we have focused only on communication channels with a discrete alphabet. For
instance, the binary erasure channel (BEC) is a good model for links and routes in a network
where packets of data are transferred correctly or lost entirely. The binary symmetric channel
(BSC) on the other hand is quite an oversimplification of errors in a physical channel and
therefore, not a very realistic model.

14.3 Gaussian Channel


In reality, a communication channel is analog, motivating the study of continuous alphabet
communication channels. The physical layer of communication channels has additive noise
due to a variety of reasons. Now, by the central limit theorem, the cumulative effect of a large
number of small random effects is approximately normal, so modelling the channel noise as
a Gaussian random variable is quite natural and valid in a large number of situations.
Mathematically, the Gaussian channel is a continuous alphabet time-discrete channel
i.i.d
where the output Yi is the sum of the input Xi and independent noise Zi , where Zi ∼
N (0, σ 2 ).
Yi = Xi + Z i .

14.3.1 Power Constraint


If we analyze the Gaussian channel like a discrete alphabet channel, the natural approach
would be to try to find its capacity. However, if the input is unconstrained, we can choose an
infinite subset of inputs arbitrarily far apart, so that they are distinguishable at the output
with arbitrarily small probability of error. Such a scheme would then have infinite capacity.
Here we will impose an average power constraint on the input X. Any codeword
(X1 , · · · , Xn ) transmitted over the channel must satisfy
n
X
Xi2 ≤ nP.

E (14.1)
i=1

14-1
EE/Stats 376A Lecture 14 — February 28 Winter 2017

14.3.2 Capacity
Theorem 1. In a Gaussian channel with average power constrained by P and noise dis-
tributed as N (0, σ 2 ), the channel capacity C is given by 21 log 1 + σP2 .

Theorem 1 is a cornerstone result for information theory, and we will devote this section
to analyze the sphere packing structure of a Gaussian channel and provide an intuitive
explanation of the theorem above. In the next lecture we will prove the theorem rigorously.
The sphere packing argument of the Gaussian channel is characterized by the noise and
the output spheres. In particular, the ratio of the volume of the output sphere to the volume
of a ‘average’ noise sphere gives us an upper bound on the codebook size, 2nR .

Noise Spheres: Consider any input codeword X of length n and the received vector Y ,
where Yi = Xi + Zi , Zi ∼ N (0, σ 2 ). Then, the radius of the noise sphere ispthe
Pndistance
n
pP
2 2
between input vector X and received vector Y , equal to i=1 (Xi − Yi ) = i=1 Zi .
2 2 2
Pni ] =2Var(Z2i ) + E[Zi ] = σ and since Zi ’s are i.i.d by the Weak Law of √
Now, E[Z Large
Numbers, i=1 Zi ≈ nσ . Thus the ‘average’ radius of a noise sphere is approximately nσ 2 .

Output Sphere: Because of the power constraint P , the input sphere has radius at most

nP . The noise expands the input sphere into the output sphere by nσ 2 . Since p the out-
put vectors have energy no greater than nP +nσ 2 , they lie in a sphere of radius n(P + σ 2 ).

The volume of an n-dimensional sphere is given Cn Rn , where R is its radius and Cn is


some constant. Therefore, we have
n
Volume of Output Sphere = Cn (n(P + σ 2 )) 2
n
Volume of Noise Sphere = Cn (nσ 2 ) 2 .

Therefore, the number of non-intersecting noise spheres in the output sphere is at most
n n
Cn (n(P + σ 2 )) 2

P 2
n = 1+ 2 .
Cn (nσ 2 ) 2 σ
For decoding with a low probability of error,
 n  
nR P 2 1 P
2 ≤ 1+ 2 =⇒ R ≤ log 1 + 2 .
σ 2 σ

Similar to the case of discrete alphabet, we expect the expression 21 log(1 + σP2 ) to be the
solution of a mutual information maximization problem with power constraint (14.1):

max I(X; Y )
fX
(14.2)
s.t E[X 2 ] ≤ P,

where I(X; Y ) is some notion of mutual information between continuous random variables
X and Y . A rigorous proof of the result necessitates the introduction of the corresponding

14-2
EE/Stats 376A Lecture 14 — February 28 Winter 2017

information measures (mutual information, KL-divergence, entropy etc.) for continuous


random variables, which will be further used to solve the optimization problem (14.2) in the
next lecture .
P
Definition 1. The quantity σ2
is the signal to noise ratio.

14.4 Information Measures for Continuous RVs


14.4.1 Mutual Information
h i
p(X,Y )
For discrete random variables X, Y ∼ p, the mutual information I(X; Y ) equals E log p(X)p(Y )
.
The probability mass function for discrete RVs is akin to the density function for continuous
ones. This motivates the following definition.
Definition 2. Mutual information for continuous random variables X, Y ∼ f is given by
 
f (X, Y )
I(X; Y ) , E log
f (X)f (Y )
.

X Y
Digital To Analog Physical Channel Analog To Digital

Figure 14.1: A typical flow-chart for a communication channel

Now, we will show that Definition 2 is sensible by proving that it’s an approximation to
the discretized form of X, Y . Since X and Y are the limit of arbitrarily small discretizations,
we will have show that the definition is consistent with our previous definitions.
For ∆ > 0, define
X ∆ = i∆ if i∆ ≤ X < (i + 1)∆,
Y ∆ = i∆ if i∆ ≤ Y < (i + 1)∆.
Now, for small ∆, we have p(X ) ≈ ∆f (X), p(Y ∆ ) ≈ ∆f (Y ) and p(X ∆ , Y ∆ ) ≈ ∆2 f (X, Y ),

since f is the probability density function. Then,


p(X ∆ , Y ∆ )
 
∆ ∆
I(X , Y ) = E log
p(X ∆ )p(Y ∆ )
f (X, Y )∆2
 
≈ E log
(f (X)∆)(f (Y )∆)
 
f (X, Y )
= E log = I(X; Y )
f (X)f (Y )
=⇒ lim I(X ∆ , Y ∆ ) = I(X; Y ).
∆→0

From the above equations we see that ∆ can be arbitrarily small, I(X ∆ , Y ∆ ) approximates
I(X; Y ) to an arbitrary precision. Therefore,the definition for mutual information in con-
tinuous random variables is consistent with that for discrete ones.

14-3
EE/Stats 376A Lecture 14 — February 28 Winter 2017

14.4.2 Differential Entropy


The definition of mutual information immediately gives us
       
f (X, Y ) f (Y |X) 1 1
I(X; Y ) = E log = E log = E log − E log .
f (X)f (Y ) f (Y ) f (Y ) f (Y |X)
Akin to the discrete case, we would like to define the entropy of Y to be E [− log f (Y )] and the
conditional entropy of Y given X to be E [− log f (Y |X)]. However, the quantities f (Y ) and
f (Y |X) are not dimensionless. Furthermore, the entropy of a continuous random variable
in its purest sense doesn’t exist, since the number of random bits necessary to specify a real
number to an arbitrary precision is infinite. It would however be very convenient to have
a measure for continuous random variables, similar to entropy. This prompts the following
definitions.
Definition 3. For a continuous random variable Y ∼ f , the differential entropy is defined
as  
1
h(Y ) , E log
f (Y )
Definition 4. For continuous random variables X, Y ∼ f , the differential conditional en-
tropy of Y conditioned on X is defined as
 
1
h(Y |X) , E log
f (Y |X)

Now, of course there are similarities and dissimilarities between entropy and differential
entropy. We explore some of these in detail:
1. H(X) ≥ 0 for any discrete random variable X. However, h(X) need not be non-
negative for every continuous random variable. This is because a probability mass
function is always at most 1 but a density function can be arbitrarily large.
2. H(X) is label-invariant. However, h(X) need not be label invariant. The simplest
change of labels, say Y = aX, for a scalar a proves this. Indeed the density function
1
fY is given by fY (y) = |a| fX ay and so
     
1 |a| 1
h(Y ) = E log = E log = log |a|+E log = h(X)+log |a|.
fY (Y ) fX (Y /a) fX (X)

14.4.3 Differential Entropies of common distributions


We finish these lecture notes by computing the differential entropies of random variables
with some common distributions.
1. Uniform Distribution. Suppose X ∼ Unif([0, a]). Then, f (x) = a1 for every x ∈ [0, a],
1
and log f (x) = log a. Then,
   
1 1
h(X) = E log = E log 1 = log a.
f (X) a

14-4
EE/Stats 376A Lecture 14 — February 28 Winter 2017

2 /2
2. Normal Distribution. Suppose X ∼ N (0, 1). Then, f (x) = √12π e−x for every x, and
1
√ 2
log2 f (x) = log2 2π + x2 log2 e. The differential entropy of X is

√ X2
   
1
h(X) = E log2 = E log2 2π + log2 e
f (X) 2

1 log2 e 1
=⇒ h(X) = log2 2π + (Var(X) + E[X]2 ) = log2 2πe.
2 2 2

14-5

You might also like