0% found this document useful (0 votes)
15 views8 pages

T4 NoiseAndMutualInformation

Uploaded by

bertamanteca11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views8 pages

T4 NoiseAndMutualInformation

Uploaded by

bertamanteca11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

digital data transmission — 2023/2024

noise and mutual information

1 introduction
We saw in Figure 1.2 of the lecture notes about modulations that we may perfectly recover the symbols X1 , . . . , Xn
from a signal X(t) by means of the inner products or using the rather more convenient matched-filter implemen-
tations. Since the bit-to-symbol mapper is a one-to-one mapping between a sequence of k bits and a sequence of n
symbols, the transmitted bits D1 , . . . , Dk may also be perfectly recovered from the symbols X1 , . . . , Xn , hence hav-
ing error-free communication at a rate of R = nk bits per channel use. In practice, however, we will have Y(t) ̸= X(t)
in a noisy communication channel. A new signal processing block called detector is required to estimate the trans-
mitted bits from the received symbols, as shown in Figure 1.1 below.

Y1 , . . . , Y n
D̂1 , . . . , D̂k Detector Demodulator Y(t)

Figure 1.1: Demodulation and detection process.

Over the past decades, and through the sister concepts of entropy and channel capacity, which respectively charac-
terize the best solutions to the dual problems of source and channel coding, information and communication theory
have successfully guided engineers and computer scientists in the design and implementation of ever more efficient
and reliable communication systems. In this part of the course, we focus on the channel coding and decoding parts
and characterize the condition for reliable communication. Here reliable means that we are able to ensure a vanish-
ingly small probability of erroneous message detection at the receiving end of the communication link, a promise
which itself arguably ranks among the most surprising and unexpected scientific achievements of the past century.

For a detector g(·) expressed as the (deterministic or random) function (D̂1 , . . . , D̂k ) = g(Y1 , . . . , Yn ), the measure
of the reliability of a digital communication system is given in terms of the error probability, given by
[ ]
Pe = Pr g(Ŷ1 , . . . , Ŷn ) ̸= (D1 , . . . , Dk ) . (1.1)

That is the probability of wrongly estimating the sequence of information bits, rather than the probability of wrongly
estimating the actual transmitted bits or symbols. The difference between the two stems from the fact that we are
explicitly considering channel coding. This will become clear in the following sections.

2 information rates
An important result from information theory, that we will make no attempt to prove in this course, is that, for a
given channel W and input distribution Q, there exists a code of rate R bits per channel use such that the error
probability in (1.1) satisfies Pe → 0 as n → ∞ if and only if the rate is below the celebrated mutual information
(capacity) of the channel, that is R < C, where C = maxQ I(Q; W) and I(Q; W) is the mutual information between
the distribution Q of the channel input and the channel law W.

1
2.1 Transmission of a repetition code over a bit-flipping channel

We start by sending 1 bit over a channel that can flip the bit value with some probability p. In this case, the alphabet
of X and Y is X = {0, 1} = Y and the channel transition probability W(y|x) is given by
{
p, y ̸= x,
W(y|x) = (2.1)
1 − p, y = x.

If we send one bit, that is n = 1 channel use, and let X = B, the probability of detection error Pe is given by
[ ] ( )
Pe = Pr B̂1 ̸= B1 = P(0)W(1|0) + P(1)W(0|1) = P(0) + P(1) p = p, (2.2)

where we made the reasonable assumption that B̂ = Y. The result does not depend on the values of Q(0) and Q(1).

We can reduce this probability by introducing redundancy and sending the bit twice. As before, we have B1 , but
now the transmitted sequence is (X1 , X2 ) = (B1 , B1 ). The set of transmitted sequences is {00, 11}, while the set
of received sequences is {00, 01, 10, 11}. At the decoder, we guess the transmitted bit according to a majority rule,
breaking ties with a random decision. Let B1 = 0, that is a bit of value 0 is transmitted. Then (X1 , X2 ) = (0, 0) and
we have the outcomes (Y1 , Y2 ), corresponding decisions B̂1 , and error probabilities Pe (y1 , y2 |B1 = 0) for a given
outcome and averaged over the outcomes Pe (B1 = 0) are summarized in the following table:

∏2
(y1 , y2 ) Probability i=1 W(yi |0) Decision B̂1 Pe (y1 , y2 |B1 = 0)
00 (1 − p)2 0 0
01 p(1 − p) 0/1 (random) 1/2
10 p(1 − p) 0/1 (random) 1/2
11 p2 1 1
Pe (B1 = 0) = p(1 − p) + p2 = p

As before Pe = Pe (B1 = 0) = Pe (B1 = 1) and this probability does not depend on P(B1 = 0), so not much seems
to have been gained in terms of error probability. What happens if we repeat the bit B1 = 0 three times? Taking
into account that X1 = X2 = X3 = 0, the outcomes and probabilities are summarized again in the following table:

∏3
(y1 , y2 , y3 ) Probability i=1 W(yi |0) Decision B̂1 Pe (y1 , y2 , y3 |B1 = 0)
000 (1 − p)3 0 0
001 p(1 − p)2 0 0
010 p(1 − p)2 0 0
011 p2 (1 − p) 1 1
100 p(1 − p)2 0 0
101 p2 (1 − p) 1 1
110 p2 (1 − p) 1 1
111 p3 1 1
Pe (B1 = 0) = 3p2 (1 − p) + p3 = p2 (3 − 2p)

We can verify that pe = p2 (3 − 2p) < p as long as p < 12 .


Problem 2.1. What happens if p = 12 ? Can the decoder be simplified in this case? And if p > 12 ? Can the decoder be
improved in this case? Can you justify the reason for this improvement in terms of Maximum Likelihood decoding?

2
If we repeat the bit n times, the error probability can be evaluated as
{1(n ) n n ∑ ( )
2 n p 2 (1 − p) 2 + nk= n +1 nk pk (1 − p)n−k n even
Pe = ∑n2 (n ) k n−k
2

k=⌈ n ⌉ k p (1 − p) n odd.
2

Now, not only is this error probability lower than p as long as p < 1/2, but it tends to 0 as n → ∞. However, we
are using n bit to send 1 single bit, so in some sense each bit carries only n1 bits of information. This quantity, the
number of information bits carried in each channel bit, is known generally as the rate and is denoted by R. This
repetition code achieves therefore a vanishing error probability, Pe → 0 as n → ∞, while R → 0 as well :-(

One key finding of information theory is that it is possible to have a vanishing error probability while the rate remains
strictly positive. This leads to the concept of channel capacity, namely the largest R for which there exist codes with
vanishing error probability. In the information theory course, we saw that the capacity for the bit-flipping channel
is 1 − H2 (p), where H2 (p) = −p log2 p − (1 − p) log2 (1 − p) is the binary entropy function. For p = 0.11,
the capacity is about 0.5 bits, so in every transmitted bit we may send approximately half an information bit. There
are many different channel models, each with a possibly different capacity. For instance, the channel might erase
the bit with probability p, with erasures detected at the receiver. Or the input might be a real or complex number,
maybe constrained in its average value or its average energy, and sent over a channel that adds some noise, e. g. white
Gaussian noise. We study these models in detail in the following sections.

3 mutual information for discrete and continuous channels


In general, a discrete-time memoryless channel is determined by its input x, output y, and the transition probability
W(y|x). For a discrete channel with transition probability W(y|x) and a given input distribution Q(x), the mutual
information I(Q, W) measured in bits per channel use is a quantity given by
∑ W(y|x)
I(Q; W) = Q(x)W(y|x) log2 ∑ . (3.1)
x∈X ,y∈Y x̄∈X Q(x̄)W(y|x̄)


Sometimes, we may write the mutual information with the output distribution P(y) = x̄∈X Q(x̄)W(y|x̄) as
∑ W(y|x)
I(Q; W) = Q(x)W(y|x) log2 . (3.2)
P(y)
x∈X ,y∈Y
[ ]
Problem 3.1. The mutual information in (3.2) can be written as the expectation E ı(X, Y) of some function ı(x, y)
of the pair (x, y), viewed as a random variable with some joint distribution R(x, y). Write down the form of this
function ı(x, y) and the expression of the joint distribution R(x, y).

∑ to the entropy H(P) of a distribution P(y) over an alphabet Y, defined with the well-known
Problem 3.2. In analogy
formula H(P) =∑− y∈Y P(y) log2 P(y), for a given x we write the entropy H(Wx ) of the channel distribution
as H(Wx ) = − y∈Y W(y|x) log2 W(y|x). Since this entropy actually depends on x, we may compute its aver-
[ ] ∑
age value under the input distribution Q(x), that is EQ H(W [ x ) = ] x∈X Q(x)H(Wx ). Show that the mutual
information in (3.2) can be written as I(Q; W) = H(P) − EQ H(Wx ) .

For continuous alphabets. e. g. R or C, the definitions we have made for the mutual information hold, with summa-
tions are replaced by definite integrals and probability distributions by density functions as needed. For instance, if
both x and y are continuous-valued complex numbers, we would have
∫∫
W(y|x)
I(Q; W) = Q(x)W(y|x) log2 ∫ dy dx. (3.3)
x∈C,y∈C x̄∈R Q(x̄)W(y|x̄) dx̄

3
If the input or output are limited in value to some subset of the real or complex numbers, the integration(s) is(are)
then done on the corresponding subset only as the density function is zero outside this subset.

In information theory one proves the existence of codes of rate R made of codewords of n length symbols drawn
from a probability distribution function Q(x) such that their error probability Pe can be made very small as n
increases as long as the rate R is smaller than the mutual information I(Q; W) of the given distribution Q. This is
known as the channel coding theorem. It stands to reason to consider the largest mutual information for all possible
probability distributions Q(x), namely the channel capacity given by

C(W) = max I(Q; W). (3.4)


Q

While a codeword has length n and the error probabilitiy obviously depends on n, the mutual information is a
quantity that does not depend on the blocklength. The performance of the best possible codes is determined from
the one-symbol input and channel transition distributions Q(x) and W(y|x) respectively.
Problem 3.3. Consider the binary bit-flipping channel with transition probability W(y|x) = p if y ̸= x and 1 − p
otherwise. For an input (distribution Q with Q(0)
) = q and Q(1) = 1 − q, prove that the mutual information is
given by I(Q; W) = H2 q(1 − p) + (1 − q)p − H2 (p), where H2 (ε) = −ε log2 ε − (1 − ε) log2 (1 − ε) is the
binary entropy of p and that the channel capacity is given by 1 − H2 (p).
Problem 3.4. Let us consider a bit-erasing channel with input alphabet X {0, 1} and output alphabet Y = {0, 1, ∗}
and transition probability W(∗|x) = p and W(x|x) = 1 − p for both x. Prove that the mutual information is given
by I(Q; W) = (1 − p)H2 (q) and that the channel capacity is given by 1 − p.
Problem 3.5. Consider the Z channel, which transmits a zero unchanged and flips a one with probability p:


1, x = y = 0,
W(y|x) = p, x = 1, y = 0, (3.5)


1 − p, x = 1, y = 1.

For an input distribution with Q(0) = q, the mutual information I(Q; W) is given by

1 p 1
I(Q; W) = q log2 + (1 − q)p log2 + (1 − q)(1 − p) log2 . (3.6)
q + (1 − q)p q + (1 − q)p 1−q
| {z } | {z } | {z }
x=0,y=0 x=1,y=0 x=1,y=1

Expanding the logarithm of fractions into differences of logarithms and doing some algebraic manipulations yields

I(Q; W) = −(1 − (1 − q)(1 − p)) log2 (1 − (1 − q)(1 − p)) + (1 − q)p log2 p − (1 − q)(1 − p) log2 (1 − q)
(3.7)
= −(1 − (1 − q)(1 − p)) log2 (1 − (1 − q)(1 − p)) + (1 − q)p log2 p − (1 − q)(1 − p) log2 (1 − q)
+ (1 − q)(1 − p) log2 (1 − p) − (1 − q)(1 − p) log2 (1 − p) (3.8)
= −(1 − (1 − q)(1 − p)) log2 (1 − (1 − q)(1 − p)) − (1 − q)(1 − p) log2 (1 − q)(1 − p)
+ (1 − q)(1 − p) log2 (1 − p) + (1 − q)p log2 p (3.9)
= H2 ((1 − q)(1 − p)) − (1 − q)H2 (p), (3.10)

where in (3.7) we rewrote q + (1 − q)p as 1 − (1 − p)(1 − q), in (3.8) we added and subtracted a common term
(1 − q)(1 − p) log2 (1 − p), in (3.9) we recombined the factors in a convenient form, and in (3.10) we identified this
convenient form as a linear combination of two binary entropies.

4
We now optimize the value of q for given p. Taking the derivative with respect to q, we successively obtain

1 − (1 − q)(1 − p) H2 (p)
log2 = (3.11)
(1 − q)(1 − p) 1−p
1
(1 − q)(1 − p) = H2 (p)
(3.12)
1 + 2 1−p
1
q∗ = 1 − H2 (p)
. (3.13)
(1 − p)(1 + 2 1−p )

Interestingly, we have limp→0 q∗ = 12 and limp→1 q∗ = 1 − 1e , so even in the limit that the output y = 1 is very
rarely reached and the capacity vanishes, it pays to use both inputs.

4 capacity of the additive white gaussian noise (awgn) channel


4.1 Discrete-time AWGN channel

We now move to the discrete-time Gaussian-noise channel. We saw earlier that, for a perfect channel, we may
perfectly recover the symbols X1 , . . . , Xn from a signal X(t) by means of the inner products. The additive white
Gaussian noise (AWGN) channel is a channel whose output Y(t) is related to the input X(t) as

Y(t) = X(t) + Z(t), (4.1)

where Z(t) is an additive a zero-mean white Gaussian noise. For our purposes, a Gaussian noise Z(t) is a random
process that such that for a given orthonormal set of waveforms {ϕ1 , . . . , ϕn } of duration T and bandwidth B, the
coefficients of Z(t) in the signal space given by the inner products Zi = ⟨Z, ϕi ⟩ are i.i.d. complex-valued Gaussian
random variables with zero-mean and variance σ2 = BTN0 , where N0 is called the power spectral density of the
white noise. As a result, the joint probability distribution of Z1 , . . . , Zn is given by


n
G(z1 , . . . , zn ) = G(zi ), (4.2)
i=1

where G(z) is the probability density function of a complex Gaussian random variable given by

1 − |z|22
G(z) = e σ (4.3)
πσ2
Remember that |z|2 denotes the squared modulus of the complex number z, i. e. |z|2 = ℜ(z)2 + ℑ(z)2 . We often use
the notation Z ∼ CN (0, σ2 ) to represent the fact that Z is a complex Gaussian random variable with zero-mean and
total variance σ2 , that is, a complex random variable whose real and imaginary parts are independent real-valued
Gaussian with zero mean and variance σ2 /2. As a result, after representing X(t), Y(t) and Z(t) in an orthonormal
basis {ϕ1 , . . . , ϕn } using the respective coefficients (X1 , . . . , Xn ), (Y1 , . . . , Yn ) and (Z1 , . . . , Zn ), we obtain the
discrete-time AWGN channel given by
Yi = Xi + Zi , (4.4)
Since W(y|x) = G(y − x), such channel is a memoryless channel with joint conditional probability
∑n

n ∏
n
1 − |yi −x2 i |2 1 |y −x |2
− i=1 2i i
W (y1 , . . . , yn |x1 , . . . , xn ) =
n
W(yi |xi ) = e σ = e σ . (4.5)
πσ2 (πσ2 )n
i=1 i=1

5
To assess the quality of the AWGN channel we define the average signal-to-noise ratio (SNR) as
Es
SNR = , (4.6)
σ2
where σ2 is the white Gaussian noise variance and Es is the average symbol energy of the constellation given by

Es = Q(x)|x|2 . (4.7)
x∈X

Furthermore, we assume that there is a constraint on the symbols’ energy such that

Q(x)|x|2 ≤ Es . (4.8)
x∈C

While a codeword has length n and the error probabilitiy obviously depends on n, the mutual information is a
quantity that does not depend on the blocklength. The performance of the best possible codes is determined from
the one-symbol input and channel transition distributions Q(x) and W(y|x) respectively. At this point, the input
distribution is arbitrary and need not have its support limited to a specfic constellation X . Indeed, information
theory will provide a tool to compare the information rate capabilities of various constellations and allow the system
designer to pick the most efficient and/or convenient for a given scenario.

For additive channels, Eq. (3.1) may be decomposed in a more convenient form. First, using that log a · b =
log a + log b, we rewrite the integral form of Eq. (3.3) as the sum of two integrals, I1 and I2 , that is

I(Q; W) =
∫∫ ∫∫ (∫ )
Q(x)W(y|x) log2 W(y|x) dy dx − Q(x) W(y|x) log2 Q(x̄)W(y|x̄) dx̄ dy dx .
x∈C,y∈C x∈C,y∈C x̄∈C
| {z }| {z }
I1 I2
(4.9)

Second, as the channel transition probability coincides with the noise density W(y = x + z|x) = W(z) and this
depends only on the noise z for given x, we can change the integration variables to (x, z) in the first integral:
∫∫
I1 = Q(x)W(y|x) log2 W(y|x) dy dx (4.10)
x∈C,y∈C
∫∫
= Q(x)W(z) log2 W(z) dz dx (4.11)
∫ x∈C,z∈C

= W(z) log2 W(z) dz (4.12)


z∈C
= −h(W), (4.13)

where we noted that the integration over x is 1 and we defined the differential entropy of a distribution R(z) as

h(R) = − R(z) log2 R(z) dz. (4.14)
z

Similarly, after defining a distribution P(y) over the channel output given by

P(y) = Q(x̄)W(y|x) dx̄, (4.15)
x∈C

6
with differential entropy h(P), we note that I2 is actually the differential entropy of P, I2 = h(P), and we may
rewrite Eq. (4.9) for additive channels where the noise is independent of the input as the sum of these two terms,

I(Q; W) = h(P) − h(W). (4.16)

Now, we study the two terms in Eq. (4.16) separately. The second one, h(W) is easy to evaluate as

1 − |z|22
h(W) = − W(z) log2 e σ dz (4.17)
z∈C πσ2
∫ ∫
2 |z|2
= W(z) log2 (πσ ) dz + W(z) 2 log2 (e) dz. (4.18)
z∈C z∈C σ
Now, carrying out the integrals and combining the results, we obtain

h(W) = log2 (πeσ2 ). (4.19)

Concerning the differential entropy h(P), we choose for Q(x) the density of a zero-mean complex Gaussian random
variable with variance Es . As input x and noise z are independent, the density P(y) of the output y = x + z is that
of a zero-mean complex Gaussian random variable with variance Es + σ2 , and its differential entropy is given by

h(P) = log2 (πe(Es + σ2 )). (4.20)

Therefore, the mutual information is given by


( )
Es
I(Q; W) = log2 1+ 2 . (4.21)
σ
It turns out that
∑ this choice of input density attains the largest possible mutual information when a cost constraint
of the form x Q(x)|x|2 = Es , for some given positive-valued energy Es , is imposed. Therefore, the expression
in (4.21) is the channel capacity of the discrete-time Gaussian channel.

4.2 Bandlimited AWGN channel

So far, the rate has been measured in bits per channel use. If we measure instead the rate in bits per second, we
have to multiply the rate by the number of channel uses per second. From our study of the signal space and the
various orthonormal expansions used to represent time- and band-limited signals, we know that the number of
(complex-valued) channel uses is approximately BT , where B is the available bandwidth and T is the duration of
the transmission. Therefore, the number of channel uses per second is simply B and the rate in bits per second is
( )
E
R(in bits/sec) = B log2 1 + 2 , (4.22)
σ

where we have explicitly written the noise energy variance σ2 . The theory of stochastic processes has a results that
expresses σ2 as σ2 = BTN0 , where N0 is a quantity known as the power spectral density of the white noise. If we
now define the transmitted power as P = E/T , we have
( )
P
R(in bits/sec) = B log2 1 + . (4.23)
BN0

It is interesting to plot the rate in (4.23) as a function of B for fixed P/N0 to verify that the rate tends to P/(N0 log 2)
as B → ∞. For small enough B, increases in B translate into large increases of the rate; however, this effect saturates

7
8
Capacity (bits/sec)
P/N0 = 5
6

4 P/N0 = 3

2
P/N0 = 1

0 20 40 60 80 100
B (Hz)

Figure 4.1: Capacity of the bandlimited AWGN channel as B → ∞.

as B becomes very large. More generally, two distinct regimes are present in Eq. (4.23). When the signal-to-noise
ratio P/N0 is very small, a Taylor expansion of the capacity gives

P P
C ≃ B log2 (e) = log2 (e) , (4.24)
N0 N0
and the capacity measured in bits per second is proportional to the power. As essentially all modulations attain the
same mutual information for low values of P/N0 , simple BPSK is a good modulation choice in this regime.

At the other extreme, when the signal-to-noise ratio P/N0 is large, a Taylor expansion of the capacity gives
( )
P
C ≃ B log2 , (4.25)
N0

and the capacity grows only logarithmically with the signal-to-noise ratio. In this case, ever larger cardinalities are
required to approach Shannon’s capacity for large P/N0 . Interestingly, since N0 ≃ 10−21 W/Hz at room temperature
and transmitted powers are in the order of watts, actual values of P/N0 , even taking into account propagation losses,
are compatible with enormous data transmission capabilities, much higher than those in present-day systems.

You might also like