Noisy Channel Theorem

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

1 The Noisy Coding Theorem

1.1 The Idea of the Noisy Coding Theorem


We have a discrete memoryless source, with M words and entropy H.
We have a binary symmetric channel with crossover probability p. The
channel has capacity C, which we have yet to define.
Can we reliably send streams of source words through this channel? We are
allowed to code the source words as bitstrings of some fixed length before sending
through the channel. We can utilize whatever error detecting and correcting
capabilities that we have built into the code.
We set an acceptable failure rate, . If we want 99.5% reliability, we set
 = 0.5%. We are asking that at least 1  = 99.5% of the words that are sent
can be correctly read (after using the code to detect and correct errors.) We
will think of setting  to be small, near 0.
Is there an error-correcting and detecting code that can do that?
Lets imagine a code where codewords are bitstrings of length n, where we
are prepared to accept a very large n in order to make a fancy code.
Theres a tradeoff that comes with increasing the length of the codewords.
When you send a longer codeword, there are probably going to more bit
errors. In fact, the number of bit errors is on average np for each word
sent.
When you code with longer bit strings, there are more possible codewords.
We only need a fixed number |W | = M of codewords, and there are 2n bit
strings that could be codewords. Thus we can spread out the codewords
more and more, so there is a greater Hamming distance between any two
of them. Thus we can detect and correct more and more errors, up to half
as many as the minimum Hamming distance between codewords.
So we have to ask: As n, the length of the codewords, increases, does the
Hamming distance between codewords grow faster than the probable number
of bit errors? If yes then there exists a good code. If no, then there is no such
code. When a good code does exist (in this sense), no promises are made about
how easy coding and encoding might be).
To prove a theorem addressing these questions, we have to figure out (1)
how far apart codewords can be; and (2) estimate error probabilities.
Theres another consideration relating to both the source and the channel.
The higher the entropy H of the source, the more information we want to
send, and the more difficult it becomes.
The lower the cross-over probability p of the channel, the fewer errors the
errors, making it easier to detect and correct them.
If the source entropy is higher, we might need a lower channel cross-over prob-
ability.

1
Shannon figured this all out. His theorem asserts that if source entropy
is less than the channel capacity, then -reliable communication is possible for
every  > 0. But if the source entropy is greater than the channel capacity, then
there is a positive lower limit to the reliability that is possible.

1.2 Mutual Information


Consider a probability distribution on a cross product X Y , where X =
{x1 , . . . , xM } and Y = {y1 , . . . , yN }. We have a set of joint probabilities
P (xi & yj ) = P (ij) that sum to 1 and that can conveniently be displayed in a
table of joint probabilities. We are thinking of X as source symbols (e.g. 0 and
1) and Y as received symbols (perhaps also 0 and 1), but the joint probabilities
are NOT transition probabilities (and cannot be, because the row probabilities
do not add to 1.) From this table we can produce marginal probabilities for
each xi separately, which we think of as source probabilities for a memoryless
source X. The marginal probabilities fire the ys give the probabilities for what
is received after transmission through the channel.
Heres a running example.

Joint Probabilities
sent \ received y1 y2 Marginal X prob. (Source prob)
x1 0.1 0.3 0.4
x2 0.4 0.2 0.6
Marginal Y 0.5 0.5

From the joint probabilities we can compute transition (conditional) prob-


abilities, p(yj |xi ) = P (j|i). Simply normalize each row by dividing by its
marginal probability. Thus

P (xi & yj )
P (yj |xi ) = .
P (xi )

In our example we have

Transition Probabilities
sent \ received y1 y2 Row sum
x1 0.25 0.75 1
x2 0.33 0.67 1
Summarizing: From a joint probability distribution on X Y we found
both source probabilities for X and transition probabilities from X to Y . The
process is reversible, which is very important for the sequel.. If you begin with
just source and transition probabilities, then you can easily construct a full
table of joint probabilities. Just multiply each row of the transition table by its
corresponding source probability. The joint probability table is constructed from
information about both the source (the source probabilities) and the channel
(the transition probabilities). From the joint probabilities you can find marginal

2
probabilities for the received variables. For the same channel, different sources
lead to different joint probabilities.
Given either a joint probability model on X Y OR (equivalently) a source
S and a channel Ch, we define the mutual information
P (yj |xi )
I(xi , yj ) = log2 .
P (yj )
The mutual information is the ratio of a conditional probability and a marginal
probability for the same quantity yj . In terms of sources and channels, its the
ratio of a transition probability to yj (from xi ) and the probability that yj will
be received when a random word is transmitted from the source.
In our example we have
Mutual Information I(xi , yj ).
sent \ received y1 y2

x1 log2 (0.25/0.5) = 1.000 log2 (0.75/0.5) = 0.585


.
x2 log2 (0.33/0.5) = 0.585 log2 (0.67/0.5) = 0.415

When yj is more likely to occur after xi than it is overall, the mutual informa-
tion is positive. The occurrence of xi indicates an increased probability of yj .
When yj is less likely to occur after xi than it is overall, the mutual information
is negative. The occurrence of xi indicates a decreased probability of yj . If the
mutual information is zero, I(xi , yj ) = 0, then P (yj |xi ) = P (yj ), so informa-
tion about the occurrence of xi gives no information enabling us to revise our
estimate of the probability of yj .
Mutual information is symmetric in the two variables: I(xi , yj ) = I(yj , xi ).
Indeed, since
P (yj |xi )P (xi ) = P (xi & yj ) = P (xi |yj )P (yj )
we have
P (yj |xi ) P (xi |yj )
= .
P (yj ) P (xi )
Finally, we can define the Average Mutual Information of a joint distri-
bution on X Y (equivalently of a channel specified by transition probabilities
P (yj |xi ) together with a source specified by its word probabilities P (xi )) to be
the expected value of the mutual information.
M X
X N
I(X, Y ) = P (xi & yj )I(xi , yj ).
i=1 j=1

In our example, the average mutual information is


I(X, Y ) = (0.1)(1.000) + (0.3)(0.585) + (0.4)(0.585) + (0.2)(0.4 5)
= 0.07549.

3
1.3 Capacity of a Binary Symmetric Channel
Main Example.
Our goal is to compute the channel capacity of a binary symmetric channel,
Ch(), with cross-over probability .
We work with a source, S(p), with words {0, 1} and probabilities P (1) = p
and P (0) = q = 1 p.

Joint Probabilities
Binary symmetric channel ()
Source (p, q)

sent\received 0 1 MarginalX Source prob

0 p(1 ) p p

1 q q(1 ) q

Marginal Y p(1 ) + q p + q(1 )

Mutual Information
Binary symmetric channel ()
Source (p, q)

sent\received 0 1

1 
0 log2 p(1)+q log2 p+q(1)
.
 1
1 log2 p(1)+q log2 p+q(1)

The average mutual information for the source and binary symmetric chan-
nel, obtained from the joint probabilities and the element-level mutual informa-
tion, is
1 
I(, p) = p(1 ) log2 + p log2
p(1 ) + q p + q(1 )
 1
+ q log2 + q(1 ) log2
p(1 ) + q p + q(1 )
= (1 ) log2  +  log2  (p(1 ) + q) log2 (p(1 ) + q)
(p + q(1 )) log2 (p + q(1 )).

Notice that a single channel, the average mutual information depends on


the source probabilities p = P (1) and q = P (0), so it is not a property of the
channel alone. Some sources give greater average mutual information and others
smaller average mutual information for the same cross-over probabilities . If

4
among all sources we choose one for which the channel has the greatest average
mutual information, that maximal value is a property of the channel alone.
Definition The Capacity of a discrete memoryless channel is the average
mutual information when computed with the source that gives greatest such
average.
To find the capacity of the binary symmetric channel, we have to maximize
I(, p) with respect to p, on the domain 0 p 1. Its a calculus problem.
Differentiating with respect to p (remembering that q = 1 p) and equating
the result to zero shows that maximum average mutual information occurs with
p = q = 1/2. Hence

1 1 1 1
Channel capacity = (1 ) log2  +  log2  log2 log2
2 2 2 2
= 1 +  log2  + (1 ) log2 (1 ).
The capacity of the channel depends on the cross-over probability . If
the cross-over error rate is 50%, then the channel has no capacity to transmit
information reliably. The lower the cross-over error rate, the more efficiently
information can be sent. If there  = 0, so there is no error at all, then it takes
just 1 bit through the channel to transmit 1 bit of information.

1.0

0.8

0.6

0.4

0.2

0.0
0.0 0.2 0.4 0.6 0.8 1.0
-

1.4 Channel coding theorem


We aim to define the capacity of a discrete memoryless channel channel, a mea-
sure of how much information we can reliably send through the channel (assum-
ing we are smart about the way we code the data before transmitting). There
is some intuition that comes from the definition, but ultimately the definition
of channel capacity is justified by Shannons great coding theorem.

5
The hope is to take a word stream from the source, encode it is some fancy
way so that after being sent through the noisy channel we can detect and correct
as many transmission errors as possible and then decode, in the end producing
a message that is virtually error free, nearly identical with the word stream
produced by the source. Can we do this?
Theorem (Shannons noisy coding theorem, aka channel coding theorem)
Given a memoryless source and a discrete memoryless channel:
1. If the source entropy is less than the channel capacity, then the error
probability can be reduced to any desired level by using a sufficiently
complex encoder and decoder. There exist codes that can do the job.

2. If the source entropy is greater than the channel capacity, arbitrarily small
error probability cannot be achieved. There is a limit to how effective a
code can be.
In the cases where the noisy coding theorem asserts existence of good codes,
the proof of the theorem gives no indication of how to create these codes. The
theorem is a pure existence theorem. Moreover, the theorem does not claim
that the codes with these minimal error rates are easy to implement. There
has to be some good underlying structure to the code in order for encoding and
decoding to be efficient.

You might also like