Notes - BSC
Notes - BSC
Lecture 6
Instructor: Madhu Sudan Scribe: Xingchi Yan
1 Overview
1.1 Outline for today
Channel Coding (Or Error Correction)
• Definitions: Rate, Capacity
• Coding Theorem for Binary Symmetric Channel (BSC)
• Coding Theroem for general channels
• Converse
X Y = X w.p. 1 − p
Y = 1 − X w.p. p
In order to use such a channel, Shannon’s idea was to encode the information before you send it and
decode it afterwards. Suppose you have a message m that you want to send, you encode it and get some
sequence X, maybe n such symbol X, get Y and then you want to decode.
m Xn Yn m̂
Encode Channel Decode
What we would really want to understand is the capacity of the channel. In this case the capacity of the
channel would be
Definition 2.
Capacity(BSC(p)) = sup{lim lim {Rate R of En , Dn }}
R ε→0 n→∞
such that
Pr [D(y n ) 6= m] ≤ ε
m,y|x,X=E(m)
So what is the best rate you can get? Allow to use n as large as you get but have to make sure that error
goes to zero when n goes to infinity. This is the general quantity we want to understand for any channel.
Remark The capacity of binary symmetric channel is Capacity(BSC(p)) = 1 − h(p) where h(p) =
1
plog p1 + (1 − p)log 1−p which is the entropy of Bernoulli p random variable.
This is a little striking theorem. Why we get the 1 − h(p)? Roughly the idea is the following, X came
in the channel with a sequence of Bernoulli random variables η distributed according to Bern(p), after the
decoding, suppose you are able to reproduce the original message m, then you can easily use this to determine
also η. This channel is sending for free after the decoding n Bernoulli random bits that should require nh(p)
uses of the channel. What is remaining to use is 1 − h(p) and that’s what we’re going to use to convey X.
X X ⊕η m
η ∼ Bern(p)n
If you try to draw the capacity of the channel 1 − h(p) and h(p). When p = 0 you get capacity 1 which
means that for every use of the channel you can send 1 bit through which makes sense. When p = 12 you
get no capacity which also makes sense when you send 0 or 1 the receiver receives unbiased bit so there is
no correlation between sending and receiving.
3 General Channel
It turns out when you are talking about general channel, you get a even nicer connection. We would like
to find the rate and capacity for the general channel. First we would like to specify the encoder for general
channel. For today general channel means memoryless and here we’re just taking some arbitrary channel
which acts independently on every single bit being transmitted.
3.1 Introduction
The input X is some element of some universe Ωx . The output some element of some universe Ωy . The
universe means not to be related. We want think about stochastic channels which make a lot sense that
channel are given by a bunch of conditional distributions.
Example 3. One very simple example of this may be called erasure channel. The following figure shows a
Binary Erasure Channel(BEC). It produces a output of 0 or 1 or ?.
1−p
0 0
1−p
1 1
Example 4. People in information theory may also think about something called noisy typewriter. The
channel input is either unchanged with probability 12 or transformed into the next letter with probability 12 in
the output.
Exercise 5. Take a binary erasure channel with parameter p and a binary erasure channel with parameter
q and try to find an reasonable relationship between p and q.
Remark The matrix Py|x (α, β) specifies the channel.
Ωy
Ωx Py|x (α, β)
1−p p
p 1−p
Here we will get the remarkable theorem once again. It turns out the capacity of the channel are given
by this joint distribution.
Theorem 8.
Capacity(Py|x ) = sup I(x; y)
Px
Remark Once X distribution is specified, I get a joint distribution on (X, Y ). The information that Y
conditioned on X is the capacity of this channel. This completely characterize every memoryless channel of
communication. Given Y, figuring out X. Then mutual information is the right characterization.
Let’s prove half of the statement first, half of that statement next, and then other half. Here is what
we’re going to do with the encode, pick n large enough, let k ≥ (1 − h(p) − ε)n, let En : {0, 1}k → {0, 1}n
be completely random. The decoding function is, not changed, given some sequence Y, we’re going to look
at the m which maximizes the probability y n conditioned on xn , that is,
Dn (β n ) = argmax{Pyn |xn (β n , E(m)}
m
where βn ∈ Ωny .
It could be used to talk any channel, for instance, the Markov channels. The following
theorem is nor hard to prove. We have
Theorem 9.
Pr [Dn (y) 6= m] ≤ ε
En ,m,y|En (m)
Remark
Pr [Dn (BSCp (En (m))) 6= m] ≤ ε
En ,m,BCSp
Pr[E1 ] is exponentially small thus we are left with probability E2 , the bound relys on the probability
of E2 which is simple to calculate.
Lemma 10.
V olume(Ball) 2(h(p)+ε)n
∀m0 , Pr[m0 ∈ Ball of radius (P + ε)n around y] = ≈
2n 2n
Remark
Ballr (y) = {Z ∈ {0, 1}n |Z and y dif f er in ≤ r coordinates}
The volume of this ball or the size of this set is
r
X n r
|Ballr (y)| = ≈ 2h( n )n
i=0
i
That’s just one single message, if you want any valid message
2k 2(h(p)+ε)n
P r[∃m0 , st. E2] ≤
2n
k
This is where we see the quantity that we want, n + h(p) + ε < 1 and why need one minus entropy.
Exercise 11. Show Pr[E1 ] ≤ exp(−n)
That proved part of what we want to prove today. It says about the capacity of the binary symmetric
channel is
Capacity of BSC ≥ lim {1 − h(p) − ε}
ε→0
Now let’s see if we can show the capacity of general channel is at least limε→0 supPx {I(X; Y )} − ε.
We are going to pick n to be large enough and k large enough. We’re going to fix some distribution Px .
Now here we are going to choose the encoding of En (m)i ∼ Px , iid over all (m, i), m ∈ {0, 1}k and i ∈ [n].
The decoding function is still the same maximum likelihood. Now there is a question what is the error of
type 1 and type 2 look like? There’s no notion like error anymore.
So What we’re going to do instead is to start talking about typical sequences. Let’s recall asymptotic
equipartition principle (AEP):
Lemma 12. If Z1 , ..., Zn , Zi ∼ Pz iid, then
1 1
∃S ⊆ Ωnz , ≤ Pr[Z1 ...Zn = r1 ...rn ] ≤
|S|1+ε |S|1−ε
Pr[(Z1 ...Zn ) ∈
/ S] ≤ ε
Here S=“typical set for Pz ”
We are going to look at E(m0 ) and want to analyze the probability that E(m0 ) is typical and E(m0 , y)
is also typical. Let’s fix the m0 and ask the question. The fact E(m0 ) is typical is going to happen in a very
high probability. We are more interested in the probability that (E(m0 ), y) is jointly typical. This is the
crucial question. This is actually distributed according to Px × Py .
Here is the lemma that I will state which immediately imply.
Lemma 13. Let Zn ∼ P n ,
This is another fundamental reason to understand the divergence between these two distributions. You
can apply to very simple things. In the current case we are looking into D(Pxy ||Px × Py ) = I(X, Y ). When
you combine these two facts, apply here, it turns out that any particular message is going to be jointly
typical, the probability is approximately two to the minus mutual information. It tells the rate is at least
the mutual information for this distribution. Next lecture we will try to prove the upper bounds.