Capacity of The Binary Erasure Channel: 1 Communication
Capacity of The Binary Erasure Channel: 1 Communication
1 Communication
This note is about the breakthrough work of Claude Shannon in the 1940s. We begin with
Shannon’s famous block diagram, Figure 1.
Suppose that you want to send a message over a noisy channel. The basic steps of a
digital communication system are:
On the other end of the channel, the receiver reverses all of the above steps. Shannon showed
that we can design the source coding in 1 and the channel coding in 2 separately and still
transfer information over the communication channel at the optimal rate. For the rest of the
note, we will focus on the channel encoding and decoding for the special case of the binary
erasure channel.
1
Figure 2: The BEC and the BSC.
Take time to parse the above definition. We are considering the maximum probability that
the decoding function, when applied to the output of the channel, differs from the original
intended message, where the maximum is taken over all choices of the input message.
We say that the rate R is achievable for the channel if for each positive integer n there
exist encoding and decoding functions (fn , gn ) which encode messages of length L(n) := dnRe
to messages of length n, such that Pe (n) → 0 as n → ∞ (asymptotically error-free). The
largest achievable rate of the channel is called the capacity of the channel.
The main goal is to show the following:
Theorem 1. The capacity of the BEC with error probability p is 1 − p.
Proof. First, we will show that we can do no better than rate 1 − p. Indeed, even with
feedback (the receiver notifies the transmitter about exactly which bits were erased by the
2
channel), the best that the transmitter can do is to resend the bits which were erased. Since
the channel erases a fraction p of the input bits, the reliable rate of communication is 1 − p
bits per channel use. 1
Next, we will show that we can achieve a rate of R := 1 − p − for any > 0. Shannon’s
insight was to leverage the SLLN to achieve capacity. How do we generate a good codebook?
Flip n2L(n) fair coins independently, and fill in an n × 2L(n) codebook accordingly (thus each
of the 2L(n) possible messages are associated with a codeword of length n).
c1 · · · c2L(n)
bit 1
..
.
bit n
Figure 3: We fill in a n × 2L(n) table. The columns represent the codewords c1 , . . . , c2L(n) (one
codeword per possible message) and the rows represent the individual bits of the codewords.
Since the channel is a BEC, a fraction p of the bits transmitted will be erased (by the
SLLN). Suppose that the first codeword is sent. The receiver then gets the first codeword with
a fraction p of its bits erased. Assume WLOG that the first bn(1 − p)c symbols came through
(this is fine because the encoder does not know which bits were erased so it does not affect the
coding). The receiver now looks at the codebook truncated to the first bn(1 − p)c rows and
sees if there is a unique codeword matching the bits that were received. The decoding rule is
that the decoder looks for a unique match in the codebook, and if a unique match does not
exist, an error is declared. Thus, the probability of error is the probability that there exist ≥ 2
entries in the truncated codebook which match the received bits. If the truncated codewords
are denoted c1 , . . . , c2L(n) , then consider codeword c2 ; we have P(c1 = c2 ) = 2−bn(1−p)c . Hence,
L(n) L(n)
2[ 2X
P(error) = P {c1 = ci } ≤ 2−bn(1−p)c = 2L(n) · 2−bn(1−p)c ∼ 2−n(1−p−R) .
i=2 i=2
We now examine the exponent and note that as n → ∞, since R < 1 − p, our error goes to 0
exponentially fast.
We now have a sufficient scheme to achieve capacity, but what is the drawback? Decoding
in this manner requires exhaustive search over a massive codebook, so it is practically useless.
Thus, one needs implementable and fast codes to achieve capacity in practice.
3
The general result is stated in terms of the mutual information of random variables,
which is defined as I(X; Y ) := H(X) + H(Y ) − H(X, Y ). Let X denote the source alphabet
of a channel, and let Y denote the corresponding output alphabet. Let X be the input to the
channel (one transmission), and let Y be the output of the channel. Finally, let P denote the
set of probability distributions on the input alphabet X . The channel capacity is
In words, we are looking for the largest possible mutual information between the input and
output random variables, where the maximization is taken over all possible input distributions.
This new definition does not conflict with our earlier definition of the capacity of the channel,
because of the following famous result:
Theorem 2 (Channel Coding Theorem). Any rate below the channel capacity C (as defined
in (1)) is achievable. Conversely, any sequence of codes with Pe (n) → 0 as n → ∞ has a rate
R ≤ C. Thus, the two definitions of the channel capacity which we have given agree.
The general result is more difficult to prove than the special case of the BEC, but the
BEC example already carries most of the intuition.