EE 376A: Information Theory: Lecture Notes
EE 376A: Information Theory: Lecture Notes
Lecture Notes
January 6, 2016
Contents
1 Introduction 1
1.1 Lossless Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Channel Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Lossy Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4 Lossless Compression 18
4.1 Uniquely decodable codes and prefix codes . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2 Prefix code for dyadic distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Shannon codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4 Average codelength bound for uniquely decodable codes . . . . . . . . . . . . . . . . 22
4.5 Huffman Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.6 Optimality of Huffman codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
i
5.5 Joint Asymptotic Equipartition Property (AEP) . . . . . . . . . . . . . . . . . . . . 38
5.5.1 Set of Jointly Typical Sequences . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.6 Direct Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.7 Fano’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.8 Converse Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.9 Some Notes on the Direct and Converse Theorems . . . . . . . . . . . . . . . . . . . 42
5.9.1 Communication with Feedback: Xi (J, Y i−1 ) . . . . . . . . . . . . . . . . . . . 42
5.9.2 Practical Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.9.3 Pe vs. Pmax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6 Method of Types 45
6.1 Method of Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.1.1 Recap on Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.2 A Version of Sanov’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Introduction
Information theory is the science of operations on data such as compression, storage, and com-
munication. It is among the few disciplines fortunate to have a precise date of birth: 1948, with
the publication of Claude E. Shannon’s paper entitled A Mathematical Theory of Communication.
Shannon’s Information theory had a profound impact on our understanding of the concepts in
communication.
In this introductory chapter, we will look at a few representative examples which try to give
a flavour of the problems which can be addressed using information theory. However note that,
communication theory, is just one of the numerous fields which had a dramatic shift in the under-
standing due to information theory.
P (U = a) = 0.7
P (U = b) = P (U = c) = 0.15
Our task is to encode the source sequence into binary bits (1s and 0s). How should we do so?
The naive way is to use two bits to represent each symbol, since there are three possible
symbols. For example, we can use 00 to represent a, 01 to represent b and 10 to represent c. This
scheme has an expected codeword length of 2 bits per source symbol. Can we do better? One
natural improvement is to try to use fewer bits to represent symbols that appear more often. For
example, we can use the single bit 0 to represent a since a is the most common symbol, and 10 to
represent b and 11 to represent c since they are less common. Note that this code satisfies the prefix
condition, meaning no codeword is the prefix of another codeword, which allows us to decode a
message consisting of stream of bits without any ambiguity. Thus, if we see the encoded sequence,
001101001101011, we can quickly decode it as follows:
0 |{z}
|{z} 0 |{z}
11 |{z}
0 |{z}
10 |{z}
0 |{z}
11 |{z}
0 |{z}
10 |{z}
11
a a c a b a c a b c
1
If we use this encoding scheme, then L, which denotes the expected number of bits we use per
source symbol, is
This is a significant improvement over our first encoding scheme. But can we do even better? A
possible improvement is to encode two values at a time instead of encoding each value individually.
For example, the following table shows all the possibilities we can get if we look at 2 values, and
their respective probabilities (listed in order of most to least likely pairs). A possible prefix coding
scheme is also given.
Note that this scheme satisfies the two important properties: 1) the prefix condition and 2)
more common source symbol pairs have shorter codewords. If we use the above encoding scheme,
then the expected number of bits used per source symbol is
It can be proven that if we are to encode 2 values at a time, the above encoding scheme achieves
the lowest average number of bits per value (*wink* Huffman encoding *wink*).
Generalizing the above idea, we can consider a family of encoding schemes indexed by an integer
k. Given an integer k, we can encode k values at a time with a scheme that satisfies the prefix
condition and assigns shorter codewords to more common symbols. Under some optimal encoding
scheme, it seems reasonable that the expected number of bits per value will decrease as k increases.
We may ask, what is the best we can do? Is there a lower bound on L? Shannon proved that
given any such source, the best we can do is H(U ), which is called the Entropy of the source . By
definition, the source entropy is
X 1
H(U ) , p(u) log2 (1.1)
u∈U
p(u)
Theorem 2. ∀ε > 0, ∃ family of schemes, such that the average codeword length, L ≤ H(U ) + ε.
Yi = Xi ⊕ Wi , Wi ∼ Ber(q)
where ⊕ is the XOR operator.
We want to know how accurate we can be when transmitting the bits. The simplest approach
is to let Xi = Ui , and to decode the received bit by assuming Yi = Xi . Let pe be the probability of
error per source bit. Then in this case, pe = q < 1/2.
Can we decrease pe ? One approach may be to use repetition encoding, i.e., send each bit k
times for some k, and then decode the received bit as the value that appeared most among the k
received symbols. For example, if k = 3, then pe is simply the probability that the channel flipped
2 or more of the bits, which is
However, we need to send 3 times as many bits. To quantify this, we introduce the notion of bit
rate, denoted by R, which is the ratio of the number of bits sent to the units of channel space
used. For this scheme, our bit rate is 31 , whereas our bit rate in the previous example was 1.
Generalizing the above example, we see that as we increase k, our error rate pe will tend to 0,
but our bit rate R (which is 1/k) tends to 0 as well. Is there some scheme that has a significant
positive bit rate and yet allows us to get reliable communication (error rate tends to 0)? Again,
Shannon provides the answer.
In fact, the largest such C is known as the channel capacity of a channel, which represents
the largest bit rate ( the largest C ) that still allows for reliable communication. This was a very
significant and a startling revelation for the world of communication, as it was thought that zero
error probability is not achievable with a non-zero bit rate.
As examples, we will consider the channel capacity of the binary symmetric channel and the
additive white gaussian noise channel.
1 − H(X)
R< , (1.3)
1 − H(Y )
is achievable, whereas
1 − H(X)
R> , (1.4)
1 − H(Y )
is unachievable.
Yi = Xi + Ni , Ni ∼ N (0, σ 2 )
N
The rate of transmission is the ratio (which is the ratio of the number of source bits to the
n
number of uses of the channel). We want to develop a scheme so that we can reliably reconstruct
Ui from the given Yi . One way, if we have no usage power constraint, is to make Xi a large positive
value if Ui = 1 and Xi a large negative value if Ui = 0. In this manner, the noise from Ni will be
trivial relative to the signal magnitude, and will not impact reconstruction too much. However,
suppose there is an additional constraint on the average power of the transmitted signal, such that
we require
n
1X
X 2 ≤ p,
n i=1 i
for a given value p. In fact, we will see that
1 p
Theorem 4. If the rate of transmission is < 2 log2 1 + σ2
, then ∃ family of schemes that com-
1 p
municate reliably. And if the rate of transmission is > 2 log2 1 + σ2
, then there is no family of
schemes which communicates reliably.
p
The ratio is referred to as the signal-to-noise ratio (SNR).
σ2
After receiving the bits, let V1 , V2 , . . . be the reconstructed values. The distortion of the scheme
is defined as
D , E[(Ui − Vi )2 ] (1.5)
The optimal estimation rule for minimum mean squared error is the conditional expectation.
Therefore, to minimize distortion, we should reconstruct via Vi = E[Ui | Bi ]. This results in
D = E[(Ui − Vi )2 ]
= V ar(Ui | Bi )
= 0.5 × V ar(Ui | Bi = 1) + 0.5 × V ar(Ui | Bi = 0) (because U is symmetric)
= V ar(Ui | Bi = 1)
= E[Ui2 | Bi = 1] − (E[Ui | Bi = 1])2
2
= σ2 1 −
π
2
≈ 0.363σ .
Theorem 5. Consider a Gaussian memoryless source with mean µ and variance σ 2 . ∀ε > 0, ∃
family of schemes such that D ≤ σ 2 /4 + ε. Moreover, ∀ families of schemes, D ≥ σ 2 /4.
As we saw, the few examples signify the usefulness of information theory to the field of communi-
cations. In the next few chapters, we will try to build the mathematical foundations for the theory
of information theory, which will make it much more convenient for us to use them later on.
In this chapter ,we will introduce certain key measures of information, that play crucial roles in
theoretical and operational characterizations throughout the course. These include the entropy, the
mutual information, and the relative entropy. We will also exhibit some key properties exhibited
by these information measures.
Notation
A quick summary of the notation
2. Alphabets: X , Y, U, V
3. Specific Values: x, y, u, v
For discrete random variable (object), U has p.m.f: PU (u) , P (U = u). Often, we’ll just write
p(u). Similarly: p(x, y) for PX,Y (x, y) and p(y|x) for PY |X (y|x), etc.
2.1 Entropy
Before we understand entropy, let us take a look at the "surprise function", which will give us more
intuition into the definition of entropy.
6
Definition 7. Entropy: Let U a discrete R.V. taking values in U. The entropy of U is defined
by:
X 1
H(U ) , p(u) log , E[s(u)] (2.1)
u∈U
p(u)
The Entropy represents the expected value of surprise a distribution holds. Intuitively, the
more the expected surprise or the entropy of the distribution, the harder it is to represent.
Note: The entropy H(U ) is not a random variable. In fact it is not a function of the object
(u)
U , but rather a functional (or property) of the underlying distribution PU , u ∈ U. An analogy is
E[U ], which is also a number (the mean) corresponding to the distribution.
Properties of Entropy
Although almost everyone would have encountered the Jensen’s Inequality in their calculus class,
we take a brief look at it in a form most useful for information theory. Jensen’s Inequality: Let
Q denote a convex function, and X denote any random variable. Jensen’s inequality states that
E[eX ] ≥ eE[X]
Note that this is the expected surprise function, but instead of the surprise associated with
p, it is the surprise associated U , which is distributed according to PMF p, but incorrectly
assumed to be having the PMF of q. The following result stipulates, that we will (on average)
be more surprised if we had the wrong distribution in mind. This makes intuitive sense!
Mathematically,
H(U ) ≤ Hq (U ), (2.11)
with equality iff q = p.
Proof:
1 1
H(U ) − Hq (U ) = E log − E log (2.12)
p(u) q(u)
q(u)
H(U ) − Hq (U ) = E log (2.13)
p(u)
h i h i
q(u) q(u)
By Jensen’s, we know that E log p(u) ≤ log E p(u) , so
q(u)
H(U ) − Hq (U ) ≤ log E (2.14)
p(u)
X q(u)
= log p(u) (2.15)
u∈U
p(u)
X
= log q(u) (2.16)
u∈U
= log 1 (2.17)
=0 (2.18)
Note that property 3 is equivalent to saying that the relative entropy is always greater than
or equal to 0, with equality iff q = p (convince yourself).
4. If X1 , X2 , . . . , Xn are independent random variables, then
n
X
H(X1 , X2 , . . . , Xn ) = H(Xi ) (2.20)
i=1
Proof:
1
H(X1 , X2 , . . . , Xn ) = E log (2.21)
p(x1 , x2 , . . . , xn )
= E [− log p(x1 , x2 , . . . , xn )] (2.22)
= E [− log p(x1 )p(x2 ) . . . p(xn )] (2.23)
n
" #
X
=E − log p(xi ) (2.24)
i=1
n
X
= E [− log p(xi )] (2.25)
i=1
Xn
= H(Xi ). (2.26)
i=1
Therefore, the entropy of independent random variables is the sum of the individual entropies.
This is also intuitive, since the uncertainty (or surprise) associated with each random variable
is independent.
1
H(X, Y ) , E log (2.31)
P (X, Y )
1
= E log ] (2.32)
P (X)P (Y |X)
The last step follows from the non-negativity of relative entropy. Equality holds iff Px,y ≡
Px × Py , i.e. X and Y are independent.
3. Sub-additivity of entropy
with equality iff X ⊥ Y (follows from the property that conditioning does not increase
entropy)
In this chapter, we will try to understand how the distribution of n-length sequences generated
by memoryless sources behave as we increase n. We observe that a set of small fraction of all
the possible n-length sequences occurs with probability almost equal to 1. Thus, this makes the
compression of n-length sequences easier as we can then concentrate on this set.
We begin by introducting some important notation:
• For a set S, |S| denotes its cardinality (number of elements contained on the set). For
example, let U = {1, 2, . . . , M }, then |U| = M .
• un = (u1 , . . . , un ) is an n-tuple of u.
or equivalently,
2−n(H(U )+) ≤ p(un ) ≤ 2−n(H(U )−)
(n)
Let A denote the set of all -typical sequences, called the typical set.
So a length-n typical sequence would assume a probability approximately equal to 2−nH(U ) .
Note that this applies to memoryless sources, which will be the focus on this course1 .
(n)
Theorem 13 (AEP). ∀ > 0, P U n ∈ A → 1 as n → ∞.
1
For a different definition of typicality, see e.g. [1]. For treatment of non-memoryless sources, see e.g. [2], [3].
12
Proof This is a direct application of the Law of Large Numbers (LLN).
1
P U n ∈ A(n) = P − log p(U n ) − H(U ) ≤
n
n
!
1 Y
= P − log p(Ui ) − H(U ) ≤
n
i=1
" n # !
1 X
=P − log p(Ui ) − H(U ) ≤
n
i=1
→ 1 as n → ∞
where the last step is due to the Law of Large Numbers (LLN), in which − log p(Ui )’s are i.i.d. and
hence their arithmetic average converges to their expectation H(U ).
This theorem tells us that with very high probability, we will generate a typical sequence. But
(n)
how large is the typical set A ?
Theorem 14. ∀ > 0 and sufficiently large n,
(1 − )2n(H(U )−) ≤ A(n)
≤2
n(H(U )+)
which gives the upper bound. For the lower bound, by the AEP theorem, for any > 0, there
exists sufficiently large n such that
2−n(H(U )−) = A(n) −n(H(U )−)
X X
1 − ≤ P U n ∈ A(n) = p(un ) ≤ 2 .
(n) (n)
un ∈A un ∈A
The intuition is that since all typical sequences assume a probability about 2−nH(U ) and their
nH(U ) . Although
total probability is almost 1, the size of the typical set has to be approximately 2
(n)
A grows exponentially with n, notice that it is a relatively small set compared to U n . For some
> 0, we have
(n)
A 2n(H(U )+)
≤ = 2−n(log |U |−H(U )−) → 0 as n → ∞
|U n | 2n log |U |
given that H(U ) < log |U| (with strict inequality!), i.e., the fraction that the typical set takes up
in the set of all sequences vanishes exponentially. Note that H(U ) = log |U| only if the source is
uniformly distributed, in which case all the possible sequences are typical.
(n)
un ∈ A ⇔ p(un ) ≈ 2−nH(U )
The set U n of all sequences
(n)
A ≈ 2nH(U )
In the context of lossless compression of the source U , the AEP tells us that we may only focus
on the typical set, and we would need about nH(U ) bits, or H(U ) bits per symbol, for a good
representation of the typical sequences.
(n)
We know that the set, A has probability 1 as n increases. However is it the smallest such
set? The next theorem gives a definitive answer to the question.
Theorem 15. For all δ > 0 and all sequences of sets B (n) ⊆ U n such that B (n) ≤ 2n[H(U )−δ] ,
lim P U n ∈ B (n) = 0 (3.1)
n→∞
(n) (n)
We can justify the theorem in the following way: As n increases |B (n) ∩ A | ≈ 2−nδ |A |.
As every typical
sequence has probability of ≈ 2−nH(U ) , and is the same for every sequence,
n
P U ∈B (n) =0
We will next look at a simple application of the AEP for the compression of symbols generated
by a discrete memoryless source.
Definition 16 (Achievable rate). R is an achievable rate if for all ε > 0, there exists a scheme (n,
m, compressor, decompressor) whose rate mn ≤ R and whose probability of error Pe < ε.
We are interested in the question: What is the lowest achievable rate? Theorems 17 and 18 tell
us the answer.
Proof Fix R > H(U ) and ε > 0. Set δ = R − H(U ) > 0 and note that for all n sufficiently large,
by Theorem 13,
(n)
P Un ∈/ Aδ < ε, (3.3)
and by Theorem 14,
(n)
Aδ ≤ 2n[H(U )+δ] = 2nR . (3.4)
(n)
Consider a scheme that enumerates sequences in Aδ . That is, the compressor outputs a binary
(n)
representation of the index of U n if U n ∈ Aδ ; otherwise, it outputs (0, 0, . . . , 0). The decompressor
(n)
maps this binary representation back to the corresponding sequence in Aδ . For this scheme, the
probability of error is bounded by
(n)
Pe ≤ P U n ∈
/ Aδ <ε (3.5)
Pe ≥ P U n ∈
/ B (n) → 1, as n → ∞ (3.7)
Hence, increasing n cannot make the probability of error arbitrarily small. Furthermore, there is
clearly a nonzero probability of error for any finite n, so R is not achievable. Conceptually, if the
rate is too small, it can’t represent a large enough set.
[1] A. E. Gamal and Y.-H. Kim, Network Information Theory, Cambridge Univ. Press, UK, 2012.
[2] R. M. Gray, Entropy and Information Theory, Springer-Verlag, New York, 1990.
17
Chapter 4
Lossless Compression
Let us start with a simple code. Let l(u) represent the length of a binary codeword representing
u, u ∈ U. We can then write ¯l = El(u) = u∈U p(u)l(u) where ¯l is the expected length of a
P
codeword.
Example 19. Let U = {a, b, c, d} and let us try to come up with a simple code for this alphabet.
This code satisfies the prefix condition since no codeword is the prefix for another codeword. It
also looks like the expected code length is equal to the entropy. Is the entropy the limit for variable
length coding? Can we do better? Let us try a better code.
18
Figure 4.2: Better code with regard to l(u)
Definition 20. A code is uniquely decodable(UD) if every sequence of source symbols is mapped to
a distinct binary codeword.
Definition 21.
Prefix Condition: When no codeword is the prefix of any other.
Prefix Code: A code satisfying the prefix condition.
Codes that satisfy the prefix condition are decodable on the fly. Codes that do not satisfy the
prefix condition can also be uniquely decodable, but they are less useful.
2−nu
X
=
u∈U
nX
max
= (# of symbols u with nu = n) · 2−n
n=1
nXmax
⇒ 2nmax = (# of symbols u with nu = n) · 2nmax −n
n=1
X−1
nmax
= (# of symbols u with nu = n) · 2nmax −n + (# of symbols u with nu = nmax )
n=1
(4.1)
Now with this lemma in hand, we can prove our claim that we can find a UD code for a dyadic
distribution. Consider the following procedure:
• Choose 2 symbols with nu = nmax and merge them into one symbol with (twice the) proba-
bility 2−nmax +1
Also note that the symbol with p(u) = 2−nu has distance nu from root. This means that the
induced prefix code satisfies l(u) = nu = − log p(u)
Here, we take the ceiling of − log p(u) as the length of the codeword for the source symbol u. Be-
cause the ceiling of a number is always within 1 of the actual number, the expected code length
¯l = E[− log p(u)] is within 1 of H(u).
∗
Let us consider the "PMF" p∗ (u) = 2−nu . This "PMF" is a dyadic distribution because all
Pp∗ (u)
probabilities are a power of 2. We put PMF in quotes because for a non-dyadic source, u∈U is
less than 1, and so the PMF is not a true PMF. See the following:
∗
p∗ (u) = 2−nu = 2−d− log p(u)e
X X X
We can thus construct a prefix code for U ∗ . Because this is a dyadic distribution, we can use the
binary tree principle above to construct a code for all u ∈ U ∗ , and we can consider only the source
symbols u in the original alphabet U . The lengths of each codeword will satisfy
The expected code length for a Shannon code can be expressed as the following:
Therefore, the expected code length is always less or equal to the entropy plus 1. This result
could be good or bad depending on how large H(U ) is to start with. If the extra “1" is too much,
alternatively, we can construct a Shannon code for the multi-symbol un = (u1 , u2 , ...un ), where ui
is memoryless. Then,
¯ln ≤ H(U n ) + 1 or 1¯
≤ n1 H(U n ) + 1 1
n ln n = H(U ) + n
Now we can make it arbitrarily close to the entropy. In the end, there is a trade-off between ideal
code length and memory since the code map is essentially a lookup table. If n gets too large, the
exponential increase in lookup table size could be a problem.
2−`(u) ≤ 1
X
(4.3)
u∈U
Conversely, any integer-valued function satisfying (4.3) is the length function of some UD code.
To see the “conversely" statement, note that we know how to generate a UD code (in fact, a
prefix code) with length function satisfying (4.3), using Huffman Codes. Here, we prove the first
claim of the Kraft-McMillan Inequality.
(u1 ,...,uk )
k)
2−`(u
X
=
uk
k·`max n o
u `(uk ) = i · 2−i
X k
=
i=1
k·` max
2i · 2−i
X
≤
i=1
= k · `max
Note that the inequality in the second to last line arises because we know that our code is one-to-
one so there can be at most 2i symbols whose codewords have length i. Finally, we can see the
theorem through the following inequality.
Now, we can prove the important theorem relating UD codes to the binary entropy.
Proof
Thus, D(pkq) can be thought of as the “cost of mismatch", in designing a code for a distribution
q, when the actual distribution is p.
1. Find 2 symbols with the smallest probability and then merge them to create a new “node"
and treat it as a new symbol.
2. Then merge the next 2 symbols with the smallest probability to create a new “node"
3. Repeat steps 1 and 2 until there is only 1 symbol left. At this point, we created a binary tree.
The paths traversed from the root to the leaves are the prefix codes.
We will next try to understand why Huffman codes are the optimal prefix codes.
and cr−1 (ur−1 ), `r−1 (ur−1 ) are the codeword and length of ur−1 ∈ Ur−1 , respectively. Again,
`r−1 = E`r−1 (Ur−1 ).
"Splitting a prefix code cr−1 ": creating a prefix code for U by
c(i) = cr−1 (i)
1≤i≤r−2
c(r − 1) = cr−1 (r − 1)0
r−1 (r − 1)1
c(r) = c
We will use the following lemma to justify the optimality of Huffman Codes. Intuitively, we
will show that if we start with an optimal code on r − 1 symbols, splitting gives us an optimal code
over r symbols. We can use an inductive argument, starting with a binary object to prove that
Huffman Codes are optimal for alphabets with any number of symbols.
Lemma 28. Let copt,r−1 be an optimal prefix code for Ur−1 . Let c be the code obtained from copt,r−1
by splitting. Then c is an optimal prefix code for U .
1. `(1) ≤ `(2) ≤ . . . ≤ `(r) Otherwise, we could rearrange the codes to satisfy this property, and
the result would be at least as good due to the ordergin we have assumed on the probabilities.
[1] D. J. C. MacKay, Information Theory, Inference, and Learning Algorithms, Cambridge Uni-
versity Press, UK, 2003.
[2] S. Geman and D. Geman, “Stochastic Relaxation, Gibbs Distributions, and the Bayesian
Restoration of Images,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 6,
pp. 721–741, 1984.
27
Chapter 5
noisy channel
X n −→ PY n |X n −→ Y n
Our goal, then, is to find an encoder e and a decoder d such that we can take an input of m
bits (called B m ), encode it using e to get an encoding of length n denoted X n , pass the encoding
through the channel to get Y n , and then decode Y n using d to recover an estimate of the original
string of bits, denoted B̂ m . Pictorially, that is:
noisy channel
Xn Yn
(B1 , B2 , . . . , Bm ) = B m −→ encoder (e) −→ PY n |X n −→ decoder (d) −→ B̂ m = (B̂1 , B̂2 , . . . , B̂m )
Ideally, n will be small relative to m (we will not have to send a lot of symbols through the noisy
channel), and the probability that the received message B̂ m does not match the original message
B m will be low. We will now make rigorous these intuitively good properties.
Definition 29. We define a scheme to be a pair of encoder and decoder, denoted (e, d).
Note that the definition of a scheme does not include the noisy channel itself. We must take the
channel as it is given to us: we cannot modify it, but we can choose what symbols X n to transmit
through it.
28
Definition 30. We define the rate of a scheme, denoted R, to be the number of bits communicated
per use of the channel. This is equal to m/n in the notation of the diagram above.
Definition 31. We define the probability of error for a scheme, denoted Pe , to be the probability
that the output of the decoder does not exactly match the input of the encoder. That is,
Pe = P (B m 6= B̂ m ).
Definition 32. For a given channel, we say that a rate R is achievable if there exists a sequence
of schemes (e1 , d1 ), (e2 , d2 ), . . ., such that:
1. For all n = 1, 2, . . ., scheme (en , dn ) has rate at least R, and
2.
lim Pe(n) = 0,
n→∞
(n)
where Pe denotes the probability of error of the nth scheme.
X ∼ Px −→ PY |X −→ random Y
With this single letter channel, we now examine I(X; Y ). What distribution of X will maximize
I(X; Y ) over all possible channel inputs?
Proof:
Direct Theorem: If R < C (I) , then the rate R is achievable.
Converse Theorem: If R > C (I) , then R is not achievable.
The direct part and the converse part of the proof are given at the end of this chapter.
Let: X = Y = {0, 1}
0 1
⇐⇒ channel matrix: 0 1 − p p
1 p 1−p
1−p
0 0
p
p
1 1
1−p
⇐⇒ bipartite graph:
⇐⇒ Y = X ⊕2 Z ← ber(p)
To achieve equality, H(Y ) = 1, i.e. Y is bernoulli 21 . Taking X ∼ Ber( 12 ) produces this desired Y
and therefore gives I(X; Y ) = 1 − h2 (p)
=⇒ C = 1 − h2 (p)
1−α 1
I(X; Y ) = 1 − α
=⇒ C = 1 − α
(1 − α)n bits of information can be communicated reliably.
5.2.2 Recap
J ⇐⇒
B1 , B2 , ..., Bm Jˆ ⇐⇒
i.i.d. ∼Ber(1/2) Noisy B̂1 , B̂2 , ..., B̂m
Encoder/ Xn Yn Decoder/
Memoryless
Transmitter Receiver
Channel
m bits
• Rate = n channel use
• Pe = P(Jˆ 6= J)
m
• R is achievable if ∀ > 0, ∃ a scheme (m, n, encoder, decoder) with n ≥ R and Pe < .
Note: The Channel Coding Theorem is equally valid for analog signals, e.g., the AWGN chan-
nel. However, we must extend our definition of the various information measures such as entropy,
mutual information, etc.
Next we extend the information measures to continue random variables, and analyze the AWGN
channel.
2. X, Y ∼ fX,Y :
Note: Unlike discrete entropy H(X), differential entropy can be positive or negative. This is not
the only way in which they differ.
Note: differential entropies can be either positive or negative. The more correlated the random
variable the more negative
X 1
H(X∆ ) = Pi log
i
P1
X 1
≈ f (i∆) · ∆ · log
i
∆f (i∆)
1 X 1
= log + f (i∆) log ∆
∆ i
f (i∆)
1 1 1
Z
X ∆→0
H(X∆ ) − log = f (i∆) log ∆ −−−→ f (x) log dx = h(x)
∆ i
f (i∆) f (x)
Proof of Claim:
1 1 2 √ 1
2X
2
fG (X) = √ e− 2σ2 X , − log fG (X) = log 2πσ 2 + 2σ
2πσ 2 ln 2
fX (X)
0 ≤ D(fX ||fG ) = E log
fG (X)
1
= −h(X) + E log
fG (X)
√ 1
" #
2σ 2
X2
= −h(X) + E log 2πσ 2 +
ln 2
√ 1
" #
2σ 2
G2
≤ −h(X) + E log 2πσ 2 +
ln 2
1
= −h(X) + E log
fG (G)
= −h(X) + h(G)
Xi + Yi
• R is achievable with power P if: ∀ > 0, ∃ scheme restricted to power P and with rate
m
n ≥ R and probability of error Pe < .
Note: We could instead have considered the restriction that E[ n1 ni=1 Xi2 ] ≤ P . This constitutes
P
a relaxed constraint. However, it turns out that even with the relaxation, you cannot perform any
better in terms of the fundamental limit.
X
+ Y
E[X 2 ] ≤P
See Figure 7.2 for the geometric interpretation of this problem. We want the high probability
output balls to not intersect. This way, we can uniquely distinguish the input sequences associated
with any given output sequence.
p
Vol(n-dim ball of radius n(P + σ 2 ))
# messages ≤ √
Vol(n-dim ball of radius nσ 2 )
This inequality is due to inefficiencies in the packing ratio. Equality corresponds to perfect
packing, i.e. no dead-zones. So,
n/2
Kn ( n(P + σ 2 ))n
p
P
# of bits = √ = 1+ 2
Kn ( nσ 2 )n σ
log # of messages 1 P
⇒ rate = ≤ log 1 +
n 2 Q
The achievability of the equality indicates that in high dimension, can pack the balls very
effectively.
1.
lim P ((X n , Y n ) ∈ A(n)
(X, Y )) = 1 (5.5)
n→∞
By the AEP, we have that X n is typical, Y n is typical, and (X n , Y n ) is typical too.
2. ∀ > 0, ∃n0 ∈ N such that ∀n > n0
(1 − )2n(H(X,Y )−) ≤ |A(n)
(X, Y )| ≤ 2
n(H(X,Y )+)
(5.6)
Theorem 44. If (X̃ n , Ỹ n ) are formed by i.i.d. (X̃i , Ỹi ) ∼ (X̃, Ỹ ), where PX̃,Ỹ = PX · PY , then
∀ > 0, ∃n0 ∈ N such that ∀n > n0
(1 − )2−n(I(X;Y )+3) ≤ P ((X̃ n , Ỹ n ) ∈ A(n) (X, Y )) ≤ 2−n(I(X;Y )−3) (5.7)
Intuition:
nH(X̃,Ỹ )
|A(n)
(X̃, Ỹ )| ≈ 2 (5.8)
n(H(X)+H(Y ))
=2 (5.9)
= 2nH(X) · 2nH(Y ) (5.10)
≈ |A(n)
(X)| · |A(n)
(Y )| (5.11)
(n) (n)
Note that (X̃, Ỹ ) are distributed uniformly within a set of size |A (X)| · |A (Y )|
(n)
|A (X, Y )|
⇒ P ((X̃ n , Ỹ n ) ∈ A(n)
(X, Y )) = (n) (n)
(5.12)
|A (X)| · |A (Y )|
2nH(X,Y )
≈ (5.13)
2nH(X) · 2nH(Y )
= 2−nI(X;Y ) (5.14)
ˆ : Y n → {1, 2, . . . , M }
Decoder: J(·)
Theorem 45. (Direct Theorem) If R < maxPX I(X; Y ), then R is achievable. Equivalently, if
∃ PX s.t. R < I(X; Y ), then R is achievable.
Proof
Fix PX and a rate R < I(X; Y ). Choose = (I(X; Y )−R)/4. This means that R < I(X; Y )−3.
Generate codebook Cn of size M = d2nR e.
X n (k) are i.i.d. with distribution PX , ∀ k = 1, 2, . . . , M . Then
(
ˆ n) = j if (X n (j), Y n ) ∈ An (X, Y ) and (X n (k), Y n ) ∈
/ An (X, Y ), ∀j 6= k
J(Y (5.21)
error otherwise
This follows because, by symmetry, P (Jˆ 6= J|J = i) = P (Jˆ 6= J|J = j), ∀i, j, and P (J = i) =
1/M, ∀i
Since R < I(X; Y ) − 3, the expression tends to zero as n tends to infinity.
A weaker version of Fano’s Inequality uses the facts that h(Pe ) ≤ 1 and log(|X | − 1) ≤ log(|X |):
H(X|Y ) ≤ 1 + Pe log(|X |) (5.35)
or equivalently,
H(X|Y ) − 1
Pe ≥ (5.36)
log(X )
1-α
0
0 α
α
e
1
1-α
1
1. LDPC Codes: "Low Density Parity Check Codes", Gallager 1963 Thesis [3].
A more stringent notion of reliability is the maximal probability of error Pmax , which is defined as:
It turns out that our results, i.e., direct and converse theorems, are still valid for this more stringent
notion of reliability. The converse theorem is clear. If arbitrarily small Pe cannot be achieved,
arbitrarily small Pmax cannot be achieved either, therefore the converse theorem holds for Pmax .
We now show that the result of the direct proof holds for vanishing Pmax . Note that with application
of the Markov inequality, we have:
M
|{1 ≤ j ≤ M : P (Jˆ 6= j|J = j) ≤ 2Pe }| ≥ (5.48)
2
Given Cn with |Cn | = M and Pe , there exists Cn0 with |Cn0 | = M/2 and Pmax ≤ 2Pe . By extracting
a better half of Cn , one can construct Cn0 . The rate of Cn0 is:
log(M/2) log M 1
Rate of Cn0 ≥ = − (5.49)
n n n
This implies that if there exists schemes of rate ≥ R with Pe → 0, then for any > 0, there exists
schemes of rate ≥ R − with Pmax → 0
[1] D. J. C. MacKay, Information Theory, Inference, and Learning Algorithms, Cambridge Uni-
versity Press, UK, 2003.
[2] S. Geman and D. Geman, “Stochastic Relaxation, Gibbs Distributions, and the Bayesian
Restoration of Images,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 6,
pp. 721–741, 1984.
[3] R. G. Gallager, Low-Density Parity-Check Codes, Cambridge, MA: MIT Press, 1963.
[4] E. Arikan, “Channel polarization: A method for constructing capacity achieving codes for
symmetric binary-input memoryless channels,” IEEE Trans. Inf. Theory, vol. 55, no. 7, pp.
3051–3073, July 2009.
44
Chapter 6
Method of Types
For additional material on the Method of Types, we refer the reader to [1] (Section 11.1).
Definition 49. The empirical distribution or type of xn is the vector (Pxn (1), Pxn (2), . . . Pxn (r))
N (a|xn )
, where N (a|xn ) = ni=1 1{xi =a} .
P
of relative frequencies Pxn (a) =
n
Definition 50. Pn denotes the collection of all empirical distributions of sequences of length n.
1 n−1 2 n−2
Pn = (0, 1) , , , , , . . . , (1, 0) .
n n n n
Definition 52. If P ∈ Pn (probabilities are integer multiples of 1/n), the type class or type of P
is T (P ) = {xn : Pxn = P }. The type class of xn is Txn = T (Pxn ) = {x̃ : Px̃n = Pxn }.
5
5!
|Txn | = 311 = = 20.
3!1!1!
45
Theorem 54. |Pn | ≤ (n + 1)r−1
Proof Type of xn is determined by (N (1|xn ) , N (2|xn ) , . . . , N (r|xn )). Each component can
assume no more than n + 1 values (0 ≤ N (i|xn ) ≤ n) (and the last component is dictated by the
others).
Notation:
Proof
n
Y
Qn (xn ) = Q(xi )
i=1
Pn
log Q(xi )
=2 i=1
P
N (a|xn ) log Q(a)
=2 a∈X
P N (a|xn )
= 2n a∈X n
log Q(a)
P N (a|xn ) 1
−n log
=2 a∈X n Q(a)
P 1
−n Pxn (a) log
=2 a∈X Q(a)
hP i
1
P Pxn (a)
−n a∈X
Pxn (a) log Pxn (a)
+ a∈X
Pxn (a) log Q(a)
=2
= 2−n[H(Pxn )+D(Pxn ||Q)] .
Theorem 57. ∀P ∈ Pn ,
1
2nH(P ) ≤ |T (P )| ≤ 2nH(P ) .
(n + 1)r−1
n n!
Note: |T (P )| = nP (1) nP (2) ... nP (r) =Q .
a∈X (nP (a))!
Proof
n
P n (T (P )) |T (P )| a∈X P (a)[nP (a)]
Q
nP (1) ... nP (r) Y
= = n P (a)n[P (a)−Q(a)]
P n (T (Q)) |T (Q)| a∈X P (a)[nQ(a)]
Q
nQ(1) ... nQ(r) a∈X
Y (nQ(a))!
= P (a)n[P (a)−Q(a)]
a∈X
(nP (a))!
m!
Note: ≥ nm−n
n!
m!
If m > n, then = m(m − 1) . . . (n + 1) ≥ nm−n .
n!
n−m
m! 1 1
If n > m, then = ≥ = nm−n .
n! n(n − 1) . . . (m + 1) n
Therefore,
Y (nQ(a))!
(nP (a))n[Q(a)−P (a)] P (a)n[P (a)−Q(a)]
Y
P (a)n[P (a)−Q(a)] ≥
a∈X
(nP (a))! a∈X
Y
= nn[P (a)−Q(a)]
a∈X
P
= nn a∈X
[P (a)−Q(a)]
= 1.
= |Pn |P n (T (P ))
= |Pn ||T (P )|2−n[H(P )+D(P ||P )]
≤ (n + 1)r−1 |T (P )|2−nH(P ) .
.
Note: We will write αn = βn : “equality to first order in the exponent"
1 αn n→∞
⇐⇒ log −−−→ 0
n βn
1 1 n→∞
⇐⇒ | log αn − log βn | −−−→ 0
n n
. n→∞
E.g. : αn = 2nJ ⇐⇒ αn = 2n(J+n ) where n −−−→ 0.
T (P ) = {xn : Pxn = P },
1. |Pn | ≤ (n + 1)|x|
n
!
1 1X
2−n min D(P ||Q) ≤ f (Xi )) ≥ α ≤ (n + 1)|x| 2−n min D(P ||Q) ,
(n + 1)|x| n i=1
where the min is over the set {P : P ∈ Pn , hPxn , f i ≥ α}.
Proof First observe that we can write
n
1X 1 X
f (x) = N (a|xn )f (a)
n i=0 n a∈X
X
= Pxn (a)f (a)
a∈X
= hPxn , f i
Pn
where we have used the Euclidean inner product, defined as ha, bi := i=1 ai bi for a, b ∈ R. Then
by the Law of Large numbers,
n
1X
f (Xi ) ≈ EX∼Q f (X)
n i=1
X
= Q(a)f (a)
a∈X
= hQ, f i
We can proceed to find the desired upper bound.
n
!
1X
P f (Xi ) ≥ α = P (hPxn , f i)
n i=1
[
= Qn T (P )
P ∈Pn ,hPxn ,f i≥α
X
= Qn (T (P ))
P ∈Pn ,hPxn ,f i≥α
≤ |Pn | max Qn (T (P ))
≤ (n + 1)|x| max 2−nD(P ||Q)
= (n + 1)|x| 2−n min D(P ||Q)
Now we solve for the lower bound:
n
!
1X
P f (Xi ) ≥ α ≥ max Qn (T (P ))
n i=1
1
≥ max |x|
2−nD(P ||Q)
(n + 1)
1
= 2−n min D(P ||Q)
(n + 1)|x|
1
Example 60. Take Xi ∼ Ber 2 . Then:
n
!
1X . ∗
P (fraction of ones in X1 X2 · · · Xn ≥ α) = P Xi ≥ α = 2−nD (α)
n i=1
and since
α 1−α
D(Ber(p)||Ber(1/2)) = α log + (1 − α) log
1/2 1/2
= 1 − h(α)
Interestingly, this function explodes at α = 1, which makes sense because the probability that the
mean of random variables which take values up to 1 is greater than 1 is impossible. Furthermore,
we have a region where the cost of mismatch is zero, since we are guaranteed that one of the
probabilities is always going to be ≥ 1/2, so we would expect our mean to be so as well.
[1] Cover, Thomas M., and Joy A. Thomas. Elements of information theory. John Wiley & Sons,
2012.
51
Chapter 7
Notation:
We always write a sequence of symbols by small letter. For example xn is an individual sequence
without any probability distribution assigned to it. We use capital letter for random variables, e.g.
X n i.i.d. according to some distribution. Throughout, X will denote the set of possible values a
symbol can take.
Definition 61. For any sequence xn , empirical distribution is the probability distribution derived
for letters of the alphabet based on frequency of appearance of that specific letter in the sequence.
More precisely:
1X
pxn (a) = 1(xi = a), ∀a ∈ X (7.1)
n
52
All(sequences(of(
length(n(
Typical(Set(
Definition 63. The strongly δ-typical set, Tδ (P ), is the set of all strongly δ typical sequences. That
is
Note that this "strongly typical" set is different from the other "(weakly) typical" set defined
in previous chapters. This new notion is stronger, but as we will see it retains all the desirable
properties of typicality, and more. The following example illustrates difference of this strong notion
and weak notion defined earlier:
Example 65. Suppose that alphabet is X = {a, b, c} with the probabilities p(a) = 0.1, p(b) = 0.8
and p(c) = 0.1. Now consider two strings of length 1000:
x1000
strong = (100a, 800b, 100c)
x1000
weak = (200a, 800b)
In this example, these two sequences have same probability, so they are both identical in the weak
()
notion of typical set A1000 (for some ). But it is not hard to see that x1000
strong is a δ- strong typical
We will show in the homework that that we preserve important properties of typical sets in the
new strong notion. Specifically, we will show the following results:
1. ∀ δ > 0, there exists = δH(p) such that Tδ (P ) ⊆ A (P ) (i.e., strong typical sets are inside
weak typical sets).
4. P (xn ∈ Tδ (X)) → 1 as n → ∞
Definition 67. A pair of sequences (xn , y n ) is said to be δ - jointly typical with respect to a pmf
PXY on X × Y if:
|Pxn ,yn (x, y) − P (x, y)| ≤ δP (x, y),
where Pxn ,yn (x, y) is the empirical distribution.
Tδ (P ) = {(X n , Y n ) : |Pxn ,yn (x, y) − P (x, y)| < δP (x, y), ∀(x, y) ∈ X × Y}
If we look carefully, nothing is really very new. We just require that empirical distribution of
pair of sequences be δ - close to the pmf PXY .
and LNIT: Information Measures and Typical Sequences (2010-06-22 08:45) Page 2 – 19
P ((xn , y n ) ∈ Tδ (X, Y )) ≈ 2−nH(X,Y )
In Figure 7.2 we have depicted the sequences of length n from alphabet X on the x axis and
sequences from alphabet Y on the y axis.
Useful Picture
⇣ ⌘
.
T✏(n)(X) | · | = 2nH(X)
xn
yn
(n)
⇣T✏ (X, Y ) ⌘
.
T✏(n)(Y ) ⌘ | · | = 2nH(X,Y )
⇣
. nH(Y )
|·|=2
(n) n (n) n
⇣ T✏ (Y |x ) ⌘ ⇣ T✏ (X|y ) ⌘
. nH(Y |X) . nH(X|Y )
|·|=2 |·|=2
Figure 7.2: A useful diagram depicting typical sets, from El Gamal and Kim (Chapter 2)
LNIT: Information Measures and Typical Sequences (2010-06-22 08:45) Page 2 – 20
Then we can look inside the table and any point corresponds to a pair of sequences. We have
marked the jointly typical sets with dots. It is easy to see that if a set is jointly typical then both
of the sequences in the set are typical as well. Also we will see in the next section that number
of dots in the column corresponding to a typical xn sequence is approximately 2nH(Y |X) (a similar
statement is also correct for rows). We can quickly check that this is consistent by a counting
argument: we know that number of typical xn sequences is 2nH(X) and each column there are
2nH(Y |X) jointly typical pairs. So in total there are 2nH(X) · 2nH(Y |X) number of typical pairs. But
2nH(X) · 2nH(Y |X) = 2n(H(X)+H(Y |X)) = 2nH(X,Y ) which is consistent to the fact that total number
of jointly typical pairs is equal to 2nH(X,Y ) .
Say we want to send one of M messages over a channel. We encode each m ∈ {1, . . . , M } into a
codeword X n (m). We then send the codeword over the channel, obtaining Y n . Finally, we use a
decoding rule M̂ (Y n ) which yields M̂ = m with high probability.
The conditional typicality lemma (proved in the homework) characterizes the behavior of this
channel for large n: if we choose a typical input, then the output is essentially chosen uniformly at
random from Tδ (Y |xn ). More precisely, for all δ 0 < δ, xn ∈ Tδ0 (X) implies that
P (y n ∈ Tδ (Y |xn )) = P ((xn , Y n ) ∈ Tδ (X, Y )) → 1,
2nH(Y )
= 2n(H(Y )−H(Y |X)) = 2nI(X;Y ) ≤ 2nC ,
2nH(Y |X)
where C is the channel capacity. Note that this argument does not give a construction that lets
us attain this upper bound on the communication rate. The magic of the direct part of Shannon’s
channel coding theorem is that random coding lets us attain this upper bound.
All(sequences(of(
length(n(
All(sequences(of(
length(n( Typical(Set(
Typical(Set(
Tδ(Y|xn(2))(
xn(2)(
xn(1)( Tδ(Y|xn(1))(
where ˜(δ) → 0 as δ → 0.
Intuitive argument for joint typicality lemma The joint typicality lemma asserts that
the probability of observing two random xn and y n sequences is roughly 2−nI(X;Y ) . Observe that
there are roughly 2nH(X) typical xn sequences, and 2nH(Y ) typical y n sequences. The total number
of jointly typical sequences is 2nH(X,Y ) . Thus, what is the probability that two randomly chosen
sequences are jointly typical?
2nH(X,Y )
≈ nH(X) nH(Y )
= 2−nI(X;Y ) (7.4)
2 ×2
where Ui ∼ U , i.i.d.
• An encoder, i.e., a mapping from U N to J ∈ {1, 2, ..., M } (log M bits used to encode a symbol
sequence, where a symbol sequence is U N and a symbol is Ui )
N
2. Expected distortion (figure of merit) = d(U N , V N ) = E[ N1
P
d(Ui , Vi )] (we always specify
i=1
distortion on a per-symbol basis, and then average the distortions to arrive at d(U N , V N ))
distortion
There’s a trade-off between rate and symbol . Distortion theory deals with this trade-off.
log M
Definition 70. (R,D) is achievable if ∀ > 0 ∃ scheme (N ,M ,encoder,decoder) such that N ≤
R + and E[d(U N , V N )] ≤ D +
Definition 71. R(D) , inf{R0 : (R0 , D) is achievable}
Definition 72. R(D)(I) , min I(U ; V )
E[d(U,V )]≤D
59
Theorem 73. R(D) = R(I) (D).
Proof
(
Direct Part : R(D) ≤ R(I) D
⇔
Converse Part : R(D) ≥ R(I) D
The proof of the direct part and the converse part are given below.
Note that R(D) is something we can’t solve for (solution space is too large!), but R(I) (D) is
something we can solve for (solution space is reasonable).
Sketch of proof: We consider a “time-sharing” scheme for encoding N bits. We encode the
first αN bits using a “good" scheme for distortion D = D0 and encode the last (1 − α)N bits
using a “good” scheme for D = D1 . Overall, the number of bits in the compressed message is
N αR(D0 ) + N (1 − α)R(D1 ), so that the rate is αR(D0 ) + (1 − α)R(D1 ). Further, the expected
distortion is the average, weighted by α between the distortions between the two different schemes,
i.e. αD0 + (1 − α)D1 . We therefore have constructed a scheme which achieves distortion αD0 +
(1 − α)D1 with rate αR(D0 ) + (1 − α)R(D1 ), and the optimal scheme can only do better. That is
as desired.
8.2 Examples
Example 75.
1
Consider U ∼ Ber(p), p ≤ 2 and Hamming distortion. That is
(
0 for u = v
d(u, v) =
1 for u 6= v
Claim: (
h2 (p) − h2 (D) 0≤p≤D
R(D) =
0 D>p
Proof: We will not be overly pedantic by worrying about small factors in the proof.
Note we can achieve distortion p without sending any information by setting V = 0. Therefore,
for D > p, R(D) = 0, as claimed. For the remainder of the proof, therefore, we assume D ≤ p ≤ 21 .
In the second line we have used the fact that H(U |V ) = H(U 2 V |V ) because there is a one to one
mapping (U, V ) ↔ (U 2 V, V ). In the third line, we have used that conditioning reduces entropy,
so H(U 2 V |V ) ≤ H(U 2 V ). Finally, in the last line we have used that h2 is increasing on [0, 12 ]
and that P (U 6= V ) ≤ D ≤ p ≤ 12 . This establishes that R(D) ≥ h2 (p) − h2 (D).
Now we must show equality can be achieved. The first and second inequalities above demon-
strate that we get equality if and only if
1. U 2 V is independent of V .
2. U 2 V ∼ Ber(D).
Z ∼ Ber(D)
V ∼ Ber(q) + U ∼ Ber(p)
p = P (U = 1)
= P (V = 1)P (Z = 0) + P (V = 0)P (Z = 1)
= q(1 − D) + (1 − q)D
2. U − V ∼ N(0, D).
Z ∼ N (0, D)
V + U ∼ N (0, σ 2 )
• For the ⇐ part, consider some R00 = I(U ; V ) + . By the assumption (R, D) is achievable for
any R > I(U ; V ), implying that (R00 , D) is achievable, and thereafter
Hence we can prove the equivalent statement instead of R(D) ≤ R(I) (D). That’s to show
(R, D) is the achievable for fixed U, V s.t. E[d(U, V )] ≤ D and fixed R > I(U ; V ).
for sufficiently large n and some (δ) > 0 where limδ→0 (δ) = 0.
Proof Take M = b2nR c. Denote by Cn = {V n (1), V n (2), ..., V n (M )} the random codebook
i.i.d.
which is generated by Vi ’s ∼ V and independent of U . Let d(un , Cn ) = minV n ∈Cn d(un , V n ).
P (d(un , Cn ) > D(1 + δ)) = P (d(un , Vn (i)) > D(1 + δ) for i = 1, 2, ..., M )
(Definition of d(un , Cn ))
i.i.d.
= P (d(un , Vn (1)) > D(1 + δ))M (Vi ∼ V )
≤ P (d(un , Vn (1)) > E d(U, V )(1 + δ))M (Assumption of E d(U, V ) ≤ D)
≤ P ((un , Vn (1)) 6∈ Tδ (U, V ))M (Inverse-negative of Lemma 78)
= [1 − P ((un , Vn (1)) ∈ Tδ (U, V ))]M
h iM
≤ 1 − 2−n(I(U ;V )+(δ)) (Lemma 77 with un ∈ Tδ0 (U ) and large n)
≤ exp −M · 2−n(I(U ;V )+(δ)) (1 − x ≤ e−x )
So far, we have an upper bound of P (d(un , Cn ) > D(1 + δ)) for any un ∈ Tδ0 (U ) and sufficiently
large n.
P (d(un , Cn ) > D(1 + δ)) ≤ exp −M · 2−n(I(U ;V )+(δ)) (8.1)
i.i.d.
Then for Ui ∼ U ,
X
P (d(U n , Cn ) > D(1 + δ)) = P (d(un , Cn ) > D(1 + δ), U n = un )
un ∈Tδ0 (U )
X
+ P (d(un , Cn ) > D(1 + δ), U n = un )
un 6∈Tδ0 (U )
X
≤ P (d(un , Cn ) > D(1 + δ)) P (U n = un )
un ∈Tδ0 (U )
(U n independent of Cn )
+ P (U n 6∈ Tδ0 (U ))
≤ exp −M · 2−n(I(U ;V )+(δ)) + P (U n 6∈ Tδ0 (U ))
(Upper bound in Eq. 8.1)
Further, let d(Cn ) = E (d(U n , Cn )|Cn ) be the average distortion by random codebook Cn , and thus
d(cn ) = E (d(U n , cn )|Cn = cn ) = E (d(U n , cn )) (Cn is independent of U n ) is the average distortion
It implies that
E d(Cn ) < D + 2δDmax for sufficiently large n,
which further implies existence of cn , a realization of Cn , satisfying
Taking arbitrarily small δ and sufficiently large n, we can get the average distortion d(cn ) arbitrarily
close to D. And the size of codeword lists
log M ≥ H(V N )
≥ H(V N ) − H(V N | U N )
= I(U N ; V N )
= H(U N ) − H(U N | V N )
N
X
= H(Ui ) − H(Ui | U i−1 , V N ) (by chain rule)
i=1
N
X
≥ H(Ui ) − H(Ui | U i−1 , Vi ) (conditioning reduces entropy)
i=1
N
X
= I(Ui ; Vi )
i=1
XN
≥ R(I) (E[d(Ui , Vi )]) (by definition ofR(I) (D))
i=1
N
X 1
=N R(I) (E[d(Ui , Vi )]) (average ofR(I) (D)over all i)
i=1
N
N
1 X
≥ N R(I) ( E[d(Ui , Vi )]) (By the convexity of R(I) (D))
N i=1
≥ N R(I) (D) (R(I) (D) is nonincreasing)
log M
rate = ≥ R(I) (D)
N
How large does a codebook has to be so that every source sequence in the typical set has a
reconstruction, which it is jointly typical? Let T (U | V N (i)) be the set of source sequences jointly
typical with the reconstruction sequence V N (i). Therefore, to cover every source sequence, we need
a codebook of at least the size of typical set of the input divided by the number of source sequences
one reconstruction can cover.
|T (U )| 2N H(U )
The size of the codebook = ≈ = 2N I(U ;V ) . This is showed in Fig. 8.1 on
|T (U | V N (i))| 2N H(U |V )
the distortion function.
The communication problem has a similar setup. In order to achieve reliable communication,
|T (Y )| 2nH(Y )
the number of messages ≤ = |X)
= 2−nI(X;Y ) . This communication channel
|T (Y | X n (i))| 2nH(Y
Ui iid ∼ U
U N = (U1 , . . . , UN ) X n Memoryless Channel Yn V N = (V1 , . . . , VN )
Transmitter PY |X Receiver
With this channel description, the goal is to communicate the U N = (U1 , U2 , . . . ,hUN ) through
i
the memoryless channel given by PY |X with small expected distortion, measured by E d(U N , V N ) .
In other words, the goal is to find the best possible distortion given some rate and some noise during
transmission. Note that the Ui are not necessarily bits.
N
1 X
d(U N , V N ) = d (Ui , Vi ) .
N i=1
N
Definition 79. A rate-distortion pair (ρ, D) is achievable if ∀ > 0, ∃ a scheme with ≥ ρ−
n
and E[d(U N , V N )] ≤ D + .
68
Note: under any scheme, E[d(U N , V N )] ≤ D, and U N → X n → Y n → V N forms a Markov
chain. Therefore,
nC ≥ I(X n ; Y n ) (proven in channel coding converse theorem)
I(X n ; Y n ) ≥ I(U N ; V N )) (Data processing inequality)
I(U N ; V N ) ≥ N R(D) (proven in converse of rate distortion theorem)
N R(D)
= Rate · R(D) ≤ C.
n
Thus, if (ρ, D) is achievable ⇒ ρR(D) ≤ C.
UN N ·R(D) bits Xn
−−→ Good Distortion Compressor −−−−−−−→ Reliable Channel Encoder −−→ Memoryless Channel
YN N ·R(D) bits VN
−−→ Reliable Channel Decoder −−−−−−−→ Good Distortion Compressor −−→
All these pieces work correctly to ensure that distortion and channel noise are handled properly.
N
It is guaranteed that E[d(U N , V N )] ≈ D provided that n · C ≥ N R(D) · C ≥ · R(D) =
n
rate · R(D). Thus, if R(D) ≤ C, then (ρ, D) is achievable.
not achievable
C
R(D)
C
R(D0 )
achievable
D0 Dmax D
We can achieve points on the curve above by first thinking about representing the data as bits in
an efficient manner (compression) with rate R(D) and then transmitting these bits losslessly across
the channel with rate C. Note that the distortion without sending anything over the channel is
Dmax .
1 − h2 (q)
ρ=
h2 (p) − h2 (D)
1−h2 (q)
h2 (p)−h2 (D)
not achievable
1−h2 (q)
h2 (p)
achievable
p D
Note that the communication problem corresponds to D = 0.
In particular, if p = 1/2, then if we want distortion ≤ D, the maximum rate we can transmit
at is:
1 − h2 (q)
ρ=
1 − h2 (D)
ρ
1−h2 (q)
1−h2 (D)
not achievable
1
1 − h2 (q)
achievable
D=q 1 D
2
Z ∼ (0, 1)
X + Y
log(1+P )
log(σ 2 /D)
not achievable
1 achievable
σ2 σ2 D
P +1
log(1 + P )
ρ=
log(σ 2 /D)
Consider the following scheme at rate=1:
q
P
transmit: Xi = U
σ2 i q
receive: Yi = Xi + Zi = σP2 Ui + Zi
reconstruction: Vi = E[Ui |Vi ]
The distortion is squared error, so we know that reconstruction using the expected value is
optimal. Thus, we take Vi = E[Ui |Vi ].
The expected distortion is then:
where (a) follows from the fact that for X ∼ N (0, σ12 ) independent from Y ∼ N (0, σ22 ):
σ12 σ22
Var(X|X + Y ) =
σ12 + σ22
Now, at rate = 1:
The optimal D satisfies
log(1 + P )
=1
log(σ 2 /D)
σ2
→1+P =
D
So, in the specific case of rate = 1 we see that the simple scheme above is optimal, just as the
simple scheme for the Binary Source and Channel was also optimal when rate = 1.