Alajaji Chen2018 Book AnIntroductionToSingle UserInf
Alajaji Chen2018 Book AnIntroductionToSingle UserInf
Alajaji Chen2018 Book AnIntroductionToSingle UserInf
Fady Alajaji
Po-Ning Chen
An Introduction
to Single-User
Information
Theory
Springer Undergraduate Texts in Mathematics
and Technology
Series editor
H. Holden, Norwegian University of Science and Technology,
Trondheim, Norway
Editorial Board
Lisa Goldberg, University of California, Berkeley, CA, USA
Armin Iske, University of Hamburg, Germany
Palle E. T. Jorgensen, The University of Iowa, Iowa City, IA, USA
Springer Undergraduate Texts in Mathematics and Technology (SUMAT)
publishes textbooks aimed primarily at the undergraduate. Each text is designed
principally for students who are considering careers either in the mathematical
sciences or in technology-based areas such as engineering, finance, information
technology and computer science, bioscience and medicine, optimization or
industry. Texts aim to be accessible introductions to a wide range of core
mathematical disciplines and their practical, real-world applications; and are
fashioned both for course use and for independent study.
An Introduction
to Single-User Information
Theory
123
Fady Alajaji Po-Ning Chen
Department of Mathematics Department of Electrical
and Statistics and Computer Engineering
Queen’s University National Chiao Tung University
Kingston, ON Hsinchu
Canada Taiwan, Republic of China
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
part of Springer Nature
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
Preface
v
vi Preface
We are very much indebted to all readers, including many students, who pro-
vided valuable feedback. Special thanks are devoted to Yunghsiang S. Han,
Yu-Chih Huang, Tamás Linder, Stefan M. Moser, and Vincent Y. F. Tan; their
insightful and incisive comments greatly benefited the manuscript. We also thank
all anonymous reviewers for their constructive and detailed criticism. Finally, we
sincerely thank all our mentors and colleagues who immeasurably and positively
impacted our understanding of and fondness for the field of information theory,
including Lorne L. Campbell, Imre Csiszár, Lee D. Davisson, Nariman Farvardin,
Thomas E. Fuja, Te Sun Han, Tamás Linder, Prakash Narayan, Adrian
Papamarcou, Nam Phamdo, Mikael Skoglund, and Sergio Verdú.
Thanks are given to our families for their full support during the period of writing
this textbook.
ix
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Communication System Model . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Information Measures for Discrete Systems . . . . . . . . . . . . . . . . . . . 5
2.1 Entropy, Joint Entropy, and Conditional Entropy . . . . . . . . . . . . . 5
2.1.1 Self-information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.3 Properties of Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.4 Joint Entropy and Conditional Entropy . . . . . . . . . . . . . . . 12
2.1.5 Properties of Joint Entropy and Conditional Entropy . . . . . 14
2.2 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Properties of Mutual Information . . . . . . . . . . . . . . . . . . . 16
2.2.2 Conditional Mutual Information . . . . . . . . . . . . . . . . . . . . 17
2.3 Properties of Entropy and Mutual Information for Multiple
Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Data Processing Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5 Fano’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6 Divergence and Variational Distance . . . . . . . . . . . . . . . . . . . . . . 26
2.7 Convexity/Concavity of Information Measures . . . . . . . . . . . . . . . 37
2.8 Fundamentals of Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . 40
2.9 Rényi’s Information Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3 Lossless Data Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.1 Principles of Data Compression . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2 Block Codes for Asymptotically Lossless Compression . . . . . . . . 57
3.2.1 Block Codes for Discrete Memoryless Sources . . . . . . . . . 57
3.2.2 Block Codes for Stationary Ergodic Sources . . . . . . . . . . . 66
3.2.3 Redundancy for Lossless Block Data Compression . . . . . . 75
xi
xii Contents
xv
xvi List of Figures
xvii
Chapter 1
Introduction
1.1 Overview
Since its inception, the main role of information theory has been to provide the
engineering and scientific communities with a mathematical framework for the the-
ory of communication by establishing the fundamental limits on the performance of
various communication systems. The birth of information theory was initiated with
the publication of the groundbreaking works [340, 346] of Claude Elwood Shannon
(1916–2001) who asserted that it is possible to send information-bearing signals at a
fixed positive rate through a noisy communication channel with an arbitrarily small
probability of error as long as the transmission rate is below a certain fixed quan-
tity that depends on the channel statistical characteristics; he “named” this quantity
channel capacity. He further proclaimed that random (stochastic) sources, represent-
ing data, speech or image signals, can be compressed distortion-free at a minimal
rate given by the source’s intrinsic amount of information, which he called source
entropy1 and defined in terms of the source statistics. He went on proving that if a
source has an entropy that is less than the capacity of a communication channel, then
the source can be reliably transmitted (with asymptotically vanishing probability of
error) over the channel. He further generalized these “coding theorems” from the
lossless (distortionless) to the lossy context where the source can be compressed
and reproduced (possibly after channel transmission) within a tolerable distortion
threshold [345].
1 Shannonborrowed the term “entropy” from statistical mechanics since his quantity admits the
same expression as Boltzmann’s entropy [55].
2 See [359] for accessing most of Shannon’s works, including his master’s thesis [337, 338] which
made a breakthrough connection between electrical switching circuits and Boolean algebra and
played a catalyst role in the digital revolution, his dissertation on an algebraic framework for
population genetics [339], and his seminal paper on information-theoretic cryptography [342]. Refer
also to [362] for a recent (nontechnical) biography on Shannon and [146] for a broad discourse on
the history of information and on the information age.
1.2 Communication System Model 3
Transmitter Part
Focus of Physical
this text Channel
• Modulator: It transforms the channel encoder output into a waveform suitable for
transmission over the physical channel. This is typically accomplished by varying
the parameters of a sinusoidal signal in proportion with the data provided by the
channel encoder output.
• Physical Channel: It consists of the noisy (or unreliable) medium that the trans-
mitted waveform traverses. It is usually modeled via a sequence of conditional
(or transition) probability distributions of receiving an output given that a specific
input was sent.
• Receiver Part: It consists of the demodulator, the channel decoder, and the source
decoder where the reverse operations are performed. The destination represents
the sink where the source estimate provided by the source decoder is reproduced.
In this text, we will model the concatenation of the modulator, physical chan-
nel, and demodulator via a discrete-time3 channel with a given sequence of condi-
tional probability distributions. Given a source and a discrete channel, our objectives
will include determining the fundamental limits of how well we can construct a
(source/channel) coding scheme so that:
• the smallest number of source encoder symbols can represent each source symbol
distortion-free or within a prescribed distortion level D, where D > 0 and the
channel is noiseless;
3 Except for a brief interlude with the continuous-time (waveform) Gaussian channel in Chap. 5, we
• the largest rate of information can be transmitted over a noisy channel between
the channel encoder input and the channel decoder output with an arbitrarily small
probability of decoding error;
• we can guarantee that the source is transmitted over a noisy channel and reproduced
at the destination within distortion D, where D > 0.
We refer the reader to Appendix A for the necessary background on suprema and
limits; in particular, Observation A.5 (resp. Observation A.11) provides a pertinent
connection between the supremum (resp., infimum) of a set and the proof of a typ-
ical channel coding (resp., source coding) theorem in information theory. Finally,
Appendix B provides an overview of basic concepts from probability theory and the
theory of random processes that are used in the text. The appendix also contains
a brief discussion of convexity, Jensen’s inequality and the Lagrange multipliers
constrained optimization technique.
Chapter 2
Information Measures for Discrete
Systems
2.1.1 Self-information
1 More specifically, Shannon introduced the entropy, conditional entropy, and mutual information
measures [340], while divergence is due to Kullback and Leibler [236, 237].
2 By discrete alphabets, one usually means finite or countably infinite alphabets. We however focus
mostly on finite-alphabet systems, although the presented information measures allow for countable
alphabets (when they exist).
© Springer Nature Singapore Pte Ltd. 2018 5
F. Alajaji and P.-N. Chen, An Introduction to Single-User Information Theory,
Springer Undergraduate Texts in Mathematics and Technology,
https://doi.org/10.1007/978-981-10-8001-2_2
6 2 Information Measures for Discrete Systems
event E is, the more information is gained when one learns it has occurred. In
other words, I (pE ) is a decreasing function of pE .
2. I (pE ) should be continuous in pE .
Intuitively, one should expect that a small change in pE corresponds to a small
change in the amount of information carried by E.
3. If E1 and E2 are independent events, then I(E1 ∩ E2 ) = I(E1 ) + I(E2 ), or
equivalently, I (pE1 × pE2 ) = I (pE1 ) + I (pE2 ).
This property declares that when events E1 and E2 are independent of each other
(i.e., when they do not affect each other probabilistically), the amount of infor-
mation one gains by learning that both events have jointly occurred should be
equal to the sum of the amounts of information of each individual event.
Next, we show that the only function that satisfies Properties 1–3 above is the
logarithmic function.
Theorem 2.1 The only function defined over p ∈ [0, 1] and satisfying
1. I (p) is monotonically decreasing in p;
2. I (p) is a continuous function of p for 0 ≤ p ≤ 1;
3. I (p1 × p2 ) = I (p1 ) + I (p2 );
is I (p) = −c · logb (p), where c is a positive constant and the base b of the logarithm
is any number larger than one.
Now let n be a fixed positive integer greater than 1. Conditions 1 and 3 respectively
imply
1 1
n < m =⇒ I <I (2.1.1)
n m
and
1 1 1
I =I +I , (2.1.2)
mn m n
Now for any positive integer r, there exists a nonnegative integer k such that
nk ≤ 2r < nk+1 .
By (2.1.1), we obtain
1 1 1
I ≤I <I ,
nk 2r nk+1
k I (1/2) k +1
≤ ≤ .
r I (1/n) r
k logb (2) k +1
logb nk ≤ logb 2r ≤ logb nk+1 ⇐⇒ ≤ ≤ .
r logb (n) r
Therefore,
logb (2) I (1/2) 1
log (n) − I (1/n) < r .
b
Since n is fixed, and r can be made arbitrarily large, we can let r → ∞ to get
1
I = c · logb (n),
n
where c = I (1/2)/ logb (2) > 0. This completes the proof of the claim.
Step 2: Claim. I (p) = −c · logb (p) for positive rational number p, where c > 0
is a constant.
Step 3: For any p ∈ [0, 1], it follows by continuity and the density of the rationals
in the reals that
The constant c above is by convention normalized to c = 1. Furthermore, the base
b of the logarithm determines the type of units used in measuring information. When
b = 2, the amount of information is expressed in bits (i.e., binary digits). When
b = e – i.e., the natural logarithm (ln) is used – information is measured in nats (i.e.,
natural units or digits). For example, if the event E concerns a Heads outcome from
the toss of a fair coin, then its self-information is I(E) = − log2 (1/2) = 1 bit or
− ln(1/2) = 0.693 nats.
More generally, under base b > 1, information is in b-ary units or digits. For
the sake of simplicity, we will use the base-2 logarithm throughout unless otherwise
specified. Note that one can easily convert information units from bits to b-ary units
by dividing the former by log2 (b).
2.1.2 Entropy
where I(x) := − log2 PX (x) is the self-information of the elementary event {X = x}.
When computing the entropy, we adopt the convention
0 · log2 0 = 0,
Example 2.3 Let X be a binary (valued) random variable with alphabet X = {0, 1}
and pmf given by PX (1) = p and PX (0) = 1 − p, where 0 ≤ p ≤ 1 is fixed. Then
H (X ) = −p · log2 p − (1 − p) · log2 (1 − p). This entropy is conveniently called the
binary entropy function and is usually denoted by hb (p): it is illustrated in Fig. 2.1.
As shown in the figure, hb (p) is maximized for a uniform distribution (i.e., p = 1/2).
The units for H (X ) above are in bits as base-2 logarithm is used. Setting
HD (X ) := − PX (x) · logD PX (x)
x∈X
yields the entropy in D-ary units, where D > 1. Note that we abbreviate H2 (X ) as
H (X ) throughout the book since bits are common measure units for a coding system,
and hence
H (X )
HD (X ) = .
log2 D
0
0 0.5 1
p
10 2 Information Measures for Discrete Systems
Thus
H (X )
He (X ) = = (ln 2) · H (X )
log2 (e)
gives the entropy in nats, where e is the base of the natural logarithm.
When developing or proving the basic properties of entropy (and other information
measures), we will often use the following fundamental inequality for the loga-
rithm(its proof is left as an exercise).
Lemma 2.4 (Fundamental inequality (FI)) For any x > 0 and D > 1, we have that
Setting y = 1/x and using FI above directly yield that for any y > 0, we also have
that
1
logD (y) ≥ logD (e) 1 − ,
y
also with equality iff y = 1. In the above the base-D logarithm was used. Specifically,
for a logarithm with base-2, the above inequalities become
1
log2 (e) 1 − ≤ log2 (x) ≤ log2 (e) · (x − 1),
x
Proof 0 ≤ PX (x) ≤ 1 implies that log2 [1/PX (x)] ≥ 0 for every x ∈ X . Hence,
1
H (X ) = PX (x) log2 ≥ 0,
x∈X
PX (x)
Lemma 2.6 (Upper bound on entropy) If a random variable X takes values from a
finite set X , then
H (X ) ≤ log2 |X |,
where4 |X | denotes the size of the set X . Equality holds iff X is equiprobable or
uniformly distributed over X (i.e., PX (x) = |X1 | for all x ∈ X ).
Proof
log2 |X | − H (X ) = log2 |X | · PX (x) − − PX (x) log2 PX (x)
x∈X x∈X
= PX (x) · log2 |X | + PX (x) log2 PX (x)
x∈X x∈X
= PX (x) log2 [|X | · PX (x)]
x∈X
1
≥ PX (x) · log2 (e) 1 −
x∈X
|X | · PX (x)
1
= log2 (e) PX (x) −
x∈X
|X |
= log2 (e) · (1 − 1) = 0,
where the inequality follows from the FI Lemma, with equality iff (∀ x ∈ X ),
|X | · PX (x) = 1, which means PX (·) is a uniform distribution on X .
Intuitively, H (X ) tells us how random X is. Indeed, X is deterministic (not random
at all) iff H (X ) = 0. If X is uniform (equiprobable), H (X ) is maximized and is
equal to log2 |X |.
Lemma 2.7 (Log-sum inequality) For nonnegative numbers, a1 , a2 , . . ., an and
b1 , b2 , . . ., bn ,
n
n
n
ai ai
ai logD ≥ ai logD i=1
n , (2.1.4)
i=1
bi i=1 i=1 bi
which is a constant that does not depend on i. (By convention, 0 · logD (0) = 0,
0 · logD (0/0) = 0 and a · logD (a/0) = ∞ if a > 0. Again, this can be justified by
“continuity.”)
that log |X | is also known as Hartley’s function or entropy; Hartley was the first to suggest
4 Note
n n
Proof Let a := i=1 ai and b := i=1 bi . Then
⎡ ⎤
⎢ n
n ⎥
n
ai a ⎢ ai a ai a⎥
ai logD − a logD = a ⎢ ⎥
i
⎢ logD − logD
i=1
bi b ⎣ i=1 a bi i=1
a b⎥
⎦
=1
n
ai ai b
=a logD
i=1
a bi a
n
ai bi a
≥ a logD (e) 1−
i=1
a ai b
n
ai bi n
= a logD (e) −
i=1
a i=1
b
= a logD (e) (1 − 1) = 0,
where the inequality follows from the FI Lemma, with equality holding iff abii ab = 1
for all i; i.e., abii = ab ∀i.
We also provide another proof using Jensen’s inequality (cf. Theorem B.18 in
Appendix B). Without loss of generality, assume that ai > 0 and bi > 0 for every i.
Jensen’s inequality states that
n
n
αi f (ti ) ≥ f αi ti
i=1 i=1
and ni=1 αi = 1; equality holds iff ti is a
for any strictly convex function f (·), αi ≥ 0,
constant for all i. Hence by setting αi = bi / nj=1 bj , ti = ai /bi , and f (t) = t·logD (t),
we obtain the desired result.
Given a pair of random variables (X , Y ) with a joint pmf PX ,Y (·, ·) defined5 on X ×Y,
the self-information of the (two-dimensional) elementary event {X = x, Y = y} is
defined by
I(x, y) := − log2 PX ,Y (x, y).
5 Note that PXY (·, ·) is another common notation for the joint distribution PX ,Y (·, ·).
2.1 Entropy, Joint Entropy, and Conditional Entropy 13
where H (Y |X = x) := − y∈Y PY |X (y|x) log2 PY |X (y|x).
The relationship between joint entropy and conditional entropy is exhibited by
the fact that the entropy of a pair of random variables is the entropy of one plus the
conditional entropy of the other.
Theorem 2.10 (Chain rule for entropy)
H (X , Y ) = H (X ) + H (Y |X ). (2.1.6)
Proof Since
PX ,Y (x, y) = PX (x)PY |X (y|x),
H (X , Y ) = E[− log2 PX ,Y (X , Y )]
= E[− log2 PX (X )] + E[− log2 PY |X (Y |X )]
= H (X ) + H (Y |X ).
14 2 Information Measures for Discrete Systems
H (X , Y ) = H (X ) + H (Y |X ) = H (Y ) + H (X |Y ) = H (Y , X ),
The above quantity is exactly equal to the mutual information which will be intro-
duced in the next section.
The conditional entropy can be thought of in terms of a channel whose input
is the random variable X and whose output is the random variable Y . H (X |Y ) is
then called the equivocation6 and corresponds to the uncertainty in the channel
input from the receiver’s point of view. For example, suppose that the set of possible
outcomes of random vector (X , Y ) is {(0, 0), (0, 1), (1, 0), (1, 1)}, where none of the
elements has zero probability mass. When the receiver Y receives 1, he still cannot
determine exactly what the sender X observes (it could be either 1 or 0); therefore, the
uncertainty, from the receiver’s view point, depends on the probabilities PX |Y (0|1)
and PX |Y (1|1).
Similarly, H (Y |X ), which is called prevarication,7 is the uncertainty in the channel
output from the transmitter’s point of view. In other words, the sender knows exactly
what he sends, but is uncertain on what the receiver will finally obtain.
A case that is of specific interest is when H (X |Y ) = 0. By its definition,
H (X |Y ) = 0 if X becomes deterministic after observing Y . In such a case, the
uncertainty of X after giving Y is completely zero.
The next corollary can be proved similarly to Theorem 2.10.
with equality holding iff X and Y are independent. In other words, “conditioning”
reduces entropy.
6 Equivocation is an ambiguous statement one uses deliberately in order to deceive or avoid speaking
the truth.
7 Prevarication is the deliberate act of deviating from the truth (it is a synonym of “equivocation”).
2.1 Entropy, Joint Entropy, and Conditional Entropy 15
Proof
PX |Y (x|y)
H (X ) − H (X |Y ) = PX ,Y (x, y) · log2
(x,y)∈X ×Y
PX (x)
PX |Y (x|y)PY (y)
= PX ,Y (x, y) · log2
(x,y)∈X ×Y
PX (x)PY (y)
PX ,Y (x, y)
= PX ,Y (x, y) · log2
(x,y)∈X ×Y
PX (x)PY (y)
⎛ ⎞
(x,y)∈X ×Y PX ,Y (x, y)
≥⎝ PX ,Y (x, y)⎠ log2
(x,y)∈X ×Y (x,y)∈X ×Y PX (x)PY (y)
= 0,
where the inequality follows from the log-sum inequality, with equality holding iff
PX ,Y (x, y)
= constant ∀ (x, y) ∈ X × Y.
PX (x)PY (y)
Since probability must sum to 1, the above constant equals 1, which is exactly the
case of X being independent of Y .
H (X , Y ) = H (X ) + H (Y |X ) ≤ H (X ) + H (Y ). (2.1.8)
The above lemma tells us that equality holds for (2.1.8) only when X is independent
of Y .
A result similar to (2.1.8) also applies to the conditional entropy.
PX1 ,X2 |Y1 ,Y2 (x1 , x2 |y1 , y2 ) = PX1 |Y1 (x1 |y1 )PX2 |Y2 (x2 |y2 )
Proof Using the chain rule for conditional entropy and the fact that conditioning
reduces entropy, we can write
For (2.1.9), equality holds iff X1 and X2 are conditionally independent given (Y1 , Y2 ):
PX1 ,X2 |Y1 ,Y2 (x1 , x2 |y1 , y2 ) = PX1 |Y1 ,Y2 (x1 |y1 , y2 )PX2 |Y1 ,Y2 (x2 |y1 , y2 ). For (2.1.10), equal-
ity holds iff X1 is conditionally independent of Y2 given Y1 (i.e., PX1 |Y1 ,Y2 (x1 |y1 , y2 ) =
PX1 |Y1 (x1 |y1 )), and X2 is conditionally independent of Y1 given Y2 (i.e., PX2 |Y1 ,Y2 (x2 |y1 ,
y2 ) = PX2 |Y2 (x2 |y2 )). Hence, the desired equality condition of the lemma is
obtained.
For two random variables X and Y , the mutual information between X and Y is the
reduction in the uncertainty of Y due to the knowledge of X (or vice versa). A dual
definition of mutual information states that it is the average amount of information
that Y has (or contains) about X or X has (or contains) about Y .
We can think of the mutual information between X and Y in terms of a channel
whose input is X and whose output is Y . Thereby the reduction of the uncertainty is
by definition the total uncertainty of X (i.e., H (X )) minus the uncertainty of X after
observing Y (i.e., H (X |Y )). Mathematically, it is
It can be easily verified from (2.1.7) that mutual information is symmetric; i.e.,
I (X ; Y ) = I (Y ; X ).
H(X, Y )
Lemma 2.16 (Chain rule for mutual information) Defining the joint mutual infor-
mation between X and the pair (Y , Z) as in (2.2.1) by
I (X ; Y , Z) := H (X ) − H (X |Y , Z),
we have
I (X ; Y , Z) = I (X ; Y ) + I (X ; Z|Y ) = I (X ; Z) + I (X ; Y |Z).
18 2 Information Measures for Discrete Systems
I (X ; Y , Z) = H (X ) − H (X |Y , Z)
= H (X ) − H (X |Y ) + H (X |Y ) − H (X |Y , Z)
= I (X ; Y ) + I (X ; Z|Y ).
The above lemma can be read as follows: the information that (Y , Z) has about X
is equal to the information that Y has about X plus the information that Z has about
X when Y is already known.
where H (Xi |Xi−1 , . . . , X1 ) := H (X1 ) for i = 1. (The above chain rule can also be
written as:
n
H (X n ) = H (Xi |X i−1 ),
i=1
Once again, applying (2.1.6) to the first term of the right-hand side of (2.3.1), we
have
n
H (X1 , X2 , . . . , Xn |Y ) = H (Xi |Xi−1 , . . . , X1 , Y ).
i=1
n
I (X1 , X2 , . . . , Xn ; Y ) = I (Xi ; Y |Xi−1 , . . . , X1 ),
i=1
n
H (X1 , X2 , . . . , Xn ) ≤ H (Xi ).
i=1
n
H (X1 , X2 , . . . , Xn ) = H (Xi |Xi−1 , . . . , X1 )
i=1
n
≤ H (Xi ).
i=1
Equality holds iff each conditional entropy is equal to its associated entropy, that iff
Xi is independent of (Xi−1 , . . . , X1 ) for all i.
n
I (X1 , . . . , Xn ; Y1 , . . . , Yn ) ≤ I (Xi ; Yi ),
i=1
n
H (Y1 , . . . , Yn ) ≤ H (Yi ).
i=1
Hence,
I (X n ; Y n ) = H (Y n ) − H (Y n |X n )
n n
≤ H (Yi ) − H (Yi |Xi )
i=1 i=1
n
= I (Xi ; Yi ),
i=1
with equality holding iff {Yi }ni=1 are independent, which holds iff {Xi }ni=1 are inde-
pendent.
Recalling that the Markov chain relationship X → Y → Z means that X and Z are
conditional independent given Y (cf. Appendix B), we have the following result.
2.4 Data Processing Inequality 21
Lemma 2.22 (Data processing inequality) (This is also called the data processing
lemma.) If X → Y → Z, then
I (X ; Y ) ≥ I (X ; Z).
I (X ; Z) + I (X ; Y |Z) = I (X ; Y , Z) (2.4.1)
= I (X ; Y ) + I (X ; Z|Y )
= I (X ; Y ). (2.4.2)
The data processing inequality means that the mutual information will not increase
after processing. This result is somewhat counterintuitive since given two random
variables X and Y , we might believe that applying a well-designed processing scheme
to Y , which can be generally represented by a mapping g(Y ), could possibly increase
the mutual information. However, for any g(·), X → Y → g(Y ) forms a Markov
chain which implies that data processing cannot increase mutual information. A
communication context for the data processing lemma is depicted in Fig. 2.3, and
summarized in the next corollary.
Corollary 2.23 For jointly distributed random variables X and Y and any function
g(·), we have X → Y → g(Y ) and
I (X ; Y ) ≥ I (X ; g(Y )).
We also note that if Z obtains all the information about X through Y , then knowing
Z will not help increase the mutual information between X and Y ; this is formalized
in the following.
I (X ; Y |Z) ≤ I (X ; Y ).
I(U; V ) ≤ I(X; Y )
U X Y V
Source Encoder Channel Decoder
I (X ; Y |Z) = H (X |Z) − H (X |Y , Z)
= H (X |Z)
= PZ (0)H (X |z = 0) + PZ (1)H (X |z = 1) + PZ (2)H (X |z = 2)
= 0 + 0.5 + 0
= 0.5 bits,
I (Xi ; Xl ) ≤ I (Xj ; Xk ).
Fano’s inequality [113, 114] is a useful tool widely employed in information theory to
prove converse results for coding theorems (as we will see in the following chapters).
Lemma 2.26 (Fano’s inequality) Let X and Y be two random variables, correlated
in general, with alphabets X and Y, respectively, where X is finite but Y can be
countably infinite. Let X̂ := g(Y ) be an estimate of X from observing Y , where
g : Y → X is a given estimation function. Define the probability of error as
Pe := Pr[X̂ = X ].
H (X |Y )
log2 (|X |)
log2 (|X | − 1)
0 (|X | − 1)/|X | 1
Pe
Observation 2.27
• Note that when Pe = 0, we obtain that H (X |Y ) = 0 (see (2.5.1)) as intuition
suggests, since if Pe = 0, then X̂ = g(Y ) = X (with probability 1) and thus
H (X |Y ) = H (g(Y )|Y ) = 0.
• Fano’s inequality yields upper and lower bounds on Pe in terms of H (X |Y ). This is
illustrated in Fig. 2.4, where we plot the region for the pairs (Pe , H (X |Y )) that are
permissible under Fano’s inequality. In the figure, the boundary of the permissible
(dashed) region is given by the function
Furthermore, when
0 < H (X |Y ) ≤ log2 (|X | − 1),
24 2 Information Measures for Discrete Systems
Thus for all nonzero values of H (X |Y ), we obtain a lower bound (of the same
form above) on Pe ; the bound implies that if H (X |Y ) is bounded away from zero,
Pe is also bounded away from zero.
• A weaker but simpler version of Fano’s inequality can be directly obtained from
(2.5.1) by noting that hb (Pe ) ≤ 1:
H (X |Y ) − 1
Pe ≥ (for |X | > 2)
log2 (|X | − 1)
1, if g(Y )
= X
E := .
0, if g(Y ) = X
H (E, X |Y ) = H (X |Y ) + H (E|X , Y )
= H (E|Y ) + H (X |E, Y ).
since X = g(Y ) for E = 0, and given E = 1, we can upper bound the conditional
entropy by the logarithm of the number of remaining outcomes, i.e., (|X | − 1).
Combining these results completes the proof.
Fano’s inequality cannot be improved in the sense that the lower bound, H (X |Y ),
can be achieved for some specific cases. Any bound that can be achieved in some cases
is often referred to as sharp.9 From the proof of the above lemma, we can observe
9 Definition. A bound is said to be sharp if the bound is achievable for some specific cases. A bound
Example 2.28 Suppose that X and Y are two independent random variables which
are both uniformly distributed on the alphabet {0, 1, 2}. Let the estimating function
be given by g(y) = y. Then
2
2
Pe = Pr[g(Y )
= X ] = Pr[Y
= X ] = 1 − PX (x)PY (x) = .
x=0
3
I (X ; Y ) ≥ I (X ; X̂ ),
H (X |Y ) ≤ H (X |X̂ ).
Thus, if we show that H (X |X̂ ) is no larger than the right-hand side of (2.5.1), the
proof of (2.5.1) is complete.
Noting that
Pe = PX ,X̂ (x, x̂)
x∈X x̂∈X :x̂
=x
and
1 − Pe = PX ,X̂ (x, x̂) = PX ,X̂ (x, x),
x∈X x̂∈X :x̂=x x∈X
26 2 Information Measures for Discrete Systems
we obtain that
where the inequality follows by applying the FI Lemma to each logarithm term in
(2.5.3).
Definition 2.29 (Divergence) Given two discrete random variables X and X̂ defined
over a common alphabet X , the divergence or the Kullback–Leibler divergence or dis-
2.6 Divergence and Variational Distance 27
tance10 (other names are relative entropy and discrimination) is denoted by D(X X̂ )
or D(PX PX̂ ) and defined by11
PX (X ) PX (x)
D(X X̂ ) = D(PX PX̂ ) := EX log2 = PX (x) log2 .
PX̂ (X ) x∈X
PX̂ (x)
In other words, the divergence D(PX PX̂ ) is the expectation (with respect to
PX ) of the log-likelihood ratio log2 [PX /PX̂ ] of distribution PX against distribution
PX̂ . D(X X̂ ) can be viewed as a measure of “distance” or “dissimilarity” between
distributions PX and PX̂ . D(X X̂ ) is also called relative entropy since it can be
regarded as a measure of the inefficiency of mistakenly assuming that the distribution
of a source is PX̂ when the true distribution is PX . For example, if we know the true
distribution PX of a source, then we can construct a lossless data compression code
with average codeword length achieving entropy H (X ) (this will be studied in the
next chapter). If, however, we mistakenly thought that the “true” distribution is PX̂
and employ the “best” code corresponding to PX̂ , then the resultant average codeword
length becomes
[−PX (x) · log2 PX̂ (x)].
x∈X
As a result, the relative difference between the resultant average codeword length and
H (X ) is the relative entropy D(X X̂ ). Hence, divergence is a measure of the system
cost (e.g., storage consumed) paid due to mis-classifying the system statistics.
Note that when computing divergence, we follow the convention that
0 p
0 · log2 = 0 and p · log2 = ∞ for p > 0.
p 0
We next present some properties of the divergence and discuss its relation with
entropy and mutual information.
D(X X̂ ) ≥ 0,
with equality iff PX (x) = PX̂ (x) for all x ∈ X (i.e., the two distributions are equal).
10 As noted in Footnote 1, this measure was originally introduced by Kullback and Leibler [236,
237].
11 In order to be consistent with the units (in bits) adopted for entropy and mutual information, we
will also use the base-2 logarithm for divergence unless otherwise specified.
28 2 Information Measures for Discrete Systems
Proof
PX (x)
D(X X̂ ) = PX (x) log2
x∈X
PX̂ (x)
PX (x)
≥ PX (x) log2 x∈X
x∈X x∈X PX̂ (x)
= 0,
where the second step follows from the log-sum inequality with equality holding iff
for every x ∈ X ,
PX (x) PX (a)
= a∈X = 1,
PX̂ (x) b∈X PX̂ (b)
I (X ; Y ) = D(PX ,Y PX × PY ),
where PX ,Y (·, ·) is the joint distribution of the random variables X and Y and PX (·)
and PY (·) are the respective marginals.
Proof The observation follows directly from the definitions of divergence and mutual
information.
!
k
X = Ui .
i=1
Let us briefly discuss the relation between the processing of information and its
refinement. Processing of information can be modeled as a (many-to-one) mapping,
and refinement is actually the reverse operation. Recall that the data processing
lemma shows that mutual information can never increase due to processing. Hence,
if one wishes to increase mutual information, he should “anti-process” (or refine) the
involved statistics.
2.6 Divergence and Variational Distance 29
From Lemma 2.31, the mutual information can be viewed as the divergence of
a joint distribution against the product distribution of the marginals. It is therefore
reasonable to expect that a similar effect due to processing (or a reverse effect due
to refinement) should also apply to divergence. This is shown in the next lemma.
Lemma 2.33 (Refinement cannot decrease divergence) Let PX and PX̂ be the refine-
ments (k-refinements) of PU and PÛ respectively. Then
PU (i)
= PU (i) log2 , (2.6.1)
PÛ (i)
k
PX (x)
D(PX PX̂ ) = PX (x) log2
i=1 x∈Ui
PX̂ (x)
k
PU (i)
≥ PU (i) log2
i=1
PÛ (i)
= D(PU PÛ ),
other words, D(PX PX̂ )
= D(PX̂ PX ) in general. (It also does not satisfy the trian-
gle inequality.) Thus, divergence is not a true distance or metric. Another measure
which is a true distance, called variational distance, is sometimes used instead.
" #
Proof We first show that PX −PX̂ = 2· x∈X :PX (x)>PX̂ (x) PX (x)−PX̂ (x) . Setting
A := {x ∈ X : PX (x) > PX̂ (x)}, we have
PX − PX̂ = PX (x) − P (x)
X̂
x∈X
= PX (x) − P (x) + PX (x) − P (x)
X̂ X̂
x∈A x∈Ac
" # " #
= PX (x) − PX̂ (x) + PX̂ (x) − PX (x)
x∈A x∈Ac
" # " # " #
= PX (x) − PX̂ (x) + PX̂ Ac − PX Ac
x∈A
" #
= PX (x) − PX̂ (x) + PX (A) − PX̂ (A)
x∈A
" # " #
= PX (x) − PX̂ (x) + PX (x) − PX̂ (x)
x∈A x∈A
" #
= 2· PX (x) − PX̂ (x) ,
x∈A
log2 (e)
D(X X̂ ) ≥ · PX − PX̂ 2 .
2
This result is referred to as Pinsker’s inequality.
Proof
1. With A := {x ∈ X : PX (x) > PX̂ (x)}, we have from the previous lemma that
1, if X ∈ A,
U=
0, if X ∈ Ac ,
and
1, if X̂ ∈ A,
Û =
0, if X̂ ∈ Ac .
32 2 Information Measures for Discrete Systems
For ease of notations, let p = PU (1) and q = PÛ (1). Then proving the above
inequality is equivalent to showing that
p 1−p
p · ln + (1 − p) · ln ≥ 2(p − q)2 .
q 1−q
Define
p 1−p
f (p, q) := p · ln + (1 − p) · ln − 2(p − q)2 ,
q 1−q
f (p, q) ≥ 0 for q ≥ p,
Observation 2.38 The above lemma tells us that for a sequence of distributions
{(PXn , PX̂n )}n≥1 , when D(PXn PX̂n ) goes to zero as n goes to infinity, PXn − PX̂n
goes to zero as well. But the converse does not necessarily hold. For a quick coun-
terexample, let
1
PXn (0) = 1 − PXn (1) = > 0
n
2.6 Divergence and Variational Distance 33
and
PX̂n (0) = 1 − PX̂n (1) = 0.
In this case,
D(PXn PX̂n ) → ∞
2
= → 0.
n
We however can upper bound D(PX PX̂ ) by the variational distance between PX and
PX̂ when D(PX PX̂ ) < ∞.
log2 (e)
D(PX PX̂ ) ≤ · PX − PX̂ .
min min{PX (x), PX̂ (x)}
{x:PX (x)>0}
Proof Without loss of generality, we assume that PX (x) > 0 for all x ∈ X . Since
D(PX PX̂ ) < ∞, we have that PX (x) > 0 implies that PX̂ (x) > 0. Let
Hence,
PX (x)
D(PX PX̂ ) = log2 (e) PX (x) · ln
x∈X
PX̂ (x)
log2 (e)
≤ PX (x) · |PX (x) − PX̂ (x)|
t x∈X
log2 (e)
≤ |PX (x) − PX̂ (x)|
t x∈X
log2 (e)
= · PX − PX̂ .
t
The next lemma discusses the effect of side information on divergence. As stated in
Lemma 2.12, side information usually reduces entropy; it, however, increases diver-
gence. One interpretation of these results is that side information is useful. Regarding
entropy, side information provides us more information, so uncertainty decreases.
As for divergence, it is the measure or index of how easy one can differentiate the
source from two candidate distributions. The larger the divergence, the easier one can
tell these two distributions apart and make the right guess. At an extreme case, when
divergence is zero, one can never tell which distribution is the right one, since both
produce the same source. So, when we obtain more information (side information),
we should be able to make a better decision on the source statistics, which implies
that the divergence should be larger.
Definition 2.40 (Conditional divergence) Given three discrete random variables, X ,
X̂ , and Z, where X and X̂ have a common alphabet X , we define the conditional
divergence between X and X̂ given Z by
PX |Z (x|z)
D(X X̂ |Z) = D(PX |Z PX̂ |Z |PZ ) := PZ (z) PX |Z (x|z) log
z∈Z x∈X
PX̂ |Z (x|z)
PX |Z (x|z)
= PX ,Z (x, z) log .
z∈Z x∈X
PX̂ |Z (x|z)
In other words, it is the conditional divergence between PX |Z and PX̂ |Z given PZ and
it is nothing but the expected value with respect to PX ,Z of the log-likelihood ratio
P
log PX |Z .
X̂ |Z
Similarly, the conditional divergence between PX |Z and PX̂ given PZ is defined as
PX |Z (x|z)
D(PX |Z PX̂ |PZ ) := PZ (z) PX |Z (x|z) log .
z∈Z x∈X
PX̂ (z)
2.6 Divergence and Variational Distance 35
Proof The proof follows directly from the definition of conditional mutual informa-
tion (2.2.2) and the above definition of conditional divergence.
Lemma 2.42 (Chain rule for divergence)
Let PX n and QX n be two joint distributions on X n . We have that
D(PX1 ,X2 QX1 ,X2 ) = D(PX1 QX1 ) + D(PX2 |X1 QX2 |X1 |PX1 ),
n
D(PX n QX n ) = D(PXi |X i−1 QXi |X i−1 |PX i−1 ),
i=1
where D(PXi |X i−1 QXi |X i−1 |PX i−1 ) := D(PX1 QX1 ) for i = 1.
Proof The proof readily follows from the above divergence definitions.
Lemma 2.43 (Conditioning never decreases divergence) For three discrete random
variables, X , X̂ , and Z, where X and X̂ have a common alphabet X , we have that
Proof
PX̂ |Z (x|z)PX (x)
≥ PX ,Z (x, z) · log2 (e) 1 − (by the FI Lemma)
z∈Z x∈X
PX |Z (x|z)PX̂ (x)
PX (x)
= log2 (e) 1 − PZ (z)PX̂ |Z (x|z)
P (x) z∈Z
x∈X X̂
PX (x)
= log2 (e) 1 − P (x)
P (x) X̂
x∈X X̂
= log2 (e) 1 − PX (x) = 0,
x∈X
PX (x) PX |Z (x|z)
= .
PX̂ (x) PX̂ |Z (x|z)
where Z and Ẑ also have a common alphabet. In other words, side information
is helpful for divergence only when it provides information on the similarity or
difference of the two distributions. In the above case, Z only provides information
about X , and Ẑ provides information about X̂ ; so the divergence certainly cannot be
expected to increase. The next lemma shows that if the pair (Z, Ẑ) is independent
component-wise of the pair (X , X̂ ), then the side information of (Z, Ẑ) does not help
in improving the divergence of X against X̂ .
PY |X (y|x)
I (PX , PY |X ) := PY |X (y|x)PX (x) log2 ,
x∈X y∈Y a∈X PY |X (y|a)PX (a)
PY (y)PYλ |X (y|x)
≥ λ log2 (e) PX (x)PY |X (y|x) 1 −
x∈X y∈Y
PY |X (y|x)PYλ (y)
P$Y (y)PYλ |X (y|x)
+λ̄ log2 (e) PX (x)P$Y |X (y|x) 1 −
x∈X y∈Y
P$Y |X (y|x)PYλ (y)
= 0,
2.7 Convexity/Concavity of Information Measures 39
where the inequality follows from the FI Lemma, with equality holding iff
PY (y) P$ Y (y)
(∀ x ∈ X , y ∈ Y) = .
PY |X (y|x) Y |X (y|x)
P$
3. For ease of notation, let PXλ (x) := λPX (x) + (1 − λ)PX$ (x).
by the nonnegativity of the divergence, with equality holding iff PX (x) = PX$ (x)
for all x. Similarly, by letting PX̂λ (x) := λPX̂ (x) + (1 − λ)PX$ (x), we obtain
where the inequality follows from the FI Lemma, with equality holding iff
PX$ (x) = PX̂ (x) for all x.
Finally, by the log-sum inequality, for each x ∈ X , we have
• H0 : PX n
• H1 : PX̂ n
Based on one sequence of observations xn , one has to decide which of the hypotheses
is true. This is denoted by a decision mapping φ(·), where
0, if distribution of X n is classified to be PX n ;
φ(xn ) =
1, if distribution of X n is classified to be PX̂ n .
Accordingly, the possible observed sequences are divided into two groups:
Hence, depending on the true distribution, there are two types of error probabilities:
" #
Type I error : αn = αn (φ) := PX n {xn ∈ X n : φ(xn ) = 1}
" #
Type II error : βn = βn (φ) := PX̂ n {xn ∈ X n : φ(xn ) = 0} .
The choice of the decision mapping is dependent on the optimization criterion. Two
of the most frequently used ones in information theory are
1. Bayesian hypothesis testing.
Here, φ(·) is chosen so that the Bayesian cost
π0 αn + π1 βn
is minimized, where π0 and π1 are the prior probabilities for the null and alternative
hypotheses, respectively. The mathematical expression for Bayesian testing is
The set {φ} considered in the minimization operation could have two different
ranges: range over deterministic rules, and range over randomization rules. The main
42 2 Information Measures for Discrete Systems
difference between a randomization rule and a deterministic rule is that the former
allows the mapping φ(xn ) to be random on {0, 1} for some xn , while the latter only
accepts deterministic assignments to {0, 1} for all xn . For example, a randomization
rule for specific observations x̃n can be
%
0, with probability 0.2,
φ(x̃ ) =
n
1, with probability 0.8.
The Neyman–Pearson lemma shows the well-known fact that the likelihood ratio
test is always the optimal test [281].
and
βn∗ := PX̂ n {An (τ )} .
Then for type I error αn and type II error βn associated with another choice of
acceptance region for the null hypothesis, we have
αn ≤ αn∗ =⇒ βn ≥ βn∗ .
Proof Let B be a choice of acceptance region for the null hypothesis. Then
αn + τ βn = PX n (xn ) + τ PX̂ n (xn )
xn ∈Bc xn ∈B
= PX n (x ) + τ 1 −
n
PX̂ n (x )
n
xn ∈Bc xn ∈Bc
=τ+ PX n (xn ) − τ PX̂ n (xn ) . (2.8.1)
xn ∈Bc
αn + τ βn ≥ αn∗ + τ βn∗ ,
1
lim − log2 βn∗ (ε) = D(PX PX̂ ),
n→∞ n
for any ε ∈ (0, 1), where βn∗ (ε) = minαn ≤ε βn , and αn and βn are the type I and type
II errors, respectively.
Proof Forward Part: In this part, we prove that there exists an acceptance region for
the null hypothesis such that
1
lim inf − log2 βn (ε) ≥ D(PX PX̂ ).
n→∞ n
Step 1: Divergence typical set. For any δ > 0, define the divergence typical set
as &
1 PX n (xn )
An (δ) := xn ∈ X n : log2 − D(PX P )
X̂
< δ .
n PX̂ n (xn )
PX n (An (δ)) → 1 as n → ∞.
44 2 Information Measures for Discrete Systems
Hence,
αn = PX n (Acn (δ)) < ε
Hence,
1 1
− log2 βn (ε) ≥ D(PX PX̂ ) − δ + log2 (1 − αn ),
n n
which implies that
1
lim inf − log2 βn (ε) ≥ D(PX PX̂ ) − δ.
n→∞ n
The above inequality is true for any δ > 0; therefore,
1
lim inf − log2 βn (ε) ≥ D(PX PX̂ ).
n→∞ n
Converse Part: We next prove that for any acceptance region Bn for the null hypoth-
esis satisfying the type I error constraint, i.e.,
αn (Bn ) = PX n (Bnc ) ≤ ε,
1
lim sup − log2 βn (Bn ) ≤ D(PX PX̂ ).
n→∞ n
2.8 Fundamentals of Hypothesis Testing 45
We have
Hence,
1 1 " #
− log2 βn (Bn ) ≤ D(PX PX̂ ) + δ + log2 1 − ε − PX n Acn (δ) ,
n n
" #
which, upon noting that limn→∞ PX n Acn (δ) = 0 (by the weak law of large num-
bers), implies that
1
lim sup − log2 βn (Bn ) ≤ D(PX PX̂ ) + δ.
n→∞ n
1
lim sup − log2 βn (Bn ) ≤ D(PX PX̂ ).
n→∞ n
Definition 2.50 (Rényi’s entropy) Given a parameter α > 0 with α
= 1, and given
a discrete random variable X with alphabet X and distribution PX , its Rényi entropy
of order α is given by
1
Hα (X ) = log PX (x)α . (2.9.1)
1−α x∈X
46 2 Information Measures for Discrete Systems
As in case of the Shannon entropy, the base of the logarithm determines the units;
if the base is D, Rényi’s entropy is in D-ary units. Other notations for Hα (X ) are
H (X ; α), Hα (PX ), and H (PX ; α).
Definition 2.51 (Rényi’s divergence) Given a parameter 0 < α < 1, and two dis-
crete random variables X and X̂ with common alphabet X and distribution PX and
PX̂ , respectively, then the Rényi divergence of order α between X and X̂ is given by
1 ) *
α
Dα (X X̂ ) = log PX (x)PX̂ (x) .
1−α
(2.9.2)
α−1 x∈X
This definition can be extended to α > 1 if PX̂ (x) > 0 for all x ∈ X . Other notations
for Dα (X X̂ ) are D(X X̂ ; α), Dα (PX PX̂ ) and D(PX PX̂ ; α).
As in the case of Shannon’s information measures, the base of the logarithm
indicates the units of the measure and can be changed from 2 to an arbitrary b > 1.
In the next lemma, whose proof is left as an exercise, we note that in the limit of
α tending to 1, Shannon’s entropy and divergence can be recovered from Rényi’s
entropy and divergence, respectively.
Lemma 2.52 When α → 1, we have the following:
lim Hα (X ) = H (X ) (2.9.3)
α→1
and
lim Dα (X X̂ ) = D(X X̂ ). (2.9.4)
α→1
Problems
H (X , f (X )) = H (X ) + H (f (X )|X ) = H (f (X )) + H (X |f (X )).
Hint: For (a), create example for I (X ; Y |Z) = 0 and I (X ; Y ) > 0. For (b),
create example for I (X ; Y ) = 0 and I (X ; Y |Z) > 0.
H (X ), H (Y ), H (X |Y ), H (Y |X ), H (X , Y ) and I (X ; Y ),
and indicate the quantities (in bits) for each area of the Venn diagram.
48 2 Information Measures for Discrete Systems
10. Maximal discrete entropy. Prove that, of all probability mass functions for a non-
negative integer-valued random variable with mean μ, the geometric distribution,
given by z
1 μ
PZ (z) = , for z = 0, 1, 2, . . . ,
1+μ 1+μ
H (X ) ≥ H (U ).
15. Provide examples for the following inequalities (see Definition 2.40 for the
definition of conditional divergence).
(a) D(PX |Z PX̂ |Ẑ |PZ ) > D(PX PX̂ ).
(b) D(PX |Z PX̂ |Ẑ |PZ ) < D(PX PX̂ ).
2.9 Rényi’s Information Measures 49
p 1−p
D(pq) := p log2 + (1 − p) log2
q 1−q
satisfies
(p − q)2
D(pq) ≤ log2 (e)
q(1 − q)
m
1 m
1
pi log ≤ pi log + log α.
i=1
pi i=1
qi
19. Let X and Y be jointly distributed discrete random variables. Show that
I (X ; Y ) ≥ I (f (X ); g(Y )),
and
QY (y) = PY |X (y|x)QX (x)
x∈X
m
qiλ pi1−λ ≤ 1,
i=1
for all i. This inequality is known as Hölder’s inequality. In the special case
of λ = 1/2, the bound is referred to as the Cauchy–Schwarz inequality.
2.9 Rényi’s Information Measures 51
Prove this inequality using (b), and show that equality holds iff for some
constant c,
1 1
pi aiλ = cpi bi1−λ
for all i.
Note: We refer the reader to [135, 176] for a variety of other useful inequal-
ities.
23. Inequality of arithmetic and geometric means:
(a) Show that
n
n
ai ln xi ≤ ln ai xi ,
i=1 i=1
n
x1a1 x2a2 . . . xnan ≤ ai xi .
i=1
24. Consider two distributions P(·) and Q(·) on the alphabet X = {a1 , . . . , ak } such
that Q(ai ) > 0 for all i = 1, . . . , k. Show that
k
(P(ai ))2
≥ 1.
i=1
Q(ai )
25. Let X be a discrete random variable with alphabet X and distribution PX . Let
f : X → R be a real-valued function, and let α be an arbitrary real number.
(a) Show that
−αf (x)
H (X ) ≤ αE[f (X ))] + log2 2 ,
x∈X
with equality iff PX (x) = A1 2−αf (x) for x ∈ X , where A := x∈X 2−αf (x) .
52 2 Information Measures for Discrete Systems
(b) Show that for a positive integer-valued random variable N (such that E[N ] >
1 without loss of generality), the following holds:
X̂ m := (g1 (Y ), g2 (Y ), . . . , gm (Y ))
where
u := PX̂ m (x̂m ).
x∈X x̂m ∈X m :x̂i =x for some i
Hint: Show that H (X |Y ) ≤ H (X |X̂ m ) and that H (X |X̂ m ) is less than the
right-hand side of the above inequality.
(b) Use (2.10.1) to deduce the following weaker version of Fano’s inequality
for list decoding (see [5], [216], [313, Appendix 3.E]):
27. Fano’s inequality for ternary partitioning of the observation space: In Problem
26, Pe(m) and u can actually be expressed as
Pe(m) = PX ,X̂ m (x, x̂m ) = PX ,Y (x, y)
x∈X x̂m ∈
/ Ux x∈X y∈Y
/ x
2.9 Rényi’s Information Measures 53
and
u= PX̂ m (x̂m ) = PY (y),
x∈X x̂m ∈ Ux x∈X y∈ Yx
respectively, where
' (
Ux := x̂m ∈ X m : x̂i = x for some i
and
Yx := {y ∈ Y : gi (y) = x for some i} .
Thus, given x ∈ X , Yx and Yxc form a binary partition on the observation space Y.
and
s := PY (y), t := PY (y), v := PY (y).
x∈X y∈Sx x∈X y∈Tx x∈X y∈Vx
where
1 1 1
H (p, q, r) = p log2 + q log2 + r log2 .
p q r
log2 (e) 2
I (X ; Y ) <
2
is a sufficient condition for Y to be
-independent from X , where I (X ; Y ) is the
mutual information (in bits) between X and Y .
29. Rényi’s entropy: Given a fixed positive integer n > 1, consider an n-ary valued
random variable X with alphabet X = {1, 2, . . . , n} and distribution described
by the probabilities pi := Pr[X = i], where pi > 0 for each i = 1, . . . , n. Given
α > 0 and α
= 1, the Rényi entropy of X (see Definition 2.9.1) is given by
n
1
α
Hα (X ) := log2 pi .
1−α i=1
and that
n
pir < 1 if r > 1.
i=1
Hint: Show that the function f (r) = ni=1 pir is decreasing in r, where r > 0.
(b) Show that
0 ≤ Hα (X ) ≤ log2 n.
Hint: Use (a) for the lower bound, and use Jensen’s inequality (with the
1
convex function f (y) = y 1−α , for y > 0) for the upper bound.
30. Rényi’s entropy and divergence: Consider two discrete random variables X and
X̂ with common alphabet X and distribution PX and PX̂ , respectively.
(a) Prove Lemma 2.52.
(b) Find a distribution Q on X in terms of α and PX such that the following
holds:
1
Hα (X ) = H (X ) + D(PX Q).
1−α
Chapter 3
Lossless Data Compression
PX (x = outcome A ) = 0.5;
PX (x = outcome B ) = 0.25;
PX (x = outcomeC ) = 0.25.
Suppose that a binary codebook is designed for this source, in which outcome A ,
outcome B , and outcomeC are, respectively, encoded as 0, 10, and 11. Then, the
average codeword length (in bits per source outcome) is
There are usually no constraints on the basic structure of a code. In the case where
the codeword length for each source outcome can be different, the code is called a
variable-length code. When the codeword lengths of all source outcomes are equal,
the code is referred to as a fixed-length code. It is obvious that the minimum average
codeword length among all variable-length codes is no greater than that among
all fixed-length codes, since the latter is a subclass of the former. We will see in
this chapter that the smallest achievable average code rate for variable-length and
fixed-length codes coincide for sources with good probabilistic characteristics, such
as stationarity and ergodicity. But for more general sources with memory, the two
quantities are different (e.g., see [172]).
For fixed-length codes, the sequence of adjacent codewords is concatenated
together for storage or transmission purposes, and some punctuation mechanism—
such as marking the beginning of each codeword or delineating internal sub-
blocks for synchronization between encoder and decoder—is normally considered
an implicit part of the codewords. Due to constraints on space or processing capa-
bility, the sequence of source symbols may be too long for the encoder to deal with
all at once; therefore, segmentation before encoding is often necessary. For example,
suppose that we need to encode using a binary code the grades of a class with 100
students. There are three grade levels: A, B, and C. By observing that there are 3100
possible grade combinations for 100 students, a straightforward code design requires
to encode these combinations (by enumerating them). Now suppose that the encoder
facility can only process 16 bits at a time. Then, the above code design becomes
infeasible and segmentation is unavoidable. Under such constraint, we may encode
grades of 10 students at a time, which requires
As a consequence, for a class of 100 students, the code requires 160 bits in total.
In the above example, the letters in the grade set {A, B, C} and the letters from the
code alphabet {0, 1} are often called source symbols and code symbols, respectively.
When the code alphabet is binary (as in the previous two examples), the code sym-
bols are referred to as code bits or simply bits (as already used). A tuple (or grouped
sequence) of source symbols is called a sourceword, and the resulting encoded tuple
consisting of code symbols is called a codeword. (In the above example, each source-
word consists of 10 source symbols (student grades) and each codeword consists of
16 bits.)
Note that, during the encoding process, the sourceword lengths do not have to be
equal. In this text, however, we only consider the case where the sourcewords have
a fixed length throughout the encoding process (except for the Lempel–Ziv code
briefly discussed at the end of this chapter), but we will allow the codewords to have
3.1 Principles of Data Compression 57
fixed or variable lengths as defined earlier.1 The block diagram of a source coding
system is depicted in Fig. 3.1.
When adding segmentation mechanisms to fixed-length codes, the codes can be
loosely divided into two groups. The first consists of block codes in which the encod-
ing (or decoding) of the next segment of source symbols is independent of the previ-
ous segments. If the encoding/decoding of the next segment, somehow, retains and
uses some knowledge of earlier segments, the code is called a fixed-length tree code.
As we will not investigate such codes in this text, we can use “block codes” and
“fixed-length codes” as synonyms.
In this chapter, we first consider data compression for block codes in Sect. 3.2.
Data compression for variable-length codes is then addressed in Sect. 3.3.
n
PX n (x1 , x2 , . . . , xn ) = PX (xi ).
i=1
1 In other
words, our fixed-length codes are actually “fixed-to-fixed length codes” and our variable-
length codes are “fixed-to-variable length codes” since, in both cases, a fixed number of source
symbols is mapped onto codewords with fixed and variable lengths, respectively.
58 3 Lossless Data Compression
Definition 3.2 An (n, M) block code with blocklength n and size M (which can
be a function of n in general,2 i.e., M = Mn ) for a discrete source {X n }∞ n=1 is a
set ∼Cn = {c1 , c2 , . . . , c M } ⊆ X n consisting of M reproduction (or reconstruction)
words, where each reproduction word is a sourceword (an n-tuple of source symbols).
To simplify the exposition, we make an abuse of notation by writing ∼Cn = (n, M)
to mean that ∼Cn is a block code with blocklength n and size M.
f : X n → {0, 1}k
is a retrieving operation that produces the reproduction words. Since the codewords
are binary-valued, such a block code is called a binary code. More generally, a
D-ary block code (where D > 1 is an integer) would use an encoding function
f : X n → {0, 1, . . . , D − 1}k where each codeword f (x n ) contains k D-ary code
symbols.
Furthermore, since the behavior of block codes is investigated
forsufficiently
large n and M (tending to infinity), it is legitimate to replace log2 M by log2 M
for the case of binary codes. With this convention, the data compression rate or code
rate is
k 1
= log2 M(in bits per source symbol).
n n
Similarly, for D-ary codes, the rate is
k 1
= log D M (in D-ary code symbols per source symbol).
n n
For computational convenience, nats (under the natural logarithm) can be used
instead of bits or D-ary code symbols; in this case, the code rate becomes
1
log M (in nats per source symbol).
n
2 Inthe literature, both (n, M) and (M, n) have been used to denote a block code with blocklength
n and size M. For example, [415, p. 149] adopts the former one, while [83, p. 193] uses the latter.
We use the (n, M) notation since M = Mn is a function of n in general.
3.2 Block Codes for Asymptotically Lossless Compression 59
(x1 , x2 , . . . , xn ) → cm ∈ {c1 , c2 , . . . , c M }.
This procedure will be repeated for each consecutive block of length n, i.e.,
1
− log2 PX n (X 1 , . . . , X n ) → H (X ) in probability.
n
Proof This theorem follows by first observing that for an i.i.d. sequence {X n }∞
n=1 ,
1
n
1
− log2 PX n (X 1 , . . . , X n ) = − log2 PX (X i )
n n i=1
∞
and that the sequence {− log2 PX (X i )}i=1 is i.i.d., and then applying the weak law
of large numbers on the latter sequence.
The AEP indeed constitutes an “information theoretic” analog of the weak law
∞
of large numbers as it states that if {− log2 PX (X i )}i=1 is an i.i.d. sequence, then for
any δ > 0,
n
1
Pr − log2 PX (X i ) − H (X ) ≤ δ → 1 as n → ∞.
n
i=1
As a consequence of the AEP, all the probability mass will be ultimately placed on
the weakly δ-typical set, which is defined as
3 When one uses an encoder–decoder pair ( f, g) to describe the block code, the code’s operation
can be expressed as cm = g( f (x n )).
4 This theorem, which is also called the entropy stability property, is due to Shannon [340],
is a constant6 independent of n.
To prove Property 3, we have from Property 1 that
1≥ PX n (x n ) ≥ 2−n(H (X )+δ) = |Fn (δ)|2−n(H (X )+δ) ,
x n ∈F n (δ) x n ∈F n (δ)
σ 2X
1−δ <1− ≤ PX n (x n ) ≤ 2−n(H (X )−δ) = |Fn (δ)|2−n(H (X )−δ) ,
nδ 2
x n ∈F n (δ) x n ∈F n (δ)
for n ≥ σ 2X /δ 3 .
Note that for any n > 0, a block code ∼Cn = (n, M) is said to be uniquely decodable
or completely lossless if its set of reproduction words is trivially equal to the set of
all source n-tuples: {c1 , c2 , . . . , c M } = X n . In this case, if we are binary-indexing
the reproduction words using an encoding–decoding pair ( f, g), every sourceword
x n will be assigned to a distinct binary codeword f (x n ) of length k = log2 M and
all the binary k-tuples are the image under f of some sourceword. In other words, f
is a bijective (injective and surjective) map and hence invertible with the decoding
map g = f −1 and M = |X |n = 2k . Thus, the code rate is (1/n) log2 M = log2 |X |
bits/source symbol.
Now the question becomes: can we achieve a better (i.e., smaller) compression
rate? The answer is affirmative: we can achieve a compression rate equal to the
source entropy H (X ) (in bits), which can be significantly smaller than log2 |X | when
this source is strongly nonuniformly distributed, if we give up unique decodability
(for every n) and allow n to be sufficiently large to asymptotically achieve lossless
reconstruction by having an arbitrarily small (but positive) probability of decoding
error
Pe (∼Cn ) := PX n {x n ∈ X n : g( f (x n )) = x n }.
6 Inthe proof, we assume that the variance σ 2X = Var[− log2 PX (X )] < ∞. This holds since the
source alphabet is finite:
Var[− log2 PX (X )] ≤ E[(log2 PX (X ))2 ] = PX (x)(log2 PX (x))2
x∈X
4 4
≤ [log2 (e)]2 = 2 [log2 (e)]2 × |X | < ∞.
e2 e
x∈X
62 3 Lossless Data Compression
Thus, block codes herein can perform data compression that is asymptotically
lossless with respect to blocklength; this contrasts with variable-length codes which
can be completely lossless (uniquely decodable) for every finite blocklength.
We now can formally state and prove Shannon’s asymptotically lossless source
coding theorem for block codes. The theorem will be stated for general D-ary block
codes, representing the source entropy H D (X ) in D-ary code symbol/source sym-
bol as the smallest (infimum) possible compression rate for asymptotically lossless
D-ary block codes. Without loss of generality, the theorem will be proved for the
case of D = 2. The idea behind the proof of the forward (achievability) part is
basically to binary-index the source sequence in the weakly δ-typical set Fn (δ) to
a binary codeword (starting from index one with corresponding k-tuple codeword
0 · · · 01); and to encode all sourcewords outside Fn (δ) to a default all-zero binary
codeword, which certainly cannot be reproduced distortionless due to its many-to-
one-mapping property. The resultant code rate is (1/n)log2 (|Fn (δ)| + 1) bits per
source symbol. As revealed in the Shannon–McMillan–Breiman AEP theorem and
its consequence, almost all the probability mass will be on Fn (δ) as n is sufficiently
large, and hence, the probability of non-reconstructable source sequences can be
made arbitrarily small. A simple example for the above coding scheme is illustrated
in Table 3.1. The converse part of the proof will establish (by expressing the proba-
bility of correct decoding in terms of the δ-typical set and also using the consequence
of the AEP) that for any sequence of D-ary codes with rate strictly below the source
entropy, their probability of error cannot asymptotically vanish (is bounded away
from zero). Actually, a stronger result is proven: it is shown that their probability of
error not only does not asymptotically vanish, it actually ultimately grows to 1 (this
is why we call this part a “strong” converse).
Theorem 3.6 (Shannon’s source coding theorem) Given integer D > 1, consider
a discrete memoryless source {X n }∞
n=1 with entropy H D (X ). Then the following hold.
• Forward part (achievability): For any 0 < ε < 1, there exists 0 < δ < ε and a
sequence of D-ary block codes {∼Cn = (n, Mn )}∞
n=1 with
1
lim sup log D Mn ≤ H D (X ) + δ (3.2.1)
n→∞ n
satisfying
Pe (∼Cn ) < ε (3.2.2)
for all sufficiently large n, where Pe (∼Cn ) denotes the probability of error (or
decoding error) for block code ∼Cn .7
7 Note that (3.2.2) is equivalent to lim supn→∞ Pe (∼Cn ) ≤ ε. Since ε can be made arbitrarily small,
the forward part actually indicates the existence of a sequence of D-ary block codes {∼Cn }∞ n=1
satisfying (3.2.1) such that lim supn→∞ Pe (∼Cn ) = 0. Based on this, the converse should be that any
sequence of D-ary block codes satisfying (3.2.3) satisfies lim supn→∞ Pe (∼Cn ) > 0. However, the
so-called strong converse actually gives a stronger consequence: lim supn→∞ Pe (∼Cn ) = 1 (as can
be made arbitrarily small).
3.2 Block Codes for Asymptotically Lossless Compression 63
Table 3.1 An example of the δ-typical set with n = 2 and δ = 0.4, where F2 (0.4) = {AB, AC,
BA, BB, BC, CA, CB}. The codeword set is {001(AB), 010(AC), 011(BA), 100(BB), 101(BC),
110(CA), 111(CB), 000(AA, AD, BD, CC, CD, DA, DB, DC, DD)}, where the parenthesis following
each binary codeword indicates those sourcewords that are encoded to this codeword. The source
distribution is PX (A) = 0.4, PX (B) = 0.3, PX (C) = 0.2, and PX (D) = 0.1
1 2
Source − log2 PX (xi ) − H (X ) Codeword Reconstructed
2
i=1 source sequence
AA 0.525 bits ∈
/ F2 (0.4) 000 Ambiguous
AB 0.317 bits ∈ F2 (0.4) 001 AB
AC 0.025 bits ∈ F2 (0.4) 010 AC
AD 0.475 bits ∈
/ F2 (0.4) 000 Ambiguous
BA 0.317 bits ∈ F2 (0.4) 011 BA
BB 0.109 bits ∈ F2 (0.4) 100 BB
BC 0.183 bits ∈ F2 (0.4) 101 BC
BD 0.683 bits ∈
/ F2 (0.4) 000 Ambiguous
CA 0.025 bits ∈ F2 (0.4) 110 CA
CB 0.183 bits ∈ F2 (0.4) 111 CB
CC 0.475 bits ∈
/ F2 (0.4) 000 Ambiguous
CD 0.975 bits ∈
/ F2 (0.4) 000 Ambiguous
DA 0.475 bits ∈
/ F2 (0.4) 000 Ambiguous
DB 0.683 bits ∈
/ F2 (0.4) 000 Ambiguous
DC 0.975 bits ∈
/ F2 (0.4) 000 Ambiguous
DD 1.475 bits ∈
/ F2 (0.4) 000 Ambiguous
• Strong converse part: For any 0 < ε < 1, any sequence of D-ary block codes
{∼Cn = (n, Mn )}∞
n=1 with
1
lim sup log D Mn < H D (X ) (3.2.3)
n→∞ n
satisfies
Pe (∼Cn ) > 1 − ε
Mn = |Fn (δ/2)| + 1 ≤ 2n(H (X )+δ/2) + 1 < 2 · 2n(H (X )+δ/2) < 2n(H (X )+δ)
for n > 2/δ. Hence, a sequence of ∼Cn = (n, Mn ) block code satisfying (3.2.1) is
established. It remains to show that the error probability for this sequence of (n, Mn )
block code can be made smaller than ε for all sufficiently large n.
By the Shannon–McMillan–Breiman AEP theorem,
δ
PX n (Fnc (δ/2)) < for all sufficiently large n.
2
Consequently, for those n satisfying the above inequality, and being bigger than 2/δ,
(For the last step, the reader can refer to Table 3.1 to confirm that only the “ambigu-
ous” sequences outside the typical set contribute to the probability of error.)
Strong Converse Part: Fix any sequence of block codes {∼Cn }∞
n=1 with
1
lim sup log2 |∼Cn | < H (X ).
n→∞ n
Let Sn be the set of source symbols that can be correctly decoded through the ∼Cn -
coding system. (A quick example is depicted in Fig. 3.2.) Then |Sn | = |∼Cn |. By
choosing δ small enough with ε/2 > δ > 0, and by definition of the limsup operation,
we have
1 1
(∃ N0 )(∀ n > N0 ) log2 |Sn | = log2 |∼Cn | < H (X ) − 2δ,
n n
which implies
|Sn | < 2n(H (X )−2δ) for n > N0 .
Source Symbols
Sn
Reproduction words
Fig. 3.2 Possible code ∼Cn and its corresponding Sn . The solid box indicates the decoding mapping
from ∼Cn back to Sn
1 − Pe (∼Cn ) = PX n (x n )
x n ∈Sn
= PX n (x n ) + PX n (x n )
x n ∈S n ∩Fn (δ)
c x n ∈S n ∩Fn (δ)
8 Note that it is clear from the statement and proof of the forward part of Theorem 3.6 that the source
entropy can be achieved as an asymptotic compression rate as long as (1/n) log D Mn approaches it
from above with increasing n. Furthermore, the asymptotic compression rate is defined as the limsup
of (1/n) log D Mn in order to guarantee reliable compression for n sufficiently large (analogously,
in channel coding, the asymptotic transmission rate is defined via the liminf of (1/n) log D Mn to
ensure reliable communication for all sufficiently large n, see Chap. 4).
66 3 Lossless Data Compression
n→∞ n→∞
Pe −→ 1 Pe −→ 0
for all block codes for the best data compression block code
HD (X) R̄
Fig. 3.3 Asymptotic compression rate R̄ versus source entropy H D (X ) and behavior of the prob-
ability of block decoding error as blocklength n goes to infinity for a discrete memoryless source
which there exists a sequence of D-ary block codes with asymptotically vanishing
(as the blocklength goes to infinity) probability of decoding error. Indeed to prove
that H D (X ) is such infimum, we decomposed the above theorem in two parts as per
the properties of the infimum; see Observation A.11.
whose size is prohibitively small and whose probability mass is asymptotically large.
Thus, if we can find such typical-like set for a source with memory, the source coding
theorem for block codes can be extended for this source. Indeed, with appropriate
modifications, the Shannon–McMillan–Breiman theorem can be generalized for the
class of stationary ergodic sources and hence a block source coding theorem for this
class can be established; this is considered in the next section. The block source
coding theorem for general (e.g., nonstationary non-ergodic) sources in terms of a
generalized “spectral” entropy measure is studied in [73, 172, 175] (see also the end
of the next section for a brief description).
In practice, a stochastic source used to model data often exhibits memory or statistical
dependence among its random variables; its joint distribution is hence not a product
of its marginal distributions. In this section, we consider the asymptotic lossless data
compression theorem for the class of stationary ergodic sources.9
Before proceeding to generalize the block source coding theorem, we need to first
generalize the “entropy” measure for a sequence of dependent random variables X n
(which certainly should be backward compatible to the discrete memoryless cases).
A straightforward generalization is to examine the limit of the normalized block
entropy of a source sequence, resulting in the concept of entropy rate.
9 The definitions of stationarity and ergodicity can be found in Sect. B.3 of Appendix B.
3.2 Block Codes for Asymptotically Lossless Compression 67
Next, we will show that the entropy rate exists for stationary sources (here, we
do not need ergodicity for the existence of entropy rate).
H (X n |X n−1 , . . . , X 1 )
is nonincreasing in n and also bounded from below by zero. Hence, by Lemma A.20,
the limit
lim H (X n |X n−1 , . . . , X 1 )
n→∞
exists.
Proof We have
where (3.2.4) follows since conditioning never increases entropy, and (3.2.5) holds
because of the stationarity assumption. Finally, recall that each conditional entropy
H (X n |X n−1 , . . . , X 1 ) is nonnegative.
n
Lemma 3.10 (Cesaro-mean theorem) If an → a as n → ∞ and bn = n1 i=1 ai ,
then bn → a as n → ∞.
Proof an → a implies that for any ε > 0, there exists an N such that for all n > N ,
|an − a| < ε. Then
n
1
|bn − a| = (ai − a)
n
i=1
1
n
≤ |ai − a|
n i=1
68 3 Lossless Data Compression
1 1
N n
= |ai − a| + |ai − a|
n i=1 n i=N +1
1
N
n−N
≤ |ai − a| + ε.
n i=1 n
Hence, limn→∞ |bn − a| ≤ ε. Since ε can be made arbitrarily small, the lemma
holds.
Theorem 3.11 The entropy rate of a stationary source {X n }∞ n=1 always exists and
is equal to
H (X ) = lim H (X n |X n−1 , . . . , X 1 ).
n→∞
1
n
1
H (X ) =
n
H (X i |X i−1 , . . . , X 1 ) (chain rule for entropy)
n n i=1
1
H (X ) = lim H (X n ) = H (X ).
n→∞ n
1
H (X ) = lim H (X n ) = lim H (X n |X n−1 , . . . , X 1 ) = H (X 2 |X 1 ),
n→∞ n n→∞
where
H (X 2 |X 1 ) = − π(x1 )PX 2 |X 1 (x2 |x1 ) log PX 2 |X 1 (x2 |x1 ),
x1 ∈X x2 ∈X
and π(·) is a stationary distribution for the Markov source (note that π(·) is unique if
the Markov source is irreducible11 ). For example, for the stationary binary Markov
10 If a Markov source is mentioned without specifying its order, it is understood that it is a first-order
Markov source; see Appendix B for a brief overview on Markov sources and their properties.
11 See Sect. B.3 of Appendix B for the definition of irreducibility for Markov sources.
3.2 Block Codes for Asymptotically Lossless Compression 69
β α
H (X ) = h b (α) + h b (β),
α+β α+β
1
lim D(PX n PX̂ n )
n→∞ n
provided the limit exists.12 The divergence rate is not guaranteed to exist in general;
in [350], two examples of non-Markovian ergodic sources are given for which the
divergence rate does not exist. However, if the source { X̂ i } is time-invariant Markov
and {X i } is stationary, then the divergence rate exists and is given in terms of the
entropy rate of {X i } and another quantity depending on the (second-order) statistics
of {X i } and { X̂ i } as follows [157, p. 40]:
1 1
lim D(PX n PX̂ n ) = − lim H (X n ) − PX 1 X 2 (x1 , x2 ) log2 PX̂ 2 | X̂ 1 (x2 |x1 ).
n→∞ n n→∞ n
x1 ∈Xx2 ∈X
(3.2.6)
Furthermore, if both {X i } and { X̂ i } are time-invariant irreducible Markov sources,
then their divergence rate exists and admits the following expression [312, Theo-
rem 1]:
1 PX 2 |X 1 (x2 |x1 )
lim D(PX n PX̂ n ) = π X (x1 )PX 2 |X 1 (x2 |x1 ) log2 ,
n→∞ n
x ∈X x ∈X
PX̂ 2 | X̂ 1 (x2 |x1 )
1 2
where π X (·) is the stationary distribution of {X i }. The above result can also be
generalized using the theory of nonnegative matrices and Perron–Frobenius theory
for {X i } and { X̂ i } being arbitrary (not necessarily irreducible, stationary, etc.) time-
invariant Markov chains; see the explicit computable expression in [312, Theorem 2].
A direct consequence of the later result is a formula for the entropy rate of an arbitrary
not necessarily stationary time-invariant Markov source [312, Corollary 2].13
Finally, note that all the above results also hold with the proper modifications if
the Markov chains are replaced with kth-order Markov chains (for any integer k > 1)
[312].
1 a.s.
− log2 PX n (X 1 , . . . , X n ) −→ H (X ).
n
Since the AEP theorem (law of large numbers) is valid for stationary ergodic
sources, all consequences of the AEP will follow, including Shannon’s lossless source
coding theorem.
Theorem 3.15 (Shannon’s source coding theorem for stationary ergodic sources)
Given integer D > 1, let {X n }∞
n=1 be a stationary ergodic source with entropy rate
(in base D)
1
H D (X ) := lim H D (X n ).
n→∞ n
1
lim sup log D Mn < H D (X ) + δ,
n→∞ n
Pe (∼Cn ) < ε
1
lim sup log D Mn < H D (X )
n→∞ n
satisfies
Pe (∼Cn ) > 1 − ε
{X i } and { X̂ i } exist and admit closed-form expressions [311] (see also the earlier work in [279],
where the results hold for more restricted classes of Markov sources).
3.2 Block Codes for Asymptotically Lossless Compression 71
A discrete memoryless (i.i.d.) source is stationary and ergodic (so Theorem 3.6
is clearly a special case of Theorem 3.15). In general, it is hard to check whether a
stationary process is ergodic or not. It is known though that if a stationary process is
a mixture of two or more stationary ergodic processes, i.e., its n-fold distribution can
be written as the mean (with respect to some distribution) of the n-fold distributions
of stationary ergodic processes, then it is not ergodic.14
For example, let P and Q be two distributions on a finite alphabet X such that
the process {X n }∞ ∞
n=1 is i.i.d. with distribution P and the process {Yn }n=1 is i.i.d. with
distribution Q. Flip a biased coin (with Heads probability equal to θ, 0 < θ < 1)
once and let
X n , if Heads,
Zn =
Yn , if Tails,
PZ n (a n ) = θ PX n (a n ) + (1 − θ)PY n (a n ) (3.2.7)
In this model, a red ball in the urn can represent an infected person in the population
and a black ball can represent a healthy person. Since the number of balls of the color
just drawn increases (while the number of balls of the opposite color is unchanged),
14 The converse is also true; i.e., if a stationary process cannot be represented as a mixture of
the likelihood that a ball of the same color as the ball just drawn will be picked in
the next draw increases. Hence, the occurrence of an “unfavorable” event (say an
infection) increases the probability of future unfavorable events (the same applies for
favorable events) and as a result the model provides a basic template for characterizing
contagious phenomena.
For any n ≥ 1, the n-fold distribution of the binary process {Z n }∞
n=1 can be derived
in closed form as follows:
ρ(ρ + δ) · · · (ρ + (d − 1)δ)σ(σ + δ) · · · (σ + (n − d − 1)δ)
Pr[Z n = a n ] =
(1 + δ)(1 + 2δ) · · · (1 + (n − 1)δ)
( 1δ ) ( ρδ + d) ( σδ + n − d)
= (3.2.8)
( ρδ ) ( σδ ) ( 1δ + n)
n−1
( αβ + n)
(α + jβ) = β n
j=0
( αβ )
which is obtained using the fact that (x + 1) = x (x). We remark from expression
(3.2.8) for the joint distribution that the process {Z n } is exchangeable15 and is thus
stationary. Furthermore, it can be shown [120, 306] that the process sample average
1
n
(Z 1 + Z 2 + · · · + Z n ) converges almost surely as n → ∞ to a random variable
Z , whose distribution is given by the beta distribution with parameters ρ/δ = R/
and σ/δ = B/. This directly implies that the process {Z n }∞ n=1 is not ergodic since
its sample average does not converge to a constant. It is also shown in [12] that the
entropy rate of {Z n }∞n=1 is given by
1
H (Z) = E Z [h b (Z )] = h b (z) f Z (z)dz
0
15 A process {Z ∞
n }n=1 is called exchangeable (or symmetrically dependent) if for every finite positive
integer n, the random variables Z 1 , Z 2 , . . . , Z n have the property that their joint distribution is
invariant with respect to all permutations of the indices 1, 2, . . . , n (e.g., see [120]). The notion of
exchangeability is originally due to de Finetti [90]. It directly follows from the definition that an
exchangeable process is stationary.
3.2 Block Codes for Asymptotically Lossless Compression 73
( 1δ )
( ρδ )( σδ )
z ρ/δ−1 (1 − z)σ/δ−1 , if 0 < z < 1,
f Z (z) =
0, otherwise,
is the beta probability density function with parameters ρ/δ and σ/δ. Note that
Theorem 3.15 does not hold for the contagion source {Z n } since it is not ergodic.
Finally, letting 0 ≤ Rn ≤ 1 denote the proportion of red balls in the urns after the
nth draw, we can write
R + (Z 1 + Z 2 + · · · + Z n )
Rn = (3.2.9)
T + n
Rn−1 (T + (n − 1)) + Z n
= . (3.2.10)
T + n
almost surely, and thus {Rn } is a martingale (e.g., [120, 162]). Since {Rn } is bounded,
we obtain by the martingale convergence theorem that Rn converges almost surely
to some limiting random variable. But from (3.2.9), we note that the asymptotic
behavior of Rn is identical to that of n1 (Z 1 + Z 2 + · · · + Z n ). Thus, Rn also converges
almost surely to the above beta-distributed random variable Z .
In [12], a binary additive noise channel, whose noise is the above Polya contagion
process {Z n }∞
n=1 , is investigated as a model for a non-ergodic communication channel
with memory. Polya’s urn scheme has also been applied and generalized in a wide
range of contexts, including genetics [210], evolution and epidemiology [257, 289],
image segmentation [35], and network epidemics [182] (see also the survey in [289]).
Example 3.17 (Finite-memory Polya contagion process [12]) The above Polya
model has “infinite” memory in the sense that the very first ball drawn from the
urn has an identical effect (that does not vanish as the number of draws grows with-
out bound) as the 999 999th ball drawn from the urn on the outcome of the millionth
draw. In the context of modeling contagious phenomena, this is not reasonable as one
would assume that the effects of an infection dissipate in time. We herein consider a
more realistic urn model with finite memory [12].
Consider again an urn originally containing T = R + B balls, of which R are red
and B are black. At the nth draw, n = 1, 2, . . ., a ball is selected at random from the
urn and replaced with 1 + balls of the same color just drawn ( > 0). Then, M
draws later, i.e., after the (n + M)th draw, balls of the color picked at the nth draw
are retrieved from the urn.
74 3 Lossless Data Compression
Note that in this model, the total number of balls in the urn is constant (T + M)
after an initialization period of M draws. Also, in this scheme, the effect of any draw
is limited to M draws in the future. The process {Z n }∞ n=1 again corresponds to the
outcome of the draws:
1, if the nth ball drawn is red
Zn =
0, if the nth ball drawn is black.
R + (z n−1 + · · · + z n−M )
Pr Z n = 1|Z n−1 = z n−1 , . . . , Z 1 = z 1 =
T + M
are derived in closed form in terms of R/T , /T , and M. Furthermore, it is shown
that {Z n }∞
n=1 is irreducible, and hence ergodic. Thus, Theorem 3.15 applies for this
finite-memory Polya contagion process. In [420], a generalized version of this process
is introduced via a ball sampling mechanism involving a large urn and a finite queue.
Shannon’s block source coding theorem establishes that the smallest data com-
pression rate for achieving arbitrarily small error probability for stationary ergodic
sources is given by the entropy rate. Thus, one can define the source redundancy as
the reduction in coding rate one can achieve via asymptotically lossless block source
coding versus just using uniquely decodable (completely lossless for any value of
the sourceword blocklength n) block source coding. In light of the fact that the for-
mer approach yields a source coding rate equal to the entropy rate while the latter
approach provides a rate of log2 |X |, we therefore define the total block source
coding redundancy ρt (in bits/source symbol) for a stationary ergodic source
{X n }∞
n=1 as
ρt := log2 |X | − H (X ).
Hence, ρt represents the amount of “useless” (or superfluous) statistical source infor-
mation one can eliminate via binary16 block source coding.
If the source is i.i.d. and uniformly distributed, then its entropy rate is equal
to log2 |X | and as a result its redundancy is ρt = 0. This means that the source is
incompressible, as expected, since in this case every sourceword x n will belong to
the δ-typical set Fn (δ) for every n > 0 and δ > 0 (i.e., Fn (δ) = X n ), and hence there
are no superfluous sourcewords that can be dispensed of via source coding. If the
source has memory or has a nonuniform marginal distribution, then its redundancy
is strictly positive and can be classified into two parts:
ρm := H (X 1 ) − H (X ).
ρt = ρd + ρm .
16 Since we are measuring ρ in code bits/source symbol, all logarithms in its expression are in base
t
2, and hence this redundancy can be eliminated via asymptotically lossless binary block codes (one
can also change the units to D-ary code symbol/source symbol using base-D logarithms for the
case of D-ary block codes).
76 3 Lossless Data Compression
Source ρd ρm ρt
i.i.d. uniform 0 0 0
i.i.d. nonuniform log2 |X | − H (X 1 ) 0 ρd
First-order symmetric
0 H (X 1 ) − H (X 2 |X 1 ) ρm
Markova
First-order non
log2 |X | − H (X 1 ) H (X 1 ) − H (X 2 |X 1 ) ρd + ρm
symmetric Markov
a A first-order Markov process is symmetric if for any x and x̂ ,
1 1
{a : a = PX 2 |X 1 (y|x1 ) for some y} = {a : a = PX 2 |X 1 (y|x̂1 ) for some y}.
Definition 3.19 Consider a discrete source {X n }∞ n=1 with finite alphabet X along
with a D-ary code alphabet B = {0, 1, . . . , D − 1}, where D > 1 is an integer. Fix
integer n ≥ 1; then a D-ary nth-order variable-length code (VLC) is a function
f : X n → B∗
C = f (X n ) = { f (x n ) ∈ B ∗ : x n ∈ X n }.
or equivalently,
3.3 Variable-Length Codes for Lossless Data Compression 77
implies that
m = m and x nj = y mj for j = 1, . . . , m.
Note that a non-singular VLC is not necessarily uniquely decodable. For example,
consider a binary (first-order) code for the source with alphabet
X = {A, B, C, D, E, F}
given by
code of A = 0,
code of B = 1,
code of C = 00,
code of D = 01,
code of E = 10,
code of F = 11.
The above code is clearly non-singular; it is, however, not uniquely decodable
because the codeword sequence, 010, can be reconstructed as AB A, D A, or AE (i.e.,
( f (A), f (B), f (A)) = ( f (D), f (A)) = ( f (A), f (E)) even though (A, B, A),
(D, A), and (A, E) are all non-equal).
One important objective is to find out how “efficiently” we can represent a given
discrete source via a uniquely decodable nth-order VLC and provide a construction
technique that (at least asymptotically, as n → ∞) attains the optimal “efficiency.”
In other words, we want to determine what is the smallest possible average code rate
(or equivalently, the smallest average codeword length) that an nth-order uniquely
decodable VLC can have when (losslessly) representing a given source, and we want
to give an explicit code construction that can attain this smallest possible rate (at
least asymptotically in the sourceword length n).
f : X n → {0, 1, . . . , D − 1}∗
and its average code rate (in D-ary code symbols/source symbol) is given by
78 3 Lossless Data Compression
1
R n := = PX n (x n )(cx n ).
n n x n ∈X n
The following theorem provides a strong condition that a uniquely decodable code
must satisfy.17
Theorem 3.21 (Kraft inequality for uniquely decodable codes) Let C be a uniquely
decodable D-ary nth-order VLC for a discrete source {X n }∞ n=1 with alphabet X . Let
the M = |X |n codewords of C have lengths 1 , 2 , . . . , M , respectively. Then, the
following inequality must hold:
M
D −m ≤ 1.
m=1
c1 c2 c3 . . . c N .
Consider ⎛ ⎞
⎝ ··· D −[(c1 )+(c2 )+···+(cN )] ⎠ .
c1 ∈C c2 ∈C c N ∈C
(Note that |C| = M.) On the other hand, all the code sequences with length
contribute equally to the sum of the identity, which is D −i . Let Ai denote the number
of N -codeword sequences that have length i. Then, the above identity can be rewritten
as M N
LN
−m
D = Ai D −i ,
m=1 i=1
where
L := max (c).
c∈C
The proof is completed by noting that the above inequality holds for every N , and
the upper bound (L N )1/N goes to 1 as N goes to infinity.
The Kraft inequality is a very useful tool, especially for showing that the fun-
damental lower bound of the average rate of uniquely decodable VLCs for discrete
memoryless sources is given by the source entropy.
Theorem 3.22 The average rate of every uniquely decodable D-ary nth-order VLC
for a discrete memoryless source {X n }∞
n=1 is lower bounded by the source entropy
H D (X ) (measured in D-ary code symbols/source symbol).
Proof Consider a uniquely decodable D-ary nth-order VLC code for the source
{X n }∞
n=1
f : X n → {0, 1, . . . , D − 1}∗
and let (cx n ) denote the length of the codeword cx n = f (x n ) for sourceword x n .
Then
1 1
R n − H D (X ) = PX n (x n )(cx n ) − H D (X n )
n x n ∈X n n
1
= PX n (x )(cx n ) −
n
−PX n (x ) log D PX n (x )
n n
n x n ∈X n x n ∈X n
1 PX n (x n )
= PX n (x n ) log D −(c n )
n x n ∈X n D x
1 x n ∈X n PX n (x )
n
≥ PX n (x ) log D
n
−(cx n )
n x n ∈X n x n ∈X n D
(log-sum inequality)
80 3 Lossless Data Compression
1
−(cx n )
= − log D
n x n ∈X n
≥ 0,
where the last inequality follows from the Kraft inequality for uniquely decodable
codes and the fact that the logarithm is a strictly increasing function.
By examining the above proof, we observe that
R n = H D (X ) iff PX n (x n ) = D −l(cx n ) ;
i.e., the source symbol probabilities are (negative) integer powers of D. Such a source
is called D-adic [83]. In this case, the code is called absolutely optimal as it achieves
the source entropy lower bound (it is thus optimal in terms of yielding a minimal
average code rate for any given n).
Furthermore, we know from the above theorem that the average code rate is no
smaller than the source entropy. Indeed, a lossless data compression code, whose
average code rate achieves entropy, should be optimal (note that if a code’s average
rate is below entropy, then the Kraft inequality is violated and the code is no longer
uniquely decodable). In summary, we have
• Uniquely decodability =⇒ the Kraft inequality holds.
• Uniquely decodability =⇒ average code rate of VLCs for memoryless sources
is lower bounded by the source entropy.
Exercise 3.23
1. Find a non-singular and also non-uniquely decodable code that violates the Kraft
inequality. (Hint: The answer is already provided in this section.)
2. Find a non-singular and also non-uniquely decodable code that beats the entropy
lower bound.
A prefix code18 is a VLC which is self-punctuated in the sense that there is no need
to append extra symbols for differentiating adjacent codewords. A more precise
definition follows:
Definition 3.24 (Prefix code) A VLC is called a prefix code or an instantaneous
code if no codeword is a prefix of any other codeword.
A prefix code is also named an instantaneous code because the codeword sequence
can be decoded instantaneously (it is immediately recognizable) without the refer-
ence to future codewords in the same sequence. Note that a uniquely decodable
Prefix
codes
10
(1)
110
(11)
1110
(111)
1111
code is not necessarily prefix-free and may not be decoded instantaneously. The
relationship between different codes encountered thus far is depicted in Fig. 3.4.
A D-ary prefix code can be represented graphically as an initial segment of a
D-ary tree. An example of a tree representation for a binary (D = 2) prefix code is
shown in Fig. 3.5.
Theorem 3.25 (Kraft inequality for prefix codes) There exists a D-ary nth-order
prefix code for a discrete source {X n }∞n=1 with alphabet X iff the codewords of length
m , m = 1, . . . , M, satisfy the Kraft inequality, where M = |X |n .
Proof Without loss of generality, we provide the proof for the case of D = 2 (binary
codes).
82 3 Lossless Data Compression
A tree has originally 2max nodes on level max . Each codeword of length m obstructs
2max −m nodes on level max . In other words, when any node is chosen as a codeword,
all its children will be excluded from being codewords (as for a prefix code, no
codeword can be a prefix of any other code). There are exactly 2max −m excluded
nodes on level max of the tree. Note that no two codewords obstruct the same nodes
on level max . Hence, the number of totally obstructed codewords on level max should
be less than 2max , i.e.,
M
2max −m ≤ 2max ,
m=1
M
2−m ≤ 1.
m=1
(This part can also be proven by stating the fact that a prefix code is a uniquely
decodable code. The objective of adding this proof is to illustrate the characteristics
of a tree-like prefix code.)
Converse part: Kraft inequality implies the existence of a prefix code.
Suppose that 1 , 2 , . . . , M satisfy the Kraft inequality. We will show that there
exists a binary tree with M selected nodes where the ith node resides on level i .
Let n i be the number of nodes (among the M nodes) residing on level i (namely,
n i is the number of codewords with length i or n i = |{m : m = i}|), and let
max := max m .
1≤m≤M
The above inequality can be rewritten in a form that is more suitable for this proof
as follows:
n 1 2−1 ≤ 1
n 1 2−1 + n 2 2−2 ≤ 1
..
.
n 1 2−1 + n 2 2−2 + · · · + n max 2−max ≤ 1.
3.3 Variable-Length Codes for Lossless Data Compression 83
Hence,
n1 ≤ 2
n 2 ≤ 22 − n 1 21
..
.
n max ≤ 2max − n 1 2max −1 − · · · − n max −1 21 ,
which can be interpreted in terms of a tree model as follows: the first inequality
says that the number of codewords of length 1 is less than the available number of
nodes on the first level, which is 2. The second inequality says that the number of
codewords of length 2 is less than the total number of nodes on the second level,
which is 22 , minus the number of nodes obstructed by the first-level nodes already
occupied by codewords. The succeeding inequalities demonstrate the availability of
a sufficient number of nodes at each level after the nodes blocked by shorter length
codewords have been removed. Because this is true at every codeword length up to
the maximum codeword length, the assertion of the theorem is proved.
Theorems 3.21 and 3.25 unveil the following relation between a variable-length
uniquely decodable code and a prefix code.
Corollary 3.26 A uniquely decodable D-ary nth-order code can always be replaced
by a D-ary nth-order prefix code with the same average codeword length (and hence
the same average code rate).
The following theorem interprets the relationship between the average code rate
of a prefix code and the source entropy.
Theorem 3.27 Consider a discrete memoryless source {X n }∞
n=1 .
1. For any D-ary nth-order prefix code for the source, the average code rate is no
less than the source entropy H D (X ).
2. There must exist a D-ary nth-order prefix code for the source whose average
code rate is no greater than H D (X ) + n1 , namely,
1 1
R n := PX n (x n )(cx n ) ≤ H D (X ) + , (3.3.1)
n x n ∈X n n
where cx n is the codeword for sourceword x n , and (cx n ) is the length of codeword
cx n .
Proof A prefix code is uniquely decodable, and hence it directly follows from
Theorem 3.22 that its average code rate is no less than the source entropy.
To prove the second part, we can design a prefix code satisfying both (3.3.1) and
the Kraft inequality, which immediately implies the existence of the desired code by
Theorem 3.25. Choose the codeword length for sourceword x n as
Then
D −(cx n ) ≤ PX n (x n ).
which is exactly the Kraft inequality. On the other hand, (3.3.2) implies
(cx n ) ≤ − log D PX n (x n ) + 1,
−0.8 · log2 0.8 − 0.1 · log2 0.1 − 0.1 · log2 0.1 = 0.92 bits.
as the source is memoryless. Then, an optimal binary prefix codes for the source is
given by
c(A A) = 0
c(AB) = 100
c(AC) = 101
c(B A) = 110
c(B B) = 111100
c(BC) = 111101
c(C A) = 1110
c(C B) = 111110
c(CC) = 111111.
f : X n → {0, 1, . . . , D − 1}∗
R n < H D (X ) + ε
f : X n → {0, 1, . . . , D − 1}∗
86 3 Lossless Data Compression
1
H D (X ) := lim H D (X n ),
n→∞ n
measured in D-ary units. The proof is very similar to the proofs of Theorems 3.22
and 3.27 with slight modifications (such as using the fact that n1 H D (X n ) is nonin-
creasing with n for stationary sources).
Observation 3.30 (Rényi’s entropy and lossless data compression) In the lossless
variable-length source coding theorem, we have chosen the criterion of minimizing
the average codeword length. Implicit in the use of average codeword length as a
performance criterion is the assumption that the cost of compression varies linearly
with codeword length. This is not always the case as in some applications, where the
processing cost of decoding may be elevated and buffer overflows caused by long
codewords can cause problems, an exponential cost/penalty function for codeword
lengths can be more appropriate than a linear cost function [54, 67, 206]. Naturally,
one would desire to choose a generalized function with exponential costs such that
the familiar linear cost function (given by the average codeword length) is a special
limiting case.
Indeed in [67], given a D-ary nth-order VLC C
f : X n → {0, 1, . . . , D − 1}∗
1
L n (t) ≤ Hα (X ) + ε
n
for n sufficiently large.
• Conversely, it is not possible to find a uniquely decodable code whose average
code rate of order t is less than Hα (X ).
Noting (by Lemma 2.52) that the Rényi entropy of order α reduces to the Shannon
entropy (in D-ary units) as α → 1, the above theorem reduces to Theorem 3.28 as
α → 1 (or equivalently, as t → 0). Finally, in [309, Sect. 4.4], [310, 311], the above
source coding theorem is extended for time-invariant Markov sources in terms of the
Rényi entropy rate, limn→∞ n1 Hα (X n ) with α = (1 + t)−1 , which exists and can be
calculated in closed form for such sources.
f : X → {0, 1}∗ ,
where optimality is in the sense that the code’s average codeword length (or equiv-
alently, its average code rate) is minimized over the class of all binary uniquely
decodable codes for the source. Note that finding optimal nth-order codes with n > 1
follows directly by considering X n as a new source with expanded alphabet (i.e., by
mapping n source symbols at a time).
By Corollary 3.26, we remark that in our search for optimal uniquely decodable
codes, we can restrict our attention to the (smaller) class of optimal prefix codes.
We thus proceed by observing the following necessary conditions of optimality for
binary prefix codes.
Lemma 3.32 Let C be an optimal binary prefix code with codeword lengths i , i =
1, . . . , M, for a source with alphabet X = {a1 , . . . , a M } and symbol probabilities
p1 , . . . , p M . We assume, without loss of generality, that
p1 ≥ p2 ≥ p3 ≥ · · · ≥ p M ,
and that any group of source symbols with identical probability is listed in order of
increasing codeword length (i.e., if pi = pi+1 = · · · = pi+s , then i ≤ i+1 ≤ · · · ≤
i+s ). Then the following properties hold.
1. Higher probability source symbols have shorter codewords: pi > p j implies
i ≤ j , for i, j = 1, . . . , M.
2. The two least probable source symbols have codewords of equal length:
M−1 = M .
3. Among the codewords of length M , two of the codewords are identical except
in the last digit.
Proof
(1) If pi > p j and i > j , then it is possible to construct a better code C by inter-
changing (“swapping”) codewords i and j of C, since
(C ) − (C) = pi j + p j i − ( pi i + p j j )
= ( pi − p j )( j − i )
< 0.
Hence, code C is better than code C, contradicting the fact that C is optimal.
(2) We first know that M−1 ≤ M , since
• If p M−1 > p M , then M−1 ≤ M by result 1 above.
• If p M−1 = p M , then M−1 ≤ M by our assumption about the ordering of
codewords for source symbols with identical probability.
Now, if M−1 < M , we may delete the last digit of codeword M, and the deletion
cannot result in another codeword since C is a prefix code. Thus, the deletion
3.3 Variable-Length Codes for Lossless Data Compression 89
forms a new prefix code with a better average codeword length than C, contra-
dicting the fact that C is optimal. Hence, we must have that M−1 = M .
(3) Among the codewords of length M , if no two codewords agree in all digits
except the last, then we may delete the last digit in all such codewords to obtain
a better codeword.
The above observation suggests that if we can construct an optimal code for
the entire source except for its two least likely symbols, then we can construct an
optimal overall code. Indeed, the following lemma due to Huffman [195] follows
from Lemma 3.32.
Lemma 3.33 (Huffman) Consider a source with alphabet X = {a1 , . . . , a M } and
symbol probabilities p1 , . . . , p M such that p1 ≥ p2 ≥ · · · ≥ p M . Consider the
reduced source alphabet Y = {a1 , . . . , a M−2 , a M−1,M } obtained from X , where the
first M − 2 symbols of Y are identical to those in X and symbol a M−1,M has proba-
bility p M−1 + p M and is obtained by combining the two least likely source symbols
a M−1 and a M of X . Suppose that C , given by f : Y → {0, 1}∗ , is an optimal prefix
code for the reduced source Y. We now construct a prefix code C, f : X → {0, 1}∗ ,
for the original source X as follows:
• The codewords for symbols a1 , a2 , . . . , a M−2 are exactly the same as the corre-
sponding codewords in C :
• The codewords associated with symbols a M−1 and a M are formed by appending
a “0” and a “1”, respectively, to the codeword f (a M−1,M ) associated with the
letter a M−1,M in C :
X = {1, 2, 3, 4, 5, 6}
90 3 Lossless Data Compression
(00) 00 00 00 0
0.25 0.25 0.25 0.25 0.5 1.0
(01) 01 01 01
0.25 0.25 0.25 0.25
(10) 10 10 1 1
0.25 0.25 0.25 0.5 0.5
(110) 110 11
0.1 0.1 0.25
(1110) 111
0.1 0.15
(1111)
0.05
and symbol probabilities 0.25, 0.25, 0.25, 0.1, 0.1, and 0.05, respectively. By follow-
ing the Huffman encoding procedure as shown in Fig. 3.6, we obtain the Huffman
code as
00, 01, 10, 110, 1110, 1111.
Observation 3.35
• Huffman codes are not unique for a given source distribution; e.g., by inverting all
the code bits of a Huffman code, one gets another Huffman code, or by resolving
ties in different ways in the Huffman algorithm, one also obtains different Huffman
codes (but all of these codes have the same minimal R n ).
• One can obtain optimal codes that are not Huffman codes; e.g., by interchanging
two codewords of the same length of a Huffman code, one can get another non-
Huffman (but optimal) code. Furthermore, one can construct an optimal suffix code
(i.e., a code in which no codeword can be a suffix of another codeword) from a
Huffman code (which is a prefix code) by reversing the Huffman codewords.
• Binary Huffman codes always satisfy the Kraft inequality with equality (their code
tree is “saturated”); e.g., see [87, p. 72].
• Any nth-order binary Huffman code f : X n → {0, 1}∗ for a stationary source
{X n }∞
n=1 with finite alphabet X satisfies
1 1 1
H (X ) ≤ H (X n ) ≤ R n < H (X n ) + .
n n n
3.3 Variable-Length Codes for Lossless Data Compression 91
|X | = 1 (modulo D − 1).
and
1
F̄(x) := PX (a) + PX (x).
a<x
2
F̄(x) = .c1 c2 . . . ck . . . ,
and take the first k (fractional) bits as the codeword of source symbol x, i.e.,
(c1 , c2 , . . . , ck ),
F(x) ≥ .c1 . . . ck .
1 1
k
≤ PX (x)
2 2
PX (x)
= PX (a) + − PX (a)
a<x
2 a≤x−1
Hence,
1 1 1
In addition,
F(x) ≥ .c1 c2 . . . ck .
Average codeword length:
3.3 Variable-Length Codes for Lossless Data Compression 93
1
¯ = PX (x) log2 +1
x∈X
PX (x)
1
< PX (x) log2 +2
x∈X
PX (x)
= H (X ) + 2 bits.
In Sect. 3.3.3, we assume that the source distribution is known. Thus, we can use either
Huffman codes or Shannon–Fano–Elias codes to compress the source. What if the
source distribution is not a known priori? Is it still possible to establish a completely
lossless data compression code which is universally good (or asymptotically optimal)
for all sources of interest? The answer is affirmative. Examples of such universal
codes are adaptive Huffman codes [136], arithmetic codes [242, 243, 322] (which
are based on the Shannon–Fano–Elias code), and Lempel–Ziv codes [404, 430, 431],
which are efficiently employed in various forms in many multimedia compression
packages and standards. We herein give a brief and basic description of adaptive
Huffman and Lempel–Ziv codes.
(A) Adaptive Huffman Codes
A straightforward universal coding scheme is to use the empirical distribution (or
relative frequencies) as the true distribution, and then apply the optimal Huffman code
according to the empirical distribution. If the source is i.i.d., the relative frequencies
will converge to its true marginal probability. Therefore, such universal codes should
be good for all i.i.d. sources. However, in order to get an accurate estimation of the
true distribution, one must observe a sufficiently long source sequence under which
the coder will suffer a long delay. This can be improved using adaptive universal
Huffman codes [136].
The working procedure of an adaptive Huffman code is as follows. Start with an
initial guess of the source distribution (based on the assumption that the source is
DMS). As a new source symbol arrives, encode the data in terms of the Huffman
coding scheme according to the current estimated distribution, and then update the
estimated distribution and the Huffman codebook according to the newly arrived
source symbol.
To be specific, let the source alphabet be X := {a1 , . . . , a M }. Define
Then, the (current) relative frequency of ai is N (ai |x n )/n. Let cn (ai ) denote the
Huffman codeword of source symbol ai with respect to the distribution
94 3 Lossless Data Compression
N (a1 |x n ) N (a2 |x n ) N (a M |x n )
, ,..., .
n n n
Now suppose that xn+1 = a j . The codeword cn (a j ) is set as output, and the relative
frequency for each source outcome becomes
N (a j |x n+1 ) n · (N (a j |x n )/n) + 1
=
n+1 n+1
and
N (ai |x n+1 ) n · (N (ai |x n )/n)
= for i = j.
n+1 n+1
n PX̂(n) (a j ) + 1
PX̂(n+1) (a j ) =
n+1
and n
PX̂(n+1) (ai ) = P (n) (ai ) for i = j,
n + 1 X̂
where PX̂(n+1) represents the estimate of the true distribution PX at time (n + 1).
Note that in the adaptive Huffman coding scheme, the encoder and decoder need
not be redesigned at every time, but only when a sufficient change in the estimated
distribution occurs such that the so-called sibling property is violated.
Definition 3.37 (Sibling property) A binary prefix code is said to have the sibling
property if its code tree satisfies
1. every node in the code tree (except for the root node) has a sibling (i.e., the code
tree is saturated), and
2. the node can be listed in nondecreasing order of probabilities with each node
being adjacent to its sibling.
The next observation indicates the fact that the Huffman code is the only prefix
code satisfying the sibling property.
Observation 3.38 A binary prefix code is a Huffman code iff it satisfies the sibling
property.
An example for a code tree satisfying the sibling property is shown in Fig. 3.7.
The first requirement is satisfied since the tree is saturated. The second requirement
can be checked by the node list in Fig. 3.7.
If the next observation (say at time n = 17) is a3 , then its codeword 100 is set as
output (using the Huffman code corresponding to PX̂(16) ). The estimated distribution
is updated as follows:
3.3 Variable-Length Codes for Lossless Data Compression 95
a1 (00, 3/8)
b0 (5/8)
a2 (01, 1/4)
8/8
a3 (100, 1/8)
b10 (1/4)
a4 (101, 1/8)
b1 (3/8)
a5 (110, 1/16)
b11 (1/8)
a6 (111, 1/16)
5 3 3 1
b0 ≥ b1 ≥ a1 ≥ a2
8 8 8 4
sibling pair sibling pair
1 1 1 1 1 1
≥ b10 ≥ b11 ≥ a3 ≥ a4 ≥ a5 ≥ a6
4 8 8 8 16 16
sibling pair sibling pair sibling pair
(16)
Fig. 3.7 Example of the sibling property based on the code tree from P . The arguments inside
X̂
the parenthesis following a j respectively indicate the codeword and the probability associated with
a j . Here, “b” is used to denote the internal nodes of the tree with the assigned (partial) code as its
subscript. The number in the parenthesis following b is the probability sum of all its children
16 × (3/8) 6 16 × (1/4) 4
PX̂(17) (a1 ) = = , PX̂(17) (a2 ) = =
17 17 17 17
16 × (1/8) + 1 3 16 × (1/8) 2
PX̂(17) (a3 ) = = , PX̂(17) (a4 ) = =
17 17 17 17
16 × [1/(16)] 1 16 × [1/(16)] 1
PX̂(17) (a5 ) = = , PX̂(17) (a6 ) = = .
17 17 17 17
The sibling property is then violated (cf. Fig. 3.8). Hence, codebook needs to be
updated according to the new estimated distribution, and the observation at n = 18
shall be encoded using the new codebook in Fig. 3.9. Details about adaptive Huffman
codes can be found in [136].
a1 (00, 6/17)
b0 (10/17)
a2 (01, 4/17)
17/17
a3 (100, 3/17)
b10 (5/17)
a4 (101, 2/17)
b1 (7/17)
a5 (110, 1/17)
b11 (2/17)
a6 (111, 1/17)
10 7 6 5
b0 ≥ b1 ≥ a1 ≥ b10
17 17 17 17
sibling pair
4 3 2 2 1 1
≥ a2 ≥ a3 ≥ a4 ≥ b11 ≥ a5 ≥ a6
17 17 17 17 17 17
sibling pair sibling pair
Fig. 3.8 (Continuation of Fig. 3.7) Example of violation of the sibling property after observing a
new symbol a3 at n = 17. Note that node a1 is not adjacent to its sibling a2
a1 (10, 6/17)
a2 (00, 4/17)
b0 (7/17)
17/17
b1 (10/17)
a3 (01, 3/17)
a4 (110, 2/17)
b11 (4/17)
a5 (1110, 1/17)
b111 (2/17)
a6 (1111, 1/17)
10 7 6 4
b1 ≥ b0 ≥ a1 ≥ b11
17 17 17 17
sibling pair sibling pair
4 3 2 2 1 1
≥ a2 ≥ a3 ≥ a4 ≥ b111 ≥ a5 ≥ a6
17 17 17 17 17 17
sibling pair sibling pair sibling pair
Fig. 3.9 (Continuation of Fig. 3.8) Updated Huffman code. The sibling property holds now for the
new code
3.3 Variable-Length Codes for Lossless Data Compression 97
version of the original Lempel–Ziv technique [404]). These codes, unlike Huffman
and Shannon–Fano–Elias codes, map variable-length sourcewords (as opposed to
fixed-length codewords) onto codewords.
Suppose the source alphabet is binary. Then, the Lempel–Ziv encoder can be
described as follows.
Encoder:
1. Parse the input sequence into strings that have never appeared before. For exam-
ple, if the input sequence is 1011010100010 . . ., the algorithm first grabs the
first letter 1 and finds that it has never appeared before. So 1 is the first string.
Then, the algorithm scoops the second letter 0 and also determines that it has not
appeared before, and hence, put it to be the next string. The algorithm moves on
to the next letter 1 and finds that this string has appeared. Hence, it hits another
letter 1 and yields a new string 11, and so on. Under this procedure, the source
sequence is parsed into the strings
2. Let L be the number of distinct strings of the parsed source. Then, we need
log2 L + 1 bits to index these strings (starting from one). In the above example,
the indices are
The codeword of each string is then the index of its prefix concatenated with the
last bit in its source string. For example, the codeword of source string 010 will
be the index of 01, i.e., 100, concatenated with the last bit of the source string,
i.e., 0. Through this procedure, encoding the above-parsed strings with L = 3
yields the codeword sequence
or equivalently,
0001000000110101100001000010.
Note that the conventional Lempel–Ziv encoder requires two passes: the first pass
to decide L, and the second pass to generate the codewords. The algorithm, however,
can be modified so that it requires only one pass over the entire source string. Also,
note that the above algorithm uses an equal number of bits (log2 L + 1) to all the
location indices, which can also be relaxed by proper modification.
Theorem 3.39 The above algorithm asymptotically achieves the entropy rate of any
stationary ergodic source (with unknown statistics).
98 3 Lossless Data Compression
Problems
(a) Show that A is indeed a typical set F100 (0.2) defined using the base-2 log-
arithm.
(b) Find the minimum codeword blocklength in bits for the block coding
scheme.
(c) Find the probability for sourcewords not in A.
(d) Use Chebyshev’s inequality to bound the probability of observing a source-
word outside A. Compare this bound with the actual probability computed
in part (c).
Hint: Let X i represent the binary random digit at instance i, and let Sn =
X 1 + · · · + X n . Note that Pr[S100 ≥ 4] is equal to
1
Pr S100 − 0.005 ≥ 0.035 .
100
2. Weak Converse to the Fixed-Length Source Coding Theorem: Recall (see Obser-
vation 3.3) that an (n, M) fixed-length source code for a discrete memory-
less source (DMS) {X n }∞ n=1 with finite alphabet X consists of an encoder
f : X n → {1, 2, . . . , M}, and a decoder g : {1, 2, . . . , M} → X n . The rate of
the code is
1
Rn := log2 M bits/source symbol,
n
and its probability of decoding error is
Pe = Pr[X n = X̂ n ],
where X̂ n = g( f (X n )).
(a) Show that any fixed-length source code (n, M) for a DMS satisfies
H (X ) − Rn 1
Pe ≥ − ,
log2 |X | n log2 |X |
3.3 Variable-Length Codes for Lossless Data Compression 99
(a) 1
n
H (X n ) ≤ n−1
1
H (X n−1 )
(b) 1
n
H (X n ) ≥ H (X n |X n−1 ).
Hint: Use the chain rule for entropy and the fact that
for every i.
4. Randomized random walk: An ant walks randomly on a line of integers. At
time instance i, it may move forward with probability 1 − Z i−1 , or it may move
∞
backward with probability Z i−1 , where {Z i }i=0 are identically distributed ran-
dom variables with finite alphabet Z ⊂ [0, 1]. Let X i be the number on which
the ant stands at time instance i, and let X 0 = 0 (with probability one).
(a) Show that
X n = X n−1 ⊕ X n−2 ⊕ Z n , n = 1, 2, . . .
7. We know the fact that the average code rate of all nth-order uniquely decodable
codes for a DMS must be no less than the source entropy. But this is not nec-
essarily true for non-singular codes. Give an example of a non-singular code in
which the average code rate is less than the source entropy.
8. Under what condition does the average code rate of a uniquely decodable binary
first-order variable-length code for a DMS equal the source entropy?
Hint: See the discussion after Theorem 3.22.
9. Binary Markov Source: Consider the binary homogeneous Markov source:
{X n }∞
n=1 , X n ∈ X = {0, 1}, with
ρ
, if i = 0 and j = 1,
Pr{X n+1 = j|X n = i} = 1+δ
ρ+δ
1+δ
, if i = 1 and j = 1,
where n ≥ 1, 0 ≤ ρ ≤ 1 and δ ≥ 0.
(a) Find the initial state distribution (Pr{X 1 = 0}, Pr{X 1 = 1}) required to make
the source {X n }∞n=1 stationary.
Assume in the next questions that the source is stationary.
(b) Find the entropy rate of {X n }∞ n=1 in terms of ρ and δ.
(c) For δ = 1 and ρ = 1/2, compute the source redundancies ρd , ρm , and ρt .
(d) Suppose that ρ = 1. Is {X n }∞ n=1 irreducible? What is the value of the entropy
rate in this case?
(e) For δ = 0, show that {X n }∞ n=1 is a discrete memoryless source and compute
its entropy rate in terms of ρ.
(f) If ρ = 1/2 and δ = 3/2, design first-, second-, and third-order binary Huff-
man codes for this source. Determine in each case the average code rate and
compare it to the entropy rate.
10. Polya contagion process of memory two: Consider the finite-memory Polya con-
tagion source presented in Example 3.17 with M = 2.
(a) Find the transition distribution of this binary Markov process and determine
its stationary distribution in terms of the source parameters.
(b) Find the source entropy rate.
11. Suppose random variables Z 1 and Z 2 are independent from each other and have
the same distribution as Z with
⎧
⎪ Pr[Z = e1 ] = 0.4,
⎪
⎨
Pr[Z = e2 ] = 0.3,
⎪ Pr[Z = e3 ] = 0.2,
⎪
⎩
Pr[Z = e4 ] = 0.1.
f (Z 1 , Z 2 ) := ( f (Z 1 ), f (Z 2 )) = (U1 , U2 , . . . , Uk ),
1 1 1
0.4 log2 + 0.3 log2 + 0.2 log2
0.4 0.3 0.2
1
+0.1 log2 = 1.84644 bits/letter?
0.1
Justify your answer.
(d) Now if we apply the Huffman code in (a) sequentially to the i.i.d. sequence
Z 1 , Z 2 , Z 3 , . . . with the same marginal distribution as Z , and yield the output
U1 , U2 , U3 , . . ., can U1 , U2 , U3 , . . . be further compressed?
If your answer to this question is NO, prove the i.i.d. uniformity of
U1 , U2 , U3 , . . .. If your answer to this question is YES, then explain why
the optimal Huffman code does not give an i.i.d. uniform output.
Hint: Examine whether the average code rate can achieve the source entropy.
12. In the second part of Theorem 3.27, it is shown that there exists a D-ary prefix
code with
1 1
R̄n = PX (x)(cx ) ≤ H D (X ) + ,
n x∈X n
where cx is the codeword for the source symbol x and (cx ) is the length of
codeword cx . Show that the upper bound can be improved to
1
R̄n < H D (X ) + .
n
(c) Prove that the average code rate of the second-order (two-letter) binary
Huffman code cannot be equal to H (X ) + 1/2 bits?
Hint: Use the new bound in Problem 12.
14. Decide whether each of the following statements is true or false. Prove the
validity of those that are true and give counterexamples or arguments based on
known facts to disprove those that are false.
(a) Every Huffman code for a discrete memoryless source (DMS) has a corre-
sponding suffix code with the same average code rate.
(b) Consider a DMS {X n }∞ n=1 with alphabet X = {a1 , a2 , a3 , a4 , a5 , a6 } and
probability distribution
1 1 1 1 1 1
[ p1 , p2 , p3 , p4 , p5 , p6 ] = , , , , , ,
4 4 4 8 16 16
1−
PX,Y (0, 0) = PX,Y (1, 1) =
2
and
PX,Y (0, 1) = PX,Y (1, 0) = ,
2
where 0 < < 1.
1
(a) Find the limit of the random variable [PX n (X n )] 2n as n → ∞.
(b) Find the limit of the random variable
1 PX n ,Y n (X n , Y n )
log2
n PX n (X n )PY n (Y n )
as n → ∞.
3.3 Variable-Length Codes for Lossless Data Compression 103
∞
17. Consider a discrete memoryless source {X i }i=1 with alphabet X and distribution
p X . Let C = f (X ) be a uniquely decodable binary code
f : X → {0, 1}∗
that maps single source letters onto binary strings such that its average code rate
R C satisfies
R C = H (X ) bits/source symbol.
f : X n → {0, 1}∗ .
Provide a construction for the map f such that the code C is also absolutely
optimal.
18. Consider two random variables X and Y with values in finite sets X and Y,
respectively. Let l X , l Y , and l X Y denote the average codeword lengths of the
optimal (first-order) prefix codes
f : X → {0, 1}∗ ,
g : Y → {0, 1}∗
and
h : X × Y → {0, 1}∗ ,
20. Divergence rate: Prove the expression in (3.2.6) for the divergence rate between
a stationary source {X i } and a time-invariant Markov source { X̂ i }, with both
sources having a common finite alphabet X . Generalize the result if the source
{ X̂ i } is a time-invariant kth-order Markov chain.
21. Prove Observation 3.29.
22. Prove Lemma 3.33.
Chapter 4
Data Transmission and Channel Capacity
PY |X 2 (y = 0|x 2 = 00) = 1
PY |X 2 (y = 0|x 2 = 01) = 1
PY |X 2 (y = 1|x 2 = 10) = 1
PY |X 2 (y = 1|x 2 = 11) = 1,
Fig. 4.1 A data transmission system, where W represents the message for transmission, X n denotes
the codeword corresponding to message W , Y n represents the received word due to channel input
X n , and Ŵ denotes the reconstructed message from Y n
00 1 0
1
01
10 1 1
1
11
and a binary message (either event A or event B) is required to be transmitted from
the sender to the receiver. Then the data transmission code with (codeword 00 for
event A, codeword 10 for event B) obviously induces less ambiguity at the receiver
than the code with (codeword 00 for event A, codeword 01 for event B).
In short, the objective in designing a data transmission (or channel) code is to
transform a noisy channel into a reliable medium for sending messages and recov-
ering them at the receiver with minimal loss. To achieve this goal, the designer of
a data transmission code needs to take advantage of the common parts between the
sender and the receiver sites that are least affected by the channel noise. We will see
that these common parts are probabilistically captured by the mutual information
between the channel input and the channel output.
As illustrated in the previous example, if a “least-noise-affected” subset of the
channel input words is appropriately selected as the set of codewords, the messages
intended to be transmitted can be reliably sent to the receiver with arbitrarily small
error. One then raises the question:
What is the maximum amount of information (per channel use) that can be reliably
transmitted over a given noisy channel?
In the above example, we can transmit a binary message error-free, and hence, the
amount of information that can be reliably transmitted is at least 1 bit per channel
use (or channel symbol). It can be expected that the amount of information that can
be reliably transmitted for a highly noisy channel should be less than that for a less
noisy channel. But such a comparison requires a good measure of the “noisiness” of
channels.
From an information-theoretic viewpoint, “channel capacity” provides a good
measure of the noisiness of a channel; it represents the maximal amount of infor-
mational messages (per channel use) that can be transmitted via a data transmission
code over the channel and recovered with arbitrarily small probability of error at the
receiver. In addition to its dependence on the channel transition distribution, channel
4.1 Principles of Data Transmission 107
capacity also depends on the coding constraint imposed on the channel input, such
as “only block (fixed-length) codes are allowed.” In this chapter, we will study chan-
nel capacity for block codes (namely, only block transmission code can be used).1
Throughout the chapter, the noisy channel is assumed to be memoryless (as defined
in the next section).
{PY n |X n (y n |x n )}∞
n=1
such that y n ∈Y n PY n |X n (y n |x n ) = 1 for every x n ∈ X n , where x n = (x1 , . . . , xn )
∈ X n and y n = (y1 , . . . , yn ) ∈ Y n . We assume that the above sequence of
n-dimensional distribution is consistent, i.e.,
xi+1 ∈X yi+1 ∈Y PX i+1 (x i+1 )PY i+1 |X i+1 (y i+1 |x i+1 )
PY i |X i (y |x ) =
i i
xi+1 ∈X PX i+1 (x i+1 )
= PX i+1 |X i (xi+1 |x i )PY i+1 |X i+1 (y i+1 |x i+1 )
xi+1 ∈X yi+1 ∈Y
n
PY n |X n (y n |x n ) = PY |X (yi |xi ) (4.2.1)
i=1
1 See[397] for recent results regarding channel capacity when no coding constraints are applied to
the channel input (so that variable-length codes can be employed).
108 4 Data Transmission and Channel Capacity
Observation 4.3 We note that the DMC’s condition (4.2.1) is actually equivalent
to the following two sets of conditions [29]:
⎧
⎪
⎪ PYn |X n ,Y n−1 (yn |x n , y n−1 ) = PY |X (yn |xn ) ∀ n = 1, 2, . . . , x n , y n ;
⎪
⎨ (4.2.2a)
⎪
⎪ PY n−1 |X n (y |x ) = PY n−1 |X n−1 (y |x ) ∀ n = 2, 3, . . . , x , y .
n−1 n n−1 n−1 n n−1
⎪
⎩
(4.2.2b)
⎧
⎪
⎪ PYn |X n ,Y n−1 (yn |x n , y n−1 ) = PY |X (yn |xn ) ∀ n = 1, 2, . . . , x n , y n ;
⎪
⎨ (4.2.3a)
⎪
⎪ PX n |X n−1 ,Y n−1 (xn |x n−1
,y n−1
) = PX n |X n−1 (xn |x n−1
) ∀ n = 1, 2, . . . , x , y n−1 .
n
⎪
⎩
(4.2.3b)
Condition (4.2.2a) [also (4.2.3a)] implies that the current output Yn only depends
on the current input X n but not on past inputs X n−1 and outputs Y n−1 . Condition
(4.2.2b) indicates that the past outputs Y n−1 do not depend on the current input X n .
These two conditions together give
hence, (4.2.1) holds recursively for n = 1, 2, . . .. The converse [i.e., (4.2.1) implies
both (4.2.2a) and (4.2.2b)] is a direct consequence of
PY n |X n (y n |x n )
PYn |X n ,Y n−1 (yn |x n , y n−1 ) =
yn ∈Y PY n |X n (y |x )
n n
and
PY n−1 |X n (y n−1 |x n ) = PY n |X n (y n |x n ).
yn ∈Y
Similarly, (4.2.3b) states that the current input X n is independent of past outputs
Y n−1 , which together with (4.2.3a) implies again
4.2 Discrete Memoryless Channels 109
PY n |X n (y n |x n )
PX n ,Y n (x n , y n )
=
PX n (x n )
PX n−1 ,Y n−1 (x n−1 , y n−1 )PX n |X n−1 ,Y n−1 (xn |x n−1 , y n−1 )PYn |X n ,Y n−1 (yn |x n , y n−1 )
=
PX n−1 (x n−1 )PX n |X n−1 (xn |x n−1 )
= PY n−1 |X n−1 (y n−1 |x n−1 )PY |X (yn |xn ),
hence, recursively yielding (4.2.1). The converse for (4.2.3b)—i.e., (4.2.1) implying
(4.2.3b)—can be analogously proved by noting that
PX n (x n ) yn ∈Y PY n |X n (y n |x n )
PX n |X n−1 ,Y n−1 (xn |x n−1
,y n−1
)= .
PX n−1 (x n−1 )PY n−1 |X n−1 (y n−1 |x n−1 )
Note that the above definition of DMC in (4.2.1) prohibits the use of channel feed-
back, as feedback allows the current channel input to be a function of past chan-
nel outputs (therefore, conditions (4.2.2b) and (4.2.3b) cannot hold with feedback).
Instead, a causality condition generalizing (4.2.2a) (e.g., see Problem 4.28 or [415,
Definition 7.4]) will be needed for a channel with feedback.
Examples of DMCs:
1. Identity (noiseless) channels: An identity channel has equal size input and output
alphabets (|X | = |Y|) and channel transition probability satisfying
1 if y = x
PY |X (y|x) =
0 if y = x.
and can be graphically represented via a transition diagram as shown in Fig. 4.2.
If we set ε = 0, then the BSC reduces to the binary identity (noiseless)
channel. The channel is called “symmetric” since PY |X (1|0) = PY |X (0|1); i.e.,
110 4 Data Transmission and Channel Capacity
1 1
1−ε
it has the same probability for flipping an input bit into a 0 or a 1. A detailed
discussion of DMCs with various symmetry properties will be discussed later in
this chapter.
Despite its simplicity, the BSC is rich enough to capture most of the complex-
ity of coding problems over more general channels. For example, it can exactly
model the behavior of practical channels with additive memoryless Gaussian
noise used in conjunction of binary symmetric modulation and hard-decision
demodulation (e.g., see [407, p. 240]). It is also worth pointing out that the BSC
can be explicitly represented via a binary modulo-2 additive noise channel whose
output at time i is the modulo-2 sum of its input and noise variables:
Yi = X i ⊕ Z i for i = 1, 2, . . . , (4.2.5)
where ⊕ denotes addition modulo-2, Yi , X i , and Z i are the channel output, input,
and noise, respectively, at time i, the alphabets X = Y = Z = {0, 1} are all
binary. It is assumed in (4.2.5) that X i and Z j are independent of each other for
any i, j = 1, 2, . . . , and that the noise process is a Bernoulli(ε) process—i.e., a
binary i.i.d. process with Pr[Z = 1] = ε.
3. Binary erasure channels: In the BSC, some input bits are received perfectly and
others are received corrupted (flipped) at the channel output. In some channels,
however, some input bits are lost during transmission instead of being received
corrupted (for example, packets in data networks may get dropped or blocked
due to congestion or bandwidth constraints). In this case, the receiver knows the
exact location of these bits in the received bitstream or codeword, but not their
actual value. Such bits are then declared as “erased” during transmission and are
called “erasures.” This gives rise to the so-called binary erasure channel (BEC)
as illustrated in Fig. 4.3, with input alphabet X = {0, 1} and output alphabet
Y = {0, E, 1}, where E represents an erasure (we may assume that E is a real
number strictly greater than one), and channel transition matrix given by
4.2 Discrete Memoryless Channels 111
α
E
α
1 1
1−α
p p p
Q = [ px,y ] = 0,0 0,E 0,1
p1,0 p1,E p1,1
P (0|0) PY |X (E|0) PY |X (1|0)
= Y |X
PY |X (0|1) PY |X (E|1) PY |X (1|1)
1−α α 0
= , (4.2.6)
0 α 1−α
where
1 if Z i = E
1{Z i = E} :=
0 if Z i = E
is the indicator function of the set {Z i = E}, Yi , X i , and Z i are the channel output,
input, and erasure, respectively, at time i and the alphabets are X = {0, 1},
Z = {0, E} and Y = {0, 1, E}. Indeed, when the erasure variable Z i = E,
Yi = E and an erasure occurs in the channel; also, when Z i = 0, Yi = X i and
the input is received perfectly. In the BEC functional representation in (4.2.7),
it is assumed that X i and Z j are independent of each other for any i, j and that
the erasure process {Z i } is i.i.d. with Pr[Z = E] = α.
4. Binary channels with errors and erasures: One can combine the BSC with the
BEC to obtain a binary channel with both errors and erasures, as shown in Fig. 4.4.
We will call such channel the binary symmetric erasure channel (BSEC). In this
case, the channel’s transition matrix is given by
112 4 Data Transmission and Channel Capacity
ε α
1 1
1−ε−α
p0,0 p0,E p0,1 1−ε−α α ε
Q = [ px,y ] = = , (4.2.8)
p1,0 p1,E p1,1 ε α 1−ε−α
where ε, α ∈ [0, 1] are the channel’s crossover and erasure probabilities, respec-
tively, with ε + α ≤ 1. Clearly, setting α = 0 reduces the BSEC to the BSC,
and setting ε = 0 reduces the BSEC to the BEC. Analogously to the BSC and
the BEC, the BSEC admits an explicit expression in terms of a noise-erasure
process:
X i ⊕ Z i if Z i = E
Yi =
E if Z i = E
= (X i ⊕ Z i ) · 1{Z i = E} + E · 1{Z i = E} (4.2.9)
Q = [ px,y ]
⎡ ⎤
p0,0 p0,1 · · · p0,q−1
⎢ p1,0 p1,1 · · · p1,q−1 ⎥
⎢ ⎥
=⎢ . .. .. .. ⎥
⎣ .. . . . ⎦
pq−1,0 pq−1,1 · · · pq−1,q−1
⎡ ε ε ⎤
1 − ε q−1 · · · q−1
⎢ q−1 1 − ε · · · q−1
ε ε ⎥
⎢ ⎥
=⎢ . .. .. .. ⎥ , (4.2.11)
⎣ .. . . . ⎦
ε ε
q−1 q−1
··· 1 − ε
ε
Pr[Z = 0] = 1 − ε and Pr[Z = a] = ∀a ∈ {1, . . . , q − 1}.
q −1
It is also assumed that the input and noise processes are independent of each
other.
6. q-ary erasure channels: Given an integer q ≥ 2, one can also consider a nonbi-
nary extension of the BEC, yielding the so-called q-ary erasure channel. Specifi-
cally, this channel has input and output alphabets given by X = {0, 1, . . . , q −1}
and Y = {0, 1, . . . , q − 1, E}, respectively, where E denotes an erasure, and
channel transition distribution given by
114 4 Data Transmission and Channel Capacity
⎧
⎪
⎨1 − α if y = x, x ∈ X
PY |X (y|x) = α if y = E, x ∈ X (4.2.12)
⎪
⎩
0 if y = x, x ∈ X ,
Definition 4.4 (Fixed-length data transmission code) Given positive integers n and
M (where M = Mn ), and a discrete channel with input alphabet X and output
alphabet Y, a fixed-length data transmission code (or block code) for this channel
with blocklength n and rate n1 log2 M message bits per channel symbol (or channel
use) is denoted by ∼Cn = (n, M) and consists of:
1. M information messages intended for transmission.
2. An encoding function
f : {1, 2, . . . , M} → X n
The set {1, 2, . . . , M} is called the message set and we assume that a message W
follows a uniform distribution over the set of messages: Pr[W = w] = M1 for all
w ∈ {1, 2, . . . , M}. A block diagram for the channel code is given at the beginning of
this chapter; see Fig. 4.1. As depicted in the diagram, to convey message W over the
channel, the encoder sends its corresponding codeword X n = f (W ) at the channel
input. Finally, Y n is received at the channel output (according to the memoryless
channel distribution PY n |X n ) and the decoder yields Ŵ = g(Y n ) as the message
estimate.
Definition 4.5 (Average probability of error) The average probability of error for a
channel block code ∼Cn = (n, M) code with encoder f (·) and decoder g(·) used over
a channel with transition distribution PY n |X n is defined as
1
M
Pe (∼Cn ) := λw (∼Cn ),
M w=1
4.3 Block Codes for Data Transmission Over DMCs 115
where
is the code’s conditional probability of decoding error given that message w is sent
over the channel.
Note that, since we have assumed that the message W is drawn uniformly from
the set of messages, we have that
Pe (∼Cn ) = Pr[Ŵ = W ].
Clearly, Pe (∼Cn ) ≤ λ(∼Cn ); so one would expect that Pe (∼Cn ) behaves differently than
λ(∼Cn ). However, it can be shown that from a code ∼Cn = (n, M) with arbitrarily small
Pe (∼Cn ), one can construct (by throwing away from ∼Cn half of its codewords with
largest conditional probability of error) a code ∼Cn = (n, M2 ) with arbitrarily small
λ(∼Cn ) at essentially the same code rate as n grows to infinity (e.g., see [83, p. 204],
[415, p. 163]).3 Hence, for simplicity, we will only use Pe (∼Cn ) as our criterion when
evaluating the “goodness” or reliability4 of channel block codes; but one must keep
in mind that our results hold under λ(∼Cn ) as well, in particular the channel coding
theorem below.
Our target is to find a good channel block code (or to show the existence of a good
channel block code). From the perspective of the (weak) law of large numbers, a
good choice is to draw the code’s codewords based on the jointly typical set between
the input and the output of the channel, since all the probability mass is ultimately
placed on the jointly typical set. The decoding failure then occurs only when the
channel input–output pair does not lie in the jointly typical set, which implies that
the probability of decoding error is ultimately small. We next define the jointly typical
set.
3 Note that this fact holds for single-user channels with known transition distributions (as given in
Definition 4.1) that remain constant throughout the transmission of a codeword. It does not however
hold for single-user channels whose statistical descriptions may vary in an unknown manner from
symbol to symbol during a codeword transmission; such channels, which include the class of
“arbitrarily varying channels” (see [87, Chap. 2, Sect. 6]), will not be considered in this textbook.
4 We interchangeably use the terms “goodness” or “reliability” for a block code to mean that its
Definition 4.7 (Jointly typical set) The set Fn (δ) of jointly δ-typical n-tuple
n (x , y ) with respect to the memoryless distribution PX ,Y (x , y ) =
n n n n
pairs n n
With the above definition, we directly obtain the joint AEP theorem.
1
− log2 PX n (X 1 , X 2 , . . . , X n ) → H (X ) in probability,
n
1
− log2 PY n (Y1 , Y2 , . . . , Yn ) → H (Y ) in probability,
n
and
1
− log2 PX n ,Y n ((X 1 , Y1 ), . . . , (X n , Yn )) → H (X, Y ) in probability
n
as n → ∞.
Proof By the weak law of large numbers, we have the desired result.
1
lim inf log2 Mn ≥ R and lim Pe (∼Cn ) = 0.
n→∞ n n→∞
The channel’s operational capacity, Cop , is the supremum of all achievable rates:
We herein arrive at the main result of this chapter, Shannon’s channel coding
theorem for DMCs. It states that for a DMC, its operational capacity Cop is actually
equal to a quantity C, conveniently termed as channel capacity (or information
capacity) and defined as the maximum of the channel’s mutual information over the
set of its input distributions (see below). In other words, the quantity C is indeed the
supremum of all achievable channel code rates, and this is shown in two parts in the
theorem in light of the properties of the supremum; see Observation A.5. As a result,
for a given DMC, its quantity C, which can be calculated by solely using the channel’s
transition matrix Q, constitutes the largest rate at which one can reliably transmit
information via a block code over this channel. Thus, it is possible to communicate
reliably over an inherently noisy DMC at a fixed rate (without decreasing it) as long
as this rate is below C and the code’s blocklength is allowed to be large.
Theorem 4.11 (Shannon’s channel coding theorem) Consider a DMC with finite
input alphabet X , finite output alphabet Y and transition distribution probability
PY |X (y|x), x ∈ X and y ∈ Y. Define the channel capacity5
note that the mutual information I (X ; Y ) is actually a function of the input statistics PX and
5 First
where the maximum is taken over all input distributions PX . Then, the following hold.
• Forward part (achievability): For any 0 < ε < 1, there exist γ > 0 and a sequence
of data transmission block codes {∼Cn = (n, Mn )}∞
n=1 with
1
lim inf log2 Mn ≥ C − γ
n→∞ n
and
Pe (∼Cn ) < ε for sufficiently large n,
where Pe (∼Cn ) denotes the (average) probability of error for block code ∼Cn .
• Converse part: For any 0 < ε < 1, any sequence of data transmission block codes
{∼Cn = (n, Mn )}∞n=1 with
1
lim inf log2 Mn > C
n→∞ n
satisfies
Pe (∼Cn ) > (1 − )μ for sufficiently large n, (4.3.1)
where
C
μ=1− > 0,
lim inf n→∞ n1 log2 Mn
i.e., the codes’ probability of error is bounded away from zero for all n sufficiently
large.6
Proof of the forward part: It suffices to prove the existence of a good block code
sequence (satisfying the rate condition, i.e., lim inf n→∞ (1/n) log2 Mn ≥ C − γ for
some γ > 0) whose average error probability is ultimately less than ε. Since the
forward part holds trivially when C = 0 by setting Mn = 1, we assume in the sequel
that C > 0.
We will use Shannon’s original random coding proof technique in which the
good block code sequence is not deterministically constructed; instead, its existence
is implicitly proven by showing that for a class (ensemble) of block code sequences
{∼Cn }∞
n=1 and a code-selecting distribution Pr[∼ Cn ] over these block code sequences,
the expectation value of the average error probability, evaluated under the code-
selecting distribution on these block code sequences, can be made smaller than ε for
n sufficiently large:
6 Note that(4.3.1) actually implies that lim inf n→∞ Pe (∼Cn ) ≥ lim↓0 (1 − )μ = μ, where the error
probability lower bound has nothing to do with . Here, we state the converse of Theorem 4.11 in
a form in parallel to the converse statements in Theorems 3.6 and 3.15.
4.3 Block Codes for Data Transmission Over DMCs 119
E ∼Cn [Pe (∼Cn )] = Pr[∼Cn ]Pe (∼Cn ) → 0 as n → ∞.
∼Cn
Hence, there must exist at least one such a desired good code sequence {∼Cn∗ }∞n=1
among them (with Pe (∼Cn∗ ) → 0 as n → ∞).
Fix ε ∈ (0, 1) and some γ in (0, min{4ε, C}). Observe that there exists N0 such
that for n > N0 , we can choose an integer Mn with
γ 1
C− ≥ log2 Mn > C − γ. (4.3.2)
2 n
(Since we are only concerned with the case of “sufficient large n,” it suffices to
consider only those n’s satisfying n > N0 , and ignore those n’s for n ≤ N0 .)
Define δ := γ/8. Let PX̂ be a probability distribution that achieves the channel
capacity:
C := max I (PX , PY |X ) = I (PX̂ , PY |X ).
PX
Denote by PŶ n the channel output distribution due to channel input product distribu-
n
tion PX̂ n (with PX̂ n (x n ) = i=1 PX̂ (xi )), i.e.,
PŶ n (y n ) = PX̂ n ,Ŷ n (x n , y n )
x n ∈X n
where
PX̂ n ,Ŷ n (x n , y n ) := PX̂ n (x n )PY n |X n (y n |x n )
n
for all x n ∈ X n and y n ∈ Y n . Note that since PX̂ n (x n ) = i=1 PX̂ (xi ) and the
∞
channel is memoryless, the resulting joint input–output process {( X̂ i , Ŷi )}i=1 is also
memoryless with
n
PX̂ n ,Ŷ n (x n , y n ) = PX̂ ,Ŷ (xi , yi )
i=1
and
PX̂ ,Ŷ (x, y) = PX̂ (x)PY |X (y|x) for x ∈ X , y ∈ Y.
7 Here, the channel inputs are selected with replacement. That means it is possible and acceptable
that all the selected Mn channel inputs are identical.
120 4 Data Transmission and Channel Capacity
f n (m) = cm for 1 ≤ m ≤ Mn ,
and
⎧
⎪
⎪ m, if cm is the only codeword in ∼Cn
⎨
satisfying (cm , y n ) ∈ Fn (δ);
gn (y n ) =
⎪
⎪
⎩
any one in {1, 2, . . . , Mn }, otherwise,
where Fn (δ) is defined in Definition 4.7 with respect to distribution PX̂ n ,Ŷ n . (We
evidently assume that the codebook ∼Cn and the channel distribution PY |X are
known at both the encoder and the decoder.) Hence, the code ∼Cn operates as
follows. A message W is chosen according to the uniform distribution from the
set of messages. The encoder f n then transmits the W th codeword cW in ∼Cn over
the channel. Then, Y n is received at the channel output and the decoder guesses
the sent message via Ŵ = gn (Y n ).
Note that there is a total |X |n Mn possible randomly generated codebooks ∼Cn and
the probability of selecting each codebook is given by
Mn
Pr[∼Cn ] = PX̂ n (cm ).
m=1
Mn
+ PY n |X n (y n |cm ), (4.3.3)
y n ∈Y n : (cm ,y n )∈F
m =1 n (δ)
m =m
where the first term in (4.3.3) considers the case that the received channel output y n
is not jointly δ-typical with cm , (and hence, the decoding rule gn (·) would possibly
result in a wrong guess), and the second term in (4.3.3) reflects the situation when
y n is jointly δ-typical with not only the transmitted codeword cm but also with
another codeword cm (which may cause a decoding error).
By taking expectation in (4.3.3) with respect to the mth codeword-selecting
distribution PX̂ n (cm ), we obtain
4.3 Block Codes for Data Transmission Over DMCs 121
PX̂ n (cm )λm (∼Cn ) ≤ PX̂ n (cm )PY n |X n (y n |cm )
cm ∈X n cm ∈X n y n ∈F
/ n (δ|cm )
Mn
+ PX̂ n (cm )PY n |X n (y n |cm )
∈X n y n ∈F
cm m =1 n (δ|cm )
m =m
= PX̂ n ,Ŷ n Fnc (δ)
Mn
+ PX̂ n ,Ŷ n (cm , y n ),
m =1 cm ∈X n y n ∈Fn (δ|cm )
m =m
(4.3.4)
where
Fn (δ|x n ) := y n ∈ Y n : (x n , y n ) ∈ Fn (δ) .
over the ensemble of all codebooks ∼Cn generated at random according to Pr[∼Cn ]
and show that it asymptotically vanishes as n grows without bound. We obtain the
following series of inequalities:
E ∼Cn [Pe (∼Cn )] = Pr[∼Cn ]Pe (∼Cn )
∼Cn
1
Mn
= ··· PX̂ n (c1 ) · · · PX̂ n (c Mn ) λm (∼Cn )
c1 ∈X n c Mn ∈X n
Mn m=1
1
Mn
= ··· ···
Mn m=1 c ∈X n c ∈X n c ∈X n c ∈X n
1 m−1 m+1 Mn
1
Mn
≤ ··· ···
Mn m=1 c ∈X n c ∈X n c ∈X n c ∈X n
1 m−1 m+1 Mn
1
Mn
+ ··· ···
Mn m=1 c ∈X n c ∈X n c ∈X n c ∈X n
1 m−1 m+1 Mn
where (4.3.5) follows from (4.3.4), and the last step holds since PX̂ n ,Ŷ n Fnc (δ) is
a constant independent of c1 , . . ., c Mn and m. Observe that for n > N0 ,
⎡
Mn
⎣ ··· ···
m =1 c1 ∈X n cm−1 ∈X n cm+1 ∈X n c Mn ∈X n
m =m
PX̂ n (c1 ) · · · PX̂ n (cm−1 )PX̂ n (cm+1 ) · · · PX̂ n (c Mn )
⎤
× P n n (cm , y n )⎦
X̂ ,Ŷ
cm ∈X n y n ∈Fn (δ|cm )
⎡ ⎤
Mn
= ⎣ PX̂ n (cm )PX̂ n ,Ŷ n (cm , y n )⎦
m =1 cm ∈X n cm ∈X n y n ∈Fn (δ|cm )
m =m
⎡ ⎛ ⎞⎤
Mn
= ⎣ PX̂ n (cm ) ⎝ PX̂ n ,Ŷ n (cm , y n )⎠⎦
m =1 cm ∈X n y n ∈Fn (δ|cm ) cm ∈X n
m =m
⎡ ⎤
Mn
= ⎣ PX̂ n (cm )PŶ n (y n )⎦
m =1 cm ∈X n y n ∈Fn (δ|cm )
m =m
4.3 Block Codes for Data Transmission Over DMCs 123
⎡ ⎤
Mn
= ⎣ PX̂ n (cm )PŶ n (y n )⎦
(cm ,y n )∈F
m =1 n (δ)
m =m
Mn
≤ |Fn (δ)|2−n(H ( X̂ )−δ) 2−n(H (Ŷ )−δ)
m =1
m =m
Mn
≤ 2n(H ( X̂ ,Ŷ )+δ) 2−n(H ( X̂ )−δ) 2−n(H (Ŷ )−δ)
m =1
m =m
where the first inequality follows from the definition of the jointly typical set Fn (δ),
the second inequality holds by the Shannon–McMillan–Breiman theorem for pairs
(Theorem 4.9), the last inequality follows since C = I ( X̂ ; Ŷ ) by definition of X̂
and Ŷ , and since (1/n) log2 Mn ≤ C − (γ/2) = C − 4δ. Consequently,
E ∼Cn [Pe (∼Cn )] ≤ PX̂ n ,Ŷ n Fnc (δ) + 2−nδ ,
which for sufficiently large n (and n > N0 ), can be made smaller than 2δ = γ/4 <
ε by the Shannon–McMillan–Breiman theorem for pairs.
Before proving the converse part of the channel coding theorem, let us recall Fano’s
inequality in a channel coding context. Consider an (n, Mn ) channel block code ∼Cn
with encoding and decoding functions given by
f n : {1, 2, . . . , Mn } → X n
and
gn : Y n → {1, 2, . . . , Mn },
respectively. Let message W , which is uniformly distributed over the set of messages
{1, 2, . . . , Mn }, be sent via codeword X n (W ) = f n (W ) over the DMC, and let Y n
be received at the channel output. At the receiver, the decoder estimates the sent
message via Ŵ = gn (Y n ) and the probability of estimation error is given by the
code’s average error probability:
Pr[W = Ŵ ] = Pe (∼Cn )
I (W ; Y n ) ≤ I (X n ; Y n ). (4.3.7)
I (X n ; Y n ) ≤ max I (X n ; Y n )
PX n
n
≤ max I (X i ; Yi ) (by Theorem 2.21)
PX n
i=1
n
≤ max I (X i ; Yi )
PX n
i=1
n
= max I (X i ; Yi )
PX i
i=1
= nC. (4.3.8)
C 1 C + 1/n
Pe (∼Cn ) > 1 − − =1− .
(1/n) log2 Mn log2 Mn (1/n) log2 Mn
1 C + 1/n
log2 Mn ≥ , (4.3.9)
n 1 − (1 − ε)μ
4.3 Block Codes for Data Transmission Over DMCs 125
Fig. 4.5 Asymptotic channel coding rate R versus channel capacity C and behavior of the proba-
bility of error as blocklength n goes to infinity for a DMC
because, otherwise, (4.3.9) would be violated for infinitely many n, implying a con-
tradiction that
1 C + 1/n C
lim inf log2 Mn ≤ lim inf = .
n→∞ n n→∞ 1 − (1 − ε)μ 1 − (1 − ε)μ
Hence, for n ≥ N ,
C + 1/n
Pe (∼Cn ) > 1 − [1 − (1 − ε)μ] = (1 − )μ > 0;
C + 1/n
Observation 4.12 The results of the above channel coding theorem, which proves
that Cop = C, are illustrated in Fig. 4.5,8 where R = lim inf n→∞ n1 log2 Mn (mea-
sured in message bits/channel use) is usually called the asymptotic coding rate of
channel block codes. As indicated in the figure, the asymptotic rate of any good block
code for the DMC must be smaller than or equal to the channel capacity C.9 Con-
versely, any block code with (asymptotic) rate greater than C, will have its probability
of error bounded away from zero.
Observation 4.13 (Zero error codes) In the converse part of Theorem 4.11, we
showed that
1
lim inf Pe (∼Cn ) = 0 =⇒ lim inf log2 Mn ≤ C. (4.3.10)
n→∞ n→∞ n
8 Note that Theorem 4.11 actually implies that limn→∞ Pe = 0 for R < Cop = C and that
lim inf n→∞ Pe > 0 for R > Cop = C; these properties, however, might not hold for more general
channels than the DMC. For general channels, three partitions instead of two may result, i.e.,
R < Cop , Cop < R < C̄op and R > C̄op , which, respectively, correspond to lim supn→∞ Pe = 0
for the best block code, lim supn→∞ Pe > 0 but lim inf n→∞ Pe = 0 for the best block code, and
lim inf n→∞ Pe > 0 for all channel codes, where C̄op is called the channel’s optimistic operational
capacity [394, 396]. Since C̄op = Cop = C for DMCs, the three regions are reduced to two. A
formula for C̄op in terms of a generalized (spectral) mutual information rate is established in [75].
9 It can be seen from the theorem that C can be achieved as an asymptotic transmission rate as long
We next briefly examine the situation when we require that all (n, Mn ) codes
∼Cn are to be used with exactly no errors for any value of the blocklength n; i.e.,
Pe (∼Cn ) = 0 for every n. In this case, we readily obtain that H (W |Y n ) = 0, which
in turn implies (by invoking the data processing inequality) that for any n,
log2 Mn = H (W |Y n ) + I (W ; Y n )
= I (W ; Y n )
≤ I (X n ; Y n )
≤ nC.
1
Pe (∼Cn ) = 0 ∀n =⇒ lim sup log2 Mn ≤ C,
n→∞ n
Shannon’s channel coding theorem, established in 1948 [340], provides the ulti-
mate limit for reliable communication over a noisy channel. However, it does not
provide an explicit efficient construction for good codes since searching for a good
code from the ensemble of randomly generated codes is prohibitively complex, as
its size grows double exponentially with blocklength (see Step 1 of the proof of
the forward part). It thus spurred the entire area of coding theory, which flourished
over the last several decades with the aim of constructing powerful error-correcting
codes operating close to the capacity limit. Particular advances were made for the
class of linear codes (also known as group codes) whose rich10 yet elegantly sim-
ple algebraic structures made them amenable for efficient practically implementable
encoding and decoding. Examples of such codes include Hamming, Golay, Bose–
Chaudhuri–Hocquenghem (BCH), Reed–Muller, Reed–Solomon and convolutional
codes. In 1993, the so-called Turbo codes were introduced by Berrou et al. [44,
45] and shown experimentally to perform close to the channel capacity limit for the
class of memoryless channels. Similar near-capacity achieving linear codes were
later established with the rediscovery of Gallager’s low-density parity-check codes
(LDPC) [133, 134, 251, 252]. A more recent breakthrough was the invention of
polar codes by Arikan in 2007, when he provided a deterministic construction of
codes that can provably achieve channel capacity [22, 23]; see the next section for
a brief illustrative example on polar codes for the BEC. Many of the above codes
are used with increased sophistication in today’s ubiquitous communication, infor-
mation and multimedia technologies. For detailed studies on channel coding theory,
see the following texts [50, 52, 208, 248, 254, 321, 407].
10 Indeed, there exist linear codes that can achieve the capacity of memoryless channels with additive
noise (e.g., see [87, p. 114]). Such channels include the BSC and the q-ary symmetric channel.
4.4 Example of Polar Codes for the BEC 127
As noted above, polar coding is a new channel coding method proposed by Arikan
[22, 23], which can provably achieve the capacity of any binary-input memoryless
channel Q whose capacity is realized by a uniform input distribution (e.g., quasi-
symmetric channels). The proof technique and code construction, which has low
encoding and decoding complexity, are purely based on information-theoretic con-
cepts. For simplicity, we focus solely on a channel Q given by the BEC with erasure
probability ε, which we denote as BEC(ε) for short.
The main idea behind polar codes is channel “polarization,” which transforms
many independent uses of BEC(ε), n uses to be precise (where n is the coding
blocklength),11 into extremal “polarized” channels; i.e., channels which are either
perfect (noiseless) or completely noisy. It is shown that as n → ∞, the number of
unpolarized channels converges to 0 and the fraction of perfect channels converges
to I (X ; Y ) = 1 − ε under a uniform input, which is the capacity of the BEC. A polar
code can then be naturally obtained by sending information bits directly through
those perfect channels and sending known bits (usually called frozen bits) through
the completely noisy channels.
We start with the simplest case of n = 2. The channel transformation depicted
in Fig. 4.6a is usually called the basic transformation. In this figure, we have two
independent uses of BEC(ε), namely, (X 1 , Y1 ) and (X 2 , Y2 ), where every bit has ε
chance of being erased. In other words, under uniformly distributed X 1 and X 2 , we
have
I (Q) := I (X 1 ; Y1 ) = I (X 2 ; Y2 ) = 1 − ε.
Now consider the following linear modulo-2 operation shown in Fig. 4.6:
X 1 = U1 ⊕ U2 ,
X 2 = U2 ,
Q− : U1 → (Y1 , Y2 ),
Q+ : U2 → (Y1 , Y2 , U1 ),
11 Recall that in channel coding, a codeword of length n is typically sent by using the channel
n consecutive times (i.e., in series). But in polar coding, an equivalent method is applied, which
consists of using n identical and independent copies of the channel in parallel, with each channel
being utilized only once.
128 4 Data Transmission and Channel Capacity
(1 − (1 − ε)2 ) (ε)
U1 X1 Y1
⊕ BEC(ε)
(ε2 ) (ε)
U2 X2 Y2
BEC(ε)
(1 − (1 − ε1 )(1 − ε2 )) (ε1 )
U1 X1 Y1
⊕ BEC(ε1 )
(ε1 ε2 ) (ε2 )
U2 X2 Y2
BEC(ε2 )
respectively (the names of these channels will be justified shortly). Note that correctly
receiving Y1 = X 1 alone is not enough for us to determine U1 , since U2 is a uniform
random variable that is independent of U1 . One really needs to have both Y1 = X 1 and
Y2 = X 2 for correctly decoding U1 . This observation implies that Q− is a BEC with
erasure probability12 ε− := 1 − (1 − ε)2 . Also, note that given U1 , either Y1 = X 1
or Y2 = X 2 is sufficient to determine U2 . This implies that Q+ is a BEC with erasure
probability ε+ := ε2 .
Overall, we have
and
12 More precisely, channel Q− has the same behavior as a BEC, and it can be exactly converted to a
BEC after relabeling its output pair (y1 , y2 ) as an equivalent three-valued symbol y1,2 as follows:
⎧
⎪
⎨0 if (y1 , y2 ) ∈ {(0, 0), (1, 1)},
y1,2 = E if (y1 , y2 ) ∈ {(0, E), (1, E), (E, E), (E, 0), (E, 1)},
⎪
⎩1 if (y , y ) ∈ {(0, 1), (1, 0)}.
1 2
(1 − ε)2 = I (Q− )
≤ I (Q) = 1 − ε
≤ I (Q+ ) = 1 − ε2 , (4.4.2)
with equality iff ε(1 − ε) = 0 (i.e., ε = 0 or ε = 1). Equation (4.4.1) shows that
the basic transformation does not incur any loss in mutual information. Furthermore,
(4.4.2) indeed confirms that Q+ and Q− are, respectively, better and worse than Q.13
So far, we have talked about how to use the basic transformation to generate a
better channel Q+ and a worse channel Q− from two independent uses of Q =
BEC(ε). Now, let us consider the case of n = 4 and suppose we perform the basic
transformation twice to send (i.i.d. uniform) message bits (U1 , U2 , U3 , U4 ), yielding
Q− : V1 → (Y1 , Y2 ), where X 1 = V1 ⊕ V2 ,
Q+ : V2 → (Y1 , Y2 , V1 ), where X 2 = V2 ,
Q− : V3 → (Y3 , Y4 ), where X 3 = V3 ⊕ V4 ,
Q+ : V4 → (Y3 , Y4 , V3 ), where X 4 = V4 ,
13 The same reasoning can be applied to form the basic transformation for two independent but
not identically distributed BECs as shown in Fig. 4.6b, where Q+ and Q− become BEC(ε1 ε2 )
and BEC(1 − (1 − ε1 )(1 − ε2 )), respectively. This extension may be useful when combining n
independent uses of a channel in a multistage manner (in particular, when the two channels to be
combined may become non-identically distributed after the second stage). In Example 4.14, only
identically distributed BECs will be combined at each stage, which is a typical design for polar
coding.
130 4 Data Transmission and Channel Capacity
largest mutual informations. The other n − k positions are stuffed with frozen bits;
this encoding process is precisely channel combining. The decoder successively
decodes Ui , i ∈ {1, . . . , n}, based on (Y1 , . . . , Yn ) and the previously decoded Û j ,
j ∈ {1, . . . , i − 1}. This decoder is called a successive cancellation decoder and
mimics the behavior of channel splitting in the process of channel polarization.
Example 4.14 Consider a BEC with erasure probability ε = 0.5 and let n = 8. The
channel polarization process for this example is shown in Fig. 4.7. Note that since
the mutual information of a BEC(ε) under a uniform input is simply 1 − ε; one can
equivalently keep tracking the erasure probabilities as we have shown in parentheses
in Fig. 4.7. Now, suppose we would like to construct a (8, 4) polar code, we pick the
four positions with largest mutual informations (i.e., smallest erasure probabilities).
That is, we pick (U4 , U6 , U7 , U8 ) to send uncoded bits and the other positions are
frozen.
As an example of the computation of the erasure probabilities, 0.5625 for T2 is
obtained from 0.75 × 0.75, which are the numbers above V1 and V3 , and combining
T1 and T2 produces 1 − (1 − 0.9375)(1 − 0.9375) ≈ 0.9961, which is the number
above U1 .
Ever since their invention by Arikan [22, 23], polar codes have generated exten-
sive interest; see [25, 26, 226, 228, 329, 371, 372] and the references therein and
thereafter. A key reason for their prevalence is that they form the first coding scheme
that has an explicit low-complexity construction structure while being capable of
achieving channel capacity as code length approaches infinity. More importantly,
polar codes do not exhibit the error floor behavior, which Turbo and (to a lesser
extent) LDPC codes are prone to. In practice, since one cannot have infinitely many
stages of polarization, there will always exist unpolarized channels. The develop-
ment of effective construction and decoding methods for polar codes with practical
blocklengths is an active area of research. Due to their attractive properties, polar
codes were adopted in 2016 by the 3rd Generation Partnership Project (3GPP) as
error-correcting codes for the control channel of the 5th generation (5G) mobile
communication standard [99].
We conclude by noting that the notion of polarization is not unique to channel cod-
ing; it can also be applied to source coding and other information-theoretic problems
including secrecy and multiuser systems (e.g., cf. [24, 148, 226, 227, 256]).
Given a DMC with finite input alphabet X , finite output alphabet Y and channel
transition matrix Q = [ px,y ] of size |X | × |Y|, where px,y := PY |X (y|x), for x ∈ X
and y ∈ Y, we would like to calculate
C := max I (X ; Y )
PX
4.5 Calculating Channel Capacity 131
where the maximization (which is well-defined) is carried over the set of input dis-
tributions PX , and I (X ; Y ) is the mutual information between the channel’s input
and output.
Note that C can be determined numerically via nonlinear optimization techniques
—such as the iterative algorithms developed by Arimoto [27] and Blahut [49, 51],
see also [88] and [415, Chap. 9]. In general, there are no closed-form (single-letter)
analytical expressions for C. However, for many “simplified” channels, it is possible
to analytically determine C under some “symmetry” properties of their channel
transition matrix.
132 4 Data Transmission and Channel Capacity
Definition 4.15 A DMC with finite input alphabet X , finite output alphabet Y and
channel transition matrix Q = [ px,y ] of size |X | × |Y| is said to be symmetric if the
rows of Q are permutations of each other and the columns of Q are permutations
of each other. The channel is said to be weakly symmetric if the rows of Q are
permutations of each other and all the column sums in Q are equal.
is weakly symmetric (but not symmetric). Noting that all above channels involve
square transition matrices, we emphasize that Q can be rectangular while satisfying
the symmetry or weak-symmetry properties. For example, the DMC with |X | = 2,
|Y| = 4 and
1−ε ε 1−ε ε
Q= 2 2 2 2
ε 1−ε ε 1−ε (4.5.2)
2 2 2 2
is symmetric (where ε ∈ [0, 1]), while the DMC with |X | = 2, |Y| = 3 and
1 1 1
Q= 3 6 2
1 1 1
3 2 6
is weakly symmetric.
|Y|
H (q1 , q2 , . . . , q|Y| ) := − qi log2 qi
i=1
Proof The mutual information between the channel’s input and output is given by
I (X ; Y ) = H (Y ) − H (Y |X )
= H (Y ) − PX (x)H (Y |X = x)
x∈X
where H (Y |X = x) = − y∈Y PY |X (y|x) log2 PY |X (y|x) = − y∈Y px,y
log2 px,y .
Noting that every row of Q is a permutation of every other row, we obtain that
H (Y |X = x) is independent of x and can be written as
H (Y |X = x) = H (q1 , q2 , . . . , q|Y| ),
This implies
I (X ; Y ) = H (Y ) − H (q1 , q2 , . . . , q|Y| )
≤ log2 |Y| − H (q1 , q2 , . . . , q|Y| ),
with equality achieved iff Y is uniformly distributed over Y. We next show that
choosing a uniform input distribution, PX (x) = |X1 | ∀x ∈ X , yields a uniform
output distribution, hence maximizing mutual information. Indeed, under a uniform
input distribution, we obtain that for any y ∈ Y,
1 A
PY (y) = PX (x)PY |X (y|x) = px,y =
x∈X
|X | x∈X |X |
where A := x∈X px,y is a constant given by the sum of the entries in any column
of Q,
since by the weak-symmetry property all column sums in Q are identical. Note
that y∈Y PY (y) = 1 yields that
134 4 Data Transmission and Channel Capacity
A
=1
y∈Y
|X |
and hence
|X |
A= . (4.5.4)
|Y|
Accordingly,
A |X | 1 1
PY (y) = = =
|X | |Y| |X | |Y|
for any y ∈ Y; thus the uniform input distribution induces a uniform output distri-
bution and achieves channel capacity as given by (4.5.3).
Observation 4.17 Note that if the weakly symmetric channel has a square (i.e.,
with |X | = |Y|) transition matrix Q, then Q is a doubly stochastic matrix; i.e., both
its row sums and its column sums are equal to 1. Note, however, that having a square
transition matrix does not necessarily make a weakly symmetric channel symmetric;
e.g., see (4.5.1).
Example 4.18 (Capacity of the BSC) Since the BSC with crossover probability (or
bit error rate) ε is symmetric, we directly obtain from Lemma 4.16 that its capacity
is achieved by a uniform input distribution and is given by
Example 4.19 (Capacity of the q-ary symmetric channel) Similarly, the q-ary sym-
metric channel with symbol error rate ε described in (4.2.11) is symmetric; hence,
by Lemma 4.16, its capacity is given by
# $
ε ε
C = log2 q − H 1 − ε, ,...,
q −1 q −1
ε
= log2 q + ε log2 + (1 − ε) log2 (1 − ε).
q −1
Note that when q = 2, the channel capacity is equal to that of the BSC, as expected.
Furthermore, when ε = 0, the channel reduces to the identity (noiseless) q-ary
channel and its capacity is given by C = log2 q.
We next note that one can further weaken the weak-symmetry property and define
a class of “quasi-symmetric” channels for which the uniform input distribution still
achieves capacity and yields a simple closed-form formula for capacity.
4.5 Calculating Channel Capacity 135
Definition 4.20 A DMC with finite input alphabet X , finite output alphabet Y and
channel transition matrix Q = [ px,y ] of size |X |×|Y| is said to be quasi-symmetric14
if Q can be partitioned along its columns into m weakly symmetric sub-matrices
Q1 , Q2 , . . . , Qm for some integer m ≥ 1, where each Qi sub-matrix has size |X | ×
|Yi | for i = 1, 2, . . . , m with Y1 ∪ · · · ∪ Ym = Y and Yi ∩ Y j = ∅ ∀i = j,
i, j = 1, 2, . . . , m.
Hence, quasi-symmetry is our weakest symmetry notion, since a weakly symmet-
ric channel is clearly quasi-symmetric (just set m = 1 in the above definition); we
thus have symmetry =⇒ weak-symmetry =⇒ quasi-symmetry.
Lemma 4.21 The capacity of a quasi-symmetric channel Q as defined above is
achieved by a uniform input distribution and is given by
m
C= ai Ci , (4.5.6)
i=1
where
ai := px,y = sum of any row in Qi , i = 1, . . . , m,
y∈Yi
and % &
Ci = log2 |Yi | − H any row in the matrix 1
ai
Qi , i = 1, . . . , m
is the capacity of the ith weakly symmetric “sub-channel” whose transition matrix is
obtained by multiplying each entry of Qi by a1i (this normalization renders sub-matrix
Qi into a stochastic matrix and hence a channel transition matrix).
Proof We first observe that for each i = 1, . . . , m, ai is independent of the input
value x, since sub-matrix i is weakly symmetric (so any row in Qi is a permutation
of any other row), and hence, ai is the sum of any row in Qi .
For each i = 1, . . . , m, define
p
x,y
if y ∈ Yi and x ∈ X ;
PYi |X (y|x) := ai
0 otherwise,
14 This notion of “quasi-symmetry” is slightly more general than Gallager’s notion [135, p. 94], as
= ai PX (x) log2
px ,y
x ∈X PX (x ) ai
i=1 y∈Yi x∈X
ai
m PYi |X (y|x)
= ai PX (x)PYi |X (y|x) log2
i=1 y∈Yi x∈X x ∈X PX (x )PYi |X (y|x )
m
= ai I (X ; Yi ).
i=1
C = max I (X ; Y )
PX
m
= max ai I (X ; Yi )
PX
i=1
m
= ai max I (X ; Yi ) (as the same uniform PX maximizes each I (X ; Yi ))
PX
i=1
m
= ai Ci .
i=1
Example 4.22 (Capacity of the BEC) The BEC with erasure probability α as given in
(4.2.6) is quasi-symmetric (but neither weakly symmetric nor symmetric). Indeed, its
transition matrix Q can be partitioned along its columns into two symmetric (hence
weakly symmetric) sub-matrices
1−α 0
Q1 =
0 1−α
and
α
Q2 = .
α
Thus, applying the capacity formula for quasi-symmetric channels of Lemma 4.21
yields that the capacity of the BEC is given by
4.5 Calculating Channel Capacity 137
C = a1 C 1 + a2 C 2 ,
where a1 = 1 − α, a2 = α,
# $
1−α 0
C1 = log2 (2) − H , = 1 − H (1, 0) = 1 − 0 = 1,
1−α 1−α
and %α&
C2 = log2 (1) − H = 0 − 0 = 0.
α
Therefore, the BEC capacity is given by
Example 4.23 (Capacity of the BSEC) Similarly, the BSEC with crossover probabil-
ity ε and erasure probability α as described in (4.2.8) is quasi-symmetric; its transition
matrix can be partitioned along its columns into two symmetric sub-matrices
1−ε−α ε
Q1 =
ε 1−ε−α
and
α
Q2 = .
α
and %α&
C2 = log2 (1) − H = 0.
α
We thus obtain that
# $
1−ε−α
C = (1 − α) 1 − h b + (α)(0)
1−α
# $
1−ε−α
= (1 − α) 1 − h b . (4.5.8)
1−α
As already noted, the BSEC is a combination of the BSC with bit error rate ε and
the BEC with erasure probability α. Indeed, setting α = 0 in (4.5.8) yields that
C = 1 − h b (1 − ε) = 1 − h b (ε) which is the BSC capacity. Furthermore, setting
ε = 0 results in C = 1 − α, the BEC capacity.
138 4 Data Transmission and Channel Capacity
When the channel does not satisfy any symmetry property, the following necessary
and sufficient Karush–Kuhn–Tucker (KKT) conditions (e.g., cf. Appendix B.8, [135,
pp. 87–91] or [46, 56]) for calculating channel capacity can be quite useful.
Definition 4.24 (Mutual information for a specific input symbol) The mutual infor-
mation for a specific input symbol is defined as
PY |X (y|x)
I (x; Y ) := PY |X (y|x) log2 .
y∈Y
PY (y)
Lemma 4.25 (KKT conditions for channel capacity) For a given DMC, an input
distribution PX achieves its channel capacity iff there exists a constant C such that
I (x; Y ) = C ∀x ∈ X with PX (x) > 0;
(4.5.9)
I (x; Y ) ≤ C ∀x ∈ X with PX (x) = 0.
Furthermore, the constant C is the channel capacity (justifying the choice of nota-
tion).
Proof The forward (if) part holds directly; hence, we only prove the converse (only-
if) part. Without loss of generality, we assume that PX (x) < 1 for all x ∈ X , since
PX (x) = 1 for some x implies that I (X ; Y ) = 0. The problem of calculating the
channel capacity is to maximize
PY |X (y|x)
I (X ; Y ) = PX (x)PY |X (y|x) log2 , (4.5.10)
x∈X y∈Y x ∈X PX (x )PY |X (y|x )
for a given channel distribution PY |X . By using the Lagrange multipliers method (e.g.,
see Appendix B.8 or [46]), maximizing (4.5.10) subject to (4.5.11) is equivalent to
maximize:
4.5 Calculating Channel Capacity 139
PY |X (y|x)
f (PX ) := PX (x)PY |X (y|x) log2 +λ PX (x) − 1 .
x∈X PX (x )PY |X (y|x ) x∈X
y∈Y x ∈X
We then take the derivative of the above quantity with respect to PX (x ), and obtain
that15
∂ f (PX )
= I (x ; Y ) − log2 (e) + λ.
∂ PX (x )
= I (x ; Y ) − log2 (e) + λ.
140 4 Data Transmission and Channel Capacity
for some λ. With the above result, setting C = −λ + log2 (e) yields (4.5.9). Finally,
multiplying both sides of each equation in (4.5.9) by PX (x) and summing over x
yields that max PX I (X ; Y ) on the left and the constant C on the right, thus proving
that the constant C is indeed the channel’s capacity.
Example 4.26 (Quasi-symmetric channels) For a quasi-symmetric channel, one can
directly verify that the uniform input distribution satisfies the KKT conditions of
Lemma 4.25 and yields that the channel capacity is given by (4.5.6); this is left as an
exercise. As we already saw, the BSC, the q-ary symmetric channel, the BEC and
the BSEC are all quasi-symmetric.
Example 4.27 Consider a DMC with a ternary input alphabet X = {0, 1, 2}, binary
output alphabet Y = {0, 1} and the following transition matrix:
⎡ ⎤
10
Q = ⎣ 21 21 ⎦ .
01
This channel is not quasi-symmetric. However, one may guess that the capacity of
this channel is achieved by the input distribution (PX (0), PX (1), PX (2)) = ( 21 , 0, 21 )
since the input x = 1 has an equal conditional probability of being received as 0
or 1 at the output. Under this input distribution, we obtain that I (x = 0; Y ) =
I (x = 2; Y ) = 1 and that I (x = 1; Y ) = 0. Thus, the KKT conditions of (4.5.9)
are satisfied; hence confirming that the above input distribution achieves channel
capacity and that channel capacity is equal to 1 bit.
Observation 4.28 (Capacity achieved by a uniform input distribution) We close
this section by noting that there is a class of DMCs that is larger than that of quasi-
symmetric channels for which the uniform input distribution achieves capacity. It
concerns the class of so-called “T -symmetric” channels [319, Sect. 5, Definition 1]
for which
PY |X (y|x)
T (x) := I (x; Y ) − log2 |X | = PY |X (y|x) log2
y∈Y x ∈XPY |X (y|x )
Hence, its capacity is achieved by the uniform input distribution. See [319, Fig. 2]
for (infinitely many) other examples of T -symmetric channels. However, unlike
4.5 Calculating Channel Capacity 141
Xn Yn
Source Encoder Channel Decoder Sink
f (sc) : V m → X n
and
g (sc) : Y n → V m .
The code’s operation is illustrated in Fig. 4.10. The source m-tuple V m is encoded via
the source-channel encoding function f (sc) , yielding the codeword X n = f (sc) (V m )
as the channel input. The channel output Y n , which is dependent on V m only via X n
(i.e., we have the Markov chain V m → X n → Y n ), is decoded via g (sc) to obtain the
source tuple estimate V̂ m = g (sc) (Y n ).
An error is made by the decoder if V m = V̂ m , and the code’s error probability is
given by
17 The minimal achievable compression rate of such sources is given by the entropy rate, see Theo-
rem 3.15.
18 Note that n = n ; that is, the channel blocklength n is in general a function of the source
m
(sc) (sc)
blocklength m. Similarly, f (sc) = f m and g (sc) = gm ; i.e., the encoding and decoding functions
are implicitly dependent on m.
4.6 Lossless Joint Source-Channel Coding and Shannon’s Separation Principle 143
Pe (∼Cm,n ) := Pr[V m = V̂ m ]
= PV m (v m )PY n |X n (y n | f (sc) (v m ))
v m ∈V m y n ∈Y n : g (sc) (y n )=v m
H (V) > C,
where μ = H D (V) − C D with D = |V|, and H D (V) and C D are entropy rate
and channel capacity measured in D-ary digits, i.e., the codes’ error probability
is bounded away from zero and it is not possible to transmit the source over
the channel via rate-one source-channel block codes with arbitrarily low error
probability.20
Proof of the forward part: Without loss of generality, we assume throughout this
proof that both the source entropy rate H (V) and the channel capacity C are measured
in nats (i.e., they are both expressed using the natural logarithm).
We will show the existence of the desired rate-one source-channel codes ∼Cm,m
via a separate (tandem or two-stage) source and channel coding scheme as the one
depicted in Fig. 4.8.
Let γ := C − H (V) > 0. Now, given any 0 < < 1, by the lossless source coding
theorem for stationary ergodic sources (Theorem 3.15), there exists a sequence of
source codes of blocklength m and size Mm with encoder
f s : V m → {1, 2, . . . , Mm }
and decoder
gs : {1, 2, . . . , Mm } → V m
such that
1
log Mm < H (V) + γ/2 (4.6.2)
m
and ) *
Pr gs ( f s (V m )) = V m < /2
f c : {1, 2, . . . , M̄m } → X m
20 Note that (4.6.1) actually implies that lim inf m→∞ Pe (∼Cm,m ) ≥ lim↓0 (1 − )μ = μ, where the
error probability lower bound has nothing to do with . Here, we state the converse of Theorem
4.30 in a form in parallel to the converse statements in Theorems 3.6, 3.15 and 4.11.
21 Theorem 3.15 indicates that for any 0 < ε := min{ε/2, γ/(2 log(2))} < 1, there exists δ with
0 < δ < ε and a sequence of binary block codes {∼Cm = (m, Mm )}∞ m=1 with
1
lim sup log2 Mm < H2 (V ) + δ, (4.6.3)
m→∞ m
and probability of decoding error satisfying Pe (∼Cm ) < ε (≤ ε/2) for sufficiently large m, where
H2 (V ) is the entropy rate measured in bits. Here, (4.6.3) implies that m1 log2 Mm < H2 (V ) + δ for
sufficiently large m. Hence,
1
log Mm < H (V ) + δ log(2) < H (V ) + ε log(2) ≤ H (V ) + γ/2
m
for sufficiently large m.
4.6 Lossless Joint Source-Channel Coding and Shannon’s Separation Principle 145
and decoder
gc : Y m → {1, 2, . . . , M̄m }
such that22
1 1
log M̄m > C − γ/2 = H (V) + γ/2 > log Mm (4.6.5)
m m
and ) *
λ := max Pr gc (Y m ) = w|X m = f c (w) < /2
w∈{1,..., M̄m }
and
g (sc) : Y m → V m
with
(sc) gs (gc (y m )), if gc (y m ) ∈ {1, 2, . . . , Mm }
g (y ) =
m
∀y m ∈ Y m .
arbitrary, otherwise
22 Theorem 4.11 and its proof of forward part indicate that for any 0 < ε :=
min{ε/4, γ/(16 log(2))} < 1, there exist 0 < γ < min{4ε , C2 } = min{ε, γ/(4 log(2)), C2 }
and a sequence of data transmission block codes {∼Cm = (m, M̄m )}∞
m=1 satisfying
1 γ
C2 − γ < log2 M̄m ≤ C2 − (4.6.4)
m 2
and
Pe (∼Cm ) < ε for sufficiently large m,
provided that C2 > 0, where C2 is the channel capacity measured in bits.
Observation 4.6 indicates that by throwing away from ∼Cm half of its codewords with largest
conditional probability of error, a new code ∼Cm = (m, M̄m ) = (m, M̄m /2) is obtained, which
satisfies λ(∼Cm ) ≤ 2Pe (∼Cm ) < 2ε ≤ ε/2.
Equation (4.6.4) then implies that for m > 1/γ sufficiently large,
1 1 1 1
log M̄m = log M̄m − log(2) > C − γ log(2) − log(2) > C − 2γ log(2) > C − γ/2.
m m m m
146 4 Data Transmission and Channel Capacity
+ Pr[g (sc) (Y m ) = V m , gc (Y m ) = f s (V m )]
≤ Pr[gs ( f s (V m )) = V m ] + Pr[gc (Y m ) = f s (V m )]
= Pr[gs ( f s (V m )) = V m ]
+ Pr[ f s (V m ) = w] Pr[gc (Y m ) = w| f s (V m ) = w]
w∈{1,2,...,Mm }
= Pr[gs ( f s (V m )) = V m ]
+ Pr[X m = f c (w)] Pr[gc (Y m ) = w|X m = f c (w)]
w∈{1,2,...,Mm }
≤ Pr[gs ( f s (V m )) = V m ] + λ
< /2 + /2 =
for m sufficiently large. Thus, the source can be reliably sent over the channel via
rate-one block source-channel codes as long as H (V) < C.
Proof of the converse part: For simplicity, we assume in this proof that H (V) and
C are measured in bits.
For any m-to-m source-channel code ∼Cm,m , we can write
1
H (V) ≤ H (V m ) (4.6.6)
m
1 1
= H (V m |V̂ m ) + I (V m ; V̂ m )
m m
1 ) * 1
≤ Pe (∼Cm,m ) log2 (|V|m ) + 1 + I (V m ; V̂ m ) (4.6.7)
m m
1 1
≤ Pe (∼Cm,m ) log2 |V| + + I (X ; Y m )
m
(4.6.8)
m m
1
≤ Pe (∼Cm,m ) log2 |V| + + C, (4.6.9)
m
where
• Equation (4.6.6) is due to the fact that (1/m)H (V m ) is nonincreasing in m and con-
verges to H (V) as m → ∞ since the source is stationary (see Observation 3.12),
• Equation (4.6.7) follows from Fano’s inequality,
Observation 4.31 We make the following remarks regarding the above joint source-
channel coding theorem:
• In general, it is not known whether the source can be (asymptotically) reliably
transmitted over the DMC when
H (V) = C
even if the source is a DMS. This is because separate source and channel codings
are used to prove the forward part of the theorem and the facts that the source
coding rate approaches the source entropy rate from above [cf. (4.6.2)] while the
channel coding rate approaches channel capacity from below [cf. (4.6.5)].
• The above theorem directly holds for DMSs since any DMS is stationary
and ergodic.
• We can expand the forward part of the theorem above by replacing the requirement
that the source be stationary ergodic with the more general condition that the source
be information stable.23 Note that time-invariant irreducible Markov sources (that
are not necessarily stationary) are information stable.
The above lossless joint source-channel coding theorem can be readily generalized
for m-to-n source-channel codes—i.e., codes with rate not necessarily equal to one—
as follows (its proof, which is similar to the previous theorem, is left as an exercise).
Theorem 4.32 (Lossless joint source-channel coding theorem for general rate block
∞
codes) Consider a discrete source {Vi }i=1 with finite alphabet V and entropy rate
H (V) and a DMC with input alphabet X , output alphabet Y and capacity C, where
both H (V) and C are measured in the same units. Then, the following holds:
• Forward part (achievability): For any 0 < < 1 and given that the source
is stationary ergodic, there exists a sequence of m-to-n m source-channel codes
{∼Cm,n m }∞
m=1 such that
23 See
[75, 96, 303, 394] for a definition of information stable sources, whose property is slightly
more general than the Generalized AEP property given in Theorem 3.14.
148 4 Data Transmission and Channel Capacity
if
m C
lim sup < .
m→∞ nm H (V)
• Converse part: For any 0 < < 1 and given that the source is stationary, any
sequence of m-to-n m source-channel codes {∼Cm,n m }∞
m=1 with
m C
lim inf > ,
m→∞ nm H (V)
satisfies
Pe (∼Cm,n m ) > (1 − )μ for sufficiently large m,
for some positive constant μ that depends on lim inf m→∞ (m/n m ), H (V) and C,
i.e., the codes’ error probability is bounded away from zero and it is not possible
to transmit the source over the channel via m-to-n m source-channel block codes
with arbitrarily low error probability.
combat noise. For example, in a noiseless telegraph channel one could save about
50% in time by proper encoding of the messages. This is not done and most of the
redundancy of English remains in the channel symbols. This has the advantage,
however, of allowing considerable noise in the channel. A sizable fraction of the
letters can be received incorrectly and still reconstructed by the context. In fact this
is probably not a bad approximation to the ideal in many cases . . .
We make the following observations regarding the merits of joint versus separate
source-channel coding:
• Under finite coding blocklengths and/or complexity, many studies have demon-
strated that joint source-channel coding can provide better performance than sepa-
rate coding (e.g., see [13, 14, 37, 100, 127, 200, 247, 410, 427] and the references
therein).
• Even in the infinite blocklength regime where separate coding is optimal in terms
of reliable transmissibility, it can be shown that for a large class of systems, joint
source-channel coding can achieve an error exponent24 that is as large as double
the error exponent resulting from separate coding [422–424]. This indicates that
one can realize via joint source-channel coding the same performance as separate
coding, while reducing the coding delay by half (this result translates into notable
power savings of more than 2 dB when sending binary sources over channels
with Gaussian noise, fading an output quantization [422]). These findings provide
an information-theoretic rationale for adopting joint source-channel coding over
separate coding.
• Finally, it is important to point out that, with the exception of certain network
topologies [173, 383, 425] where separation is optimal, the separation theorem
does not in general hold for multiuser (multiterminal) systems (cf., [81, 83, 106,
174]),
and thus, in such systems, it is more beneficial to perform joint source-channel
coding.
The study of joint source-channel coding dates back to as early as the 1960s. Over
the years, many works have introduced joint source-channel coding techniques and
illustrated (analytically or numerically) their benefits (in terms of both performance
improvement and increased robustness to variations in channel noise) over separate
coding for given source and channel conditions and fixed complexity and/or delay
constraints. In joint source-channel coding systems, the designs of the source and
channel codes are either well coordinated or combined into a single step. Examples of
24 The error exponent or reliability function of a coding system is the largest rate of exponential
decay of its decoding error probability as the coding blocklength grows without bound [51, 87,
95, 107, 114, 135, 177, 178, 205, 347, 348]. Roughly speaking, the error exponent is a number
E with the property that the decoding error probability of a good code is approximately e−n E for
large coding blocklength n. In addition to revealing the fundamental trade-off between the error
probability of optimal codes and their blocklength for a given coding rate and providing insight on
the behavior of optimal codes, such a function provides a powerful tool for proving the achievability
part of coding theorems (e.g., [135]), for comparing the performance of competing coding schemes
(e.g., weighing joint against separate coding [422]) and for communications system design [194].
150 4 Data Transmission and Channel Capacity
(both constructive and theoretical) previous lossless and lossy joint source-channel
coding investigations for single-user25 systems include the following:
(a) Fundamental limits: joint source-channel coding theorems and the separation
principle [21, 34, 75, 96, 103, 135, 161, 164, 172, 187, 231, 271, 273, 351,
365, 373, 386, 394, 399], and joint source-channel coding exponents [69, 70,
84, 85, 135, 220, 422–424].
(b) Channel-optimized source codes (i.e., source codes that are robust against chan-
nel noise) [15, 32, 33, 39, 102, 115–117, 121, 126, 131, 143, 155, 167, 218,
238–240, 247, 272, 293, 295, 296, 354–356, 369, 375, 392, 419].
(c) Source-optimized channel codes (i.e., channel codes that exploit the source’s
redundancy) [14, 19, 62, 91, 93, 100, 118, 122, 127, 139, 169, 198, 234, 263,
331, 336, 410, 427, 428], uncoded source-channel matching with joint decoding
[13, 92, 140, 230, 285, 294, 334, 335, 366, 406] and source-matched channel
signaling [109, 229, 276, 368].
(d) Jointly coordinated source and channel codes [61, 101, 124, 132, 149, 150,
152, 166, 168, 171, 183, 184, 189, 190, 204, 217, 241, 268, 275, 282, 283,
286, 288, 332, 381, 402, 416, 417].
(e) Hybrid digital-analog source-channel coding and analog mapping [8, 57, 64,
71, 77, 79, 112, 130, 138, 147, 185, 193, 219, 221, 232, 244, 245, 274, 314,
320, 324, 335, 341, 357, 358, 367, 382, 391, 401, 405, 409, 429].
The above references while numerous are not exhaustive as the field of joint
source-channel coding has been quite active, particularly over the last decades.
Problems
25 Weunderscore that, even though not listed here, the literature on joint source-channel coding for
multiuser systems is also quite extensive and ongoing.
4.6 Lossless Joint Source-Channel Coding and Shannon’s Separation Principle 151
5. Consider a DMC with input X and output Y . Assume that the input alphabet is
X = {1, 2}, the output alphabet is Y = {0, 1, 2, 3}, and the transition probability
is given by ⎧
⎨ 1 − 2ε, if x = y;
PY |X (y|x) = ε, if |x − y| = 1;
⎩
0, otherwise,
0 1 0
β
1−β
1 1
I (X ; Y ) ≥ H (Y ) − H (Z ).
(c) Show that the capacity of the Z -channel is no smaller than that of a BSC
with crossover probability 1 − β (i.e., a binary modulo-2 additive noise
channel with {Z i } as its noise process):
C ≥ 1 − h b (β)
9. A DMC has identical input and output alphabets given by {0, 1, 2, 3, 4}. Let X
be the channel input, and Y be the channel output. Suppose that
1
PY |X (i|i) = ∀ i ∈ {0, 1, 2, 3, 4}.
2
(a) Find the channel transition matrix that maximizes H (Y |X ).
(b) Using the channel transition matrix obtained in (a), evaluate the channel
capacity.
10. Binary channel: Consider a binary memoryless channel with the following prob-
ability transition matrix:
1−α α
Q= ,
β 1−β
I (X ; Y ) = I (X, Z ; Y )
= h b (α) + αI (X 1 ; Y1 ) + (1 − α)I (X 2 ; Y2 ),
where αn , X n , Yn , and Nn all take values from {0, 1}, and “⊕” represents
the modulo-2 addition operation. Assume that the attenuation {αn }∞ n=1 , chan-
nel input {X n }∞
n=1 and noise {N } ∞
n n=1 processes are independent of each other.
Also, {αn }∞
n=1 and {N } ∞
n n=1 are i.i.d. with
1
Pr[αn = 1] = Pr[αn = 0] =
2
and
Pr[Nn = 1] = 1 − Pr[Nn = 0] = ε ∈ (0, 1/2).
(a) Show that the channel is a DMC and derive its transition probability matrix
PY j |X j (0|0) PY j |X j (1|0)
.
PY j |X j (0|1) PY j |X j (1|1)
Let
x̂ = g(y)
h b (Pe ) + 2Pe ≥ H (X |Y ),
where h b ( p) = p log2 1
p
+ (1 − p) log2 1
1− p
is the binary entropy function. The
curve for
h b (Pe ) + 2Pe = H (X |Y )
156 4 Data Transmission and Channel Capacity
2
C
H(X|Y) (in bits)
1.5
0.5
A D
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
P
e
(a) Point A on the above figure shows that if H (X |Y ) = 0, zero estimation error,
namely, Pe = 0, can be achieved. In this case, characterize the distribution
PX |Y . Also, give an estimator g(·) that achieves Pe = 0. Hint: Think what
kind of relation between X and Y can render H (X |Y ) = 0.
(b) Point B on the above figure indicates that when H (X |Y ) = log2 (5), the
estimation error can only be equal to 0.8. In this case, characterize the
distributions PX |Y and PX . Prove that at H (X |Y ) = log2 (5), all estimators
yield Pe = 0.8.
Hint: Think what kind of relation between X and Y can result in H (X |Y ) =
log2 (5).
(c) Point C on the above figure hints that when H (X |Y ) = 2, the estimation
error can be as worse as 1. Give an estimator g(·) that leads to Pe = 1, if
PX |Y (x|y) = 1/4 for x = y, and PX |Y (x|y) = 0 for x = y.
(d) Similarly, point D on the above figure hints that when H (X |Y ) = 0, the
estimation error can be as worse as 1. Give an estimator g(·) that leads to
Pe = 1 at H (X |Y ) = 0.
22. Decide whether the following statement is true or false. Consider a discrete
memoryless channel with input alphabet X , output alphabet Y and transition
distribution PY |X (y|x) := Pr{Y = y|X = x}. Let PX 1 (·) and PX 2 (·) be two
possible input distributions, and PY1 (·)
and PY2 (·) be the corresponding output
distributions; i.e., ∀y ∈ Y, PYi (y) = x∈X PY |X (y|x)PX i (x), i = 1, 2. Then,
and Q 2 is described by
⎡ ⎤
1 − /2 /2
Q 2 = [ p2 (z|y)] = ⎣ /2 1 − /2 ⎦ , 0 ≤ ≤ 1,
/2 /2 1 −
25. Let X be a binary random variable with alphabet X = {0, 1}. Let Z denote
another random variable that is independent of X and taking values in Z =
{0, 1, 2, 3} such that Pr[Z = 0] = Pr[Z = 1] = Pr[Z = 2] = , where
0 < ≤ 1/3. Consider a DMC with input X , noise Z , and output Y described
by the equation
Y = 3X + (−1) X Z ,
Y = X ⊕q Z ,
and
B = {r + 1, r + 2, . . . , q − 1},
called focused error control codes, was developed in [129] to provide a certain
level of protection against the common errors of the channel while guaran-
teeing another lower level of protection against uncommon errors; hence the
levels of protection are determined based not only on the numbers of errors
but on the kind of errors as well (unlike traditional channel codes). The per-
formance of these codes was assessed in [10].
27. Effect of memory on capacity: This problem illustrates the adage “memory
increases (operational) capacity.” Given an integer q ≥ 2, consider a q-ary
additive noise channel described by
Yi = X i ⊕q Z i , i = 1, 2 . . . ,
where ⊕q denotes addition modulo-q and Yi , X i and Z i are the channel output,
input and noise at time instant i, all with identical alphabet Y = X = Z =
{0, 1, . . . , q − 1}. We assume that the input and noise processes are independent
∞
of each other and that the noise process {Z i }i=1 is stationary ergodic. It can be
shown via an extended version of Theorem 4.11 that the operational capacity of
this channel with memory is given by [96, 191]:
1
Cop = lim max I (X n ; Y n ).
n→∞ p(x )
n
n
Cop ≥ C̃.
Note: The adage “memory increases (operational) capacity” does not hold for
arbitrary channels. It is only valid for well-behaved channels with memory [97],
such as the above additive noise channel with stationary ergodic noise or more
generally for information stable26 channels [96, 191, 303] whose capacity is
given by27
1
Cop = lim inf max I (X n ; Y n ). (4.7.1)
n→∞ p(x ) n
n
26 Loosely speaking, a channel is information stable if the input process which maximizes the
channel’s block mutual information yields a joint input–output process that behaves ergodically
and satisfies the joint AEP (see [75, 96, 191, 303, 394] for a precise definition).
27 Note that a formula of the capacity of more general (not necessarily information stable) channels
with memory does exist in terms of a generalized (spectral) mutual information rate, see [172, 396].
160 4 Data Transmission and Channel Capacity
However, one can find counterexamples to this adage, such as in [3] regarding
non-ergodic “averaged” channels [4, 199]. Examples of such averaged channels
include additive noise channels with stationary but non-ergodic noise, in par-
ticular, the Polya contagion channel [12] whose noise process is described in
Example 3.16.
28. Feedback capacity. Consider a (not necessarily memoryless) discrete channel
with input alphabet X , output alphabet Y and n-fold transition distributions
PY n |X n , n = 1, 2, . . .. The channel is to be used with feedback as shown in the
figure below.
More specifically, there is a noiseless feedback link from the channel output to
the transmitter with one time unit of delay. As a result, at each time instance i,
the channel input X i is a function of both the message W and all past channel
outputs Y i−1 = (Y1 , . . . , Yi−1 ). More formally, an (n, Mn ) feedback channel
code consists of a sequence of encoding functions
f i : {1, 2, . . . , Mn } × Y i−1 → X
g : Y n → {1, 2, . . . , Mn }.
1
lim inf log2 Mn ≥ R and lim Pe = 0.
n→∞ n n→∞
4.6 Lossless Joint Source-Channel Coding and Shannon’s Separation Principle 161
The feedback operational capacity Cop,F B of the channel is defined as the supre-
mum of all achievable rates with feedback:
Comparing this definition of feedback operational capacity with the one when
no feedback exists given in Definition 4.10 and studied in Theorem 4.11, we
readily observe that, in general,
Cop,F B ≥ Cop
since non-feedback codes belong to the class of feedback codes. This inequality
is intuitively not surprising as in the presence of feedback, the transmitter can use
the previously received output symbols to better understand the channel behavior
and hence send codewords that are more robust to channel noise, potentially
increasing the rate at which information can be transferred reliably over the
channel.
(a) Show that for DMCs, feedback does not increase operational capacity:
Note that for a DMC with feedback, property (4.2.1) does not hold since
current inputs depend on past outputs. However, by the memorylessness
nature of the channel, we assume the following causality Markov chain
condition:
(W, X i−1 , Y i−1 ) → X i → Yi (4.7.3)
for the channel, which is a simplified version of (4.7.2), see also [415,
Definition 7.4]. Condition (4.7.3) can be seen as a generalized definition of
a DMC used with or without feedback coding.
(b) Consider the q-ary channel of Problem 4.27 with stationary ergodic additive
noise. Assume that the noise process is independent of the message W .28
Show that although this channel has memory, feedback does not increase its
operational capacity:
1
Cop,F B = Cop = log2 q − lim H (Z n ).
n→∞ n
We point out that the classical Gilbert–Elliott burst noise channel [108, 145,
277] is a special instance of this channel.
28 This intrinsically natural assumption, which is equivalent to requiring that the channel input and
noise processes are independent of each other when no feedback is present, ensures that (4.7.2)
holds for this channel.
162 4 Data Transmission and Channel Capacity
Note: Result (a) is due to Shannon [343]. Even though feedback does not help
increase capacity for a DMC, it can have several benefits such a simplifying
the coding scheme and speeding the rate at which the error probability of good
codes decays to zero (e.g., see [87, 284]). Result (b), which was shown in [9]
for arbitrary additive noise processes with memory, stems from the fact that the
channel has a symmetry property in the sense that a uniform input maximizes the
mutual information between channel input and output tuples. Similar results for
channels with memory satisfying various symmetry properties have appeared
in [11, 292, 333, 361]. However, it can be shown that for channels with memory
and asymmetric structures,
Cop,F B > Cop ,
see, for example, [415, Problem 7.12] and [16] where average input costs are
imposed.
We further point out that for information stable channels, the feedback opera-
tional capacity is given by [215, 291, 378]
1
Cop,F B = lim inf max I (X n → Y n ), (4.7.4)
n→∞ PX n Y n−1 n
n
PX n Y n−1 (x n y n−1 ) := PX i |X i−1 Y i−1 (xi |x i−1 , y i−1 )
i=1
n
I (X n → Y n ) := I (Yi ; X i |Y i−1 )
i=1
1
Cop,F B = lim inf max I (W ; Y n ), (4.7.5)
n→∞ n f n
29 Forarbitrary channels with memory, a generalized expression for Cop,F B is established in [376,
378] in terms of a generalized (spectral) directed information rate.
4.6 Lossless Joint Source-Channel Coding and Shannon’s Separation Principle 163
30 This result was actually shown in [72] for general channels with memory in terms of a generalized
FX (x) := Pr[X ≤ x]
for x ∈ R, the set of real numbers. The distribution of X is called absolutely contin-
uous (with respect to the Lebesgue measure) if a probability density function (pdf)
f X (·) exists such that x
FX (x) = f X (t)dt,
−∞
+∞
where f X (t) ≥ 0 ∀t and −∞ f X (t)dt = 1. If FX (·) is differentiable everywhere,
then the pdf f X (·) exists and is given by the derivative of FX (·): f X (t) = d FdtX (t) .
The support of a random variable X with pdf f X (·) is denoted by S X and can be
conveniently given as
S X = {x ∈ R : f X (x) > 0}.
Recall that the definition of entropy for a discrete random variable X representing a
DMS is
H (X ) := −PX (x) log2 PX (x) (in bits).
x∈X
As already seen in Shannon’s source coding theorem, this quantity is the minimum
average code rate achievable for the lossless compression of the DMS. But if the
random variable takes on values in a continuum, the minimum number of bits per
symbol needed to losslessly describe it must be infinite. This is illustrated in the
following example, where we take a discrete approximation (quantization) of a ran-
dom variable uniformly distributed on the unit interval and study the entropy of the
quantized random variable as the quantization becomes finer and finer.
i i −1 i
qm (X ) = , if ≤X< ,
m m m
Since the entropy H (qm (X )) of the quantized version of X is a lower bound to the
entropy of X (as qm (X ) is a function of X ) and satisfies in the limit
The above example indicates that to compress a continuous source without incur-
ring any loss or distortion indeed requires an infinite number of bits. Thus when
studying continuous sources, the entropy measure is limited in its effectiveness and
the introduction of a new measure is necessary. Such new measure is indeed obtained
5.1 Differential Entropy 167
Lemma 5.2 Consider a real-valued random variable X with support [a, b) and
b
pdf f X such that − f X log2 f X is integrable2 (where − a f X (x) log2 f X (x)d x is
finite). Then a uniform quantization of X with an n-bit accuracy (i.e., with a
quantization step-size of = 2−n ) yields an entropy approximately equal to
b
− a f X (x) log2 f X (x)d x + n bits for n sufficiently large. In other words,
b
lim [H (qn (X )) − n] = − f X (x) log2 f X (x)d x,
n→∞ a
Proof
Step 1: Mean value theorem. Let = 2−n be the quantization step-size, and let
a + i, i = 0, 1, . . . , j − 1
ti :=
b, i = j,
where j = (b − a)2n . From the mean value theorem (e.g., cf. [262]), we can
choose xi ∈ [ti−1 , ti ] for 1 ≤ i ≤ j such that
ti
pi := f X (x)d x = f X (xi )(ti − ti−1 ) = · f X (xi ).
ti−1
j
(n)
h (X ) := − [ f X (xi ) log2 f X (xi )]2−n .
i=1
Therefore, given any ε > 0, there exists N such that for all n > N ,
b
− f X (x) log2 f X (x)d x − h (n) (X ) < ε.
a
j
H (qn (X )) = − pi log2 pi
i=1
j
=− ( f X (xi )) log2 ( f X (xi ))
i=1
j
=− ( f X (xi )2−n ) log2 ( f X (xi )2−n ),
i=1
j
H (qn (X )) − h (n) (X ) = − [ f X (xi )2−n ] log2 (2−n )
i=1
j
ti
=n f X (x)d x
i=1 ti−1
b
=n f X (x)d x = n.
a
yielding that
b
lim [H (qn (X )) − n] = − f X (x) log2 f X (x)d x.
n→∞ a
More generally, the following result due to Rényi [316] can be shown for (absolutely
continuous) random variables with arbitrary support.
Theorem 5.3 [316, Theorem 1] For any real-valued random variable with pdf
f X , if − i pi log2 pi is finite, where the (possibly countably many) pi ’s are the
5.1 Differential Entropy 169
In light of the above results, we can define the following information measure [340]:
Definition 5.4 (Differential entropy) The differential entropy (in bits) of a continu-
ous random variable X with pdf f X and support S X is defined as
h(X ) := − f X (x) · log2 f X (x)d x = E[− log2 f X (X )],
SX
Example 5.5 A continuous random variable X with support S X = [0, 1) and pdf
f X (x) = 2x for x ∈ S X has differential entropy equal to
1
1
x 2 (log2 e − 2 log2 (2x))
−2x · log2 (2x)d x =
0 2 0
1
= − log2 (2) = −0.278652 bits.
2 ln 2
We herein illustrate Lemma 5.2 by uniformly quantizing X to an n-bit accuracy and
computing the entropy H (qn (X )) and H (qn (X )) − n for increasing values of n,
where qn (X ) is the quantized version of X .
We have that qn (X ) is given by
i i −1 i
qn (X ) = n
, if n
≤ X < n,
2 2 2
for 1 ≤ i ≤ 2n . Hence,
i (2i − 1)
Pr qn (X ) = n = ,
2 22n
which yields
170 5 Differential Entropy and Gaussian Channels
2n
2i − 1 2i − 1
H (qn (X )) = − log 2
i=1
22n 22n
1
n
2
= − 2n (2i − 1) log2 (2i − 1) + 2 log2 (2n ).
2 i=1
Example 5.6 Let us determine the minimum average number of bits required to
describe the uniform quantization with 3-digit accuracy of the decay time (in years)
of a radium atom assuming that the half-life of the radium (i.e., the median of the
decay time) is 80 years and that its pdf is given by f X (x) = λe−λx , where x > 0.
Since the median of the decay time is 80, we obtain
80
λe−λx d x = 0.5,
0
Table 5.1 Quantized random variable qn (X ) under an n-bit accuracy: H (qn (X )) and H (qn (X ))−n
versus n
n H (qn (X )) H (qn (X )) − n
1 0.811278 bits −0.188722 bits
2 1.748999 bits −0.251000 bits
3 2.729560 bits −0.270440 bits
4 3.723726 bits −0.276275 bits
5 4.722023 bits −0.277977 bits
6 5.721537 bits −0.278463 bits
7 6.721399 bits −0.278600 bits
8 7.721361 bits −0.278638 bits
9 8.721351 bits −0.278648 bits
5.1 Differential Entropy 171
We close this section by computing the differential entropy for two common real-
valued random variables: the uniformly distributed random variable and the Gaussian
distributed random variable.
Example 5.7 (Differential entropy of a uniformly distributed random variable) Let
X be a continuous random variable that is uniformly distributed over the interval
(a, b), where b > a; i.e., its pdf is given by
1
if x ∈ (a, b);
f X (x) = b−a
0 otherwise.
Note that if (b − a) < 1 in the above example, then h(X ) is negative, unlike
entropy. The above example indicates that although differential entropy has a form
analogous to entropy (in the sense that summation and pmf for entropy are replaced
by integration and pdf, respectively, for differential entropy), differential entropy
does not retain all the properties of entropy (one such operational difference was
already highlighted in the previous lemma and theorem).3
Example 5.8 (Differential entropy of a Gaussian random variable) Let X ∼
N (μ, σ 2 ); i.e., X is a Gaussian (or normal) random variable with finite mean μ,
variance Var(X ) = σ 2 > 0 and pdf
1 (x−μ)2
f X (x) = √ e− 2σ2
2πσ 2
1 (x − μ)2
h(X ) = f X (x)log2 (2πσ 2 ) + log 2 e dx
R 2 2σ 2
1 log2 e
= log2 (2πσ 2 ) + E[(X − μ)2 ]
2 2σ 2
1 1
= log2 (2πσ 2 ) + log2 e
2 2
1
= log2 (2πeσ ) bits.
2
(5.1.1)
2
Note that for a Gaussian random variable, its differential entropy is only a function of
its variance σ 2 (it is independent from its mean μ). This is similar to the differential
3 By contrast, entropy and differential entropy are sometimes called discrete entropy and continuous
entropy, respectively.
172 5 Differential Entropy and Gaussian Channels
f X,Y (x, y)
f Y |X (y|x) = ,
f X (x)
is well defined for all (x, y) ∈ S X,Y , where f X is the marginal pdf of X . Then, the
conditional differential entropy of Y given X is defined as
h(Y |X ) := − f X,Y (x, y) log2 f Y |X (y|x) d x d y = E[− log2 f Y |X (Y |X )],
S X,Y
f X (x) f X (X )
D(X Y ) := f X (x) log2 dx = E
SX f Y (x) f Y (X )
when the integral exists. The definition carries over similarly in the multivariate
case: for X n = (X 1 , X 2 , . . . , X n ) and Y n = (Y1 , Y2 , . . . , Yn ) two random vectors
with joint pdfs f X n and f Y n , respectively, and supports satisfying S X n ⊆ SY n ⊆ Rn ,
the divergence between X n and Y n is defined as
f X n (x1 , x2 , . . . , xn )
D(X n Y n ) := f X n (x1 , x2 , . . . , xn ) log2 d x1 d x2 · · · d xn
SX n f Y n (x1 , x2 , . . . , xn )
assuming the integral exists, where f X and f Y are the marginal pdfs of X and Y ,
respectively.
Observation 5.13 For two jointly distributed continuous random variables X and
Y with joint pdf f X,Y , support S X Y ⊆ R2 and joint differential entropy
h(X, Y ) = − f X,Y (x, y) log2 f X,Y (x, y) d x d y,
SX Y
then as in Lemma 5.2 and the ensuing discussion, one can write
H (qn (X ), qm (Y )) ≈ h(X, Y ) + n + m
for n and m sufficiently large, where qk (Z ) denotes the (uniformly) quantized version
of random variable Z with a k-bit accuracy.
On the other hand, for the above continuous X and Y ,
Thus, mutual information and divergence can be considered as the true tools of
information theory, as they retain the same operational characteristics and properties
for both discrete and continuous probability spaces (as well as general spaces where
they can be defined in terms of Radon–Nikodym derivatives (e.g., cf. [196]).5
The following lemma illustrates that for continuous systems, I (·; ·) and D(··)
keep the same properties already encountered for discrete systems, while differential
entropy (as already seen with its possibility of being negative) satisfies some different
properties from entropy. The proof is left as an exercise.
Lemma 5.14 The following properties hold for the information measures of contin-
uous systems.
1. Nonnegativity of divergence: Let X and Y be two continuous random variables
with marginal pdfs f X and f Y , respectively, such that their supports satisfy
S X ⊆ SY ⊆ R. Then
D( f X f Y ) ≥ 0,
with equality iff f X (x) = f Y (x) for all x ∈ S X except in a set of f X -measure
zero (i.e., X = Y almost surely).
2. Nonnegativity of mutual information: For any two continuous jointly dis-
tributed random variables X and Y ,
I (X ; Y ) ≥ 0,
n
h(X 1 , X 2 , . . . , X n ) = h(X i |X 1 , X 2 , . . . , X i−1 ),
i=1
5 This justifies using identical notations for both I (·; ·) and D(··) as opposed to the discerning
notations of H (·) for entropy and h(·) for differential entropy.
5.2 Joint and Conditional Differential Entropies, Divergence, and Mutual Information 175
n
I (X 1 , X 2 , . . . , X n ; Y ) = I (X i ; Y |X i−1 , . . . , X 1 ),
i=1
I (X ; Y ) ≥ I (X ; Z ).
with equality iff all the X i ’s are independent from each other.
8. Invariance of differential entropy under translation: For continuous random
variables X and Y with joint pdf f X,Y and well-defined conditional pdf f X |Y ,
and
h(X + Y |Y ) = h(X |Y ).
The results also generalize in the multivariate case: for two continuous random
vectors X n = (X 1 , X 2 , . . . , X n ) and Y n = (Y1 , Y2 , . . . , Yn ) with joint pdf f X n ,Y n
and well-defined conditional pdf f X n |Y n ,
h(X n + cn ) = h(X n )
h(X n + Y n |Y n ) = h(X n |Y n ),
10. Joint differential entropy under linear mapping: Consider the random (col-
umn) vector X = (X 1 , X 2 , . . . , X n )T with joint pdf f X n , where T denotes trans-
position, and let Y = (Y1 , Y2 , . . . , Yn )T be a random (column) vector obtained
from the linear transformation Y = AX , where A is an invertible (non-singular)
n × n real-valued matrix. Then
h(Y ) = h(Y1 , Y2 , . . . , Yn )
= h(X 1 , . . . , X n ) + f X n (x1 , . . . , xn ) log2 |det(J)| d x1 · · · d xn ,
Rn
Observation 5.15 Property 9 of the above Lemma indicates that for a continuous
random variable X , h(X ) = h(a X ) (except for the trivial case of a = 1) and
hence differential entropy is not in general invariant under invertible maps. This is in
contrast to entropy, which is always invariant under invertible maps: given a discrete
random variable X with alphabet X ,
H ( f (X )) = H (X )
and
D(X Y ) = D(g(X )g(Y ))
for all invertible maps g and h properly defined on the alphabet/support of the con-
cerned random variables. This reinforces the notion that mutual information and
divergence constitute the true tools of information theory.
Definition 5.16 (Multivariate Gaussian) A continuous random vector X = (X 1 ,
X 2 , . . . , X n )T is called a size-n (multivariate) Gaussian random vector with a finite
mean vector μ := (μ1 , μ2 , . . . , μn )T , where μi := E[X i ] < ∞ for i = 1, 2, . . . , n,
and an n × n invertible (real-valued) covariance matrix
K X = [K i, j ]
:= E[(X − μ)(X − μ)T ]
⎡ ⎤
Cov(X 1 , X 1 ) Cov(X 1 , X 2 ) · · · Cov(X 1 , X n )
⎢Cov(X 2 , X 1 ) Cov(X 2 , X 2 ) · · · Cov(X 2 , X n )⎥
⎢ ⎥
= ⎢ .. .. .. ⎥,
⎣ . . ··· . ⎦
Cov(X n , X 1 ) Cov(X n , X 2 ) · · · Cov(X n , X n )
6 Note that the diagonal components of K X yield the variance of the different random variables:
K i,i = Cov(X i , X i ) = Var(X i ) = σ 2X i , i = 1, . . . , n.
7 An n × n real-valued symmetric matrix K is positive-semidefinite (e.g., cf. [128]) if for every
real-valued vector x = (x1 , x2 , . . . , xn )T ,
⎛ ⎞
x1
⎜ ⎟
x T Kx = (x1 , . . . , xn )K ⎝ ... ⎠ ≥ 0,
xn
with equality holding only when xi = 0 for i = 1, 2, . . . , n. Furthermore, the matrix is positive-
definite if x T Kx > 0 for all real-valued vectors x = 0, where 0 is the all-zero vector of size n.
178 5 Differential Entropy and Gaussian Channels
is a size-m Gaussian random vector with mean vector Amn μ and covariance
T
matrix Amn K X Amn .
More generally, any affine transformation of a Gaussian random vector yields
another Gaussian random vector: if X ∼ Nn (μ, K X ) and Y = Amn X + bm ,
where Amn is a m × n real-valued matrix and bm is a size-m real-valued vector,
then
Y ∼ Nm (Amn μ + bm , Amn K X Amn
T
).
1
h(X ) = h(X 1 , X 2 , . . . , X n ) = log2 (2πe)n det(K X ) . (5.2.1)
2
In particular, in the univariate case of n = 1, (5.2.1) reduces to (5.1.1).
Proof Without loss of generality, we assume that X has a zero-mean vector since its
differential entropy is invariant under translation by Property 8 of Lemma 5.14:
so we assume that μ = 0.
Since the covariance matrix K X is a real-valued symmetric matrix, then it is
orthogonally diagonizable; i.e., there exists a square (n × n) orthogonal matrix A
(i.e., satisfying AT = A−1 ) such that AK X AT is a diagonal matrix whose entries
are given by the eigenvalues of K X (A is constructed using the eigenvectors
of K X;
e.g., see [128]). As a result, the linear transformation Y = AX ∼ Nn 0, AK X AT
5.2 Joint and Conditional Differential Entropies, Divergence, and Mutual Information 179
h(Y ) = h(Y1 , Y2 , . . . , Yn )
= h(Y1 ) + h(Y2 ) + · · · + h(Yn ) (5.2.2)
n
1
= log2 [2πeVar(Yi )] (5.2.3)
i=1
2
n !
n 1
= log2 (2πe) + log2 Var(Yi )
2 2 i=1
n 1
= log2 (2πe) + log2 det KY (5.2.4)
2 2
1 1
= log2 (2πe) + log2 det K X
n
(5.2.5)
2 2
1
= log2 (2πe) det K X ,
n
(5.2.6)
2
where (5.2.2) follows by the independence of the random variables Y1 , . . . , Yn (e.g.,
see Property 7 of Lemma 5.14), (5.2.3) follows from (5.1.1), (5.2.4) holds since
the matrix KY is diagonal and hence its determinant is given by the product of its
diagonal entries, and (5.2.5) holds since
det KY = det AK X AT
= det(A)det K X det(AT )
= det(A)2 det K X
= det K X ,
where the last equality holds since (det(A))2 = 1, as the matrix A is orthogonal
(AT = A−1 =⇒ det(A) = det(AT ) = 1/[det(A)]; thus, det(A)2 = 1).
Now invoking Property 10 of Lemma 5.14 and noting that |det(A)| = 1 yield that
1
h(X 1 , X 2 , . . . , X n ) = log2 (2πe)n det K X ,
2
hence completing the proof.
An alternate (but rather mechanical) proof to the one presented above consists
of directly evaluating the joint differential entropy of X by integrating − f X n (x n )
log2 f X n (x n ) over Rn ; it is left as an exercise.
with equality iff K is a diagonal matrix, where K i,i are the diagonal entries of K.
Proof Since every positive-definite matrix is a covariance matrix (e.g., see [162]),
let X = (X 1 , X 2 , . . . , X n )T ∼ Nn 0, K be a jointly Gaussian random vector with
zero-mean vector and covariance matrix K. Then
1
log2 (2πe)n det(K) = h(X 1 , X 2 , . . . , X n ) (5.2.7)
2
n
≤ h(X i ) (5.2.8)
i=1
n
1
= log2 [2πeVar(X i )] (5.2.9)
i=1
2
n
!
1
= log2 (2πe)n K i,i , (5.2.10)
2 i=1
where (5.2.7) follows from Theorem 5.18, (5.2.8) follows from Property 7 of
Lemma 5.14 and (5.2.9)–(5.2.10) hold using (5.1.1) along with the fact that each
random variable X i ∼ N (0, K i,i ) is Gaussian with zero mean and variance
Var(X i ) = K i,i for i = 1, 2, . . . , n (as the marginals of a multivariate Gaussian
are also Gaussian e.g., cf. [162]).
Finally, from (5.2.10), we directly obtain that
n
det(K) ≤ K i,i ,
i=1
with equality iff the jointly Gaussian random variables X 1 , X 2 , . . ., X n are inde-
pendent from each other, or equivalently iff the covariance matrix K is diagonal.
The next theorem states that among all real-valued size-n random vectors (of support
Rn ) with identical mean vector and covariance matrix, the Gaussian random vector
has the largest differential entropy.
Theorem 5.20 (Maximal differential entropy for real-valued random vectors) Let
X = (X 1 , X 2 , . . . , X n )T be a real-valued random vector with a joint pdf of support
S X n = Rn , mean vector μ, covariance matrix K X and finite joint differential entropy
h(X 1 , X 2 , . . . , X n ). Then
5.2 Joint and Conditional Differential Entropies, Divergence, and Mutual Information 181
1
h(X 1 , X 2 , . . . , X n ) ≤
log2 (2πe)n det(KX ) , (5.2.11)
2
& '
with equality iff X is Gaussian; i.e., X ∼ Nn μ, K X .
Proof We will present the proof in two parts: the scalar or univariate case, and the
multivariate case.
(i) Scalar case (n = 1): For a real-valued random variable with support S X = R,
mean μ and variance σ 2 , let us show that
1
h(X ) ≤ log2 2πeσ 2 , (5.2.12)
2
0 ≤ D(X Y )
f X (x)
= f X (x) log2 (x−μ)2
dx
R √ 1 e− 2σ 2
2πσ 2
&√ ' (x − μ)2
Z = AX
1 − |x−μ|
f X (x) = e λ for x ∈ R
2λ
maximizes differential entropy.
5.2 Joint and Conditional Differential Entropies, Divergence, and Mutual Information 183
and
π
1 1 φ X (λ) φ X (λ)
lim D(X n X̂ n ) = − 1 − ln dλ, (5.2.19)
n→∞ n 4π −π φ X̂ (λ) φ X̂ (λ)
respectively. Here, φ X (·) and φ X̂ (·) denote the power spectral densities of the
zero-mean stationary Gaussian processes {X i } and { X̂ i }, respectively. Recall that
for a stationary zero-mean process {Z i }, its power spectral density φ Z (·) is the
(discrete-time) Fourier transform of its covariance function K Z (τ ) := E[Z n+τ Z n ] −
E[Z n+τ ]E[Z n ] = E[Z n+τ Z n ], n, τ = 1, 2, . . .; more precisely,
∞
φ Z (λ) = K Z (τ )e− jτ λ , −π ≤ λ ≤ π,
τ =−∞
√
where j = −1 is the imaginary unit number. Note that (5.2.18) and (5.2.19) hold
under mild integrability and boundedness conditions; see [196, Sect. 2.4] for the
details.
The AEP theorem and its consequence for discrete memoryless (i.i.d.) sources reveal
to us that the number of elements in the typical set is approximately 2n H (X ) , where
H (X ) is the source entropy, and that the typical set carries almost all the probability
mass asymptotically (see Theorems 3.4 and 3.5). An extension of this result from
discrete to continuous memoryless sources by just counting the number of elements
in a continuous (typical) set defined via a law of large numbers argument is not
possible, since the total number of elements in a continuous set is infinite. However,
when considering the volume of that continuous typical set (which is a natural analog
to the size of a discrete set), such an extension, with differential entropy playing a
similar role as entropy, becomes straightforward.
184 5 Differential Entropy and Gaussian Channels
∞
Theorem 5.23 (AEP for continuous memoryless sources) Let {X i }i=1 be a con-
tinuous memoryless source (i.e., an infinite sequence of continuous i.i.d. random
variables) with pdf f X (·) and differential entropy h(X ). Then
1
− log f X (X 1 , . . . , X n ) → E[− log2 f X (X )] = h(X ) in probability.
n
Proof The proof is an immediate result of the law of large numbers (e.g., see Theo-
rem 3.4).
Definition 5.24 (Typical set) For δ > 0 and any n given, define the typical set for
the above continuous source as
1
Fn (δ) := x n ∈ Rn : − log2 f X (X 1 , . . . , X n ) − h(X ) < δ .
n
Theorem 5.26 (Consequence of the AEP for continuous memoryless sources) For a
∞
continuous memoryless source {X i }i=1 with differential entropy h(X ), the following
hold.
1. For n sufficiently large, PX n {Fn (δ)} > 1 − δ.
2. Vol(Fn (δ)) ≤ 2n(h(X )+δ) for all n.
3. Vol(Fn (δ)) ≥ (1 − δ)2n(h(X )−δ) for n sufficiently large.
Proof The proof is quite analogous to the corresponding theorem for discrete mem-
oryless sources (Theorem 3.5) and is left as an exercise.
We next study the fundamental limits for error-free communication over the discrete-
time memoryless Gaussian channel, which is the most important continuous-alphabet
channel and is widely used to model real-world wired and wireless channels. We first
state the definition of discrete-time continuous-alphabet memoryless channels.
n
f Y n |X n (y n |x n ) = f Y |X (yi |xi ) (5.4.1)
i=1
where t (·) is a given nonnegative real-valued function describing the cost for trans-
mitting an input symbol, and P is a given positive number representing the maximal
average amount of available resources per input symbol.
Definition 5.28 The capacity (or capacity-cost function) of a discrete-time contin-
uous memoryless channel with input average cost constraint (t, P) is denoted by
C(P) and defined as
and
E[t (X i )] ≤ Pi , (5.4.5)
where X i denotes the input with distribution FX i and Yi is the corresponding channel
output for i = 1, 2. Now, for 0 ≤ λ ≤ 1, let X λ be a random variable with distribution
FX λ := λFX 1 + (1 − λ)FX 2 . Then by (5.4.5)
186 5 Differential Entropy and Gaussian Channels
Furthermore,
where the first inequality holds by (5.4.6), the second inequality follows from the
concavity of the mutual information with respect to its first argument (cf. Lemma
2.46), and the third inequality follows from (5.4.4). Letting
→ 0 yields that
The most commonly used cost function is the power cost function, t (x) = x 2 ,
resulting in the average power constraint P for each transmitted input n-tuple:
1 2
n
x ≤ P. (5.4.7)
n i=1 i
Throughout this chapter, we will adopt this average power constraint on the channel
input.
We herein focus on the discrete-time memoryless Gaussian channel8 with aver-
age input power constraint P and establish an operational meaning for the channel
capacity C(P) as the largest coding rate for achieving reliable communication over
the channel. The channel is described by the following additive noise equation:
Yi = X i + Z i , for i = 1, 2, . . . , (5.4.8)
where Yi , X i , and Z i are the channel output, input and noise at time i. The input and
noise processes are assumed to be independent from each other and the noise source
∞
{Z i }i=1 is i.i.d. Gaussian with each Z i having mean zero and variance σ 2 , Z i ∼
N (0, σ ). Since the noise process is i.i.d, we directly get that the channel satisfies
2
(5.4.1) and is hence memoryless, where the channel transition pdf is explicitly given
in terms of the noise pdf as follows:
8 This
channel is also commonly referred to as the discrete-time additive white Gaussian noise
(AWGN) channel.
5.4 Capacity and Channel Coding Theorem for the Discrete-Time . . . 187
1 (y−x)2
f Y |X (y|x) = f Z (y − x) = √ e− 2σ2 .
2πσ 2
As mentioned above, we impose the average power constraint (5.4.7) on the channel
input.
Observation 5.30 The memoryless Gaussian channel is a good approximating
model for many practical channels such as radio, satellite, and telephone line chan-
nels. The additive noise is usually due to a multitude of causes, whose cumulative
effect can be approximated via the Gaussian distribution. This is justified by the
∞
n which states that for an i.i.d. process {Ui }i=1 with mean μ and
central limit theorem
variance σ , n i=1 (Ui − μ) converges in distribution as n → ∞ to a Gaussian
2 √1
distributed random variable with mean zero and variance σ 2 (see Appendix B).9
Before proving the channel coding theorem for the above memoryless Gaussian
channel with input power constraint P, we first show that its capacity C(P) as defined
in (5.4.3) with t (x) = x 2 admits a simple expression in terms of P and the channel
noise variance σ 2 . Indeed, we can write the channel mutual information I (X ; Y )
between its input and output as follows:
I (X ; Y ) = h(Y ) − h(Y |X )
= h(Y ) − h(X + Z |X ) (5.4.9)
= h(Y ) − h(Z |X ) (5.4.10)
= h(Y ) − h(Z ) (5.4.11)
1
= h(Y ) − log2 2πeσ 2 , (5.4.12)
2
where (5.4.9) follows from (5.4.8), (5.4.10) holds since differential entropy is invari-
ant under translation (see Property 8 of Lemma 5.14), (5.4.11) follows from the
independence of X and Z , and (5.4.12) holds since Z ∼ N (0, σ 2 ) is Gaussian (see
(5.1.1)). Now since Y = X + Z , we have that
since the input in (5.4.3) is constrained to satisfy E[X 2 ] ≤ P. Thus, the variance of
Y satisfies Var(Y ) ≤ E[Y 2 ] ≤ P + σ 2 , and
1 1
h(Y ) ≤ log2 (2πeVar(Y )) ≤ log2 2πe(P + σ 2 ) ,
2 2
where the first inequality follows by Theorem 5.20 since Y is real-valued (with
support R). Noting that equality holds in the first inequality above iff Y is Gaussian
and in the second inequality iff Var(Y ) = P + σ 2 , we obtain that choosing the input
X as X ∼ N (0, P) yields Y ∼ N (0, P + σ 2 ) and hence maximizes I (X ; Y ) over
9 The reader is referred to [209] for an information theoretic treatment of the central limit theorem.
188 5 Differential Entropy and Gaussian Channels
all inputs satisfying E[X 2 ] ≤ P. Thus, the capacity of the discrete-time memoryless
Gaussian channel with input average power constraint P and noise variance (or
power) σ 2 is given by
1 1
C(P) = log2 2πe(P + σ 2 ) − log2 2πeσ 2
2 2
1 P
= log2 1 + 2 . (5.4.13)
2 σ
Note P/σ 2 is called the channel’s signal-to-noise ratio (SNR) and is usually mea-
sured in decibels (dB).10
where
λw (∼Cn ) := Pr[Ŵ = W |W = w]
= Pr[g(Y n ) = w|X n = f (w)]
= f Y n |X n (y n | f (w)) dy n
y n ∈Rn : g(y n )=w
is the code’s conditional probability of decoding error given that message w is sent
over the channel. Here
n
f Y n |X n (y n |x n ) = f Y |X (yi |xi )
i=1
1
log2 Mn > C(P) − γ
n
1 2
n
c ≤P (5.4.14)
n i=1 i
such that the probability of error Pe (∼Cn ) < ε for sufficiently large n.
• Converse part: If for any sequence of data transmission block codes {∼Cn =
(n, Mn )}∞
n=1 whose codewords satisfy (5.4.14), we have that
1
lim inf log2 Mn > C(P),
n→∞ n
then the codes’ probability of error Pe (∼Cn ) is bounded away from zero for all n
sufficiently large.
Proof of the forward part: The theorem holds trivially when C(P) = 0 because we
can choose Mn = 1 for every n and have Pe (∼Cn ) = 0. Hence, we assume without
loss of generality C(P) > 0.
190 5 Differential Entropy and Gaussian Channels
Step 0:
Take a positive γ satisfying γ < min{2ε, C(P)}. Pick ξ > 0 small enough such that
2[C(P) − C(P − ξ)] < γ, where the existence of such ξ is assured by the strictly
increasing property of C(P). Hence, we have C(P − ξ) − γ/2 > C(P) − γ > 0.
Choose Mn to satisfy
γ 1
C(P − ξ) − > log2 Mn > C(P) − γ,
2 n
for which the choice should exist for all sufficiently large n. Take δ = γ/8. Let
FX be the distribution that achieves C(P − ξ), where C(P) is given by (5.4.13).
In this case, FX is the Gaussian distribution with mean zero and variance P − ξ
and admits a pdf f X . Hence, E[X 2 ] ≤ P − ξ and I (X ; Y ) = C(P − ξ).
Step 1: Random coding with average power constraint.
cm = (cm1 , . . . , cmn )
satisfies
1 2
n
lim cmi = E[X 2 ] ≤ P − ξ
n→∞ n
i=1
for m = 1, 2, . . . , Mn .
For Mn selected codewords {c1 , . . . , c Mn }, replace the codewords that violate the
power constraint (i.e., (5.4.14)) by an all-zero (default) codeword 0. Define the
encoder as
f n (m) = cm for 1 ≤ m ≤ Mn .
By taking expectation with respect to the mth codeword-selecting distribution f X n (cm ), we obtain
E[λm ] = f X n (cm )λm (∼Cn ) d cm
cm ∈X n
= f X n (cm )λm (∼Cn ) d cm + f X n (cm )λm (∼Cn ) d cm
cm ∈X n ∩E0 cm ∈X n ∩E0c
≤ f X n (cm ) d cm + f X n (cm )λm (∼Cn ) d cm
cm ∈E0 cm ∈X n
≤ PX n (E0 ) + f X n (cm ) f Y n |X n (y n |cm ) dy n d cm
cm ∈X n yn ∈
/ Fn (δ|cm )
Mn
+ f X n (cm ) f Y n |X n (y n |cm ) dy n d cm .
cm ∈X n y n ∈Fn (δ|cm )
m =1
m =m
192 5 Differential Entropy and Gaussian Channels
E[λm ] ≤ PX n (E0 ) + PX n ,Y n Fnc (δ)
Mn
+ f X n ,Y n (cm , y n ) dy n d cm , (5.4.15)
cm ∈X n y n ∈Fn (δ|cm )
m =1
m =m
where - .
Fn (δ|x n ) := y n ∈ Y n : (x n , y n ) ∈ Fn (δ) .
Note that the additional term PX n (E0 ) in (5.4.15) is to cope with the errors due to
all-zero codeword replacement, which will be less than δ for all sufficiently large
n by the law of large numbers. Finally, by carrying out a similar procedure as in
the proof of the channel coding theorem for discrete channels (cf. page 123), we
obtain
E[Pe (C n )] ≤ PX n (E0 ) + PX n ,Y n Fnc (δ)
+Mn · 2n(h(X,Y )+δ) 2−n(h(X )−δ) 2−n(h(Y )−δ)
≤ PX n (E0 ) + PX n ,Y n Fnc (δ) + 2n(C(P−ξ)−4δ) · 2−n(I (X ;Y )−3δ)
= PX n (E0 ) + PX n ,Y n Fnc (δ) + 2−nδ .
Accordingly, we can make the average probability of error, E[Pe (C n )], less than
3δ = 3γ/8 < 3ε/4 < ε for all sufficiently large n.
Proof of the converse part: Consider an (n, Mn ) block data transmission code
satisfying the power constraint (5.4.14) with encoding function
f n : {1, 2, . . . , Mn } → X n
I (X n ; Y n ) ≤ sup
I (X n ; Y n )
n
FX n : (1/n) i=1 E[X i2 ]≤P
n
≤ sup
I (X j ; Y j ) (by Theorem 2.21)
n
FX n : (1/n) i=1 E[X i2 ]≤P j=1
n
= sup n
sup I (X j ; Y j )
(P1 ,P2 ,...,Pn ) : (1/n) i=1 Pi =P FX n : (∀ i) E[X i2 ]≤Pi j=1
5.4 Capacity and Channel Coding Theorem for the Discrete-Time . . . 193
n
≤ sup n
sup I (X j ; Y j )
(P1 ,P2 ,...,Pn ) : (1/n) i=1 Pi =P j=1 FX n : (∀ i) E[X i2 ]≤Pi
n
= sup n
sup I (X j ; Y j )
(P1 ,P2 ,...,Pn ) : (1/n) i=1 Pi =P j=1 FX j : E[X 2j ]≤P j
n
= sup n
C(P j )
(P1 ,P2 ,...,Pn ) : (1/n) i=1 Pi =P j=1
n
1
= sup n
nC(P j )
(P1 ,P2 ,...,Pn ) : (1/n) i=1 Pi =P j=1
n
⎛ ⎞
1 n
≤ sup nC ⎝ P j ⎠ (by concavity of C(P))
n
(P1 ,P2 ,...,Pn ) : (1/n) i=1 Pi =P n j=1
= nC(P).
log2 Mn = H (W )
= H (W |Y n ) + I (W ; Y n )
≤ H (W |Y n ) + I (X n ; Y n )
≤ h b (Pe (∼Cn )) + Pe (∼Cn ) · log2 (|W| − 1) + nC(P)
(by Fano s inequality)
≤ 1 + Pe (∼Cn ) · log2 (Mn − 1) + nC(P),
(by the fact that (∀ t ∈ [0, 1]) h b (t) ≤ 1)
< 1 + Pe (∼Cn ) · log2 Mn + nC(P),
C(P) 1
Pe (∼Cn ) > 1 − − .
(1/n) log2 Mn log2 Mn
So if lim inf n→∞ (1/n) log2 Mn > C(P), then there exist δ > 0 and an integer N
such that for n ≥ N ,
1
log2 Mn > C(P) + δ.
n
C(P) 1 δ
Pe (∼Cn ) ≥ 1 − − ≥ .
C(P) + δ n(C(P) + δ) 2(C(P) + δ)
Proof Let f Y |X and f Yg |X g denote the transition pdfs of the additive noise channel and
the Gaussian channel, respectively, where both channels satisfy input average power
constraint P. Let Z and Z g respectively denote their zero-mean noise variables of
identical variance σ 2 .
Writing the mutual information in terms of the channel’s transition pdf and input
distribution as in Lemma 2.46, then for any Gaussian input with pdf f X g with corre-
sponding outputs Y and Yg when applied to channels f Y |X and f Yg |X g , respectively,
we have that
I ( f X g , f Y |X ) − I ( f X g , f Yg |X g )
f Z (y − x)
= f X g (x) f Z (y − x) log2 d yd x
X Y f Y (y)
f Z g (y − x)
− f X g (x) f Z g (y − x) log2 d yd x
X Y f Yg (y)
f Z (y − x)
= f X g (x) f Z (y − x) log2 d yd x
X Y f Y (y)
f Z g (y − x)
− f X g (x) f Z (y − x) log2 d yd x
X Y f Yg (y)
f Z (y − x) f Yg (y)
= f X g (x) f Z (y − x) log2 d yd x
X Y f Z g (y − x) f Y (y)
f Z g (y − x) f Y (y)
≥ f X g (x) f Z (y − x)(log2 e) 1 − d yd x
X Y f Z (y − x) f Yg (y)
f Y (y)
= (log2 e) 1 − f X g (x) f Z g (y − x)d x dy
Y f Yg (y) X
= 0,
f Y (y) f Z (y − x)
=
f Yg (y) f Z g (y − x)
C(P) := sup I (X ; Y )
FX : E[t (X )]≤P
is the largest rate for which there exist block codes for the channel satisfying (5.4.2)
which are reliably good (i.e., with asymptotically vanishing error probability).
The proof is quite similar to that of Theorem 5.32, except that some modifications
are needed in the forward part as for a general (non-Gaussian) channel, the input
distribution FX used to construct the random code may not admit a pdf (e.g., cf.
[135, Chap. 7], [415, Theorem 11.14]).
Observation 5.35 (Capacity of memoryless fading channels) We briefly examine
the capacity of the memoryless fading channel, which is widely used to model wire-
less communications channels [151, 307, 387]. The channel is described by the
following multiplicative and additive noise equation:
Yi = Ai X i + Z i , for i = 1, 2, . . . , (5.4.16)
where Yi , X i , Z i , and Ai are the channel output, input, additive noise, and amplitude
fading coefficient (or gain) at time i. It is assumed that the fading process {Ai } and the
noise process {Z i } are each i.i.d. and that they are independent of each other and of the
input process. As in the case of the memoryless Gaussian (AWGN) channel, the input
power constraint is given by P and the noise {Z i } is Gaussian with Z i ∼ N (0, σ 2 ).
The fading coefficients Ai are typically Rayleigh or Rician distributed [151]. In both
cases, we assume that E[Ai2 ] = 1 so that the channel SNR is unchanged as P/σ 2 .
Note setting Ai = 1 for all i in (5.4.16) reduces the channel to the AWGN
channel in (5.4.8). We next examine the effect of the random fading coefficient
196 5 Differential Entropy and Gaussian Channels
I (X ; A, Y ) = I (X ; A) + I (X ; Y |A) = I (X ; Y |A),
1 A2 P
= EA log2 1 + 2 , (5.4.17)
2 σ
where the expectation is taken with respect to the fading distribution. Note that
the capacity-achieving distribution here is also Gaussian with mean zero and
variance P and is independent of the fading coefficient.
At this point, it is natural to compare the capacity in (5.4.17) with that of the
AWGN channel in (5.4.13). In light of the concavity of the logarithm and using
Jensen’s inequality (in Theorem B.18), we readily obtain that
1 A2 P
C DS I (P) = E A log2 1 + 2
2 σ
1 E[A2 ]P
≤ log2 1 +
2 σ2
1 P
= log2 1 + 2 := C G (P) (5.4.18)
2 σ
which is the capacity of the AWGN channel with identical SNR, and where
the last step follows since E[A2 ] = 1. Thus, we conclude that fading degrades
capacity as C DS I (P) ≤ C G (P).
2. Capacity of the fading channel with full side information: We next assume that
both the receiver and the transmitter have knowledge of the fading coefficients;
this is the case of the fading channel with full side information (FSI). This
assumption applies to situations where there exists a reliable and fast feed-
back channel in the reverse direction where the decoder can communicate its
knowledge of the fading process to the encoder. In this case, the transmitter can
adaptively adjust its input power according to the value of the fading coefficient.
It can be shown (e.g., see [387]) using Lagrange multipliers that the capacity in
this case is given by
!
1 A2 p(A)
C F S I (P) = E A sup log2 1 +
p(·) : E A [ p(A)]=P 2 σ2
1 A2 p ∗ (A)
= EA log2 1 + (5.4.19)
2 σ2
where
1 σ2
p ∗ (a) = max 0, − 2
λ a
Finally, we note that real-world wireless channels are often not memoryless;
they exhibit statistical temporal memory in their fading process [80] and as a result
signals traversing the channels are distorted in a bursty fashion. We refer the reader
to [12, 108, 125, 135, 145, 211, 277, 298–301, 325, 334, 389, 420, 421] and the
references therein for models of channels with memory and for finite-state Markov
channel models which characterize the behavior of time-correlated fading channels
in various settings.
channel be apportioned given a fixed overall power budget ? The answer to the above
question lies in the so-called water-filling or water-pouring principle.
Pi = max{0, θ − σi2 },
k
and θ is chosen to satisfy i=1 Pi = P. This capacity is achieved by a tuple of
independent Gaussian inputs (X 1 , X 2 , . . . , X k ), where X i ∼ N (0, Pi ) is the input
to channel i, for i = 1, 2, . . . , k.
Proof By definition,
C(P) = sup I (X k ; Y k ).
k
FX k : i=1 E[X k2 ]≤P
Since the noise random variables Z 1 , . . . , Z k are independent from each other,
I (X k ; Y k ) = h(Y k ) − h(Y k |X k )
= h(Y k ) − h(Z k + X k |X k )
= h(Y k ) − h(Z k |X k )
= h(Y k ) − h(Z k )
k
= h(Y k ) − h(Z i )
i=1
k
k
≤ h(Yi ) − h(Z i )
i=1 i=1
k
1 Pi
≤ log2 1 + 2 ,
i=1
2 σi
where the first inequality follows from the chain rule for differential entropy and the
fact that conditioning cannot increase differential entropy, and the second inequality
holds since output Yi of channel i due to input X i with E[X i2 ] = Pi has its differential
entropy maximized if it is Gaussian distributed with zero-mean and variance Pi +σi2 .
Equalities hold above if all the X i inputs are independent of each other with each
k
input X i ∼ N (0, Pi ) such that i=1 Pi = P.
5.5 Capacity of Uncorrelated Parallel Gaussian Channels . . . 199
Thus, the problem is reduced to findingthe power allotment that maximizes the
k
overall capacity subject to the constraint i=1 Pi = P with Pi ≥ 0. By using the
Lagrange multipliers technique and verifying the KKT conditions (see Example B.21
in Appendix B.8), the maximizer (P1 , . . . , Pk ) of
/ k 0,
k
1 Pi
k
max log2 1 + 2 + λi Pi − ν Pi − P
i=1
2 σi i=1 i=1
can be found by taking the derivative of the above equation (with respect to Pi ) and
setting it to zero, which yields
⎧
⎪
⎪
1 1
⎨− + ν = 0, if Pi > 0;
2 ln(2) Pi + σi2
λi =
⎪
⎪
1 1
⎩− + ν ≥ 0, if Pi = 0.
2 ln(2) Pi + σi2
Hence,
Pi = θ − σi2 , if Pi > 0;
(equivalently, Pi = max{0, θ − σi2 }),
Pi ≥ θ − σi2 , if Pi = 0,
k
where θ := log2 e/(2ν) is chosen to satisfy i=1 Pi = P.
We illustrate the above result in Fig. 5.1 and elucidate why the Pi power allotments
form a water-filling (or water-pouring) scheme. In the figure, we have a vessel where
the height of each of the solid bins represents the noise power of each channel (while
the width is set to unity so that the area of each bin yields the noise power of the
corresponding Gaussian channel). We can thus visualize the system as a vessel with
an uneven bottom where the optimal input signal allocation Pi to each channel is
realized by pouring an amount P units of water into the vessel (with the resulting
overall area of filled water equal to P). Since the vessel has an uneven bottom, water
is unevenly distributed among the bins: noisier channels are allotted less signal power
(note that in this example, channel 3, whose noise power is largest, is given no input
power at all and is hence not used).
P = P1 + P2 + P4
θ
P2
σ32
P1
P4
σ22
σ12
σ42
Fig. 5.1 The water-pouring scheme for uncorrelated parallel Gaussian channels. The horizontal
dashed line, which indicates the level where the water rises to, indicates the value of θ for which
k
i=1 Pi = P
the answer is different from the water-filling principle. By characterizing the relation-
ship between mutual information and minimum mean square error (MMSE) [165],
the optimal power allocation for parallel AWGN channels with inputs constrained
to be discrete is established in [250], resulting in a new graphical power allocation
interpretation called the mercury/water-filling principle: mercury of proper amounts
[250, Eq. (43)]must be individually poured into each channel bin before water of
k
amount P = i=1 Pi is added to the vessel. It is thus named because mercury is
heavier than water and does not dissolve in it; so it can play the role of pre-adjuster
of bin heights. This line of inquiry concludes with the observation that when the
total transmission power P is small, the strategy that maximizes capacity follows
approximately the equal SNR principle; i.e., a larger power should be allotted to a
noisier channel to optimize capacity.
Furthermore, it was found in [400] that when the channel’s additive noise is
no longer Gaussian, the mercury adjustment fails to interpret the optimal power
allocation scheme. For additive Gaussian noise with arbitrary discrete inputs, the pre-
adjustment before the water pouring step is always upward; hence, the mercury-filling
scheme is used to increase bin heights. However, since the pre-adjustment of bin
heights can generally be in both upward and downward directions for channels with
non-Gaussian noise, the use of the name mercury/water filling becomes inappropriate
(see [400, Example 1] for quaternary-input additive Laplacian noise channels). In this
case, the graphical interpretation of the optimal power allocation scheme is simply
named two-phase water-filling principle [400].
We end this observation by emphasizing that a vital measure for practical digital
communication systems is the effective transmission rate subject to an acceptably
small decoding error rate (e.g., an overall bit error probability ≤ 10−5 ). Instead,
researchers typically adopt channel capacity as a design criterion in order to make
the analysis tractable and obtain a simple reference scheme for practical systems.
5.6 Capacity of Correlated Parallel Gaussian Channels 201
k
E[X i2 ] = tr(K X ) ≤ P,
i=1
where tr(·) denotes the trace of the k × k matrix K X . Since in each channel, the input
and noise variables are independent from each other, we have
I (X k ; Y k ) = h(Y k ) − h(Y k |X k )
= h(Y k ) − h(Z k + X k |X k )
= h(Y k ) − h(Z k |X k )
= h(Y k ) − h(Z k ).
Since h(Z k ) is not determined by the input, determining the system’s capacity reduces
to maximizing h(Y k ) over all possible inputs (X 1 , . . . , X k ) satisfying the power
constraint.
Now observe that the covariance matrix of Y k is equal to KY = K X + K Z , which
implies by Theorem 5.20 that the differential entropy of Y k is upper bounded by
1
h(Y k ) ≤ log2 (2πe)k det(K X + K Z ) ,
2
with equality iff Y k Gaussian. It remains to find out whether we can find inputs
(X 1 , . . . , X k ) satisfying the power constraint which achieve the above upper bound
and maximize it.
As in the proof of Theorem 5.18, we can orthogonally diagonalize K Z as
K Z = AAT ,
where AAT = Ik (and thus det(A)2 = 1), Ik is the k × k identity matrix, and is a
diagonal matrix with positive diagonal components consisting of the eigenvalues of
K Z (as K Z is positive-definite). Then
202 5 Differential Entropy and Gaussian Channels
k
det(B + ) ≤ (Bii + λi ),
i=1
where λi is the component of matrix locating at ith row and ith column, which
is exactly the ith eigenvalue of K Z . Thus, the maximum value of det(B + ) under
tr(B) ≤ P is realized by a diagonal matrix B (to achieve equality in Hadamard’s
inequality) with
k
Bii = P.
i=1
Finally, as in the proof of Theorem 5.36, we obtain a water-filling allotment for the
optimal diagonal elements of B:
Bii = max{0, θ − λi },
k
where θ is chosen to satisfy i=1 Bii = P. We summarize this result in the next
theorem.
k
1 Pi
C(P) = log2 1 + ,
i=1
2 λi
Pi = max{0, θ − λi },
k
and θ is chosen to satisfy i=1 Pi = P. This capacity is achieved by a tuple of zero-
mean Gaussian inputs (X 1 , X 2 , . . . , X k ) with covariance matrix K X having the same
eigenvectors as K Z , where the ith eigenvalue of K X is Pi , for i = 1, 2, . . . , k.
We close this section by briefly examining the capacity of two important systems
used in wireless communications.
13 For a multivariate Gaussian vector Z , its pdf has a slightly different form when it is complex-valued
Thus, in parallel to Theorem 5.18, the joint differential entropy for a complex-valued Gaussian Z
is equal to 2 3
h(Z ) = h(Z 1 , Z 2 , . . . , Z N ) = log2 (πe) N det(K Z ) ,
where the multiplicative factors 1/2 and 2 in the differential entropy formula in Theorem 5.18 are
removed. Accordingly, the multiplicative factor 1/2 in the capacity formula for real-valued AWGN
channels is no longer necessary when a complex-valued AWGN channel is considered (e.g., see
(5.6.2) and (5.6.3)).
14 This assumption can be made valid as long as a whitening (i.e., decorrelation) matrix W of Z
i
exists. One can thus multiply the received vector Y i with W to yield the desired equivalent channel
model with I N as the noise covariance matrix (see [153, Eq. (1)] and the ensuing description).
204 5 Differential Entropy and Gaussian Channels
(as in the FSI case in Observation (5.35)). Then, we can follow a similar approach
to the one carried earlier in this section and obtain
det(KY ) = det(H K X H† + I N ),
If, however, only the decoder has perfect knowledge of Hi while the transmitter only
knows its distribution (as in the DSI scenario in Observation (5.35)), then
It directly follows from (5.6.2) and (5.6.3) above (and the property of the maximum)
that in general,
D DS I (P) ≤ C F S I (P).
A key find emanating from the analysis of MIMO channels is that in virtue of their
spatial (multi-antenna) diversity, such channels can provide significant capacity gains
vis-a-vis the traditional single-antenna (with M = N = 1) channel. For example, it
can be shown that when the receiver knows the channel fading coefficients perfectly
with the latter governed by a Rayleigh distribution, then MIMO channel capacity
scales linearly in the minimum of the number of receive and transmit antennas at
high channel SNR values, and thus it can be significantly larger than in the single-
antenna case [380, 387]. Detailed studies about MIMO systems, including their
capacity benefits under various conditions and configurations, can be found in [151,
153, 387] and the references therein. MIMO technology has become an essential
component of mobile communication standards, such as IEEE 802.11 Wi-Fi, 4th
generation (4G) Worldwide Interoperability for Microwave Access (WiMax), 4G
Long Term Evolution (LTE) and others; see for example [110].
selective fading with different frequency components of the signal affected by differ-
ent fading due to multipath propagation effects (which occur when the signal arrives
at the receiver via several paths). It has been shown that such fading channels are well
handled by multi-carrier modulation schemes such as orthogonal frequency division
multiplexing (OFDM) which deftly exploits the channels’ frequency diversity to
provide resilience against the deleterious consequences of fading and interference.
OFDM transforms a single-user frequency selective channel into k parallel nar-
rowband fading channels, where k is the number of OFDM subcarriers. It can be
modeled as a memoryless multivariate channel:
Y i = Hi X i + Z i , for i = 1, 2, . . . ,
k k
|h |2 P P
C(P) = log2 1 + = log 2 1 + ,
=1
σ2 =1
σ2 /|h |2
where σ2 is the variance of the th component of Z i and h is the th diagonal
. Thus the overall system capacity, C(P), optimized subject to the power
entry of Hi
constraint k=1 P ≤ P, can be obtained via the water-filling principle of Sect. 5.5:
k
P∗
= log2 1 + ,
=1
σ /|h |2
2
where
P∗ = max{0, θ − σ2 /|h |2 },
and the parameter θ is chosen to satisfy k=1 P = P.
Like MIMO, OFDM has been adopted by many communication standards, includ-
ing Digital Video Broadcasting (DVB-S/T), Digital Subscriber Line (DSL), Wi-Fi,
WiMax, and LTE. The reader is referred to wireless communication textbooks such
as [151, 255, 387] for a thorough examination of OFDM systems.
206 5 Differential Entropy and Gaussian Channels
1 P + σ2 1 P + σ2
log2 ≥ C(P) ≥ log2 . (5.7.1)
2 Ze 2 σ2
Proof The lower bound in (5.7.1) is already proved in Theorem 5.33. The upper
bound follows from
I (X ; Y ) = h(Y ) − h(Z )
1 1
≤ log2 [2πe(P + σ 2 )] − log2 [2πeZe ].
2 2
1 2h(Z )
Ze = 2 = Var(Z ),
2πe
as expected.
Whenever two independent Gaussian random variables, Z 1 and Z 2 , are added, the
power (variance) of the sum is equal to the sum of the powers (variances) of Z 1 and
Z 2 . This relationship can then be written as
or equivalently
Var(Z 1 + Z 2 ) = Var(Z 1 ) + Var(Z 2 ).
However, when two independent random variables are non-Gaussian, the relationship
becomes
5.7 Non-Gaussian Discrete-Time Memoryless Channels 207
or equivalently
Ze (Z 1 + Z 2 ) ≥ Ze (Z 1 ) + Ze (Z 2 ). (5.7.3)
Inequality (5.7.2) (or equivalently 5.7.3), whose proof can be found in [83, Sect. 17.8]
or [51, Theorem 7.10.4], is called the entropy power inequality. It reveals that the
sum of two independent random variables may introduce more entropy power than
the sum of each individual entropy power, except in the Gaussian case.
1 P + σ2 1 P + σ2
log2 = log2 + D(Z Z G ),
2 Ze 2 σ2
Z(t)
X(t) Y (t)
+ H(f )
Waveform channel
Fig. 5.2 Band-limited waveform channel with additive white Gaussian noise
where “∗” represents the convolution operation (recall that the convolution between
∞
two signals a(t) and b(t) is defined as a(t) ∗ b(t) = −∞ a(τ )b(t − τ )dτ ). Here,
X (t) is the channel input waveform with average power constraint
T /2
1
lim E[X 2 (t)]dt ≤ P (5.8.1)
T →∞ T −T /2
and bandwidth W cycles per second or Hertz (Hz); i.e., its spectrum or Fourier
+∞
transform X ( f ) := F[X (t)] = −∞ X (t)e− j2π f t dt = 0 for all frequencies
√
| f | > W , where j = −1 is the imaginary unit number. Z (t) is the noise wave-
form of a zero-mean stationary white Gaussian process with power spectral density
N0 /2; i.e., its power spectral density PSD Z ( f ), which is the Fourier transform of the
process covariance (equivalently, correlation) function K Z (τ ) := E[Z (s)Z (s + τ )],
s, τ ∈ R, is given by
+∞
N0
PSD Z ( f ) = F[K Z (t)] = K Z (t)e− j2π f t dt = ∀ f.
−∞ 2
Finally, h(t) is the impulse response of an ideal bandpass filter with cutoff fre-
quencies at ±W Hz:
1 if − W ≤ f ≤ W,
H ( f ) = F[(h(t)] =
0 otherwise.
Recall that one can recover h(t) by taking the inverse Fourier transform of H ( f );
this yields
+∞
h(t) = F −1 [H ( f )] = H ( f )e j2π f t d f = 2W sinc(2W t),
−∞
where
5.8 Capacity of the Band-Limited White Gaussian Channel 209
sin(πt)
sinc(t) :=
πt
is the sinc function and is defined to equal 1 at t = 0 by continuity.
Note that we can write the channel output as
where Z̃ (t) := Z (t) ∗ h(t) is the filtered noise waveform. The input X (t) is not
affected by the ideal unit-gain bandpass filter since it has an identical bandwidth as
h(t). Note also that the power spectral density of the filtered noise is given by
N0
if − W ≤ f ≤ W,
PSD Z̃ ( f ) = PSD Z ( f )|H ( f )| =
2 2
0 otherwise.
Taking the inverse Fourier transform of PSD Z̃ ( f ) yields the covariance function of
the filtered noise process:
To determine the capacity (in bits per second) of this continuous-time band-
limited white Gaussian channel with parameters, P, W , and N0 , we convert it to
an “equivalent” discrete-time channel with power constraint P by using the well-
known sampling theorem (due to Nyquist, Kotelnikov and Shannon), which states
that sampling a band-limited signal with bandwidth W at a rate of 1/(2W ) is sufficient
to reconstruct the signal from its samples. Since X (t), Z̃ (t), and Y (t) are all band-
limited to [−W, W ], we can thus represent these signals by their samples taken 2W 1
seconds apart and model the channel by a discrete-time channel described by:
Yn = X n + Z̃ n , n = 0, ±1, ±2, . . . ,
where X n := X ( 2W n
) are the input samples and Z̃ n = Z ( 2W
n
) and Yn = Y ( 2W
n
) are
the random samples of the noise Z̃ (t) and output Y (t) signals, respectively.
Since Z̃ (t) is a filtered version of Z (t), which is a zero-mean stationary Gaussian
process, we obtain that Z̃ (t) is also zero-mean, stationary and Gaussian. This directly
implies that the samples Z̃ n , n = 1, 2, . . . , are zero-mean Gaussian identically
distributed random variables. Now an examination of the expression of K Z̃ (τ ) in
(5.8.2) reveals that
K Z̃ (τ ) = 0
for τ = 2Wn
, n = 1, 2, . . . , since sinc(t) = 0 for all nonzero integer values of t.
Hence, the random variables Z̃ n , n = 1, 2, . . . , are uncorrelated and hence inde-
pendent (since they are Gaussian) and their variance is given by E[ Z̃ n2 ] = K Z̃ (0) =
210 5 Differential Entropy and Gaussian Channels
1
Given that we are using the channel (with inputs X n ) every 2W seconds, we obtain
that the capacity in bits/second of the band-limited white Gaussian channel is given
by
P
C(P) = W log2 1 + bits/second, (5.8.3)
N0 W
P
10 log10 = 40 dB,
N0 W
then from (5.8.3), we calculate that the capacity of the telephone line channel (when
modeled via the band-limited white Gaussian channel) is given by
Example 5.45 (Infinite bandwidth white Gaussian channel) As the channel band-
width W grows without bound, we obtain from (5.8.3) that
P
lim C(P) = log2 e bits/second,
W →∞ N0
which indicates that in the infinite-bandwidth regime, capacity grows linearly with
power.
Observation 5.46 (Band-limited colored Gaussian channel) If the above band-
limited channel has a stationary colored (nonwhite) additive Gaussian noise, then it
can be shown (e.g., see [135]) that the capacity of this channel becomes
1 W
θ
C(P) = max 0, log2 d f,
2 −W PSD Z ( f )
h(X ) = h(X + c)
1 (ln x−μ)2
f X (x) = √ e− 2σ2 , x > 0.
σx 2π
(d) The source X = a X 1 +bX 2 , where a and b are nonzero constants and X 1 and
X 2 are independent Gaussian random variables such that X 1 ∼ N (μ1 , σ12 )
and X 2 ∼ N (μ2 , σ22 ).
3. Generalized Gaussian: Let X be a generalized Gaussian random variable with
mean zero, variance σ 2 and pdf given by
αη −ηα |x|α
f X (x) = e , x ∈ R,
2( α1 )
and
(b) Show that when α = 2 and α = 1, h(X ) reduces to the differential entropy
of the Gaussian and Laplacian distributions, respectively.
4. Prove that, of all pdfs with support [0, 1], the uniform density function has the
largest differential entropy.
5. Of all pdfs with continuous support [0, K ], where K > 1 is finite, which pdf
has the largest differential entropy?
Hint: If f X is the pdf that maximizes differential entropy among all pdfs with
support [0, K ], then E[log f X (X )] = E[log f X (Y )] for any random variable Y
of support [0, K ].
6. Show that the exponential distribution has the largest differential entropy among
all pdfs with mean μ and support [0, ∞). (Recall that the pdf of the exponential
distribution with mean μ is given by f X (x) = μ1 exp(− μx ) for x ≥ 0.)
7. Show that among all continuous random variables X admitting a pdf with support
R and finite differential entropy and satisfying E[X ] = 0 and E[|X |] = λ, where
λ > 0 is a fixed parameter, the Laplacian random variable with pdf
1 − |x|
f X (x) = e λ for x ∈ R
2λ
maximizes differential entropy.
8. Find the mutual information between the dependent Gaussian zero-mean random
variables X and Y with covariance matrix
2
σ ρσ 2
,
ρσ 2 σ 2
9. A variant of the fundamental inequality for the logarithm: For any x > 0 and
y > 0, show that &y'
y ln ≥ y − x,
x
with equality iff x = y.
10. Nonnegativity of divergence: Let X and Y be two continuous random variables
with pdfs f X and f Y , respectively, such that their supports satisfy S X ⊆ SY ⊆ R.
Use Problem 4.9 to show that
D( f X f Y ) ≥ 0,
D( f g S ) ≥ D( f S g S ),
1
I (Z 1 + Z 2 ; Z 1 + Z 2 + Z 3 ) ≥ log2 (3) (in bits).
2
Hint: Use the entropy power inequality in (5.7.2).
15. An alternative form of the entropy power inequality: Show that the entropy power
inequality in (5.7.2) can be written as
h(Z 1 + Z 2 ) ≥ h(Y1 + Y2 ),
5.8 Capacity of the Band-Limited White Gaussian Channel 215
where Z 1 and Z 2 are two independent continuous random variables, and Y1 and
Y2 are two independent Gaussian random variables such that
h(Y1 ) = h(Z 1 )
and
h(Y2 ) = h(Z 2 ).
16. A relation between differential entropy and estimation error: Consider a contin-
uous random variable Z with a finite variance and admitting a pdf. It is desired
to estimate Z by observing a correlated random variable Y (assume that the joint
pdf of Z and Y and their conditional pdfs are well-defined). Let Ẑ = Ẑ (Y ) be
such estimate of Z .
(a) Show that the mean square estimation error satisfies
22h(Z |Y )
E[(Z − Ẑ (Y ))2 ] ≥ .
2πe
(a) By applying a power constraint on the input, i.e., E[X 2 ] ≤ P, where P > 0,
show the channel capacity Cn (P) of this channel is given by
216 5 Differential Entropy and Gaussian Channels
1 1 P
Cn (P) = (n − 1) log (1 − ρ) + log (1 − ρ) + n + ρ .
2 2 σ2
σ 2
(b) Prove that if P > n−1 , then Cn (P) < nC1 (P).
Hint: It suffices to prove that exp{2Cn (P)} < exp{2nC1 (P)}. Use also the
following identity regarding the determinant of an n×n matrix with identical
diagonal entries and identical non-diagonal entries:
⎡ ⎤
a b b ··· b
⎢b a b · · · b⎥
⎢ ⎥
det ⎢ . .. . . .. ⎥ = (a − b)n−1 (a + (n − 1)b).
⎣ .. . . .⎦
b b ··· b a
19. Consider a continuous-alphabet channel with a vectored output for a scalar input
as follows.
X→ Channel → Y1 , Y2
sian, where K Y1 ,Y2 is the covariance matrix of (Y1 , Y2 ), derive Ctwo (S) for
the two-output channel under the power constraint E[X 2 ] ≤ S.
Hint: I (X ; Y1 , Y2 ) = h(Y1 , Y2 ) − h(N1 , N2 ) = h(Y1 , Y2 ) − h(N1 ) − h(N2 ).
5.8 Capacity of the Band-Limited White Gaussian Channel 217
Y = X + Z,
6.1 Preliminaries
6.1.1 Motivation
In a number of situations, one may need to compress a source to a rate less than the
source entropy, which as we saw in Chap. 3 is the minimum lossless data compression
rate. In this case, some sort of data loss is inevitable and the resultant code is referred
to as a lossy data compression code. The following are examples for requiring the
use of lossy data compression.
Example 6.2 (Extracting useful information) In some scenarios, the source informa-
tion may not be operationally useful in its entirety. A quick example is the hypothesis
testing problem where the system designer is only concerned with knowing the like-
lihood ratio of the null hypothesis distribution against the alternative hypothesis
distribution (see Chap. 2). Therefore, any two distinct source letters which produce
the same likelihood ratio are not encoded into distinct codewords and the resultant
code is lossy.
Output with
Source with unmanageable
rH > C Channel with error
capacity C
or distortion will be introduced at the destination (beyond the control of the sys-
tem designer). A more viable approach would be to reduce the source’s information
content via a lossy compression step so that the entropy H of the resulting source
satisfies r H < C (this can, for example, be achieved by grouping the symbols
of the original source and thus reducing its alphabet size). By Theorem 4.32, the
compressed source can then be reliably sent at rate r over the channel. With this
approach, error is only incurred (under the control of the system designer) in the
lossy compression stage (cf. Fig. 6.1).
Note that another solution that avoids the use of lossy compression would be to
reduce the source–channel transmission rate in the system from r to r source sym-
bol/channel symbol such that r H < C holds; in this case, again by Theorem 4.32,
lossless reproduction of the source is guaranteed at the destination, albeit at the price
of slowing the system.
→ R+ ,
ρ: Z × Z
From the above definition, the distortion measure ρ(z, ẑ) can be viewed as the
cost of representing the source symbol z ∈ Z by a reproduction symbol ẑ ∈ Z. It is
then expected to choose a certain number of (typical) reproduction letters in Z that
represent the source letters with the least cost.
When Z = Z, the selection of typical reproduction letters is similar to par-
titioning the source alphabet into several groups, and then choosing one ele-
ment in each group to represent all group members. For example, suppose that
Z = Z = {1, 2, 3, 4} and that, due to some constraints, we need to reduce the
number of outcomes to 2, and we require that the resulting expected cost cannot be
larger than 0.5. Assume that the source is uniformly distributed and that the distortion
measure is given by the following matrix:
⎡ ⎤
0 1 2 2
⎢1 0 2 2⎥
[ρ(i, j)] := ⎢
⎣2
⎥.
2 0 1⎦
2 2 1 0
We see that the two groups in Z which cost least in terms of expected distortion
E[ρ(Z ,
Z )] should be {1, 2} and {3, 4}. We may choose, respectively, 1 and 3 as
the typical elements for these two groups (cf. Fig. 6.2). The expected cost of such
selection is
1 1 1 1 1
ρ(1, 1) + ρ(2, 1) + ρ(3, 3) + ρ(4, 3) = .
4 4 4 4 2
= {1, 2, 3, E}| = 4,
|Z = {1, 2, 3}| = 3, |Z
where E can be regarded as an erasure symbol, and the distortion measure is defined
by
1 2 1 2
=⇒
3 4 3 4
In this case, assume again that the source is uniformly distributed and that to represent
source letters by distinct letters in {1, 2, 3} will yield four times the cost incurred
when representing them by E. Therefore, if only 2 outcomes are allowed, and the
expected distortion cannot be greater than 1/3, then employing typical elements 1
and E to represent groups {1} and {2, 3}, respectively, is an optimal choice. The
resultant entropy is reduced from log2 (3) bits to [log2 (3) − 2/3] bits. It needs to be
> |Z| + 1 is usually not advantageous.
pointed out that having |Z|
Example 6.5 (Hamming distortion measure) Let the source and reproduction alpha-
Then, the Hamming distortion measure is given
bets be identical, i.e., Z = Z.
by
0, if z = ẑ;
ρ(z, ẑ) :=
1, if z = ẑ.
E[ρ(Z ,
Z )] = Pr(Z =
Z ).
= R, the
Example 6.6 (Absolute error distortion measure) Assuming that Z = Z
absolute error distortion measure is given by
= R,
Example 6.7 (Squared error distortion measure) Again assuming that Z = Z
the squared error distortion measure is given by
The squared error distortion measure is perhaps the most popular distortion measure
used for continuous alphabets.
Note that all above distortion measures belong to the class of so-called difference
distortion measures, which have the form ρ(z, ẑ) = d(x − x̂) for some nonnegative
function d(·, ·). The squared error distortion measure has the advantages of simplicity
and having a closed-form solution for most cases of interest, such as when using least
squares prediction. Yet, this measure is not ideal for practical situations involving
data operated by human observers (such as image and speech data) as it is inadequate
6.1 Preliminaries 223
in measuring perceptual quality. For example, two speech waveforms in which one is
a marginally time-shifted version of the other may have large square error distortion;
however, they sound quite similar to the human ear.
The above definition for distortion measures can be viewed as a single-letter
distortion measure since they consider only one random variable Z which draws a
single letter. For sources modeled as a sequence of random variables {Z n }, some
extension needs to be made. A straightforward extension is the additive distortion
measure.
Definition 6.8 (Additive distortion measure between vectors) The additive distor-
tion measure ρn between vectors z n and ẑ n of size n (or n-sequences or n-tuples) is
defined by
n
ρn (z n , ẑ n ) = ρ(z i , ẑ i ).
i=1
ρn (z n , ẑ n ) = max ρ(z i , ẑ i ).
1≤i≤n
After defining the distortion measures for source sequences, a natural question to
ask is whether to reproduce source sequence z n by sequence ẑ n of the same length is a
must or not. To be more precise, can we use z̃ k to represent z n for k = n? The answer
is certainly yes if a distortion measure for z n and z̃ k is defined. A quick example will
be that the source is a ternary sequence of length n, while the (fixed-length) data
compression result is a set of binary indices of length k, which is taken as small as
possible subject to some given constraints. Hence, k is not necessarily equal to n.
One of the problems for taking k = n is that the distortion measure for sequences can
no longer be defined based on per-letter distortions, and hence a per-letter formula
for the best lossy data compression rate cannot be rendered.
In order to alleviate the aforementioned (k = n) problem, we claim that for
most cases of interest, it is reasonable to assume k = n. This is because one can
actually implement lossy data compression from Z n to Z k in two steps. The first
step corresponds to a lossy compression mapping h n : Z n → Z n , and the second
n
step performs indexing h n (Z ) into Z : k
n
hn : Z n → Z
for which the prespecified distortion and rate constraints are satisfied.
Step 2: Derive the (asymptotically) lossless data compression block code for source
h n (Z n ). When n is sufficiently large, the existence of such code with blocklength
224 6 Lossy Data Compression and Transmission
k 1
k > H (h n (Z n )) equivalently, R = > H (h n (Z n )
n n
n
Z → Z → {0, 1}
n k
Step 1
Similar to the lossless source coding theorem, the objective is to find the theoretical
limit of the compression rate for lossy source codes. Before introducing the main
theorem, we first formally define lossy data compression codes, the rate–distortion
region, and the (operational) rate–distortion function.
with a codebook size (i.e., the image h(Z n )) equal to |h(Z n )| = M = Mn and an
average (expected) distortion no larger than distortion threshold D ≥ 0:
1
E ρn (Z n , h(Z n )) ≤ D.
n
The compression rate of the code is defined as (1/n) log2 M bits/source symbol, as
log2 M bits can be used to represent a sourceword of length n. Indeed, an equivalent
description of the above compression code can be made via an encoder–decoder pair
( f, g), where
f : Z n → {1, 2, . . . , M}
n
g: {1, 2, . . . , M} → Z
6.2 Fixed-Length Lossy Data Compression Codes 225
1
lim sup log Mn < R.
n→∞ n
Proof It is enough to show that the set of all achievable rate–distortion pairs is convex
(since the closure of a convex set is convex). Also, assume without loss of generality
that 0 < λ < 1.
We will prove convexity of the set of all achievable rate–distortion pairs using a
time-sharing argument, which basically states that if we can use an (n, M1 , D1 ) code
∼C1 to achieve (R1 , D1 ) and an (n, M2 , D2 ) code ∼C2 to achieve (R2 , D2 ), then for any
rational number 0 < λ < 1, we can use ∼C1 for a fraction λ of the time and use ∼C2
for a fraction 1 − λ of the time to achieve (Rλ , Dλ ), where Rλ = λR1 + (1 − λ)R2
and Dλ = λD1 + (1 − λ)D2 ; hence, the result holds for any real number 0 < λ < 1
by the density of the rational numbers in R and the continuity of Rλ and Dλ in λ.
Let r and s be positive integers and let λ = r +s r
; then 0 < λ < 1. Now assume
that the pairs (R1 , D1 ) and (R2 , D2 ) are achievable. Then, there exist a sequence
of (n, M1 , D1 ) codes ∼C1 and a sequence of (n, M2 , D2 ) codes ∼C2 such that for n
sufficiently large,
1
log2 M1 ≤ R1
n
and
1
log2 M2 ≤ R2 .
n
where
z (r +s)n = (z 1n , . . . , zrn , zrn+1 , . . . , zrn+s )
226 6 Lossy Data Compression and Transmission
and h 1 and h 2 are the compression functions of ∼C1 and ∼C2 , respectively. In other
words, each reconstruction vector h(z (r +s)n ) of code ∼C is a concatenation of r recon-
struction vectors of code ∼C1 and s reconstruction vectors of code ∼C2 .
The average (or expected) distortion under the additive distortion measure ρn and
the rate of code ∼C are given by
ρ(r +s)n (z (r +s)n , h(z (r +s)n ))
E
(r + s)n
1 ρn (z 1n , h 1 (z 1n )) ρn (zrn , h 1 (zrn ))
= E + ··· + E
r +s n n
n n
ρn (zr +1 , h 2 (zr +1 )) ρn (zr +s , h 2 (zrn+s ))
n
+E + ··· + E
n n
1
≤ (r D1 + s D2 )
r +s
= λD1 + (1 − λ)D2 = Dλ
and
1 1
log2 M = log2 (M1r × M2s )
(r + s)n (r + s)n
r 1 s 1
= log2 M1 + log2 M2
(r + s) n (r + s) n
≤ λR1 + (1 − λ)R2 = Rλ ,
Observation 6.15 (Monotonicity and convexity of R(D)) Note that, under an addi-
tive distortion measure ρn , the rate–distortion function R(D) is nonincreasing and
convex in D (the proof is left as an exercise).
The basic idea for identifying good data compression reproduction words from
the set of sourcewords emanating from a DMS is to draw them from the so-called
distortion typical set. This set is defined analogously to the jointly typical set studied
in channel coding (cf. Definition 4.7).
Definition 6.16 (Distortion typical set) The distortion δ-typical set with respect to
the memoryless (product) distribution PZ , n n
Z on Z × Z (i.e., when pairs of n-tuples
(z , ẑ ) are drawn i.i.d. from Z × Z
n n according to PZ ,
Z ) and a bounded additive
distortion measure ρn (·, ·) is defined by
Dn (δ) := (z n , ẑ n ) ∈ Z n × Z n :
1
− log PZ n (z n ) − H (Z ) < δ,
n 2
1
− log P
2 Z n (ẑ ) − H ( Z ) < δ,
n
n
1
− log PZ n ,
Z n (z , ẑ ) − H (Z , Z ) < δ,
n n
n 2
1
and ρn (z n , ẑ n ) − E[ρ(Z , Z )] < δ .
n
Note that this is the definition of the jointly typical set with an additional con-
straint on the normalized distortion on sequences of length n being close to the
expected value. Since the additive distortion measure between two joint i.i.d. ran-
dom sequences is actually the sum of the i.i.d. random variables ρ(Z i ,
Z i ), i.e.,
n
ρn (Z n ,
Zn) = ρ(Z i ,
Z i ),
i=1
then the (weak) law of large numbers holds for the distortion typical set. Therefore,
an AEP-like theorem can be derived for distortion typical set.
Theorem 6.17 If (Z 1 ,
Z 1 ), (Z 2 ,
Z 2 ), . . ., (Z n ,
Z n ), . . . are i.i.d., and ρn (·, ·) is a
bounded additive distortion measure, then as n → ∞,
1
− log2 PZ n (Z 1 , Z 2 , . . . , Z n ) → H (Z ) in probability,
n
1
Z n ( Z 1 , Z 2 , . . . , Z n ) → H ( Z ) in probability,
− log2 P
n
1
Z n ((Z 1 , Z 1 ), . . . , (Z n , Z n )) → H (Z , Z ) in probability,
− log2 PZ n ,
n
228 6 Lossy Data Compression and Transmission
and
1
ρn (Z n ,
Z n ) → E[ρ(Z ,
Z )] in probability.
n
Proof Functions of i.i.d. random variables are also i.i.d. random variables. Thus by
the weak law of large numbers, we have the desired result.
It needs to be pointed out that without the bounded property assumption on ρ, the
normalized sum of an i.i.d. sequence does not necessarily converge in probability to
a finite mean, hence the need for requiring that ρ be bounded.
Theorem 6.18 (AEP for distortion measure) Given a DMS {(Z n , Z n )} with generic
joint distribution PZ ,
Z and any δ > 0, the distortion δ-typical set satisfies
Proof The first result follows directly from Theorem 6.17 and the definition of the
distortion typical set Dn (δ). The second result can be proved as follows:
Z n (z , ẑ )
n n
PZ n ,
Z n |Z n (ẑ |z ) =
n n
P
PZ n (z n )
Z n (z , ẑ )
n n
n PZ n ,
= P
Z n (ẑ )
PZ n (z n )P
Z n (ẑ )
n
2−n[H (Z , Z )−δ]
≤ P
Z n (ẑ )
n
2−n[H (Z )+δ] 2−n[H (
Z )+δ]
n n[I (Z ;
Z )+3δ]
= P
Z n (ẑ )2 ,
Before presenting the lossy data compression theorem, we need the following
inequality.
Proof Let g y (t) := (1 − yt)n . It can be shown by taking the second derivative of
g y (t) with respect to t that this function is strictly convex for t ∈ [0, 1]. Hence, using
∨ to denote disjunction, we have for any x ∈ [0, 1] that
6.3 Rate–Distortion Theorem 229
(1 − x y)n = g y (1 − x) · 0 + x · 1
≤ (1 − x) · g y (0) + x · g y (1)
with equality holding iff (x = 0) ∨ (x = 1) ∨ (y = 0)
= (1 − x) + x · (1 − y)n
n
≤ (1 − x) + x · e−y
with equality holding iff (x = 0) ∨ (y = 0)
≤ (1 − x) + e−ny
with equality holding iff (x = 1).
From the above derivation, we know that equality holds in (6.3.2) iff
n
ρn (z n , ẑ n ) = ρ(z i , ẑ i ) and ρmax := max ρ(z, ẑ) < ∞,
(z,ẑ)∈Z×Z
i=1
where ρ(·, ·) is a given single-letter distortion measure. Then, the source’s rate–
distortion function satisfies the following expression:
R(D) = min I (Z ;
Z ).
Z |Z :E[ρ(Z , Z )]≤D
P
Proof Define
R (I ) (D) := min I (Z ;
Z ); (6.3.3)
Z |Z :E[ρ(Z , Z )]≤D
P
1. Achievability Part (i.e., R(D + ε) ≤ R (I ) (D) + 4ε for arbitrarily small ε > 0):
We need to show that for any ε > 0, there exist 0 < γ < 4ε and a sequence of
lossy data compression codes {(n, Mn , D + ε)}∞ n=1 with
1
lim sup log2 Mn ≤ R (I ) (D) + γ < R (I ) (D) + 4ε.
n→∞ n
230 6 Lossy Data Compression and Transmission
R (I ) (D) = min I (Z ;
Z ) = I (Z ;
Z ).
Z |Z :E[ρ(Z , Z )]≤D
P
Then
E[ρ(Z ,
Z )] ≤ D.
Choose Mn to satisfy
1 1
R (I ) (D) + γ ≤ log2 Mn ≤ R (I ) (D) + γ
2 n
for some γ in (0, 4ε), for which the choice should exist for all sufficiently large
n > N0 for some N0 . Define
γ ε
δ := min , .
8 1 + 2ρmax
n according to
Step 2: Random coding. Independently select Mn words from Z
n
Z n (z̃ ) = Z (z̃ i ),
n
P P
i=1
Z (z̃) =
P PZ (z)P
Z |Z (z̃|z).
z∈Z
∼Cn = {c1 , c2 , . . . , c Mn },
1
ρn (z n , h n (z n )) ≤ E[ρ(Z ,
Z )] + δ ≤ D + δ.
n
Let
:= PZ n (J c (∼Cn )).
Then, the expected probability of source n-tuples not belonging to J (∼Cn ), aver-
aged over all randomly generated codebooks, is given by
⎛ ⎞
E[] = Cn ) ⎝
Z n (∼
P PZ n (z n )⎠
∼Cn z n ∈J
/ (∼Cn )
⎛ ⎞
= PZ n (z n ) ⎝ Cn )⎠ .
Z n (∼
P
z n ∈Z n ∼Cn :z n ∈J
/ (∼Cn )
Then ⎛ ⎞ Mn
P Cn ) = ⎝1 −
Z n (∼ P n n n ⎠
Z n (z̃ )K (z , z̃ ) .
∼Cn :z n ∈J
/ (∼Cn ) n
z̃ n ∈Z
E[] = PZ n (z n ) ⎝1 − n n ⎠
Z n (z̃ )K (z , z̃ )
P n
z n ∈Z n n
z̃ n ∈Z
⎛ ⎞ Mn
−n(I (Z ;
≤ PZ n (z n ) ⎝1 − Z n |Z n (z̃ |z )2
P n n Z )+3δ)
K (z n , z̃ n )⎠
z n ∈Z n n
z̃ n ∈Z
(by 6.3.1)
⎛ ⎞ Mn
= PZ n (z n ) ⎝1 − 2−n(I (Z ; Z )+3δ) n n ⎠
Z n |Z n (z̃ |z )K (z , z̃ )
P n n
z n ∈Z n n
z̃ n ∈Z
≤ PZ n (z ) 1 −
n
Z n |Z n (z̃ |z )K (z , z̃ )
P n n n n
z n ∈Z n n
z̃ n ∈Z
( )
+ exp −Mn · 2−n(I (Z ; Z )+3δ) (from 6.3.2)
≤ PZ n (z n ) 1 − PZ n |Z n (z̃ |z )K (z , z̃ )
n n n n
z n ∈Z n n
z̃ n ∈Z
( )
n(R (I ) (D)+γ/2) −n(I (Z ;
Z )+3δ)
+ exp −2 ·2 ,
≤ δ + δ = 2δ
( * +)
for all n > N := max N0 , N1 , 1δ log2 log min{δ,1} 1
.
∗
Since E[] E [PZ n (J (∼Cn ))] ≤ 2δ, there must exist a codebook ∼Cn such that
= c
∗
PZ n J (∼Cn ) is no greater than 2δ for n sufficiently large.
c
Step 5: Calculation of distortion. The distortion of the optimal codebook ∼C∗n (from
the previous step) satisfies for n > N :
1
1
E[ρn (Z n , h n (Z n ))] = PZ n (z n ) ρn (z n , h n (z n ))
n n
z n ∈J (∼C∗n )
1
+ PZ n (z n ) ρn (z n , h n (z n ))
n
/ (∼C∗n )
z n ∈J
≤ PZ n (z n )(D + δ) + PZ n (z n )ρmax
z n ∈J (∼C∗n ) / (∼C∗n )
z n ∈J
6.3 Rate–Distortion Theorem 233
≤ (D + δ) + 2δ · ρmax
≤ D + δ(1 + 2ρmax )
≤ D + ε.
2. Converse Part (i.e., R(D + ε) ≥ R (I ) (D) for arbitrarily small ε > 0 and any D ∈
{D ≥ 0 : R (I ) (D) > 0}): We need to show that for any sequence of {(n, Mn , Dn )}∞
n=1
code with
1
lim sup log2 Mn < R (I ) (D),
n→∞ n
Z λ ) ≤ λ · I (Z ;
I (Z ; Z 1 ) + (1 − λ) · I (Z ;
Z 2 ),
Z λ |Z (ẑ|z) := λP
P Z 1 |Z (ẑ|z) + (1 − λ)P
Z 2 |Z (ẑ|z).
E[ρ(Z ,
Z λ )] = PZ (z) Z λ |Z (ẑ|z)ρ(z, ẑ)
P
z∈Z
ẑ∈Z
& '
= PZ (z) λP
Z 1 |Z (ẑ|z) + (1 − λ)P
Z 2 |Z (ẑ|z) ρ(z, ẑ)
z∈Z,ẑ∈Z
= λD1 + (1 − λ)D2 ,
we have
R (I ) (λD1 + (1 − λ)D2 ) ≤ I (Z ;
Zλ)
≤ λI (Z ;
Z 1 ) + (1 − λ)I (Z ;
Z2)
= λR (I ) (D1 ) + (1 − λ)R (I ) (D2 ).
which is finite by the boundedness of the distortion measure. Thus, since R (I ) (D)
is nonincreasing and convex, it directly follows that it is strictly decreasing and
continuous over {D ≥ 0 : R (I ) (D) > 0}.
Step 4: Main proof.
log2 Mn ≥ H (h n (Z n ))
= H (h n (Z n )) − H (h n (Z n )|Z n ), since H (h n (Z n )|Z n ) = 0;
= I (Z n ; h n (Z n ))
= H (Z n ) − H (Z n |h n (Z n ))
n
n
= H (Z i ) − H (Z i |h n (Z n ), Z 1 , . . . , Z i−1 )
i=1 i=1
by the independence of Z n ,
and the chain rule for conditional entropy;
n
n
≥ H (Z i ) − H (Z i |
Zi )
i=1 i=1
where
Z i is the ith component of h n (Z n );
n
= I (Z i ;
Zi )
i=1
n
≥ R (I ) (Di ), where Di := E[ρ(Z i ,
Z i )];
i=1
n
1 (I )
=n R (Di )
i=1
n
, n
1
≥ n R (I ) Di , by convexity of R (I ) (D);
i=1
n
(I ) 1
= nR E[ρn (Z n , h n (Z n ))] ,
n
where the last step follows since the distortion measure is additive. Finally,
lim supn→∞ (1/n) log2 Mn < R (I ) (D) implies the existence of N and γ > 0
such that (1/n) log Mn < R (I ) (D) − γ for all n > N . Therefore, for n > N ,
6.3 Rate–Distortion Theorem 235
1
R (I ) E[ρn (Z n , h n (Z n ))] < R (I ) (D) − γ,
n
which, together with the fact that R (I ) (D) is strictly decreasing, implies that
1
E[ρn (Z n , h n (Z n ))] > D + ε
n
As in the case of block source coding in Chap. 3 (compare Theorem 3.6 with
Theorem 3.15), the above rate–distortion theorem can be extended for the case of
stationary ergodic sources (e.g., see [42, 135]).
n
ρn (z n , ẑ n ) = ρ(z i , ẑ i ) and ρmax := max ρ(z, ẑ) < ∞,
(z,ẑ)∈Z×Z
i=1
where ρ(·, ·) is a given single-letter distortion measure. Then, the source’s rate–
distortion function is given by
(I )
R(D) = R (D),
where
(I )
R (D) := lim Rn(I ) (D) (6.3.5)
n→∞
1
Rn(I ) (D) := min I (Z n ;
Zn) (6.3.6)
Z n |Z n : n
P 1
E[ρn (Z n ,
Z n )]≤D n
Hence,
(I )
R (D) ≤ Rn(I ) (D) (6.3.7)
where
1
μn := H (Z n ) − H (Z)
n
where ⎛ - ⎞
1⎝ (1 − q)2 ⎠,
Dc = 1− 1−
2 q
1A twin result to the above Wyner–Ziv lower bound, which consists of an upper bound on the
capacity-cost function of channels with stationary additive noise, is shown in [16, Corollary 1].
This result, which is expressed in terms of the nth-order capacity-cost function and the amount
of memory in the channel noise, illustrates the “natural duality” between the information rate–
distortion and capacity-cost functions originally pointed out by Shannon [345].
6.3 Rate–Distortion Theorem 237
q = P{Z n = 1|Z n−1 = 0} = P{Z n = 0|Z n−1 = 1} > 1/2 is the source’s
transition probability and h b ( p) = − p log2 ( p) − (1 − p) log2 (1 − p) is the binary
(I )
entropy function. Determining R (D) for D > Dc is still an open problem,
(I )
although R (D) can be estimated in this region via lower and upper bounds.
(I )
Indeed, the right-hand side of (6.3.9) still serves as a lower bound on R (D) for
(I )
D > Dc [156]. Another lower bound on R (D) is the above Wyner–Ziv bound
(I )
(6.3.8), while (6.3.7) gives an upper bound. Various bounds on R (D) are studied
in [43] and calculated (in particular, see [43, Fig. 1]).
2 For example, the boundedness assumption in the theorems can be replaced with assuming that there
such that E[ρ(Z , ẑ 0 )] < ∞ [42, Theorems 7.2.4 and 7.2.5].
exists a reproduction symbol ẑ 0 ∈ Z
This assumption can accommodate the squared error distortion measure and a source with finite
second moment (including continuous-alphabet sources such as Gaussian sources); see also [135,
Theorem 9.6.2 and p. 479].
238 6 Lossy Data Compression and Transmission
In light of Theorem 6.20 and the discussion at the end of the previous section, we
know that for a wide class of memoryless sources
R(D) = R (I ) (D)
as given in (6.3.3).
We first note that, like channel capacity, R(D) cannot in general be explicitly
determined in a closed-form expression, and thus optimization-based algorithms can
be used for its efficient numerical computation [27, 49, 51, 88]. In the following,
we consider simple examples involving the Hamming and squared error distortion
measures where bounds or exact expressions for R(D) can be obtained.
n
ρn (z n , ẑ n ) = z i ⊕ ẑ i ,
i=1
where ⊕ denotes modulo two addition. In such case, ρ(z n , ẑ n ) is exactly the number of
bit errors or changes after compression. Therefore, the distortion bound D becomes
a bound on the average probability of bit error. Specifically, among n compressed
bits, it is expected to have E[ρ(Z n ,
Z n )] bit errors; hence, the expected value of bit
n n
error rate is (1/n)E[ρ(Z , Z )]. The rate–distortion function for binary sources and
Hamming additive distortion measure is given by the next theorem.
Theorem 6.23 Fix a binary DMS {Z n }∞n=1 with marginal distribution PZ (0) = 1 −
PZ (1) = p, where 0 < p < 1. Then the source’s rate–distortion function under the
Hamming additive distortion measure is given by
h b ( p) − h b (D) if 0 ≤ D < min{ p, 1 − p};
R(D) =
0 if D ≥ min{ p, 1 − p},
H (Z |
Z ) = H (Z ⊕
Z |
Z ).
E[ρ(Z ,
Z )] ≤ D implies that Pr{Z ⊕
Z = 1} ≤ D.
I (Z ;
Z ) = H (Z ) − H (Z |
Z)
= h b ( p) − H (Z ⊕
Z |
Z)
≥ h b ( p) − H (Z ⊕ Z ) (conditioning never increases entropy)
≥ h b ( p) − h b (D),
where the last inequality follows since the binary entropy function h b (x) is increasing
for x ≤ 1/2, and Pr{Z ⊕ Z = 1} ≤ D. Since the above derivation is true for any
PZ |Z , we have
R(D) ≥ h b ( p) − h b (D).
PZ (0) PZ (0)
Z (0) + P
1 = P Z (1) = Z |Z (0|0) +
P PZ |Z (1|0)
Z (0|0)
PZ | Z (0|1)
PZ |
p p
= P (0|0) + (1 − P Z |Z (0|0))
1 − D Z |Z D
and
PZ (1) PZ (1)
1 = P
Z (0) + P
Z (1) = Z |Z (0|1) +
P Z |Z (1|1)
P
Z (1|0)
PZ | Z (1|1)
PZ |
1− p 1− p
= (1 − P Z |Z (1|1)) + P (1|1),
D 1 − D Z |Z
which yield
1− D D
Z |Z (0|0) =
P 1−
1 − 2D p
and
240 6 Lossy Data Compression and Transmission
1− D D
Z |Z (1|1) =
P 1− .
1 − 2D 1− p
The above theorem can be extended for nonbinary (finite alphabet) memoryless
sources resulting in a more complicated (but exact) expression for R(D), see [207].
We instead present a simple lower bound on R(D) for a nonbinary DMS under the
Hamming distortion measure; this bound is a special case of the so-called Shannon
lower bound on the rate–distortion function of a DMS [345] (see also [42, 158], [83,
Problem 10.6]).
Theorem 6.24 Fix a DMS {Z n }∞ n=1 with distribution PZ . Then, the source’s rate–
distortion function under the Hamming additive distortion measure and Z = Z
satisfies
H (Z ) is the source entropy and h b (·) is the binary entropy function. Furthermore,
equality holds in the above bound for D ≤ (|Z| − 1) minz∈Z PZ (z).
Proof The proof is left as an exercise. (Hint: use Fano’s inequality and examine the
equality condition.)
Thus, the condition for equality in (6.4.1) always holds and Theorem 6.24 reduces
to Theorem 6.23.
• If the source is uniformly distributed (i.e., PZ (z) = 1/|Z| for all z ∈ Z), then
|Z| − 1
(|Z| − 1) min PZ (z) = = Dmax .
z∈Z |Z|
So for any f
Z |Z satisfying the distortion constraint,
R(D) ≤ I ( f Z , f
Z |Z ).
For 0 < D ≤ σ 2 , choose a dummy Gaussian random variable W with zero mean
and variance a D, where a = 1 − D/σ 2 , and is independent of Z . Let
Z = aZ + W.
Then
E[(Z −
Z )2 ] = E[(1 − a)2 Z 2 ] + E[W 2 ] = (1 − a)2 σ 2 + a D = D
R(D) ≤ I (Z ;
Z)
= h(
Z ) − h(
Z |Z )
= h(
Z ) − h(W + a Z |Z )
= h(
Z ) − h(W |Z ) (by Lemma 5.14)
= h(
Z ) − h(W ) (by the independence of W and Z )
1
= h(
Z ) − log2 (2πe(a D))
2
1 1
≤ log2 (2πe(σ 2 − D)) − log2 (2πe(a D)) (by Theorem 5.20)
2 2
1 σ2
= log2 .
2 D
The achievability of this upper bound by a Gaussian source (with zero mean
and variance σ 2 ) can be proved by showing that under the Gaussian source,
(1/2) log2 (σ 2 /D) is a lower bound to R(D) for 0 < D ≤ σ 2 . Indeed, when the
source is Gaussian and for any f 2
Z |Z such that E[(Z − Z ) ] ≤ D, we have
I (Z ;
Z ) = h(Z ) − h(Z |
Z)
1
= log2 (2πeσ 2 ) − h(Z −
Z |
Z)
2
1
≥ log2 (2πeσ 2 ) − h(Z −
Z) (by Lemma 5.14)
2
1 1
≥ log2 (2πeσ 2 ) − log2 2πe Var[(Z − Z )] (by Theorem 5.20)
2 2
1 1
≥ log2 (2πeσ ) − log2 2πe E[(Z −
2
Z )2 ]
2 2
1 1
≥ log2 (2πeσ ) − log2 (2πeD)
2
2 2
1 σ2
= log2 .
2 D
Theorem 6.27 (Shannon lower bound on the rate–distortion function: squared error
distortion) Consider a continuous memoryless source {Z i } with a pdf of support R
and finite differential entropy under the additive squared error distortion measure.
Then, its rate–distortion function satisfies
1
R(D) ≥ h(Z ) − log2 (2πeD).
2
6.4 Calculation of the Rate–Distortion Function 243
Proof The proof, which follows similar steps as in the achievability of the upper
bound in the proof of the previous theorem, is left as an exercise.
The above two theorems yield that for any continuous memoryless source {Z i }
with zero mean and variance σ 2 , its rate–distortion function under the mean square
error distortion measure satisfies
where
1
R S H (D) := h(Z ) − log2 (2πeD)
2
and
1 σ2
RG (D) := log2 ,
2 D
and with equality holding when the source is Gaussian. Thus, the difference between
the upper and lower bounds on R(D) in (6.4.3) is
1
RG (D) − R S H (D) = −h(Z ) + log2 (2πeσ 2 )
2
= D(Z Z G ) (6.4.4)
Laplacian source is more similar to the Gaussian source than the uniformly distributed
source and hence its rate–distortion function is closer to Gaussian’s rate–distortion
function RG (D) than that of a uniform source. Finally in light of (6.4.4), the bounds
on R(D) in (6.4.3) can be expressed in terms of the Gaussian rate–distortion function
RG (D) and the non-Gaussianness D(Z Z G ), as follows:
Note that (6.4.5) is nothing but the dual of (5.7.4) and is an illustration of the duality
between the rate–distortion and capacity-cost functions.
previous section or [42, Theorems 7.2.4 and 7.2.5]). Note that a zero-mean stationary
Gaussian source {X i } is ergodic if its covariance function K X (τ ) → 0 as τ → ∞.
For such sources, the rate–distortion function R(D) can be determined paramet-
rically; see [42, Theorem 4.5.3]. Furthermore, if the sources are also Markov, then
R(D) admits an explicit analytical expression for small values of D. More specifi-
cally, consider a zero-mean unit-variance stationary Gauss–Markov source {X i } with
covariance function K X (τ ) = a τ , where 0 < a < 1 is the correlation coefficient.
Then,
1 1 − a2 1−a
R(D) = log2 for D ≤ .
2 D 1+a
with equality holding when the source is Laplacian with mean zero and variance 2λ2
1 − |z|
(parameter λ); i.e., its pdf is given by f Z (z) = 2λ e λ , z ∈ R.
6.4 Calculation of the Rate–Distortion Function 245
Proof Since
R(D) = min I (Z ;
Z ),
Z |Z :E[|Z − Z |]≤D
f
R(D) ≤ I (Z ;
Z ) = I ( fZ , f
Z |Z ).
and
D 2
E[|Z −
Z |] = E 1− Z + sgn(Z )|W | − Z
λ
D D
= E 2− Z − sgn(Z )|W |
λ λ
D D
= 2− E[|Z |] − E[|W |]
λ λ
D D D
= 2− λ− 1− D
λ λ λ
= D,
R(D) ≤ I (Z ;
Z)
= h( Z ) − h(
Z |Z )
(a)
= h(
Z ) − h(sgn(Z )|W | |Z )
= h(
Z ) − h(|W | |Z ) − log2 |sgn(Z )|
(b)
= h(
Z ) − h(|W |)
(c)
= h(
Z ) − h(W )
(d)
= h(
Z ) − log2 [2e(1 − D/λ)D]
(e)
≤ log2 [2e(λ − D)] − log2 [2e(1 − D/λ)D]
λ
= log2 ,
D
where (a) follows from the expression of Z and the fact that differential entropy is
invariant under translations (Lemma 5.14), (b) holds since Z and W are independent
of each other, (c) follows from the fact that W is Laplacian and that the Laplacian
is symmetric, and (d) holds since the differential entropy of a zero-mean Laplacian
random variable Z with E[|Z |] = λ is given by
I (Z ;
Z ) = h(Z ) − h(Z |
Z)
= log2 (2eλ) − h(Z −
Z |
Z)
≥ log2 (2eλ) − h(Z −
Z) (by Lemma 5.14)
≥ log2 (2eλ) − log2 (2eD)
λ
= log2 ,
D
where the last inequality follows since
6.4 Calculation of the Rate–Distortion Function 247
Z ) ≤ log2 2eE[|Z −
h(Z − Z |] ≤ log2 (2eD)
where the supremum is taken over all random variables X with a pdf satisfying
E[d(X )] ≤ D. It can be readily seen that this bound encompasses those in Theo-
rems 6.27 and 6.30 as special cases.3
By combining the rate–distortion theorem with the channel coding theorem, the
optimality of separation between lossy source coding and channel coding can be
established and Shannon’s lossy joint source–channel coding theorem (also known as
the lossy information-transmission theorem) can be shown for the communication of
a source over a noisy channel and its reconstruction within a distortion threshold at the
receiver. These results can be viewed as the “lossy” counterparts of the lossless joint
source–channel coding theorem and the separation principle discussed in Sect. 4.6.
Definition 6.32 (Lossy source–channel block code) Given a discrete-time source
∞
{Z i }i=1 and a discrete-time channel
with alphabet Z and reproduction alphabet Z
with input and output alphabets X and Y, respectively, an m-to-n lossy source–
channel block code with rate mn source symbol/channel symbol is a pair of mappings
( f (sc) , g (sc) ), where4
f (sc) : Z m → X n and m .
g (sc) : Y n → Z
Encoder Xn Yn Decoder
Zm ∈ Zm Channel Ẑ m ∈ Ẑ m
f (sc) g (sc)
The code’s operation is illustrated in Fig. 6.3. The source m-tuple Z m is encoded via
the encoding function f (sc) , yielding the codeword X n = f (sc) (Z m ) as the channel
input. The channel output Y n , which is dependent on Z m only via X n (i.e., we have
the Markov chain Z m → X n → Y n ), is decoded via g (sc) to obtain the source tuple
estimate Ẑ m = g (sc) (Y n ). /m
Given an additive distortion measure ρm = i=1 ρ(z i , ẑ i ), where ρ is a distor-
tion function on Z × Z, we say that the m-to-n lossy source–channel block code
( f (sc) , g (sc) ) satisfies the average distortion fidelity criterion D, where D ≥ 0, if
1
E[ρm (Z m ,
Z m )] ≤ D.
m
• Converse part: On the other hand, for any sequence of m-to-n m lossy source–
channel codes ( f (sc) , g (sc) ) satisfying the average distortion fidelity criterion D,
we have
m
· R(D) ≤ C.
nm
5 Note that Z and Z can also be continuous alphabets with an unbounded distortion function.
In this case, the theorem still holds under appropriate conditions (e.g., [42, Problem 7.5], [135,
Theorem 9.6.3]) that can accommodate, for example, the important class of Gaussian sources under
the squared error distortion function (e.g., [135, p. 479]).
6 The channel can have either finite or continuous alphabets. For example, it can be the memoryless
Gaussian (i.e., AWGN) channel with input power P; in this case, C = C(P).
6.5 Lossy Joint Source–Channel Coding Theorem 249
Proof The proof uses both the channel coding theorem (i.e., Theorem 4.11) and the
rate–distortion theorem (i.e., Theorem 6.21) and follows similar arguments as the
proof of the lossless joint source–channel coding theorem presented in Sect. 4.6. We
leave the proof as an exercise.
Observation 6.34 (Lossy joint source–channel coding theorem with signaling rates)
The above theorem also admits another form when the source and channel are
described in terms of “signaling rates” (e.g., [51]). More specifically, let Ts and
Tc represent the durations (in seconds) per source letter and per channel input sym-
bol, respectively.7 In this case, TTcs represents the source–channel transmission rate
measured in source symbols per channel use (or input symbol). Thus, again assum-
ing that both R(D) and C are measured in the same units, the theorem becomes as
follows:
• The source can be reproduced at the output of the channel with distortion less
than D (i.e., there exist lossy source–channel codes asymptotically satisfying the
average distortion fidelity criterion D) if
Tc
· R(D) < C.
Ts
• Conversely, for any lossy source–channel codes satisfying the average distortion
fidelity criterion D, we have
Tc
· R(D) ≤ C.
Ts
We close this chapter by applying Theorem 6.33 to a few useful examples of com-
munication systems. Specifically, we obtain a bound on the end-to-end distortion of
any communication system using the fact that if a source with rate–distortion func-
tion R(D) can be transmitted over a channel with capacity C via a source–channel
block code of rate Rsc > 0 (in source symbols/channel use) and reproduced at the
destination with an average distortion no larger than D, then we must have that
Rsc · R(D) ≤ C
or equivalently,
1
R(D) ≤ C. (6.6.1)
Rsc
7 In
other words, the source emits symbols at a rate of 1/Ts source symbols per second and the
channel accepts inputs at a rate of 1/Tc channel symbols per second.
250 6 Lossy Data Compression and Transmission
Solving for the smallest D, say D SL , satisfying (6.6.1) with equality8 yields a lower
bound, called the Shannon limit,9 on the distortion of all realizable lossy source–
channel codes for the system with rate Rsc .
In the following examples, we calculate the Shannon limit for some source–
channel configurations. The Shannon limit is not necessarily achievable in general,
although this is the case in the first two examples.
Example 6.35 (Shannon limit for a binary uniform DMS over a BSC)10 Let
Z = Ẑ = {0, 1} and consider a binary uniformly distributed DMS {Z i } (i.e., a
Bernoulli(1/2) source) using the additive Hamming distortion measure. Note that in
this case, E[ρ(Z ,
Z )] = P(Z = Z ) := Pb ; in other words, the expected distortion
is nothing but the source’s bit error probability Pb . We desire to transmit the source
over a BSC with crossover probability
< 1/2.
From Theorem 6.23, we know that for 0 ≤ D ≤ 1/2, the source’s rate–distortion
function is given by
R(D) = 1 − h b (D),
where h b (·) is the binary entropy function. Also from (4.5.5), the channel’s capacity
is given by
C = 1 − h b (
).
The Shannon limit D SL for this system with source–channel transmission rate Rsc
is determined by solving (6.6.1) with equality:
1
1 − h b (D SL ) = [1 − h b (
)].
Rsc
is the inverse of the binary entropy function on the interval [0, 1/2]. Thus, D SL given
in (6.6.2) gives a lower bound on the bit error probability Pb of any rate-Rsc source–
channel code used for this system. In particular, if Rsc = 1 source symbol/channel
8 If the strict inequality R(D) < R1sc C always holds, then in this case, the Shannon limit is
& '
D S L = Dmin := E minẑ∈Ẑ ρ(Z , ẑ) .
9 Other similar quantities used in the literature are the optimal performance theoretically achievable
(OPTA) [42] and the limit of the minimum transmission ratio (LMTR) [87].
10 This example appears in various sources including [205, Sect. 11.8], [87, Problem 2.2.16], and
D SL =
. (6.6.4)
where 0 ∞
1 t2
Q(x) = √ e− 2 dt
2π x
Thus, in light of (6.6.2), the Shannon limit for sending a uniform binary source
over an AWGN channel used with antipodal modulation and hard-decision decoding
satisfies the following in terms of the SNR per source bit γb :
* *1 ++
Rsc (1 − h b (D SL )) = 1 − h b Q 2Rsc γb (6.6.7)
for D SL ≤ 1/2. In Table 6.1, we use (6.6.7) to present the optimal (minimal) values
of γb (in dB) for a given target value of D SL and a given source–channel code rate
Rsc < 1. The table indicates, for example, that if we desire to achieve an end-to-end
bit error probability of no larger than 10−5 at a rate of 1/2, then the system’s SNR per
source bit can be no smaller than 1.772 dB. The Shannon limit values can similarly
be computed for rates Rsc > 1.
11 Source–channel systems with rate Rsc = 1 are typically referred to as systems with matched
source and channel bandwidths (or signaling rates). Also, when Rsc < 1 (resp., > 1), the system is
said to have bandwidth compression (resp., bandwidth expansion); e.g., cf. [274, 314, 358].
252 6 Lossy Data Compression and Transmission
Table 6.1 Shannon limit values γb = E b /N0 (dB) for sending a binary uniform source over a
BPSK-modulated AWGN used with hard-decision decoding
Rate Rsc DS L = 0 D S L = 10−5 D S L = 10−4 D S L = 10−3 D S L = 10−3
1/3 1.212 1.210 1.202 1.150 0.077
1/2 1.775 1.772 1.763 1.703 1.258
2/3 2.516 2.513 2.503 2.423 1.882
4/5 3.369 3.367 3.354 3.250 2.547
Optimality of uncoded communication: Note that for Rsc = 1, the Shannon limit
in (6.6.4) can surprisingly be achieved by a simple uncoded scheme12 : just directly
transmit the source over the channel (i.e., set the blocklength m = 1 and the channel
input X i = Z i for any time instant i = 1, 2, . . .) and declare the channel output as
the reproduced source symbol (i.e., set Z i = Yi for any i).13
In this case, the expected distortion (i.e., bit error probability) of this uncoded
rate-one source–channel scheme is indeed given as follows:
E[ρ(Z ,
Z )] = Pb
= P(X = Y )
= P(Y = X |X = 1)(1/2) + P(Y = X |X = 0)(1/2)
=
= D SL .
We conclude that this rate-one uncoded source–channel scheme achieves the Shan-
non limit and is hence optimal. Furthermore, this scheme, which has no encod-
ing/decoding delay and no complexity, is clearly more desirable than using a separate
source–channel coding scheme,14 which would impose large encoding and decoding
delays and would demand significant computational/storage resources.
Note that for rates Rsc = 1 and/or nonuniform sources, the uncoded scheme is not
optimal and hence more complicated joint or separate source–channel codes would
be required to yield a bit error probability arbitrarily close to (but strictly larger than)
the Shannon limit D SL . Finally, we refer the reader to [140], where necessary and
sufficient conditions are established for source–channel pairs under which uncoded
schemes are optimal.
Observation 6.36 The following two systems are extensions of the system consid-
ered in the above example.
Still the separate coding scheme will consist of a near-capacity achieving channel code.
6.6 Shannon Limit of Communication Systems 253
• Binary nonuniform DMS over a BSC: The system is identical to that of Exam-
ple 6.35 with the exception that the binary DMS is nonuniformly distributed with
P(Z = 0) = p. Using the expression of R(D) from Theorem 6.23, it can be
readily shown that this system’s Shannon limit is given by
1 − h b (
)
D SL = h −1
b h b ( p) − (6.6.8)
Rsc
for D SL ≤ min{ p, 1 − p}, where h −1 b (·) is the inverse of the binary entropy
function on the interval [0, 1/2] defined in (6.6.3). Setting p = 1/2 in (6.6.8)
directly results in the Shannon limit given in (6.6.2), as expected.
• Nonbinary uniform DMS over a nonbinary symmetric channel: Given integer
q ≥ 2, consider a q-ary uniformly distributed DMS with identical alphabet and
reproduction alphabet Z = Z = {0, 1, . . . , q − 1} using the additive Hamming
distortion measure and the q-ary symmetric DMC (with q-ary input and output
alphabets and symbol error rate
) described in (4.2.11). Thus using the expressions
for the source’s rate–distortion function in (6.4.2) and the channel’s capacity in
Example 4.19, we obtain that the Shannon limit of the system using rate-Rsc
source–channel codes satisfies
1 & '
log2 (q) − D SL log2 (q − 1) − h b (D SL ) = log2 (q) −
log2 (q − 1) − h b (
)
Rsc
(6.6.9)
for D SL ≤ q−1
q
. Setting q = 2 renders the source a Bernoulli(1/2) source and the
channel a BSC with crossover probability
, thus reducing (6.6.9) to (6.6.2).
Example 6.37 (Shannon limit for a memoryless Gaussian source over an AWGN
channel [147]) Let Z = Z = R and consider a memoryless Gaussian source {Z i } of
mean zero and variance σ 2 and the squared error distortion function. The objective
is to transmit the source over an AWGN channel with input power constraint P and
noise variance σ 2N and recover it with distortion fidelity no larger than D, for a given
threshold D > 0.
By Theorem 6.26, the source’s rate–distortion function is given by
1 σ2
R(D) = log2 for 0 < D < σ2 .
2 D
Furthermore, the capacity (or capacity-cost function) of the AWGN channel is given
in (5.4.13) as
1 P
C(P) = log2 1 + 2 .
2 σN
The Shannon limit D SL for this system with rate Rsc is obtained by solving
C(P)
R(D SL ) =
Rsc
254 6 Lossy Data Compression and Transmission
or equivalently,
1 σ2 1 P
log2 = log2 1 + 2
2 D SL 2Rsc σN
σ2
D SL = * + 1 . (6.6.10)
P Rsc
1 + σ2
N
In particular, for a system with rate Rsc = 1, the Shannon limit in (6.6.10) becomes
σ 2 σ 2N
D SL = . (6.6.11)
P + σ 2N
and is sent over the AWGN channel. At the receiver, the corresponding channel
output Yi = X i + Ni , where Ni is the additive Gaussian noise (which is independent
of Z i ), is decoded via a scalar (MMSE) detector to yield the following reconstructed
source symbol Zi √
(sc) Pσ 2
Z i = g (Yi ) = Yi .
P + σ 2N
σ 2 σ 2N
=
P + σ 2N
= D SL ,
Example 6.38 (Shannon limit for a memoryless Gaussian source over a fading chan-
nel) Consider the same system as the above example except that now the channel
is a memoryless fading channel as described in Observation 5.35 with input power
constraint P and noise variance σ 2N . We determine the Shannon limit of this system
with rate Rsc for two cases: (1) the fading coefficients are known at the receiver, and
(2) the fading coefficients are known at both the receiver and the transmitter.
1. Shannon limit with decoder side information (DSI): Using (5.4.17) for the chan-
nel capacity with DSI, we obtain that the Shannon limit with DSI is given by
(DS I ) σ2
D SL = ⎧
⎡
1 ⎬
⎤⎫ (6.6.12)
⎨ Rsc
⎣ A2 P
E log2 1+ 2 ⎦
⎩ A σN ⎭
2
(DS I )
for 0 < D SL < σ 2 . Making the fading process deterministic by setting A = 1
(almost surely) reduces (6.6.12) to (6.6.10), as expected.
2. Shannon limit with full side information (FSI): Similarly, using (5.4.19) for the
fading channel capacity with FSI, we obtain the following Shannon limit:
(F S I ) σ2
D SL = ⎧ ⎡
1 ⎬
⎤⎫ (6.6.13)
⎨ 2 ∗ Rsc
E A ⎣log2 1+ A p2 (A) ⎦
⎩ σN ⎭
2
(F S I )
for 0 < D SL < σ 2 , where p ∗ (·) in (6.6.13) is given by
∗ 1 σ2
p (a) = max 0, − 2
λ a
Example 6.39 (Shannon limit for a binary uniform DMS over a binary-input AWGN
channel) Consider the same binary uniform source as in Example 6.35 under the
Hamming distortion measure to be sent via a source–channel code over a binary-input
AWGN channel used with antipodal (BPSK) signaling of power P and noise variance
σ 2N = N0 /2. Again, here the expected distortion is nothing but the source’s bit error
probability Pb . The source’s rate–distortion function is given by Theorem 6.23 as
presented in Example 6.35.
However, the channel
√ capacity
√ C(P) of the AWGN whose input takes on two
possible values + P or − P, whose output is real-valued (unquantized), and
whose noise variance is σ 2N = N20 , is given by evaluating the mutual information
√
√ the channel input and output under the input distribution PX (+ P) =
between
PX (− P) = 1/2 (e.g., see [63]):
256 6 Lossy Data Compression and Transmission
0 - ,
∞
P 1 −y 2 /2 P P
C(P) = 2 log2 (e) − √ e log2 cosh +y dy
σN 2π −∞ σN
2
σ 2N
0 ∞ , -
Rsc E b 1 −y 2 /2 Rsc E b Rsc E b
= log2 (e) − √ e log2 cosh +y dy
N0 /2 2π −∞ N0 /2 N0 /2
0 ∞ 1
1
e−y /2 log2 [cosh(2Rsc γb + y 2Rsc γb )]dy,
2
= 2Rsc γb log2 (e) − √
2π −∞
where P = Rsc E b is the channel signal power, E b is the average energy per source
bit, Rsc is the rate in source bit/channel use of the system’s source–channel code,
and γb = E b /N0 is the SNR per source bit. The system’s Shannon limit satisfies
&
Rsc (1 − h b (D SL )) = 2Rsc γb log2 (e)
0 ∞
1 1
e−y /2 log2 [cosh(2Rsc γb + y 2Rsc γb )]dy ,
2
−√
2π −∞
or equivalently,
0 ∞ 1
1
e−y /2
2
h b (D SL ) = 1 − 2γb log2 (e) + √ log2 [cosh(2Rsc γb + y 2Rsc γb )]dy
Rsc 2π −∞
(6.6.14)
for D SL ≤ 1/2. In Fig. 6.4, we use (6.6.14) to plot the Shannon limit versus γb (in
dB) for codes with rates 1/2 and 1/3. We also provide in Table 6.2 the optimal values
of γb for target values of D SL and Rsc .
Shannon Limit
1
Rsc = 1/2
Rsc = 1/3
−1
10
10−2
DSL
10−3
10−4
10−5
10−6
−6 −5 −4 −3 −2 −1 −.496 .186 1
γb (dB)
Fig. 6.4 The Shannon limit for sending a binary uniform source over a BPSK-modulated AWGN
channel with unquantized output; rates Rsc = 1/2 and 1/3
6.6 Shannon Limit of Communication Systems 257
Table 6.2 Shannon limit values γb = E b /N0 (dB) for sending a binary uniform source over a
BPSK-modulated AWGN with unquantized output
Rate Rsc DS L = 0 D S L = 10−5 D S L = 10−4 D S L = 10−3 D S L = 10−3
1/3 −0.496 −0.496 −0.504 −0.559 −0.960
1/2 0.186 0.186 0.177 0.111 −0.357
2/3 1.060 1.057 1.047 0.963 0.382
4/5 2.040 2.038 2.023 1.909 1.152
The Shannon limits calculated above are pertinent due to the invention of near-
capacity achieving channel codes, such as turbo [44, 45] or LDPC [133, 134, 251,
252] codes. For example, the rate-1/2 turbo coding system proposed in [44, 45] can
approach a bit error rate of 10−5 at γb = 0.9 dB, which is only 0.714 dB away from
the Shannon limit of 0.186 dB. This implies that a near-optimal channel code has
been constructed, since in principle, no codes can perform better than the Shannon
limit. Source–channel turbo codes for sending nonuniform memoryless and Markov
binary sources over the BPSK-modulated AWGN channel are studied in [426–428].
Example 6.40 (Shannon limit for a binary uniform DMS over a binary-input Rayleigh
fading channel) Consider the same system as the one in the above example, except
that the channel is a unit-power BPSK-modulated Rayleigh fading channel (with
unquantized output). The channel is described by (5.4.16), where the input can take
on one of the two values, −1 or +1 (i.e., its input power is P = 1 = Rsc E b ), the
noise variance is σ 2N = N0 /2, and the fading distribution is Rayleigh:
f A (a) = 2ae−a ,
2
a > 0.
Assume also that the receiver knows the fading amplitude (i.e., the case of decoder
side information). Then, the channel capacity is given by evaluating I (X ; Y |A) under
the uniform input distribution PX (−1) = PX (+1) = 1/2, yielding the following
expression in terms of the SNR per source bit γb = E b /N0 :
2 0 +∞ 0 +∞
Rsc γb
f A (a) e−Rsc γb (y+a) log2 1 + e4Rsc γb ya dy da.
2
C DS I (γb ) = 1−
π 0 −∞
Now, setting Rsc R(D SL ) = C DS I (γb ) implies that the Shannon limit satisfies
2 0 +∞ 0 +∞
1 γb
h b (D SL ) = 1 − + f A (a)
Rsc Rsc π 0 −∞
× e−Rsc γb (y+a) log2 1 + e4Rsc γb ya dy da
2
(6.6.15)
for D SL ≤ 1/2. In Table 6.3, we present some Shannon limit values calculated from
(6.6.15).
258 6 Lossy Data Compression and Transmission
Table 6.3 Shannon limit values γb = E b /N0 (dB) for sending a binary uniform source over a
BPSK-modulated Rayleigh fading channel with decoder side information
Rate Rsc DS L = 0 D S L = 10−5 D S L = 10−4 D S L = 10−3 D S L = 10−3
1/3 0.489 0.487 0.479 0.412 −0.066
1/2 1.830 1.829 1.817 1.729 1.107
2/3 3.667 3.664 3.647 3.516 2.627
4/5 5.936 5.932 5.904 5.690 4.331
Example 6.41 (Shannon limit for a binary uniform DMS over an AWGN channel)
As in the above example, we consider a memoryless binary uniform source but we
assume that the channel is an AWGN channel (with real inputs and outputs) with
power constraint P and noise variance σ 2N = N0 /2. Recalling that the channel
capacity is given by
1 P
C(P) = log2 1 + 2
2 σN
1
= log2 (1 + 2Rsc γb ) ,
2
we obtain that the system’s Shannon limit satisfies
1
h b (D SL ) = 1 − log2 (1 + 2Rsc γb )
2Rsc
for D SL ≤ 1/2. In Fig. 6.5, we plot the above Shannon limit versus γb for systems
with Rsc = 1/2 and 1/3.
Other examples of determining the Shannon limit for sending sources with mem-
ory over memoryless channels, such as discrete Markov sources under the Hamming
distortion function15 or Gauss–Markov sources under the squared error distortion
measure (e.g., see [98]) can be similarly considered. Finally, we refer the reader to
the end of Sect. 4.6 for a discussion of relevant works on lossy joint source–channel
coding.
Problems
15 For example, if the Markov source is binary symmetric, then its rate–distortion function is given
by (6.3.9) for D ≤ Dc and the Shannon limit for sending this source over say a BSC or an AWGN
channel can be calculated. If the distortion region D > Dc is of interest, then (6.3.8) or the right
side of (6.3.9) can be used as lower bounds on R(D); in this case, a lower bound on the Shannon
limit can be obtained.
6.6 Shannon Limit of Communication Systems 259
Shannon Limit
1
Rsc = 1/2
Rsc = 1/3
−1
10
10−2
DSL
10−3
10−4
10−5
10−6
−6 −5 −4 −3 −2 −1 −.55 −.001 1
γb (dB)
Fig. 6.5 The Shannon limits for sending a binary uniform source over a continuous-input AWGN
channel; rates Rsc = 1/2 and 1/3
⎧
⎨ 0 if ẑ = z
ρ(z, ẑ) = 1 if z = 1 and ẑ = 0
⎩
∞ if z = 0 and ẑ = 1.
(a) Determine the source’s rate–distortion function R(D) (in your calculations,
use the convention that 0 · ∞ = 0).
(b) Specialize R(D) to the case of p = 1/2 (uniform source).
3. Binary uniform source with erasure and infinite distortion: Consider a uniformly
distributed DMS {Z i } with alphabet Z = {0, 1} and reproduction alphabet Z =
{0, 1, E}, where E represents an erasure. Let the source’s distortion function be
given as follows:
⎧
⎨ 0 if ẑ = z
ρ(z, ẑ) = 1 if ẑ = E
⎩
∞ otherwise.
R(D) = R(D − c̄)
/
for D ≥ c̄, where c̄ = z∈Z PZ (z)cz .
Note: This result was originally shown by Pinkston [302].
7. Scaled distortion: Consider a DMS {Z i } with alphabet Z, reproduction alphabet
Ẑ, distortion function ρ(·, ·), and rate–distortion function R(D). Let ρ̂(·, ·) be a
new distortion function obtained by scaling ρ(·, ·) via a positive constant a:
where R(D) is the rate–distortion function of source { =
Z i } with alphabet Z
Z \ {z 1 } and distribution
PZ (z)
PZ̃ (z) = , z ∈ Z,
1 − PZ (z 1 )
and with the same reproduction alphabet Z and distortion function ρ(·, ·).
Note: This result first appeared in [302].
9. Consider a DMS {Z i } with quaternary source and reproduction alphabets Z =
Ẑ = {0, 1, 2, 3}, probability distribution vector
6.6 Shannon Limit of Communication Systems 261
for fixed 0 < p < 1, and distortion measure given by the following matrix:
⎡ ⎤
0 ∞ 1 ∞
⎢∞ 0 1 ∞⎥
[ρ(z, ẑ)] = [ρz ẑ ] = ⎢
⎣ 0 0
⎥.
0 0 ⎦
∞∞ 1 0
measure. Consider also the q-ary symmetric DMC described in (4.2.11) with
q-ary input and output alphabets and symbol error rate
≤ q−1 q
.
Determine whether or not an uncoded source–channel transmission scheme of
rate Rsc = 1 source symbol/channel use (i.e., a source–channel code whose
encoder and decoder are both given by the identity function) is optimal for this
communication system.
18. Shannon limit of the erasure source–channel system: Given integer q ≥ 2, con-
sider the q-ary uniform DMS together with the distortion measure of Problem 6.5
above and the q-ary erasure channel described in (4.2.12), see also Problem 4.13.
(a) Find the system’s Shannon limit under a transmission rate of Rsc source
symbol/channel use.
(b) Describe an uncoded source–channel transmission scheme for the system
with rate Rsc = 1 and assess its optimality.
19. Shannon limit for a Laplacian source over an AWGN channel: Determine the
Shannon limit under the absolute error distortion criterion and a transmission
rate of Rsc source symbols/channel use for a communication system consisting
of a memoryless zero-mean Laplacian source with parameter λ and an AWGN
channel with input power constraint P and noise variance σ 2N .
20. Shannon limit for a nonuniform DMS over different channels: Find the Shannon
limit under the Hamming distortion criterion for each of the systems of Exam-
ples 6.39–6.41, where the source is a binary nonuniform DMS with PZ (0) = p,
where 0 ≤ p ≤ 1.
Appendix A
Overview on Suprema and Limits
We herein review basic results on suprema and limits which are useful for the devel-
opment of information theoretic coding theorems; they can be found in standard real
analysis texts (e.g., see [262, 398]).
Definition A.1 (Upper bound of a set) A real number u is called an upper bound of
a non-empty subset A of R if every element of A is less than or equal to u; we say
that A is bounded above. Symbolically, the definition becomes
the set (0, 1) can be used for (ii). In both examples, the supremum is equal to 1;
however, in the former case, the supremum belongs to the set, while in the latter case
it does not. When a set contains its supremum, we call the supremum the maximum
of the set.
Definition A.3 (Maximum) If sup A ∈ A, then sup A is also called the maximum of
A and is denoted by max A. However, if sup A ∈
/ A, then we say that the maximum
of A does not exist.
and
(∀ a ∈ A) a ≤ α, (A.1.2)
where (A.1.1) and (A.1.2) are called the achievability (or forward) part and the
converse part, respectively, of the theorem. Specifically, (A.1.2) states that α is an
upper bound of A, and (A.1.1) states that no number less than α can be an upper
bound for A.
From the above property, in order to obtain α = max A, one needs to show that
α satisfies both
(∀ a ∈ A) a ≤ α and α ∈ A.
Appendix A: Overview on Suprema and Limits 265
The concepts of infimum and minimum are dual to those of supremum and maximum.
Definition A.7 (Lower bound of a set) A real number is called a lower bound of a
non-empty subset A in R if every element of A is greater than or equal to ; we say
that A is bounded below. Symbolically, the definition becomes
Definition A.9 (Minimum) If inf A ∈ A, then inf A is also called the minimum of
A and is denoted by min A. However, if inf A ∈
/ A, we say that the minimum of A
does not exist.
and
(∀ a ∈ A) a ≥ α. (A.2.2)
266 Appendix A: Overview on Suprema and Limits
Here, (A.2.1) is called the achievability or forward part of the coding theorem; it
specifies that no number greater than α can be a lower bound for A. Also, (A.2.2) is
called the converse part of the theorem; it states that α is a lower bound of A.
Lemma A.15 (Monotone property) Suppose that A and B are non-empty subsets of
R such that A ⊂ B. Then
1. sup A ≤ sup B.
2. inf A ≥ inf B.
Lemma A.16 (Supremum for set operations) Define the “addition” of two sets A
and B as
Property 1 does not hold for the “product” of two sets, where the “product” of
sets A and B is defined as
and
sup{x ∈ R : f (x) ≤ ε} = inf{x ∈ R : f (x) > ε}.
and
sup{x ∈ R : f (x) ≥ ε} = inf{x ∈ R : f (x) < ε}.
f : N → R.
{a1 , a2 , a3 , . . . , an , . . .} or {an }∞
n=1 .
One important question that arises with a sequence is what happens when n gets
large. To be precise, we want to know that when n is large enough, whether or not
every an is close to some fixed number L (which is the limit of an ).
268 Appendix A: Overview on Suprema and Limits
f (x)
f (x)
Note that in the above definition, −∞ and ∞ cannot be a legitimate limit for any
sequence. In fact, if (∀ L)(∃ N ) such that (∀ n > N ) an > L, then we say that an
Appendix A: Overview on Suprema and Limits 269
As stated above, the limit of a sequence may not exist. For example, an = (−1)n .
Then, an will be close to either −1 or 1 for n large. Hence, more generalized defini-
tions that can describe the general limiting behavior of a sequence is required.
Some also use the notations lim and lim to denote limsup and liminf, respectively.
Note that the limit supremum and the limit infimum of a sequence are always
defined in R ∪ {−∞, ∞}, since the sequences supk≥n ak = sup{ak : k ≥ n} and
inf k≥n ak = inf{ak : k ≥ n} are monotone in n (cf. Lemma A.20). An immediate
result follows from the definitions of limsup and liminf.
Some properties regarding the limsup and liminf of sequences (which are parallel
to Properties A.4 and A.10) are listed below.
3. If | lim supm→∞ am | < ∞, then (∀ ε > 0 and integer K )(∃ N > K ) such that
a N > lim supm→∞ am − ε. (Note that this holds only for one N , which is larger
than K .)
The last two items in Properties A.23 and A.24 can be stated using the terminology
of sufficiently large and infinitely often, which is often adopted in information theory.
Definition A.25 (Sufficiently large) We say that a property holds for a sequence
{an }∞
n=1 almost always or for all sufficiently large n if the property holds for every
n > N for some N .
Definition A.26 (Infinitely often) We say that a property holds for a sequence {an }∞
n=1
infinitely often or for infinitely many n if for every K , the property holds for one
(specific) N with N > K .
and
an > lim sup am − ε for infinitely many n.
m→∞
Similarly, Properties 2 and 3 of Property A.24 becomes: if | lim inf m→∞ am | < ∞,
then (∀ ε > 0)
an < lim inf am + ε for infinitely many n
m→∞
and
an > lim inf am − ε for all sufficiently large n.
m→∞
Lemma A.27
1. lim inf n→∞ an ≤ lim supn→∞ an .
2. If an ≤ bn for all sufficiently large n, then
≤ lim sup(an + bn )
n→∞
≤ lim sup an + lim sup bn .
n→∞ n→∞
and
lim sup(an + bn ) = lim an + lim sup bn .
n→∞ n→∞ n→∞
Finally, one can also interpret the limit supremum and limit infimum in terms
of the concept of clustering points. A clustering point is a point that a sequence
{an }∞
n=1 approaches (i.e., belonging to a ball with arbitrarily small radius and that
point as center) infinitely many times. For example, if an = sin(nπ/2), then
{an }∞
n=1 = {1, 0, −1, 0, 1, 0, −1, 0, . . .}. Hence, there are three clustering points
in this sequence, which are −1, 0 and 1. Then, the limit supremum of the sequence
is nothing but its largest clustering point, and its limit infimum is exactly its smallest
clustering point. Specifically, lim supn→∞ an = 1 and lim inf n→∞ an = −1. This
approach can sometimes be useful to determine the limsup and liminf quantities.
A.5 Equivalence
We close this appendix by providing some equivalent statements that are often used
to simplify proofs. For example, instead of directly showing that quantity x is less
than or equal to quantity y, one can take an arbitrary constant ε > 0 and prove that
x < y + ε. Since y + ε is a larger quantity than y, in some cases it might be easier to
show x < y + ε than proving x ≤ y. By the next theorem, any proof that concludes
that “x < y + ε for all ε > 0” immediately gives the desired result of x ≤ y.
Theorem A.28 For any x, y and a in R,
1. x < y + ε for all ε > 0 iff x ≤ y;
2. x < y − ε for some ε > 0 iff x < y;
3. x > y − ε for all ε > 0 iff x ≥ y;
4. x > y + ε for some ε > 0 iff x > y;
5. |a| < ε for all ε > 0 iff a = 0.
Appendix B
Overview in Probability and Random Processes
It directly follows that the empty set ∅ is also an element of F (as c = ∅) and
that F is closed under countable intersection since
∞
∞ c
Ai =
c
Ai .
i=1 i=1
The largest σ-field of subsets of a given set is the collection of all subsets of
(i.e., its powerset), while the smallest σ-field is given by {, ∅}. Also, if A is a proper
(strict) non-empty subset of , then the smallest σ-field containing A is given by
{, ∅, A, Ac }.
Definition B.2 (Probability space) A probability space is a triple (, F, P), where
is a given set called sample space containing all possible outcomes (usually
observed from an experiment), F is a σ-field of subsets of and P is a proba-
bility measure P : F → [0, 1] on the σ-field satisfying the following:
1. 0 ≤ P(A) ≤ 1 for all A ∈ F.
2. P() = 1.
3. Countable additivity: If A1 , A2 , . . . is a sequence of disjoint sets (i.e., Ai ∩ A j = ∅
for all i = j) in F, then
∞ ∞
P Ak = P(Ak ).
k=1 k=1
It directly follows from Properties 1–3 of the above definition that P(∅) = 0. Usually,
the σ-field F is called the event space and its elements (which are subsets of
satisfying the properties of Definition B.1) are called events.
Note that the quantities PX (B), B ∈ B(R), fully characterize the random variable
X as they determine the probabilities of all events that concern X .
1 One may question why bother defining random variables based on some abstract probability
space. One may continue that “a random variable X can simply be defined based on its probability
distribution,” which is indeed true (cf. Observation B.3). A perhaps easier way to understand the
abstract definition of a random variable is that the underlying probability space (, F , P) on which
it is defined is what truly occurs internally, but it is possibly non-observable. In order to infer which
of the non-observable ω occurs, an experiment is performed resulting in an observable x that is a
function of ω. Such experiment yields the random variable X whose probability is defined over the
probability space (, F , P).
Appendix B: Overview in Probability and Random Processes 275
where B(Rn ) is the Borel σ-field of Rn ; i.e., the smallest σ-field of subsets of Rn
containing all open sets in Rn .
The joint cdf FX n of X n is the function from Rn to [0, 1] given by
FX n (x n ) = PX ((−∞, xi ), i = 1, . . . , n) = P (ω ∈ : X i (w) ≤ xi , i = 1, . . . , n)
for x n = (x1 , . . . , xn ) ∈ Rn .
A random process (or random source) is a collection of random variables that
arise from the same probability space. It can be mathematically represented by the
collection
{X t , t ∈ I },
where X t denotes the tth random variable in the process, and the index t runs over
an index set I which is arbitrary. The index set I can be uncountably infinite (e.g.,
I = R), in which case we are dealing with a continuous-time process. We will,
however, exclude such a case in this appendix for the sake of simplicity.
In this text, we focus mostly on discrete-time sources; i.e., sources with the count-
able index set I = {1, 2, . . .}. Each such source is denoted by
X := {X n }∞
n=1 = {X 1 , X 2 , . . .},
as an infinite sequence of random variables, where all the random variables take on
values from a common generic alphabet X ⊆ R. The elements in X are usually
276 Appendix B: Overview in Probability and Random Processes
called letters (or symbols or values). When X is a finite set, the letters of X can be
conveniently expressed via the elements of any appropriately chosen finite set (i.e.,
the letters of X need not be real numbers).2
The source X is completely characterized by the sequence of joint cdf’s {FX n }∞
n=1 .
When the alphabet X is finite, the source can be equivalently described by the
sequence of joint probability mass functions (pmf’s):
PX n (a n ) = Pr[X 1 = a1 , X 2 = a2 , . . . , X n = an ]
TE ⊆ E,
where
yields
2 More formally, the definition of a random variable X can be generalized by allowing it to take
values that are not real numbers: a random variable over the probability space (, F , P) is a function
X : → X satisfying the property that for every F ∈ F X ,
X −1 (F) := {w ∈ : X (w) ∈ F} ∈ F ,
where the alphabet X is a general set and F X is a σ-field of subsets of X [159, 349]. Note that this
definition allows X to be an arbitrary set (including being an arbitrary finite set). Furthermore, if
we set X = R, then we revert to the earlier (standard) definition of a random variable.
Appendix B: Overview in Probability and Random Processes 277
Thus, if an element say (1, 0, 0, 1, 0, 0, . . .) is in a T-invariant set E, then all its left-
shift counterparts (i.e., (0, 0, 1, 0, 0, 1 . . .) and (0, 1, 0, 0, 1, 0, . . .)) should be con-
tained in E. As a result, for a T-invariant set E, an element and all its left-shift coun-
terparts are either all in E or all outside E, but cannot be partially inside E. Hence, a
“T-invariant group” such as one containing (1, 0, 0, 1, 0, 0, . . .), (0, 0, 1, 0, 0, 1 . . .)
and (0, 1, 0, 0, 1, 0, . . .) should be treated as an indecomposable group in T-invariant
sets.
Although we are in particular interested in these “T-invariant indecomposable
groups” (especially when defining an ergodic random process), it is possible that
some single “transient” element, such as (0, 0, 1, 1, . . .) in (B.3.1), is included in
a T-invariant set, and will be excluded after applying left-shift operation T. This,
however, can be resolved by introducing the inverse operation T−1 . Note that T is
a many-to-one mapping, so its inverse operation does not exist in general. Similar
to taking the closure of an open set, the definition adopted below [349, p. 3] allows
us to “enlarge” the T-invariant set such that all right-shift counterparts of the single
“transient” element are included
T−1 E := x ∈ X ∞ : Tx ∈ E .
T−1 E = E, (B.3.2)
then4
TE = T(T−1 E) = E,
· · · = T−2 E = T−1 E = E = TE = T2 E = · · · .
The sets that satisfy (B.3.2) are sometimes referred to as ergodic sets because as
time goes by (the left-shift operator T can be regarded as a shift to a future time), the
set always stays in the state that it has been before. A quick example of an ergodic
set for X = {0, 1} is one that consists of all binary sequences that contain finitely
many 0’s.5
We now classify several useful statistical properties of random process X =
{X 1 , X 2 , . . .}.
for all xl ∈ X , l = 1, . . . , n; we also say that these random variables are mutually
independent. Furthermore, the notion of identical distribution means that
Pr[X i = x] = Pr[X 1 = x]
for any x ∈ X and i = 1, 2, . . .; i.e., all the source’s random variables are governed
by the same marginal distribution.
• Stationary process The process X is said to be stationary (or strictly stationary)
if the probability of every sequence or event is unchanged by a left (time) shift,
or equivalently, if any j = 1, 2, . . ., the joint distribution of (X 1 , X 2 , . . . , X n )
satisfies
Pr[X 1 = x1 , X 2 = x2 , . . . , X n = xn ]
= Pr[X j+1 = x1 , X j+2 = x2 , . . . , X j+n = xn ]
for all xl ∈ X , l = 1, . . . , n.
5 As the textbook only deals with one-sided random processes, the discussion on T-
invariance only focuses on sets of one-sided sequences. When a two-sided random process
. . . , X −2 , X −1 , X 0 , X 1 , X 2 , . . . is considered, the left-shift operation T of a two-sided sequence
actually has a unique inverse. Hence, TE ⊆ E implies TE = E. Also, TE = E iff T−1 E = E.
Ergodicity for two-sided sequences can therefore be directly defined using TE = E.
Appendix B: Overview in Probability and Random Processes 279
Observe that the definition has nothing to do with stationarity. It simply states
that events that are unaffected by time-shifting (both left- and right-shifting) must
have probability either zero or one.
Ergodicity implies that all convergent sample averages6 converge to a con-
stant (but not necessarily to the ensemble average or statistical expectation), and
stationarity assures that the time average converges to a random variable; hence,
it is reasonable to expect that they jointly imply the ultimate time average equals
the ensemble average. This is validated by the well-known ergodic theorem by
Birkhoff and Khinchin.
1
n
lim f (X k ) = Y with probability 1.
n→∞ n
k=1
1
n
lim f (X k ) = E[ f (X 1 )] with probability 1.
n→∞ n
k=1
∞
Example B.5 Consider the process {X i }i=1 consisting of a family of i.i.d. binary
random variables (obviously, it is stationary and ergodic). Define the function f (·)
by f (0) = 0 and f (1) = 1. Hence,7
f (X 1 ) + f (X 2 ) + · · · + f (X n ) X1 + X2 + · · · + Xn
lim = lim
n→∞ n n→∞ n
= PX (1).
As seen in the above example, one of the important consequences that the
pointwise ergodic theorem indicates is that the time average can ultimately replace
the statistical average, which is a useful result. Hence, with stationarity and ergod-
icity, one, who observes
6 Two alternative names for sample average are time average and Cesàro mean. In this book, these
X 130 = 154326543334225632425644234443
from the experiment of rolling a dice, can draw the conclusion that the true distri-
bution of rolling the dice can be well approximated by
1 6 7
Pr{X i = 1} ≈ Pr{X i = 2} ≈ Pr{X i = 3} ≈
30 30 30
9 4 3
Pr{X i = 4} ≈ Pr{X i = 5} ≈ Pr{X i = 6} ≈
30 30 30
Such result is also known by the law of large numbers. The relation between ergod-
icity and the law of large numbers will be further explored in Sect. B.5.
• Markov chain for three random variables: Three random variables X , Y , and Z
are said to form a Markov chain if
n−1
Each xn−k := (xn−k , xn−k+1 , . . . , xn−1 ) ∈ X k is called the state of the Markov
chain at time n.
X1 → X2 → · · · → Xn
for n > 2. The same property applies to any finite number of random variables
from the source ordered in terms of increasing time indices.
We next summarize important concepts and facts about Markov sources (e.g.,
see [137, 162]).
– A kth order Markov chain is irreducible if with some probability, we can go
from any state in X k to another state in a finite number of steps, i.e., for all
x k , y k ∈ X k there exists an integer j ≥ 1 such that
k+ j−1
Pr X j = x k X 1k = y k > 0.
where gcd denotes the greatest common divisor; in other words, if the Markov
chain starts in state x, then the chain cannot return to state x at any time that
282 Appendix B: Overview in Probability and Random Processes
is not a multiple of d(x). If Pr{X n+1 = x|X 1 = x} = 0 for all n, we say that
state x has an infinite period and write d(x) = ∞. We also say that state x is
aperiodic if d(x) = 1 and periodic if d(x) > 1. Furthermore, the first-order
Markov chain is called aperiodic if all its states are aperiodic. In other words,
the first-order Markov chain is aperiodic if
– In an irreducible first-order Markov chain, all states have the same period. Hence,
if one state in such a chain is aperiodic, then the entire Markov chain is aperiodic.
– A distribution π(·) on X is said to be a stationary distribution for a homogeneous
first-order Markov chain, if for every y ∈ X ,
π(y) = π(x) Pr{X 2 = y|X 1 = x}.
x∈X
for all states x and y in X . If the initial state distribution is equal to a sta-
tionary distribution, then the homogeneous first-order Markov chain becomes a
stationary process.
– A finite-alphabet stationary Markov source is an ergodic process (and hence sat-
isfies the pointwise ergodic theorem) iff it is irreducible; see [30, p. 371]
and [349, Prop. I.2.9].
The general relations among i.i.d. sources, Markov sources, stationary sources,
and ergodic sources are depicted in Fig. B.1
i.i.d. Stationary
Ergodic
{X n }∞
n=1 is said to converge to X pointwise on if
{X n }∞
n=1 is said to converge to X with probability 1, if
a.s.
Almost sure convergence is denoted by X n −→ X ; note that it is nothing but a
probabilistic version of pointwise convergence.
3. Convergencein probability.
{X n }∞
n=1 is said to converge to X in probability, if for any ε > 0,
p
This mode of convergence is denoted by X n −→ X .
8 Although such mode of convergence is not used in probability theory, we introduce it herein to
contrast it with the almost sure convergence mode (see Example B.7).
284 Appendix B: Overview in Probability and Random Processes
4. Convergence in r th mean.
{X n }∞
n=1 is said to converge to X in r th mean, if
lim E[|X − X n |r ] = 0.
n→∞
Lr
This is denoted by X n −→ X .
5. Convergence in distribution.
{X n }∞
n=1 is said to converge to X in distribution, if
d
We denote this notion of convergence by X n −→ X .
Example B.7 Consider a probability space (, 2 , P), where = {0, 1, 2, 3}, 2
is the power set of and P(0) = P(1) = P(2) = 1/3 and P(3) = 0. Define a
random variable as ω
X n (ω) = .
n
Then
1 2 1
Pr{X n = 0} = Pr X n = = Pr X n = = .
n n 3
It is clear that for every ω in , X n (ω) converges to X (ω), where X (ω) = 0 for
every ω ∈ ; so
p.w.
X n −→ X.
Now let X̃ (ω) = 0 for ω = 0, 1, 2 and X̃ (ω) = 1 for ω = 3. Then, both of the
following statements are true:
a.s. a.s.
X n −→ X and X n −→ X̃ ,
since
3
Pr lim X n = X̃ = P(ω) · 1 lim X n (ω) = X̃ (ω) = 1,
n→∞ n→∞
ω=0
Appendix B: Overview in Probability and Random Processes 285
where 1{·} represents the set indicator function. However, X n does not converge to
X̃ pointwise because
lim X n (3) = X̃ (3).
n→∞
In other words, pointwise convergence requires “equality” even for samples without
probability mass; however, these samples are ignored under almost sure convergence.
For ease of understanding, the relations of the five modes of convergence can be
depicted as follows. As usual, a double arrow denotes implication.
p.w.
Xn −→ X
d
Xn −→ X
There are some other relations among these five convergence modes that are also
depicted in the above graph (via the dotted line); they are stated below.
a.s. L1
X n −→ X, (∀ n)Y ≤ X n ≤ X n+1 , and E[|Y |] < ∞ =⇒ X n −→ X
=⇒ E[X n ] → E[X ].
a.s. L1
X n −→ X, (∀ n)|X n | ≤ Y, and E[|Y |] < ∞ =⇒ X n −→ X
=⇒ E[X n ] → E[X ].
286 Appendix B: Overview in Probability and Random Processes
L1
The implication of X n −→ X to E[X n ] → E[X ] can be easily seen from
X1 + · · · + Xn
n
converges to μ in probability, while the strong law asserts that this convergence takes
place with probability 1.
The following two inequalities will be useful in the discussion of this subject.
Lemma B.11 (Markov’s inequality) For any integer k > 0, real number α > 0,
and any random variable X ,
1
Pr[|X | ≥ α] ≤ E[|X |k ].
αk
Proof Let FX (·) be the cdf of random variable X . Then,
∞
E[|X |k ] = |x|k d FX (x)
−∞
≥ |x|k d FX (x)
{x∈R : |x|≥α}
≥ αk d FX (x)
{x∈R : |x|≥α}
= αk d FX (x)
{x∈R : |x|≥α}
= αk Pr[|X | ≥ α].
Appendix B: Overview in Probability and Random Processes 287
namely,
Pr[X = 0] + Pr[|X | = α] = 1.
In the proof of Markov’s inequality, we use the general representation for inte-
gration with respect to a (cumulative) distribution function FX (·), i.e.,
·d FX (x), (B.5.1)
X
Lemma B.12 (Chebyshev’s inequality) For any random variable X with variance
Var[X ] and real number α > 0,
1
Pr[|X − E[X ]| ≥ α] ≤ Var[X ].
α2
Proof By Markov’s inequality with k = 2, we have
1
Pr[|X − E[X ]| ≥ α] ≤ E[|X − E[X ]|2 ].
α2
Equality holds iff
Pr[|X − E[X ]| = 0] + Pr |X − E[X ]| = α = 1,
In the proofs of the above two lemmas, we also provide the condition under which
equality holds. These conditions indicate that equality usually cannot be fulfilled.
Hence in most cases, the two inequalities are strict.
288 Appendix B: Overview in Probability and Random Processes
Theorem B.13 (Weak law of large numbers) Let {X n }∞n=1 be a sequence of uncor-
related random variables with common mean E[X i ] = μ. If the variables also have
common variance, or more generally,
1
n
X1 + · · · + Xn L2
lim Var[X i ] = 0, equivalently, −→ ¯
n→∞ n 2 n
i=1
Note that the right-hand side of the above Chebyshev’s inequality is just the second
moment of the difference between the n-sample average and the mean μ. Thus, the
L2 p
variance constraint is equivalent to the statement that X n −→ μ implies X n −→ μ.
Theorem B.14 (Kolmogorov’s strong law of large numbers) Let {X n }∞ n=1 be a
sequence of independent random variables with common mean E[X n ] = μ. If either
1. X n ’s are identically distributed; or
2. X n ’s are square-integrable9 with variances satisfying
∞
Var[X i ]
< ∞,
i=1
i2
then
X 1 + · · · + X n a.s.
−→ μ.
n
Note that the above i.i.d. assumption does not exclude the possibility of μ = ∞
(or μ = −∞), in which case the sample average converges to ∞ (or −∞) with
probability 1. Also note that there are cases of sequences of independent random
variables for which the weak law applies, but the strong law does not. This is due to
the fact that
n
Var[X i ] 1
n
≥ Var[X i ].
i=1
i2 n 2 i=1
The final remark is that Kolmogorov’s strong law of large number can be extended
to a function of a sequence of independent random variables:
But such extension cannot be applied to the weak law of large numbers, since g(Yi )
and g(Y j ) can be correlated even if Yi and Y j are not.
After the introduction of Kolmogorov’s strong law of large numbers, one may find
that the pointwise ergodic theorem (Theorem B.4) actually indicates a similar result.
In fact, the pointwise ergodic theorem can be viewed as another version of the strong
law of large numbers, which states that for stationary and ergodic processes, time
averages converge with probability 1 to the ensemble expectation.
The notion of ergodicity is often misinterpreted, since the definition is not very
intuitive. Some texts may provide a definition that a stationary process satisfying
the ergodic theorem is also ergodic.10 However, the ergodic theorem is indeed a
consequence of the original mathematical definition of ergodicity in terms of the
shift-invariant property (see Sect. B.3 and the discussion in [160, pp. 174–175]).
Let us try to clarify the notion of ergodicity by the following remarks:
• The concept of ergodicity does not require stationarity. In other words, a nonsta-
tionary process can be ergodic.
• Many perfectly good models of physical processes are not ergodic, yet they obey
some form of law of large numbers. In other words, non-ergodic processes can be
perfectly good and useful models.
1
n
a.s.
f (X i+1 , . . . , X i+k ) −→ E[ f (X 1 , . . . , X k )].
n
i=1
As a result of this definition, a stationary ergodic source is the most general dependent random
process for which the strong law of large numbers holds. This definition somehow implies that if
a process is not stationary ergodic, then the strong law of large numbers is violated (or the time
average does not converge with probability 1 to its ensemble expectation). But this is not true. One
can weaken the conditions of stationarity and ergodicity from its original mathematical definitions to
asymptotic stationarity and ergodicity, and still make the strong law of large numbers hold. (Cf. the
last remark in this section and also Fig. B.2.)
290 Appendix B: Overview in Probability and Random Processes
Ergodicity
defined through
Ergodicity ergodic theorem
defined through i.e., stationarity and
shift-invariance time average
property converging to
sample average
(law of large numbers)
Fig. B.2 Relation of ergodic random processes, respectively, defined through time-shift invariance
and ergodic theorem
where Pr(Y = 1/4) = Pr(Y = 3/4) = 1/2, which contradicts the ergodic theo-
rem.
Appendix B: Overview in Probability and Random Processes 291
From the above example, the pointwise ergodic theorem can actually be
made useful in such a stationary but non-ergodic case, since an “apparent” station-
ary ergodic process (either {An }∞ ∞
n=1 or {Bn }n=1 ) is actually being observed when
measuring the relative frequency (3/4 or 1/4). This renders a surprising funda-
mental result for random processes—the ergodic decomposition theorem: under
fairly general assumptions, any (not necessarily ergodic) stationary process is in
fact a mixture of stationary ergodic processes, and hence one always observes a
stationary ergodic outcome (e.g., see [159, 349]). As in the above example, one
always observe either A1 , A2 , A3 , . . . or B1 , B2 , B3 , . . ., depending on the value
of U , for which both sequences are stationary ergodic (i.e., the time-stationary
observation X n satisfies X n = U · An + (1 − U ) · Bn ).
• The previous remark implies that ergodicity is not required for the strong law of
large numbers to be useful. The next question is whether or not stationarity is
required. Again, the answer is negative. In fact, the main concern of the law of
large numbers is the convergence of sample averages to its ensemble expectation.
It should be reasonable to expect that random processes could exhibit transient
behaviors that violate the stationarity definition, with their sample average still
converging. One can then introduce the notion of asymptotically mean stationary
to achieve the law of large numbers [159]. For example, a finite-alphabet time-
invariant (but not necessarily stationary) irreducible Markov chain satisfies the law
of large numbers. Thus, the stationarity and/or ergodicity properties of a process
can be weakened with the process still admitting laws of large numbers (i.e., time
averages and relative frequencies have desired and well-defined limits).
1
n
d
√ (X i − μ) −→ Z ∼ N (0, σ 2 ),
n i=1
Jensen’s inequality provides a useful bound for the expectation of convex (or concave)
functions.
292 Appendix B: Overview in Probability and Random Processes
E[ f (X )] ≤ f (E[X ]).
vector a T ∈ Rm and affine parameter b ∈ R if among all hyperplanes of the same slope vector
a T , it is the largest one satisfying a T x + b ≤ f (x) for every x ∈ O. A support hyperplane may
not necessarily be made to pass through the desired point (x , f (x )). Here, since we only consider
convex functions, the validity of the support hyperplane passing (x , f (x )) is therefore guaranteed.
Note that when x is one-dimensional (i.e., m = 1), a support hyperplane is simply referred to as a
support line.
Appendix B: Overview in Probability and Random Processes 293
support line
y = ax + b
the point (x , f (x )) and lying entirely below the graph of f (see Fig. B.3 for an
illustration of a support line for a convex function over R).
Thus,
(∀x ∈ X ) a T x + b ≤ f (x).
a T E[X ] + b ≤ E[ f (X )],
f (E[X ]) ≤ E[ f (X )].
13 Since maximization of f (·) is equivalent to minimization of − f (·), it suffices to discuss the KKT
conditions for the minimization problem defined in (B.8.1).
294 Appendix B: Overview in Probability and Random Processes
where
We are however interested in when the above inequality becomes equality (i.e., when
the so-called strong duality holds) because if there exist nonnegative λ̃ and ν̃ that
equate (B.8.3), then
m
≤ f (x ∗ ) + λ̃i gi (x ∗ ) + ν̃ j h j (x ∗ )
i=1 j=1
≤ f (x ∗ ), (B.8.4)
#m
14 Equating (B.8.4) implies i=1 λ̃i gi (x ∗ ) = 0. It can then be easily verified from λ̃i gi (x ∗ ) ≤ 0
∗
for every 1 ≤ i ≤ m that λ̃i gi (x ) = 0 for 1 ≤ i ≤ m.
Appendix B: Overview in Probability and Random Processes 295
they found that the strong duality holds iff the KKT conditions are satisfied [56,
p. 258].
From λm+k ≥ 0 and λm+k xk = 0, we can obtain the well-known relation below.
⎧∂f #m # ∂h
⎨ ∂xk (x) + i=1 ∂gi
λi ∂x k
(x) + j=1 ν j ∂xkj (x) + ν+1 = 0 if xk > 0
λm+k = #m #
⎩∂f ∂gi ∂h
∂xk
(x) + i=1 λi ∂x k
(x) + j=1 ν j ∂xkj (x) + ν+1 ≥ 0 if xk = 0.
The above relation is the most seen form of the KKT conditions when it is used in
problems in information theory.
#n
Example B.20 Suppose for nonnegative {qi, j }1≤i≤n,1≤ j≤n with j=1 qi, j = 1,
296 Appendix B: Overview in Probability and Random Processes
⎧ n
⎪
⎪ n
qi, j
⎪
⎪ f (x) = − xi qi, j log #n
⎪
⎨ i=1 j=1 i =1 x i qi , j
⎪
⎪ gi (x) = −xi ≤ 0 i = 1, . . . , n
⎪
⎪
⎪
⎩ #n
h(x) = i=1 xi − 1 = 0
By this, the input distributions that achieve the channel capacities of some channels
such as BSC and BEC can be identified.
The next example shows the analogy of determining the channel capacity to the
problem of optimal power allocation.
Example B.21 (Water-filling) Suppose with σi2 > 0 for 1 ≤ i ≤ n and P > 0,
⎧ #n
⎪
⎪ f (x) = − log 1 + xi
⎪
⎪ i=1 σi2
⎨
gi (x) = −xi ≤ 0 i = 1, . . . , n
⎪
⎪
⎪
⎪ #n
⎩
h(x) = i=1 xi − P = 0
This then gives the water-filling solution for the power allocation over parallel
continuous-input AWGN channels.
References
16. F. Alajaji, N. Whalen, The capacity-cost function of discrete additive noise channels with and
without feedback. IEEE Trans. Inf. Theory 46(3), 1131–1140 (2000)
17. F. Alajaji, P.-N. Chen, Z. Rached, Csiszár’s cutoff rates for the general hypothesis testing
problem. IEEE Trans. Inf. Theory 50(4), 663–678 (2004)
18. V.R. Algazi, R.M. Lerner, Binary detection in white non-Gaussian noise, Technical Report
DS-2138, M.I.T. Lincoln Lab, Lexington, MA (1964)
19. S.A. Al-Semari, F. Alajaji, T. Fuja, Sequence MAP decoding of trellis codes for Gaussian and
Rayleigh channels. IEEE Trans. Veh. Technol. 48(4), 1130–1140 (1999)
20. E. Arikan, An inequality on guessing and its application to sequential decoding. IEEE Trans.
Inf. Theory 42(1), 99–105 (1996)
21. E. Arikan, N. Merhav, Joint source-channel coding and guessing with application to sequential
decoding. IEEE Trans. Inf. Theory 44, 1756–1769 (1998)
22. E. Arikan, A performance comparison of polar codes and Reed-Muller codes. IEEE Commun.
Lett. 12(6), 447–449 (2008)
23. E. Arikan, Channel polarization: a method for constructing capacity-achieving codes for
symmetric binary-input memoryless channels. IEEE Trans. Inf. Theory 55(7), 3051–3073
(2009)
24. E. Arikan, Source polarization, in Proceedings of International Symposium on Information
Theory and Applications, July 2010, pp. 899–903
25. E. Arikan, I.E. Telatar, On the rate of channel polarization, in Proceedings of IEEE Interna-
tional Symposium on Information Theory, Seoul, Korea, June–July 2009, pp. 1493–1495
26. E. Arikan, N. ul Hassan, M. Lentmaier, G. Montorsi, J. Sayir, Challenges and some new
directions in channel coding. J. Commun. Netw. 17(4), 328–338 (2015)
27. S. Arimoto, An algorithm for computing the capacity of arbitrary discrete memoryless channel.
IEEE Trans. Inf. Theory 18(1), 14–20 (1972)
28. S. Arimoto, Information measures and capacity of order α for discrete memoryless channels,
Topics in Information Theory, Proceedings of Colloquium Mathematical Society Janos Bolyai,
Keszthely, Hungary, 1977, pp. 41–52
29. R.B. Ash, Information Theory (Interscience, New York, 1965)
30. R.B. Ash, C.A. Doléans-Dade, Probability and Measure Theory (Academic Press, MA, 2000)
31. S. Asoodeh, M. Diaz, F. Alajaji, T. Linder, Information extraction under privacy constraints.
Information 7(1), 1–37 (2016)
32. E. Ayanoǧlu, R. Gray, The design of joint source and channel trellis waveform coders. IEEE
Trans. Inf. Theory 33, 855–865 (1987)
33. J. Bakus, A.K. Khandani, Quantizer design for channel codes with soft-output decoding. IEEE
Trans. Veh. Technol. 54(2), 495–507 (2005)
34. V.B. Balakirsky, Joint source-channel coding with variable length codes. Probl. Inf. Transm.
1(37), 10–23 (2001)
35. A. Banerjee, P. Burlina, F. Alajaji, Image segmentation and labeling using the Polya urn
model. IEEE Trans. Image Process. 8(9), 1243–1253 (1999)
36. M.B. Bassat, J. Raviv, Rényi’s entropy and the probability of error. IEEE Trans. Inf. Theory
24(3), 324–330 (1978)
37. F. Behnamfar, F. Alajaji, T. Linder, MAP decoding for multi-antenna systems with non-
uniform sources: exact pairwise error probability and applications. IEEE Trans. Commun.
57(1), 242–254 (2009)
38. H. Behroozi, F. Alajaji, T. Linder, On the optimal performance in asymmetric Gaussian
wireless sensor networks with fading. IEEE Trans. Signal Process. 58(4), 2436–2441 (2010)
39. S. Ben-Jamaa, C. Weidmann, M. Kieffer, Analytical tools for optimizing the error correction
performance of arithmetic codes. IEEE Trans. Commun. 56(9), 1458–1468 (2008)
40. C.H. Bennett, G. Brassard, Quantum cryptography: public key and coin tossing, in Proceed-
ings of International Conference on Computer Systems and Signal Processing, Bangalore,
India, Dec 1984, pp. 175–179
41. C.H. Bennett, G. Brassard, C. Crepeau, U.M. Maurer, Generalized privacy amplification.
IEEE Trans. Inf. Theory 41(6), 1915–1923 (1995)
References 301
42. T. Berger, Rate Distortion Theory: A Mathematical Basis for Data Compression (Prentice-
Hall, New Jersey, 1971)
43. T. Berger, Explicit bounds to R(D) for a binary symmetric Markov source. IEEE Trans. Inf.
Theory 23(1), 52–59 (1977)
44. C. Berrou, A. Glavieux, P. Thitimajshima, Near Shannon limit error-correcting coding and
decoding: Turbo-codes(1), in Proceedings of IEEE International Conference on Communi-
cations, Geneva, Switzerland, May 1993, pp. 1064–1070
45. C. Berrou, A. Glavieux, Near optimum error correcting coding and decoding: turbo-codes.
IEEE Trans. Commun. 44(10), 1261–1271 (1996)
46. D.P. Bertsekas, with A. Nedić, A.E. Ozdagler, Convex Analysis and Optimization (Athena
Scientific, Belmont, MA, 2003)
47. P. Billingsley, Probability and Measure, 2nd edn. (Wiley, New York, 1995)
48. C.M. Bishop, Pattern Recognition and Machine Learning (Springer, Berlin, 2006)
49. R.E. Blahut, Computation of channel capacity and rate-distortion functions. IEEE Trans. Inf.
Theory 18(4), 460–473 (1972)
50. R.E. Blahut, Theory and Practice of Error Control Codes (Addison-Wesley, MA, 1983)
51. R.E. Blahut, Principles and Practice of Information Theory (Addison Wesley, MA, 1988)
52. R.E. Blahut, Algebraic Codes for Data Transmission (Cambridge University Press, Cam-
bridge, 2003)
53. M. Bloch, J. Barros, Physical-Layer Security: From Information Theory to Security Engi-
neering (Cambridge University Press, Cambridge, 2011)
54. A.C. Blumer, R.J. McEliece, The Rényi redundancy of generalized Huffman codes. IEEE
Trans. Inf. Theory 34(5), 1242–1249 (1988)
55. L. Boltzmann, Uber die beziehung zwischen dem hauptsatze der mechanischen warmetheorie
und der wahrscheinlicjkeitsrechnung respective den satzen uber das warmegleichgewicht.
Wiener Berichte 76, 373–435 (1877)
56. S. Boyd, L. Vandenberghe, Convex Optimization (Cambridge University Press, Cambridge,
UK, 2003)
57. G. Brante, R. Souza, J. Garcia-Frias, Spatial diversity using analog joint source channel coding
in wireless channels. IEEE Trans. Commun. 61(1), 301–311 (2013)
58. L. Breiman, The individual ergodic theorems of information theory. Ann. Math. Stat. 28,
809–811 (1957). (with acorrection made in vol. 31, pp. 809–810, 1960)
59. D.R. Brooks, E.O. Wiley, Evolution as Entropy: Toward a Unified Theory of Biology (Uni-
versity of Chicago Press, Chicago, 1988)
60. N. Brunel, J.P. Nadal, Mutual information, Fisher information, and population coding. Neural
Comput. 10(7), 1731–1757 (1998)
61. O. Bursalioglu, G. Caire, D. Divsalar, Joint source-channel coding for deep-space image
transmission using rateless codes. IEEE Trans. Commun. 61(8), 3448–3461 (2013)
62. V. Buttigieg, P.G. Farrell, Variable-length error-correcting codes. IEE Proc. Commun. 147(4),
211–215 (2000)
63. S.A. Butman, R.J. McEliece, The ultimate limits of binary coding for a wideband Gaussian
channel, DSN Progress Report 42–22, Jet Propulsion Lab, Pasadena, CA, Aug 1974, pp.
78–80
64. G. Caire, K. Narayanan, On the distortion SNR exponent of hybrid digital-analog space-time
coding. IEEE Trans. Inf. Theory 53, 2867–2878 (2007)
65. F.P. Calmon, Information-Theoretic Metrics for Security and Privacy, Ph.D. thesis, MIT, Sept
2015
66. F.P. Calmon, A. Makhdoumi, M. Médard, Fundamental limits of perfect privacy, in Proceed-
ings of IEEE International Symposium on Information Theory, Hong Kong, pp. 1796–1800,
June 2015
67. L.L. Campbell, A coding theorem and Rényi’s entropy. Inf. Control 8, 423–429 (1965)
68. L.L. Campbell, A block coding theorem and Rényi’s entropy. Int. J. Math. Stat. Sci. 6, 41–47
(1997)
302 References
95. R.L. Dobrushin, Asymptotic bounds of the probability of error for the transmission of mes-
sages over a memoryless channel with a symmetric transition probability matrix (in Russian).
Teor. Veroyatnost. i Primenen 7(3), 283–311 (1962)
96. R.L. Dobrushin, General formulation of Shannon’s basic theorems of information theory,
AMS Translations, vol. 33, AMS, Providence, RI (1963), pp. 323–438
97. R.L. Dobrushin, M.S. Pinsker, Memory increases transmission capacity. Probl. Inf. Transm.
5(1), 94–95 (1969)
98. S.J. Dolinar, F. Pollara, The theoretical limits of source and channel coding, TDA Progress
Report 42-102, Jet Propulsion Lab, Pasadena, CA, Aug 1990, pp. 62–72
99. Draft report of 3GPP TSG RAN WG1 #87 v0.2.0, The 3rd Generation Partnership Project
(3GPP),Reno, Nevada, USA, Nov. 2016
100. P. Duhamel, M. Kieffer, Joint Source-Channel Decoding: A Cross- Layer Perspective with
Applications in Video Broadcasting over Mobile and Wireless Networks (Academic Press,
2010)
101. S. Dumitrescu, X. Wu, On the complexity of joint source-channel decoding of Markov
sequences over memoryless channels. IEEE Trans. Commun. 56(6), 877–885 (2008)
102. S. Dumitrescu, Y. Wan, Bit-error resilient index assignment for multiple description scalar
quantizers. IEEE Trans. Inf. Theory 61(5), 2748–2763 (2015)
103. J.G. Dunham, R.M. Gray, Joint source and noisy channel trellis encoding. IEEE Trans. Inf.
Theory 27, 516–519 (1981)
104. R. Durrett, Probability: Theory and Examples (Cambridge University Press, Cambridge,
2015)
105. A.K. Ekert, Quantum cryptography based on Bell’s theorem. Phys. Rev. Lett. 67(6), 661–663
(1991)
106. A. El Gamal, Y.-H. Kim, Network Information Theory (Cambridge University Press, Cam-
bridge, 2011)
107. P. Elias, Coding for noisy channels, IRE Convention Record, Part 4, pp. 37–46 (1955)
108. E.O. Elliott, Estimates of error rates for codes on burst-noise channel. Bell Syst. Tech. J. 42,
1977–1997 (1963)
109. S. Emami, S.L. Miller, Nonsymmetric sources and optimum signal selection. IEEE Trans.
Commun. 44(4), 440–447 (1996)
110. M. Ergen, Mobile Broadband: Including WiMAX and LTE (Springer, Berlin, 2009)
111. F. Escolano, P. Suau, B. Bonev, Information Theory in Computer Vision and Pattern Recog-
nition (Springer, Berlin, 2009)
112. I. Esnaola, A.M. Tulino, J. Garcia-Frias, Linear analog coding of correlated multivariate
Gaussian sources. IEEE Trans. Commun. 61(8), 3438–3447 (2013)
113. R.M. Fano, Class notes for “Transmission of Information,” Course 6.574, MIT, 1952
114. R.M. Fano, Transmission of Information: A Statistical Theory of Communication (Wiley, New
York, 1961)
115. B. Farbre, K. Zeger, Quantizers with uniform decoders and channel-optimized encoders. IEEE
Trans. Inf. Theory 52(2), 640–661 (2006)
116. N. Farvardin, A study of vector quantization for noisy channels. IEEE Trans. Inf. Theory
36(4), 799–809 (1990)
117. N. Farvardin, V. Vaishampayan, On the performance and complexity of channel-optimized
vector quantizers. IEEE Trans. Inf. Theory 37(1), 155–159 (1991)
118. T. Fazel, T. Fuja, Robust transmission of MELP-compressed speech: an illustrative example
of joint source-channel decoding. IEEE Trans. Commun. 51(6), 973–982 (2003)
119. W. Feller, An Introduction to Probability Theory and its Applications, vol. I, 3rd edn. (Wiley,
New York, 1970)
120. W. Feller, An Introduction to Probability Theory and its Applications, vol. II, 2nd edn. (Wiley,
New York, 1971)
121. T. Fine, Properties of an optimum digital system and applications. IEEE Trans. Inf. Theory
10, 443–457 (1964)
304 References
122. T. Fingscheidt, T. Hindelang, R.V. Cox, N. Seshadri, Joint source-channel (de)coding for
mobile communications. IEEE Trans. Commun. 50, 200–212 (2002)
123. F. Fleuret, Fast binary feature selection with conditional mutual information. J. Mach. Learn.
Res. 5, 1531–1555 (2004)
124. M. Fossorier, Z. Xiong, K. Zeger, Progressive source coding for a power constrained Gaussian
channel. IEEE Trans. Commun. 49(8), 1301–1306 (2001)
125. B.D. Fritchman, A binary channel characterization using partitioned Markov chains. IEEE
Trans. Inf. Theory 13(2), 221–227 (1967)
126. M. Fresia, G. Caire, A linear encoding approach to index assignment in lossy source-channel
coding. IEEE Trans. Inf. Theory 56(3), 1322–1344 (2010)
127. M. Fresia, F. Perez-Cruz, H.V. Poor, S. Verdú, Joint source-channel coding. IEEE Signal
Process. Mag. 27(6), 104–113 (2010)
128. S.H. Friedberg, A.J. Insel, L.E. Spence, Linear Algebra, 4th edn. (Prentice Hall, 2002)
129. T.E. Fuja, C. Heegard, Focused codes for channels with skewed errors. IEEE Trans. Inf.
Theory 36(9), 773–783 (1990)
130. A. Fuldseth, T.A. Ramstad, Bandwidth compression for continuous amplitude channels based
onvector approximation to a continuous subset of the source signal space, in Proceedings IEEE
International Conference on Acoustics, Speech and Signal Processing, Munich, Germany, Apr
1997, pp. 3093–3096
131. S. Gadkari, K. Rose, Robust vector quantizer design by noisy channel relaxation. IEEE Trans.
Commun. 47(8), 1113–1116 (1999)
132. S. Gadkari, K. Rose, Unequally protected multistage vector quantization for time-varying
CDMA channels. IEEE Trans. Commun. 49(6), 1045–1054 (2001)
133. R.G. Gallager, Low-density parity-check codes. IRE Trans. Inf. Theory 28(1), 8–21 (1962)
134. R.G. Gallager, Low-Density Parity-Check Codes (MIT Press, 1963)
135. R.G. Gallager, Information Theory and Reliable Communication (Wiley, New York, 1968)
136. R.G. Gallager, Variations on a theme by Huffman. IEEE Trans. Inf. Theory 24(6), 668–674
(1978)
137. R.G. Gallager, Discrete Stochastic Processes (Kluwer Academic, Boston, 1996)
138. Y. Gao, E. Tuncel, New hybrid digital/analog schemes for transmission of a Gaussian source
over a Gaussian channel. IEEE Trans. Inf. Theory 56(12), 6014–6019 (2010)
139. J. Garcia-Frias, J.D. Villasenor, Joint Turbo decoding and estimation of hidden Markov
sources. IEEE J. Sel. Areas Commun. 19, 1671–1679 (2001)
140. M. Gastpar, B. Rimoldi, M. Vetterli, To code, or not to code: lossy source-channel communi-
cation revisited. IEEE Trans. Inf. Theory 49, 1147–1158 (2003)
141. M. Gastpar, Uncoded transmission is exactly optimal for a simple Gaussian sensor network.
IEEE Trans. Inf. Theory 54(11), 5247–5251 (2008)
142. A. Gersho, R.M. Gray, Vector Quantization and Signal Compression (Kluwer Academic
Press/Springer, 1992)
143. J.D. Gibson, T.R. Fisher, Alphabet-constrained data compression. IEEE Trans. Inf. Theory
28, 443–457 (1982)
144. M. Gil, F. Alajaji, T. Linder, Rényi divergence measures for commonly used univariate con-
tinuous distributions. Inf. Sci. 249, 124–131 (2013)
145. E.N. Gilbert, Capacity of a burst-noise channel. Bell Syst. Tech. J. 39, 1253–1266 (1960)
146. J. Gleick, The Information: A History, a Theory and a Flood (Pantheon Books, New York,
2011)
147. T. Goblick Jr., Theoretical limitations on the transmission of data from analog sources. IEEE
Trans. Inf. Theory 11(4), 558–567 (1965)
148. N. Goela, E. Abbe, M. Gastpar, Polar codes for broadcast channels. IEEE Trans. Inf. Theory
61(2), 758–782 (2015)
149. N. Görtz, On the iterative approximation of optimal joint source-channel decoding. IEEE J.
Sel. Areas Commun. 19(9), 1662–1670 (2001)
150. N. Görtz, Joint Source-Channel Coding of Discrete-Time Signals with Continuous Amplitudes
(Imperial College Press, London, UK, 2007)
References 305
180. R.V.L. Hartley, Transmission of information. Bell Syst. Tech. J. 7, 535 (1928)
181. B. Hayes, C. Wilson, A maximum entropy model of phonotactics and phonotactic learning.
Linguist. Inq. 39(3), 379–440 (2008)
182. M. Hayhoe, F. Alajaji, B. Gharesifard, A Polya urn-based model for epidemics on networks,
in Proceedings of American Control Conference, Seattle, May 2017, pp. 358–363
183. A. Hedayat, A. Nosratinia, Performance analysis and design criteria for finite-alphabet
source/channel codes. IEEE Trans. Commun. 52(11), 1872–1879 (2004)
184. S. Heinen, P. Vary, Source-optimized channel coding for digital transmission channels. IEEE
Trans. Commun. 53(4), 592–600 (2005)
185. F. Hekland, P.A. Floor, T.A. Ramstad, Shannon-Kotelnikov mappings in joint source-channel
coding. IEEE Trans. Commun. 57(1), 94–105 (2009)
186. M. Hellman, J. Raviv, Probability of error, equivocation and the Chernoff bound. IEEE Trans.
Inf. Theory 16(4), 368–372 (1970)
187. M.E. Hellman, Convolutional source encoding. IEEE Trans. Inf. Theory 21, 651–656 (1975)
188. M. Hirvensalo, Quantum Computing (Springer, Berlin, 2013)
189. B. Hochwald, K. Zeger, Tradeoff between source and channel coding. IEEE Trans. Inf. Theory
43, 1412–1424 (1997)
190. T. Holliday, A. Goldsmith, H.V. Poor, Joint source and channel coding for MIMO systems:
is it better to be robust or quick? IEEE Trans. Inf. Theory 54(4), 1393–1405 (2008)
191. G.D. Hu, On Shannon theorem and its converse for sequence of communication schemes in
the case of abstract random variables, Transactions of 3rd Prague Conference on Informa-
tion Theory, Statistical Decision Functions, Random Processes (Czechoslovak Academy of
Sciences, Prague, 1964), pp. 285–333
192. T.C. Hu, D.J. Kleitman, J.K. Tamaki, Binary trees optimum under various criteria. SIAM J.
Appl. Math. 37(2), 246–256 (1979)
193. Y. Hu, J. Garcia-Frias, M. Lamarca, Analog joint source-channel coding using non-linear
curves and MMSE decoding. IEEE Trans. Commun. 59(11), 3016–3026 (2011)
194. J. Huang, S. Meyn, M. Medard, Error exponents for channel coding with application to signal
constellation design. IEEE J. Sel. Areas Commun. 24(8), 1647–1661 (2006)
195. D.A. Huffman, A method for the construction of minimum redundancy codes. Proc. IRE 40,
1098–1101 (1952)
196. S. Ihara, Information Theory for Continuous Systems (World-Scientific, Singapore, 1993)
197. I. Issa, S. Kamath, A.B. Wagner, An operational measure of information leakage, in Proceed-
ings of the Conference on Information Sciences and Systems, Princeton University, Mar 2016,
pp. 234–239
198. H. Jafarkhani, N. Farvardin, Design of channel-optimized vector quantizers in the presence
of channel mismatch. IEEE Trans. Commun. 48(1), 118–124 (2000)
199. K. Jacobs, Almost periodic channels, Colloquium on Combinatorial Methods in Probability
Theory, Aarhus, 1962, pp. 118–126
200. X. Jaspar, C. Guillemot, L. Vandendorpe, Joint source-channel turbo techniques for discrete-
valued sources: from theory to practice. Proc. IEEE 95, 1345–1361 (2007)
201. E.T. Jaynes, Information theory and statistical mechanics. Phys. Rev. 106(4), 620–630 (1957)
202. E.T. Jaynes, Information theory and statistical mechanics II. Phys. Rev. 108(2), 171–190
(1957)
203. E.T. Jaynes, On the rationale of maximum-entropy methods. Proc. IEEE 70(9), 939–952
(1982)
204. M. Jeanne, J.-C. Carlach, P. Siohan, Joint source-channel decoding of variable-length codes
for convolutional codes and Turbo codes. IEEE Trans. Commun. 53(1), 10–15 (2005)
205. F. Jelinek, Probabilistic Information Theory (McGraw Hill, 1968)
206. F. Jelinek, Buffer overflow in variable length coding of fixed rate sources. IEEE Trans. Inf.
Theory 14, 490–501 (1968)
207. V.D. Jerohin,
-entropy of discrete random objects. Teor. Veroyatnost. i Primenen 3, 103–107
(1958)
208. R. Johanesson, K. Zigangirov, Fundamentals of Convolutional Coding (IEEE, 1999)
References 307
209. O. Johnson, Information Theory and the Central Limit Theorem (Imperial College Press,
London, 2004)
210. N.L. Johnson, S. Kotz, Urn Models and Their Application: An Approach to Modern Discrete
Probability Theory (Wiley, New York, 1977)
211. L.N. Kanal, A.R.K. Sastry, Models for channels with memory and their applications to error
control. Proc. IEEE 66(7), 724–744 (1978)
212. W. Karush, Minima of Functions of Several Variables with Inequalities as Side Constraints,
M.Sc. Dissertation, Department of Mathematics, University of Chicago, Chicago, Illinois,
1939
213. A. Khisti, G. Wornell, Secure transmission with multiple antennas I: The MISOME wiretap
channel. IEEE Trans. Inf. Theory 56(7), 3088–3104 (2010)
214. A. Khisti, G. Wornell, Secure transmission with multiple antennas II: the MIMOME wiretap
channel. IEEE Trans. Inf. Theory 56(11), 5515–5532 (2010)
215. Y.H. Kim, A coding theorem for a class of stationary channels with feedback. IEEE Trans.
Inf. Theory 54(4), 1488–1499 (2008)
216. Y.H. Kim, A. Sutivong, T.M. Cover, State amplification. IEEE Trans. Inf. Theory 54(5),
1850–1859 (2008)
217. J. Kliewer, R. Thobaben, Iterative joint source-channel decoding of variable-length codes
using residual source redundancy. IEEE Trans. Wireless Commun. 4(3), 919–929 (2005)
218. P. Knagenhjelm, E. Agrell, The Hadamard transform—a tool for index assignment. IEEE
Trans. Inf. Theory 42(4), 1139–1151 (1996)
219. Y. Kochman, R. Zamir, Analog matching of colored sources to colored channels. IEEE Trans.
Inf. Theory 57(6), 3180–3195 (2011)
220. Y. Kochman, G. Wornell, On uncoded transmission and blocklength, in Proceedings IEEE
Information Theory Workshop, Sept 2012, pp. 15–19
221. E. Koken, E. Tuncel, On robustness of hybrid digital/analog source-channel coding with
bandwidth mismatch. IEEE Trans. Inf. Theory 61(9), 4968–4983 (2015)
222. A.N. Kolmogorov, On the Shannon theory of information transmission in the case of contin-
uous signals. IEEE Trans. Inf. Theory 2(4), 102–108 (1956)
223. A.N. Kolmogorov, A new metric invariant of transient dynamical systems and automorphisms,
Lebesgue Spaces.18 Dokl. Akad. Nauk. SSSR, 119.61-864 (1958)
224. A.N. Kolmogorov, S.V. Fomin, Introductory Real Analysis (Dover Publications, New York,
1970)
225. L.H. Koopmans, Asymptotic rate of discrimination for Markov processes. Ann. Math. Stat.
31, 982–994 (1960)
226. S.B. Korada, Polar Codes for Channel and Source Coding, Ph.D. Dissertation, EPFL, Lau-
sanne, Switzerland, 2009
227. S.B. Korada, R.L. Urbanke, Polar codes are optimal for lossy source coding. IEEE Trans. Inf.
Theory 56(4), 1751–1768 (2010)
228. S.B. Korada, E. Şaşoğlu, R. Urbanke, Polar codes: characterization of exponent, bounds, and
constructions. IEEE Trans. Inf. Theory 56(12), 6253–6264 (2010)
229. I. Korn, J.P. Fonseka, S. Xing, Optimal binary communication with nonequal probabilities.
IEEE Trans. Commun. 51(9), 1435–1438 (2003)
230. V.N. Koshelev, Direct sequential encoding and decoding for discrete sources. IEEE Trans.
Inf. Theory 19, 340–343 (1973)
231. V. Kostina, S. Verdú, Lossy joint source-channel coding in the finite blocklength regime. IEEE
Trans. Inf. Theory 59(5), 2545–2575 (2013)
232. V.A. Kotelnikov, The Theory of Optimum Noise Immunity (McGraw-Hill, New York, 1959)
233. G. Kramer, Directed Information for Channels with Feedback, Ph.D. Dissertation, ser. ETH
Series in Information Processing. Konstanz, Switzerland: Hartung-Gorre Verlag, vol. 11
(1998)
234. J. Kroll, N. Phamdo, Analysis and design of trellis codes optimized for a binary symmetric
Markov source with MAP detection. IEEE Trans. Inf. Theory 44(7), 2977–2987 (1998)
308 References
235. H.W. Kuhn, A.W. Tucker, Nonlinear programming, in Proceedings of 2nd Berkeley Sympo-
sium, Berkeley, University of California Press, 1951, pp. 481–492
236. S. Kullback, R.A. Leibler, On information and sufficiency. Ann. Math. Stat. 22(1), 79–86
(1951)
237. S. Kullback, Information Theory and Statistics (Wiley, New York, 1959)
238. H. Kumazawa, M. Kasahara, T. Namekawa, A construction of vector quantizers for noisy
channels. Electron. Eng. Jpn. 67–B(4), 39–47 (1984)
239. A. Kurtenbach, P. Wintz, Quantizing for noisy channels. IEEE Trans. Commun. Technol. 17,
291–302 (1969)
240. F. Lahouti, A.K. Khandani, Efficient source decoding over memoryless noisy channels using
higher order Markov models. IEEE Trans. Inf. Theory 50(9), 2103–2118 (2004)
241. J.N. Laneman, E. Martinian, G. Wornell, J.G. Apostolopoulos, Source-channel diversity for
parallel channels. IEEE Trans. Inf. Theory 51(10), 3518–3539 (2005)
242. G.G. Langdon, An introduction to arithmetic coding. IBM J. Res. Dev. 28, 135–149 (1984)
243. G.G. Langdon, J. Rissanen, A simple general binary source code. IEEE Trans. Inf. Theory
28(5), 800–803 (1982)
244. K.H. Lee, D. Petersen, Optimal linear coding for vector channels. IEEE Trans. Commun.
24(12), 1283–1290 (1976)
245. J.M. Lervik, A. Grovlen, T.A. Ramstad, Robust digital signal compression and modulation
exploiting the advantages of analog communications, in Proceedings of IEEEGLOBECOM,
Nov 1995, pp. 1044–1048
246. F. Liese, I. Vajda, Convex Statistical Distances, Treubner, 1987
247. J. Lim, D.L. Neuhoff, Joint and tandem source-channel coding with complexity and delay
constraints. IEEE Trans. Commun. 51(5), 757–766 (2003)
248. S. Lin, D.J. Costello, Error Control Coding: Fundamentals and Applications, 2nd edn. (Pren-
tice Hall, Upper Saddle River, NJ, 2004)
249. T. Linder, R. Zamir, On the asymptotic tightness of the Shannon lower bound. IEEE Trans.
Inf. Theory 40(6), 2026–2031 (1994)
250. A. Lozano, A.M. Tulino, S. Verdú, Optimum power allocation for parallel Gaussian channels
with arbitrary input distributions. IEEE Trans. Inf. Theory 52(7), 3033–3051 (2006)
251. D.J.C. MacKay, R.M. Neal, Near Shannon limit performance of low density parity check
codes. Electron. Lett. 33(6) (1997)
252. D.J.C. MacKay, Good error correcting codes based on very sparse matrices. IEEE Trans. Inf.
Theory 45(2), 399–431 (1999)
253. D.J.C. MacKay, Information Theory, Inference and Learning Algorithms (Cambridge Uni-
versity Press, Cambridge, 2003)
254. F.J. MacWilliams, N.J.A. Sloane, The Theory of Error Correcting Codes (North-Holland Pub.
Co., 1978)
255. U. Madhow, Fundamentals of Digital Communication (Cambridge University Press, Cam-
bridge, 2008)
256. H. Mahdavifar, A. Vardy, Achieving the secrecy capacity of wiretap channels using polar
codes. IEEE Trans. Inf. Theory 57(10), 6428–6443 (2011)
257. H.M. Mahmoud, Polya Urn Models (Chapman and Hall/CRC, 2008)
258. S. Mallat, A theory for multiresolution signal decomposition: the wavelet representation.
IEEE Trans. Pattern Anal. Mach. Intell. 11(7), 674–693 (1989)
259. C.D. Manning, H. Schütze, Foundations of Statistical Natural Language Processing (MIT
Press, Cambridger, MA, 1999)
260. W. Mao, Modern Cryptography: Theory and Practice (Prentice Hall Professional Technical
Reference, 2003)
261. H. Marko, The bidirectional communication theory—a generalization of information theory.
IEEE Trans. Commun. Theory 21(12), 1335–1351 (1973)
262. J.E. Marsden, M.J. Hoffman, Elementary Classical Analysis (W.H. Freeman & Company,
1993)
References 309
263. J.L. Massey, Joint source and channel coding, in Communications and Random Process
Theory, ed. by J.K. Skwirzynski (Sijthoff and Nordhoff, The Netherlands, 1978), pp. 279–293
264. J.L. Massey, Cryptography—a selective survey, in Digital Communications, ed. by E. Biglieri,
G. Prati (Elsevier, 1986), pp. 3–21
265. J. Massey, Causality, feedback, and directed information, in Proceedings of International
Symposium on Information Theory and Applications, 1990, pp. 303–305
266. R.J. McEliece, The Theory of Information and Coding, 2nd edn. (Cambridge University Press,
Cambridge, 2002)
267. B. McMillan, The basic theorems of information theory. Ann. Math. Stat. 24, 196–219 (1953)
268. A. Méhes, K. Zeger, Performance of quantizers on noisy channels using structured families
of codes. IEEE Trans. Inf. Theory 46(7), 2468–2476 (2000)
269. N. Merhav, Shannon’s secrecy system with informed receivers and its application to systematic
coding for wiretapped channels. IEEE Trans. Inf. Theory 54(6), 2723–2734 (2008)
270. N. Merhav, E. Arikan, The Shannon cipher system with a guessing wiretapper. IEEE Trans.
Inf. Theory 45(6), 1860–1866 (1999)
271. N. Merhav, S. Shamai, On joint source-channel coding for the Wyner-Ziv source and the
Gel’fand-Pinsker channel. IEEE Trans. Inf. Theory 49(11), 2844–2855 (2003)
272. D. Miller, K. Rose, Combined source-channel vector quantization using deterministic anneal-
ing. IEEE Trans. Commun. 42, 347–356 (1994)
273. U. Mittal, N. Phamdo, Duality theorems for joint source-channel coding. IEEE Trans. Inf.
Theory 46(4), 1263–1275 (2000)
274. U. Mittal, N. Phamdo, Hybrid digital-analog (HDA) joint source-channel codes for broad-
casting and robust communications. IEEE Trans. Inf. Theory 48(5), 1082–1102 (2002)
275. J.W. Modestino, D.G. Daut, Combined source-channel coding of images. IEEE Trans. Com-
mun. 27, 1644–1659 (1979)
276. B. Moore, G. Takahara, F. Alajaji, Pairwise optimization of modulation constellations for
non-uniform sources. IEEE Can. J. Electr. Comput. Eng. 34(4), 167–177 (2009)
277. M. Mushkin, I. Bar-David, Capacity and coding for the Gilbert-Elliott channel. IEEE Trans.
Inf. Theory 35(6), 1277–1290 (1989)
278. T. Nakano, A.M. Eckford, T. Haraguchi, Molecular Communication (Cambridge University
Press, Cambridge, 2013)
279. T. Nemetz, On the α-divergence rate for Markov-dependent hypotheses. Probl. Control Inf.
Theory 3(2), 147–155 (1974)
280. T. Nemetz, Information Type Measures and Their Applications to Finite Decision-Problems,
Carleton Mathematical Lecture Notes, no. 17, May 1977
281. J. Neyman, E.S. Pearson, On the problem of the most efficient tests of statistical hypotheses.
Philos. Trans. R. Soc. Lond. A 231, 289–337 (1933)
282. H. Nguyen, P. Duhamel, Iterative joint source-channel decoding of VLC exploiting source
semantics over realistic radio-mobile channels. IEEE Trans. Commun. 57(6), 1701–1711
(2009)
283. A. Nosratinia, J. Lu, B. Aazhang, Source-channel rate allocation for progressive transmission
of images. IEEE Trans. Commun. 51(2), 186–196 (2003)
284. J.M. Ooi, Coding for Channels with Feedback (Springer, Berlin, 1998)
285. E. Ordentlich, T. Weissman, On the optimality of symbol-by-symbol filtering and denoising.
IEEE Trans. Inf. Theory 52(1), 19–40 (2006)
286. X. Pan, A. Banihashemi, A. Cuhadar, Progressive transmission of images over fading channels
using rate-compatible LDPC codes. IEEE Trans. Image Process. 15(12), 3627–3635 (2006)
287. L. Paninski, Estimation of entropy and mutual information. Neural Comput. 15(6), 1191–1253
(2003)
288. M. Park, D. Miller, Joint source-channel decoding for variable-length encoded data by exact
and approximate MAP source estimation. IEEE Trans. Commun. 48(1), 1–6 (2000)
289. R. Pemantle, A survey of random processes with reinforcement. Probab. Surv. 4, 1–79 (2007)
290. W.B. Pennebaker, J.L. Mitchell, JPEG: Still Image Data Compression Standard (Kluwer
Academic Press/Springer, 1992)
310 References
291. H. Permuter, T. Weissman, A.J. Goldsmith, Finite state channels with time-invariant deter-
ministic feedback. IEEE Trans. Inform. Theory 55, 644–662 (2009)
292. H. Permuter, H. Asnani, T. Weissman, Capacity of a POST channel with and without feedback.
IEEE Trans. Inf. Theory 60(10), 6041–6057 (2014)
293. N. Phamdo, N. Farvardin, T. Moriya, A unified approach to tree-structured and multistage
vector quantization for noisy channels. IEEE Trans. Inf. Theory 39(3), 835–850 (1993)
294. N. Phamdo, N. Farvardin, Optimal detection of discrete Markov sources over discrete mem-
oryless channels—applications to combined source-channel coding. IEEE Trans. Inf. Theory
40(1), 186–193 (1994)
295. N. Phamdo, F. Alajaji, N. Farvardin, Quantization of memoryless and Gauss-Markov sources
over binary Markov channels. IEEE Trans. Commun. 45(6), 668–675 (1997)
296. N. Phamdo, F. Alajaji, Soft-decision demodulation design for COVQ over white, colored, and
ISI Gaussian channels. IEEE Trans. Commun. 46(9), 1499–1506 (2000)
297. J.R. Pierce, An Introduction to Information Theory: Symbols, Signals and Noise, 2nd edn.
(Dover Publications Inc., New York, 1980)
298. C. Pimentel, I.F. Blake, Modeling burst channels using partitioned Fritchman’s Markov mod-
els. IEEE Trans. Veh. Technol. 47(3), 885–899 (1998)
299. C. Pimentel, T.H. Falk, L. Lisbôa, Finite-state Markov modeling of correlated Rician-fading
channels. IEEE Trans. Veh. Technol. 53(5), 1491–1501 (2004)
300. C. Pimentel, F. Alajaji, Packet-based modeling of Reed-Solomon block coded correlated
fading channels via a Markov finite queue model. IEEE Trans. Veh. Technol. 58(7), 3124–
3136 (2009)
301. C. Pimentel, F. Alajaji, P. Melo, A discrete queue-based model for capturing memory and soft-
decision information in correlated fading channels. IEEE Trans. Commun. 60(5), 1702–1711
(2012)
302. J.T. Pinkston, An application of rate-distortion theory to a converse to the coding theorem.
IEEE Trans. Inf. Theory 15(1), 66–71 (1969)
303. M.S. Pinsker, Information and Information Stability of Random Variables and Processes
(Holden-Day, San Francisco, 1964)
304. G. Polya, F. Eggenberger, Über die Statistik Verketteter Vorgänge. Z. Angew. Math. Mech. 3,
279–289 (1923)
305. G. Polya, F. Eggenberger, Sur l’Interpretation de Certaines Courbes de Fréquences. Comptes
Rendus C.R. 187, 870–872 (1928)
306. G. Polya, Sur Quelques Points de la Théorie des Probabilités. Ann. Inst. H. Poincarré 1,
117–161 (1931)
307. J.G. Proakis, Digital Communications (McGraw Hill, 1983)
308. L. Pronzato, H.P. Wynn, A.A. Zhigljavsky, Using Rényi entropies to measure uncertainty in
search problems. Lect. Appl. Math. 33, 253–268 (1997)
309. Z. Rached, Information Measures for Sources with Memory and their Application to Hypoth-
esis Testing and Source Coding, Doctoral dissertation, Queen’s University, 2002
310. Z. Rached, F. Alajaji, L.L. Campbell, Rényi’s entropy rate for discrete Markov sources, in
Proceedings on Conference of Information Sciences and Systems, Baltimore, Mar 1999
311. Z. Rached, F. Alajaji, L.L. Campbell, Rényi’s divergence and entropy rates for finite alphabet
Markov sources. IEEE Trans. Inf. Theory 47(4), 1553–1561 (2001)
312. Z. Rached, F. Alajaji, L.L. Campbell, The Kullback-Leibler divergence rate between Markov
sources. IEEE Trans. Inf. Theory 50(5), 917–921 (2004)
313. M. Raginsky, I. Sason, Concentration of measure inequalities in information theory, commu-
nications, and coding, Found. Trends Commun. Inf. Theory. 10(1-2), 1–246, Now Publishers,
Oct 2013
314. T.A. Ramstad, Shannon mappings for robust communication. Telektronikk 98(1), 114–128
(2002)
315. R.C. Reininger, J.D. Gibson, Distributions of the two-dimensional DCT coefficients for
images. IEEE Trans. Commun. 31(6), 835–839 (1983)
References 311
316. A. Rényi, On the dimension and entropy of probability distributions. Acta Math. Acad. Sci.
Hung. 10, 193–215 (1959)
317. A. Rényi, On measures of entropy and information, in Proceedings of the Fourth Berkeley
Symposium on Mathematical Statistics Probability, vol. 1 (University of California Press,
Berkeley, 1961), pp. 547–561
318. A. Rényi, On the foundations of information theory. Rev. Inst. Int. Stat. 33, 1–14 (1965)
319. M. Rezaeian, A. Grant, Computation of total capacity for discrete memoryless multiple-access
channels. IEEE Trans. Inf. Theory 50(11), 2779–2784 (2004)
320. Z. Reznic, M. Feder, R. Zamir, Distortion bounds for broadcasting with bandwidth expansion.
IEEE Trans. Inf. Theory 52(8), 3778–3788 (2006)
321. T.J. Richardson, R.L. Urbanke, Modern Coding Theory (Cambridge University Press, Cam-
bridge, 2008)
322. J. Rissanen, Generalized Kraft inequality and arithmetic coding. IBM J. Res. Dev. 20, 198–203
(1976)
323. H.L. Royden, Real Analysis, 3rd edn. (Macmillan Publishing Company, New York, 1988)
324. M. Rüngeler, J. Bunte, P. Vary, Design and evaluation of hybrid digital-analog transmission
outperforming purely digital concepts. IEEE Trans. Commun. 62(11), 3983–3996 (2014)
325. P. Sadeghi, R.A. Kennedy, P.B. Rapajic, R. Shams, Finite-state Markov modeling of fading
channels. IEEE Signal Process. Mag. 25(5), 57–80 (2008)
326. D. Salomon, Data Compression: The Complete Reference, 3rd edn. (Springer, Berlin, 2004)
327. L. Sankar, S.R. Rajagopalan, H.V. Poor, Utility-privacy tradeoffs in databases: an information-
theoretic approach. IEEE Trans. Inf. Forensic Secur. 8(6), 838–852 (2013)
328. L. Sankar, S.R. Rajagopalan, S. Mohajer, H.V. Poor, Smart meter privacy: a theoretical frame-
work. IEEE Trans. Smart Grid 4(2), 837–846 (2013)
329. E. Şaşoğlu, Polarization and polar codes. Found. Trends Commun. Inf. Theory 8(4), 259–381
(2011)
330. K. Sayood, Introduction to Data Compression, 4th edn. (Morgan Kaufmann, 2012)
331. K. Sayood, J.C. Borkenhagen, Use of residual redundancy in the design of joint source/channel
coders. IEEE Trans. Commun. 39, 838–846 (1991)
332. L. Schmalen, M. Adrat, T. Clevorn, P. Vary, EXIT chart based system design for iterative
source-channel decoding with fixed-length codes. IEEE Trans. Commun. 59(9), 2406–2413
(2011)
333. N. Sen, F. Alajaji, S. Yüksel, Feedback capacity of a class of symmetric finite-state Markov
channels. IEEE Trans. Inf. Theory 57, 4110–4122 (2011)
334. S. Shahidi, F. Alajaji, T. Linder, MAP detection and robust lossy coding over soft-decision
correlated fading channels. IEEE Trans. Veh. Technol. 62(7), 3175–3187 (2013)
335. S. Shamai, S. Verdú, R. Zamir, Systematic lossy source/channel coding. IEEE Trans. Inf.
Theory 44, 564–579 (1998)
336. G.I. Shamir, K. Xie, Universal source controlled channel decoding with nonsystematic quick-
look-in Turbo codes. IEEE Trans. Commun. 57(4), 960–971 (2009)
337. C.E. Shannon, A symbolic analysis of relay and switching circuits. Trans. Am. Inst. Electr.
Eng. 57(12), 713–723 (1938)
338. C.E. Shannon, A Symbolic Analysis of Relay and Switching Circuits, M.Sc. Thesis, Department
of Electrical Engineering, MIT, 1940
339. C.E. Shannon, An Algebra for Theoretical Genetics, Ph.D. Dissertation, Department of Math-
ematics, MIT, 1940
340. C.E. Shannon, A mathematical theory of communications. Bell Syst. Tech. J. 27, 379–423
and 623–656 (1948)
341. C.E. Shannon, Communication in the presence of noise. Proc. IRE 37, 10–21 (1949)
342. C.E. Shannon, Communication theory of secrecy systems. Bell Syst. Tech. J. 28, 656–715
(1949)
343. C.E. Shannon, The zero-error capacity of a noisy channel. IRE Trans. Inf. Theory 2, 8–19
(1956)
312 References
344. C.E. Shannon, Certain results in coding theory for noisy channels. Inf. Control 1(1), 6–25
(1957)
345. C.E. Shannon, Coding theorems for a discrete source with a fidelity criterion. IRE Nat. Conv.
Rec. 4, 142–163 (1959)
346. C.E. Shannon, W.W. Weaver, The Mathematical Theory of Communication (University of
Illinois Press, Urbana, IL, 1949)
347. C.E. Shannon, R.G. Gallager, E.R. Berlekamp, Lower bounds to error probability for coding
in discrete memoryless channels I. Inf. Control 10(1), 65–103 (1967)
348. C.E. Shannon, R.G. Gallager, E.R. Berlekamp, Lower bounds to error probability for coding
in discrete memoryless channels II. Inf. Control 10(2), 523–552 (1967)
349. P.C. Shields, The Ergodic Theory of Discrete Sample Paths (American Mathematical Society,
1991)
350. P.C. Shields, Two divergence-rate counterexamples. J. Theor. Probab. 6, 521–545 (1993)
351. Y. Shkel, V.Y.F. Tan, S. Draper, Unequal message protection: asymptotic and non-asymptotic
tradeoffs. IEEE Trans. Inf. Theory 61(10), 5396–5416 (2015)
352. R. Sibson, Information radius. Z. Wahrscheinlichkeitstheorie Verw. Geb. 14, 149–161 (1969)
353. C.A. Sims, Rational inattention and monetary economics, in Handbook of Monetary Eco-
nomics, vol. 3 (2010), pp. 155–181
354. M. Skoglund, Soft decoding for vector quantization over noisy channels with memory. IEEE
Trans. Inf. Theory 45(4), 1293–1307 (1999)
355. M. Skoglund, On channel-constrained vector quantization and index assignment for discrete
memoryless channels. IEEE Trans. Inf. Theory 45(7), 2615–2622 (1999)
356. M. Skoglund, P. Hedelin, Hadamard-based soft decoding for vector quantization over noisy
channels. IEEE Trans. Inf. Theory 45(2), 515–532 (1999)
357. M. Skoglund, N. Phamdo, F. Alajaji, Design and performance of VQ-based hybrid digital-
analog joint source-channel codes. IEEE Trans. Inf. Theory 48(3), 708–720 (2002)
358. M. Skoglund, N. Phamdo, F. Alajaji, Hybrid digital-analog source-channel coding for band-
width compression/expansion. IEEE Trans. Inf. Theory 52(8), 3757–3763 (2006)
359. N.J.A. Sloane, A.D. Wyner (eds.), Claude Elwood Shannon: Collected Papers (IEEE Press,
New York, 1993)
360. K. Song, Rényi information, loglikelihood and an intrinsic distribution measure. J. Stat. Plan.
Inference 93(1–2), 51–69 (2001)
361. L. Song, F. Alajaji, T. Linder, On the capacity of burst noise-erasure channels with and without
feedback, in Proceedings of IEEE International Symposium on Information Theory, Aachen,
Germany, June 2017, pp. 206–210
362. J. Soni, R. Goodman, A Mind at Play: How Claude Shannon Invented the Information Age
(Simon & Schuster, 2017)
363. J.F. Sowa, Conceptual Structures: Information Processing in Mind and Machine (Addison-
Wesley Pub, MA, 1983)
364. Y. Steinberg, S. Verdú, Simulation of random processes and rate-distortion theory. IEEE Trans.
Inf. Theory 42(1), 63–86 (1996)
365. Y. Steinberg, N. Merhav, On hierarchical joint source-channel coding with degraded side
information. IEEE Trans. Inf. Theory 52(3), 886–903 (2006)
366. K.P. Subbalakshmi, J. Vaisey, On the joint source-channel decoding of variable-length encoded
sources: the additive-Markov case. IEEE Trans. Commun. 51(9), 1420–1425 (2003)
367. M. Taherzadeh, A.K. Khandani, Single-sample robust joint source-channel coding: achieving
asymptotically optimum scaling of SDR versus SNR. IEEE Trans. Inf. Theory 58(3), 1565–
1577 (2012)
368. G. Takahara, F. Alajaji, N.C. Beaulieu, H. Kuai, Constellation mappings for two-dimensional
signaling of nonuniform sources. IEEE Trans. Commun. 51(3), 400–408 (2003)
369. Y. Takashima, M. Wada, H. Murakami, Reversible variable length codes. IEEE Trans. Com-
mun. 43, 158–162 (1995)
370. C. Tan, N.C. Beaulieu, On first-order Markov modeling for the Rayleigh fading channel. IEEE
Trans. Commun. 48(12), 2032–2040 (2000)
References 313
371. I. Tal, A. Vardy, How to construct polar codes. IEEE Trans. Inf. Theory 59(10), 6562–6582
(2013)
372. I. Tal, A. Vardy, List decoding of polar codes. IEEE Trans. Inf. Theory 61(5), 2213–2226
(2015)
373. V.Y.F. Tan, S. Watanabe, M. Hayashi, Moderate deviations for joint source-channel coding of
systems with Markovian memory, in Proceedings IEEE Symposium on Information Theory,
Honolulu, HI, June 2014, pp. 1687–1691
374. A. Tang, D. Jackson, J. Hobbs, W. Chen, J.L. Smith, H. Patel, A. Prieto, D. Petrusca, M.I.
Grivich, A. Sher, P. Hottowy, W. Davrowski, A.M. Litke, J.M. Beggs, A maximum entropy
model applied to spatial and temporal correlations from cortical networks in vitro. J. Neurosci.
28(2), 505–518 (2008)
375. N. Tanabe, N. Farvardin, Subband image coding using entropy-coded quantization over noisy
channels. IEEE J. Sel. Areas Commun. 10(5), 926–943 (1992)
376. S. Tatikonda, Control Under Communication Constraints, Ph.D. Dissertation, MIT, 2000
377. S. Tatikonda, S. Mitter, Control under communication constraints. IEEE Trans. Autom. Con-
trol 49(7), 1056–1068 (2004)
378. S. Tatikonda, S. Mitter, The capacity of channels with feedback. IEEE Trans. Inf. Theory 55,
323–349 (2009)
379. H. Theil, Economics and Information Theory (North-Holland, Amsterdam, 1967)
380. I.E. Telatar, Capacity of multi-antenna Gaussian channels. Eur. Trans. Telecommun. 10(6),
585–596 (1999)
381. R. Thobaben, J. Kliewer, An efficient variable-length code construction for iterative source-
channel decoding. IEEE Trans. Commun. 57(7), 2005–2013 (2009)
382. C. Tian, S. Shamai, A unified coding scheme for hybrid transmission of Gaussian source over
Gaussian channel, in Proceedings International Symposium on Information Theory, Toronto,
Canada, July 2008, pp. 1548–1552
383. C. Tian, J. Chen, S.N. Diggavi, S. Shamai, Optimality and approximate optimality of source-
channel separation in networks. IEEE Trans. Inf. Theory 60(2), 904–918 (2014)
384. N. Tishby, F.C. Pereira, W. Bialek, The information bottleneck method, in Proceedings of
37th Annual Allerton Conference on Communication, Control, and Computing (1999), pp.
368–377
385. N. Tishby, N. Zaslavsky, Deep learning and the information bottleneck principle, in Proceed-
ings IEEE Information Theory Workshop, Apr 2015, pp. 1–5
386. S. Tridenski, R. Zamir, A. Ingber, The Ziv-Zakai-Rényi bound for joint source-channel coding.
IEEE Trans. Inf. Theory 61(8), 4293–4315 (2015)
387. D.N.C. Tse, P. Viswanath, Fundamentals of Wireless Communications (Cambridge University
Press, Cambridge, UK, 2005)
388. A. Tulino, S. Verdú, Monotonic decrease of the non-Gaussianness of the sum of independent
random variables: a simple proof. IEEE Trans. Inf. Theory 52(9), 4295–4297 (2006)
389. W. Turin, R. van Nobelen, Hidden Markov modeling of flat fading channels. IEEE J. Sel.
Areas Commun. 16, 1809–1817 (1998)
390. R.E. Ulanowicz, Information theory in ecology. Comput. Chem. 25(4), 393–399 (2001)
391. V. Vaishampayan, S.I.R. Costa, Curves on a sphere, shift-map dynamics, and error control
for continuous alphabet sources. IEEE Trans. Inf. Theory 49(7), 1658–1672 (2003)
392. V.A. Vaishampayan, N. Farvardin, Joint design of block source codes and modulation signal
sets. IEEE Trans. Inf. Theory 38, 1230–1248 (1992)
393. I. Vajda, Theory of Statistical Inference and Information (Kluwer, Dordrecht, 1989)
394. S. Vembu, S. Verdú, Y. Steinberg, The source-channel separation theorem revisited. IEEE
Trans. Inf. Theory 41, 44–54 (1995)
395. S. Verdú, α-mutual information, in Proceedings of Workshop Information Theory and Appli-
cations, San Diego, 2015
396. S. Verdú, T.S. Han, A general formula for channel capacity. IEEE Trans. Inf. Theory 40(4),
1147–1157 (1994)
314 References
397. S. Verdú, S. Shamai, Variable-rate channel capacity. IEEE Trans. Inf. Theory 56(6), 2651–
2667 (2010)
398. W.R. Wade, An Introduction to Analysis (Prentice Hall, Upper Saddle River, NJ, 1995)
399. D. Wang, A. Ingber, Y. Kochman, A strong converse for joint source-channel coding, in
Proceedings International Symposium on Information Theory, Cambridge, MA, 2012, pp.
2117–2121
400. S.-W. Wang, P.-N. Chen, C.-H. Wang, Optimal power allocation for (N , K )-limited access
channels. IEEE Trans. Inf. Theory 58(6), 3725–3750 (2012)
401. Y. Wang, F. Alajaji, T. Linder, Hybrid digital-analog coding with bandwidth compression for
Gaussian source-channel pairs. IEEE Trans. Commun. 57(4), 997–1012 (2009)
402. T. Wang, W. Zhang, R.G. Maunder, L. Hanzo, Near-capacity joint source and channel coding
of symbol values from an infinite source set using Elias Gamma error correction codes. IEEE
Trans. Commun. 62(1), 280–292 (2014)
403. W. Wang, L. Ying, J. Zhang, On the relation between identifiability, differential privacy, and
mutual-information privacy. IEEE Trans. Inf. Theory 62(9), 5018–5029 (2016)
404. T.A. Welch, A technique for high-performance data compression. Computer 17(6), 8–19
(1984)
405. N. Wernersson, M. Skoglund, T. Ramstad, Polynomial based analog source channel codes.
IEEE Trans. Commun. 57(9), 2600–2606 (2009)
406. T. Weissman, E. Ordentlich, G. Seroussi, S. Verdú, M.J. Weinberger, Universal discrete denois-
ing: known channel. IEEE Trans. Inf. Theory 51, 5–28 (2005)
407. S. Wicker, Error Control Systems for Digital Communication and Storage (Prentice Hall,
Upper Saddle RiverNJ, 1995)
408. M.M. Wilde, Quantum Information Theory, 2nd edn. (Cambridge University Press, Cam-
bridge, 2017)
409. M.P. Wilson, K.R. Narayanan, G. Caire, Joint source-channel coding with side information
using hybrid digital analog codes. IEEE Trans. Inf. Theory 56(10), 4922–4940 (2010)
410. T.-Y. Wu, P.-N. Chen, F. Alajaji, Y.S. Han, On the design of variable-length error-correcting
codes. IEEE Trans. Commun. 61(9), 3553–3565 (2013)
411. A.D. Wyner, The capacity of the band-limited Gaussian channel. Bell Syst. Tech. J. 45, 359–
371 (1966)
412. A.D. Wyner, J. Ziv, Bounds on the rate-distortion function for stationary sources with memory.
IEEE Trans. Inf. Theory 17(5), 508–513 (1971)
413. A.D. Wyner, The wire-tap channel. Bell Syst. Tech. J. 54, 1355–1387 (1975)
414. H. Yamamoto, A source coding problem for sources with additional outputs to keep secret
from the receiver or wiretappers. IEEE Trans. Inf. Theory 29(6), 918–923 (1983)
415. R.W. Yeung, Information Theory and Network Coding (Springer, New York, 2008)
416. S. Yong, Y. Yang, A.D. Liveris, V. Stankovic, Z. Xiong, Near-capacity dirty-paper code design:
a source-channel coding approach. IEEE Trans. Inf. Theory 55(7), 3013–3031 (2009)
417. X. Yu, H. Wang, E.-H. Yang, Design and analysis of optimal noisy channel quantization with
random index assignment. IEEE Trans. Inf. Theory 56(11), 5796–5804 (2010)
418. S. Yüksel, T. Başar, Stochastic Networked Control Systems: Stabilization and Optimization
under Information Constraints (Springer, Berlin, 2013)
419. K.A. Zeger, A. Gersho, Pseudo-Gray coding. IEEE Trans. Commun. 38(12), 2147–2158
(1990)
420. L. Zhong, F. Alajaji, G. Takahara, A binary communication channel with memory based on
a finite queue. IEEE Trans. Inform. Theory 53, 2815–2840 (2007)
421. L. Zhong, F. Alajaji, G. Takahara, A model for correlated Rician fading channels based on a
finite queue. IEEE Trans. Veh. Technol. 57(1), 79–89 (2008)
422. Y. Zhong, F. Alajaji, L.L. Campbell, On the joint source-channel coding error exponent for
discrete memoryless systems. IEEE Trans. Inf. Theory 52(4), 1450–1468 (2006)
423. Y. Zhong, F. Alajaji, L.L. Campbell, On the joint source-channel coding error exponent of
discrete communication systems with Markovian memory. IEEE Trans. Inf. Theory 53(12),
4457–4472 (2007)
References 315
424. Y. Zhong, F. Alajaji, L.L. Campbell, Joint source-channel coding excess distortion exponent
for some memoryless continuous-alphabet systems. IEEE Trans. Inf. Theory 55(3), 1296–
1319 (2009)
425. Y. Zhong, F. Alajaji, L.L. Campbell, Error exponents for asymmetric two-user discrete mem-
oryless source-channel coding systems. IEEE Trans. Inf. Theory 55(4), 1487–1518 (2009)
426. G.-C. Zhu, F. Alajaji, Turbo codes for non-uniform memoryless sources over noisy channels.
IEEE Commun. Lett. 6(2), 64–66 (2002)
427. G.-C. Zhu, F. Alajaji, J. Bajcsy, P. Mitran, Transmission of non-uniform memoryless sources
via non-systematic Turbo codes. IEEE Trans. Commun. 52(8), 1344–1354 (2004)
428. G.-C. Zhu, F. Alajaji, Joint source-channel Turbo coding for binary Markov sources. IEEE
Trans. Wireless Commun. 5(5), 1065–1075 (2006)
429. J. Ziv, The behavior of analog communication systems. IEEE Trans. Inf. Theory 16(5), 587–
594 (1970)
430. J. Ziv, A. Lempel, A universal algorithm for sequential data compression. IEEE Trans. Inf.
Theory 23(3), 337–343 (1977)
431. J. Ziv, A. Lempel, Compression of individual sequences via variable-rate coding. IEEE Trans.
Inf. Theory 24(5), 530–536 (1978)
Index
B C
Band-limited, 207, 209–211 Capacity
Bandpass filter, 208, 209 BEC, 136
© Springer Nature Singapore Pte Ltd. 2018 317
F. Alajaji and P.-N. Chen, An Introduction to Single-User Information Theory,
Springer Undergraduate Texts in Mathematics and Technology,
https://doi.org/10.1007/978-981-10-8001-2
318 Index
H
Hadamard’s inequality, 180, 202 J
Hamming codes, 126 Jensen’s inequality, 12, 50, 54, 196, 292
Hamming distortion, 222, 238, 240, 250, Joint distribution, 12, 28, 35, 47, 49, 66, 72,
253, 255, 258, 262 214, 228, 278
Hölder’s inequality, 50 Joint entropy, 5, 12–14, 116
Index 321
Joint source-channel coding, 141, 247, 248, channel transition, 108, 110, 132, 134,
252, 254, 261 152
Joint source-channel coding theorem covariance, 177
lossless general rate block codes, 147 doubly stochastic, 134
lossless rate-one block codes, 143 positive-definite, 177, 180
lossy, 248 positive-semidefinite, 177
Jointly typical set, 116 transition probability, 71, 157, 158
Maximal probability of error, 115, 144
Maximum, 264
K Maximum a Posteriori (MAP), 155
Karush-Kuhn-Tucker (KKT) conditions, Maximum differential entropy, 180, 182
138, 295 Maximum entropy principle, 40
Kraft inequality, 78, 80, 81 Maximum Likelihood (ML), 154
Kullback-Leibler distance, divergence, 26, Memoryless
69, 172 channel, 107, 184, 248
source, 8, 57, 183, 226, 278
MIMO channels, 203
L Minimum, 265, 266
Lagrange multipliers, 138, 197, 199, 293 Modes of convergences
Laplacian source, 182, 243, 244 almost surely or with probability one,
Law of large numbers 283
strong law, 288, 289 in distribution, 284
weak law, 287 in mean, 284
L2 -distance, 117
in probability, 283
Lempel-Ziv codes, 95
pointwise, 283
Likelihood ratio, 27, 42, 43
uniqueness of convergence limit, 285
Limit infimum, see liminf under sequence
Molecular communication, 40
Limit supremum, see limsup under sequence
Monotone convergence theorem, 285
List decoding, 52
Monotone sequence
Log-likelihood ratio, 27, 34, 43
convergence, 269
Log-sum inequality, 11
Multivariate Gaussian, 177
Lossy information-transmission theorem,
Mutual information, 16
248
bound for memoryless channel, 19
Lossy joint source-channel coding theorem,
chain rule, 17, 19
248
conditional, 17, 21
Low-Density Parity-Check (LDPC) codes,
126, 130 continuous random variables, 173, 175,
214
convexity and concavity, 37
M for specific input symbol, 138
Machine learning, 2, 40 properties, 16, 174
Markov chain, 281 Venn diagram, 17
aperiodic, 281
homogeneous, 281
irreducible, 69, 71, 74, 100, 147, 281, N
282, 291 Nats, 8, 58, 213, 214
stationary distribution, 69, 74, 100, 282 Network epidemics, 2, 73
time-invariant, 69, 104, 147, 281 Neyman-Pearson lemma, 42
Markov source or process, 68, 74, 76, 100, Noise, 73, 106, 110, 112, 113, 126, 148, 149,
280 151, 154, 158, 159, 161, 186, 189,
stationary ergodic, 282 194, 195, 197, 200, 201, 206, 207,
Markov’s inequality, 286 209, 211, 213, 236, 241, 251, 253–
Martingale, 73 255, 262, 297
Matrix Non-Gaussianness, 207, 243
322 Index