Alajaji Chen2018 Book AnIntroductionToSingle UserInf

Download as pdf or txt
Download as pdf or txt
You are on page 1of 333

Springer Undergraduate Texts

in Mathematics and Technology

Fady Alajaji 
Po-Ning Chen

An Introduction
to Single-User
Information
Theory
Springer Undergraduate Texts in Mathematics
and Technology

Series editor
H. Holden, Norwegian University of Science and Technology,
Trondheim, Norway

Editorial Board
Lisa Goldberg, University of California, Berkeley, CA, USA
Armin Iske, University of Hamburg, Germany
Palle E. T. Jorgensen, The University of Iowa, Iowa City, IA, USA
Springer Undergraduate Texts in Mathematics and Technology (SUMAT)
publishes textbooks aimed primarily at the undergraduate. Each text is designed
principally for students who are considering careers either in the mathematical
sciences or in technology-based areas such as engineering, finance, information
technology and computer science, bioscience and medicine, optimization or
industry. Texts aim to be accessible introductions to a wide range of core
mathematical disciplines and their practical, real-world applications; and are
fashioned both for course use and for independent study.

More information about this series at http://www.springer.com/series/7438


Fady Alajaji Po-Ning Chen

An Introduction
to Single-User Information
Theory

123
Fady Alajaji Po-Ning Chen
Department of Mathematics Department of Electrical
and Statistics and Computer Engineering
Queen’s University National Chiao Tung University
Kingston, ON Hsinchu
Canada Taiwan, Republic of China

ISSN 1867-5506 ISSN 1867-5514 (electronic)


Springer Undergraduate Texts in Mathematics and Technology
ISBN 978-981-10-8000-5 ISBN 978-981-10-8001-2 (eBook)
https://doi.org/10.1007/978-981-10-8001-2

Library of Congress Control Number: 2018934890

Mathematics Subject Classification: 94-XX, 60-XX, 68-XX, 62-XX

© Springer Nature Singapore Pte Ltd. 2018


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made. The publisher remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.

Printed on acid-free paper

This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
part of Springer Nature
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
Preface

The reliable transmission and processing of information bearing signals over a


noisy communication channel is at the heart of what we call communication.
Information theory—founded by Claude E. Shannon in 1948 [340]—provides a
mathematical framework for the theory of communication. It describes the funda-
mental limits to how efficiently one can encode information and still be able to
recover it at the destination either with negligible loss or within a prescribed dis-
tortion threshold.
The purpose of this textbook is to present a concise, yet mathematically rigorous,
introduction to the main pillars of information theory. It thus naturally focuses on
the foundational concepts and indispensable results of the subject for single-user
systems, where a single data source or message needs to be reliably processed and
communicated over a noiseless or noisy point-to-point channel. The book consists
of five meticulously drafted core chapters (with accompanying problems),
emphasizing the key topics of information measures, lossless and lossy data
compression, channel coding, and joint source–channel coding. Two appendices
covering necessary and supplementary material in real analysis and in probability
and stochastic processes are included and a comprehensive instructor’s solutions
manual is separately available. The book is well suited for a single-term first course
on information theory, ranging from 10 to 15 weeks, offered to senior undergrad-
uate and entry-level graduate students in mathematics, statistics, engineering, and
computing and information sciences.
The textbook grew out of lecture notes we developed while teaching information
theory over the last 20 years to students in applied mathematics, statistics, and
engineering at our home universities. Over the years teaching the subject, we
realized that standard textbooks, some of them undeniably outstanding, tend to
cover a large amount of material (including advanced topics), which can be over-
whelming and inaccessible to debutant students. They also do not always provide
all the necessary technical details in the proofs of the main results. We hope that this
book fills these needs by virtue of being succinct and mathematically precise and
that it helps beginners acquire a profound understanding of the fundamental ele-
ments of the subject.

v
vi Preface

The textbook aims at providing a coherent introduction of the primary principles


of single-user information theory. All the main Shannon coding theorems are
proved in full detail (without skipping important steps) using a consistent approach
based on the law of large numbers, or equivalently, the asymptotic equipartition
property (AEP). A brief description of the topics of each chapter follows.
• Chapter 2: Information measures for discrete systems and their properties:
self-information, entropy, mutual information and divergence, data processing
theorem, Fano’s inequality, Pinsker’s inequality, simple hypothesis testing, the
Neyman–Pearson lemma, the Chernoff–Stein lemma, and Rényi’s information
measures.
• Chapter 3: Fundamentals of lossless source coding (data compression): discrete
memoryless sources, fixed-length (block) codes for asymptotically lossless
compression, AEP, fixed-length source coding theorems for memoryless and
stationary ergodic sources, entropy rate and redundancy, variable-length codes
for lossless compression, variable-length source coding theorems for memory-
less and stationary sources, prefix codes, Kraft inequality, Huffman codes,
Shannon–Fano–Elias codes, and Lempel–Ziv codes.
• Chapter 4: Fundamentals of channel coding: discrete memoryless channels,
block codes for data transmission, channel capacity, coding theorem for discrete
memoryless channels, example of polar codes, calculation of channel capacity,
channels with symmetric structures, lossless joint source–channel coding, and
Shannon’s separation principle.
• Chapter 5: Information measures for continuous alphabet systems and Gaussian
channels: differential entropy, mutual information and divergence, AEP for
continuous memoryless sources, capacity and channel coding theorem of
discrete-time memoryless Gaussian channels, capacity of uncorrelated parallel
Gaussian channels and the water-filling principle, capacity of correlated
Gaussian channels, non-Gaussian discrete-time memoryless channels, and
capacity of band-limited (continuous-time) white Gaussian channels.
• Chapter 6: Fundamentals of lossy source coding and joint source–channel
coding: distortion measures, rate–distortion theorem for memoryless sources,
rate–distortion theorem for stationary ergodic sources, rate–distortion function
and its properties, rate–distortion function for memoryless Gaussian and
Laplacian sources, lossy joint source–channel coding theorem, and Shannon
limit of communication systems.
• Appendix A: Overview on suprema and limits.
• Appendix B: Overview in probability and random processes. Random variables
and stochastic processes, statistical properties of random processes, Markov
chains, convergence of sequences of random variables, ergodicity and laws of
large numbers, central limit theorem, concavity and convexity, Jensen’s
inequality, Lagrange multipliers, and the Karush–Kuhn–Tucker (KKT) condi-
tions for constrained optimization problems.
Preface vii

We are very much indebted to all readers, including many students, who pro-
vided valuable feedback. Special thanks are devoted to Yunghsiang S. Han,
Yu-Chih Huang, Tamás Linder, Stefan M. Moser, and Vincent Y. F. Tan; their
insightful and incisive comments greatly benefited the manuscript. We also thank
all anonymous reviewers for their constructive and detailed criticism. Finally, we
sincerely thank all our mentors and colleagues who immeasurably and positively
impacted our understanding of and fondness for the field of information theory,
including Lorne L. Campbell, Imre Csiszár, Lee D. Davisson, Nariman Farvardin,
Thomas E. Fuja, Te Sun Han, Tamás Linder, Prakash Narayan, Adrian
Papamarcou, Nam Phamdo, Mikael Skoglund, and Sergio Verdú.

Remarks to the reader: In the text, all assumptions, claims, conjectures,


corollaries, definitions, examples, exercises, lemmas, observations, properties,
and theorems are numbered under the same counter. For example, a lemma
that immediately follows Theorem 2.1 is numbered as Lemma 2.2, instead of
Lemma 2.1. Readers are welcome to submit comments to [email protected] or to
[email protected].

Kingston, ON, Canada Fady Alajaji


Hsinchu, Taiwan, Republic of China Po-Ning Chen
2018
Acknowledgements

Thanks are given to our families for their full support during the period of writing
this textbook.

ix
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Communication System Model . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Information Measures for Discrete Systems . . . . . . . . . . . . . . . . . . . 5
2.1 Entropy, Joint Entropy, and Conditional Entropy . . . . . . . . . . . . . 5
2.1.1 Self-information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.3 Properties of Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.4 Joint Entropy and Conditional Entropy . . . . . . . . . . . . . . . 12
2.1.5 Properties of Joint Entropy and Conditional Entropy . . . . . 14
2.2 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Properties of Mutual Information . . . . . . . . . . . . . . . . . . . 16
2.2.2 Conditional Mutual Information . . . . . . . . . . . . . . . . . . . . 17
2.3 Properties of Entropy and Mutual Information for Multiple
Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Data Processing Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5 Fano’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6 Divergence and Variational Distance . . . . . . . . . . . . . . . . . . . . . . 26
2.7 Convexity/Concavity of Information Measures . . . . . . . . . . . . . . . 37
2.8 Fundamentals of Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . 40
2.9 Rényi’s Information Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3 Lossless Data Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.1 Principles of Data Compression . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2 Block Codes for Asymptotically Lossless Compression . . . . . . . . 57
3.2.1 Block Codes for Discrete Memoryless Sources . . . . . . . . . 57
3.2.2 Block Codes for Stationary Ergodic Sources . . . . . . . . . . . 66
3.2.3 Redundancy for Lossless Block Data Compression . . . . . . 75

xi
xii Contents

3.3 Variable-Length Codes for Lossless Data Compression . . . . . . . . . 76


3.3.1 Non-singular Codes and Uniquely Decodable Codes . . . . . 76
3.3.2 Prefix or Instantaneous Codes . . . . . . . . . . . . . . . . . . . . . . 80
3.3.3 Examples of Binary Prefix Codes . . . . . . . . . . . . . . . . . . . 87
3.3.4 Examples of Universal Lossless Variable-Length Codes . . . 93
4 Data Transmission and Channel Capacity . . . . . . . . . . . . . . . . . . . . 105
4.1 Principles of Data Transmission . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.2 Discrete Memoryless Channels . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.3 Block Codes for Data Transmission Over DMCs . . . . . . . . . . . . . 114
4.4 Example of Polar Codes for the BEC . . . . . . . . . . . . . . . . . . . . . 127
4.5 Calculating Channel Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
4.5.1 Symmetric, Weakly Symmetric, and Quasi-symmetric
Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
4.5.2 Karush–Kuhn–Tucker Conditions for Channel Capacity . . . 138
4.6 Lossless Joint Source-Channel Coding and Shannon’s
Separation Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5 Differential Entropy and Gaussian Channels . . . . . . . . . . . . . . . . . . 165
5.1 Differential Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
5.2 Joint and Conditional Differential Entropies, Divergence,
and Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
5.3 AEP for Continuous Memoryless Sources . . . . . . . . . . . . . . . . . . 183
5.4 Capacity and Channel Coding Theorem for the Discrete-Time
Memoryless Gaussian Channel . . . . . . . . . . . . . . . . . . . . . . . . . . 184
5.5 Capacity of Uncorrelated Parallel Gaussian Channels: The
Water-Filling Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
5.6 Capacity of Correlated Parallel Gaussian Channels . . . . . . . . . . . . 201
5.7 Non-Gaussian Discrete-Time Memoryless Channels . . . . . . . . . . . 206
5.8 Capacity of the Band-Limited White Gaussian Channel . . . . . . . . 207
6 Lossy Data Compression and Transmission . . . . . . . . . . . . . . . . . . . 219
6.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
6.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
6.1.2 Distortion Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
6.1.3 Frequently Used Distortion Measures . . . . . . . . . . . . . . . . 222
6.2 Fixed-Length Lossy Data Compression Codes . . . . . . . . . . . . . . . 224
6.3 Rate–Distortion Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
6.4 Calculation of the Rate–Distortion Function . . . . . . . . . . . . . . . . . 238
6.4.1 Rate–Distortion Function for Discrete Memoryless
Sources Under the Hamming Distortion Measure . . . . . . . . 238
6.4.2 Rate–Distortion Function for Continuous Memoryless
Sources Under the Squared Error Distortion Measure . . . . 241
Contents xiii

6.4.3 Rate–Distortion Function for Continuous Memoryless


Sources Under the Absolute Error Distortion Measure . . . . 244
6.5 Lossy Joint Source–Channel Coding Theorem . . . . . . . . . . . . . . . 247
6.6 Shannon Limit of Communication Systems . . . . . . . . . . . . . . . . . 249
Appendix A: Overview on Suprema and Limits . . . . . . . . . . . . . . . . . . . . 263
Appendix B: Overview in Probability and Random Processes . . . . . . . . 273
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
List of Figures

Fig. 1.1 Block diagram of a general communication system . . . . . . . . . . 3


Fig. 2.1 Binary entropy function hb ðpÞ . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Fig. 2.2 Relation between entropy and mutual information . . . . . . . . . . . 17
Fig. 2.3 Communication context of the data processing lemma . . . . . . . . 21
Fig. 2.4 Permissible ðPe ; HðXjYÞÞ region due to Fano’s inequality . . . . . 23
Fig. 3.1 Block diagram of a data compression system . . . . . . . . . . . . . . . 57
Fig. 3.2 Possible code Cn and its corresponding S n . The solid box
indicates the decoding mapping from Cn back to S n . . . . . . . .. 65
Fig. 3.3 Asymptotic compression rate R  versus source entropy HD ðXÞ
and behavior of the probability of block decoding error as
blocklength n goes to infinity for a discrete
memoryless source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 66
Fig. 3.4 Classification of variable-length codes . . . . . . . . . . . . . . . . . . .. 81
Fig. 3.5 Tree structure of a binary prefix code. The codewords are those
residing on the leaves, which in this case are 00, 01, 10, 110,
1110, and 1111 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 81
Fig. 3.6 Example of the Huffman encoding . . . . . . . . . . . . . . . . . . . . . .. 90
Fig. 3.7 Example of the sibling property based on the code tree from
ð16Þ
PX^ . The arguments inside the parenthesis following aj
respectively indicate the codeword and the probability
associated with aj . Here, “b” is used to denote the internal
nodes of the tree with the assigned (partial) code as its
subscript. The number in the parenthesis following b is the
probability sum of all its children . . . . . . . . . . . . . . . . . . . . . . .. 95
Fig. 3.8 (Continuation of Fig. 3.7) Example of violation of the sibling
property after observing a new symbol a3 at n ¼ 17. Note that
node a1 is not adjacent to its sibling a2 . . . . . . . . . . . . . . . . . .. 96
Fig. 3.9 (Continuation of Fig. 3.8) Updated Huffman code. The sibling
property holds now for the new code . . . . . . . . . . . . . . . . . . . .. 96

xv
xvi List of Figures

Fig. 4.1 A data transmission system, where W represents the message


for transmission, X n denotes the codeword corresponding to
message W, Y n represents the received word due to
channel input X n , and W ^ denotes the reconstructed message
n
from Y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Fig. 4.2 Binary symmetric channel (BSC) . . . . . . . . . . . . . . . . . . . . . . . . 110
Fig. 4.3 Binary erasure channel (BEC) . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Fig. 4.4 Binary symmetric erasure channel (BSEC) . . . . . . . . . . . . . . . . . 112
Fig. 4.5 Asymptotic channel coding rate R versus channel capacity
C and behavior of the probability of error as blocklength
n goes to infinity for a DMC . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Fig. 4.6 Basic transformation with n ¼ 2 . . . . . . . . . . . . . . . . . . . . . . . . . 128
Fig. 4.7 Channel polarization for Q ¼ BECð0:5Þ with n ¼ 8 . . . . . . . . . . 131
Fig. 4.8 A separate (tandem) source-channel coding scheme . . . . . . . . . . 142
Fig. 4.9 A joint source-channel coding scheme . . . . . . . . . . . . . . . . . . . . 142
Fig. 4.10 An m-to-n block source-channel coding system . . . . . . . . . . . . . 143
Fig. 5.1 The water-pouring scheme for uncorrelated parallel Gaussian
channels. The horizontal dashed line, which indicates the level
where the water rises to, indicates the value of h for which
Pk
i¼1 Pi ¼ P . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
Fig. 5.2 Band-limited waveform channel with additive white Gaussian
noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
Fig. 5.3 Water-pouring for the band-limited colored
Gaussian channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
Fig. 6.1 Example for the application of lossy data compression. . . . . . . . 220
Fig. 6.2 “Grouping” as a form of lossy data compression . . . . . . . . . . . . 221
Fig. 6.3 An m-to-n block lossy source–channel coding system . . . . . . . . 248
Fig. 6.4 The Shannon limit for sending a binary uniform source
over a BPSK-modulated AWGN channel with unquantized
output; rates Rsc ¼ 1/2 and 1/3 . . . . . . . . . . . . . . . . . . . . . . . . . . 256
Fig. 6.5 The Shannon limits for sending a binary uniform source
over a continuous-input AWGN channel; rates
Rsc ¼ 1/2 and 1/3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
Fig. A.1 Illustration of Lemma A.17. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
Fig. B.1 General relations of random processes . . . . . . . . . . . . . . . . . . . . 283
Fig. B.2 Relation of ergodic random processes, respectively, defined
through time-shift invariance and ergodic theorem . . . . . . . . . . . 290
Fig. B.3 The support line y ¼ ax þ b of the convex function f(x) . . . . . . . 293
List of Tables

Table 3.1 An example of the d-typical set with n ¼ 2 and d ¼ 0:4,


where F 2 (0.4) = fAB, AC, BA, BB, BC, CA, CBg. The
codeword set is f001(AB), 010(AC), 011(BA), 100(BB),
101(BC), 110(CA), 111(CB), 000(AA, AD, BD, CC, CD,
DA, DB, DC, DD)g, where the parenthesis following each
binary codeword indicates those sourcewords that are
encoded to this codeword. The source distribution is
PX ðAÞ ¼ 0:4, PX ðBÞ ¼ 0:3, PX ðCÞ ¼ 0:2, and
PX ðDÞ ¼ 0:1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 63
Table 5.1 Quantized random variable qn ðXÞ under an n-bit accuracy:
Hðqn ðXÞÞ and Hðqn ðXÞÞ  n versus n. . . . . . . . . . . . . . . . . . . . 170
Table 6.1 Shannon limit values cb ¼ Eb =N0 (dB) for sending a binary
uniform source over a BPSK-modulated AWGN used with
hard-decision decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
Table 6.2 Shannon limit values cb ¼ Eb =N0 (dB) for sending a binary
uniform source over a BPSK-modulated AWGN with
unquantized output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
Table 6.3 Shannon limit values cb ¼ Eb =N0 (dB) for sending a binary
uniform source over a BPSK-modulated Rayleigh fading
channel with decoder side information . . . . . . . . . . . . . . . . . . . 258

xvii
Chapter 1
Introduction

1.1 Overview

Since its inception, the main role of information theory has been to provide the
engineering and scientific communities with a mathematical framework for the the-
ory of communication by establishing the fundamental limits on the performance of
various communication systems. The birth of information theory was initiated with
the publication of the groundbreaking works [340, 346] of Claude Elwood Shannon
(1916–2001) who asserted that it is possible to send information-bearing signals at a
fixed positive rate through a noisy communication channel with an arbitrarily small
probability of error as long as the transmission rate is below a certain fixed quan-
tity that depends on the channel statistical characteristics; he “named” this quantity
channel capacity. He further proclaimed that random (stochastic) sources, represent-
ing data, speech or image signals, can be compressed distortion-free at a minimal
rate given by the source’s intrinsic amount of information, which he called source
entropy1 and defined in terms of the source statistics. He went on proving that if a
source has an entropy that is less than the capacity of a communication channel, then
the source can be reliably transmitted (with asymptotically vanishing probability of
error) over the channel. He further generalized these “coding theorems” from the
lossless (distortionless) to the lossy context where the source can be compressed
and reproduced (possibly after channel transmission) within a tolerable distortion
threshold [345].

1 Shannonborrowed the term “entropy” from statistical mechanics since his quantity admits the
same expression as Boltzmann’s entropy [55].

© Springer Nature Singapore Pte Ltd. 2018 1


F. Alajaji and P.-N. Chen, An Introduction to Single-User Information Theory,
Springer Undergraduate Texts in Mathematics and Technology,
https://doi.org/10.1007/978-981-10-8001-2_1
2 1 Introduction

Inspired and guided by the pioneering ideas of Shannon,2 information theorists


gradually expanded their interests beyond communication theory, and investigated
fundamental questions in several other related fields. Among them we cite:
• statistical physics (thermodynamics, quantum information theory);
• computing and information sciences (distributed processing, compression, algo-
rithmic complexity, resolvability);
• probability theory (large deviations, limit theorems, Markov decision processes);
• statistics (hypothesis testing, multiuser detection, Fisher information, estimation);
• stochastic control (control under communication constraints, stochastic optimiza-
tion);
• economics (game theory, team decision theory, gambling theory, investment the-
ory);
• mathematical biology (biological information theory, bioinformatics);
• information hiding, data security and privacy;
• data networks (network epidemics, self-similarity, traffic regulation theory);
• machine learning (deep neural networks, data analytics).
In this textbook, we focus our attention on the study of the basic theory of commu-
nication for single-user (point-to-point) systems for which information theory was
originally conceived.

1.2 Communication System Model

A simple block diagram of a general communication system is depicted in Fig. 1.1.


Let us briefly describe the role of each block in the figure.
• Source: The source, which usually represents data or multimedia signals, is mod-
eled as a random process (an introduction to random processes is provided in
Appendix B). It can be discrete (finite or countable alphabet) or continuous (un-
countable alphabet) in value and in time.
• Source Encoder: Its role is to represent the source in a compact fashion by re-
moving its unnecessary or redundant content (i.e., by compressing it).
• Channel Encoder: Its role is to enable the reliable reproduction of the source
encoder output after its transmission through a noisy communication channel.
This is achieved by adding redundancy (using usually an algebraic structure) to
the source encoder output.

2 See [359] for accessing most of Shannon’s works, including his master’s thesis [337, 338] which
made a breakthrough connection between electrical switching circuits and Boolean algebra and
played a catalyst role in the digital revolution, his dissertation on an algebraic framework for
population genetics [339], and his seminal paper on information-theoretic cryptography [342]. Refer
also to [362] for a recent (nontechnical) biography on Shannon and [146] for a broad discourse on
the history of information and on the information age.
1.2 Communication System Model 3

Transmitter Part

Source Source Channel Modulator


Encoder Encoder

Focus of Physical
this text Channel

Destination Source Channel Demodulator


Decoder Decoder

Receiver Part Discrete


Channel

Fig. 1.1 Block diagram of a general communication system

• Modulator: It transforms the channel encoder output into a waveform suitable for
transmission over the physical channel. This is typically accomplished by varying
the parameters of a sinusoidal signal in proportion with the data provided by the
channel encoder output.
• Physical Channel: It consists of the noisy (or unreliable) medium that the trans-
mitted waveform traverses. It is usually modeled via a sequence of conditional
(or transition) probability distributions of receiving an output given that a specific
input was sent.
• Receiver Part: It consists of the demodulator, the channel decoder, and the source
decoder where the reverse operations are performed. The destination represents
the sink where the source estimate provided by the source decoder is reproduced.
In this text, we will model the concatenation of the modulator, physical chan-
nel, and demodulator via a discrete-time3 channel with a given sequence of condi-
tional probability distributions. Given a source and a discrete channel, our objectives
will include determining the fundamental limits of how well we can construct a
(source/channel) coding scheme so that:
• the smallest number of source encoder symbols can represent each source symbol
distortion-free or within a prescribed distortion level D, where D > 0 and the
channel is noiseless;

3 Except for a brief interlude with the continuous-time (waveform) Gaussian channel in Chap. 5, we

will consider discrete-time communication systems throughout the text.


4 1 Introduction

• the largest rate of information can be transmitted over a noisy channel between
the channel encoder input and the channel decoder output with an arbitrarily small
probability of decoding error;
• we can guarantee that the source is transmitted over a noisy channel and reproduced
at the destination within distortion D, where D > 0.
We refer the reader to Appendix A for the necessary background on suprema and
limits; in particular, Observation A.5 (resp. Observation A.11) provides a pertinent
connection between the supremum (resp., infimum) of a set and the proof of a typ-
ical channel coding (resp., source coding) theorem in information theory. Finally,
Appendix B provides an overview of basic concepts from probability theory and the
theory of random processes that are used in the text. The appendix also contains
a brief discussion of convexity, Jensen’s inequality and the Lagrange multipliers
constrained optimization technique.
Chapter 2
Information Measures for Discrete
Systems

In this chapter, we define Shannon’s information measures1 for discrete-time discrete-


alphabet2 systems from a probabilistic standpoint and develop their properties. Eluci-
dating the operational significance of probabilistically defined information measures
vis-a-vis the fundamental limits of coding constitutes a main objective of this book;
this will be seen in the subsequent chapters.

2.1 Entropy, Joint Entropy, and Conditional Entropy

2.1.1 Self-information

Let E be an event belonging to a given event space and having probability pE :=


Pr(E), where 0 ≤ pE ≤ 1. Let I(E) – called the self-information of event E [114, 135]
– represent the amount of information one gains when learning that E has occurred
(or equivalently, the amount of uncertainty one had about E prior to learning that
it has happened). A natural question to ask is “what properties should I(E) have?”
Although the answer to this question may vary from person to person, here are some
properties that I(E) is reasonably expected to have.

1. I(E) should be a decreasing function of pE .


In other words, this property states that I(E) = I (pE ), where I (·) is a real-valued
function defined over [0, 1]. Furthermore, one would expect that the less likely

1 More specifically, Shannon introduced the entropy, conditional entropy, and mutual information
measures [340], while divergence is due to Kullback and Leibler [236, 237].
2 By discrete alphabets, one usually means finite or countably infinite alphabets. We however focus

mostly on finite-alphabet systems, although the presented information measures allow for countable
alphabets (when they exist).
© Springer Nature Singapore Pte Ltd. 2018 5
F. Alajaji and P.-N. Chen, An Introduction to Single-User Information Theory,
Springer Undergraduate Texts in Mathematics and Technology,
https://doi.org/10.1007/978-981-10-8001-2_2
6 2 Information Measures for Discrete Systems

event E is, the more information is gained when one learns it has occurred. In
other words, I (pE ) is a decreasing function of pE .
2. I (pE ) should be continuous in pE .
Intuitively, one should expect that a small change in pE corresponds to a small
change in the amount of information carried by E.
3. If E1 and E2 are independent events, then I(E1 ∩ E2 ) = I(E1 ) + I(E2 ), or
equivalently, I (pE1 × pE2 ) = I (pE1 ) + I (pE2 ).
This property declares that when events E1 and E2 are independent of each other
(i.e., when they do not affect each other probabilistically), the amount of infor-
mation one gains by learning that both events have jointly occurred should be
equal to the sum of the amounts of information of each individual event.
Next, we show that the only function that satisfies Properties 1–3 above is the
logarithmic function.
Theorem 2.1 The only function defined over p ∈ [0, 1] and satisfying
1. I (p) is monotonically decreasing in p;
2. I (p) is a continuous function of p for 0 ≤ p ≤ 1;
3. I (p1 × p2 ) = I (p1 ) + I (p2 );
is I (p) = −c · logb (p), where c is a positive constant and the base b of the logarithm
is any number larger than one.

Proof Step 1: Claim. For n = 1, 2, 3, . . .,


   
1 1
I = −c · logb ,
n n

where c > 0 is a constant.


Proof: First note that for n = 1, Condition 3 directly shows the claim, since it
yields that I (1) = I (1) + I (1). Thus I (1) = 0 = −c logb (1).

Now let n be a fixed positive integer greater than 1. Conditions 1 and 3 respectively
imply    
1 1
n < m =⇒ I <I (2.1.1)
n m

and      
1 1 1
I =I +I , (2.1.2)
mn m n

where n, m = 1, 2, 3, . . .. Now using (2.1.2), we can show by induction (on k)


that    
1 1
I = k · I (2.1.3)
nk n

for all nonnegative integers k.


2.1 Entropy, Joint Entropy, and Conditional Entropy 7

Now for any positive integer r, there exists a nonnegative integer k such that

nk ≤ 2r < nk+1 .

By (2.1.1), we obtain
     
1 1 1
I ≤I <I ,
nk 2r nk+1

which together with (2.1.3), yields


     
1 1 1
k ·I ≤r·I < (k + 1) · I .
n 2 n

Hence, since I (1/n) > I (1) = 0,

k I (1/2) k +1
≤ ≤ .
r I (1/n) r

On the other hand, by the monotonicity of the logarithm, we obtain

k logb (2) k +1
logb nk ≤ logb 2r ≤ logb nk+1 ⇐⇒ ≤ ≤ .
r logb (n) r

Therefore,  
 logb (2) I (1/2)  1
 
 log (n) − I (1/n)  < r .
b

Since n is fixed, and r can be made arbitrarily large, we can let r → ∞ to get
 
1
I = c · logb (n),
n

where c = I (1/2)/ logb (2) > 0. This completes the proof of the claim.
Step 2: Claim. I (p) = −c · logb (p) for positive rational number p, where c > 0
is a constant.

Proof: A positive rational number p can be represented by a ratio of two integers,


i.e., p = r/s, where r and s are both positive integers. Then condition 3 yields that
    r   
1 r1 1
I =I =I +I ,
s sr s r
8 2 Information Measures for Discrete Systems

which, from Step 1, implies that


r     
1 1
I (p) = I =I −I = c · logb s − c · logb r = −c · logb p.
s s r

Step 3: For any p ∈ [0, 1], it follows by continuity and the density of the rationals
in the reals that

I (p) = lim I (a) = lim I (b) = −c · logb (p).


a↑p, a rational b↓p, b rational


The constant c above is by convention normalized to c = 1. Furthermore, the base
b of the logarithm determines the type of units used in measuring information. When
b = 2, the amount of information is expressed in bits (i.e., binary digits). When
b = e – i.e., the natural logarithm (ln) is used – information is measured in nats (i.e.,
natural units or digits). For example, if the event E concerns a Heads outcome from
the toss of a fair coin, then its self-information is I(E) = − log2 (1/2) = 1 bit or
− ln(1/2) = 0.693 nats.
More generally, under base b > 1, information is in b-ary units or digits. For
the sake of simplicity, we will use the base-2 logarithm throughout unless otherwise
specified. Note that one can easily convert information units from bits to b-ary units
by dividing the former by log2 (b).

2.1.2 Entropy

Let X be a discrete random variable taking values in a finite alphabet X under a


probability distribution or probability mass function (pmf) PX (x) := P[X = x] for
all x ∈ X . Note that X generically represents a memoryless source, which is a
discrete-time random process {Xn }∞n=1 with independent and identically distributed
(i.i.d.) random variables (cf. Appendix B).3
Definition 2.2 (Entropy) The entropy of a discrete random variable X with pmf
PX (·) is denoted by H (X ) or H (PX ) and defined by

H (X ) := − PX (x) · log2 PX (x) (bits).
x∈X

Thus, H (X ) represents the statistical average (mean) amount of information one


gains when learning that one of its |X | outcomes has occurred, where |X | denotes
the size of alphabet X . Indeed, we directly note from the definition that

3 We will interchangeably use the notations {Xn }∞


n=1 and {Xn } to denote discrete-time random pro-
cesses.
2.1 Entropy, Joint Entropy, and Conditional Entropy 9

H (X ) = E[− log2 PX (X )] = E[I(X )],

where I(x) := − log2 PX (x) is the self-information of the elementary event {X = x}.
When computing the entropy, we adopt the convention

0 · log2 0 = 0,

which can be justified by a continuity argument since x log2 x → 0 as x → 0. Also


note that H (X ) only depends on the probability distribution of X and is not affected
by the symbols that represent the outcomes. For example when tossing a fair coin, we
can denote Heads by 2 (instead of 1) and Tail by 100 (instead of 0), and the entropy of
the random variable representing the outcome would remain equal to log2 (2) = 1 bit.

Example 2.3 Let X be a binary (valued) random variable with alphabet X = {0, 1}
and pmf given by PX (1) = p and PX (0) = 1 − p, where 0 ≤ p ≤ 1 is fixed. Then
H (X ) = −p · log2 p − (1 − p) · log2 (1 − p). This entropy is conveniently called the
binary entropy function and is usually denoted by hb (p): it is illustrated in Fig. 2.1.
As shown in the figure, hb (p) is maximized for a uniform distribution (i.e., p = 1/2).

The units for H (X ) above are in bits as base-2 logarithm is used. Setting

HD (X ) := − PX (x) · logD PX (x)
x∈X

yields the entropy in D-ary units, where D > 1. Note that we abbreviate H2 (X ) as
H (X ) throughout the book since bits are common measure units for a coding system,
and hence
H (X )
HD (X ) = .
log2 D

Fig. 2.1 Binary entropy 1


function hb (p)

0
0 0.5 1
p
10 2 Information Measures for Discrete Systems

Thus
H (X )
He (X ) = = (ln 2) · H (X )
log2 (e)

gives the entropy in nats, where e is the base of the natural logarithm.

2.1.3 Properties of Entropy

When developing or proving the basic properties of entropy (and other information
measures), we will often use the following fundamental inequality for the loga-
rithm(its proof is left as an exercise).

Lemma 2.4 (Fundamental inequality (FI)) For any x > 0 and D > 1, we have that

logD (x) ≤ logD (e) · (x − 1),

with equality if and only if (iff) x = 1.

Setting y = 1/x and using FI above directly yield that for any y > 0, we also have
that  
1
logD (y) ≥ logD (e) 1 − ,
y

also with equality iff y = 1. In the above the base-D logarithm was used. Specifically,
for a logarithm with base-2, the above inequalities become
 
1
log2 (e) 1 − ≤ log2 (x) ≤ log2 (e) · (x − 1),
x

with equality iff x = 1.

Lemma 2.5 (Nonnegativity) We have H (X ) ≥ 0. Equality holds iff X is determin-


istic (when X is deterministic, the uncertainty of X is obviously zero).

Proof 0 ≤ PX (x) ≤ 1 implies that log2 [1/PX (x)] ≥ 0 for every x ∈ X . Hence,
 1
H (X ) = PX (x) log2 ≥ 0,
x∈X
PX (x)

with equality holding iff PX (x) = 1 for some x ∈ X . 


2.1 Entropy, Joint Entropy, and Conditional Entropy 11

Lemma 2.6 (Upper bound on entropy) If a random variable X takes values from a
finite set X , then
H (X ) ≤ log2 |X |,

where4 |X | denotes the size of the set X . Equality holds iff X is equiprobable or
uniformly distributed over X (i.e., PX (x) = |X1 | for all x ∈ X ).
Proof
 
 
log2 |X | − H (X ) = log2 |X | · PX (x) − − PX (x) log2 PX (x)
x∈X x∈X
 
= PX (x) · log2 |X | + PX (x) log2 PX (x)
x∈X x∈X

= PX (x) log2 [|X | · PX (x)]
x∈X
  
1
≥ PX (x) · log2 (e) 1 −
x∈X
|X | · PX (x)
 1

= log2 (e) PX (x) −
x∈X
|X |
= log2 (e) · (1 − 1) = 0,

where the inequality follows from the FI Lemma, with equality iff (∀ x ∈ X ),
|X | · PX (x) = 1, which means PX (·) is a uniform distribution on X . 
Intuitively, H (X ) tells us how random X is. Indeed, X is deterministic (not random
at all) iff H (X ) = 0. If X is uniform (equiprobable), H (X ) is maximized and is
equal to log2 |X |.
Lemma 2.7 (Log-sum inequality) For nonnegative numbers, a1 , a2 , . . ., an and
b1 , b2 , . . ., bn ,
n  
n
n
ai ai
ai logD ≥ ai logD i=1
n , (2.1.4)
i=1
bi i=1 i=1 bi

with equality holding iff for all i = 1, . . . , n,


n
ai j=1 aj
= n ,
bi j=1 bj

which is a constant that does not depend on i. (By convention, 0 · logD (0) = 0,
0 · logD (0/0) = 0 and a · logD (a/0) = ∞ if a > 0. Again, this can be justified by
“continuity.”)

that log |X | is also known as Hartley’s function or entropy; Hartley was the first to suggest
4 Note

measuring information regardless of its content [180].


12 2 Information Measures for Discrete Systems

n n
Proof Let a := i=1 ai and b := i=1 bi . Then
⎡ ⎤
⎢ n
n ⎥

n
ai a ⎢ ai a  ai a⎥
ai logD − a logD = a ⎢ ⎥
i
⎢ logD − logD
i=1
bi b ⎣ i=1 a bi i=1
a b⎥

  
=1

n  
ai ai b
=a logD
i=1
a bi a

n  
ai bi a
≥ a logD (e) 1−
i=1
a ai b

n
 ai  bi n
= a logD (e) −
i=1
a i=1
b
= a logD (e) (1 − 1) = 0,

where the inequality follows from the FI Lemma, with equality holding iff abii ab = 1
for all i; i.e., abii = ab ∀i.
We also provide another proof using Jensen’s inequality (cf. Theorem B.18 in
Appendix B). Without loss of generality, assume that ai > 0 and bi > 0 for every i.
Jensen’s inequality states that

n

n 
αi f (ti ) ≥ f αi ti
i=1 i=1


and ni=1 αi = 1; equality holds iff ti is a
for any strictly convex function f (·), αi ≥ 0,
constant for all i. Hence by setting αi = bi / nj=1 bj , ti = ai /bi , and f (t) = t·logD (t),
we obtain the desired result. 

2.1.4 Joint Entropy and Conditional Entropy

Given a pair of random variables (X , Y ) with a joint pmf PX ,Y (·, ·) defined5 on X ×Y,
the self-information of the (two-dimensional) elementary event {X = x, Y = y} is
defined by
I(x, y) := − log2 PX ,Y (x, y).

This leads us to the definition of joint entropy.

5 Note that PXY (·, ·) is another common notation for the joint distribution PX ,Y (·, ·).
2.1 Entropy, Joint Entropy, and Conditional Entropy 13

Definition 2.8 (Joint entropy) The joint entropy H (X , Y ) of random variables


(X , Y ) is defined by

H (X , Y ) := − PX ,Y (x, y) · log2 PX ,Y (x, y)
(x,y)∈X ×Y
= E[− log2 PX ,Y (X , Y )].

The conditional entropy can be similarly defined as follows.


Definition 2.9 (Conditional entropy) Given two jointly distributed random vari-
ables X and Y , the conditional entropy H (Y |X ) of Y given X is defined by
⎛ ⎞
 
H (Y |X ) := PX (x) ⎝− PY |X (y|x) · log2 PY |X (y|x)⎠ , (2.1.5)
x∈X y∈Y

where PY |X (·|·) is the conditional pmf of Y given X .


Equation (2.1.5) can be written into three different but equivalent forms:

H (Y |X ) = − PX ,Y (x, y) · log2 PY |X (y|x)
(x,y)∈X ×Y
= E[− log2 PY |X (Y |X )]

= PX (x) · H (Y |X = x),
x∈X


where H (Y |X = x) := − y∈Y PY |X (y|x) log2 PY |X (y|x).
The relationship between joint entropy and conditional entropy is exhibited by
the fact that the entropy of a pair of random variables is the entropy of one plus the
conditional entropy of the other.
Theorem 2.10 (Chain rule for entropy)

H (X , Y ) = H (X ) + H (Y |X ). (2.1.6)

Proof Since
PX ,Y (x, y) = PX (x)PY |X (y|x),

we directly obtain that

H (X , Y ) = E[− log2 PX ,Y (X , Y )]
= E[− log2 PX (X )] + E[− log2 PY |X (Y |X )]
= H (X ) + H (Y |X ).


14 2 Information Measures for Discrete Systems

By its definition, joint entropy is commutative; i.e., H (X , Y ) = H (Y , X ). Hence,

H (X , Y ) = H (X ) + H (Y |X ) = H (Y ) + H (X |Y ) = H (Y , X ),

which implies that


H (X ) − H (X |Y ) = H (Y ) − H (Y |X ). (2.1.7)

The above quantity is exactly equal to the mutual information which will be intro-
duced in the next section.
The conditional entropy can be thought of in terms of a channel whose input
is the random variable X and whose output is the random variable Y . H (X |Y ) is
then called the equivocation6 and corresponds to the uncertainty in the channel
input from the receiver’s point of view. For example, suppose that the set of possible
outcomes of random vector (X , Y ) is {(0, 0), (0, 1), (1, 0), (1, 1)}, where none of the
elements has zero probability mass. When the receiver Y receives 1, he still cannot
determine exactly what the sender X observes (it could be either 1 or 0); therefore, the
uncertainty, from the receiver’s view point, depends on the probabilities PX |Y (0|1)
and PX |Y (1|1).
Similarly, H (Y |X ), which is called prevarication,7 is the uncertainty in the channel
output from the transmitter’s point of view. In other words, the sender knows exactly
what he sends, but is uncertain on what the receiver will finally obtain.
A case that is of specific interest is when H (X |Y ) = 0. By its definition,
H (X |Y ) = 0 if X becomes deterministic after observing Y . In such a case, the
uncertainty of X after giving Y is completely zero.
The next corollary can be proved similarly to Theorem 2.10.

Corollary 2.11 (Chain rule for conditional entropy)

H (X , Y |Z) = H (X |Z) + H (Y |X , Z).

2.1.5 Properties of Joint Entropy and Conditional Entropy

Lemma 2.12 (Conditioning never increases entropy) Side information Y decreases


the uncertainty about X :
H (X |Y ) ≤ H (X ),

with equality holding iff X and Y are independent. In other words, “conditioning”
reduces entropy.

6 Equivocation is an ambiguous statement one uses deliberately in order to deceive or avoid speaking

the truth.
7 Prevarication is the deliberate act of deviating from the truth (it is a synonym of “equivocation”).
2.1 Entropy, Joint Entropy, and Conditional Entropy 15

Proof
 PX |Y (x|y)
H (X ) − H (X |Y ) = PX ,Y (x, y) · log2
(x,y)∈X ×Y
PX (x)
 PX |Y (x|y)PY (y)
= PX ,Y (x, y) · log2
(x,y)∈X ×Y
PX (x)PY (y)
 PX ,Y (x, y)
= PX ,Y (x, y) · log2
(x,y)∈X ×Y
PX (x)PY (y)
⎛ ⎞
 (x,y)∈X ×Y PX ,Y (x, y)
≥⎝ PX ,Y (x, y)⎠ log2
(x,y)∈X ×Y (x,y)∈X ×Y PX (x)PY (y)
= 0,

where the inequality follows from the log-sum inequality, with equality holding iff

PX ,Y (x, y)
= constant ∀ (x, y) ∈ X × Y.
PX (x)PY (y)

Since probability must sum to 1, the above constant equals 1, which is exactly the
case of X being independent of Y . 

Lemma 2.13 Entropy is additive for independent random variables; i.e.,

H (X , Y ) = H (X ) + H (Y ) for independent X and Y .

Proof By the previous lemma, independence of X and Y implies H (Y |X ) = H (Y ).


Hence
H (X , Y ) = H (X ) + H (Y |X ) = H (X ) + H (Y ).

Since conditioning never increases entropy, it follows that

H (X , Y ) = H (X ) + H (Y |X ) ≤ H (X ) + H (Y ). (2.1.8)

The above lemma tells us that equality holds for (2.1.8) only when X is independent
of Y .
A result similar to (2.1.8) also applies to the conditional entropy.

Lemma 2.14 Conditional entropy is lower additive; i.e.,

H (X1 , X2 |Y1 , Y2 ) ≤ H (X1 |Y1 ) + H (X2 |Y2 ).


16 2 Information Measures for Discrete Systems

Equality holds iff

PX1 ,X2 |Y1 ,Y2 (x1 , x2 |y1 , y2 ) = PX1 |Y1 (x1 |y1 )PX2 |Y2 (x2 |y2 )

for all x1 , x2 , y1 and y2 .

Proof Using the chain rule for conditional entropy and the fact that conditioning
reduces entropy, we can write

H (X1 , X2 |Y1 , Y2 ) = H (X1 |Y1 , Y2 ) + H (X2 |X1 , Y1 , Y2 )


≤ H (X1 |Y1 , Y2 ) + H (X2 |Y1 , Y2 ), (2.1.9)
≤ H (X1 |Y1 ) + H (X2 |Y2 ). (2.1.10)

For (2.1.9), equality holds iff X1 and X2 are conditionally independent given (Y1 , Y2 ):
PX1 ,X2 |Y1 ,Y2 (x1 , x2 |y1 , y2 ) = PX1 |Y1 ,Y2 (x1 |y1 , y2 )PX2 |Y1 ,Y2 (x2 |y1 , y2 ). For (2.1.10), equal-
ity holds iff X1 is conditionally independent of Y2 given Y1 (i.e., PX1 |Y1 ,Y2 (x1 |y1 , y2 ) =
PX1 |Y1 (x1 |y1 )), and X2 is conditionally independent of Y1 given Y2 (i.e., PX2 |Y1 ,Y2 (x2 |y1 ,
y2 ) = PX2 |Y2 (x2 |y2 )). Hence, the desired equality condition of the lemma is
obtained. 

2.2 Mutual Information

For two random variables X and Y , the mutual information between X and Y is the
reduction in the uncertainty of Y due to the knowledge of X (or vice versa). A dual
definition of mutual information states that it is the average amount of information
that Y has (or contains) about X or X has (or contains) about Y .
We can think of the mutual information between X and Y in terms of a channel
whose input is X and whose output is Y . Thereby the reduction of the uncertainty is
by definition the total uncertainty of X (i.e., H (X )) minus the uncertainty of X after
observing Y (i.e., H (X |Y )). Mathematically, it is

mutual information = I (X ; Y ) := H (X ) − H (X |Y ). (2.2.1)

It can be easily verified from (2.1.7) that mutual information is symmetric; i.e.,
I (X ; Y ) = I (Y ; X ).

2.2.1 Properties of Mutual Information


 PX ,Y (x, y)
Lemma 2.15 1. I (X ; Y ) = PX ,Y (x, y) log2 .
x∈X y∈Y
PX (x)PY (y)
2. I (X ; Y ) = I (Y ; X ) = H (Y ) − H (Y |X ).
3. I (X ; Y ) = H (X ) + H (Y ) − H (X , Y ).
2.2 Mutual Information 17

H(X, Y )

H(X) H(X|Y ) I(X; Y ) H(Y |X) H(Y )

Fig. 2.2 Relation between entropy and mutual information

4. I (X ; Y ) ≤ H (X ) with equality holding iff X is a function of Y (i.e., X = f (Y )


for some function f (·)).
5. I (X ; Y ) ≥ 0 with equality holding iff X and Y are independent.
6. I (X ; Y ) ≤ min{log2 |X |, log2 |Y|}.

Proof Properties 1, 2, 3, and 4 follow immediately from the definition. Property 5


is a direct consequence of Lemma 2.12. Property 6 holds iff I (X ; Y ) ≤ log2 |X |
and I (X ; Y ) ≤ log2 |Y|. To show the first inequality, we write I (X ; Y ) = H (X ) −
H (X |Y ), use the fact that H (X |Y ) is nonnegative and apply Lemma 2.6. A similar
proof can be used to show that I (X ; Y ) ≤ log2 |Y|. 

The relationships between H (X ), H (Y ), H (X , Y ), H (X |Y ), H (Y |X ), and I (X ; Y )


can be illustrated by the Venn diagram in Fig. 2.2.

2.2.2 Conditional Mutual Information

The conditional mutual information, denoted by I (X ; Y |Z), is defined as the com-


mon uncertainty between X and Y under the knowledge of Z:

I (X ; Y |Z) := H (X |Z) − H (X |Y , Z). (2.2.2)

Lemma 2.16 (Chain rule for mutual information) Defining the joint mutual infor-
mation between X and the pair (Y , Z) as in (2.2.1) by

I (X ; Y , Z) := H (X ) − H (X |Y , Z),

we have

I (X ; Y , Z) = I (X ; Y ) + I (X ; Z|Y ) = I (X ; Z) + I (X ; Y |Z).
18 2 Information Measures for Discrete Systems

Proof Without loss of generality, we only prove the first equality:

I (X ; Y , Z) = H (X ) − H (X |Y , Z)
= H (X ) − H (X |Y ) + H (X |Y ) − H (X |Y , Z)
= I (X ; Y ) + I (X ; Z|Y ).

The above lemma can be read as follows: the information that (Y , Z) has about X
is equal to the information that Y has about X plus the information that Z has about
X when Y is already known.

2.3 Properties of Entropy and Mutual Information


for Multiple Random Variables

Theorem 2.17 (Chain rule for entropy) Let X1 , X2 , . . ., Xn be drawn according to


PX n (xn ) := PX1 ,...,Xn (x1 , . . . , xn ), where we use the common superscript notation to
denote an n-tuple: X n := (X1 , . . . , Xn ) and xn := (x1 , . . . , xn ).
Then
n
H (X1 , X2 , . . . , Xn ) = H (Xi |Xi−1 , . . . , X1 ),
i=1

where H (Xi |Xi−1 , . . . , X1 ) := H (X1 ) for i = 1. (The above chain rule can also be
written as:
n
H (X n ) = H (Xi |X i−1 ),
i=1

where X i := (X1 , . . . , Xi ).)

Proof From (2.1.6),

H (X1 , X2 , . . . , Xn ) = H (X1 , X2 , . . . , Xn−1 ) + H (Xn |Xn−1 , . . . , X1 ). (2.3.1)

Once again, applying (2.1.6) to the first term of the right-hand side of (2.3.1), we
have

H (X1 , X2 , . . . , Xn−1 ) = H (X1 , X2 , . . . , Xn−2 ) + H (Xn−1 |Xn−2 , . . . , X1 ).

The desired result can then be obtained by repeatedly applying (2.1.6). 


2.3 Properties of Entropy and Mutual Information for Multiple Random Variables 19

Theorem 2.18 (Chain rule for conditional entropy)


n
H (X1 , X2 , . . . , Xn |Y ) = H (Xi |Xi−1 , . . . , X1 , Y ).
i=1

Proof The theorem can be proved similarly to Theorem 2.17. 


If X n = (X1 , . . . , Xn ) and Y m = (Y1 , . . . , Ym ) are jointly distributed random
vectors (of not necessarily equal lengths), then their joint mutual information is
given by

I (X1 , . . . , Xn ; Y1 , . . . , Ym ) := H (X1 , . . . , Xn ) − H (X1 , . . . , Xn |Y1 , . . . , Ym ).

Theorem 2.19 (Chain rule for mutual information)


n
I (X1 , X2 , . . . , Xn ; Y ) = I (Xi ; Y |Xi−1 , . . . , X1 ),
i=1

where I (Xi ; Y |Xi−1 , . . . , X1 ) := I (X1 ; Y ) for i = 1.


Proof This can be proved by first expressing mutual information in terms of entropy
and conditional entropy, and then applying the chain rules for entropy and conditional
entropy. 
Theorem 2.20 (Independence bound on entropy)


n
H (X1 , X2 , . . . , Xn ) ≤ H (Xi ).
i=1

Equality holds iff all the Xi ’s are independent of each other.8


Proof By applying the chain rule for entropy,


n
H (X1 , X2 , . . . , Xn ) = H (Xi |Xi−1 , . . . , X1 )
i=1
n
≤ H (Xi ).
i=1

Equality holds iff each conditional entropy is equal to its associated entropy, that iff
Xi is independent of (Xi−1 , . . . , X1 ) for all i. 

8 Thiscondition is equivalent to requiring that Xi be independent of (Xi−1 , . . . , X1 ) for all i. The


equivalence can be directly proved using the chain rule for joint probabilities, i.e., PX n (xn ) =
 n i−1
i=1 PXi |X i−1 (xi |x1 ); it is left as an exercise.
1
20 2 Information Measures for Discrete Systems

Theorem 2.21 (Bound on mutual information) If {(Xi , Yi )}ni=1 is a process satisfying


the conditional independence assumption PY n |X n = ni=1 PYi |Xi , then


n
I (X1 , . . . , Xn ; Y1 , . . . , Yn ) ≤ I (Xi ; Yi ),
i=1

with equality holding iff {Xi }ni=1 are independent.

Proof From the independence bound on entropy, we have


n
H (Y1 , . . . , Yn ) ≤ H (Yi ).
i=1

By the conditional independence assumption, we have


 
H (Y1 , . . . , Yn |X1 , . . . , Xn ) = E − log2 PY n |X n (Y n |X n )
 n

=E − log2 PYi |Xi (Yi |Xi )
i=1

n
= H (Yi |Xi ).
i=1

Hence,

I (X n ; Y n ) = H (Y n ) − H (Y n |X n )
n n
≤ H (Yi ) − H (Yi |Xi )
i=1 i=1

n
= I (Xi ; Yi ),
i=1

with equality holding iff {Yi }ni=1 are independent, which holds iff {Xi }ni=1 are inde-
pendent. 

2.4 Data Processing Inequality

Recalling that the Markov chain relationship X → Y → Z means that X and Z are
conditional independent given Y (cf. Appendix B), we have the following result.
2.4 Data Processing Inequality 21

Lemma 2.22 (Data processing inequality) (This is also called the data processing
lemma.) If X → Y → Z, then

I (X ; Y ) ≥ I (X ; Z).

Proof Since X → Y → Z, we directly have that I (X ; Z|Y ) = 0. By the chain rule


for mutual information,

I (X ; Z) + I (X ; Y |Z) = I (X ; Y , Z) (2.4.1)
= I (X ; Y ) + I (X ; Z|Y )
= I (X ; Y ). (2.4.2)

Since I (X ; Y |Z) ≥ 0, we obtain that I (X ; Y ) ≥ I (X ; Z) with equality holding iff


I (X ; Y |Z) = 0. 

The data processing inequality means that the mutual information will not increase
after processing. This result is somewhat counterintuitive since given two random
variables X and Y , we might believe that applying a well-designed processing scheme
to Y , which can be generally represented by a mapping g(Y ), could possibly increase
the mutual information. However, for any g(·), X → Y → g(Y ) forms a Markov
chain which implies that data processing cannot increase mutual information. A
communication context for the data processing lemma is depicted in Fig. 2.3, and
summarized in the next corollary.

Corollary 2.23 For jointly distributed random variables X and Y and any function
g(·), we have X → Y → g(Y ) and

I (X ; Y ) ≥ I (X ; g(Y )).

We also note that if Z obtains all the information about X through Y , then knowing
Z will not help increase the mutual information between X and Y ; this is formalized
in the following.

Corollary 2.24 If X → Y → Z, then

I (X ; Y |Z) ≤ I (X ; Y ).

I(U; V ) ≤ I(X; Y )
U X Y V
Source Encoder Channel Decoder

“By processing, we can only reduce (mutual) information,


but the processed information may be in a more useful form!”

Fig. 2.3 Communication context of the data processing lemma


22 2 Information Measures for Discrete Systems

Proof The proof directly follows from (2.4.1) and (2.4.2). 

It is worth pointing out that it is possible that I (X ; Y |Z) > I (X ; Y ) when X ,


Y and Z do not form a Markov chain. For example, let X and Y be independent
equiprobable binary zero-one random variables, and let Z = X + Y . Then,

I (X ; Y |Z) = H (X |Z) − H (X |Y , Z)
= H (X |Z)
= PZ (0)H (X |z = 0) + PZ (1)H (X |z = 1) + PZ (2)H (X |z = 2)
= 0 + 0.5 + 0
= 0.5 bits,

which is clearly larger than I (X ; Y ) = 0.


Finally, we observe that we can extend the data processing inequality for a
sequence of random variables forming a Markov chain (see (B.3.5) in Appendix B
for the definition) as follows.

Corollary 2.25 If X1 → X2 → · · · → Xn , then for any i, j, k.l such that 1 ≤ i ≤


j ≤ k ≤ l ≤ n, we have that

I (Xi ; Xl ) ≤ I (Xj ; Xk ).

2.5 Fano’s Inequality

Fano’s inequality [113, 114] is a useful tool widely employed in information theory to
prove converse results for coding theorems (as we will see in the following chapters).

Lemma 2.26 (Fano’s inequality) Let X and Y be two random variables, correlated
in general, with alphabets X and Y, respectively, where X is finite but Y can be
countably infinite. Let X̂ := g(Y ) be an estimate of X from observing Y , where
g : Y → X is a given estimation function. Define the probability of error as

Pe := Pr[X̂ = X ].

Then the following inequality holds

H (X |Y ) ≤ hb (Pe ) + Pe · log2 (|X | − 1), (2.5.1)

where hb (x) := −x log2 x − (1 − x) log2 (1 − x) for 0 ≤ x ≤ 1 is the binary entropy


function (see Example 2.3).
2.5 Fano’s Inequality 23

H (X |Y )

log2 (|X |)

log2 (|X | − 1)

0 (|X | − 1)/|X | 1
Pe

Fig. 2.4 Permissible (Pe , H (X |Y )) region due to Fano’s inequality

Observation 2.27
• Note that when Pe = 0, we obtain that H (X |Y ) = 0 (see (2.5.1)) as intuition
suggests, since if Pe = 0, then X̂ = g(Y ) = X (with probability 1) and thus
H (X |Y ) = H (g(Y )|Y ) = 0.
• Fano’s inequality yields upper and lower bounds on Pe in terms of H (X |Y ). This is
illustrated in Fig. 2.4, where we plot the region for the pairs (Pe , H (X |Y )) that are
permissible under Fano’s inequality. In the figure, the boundary of the permissible
(dashed) region is given by the function

f (Pe ) := hb (Pe ) + Pe · log2 (|X | − 1),

the right-hand side of (2.5.1). We obtain that when

log2 (|X | − 1) < H (X |Y ) ≤ log2 (|X |),

Pe can be upper and lower bounded as follows:

0 < inf{a : f (a) ≥ H (X |Y )} ≤ Pe ≤ sup{a : f (a) ≥ H (X |Y )} < 1.

Furthermore, when
0 < H (X |Y ) ≤ log2 (|X | − 1),
24 2 Information Measures for Discrete Systems

only the lower bound holds:

Pe ≥ inf{a : f (a) ≥ H (X |Y )} > 0.

Thus for all nonzero values of H (X |Y ), we obtain a lower bound (of the same
form above) on Pe ; the bound implies that if H (X |Y ) is bounded away from zero,
Pe is also bounded away from zero.
• A weaker but simpler version of Fano’s inequality can be directly obtained from
(2.5.1) by noting that hb (Pe ) ≤ 1:

H (X |Y ) ≤ 1 + Pe log2 (|X | − 1), (2.5.2)

which in turn yields that

H (X |Y ) − 1
Pe ≥ (for |X | > 2)
log2 (|X | − 1)

which is weaker than the above lower bound on Pe .


Proof of Lemma 2.26 Define a new random variable,

1, if g(Y ) = X
E := .
0, if g(Y ) = X

Then using the chain rule for conditional entropy, we obtain

H (E, X |Y ) = H (X |Y ) + H (E|X , Y )
= H (E|Y ) + H (X |E, Y ).

Observe that E is a function of X and Y ; hence, H (E|X , Y ) = 0. Since condi-


tioning never increases entropy, H (E|Y ) ≤ H (E) = hb (Pe ). The remaining term,
H (X |E, Y ), can be bounded as follows:

H (X |E, Y ) = Pr[E = 0]H (X |Y , E = 0) + Pr[E = 1]H (X |Y , E = 1)


≤ (1 − Pe ) · 0 + Pe · log2 (|X | − 1),

since X = g(Y ) for E = 0, and given E = 1, we can upper bound the conditional
entropy by the logarithm of the number of remaining outcomes, i.e., (|X | − 1).
Combining these results completes the proof. 
Fano’s inequality cannot be improved in the sense that the lower bound, H (X |Y ),
can be achieved for some specific cases. Any bound that can be achieved in some cases
is often referred to as sharp.9 From the proof of the above lemma, we can observe

9 Definition. A bound is said to be sharp if the bound is achievable for some specific cases. A bound

is said to be tight if the bound is achievable for all cases.


2.5 Fano’s Inequality 25

that equality holds in Fano’s inequality, if H (E|Y ) = H (E) and H (X |Y , E = 1) =


log2 (|X | − 1). The former is equivalent to E being independent of Y , and the latter
holds iff PX |Y (·|y) is uniformly distributed over the set X \ {g(y)}. We can therefore
create an example in which equality holds in Fano’s inequality.

Example 2.28 Suppose that X and Y are two independent random variables which
are both uniformly distributed on the alphabet {0, 1, 2}. Let the estimating function
be given by g(y) = y. Then


2
2
Pe = Pr[g(Y ) = X ] = Pr[Y = X ] = 1 − PX (x)PY (x) = .
x=0
3

In this case, equality is achieved in Fano’s inequality, i.e.,


 
2 2
hb + · log2 (3 − 1) = H (X |Y ) = H (X ) = log2 3.
3 3

To conclude this section, we present an alternative proof for Fano’s inequality to


illustrate the use of the data processing inequality and the FI Lemma.
Alternative Proof of Fano’s inequality: Noting that X → Y → X̂ form a Markov
chain, we directly obtain via the data processing inequality that

I (X ; Y ) ≥ I (X ; X̂ ),

which implies that

H (X |Y ) ≤ H (X |X̂ ).

Thus, if we show that H (X |X̂ ) is no larger than the right-hand side of (2.5.1), the
proof of (2.5.1) is complete.
Noting that
 
Pe = PX ,X̂ (x, x̂)
x∈X x̂∈X :x̂ =x

and
  
1 − Pe = PX ,X̂ (x, x̂) = PX ,X̂ (x, x),
x∈X x̂∈X :x̂=x x∈X
26 2 Information Measures for Discrete Systems

we obtain that

H (X |X̂ ) − hb (Pe ) − Pe log2 (|X | − 1)


  1  1
= PX ,X̂ (x, x̂) log2 + PX ,X̂ (x, x) log2
x∈X x̂∈X :x̂ =x
PX |X̂ (x|x̂) x∈X PX |X̂ (x|x)
⎡ ⎤ 
  (|X | − 1) 
−⎣ PX ,X̂ (x, x̂)⎦ log2 + PX ,X̂ (x, x) log2 (1 − Pe )
Pe
x∈X x̂∈X :x̂ =x x∈X
  Pe
= PX ,X̂ (x, x̂) log2
x∈X x̂∈X :x̂ =x
PX |X̂ (x|x̂)(|X | − 1)
 1 − Pe
+ PX ,X̂ (x, x) log2 (2.5.3)
x∈X
PX |X̂ (x|x)

  Pe
≤ log2 (e) PX ,X̂ (x, x̂) −1
x∈X x̂∈X :x̂ =x
PX |X̂ (x|x̂)(|X | − 1)

 1 − Pe
+ log2 (e) PX ,X̂ (x, x) −1
x∈X
PX |X̂ (x|x)
⎡ ⎤
P    
= log2 (e) ⎣ PX ,X̂ (x, x̂)⎦
e
PX̂ (x̂) −
(|X | − 1) x∈X
x̂∈X :x̂ =x x∈X x̂∈X :x̂ =x

 
+ log2 (e) (1 − Pe ) PX̂ (x) − PX ,X̂ (x, x)
x∈X x∈X
 
Pe
= log2 (e) (|X | − 1) − Pe + log2 (e) [(1 − Pe ) − (1 − Pe )]
(|X | − 1)
= 0,

where the inequality follows by applying the FI Lemma to each logarithm term in
(2.5.3). 

2.6 Divergence and Variational Distance

In addition to the probabilistically defined entropy and mutual information, another


measure that is frequently considered in information theory is divergence. In this
section, we define this measure and study its statistical properties.

Definition 2.29 (Divergence) Given two discrete random variables X and X̂ defined
over a common alphabet X , the divergence or the Kullback–Leibler divergence or dis-
2.6 Divergence and Variational Distance 27

tance10 (other names are relative entropy and discrimination) is denoted by D(X X̂ )
or D(PX PX̂ ) and defined by11
  
PX (X ) PX (x)
D(X X̂ ) = D(PX PX̂ ) := EX log2 = PX (x) log2 .
PX̂ (X ) x∈X
PX̂ (x)

In other words, the divergence D(PX PX̂ ) is the expectation (with respect to
PX ) of the log-likelihood ratio log2 [PX /PX̂ ] of distribution PX against distribution
PX̂ . D(X X̂ ) can be viewed as a measure of “distance” or “dissimilarity” between
distributions PX and PX̂ . D(X X̂ ) is also called relative entropy since it can be
regarded as a measure of the inefficiency of mistakenly assuming that the distribution
of a source is PX̂ when the true distribution is PX . For example, if we know the true
distribution PX of a source, then we can construct a lossless data compression code
with average codeword length achieving entropy H (X ) (this will be studied in the
next chapter). If, however, we mistakenly thought that the “true” distribution is PX̂
and employ the “best” code corresponding to PX̂ , then the resultant average codeword
length becomes 
[−PX (x) · log2 PX̂ (x)].
x∈X

As a result, the relative difference between the resultant average codeword length and
H (X ) is the relative entropy D(X X̂ ). Hence, divergence is a measure of the system
cost (e.g., storage consumed) paid due to mis-classifying the system statistics.
Note that when computing divergence, we follow the convention that

0 p
0 · log2 = 0 and p · log2 = ∞ for p > 0.
p 0

We next present some properties of the divergence and discuss its relation with
entropy and mutual information.

Lemma 2.30 (Nonnegativity of divergence)

D(X X̂ ) ≥ 0,

with equality iff PX (x) = PX̂ (x) for all x ∈ X (i.e., the two distributions are equal).

10 As noted in Footnote 1, this measure was originally introduced by Kullback and Leibler [236,
237].
11 In order to be consistent with the units (in bits) adopted for entropy and mutual information, we

will also use the base-2 logarithm for divergence unless otherwise specified.
28 2 Information Measures for Discrete Systems

Proof
 PX (x)
D(X X̂ ) = PX (x) log2
x∈X
PX̂ (x)


 PX (x)
≥ PX (x) log2 x∈X
x∈X x∈X PX̂ (x)
= 0,

where the second step follows from the log-sum inequality with equality holding iff
for every x ∈ X ,
PX (x) PX (a)
= a∈X = 1,
PX̂ (x) b∈X PX̂ (b)

or equivalently PX (x) = PX̂ (x) for all x ∈ X . 

Lemma 2.31 (Mutual information and divergence)

I (X ; Y ) = D(PX ,Y PX × PY ),

where PX ,Y (·, ·) is the joint distribution of the random variables X and Y and PX (·)
and PY (·) are the respective marginals.

Proof The observation follows directly from the definitions of divergence and mutual
information. 

Definition 2.32 (Refinement of distribution) Given the distribution PX on X , divide


X into k mutually disjoint sets, U1 , U2 , . . . , Uk , satisfying

!
k
X = Ui .
i=1

Define a new distribution PU on U = {1, 2, . . . , k} as



PU (i) = PX (x).
x∈Ui

Then PX is called a refinement (or more specifically, a k-refinement) of PU .

Let us briefly discuss the relation between the processing of information and its
refinement. Processing of information can be modeled as a (many-to-one) mapping,
and refinement is actually the reverse operation. Recall that the data processing
lemma shows that mutual information can never increase due to processing. Hence,
if one wishes to increase mutual information, he should “anti-process” (or refine) the
involved statistics.
2.6 Divergence and Variational Distance 29

From Lemma 2.31, the mutual information can be viewed as the divergence of
a joint distribution against the product distribution of the marginals. It is therefore
reasonable to expect that a similar effect due to processing (or a reverse effect due
to refinement) should also apply to divergence. This is shown in the next lemma.
Lemma 2.33 (Refinement cannot decrease divergence) Let PX and PX̂ be the refine-
ments (k-refinements) of PU and PÛ respectively. Then

D(PX PX̂ ) ≥ D(PU PÛ ).

Proof By the log-sum inequality, we obtain that for any i ∈ {1, 2, . . . , k}


⎛ ⎞
 PX (x)  x∈U PX (x)
PX (x) log2 ≥⎝ PX (x)⎠ log2 i
x∈Ui
PX̂ (x) x∈U x∈Ui PX̂ (x)
i

PU (i)
= PU (i) log2 , (2.6.1)
PÛ (i)

with equality iff


PX (x) PU (i)
=
PX̂ (x) PÛ (i)

for all x ∈ U. Hence,


k 
PX (x)
D(PX PX̂ ) = PX (x) log2
i=1 x∈Ui
PX̂ (x)


k
PU (i)
≥ PU (i) log2
i=1
PÛ (i)
= D(PU PÛ ),

with equality iff


PX (x) PU (i)
=
PX̂ (x) PÛ (i)

for all i and x ∈ Ui . 


Observation 2.34 One drawback of adopting the divergence as a measure between
two distributions is that it does not meet the symmetry requirement of a true dis-
tance,12 since interchanging its two arguments may yield different quantities. In

12 Given a non-empty set A, the function d : A × A → [0, ∞) is called a distance or metric if it


satisfies the following properties.
1. Nonnegativity: d (a, b) ≥ 0 for every a, b ∈ A with equality holding iff a = b.
2. Symmetry: d (a, b) = d (b, a) for every a, b ∈ A.
30 2 Information Measures for Discrete Systems

other words, D(PX PX̂ ) = D(PX̂ PX ) in general. (It also does not satisfy the trian-
gle inequality.) Thus, divergence is not a true distance or metric. Another measure
which is a true distance, called variational distance, is sometimes used instead.

Definition 2.35 (Variational distance) The variational distance (also known as


the L1 -distance) between two distributions PX and PX̂ with common alphabet X is
defined by  
PX − PX̂  := PX (x) − P (x).

x∈X

Lemma 2.36 The variational distance satisfies


   " #
PX − PX̂  = 2 · sup PX (E) − PX̂ (E) = 2 · PX (x) − PX̂ (x) .
E⊂X x∈X :PX (x)>PX̂ (x)

" #
Proof We first show that PX −PX̂  = 2· x∈X :PX (x)>PX̂ (x) PX (x)−PX̂ (x) . Setting
A := {x ∈ X : PX (x) > PX̂ (x)}, we have
 
PX − PX̂  = PX (x) − P (x)

x∈X
   
= PX (x) − P (x) + PX (x) − P (x)
X̂ X̂
x∈A x∈Ac
" # " #
= PX (x) − PX̂ (x) + PX̂ (x) − PX (x)
x∈A x∈Ac
" # " # " #
= PX (x) − PX̂ (x) + PX̂ Ac − PX Ac
x∈A
" #
= PX (x) − PX̂ (x) + PX (A) − PX̂ (A)
x∈A
" # " #
= PX (x) − PX̂ (x) + PX (x) − PX̂ (x)
x∈A x∈A
" #
= 2· PX (x) − PX̂ (x) ,
x∈A

where Ac denotes the complement set of A.  


We next prove that PX − PX̂  = 2 · supE⊂X PX (E) − PX̂ (E) by showing that
each quantity is greater than or equal to the other. For any set E ⊂ X , we can write

3. Triangle inequality: d (a, b) + d (b, c) ≥ d (a, c) for every a, b, c ∈ A.


2.6 Divergence and Variational Distance 31

PX − PX̂  = |PX (x) − PX̂ (x)|
x∈X
 
= |PX (x) − PX̂ (x)| + |PX (x) − PX̂ (x)|
x∈E x∈E c
   
      

≥ PX (x) − PX̂ (x)  +  P (x) − PX̂ (x) 
   c X 
x∈E x∈E
= |PX (E) − PX̂ (E)| + |PX (E c ) − PX̂ (E c )|
= |PX (E) − PX̂ (E)| + |PX̂ (E) − PX (E)|
= 2 · |PX (E) − PX̂ (E)|.
 
Thus PX − PX̂  ≥ 2 · supE⊂X PX (E) − PX̂ (E). Conversely, we have that
 
2 · sup PX (E) − PX̂ (E) ≥ 2 · |PX (A) − PX̂ (A)|
E⊂X
= |PX (A) − PX̂ (A)| + |PX̂ (Ac ) − PX (Ac )|
   
      

= PX (x) − PX̂ (x)  +  P (x) − PX (x) 
   c X̂ 
x∈A x∈A
 
= |PX (x) − PX̂ (x)| + |PX (x) − PX̂ (x)|
x∈A x∈Ac
= PX − PX̂ .
 
Therefore, PX − PX̂  = 2 · supE⊂X PX (E) − PX̂ (E) . 
Lemma 2.37 (Variational distance vs divergence: Pinsker’s inequality)

log2 (e)
D(X X̂ ) ≥ · PX − PX̂ 2 .
2
This result is referred to as Pinsker’s inequality.
Proof
1. With A := {x ∈ X : PX (x) > PX̂ (x)}, we have from the previous lemma that

PX − PX̂  = 2[PX (A) − PX̂ (A)].

2. Define two random variables U and Û as

1, if X ∈ A,
U=
0, if X ∈ Ac ,

and
1, if X̂ ∈ A,
Û =
0, if X̂ ∈ Ac .
32 2 Information Measures for Discrete Systems

Then PX and PX̂ are refinements (2-refinements) of PU and PÛ , respectively.


From Lemma 2.33, we obtain that

D(PX PX̂ ) ≥ D(PU PÛ ).

3. The proof is complete if we show that

D(PU PÛ ) ≥ 2 log2 (e)[PX (A) − PX̂ (A)]2


= 2 log2 (e)[PU (1) − PÛ (1)]2 .

For ease of notations, let p = PU (1) and q = PÛ (1). Then proving the above
inequality is equivalent to showing that

p 1−p
p · ln + (1 − p) · ln ≥ 2(p − q)2 .
q 1−q

Define
p 1−p
f (p, q) := p · ln + (1 − p) · ln − 2(p − q)2 ,
q 1−q

and observe that


 
df (p, q) 1
= (p − q) 4 − ≤ 0 for q ≤ p.
dq q(1 − q)

Thus, f (p, q) is non-increasing in q for q ≤ p. Also note that f (p, q) = 0 for


q = p. Therefore,
f (p, q) ≥ 0 for q ≤ p.

The proof is completed by noting that

f (p, q) ≥ 0 for q ≥ p,

since f (1 − p, 1 − q) = f (p, q).




Observation 2.38 The above lemma tells us that for a sequence of distributions
{(PXn , PX̂n )}n≥1 , when D(PXn PX̂n ) goes to zero as n goes to infinity, PXn − PX̂n 
goes to zero as well. But the converse does not necessarily hold. For a quick coun-
terexample, let
1
PXn (0) = 1 − PXn (1) = > 0
n
2.6 Divergence and Variational Distance 33

and
PX̂n (0) = 1 − PX̂n (1) = 0.

In this case,
D(PXn PX̂n ) → ∞

since by convention, (1/n) · log2 ((1/n)/0) = ∞. However,


 
" # " #
PX − PX̂  = 2 PX {x : PX (x) > PX̂ (x)} − PX̂ {x : PX (x) > PX̂ (x)}

2
= → 0.
n
We however can upper bound D(PX PX̂ ) by the variational distance between PX and
PX̂ when D(PX PX̂ ) < ∞.

Lemma 2.39 If D(PX PX̂ ) < ∞, then

log2 (e)
D(PX PX̂ ) ≤ · PX − PX̂ .
min min{PX (x), PX̂ (x)}
{x:PX (x)>0}

Proof Without loss of generality, we assume that PX (x) > 0 for all x ∈ X . Since
D(PX PX̂ ) < ∞, we have that PX (x) > 0 implies that PX̂ (x) > 0. Let

t := min min{PX (x), PX̂ (x)}.


{x∈X :PX (x)>0}

Then for all x ∈ X ,


 
PX (x)  PX (x) 
ln ≤ ln
PX̂ (x)  PX̂ (x) 
 
 d ln(s) 
≤  max · |PX (x) − PX̂ (x)|
min{PX (x),PX̂ (x)}≤s≤max{PX (x),PX̂ (x)} ds 
1
= · |PX (x) − PX̂ (x)|
min{PX (x), PX̂ (x)}
1
≤ · |PX (x) − PX̂ (x)|.
t
34 2 Information Measures for Discrete Systems

Hence,
 PX (x)
D(PX PX̂ ) = log2 (e) PX (x) · ln
x∈X
PX̂ (x)
log2 (e) 
≤ PX (x) · |PX (x) − PX̂ (x)|
t x∈X
log2 (e) 
≤ |PX (x) − PX̂ (x)|
t x∈X
log2 (e)
= · PX − PX̂ .
t

The next lemma discusses the effect of side information on divergence. As stated in
Lemma 2.12, side information usually reduces entropy; it, however, increases diver-
gence. One interpretation of these results is that side information is useful. Regarding
entropy, side information provides us more information, so uncertainty decreases.
As for divergence, it is the measure or index of how easy one can differentiate the
source from two candidate distributions. The larger the divergence, the easier one can
tell these two distributions apart and make the right guess. At an extreme case, when
divergence is zero, one can never tell which distribution is the right one, since both
produce the same source. So, when we obtain more information (side information),
we should be able to make a better decision on the source statistics, which implies
that the divergence should be larger.
Definition 2.40 (Conditional divergence) Given three discrete random variables, X ,
X̂ , and Z, where X and X̂ have a common alphabet X , we define the conditional
divergence between X and X̂ given Z by
  PX |Z (x|z)
D(X X̂ |Z) = D(PX |Z PX̂ |Z |PZ ) := PZ (z) PX |Z (x|z) log
z∈Z x∈X
PX̂ |Z (x|z)
 PX |Z (x|z)
= PX ,Z (x, z) log .
z∈Z x∈X
PX̂ |Z (x|z)

In other words, it is the conditional divergence between PX |Z and PX̂ |Z given PZ and
it is nothing but the expected value with respect to PX ,Z of the log-likelihood ratio
P
log PX |Z .
X̂ |Z
Similarly, the conditional divergence between PX |Z and PX̂ given PZ is defined as

  PX |Z (x|z)
D(PX |Z PX̂ |PZ ) := PZ (z) PX |Z (x|z) log .
z∈Z x∈X
PX̂ (z)
2.6 Divergence and Variational Distance 35

Lemma 2.41 (Conditional mutual information and conditional divergence) Given


three discrete random variables X , Y and Z with alphabets X , Y and Z, respectively,
and joint distribution PX ,Y ,Z , we have

I (X ; Y |Z) = D(PX ,Y |Z PX |Z PY |Z |PZ )


 PX ,Y |Z (x, y|z)
= PX ,Y ,Z (x, y, z) log2 ,
x∈X y∈Y z∈Z
PX |Z (x|z)PY |Z (y|z)

where PX ,Y |Z is the conditional joint distribution of X and Y given Z, and PX |Z and


PY |Z are the conditional distributions of X and Y , respectively, given Z.

Proof The proof follows directly from the definition of conditional mutual informa-
tion (2.2.2) and the above definition of conditional divergence. 
Lemma 2.42 (Chain rule for divergence)
Let PX n and QX n be two joint distributions on X n . We have that

D(PX1 ,X2 QX1 ,X2 ) = D(PX1 QX1 ) + D(PX2 |X1 QX2 |X1 |PX1 ),

and more generally,


n
D(PX n QX n ) = D(PXi |X i−1 QXi |X i−1 |PX i−1 ),
i=1

where D(PXi |X i−1 QXi |X i−1 |PX i−1 ) := D(PX1 QX1 ) for i = 1.
Proof The proof readily follows from the above divergence definitions. 
Lemma 2.43 (Conditioning never decreases divergence) For three discrete random
variables, X , X̂ , and Z, where X and X̂ have a common alphabet X , we have that

D(PX |Z PX̂ |Z |PZ ) ≥ D(PX PX̂ ).

Proof

D(PX |Z PX̂ |Z |PZ ) − D(PX PX̂ )


 PX |Z (x|z)  PX (x)
= PX ,Z (x, z) · log2 − PX (x) · log2
z∈Z x∈X
PX̂ |Z (x|z) x∈X PX̂ (x)


 PX |Z (x|z)   PX (x)
= PX ,Z (x, z) · log2 − PX ,Z (x, z) · log2
z∈Z x∈X
PX̂ |Z (x|z) x∈X z∈Z PX̂ (x)
 PX |Z (x|z)PX̂ (x)
= PX ,Z (x, z) · log2
z∈Z x∈X
PX̂ |Z (x|z)PX (x)
36 2 Information Measures for Discrete Systems

  
PX̂ |Z (x|z)PX (x)
≥ PX ,Z (x, z) · log2 (e) 1 − (by the FI Lemma)
z∈Z x∈X
PX |Z (x|z)PX̂ (x)


 PX (x) 
= log2 (e) 1 − PZ (z)PX̂ |Z (x|z)
P (x) z∈Z
x∈X X̂


 PX (x)
= log2 (e) 1 − P (x)
P (x) X̂
x∈X X̂



= log2 (e) 1 − PX (x) = 0,
x∈X

with equality holding iff for all x and z,

PX (x) PX |Z (x|z)
= .
PX̂ (x) PX̂ |Z (x|z)

Note that it is not necessary that

D(PX |Z PX̂ |Ẑ |PZ ) ≥ D(PX PX̂ ),

where Z and Ẑ also have a common alphabet. In other words, side information
is helpful for divergence only when it provides information on the similarity or
difference of the two distributions. In the above case, Z only provides information
about X , and Ẑ provides information about X̂ ; so the divergence certainly cannot be
expected to increase. The next lemma shows that if the pair (Z, Ẑ) is independent
component-wise of the pair (X , X̂ ), then the side information of (Z, Ẑ) does not help
in improving the divergence of X against X̂ .

Lemma 2.44 (Independent side information does not change divergence) If X is


independent of Z and X̂ is independent of Ẑ, where X and Z share a common
alphabet with X̂ and Ẑ, respectively, then

D(PX |Z PX̂ |Ẑ |PZ ) = D(PX PX̂ ).

Proof This can be easily justified from the divergence definitions. 

Corollary 2.45 (Additivity of divergence under independence) If X is independent


of Z and X̂ is independent of Ẑ, where X and Z share a common alphabet with X̂
and Ẑ, respectively, then

D(PX ,Z PX̂ ,Ẑ ) = D(PX PX̂ ) + D(PZ PẐ ).


2.7 Convexity/Concavity of Information Measures 37

2.7 Convexity/Concavity of Information Measures

We next address the convexity/concavity properties of information measures with


respect to the distributions on which they are defined. Such properties will be useful
when optimizing the information measures over distribution spaces.
Lemma 2.46
1. H (PX ) is a concave function of PX , namely

H (λPX + (1 − λ)PX$ ) ≥ λH (PX ) + (1 − λ)H (PX$ )

for all λ ∈ [0, 1].


2. Noting that I (X ; Y ) can be rewritten as I (PX , PY |X ), where

 PY |X (y|x)
I (PX , PY |X ) := PY |X (y|x)PX (x) log2 ,
x∈X y∈Y a∈X PY |X (y|a)PX (a)

then I (X ; Y ) is a concave function of PX (for fixed PY |X ), and a convex function


of PY |X (for fixed PX ).
3. D(PX PX̂ ) is convex with respect to both the first argument PX and the second
argument PX̂ . It is also convex in the pair (PX , PX̂ ); i.e., if (PX , PX̂ ) and (QX , QX̂ )
are two pairs of probability mass functions, then

D(λPX + (1 − λ)QX λPX̂ + (1 − λ)QX̂ )


≤ λ · D(PX PX̂ ) + (1 − λ) · D(QX QX̂ ), (2.7.1)

for all λ ∈ [0, 1].


Proof 1. The proof uses the log-sum inequality:
 
H (λPX + (1 − λ)PX$ ) − λH (PX ) + (1 − λ)H (PX$ )
 PX (x)
=λ PX (x) log2
x∈X
λP X (x) + (1 − λ)PX$ (x)
 PX$ (x)
+ (1 − λ) PX$ (x) log2
x∈X
λPX (x) + (1 − λ)PX$ (x)



x∈X PX (x)
≥λ PX (x) log2
x∈X x∈X [λP X (x) + (1 − λ)PX$ (x)]


 $ (x)
x∈X PX
+ (1 − λ) PX$ (x) log2
x∈X x∈X [λP X (x) + (1 − λ)PX$ (x)]
= 0,
38 2 Information Measures for Discrete Systems

with equality holding iff PX (x) = PX$ (x) for all x.


2. We first show the concavity of I (PX , PY |X ) with respect to PX . Let λ̄ = 1 − λ.

I (λPX + λ̄PX$ , PY |X ) − λI (PX , PY |X ) −


λ̄I (PX$ , PY |X )
PX (x)PY |X (y|x)
 x∈X
=λ PX (x)PY |X (y|x) log2 
y∈Y x∈X [λPX (x) + λ̄PX$ (x)]PY |X (y|x)]
x∈X 
PX$ (x)PY |X (y|x)
 x∈X
+ λ̄ PX$ (x)PY |X (y|x) log2 
y∈Y x∈X [λPX (x) + λ̄PX$ (x)]PY |X (y|x)]
x∈X
≥ 0, (by the log-sum inequality)

with equality holding iff


 
PX (x)PY |X (y|x) = PX$ (x)PY |X (y|x)
x∈X x∈X

for all y ∈ Y. We now turn to the convexity of I (PX , PY |X ) with respect to


PY |X . For ease of notation, let PYλ (y) := λPY (y) + λ̄P$
Y (y), and PYλ |X (y|x) :=
λPY |X (y|x) + λ̄P$Y |X (y|x). Then

λI (PX , PY |X ) + λ̄I (PX , P$


Y |X ) − I (PX , λPY |X + λ̄P$
Y |X )
 PY |X (y|x)
=λ PX (x)PY |X (y|x) log2
x∈X y∈Y
PY (y)
 Y |X (y|x)
P$
+λ̄ PX (x)P$
Y |X (y|x) log2
x∈X y∈Y
P$ Y (y)
 PYλ |X (y|x)
− PX (x)PYλ |X (y|x) log2
x∈X y∈Y
PYλ (y)
 PY |X (y|x)PYλ (y)
=λ PX (x)PY |X (y|x) log2
x∈X y∈Y
PY (y)PYλ |X (y|x)
 Y |X (y|x)PYλ (y)
P$
+λ̄ PX (x)P$
Y |X (y|x) log2
x∈X y∈Y
PY (y)PYλ |X (y|x)
$

  
PY (y)PYλ |X (y|x)
≥ λ log2 (e) PX (x)PY |X (y|x) 1 −
x∈X y∈Y
PY |X (y|x)PYλ (y)
  
P$Y (y)PYλ |X (y|x)
+λ̄ log2 (e) PX (x)P$Y |X (y|x) 1 −
x∈X y∈Y
P$Y |X (y|x)PYλ (y)

= 0,
2.7 Convexity/Concavity of Information Measures 39

where the inequality follows from the FI Lemma, with equality holding iff

PY (y) P$ Y (y)
(∀ x ∈ X , y ∈ Y) = .
PY |X (y|x) Y |X (y|x)
P$

3. For ease of notation, let PXλ (x) := λPX (x) + (1 − λ)PX$ (x).

λD(PX PX̂ ) + (1 − λ)D(PX$ PX̂ ) − D(PXλ PX̂ )


 PX (x)  P$ (x)
=λ PX (x) log2 + (1 − λ) PX$ (x) log2 X
x∈X
PXλ (x) x∈X
PXλ (x)

= λD(PX PXλ ) + (1 − λ)D(PX$ PXλ )


≥0

by the nonnegativity of the divergence, with equality holding iff PX (x) = PX$ (x)
for all x. Similarly, by letting PX̂λ (x) := λPX̂ (x) + (1 − λ)PX$ (x), we obtain

λD(PX PX̂ ) + (1 − λ)D(PX PX$ ) − D(PX PX̂λ )


 PX̂ (x)  PX̂ (x)
=λ PX (x) log2 λ + (1 − λ) PX (x) log2 λ
x∈X
P X̂ (x) x∈X
PX$ (x)



λ  PX̂ (x) (1 − λ)  PX$ (x)
≥ PX (x) 1 − + PX (x) 1 −
ln 2 x∈X PX̂λ (x) ln 2 x∈X PX̂λ (x)


 λP (x) + (1 − λ)PX$ (x)
= log2 (e) 1 − PX (x) X̂
x∈X
PX̂λ (x)
= 0,

where the inequality follows from the FI Lemma, with equality holding iff
PX$ (x) = PX̂ (x) for all x.
Finally, by the log-sum inequality, for each x ∈ X , we have

" # λPX (x) + (1 − λ)QX (x)


λPX (x) + (1 − λ)QX (x) log2
λPX̂ (x) + (1 − λ)QX̂ (x)
λPX (x) (1 − λ)QX (x)
≤ λPX (x) log2 + (1 − λ)QX (x) log2 .
λPX̂ (x) (1 − λ)QX̂ (x)

Summing over x, we yield (2.7.1).


Note that the last result (convexity of D(PX PX̂ ) in the pair (PX , PX̂ )) actually
implies the first two results: just set PX̂ = QX̂ to show convexity in the first
argument PX , and set PX = QX to show convexity in the second argument
PX̂ . 
40 2 Information Measures for Discrete Systems

Observation 2.47 (Applications of information measures) In addition to playing


a critical role in communications and information theory, the above information
measures, as well as their extensions (such as Rényi’s information measures, see
Sect. 2.9) and their counterparts for continuous-alphabet systems (see Chap. 5) have
been applied in many domains. Recall that entropy measures statistical uncertainty,
divergence measures statistical dissimilarity, and mutual information quantifies sta-
tistical dependence or information transfer in stochastic systems.
One example where entropy has been extensively employed is in the so-called
maximum entropy principle methods. This principle, originally espoused by
Jaynes [201–203] who saw a close connection between statistical mechanics and
information theory, states that given past observations, the probability distribution
that best characterizes current statistical behavior is the one with the largest entropy.
In other words, given prior data constraints (expressed in the form of moments or
averages), the best representative distribution should be, beyond satisfying the con-
straints, the least informative or as unbiased as possible.
Indeed, entropy together with divergence and mutual information have been used
as powerful tools in a wide range of fields, including image processing, computer
vision, pattern recognition and machine learning [48, 89, 111, 123, 163, 253, 384,
385], cryptography and data privacy [6, 7, 31, 41, 53, 65, 66, 94, 197, 213, 214,
260, 264, 269, 270, 327, 328, 342, 403, 413, 414], quantum information theory,
quantum cryptography and computing [40, 105, 188, 408], biology and molecular
communication [2, 59, 179, 278, 390], stochastic control under communication con-
straints [376, 377, 418], neuroscience [60, 287, 374], natural language processing
and linguistics [181, 259, 297, 363], and economics [82, 353, 379].

2.8 Fundamentals of Hypothesis Testing

One of the fundamental problems in statistics is to decide between two alternative


explanations for the observed data. For example, when gambling, one may wish to
test whether the game is fair or not. Similarly, a sequence of observations on the
market may reveal information on whether or not a new product is successful. These
are examples of the simplest form of the hypothesis testing problem, which is usually
named simple hypothesis testing.
Hypothesis testing has also close connections with information theory. For exam-
ple, we will see that the divergence plays a key role in the asymptotic error analysis
of Neyman–Pearson hypothesis testing (see Lemma 2.49).
The simple hypothesis testing problem can be formulated as follows:

Problem Let X1 , . . . , Xn be a sequence of observations which is drawn according to


either a “null hypothesis” distribution PX n or an “alternative hypothesis” distribution
PX̂ n . The hypotheses are usually denoted by
2.8 Fundamentals of Hypothesis Testing 41

• H0 : PX n
• H1 : PX̂ n

Based on one sequence of observations xn , one has to decide which of the hypotheses
is true. This is denoted by a decision mapping φ(·), where

0, if distribution of X n is classified to be PX n ;
φ(xn ) =
1, if distribution of X n is classified to be PX̂ n .

Accordingly, the possible observed sequences are divided into two groups:

Acceptance region for H0 : {xn ∈ X n : φ(xn ) = 0}


Acceptance region for H1 : {xn ∈ X n : φ(xn ) = 1}.

Hence, depending on the true distribution, there are two types of error probabilities:

" #
Type I error : αn = αn (φ) := PX n {xn ∈ X n : φ(xn ) = 1}
" #
Type II error : βn = βn (φ) := PX̂ n {xn ∈ X n : φ(xn ) = 0} .

The choice of the decision mapping is dependent on the optimization criterion. Two
of the most frequently used ones in information theory are
1. Bayesian hypothesis testing.
Here, φ(·) is chosen so that the Bayesian cost

π0 αn + π1 βn

is minimized, where π0 and π1 are the prior probabilities for the null and alternative
hypotheses, respectively. The mathematical expression for Bayesian testing is

min [π0 αn (φ) + π1 βn (φ)] .


{φ}

2. Neyman–Pearson hypothesis testing subject to a fixed test level.


Here, φ(·) is chosen so that the type II error βn is minimized subject to a constant
bound on the type I error; i.e.,
αn ≤ ε,

where ε > 0 is fixed. The mathematical expression for Neyman–Pearson testing


is
min βn (φ).
{φ:αn (φ)≤ε}

The set {φ} considered in the minimization operation could have two different
ranges: range over deterministic rules, and range over randomization rules. The main
42 2 Information Measures for Discrete Systems

difference between a randomization rule and a deterministic rule is that the former
allows the mapping φ(xn ) to be random on {0, 1} for some xn , while the latter only
accepts deterministic assignments to {0, 1} for all xn . For example, a randomization
rule for specific observations x̃n can be
%
0, with probability 0.2,
φ(x̃ ) =
n
1, with probability 0.8.

The Neyman–Pearson lemma shows the well-known fact that the likelihood ratio
test is always the optimal test [281].

Lemma 2.48 (Neyman–Pearson Lemma) For a simple hypothesis testing problem,


define an acceptance region for the null hypothesis through the likelihood ratio as
&
PX n (xn )
An (τ ) := x ∈ X :n n
>τ ,
PX̂ n (xn )

and let ' (


αn∗ := PX n Acn (τ )

and
βn∗ := PX̂ n {An (τ )} .

Then for type I error αn and type II error βn associated with another choice of
acceptance region for the null hypothesis, we have

αn ≤ αn∗ =⇒ βn ≥ βn∗ .

Proof Let B be a choice of acceptance region for the null hypothesis. Then
 
αn + τ βn = PX n (xn ) + τ PX̂ n (xn )
xn ∈Bc xn ∈B

 
= PX n (x ) + τ 1 −
n
PX̂ n (x )
n

xn ∈Bc xn ∈Bc
  
=τ+ PX n (xn ) − τ PX̂ n (xn ) . (2.8.1)
xn ∈Bc

Observe that (2.8.1) is minimized by choosing B = An (τ ). Hence,

αn + τ βn ≥ αn∗ + τ βn∗ ,

which immediately implies the desired result. 


2.8 Fundamentals of Hypothesis Testing 43

The Neyman–Pearson lemma indicates that no other choices of acceptance regions


can simultaneously improve both type I and type II errors of the likelihood ratio
test. Indeed, from (2.8.1), it is clear that for any αn and βn , one can always find a
likelihood ratio test that performs as good. Therefore, the likelihood ratio test is an
optimal test. The statistical properties of the likelihood ratio thus become essential in
hypothesis testing. Note that, when the observations are i.i.d. under both hypotheses,
the divergence, which is the statistical expectation of the log-likelihood ratio, plays
an important role in hypothesis testing (for non-memoryless observations, one is then
concerned with the divergence rate, an extended notion of divergence for systems
with memory which will be defined in the following chapter) as the exponent of
the best type II error. More specifically, we have the following result, known as the
Chernoff–Stein lemma [78].

Lemma 2.49 (Chernoff–Stein lemma) For a sequence of i.i.d. observations X n


which is possibly drawn from either the null hypothesis distribution PX n or the alter-
native hypothesis distribution PX̂ n , the best type II error satisfies

1
lim − log2 βn∗ (ε) = D(PX PX̂ ),
n→∞ n
for any ε ∈ (0, 1), where βn∗ (ε) = minαn ≤ε βn , and αn and βn are the type I and type
II errors, respectively.

Proof Forward Part: In this part, we prove that there exists an acceptance region for
the null hypothesis such that

1
lim inf − log2 βn (ε) ≥ D(PX PX̂ ).
n→∞ n
Step 1: Divergence typical set. For any δ > 0, define the divergence typical set
as   &
1 PX n (xn ) 
An (δ) := xn ∈ X n :  log2 − D(PX P )
X̂ 
 < δ .
n PX̂ n (xn )

Note that any sequence xn in this set satisfies

PX̂ n (xn ) ≤ PX n (xn )2−n(D(PX PX̂ )−δ) .

Step 2: Computation of type I error. The observations being i.i.d., we have by


the weak law of large numbers that

PX n (An (δ)) → 1 as n → ∞.
44 2 Information Measures for Discrete Systems

Hence,
αn = PX n (Acn (δ)) < ε

for sufficiently large n.


Step 3: Computation of type II error.

βn (ε) = PX̂ n (An (δ))



= PX̂ n (xn )
xn ∈An (δ)

≤ PX n (xn )2−n(D(PX PX̂ )−δ)
xn ∈An (δ)

= 2−n(D(PX PX̂ )−δ) PX n (xn )
xn ∈An (δ)

= 2−n(D(PX PX̂ )−δ) (1 − αn ).

Hence,
1 1
− log2 βn (ε) ≥ D(PX PX̂ ) − δ + log2 (1 − αn ),
n n
which implies that

1
lim inf − log2 βn (ε) ≥ D(PX PX̂ ) − δ.
n→∞ n
The above inequality is true for any δ > 0; therefore,

1
lim inf − log2 βn (ε) ≥ D(PX PX̂ ).
n→∞ n
Converse Part: We next prove that for any acceptance region Bn for the null hypoth-
esis satisfying the type I error constraint, i.e.,

αn (Bn ) = PX n (Bnc ) ≤ ε,

its type II error βn (Bn ) satisfies

1
lim sup − log2 βn (Bn ) ≤ D(PX PX̂ ).
n→∞ n
2.8 Fundamentals of Hypothesis Testing 45

We have

βn (Bn ) = PX̂ n (Bn ) ≥ PX̂ n (Bn ∩ An (δ))



≥ PX̂ n (xn )
xn ∈Bn ∩An (δ)

≥ PX n (xn )2−n(D(PX PX̂ )+δ)
xn ∈B n ∩An (δ)

−n(D(PX PX̂ )+δ)


=2 PX n (Bn ∩ An (δ))

−n(D(PX PX̂ )+δ)
" #
≥2 1 − PX n (Bnc ) − PX n Acn (δ)
 " #
= 2−n(D(PX PX̂ )+δ) 1 − αn (Bn ) − PX n Acn (δ)
 " #
≥ 2−n(D(PX PX̂ )+δ) 1 − ε − PX n Acn (δ) .

Hence,

1 1  " #
− log2 βn (Bn ) ≤ D(PX PX̂ ) + δ + log2 1 − ε − PX n Acn (δ) ,
n n
" #
which, upon noting that limn→∞ PX n Acn (δ) = 0 (by the weak law of large num-
bers), implies that

1
lim sup − log2 βn (Bn ) ≤ D(PX PX̂ ) + δ.
n→∞ n

The above inequality is true for any δ > 0; therefore,

1
lim sup − log2 βn (Bn ) ≤ D(PX PX̂ ).
n→∞ n

2.9 Rényi’s Information Measures

We close this chapter by briefly introducing generalized information measures due


to Rényi [317], which subsume Shannon’s measures as limiting cases.

Definition 2.50 (Rényi’s entropy) Given a parameter α > 0 with α = 1, and given
a discrete random variable X with alphabet X and distribution PX , its Rényi entropy
of order α is given by


1 
Hα (X ) = log PX (x)α . (2.9.1)
1−α x∈X
46 2 Information Measures for Discrete Systems

As in case of the Shannon entropy, the base of the logarithm determines the units;
if the base is D, Rényi’s entropy is in D-ary units. Other notations for Hα (X ) are
H (X ; α), Hα (PX ), and H (PX ; α).
Definition 2.51 (Rényi’s divergence) Given a parameter 0 < α < 1, and two dis-
crete random variables X and X̂ with common alphabet X and distribution PX and
PX̂ , respectively, then the Rényi divergence of order α between X and X̂ is given by


1 ) *
α
Dα (X X̂ ) = log PX (x)PX̂ (x) .
1−α
(2.9.2)
α−1 x∈X

This definition can be extended to α > 1 if PX̂ (x) > 0 for all x ∈ X . Other notations
for Dα (X X̂ ) are D(X X̂ ; α), Dα (PX PX̂ ) and D(PX PX̂ ; α).
As in the case of Shannon’s information measures, the base of the logarithm
indicates the units of the measure and can be changed from 2 to an arbitrary b > 1.
In the next lemma, whose proof is left as an exercise, we note that in the limit of
α tending to 1, Shannon’s entropy and divergence can be recovered from Rényi’s
entropy and divergence, respectively.
Lemma 2.52 When α → 1, we have the following:

lim Hα (X ) = H (X ) (2.9.3)
α→1

and
lim Dα (X X̂ ) = D(X X̂ ). (2.9.4)
α→1

Observation 2.53 (Operational meaning of Rényi’s information measures) Rényi’s


entropy has been shown to have an operational characterization for many problems,
including lossless variable-length source coding under an exponential cost constraint
[54, 67, 68, 310] (see also Observation 3.30 in Chap. 3), buffer overflow in source
coding [206], fixed-length source coding [76, 86] and others areas [1, 20, 36, 308,
318]. Furthermore, Rényi’s divergence has played a prominent role in hypothesis
testing questions [17, 86, 186, 225, 279, 280].
Observation 2.54 (α-mutual information) While Rényi did not propose a mutual
information of order α that generalizes Shannon’s mutual information, there are at
least three different possible definitions of such measure due to Sibson [352], Arimoto
[28] and Csiszár [86], respectively. We refer the reader to [86, 395] for discussions
on the properties and merits of these different measures.
Observation 2.55 (Information measures for continuous distributions) Note that
the above information measures defined for discrete distributions can similarly be
defined for continuous distributions admitting densities with the usual straightfor-
ward modifications (with densities replacing pmf’s and integrals replacing summa-
tions). See Chap. 5 for a study of Shannon’s differential entropy and divergence for
2.9 Rényi’s Information Measures 47

continuous distributions (also for continuous distributions, closed-form expressions


for Shannon’s differential entropy and Rényi’s entropy can be found in [360] and
expressions for Rényi’s divergence are derived in [144, 246]).

Problems

1. Prove the FI Lemma.


2. Show that the two conditions in Footnote 8 are equivalent.
3. For a finite-alphabet random variable X , show that H (X ) ≤ log2 |X | using the
log-sum inequality.
4. Given a pair of random variables (X , Y ), is H (X |Y ) = H (Y |X )? If not, when
do we have equality?
5. Given a discrete random variable X with alphabet X ⊂ {1, 2, . . .}, what is the
relationship between H (X ) and H (Y ) when Y is defined as follows.
(a) Y = log2 (X ).
(b) Y = X 2 .
6. Show that the entropy of a function f of X is less than or equal to the entropy of X .

Hint: By the chain rule for entropy,

H (X , f (X )) = H (X ) + H (f (X )|X ) = H (f (X )) + H (X |f (X )).

7. Show that H (Y |X ) = 0 iff Y is a function of X .


8. Give examples of:
(a) I (X ; Y |Z) < I (X ; Y ).
(b) I (X ; Y |Z) > I (X ; Y ).

Hint: For (a), create example for I (X ; Y |Z) = 0 and I (X ; Y ) > 0. For (b),
create example for I (X ; Y ) = 0 and I (X ; Y |Z) > 0.

9. Let the joint distribution of X and Y be:


Y
0 1
X
1
0 0
4
1 1
1
2 4

Draw the Venn diagram for

H (X ), H (Y ), H (X |Y ), H (Y |X ), H (X , Y ) and I (X ; Y ),

and indicate the quantities (in bits) for each area of the Venn diagram.
48 2 Information Measures for Discrete Systems

10. Maximal discrete entropy. Prove that, of all probability mass functions for a non-
negative integer-valued random variable with mean μ, the geometric distribution,
given by  z
1 μ
PZ (z) = , for z = 0, 1, 2, . . . ,
1+μ 1+μ

has the largest entropy.

Hint: Let X be a nonnegative integer-valued random variable with mean μ. Show


that H (X ) − H (Z) = −D(PX PZ ) ≤ 0, with equality iff PX = PZ .
11. Inequalities: Which of the following inequalities are ≥, =, ≤ ? Label each with
≥, =, or ≤ and justify your answer.
(a) H (X |Z) − H (X |Y ) versus I (X ; Y |Z).
(b) H (X |g(Y )) versus H (X |Y ), where g is a function.
(c) H (X2 |X1 ) versus (1/2)H (X1 , X2 ), where X1 and X2 are identically dis-
tributed on a common alphabet X (i.e., PX1 (a) = PX2 (a) for all a ∈ X ).
12. Entropy of invertible functions: Given random variables X and Z with finite
alphabets X and Z, respectively, define a new random variable Y as Y = fX (Z)
with alphabet Y := {fx (z) : x ∈ X and z ∈ Z}, where for each x ∈ X , the
function fx : Z → Y is invertible.
(a) Prove that H (Y |X ) = H (Z|X ).
(b) Show that H (Y ) ≥ H (Z) when X and Z are independent.
(c) Verify via a counterexample that H (Y |Z) = H (X |Z) in general.
(d) Under what conditions does H (Y |Z) = H (X |Z)?
13. Let X1 → X2 → X3 → · · · → Xn form a Markov chain (see (B.3.5) in
Appendix B). Show that:
(a) I (X1 ; X2 , . . . , Xn ) = I (X1 ; X2 ).
(b) For any n, H (X1 |Xn−1 ) ≤ H (X1 |Xn ).
14. Refinement cannot decrease entropy: Given integer m ≥ 1 and a random variable
X
+with distribution PX , let {U1 , . . . , Um } be a partition on the alphabet X of X ; i.e.,
m
i=1 Ui = X and Uj ∩ Uk = ∅ for all j = k. Now let U denote a random variable
with alphabet {1, . . . , m} and distribution PU (i) = PX (Ui ) for 1 ≤ i ≤ m. In this
case, X is called a refinement (or an m-refinement) of U . Show that

H (X ) ≥ H (U ).

15. Provide examples for the following inequalities (see Definition 2.40 for the
definition of conditional divergence).
(a) D(PX |Z PX̂ |Ẑ |PZ ) > D(PX PX̂ ).
(b) D(PX |Z PX̂ |Ẑ |PZ ) < D(PX PX̂ ).
2.9 Rényi’s Information Measures 49

16. Prove that the binary divergence defined by

p 1−p
D(pq) := p log2 + (1 − p) log2
q 1−q

satisfies
(p − q)2
D(pq) ≤ log2 (e)
q(1 − q)

for 0 < p < 1 and 0 < q < 1.


Hint: Use the FI Lemma.
17. Let {p1 , p2 , . . . , pm } be a set of positive real numbers with m pi = 1. If
i=1
{q1 , q2 , . . . , qm } is any other set of positive real numbers with m i=1 qi = α,
where α > 0 is a constant, show that


m
1  m
1
pi log ≤ pi log + log α.
i=1
pi i=1
qi

Give a necessary and sufficient condition for equality.


18. An alternative form for mutual information: Given jointly distributed random
variables X and Y with alphabets X and Y, respectively, and with joint distribu-
tion PX ,Y = PY |X PX , show that the mutual information I (X ; Y ) can be written
as

I (X ; Y ) = D(PY |X PY |PX )


= D(PY |X QY |PX ) − D(PY QY )

for any distribution QY on Y, where PY is the marginal distribution of PX ,Y on


Y and
  PY |X (y|x)
D(PY |X QY |PX ) = PX (x) PY |X (y|x) log
x∈X y∈Y
QY (y)

is the conditional divergence between PY |X and QY given PX ; see Definition 2.40.

19. Let X and Y be jointly distributed discrete random variables. Show that

I (X ; Y ) ≥ I (f (X ); g(Y )),

where f (·) and g(·) are given functions.


20. Data processing inequality for the divergence: Given a conditional distribution
PY |X defined on the alphabets X and Y, let PX and QX be two distributions on
X and let PY and QY be two corresponding distributions on Y defined by

PY (y) = PY |X (y|x)PX (x)
x∈X
50 2 Information Measures for Discrete Systems

and 
QY (y) = PY |X (y|x)QX (x)
x∈X

for all y ∈ Y. Show that

D(PX QX ) ≥ D(PY QY ).

21. An application of Jensen’s inequality: Let X and X  be two discrete indepen-


dent random variables with common alphabet X and distributions PX and PX  ,
respectively.
(a) Use Jensen’s inequality to show that

2EX [log2 PX  (X )] ≤ EX [PX  (X )].

(b) Show that



Pr[X = X  ] ≥ 2−H (X )−D(X X ) .

(c) If X = {0, 1}, X  is uniformly distributed and PX (0) = p = 1 − PX (1),


evaluate the tightness of the bound in (b) as a function of p.
22. Hölder’s inequality:
(a) Given two probability vectors (p1 , p2 , . . . , pm ) and (q1 , q2 , . . . , qm ) on the
set {1, 2, . . . , m}, apply Jensen’s inequality to show that for any 0 < λ < 1,


m
qiλ pi1−λ ≤ 1,
i=1

and give a necessary and sufficient condition for equality.


(b) Given positive real numbers ai and bi i = 1, 2, . . . , m, show via an appro-
priate use of the bound in (a) that for any 0 < λ < 1,
 λ  m 1−λ

m 
m
1  1
ai bi ≤ aiλ 1−λ
bi ,
i=1 i=1 i=1

with equality iff for some constant c,


1 1
aiλ = cbi1−λ

for all i. This inequality is known as Hölder’s inequality. In the special case
of λ = 1/2, the bound is referred to as the Cauchy–Schwarz inequality.
2.9 Rényi’s Information Measures 51

(c) Another form of Hölder’s inequality is as follows:


 λ  1−λ

m 
m
1 
m 1
pi ai bi ≤ λ
pi ai pi bi 1−λ
,
i=1 i=1 i=1

where (p1 , p2 , . . . , pm ) is a probability vector as in (a).

Prove this inequality using (b), and show that equality holds iff for some
constant c,
1 1
pi aiλ = cpi bi1−λ

for all i.

Note: We refer the reader to [135, 176] for a variety of other useful inequal-
ities.
23. Inequality of arithmetic and geometric means:
(a) Show that
n

n 
ai ln xi ≤ ln ai xi ,
i=1 i=1

where x1 , x2 , . . . , xn are arbitrary


positive numbers, and a1 , a2 , . . . , an are
positive numbers such that ni=1 ai = 1.
(b) Deduce from the above the inequality of the arithmetic and geometric means:


n
x1a1 x2a2 . . . xnan ≤ ai xi .
i=1

24. Consider two distributions P(·) and Q(·) on the alphabet X = {a1 , . . . , ak } such
that Q(ai ) > 0 for all i = 1, . . . , k. Show that


k
(P(ai ))2
≥ 1.
i=1
Q(ai )

25. Let X be a discrete random variable with alphabet X and distribution PX . Let
f : X → R be a real-valued function, and let α be an arbitrary real number.
(a) Show that


−αf (x)
H (X ) ≤ αE[f (X ))] + log2 2 ,
x∈X


with equality iff PX (x) = A1 2−αf (x) for x ∈ X , where A := x∈X 2−αf (x) .
52 2 Information Measures for Discrete Systems

(b) Show that for a positive integer-valued random variable N (such that E[N ] >
1 without loss of generality), the following holds:

H (N ) ≤ log2 (E[N ]) + log2 e.


E[N ]
Hint: First use part (a) with f (N ) = N and α = log2 E[N ]−1
.
26. Fano’s inequality for list decoding: Let X and Y be two random variables with
alphabets X and Y, respectively, where X is finite and Y can be countably
infinite. Given a fixed integer m ≥ 1, define

X̂ m := (g1 (Y ), g2 (Y ), . . . , gm (Y ))

as the list of estimates of X obtained by observing Y , where gi : Y → X is


a given estimation function for i = 1, 2, . . . , m. Define the probability of list
decoding error as
) *
Pe(m) := Pr X̂1 = X , X̂2 = X , . . . , X̂m = X .

(a) Show that

H (X |Y ) ≤ hb (Pe(m) ) + Pe(m) log2 (|X | − u) + (1 − Pe(m) ) log2 (u) , (2.10.1)

where  
u := PX̂ m (x̂m ).
x∈X x̂m ∈X m :x̂i =x for some i

Note: When m = 1, we obtain that u = 1 and the right-hand side of (2.10.1)


reduces to the original Fano inequality (cf. the right-hand side of (2.5.1)).

Hint: Show that H (X |Y ) ≤ H (X |X̂ m ) and that H (X |X̂ m ) is less than the
right-hand side of the above inequality.
(b) Use (2.10.1) to deduce the following weaker version of Fano’s inequality
for list decoding (see [5], [216], [313, Appendix 3.E]):

H (X |Y ) ≤ hb (Pe(m) ) + Pe(m) log2 |X | + (1 − Pe(m) ) log2 m. (2.10.2)

27. Fano’s inequality for ternary partitioning of the observation space: In Problem
26, Pe(m) and u can actually be expressed as
  
Pe(m) = PX ,X̂ m (x, x̂m ) = PX ,Y (x, y)
x∈X x̂m ∈
/ Ux x∈X y∈Y
/ x
2.9 Rényi’s Information Measures 53

and   
u= PX̂ m (x̂m ) = PY (y),
x∈X x̂m ∈ Ux x∈X y∈ Yx

respectively, where
' (
Ux := x̂m ∈ X m : x̂i = x for some i

and
Yx := {y ∈ Y : gi (y) = x for some i} .

Thus, given x ∈ X , Yx and Yxc form a binary partition on the observation space Y.

Now consider again random variables X and Y with alphabets X and Y,


respectively, where X is finite and Y can be countably infinite, and assume that
for each x ∈ X , we are given a ternary partition {Sx , Tx , Vx } on the observation
space Y, where the sets Sx , Tx and Vx are mutually disjoint and their union equals
Y. Define
  
p := PX ,Y (x, y), q := PX ,Y (x, y), r := PX ,Y (x, y)
x∈X y∈Sx x∈X y∈Tx x∈X y∈Vx

and
  
s := PY (y), t := PY (y), v := PY (y).
x∈X y∈Sx x∈X y∈Tx x∈X y∈Vx

Note that p + q + r = 1 and s + t + v = |X |. Show that

H (X |Y ) ≤ H (p, q, r) + p log2 (s) + q log2 (t) + r log2 (v), (2.10.3)

where
1 1 1
H (p, q, r) = p log2 + q log2 + r log2 .
p q r

Note: When Vx = ∅ for all x ∈ X , we obtain that Sx = Txc for all x, r = v = 0


and p = 1 − q; as a result, inequality (2.10.3) acquires a similar expression as
(2.10.1) with p standing for the probability of error.
28.
-Independence: Let X and Y be two jointly distributed random variables with
finite respective alphabets X and Y and joint pmf PX ,Y defined on X ×Y. Given a
fixed
> 0, random variable Y is said to be
-independent from random variable
X if  
PX (x) |PY |X (y|x) − PY (y)| <
,
x∈X y∈Y
54 2 Information Measures for Discrete Systems

where PX and PY are the marginal pmf’s of X and Y , respectively, and PY |X is


the conditional pmf of Y given X . Show that

log2 (e) 2
I (X ; Y ) <

2
is a sufficient condition for Y to be
-independent from X , where I (X ; Y ) is the
mutual information (in bits) between X and Y .
29. Rényi’s entropy: Given a fixed positive integer n > 1, consider an n-ary valued
random variable X with alphabet X = {1, 2, . . . , n} and distribution described
by the probabilities pi := Pr[X = i], where pi > 0 for each i = 1, . . . , n. Given
α > 0 and α = 1, the Rényi entropy of X (see Definition 2.9.1) is given by

n
1 
α
Hα (X ) := log2 pi .
1−α i=1

(a) Show that



n
pir > 1 if r < 1,
i=1

and that

n
pir < 1 if r > 1.
i=1


Hint: Show that the function f (r) = ni=1 pir is decreasing in r, where r > 0.
(b) Show that
0 ≤ Hα (X ) ≤ log2 n.

Hint: Use (a) for the lower bound, and use Jensen’s inequality (with the
1
convex function f (y) = y 1−α , for y > 0) for the upper bound.
30. Rényi’s entropy and divergence: Consider two discrete random variables X and
X̂ with common alphabet X and distribution PX and PX̂ , respectively.
(a) Prove Lemma 2.52.
(b) Find a distribution Q on X in terms of α and PX such that the following
holds:
1
Hα (X ) = H (X ) + D(PX Q).
1−α
Chapter 3
Lossless Data Compression

3.1 Principles of Data Compression

As mentioned in Chap. 1, data compression describes methods of representing a


source by a code whose average codeword length (or code rate) is acceptably small.
The representation can be lossless (or asymptotically lossless) where the recon-
structed source is identical (or identical with vanishing error probability) to the orig-
inal source; or lossy where the reconstructed source is allowed to deviate from the
original source, usually within an acceptable threshold. We herein focus on lossless
data compression.
Since a memoryless source is modeled as a random variable, the averaged code-
word length of a codebook is calculated based on the probability distribution of that
random variable. For example, consider a ternary memoryless source X with three
possible outcomes and probability distribution given by

PX (x = outcome A ) = 0.5;
PX (x = outcome B ) = 0.25;
PX (x = outcomeC ) = 0.25.

Suppose that a binary codebook is designed for this source, in which outcome A ,
outcome B , and outcomeC are, respectively, encoded as 0, 10, and 11. Then, the
average codeword length (in bits per source outcome) is

length(0) · PX (outcomeA ) + length(10) · PX (outcomeB )


+ length(11) · PX (outcomeC )
= 0.5 + 2 × 0.25 + 2 × 0.25
= 1.5 bits.

© Springer Nature Singapore Pte Ltd. 2018 55


F. Alajaji and P.-N. Chen, An Introduction to Single-User Information Theory,
Springer Undergraduate Texts in Mathematics and Technology,
https://doi.org/10.1007/978-981-10-8001-2_3
56 3 Lossless Data Compression

There are usually no constraints on the basic structure of a code. In the case where
the codeword length for each source outcome can be different, the code is called a
variable-length code. When the codeword lengths of all source outcomes are equal,
the code is referred to as a fixed-length code. It is obvious that the minimum average
codeword length among all variable-length codes is no greater than that among
all fixed-length codes, since the latter is a subclass of the former. We will see in
this chapter that the smallest achievable average code rate for variable-length and
fixed-length codes coincide for sources with good probabilistic characteristics, such
as stationarity and ergodicity. But for more general sources with memory, the two
quantities are different (e.g., see [172]).
For fixed-length codes, the sequence of adjacent codewords is concatenated
together for storage or transmission purposes, and some punctuation mechanism—
such as marking the beginning of each codeword or delineating internal sub-
blocks for synchronization between encoder and decoder—is normally considered
an implicit part of the codewords. Due to constraints on space or processing capa-
bility, the sequence of source symbols may be too long for the encoder to deal with
all at once; therefore, segmentation before encoding is often necessary. For example,
suppose that we need to encode using a binary code the grades of a class with 100
students. There are three grade levels: A, B, and C. By observing that there are 3100
possible grade combinations for 100 students, a straightforward code design requires

log2 (3100 ) = 159 bits

to encode these combinations (by enumerating them). Now suppose that the encoder
facility can only process 16 bits at a time. Then, the above code design becomes
infeasible and segmentation is unavoidable. Under such constraint, we may encode
grades of 10 students at a time, which requires

log2 (310 ) = 16 bits.

As a consequence, for a class of 100 students, the code requires 160 bits in total.
In the above example, the letters in the grade set {A, B, C} and the letters from the
code alphabet {0, 1} are often called source symbols and code symbols, respectively.
When the code alphabet is binary (as in the previous two examples), the code sym-
bols are referred to as code bits or simply bits (as already used). A tuple (or grouped
sequence) of source symbols is called a sourceword, and the resulting encoded tuple
consisting of code symbols is called a codeword. (In the above example, each source-
word consists of 10 source symbols (student grades) and each codeword consists of
16 bits.)
Note that, during the encoding process, the sourceword lengths do not have to be
equal. In this text, however, we only consider the case where the sourcewords have
a fixed length throughout the encoding process (except for the Lempel–Ziv code
briefly discussed at the end of this chapter), but we will allow the codewords to have
3.1 Principles of Data Compression 57

sourcewords codewords sourcewords


Source Source Source
encoder decoder

Fig. 3.1 Block diagram of a data compression system

fixed or variable lengths as defined earlier.1 The block diagram of a source coding
system is depicted in Fig. 3.1.
When adding segmentation mechanisms to fixed-length codes, the codes can be
loosely divided into two groups. The first consists of block codes in which the encod-
ing (or decoding) of the next segment of source symbols is independent of the previ-
ous segments. If the encoding/decoding of the next segment, somehow, retains and
uses some knowledge of earlier segments, the code is called a fixed-length tree code.
As we will not investigate such codes in this text, we can use “block codes” and
“fixed-length codes” as synonyms.
In this chapter, we first consider data compression for block codes in Sect. 3.2.
Data compression for variable-length codes is then addressed in Sect. 3.3.

3.2 Block Codes for Asymptotically Lossless Compression

3.2.1 Block Codes for Discrete Memoryless Sources

We first focus on the study of asymptotically lossless data compression of dis-


crete memoryless sources via block (fixed-length) codes. Such sources were already
defined in the previous chapter (see Sect. 2.1.2) and in Appendix B; but we never-
theless recall their definition.

Definition 3.1 (Discrete memoryless source) A discrete memoryless source (DMS)


{X n }∞
n=1 consists of a sequence of i.i.d. random variables, X 1 , X 2 , X 3 , . . ., all tak-
ing values in a common finite alphabet X . In particular, if PX (·) is the common
distribution or probability mass function (pmf) of the X i ’s, then


n
PX n (x1 , x2 , . . . , xn ) = PX (xi ).
i=1

1 In other
words, our fixed-length codes are actually “fixed-to-fixed length codes” and our variable-
length codes are “fixed-to-variable length codes” since, in both cases, a fixed number of source
symbols is mapped onto codewords with fixed and variable lengths, respectively.
58 3 Lossless Data Compression

Definition 3.2 An (n, M) block code with blocklength n and size M (which can
be a function of n in general,2 i.e., M = Mn ) for a discrete source {X n }∞ n=1 is a
set ∼Cn = {c1 , c2 , . . . , c M } ⊆ X n consisting of M reproduction (or reconstruction)
words, where each reproduction word is a sourceword (an n-tuple of source symbols).
To simplify the exposition, we make an abuse of notation by writing ∼Cn = (n, M)
to mean that ∼Cn is a block code with blocklength n and size M.

Observation 3.3 One can binary-index  (or


 enumerate) the reproduction words in
∼Cn = {c1 , c2 , . . . , c M } using k := log2 M bits. As such k-bit words in {0, 1}k are
usually stored for retrieval at a later date, the (n, M) block code can be represented
by an encoder–decoder pair of functions ( f, g), where the encoding function

f : X n → {0, 1}k

maps each sourceword x n to a k-bit word f (x n ) which we call a codeword. Then,


the decoding function
g : {0, 1}k → {c1 , c2 , . . . , c M }

is a retrieving operation that produces the reproduction words. Since the codewords
are binary-valued, such a block code is called a binary code. More generally, a
D-ary block code (where D > 1 is an integer) would use an encoding function
f : X n → {0, 1, . . . , D − 1}k where each codeword f (x n ) contains k D-ary code
symbols.
Furthermore, since the behavior of block codes is investigated
 forsufficiently
large n and M (tending to infinity), it is legitimate to replace log2 M by log2 M
for the case of binary codes. With this convention, the data compression rate or code
rate is
k 1
= log2 M(in bits per source symbol).
n n
Similarly, for D-ary codes, the rate is

k 1
= log D M (in D-ary code symbols per source symbol).
n n
For computational convenience, nats (under the natural logarithm) can be used
instead of bits or D-ary code symbols; in this case, the code rate becomes

1
log M (in nats per source symbol).
n

2 Inthe literature, both (n, M) and (M, n) have been used to denote a block code with blocklength
n and size M. For example, [415, p. 149] adopts the former one, while [83, p. 193] uses the latter.
We use the (n, M) notation since M = Mn is a function of n in general.
3.2 Block Codes for Asymptotically Lossless Compression 59

The block code’s operation can be symbolically represented as3

(x1 , x2 , . . . , xn ) → cm ∈ {c1 , c2 , . . . , c M }.

This procedure will be repeated for each consecutive block of length n, i.e.,

· · · (x3n , . . . , x31 )(x2n , . . . , x21 )(x1n , . . . , x11 ) → · · · |cm 3 |cm 2 |cm 1 ,

where “|” reflects the necessity of “punctuation mechanism” or “synchronization


mechanism” for consecutive source block coders.
The next theorem provides a key tool for proving Shannon’s source coding
theorem.
Theorem 3.4 (Shannon–McMillan–Breiman) (Asymptotic equipartition property
or AEP4 ) If {X n }∞
n=1 is a DMS with entropy H (X ), then

1
− log2 PX n (X 1 , . . . , X n ) → H (X ) in probability.
n

In other words, for any δ > 0,


  
 1 
 
lim Pr − log2 PX (X 1 , . . . , X n ) − H (X ) > δ = 0.
n
n→∞ n

Proof This theorem follows by first observing that for an i.i.d. sequence {X n }∞
n=1 ,

1 

n
1
− log2 PX n (X 1 , . . . , X n ) = − log2 PX (X i )
n n i=1


and that the sequence {− log2 PX (X i )}i=1 is i.i.d., and then applying the weak law
of large numbers on the latter sequence. 
The AEP indeed constitutes an “information theoretic” analog of the weak law

of large numbers as it states that if {− log2 PX (X i )}i=1 is an i.i.d. sequence, then for
any δ > 0,
 n 
1 

 
Pr  − log2 PX (X i ) − H (X ) ≤ δ → 1 as n → ∞.
n 
i=1

As a consequence of the AEP, all the probability mass will be ultimately placed on
the weakly δ-typical set, which is defined as

3 When one uses an encoder–decoder pair ( f, g) to describe the block code, the code’s operation
can be expressed as cm = g( f (x n )).
4 This theorem, which is also called the entropy stability property, is due to Shannon [340],

McMillan [267], and Breiman [58].


60 3 Lossless Data Compression
   
 1 
Fn (δ) := x n ∈ X n : − log PX n (x n ) − H (X ) ≤ δ
 n 2 
 
 n 
n  1 
= x ∈ X : −
n
log2 PX (xi ) − H (X ) ≤ δ .
 n 
i=1

Note that since the source is memoryless, for any x n ∈


Fn (δ), −(1/n) log2 P
X n (x n ),
n
the normalized self-information of x , is equal to (1/n) i=1
n
− log2 PX (xi ) , which
is the empirical (arithmetic) average self-information or “apparent” entropy of the
source. Thus, a sourceword x n is δ-typical if it yields an apparent source entropy
within δ of the “true” source entropy H (X ). Note that the sourcewords in Fn (δ)
are nearly equiprobable or equally surprising (cf. Property 1 of Theorem 3.5); this
justifies naming Theorem 3.4 by AEP.
Theorem 3.5 (Consequence of the AEP) Given a DMS {X n }∞ n=1 with entropy H (X )
and any δ greater than zero, the weakly δ-typical set Fn (δ) satisfies the following.
1. If x n ∈ Fn (δ), then

2−n(H (X )+δ) ≤ PX n (x n ) ≤ 2−n(H (X )−δ) .


 
2 PX n Fnc (δ) < δ for sufficiently large n, where the superscript “c” denotes the
complementary set operation.
3. |Fn (δ)| > (1 − δ)2n(H (X )−δ) for sufficiently large n, and |Fn (δ)| ≤ 2n(H (X )+δ)
for every n, where |Fn (δ)| denotes the number of elements in Fn (δ).
Note: The above theorem also holds if we define the typical set using the base-D
logarithm log D for any D > 1 instead of the base-2 logarithm; in this case, one just
needs to appropriately change the base of the exponential terms in the above theorem
(by replacing 2x terms with D x terms) and also substitute H (X ) with H D (X ).
Proof Property 1 is an immediate consequence of the definition of Fn (δ). Property
2 is a direct consequence of the AEP, since the AEP states that for a fixed δ >
0, limn→∞ PX n (Fn (δ)) = 1; i.e., ∀ε > 0, there exists n 0 = n 0 (ε) such that for all
n ≥ n0,
PX n (Fn (δ)) > 1 − ε.

In particular, setting ε = δ yields the result. We nevertheless provide a direct proof


of Property 2 as we give an explicit expression for n 0 : observe that by Chebyshev’s
inequality,5
   
 1 
PX n (Fnc (δ)) = PX n x ∈ X : − log2 PX n (x ) − H (X ) > δ
n n  n
n
σ 2X

nδ 2
< δ,

5 Chebyshev’s inequality as well as its proof can be found on p. 287 in Appendix B.


3.2 Block Codes for Asymptotically Lossless Compression 61

for n > σ 2X /δ 3 , where the variance



2
σ 2X := Var[− log2 PX (X )] = PX (x) log2 PX (x) − (H (X ))2
x∈X

is a constant6 independent of n.
To prove Property 3, we have from Property 1 that
 
1≥ PX n (x n ) ≥ 2−n(H (X )+δ) = |Fn (δ)|2−n(H (X )+δ) ,
x n ∈F n (δ) x n ∈F n (δ)

and, using Properties 2 and 1, we have that

σ 2X  
1−δ <1− ≤ PX n (x n ) ≤ 2−n(H (X )−δ) = |Fn (δ)|2−n(H (X )−δ) ,
nδ 2
x n ∈F n (δ) x n ∈F n (δ)

for n ≥ σ 2X /δ 3 . 
Note that for any n > 0, a block code ∼Cn = (n, M) is said to be uniquely decodable
or completely lossless if its set of reproduction words is trivially equal to the set of
all source n-tuples: {c1 , c2 , . . . , c M } = X n . In this case, if we are binary-indexing
the reproduction words using an encoding–decoding pair ( f, g), every sourceword
x n will be assigned to a distinct binary codeword f (x n ) of length k = log2 M and
all the binary k-tuples are the image under f of some sourceword. In other words, f
is a bijective (injective and surjective) map and hence invertible with the decoding
map g = f −1 and M = |X |n = 2k . Thus, the code rate is (1/n) log2 M = log2 |X |
bits/source symbol.
Now the question becomes: can we achieve a better (i.e., smaller) compression
rate? The answer is affirmative: we can achieve a compression rate equal to the
source entropy H (X ) (in bits), which can be significantly smaller than log2 |X | when
this source is strongly nonuniformly distributed, if we give up unique decodability
(for every n) and allow n to be sufficiently large to asymptotically achieve lossless
reconstruction by having an arbitrarily small (but positive) probability of decoding
error
Pe (∼Cn ) := PX n {x n ∈ X n : g( f (x n )) = x n }.

6 Inthe proof, we assume that the variance σ 2X = Var[− log2 PX (X )] < ∞. This holds since the
source alphabet is finite:

Var[− log2 PX (X )] ≤ E[(log2 PX (X ))2 ] = PX (x)(log2 PX (x))2
x∈X
 4 4
≤ [log2 (e)]2 = 2 [log2 (e)]2 × |X | < ∞.
e2 e
x∈X
62 3 Lossless Data Compression

Thus, block codes herein can perform data compression that is asymptotically
lossless with respect to blocklength; this contrasts with variable-length codes which
can be completely lossless (uniquely decodable) for every finite blocklength.
We now can formally state and prove Shannon’s asymptotically lossless source
coding theorem for block codes. The theorem will be stated for general D-ary block
codes, representing the source entropy H D (X ) in D-ary code symbol/source sym-
bol as the smallest (infimum) possible compression rate for asymptotically lossless
D-ary block codes. Without loss of generality, the theorem will be proved for the
case of D = 2. The idea behind the proof of the forward (achievability) part is
basically to binary-index the source sequence in the weakly δ-typical set Fn (δ) to
a binary codeword (starting from index one with corresponding k-tuple codeword
0 · · · 01); and to encode all sourcewords outside Fn (δ) to a default all-zero binary
codeword, which certainly cannot be reproduced distortionless due to its many-to-
one-mapping property. The resultant code rate is (1/n)log2 (|Fn (δ)| + 1) bits per
source symbol. As revealed in the Shannon–McMillan–Breiman AEP theorem and
its consequence, almost all the probability mass will be on Fn (δ) as n is sufficiently
large, and hence, the probability of non-reconstructable source sequences can be
made arbitrarily small. A simple example for the above coding scheme is illustrated
in Table 3.1. The converse part of the proof will establish (by expressing the proba-
bility of correct decoding in terms of the δ-typical set and also using the consequence
of the AEP) that for any sequence of D-ary codes with rate strictly below the source
entropy, their probability of error cannot asymptotically vanish (is bounded away
from zero). Actually, a stronger result is proven: it is shown that their probability of
error not only does not asymptotically vanish, it actually ultimately grows to 1 (this
is why we call this part a “strong” converse).
Theorem 3.6 (Shannon’s source coding theorem) Given integer D > 1, consider
a discrete memoryless source {X n }∞
n=1 with entropy H D (X ). Then the following hold.
• Forward part (achievability): For any 0 < ε < 1, there exists 0 < δ < ε and a
sequence of D-ary block codes {∼Cn = (n, Mn )}∞
n=1 with

1
lim sup log D Mn ≤ H D (X ) + δ (3.2.1)
n→∞ n

satisfying
Pe (∼Cn ) < ε (3.2.2)

for all sufficiently large n, where Pe (∼Cn ) denotes the probability of error (or
decoding error) for block code ∼Cn .7

7 Note that (3.2.2) is equivalent to lim supn→∞ Pe (∼Cn ) ≤ ε. Since ε can be made arbitrarily small,
the forward part actually indicates the existence of a sequence of D-ary block codes {∼Cn }∞ n=1
satisfying (3.2.1) such that lim supn→∞ Pe (∼Cn ) = 0. Based on this, the converse should be that any
sequence of D-ary block codes satisfying (3.2.3) satisfies lim supn→∞ Pe (∼Cn ) > 0. However, the
so-called strong converse actually gives a stronger consequence: lim supn→∞ Pe (∼Cn ) = 1 (as  can
be made arbitrarily small).
3.2 Block Codes for Asymptotically Lossless Compression 63

Table 3.1 An example of the δ-typical set with n = 2 and δ = 0.4, where F2 (0.4) = {AB, AC,
BA, BB, BC, CA, CB}. The codeword set is {001(AB), 010(AC), 011(BA), 100(BB), 101(BC),
110(CA), 111(CB), 000(AA, AD, BD, CC, CD, DA, DB, DC, DD)}, where the parenthesis following
each binary codeword indicates those sourcewords that are encoded to this codeword. The source
distribution is PX (A) = 0.4, PX (B) = 0.3, PX (C) = 0.2, and PX (D) = 0.1
 
 1 2 
 
Source − log2 PX (xi ) − H (X ) Codeword Reconstructed
 2 
i=1 source sequence
AA 0.525 bits ∈
/ F2 (0.4) 000 Ambiguous
AB 0.317 bits ∈ F2 (0.4) 001 AB
AC 0.025 bits ∈ F2 (0.4) 010 AC
AD 0.475 bits ∈
/ F2 (0.4) 000 Ambiguous
BA 0.317 bits ∈ F2 (0.4) 011 BA
BB 0.109 bits ∈ F2 (0.4) 100 BB
BC 0.183 bits ∈ F2 (0.4) 101 BC
BD 0.683 bits ∈
/ F2 (0.4) 000 Ambiguous
CA 0.025 bits ∈ F2 (0.4) 110 CA
CB 0.183 bits ∈ F2 (0.4) 111 CB
CC 0.475 bits ∈
/ F2 (0.4) 000 Ambiguous
CD 0.975 bits ∈
/ F2 (0.4) 000 Ambiguous
DA 0.475 bits ∈
/ F2 (0.4) 000 Ambiguous
DB 0.683 bits ∈
/ F2 (0.4) 000 Ambiguous
DC 0.975 bits ∈
/ F2 (0.4) 000 Ambiguous
DD 1.475 bits ∈
/ F2 (0.4) 000 Ambiguous

• Strong converse part: For any 0 < ε < 1, any sequence of D-ary block codes
{∼Cn = (n, Mn )}∞
n=1 with

1
lim sup log D Mn < H D (X ) (3.2.3)
n→∞ n

satisfies
Pe (∼Cn ) > 1 − ε

for all n sufficiently large.


Proof Forward Part: Without loss of generality, we will prove the result for the case
of binary codes (i.e., D = 2). Also, recall that subscript D in H D (X ) will be dropped
(i.e., omitted) specifically when D = 2.
Given 0 < ε < 1, fix δ such that 0 < δ < ε and choose n > 2/δ. Now construct
a binary ∼Cn block code by simply mapping the δ/2-typical

n
sourcewords x onto
distinct not all-zero binary codewords of length k := log2 Mn bits. In other words,
binary-index (cf. Observation 3.3) the sourcewords in Fn (δ/2) with the following
encoding map:
64 3 Lossless Data Compression

binary index of x n , if x n ∈ Fn (δ/2),
xn →
all-zero codeword, if x n ∈ / Fn (δ/2).

Then by the Shannon–McMillan–Breiman AEP theorem, we obtain that

Mn = |Fn (δ/2)| + 1 ≤ 2n(H (X )+δ/2) + 1 < 2 · 2n(H (X )+δ/2) < 2n(H (X )+δ)

for n > 2/δ. Hence, a sequence of ∼Cn = (n, Mn ) block code satisfying (3.2.1) is
established. It remains to show that the error probability for this sequence of (n, Mn )
block code can be made smaller than ε for all sufficiently large n.
By the Shannon–McMillan–Breiman AEP theorem,

δ
PX n (Fnc (δ/2)) < for all sufficiently large n.
2
Consequently, for those n satisfying the above inequality, and being bigger than 2/δ,

Pe (∼Cn ) ≤ PX n (Fnc (δ/2)) < δ ≤ ε.

(For the last step, the reader can refer to Table 3.1 to confirm that only the “ambigu-
ous” sequences outside the typical set contribute to the probability of error.)
Strong Converse Part: Fix any sequence of block codes {∼Cn }∞
n=1 with

1
lim sup log2 |∼Cn | < H (X ).
n→∞ n

Let Sn be the set of source symbols that can be correctly decoded through the ∼Cn -
coding system. (A quick example is depicted in Fig. 3.2.) Then |Sn | = |∼Cn |. By
choosing δ small enough with ε/2 > δ > 0, and by definition of the limsup operation,
we have
1 1
(∃ N0 )(∀ n > N0 ) log2 |Sn | = log2 |∼Cn | < H (X ) − 2δ,
n n
which implies
|Sn | < 2n(H (X )−2δ) for n > N0 .

Furthermore, from Property 2 of the consequence of the AEP, we obtain that

(∃ N1 )(∀ n > N1 ) PX n (Fnc (δ)) < δ.

Consequently, for n > N := max{N0 , N1 , log2 (2/ε)/δ}, the probability of correctly


block decoding satisfies
3.2 Block Codes for Asymptotically Lossless Compression 65

Source Symbols
Sn

Reproduction words

Fig. 3.2 Possible code ∼Cn and its corresponding Sn . The solid box indicates the decoding mapping
from ∼Cn back to Sn


1 − Pe (∼Cn ) = PX n (x n )
x n ∈Sn
 
= PX n (x n ) + PX n (x n )
x n ∈S n ∩Fn (δ)
c x n ∈S n ∩Fn (δ)

≤ PX n (Fnc (δ)) + |Sn ∩ Fn (δ)| · nmax PX n (x n )


x ∈Fn (δ)
< δ + |Sn | · nmax PX n (x n )
x ∈Fn (δ)
ε
< + 2n(H (X )−2δ) · 2−n(H (X )−δ)
2
ε
< + 2−nδ
2
< ε,

which is equivalent to Pe (∼Cn ) > 1 − ε for n > N . 


Observation 3.7 The results of the above theorem are illustrated in Fig. 3.3, where
R̄ = lim supn→∞ (1/n) log D Mn is usually called the asymptotic code rate of block
codes for compressing the source. It is clear from the figure that the (asymptotic) rate
of any block code with arbitrarily small decoding error probability must be greater
than or equal to the source entropy.8 Conversely, the probability of decoding error
for any block code of rate smaller than entropy ultimately approaches 1 (and hence is
bounded away from zero). Thus for a DMS, the source entropy H D (X ) is the infimum
of all “achievable” source (block) coding rates; i.e., it is the infimum of all rates for

8 Note that it is clear from the statement and proof of the forward part of Theorem 3.6 that the source

entropy can be achieved as an asymptotic compression rate as long as (1/n) log D Mn approaches it
from above with increasing n. Furthermore, the asymptotic compression rate is defined as the limsup
of (1/n) log D Mn in order to guarantee reliable compression for n sufficiently large (analogously,
in channel coding, the asymptotic transmission rate is defined via the liminf of (1/n) log D Mn to
ensure reliable communication for all sufficiently large n, see Chap. 4).
66 3 Lossless Data Compression

n→∞ n→∞
Pe −→ 1 Pe −→ 0
for all block codes for the best data compression block code
HD (X) R̄

Fig. 3.3 Asymptotic compression rate R̄ versus source entropy H D (X ) and behavior of the prob-
ability of block decoding error as blocklength n goes to infinity for a discrete memoryless source

which there exists a sequence of D-ary block codes with asymptotically vanishing
(as the blocklength goes to infinity) probability of decoding error. Indeed to prove
that H D (X ) is such infimum, we decomposed the above theorem in two parts as per
the properties of the infimum; see Observation A.11.

For a source with (statistical) memory, the Shannon–McMillan–Breiman theorem


cannot be directly applied in its original form, and thereby Shannon’s source coding
theorem appears restricted to only memoryless sources. However, by exploring the
concept behind these theorems, one can find that the key for the validity of Shannon’s
source coding theorem is actually the existence of a set An = {x1n , x2n , . . . , x M
n
} with
n H D (X )
M≈D and PX n (An ) → 0, namely, the existence of a “typical-like” set An
c

whose size is prohibitively small and whose probability mass is asymptotically large.
Thus, if we can find such typical-like set for a source with memory, the source coding
theorem for block codes can be extended for this source. Indeed, with appropriate
modifications, the Shannon–McMillan–Breiman theorem can be generalized for the
class of stationary ergodic sources and hence a block source coding theorem for this
class can be established; this is considered in the next section. The block source
coding theorem for general (e.g., nonstationary non-ergodic) sources in terms of a
generalized “spectral” entropy measure is studied in [73, 172, 175] (see also the end
of the next section for a brief description).

3.2.2 Block Codes for Stationary Ergodic Sources

In practice, a stochastic source used to model data often exhibits memory or statistical
dependence among its random variables; its joint distribution is hence not a product
of its marginal distributions. In this section, we consider the asymptotic lossless data
compression theorem for the class of stationary ergodic sources.9
Before proceeding to generalize the block source coding theorem, we need to first
generalize the “entropy” measure for a sequence of dependent random variables X n
(which certainly should be backward compatible to the discrete memoryless cases).
A straightforward generalization is to examine the limit of the normalized block
entropy of a source sequence, resulting in the concept of entropy rate.

9 The definitions of stationarity and ergodicity can be found in Sect. B.3 of Appendix B.
3.2 Block Codes for Asymptotically Lossless Compression 67

Definition 3.8 (Entropy rate) The entropy rate for a source {X n }∞


n=1 is denoted by
H (X ) and defined by
1
H (X ) := lim H (X n )
n→∞ n

provided the limit exists, where X n = (X 1 , . . . , X n ).

Next, we will show that the entropy rate exists for stationary sources (here, we
do not need ergodicity for the existence of entropy rate).

Lemma 3.9 For a stationary source {X n }∞


n=1 , the conditional entropy

H (X n |X n−1 , . . . , X 1 )

is nonincreasing in n and also bounded from below by zero. Hence, by Lemma A.20,
the limit
lim H (X n |X n−1 , . . . , X 1 )
n→∞

exists.

Proof We have

H (X n |X n−1 , . . . , X 1 ) ≤ H (X n |X n−1 , . . . , X 2 ) (3.2.4)


= H (X n , . . . , X 2 ) − H (X n−1 , . . . , X 2 )
= H (X n−1 , . . . , X 1 ) − H (X n−2 , . . . , X 1 ) (3.2.5)
= H (X n−1 |X n−2 , . . . , X 1 ),

where (3.2.4) follows since conditioning never increases entropy, and (3.2.5) holds
because of the stationarity assumption. Finally, recall that each conditional entropy
H (X n |X n−1 , . . . , X 1 ) is nonnegative. 
n
Lemma 3.10 (Cesaro-mean theorem) If an → a as n → ∞ and bn = n1 i=1 ai ,
then bn → a as n → ∞.

Proof an → a implies that for any ε > 0, there exists an N such that for all n > N ,
|an − a| < ε. Then
 n 
1  
 
|bn − a| =  (ai − a)
n 
i=1

1
n
≤ |ai − a|
n i=1
68 3 Lossless Data Compression

1 1 
N n
= |ai − a| + |ai − a|
n i=1 n i=N +1

1
N
n−N
≤ |ai − a| + ε.
n i=1 n

Hence, limn→∞ |bn − a| ≤ ε. Since ε can be made arbitrarily small, the lemma
holds. 
Theorem 3.11 The entropy rate of a stationary source {X n }∞ n=1 always exists and
is equal to
H (X ) = lim H (X n |X n−1 , . . . , X 1 ).
n→∞

Proof The result directly follows by writing

1
n
1
H (X ) =
n
H (X i |X i−1 , . . . , X 1 ) (chain rule for entropy)
n n i=1

and applying the Cesaro-mean theorem. 


Observation 3.12 It can also be shown that for a stationary source, (1/n)H (X ) n

is nonincreasing in n and (1/n)H (X n ) ≥ H (X n |X n−1 , . . . , X 1 ) for all n ≥ 1. (The


proof is left as an exercise; see Problem 3.)
It is obvious that when {X n }∞
n=1 is a discrete memoryless source, H (X ) =
n

n · H (X ) for every n. Hence,

1
H (X ) = lim H (X n ) = H (X ).
n→∞ n

For a stationary Markov source (of order one),10

1
H (X ) = lim H (X n ) = lim H (X n |X n−1 , . . . , X 1 ) = H (X 2 |X 1 ),
n→∞ n n→∞

where
 
H (X 2 |X 1 ) = − π(x1 )PX 2 |X 1 (x2 |x1 ) log PX 2 |X 1 (x2 |x1 ),
x1 ∈X x2 ∈X

and π(·) is a stationary distribution for the Markov source (note that π(·) is unique if
the Markov source is irreducible11 ). For example, for the stationary binary Markov

10 If a Markov source is mentioned without specifying its order, it is understood that it is a first-order

Markov source; see Appendix B for a brief overview on Markov sources and their properties.
11 See Sect. B.3 of Appendix B for the definition of irreducibility for Markov sources.
3.2 Block Codes for Asymptotically Lossless Compression 69

source with transition probabilities PX 2 |X 1 (0|1) = α and PX 2 |X 1 (1|0) = β, where 0 <


α, β < 1, we have

β α
H (X ) = h b (α) + h b (β),
α+β α+β

where h b (α) := −α log2 α − (1 − α) log2 (1 − α) is the binary entropy function.


Observation 3.13 (Divergence rate for sources with memory) We briefly note that
analogously to the notion of entropy rate, the divergence rate (or Kullback–Leibler
divergence rate) can also be defined for sources with memory. More specifically,
∞ ∞
given two discrete sources {X i }i=1 and { X̂ i }i=1 defined on a common finite alpha-
bet X , with respective sequences of n-fold distributions {PX n } and {PX̂ n }, then the
∞ ∞
divergence rate between {X i }i=1 and { X̂ i }i=1 is defined by

1
lim D(PX n PX̂ n )
n→∞ n

provided the limit exists.12 The divergence rate is not guaranteed to exist in general;
in [350], two examples of non-Markovian ergodic sources are given for which the
divergence rate does not exist. However, if the source { X̂ i } is time-invariant Markov
and {X i } is stationary, then the divergence rate exists and is given in terms of the
entropy rate of {X i } and another quantity depending on the (second-order) statistics
of {X i } and { X̂ i } as follows [157, p. 40]:

1 1 
lim D(PX n PX̂ n ) = − lim H (X n ) − PX 1 X 2 (x1 , x2 ) log2 PX̂ 2 | X̂ 1 (x2 |x1 ).
n→∞ n n→∞ n
x1 ∈Xx2 ∈X
(3.2.6)
Furthermore, if both {X i } and { X̂ i } are time-invariant irreducible Markov sources,
then their divergence rate exists and admits the following expression [312, Theo-
rem 1]:

1   PX 2 |X 1 (x2 |x1 )
lim D(PX n PX̂ n ) = π X (x1 )PX 2 |X 1 (x2 |x1 ) log2 ,
n→∞ n
x ∈X x ∈X
PX̂ 2 | X̂ 1 (x2 |x1 )
1 2

where π X (·) is the stationary distribution of {X i }. The above result can also be
generalized using the theory of nonnegative matrices and Perron–Frobenius theory
for {X i } and { X̂ i } being arbitrary (not necessarily irreducible, stationary, etc.) time-
invariant Markov chains; see the explicit computable expression in [312, Theorem 2].
A direct consequence of the later result is a formula for the entropy rate of an arbitrary
not necessarily stationary time-invariant Markov source [312, Corollary 2].13

12 Anothernotation for the divergence rate is limn→∞ n1 D(X n  X̂ n ).


13 More generally, the Rényi entropy rate of order α, limn→∞ n1 Hα (X n ), as well as the Rényi
divergence rate of order α, limn→∞ n1 Dα (PX n PX̂ n ), for arbitrary time-invariant Markov sources
70 3 Lossless Data Compression

Finally, note that all the above results also hold with the proper modifications if
the Markov chains are replaced with kth-order Markov chains (for any integer k > 1)
[312].

Theorem 3.14 (Generalized AEP or Shannon–McMillan–Breiman Theorem [58,


83, 267, 340]) If {X n }∞
n=1 is a stationary ergodic source, then

1 a.s.
− log2 PX n (X 1 , . . . , X n ) −→ H (X ).
n
Since the AEP theorem (law of large numbers) is valid for stationary ergodic
sources, all consequences of the AEP will follow, including Shannon’s lossless source
coding theorem.

Theorem 3.15 (Shannon’s source coding theorem for stationary ergodic sources)
Given integer D > 1, let {X n }∞
n=1 be a stationary ergodic source with entropy rate
(in base D)
1
H D (X ) := lim H D (X n ).
n→∞ n

Then the following hold.


• Forward part (achievability): For any 0 < ε < 1, there exists a δ with 0 < δ < ε
and a sequence of D-ary block codes {∼Cn = (n, Mn )}∞n=1 with

1
lim sup log D Mn < H D (X ) + δ,
n→∞ n

and probability of decoding error satisfying

Pe (∼Cn ) < ε

for all sufficiently large n.


• Strong converse part: For any 0 < ε < 1, any sequence of D-ary block codes
{∼Cn = (n, Mn )}∞ n=1 with

1
lim sup log D Mn < H D (X )
n→∞ n

satisfies
Pe (∼Cn ) > 1 − ε

for all n sufficiently large.

{X i } and { X̂ i } exist and admit closed-form expressions [311] (see also the earlier work in [279],
where the results hold for more restricted classes of Markov sources).
3.2 Block Codes for Asymptotically Lossless Compression 71

A discrete memoryless (i.i.d.) source is stationary and ergodic (so Theorem 3.6
is clearly a special case of Theorem 3.15). In general, it is hard to check whether a
stationary process is ergodic or not. It is known though that if a stationary process is
a mixture of two or more stationary ergodic processes, i.e., its n-fold distribution can
be written as the mean (with respect to some distribution) of the n-fold distributions
of stationary ergodic processes, then it is not ergodic.14
For example, let P and Q be two distributions on a finite alphabet X such that
the process {X n }∞ ∞
n=1 is i.i.d. with distribution P and the process {Yn }n=1 is i.i.d. with
distribution Q. Flip a biased coin (with Heads probability equal to θ, 0 < θ < 1)
once and let
X n , if Heads,
Zn =
Yn , if Tails,

for n = 1, 2, . . .. Then, the resulting process {Z n }∞


n=1 has its n-fold distribution as a
mixture of the n-fold distributions of {X n }∞ ∞
n=1 and {Yn }n=1 :

PZ n (a n ) = θ PX n (a n ) + (1 − θ)PY n (a n ) (3.2.7)

for all a n ∈ X n , n = 1, 2, . . .. Then, the process {Z n }∞


n=1 is stationary but not ergodic.
A specific case for which ergodicity can be easily verified (other than the case
of i.i.d. sources) is the case of stationary Markov sources. Specifically, if a (finite
alphabet) stationary Markov source is irreducible, then it is ergodic (e.g., see [30,
p. 371] and [349, Prop. I.2.9]), and hence a generalized AEP holds for this source.
Note that irreducibility can be verified in terms of the source’s transition probability
matrix.
The following are two examples of stationary processes based on Polya’s conta-
gion urn scheme [304], one of which is non-ergodic and the other ergodic.
Example 3.16 (Polya contagion process [304]) We consider a binary process with
memory {Z n }∞n=1 , which is obtained by the following Polya contagion urn sampling
mechanism [304–306] (see also [119, 120]).
An urn initially contains T balls, of which R are red and B are black (T = R + B).
Successive draws are made from the urn, where after each draw 1 +  balls of
the same color just drawn are returned to the urn ( > 0). The process {Z n }∞ n=1 is
generated according to the outcome of the draws:

1, if the nth ball drawn is red,
Zn =
0, if the nth ball drawn is black.

In this model, a red ball in the urn can represent an infected person in the population
and a black ball can represent a healthy person. Since the number of balls of the color
just drawn increases (while the number of balls of the opposite color is unchanged),

14 The converse is also true; i.e., if a stationary process cannot be represented as a mixture of

stationary ergodic processes, then it is ergodic.


72 3 Lossless Data Compression

the likelihood that a ball of the same color as the ball just drawn will be picked in
the next draw increases. Hence, the occurrence of an “unfavorable” event (say an
infection) increases the probability of future unfavorable events (the same applies for
favorable events) and as a result the model provides a basic template for characterizing
contagious phenomena.
For any n ≥ 1, the n-fold distribution of the binary process {Z n }∞
n=1 can be derived
in closed form as follows:
ρ(ρ + δ) · · · (ρ + (d − 1)δ)σ(σ + δ) · · · (σ + (n − d − 1)δ)
Pr[Z n = a n ] =
(1 + δ)(1 + 2δ) · · · (1 + (n − 1)δ)
( 1δ ) ( ρδ + d) ( σδ + n − d)
= (3.2.8)
( ρδ ) ( σδ ) ( 1δ + n)

for all a n = (a1 , a2 , . . . , an ) ∈ {0, 1}n , where d = a1 + a2 + · · · + an , ρ := R


T
,σ :=
1 − ρ = TB , δ :=  T
, and (·) is the gamma function given by
 ∞
(x) = t x−1 e−t dt for x > 0.
0

To obtain the last equation in (3.2.8), we use the identity


n−1
( αβ + n)
(α + jβ) = β n
j=0
( αβ )

which is obtained using the fact that (x + 1) = x (x). We remark from expression
(3.2.8) for the joint distribution that the process {Z n } is exchangeable15 and is thus
stationary. Furthermore, it can be shown [120, 306] that the process sample average
1
n
(Z 1 + Z 2 + · · · + Z n ) converges almost surely as n → ∞ to a random variable
Z , whose distribution is given by the beta distribution with parameters ρ/δ = R/
and σ/δ = B/. This directly implies that the process {Z n }∞ n=1 is not ergodic since
its sample average does not converge to a constant. It is also shown in [12] that the
entropy rate of {Z n }∞n=1 is given by

 1
H (Z) = E Z [h b (Z )] = h b (z) f Z (z)dz
0

where h b (·) is the binary entropy function and

15 A process {Z ∞
n }n=1 is called exchangeable (or symmetrically dependent) if for every finite positive
integer n, the random variables Z 1 , Z 2 , . . . , Z n have the property that their joint distribution is
invariant with respect to all permutations of the indices 1, 2, . . . , n (e.g., see [120]). The notion of
exchangeability is originally due to de Finetti [90]. It directly follows from the definition that an
exchangeable process is stationary.
3.2 Block Codes for Asymptotically Lossless Compression 73
( 1δ )
( ρδ )( σδ )
z ρ/δ−1 (1 − z)σ/δ−1 , if 0 < z < 1,
f Z (z) =
0, otherwise,

is the beta probability density function with parameters ρ/δ and σ/δ. Note that
Theorem 3.15 does not hold for the contagion source {Z n } since it is not ergodic.
Finally, letting 0 ≤ Rn ≤ 1 denote the proportion of red balls in the urns after the
nth draw, we can write

R + (Z 1 + Z 2 + · · · + Z n )
Rn = (3.2.9)
T + n
Rn−1 (T + (n − 1)) + Z n 
= . (3.2.10)
T + n

Now using (3.2.10), we have

E[Rn |Rn−1 , . . . , R1 ] = E[Rn |Rn−1 ]


Rn−1 (T + (n − 1)) + 
= · Rn−1
T + n
Rn−1 (T + (n − 1))
+ · (1 − Rn−1 )
T + n
= Rn−1

almost surely, and thus {Rn } is a martingale (e.g., [120, 162]). Since {Rn } is bounded,
we obtain by the martingale convergence theorem that Rn converges almost surely
to some limiting random variable. But from (3.2.9), we note that the asymptotic
behavior of Rn is identical to that of n1 (Z 1 + Z 2 + · · · + Z n ). Thus, Rn also converges
almost surely to the above beta-distributed random variable Z .
In [12], a binary additive noise channel, whose noise is the above Polya contagion
process {Z n }∞
n=1 , is investigated as a model for a non-ergodic communication channel
with memory. Polya’s urn scheme has also been applied and generalized in a wide
range of contexts, including genetics [210], evolution and epidemiology [257, 289],
image segmentation [35], and network epidemics [182] (see also the survey in [289]).

Example 3.17 (Finite-memory Polya contagion process [12]) The above Polya
model has “infinite” memory in the sense that the very first ball drawn from the
urn has an identical effect (that does not vanish as the number of draws grows with-
out bound) as the 999 999th ball drawn from the urn on the outcome of the millionth
draw. In the context of modeling contagious phenomena, this is not reasonable as one
would assume that the effects of an infection dissipate in time. We herein consider a
more realistic urn model with finite memory [12].
Consider again an urn originally containing T = R + B balls, of which R are red
and B are black. At the nth draw, n = 1, 2, . . ., a ball is selected at random from the
urn and replaced with 1 +  balls of the same color just drawn ( > 0). Then, M
draws later, i.e., after the (n + M)th draw,  balls of the color picked at the nth draw
are retrieved from the urn.
74 3 Lossless Data Compression

Note that in this model, the total number of balls in the urn is constant (T + M)
after an initialization period of M draws. Also, in this scheme, the effect of any draw
is limited to M draws in the future. The process {Z n }∞ n=1 again corresponds to the
outcome of the draws:

1, if the nth ball drawn is red
Zn =
0, if the nth ball drawn is black.

We have that for n ≥ M + 1,


R + (z n−1 + · · · + z n−M )
Pr Z n = 1|Z n−1 = z n−1 , . . . , Z 1 = z 1 =
T + M

= Pr Z n = 1|Z n−1 = z n−1 , . . . , Z n−M = z n−M

for any z i ∈ {0, 1}, i = 1, . . . , n. Thus, {Z n }∞


n=1 is a Markov process of memory
order M. It is also shown in [12] that {Z n } is stationary and its stationary distribution
as well as its n-fold distribution and its entropy rate, which is given by

H (Z) = H (Z M+1 |Z M , Z M−1 , . . . , Z 1 ),

are derived in closed form in terms of R/T , /T , and M. Furthermore, it is shown
that {Z n }∞
n=1 is irreducible, and hence ergodic. Thus, Theorem 3.15 applies for this
finite-memory Polya contagion process. In [420], a generalized version of this process
is introduced via a ball sampling mechanism involving a large urn and a finite queue.

Observation 3.18 In complicated situations such as when the source is nonstation-


ary (with time-varying statistics) and/or non-ergodic (such as the non-ergodic pro-
cesses in (3.2.7) or in Example 3.16), the source entropy rate H (X ) (if the limit exists;
otherwise, one can look at the lim inf/lim sup of (1/n)H (X n )) has no longer an oper-
ational meaning as the smallest possible block compression rate. This causes the need
to establish new entropy measures that appropriately characterize the operational lim-
its of an arbitrary stochastic system with memory. This is achieved in [175] where Han
and Verdú introduce the spectral notions of inf/sup-entropy rates and illustrate the key
role these entropy measures play in proving a general lossless block source coding
theorem. More specifically, they demonstrate that for an arbitrary (not necessarily
stationary and ergodic) finite-alphabet source X := {X n = (X 1 , X 2 , . . . , X n )}∞
n=1 ,
the expression for the minimum achievable (block) source coding rate is given by
the sup-entropy rate H̄ (X X ), defined by
   
1
X ) := inf β : lim sup Pr − log PX n (X n ) > β = 0 .
H̄ (X
β∈R n→∞ n

More details are provided in [73, 172, 175].


3.2 Block Codes for Asymptotically Lossless Compression 75

3.2.3 Redundancy for Lossless Block Data Compression

Shannon’s block source coding theorem establishes that the smallest data com-
pression rate for achieving arbitrarily small error probability for stationary ergodic
sources is given by the entropy rate. Thus, one can define the source redundancy as
the reduction in coding rate one can achieve via asymptotically lossless block source
coding versus just using uniquely decodable (completely lossless for any value of
the sourceword blocklength n) block source coding. In light of the fact that the for-
mer approach yields a source coding rate equal to the entropy rate while the latter
approach provides a rate of log2 |X |, we therefore define the total block source
coding redundancy ρt (in bits/source symbol) for a stationary ergodic source
{X n }∞
n=1 as
ρt := log2 |X | − H (X ).

Hence, ρt represents the amount of “useless” (or superfluous) statistical source infor-
mation one can eliminate via binary16 block source coding.
If the source is i.i.d. and uniformly distributed, then its entropy rate is equal
to log2 |X | and as a result its redundancy is ρt = 0. This means that the source is
incompressible, as expected, since in this case every sourceword x n will belong to
the δ-typical set Fn (δ) for every n > 0 and δ > 0 (i.e., Fn (δ) = X n ), and hence there
are no superfluous sourcewords that can be dispensed of via source coding. If the
source has memory or has a nonuniform marginal distribution, then its redundancy
is strictly positive and can be classified into two parts:

• Source redundancy due to the nonuniformity of the source marginal distribution


ρd :
ρd := log2 |X | − H (X 1 ).

• Source redundancy due to the source memory ρm :

ρm := H (X 1 ) − H (X ).

As a result, the source total redundancy ρt can be decomposed in two parts:

ρt = ρd + ρm .

We can summarize the redundancy of some typical stationary ergodic sources in


the following table.

16 Since we are measuring ρ in code bits/source symbol, all logarithms in its expression are in base
t
2, and hence this redundancy can be eliminated via asymptotically lossless binary block codes (one
can also change the units to D-ary code symbol/source symbol using base-D logarithms for the
case of D-ary block codes).
76 3 Lossless Data Compression

Source ρd ρm ρt
i.i.d. uniform 0 0 0
i.i.d. nonuniform log2 |X | − H (X 1 ) 0 ρd
First-order symmetric
0 H (X 1 ) − H (X 2 |X 1 ) ρm
Markova
First-order non
log2 |X | − H (X 1 ) H (X 1 ) − H (X 2 |X 1 ) ρd + ρm
symmetric Markov
a A first-order Markov process is symmetric if for any x and x̂ ,
1 1
{a : a = PX 2 |X 1 (y|x1 ) for some y} = {a : a = PX 2 |X 1 (y|x̂1 ) for some y}.

3.3 Variable-Length Codes for Lossless Data Compression

3.3.1 Non-singular Codes and Uniquely Decodable Codes

We next study variable-length (completely) lossless data compression codes.

Definition 3.19 Consider a discrete source {X n }∞ n=1 with finite alphabet X along
with a D-ary code alphabet B = {0, 1, . . . , D − 1}, where D > 1 is an integer. Fix
integer n ≥ 1; then a D-ary nth-order variable-length code (VLC) is a function

f : X n → B∗

mapping (fixed-length) sourcewords of length n to D-ary codewords in B ∗ of variable


lengths, where B ∗ denotes the set of all finite-length strings from B (i.e., c ∈ B ∗ ⇐⇒
∃ integer l ≥ 1 such that c ∈ Bl ).
The codebook C of a VLC is the set of all codewords:

C = f (X n ) = { f (x n ) ∈ B ∗ : x n ∈ X n }.

A variable-length lossless data compression code is a code in which the source


symbols can be completely reconstructed without distortion. In order to achieve this
goal, the source symbols have to be encoded unambiguously in the sense that any
two different source symbols (with positive probabilities) are represented by different
codewords. Codes satisfying this property are called non-singular codes. In practice,
however, the encoder often needs to encode a sequence of source symbols, which
results in a concatenated sequence of codewords. If any concatenation of codewords
can also be unambiguously reconstructed without punctuation, then the code is said
to be uniquely decodable. In other words, a VLC is uniquely decodable if all finite
sequences of sourcewords (x n ∈ X n ) are mapped onto distinct strings of codewords:
for any m and m  , (x1n , x2n , . . . , xmn ) = (y1n , y2n , . . . , ymn  ) implies that

( f (x1n ), f (x2n ), . . . , f (xmn )) = ( f (y1n ), f (y2n ), . . . , f (ymn  )),

or equivalently,
3.3 Variable-Length Codes for Lossless Data Compression 77

( f (x1n ), f (x2n ), . . . , f (xmn )) = ( f (y1n ), f (y2n ), . . . , f (ymn  ))

implies that
m = m  and x nj = y mj for j = 1, . . . , m.

Note that a non-singular VLC is not necessarily uniquely decodable. For example,
consider a binary (first-order) code for the source with alphabet

X = {A, B, C, D, E, F}

given by

code of A = 0,
code of B = 1,
code of C = 00,
code of D = 01,
code of E = 10,
code of F = 11.

The above code is clearly non-singular; it is, however, not uniquely decodable
because the codeword sequence, 010, can be reconstructed as AB A, D A, or AE (i.e.,
( f (A), f (B), f (A)) = ( f (D), f (A)) = ( f (A), f (E)) even though (A, B, A),
(D, A), and (A, E) are all non-equal).
One important objective is to find out how “efficiently” we can represent a given
discrete source via a uniquely decodable nth-order VLC and provide a construction
technique that (at least asymptotically, as n → ∞) attains the optimal “efficiency.”
In other words, we want to determine what is the smallest possible average code rate
(or equivalently, the smallest average codeword length) that an nth-order uniquely
decodable VLC can have when (losslessly) representing a given source, and we want
to give an explicit code construction that can attain this smallest possible rate (at
least asymptotically in the sourceword length n).

Definition 3.20 Let C be a D-ary nth-order VLC

f : X n → {0, 1, . . . , D − 1}∗

for a discrete source {X n }∞ n=1 with alphabet X and distribution PX n (x ), x ∈ X .


n n n

Setting (cx n ) as the length of the codeword cx n = f (x ) associated with sourceword


n

x n , then the average codeword length for C is given by



 := PX n (x n )(cx n )
x n ∈X n

and its average code rate (in D-ary code symbols/source symbol) is given by
78 3 Lossless Data Compression

 1 
R n := = PX n (x n )(cx n ).
n n x n ∈X n

The following theorem provides a strong condition that a uniquely decodable code
must satisfy.17
Theorem 3.21 (Kraft inequality for uniquely decodable codes) Let C be a uniquely
decodable D-ary nth-order VLC for a discrete source {X n }∞ n=1 with alphabet X . Let
the M = |X |n codewords of C have lengths 1 , 2 , . . . ,  M , respectively. Then, the
following inequality must hold:


M
D −m ≤ 1.
m=1

Proof Suppose that we use the codebook C to encode N sourcewords (xkn ∈ X n ,


k = 1, . . . , N ) arriving in a sequence; this yields a concatenated codeword sequence

c1 c2 c3 . . . c N .

Let the lengths of the codewords be respectively denoted by

(c1 ), (c2 ), . . . , (c N ).

Consider ⎛ ⎞
 
⎝ ··· D −[(c1 )+(c2 )+···+(cN )] ⎠ .
c1 ∈C c2 ∈C c N ∈C

It is obvious that the above expression is equal to


 N  N
 
M
−(c) −m
D = D .
c∈C m=1

(Note that |C| = M.) On the other hand, all the code sequences with length

i = (c1 ) + (c2 ) + · · · + (c N )

contribute equally to the sum of the identity, which is D −i . Let Ai denote the number
of N -codeword sequences that have length i. Then, the above identity can be rewritten
as  M N
 
LN
−m
D = Ai D −i ,
m=1 i=1

17 This theorem is also attributed to McMillan.


3.3 Variable-Length Codes for Lossless Data Compression 79

where
L := max (c).
c∈C

Since C is by assumption a uniquely decodable code, the codeword sequence must


be unambiguously decodable. Observe that a code sequence with length i has at most
D i unambiguous combinations. Therefore, Ai ≤ D i , and
 N

M 
LN 
LN
−m
D = Ai D −i ≤ D i D −i = L N ,
m=1 i=1 i=1

which implies that



M
D −m ≤ (L N )1/N .
m=1

The proof is completed by noting that the above inequality holds for every N , and
the upper bound (L N )1/N goes to 1 as N goes to infinity. 
The Kraft inequality is a very useful tool, especially for showing that the fun-
damental lower bound of the average rate of uniquely decodable VLCs for discrete
memoryless sources is given by the source entropy.

Theorem 3.22 The average rate of every uniquely decodable D-ary nth-order VLC
for a discrete memoryless source {X n }∞
n=1 is lower bounded by the source entropy
H D (X ) (measured in D-ary code symbols/source symbol).

Proof Consider a uniquely decodable D-ary nth-order VLC code for the source
{X n }∞
n=1
f : X n → {0, 1, . . . , D − 1}∗

and let (cx n ) denote the length of the codeword cx n = f (x n ) for sourceword x n .
Then
1  1
R n − H D (X ) = PX n (x n )(cx n ) − H D (X n )
n x n ∈X n n
 
1    
= PX n (x )(cx n ) −
n
−PX n (x ) log D PX n (x )
n n
n x n ∈X n x n ∈X n
1  PX n (x n )
= PX n (x n ) log D −(c n )
n x n ∈X n D x

 

1  x n ∈X n PX n (x )
n
≥ PX n (x ) log D
n
−(cx n )

n x n ∈X n x n ∈X n D
(log-sum inequality)
80 3 Lossless Data Compression
 
1 
−(cx n )
= − log D
n x n ∈X n
≥ 0,

where the last inequality follows from the Kraft inequality for uniquely decodable
codes and the fact that the logarithm is a strictly increasing function. 
By examining the above proof, we observe that

R n = H D (X ) iff PX n (x n ) = D −l(cx n ) ;

i.e., the source symbol probabilities are (negative) integer powers of D. Such a source
is called D-adic [83]. In this case, the code is called absolutely optimal as it achieves
the source entropy lower bound (it is thus optimal in terms of yielding a minimal
average code rate for any given n).
Furthermore, we know from the above theorem that the average code rate is no
smaller than the source entropy. Indeed, a lossless data compression code, whose
average code rate achieves entropy, should be optimal (note that if a code’s average
rate is below entropy, then the Kraft inequality is violated and the code is no longer
uniquely decodable). In summary, we have
• Uniquely decodability =⇒ the Kraft inequality holds.
• Uniquely decodability =⇒ average code rate of VLCs for memoryless sources
is lower bounded by the source entropy.
Exercise 3.23
1. Find a non-singular and also non-uniquely decodable code that violates the Kraft
inequality. (Hint: The answer is already provided in this section.)
2. Find a non-singular and also non-uniquely decodable code that beats the entropy
lower bound.

3.3.2 Prefix or Instantaneous Codes

A prefix code18 is a VLC which is self-punctuated in the sense that there is no need
to append extra symbols for differentiating adjacent codewords. A more precise
definition follows:
Definition 3.24 (Prefix code) A VLC is called a prefix code or an instantaneous
code if no codeword is a prefix of any other codeword.
A prefix code is also named an instantaneous code because the codeword sequence
can be decoded instantaneously (it is immediately recognizable) without the refer-
ence to future codewords in the same sequence. Note that a uniquely decodable

18 Another name for prefix codes is prefix-free codes.


3.3 Variable-Length Codes for Lossless Data Compression 81

Fig. 3.4 Classification of


variable-length codes Non-singular
codes
Uniquely
decodable codes

Prefix
codes

Fig. 3.5 Tree structure of a 00


binary prefix code. The
codewords are those residing (0)
on the leaves, which in this
case are 00, 01, 10, 110,
1110, and 1111
01

10
(1)
110

(11)

1110
(111)

1111

code is not necessarily prefix-free and may not be decoded instantaneously. The
relationship between different codes encountered thus far is depicted in Fig. 3.4.
A D-ary prefix code can be represented graphically as an initial segment of a
D-ary tree. An example of a tree representation for a binary (D = 2) prefix code is
shown in Fig. 3.5.
Theorem 3.25 (Kraft inequality for prefix codes) There exists a D-ary nth-order
prefix code for a discrete source {X n }∞n=1 with alphabet X iff the codewords of length
m , m = 1, . . . , M, satisfy the Kraft inequality, where M = |X |n .
Proof Without loss of generality, we provide the proof for the case of D = 2 (binary
codes).
82 3 Lossless Data Compression

Forward part: Prefix codes satisfy the Kraft inequality.


The codewords of a prefix code can always be put on the leaves of a tree. Pick
up a length
max := max m .
1≤m≤M

A tree has originally 2max nodes on level max . Each codeword of length m obstructs
2max −m nodes on level max . In other words, when any node is chosen as a codeword,
all its children will be excluded from being codewords (as for a prefix code, no
codeword can be a prefix of any other code). There are exactly 2max −m excluded
nodes on level max of the tree. Note that no two codewords obstruct the same nodes
on level max . Hence, the number of totally obstructed codewords on level max should
be less than 2max , i.e.,
M
2max −m ≤ 2max ,
m=1

which immediately implies the Kraft inequality:


M
2−m ≤ 1.
m=1

(This part can also be proven by stating the fact that a prefix code is a uniquely
decodable code. The objective of adding this proof is to illustrate the characteristics
of a tree-like prefix code.)
Converse part: Kraft inequality implies the existence of a prefix code.
Suppose that 1 , 2 , . . . ,  M satisfy the Kraft inequality. We will show that there
exists a binary tree with M selected nodes where the ith node resides on level i .
Let n i be the number of nodes (among the M nodes) residing on level i (namely,
n i is the number of codewords with length i or n i = |{m : m = i}|), and let

max := max m .
1≤m≤M

Then, from the Kraft inequality, we have

n 1 2−1 + n 2 2−2 + · · · + n max 2−max ≤ 1.

The above inequality can be rewritten in a form that is more suitable for this proof
as follows:

n 1 2−1 ≤ 1
n 1 2−1 + n 2 2−2 ≤ 1
..
.
n 1 2−1 + n 2 2−2 + · · · + n max 2−max ≤ 1.
3.3 Variable-Length Codes for Lossless Data Compression 83

Hence,

n1 ≤ 2
n 2 ≤ 22 − n 1 21
..
.
n max ≤ 2max − n 1 2max −1 − · · · − n max −1 21 ,

which can be interpreted in terms of a tree model as follows: the first inequality
says that the number of codewords of length 1 is less than the available number of
nodes on the first level, which is 2. The second inequality says that the number of
codewords of length 2 is less than the total number of nodes on the second level,
which is 22 , minus the number of nodes obstructed by the first-level nodes already
occupied by codewords. The succeeding inequalities demonstrate the availability of
a sufficient number of nodes at each level after the nodes blocked by shorter length
codewords have been removed. Because this is true at every codeword length up to
the maximum codeword length, the assertion of the theorem is proved. 
Theorems 3.21 and 3.25 unveil the following relation between a variable-length
uniquely decodable code and a prefix code.
Corollary 3.26 A uniquely decodable D-ary nth-order code can always be replaced
by a D-ary nth-order prefix code with the same average codeword length (and hence
the same average code rate).
The following theorem interprets the relationship between the average code rate
of a prefix code and the source entropy.
Theorem 3.27 Consider a discrete memoryless source {X n }∞
n=1 .

1. For any D-ary nth-order prefix code for the source, the average code rate is no
less than the source entropy H D (X ).
2. There must exist a D-ary nth-order prefix code for the source whose average
code rate is no greater than H D (X ) + n1 , namely,

1  1
R n := PX n (x n )(cx n ) ≤ H D (X ) + , (3.3.1)
n x n ∈X n n

where cx n is the codeword for sourceword x n , and (cx n ) is the length of codeword
cx n .
Proof A prefix code is uniquely decodable, and hence it directly follows from
Theorem 3.22 that its average code rate is no less than the source entropy.
To prove the second part, we can design a prefix code satisfying both (3.3.1) and
the Kraft inequality, which immediately implies the existence of the desired code by
Theorem 3.25. Choose the codeword length for sourceword x n as

(cx n ) = − log D PX n (x n ) + 1. (3.3.2)


84 3 Lossless Data Compression

Then
D −(cx n ) ≤ PX n (x n ).

Summing both sides over all source symbols, we obtain



D −(cx n ) ≤ 1,
x n ∈X n

which is exactly the Kraft inequality. On the other hand, (3.3.2) implies

(cx n ) ≤ − log D PX n (x n ) + 1,

which in turn implies


 

PX n (x n )(cx n ) ≤ − PX n (x n ) log D PX n (x n ) + PX n (x n )
x n ∈X n x n ∈X n x n ∈X n
= H D (X n ) + 1 = n H D (X ) + 1,

where the last equality holds since the source is memoryless. 


We note that nth-order prefix codes (which encode sourcewords of length n) for
memoryless sources can yield an average code rate arbitrarily close to the source
entropy when allowing n to grow without bound. For example, a memoryless source
with alphabet
{A, B, C}

and probability distribution

PX (A) = 0.8, PX (B) = PX (C) = 0.1

has an entropy equal to

−0.8 · log2 0.8 − 0.1 · log2 0.1 − 0.1 · log2 0.1 = 0.92 bits.

One optimal binary first-order or single-letter encoding (with n = 1) prefix codes


for this source is given by c(A) = 0, c(B) = 10 and c(C) = 11, where c(·) is the
encoding function. Then, the resultant average code rate for this code is

0.8 × 1 + 0.2 × 2 = 1.2 bits ≥ 0.92 bits.

Now if we consider a second-order (with n = 2) prefix code by encoding two


consecutive source symbols at a time, the new source alphabet becomes

{A A, AB, AC, B A, B B, BC, C A, C B, CC},


3.3 Variable-Length Codes for Lossless Data Compression 85

and the resultant probability distribution is calculated by

(∀ x1 , x2 ∈ {A, B, C}) PX 2 (x1 , x2 ) = PX (x1 )PX (x2 )

as the source is memoryless. Then, an optimal binary prefix codes for the source is
given by

c(A A) = 0
c(AB) = 100
c(AC) = 101
c(B A) = 110
c(B B) = 111100
c(BC) = 111101
c(C A) = 1110
c(C B) = 111110
c(CC) = 111111.

The average code rate of this code now becomes

0.64(1 × 1) + 0.08(3 × 3 + 4 × 1) + 0.01(6 × 4)


= 0.96 bits,
2
which is closer to the source entropy of 0.92 bits. As n increases, the average code
rate will be brought closer to the source entropy.
From Theorems 3.22 and 3.27, we obtain Shannon’s lossless variable-length
source coding theorem for discrete memoryless sources.

Theorem 3.28 (Shannon’s lossless variable-length source coding theorem: DMS)


Fix integer D > 1 and consider a DMS {X n }∞ n=1 with distribution PX and entropy
H D (X ) (measured in D-ary units). Then the following hold.
• Forward part (achievability): For any ε > 0, there exists a D-ary nth-order prefix
(hence uniquely decodable) code

f : X n → {0, 1, . . . , D − 1}∗

for the source with an average code rate R n satisfying

R n < H D (X ) + ε

for n sufficiently large.


• Converse part: Every uniquely decodable code

f : X n → {0, 1, . . . , D − 1}∗
86 3 Lossless Data Compression

for the source has an average code rate R n ≥ H D (X ).


Thus, for a discrete memoryless source, its entropy H D (X ) (measured in
D-ary units) represents the smallest variable-length lossless compression rate for
n sufficiently large.
Proof The forward part follows directly from Theorem 3.27 by choosing n
large enough such that 1/n < ε, and the converse part is already given by
Theorem 3.22. 
Observation 3.29 (Shannon’s lossless variable-length coding theorem: stationary
sources) Theorem 3.28 actually also holds for the class of stationary sources by
replacing the source entropy H D (X ) with the source entropy rate

1
H D (X ) := lim H D (X n ),
n→∞ n
measured in D-ary units. The proof is very similar to the proofs of Theorems 3.22
and 3.27 with slight modifications (such as using the fact that n1 H D (X n ) is nonin-
creasing with n for stationary sources).

Observation 3.30 (Rényi’s entropy and lossless data compression) In the lossless
variable-length source coding theorem, we have chosen the criterion of minimizing
the average codeword length. Implicit in the use of average codeword length as a
performance criterion is the assumption that the cost of compression varies linearly
with codeword length. This is not always the case as in some applications, where the
processing cost of decoding may be elevated and buffer overflows caused by long
codewords can cause problems, an exponential cost/penalty function for codeword
lengths can be more appropriate than a linear cost function [54, 67, 206]. Naturally,
one would desire to choose a generalized function with exponential costs such that
the familiar linear cost function (given by the average codeword length) is a special
limiting case.
Indeed in [67], given a D-ary nth-order VLC C

f : X n → {0, 1, . . . , D − 1}∗

for a discrete source {X i }∞


n=1 with alphabet X and distribution PX n , Campbell con-
siders the following exponential cost function, called the average codeword length
of order t:  
1 
t·(cx n )
L n (t) := log D PX n (x )D
n
,
t x n ∈X n

where t is a chosen positive constant, cx n = f (x n ) is the codeword associated with


sourceword x n , and (·) is the length of cx n . Similarly, n1 L n (t) denotes the average
code rate of order t. The criterion for optimality now becomes that an nth-order
code is said to be optimal if its cost L n (t) is the smallest among all possible uniquely
decodable codes.
3.3 Variable-Length Codes for Lossless Data Compression 87

In the limiting case when t → 0,



L n (t) → PX (x)(cx ) = 
x∈X

and we recover the average codeword length, as desired. Also, when t → ∞,


L n (t) → maxx∈X (cx ), which is the maximum codeword length for all codewords
C. Note thatn fort(c
in a fixed t > 0, minimizing L n (t) is equivalent to minimizing
x n ) . In this sum, the weight for codeword c n is D t(cx n ) , and
x n ∈X n PX n (x )D x
hence smaller codeword lengths are favored.
For this source coding setup with an exponential cost function, Campbell estab-
lished in [67] an operational characterization for Rényi’s entropy by proving the
following lossless variable-length coding theorem for memoryless sources.
Theorem 3.31 (Lossless source coding theorem under exponential costs) Consider
a DMS {X n } with Rényi entropy in D-ary units and of order α given by
 
1 
α
Hα (X ) = log D PX (x) .
1−α x∈X

Fixing t > 0 and setting α = 1


1+t
, the following hold.
• For any ε > 0, there exists a D-ary nth-order uniquely decodable code f : X n →
{0, 1, . . . , D − 1}∗ for the source with an average code rate of order t satisfying

1
L n (t) ≤ Hα (X ) + ε
n
for n sufficiently large.
• Conversely, it is not possible to find a uniquely decodable code whose average
code rate of order t is less than Hα (X ).
Noting (by Lemma 2.52) that the Rényi entropy of order α reduces to the Shannon
entropy (in D-ary units) as α → 1, the above theorem reduces to Theorem 3.28 as
α → 1 (or equivalently, as t → 0). Finally, in [309, Sect. 4.4], [310, 311], the above
source coding theorem is extended for time-invariant Markov sources in terms of the
Rényi entropy rate, limn→∞ n1 Hα (X n ) with α = (1 + t)−1 , which exists and can be
calculated in closed form for such sources.

3.3.3 Examples of Binary Prefix Codes

(A) Huffman Codes: Optimal Variable-Length Codes


Given a discrete source with alphabet X , we next construct an optimal binary first-
order (single-letter) uniquely decodable variable-length code
88 3 Lossless Data Compression

f : X → {0, 1}∗ ,

where optimality is in the sense that the code’s average codeword length (or equiv-
alently, its average code rate) is minimized over the class of all binary uniquely
decodable codes for the source. Note that finding optimal nth-order codes with n > 1
follows directly by considering X n as a new source with expanded alphabet (i.e., by
mapping n source symbols at a time).
By Corollary 3.26, we remark that in our search for optimal uniquely decodable
codes, we can restrict our attention to the (smaller) class of optimal prefix codes.
We thus proceed by observing the following necessary conditions of optimality for
binary prefix codes.

Lemma 3.32 Let C be an optimal binary prefix code with codeword lengths i , i =
1, . . . , M, for a source with alphabet X = {a1 , . . . , a M } and symbol probabilities
p1 , . . . , p M . We assume, without loss of generality, that

p1 ≥ p2 ≥ p3 ≥ · · · ≥ p M ,

and that any group of source symbols with identical probability is listed in order of
increasing codeword length (i.e., if pi = pi+1 = · · · = pi+s , then i ≤ i+1 ≤ · · · ≤
i+s ). Then the following properties hold.
1. Higher probability source symbols have shorter codewords: pi > p j implies
i ≤  j , for i, j = 1, . . . , M.
2. The two least probable source symbols have codewords of equal length:
 M−1 =  M .
3. Among the codewords of length  M , two of the codewords are identical except
in the last digit.

Proof
(1) If pi > p j and i >  j , then it is possible to construct a better code C  by inter-
changing (“swapping”) codewords i and j of C, since

(C  ) − (C) = pi  j + p j i − ( pi i + p j  j )
= ( pi − p j )( j − i )
< 0.

Hence, code C  is better than code C, contradicting the fact that C is optimal.
(2) We first know that  M−1 ≤  M , since
• If p M−1 > p M , then  M−1 ≤  M by result 1 above.
• If p M−1 = p M , then  M−1 ≤  M by our assumption about the ordering of
codewords for source symbols with identical probability.
Now, if  M−1 <  M , we may delete the last digit of codeword M, and the deletion
cannot result in another codeword since C is a prefix code. Thus, the deletion
3.3 Variable-Length Codes for Lossless Data Compression 89

forms a new prefix code with a better average codeword length than C, contra-
dicting the fact that C is optimal. Hence, we must have that  M−1 =  M .
(3) Among the codewords of length  M , if no two codewords agree in all digits
except the last, then we may delete the last digit in all such codewords to obtain
a better codeword. 
The above observation suggests that if we can construct an optimal code for
the entire source except for its two least likely symbols, then we can construct an
optimal overall code. Indeed, the following lemma due to Huffman [195] follows
from Lemma 3.32.
Lemma 3.33 (Huffman) Consider a source with alphabet X = {a1 , . . . , a M } and
symbol probabilities p1 , . . . , p M such that p1 ≥ p2 ≥ · · · ≥ p M . Consider the
reduced source alphabet Y = {a1 , . . . , a M−2 , a M−1,M } obtained from X , where the
first M − 2 symbols of Y are identical to those in X and symbol a M−1,M has proba-
bility p M−1 + p M and is obtained by combining the two least likely source symbols
a M−1 and a M of X . Suppose that C  , given by f  : Y → {0, 1}∗ , is an optimal prefix
code for the reduced source Y. We now construct a prefix code C, f : X → {0, 1}∗ ,
for the original source X as follows:
• The codewords for symbols a1 , a2 , . . . , a M−2 are exactly the same as the corre-
sponding codewords in C  :

f (a1 ) = f  (a1 ), f (a2 ) = f  (a2 ), . . . , f (a M−2 ) = f  (a M−2 ).

• The codewords associated with symbols a M−1 and a M are formed by appending
a “0” and a “1”, respectively, to the codeword f  (a M−1,M ) associated with the
letter a M−1,M in C  :

f (a M−1 ) = [ f  (a M−1,M ) 0] and f (a M ) = [ f  (a M−1,M ) 1].

Then, code C is optimal for the original source X .


Hence, the problem of finding the optimal code for a source of alphabet size
M is reduced to the problem of finding an optimal code for the reduced source of
alphabet size M − 1. In turn, we can reduce the problem to that of size M − 2 and so
on. Indeed, the above lemma yields a recursive algorithm for constructing optimal
binary prefix codes.
Huffman encoding algorithm: Repeatedly, apply the above lemma until one is left
with a reduced source with two symbols. An optimal binary prefix code for this
source consists of the codewords 0 and 1. Then proceed backward, constructing (as
outlined in the above lemma) optimal codes for each reduced source until one arrives
at the original source.
Example 3.34 Consider a source with alphabet

X = {1, 2, 3, 4, 5, 6}
90 3 Lossless Data Compression

(00) 00 00 00 0
0.25 0.25 0.25 0.25 0.5 1.0

(01) 01 01 01
0.25 0.25 0.25 0.25

(10) 10 10 1 1
0.25 0.25 0.25 0.5 0.5

(110) 110 11
0.1 0.1 0.25

(1110) 111
0.1 0.15

(1111)
0.05

Fig. 3.6 Example of the Huffman encoding

and symbol probabilities 0.25, 0.25, 0.25, 0.1, 0.1, and 0.05, respectively. By follow-
ing the Huffman encoding procedure as shown in Fig. 3.6, we obtain the Huffman
code as
00, 01, 10, 110, 1110, 1111.

Observation 3.35

• Huffman codes are not unique for a given source distribution; e.g., by inverting all
the code bits of a Huffman code, one gets another Huffman code, or by resolving
ties in different ways in the Huffman algorithm, one also obtains different Huffman
codes (but all of these codes have the same minimal R n ).
• One can obtain optimal codes that are not Huffman codes; e.g., by interchanging
two codewords of the same length of a Huffman code, one can get another non-
Huffman (but optimal) code. Furthermore, one can construct an optimal suffix code
(i.e., a code in which no codeword can be a suffix of another codeword) from a
Huffman code (which is a prefix code) by reversing the Huffman codewords.
• Binary Huffman codes always satisfy the Kraft inequality with equality (their code
tree is “saturated”); e.g., see [87, p. 72].
• Any nth-order binary Huffman code f : X n → {0, 1}∗ for a stationary source
{X n }∞
n=1 with finite alphabet X satisfies

1 1 1
H (X ) ≤ H (X n ) ≤ R n < H (X n ) + .
n n n
3.3 Variable-Length Codes for Lossless Data Compression 91

Thus, as n increases to infinity, R n → H (X ); but while the encoding–decoding


delay increases only linearly with n, the storage complexity grows exponentially
with n.
• Note that nonbinary (i.e., for D > 2) Huffman codes can also be constructed in
a mostly similar way as for the case of binary Huffman codes by designing a
D-ary tree and iteratively applying Lemma 3.33, where now the D least likely
source symbols are combined at each stage. The only difference from the case of
binary Huffman codes is that we have to ensure that we are ultimately left with D
symbols at the last stage of the algorithm to guarantee the code’s optimality. This
is remedied by expanding the original source alphabet X by adding “dummy”
symbols (each with zero probability) so that the alphabet size of the expanded
source |X  | is the smallest positive integer greater than or equal to |X | with

|X  | = 1 (modulo D − 1).

For example, if |X | = 6 and D = 3 (ternary codes), we obtain that |X  | = 7,


meaning that we need to enlarge the original source X by adding one dummy
(zero probability) source symbol.
We thus obtain that the necessary conditions for optimality of Lemma 3.32 also
hold for D-ary prefix codes when replacing X with the expanded source X  and
replacing “two” with “D” in the statement of the lemma. The resulting D-ary
Huffman code will be an optimal code for the original source X (e.g., see [135,
Chap. 3] and [266, Chap. 11]).
• Generalized Huffman codes under exponential costs: When the lossless compres-
sion problem allows for exponential costs, as discussed in Observation 3.30 and
formalized in Theorem 3.31, a straightforward generalization of Huffman’s algo-
rithm, which minimizes the average code rate of order t, n1 L n (t), can be obtained
[192, Theorem 1 ]. More specifically, while in Huffman’s algorithm, each new
node (for a combined or equivalent symbol) is assigned weight pi + p j , where
pi and p j are the lowest weights (probabilities) among the available nodes, in the
generalized algorithm, each new node is assigned weight 2t ( pi + p j ). With this
simple modification, one can directly construct such generalized Huffman codes
(e.g., see [310] for examples of codes designed for Markov sources).
(B) Shannon–Fano–Elias Code
Assume X = {1, . . . , M} and PX (x) > 0 for all x ∈ X . Define

F(x) := PX (a),
a≤x

and
 1
F̄(x) := PX (a) + PX (x).
a<x
2

Encoder: For any x ∈ X , express F̄(x) in decimal binary form, say


92 3 Lossless Data Compression

F̄(x) = .c1 c2 . . . ck . . . ,

and take the first k (fractional) bits as the codeword of source symbol x, i.e.,

(c1 , c2 , . . . , ck ),

where k := log2 (1/PX (x)) + 1.


Decoder: Given codeword (c1 , . . . , ck ), compute the cumulative sum of F(·) starting
from the smallest element in {1, 2, . . . , M} until the first x satisfying

F(x) ≥ .c1 . . . ck .

Then, x should be the original source symbol.


Proof of decodability: For any number a ∈ [0, 1], let [a]k denote the operation that
chops the binary representation of a after k bits (i.e., removing the (k + 1)th bit, the
(k + 2)th bit, etc.). Then

1
F̄(x) − F̄(x) k < k .
2
Since k = log2 (1/PX (x)) + 1,

1 1
k
≤ PX (x)
2 2
 
 PX (x) 
= PX (a) + − PX (a)
a<x
2 a≤x−1

= F̄(x) − F(x − 1).

Hence,
 
1 1 1

F(x − 1) = F(x − 1) + k − k ≤ F̄(x) − k < F̄(x) k .


2 2 2

In addition,

F(x) > F̄(x) ≥ F̄(x) k .

Consequently, x is the first element satisfying

F(x) ≥ .c1 c2 . . . ck .


Average codeword length:
3.3 Variable-Length Codes for Lossless Data Compression 93

   
1
¯ = PX (x) log2 +1
x∈X
PX (x)
  
1
< PX (x) log2 +2
x∈X
PX (x)
= H (X ) + 2 bits.

Observation 3.36 The Shannon–Fano–Elias code is a prefix code.

3.3.4 Examples of Universal Lossless Variable-Length Codes

In Sect. 3.3.3, we assume that the source distribution is known. Thus, we can use either
Huffman codes or Shannon–Fano–Elias codes to compress the source. What if the
source distribution is not a known priori? Is it still possible to establish a completely
lossless data compression code which is universally good (or asymptotically optimal)
for all sources of interest? The answer is affirmative. Examples of such universal
codes are adaptive Huffman codes [136], arithmetic codes [242, 243, 322] (which
are based on the Shannon–Fano–Elias code), and Lempel–Ziv codes [404, 430, 431],
which are efficiently employed in various forms in many multimedia compression
packages and standards. We herein give a brief and basic description of adaptive
Huffman and Lempel–Ziv codes.
(A) Adaptive Huffman Codes
A straightforward universal coding scheme is to use the empirical distribution (or
relative frequencies) as the true distribution, and then apply the optimal Huffman code
according to the empirical distribution. If the source is i.i.d., the relative frequencies
will converge to its true marginal probability. Therefore, such universal codes should
be good for all i.i.d. sources. However, in order to get an accurate estimation of the
true distribution, one must observe a sufficiently long source sequence under which
the coder will suffer a long delay. This can be improved using adaptive universal
Huffman codes [136].
The working procedure of an adaptive Huffman code is as follows. Start with an
initial guess of the source distribution (based on the assumption that the source is
DMS). As a new source symbol arrives, encode the data in terms of the Huffman
coding scheme according to the current estimated distribution, and then update the
estimated distribution and the Huffman codebook according to the newly arrived
source symbol.
To be specific, let the source alphabet be X := {a1 , . . . , a M }. Define

N (ai |x n ) := number of ai occurrence in x1 , x2 , . . . , xn .

Then, the (current) relative frequency of ai is N (ai |x n )/n. Let cn (ai ) denote the
Huffman codeword of source symbol ai with respect to the distribution
94 3 Lossless Data Compression
 
N (a1 |x n ) N (a2 |x n ) N (a M |x n )
, ,..., .
n n n

Now suppose that xn+1 = a j . The codeword cn (a j ) is set as output, and the relative
frequency for each source outcome becomes

N (a j |x n+1 ) n · (N (a j |x n )/n) + 1
=
n+1 n+1

and
N (ai |x n+1 ) n · (N (ai |x n )/n)
= for i = j.
n+1 n+1

This observation results in the following distribution update policy:

n PX̂(n) (a j ) + 1
PX̂(n+1) (a j ) =
n+1

and n
PX̂(n+1) (ai ) = P (n) (ai ) for i = j,
n + 1 X̂

where PX̂(n+1) represents the estimate of the true distribution PX at time (n + 1).
Note that in the adaptive Huffman coding scheme, the encoder and decoder need
not be redesigned at every time, but only when a sufficient change in the estimated
distribution occurs such that the so-called sibling property is violated.

Definition 3.37 (Sibling property) A binary prefix code is said to have the sibling
property if its code tree satisfies
1. every node in the code tree (except for the root node) has a sibling (i.e., the code
tree is saturated), and
2. the node can be listed in nondecreasing order of probabilities with each node
being adjacent to its sibling.

The next observation indicates the fact that the Huffman code is the only prefix
code satisfying the sibling property.

Observation 3.38 A binary prefix code is a Huffman code iff it satisfies the sibling
property.

An example for a code tree satisfying the sibling property is shown in Fig. 3.7.
The first requirement is satisfied since the tree is saturated. The second requirement
can be checked by the node list in Fig. 3.7.
If the next observation (say at time n = 17) is a3 , then its codeword 100 is set as
output (using the Huffman code corresponding to PX̂(16) ). The estimated distribution
is updated as follows:
3.3 Variable-Length Codes for Lossless Data Compression 95

a1 (00, 3/8)
b0 (5/8)

a2 (01, 1/4)

8/8
a3 (100, 1/8)
b10 (1/4)

a4 (101, 1/8)
b1 (3/8)

a5 (110, 1/16)
b11 (1/8)

a6 (111, 1/16)

5 3 3 1
b0 ≥ b1 ≥ a1 ≥ a2
8 8 8 4
sibling pair sibling pair
1 1 1 1 1 1
≥ b10 ≥ b11 ≥ a3 ≥ a4 ≥ a5 ≥ a6
4 8 8 8 16 16
sibling pair sibling pair sibling pair

(16)
Fig. 3.7 Example of the sibling property based on the code tree from P . The arguments inside

the parenthesis following a j respectively indicate the codeword and the probability associated with
a j . Here, “b” is used to denote the internal nodes of the tree with the assigned (partial) code as its
subscript. The number in the parenthesis following b is the probability sum of all its children

16 × (3/8) 6 16 × (1/4) 4
PX̂(17) (a1 ) = = , PX̂(17) (a2 ) = =
17 17 17 17
16 × (1/8) + 1 3 16 × (1/8) 2
PX̂(17) (a3 ) = = , PX̂(17) (a4 ) = =
17 17 17 17
16 × [1/(16)] 1 16 × [1/(16)] 1
PX̂(17) (a5 ) = = , PX̂(17) (a6 ) = = .
17 17 17 17
The sibling property is then violated (cf. Fig. 3.8). Hence, codebook needs to be
updated according to the new estimated distribution, and the observation at n = 18
shall be encoded using the new codebook in Fig. 3.9. Details about adaptive Huffman
codes can be found in [136].

(B) Lempel–Ziv Codes

We now introduce a well-known and high-performing universal coding scheme,


which is named after its inventors, Lempel and Ziv [430, 431] (it is also named as
the Lempel–Ziv–Welch compression algorithm after Welch developed an efficient
96 3 Lossless Data Compression

a1 (00, 6/17)
b0 (10/17)

a2 (01, 4/17)

17/17
a3 (100, 3/17)
b10 (5/17)

a4 (101, 2/17)
b1 (7/17)

a5 (110, 1/17)
b11 (2/17)

a6 (111, 1/17)

10 7 6 5
b0 ≥ b1 ≥ a1 ≥ b10
17 17 17 17
sibling pair
4 3 2 2 1 1
≥ a2 ≥ a3 ≥ a4 ≥ b11 ≥ a5 ≥ a6
17 17 17 17 17 17
sibling pair sibling pair

Fig. 3.8 (Continuation of Fig. 3.7) Example of violation of the sibling property after observing a
new symbol a3 at n = 17. Note that node a1 is not adjacent to its sibling a2

a1 (10, 6/17)

a2 (00, 4/17)
b0 (7/17)
17/17
b1 (10/17)
a3 (01, 3/17)

a4 (110, 2/17)
b11 (4/17)
a5 (1110, 1/17)
b111 (2/17)

a6 (1111, 1/17)

10 7 6 4
b1 ≥ b0 ≥ a1 ≥ b11
17 17 17 17
sibling pair sibling pair
4 3 2 2 1 1
≥ a2 ≥ a3 ≥ a4 ≥ b111 ≥ a5 ≥ a6
17 17 17 17 17 17
sibling pair sibling pair sibling pair

Fig. 3.9 (Continuation of Fig. 3.8) Updated Huffman code. The sibling property holds now for the
new code
3.3 Variable-Length Codes for Lossless Data Compression 97

version of the original Lempel–Ziv technique [404]). These codes, unlike Huffman
and Shannon–Fano–Elias codes, map variable-length sourcewords (as opposed to
fixed-length codewords) onto codewords.
Suppose the source alphabet is binary. Then, the Lempel–Ziv encoder can be
described as follows.
Encoder:
1. Parse the input sequence into strings that have never appeared before. For exam-
ple, if the input sequence is 1011010100010 . . ., the algorithm first grabs the
first letter 1 and finds that it has never appeared before. So 1 is the first string.
Then, the algorithm scoops the second letter 0 and also determines that it has not
appeared before, and hence, put it to be the next string. The algorithm moves on
to the next letter 1 and finds that this string has appeared. Hence, it hits another
letter 1 and yields a new string 11, and so on. Under this procedure, the source
sequence is parsed into the strings

1, 0, 11, 01, 010, 00, 10.

2. Let L be the number of distinct strings of the parsed source. Then, we need
log2 L + 1 bits to index these strings (starting from one). In the above example,
the indices are

parsed source : 1 0 11 01 010 00 10


index : 001 010 011 100 101 110 111

The codeword of each string is then the index of its prefix concatenated with the
last bit in its source string. For example, the codeword of source string 010 will
be the index of 01, i.e., 100, concatenated with the last bit of the source string,
i.e., 0. Through this procedure, encoding the above-parsed strings with L = 3
yields the codeword sequence

(000, 1)(000, 0)(001, 1)(010, 1)(100, 0)(010, 0)(001, 0)

or equivalently,
0001000000110101100001000010.

Note that the conventional Lempel–Ziv encoder requires two passes: the first pass
to decide L, and the second pass to generate the codewords. The algorithm, however,
can be modified so that it requires only one pass over the entire source string. Also,
note that the above algorithm uses an equal number of bits (log2 L + 1) to all the
location indices, which can also be relaxed by proper modification.

Decoder: The decoding is straightforward from the encoding procedure.

Theorem 3.39 The above algorithm asymptotically achieves the entropy rate of any
stationary ergodic source (with unknown statistics).
98 3 Lossless Data Compression

Proof Refer to [83, Sect. 13.5]. 

Problems

1. A binary discrete memoryless source {X n }∞


n=1 has distribution PX (1) = 0.005. A
binary codeword is provided for every sequence of 100 source digits containing
three or fewer ones. In other words, the set of sourcewords of length 100 that
are encoded to distinct block codewords is

A := {x 100 ∈ {0, 1}100 : number of 1 s in x 100 ≤ 3}.

(a) Show that A is indeed a typical set F100 (0.2) defined using the base-2 log-
arithm.
(b) Find the minimum codeword blocklength in bits for the block coding
scheme.
(c) Find the probability for sourcewords not in A.
(d) Use Chebyshev’s inequality to bound the probability of observing a source-
word outside A. Compare this bound with the actual probability computed
in part (c).
Hint: Let X i represent the binary random digit at instance i, and let Sn =
X 1 + · · · + X n . Note that Pr[S100 ≥ 4] is equal to
  
 1 
Pr  S100 − 0.005 ≥ 0.035 .
100

2. Weak Converse to the Fixed-Length Source Coding Theorem: Recall (see Obser-
vation 3.3) that an (n, M) fixed-length source code for a discrete memory-
less source (DMS) {X n }∞ n=1 with finite alphabet X consists of an encoder
f : X n → {1, 2, . . . , M}, and a decoder g : {1, 2, . . . , M} → X n . The rate of
the code is
1
Rn := log2 M bits/source symbol,
n
and its probability of decoding error is

Pe = Pr[X n = X̂ n ],

where X̂ n = g( f (X n )).
(a) Show that any fixed-length source code (n, M) for a DMS satisfies

H (X ) − Rn 1
Pe ≥ − ,
log2 |X | n log2 |X |
3.3 Variable-Length Codes for Lossless Data Compression 99

where H (X ) is the source entropy.


Hint: Show that log2 M ≥ I (X n ; X̂ n ), and use Fano’s inequality.
(b) Deduce the (weak) converse to the fixed-length source coding theorem for
DMSs by proving that for any (n, M) source code with lim supn→∞ Rn <
H (X ), its Pe is bounded away from zero for n sufficiently large.
3. For a stationary source {X n }∞
n=1 , show that for any integer n > 1,

(a) 1
n
H (X n ) ≤ n−1
1
H (X n−1 )
(b) 1
n
H (X n ) ≥ H (X n |X n−1 ).
Hint: Use the chain rule for entropy and the fact that

H (X i |X i−1 , . . . , X 1 ) = H (X n |X n−1 , . . . , X n−i+1 )

for every i.
4. Randomized random walk: An ant walks randomly on a line of integers. At
time instance i, it may move forward with probability 1 − Z i−1 , or it may move

backward with probability Z i−1 , where {Z i }i=0 are identically distributed ran-
dom variables with finite alphabet Z ⊂ [0, 1]. Let X i be the number on which
the ant stands at time instance i, and let X 0 = 0 (with probability one).
(a) Show that

H (X 1 , X 2 , . . . , X n |Z 0 , Z 1 , . . . , Z n−1 ) = n E[h b (Z )],

where h b (·) is the binary entropy function.


(b) If Pr[Z 0 = 0] = Pr[Z 0 = 1] = 21 , determine H (X 1 , X 2 , . . . , X n ).
(c) Find the entropy rate of the process {X n }∞
n=1 in (b).

5. A source with binary alphabet X = {0, 1} emits a sequence of random variables


{X n }∞ ∞
n=1 . Let {Z n }n=1 be a binary independent and identically distributed (i.i.d.)
sequence of random variables such that Pr{Z n = 1} = Pr{Z n = 0}. We assume
that {X n }∞n=1 is generated according to the equation

X n = X n−1 ⊕ X n−2 ⊕ Z n , n = 1, 2, . . .

where ⊕ denotes addition modulo-2, and X 0 = X −1 = 0. Find the entropy rate


of {X n }∞
n=1 .
6. For each of the following codes, either prove unique decodability or give an
ambiguous concatenated sequence of codewords:
(a) {1, 0, 00}.
(b) {1, 01, 00}.
(c) {1, 10, 00}.
(d) {1, 10, 01}.
(e) {0, 01}.
(f) {00, 01, 10, 11}.
100 3 Lossless Data Compression

7. We know the fact that the average code rate of all nth-order uniquely decodable
codes for a DMS must be no less than the source entropy. But this is not nec-
essarily true for non-singular codes. Give an example of a non-singular code in
which the average code rate is less than the source entropy.
8. Under what condition does the average code rate of a uniquely decodable binary
first-order variable-length code for a DMS equal the source entropy?
Hint: See the discussion after Theorem 3.22.
9. Binary Markov Source: Consider the binary homogeneous Markov source:
{X n }∞
n=1 , X n ∈ X = {0, 1}, with

 ρ
, if i = 0 and j = 1,
Pr{X n+1 = j|X n = i} = 1+δ
ρ+δ
1+δ
, if i = 1 and j = 1,

where n ≥ 1, 0 ≤ ρ ≤ 1 and δ ≥ 0.
(a) Find the initial state distribution (Pr{X 1 = 0}, Pr{X 1 = 1}) required to make
the source {X n }∞n=1 stationary.
Assume in the next questions that the source is stationary.
(b) Find the entropy rate of {X n }∞ n=1 in terms of ρ and δ.
(c) For δ = 1 and ρ = 1/2, compute the source redundancies ρd , ρm , and ρt .
(d) Suppose that ρ = 1. Is {X n }∞ n=1 irreducible? What is the value of the entropy
rate in this case?
(e) For δ = 0, show that {X n }∞ n=1 is a discrete memoryless source and compute
its entropy rate in terms of ρ.
(f) If ρ = 1/2 and δ = 3/2, design first-, second-, and third-order binary Huff-
man codes for this source. Determine in each case the average code rate and
compare it to the entropy rate.
10. Polya contagion process of memory two: Consider the finite-memory Polya con-
tagion source presented in Example 3.17 with M = 2.
(a) Find the transition distribution of this binary Markov process and determine
its stationary distribution in terms of the source parameters.
(b) Find the source entropy rate.
11. Suppose random variables Z 1 and Z 2 are independent from each other and have
the same distribution as Z with

⎪ Pr[Z = e1 ] = 0.4,


Pr[Z = e2 ] = 0.3,
⎪ Pr[Z = e3 ] = 0.2,


Pr[Z = e4 ] = 0.1.

(a) Design a first-order binary Huffman code f : {e1 , e2 , e3 , e4 } → {0, 1}∗


for Z .
3.3 Variable-Length Codes for Lossless Data Compression 101

(b) Applying the Huffman code in (a) to Z 1 and Z 2 and concatenating f (Z 1 )


with f (Z 2 ) yields an overall codeword for the pair (Z 1 , Z 2 ) given by

f (Z 1 , Z 2 ) := ( f (Z 1 ), f (Z 2 )) = (U1 , U2 , . . . , Uk ),

where k ranges from 2 to 6, depending on the outcomes of Z 1 and Z 2 . Are


U1 and U2 independent? Justify your answer.
Hint: Examine Pr[U2 = 0|U1 = u 1 ] for different values of u 1 .
(c) Is the average code rate equal to the entropy given by

1 1 1
0.4 log2 + 0.3 log2 + 0.2 log2
0.4 0.3 0.2
1
+0.1 log2 = 1.84644 bits/letter?
0.1
Justify your answer.
(d) Now if we apply the Huffman code in (a) sequentially to the i.i.d. sequence
Z 1 , Z 2 , Z 3 , . . . with the same marginal distribution as Z , and yield the output
U1 , U2 , U3 , . . ., can U1 , U2 , U3 , . . . be further compressed?
If your answer to this question is NO, prove the i.i.d. uniformity of
U1 , U2 , U3 , . . .. If your answer to this question is YES, then explain why
the optimal Huffman code does not give an i.i.d. uniform output.
Hint: Examine whether the average code rate can achieve the source entropy.

12. In the second part of Theorem 3.27, it is shown that there exists a D-ary prefix
code with
1 1
R̄n = PX (x)(cx ) ≤ H D (X ) + ,
n x∈X n

where cx is the codeword for the source symbol x and (cx ) is the length of
codeword cx . Show that the upper bound can be improved to

1
R̄n < H D (X ) + .
n

Hint: Replace (cx ) = − log D PX (x) + 1 by a new assignment.


13. Let X 1 , X 2 , X 3 , . . . be an i.i.d. random variables with common infinite alphabet
X = {x1 , x2 , x3 , . . .}, and assume that PX (xi ) > 0 for every i.
(a) Prove that the average code rate of the first-order (single-letter) binary
Huffman code is equal to H (X ) iff PX (xi ) is equal to 2−ni for every i,
where {n i }i≥1 is a sequence of positive integers.
Hint: The if-part can be proved by the new bound in Problem 12, and the
only-if-part can be proved by modifying the proof of Theorem 3.22.
(b) What is the sufficient and necessary condition under which the average code
rate of the first-order (single-letter) ternary Huffman code equals H3 (X )?
102 3 Lossless Data Compression

(c) Prove that the average code rate of the second-order (two-letter) binary
Huffman code cannot be equal to H (X ) + 1/2 bits?
Hint: Use the new bound in Problem 12.
14. Decide whether each of the following statements is true or false. Prove the
validity of those that are true and give counterexamples or arguments based on
known facts to disprove those that are false.
(a) Every Huffman code for a discrete memoryless source (DMS) has a corre-
sponding suffix code with the same average code rate.
(b) Consider a DMS {X n }∞ n=1 with alphabet X = {a1 , a2 , a3 , a4 , a5 , a6 } and
probability distribution
 
1 1 1 1 1 1
[ p1 , p2 , p3 , p4 , p5 , p6 ] = , , , , , ,
4 4 4 8 16 16

where pi := Pr{X = ai }, i = 1, . . . , 6. The Shannon–Fano–Elias code


f : X → {0, 1}∗ for the source is optimal.

15. Consider a discrete memoryless source {X i }i=1 with alphabet X = {a, b, c} and
distribution P[X = a] = 1/2 and P[X = b] = P[X = c] = 1/4.
(a) Design an optimal first-order binary prefix code for this source (i.e., for
n = 1).
(b) Design an optimal second-order binary prefix code for this source (i.e., for
n = 2).
(c) Compare the codes in terms of both performance and complexity. Which
code would you recommend? Justify your answer.

16. Let {(X i , Yi )}i=1 be a two-dimensional DMS with alphabet X × Y = {0, 1} ×
{0, 1} and common distribution PX,Y given by

1−
PX,Y (0, 0) = PX,Y (1, 1) =
2
and 
PX,Y (0, 1) = PX,Y (1, 0) = ,
2
where 0 <  < 1.
1
(a) Find the limit of the random variable [PX n (X n )] 2n as n → ∞.
(b) Find the limit of the random variable

1 PX n ,Y n (X n , Y n )
log2
n PX n (X n )PY n (Y n )
as n → ∞.
3.3 Variable-Length Codes for Lossless Data Compression 103


17. Consider a discrete memoryless source {X i }i=1 with alphabet X and distribution
p X . Let C = f (X ) be a uniquely decodable binary code

f : X → {0, 1}∗

that maps single source letters onto binary strings such that its average code rate
R C satisfies
R C = H (X ) bits/source symbol.

In other words, C is absolutely optimal.


Now consider a second binary code C  = f  (X n ) for the source that maps source
n-tuples onto binary strings:

f  : X n → {0, 1}∗ .

Provide a construction for the map f  such that the code C  is also absolutely
optimal.
18. Consider two random variables X and Y with values in finite sets X and Y,
respectively. Let l X , l Y , and l X Y denote the average codeword lengths of the
optimal (first-order) prefix codes

f : X → {0, 1}∗ ,

g : Y → {0, 1}∗

and
h : X × Y → {0, 1}∗ ,

respectively; i.e., l X = E[l( f (X ))], l Y = E[l(g(Y ))], and l X Y = E[l(h(X, Y ))].


Prove that
(a) l X + l Y − l X Y < I (X ; Y ) + 2.
(b) l X Y ≤ l X + l Y .
19. Entropy rate: Consider a stationary source {X n }∞
n=1 with finite alphabet X and
entropy rate H (X ).
(a) Show that the normalized conditional entropy n1 H (X n+1 2n
|X n ) is nonincreas-
ing in n. where X = (X 1 , . . . , X n ) and X n+1 = (X n+1 , . . . , X 2n ).
n 2n

(b) Show that


1 1
H (X 2n |X 2n−1 ) ≤ H (X n+1 2n
|X n ) ≤ H (X n ).
n n

(c) Find the limits of n1 H (X n+1


2n
|X n ) and n1 I (X n+1
2n
; X n ) as n → ∞.
(d) Given that the source {X n } is stationary Markov, compare n1 H (X n+1 2n
|X n ) to
H (X ).
104 3 Lossless Data Compression

20. Divergence rate: Prove the expression in (3.2.6) for the divergence rate between
a stationary source {X i } and a time-invariant Markov source { X̂ i }, with both
sources having a common finite alphabet X . Generalize the result if the source
{ X̂ i } is a time-invariant kth-order Markov chain.
21. Prove Observation 3.29.
22. Prove Lemma 3.33.
Chapter 4
Data Transmission and Channel Capacity

4.1 Principles of Data Transmission

A noisy communication channel is an input–output medium in which the output


is not completely or deterministically specified by the input. The channel is indeed
stochastically modeled, where given channel input x, the channel output y is governed
by a transition (conditional) probability distribution denoted by PY |X (y|x). Since two
different inputs may give rise to the same output, the receiver, upon receipt of an
output, needs to guess the most probable sent input. In general, words of length n
are sent and received over the channel; in this case, the channel is characterized by a
sequence of n-dimensional transition distributions PY n |X n (y n |x n ), for n = 1, 2, . . ..
A block diagram depicting a data transmission or channel coding system with no
(output) feedback is given in Fig. 4.1.
The designer of a data transmission (or channel) code needs to carefully select
codewords from the set of channel input words (of a given length) so that a minimal
ambiguity is obtained at the channel receiver. For example, suppose that a channel
has binary input and output alphabets and that its transition probability distribution
induces the following conditional probability on its output symbols given that input
words of length 2 are sent:

PY |X 2 (y = 0|x 2 = 00) = 1
PY |X 2 (y = 0|x 2 = 01) = 1
PY |X 2 (y = 1|x 2 = 10) = 1
PY |X 2 (y = 1|x 2 = 11) = 1,

which can be graphically depicted as

© Springer Nature Singapore Pte Ltd. 2018 105


F. Alajaji and P.-N. Chen, An Introduction to Single-User Information Theory,
Springer Undergraduate Texts in Mathematics and Technology,
https://doi.org/10.1007/978-981-10-8001-2_4
106 4 Data Transmission and Channel Capacity

W Channel Xn Channel Yn Channel Ŵ


Encoder PY n |X n Decoder

Fig. 4.1 A data transmission system, where W represents the message for transmission, X n denotes
the codeword corresponding to message W , Y n represents the received word due to channel input
X n , and Ŵ denotes the reconstructed message from Y n

00 1 0
1
01

10 1 1
1
11
and a binary message (either event A or event B) is required to be transmitted from
the sender to the receiver. Then the data transmission code with (codeword 00 for
event A, codeword 10 for event B) obviously induces less ambiguity at the receiver
than the code with (codeword 00 for event A, codeword 01 for event B).
In short, the objective in designing a data transmission (or channel) code is to
transform a noisy channel into a reliable medium for sending messages and recov-
ering them at the receiver with minimal loss. To achieve this goal, the designer of
a data transmission code needs to take advantage of the common parts between the
sender and the receiver sites that are least affected by the channel noise. We will see
that these common parts are probabilistically captured by the mutual information
between the channel input and the channel output.
As illustrated in the previous example, if a “least-noise-affected” subset of the
channel input words is appropriately selected as the set of codewords, the messages
intended to be transmitted can be reliably sent to the receiver with arbitrarily small
error. One then raises the question:
What is the maximum amount of information (per channel use) that can be reliably
transmitted over a given noisy channel?

In the above example, we can transmit a binary message error-free, and hence, the
amount of information that can be reliably transmitted is at least 1 bit per channel
use (or channel symbol). It can be expected that the amount of information that can
be reliably transmitted for a highly noisy channel should be less than that for a less
noisy channel. But such a comparison requires a good measure of the “noisiness” of
channels.
From an information-theoretic viewpoint, “channel capacity” provides a good
measure of the noisiness of a channel; it represents the maximal amount of infor-
mational messages (per channel use) that can be transmitted via a data transmission
code over the channel and recovered with arbitrarily small probability of error at the
receiver. In addition to its dependence on the channel transition distribution, channel
4.1 Principles of Data Transmission 107

capacity also depends on the coding constraint imposed on the channel input, such
as “only block (fixed-length) codes are allowed.” In this chapter, we will study chan-
nel capacity for block codes (namely, only block transmission code can be used).1
Throughout the chapter, the noisy channel is assumed to be memoryless (as defined
in the next section).

4.2 Discrete Memoryless Channels

Definition 4.1 (Discrete channel) A discrete communication channel is character-


ized by
• A finite input alphabet X .
• A finite output alphabet Y.
• A sequence of n-dimensional transition distributions

{PY n |X n (y n |x n )}∞
n=1

such that y n ∈Y n PY n |X n (y n |x n ) = 1 for every x n ∈ X n , where x n = (x1 , . . . , xn )
∈ X n and y n = (y1 , . . . , yn ) ∈ Y n . We assume that the above sequence of
n-dimensional distribution is consistent, i.e.,
 
xi+1 ∈X yi+1 ∈Y PX i+1 (x i+1 )PY i+1 |X i+1 (y i+1 |x i+1 )
PY i |X i (y |x ) =
i i

xi+1 ∈X PX i+1 (x i+1 )
 
= PX i+1 |X i (xi+1 |x i )PY i+1 |X i+1 (y i+1 |x i+1 )
xi+1 ∈X yi+1 ∈Y

for every x i , y i , PX i+1 |X i and i = 1, 2, . . ..

In general, real-world communications channels exhibit statistical memory in the


sense that current channel outputs statistically depend on past outputs as well as past,
current, and (possibly) future inputs. However, for the sake of simplicity, we restrict
our attention in this chapter to the class of memoryless channels (see Problem 4.27
for a brief discussion of channels with memory).

Definition 4.2 (Discrete memoryless channel) A discrete memoryless channel


(DMC) is a channel whose sequence of transition distributions PY n |X n satisfies


n
PY n |X n (y n |x n ) = PY |X (yi |xi ) (4.2.1)
i=1

1 See[397] for recent results regarding channel capacity when no coding constraints are applied to
the channel input (so that variable-length codes can be employed).
108 4 Data Transmission and Channel Capacity

for every n = 1, 2, . . . , x n ∈ X n and y n ∈ Y n . In other words, a DMC is fully


described by the channel’s transition distribution matrix Q := [ px,y ] of size |X |×|Y|,
where
px,y := PY |X (y|x)

for x ∈ X , y ∈ Y. Furthermore, the matrix


 Q is stochastic; i.e., the sum
 of the entries
in each of its rows is equal to 1 since y∈Y px,y = 1 for all x ∈ X .

Observation 4.3 We note that the DMC’s condition (4.2.1) is actually equivalent
to the following two sets of conditions [29]:


⎪ PYn |X n ,Y n−1 (yn |x n , y n−1 ) = PY |X (yn |xn ) ∀ n = 1, 2, . . . , x n , y n ;

⎨ (4.2.2a)

⎪ PY n−1 |X n (y |x ) = PY n−1 |X n−1 (y |x ) ∀ n = 2, 3, . . . , x , y .
n−1 n n−1 n−1 n n−1


(4.2.2b)


⎪ PYn |X n ,Y n−1 (yn |x n , y n−1 ) = PY |X (yn |xn ) ∀ n = 1, 2, . . . , x n , y n ;

⎨ (4.2.3a)

⎪ PX n |X n−1 ,Y n−1 (xn |x n−1
,y n−1
) = PX n |X n−1 (xn |x n−1
) ∀ n = 1, 2, . . . , x , y n−1 .
n


(4.2.3b)

Condition (4.2.2a) [also (4.2.3a)] implies that the current output Yn only depends
on the current input X n but not on past inputs X n−1 and outputs Y n−1 . Condition
(4.2.2b) indicates that the past outputs Y n−1 do not depend on the current input X n .
These two conditions together give

PY n |X n (y n |x n ) = PY n−1 |X n (y n−1 |x n )PYn |X n ,Y n−1 (yn |x n , y n−1 )


= PY n−1 |X n−1 (y n−1 |x n−1 )PY |X (yn |xn );

hence, (4.2.1) holds recursively for n = 1, 2, . . .. The converse [i.e., (4.2.1) implies
both (4.2.2a) and (4.2.2b)] is a direct consequence of

PY n |X n (y n |x n )
PYn |X n ,Y n−1 (yn |x n , y n−1 ) = 
yn ∈Y PY n |X n (y |x )
n n

and 
PY n−1 |X n (y n−1 |x n ) = PY n |X n (y n |x n ).
yn ∈Y

Similarly, (4.2.3b) states that the current input X n is independent of past outputs
Y n−1 , which together with (4.2.3a) implies again
4.2 Discrete Memoryless Channels 109

PY n |X n (y n |x n )
PX n ,Y n (x n , y n )
=
PX n (x n )
PX n−1 ,Y n−1 (x n−1 , y n−1 )PX n |X n−1 ,Y n−1 (xn |x n−1 , y n−1 )PYn |X n ,Y n−1 (yn |x n , y n−1 )
=
PX n−1 (x n−1 )PX n |X n−1 (xn |x n−1 )
= PY n−1 |X n−1 (y n−1 |x n−1 )PY |X (yn |xn ),

hence, recursively yielding (4.2.1). The converse for (4.2.3b)—i.e., (4.2.1) implying
(4.2.3b)—can be analogously proved by noting that

PX n (x n ) yn ∈Y PY n |X n (y n |x n )
PX n |X n−1 ,Y n−1 (xn |x n−1
,y n−1
)= .
PX n−1 (x n−1 )PY n−1 |X n−1 (y n−1 |x n−1 )

Note that the above definition of DMC in (4.2.1) prohibits the use of channel feed-
back, as feedback allows the current channel input to be a function of past chan-
nel outputs (therefore, conditions (4.2.2b) and (4.2.3b) cannot hold with feedback).
Instead, a causality condition generalizing (4.2.2a) (e.g., see Problem 4.28 or [415,
Definition 7.4]) will be needed for a channel with feedback.

Examples of DMCs:

1. Identity (noiseless) channels: An identity channel has equal size input and output
alphabets (|X | = |Y|) and channel transition probability satisfying

1 if y = x
PY |X (y|x) =
0 if y = x.

This is a noiseless or perfect channel as the channel input is received error-free


at the channel output.
2. Binary symmetric channels: A binary symmetric channel (BSC) is a channel
with binary input and output alphabets such that each input has a (conditional)
probability given by ε for being received inverted at the output, where ε ∈ [0, 1]
is called the channel’s crossover probability or bit error rate. The channel’s
transition distribution matrix is given by

p p
Q = [ px,y ] = 0,0 0,1
p1,0 p1,1

P (0|0) PY |X (1|0) 1−ε ε
= Y |X = (4.2.4)
PY |X (0|1) PY |X (1|1) ε 1−ε

and can be graphically represented via a transition diagram as shown in Fig. 4.2.
If we set ε = 0, then the BSC reduces to the binary identity (noiseless)
channel. The channel is called “symmetric” since PY |X (1|0) = PY |X (0|1); i.e.,
110 4 Data Transmission and Channel Capacity

Fig. 4.2 Binary symmetric


X PY |X Y
channel (BSC)
1−ε
0 0

1 1
1−ε

it has the same probability for flipping an input bit into a 0 or a 1. A detailed
discussion of DMCs with various symmetry properties will be discussed later in
this chapter.
Despite its simplicity, the BSC is rich enough to capture most of the complex-
ity of coding problems over more general channels. For example, it can exactly
model the behavior of practical channels with additive memoryless Gaussian
noise used in conjunction of binary symmetric modulation and hard-decision
demodulation (e.g., see [407, p. 240]). It is also worth pointing out that the BSC
can be explicitly represented via a binary modulo-2 additive noise channel whose
output at time i is the modulo-2 sum of its input and noise variables:

Yi = X i ⊕ Z i for i = 1, 2, . . . , (4.2.5)

where ⊕ denotes addition modulo-2, Yi , X i , and Z i are the channel output, input,
and noise, respectively, at time i, the alphabets X = Y = Z = {0, 1} are all
binary. It is assumed in (4.2.5) that X i and Z j are independent of each other for
any i, j = 1, 2, . . . , and that the noise process is a Bernoulli(ε) process—i.e., a
binary i.i.d. process with Pr[Z = 1] = ε.
3. Binary erasure channels: In the BSC, some input bits are received perfectly and
others are received corrupted (flipped) at the channel output. In some channels,
however, some input bits are lost during transmission instead of being received
corrupted (for example, packets in data networks may get dropped or blocked
due to congestion or bandwidth constraints). In this case, the receiver knows the
exact location of these bits in the received bitstream or codeword, but not their
actual value. Such bits are then declared as “erased” during transmission and are
called “erasures.” This gives rise to the so-called binary erasure channel (BEC)
as illustrated in Fig. 4.3, with input alphabet X = {0, 1} and output alphabet
Y = {0, E, 1}, where E represents an erasure (we may assume that E is a real
number strictly greater than one), and channel transition matrix given by
4.2 Discrete Memoryless Channels 111

Fig. 4.3 Binary erasure


X P Y |X Y
channel (BEC)
1−α
0 0

α
E
α

1 1
1−α


p p p
Q = [ px,y ] = 0,0 0,E 0,1
p1,0 p1,E p1,1

P (0|0) PY |X (E|0) PY |X (1|0)
= Y |X
PY |X (0|1) PY |X (E|1) PY |X (1|1)

1−α α 0
= , (4.2.6)
0 α 1−α

where 0 ≤ α ≤ 1 is called the channel’s erasure probability. We also observe


that, like the BSC, the BEC can be explicitly expressed as follows:

Yi = X i · 1{Z i = E} + E · 1{Z i = E} for i = 1, 2, . . . , (4.2.7)

where
1 if Z i = E
1{Z i = E} :=
0 if Z i = E

is the indicator function of the set {Z i = E}, Yi , X i , and Z i are the channel output,
input, and erasure, respectively, at time i and the alphabets are X = {0, 1},
Z = {0, E} and Y = {0, 1, E}. Indeed, when the erasure variable Z i = E,
Yi = E and an erasure occurs in the channel; also, when Z i = 0, Yi = X i and
the input is received perfectly. In the BEC functional representation in (4.2.7),
it is assumed that X i and Z j are independent of each other for any i, j and that
the erasure process {Z i } is i.i.d. with Pr[Z = E] = α.
4. Binary channels with errors and erasures: One can combine the BSC with the
BEC to obtain a binary channel with both errors and erasures, as shown in Fig. 4.4.
We will call such channel the binary symmetric erasure channel (BSEC). In this
case, the channel’s transition matrix is given by
112 4 Data Transmission and Channel Capacity

Fig. 4.4 Binary symmetric


X PY |X Y
erasure channel (BSEC)
1−ε−α
0 0
ε α

ε α
1 1
1−ε−α


p0,0 p0,E p0,1 1−ε−α α ε
Q = [ px,y ] = = , (4.2.8)
p1,0 p1,E p1,1 ε α 1−ε−α

where ε, α ∈ [0, 1] are the channel’s crossover and erasure probabilities, respec-
tively, with ε + α ≤ 1. Clearly, setting α = 0 reduces the BSEC to the BSC,
and setting ε = 0 reduces the BSEC to the BEC. Analogously to the BSC and
the BEC, the BSEC admits an explicit expression in terms of a noise-erasure
process:

X i ⊕ Z i if Z i = E
Yi =
E if Z i = E
= (X i ⊕ Z i ) · 1{Z i = E} + E · 1{Z i = E} (4.2.9)

for i = 1, 2, . . ., where ⊕ is modulo-2 addition,2 1{·} is the indicator function, Yi ,


X i , and Z i are the channel output, input, and noise-erasure variable, respectively,
at time i and the alphabets are X = {0, 1} and Y = Z = {0, 1, E}. Indeed,
when the noise-erasure variable Z i = E, Yi = E and an erasure occurs in the
channel; when Z i = 0, Yi = X i and the input is received perfectly; finally
when Z i = 1, Yi = X i ⊕ 1 and the input bit is received in error. In the BSEC
functional characterization (4.2.9), it is assumed that X i and Z j are independent
of each other for any i, j and that the noise-erasure process {Z i } is i.i.d. with
Pr[Z = E] = α and Pr[Z = 1] = ε.
More generally, the channel needs not have a symmetric property in the
sense of having identical transition distributions when inputs bits 0 or 1 are sent.
For example, the channel’s transition matrix can be given by

2 Strictly speaking, note that X ⊕Z


in not defined in (4.2.9) when Z = E. However, as 1{Z = E} = 0
when Z = E, this is remedied by using the convention that an undefined quantity multiplied by
zero is equal to zero.
4.2 Discrete Memoryless Channels 113

p0,0 p0,E p0,1 1−ε−α α ε
Q = [ px,y ] = = , (4.2.10)
p1,0 p1,E p1,1 ε α  1 − ε − α 

where the probabilities ε = ε and α = α in general. We call such channel,


an asymmetric channel with errors and erasures (this model might be useful
to represent practical channels using asymmetric or nonuniform modulation
constellations).
5. q-ary symmetric channels: Given an integer q ≥ 2, the q-ary symmetric channel
is a nonbinary extension of the BSC; it has alphabets X = Y = {0, 1, . . . , q −1}
of size q and channel transition matrix given by

Q = [ px,y ]
⎡ ⎤
p0,0 p0,1 · · · p0,q−1
⎢ p1,0 p1,1 · · · p1,q−1 ⎥
⎢ ⎥
=⎢ . .. .. .. ⎥
⎣ .. . . . ⎦
pq−1,0 pq−1,1 · · · pq−1,q−1
⎡ ε ε ⎤
1 − ε q−1 · · · q−1
⎢ q−1 1 − ε · · · q−1
ε ε ⎥
⎢ ⎥
=⎢ . .. .. .. ⎥ , (4.2.11)
⎣ .. . . . ⎦
ε ε
q−1 q−1
··· 1 − ε

where 0 ≤ ε ≤ 1 is the channel’s symbol error rate (or probability). When


q = 2, the channel reduces to the BSC with bit error rate ε, as expected.
As the BSC, the q-ary symmetric channel can be expressed as a modulo-q
additive noise channel with common input, output and noise alphabets X = Y =
Z = {0, 1, . . . , q − 1} and whose output Yi at time i is given by Yi = X i ⊕q Z i ,
for i = 1, 2, . . . , where ⊕q denotes addition modulo-q, and X i and Z i are
the channel’s input and noise variables, respectively, at time i. Here, the noise
process {Z n }∞
n=1 is assumed to be an i.i.d. process with distribution

ε
Pr[Z = 0] = 1 − ε and Pr[Z = a] = ∀a ∈ {1, . . . , q − 1}.
q −1

It is also assumed that the input and noise processes are independent of each
other.
6. q-ary erasure channels: Given an integer q ≥ 2, one can also consider a nonbi-
nary extension of the BEC, yielding the so-called q-ary erasure channel. Specifi-
cally, this channel has input and output alphabets given by X = {0, 1, . . . , q −1}
and Y = {0, 1, . . . , q − 1, E}, respectively, where E denotes an erasure, and
channel transition distribution given by
114 4 Data Transmission and Channel Capacity


⎨1 − α if y = x, x ∈ X
PY |X (y|x) = α if y = E, x ∈ X (4.2.12)


0 if y = x, x ∈ X ,

where 0 ≤ α ≤ 1 is the erasure probability. As expected, setting q = 2 reduces


the channel to the BEC.
Note that the same functional representation (4.2.7) also holds for this chan-

nel, where {Z i }i=1 is an i.i.d. (input independent) erasure process with alphabet
{0, E}. Finally, a nonbinary extension to the BSEC can be similarly obtained.

4.3 Block Codes for Data Transmission Over DMCs

Definition 4.4 (Fixed-length data transmission code) Given positive integers n and
M (where M = Mn ), and a discrete channel with input alphabet X and output
alphabet Y, a fixed-length data transmission code (or block code) for this channel
with blocklength n and rate n1 log2 M message bits per channel symbol (or channel
use) is denoted by ∼Cn = (n, M) and consists of:
1. M information messages intended for transmission.
2. An encoding function
f : {1, 2, . . . , M} → X n

yielding codewords f (1), f (2), . . . , f (M) ∈ X n , each of length n. The set of


these M codewords is called the codebook and we also usually write ∼Cn =
{ f (1), f (2), . . . , f (M)} to list the codewords.
3. A decoding function g : Y n → {1, 2, . . . , M}.

The set {1, 2, . . . , M} is called the message set and we assume that a message W
follows a uniform distribution over the set of messages: Pr[W = w] = M1 for all
w ∈ {1, 2, . . . , M}. A block diagram for the channel code is given at the beginning of
this chapter; see Fig. 4.1. As depicted in the diagram, to convey message W over the
channel, the encoder sends its corresponding codeword X n = f (W ) at the channel
input. Finally, Y n is received at the channel output (according to the memoryless
channel distribution PY n |X n ) and the decoder yields Ŵ = g(Y n ) as the message
estimate.

Definition 4.5 (Average probability of error) The average probability of error for a
channel block code ∼Cn = (n, M) code with encoder f (·) and decoder g(·) used over
a channel with transition distribution PY n |X n is defined as

1 
M
Pe (∼Cn ) := λw (∼Cn ),
M w=1
4.3 Block Codes for Data Transmission Over DMCs 115

where

λw (∼Cn ) := Pr[Ŵ = W |W = w] = Pr[g(Y n ) = w|X n = f (w)]



= PY n |X n (y n | f (w))
y n ∈Y n : g(y n )=w

is the code’s conditional probability of decoding error given that message w is sent
over the channel.
Note that, since we have assumed that the message W is drawn uniformly from
the set of messages, we have that

Pe (∼Cn ) = Pr[Ŵ = W ].

Observation 4.6 (Maximal probability of error) Another more conservative error


criterion is the so-called maximal probability of error

λ(∼Cn ) := max λw (∼Cn ).


w∈{1,2,...,M}

Clearly, Pe (∼Cn ) ≤ λ(∼Cn ); so one would expect that Pe (∼Cn ) behaves differently than
λ(∼Cn ). However, it can be shown that from a code ∼Cn = (n, M) with arbitrarily small
Pe (∼Cn ), one can construct (by throwing away from ∼Cn half of its codewords with
largest conditional probability of error) a code ∼Cn = (n, M2 ) with arbitrarily small
λ(∼Cn ) at essentially the same code rate as n grows to infinity (e.g., see [83, p. 204],
[415, p. 163]).3 Hence, for simplicity, we will only use Pe (∼Cn ) as our criterion when
evaluating the “goodness” or reliability4 of channel block codes; but one must keep
in mind that our results hold under λ(∼Cn ) as well, in particular the channel coding
theorem below.
Our target is to find a good channel block code (or to show the existence of a good
channel block code). From the perspective of the (weak) law of large numbers, a
good choice is to draw the code’s codewords based on the jointly typical set between
the input and the output of the channel, since all the probability mass is ultimately
placed on the jointly typical set. The decoding failure then occurs only when the
channel input–output pair does not lie in the jointly typical set, which implies that
the probability of decoding error is ultimately small. We next define the jointly typical
set.

3 Note that this fact holds for single-user channels with known transition distributions (as given in
Definition 4.1) that remain constant throughout the transmission of a codeword. It does not however
hold for single-user channels whose statistical descriptions may vary in an unknown manner from
symbol to symbol during a codeword transmission; such channels, which include the class of
“arbitrarily varying channels” (see [87, Chap. 2, Sect. 6]), will not be considered in this textbook.
4 We interchangeably use the terms “goodness” or “reliability” for a block code to mean that its

(average) probability of error asymptotically vanishes with increasing blocklength.


116 4 Data Transmission and Channel Capacity

Definition 4.7 (Jointly typical set) The set Fn (δ) of jointly δ-typical n-tuple
n (x , y ) with respect to the memoryless distribution PX ,Y (x , y ) =
n n n n
pairs n n

i=1 PX,Y (x i , yi ) is defined by



Fn (δ) := (x n , y n ) ∈ X n × Y n :
   
 1   
− log PX n (x n ) − H (X ) < δ, − 1 log PY n (y n ) − H (Y ) < δ,
 n 2   n 2 
  
 1 
and − log2 PX n ,Y n (x n , y n ) − H (X, Y ) < δ .
n

In short, a pair (x n , y n ) generated by independently drawing n times under PX,Y is


jointly δ-typical if its joint and marginal empirical entropies are, respectively, δ-close
to the true joint and marginal entropies.

With the above definition, we directly obtain the joint AEP theorem.

Theorem 4.8 (Joint AEP) If (X 1 , Y1 ), (X 2 , Y2 ), . . ., (X n , Yn ), . . . are i.i.d., i.e.,



{(X i , Yi )}i=1 is a dependent pair of DMSs, then

1
− log2 PX n (X 1 , X 2 , . . . , X n ) → H (X ) in probability,
n
1
− log2 PY n (Y1 , Y2 , . . . , Yn ) → H (Y ) in probability,
n
and
1
− log2 PX n ,Y n ((X 1 , Y1 ), . . . , (X n , Yn )) → H (X, Y ) in probability
n
as n → ∞.

Proof By the weak law of large numbers, we have the desired result.

Theorem 4.9 (Shannon–McMillan–Breiman theorem for pairs) Given a dependent


pair of DMSs with joint entropy H (X, Y ) and any δ greater than zero, we can choose
n big enough so that the jointly δ-typical set satisfies:
1. PX n ,Y n (Fnc (δ)) < δ for sufficiently large n.
2. The number of elements in Fn (δ) is at least (1 − δ)2n(H (X,Y )−δ) for sufficiently
large n, and at most 2n(H (X,Y )+δ) for every n.
3. If (x n , y n ) ∈ Fn (δ), its probability of occurrence satisfies

2−n(H (X,Y )+δ) < PX n ,Y n (x n , y n ) < 2−n(H (X,Y )−δ) .


4.3 Block Codes for Data Transmission Over DMCs 117

Proof The proof is quite similar to that of the Shannon–McMillan–Breiman theorem


for a single memoryless source presented in the previous chapter; we hence leave it
as an exercise.

We next introduce the notion of operational capacity for a channel as the largest
coding transmission rate that can be conveyed reliably (i.e., with asymptotically
decaying error probability) over the channel when the coding blocklength is allowed
to grow without bound.
Definition 4.10 (Operational capacity) A rate R is said to be achievable for a dis-
crete channel if there exists a sequence of (n, Mn ) channel codes ∼Cn with

1
lim inf log2 Mn ≥ R and lim Pe (∼Cn ) = 0.
n→∞ n n→∞

The channel’s operational capacity, Cop , is the supremum of all achievable rates:

Cop = sup{R : R is achievable}.

We herein arrive at the main result of this chapter, Shannon’s channel coding
theorem for DMCs. It states that for a DMC, its operational capacity Cop is actually
equal to a quantity C, conveniently termed as channel capacity (or information
capacity) and defined as the maximum of the channel’s mutual information over the
set of its input distributions (see below). In other words, the quantity C is indeed the
supremum of all achievable channel code rates, and this is shown in two parts in the
theorem in light of the properties of the supremum; see Observation A.5. As a result,
for a given DMC, its quantity C, which can be calculated by solely using the channel’s
transition matrix Q, constitutes the largest rate at which one can reliably transmit
information via a block code over this channel. Thus, it is possible to communicate
reliably over an inherently noisy DMC at a fixed rate (without decreasing it) as long
as this rate is below C and the code’s blocklength is allowed to be large.
Theorem 4.11 (Shannon’s channel coding theorem) Consider a DMC with finite
input alphabet X , finite output alphabet Y and transition distribution probability
PY |X (y|x), x ∈ X and y ∈ Y. Define the channel capacity5

note that the mutual information I (X ; Y ) is actually a function of the input statistics PX and
5 First

the channel statistics PY |X . Hence, we may write it as


 PY |X (y|x)
I (PX , PY |X ) = PX (x)PY |X (y|x) log2  .
x∈X y∈Y x  ∈X PX (x  )PY |X (y|x  )

Such an expression is more suitable for calculating the channel capacity.


Note also that the channel capacity C is well-defined since, for a fixed PY |X , I (PX , PY |X ) is
concave and continuous in PX (with respect to both the variational distance and the Euclidean
distance (i.e., L2 -distance) [415, Chap. 2]), and since the set of all input distributions PX is a
compact (closed and bounded) subset of R|X | due to the finiteness of X . Hence, there exists a PX
that achieves the supremum of the mutual information and the maximum is attainable.
118 4 Data Transmission and Channel Capacity

C := max I (X ; Y ) = max I (PX , PY |X ),


PX PX

where the maximum is taken over all input distributions PX . Then, the following hold.
• Forward part (achievability): For any 0 < ε < 1, there exist γ > 0 and a sequence
of data transmission block codes {∼Cn = (n, Mn )}∞
n=1 with

1
lim inf log2 Mn ≥ C − γ
n→∞ n
and
Pe (∼Cn ) < ε for sufficiently large n,

where Pe (∼Cn ) denotes the (average) probability of error for block code ∼Cn .
• Converse part: For any 0 < ε < 1, any sequence of data transmission block codes
{∼Cn = (n, Mn )}∞n=1 with
1
lim inf log2 Mn > C
n→∞ n

satisfies
Pe (∼Cn ) > (1 − )μ for sufficiently large n, (4.3.1)

where
C
μ=1− > 0,
lim inf n→∞ n1 log2 Mn

i.e., the codes’ probability of error is bounded away from zero for all n sufficiently
large.6

Proof of the forward part: It suffices to prove the existence of a good block code
sequence (satisfying the rate condition, i.e., lim inf n→∞ (1/n) log2 Mn ≥ C − γ for
some γ > 0) whose average error probability is ultimately less than ε. Since the
forward part holds trivially when C = 0 by setting Mn = 1, we assume in the sequel
that C > 0.
We will use Shannon’s original random coding proof technique in which the
good block code sequence is not deterministically constructed; instead, its existence
is implicitly proven by showing that for a class (ensemble) of block code sequences
{∼Cn }∞
n=1 and a code-selecting distribution Pr[∼ Cn ] over these block code sequences,
the expectation value of the average error probability, evaluated under the code-
selecting distribution on these block code sequences, can be made smaller than ε for
n sufficiently large:

6 Note that(4.3.1) actually implies that lim inf n→∞ Pe (∼Cn ) ≥ lim↓0 (1 − )μ = μ, where the error
probability lower bound has nothing to do with . Here, we state the converse of Theorem 4.11 in
a form in parallel to the converse statements in Theorems 3.6 and 3.15.
4.3 Block Codes for Data Transmission Over DMCs 119

E ∼Cn [Pe (∼Cn )] = Pr[∼Cn ]Pe (∼Cn ) → 0 as n → ∞.
∼Cn

Hence, there must exist at least one such a desired good code sequence {∼Cn∗ }∞n=1
among them (with Pe (∼Cn∗ ) → 0 as n → ∞).
Fix ε ∈ (0, 1) and some γ in (0, min{4ε, C}). Observe that there exists N0 such
that for n > N0 , we can choose an integer Mn with

γ 1
C− ≥ log2 Mn > C − γ. (4.3.2)
2 n
(Since we are only concerned with the case of “sufficient large n,” it suffices to
consider only those n’s satisfying n > N0 , and ignore those n’s for n ≤ N0 .)
Define δ := γ/8. Let PX̂ be a probability distribution that achieves the channel
capacity:
C := max I (PX , PY |X ) = I (PX̂ , PY |X ).
PX

Denote by PŶ n the channel output distribution due to channel input product distribu-
n
tion PX̂ n (with PX̂ n (x n ) = i=1 PX̂ (xi )), i.e.,

PŶ n (y n ) = PX̂ n ,Ŷ n (x n , y n )
x n ∈X n

where
PX̂ n ,Ŷ n (x n , y n ) := PX̂ n (x n )PY n |X n (y n |x n )
n
for all x n ∈ X n and y n ∈ Y n . Note that since PX̂ n (x n ) = i=1 PX̂ (xi ) and the

channel is memoryless, the resulting joint input–output process {( X̂ i , Ŷi )}i=1 is also
memoryless with

n
PX̂ n ,Ŷ n (x n , y n ) = PX̂ ,Ŷ (xi , yi )
i=1

and
PX̂ ,Ŷ (x, y) = PX̂ (x)PY |X (y|x) for x ∈ X , y ∈ Y.

We next present the proof in three steps.


Step 1: Code construction.
For any blocklength n, independently select Mn channel inputs with replacement7
from X n according to the distribution PX̂ n (x n ). For the selected Mn channel inputs
yielding codebook ∼Cn := {c1 , c2 , . . . , c Mn }, define the encoder f n (·) and decoder
gn (·), respectively, as follows:

7 Here, the channel inputs are selected with replacement. That means it is possible and acceptable
that all the selected Mn channel inputs are identical.
120 4 Data Transmission and Channel Capacity

f n (m) = cm for 1 ≤ m ≤ Mn ,

and


⎪ m, if cm is the only codeword in ∼Cn

satisfying (cm , y n ) ∈ Fn (δ);
gn (y n ) =



any one in {1, 2, . . . , Mn }, otherwise,

where Fn (δ) is defined in Definition 4.7 with respect to distribution PX̂ n ,Ŷ n . (We
evidently assume that the codebook ∼Cn and the channel distribution PY |X are
known at both the encoder and the decoder.) Hence, the code ∼Cn operates as
follows. A message W is chosen according to the uniform distribution from the
set of messages. The encoder f n then transmits the W th codeword cW in ∼Cn over
the channel. Then, Y n is received at the channel output and the decoder guesses
the sent message via Ŵ = gn (Y n ).
Note that there is a total |X |n Mn possible randomly generated codebooks ∼Cn and
the probability of selecting each codebook is given by


Mn
Pr[∼Cn ] = PX̂ n (cm ).
m=1

Step 2: Conditional error probability.


For each (randomly generated) data transmission code ∼Cn , the conditional prob-
ability of error given that message m was sent, λm (∼Cn ), can be upper bounded
by

λm (∼Cn ) ≤ PY n |X n (y n |cm )
y n ∈Y n : (cm ,y n )∈F
/ n (δ)


Mn 
+ PY n |X n (y n |cm ), (4.3.3)
 y n ∈Y n : (cm  ,y n )∈F
m =1 n (δ)
m  =m

where the first term in (4.3.3) considers the case that the received channel output y n
is not jointly δ-typical with cm , (and hence, the decoding rule gn (·) would possibly
result in a wrong guess), and the second term in (4.3.3) reflects the situation when
y n is jointly δ-typical with not only the transmitted codeword cm but also with
another codeword cm  (which may cause a decoding error).
By taking expectation in (4.3.3) with respect to the mth codeword-selecting
distribution PX̂ n (cm ), we obtain
4.3 Block Codes for Data Transmission Over DMCs 121
  
PX̂ n (cm )λm (∼Cn ) ≤ PX̂ n (cm )PY n |X n (y n |cm )
cm ∈X n cm ∈X n y n ∈F
/ n (δ|cm )

 
Mn 
+ PX̂ n (cm )PY n |X n (y n |cm )
∈X n  y n ∈F
cm m =1 n (δ|cm  )
m  =m
 
= PX̂ n ,Ŷ n Fnc (δ)

Mn  
+ PX̂ n ,Ŷ n (cm , y n ),
m  =1 cm ∈X n y n ∈Fn (δ|cm  )
m  =m
(4.3.4)

where  
Fn (δ|x n ) := y n ∈ Y n : (x n , y n ) ∈ Fn (δ) .

Step 3: Average error probability.


We now can analyze the expectation of the average error probability

E ∼Cn [Pe (∼Cn )]

over the ensemble of all codebooks ∼Cn generated at random according to Pr[∼Cn ]
and show that it asymptotically vanishes as n grows without bound. We obtain the
following series of inequalities:

E ∼Cn [Pe (∼Cn )] = Pr[∼Cn ]Pe (∼Cn )
∼Cn
 
  1 
Mn
= ··· PX̂ n (c1 ) · · · PX̂ n (c Mn ) λm (∼Cn )
c1 ∈X n c Mn ∈X n
Mn m=1

1 
Mn    
= ··· ···
Mn m=1 c ∈X n c ∈X n c ∈X n c ∈X n
1 m−1 m+1 Mn

PX̂ n (c1 ) · · · PX̂ n (cm−1 )PX̂ n (cm+1 ) · · · PX̂ n (c Mn )


⎛ ⎞

×⎝ PX̂ n (cm )λm (∼Cn )⎠
cm ∈X n

1 
Mn    
≤ ··· ···
Mn m=1 c ∈X n c ∈X n c ∈X n c ∈X n
1 m−1 m+1 Mn

PX̂ n (c1 ) · · · PX̂ n (cm−1 )PX̂ n (cm+1 ) · · · PX̂ n (c Mn )


 
×PX̂ n ,Ŷ n Fnc (δ)
122 4 Data Transmission and Channel Capacity

1 
Mn    
+ ··· ···
Mn m=1 c ∈X n c ∈X n c ∈X n c ∈X n
1 m−1 m+1 Mn

PX̂ n (c1 ) · · · PX̂ n (cm−1 )PX̂ n (cm+1 ) · · · PX̂ n (c Mn )



Mn  
× PX̂ n ,Ŷ n (cm , y n ) (4.3.5)
m  =1 cm ∈X n y n ∈Fn (δ|cm  )
m  =m
 
= PX̂ n ,Ŷ n Fnc (δ)

⎪ ⎡
⎪ Mn
n ⎨ 
1     
M
+ ⎣ ··· ···
Mn m=1 ⎪ ⎪
⎩ m  =1 c1 ∈X n cm−1 ∈X n cm+1 ∈X n c Mn ∈X n
m  =m
PX̂ n (c1 ) · · · PX̂ n (cm−1 )PX̂ n (cm+1 ) · · · PX̂ n (c Mn )
⎤⎫
  ⎬
× PX̂ n ,Ŷ n (cm , y n )⎦ ,

cm ∈X y ∈Fn (δ|cm  )
n n

 
where (4.3.5) follows from (4.3.4), and the last step holds since PX̂ n ,Ŷ n Fnc (δ) is
a constant independent of c1 , . . ., c Mn and m. Observe that for n > N0 ,


Mn    
⎣ ··· ···
m  =1 c1 ∈X n cm−1 ∈X n cm+1 ∈X n c Mn ∈X n
m  =m
PX̂ n (c1 ) · · · PX̂ n (cm−1 )PX̂ n (cm+1 ) · · · PX̂ n (c Mn )

 
× P n n (cm , y n )⎦
X̂ ,Ŷ
cm ∈X n y n ∈Fn (δ|cm  )
⎡ ⎤

Mn   
= ⎣ PX̂ n (cm  )PX̂ n ,Ŷ n (cm , y n )⎦
m  =1 cm ∈X n cm  ∈X n y n ∈Fn (δ|cm  )
m  =m
⎡ ⎛ ⎞⎤

Mn   
= ⎣ PX̂ n (cm  ) ⎝ PX̂ n ,Ŷ n (cm , y n )⎠⎦
m  =1 cm  ∈X n y n ∈Fn (δ|cm  ) cm ∈X n
m  =m
⎡ ⎤

Mn  
= ⎣ PX̂ n (cm  )PŶ n (y n )⎦
m  =1 cm  ∈X n y n ∈Fn (δ|cm  )
m  =m
4.3 Block Codes for Data Transmission Over DMCs 123
⎡ ⎤

Mn 
= ⎣ PX̂ n (cm  )PŶ n (y n )⎦
 (cm  ,y n )∈F
m =1 n (δ)
m  =m


Mn
≤ |Fn (δ)|2−n(H ( X̂ )−δ) 2−n(H (Ŷ )−δ)
m  =1
m  =m


Mn
≤ 2n(H ( X̂ ,Ŷ )+δ) 2−n(H ( X̂ )−δ) 2−n(H (Ŷ )−δ)

m =1
m  =m

= (Mn − 1)2n(H ( X̂ ,Ŷ )+δ) 2−n(H ( X̂ )−δ) 2−n(H (Ŷ )−δ)


< Mn · 2n(H ( X̂ ,Ŷ )+δ) 2−n(H ( X̂ )−δ) 2−n(H (Ŷ )−δ)
≤ 2n(C−4δ) · 2−n(I ( X̂ ;Ŷ )−3δ) = 2−nδ ,

where the first inequality follows from the definition of the jointly typical set Fn (δ),
the second inequality holds by the Shannon–McMillan–Breiman theorem for pairs
(Theorem 4.9), the last inequality follows since C = I ( X̂ ; Ŷ ) by definition of X̂
and Ŷ , and since (1/n) log2 Mn ≤ C − (γ/2) = C − 4δ. Consequently,
 
E ∼Cn [Pe (∼Cn )] ≤ PX̂ n ,Ŷ n Fnc (δ) + 2−nδ ,

which for sufficiently large n (and n > N0 ), can be made smaller than 2δ = γ/4 <
ε by the Shannon–McMillan–Breiman theorem for pairs.

Before proving the converse part of the channel coding theorem, let us recall Fano’s
inequality in a channel coding context. Consider an (n, Mn ) channel block code ∼Cn
with encoding and decoding functions given by

f n : {1, 2, . . . , Mn } → X n

and
gn : Y n → {1, 2, . . . , Mn },

respectively. Let message W , which is uniformly distributed over the set of messages
{1, 2, . . . , Mn }, be sent via codeword X n (W ) = f n (W ) over the DMC, and let Y n
be received at the channel output. At the receiver, the decoder estimates the sent
message via Ŵ = gn (Y n ) and the probability of estimation error is given by the
code’s average error probability:

Pr[W = Ŵ ] = Pe (∼Cn )

since W is uniformly distributed. Then, Fano’s inequality (2.5.2) yields


124 4 Data Transmission and Channel Capacity

H (W |Y n ) ≤ 1 + Pe (∼Cn ) log2 (Mn − 1)


< 1 + Pe (∼Cn ) log2 Mn . (4.3.6)

We next proceed with the proof of the converse part.


Proof of the converse part: For any (n, Mn ) block channel code ∼Cn as described
above, we have that W → X n → Y n form a Markov chain; we thus obtain by the
data processing inequality that

I (W ; Y n ) ≤ I (X n ; Y n ). (4.3.7)

We can also upper bound I (X n ; Y n ) in terms of the channel capacity C as follows:

I (X n ; Y n ) ≤ max I (X n ; Y n )
PX n

n
≤ max I (X i ; Yi ) (by Theorem 2.21)
PX n
i=1

n
≤ max I (X i ; Yi )
PX n
i=1
n
= max I (X i ; Yi )
PX i
i=1
= nC. (4.3.8)

Consequently, code ∼Cn satisfies the following:

log2 Mn = H (W ) (since W is uniformly distributed)


= H (W |Y n ) + I (W ; Y n )
≤ H (W |Y n ) + I (X n ; Y n ) (by (4.3.7))
≤ H (W |Y ) + nCn
(by (4.3.8))
< 1 + Pe (∼Cn ) · log2 Mn + nC. (by (4.3.6))

This implies that

C 1 C + 1/n
Pe (∼Cn ) > 1 − − =1− .
(1/n) log2 Mn log2 Mn (1/n) log2 Mn

So if lim inf n→∞ (1/n) log2 Mn = C


1−μ
, then for any 0 < ε < 1, there exists an
integer N such that for n ≥ N ,

1 C + 1/n
log2 Mn ≥ , (4.3.9)
n 1 − (1 − ε)μ
4.3 Block Codes for Data Transmission Over DMCs 125

lim supn→∞ Pe = 0 lim supn→∞ Pe > 0


for the best channel block code for all channel block codes
C R

Fig. 4.5 Asymptotic channel coding rate R versus channel capacity C and behavior of the proba-
bility of error as blocklength n goes to infinity for a DMC

because, otherwise, (4.3.9) would be violated for infinitely many n, implying a con-
tradiction that
1 C + 1/n C
lim inf log2 Mn ≤ lim inf = .
n→∞ n n→∞ 1 − (1 − ε)μ 1 − (1 − ε)μ

Hence, for n ≥ N ,

C + 1/n
Pe (∼Cn ) > 1 − [1 − (1 − ε)μ] = (1 − )μ > 0;
C + 1/n

i.e., Pe (∼Cn ) is bounded away from zero for n sufficiently large.


Observation 4.12 The results of the above channel coding theorem, which proves
that Cop = C, are illustrated in Fig. 4.5,8 where R = lim inf n→∞ n1 log2 Mn (mea-
sured in message bits/channel use) is usually called the asymptotic coding rate of
channel block codes. As indicated in the figure, the asymptotic rate of any good block
code for the DMC must be smaller than or equal to the channel capacity C.9 Con-
versely, any block code with (asymptotic) rate greater than C, will have its probability
of error bounded away from zero.

Observation 4.13 (Zero error codes) In the converse part of Theorem 4.11, we
showed that
1
lim inf Pe (∼Cn ) = 0 =⇒ lim inf log2 Mn ≤ C. (4.3.10)
n→∞ n→∞ n

8 Note that Theorem 4.11 actually implies that limn→∞ Pe = 0 for R < Cop = C and that
lim inf n→∞ Pe > 0 for R > Cop = C; these properties, however, might not hold for more general
channels than the DMC. For general channels, three partitions instead of two may result, i.e.,
R < Cop , Cop < R < C̄op and R > C̄op , which, respectively, correspond to lim supn→∞ Pe = 0
for the best block code, lim supn→∞ Pe > 0 but lim inf n→∞ Pe = 0 for the best block code, and
lim inf n→∞ Pe > 0 for all channel codes, where C̄op is called the channel’s optimistic operational
capacity [394, 396]. Since C̄op = Cop = C for DMCs, the three regions are reduced to two. A
formula for C̄op in terms of a generalized (spectral) mutual information rate is established in [75].
9 It can be seen from the theorem that C can be achieved as an asymptotic transmission rate as long

as (1/n) log2 Mn approaches C from below with increasing n (see (4.3.2)).


126 4 Data Transmission and Channel Capacity

We next briefly examine the situation when we require that all (n, Mn ) codes
∼Cn are to be used with exactly no errors for any value of the blocklength n; i.e.,
Pe (∼Cn ) = 0 for every n. In this case, we readily obtain that H (W |Y n ) = 0, which
in turn implies (by invoking the data processing inequality) that for any n,

log2 Mn = H (W |Y n ) + I (W ; Y n )
= I (W ; Y n )
≤ I (X n ; Y n )
≤ nC.

Thus, we have proved that

1
Pe (∼Cn ) = 0 ∀n =⇒ lim sup log2 Mn ≤ C,
n→∞ n

which is a stronger result than (4.3.10).

Shannon’s channel coding theorem, established in 1948 [340], provides the ulti-
mate limit for reliable communication over a noisy channel. However, it does not
provide an explicit efficient construction for good codes since searching for a good
code from the ensemble of randomly generated codes is prohibitively complex, as
its size grows double exponentially with blocklength (see Step 1 of the proof of
the forward part). It thus spurred the entire area of coding theory, which flourished
over the last several decades with the aim of constructing powerful error-correcting
codes operating close to the capacity limit. Particular advances were made for the
class of linear codes (also known as group codes) whose rich10 yet elegantly sim-
ple algebraic structures made them amenable for efficient practically implementable
encoding and decoding. Examples of such codes include Hamming, Golay, Bose–
Chaudhuri–Hocquenghem (BCH), Reed–Muller, Reed–Solomon and convolutional
codes. In 1993, the so-called Turbo codes were introduced by Berrou et al. [44,
45] and shown experimentally to perform close to the channel capacity limit for the
class of memoryless channels. Similar near-capacity achieving linear codes were
later established with the rediscovery of Gallager’s low-density parity-check codes
(LDPC) [133, 134, 251, 252]. A more recent breakthrough was the invention of
polar codes by Arikan in 2007, when he provided a deterministic construction of
codes that can provably achieve channel capacity [22, 23]; see the next section for
a brief illustrative example on polar codes for the BEC. Many of the above codes
are used with increased sophistication in today’s ubiquitous communication, infor-
mation and multimedia technologies. For detailed studies on channel coding theory,
see the following texts [50, 52, 208, 248, 254, 321, 407].

10 Indeed, there exist linear codes that can achieve the capacity of memoryless channels with additive

noise (e.g., see [87, p. 114]). Such channels include the BSC and the q-ary symmetric channel.
4.4 Example of Polar Codes for the BEC 127

4.4 Example of Polar Codes for the BEC

As noted above, polar coding is a new channel coding method proposed by Arikan
[22, 23], which can provably achieve the capacity of any binary-input memoryless
channel Q whose capacity is realized by a uniform input distribution (e.g., quasi-
symmetric channels). The proof technique and code construction, which has low
encoding and decoding complexity, are purely based on information-theoretic con-
cepts. For simplicity, we focus solely on a channel Q given by the BEC with erasure
probability ε, which we denote as BEC(ε) for short.
The main idea behind polar codes is channel “polarization,” which transforms
many independent uses of BEC(ε), n uses to be precise (where n is the coding
blocklength),11 into extremal “polarized” channels; i.e., channels which are either
perfect (noiseless) or completely noisy. It is shown that as n → ∞, the number of
unpolarized channels converges to 0 and the fraction of perfect channels converges
to I (X ; Y ) = 1 − ε under a uniform input, which is the capacity of the BEC. A polar
code can then be naturally obtained by sending information bits directly through
those perfect channels and sending known bits (usually called frozen bits) through
the completely noisy channels.
We start with the simplest case of n = 2. The channel transformation depicted
in Fig. 4.6a is usually called the basic transformation. In this figure, we have two
independent uses of BEC(ε), namely, (X 1 , Y1 ) and (X 2 , Y2 ), where every bit has ε
chance of being erased. In other words, under uniformly distributed X 1 and X 2 , we
have
I (Q) := I (X 1 ; Y1 ) = I (X 2 ; Y2 ) = 1 − ε.

Now consider the following linear modulo-2 operation shown in Fig. 4.6:

X 1 = U1 ⊕ U2 ,
X 2 = U2 ,

where U1 and U2 represent uniformly distributed independent message bits. The


decoder performs successive cancellation decoding as follows. It first decodes U1
from the received (Y1 , Y2 ), and then decodes U2 based on (Y1 , Y2 ) and the previously
decoded U1 (assuming the decoding is done correctly). This will create two new
channels; namely, the “worse” channel Q− and the “better” channel Q+ given by

Q− : U1 → (Y1 , Y2 ),
Q+ : U2 → (Y1 , Y2 , U1 ),

11 Recall that in channel coding, a codeword of length n is typically sent by using the channel
n consecutive times (i.e., in series). But in polar coding, an equivalent method is applied, which
consists of using n identical and independent copies of the channel in parallel, with each channel
being utilized only once.
128 4 Data Transmission and Channel Capacity

(1 − (1 − ε)2 ) (ε)
U1 X1 Y1
⊕ BEC(ε)

(ε2 ) (ε)
U2 X2 Y2
BEC(ε)

(a) Transformation for two independent uses of BEC(ε).

(1 − (1 − ε1 )(1 − ε2 )) (ε1 )
U1 X1 Y1
⊕ BEC(ε1 )

(ε1 ε2 ) (ε2 )
U2 X2 Y2
BEC(ε2 )

(b) Transformation for independent BEC(ε1 ) and BEC(ε2 ).

Fig. 4.6 Basic transformation with n = 2

respectively (the names of these channels will be justified shortly). Note that correctly
receiving Y1 = X 1 alone is not enough for us to determine U1 , since U2 is a uniform
random variable that is independent of U1 . One really needs to have both Y1 = X 1 and
Y2 = X 2 for correctly decoding U1 . This observation implies that Q− is a BEC with
erasure probability12 ε− := 1 − (1 − ε)2 . Also, note that given U1 , either Y1 = X 1
or Y2 = X 2 is sufficient to determine U2 . This implies that Q+ is a BEC with erasure
probability ε+ := ε2 .
Overall, we have

I (Q+ ) + I (Q− ) = I (U2 ; Y1 , Y2 , U1 ) + I (U1 ; Y1 , Y2 )


= (1 − ε2 ) + [1 − (1 − (1 − ε)2 )]
= 2(1 − ε)
= 2I (Q), (4.4.1)

and

12 More precisely, channel Q− has the same behavior as a BEC, and it can be exactly converted to a
BEC after relabeling its output pair (y1 , y2 ) as an equivalent three-valued symbol y1,2 as follows:


⎨0 if (y1 , y2 ) ∈ {(0, 0), (1, 1)},
y1,2 = E if (y1 , y2 ) ∈ {(0, E), (1, E), (E, E), (E, 0), (E, 1)},

⎩1 if (y , y ) ∈ {(0, 1), (1, 0)}.
1 2

A similar conversion can be applied to channel Q+ .


4.4 Example of Polar Codes for the BEC 129

(1 − ε)2 = I (Q− )
≤ I (Q) = 1 − ε
≤ I (Q+ ) = 1 − ε2 , (4.4.2)

with equality iff ε(1 − ε) = 0 (i.e., ε = 0 or ε = 1). Equation (4.4.1) shows that
the basic transformation does not incur any loss in mutual information. Furthermore,
(4.4.2) indeed confirms that Q+ and Q− are, respectively, better and worse than Q.13
So far, we have talked about how to use the basic transformation to generate a
better channel Q+ and a worse channel Q− from two independent uses of Q =
BEC(ε). Now, let us consider the case of n = 4 and suppose we perform the basic
transformation twice to send (i.i.d. uniform) message bits (U1 , U2 , U3 , U4 ), yielding

Q− : V1 → (Y1 , Y2 ), where X 1 = V1 ⊕ V2 ,
Q+ : V2 → (Y1 , Y2 , V1 ), where X 2 = V2 ,
Q− : V3 → (Y3 , Y4 ), where X 3 = V3 ⊕ V4 ,
Q+ : V4 → (Y3 , Y4 , V3 ), where X 4 = V4 ,

where V1 = U1 ⊕U2 , V3 = U2 , V2 = U3 ⊕U4 and V4 = U4 . Since both Q− channels


have the same erasure probability ε− = 1 − (1 − ε)2 , and since both Q+ channels
have the same erasure probability ε+ = ε2 , we can now take two Q− channels and
perform the basic transformation again to generate two new channels: Q−− : U1 →
(Y1 , Y2 , Y3 , Y4 ) with erasure probability ε−− := 1 − (1 − ε− )2 and Q−+ : U2 →
(Y1 , Y2 , Y3 , Y4 , U1 ) with erasure probability ε−+ := (ε− )2 . Similarly, we can use two
Q+ channels to form Q+− : U3 → (Y1 , Y2 , Y3 , Y4 , U1 , U2 ) with erasure probability
ε+− := 1 − (1 − ε+ )2 and Q++ : U4 → (Y1 , Y2 , Y3 , Y4 , U1 , U3 , U2 ) with erasure
probability ε++ := (ε+ )2 .
The key attribute of this technique is that we do not have to stop here; in fact,
we can keep exploiting this property until all the channels eventually become either
very good (i.e., perfect) or very bad (i.e., completely noisy). In polar coding termi-
nology, the process of using multiple basic transformations to get X 1 , . . . , X n from
U1 , . . . , Un (where the Ui ’s are i.i.d. uniform message random variables) is called
channel “combining” and that of using Y1 , . . . , Yn and U1 , . . . , Ui−1 to obtain Ui for
i ∈ {1, . . . , n} is called channel “splitting.” Altogether, the phenomenon is called
channel “polarization.”
For constructing a polar code with blocklength n = 2m and 2k codewords (i.e.,
with each binary message word having length k), one can perform m stages of
channel polarization and transmit uncoded k message bits via the k positions with

13 The same reasoning can be applied to form the basic transformation for two independent but

not identically distributed BECs as shown in Fig. 4.6b, where Q+ and Q− become BEC(ε1 ε2 )
and BEC(1 − (1 − ε1 )(1 − ε2 )), respectively. This extension may be useful when combining n
independent uses of a channel in a multistage manner (in particular, when the two channels to be
combined may become non-identically distributed after the second stage). In Example 4.14, only
identically distributed BECs will be combined at each stage, which is a typical design for polar
coding.
130 4 Data Transmission and Channel Capacity

largest mutual informations. The other n − k positions are stuffed with frozen bits;
this encoding process is precisely channel combining. The decoder successively
decodes Ui , i ∈ {1, . . . , n}, based on (Y1 , . . . , Yn ) and the previously decoded Û j ,
j ∈ {1, . . . , i − 1}. This decoder is called a successive cancellation decoder and
mimics the behavior of channel splitting in the process of channel polarization.

Example 4.14 Consider a BEC with erasure probability ε = 0.5 and let n = 8. The
channel polarization process for this example is shown in Fig. 4.7. Note that since
the mutual information of a BEC(ε) under a uniform input is simply 1 − ε; one can
equivalently keep tracking the erasure probabilities as we have shown in parentheses
in Fig. 4.7. Now, suppose we would like to construct a (8, 4) polar code, we pick the
four positions with largest mutual informations (i.e., smallest erasure probabilities).
That is, we pick (U4 , U6 , U7 , U8 ) to send uncoded bits and the other positions are
frozen.
As an example of the computation of the erasure probabilities, 0.5625 for T2 is
obtained from 0.75 × 0.75, which are the numbers above V1 and V3 , and combining
T1 and T2 produces 1 − (1 − 0.9375)(1 − 0.9375) ≈ 0.9961, which is the number
above U1 .

Ever since their invention by Arikan [22, 23], polar codes have generated exten-
sive interest; see [25, 26, 226, 228, 329, 371, 372] and the references therein and
thereafter. A key reason for their prevalence is that they form the first coding scheme
that has an explicit low-complexity construction structure while being capable of
achieving channel capacity as code length approaches infinity. More importantly,
polar codes do not exhibit the error floor behavior, which Turbo and (to a lesser
extent) LDPC codes are prone to. In practice, since one cannot have infinitely many
stages of polarization, there will always exist unpolarized channels. The develop-
ment of effective construction and decoding methods for polar codes with practical
blocklengths is an active area of research. Due to their attractive properties, polar
codes were adopted in 2016 by the 3rd Generation Partnership Project (3GPP) as
error-correcting codes for the control channel of the 5th generation (5G) mobile
communication standard [99].
We conclude by noting that the notion of polarization is not unique to channel cod-
ing; it can also be applied to source coding and other information-theoretic problems
including secrecy and multiuser systems (e.g., cf. [24, 148, 226, 227, 256]).

4.5 Calculating Channel Capacity

Given a DMC with finite input alphabet X , finite output alphabet Y and channel
transition matrix Q = [ px,y ] of size |X | × |Y|, where px,y := PY |X (y|x), for x ∈ X
and y ∈ Y, we would like to calculate

C := max I (X ; Y )
PX
4.5 Calculating Channel Capacity 131

(0.9961) (0.9375) (0.75) (0.5)


U1 T1 V1 X1 Y1
⊕ ⊕ ⊕ BEC(0.5)

(0.6836) (0.4375) (0.25) (0.5)


U5 T3 V2 X2 Y2
⊕ ⊕ BEC(0.5)

(0.8086) (0.5625) (0.75) (0.5)


U3 T2 V3 X3 Y3
⊕ ⊕ BEC(0.5)

(0.1211) (0.0625) (0.25) (0.5)


U7 T4 V4 X4 Y4
⊕ BEC(0.5)

(0.8789) (0.9375) (0.75) (0.5)


U2 T5 V5 X5 Y5
⊕ ⊕ BEC(0.5)

(0.1914) (0.4375) (0.25) (0.5)


U6 T7 V6 X6 Y6
⊕ BEC(0.5)

(0.3164) (0.5625) (0.75) (0.5)


U4 T6 V7 X7 Y7
⊕ BEC(0.5)

(0.0039) (0.0625) (0.25) (0.5)


U8 T8 V8 X8 Y8
BEC(0.5)

Fig. 4.7 Channel polarization for Q = BEC(0.5) with n = 8

where the maximization (which is well-defined) is carried over the set of input dis-
tributions PX , and I (X ; Y ) is the mutual information between the channel’s input
and output.
Note that C can be determined numerically via nonlinear optimization techniques
—such as the iterative algorithms developed by Arimoto [27] and Blahut [49, 51],
see also [88] and [415, Chap. 9]. In general, there are no closed-form (single-letter)
analytical expressions for C. However, for many “simplified” channels, it is possible
to analytically determine C under some “symmetry” properties of their channel
transition matrix.
132 4 Data Transmission and Channel Capacity

4.5.1 Symmetric, Weakly Symmetric, and Quasi-symmetric


Channels

Definition 4.15 A DMC with finite input alphabet X , finite output alphabet Y and
channel transition matrix Q = [ px,y ] of size |X | × |Y| is said to be symmetric if the
rows of Q are permutations of each other and the columns of Q are permutations
of each other. The channel is said to be weakly symmetric if the rows of Q are
permutations of each other and all the column sums in Q are equal.

It directly follows from the definition that symmetry implies weak-symmetry.


Examples of symmetric DMCs include the BSC, the q-ary symmetric channel and
the following ternary channel with X = Y = {0, 1, 2} and transition matrix:
⎡ ⎤ ⎡ ⎤
PY |X (0|0) PY |X (1|0) PY |X (2|0) 0.4 0.1 0.5
Q = ⎣ PY |X (0|1) PY |X (1|1) PY |X (2|1)⎦ = ⎣0.5 0.4 0.1⎦ .
PY |X (0|2) PY |X (1|2) PY |X (2|2) 0.1 0.5 0.4

The following DMC with |X | = |Y| = 4 and


⎡ ⎤
0.5 0.25 0.25 0
⎢0.5 0.25 0.25 0⎥
Q=⎢
⎣0
⎥ (4.5.1)
0.25 0.25 0.5⎦
0 0.25 0.25 0.5

is weakly symmetric (but not symmetric). Noting that all above channels involve
square transition matrices, we emphasize that Q can be rectangular while satisfying
the symmetry or weak-symmetry properties. For example, the DMC with |X | = 2,
|Y| = 4 and
1−ε ε 1−ε ε
Q= 2 2 2 2
ε 1−ε ε 1−ε (4.5.2)
2 2 2 2

is symmetric (where ε ∈ [0, 1]), while the DMC with |X | = 2, |Y| = 3 and
1 1 1

Q= 3 6 2
1 1 1
3 2 6

is weakly symmetric.

Lemma 4.16 The capacity of a weakly symmetric channel Q is achieved by a uni-


form input distribution and is given by

C = log2 |Y| − H (q1 , q2 , . . . , q|Y| ) (4.5.3)

where (q1 , q2 , . . . , q|Y| ) denotes any row of Q and


4.5 Calculating Channel Capacity 133

|Y|

H (q1 , q2 , . . . , q|Y| ) := − qi log2 qi
i=1

is the row entropy.

Proof The mutual information between the channel’s input and output is given by

I (X ; Y ) = H (Y ) − H (Y |X )

= H (Y ) − PX (x)H (Y |X = x)
x∈X

 
where H (Y |X = x) = − y∈Y PY |X (y|x) log2 PY |X (y|x) = − y∈Y px,y
log2 px,y .
Noting that every row of Q is a permutation of every other row, we obtain that
H (Y |X = x) is independent of x and can be written as

H (Y |X = x) = H (q1 , q2 , . . . , q|Y| ),

where (q1 , q2 , . . . , q|Y| ) is any row of Q. Thus



H (Y |X ) = PX (x)H (q1 , q2 , . . . , q|Y| )
x∈X
 

= H (q1 , q2 , . . . , q|Y| ) PX (x)
x∈X
= H (q1 , q2 , . . . , q|Y| ).

This implies

I (X ; Y ) = H (Y ) − H (q1 , q2 , . . . , q|Y| )
≤ log2 |Y| − H (q1 , q2 , . . . , q|Y| ),

with equality achieved iff Y is uniformly distributed over Y. We next show that
choosing a uniform input distribution, PX (x) = |X1 | ∀x ∈ X , yields a uniform
output distribution, hence maximizing mutual information. Indeed, under a uniform
input distribution, we obtain that for any y ∈ Y,
 1  A
PY (y) = PX (x)PY |X (y|x) = px,y =
x∈X
|X | x∈X |X |

where A := x∈X px,y is a constant given by the sum of the entries in any column
of Q,
since by the weak-symmetry property all column sums in Q are identical. Note
that y∈Y PY (y) = 1 yields that
134 4 Data Transmission and Channel Capacity

 A
=1
y∈Y
|X |

and hence
|X |
A= . (4.5.4)
|Y|

Accordingly,
A |X | 1 1
PY (y) = = =
|X | |Y| |X | |Y|

for any y ∈ Y; thus the uniform input distribution induces a uniform output distri-
bution and achieves channel capacity as given by (4.5.3).

Observation 4.17 Note that if the weakly symmetric channel has a square (i.e.,
with |X | = |Y|) transition matrix Q, then Q is a doubly stochastic matrix; i.e., both
its row sums and its column sums are equal to 1. Note, however, that having a square
transition matrix does not necessarily make a weakly symmetric channel symmetric;
e.g., see (4.5.1).

Example 4.18 (Capacity of the BSC) Since the BSC with crossover probability (or
bit error rate) ε is symmetric, we directly obtain from Lemma 4.16 that its capacity
is achieved by a uniform input distribution and is given by

C = log2 (2) − H (1 − ε, ε) = 1 − h b (ε), (4.5.5)

where h b (·) is the binary entropy function.

Example 4.19 (Capacity of the q-ary symmetric channel) Similarly, the q-ary sym-
metric channel with symbol error rate ε described in (4.2.11) is symmetric; hence,
by Lemma 4.16, its capacity is given by

# $
ε ε
C = log2 q − H 1 − ε, ,...,
q −1 q −1
ε
= log2 q + ε log2 + (1 − ε) log2 (1 − ε).
q −1

Note that when q = 2, the channel capacity is equal to that of the BSC, as expected.
Furthermore, when ε = 0, the channel reduces to the identity (noiseless) q-ary
channel and its capacity is given by C = log2 q.

We next note that one can further weaken the weak-symmetry property and define
a class of “quasi-symmetric” channels for which the uniform input distribution still
achieves capacity and yields a simple closed-form formula for capacity.
4.5 Calculating Channel Capacity 135

Definition 4.20 A DMC with finite input alphabet X , finite output alphabet Y and
channel transition matrix Q = [ px,y ] of size |X |×|Y| is said to be quasi-symmetric14
if Q can be partitioned along its columns into m weakly symmetric sub-matrices
Q1 , Q2 , . . . , Qm for some integer m ≥ 1, where each Qi sub-matrix has size |X | ×
|Yi | for i = 1, 2, . . . , m with Y1 ∪ · · · ∪ Ym = Y and Yi ∩ Y j = ∅ ∀i = j,
i, j = 1, 2, . . . , m.
Hence, quasi-symmetry is our weakest symmetry notion, since a weakly symmet-
ric channel is clearly quasi-symmetric (just set m = 1 in the above definition); we
thus have symmetry =⇒ weak-symmetry =⇒ quasi-symmetry.
Lemma 4.21 The capacity of a quasi-symmetric channel Q as defined above is
achieved by a uniform input distribution and is given by


m
C= ai Ci , (4.5.6)
i=1

where 
ai := px,y = sum of any row in Qi , i = 1, . . . , m,
y∈Yi

and % &
Ci = log2 |Yi | − H any row in the matrix 1
ai
Qi , i = 1, . . . , m

is the capacity of the ith weakly symmetric “sub-channel” whose transition matrix is
obtained by multiplying each entry of Qi by a1i (this normalization renders sub-matrix
Qi into a stochastic matrix and hence a channel transition matrix).
Proof We first observe that for each i = 1, . . . , m, ai is independent of the input
value x, since sub-matrix i is weakly symmetric (so any row in Qi is a permutation
of any other row), and hence, ai is the sum of any row in Qi .
For each i = 1, . . . , m, define
p
x,y
if y ∈ Yi and x ∈ X ;
PYi |X (y|x) := ai
0 otherwise,

where Yi is a random variable taking values in Yi . It can be easily verified that


PYi |X (y|x) is a legitimate conditional distribution. Thus, [PYi |X (y|x)] = a1i Qi is the
transition matrix of the weakly symmetric “sub-channel” i with input alphabet X
and output alphabet Yi . Let I (X ; Yi ) denote its mutual information. Since each such
sub-channel i is weakly symmetric, we know that its capacity Ci is given by
% &
Ci = max I (X ; Yi ) = log2 |Yi | − H any row in the matrix 1
ai
Qi ,
PX

14 This notion of “quasi-symmetry” is slightly more general than Gallager’s notion [135, p. 94], as

we herein allow each sub-matrix to be weakly symmetric (instead of symmetric as in [135]).


136 4 Data Transmission and Channel Capacity

where the maximum is achieved by a uniform input distribution.


Now, the mutual information between the input and the output of our original
quasi-symmetric channel Q can be written as
 px,y
I (X ; Y ) = PX (x) px,y log2 
y∈Y x∈X x  ∈X PX (x  ) px  ,y

m  
px,y ai
px,y

= ai PX (x) log2 
 px  ,y
x  ∈X PX (x ) ai
i=1 y∈Yi x∈X
ai

m  PYi |X (y|x)
= ai PX (x)PYi |X (y|x) log2 
i=1 y∈Yi x∈X x  ∈X PX (x  )PYi |X (y|x  )

m
= ai I (X ; Yi ).
i=1

Therefore, the capacity of channel Q is

C = max I (X ; Y )
PX

m
= max ai I (X ; Yi )
PX
i=1

m
= ai max I (X ; Yi ) (as the same uniform PX maximizes each I (X ; Yi ))
PX
i=1
m
= ai Ci .
i=1

Example 4.22 (Capacity of the BEC) The BEC with erasure probability α as given in
(4.2.6) is quasi-symmetric (but neither weakly symmetric nor symmetric). Indeed, its
transition matrix Q can be partitioned along its columns into two symmetric (hence
weakly symmetric) sub-matrices

1−α 0
Q1 =
0 1−α

and
α
Q2 = .
α

Thus, applying the capacity formula for quasi-symmetric channels of Lemma 4.21
yields that the capacity of the BEC is given by
4.5 Calculating Channel Capacity 137

C = a1 C 1 + a2 C 2 ,

where a1 = 1 − α, a2 = α,
# $
1−α 0
C1 = log2 (2) − H , = 1 − H (1, 0) = 1 − 0 = 1,
1−α 1−α

and %α&
C2 = log2 (1) − H = 0 − 0 = 0.
α
Therefore, the BEC capacity is given by

C = (1 − α)(1) + (α)(0) = 1 − α. (4.5.7)

Example 4.23 (Capacity of the BSEC) Similarly, the BSEC with crossover probabil-
ity ε and erasure probability α as described in (4.2.8) is quasi-symmetric; its transition
matrix can be partitioned along its columns into two symmetric sub-matrices

1−ε−α ε
Q1 =
ε 1−ε−α

and
α
Q2 = .
α

Hence, by Lemma 4.21, the channel capacity is given by C = a1 C1 + a2 C2 where


a1 = 1 − α, a2 = α,
# $ # $
1−ε−α ε 1−ε−α
C1 = log2 (2) − H , = 1 − hb ,
1−α 1−α 1−α

and %α&
C2 = log2 (1) − H = 0.
α
We thus obtain that
# $
1−ε−α
C = (1 − α) 1 − h b + (α)(0)
1−α
# $
1−ε−α
= (1 − α) 1 − h b . (4.5.8)
1−α

As already noted, the BSEC is a combination of the BSC with bit error rate ε and
the BEC with erasure probability α. Indeed, setting α = 0 in (4.5.8) yields that
C = 1 − h b (1 − ε) = 1 − h b (ε) which is the BSC capacity. Furthermore, setting
ε = 0 results in C = 1 − α, the BEC capacity.
138 4 Data Transmission and Channel Capacity

4.5.2 Karush–Kuhn–Tucker Conditions for Channel


Capacity

When the channel does not satisfy any symmetry property, the following necessary
and sufficient Karush–Kuhn–Tucker (KKT) conditions (e.g., cf. Appendix B.8, [135,
pp. 87–91] or [46, 56]) for calculating channel capacity can be quite useful.
Definition 4.24 (Mutual information for a specific input symbol) The mutual infor-
mation for a specific input symbol is defined as
 PY |X (y|x)
I (x; Y ) := PY |X (y|x) log2 .
y∈Y
PY (y)

From the above definition, the mutual information becomes


  PY |X (y|x)
I (X ; Y ) = PX (x) PY |X (y|x) log2
x∈X y∈Y
PY (y)

= PX (x)I (x; Y ).
x∈X

Lemma 4.25 (KKT conditions for channel capacity) For a given DMC, an input
distribution PX achieves its channel capacity iff there exists a constant C such that

I (x; Y ) = C ∀x ∈ X with PX (x) > 0;
(4.5.9)
I (x; Y ) ≤ C ∀x ∈ X with PX (x) = 0.

Furthermore, the constant C is the channel capacity (justifying the choice of nota-
tion).
Proof The forward (if) part holds directly; hence, we only prove the converse (only-
if) part. Without loss of generality, we assume that PX (x) < 1 for all x ∈ X , since
PX (x) = 1 for some x implies that I (X ; Y ) = 0. The problem of calculating the
channel capacity is to maximize
 PY |X (y|x)
I (X ; Y ) = PX (x)PY |X (y|x) log2  , (4.5.10)
x∈X y∈Y x  ∈X PX (x  )PY |X (y|x  )

subject to the condition 


PX (x) = 1 (4.5.11)
x∈X

for a given channel distribution PY |X . By using the Lagrange multipliers method (e.g.,
see Appendix B.8 or [46]), maximizing (4.5.10) subject to (4.5.11) is equivalent to
maximize:
4.5 Calculating Channel Capacity 139
 
 PY |X (y|x) 
f (PX ) := PX (x)PY |X (y|x) log2  +λ PX (x) − 1 .
x∈X PX (x  )PY |X (y|x  ) x∈X
y∈Y x  ∈X

We then take the derivative of the above quantity with respect to PX (x  ), and obtain
that15
∂ f (PX )
= I (x  ; Y ) − log2 (e) + λ.
∂ PX (x  )

By Property 2 of Lemma 2.46, I (X ; Y ) = I (PX , PY |X ) is a concave function in


PX (for a fixed PY |X ). Therefore, the maximum of I (PX , PY |X ) occurs for a zero
derivative when PX (x) does not lie on the boundary, namely, 1 > PX (x) > 0. For
those PX (x) lying on the boundary, i.e., PX (x) = 0, the maximum occurs iff a
displacement from the boundary to the interior decreases the quantity, which implies
a nonpositive derivative, namely,

I (x; Y ) ≤ −λ + log2 (e), for those x with PX (x) = 0.

To summarize, if an input distribution PX achieves the channel capacity, then



I (x  ; Y ) = −λ + log2 (e), for PX (x  ) > 0;
I (x  ; Y ) ≤ −λ + log2 (e), for PX (x  ) = 0

15 The details for taking the derivative are as follows:



∂ ⎨ 
PX (x)PY |X (y|x) log2 PY |X (y|x)

∂ PX (x ) ⎩x∈X y∈Y
' (  ⎫
   ⎬
− PX (x)PY |X (y|x) log2 PX (x  )PY |X (y|x  ) + λ PX (x) − 1

x∈X y∈Y x  ∈X x∈X
⎛ ' (
  
= PY |X (y|x  ) log2 PY |X (y|x  ) − ⎝ PY |X (y|x  ) log2 PX (x  )PY |X (y|x  )
y∈Y y∈Y x  ∈X

 PY |X (y|x  )
+ log2 (e) PX (x)PY |X (y|x)  ⎠+λ
 
x∈X y∈Y x  ∈X PX (x )PY |X (y|x )
' (
  PY |X (y|x  )

= I (x ; Y ) − log2 (e) PX (x)PY |X (y|x)   

y∈Y x∈X x  ∈X PX (x )PY |X (y|x )

= I (x  ; Y ) − log2 (e) PY |X (y|x  ) + λ
y∈Y

= I (x  ; Y ) − log2 (e) + λ.
140 4 Data Transmission and Channel Capacity

for some λ. With the above result, setting C = −λ + log2 (e) yields (4.5.9). Finally,
multiplying both sides of each equation in (4.5.9) by PX (x) and summing over x
yields that max PX I (X ; Y ) on the left and the constant C on the right, thus proving
that the constant C is indeed the channel’s capacity.

Example 4.26 (Quasi-symmetric channels) For a quasi-symmetric channel, one can
directly verify that the uniform input distribution satisfies the KKT conditions of
Lemma 4.25 and yields that the channel capacity is given by (4.5.6); this is left as an
exercise. As we already saw, the BSC, the q-ary symmetric channel, the BEC and
the BSEC are all quasi-symmetric.
Example 4.27 Consider a DMC with a ternary input alphabet X = {0, 1, 2}, binary
output alphabet Y = {0, 1} and the following transition matrix:
⎡ ⎤
10
Q = ⎣ 21 21 ⎦ .
01

This channel is not quasi-symmetric. However, one may guess that the capacity of
this channel is achieved by the input distribution (PX (0), PX (1), PX (2)) = ( 21 , 0, 21 )
since the input x = 1 has an equal conditional probability of being received as 0
or 1 at the output. Under this input distribution, we obtain that I (x = 0; Y ) =
I (x = 2; Y ) = 1 and that I (x = 1; Y ) = 0. Thus, the KKT conditions of (4.5.9)
are satisfied; hence confirming that the above input distribution achieves channel
capacity and that channel capacity is equal to 1 bit.
Observation 4.28 (Capacity achieved by a uniform input distribution) We close
this section by noting that there is a class of DMCs that is larger than that of quasi-
symmetric channels for which the uniform input distribution achieves capacity. It
concerns the class of so-called “T -symmetric” channels [319, Sect. 5, Definition 1]
for which
 PY |X (y|x)
T (x) := I (x; Y ) − log2 |X | = PY |X (y|x) log2 
y∈Y x  ∈XPY |X (y|x  )

is a constant function of x (i.e., independent of x), where I (x; Y ) is the mutual


information for input x under a uniform input distribution. Indeed, the T -symmetry
condition is equivalent to the property of having the uniform input distribution achieve
capacity. This directly follows from the KKT conditions of Lemma 4.25. An example
of a T -symmetric channel that is not quasi-symmetric is the binary-input ternary-
output channel with the following transition matrix:
1 1 1

Q= 3 3 3
1 1 2 .
6 6 3

Hence, its capacity is achieved by the uniform input distribution. See [319, Fig. 2]
for (infinitely many) other examples of T -symmetric channels. However, unlike
4.5 Calculating Channel Capacity 141

quasi-symmetric channels, T -symmetric channels do not admit in general a simple


closed-form expression for their capacity [such as the one given in (4.5.6)].

4.6 Lossless Joint Source-Channel Coding and Shannon’s


Separation Principle

We next establish Shannon’s lossless joint source-channel coding theorem16 which


provides explicit (and directly verifiable) conditions for any communication system
in terms of its source and channel information-theoretic quantities under which the
source can be reliably transmitted (i.e., with asymptotically vanishing error proba-
bility). More specifically, this theorem consists of two parts: (i) a forward part which
reveals that if the minimal achievable compression (or source coding) rate of a source
is strictly smaller than the capacity of a channel, then the source can be reliably sent
over the channel via rate-one source-channel block codes; (ii) a converse part which
states that if the source’s minimal achievable compression rate is strictly larger than
the channel capacity, then the source cannot be reliably sent over the channel via
rate-one source-channel block codes. The theorem (under minor modifications) has
also a more general version in terms of reliable transmissibility of the source over
the channel via source-channel block codes of arbitrary rate (not necessarily equal
to one).
This key theorem is usually referred to as Shannon’s source-channel separation
theorem or principle; this renaming is explained in the following. First, the theo-
rem’s necessary and sufficient conditions for reliable transmissibility are a function
of entirely “separable” or “disentangled” information quantities, the source’s min-
imal compression rate and the channel’s capacity with no quantities that depends
on both the source and the channel; this can be seen as a “functional separa-
tion” property or condition. Second, the proof of the forward part, which (as we
will see) consists of properly combining Shannon’s source coding (Theorems 3.6
or 3.15) and channel coding (Theorem 4.11) theorems, shows that reliable transmis-
sibility can be realized by separating (or decomposing) the source-channel cod-
ing function into two distinct and independently conceived source and channel
coding operations applied in tandem, where the source code depends only on the
source statistics and, similarly, the channel code is a sole function of the chan-
nel statistics. In other words, we have “operational separation” in that a separate
(tandem or two-stage) source and channel coding scheme as depicted in Fig. 4.8
is as good (in terms of asymptotic reliable transmissibility) as the more general
joint source-channel coding scheme shown in Fig. 4.9 in which the coding oper-
ation can include a combined (one-stage) code designed with respect to both the
source and the channel or jointly coordinated source and channel codes. Now, gath-
ering the above two facts with the theorem’s converse part—with the exception
of the unresolved case where the source’s minimal achievable compression rate is

16 This theorem is sometimes referred to as the lossless information transmission theorem.


142 4 Data Transmission and Channel Capacity

Source Channel Xn Yn Channel Source


Source Encoder Channel Decoder Sink
Encoder Decoder

Fig. 4.8 A separate (tandem) source-channel coding scheme

Xn Yn
Source Encoder Channel Decoder Sink

Fig. 4.9 A joint source-channel coding scheme

exactly equal to the channel capacity—implies that either reliable transmissibility of


the source over the channel is achievable via separate source and channel coding
(under the transmissibility condition) or it is not at all achievable, hence justifying
calling the theorem by the separation principle.
We will prove the theorem by assuming that the source is stationary ergodic17
in the forward part and just stationary in the converse part and that the channel is a
DMC; note that the theorem can be extended to more general sources and channels
with memory (see [75, 96, 394]).

Definition 4.29 (Source-channel block code) Given a discrete source {Vi }i=1 with

finite alphabet V and a discrete channel {PY n |X n }n=1 with finite input and output
alphabets X and Y, respectively, an m-to-n source-channel block code ∼Cm,n with
rate mn source symbol/channel symbol is a pair of mappings ( f (sc) , g (sc) ), where18

f (sc) : V m → X n

and
g (sc) : Y n → V m .

The code’s operation is illustrated in Fig. 4.10. The source m-tuple V m is encoded via
the source-channel encoding function f (sc) , yielding the codeword X n = f (sc) (V m )
as the channel input. The channel output Y n , which is dependent on V m only via X n
(i.e., we have the Markov chain V m → X n → Y n ), is decoded via g (sc) to obtain the
source tuple estimate V̂ m = g (sc) (Y n ).
An error is made by the decoder if V m = V̂ m , and the code’s error probability is
given by

17 The minimal achievable compression rate of such sources is given by the entropy rate, see Theo-

rem 3.15.
18 Note that n = n ; that is, the channel blocklength n is in general a function of the source
m
(sc) (sc)
blocklength m. Similarly, f (sc) = f m and g (sc) = gm ; i.e., the encoding and decoding functions
are implicitly dependent on m.
4.6 Lossless Joint Source-Channel Coding and Shannon’s Separation Principle 143

Encoder Xn Channel Yn Decoder


Vm PY n |X n V̂ m
f (sc) g (sc)

Fig. 4.10 An m-to-n block source-channel coding system

Pe (∼Cm,n ) := Pr[V m = V̂ m ]
 
= PV m (v m )PY n |X n (y n | f (sc) (v m ))
v m ∈V m y n ∈Y n : g (sc) (y n )=v m

where PV m and PY n |X n are the source and channel distributions, respectively.


We next prove Shannon’s lossless joint source-channel coding theorem when
source m-tuples are transmitted via m-tuple codewords or m uses of the channel (i.e.,
when n = m or for rate-one source-channel block codes). The source is assumed to
have memory (as indicated below), while the channel is memoryless.
Theorem 4.30 (Lossless joint source-channel coding theorem for rate-one block

codes) Consider a discrete source {Vi }i=1 with finite alphabet V and entropy rate19
H (V) and a DMC with input alphabet X , output alphabet Y and capacity C, where
both H (V) and C are measured in the same units (i.e., they both use the same base
of the logarithm). Then, the following hold:
• Forward part (achievability): For any 0 <  < 1 and given that the source is
stationary ergodic, if
H (V) < C,

then there exists a sequence of rate-one source-channel codes {∼Cm,m }∞


m=1 such that

Pe (∼Cm,m ) <  for sufficiently large m,

where Pe (∼Cm,m ) is the error probability of the source-channel code ∼Cm,m .


• Converse part: For any 0 <  < 1 and given that the source is stationary, if

H (V) > C,

then any sequence of rate-one source-channel codes {∼Cm,m }∞


m=1 satisfies

Pe (∼Cm,m ) > (1 − )μ for sufficiently large m, (4.6.1)

where μ = H D (V) − C D with D = |V|, and H D (V) and C D are entropy rate
and channel capacity measured in D-ary digits, i.e., the codes’ error probability
is bounded away from zero and it is not possible to transmit the source over

19 We assume the source entropy rate exists as specified below.


144 4 Data Transmission and Channel Capacity

the channel via rate-one source-channel block codes with arbitrarily low error
probability.20
Proof of the forward part: Without loss of generality, we assume throughout this
proof that both the source entropy rate H (V) and the channel capacity C are measured
in nats (i.e., they are both expressed using the natural logarithm).
We will show the existence of the desired rate-one source-channel codes ∼Cm,m
via a separate (tandem or two-stage) source and channel coding scheme as the one
depicted in Fig. 4.8.
Let γ := C − H (V) > 0. Now, given any 0 <  < 1, by the lossless source coding
theorem for stationary ergodic sources (Theorem 3.15), there exists a sequence of
source codes of blocklength m and size Mm with encoder

f s : V m → {1, 2, . . . , Mm }

and decoder
gs : {1, 2, . . . , Mm } → V m

such that
1
log Mm < H (V) + γ/2 (4.6.2)
m
and ) *
Pr gs ( f s (V m )) = V m < /2

for m sufficiently large.21


Furthermore, by the channel coding theorem under the maximal probability of
error criterion (see Observation 4.6 and Theorem 4.11), there exists a sequence of
channel codes of blocklength m and size M̄m with encoder

f c : {1, 2, . . . , M̄m } → X m

20 Note that (4.6.1) actually implies that lim inf m→∞ Pe (∼Cm,m ) ≥ lim↓0 (1 − )μ = μ, where the
error probability lower bound has nothing to do with . Here, we state the converse of Theorem
4.30 in a form in parallel to the converse statements in Theorems 3.6, 3.15 and 4.11.
21 Theorem 3.15 indicates that for any 0 < ε := min{ε/2, γ/(2 log(2))} < 1, there exists δ with

0 < δ < ε and a sequence of binary block codes {∼Cm = (m, Mm )}∞ m=1 with

1
lim sup log2 Mm < H2 (V ) + δ, (4.6.3)
m→∞ m

and probability of decoding error satisfying Pe (∼Cm ) < ε (≤ ε/2) for sufficiently large m, where
H2 (V ) is the entropy rate measured in bits. Here, (4.6.3) implies that m1 log2 Mm < H2 (V ) + δ for
sufficiently large m. Hence,
1
log Mm < H (V ) + δ log(2) < H (V ) + ε log(2) ≤ H (V ) + γ/2
m
for sufficiently large m.
4.6 Lossless Joint Source-Channel Coding and Shannon’s Separation Principle 145

and decoder
gc : Y m → {1, 2, . . . , M̄m }

such that22
1  1 
log M̄m > C − γ/2 = H (V) + γ/2 > log Mm (4.6.5)
m m
and ) *
λ := max Pr gc (Y m ) = w|X m = f c (w) < /2
w∈{1,..., M̄m }

for m sufficiently large.


Now we form our source-channel code by concatenating in tandem the above
source and channel codes. Specifically, the m-to-m source-channel code ∼Cm,m has
the following encoder–decoder pair ( f (sc) , g (sc) ):

f (sc) : V m → X m with f (sc) (v m ) = f c ( f s (v m )) ∀v m ∈ V m

and
g (sc) : Y m → V m

with

(sc) gs (gc (y m )), if gc (y m ) ∈ {1, 2, . . . , Mm }
g (y ) =
m
∀y m ∈ Y m .
arbitrary, otherwise

The above construction is possible since {1, 2, . . . , Mm } is a subset of {1, 2, . . ., M̄m }.


The source-channel code’s probability of error can be analyzed by considering the

22 Theorem 4.11 and its proof of forward part indicate that for any 0 < ε :=
min{ε/4, γ/(16 log(2))} < 1, there exist 0 < γ  < min{4ε , C2 } = min{ε, γ/(4 log(2)), C2 }
and a sequence of data transmission block codes {∼Cm = (m, M̄m )}∞
m=1 satisfying

1 γ
C2 − γ  < log2 M̄m ≤ C2 − (4.6.4)
m 2
and
Pe (∼Cm ) < ε for sufficiently large m,
provided that C2 > 0, where C2 is the channel capacity measured in bits.
Observation 4.6 indicates that by throwing away from ∼Cm half of its codewords with largest
conditional probability of error, a new code ∼Cm = (m, M̄m ) = (m, M̄m /2) is obtained, which
satisfies λ(∼Cm ) ≤ 2Pe (∼Cm ) < 2ε ≤ ε/2.
Equation (4.6.4) then implies that for m > 1/γ  sufficiently large,
1 1 1 1
log M̄m = log M̄m − log(2) > C − γ  log(2) − log(2) > C − 2γ  log(2) > C − γ/2.
m m m m
146 4 Data Transmission and Channel Capacity

cases of whether or not a channel decoding error occurs as follows:

Pe (∼Cm,m ) = Pr[g (sc) (Y m ) = V m ]


= Pr[g (sc) (Y m ) = V m , gc (Y m ) = f s (V m )]
+ Pr[g (sc) (Y m ) = V m , gc (Y m ) = f s (V m )]
= Pr[gs (gc (Y )) = V m , gc (Y m ) = f s (V m )]
m

+ Pr[g (sc) (Y m ) = V m , gc (Y m ) = f s (V m )]
≤ Pr[gs ( f s (V m )) = V m ] + Pr[gc (Y m ) = f s (V m )]
= Pr[gs ( f s (V m )) = V m ]

+ Pr[ f s (V m ) = w] Pr[gc (Y m ) = w| f s (V m ) = w]
w∈{1,2,...,Mm }
= Pr[gs ( f s (V m )) = V m ]

+ Pr[X m = f c (w)] Pr[gc (Y m ) = w|X m = f c (w)]
w∈{1,2,...,Mm }
≤ Pr[gs ( f s (V m )) = V m ] + λ
< /2 + /2 = 

for m sufficiently large. Thus, the source can be reliably sent over the channel via
rate-one block source-channel codes as long as H (V) < C.

Proof of the converse part: For simplicity, we assume in this proof that H (V) and
C are measured in bits.
For any m-to-m source-channel code ∼Cm,m , we can write

1
H (V) ≤ H (V m ) (4.6.6)
m
1 1
= H (V m |V̂ m ) + I (V m ; V̂ m )
m m
1 ) * 1
≤ Pe (∼Cm,m ) log2 (|V|m ) + 1 + I (V m ; V̂ m ) (4.6.7)
m m
1 1
≤ Pe (∼Cm,m ) log2 |V| + + I (X ; Y m )
m
(4.6.8)
m m
1
≤ Pe (∼Cm,m ) log2 |V| + + C, (4.6.9)
m
where
• Equation (4.6.6) is due to the fact that (1/m)H (V m ) is nonincreasing in m and con-
verges to H (V) as m → ∞ since the source is stationary (see Observation 3.12),
• Equation (4.6.7) follows from Fano’s inequality,

H (V m |V̂ m ) ≤ Pe (∼Cm,m ) log2 (|V|m ) + h b (Pe (∼Cm,m )) ≤ Pe (∼Cm,m ) log2 (|V|m ) + 1,


4.6 Lossless Joint Source-Channel Coding and Shannon’s Separation Principle 147

• Equation (4.6.8) is due to the data processing inequality since V m → X m →


Y m → V̂ m form a Markov chain, and
• Equation (4.6.9) holds by (4.3.8) since the channel is a DMC.
Note that in the above derivation, the information measures are all measured in bits.
This implies that for m ≥ log D (2)/(εμ),

H (V) − C 1 log D (2)


Pe (∼Cm,m ) ≥ − = H D (V) − C D − ≥ (1 − ε)μ.
log2 (|V|) m log2 (|V|) m



Observation 4.31 We make the following remarks regarding the above joint source-
channel coding theorem:
• In general, it is not known whether the source can be (asymptotically) reliably
transmitted over the DMC when

H (V) = C

even if the source is a DMS. This is because separate source and channel codings
are used to prove the forward part of the theorem and the facts that the source
coding rate approaches the source entropy rate from above [cf. (4.6.2)] while the
channel coding rate approaches channel capacity from below [cf. (4.6.5)].
• The above theorem directly holds for DMSs since any DMS is stationary
and ergodic.
• We can expand the forward part of the theorem above by replacing the requirement
that the source be stationary ergodic with the more general condition that the source
be information stable.23 Note that time-invariant irreducible Markov sources (that
are not necessarily stationary) are information stable.
The above lossless joint source-channel coding theorem can be readily generalized
for m-to-n source-channel codes—i.e., codes with rate not necessarily equal to one—
as follows (its proof, which is similar to the previous theorem, is left as an exercise).
Theorem 4.32 (Lossless joint source-channel coding theorem for general rate block

codes) Consider a discrete source {Vi }i=1 with finite alphabet V and entropy rate
H (V) and a DMC with input alphabet X , output alphabet Y and capacity C, where
both H (V) and C are measured in the same units. Then, the following holds:
• Forward part (achievability): For any 0 <  < 1 and given that the source
is stationary ergodic, there exists a sequence of m-to-n m source-channel codes
{∼Cm,n m }∞
m=1 such that

Pe (∼Cm,n m ) <  for sufficiently large m

23 See
[75, 96, 303, 394] for a definition of information stable sources, whose property is slightly
more general than the Generalized AEP property given in Theorem 3.14.
148 4 Data Transmission and Channel Capacity

if
m C
lim sup < .
m→∞ nm H (V)

• Converse part: For any 0 <  < 1 and given that the source is stationary, any
sequence of m-to-n m source-channel codes {∼Cm,n m }∞
m=1 with

m C
lim inf > ,
m→∞ nm H (V)

satisfies
Pe (∼Cm,n m ) > (1 − )μ for sufficiently large m,

for some positive constant μ that depends on lim inf m→∞ (m/n m ), H (V) and C,
i.e., the codes’ error probability is bounded away from zero and it is not possible
to transmit the source over the channel via m-to-n m source-channel block codes
with arbitrarily low error probability.

Discussion: separate versus joint source-channel coding


Shannon’s separation principle has provided the linchpin for most modern commu-
nication systems where source coding and channel coding schemes are separately
constructed (with the source (respectively, channel) code designed by only taking
into account the source (respectively, channel) characteristics) and applied in tandem
without the risk of sacrificing optimality in terms of reliable transmissibility under
unlimited coding delay and complexity. This result is the raison d’être for separately
studying the practices of source coding or data compression (e.g., see [42, 142, 158,
290, 326, 330]) and channel coding (e.g., see [208, 248, 254, 321, 407]). Further-
more, by disentangling the source and channel coding operations, separate coding
offers appealing properties such as system modularity and flexibility. For example,
if one needs to send different sources over the same channel, using the separate cod-
ing approach, one only needs to modify the source code while keeping the channel
code unchanged (analogously, if a single source is to be communicated over different
channels, one only has to adapt the channel code).
However, in practical implementations, there is a price to pay in delay and com-
plexity for extremely long coding blocklengths (particularly when delay and com-
plexity constraints are quite stringent such as in wireless communications systems).
To begin, note that joint source-channel coding might be expected to offer improve-
ments for the combination of a source with substantial redundancy and a channel
with significant noise, since, for such a system, separate coding would involve source
coding to remove redundancy followed by channel coding to insert redundancy. It is
a natural conjecture that this is not the most efficient approach even if the blocklength
is allowed to grow without bound. Indeed, Shannon [340] made this point as follows:
. . . However, any redundancy in the source will usually help if it is utilized at the
receiving point. In particular, if the source already has a certain redundancy and no
attempt is made to eliminate it in matching to the channel, this redundancy will help
4.6 Lossless Joint Source-Channel Coding and Shannon’s Separation Principle 149

combat noise. For example, in a noiseless telegraph channel one could save about
50% in time by proper encoding of the messages. This is not done and most of the
redundancy of English remains in the channel symbols. This has the advantage,
however, of allowing considerable noise in the channel. A sizable fraction of the
letters can be received incorrectly and still reconstructed by the context. In fact this
is probably not a bad approximation to the ideal in many cases . . .

We make the following observations regarding the merits of joint versus separate
source-channel coding:
• Under finite coding blocklengths and/or complexity, many studies have demon-
strated that joint source-channel coding can provide better performance than sepa-
rate coding (e.g., see [13, 14, 37, 100, 127, 200, 247, 410, 427] and the references
therein).
• Even in the infinite blocklength regime where separate coding is optimal in terms
of reliable transmissibility, it can be shown that for a large class of systems, joint
source-channel coding can achieve an error exponent24 that is as large as double
the error exponent resulting from separate coding [422–424]. This indicates that
one can realize via joint source-channel coding the same performance as separate
coding, while reducing the coding delay by half (this result translates into notable
power savings of more than 2 dB when sending binary sources over channels
with Gaussian noise, fading an output quantization [422]). These findings provide
an information-theoretic rationale for adopting joint source-channel coding over
separate coding.
• Finally, it is important to point out that, with the exception of certain network
topologies [173, 383, 425] where separation is optimal, the separation theorem
does not in general hold for multiuser (multiterminal) systems (cf., [81, 83, 106,
174]),
and thus, in such systems, it is more beneficial to perform joint source-channel
coding.
The study of joint source-channel coding dates back to as early as the 1960s. Over
the years, many works have introduced joint source-channel coding techniques and
illustrated (analytically or numerically) their benefits (in terms of both performance
improvement and increased robustness to variations in channel noise) over separate
coding for given source and channel conditions and fixed complexity and/or delay
constraints. In joint source-channel coding systems, the designs of the source and
channel codes are either well coordinated or combined into a single step. Examples of

24 The error exponent or reliability function of a coding system is the largest rate of exponential

decay of its decoding error probability as the coding blocklength grows without bound [51, 87,
95, 107, 114, 135, 177, 178, 205, 347, 348]. Roughly speaking, the error exponent is a number
E with the property that the decoding error probability of a good code is approximately e−n E for
large coding blocklength n. In addition to revealing the fundamental trade-off between the error
probability of optimal codes and their blocklength for a given coding rate and providing insight on
the behavior of optimal codes, such a function provides a powerful tool for proving the achievability
part of coding theorems (e.g., [135]), for comparing the performance of competing coding schemes
(e.g., weighing joint against separate coding [422]) and for communications system design [194].
150 4 Data Transmission and Channel Capacity

(both constructive and theoretical) previous lossless and lossy joint source-channel
coding investigations for single-user25 systems include the following:
(a) Fundamental limits: joint source-channel coding theorems and the separation
principle [21, 34, 75, 96, 103, 135, 161, 164, 172, 187, 231, 271, 273, 351,
365, 373, 386, 394, 399], and joint source-channel coding exponents [69, 70,
84, 85, 135, 220, 422–424].
(b) Channel-optimized source codes (i.e., source codes that are robust against chan-
nel noise) [15, 32, 33, 39, 102, 115–117, 121, 126, 131, 143, 155, 167, 218,
238–240, 247, 272, 293, 295, 296, 354–356, 369, 375, 392, 419].
(c) Source-optimized channel codes (i.e., channel codes that exploit the source’s
redundancy) [14, 19, 62, 91, 93, 100, 118, 122, 127, 139, 169, 198, 234, 263,
331, 336, 410, 427, 428], uncoded source-channel matching with joint decoding
[13, 92, 140, 230, 285, 294, 334, 335, 366, 406] and source-matched channel
signaling [109, 229, 276, 368].
(d) Jointly coordinated source and channel codes [61, 101, 124, 132, 149, 150,
152, 166, 168, 171, 183, 184, 189, 190, 204, 217, 241, 268, 275, 282, 283,
286, 288, 332, 381, 402, 416, 417].
(e) Hybrid digital-analog source-channel coding and analog mapping [8, 57, 64,
71, 77, 79, 112, 130, 138, 147, 185, 193, 219, 221, 232, 244, 245, 274, 314,
320, 324, 335, 341, 357, 358, 367, 382, 391, 401, 405, 409, 429].
The above references while numerous are not exhaustive as the field of joint
source-channel coding has been quite active, particularly over the last decades.

Problems

1. Prove the Shannon–McMillan–Breiman theorem for pairs (Theorem 4.9).


2. The proof of Shannon’s channel coding theorem is based on the random coding
technique. What is the codeword-selecting distribution of the random codebook?
What is the decoding rule in the proof?
3. Show that processing the output of a DMC (via a given function) does not strictly
increase its capacity.
4. Consider the system shown in the block diagram below. Can the channel capacity
between channel input X and channel output Z be strictly larger than the channel
capacity between channel input X and channel output Y ? Which lemma or
theorem is your answer based on?
Post-processing
X DMC Y deterministic Z = g(Y )
PY |X mapping
g(·)

25 Weunderscore that, even though not listed here, the literature on joint source-channel coding for
multiuser systems is also quite extensive and ongoing.
4.6 Lossless Joint Source-Channel Coding and Shannon’s Separation Principle 151

5. Consider a DMC with input X and output Y . Assume that the input alphabet is
X = {1, 2}, the output alphabet is Y = {0, 1, 2, 3}, and the transition probability
is given by ⎧
⎨ 1 − 2ε, if x = y;
PY |X (y|x) = ε, if |x − y| = 1;

0, otherwise,

where 0 <  < 1/2.


(a) Determine the channel probability transition matrix Q := [PY |X (y, x)].
(b) Compute the capacity of this channel. What is the maximizing input distri-
bution that achieves capacity?
6. Binary-input additive discrete-noise channel: Find the capacity of a DMC whose
output Y is given by Y = X + Z , where X and Z are the channel input and
noise, respectively. Assume that the noise is independent of the input and that
it has alphabet Z = {−b, 0, b} such that PZ (−b) = PZ (0) = PZ (b) = 1/3,
where b > 0 is a fixed real number. Also assume that the input alphabet is given
by X = {−a, a} for some given real number a > 0. Discuss the dependence of
the channel capacity on the values of a and b.
7. The Z -channel: Find the capacity of DMC called the Z -channel and described
by the following transition diagram (where 0 ≤ β ≤ 1/2).

0 1 0
β

1−β
1 1

8. Functional representation of the Z -channel: Consider the DMC of Problem 4.7


above.
(a) Give a functional representation of the channel by explicitly expressing, at
any time instant i, the channel output Yi in terms of the input X i and a binary
noise random variable Z i , which is independent of the input and is generated
from a memoryless process {Z i } with Pr[Z i = 0] = β.
(b) Show that the channel’s input–output mutual information satisfies

I (X ; Y ) ≥ H (Y ) − H (Z ).

(c) Show that the capacity of the Z -channel is no smaller than that of a BSC
with crossover probability 1 − β (i.e., a binary modulo-2 additive noise
channel with {Z i } as its noise process):

C ≥ 1 − h b (β)

where h b (·) is the binary entropy function.


152 4 Data Transmission and Channel Capacity

9. A DMC has identical input and output alphabets given by {0, 1, 2, 3, 4}. Let X
be the channel input, and Y be the channel output. Suppose that

1
PY |X (i|i) = ∀ i ∈ {0, 1, 2, 3, 4}.
2
(a) Find the channel transition matrix that maximizes H (Y |X ).
(b) Using the channel transition matrix obtained in (a), evaluate the channel
capacity.
10. Binary channel: Consider a binary memoryless channel with the following prob-
ability transition matrix:
1−α α
Q= ,
β 1−β

where α > 0, β > 0 and α + β < 1.


(a) Determine the capacity C of this channel in terms of α and β.
(b) What does the expression of C reduce to if α = β?
11. Find the capacity of the asymmetric binary channel with errors and erasures
described in (4.2.10). Verify that the channel capacity reduces to that of the
BSEC when setting ε = ε and α = α.
12. Find the capacity of the binary-input quaternary-output DMC given in (4.5.2).
For what values of ε is capacity maximized, and for what values of ε is capacity
minimized?
13. Nonbinary erasure channel: Find the capacity of the q-ary erasure channel
described in (4.2.12) and compare the result with the capacity of the BEC.

14. Product of two channels: Consider two DMCs

(X1 , PY1 |X 1 , Y1 ) and (X2 , PY2 |X 2 , Y2 )

with capacity C1 and C2 , respectively. A new channel (X1 × X2 , PY1 |X 1 ×


PY2 |X 2 , Y1 × Y2 ) is formed in which x1 ∈ X1 and x2 ∈ X2 are simultane-
ously sent, resulting in Y1 , Y2 . Find the capacity of this channel, which was first
introduced by Shannon [344].
15. The sum channel: This channel, originally due to Shannon [344], operates by
signaling over two DMCs with disjoint input and output alphabets, as described
below.
(a) Let (X1 , PY1 |X 1 , Y1 ) be a DMC with finite input alphabet X1 , finite output
alphabet Y1 , transition distribution PY1 |X 1 (y|x) and capacity C1 . Similarly,
let (X2 , PY2 |X 2 , Y2 ) be another DMC with capacity C2 . Assume that X1 ∩
X2 = ∅ and that Y1 ∩ Y2 = ∅.
Now let (X , PY |X , Y) be the sum of these two channels where X = X1 ∪X2 ,
Y = Y1 ∪ Y2 and
4.6 Lossless Joint Source-Channel Coding and Shannon’s Separation Principle 153

⎨ PY1 |X 1 (y|x) if x ∈ X1 , y ∈ Y1
PY |X (y|x) = PY2 |X 2 (y|x) if x ∈ X2 , y ∈ Y2

0 otherwise.

Show that the capacity of the sum channel is given by


 
Csum = log2 2C1 + 2C2 bits/channel use.

Hint: Introduce a Bernoulli random variable Z with Pr[Z = 1] = α such


that Z = 1 if X ∈ X1 (when the first channel is used), and Z = 2 if X ∈ X2
(when the second channel is used). Then show that

I (X ; Y ) = I (X, Z ; Y )
= h b (α) + αI (X 1 ; Y1 ) + (1 − α)I (X 2 ; Y2 ),

where h b (·) is the binary entropy function, and I (X i ; Yi ) is the mutual


information for channel PYi |X i (y|x), i = 1, 2. Then maximize (jointly) over
the input distribution and α.
(b) Compute Csum above if the first channel is a BSC with crossover probability
0.11, and the second channel is a BEC with erasure probability 0.5.
16. Prove that the quasi-symmetric channel satisfies the KKT conditions of
Lemma 4.25 and yields the channel capacity given by (4.5.6).
17. Let the channel transition probability PY |X of a DMC be defined as the following
figure, where 0 <  < 0.5.
1−ε
0 0
ε
ε
1 1
1−ε
ε
2 2
1−ε
ε
3 3
1−ε

(a) Is the channel weakly symmetric? Is the channel symmetric?


(b) Determine the channel capacity of this channel. Also, indicate the input
distribution that achieves the channel capacity.

18. Let the relation between the channel input {X n }∞ ∞


n=1 and channel output {Yn }n=1
be given by
Yn = (αn × X n ) ⊕ Nn for each n,
154 4 Data Transmission and Channel Capacity

where αn , X n , Yn , and Nn all take values from {0, 1}, and “⊕” represents
the modulo-2 addition operation. Assume that the attenuation {αn }∞ n=1 , chan-
nel input {X n }∞
n=1 and noise {N } ∞
n n=1 processes are independent of each other.
Also, {αn }∞
n=1 and {N } ∞
n n=1 are i.i.d. with

1
Pr[αn = 1] = Pr[αn = 0] =
2
and
Pr[Nn = 1] = 1 − Pr[Nn = 0] = ε ∈ (0, 1/2).

(a) Show that the channel is a DMC and derive its transition probability matrix

PY j |X j (0|0) PY j |X j (1|0)
.
PY j |X j (0|1) PY j |X j (1|1)

(b) Determine the channel capacity C.


Hint: Use the KKT conditions for channel capacity (Lemma 4.25).
(c) Suppose that αn is known and consists of k 1’s. Find the maximum I (X n ; Y n )
for the same channel with known αn .
Hint: For known αn , {(X j , Y j )}nj=1 are independent. Recall I (X n ; Y n ) ≤
n
j=1 I (X j ; Y j ).
(d) Some researchers attempt to derive the capacity of the channel in (b) in terms
of the following steps:
• Derive the maximum mutual information between channel input X n and
output Y n for a given αn [namely, the solution in (c)].
• Calculate the expectation value of the maximum mutual information
obtained from the previous step according to the statistics of αn .
• Then, the capacity of the channel is equal to this “expected value” divided
by n.
Does this “expected capacity” C̄ coincide with that in (b)?
19. Maximum likelihood vs. minimum Hamming distance decoding: Given a channel
with finite input and output alphabets X and Y, respectively, and given codebook
C = {c1 , . . . , c M } of size M and blocklength n with ci = (ci,1 , . . . , ci,n ) ∈ X n ,
if an n-tuple y n ∈ Y n is received at the channel output, then under maximum
likelihood (ML) decoding, y n is decoded into the codeword c∗ ∈ C that maxi-
mizes P(Y n = y n |X n = c) among all codewords c ∈ C. It can be shown that ML
decoding minimizes the probability of decoding error when the channel input
n-tuple is chosen uniformly among all the codewords.
Recall that the Hamming distance d H (x n , y n ) between two n-tuples x n and y n
taking values in X n is defined as the number of positions where x n and y n differ.
For a BSC using a binary codebook C = {c1 , . . . , c M } ⊆ {0, 1}n of size M and
blocklength n, if a received n-tuple y n is received at the channel output, then
4.6 Lossless Joint Source-Channel Coding and Shannon’s Separation Principle 155

under minimum Hamming distance decoding, y n is decoded into the codeword


c ∈ C that minimizes d H (c, y n ) among all codewords c ∈ C.
Prove that minimum Hamming distance decoding is equivalent to ML decoding
for the BSC if its crossover probability  satisfies: ε ≤ 1/2.
Note: Note that ML decoding is not necessarily optimal if the codewords are not
selected via a uniform probability distribution. In that more general case, which
often occurs in practical systems (e.g., see [14, 169]), the optimal decoding rule
is the so-called maximum a posteriori (MAP) rule, which selects the codeword
c∗ ∈ C that maximizes P(X n = c|Y n = y n ) among all codewords c ∈ C. It can
readily be seen using Bayes’ rule that MAP decoding reduces to ML decoding
when the codewords are governed by a uniform distribution.
20. Suppose that blocklength n = 2 and code size M = 2. Assume each code bit is
either 0 or 1.
(a) What is the number of all possible codebook designs? (Note: This number
includes those lousy code designs, such as {00, 00}.)
(b) Suppose that one randomly draws one of these possible code designs accord-
ing to a uniform distribution and applies the selected code to BSC with
crossover probability ε. Then what is the expected error probability, if
the decoder simply selects the codeword whose Hamming distance to the
received vector is the smallest? (When both codewords have the same Ham-
ming distance to the received vector, the decoder chooses one of them at
random as the transmitted codeword.)
(c) Explain why the error in (b) does not vanish as ε ↓ 0.
Hint: The error of random (n, M) code is lower bounded by the error of
random (n, 2) code for M ≥ 2.
21. Fano’s inequality: Assume that the alphabets for random variables X and Y are
both given by
X = Y = {1, 2, 3, 4, 5}.

Let
x̂ = g(y)

be an estimate of x from observing y. Define the probability of estimation error


as
Pe = Pr{g(Y ) = X }.

Then, Fano’s inequality gives a lower bound for Pe as

h b (Pe ) + 2Pe ≥ H (X |Y ),

where h b ( p) = p log2 1
p
+ (1 − p) log2 1
1− p
is the binary entropy function. The
curve for
h b (Pe ) + 2Pe = H (X |Y )
156 4 Data Transmission and Channel Capacity

in terms of H (X |Y ) versus Pe is plotted in the figure below.


|X|=5
2.5
B

2
C
H(X|Y) (in bits)

1.5

0.5

A D
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
P
e

(a) Point A on the above figure shows that if H (X |Y ) = 0, zero estimation error,
namely, Pe = 0, can be achieved. In this case, characterize the distribution
PX |Y . Also, give an estimator g(·) that achieves Pe = 0. Hint: Think what
kind of relation between X and Y can render H (X |Y ) = 0.
(b) Point B on the above figure indicates that when H (X |Y ) = log2 (5), the
estimation error can only be equal to 0.8. In this case, characterize the
distributions PX |Y and PX . Prove that at H (X |Y ) = log2 (5), all estimators
yield Pe = 0.8.
Hint: Think what kind of relation between X and Y can result in H (X |Y ) =
log2 (5).
(c) Point C on the above figure hints that when H (X |Y ) = 2, the estimation
error can be as worse as 1. Give an estimator g(·) that leads to Pe = 1, if
PX |Y (x|y) = 1/4 for x = y, and PX |Y (x|y) = 0 for x = y.
(d) Similarly, point D on the above figure hints that when H (X |Y ) = 0, the
estimation error can be as worse as 1. Give an estimator g(·) that leads to
Pe = 1 at H (X |Y ) = 0.

22. Decide whether the following statement is true or false. Consider a discrete
memoryless channel with input alphabet X , output alphabet Y and transition
distribution PY |X (y|x) := Pr{Y = y|X = x}. Let PX 1 (·) and PX 2 (·) be two
possible input distributions, and PY1 (·)
 and PY2 (·) be the corresponding output
distributions; i.e., ∀y ∈ Y, PYi (y) = x∈X PY |X (y|x)PX i (x), i = 1, 2. Then,

D(PX 1 ||PX 2 ) ≥ D(PY1 ||PY2 ).


4.6 Lossless Joint Source-Channel Coding and Shannon’s Separation Principle 157

23. Consider a system consisting of two (parallel) discrete memoryless channels


with transition probability matrices Q 1 = [ p1 (y1 |x)] and Q 2 = [ p2 (y2 |x)].
These channels have a common input alphabet X and output alphabets Y1 and
Y2 , respectively, where Y1 and Y2 are disjoint. Let X denote the common input
to the two channels, and let Y1 and Y2 be the corresponding outputs in channels
Q 1 and Q 2 , respectively. Now let Y be an overall output of the system which
switches between Y1 and Y2 according to the values of a binary random variable
Z ∈ {1, 2} as follows:
Y1 if Z = 1;
Y =
Y2 if Z = 2;

where Z is independent of the input X and has distribution P(Z = 1) = λ.


(a) Express the system’s mutual information I (X ; Y ) in terms of λ, I (X ; Y1 )
and I (X ; Y2 ).
(b) Find an upper bound on the system’s capacity C = max PX I (X ; Y ) in terms
of λ, C1 and C2 , where Ci is the capacity of channel Q i , i = 1, 2.
(c) If both C1 and C2 can be achieved by the same input distribution, show that
the upper bound on C in (b) is exact.
24. Cascade channel: Consider a channel with input alphabet X , output alphabet
Y and transition probability matrix Q 1 = [ p1 (y|x)]. Consider another channel
with input alphabet Y, output alphabet Z and transition probability matrix Q 2 =
[ p2 (z|y)]. Let C1 denote the capacity of channel Q 1 , and let C2 denote the
capacity of channel Q 2 .
Define a new cascade channel with input alphabet X , output alphabet Z and
transition probability matrix Q = [ p(z|x)] obtained by feeding the output of
channel Q 1 into the input of channel Q 2 . Let C denote the capacity of channel
Q.

(a) Show that p(z|x) = y∈Y p2 (z|y) p1 (y|x) and that C ≤ min{C1 , C2 }.
(b) If X = {0, 1}, Y = Z = {a, b, c}, Q 1 is described by

1−α α 0
Q 1 = [ p1 (y|x)] = , 0 ≤ α ≤ 1,
0 α 1−α

and Q 2 is described by
⎡ ⎤
1 −  /2 /2
Q 2 = [ p2 (z|y)] = ⎣ /2 1 −  /2 ⎦ , 0 ≤  ≤ 1,
/2 /2 1 − 

find the capacities C1 and C2 .


(c) Given the channels Q 1 and Q 2 described in part (b), find C in terms of α
and .
(d) Compute the value of C obtained in part (c) if  = 2/3. Explain qualitatively.
158 4 Data Transmission and Channel Capacity

25. Let X be a binary random variable with alphabet X = {0, 1}. Let Z denote
another random variable that is independent of X and taking values in Z =
{0, 1, 2, 3} such that Pr[Z = 0] = Pr[Z = 1] = Pr[Z = 2] = , where
0 <  ≤ 1/3. Consider a DMC with input X , noise Z , and output Y described
by the equation
Y = 3X + (−1) X Z ,

where X and Z are as defined above.


(a) Determine the channel transition probability matrix Q = [ p(y|x)].
(b) Compute the capacity C of this channel in terms of . What is the maximizing
input distribution that achieves capacity?
(c) For what value of  is the noise entropy H (Z ) maximized? What is the value
of C for this choice of ? Comment on the result.
26. A channel with skewed errors: This problem presents an additive noise channel
in which the nonzero noise values can be partitioned into two distinct sets: the
set of “common” errors A and the set of “uncommon” errors B.
Consider a DMC with identical input and output alphabets given by X = Y =
{0, 1, . . . , q − 1} where q is a fixed positive integer. The channel is a modulo-q
additive noise channel, whose output Y is given by

Y = X ⊕q Z ,

where ⊕q denotes addition modulo-q, X is the channel input, and Z is the


channel noise which is independent of X and has alphabet Z = X = Y. Given
a partition of Z via sets
A = {1, 2, . . . , r },

and
B = {r + 1, r + 2, . . . , q − 1},

for a fixed integer 0 < r < q − 1, the distribution of Z is described as follows:


• P(Z = 0) = .
• P(Z ∈ A|Z = 0) = γ.
• P(Z = i) is constant for all i ∈ A.
• P(Z = j) is constant for all j ∈ B.
(a) Determine P(Z = z) for all z ∈ Z.
(b) Find the capacity of the channel in terms of q, r ,  and γ.
(c) Find the values of  and γ that minimize capacity and determine the corre-
sponding minimal capacity. Interpret the results qualitatively.
Note: This channel was introduced in [129] to model nonbinary data trans-
mission and storage channels in which some types of errors (designated as
“common errors”) occur much more frequently than others. A family of codes,
4.6 Lossless Joint Source-Channel Coding and Shannon’s Separation Principle 159

called focused error control codes, was developed in [129] to provide a certain
level of protection against the common errors of the channel while guaran-
teeing another lower level of protection against uncommon errors; hence the
levels of protection are determined based not only on the numbers of errors
but on the kind of errors as well (unlike traditional channel codes). The per-
formance of these codes was assessed in [10].
27. Effect of memory on capacity: This problem illustrates the adage “memory
increases (operational) capacity.” Given an integer q ≥ 2, consider a q-ary
additive noise channel described by

Yi = X i ⊕q Z i , i = 1, 2 . . . ,

where ⊕q denotes addition modulo-q and Yi , X i and Z i are the channel output,
input and noise at time instant i, all with identical alphabet Y = X = Z =
{0, 1, . . . , q − 1}. We assume that the input and noise processes are independent

of each other and that the noise process {Z i }i=1 is stationary ergodic. It can be
shown via an extended version of Theorem 4.11 that the operational capacity of
this channel with memory is given by [96, 191]:

1
Cop = lim max I (X n ; Y n ).
n→∞ p(x )
n
n

Now consider an “equivalent” memoryless channel in the sense that it has a



memoryless additive noise { Z̃ i }i=1 with identical marginal distribution as noise

{Z i }i=1 : PZ̃ i (z) = PZ i (z) for all i and z ∈ Z. Letting C̃ denote the (operational)
capacity of the equivalent memoryless channel, show that

Cop ≥ C̃.

Note: The adage “memory increases (operational) capacity” does not hold for
arbitrary channels. It is only valid for well-behaved channels with memory [97],
such as the above additive noise channel with stationary ergodic noise or more
generally for information stable26 channels [96, 191, 303] whose capacity is
given by27
1
Cop = lim inf max I (X n ; Y n ). (4.7.1)
n→∞ p(x ) n
n

26 Loosely speaking, a channel is information stable if the input process which maximizes the
channel’s block mutual information yields a joint input–output process that behaves ergodically
and satisfies the joint AEP (see [75, 96, 191, 303, 394] for a precise definition).
27 Note that a formula of the capacity of more general (not necessarily information stable) channels

with memory does exist in terms of a generalized (spectral) mutual information rate, see [172, 396].
160 4 Data Transmission and Channel Capacity

However, one can find counterexamples to this adage, such as in [3] regarding
non-ergodic “averaged” channels [4, 199]. Examples of such averaged channels
include additive noise channels with stationary but non-ergodic noise, in par-
ticular, the Polya contagion channel [12] whose noise process is described in
Example 3.16.
28. Feedback capacity. Consider a (not necessarily memoryless) discrete channel
with input alphabet X , output alphabet Y and n-fold transition distributions
PY n |X n , n = 1, 2, . . .. The channel is to be used with feedback as shown in the
figure below.

Encoder Xi Channel Yi Decoder


W PY n |X n g Ŵ
fi

Channel coding system with feedback.

More specifically, there is a noiseless feedback link from the channel output to
the transmitter with one time unit of delay. As a result, at each time instance i,
the channel input X i is a function of both the message W and all past channel
outputs Y i−1 = (Y1 , . . . , Yi−1 ). More formally, an (n, Mn ) feedback channel
code consists of a sequence of encoding functions

f i : {1, 2, . . . , Mn } × Y i−1 → X

for i = 1, . . . , n and a decoding function

g : Y n → {1, 2, . . . , Mn }.

To send message W , assumed to be uniformly distributed over the message set


{1, 2, . . . , Mn }, the transmitter sends the channel codeword X n = (X i , . . . , X n ),
where X i = f i (W, Y i−1 ), i = 1, . . . , n (for i = 1, X 1 = f 1 (W )), which is
received as Y n at the channel output. The decoder then provides the message
estimate via Ŵ = g(Y n ) and the resulting average error probability is Pe =
Pr[Ŵ = W ].
Causality channel condition: By the nature in which the channel is operated with
or without feedback, we assume that a causal condition holds in the form of the
following Markov chain property :

W → (X i , Y i−1 ) → Yi , for i = 1, 2, . . . , (4.7.2)

where Y i−1 = ∅ for i = 1.


We say that a rate R is achievable with feedback if there exists a sequence of
(n, Mn ) feedback channel codes with

1
lim inf log2 Mn ≥ R and lim Pe = 0.
n→∞ n n→∞
4.6 Lossless Joint Source-Channel Coding and Shannon’s Separation Principle 161

The feedback operational capacity Cop,F B of the channel is defined as the supre-
mum of all achievable rates with feedback:

Cop,F B = sup{R : R is achievable with feedback}.

Comparing this definition of feedback operational capacity with the one when
no feedback exists given in Definition 4.10 and studied in Theorem 4.11, we
readily observe that, in general,

Cop,F B ≥ Cop

since non-feedback codes belong to the class of feedback codes. This inequality
is intuitively not surprising as in the presence of feedback, the transmitter can use
the previously received output symbols to better understand the channel behavior
and hence send codewords that are more robust to channel noise, potentially
increasing the rate at which information can be transferred reliably over the
channel.
(a) Show that for DMCs, feedback does not increase operational capacity:

Cop,F B = Cop = max I (X ; Y ).


PX

Note that for a DMC with feedback, property (4.2.1) does not hold since
current inputs depend on past outputs. However, by the memorylessness
nature of the channel, we assume the following causality Markov chain
condition:
(W, X i−1 , Y i−1 ) → X i → Yi (4.7.3)

for the channel, which is a simplified version of (4.7.2), see also [415,
Definition 7.4]. Condition (4.7.3) can be seen as a generalized definition of
a DMC used with or without feedback coding.
(b) Consider the q-ary channel of Problem 4.27 with stationary ergodic additive
noise. Assume that the noise process is independent of the message W .28
Show that although this channel has memory, feedback does not increase its
operational capacity:

1
Cop,F B = Cop = log2 q − lim H (Z n ).
n→∞ n
We point out that the classical Gilbert–Elliott burst noise channel [108, 145,
277] is a special instance of this channel.

28 This intrinsically natural assumption, which is equivalent to requiring that the channel input and

noise processes are independent of each other when no feedback is present, ensures that (4.7.2)
holds for this channel.
162 4 Data Transmission and Channel Capacity

Note: Result (a) is due to Shannon [343]. Even though feedback does not help
increase capacity for a DMC, it can have several benefits such a simplifying
the coding scheme and speeding the rate at which the error probability of good
codes decays to zero (e.g., see [87, 284]). Result (b), which was shown in [9]
for arbitrary additive noise processes with memory, stems from the fact that the
channel has a symmetry property in the sense that a uniform input maximizes the
mutual information between channel input and output tuples. Similar results for
channels with memory satisfying various symmetry properties have appeared
in [11, 292, 333, 361]. However, it can be shown that for channels with memory
and asymmetric structures,
Cop,F B > Cop ,

see, for example, [415, Problem 7.12] and [16] where average input costs are
imposed.
We further point out that for information stable channels, the feedback opera-
tional capacity is given by [215, 291, 378]

1
Cop,F B = lim inf max I (X n → Y n ), (4.7.4)
n→∞ PX n Y n−1 n

where for x n ∈ X n and y n−1 ∈ Y n−1 ,


n
PX n Y n−1 (x n y n−1 ) := PX i |X i−1 Y i−1 (xi |x i−1 , y i−1 )
i=1

is a causal conditional probability that represents feedback strategies and


n
I (X n → Y n ) := I (Yi ; X i |Y i−1 )
i=1

is a causal version of the mutual information between tuples, known as directed


information [233, 261, 265]. It can be verified that (4.7.4) reduces to (4.7.1)
in the absence of feedback.29 Finally, an alternative (albeit more complicated)
expression to (4.7.4), that uses the standard mutual information, is given by [72]

1
Cop,F B = lim inf max I (W ; Y n ), (4.7.5)
n→∞ n f n

where W is uniformly distributed over {1, 2 . . . , Mn } and the maximization


is taken over all feedback encoding functions f n = ( f 1 , f 2 , . . . , f n ), where

29 Forarbitrary channels with memory, a generalized expression for Cop,F B is established in [376,
378] in terms of a generalized (spectral) directed information rate.
4.6 Lossless Joint Source-Channel Coding and Shannon’s Separation Principle 163

f i : {1, 2 . . . , Mn } × Y i−1 → X for i = 1, 2, . . . , n (note that the optimization


over f n in (4.7.5) necessitates optimizing over Mn ).30
29. Suppose you wish to encode a binary DMS with PX (0) = 3/4 using a rate-1
source-channel block code for transmission over a BEC with erasure probability
α. For what values of α, can the source be recovered reliably (i.e., with arbitrarily
low error probability) at the receiver?
30. Consider the binary Polya contagion Markov source of memory two treated
in Problem 3.10; see also Example 3.17 with M = 2. We are interested in
sending this source over the BSC with crossover probability ε using rate-Rsc
block source-channel codes.
(a) Write down the sufficient condition for reliable transmissibility of the source
over the BSC via rate-Rsc source-channel codes in terms of ε, Rsc and the
source parameters ρ := R/T and δ := /T .
(b) If ρ = δ = 1/2 and ε = 1/4, determine the permissible range of rates Rsc
for reliably communicating the source over the channel.
31. Consider a DMC with input alphabet X = {0, 1, 2, 3, 4}, output alphabet Y =
{0, 1, 2, 3, 4, 5} and the following transition matrix
⎡ ⎤
1 − 2α α α 0 0 0
⎢ α α 1 − 2α 0 0 0 ⎥
⎢ ⎥
Q=⎢
⎢ 0 0 0 1 − β β/2 β/2 ⎥ ⎥,
⎣ 0 0 0 β/2 1 − β β/2 ⎦
0 0 0 β/2 β/2 1 − β

where 0 < α < 1/2 and 0 < β < 1.


(a) Determine the capacity C of this channel in terms of α and β. What is the
maximizing input distribution that achieves capacity?
(b) Find the values of α and β that will yield the smallest possible value of C.

(c) Show that any (not necessarily memoryless) binary source {Ui }i=1 with
arbitrary distribution can be sent without any loss via a rate-one source-
channel code over the channel with the parameters α and β obtained in
part (b).

30 This result was actually shown in [72] for general channels with memory in terms of a generalized

(spectral) mutual information rate.


Chapter 5
Differential Entropy and Gaussian
Channels

We have so far examined information measures and their operational characteriza-


tion for discrete-time discrete-alphabet systems. In this chapter, we turn our focus
to continuous-alphabet (real-valued) systems. Except for a brief interlude with the
continuous-time (waveform) Gaussian channel, we consider discrete-time systems,
as treated throughout the book.
We first recall that a real-valued (continuous) random variable X is described by
its cumulative distribution function (cdf)

FX (x) := Pr[X ≤ x]

for x ∈ R, the set of real numbers. The distribution of X is called absolutely contin-
uous (with respect to the Lebesgue measure) if a probability density function (pdf)
f X (·) exists such that  x
FX (x) = f X (t)dt,
−∞

 +∞
where f X (t) ≥ 0 ∀t and −∞ f X (t)dt = 1. If FX (·) is differentiable everywhere,
then the pdf f X (·) exists and is given by the derivative of FX (·): f X (t) = d FdtX (t) .
The support of a random variable X with pdf f X (·) is denoted by S X and can be
conveniently given as
S X = {x ∈ R : f X (x) > 0}.

We will deal with random variables that admit a pdf.1

1A rigorous (measure-theoretic) study for general continuous systems, initiated by Kolmogorov


[222], can be found in [196, 303].

© Springer Nature Singapore Pte Ltd. 2018 165


F. Alajaji and P.-N. Chen, An Introduction to Single-User Information Theory,
Springer Undergraduate Texts in Mathematics and Technology,
https://doi.org/10.1007/978-981-10-8001-2_5
166 5 Differential Entropy and Gaussian Channels

5.1 Differential Entropy

Recall that the definition of entropy for a discrete random variable X representing a
DMS is 
H (X ) := −PX (x) log2 PX (x) (in bits).
x∈X

As already seen in Shannon’s source coding theorem, this quantity is the minimum
average code rate achievable for the lossless compression of the DMS. But if the
random variable takes on values in a continuum, the minimum number of bits per
symbol needed to losslessly describe it must be infinite. This is illustrated in the
following example, where we take a discrete approximation (quantization) of a ran-
dom variable uniformly distributed on the unit interval and study the entropy of the
quantized random variable as the quantization becomes finer and finer.

Example 5.1 Consider a real-valued random variable X that is uniformly distributed


on the unit interval, i.e., with pdf given by

1 if x ∈ [0, 1);
f X (x) =
0 otherwise.

Given a positive integer m, we can discretize X by uniformly quantizing it into m


levels by partitioning the support of X into equal-length segments of size  = m1 (
is called the quantization step-size) such that

i i −1 i
qm (X ) = , if ≤X< ,
m m m

for 1 ≤ i ≤ m. Then, the entropy of the quantized random variable qm (X ) is given


by
m  
1 1
H (qm (X )) = − log2 = log2 m (in bits).
i=1
m m

Since the entropy H (qm (X )) of the quantized version of X is a lower bound to the
entropy of X (as qm (X ) is a function of X ) and satisfies in the limit

lim H (qm (X )) = lim log2 m = ∞,


m→∞ m→∞

we obtain that the entropy of X is infinite.

The above example indicates that to compress a continuous source without incur-
ring any loss or distortion indeed requires an infinite number of bits. Thus when
studying continuous sources, the entropy measure is limited in its effectiveness and
the introduction of a new measure is necessary. Such new measure is indeed obtained
5.1 Differential Entropy 167

upon close examination of the entropy of a uniformly quantized real-valued random


variable minus the quantization accuracy as the accuracy increases without bound.

Lemma 5.2 Consider a real-valued random variable X with support [a, b) and
b
pdf f X such that − f X log2 f X is integrable2 (where − a f X (x) log2 f X (x)d x is
finite). Then a uniform quantization of X with an n-bit accuracy (i.e., with a
quantization step-size of  = 2−n ) yields an entropy approximately equal to
b
− a f X (x) log2 f X (x)d x + n bits for n sufficiently large. In other words,
 b
lim [H (qn (X )) − n] = − f X (x) log2 f X (x)d x,
n→∞ a

where qn (X ) is the uniformly quantized version of X with quantization step-size


 = 2−n .

Proof
Step 1: Mean value theorem. Let  = 2−n be the quantization step-size, and let

a + i, i = 0, 1, . . . , j − 1
ti :=
b, i = j,

where j = (b − a)2n . From the mean value theorem (e.g., cf. [262]), we can
choose xi ∈ [ti−1 , ti ] for 1 ≤ i ≤ j such that
 ti
pi := f X (x)d x = f X (xi )(ti − ti−1 ) =  · f X (xi ).
ti−1

Step 2: Definition of h (n) (X ). Let


j
(n)
h (X ) := − [ f X (xi ) log2 f X (xi )]2−n .
i=1

Since − f X (x) log2 f X (x) is integrable,


 b
(n)
h (X ) → − f X (x) log2 f X (x)d x as n → ∞.
a

Therefore, given any ε > 0, there exists N such that for all n > N ,
  
 b 
− f X (x) log2 f X (x)d x − h (n) (X ) < ε.

a

2 By integrability, we mean the usual Riemann integrability (e.g., see [323]).


168 5 Differential Entropy and Gaussian Channels

Step 3: Computation of H (qn (X )). The entropy of the (uniformly) quantized


version of X , qn (X ), is given by


j
H (qn (X )) = − pi log2 pi
i=1


j
=− ( f X (xi )) log2 ( f X (xi ))
i=1


j
=− ( f X (xi )2−n ) log2 ( f X (xi )2−n ),
i=1

where the pi ’s are the probabilities of the different values of qn (X ).


Step 4: H (qn (X )) − h (n) (X ).
From Steps 2 and 3,


j
H (qn (X )) − h (n) (X ) = − [ f X (xi )2−n ] log2 (2−n )
i=1
j 
 ti
=n f X (x)d x
i=1 ti−1
 b
=n f X (x)d x = n.
a

Hence, we have that for n > N ,


 b

− f X (x) log2 f X (x)d x + n − ε < H (qn (X ))


a
= h (n) (X ) + n
 b

< − f X (x) log2 f X (x)d x + n + ε,


a

yielding that
 b
lim [H (qn (X )) − n] = − f X (x) log2 f X (x)d x.
n→∞ a

More generally, the following result due to Rényi [316] can be shown for (absolutely
continuous) random variables with arbitrary support.

Theorem 5.3 [316, Theorem 1] For any real-valued random variable with pdf
f X , if − i pi log2 pi is finite, where the (possibly countably many) pi ’s are the
5.1 Differential Entropy 169

probabilities of the different values of the uniformly quantized qn (X ) over support


S X , then 
lim [H (qn (X )) − n] = − f X (x) log2 f X (x)d x
n→∞ SX

provided the integral on the right-hand side exists.

In light of the above results, we can define the following information measure [340]:

Definition 5.4 (Differential entropy) The differential entropy (in bits) of a continu-
ous random variable X with pdf f X and support S X is defined as

h(X ) := − f X (x) · log2 f X (x)d x = E[− log2 f X (X )],
SX

when the integral exists.

Thus, the differential entropy h(X ) of a real-valued random variable X has an


operational meaning in the following sense. Since H (qn (X )) is the minimum average
number of bits needed to losslessly describe qn (X ), we thus obtain that h(X ) + n
is approximately needed to describe X when uniformly quantizing it with an n-bit
accuracy. Therefore, we may conclude that the larger h(X ) is, the larger is the average
number of bits required to describe a uniformly quantized X within a fixed accuracy.

Example 5.5 A continuous random variable X with support S X = [0, 1) and pdf
f X (x) = 2x for x ∈ S X has differential entropy equal to
 1
1
x 2 (log2 e − 2 log2 (2x)) 
−2x · log2 (2x)d x = 
0 2 0
1
= − log2 (2) = −0.278652 bits.
2 ln 2
We herein illustrate Lemma 5.2 by uniformly quantizing X to an n-bit accuracy and
computing the entropy H (qn (X )) and H (qn (X )) − n for increasing values of n,
where qn (X ) is the quantized version of X .
We have that qn (X ) is given by

i i −1 i
qn (X ) = n
, if n
≤ X < n,
2 2 2
for 1 ≤ i ≤ 2n . Hence,

i (2i − 1)
Pr qn (X ) = n = ,
2 22n

which yields
170 5 Differential Entropy and Gaussian Channels

2n  
2i − 1 2i − 1
H (qn (X )) = − log 2
i=1
22n 22n

1 
n
2
= − 2n (2i − 1) log2 (2i − 1) + 2 log2 (2n ).
2 i=1

As shown in Table 5.1, we indeed observe that as n increases, H (qn (X )) tends to


infinity while H (qn (X )) − n converges to h(X ) = −0.278652 bits.
Thus, a continuous random variable X contains an infinite amount of information;
but we can measure the information contained in its n-bit quantized version qn (X )
as: H (qn (X )) ≈ h(X ) + n (for n large enough).

Example 5.6 Let us determine the minimum average number of bits required to
describe the uniform quantization with 3-digit accuracy of the decay time (in years)
of a radium atom assuming that the half-life of the radium (i.e., the median of the
decay time) is 80 years and that its pdf is given by f X (x) = λe−λx , where x > 0.
Since the median of the decay time is 80, we obtain
 80
λe−λx d x = 0.5,
0

which implies that λ = 0.00866. Also, 3-digit accuracy is approximately equivalent


to log2 999 = 9.96 ≈ 10 bits accuracy. Therefore, by Theorem 5.3, the number of
bits required to describe the quantized decay time is approximately
e
h(X ) + 10 = log2 + 10 = 18.29 bits.
λ

Table 5.1 Quantized random variable qn (X ) under an n-bit accuracy: H (qn (X )) and H (qn (X ))−n
versus n
n H (qn (X )) H (qn (X )) − n
1 0.811278 bits −0.188722 bits
2 1.748999 bits −0.251000 bits
3 2.729560 bits −0.270440 bits
4 3.723726 bits −0.276275 bits
5 4.722023 bits −0.277977 bits
6 5.721537 bits −0.278463 bits
7 6.721399 bits −0.278600 bits
8 7.721361 bits −0.278638 bits
9 8.721351 bits −0.278648 bits
5.1 Differential Entropy 171

We close this section by computing the differential entropy for two common real-
valued random variables: the uniformly distributed random variable and the Gaussian
distributed random variable.
Example 5.7 (Differential entropy of a uniformly distributed random variable) Let
X be a continuous random variable that is uniformly distributed over the interval
(a, b), where b > a; i.e., its pdf is given by

1
if x ∈ (a, b);
f X (x) = b−a
0 otherwise.

So its differential entropy is given by


 b
1 1
h(X ) = − log2 = log2 (b − a) bits.
a b−a b−a

Note that if (b − a) < 1 in the above example, then h(X ) is negative, unlike
entropy. The above example indicates that although differential entropy has a form
analogous to entropy (in the sense that summation and pmf for entropy are replaced
by integration and pdf, respectively, for differential entropy), differential entropy
does not retain all the properties of entropy (one such operational difference was
already highlighted in the previous lemma and theorem).3
Example 5.8 (Differential entropy of a Gaussian random variable) Let X ∼
N (μ, σ 2 ); i.e., X is a Gaussian (or normal) random variable with finite mean μ,
variance Var(X ) = σ 2 > 0 and pdf

1 (x−μ)2
f X (x) = √ e− 2σ2
2πσ 2

for x ∈ R. Then, its differential entropy is given by




1 (x − μ)2
h(X ) = f X (x)log2 (2πσ 2 ) + log 2 e dx
R 2 2σ 2
1 log2 e
= log2 (2πσ 2 ) + E[(X − μ)2 ]
2 2σ 2
1 1
= log2 (2πσ 2 ) + log2 e
2 2
1
= log2 (2πeσ ) bits.
2
(5.1.1)
2
Note that for a Gaussian random variable, its differential entropy is only a function of
its variance σ 2 (it is independent from its mean μ). This is similar to the differential

3 By contrast, entropy and differential entropy are sometimes called discrete entropy and continuous

entropy, respectively.
172 5 Differential Entropy and Gaussian Channels

entropy of a uniform random variable, which only depends on difference (b − a) but


not the mean (a + b)/2.

5.2 Joint and Conditional Differential Entropies,


Divergence, and Mutual Information

Definition 5.9 (Joint differential entropy) If X n = (X 1 , X 2 , . . . , X n ) is a continuous


random vector of size n (i.e., a vector of n continuous random variables) with joint
pdf f X n and support S X n ⊆ Rn , then its joint differential entropy is defined as

h(X ) := −
n
f X n (x1 , x2 , . . . , xn ) log2 f X n (x1 , x2 , . . . , xn ) d x1 d x2 · · · d xn
SX n
= E[− log2 f X n (X n )]

when the n-dimensional integral exists.


Definition 5.10 (Conditional differential entropy) Let X and Y be two jointly dis-
tributed continuous random variables with joint pdf4 f X,Y and support S X,Y ⊆ R2
such that the conditional pdf of Y given X , given by

f X,Y (x, y)
f Y |X (y|x) = ,
f X (x)

is well defined for all (x, y) ∈ S X,Y , where f X is the marginal pdf of X . Then, the
conditional differential entropy of Y given X is defined as

h(Y |X ) := − f X,Y (x, y) log2 f Y |X (y|x) d x d y = E[− log2 f Y |X (Y |X )],
S X,Y

when the integral exists.


Note that as in the case of (discrete) entropy, the chain rule holds for differential
entropy:
h(X, Y ) = h(X ) + h(Y |X ) = h(Y ) + h(X |Y ).

Definition 5.11 (Divergence or relative entropy) Let X and Y be two continuous


random variables with marginal pdfs f X and f Y , respectively, such that their supports
satisfy S X ⊆ SY ⊆ R. Then, the divergence (or relative entropy or Kullback–Leibler
distance) between X and Y is written as D(X Y ) or D( f X  f Y ) and defined by


f X (x) f X (X )
D(X Y ) := f X (x) log2 dx = E
SX f Y (x) f Y (X )

4 Note that the joint pdf f X,Y is also commonly written as f X Y .


5.2 Joint and Conditional Differential Entropies, Divergence, and Mutual Information 173

when the integral exists. The definition carries over similarly in the multivariate
case: for X n = (X 1 , X 2 , . . . , X n ) and Y n = (Y1 , Y2 , . . . , Yn ) two random vectors
with joint pdfs f X n and f Y n , respectively, and supports satisfying S X n ⊆ SY n ⊆ Rn ,
the divergence between X n and Y n is defined as

f X n (x1 , x2 , . . . , xn )
D(X n Y n ) := f X n (x1 , x2 , . . . , xn ) log2 d x1 d x2 · · · d xn
SX n f Y n (x1 , x2 , . . . , xn )

when the integral exists.


Definition 5.12 (Mutual information) Let X and Y be two jointly distributed contin-
uous random variables with joint pdf f X,Y and support S X Y ⊆ R2 . Then, the mutual
information between X and Y is defined by

f X,Y (x, y)
I (X ; Y ) := D( f X,Y  f X f Y ) = f X,Y (x, y) log2 d x d y,
S X,Y f X (x) f Y (y)

assuming the integral exists, where f X and f Y are the marginal pdfs of X and Y ,
respectively.
Observation 5.13 For two jointly distributed continuous random variables X and
Y with joint pdf f X,Y , support S X Y ⊆ R2 and joint differential entropy

h(X, Y ) = − f X,Y (x, y) log2 f X,Y (x, y) d x d y,
SX Y

then as in Lemma 5.2 and the ensuing discussion, one can write

H (qn (X ), qm (Y )) ≈ h(X, Y ) + n + m

for n and m sufficiently large, where qk (Z ) denotes the (uniformly) quantized version
of random variable Z with a k-bit accuracy.
On the other hand, for the above continuous X and Y ,

I (qn (X ); qm (Y )) = H (qn (X )) + H (qm (Y )) − H (qn (X ), qm (Y ))


≈ [h(X ) + n] + [h(Y ) + m] − [h(X, Y ) + n + m]
= h(X ) + h(Y ) − h(X, Y )

f X,Y (x, y)
= f X,Y (x, y) log2 dx dy
S X,Y f X (x) f Y (y)

for n and m sufficiently large; in other words,

lim I (qn (X ); qm (Y )) = h(X ) + h(Y ) − h(X, Y ).


n,m→∞

Furthermore, it can be shown that


174 5 Differential Entropy and Gaussian Channels

f X (x)
lim D(qn (X )qn (Y )) = f X (x) log2 d x.
n→∞ SX f Y (x)

Thus, mutual information and divergence can be considered as the true tools of
information theory, as they retain the same operational characteristics and properties
for both discrete and continuous probability spaces (as well as general spaces where
they can be defined in terms of Radon–Nikodym derivatives (e.g., cf. [196]).5

The following lemma illustrates that for continuous systems, I (·; ·) and D(··)
keep the same properties already encountered for discrete systems, while differential
entropy (as already seen with its possibility of being negative) satisfies some different
properties from entropy. The proof is left as an exercise.

Lemma 5.14 The following properties hold for the information measures of contin-
uous systems.
1. Nonnegativity of divergence: Let X and Y be two continuous random variables
with marginal pdfs f X and f Y , respectively, such that their supports satisfy
S X ⊆ SY ⊆ R. Then
D( f X  f Y ) ≥ 0,

with equality iff f X (x) = f Y (x) for all x ∈ S X except in a set of f X -measure
zero (i.e., X = Y almost surely).
2. Nonnegativity of mutual information: For any two continuous jointly dis-
tributed random variables X and Y ,

I (X ; Y ) ≥ 0,

with equality iff X and Y are independent.


3. Conditioning never increases differential entropy: For any two continuous
random variables X and Y with joint pdf f X,Y and well-defined conditional pdf
f X |Y ,
h(X |Y ) ≤ h(X ),

with equality iff X and Y are independent.


4. Chain rule for differential entropy: For a continuous random vector X n =
(X 1 , X 2 , . . . , X n ),


n
h(X 1 , X 2 , . . . , X n ) = h(X i |X 1 , X 2 , . . . , X i−1 ),
i=1

where h(X i |X 1 , X 2 , . . . , X i−1 ) := h(X 1 ) for i = 1.

5 This justifies using identical notations for both I (·; ·) and D(··) as opposed to the discerning
notations of H (·) for entropy and h(·) for differential entropy.
5.2 Joint and Conditional Differential Entropies, Divergence, and Mutual Information 175

5. Chain rule for mutual information: For continuous random vector X n =


(X 1 , X 2 , . . . , X n ) and random variable Y with joint pdf f X n ,Y and well-defined
conditional pdfs f X i ,Y |X i−1 , f X i |X i−1 and f Y |X i−1 for i = 1, . . . , n, we have that


n
I (X 1 , X 2 , . . . , X n ; Y ) = I (X i ; Y |X i−1 , . . . , X 1 ),
i=1

where I (X i ; Y |X i−1 , . . . , X 1 ) := I (X 1 ; Y ) for i = 1.


6. Data processing inequality: For continuous random variables X , Y , and Z
such that X → Y → Z , i.e., X and Z are conditional independent given Y ,

I (X ; Y ) ≥ I (X ; Z ).

7. Independence bound for differential entropy: For a continuous random vector


X n = (X 1 , X 2 , . . . , X n ),

n
h(X n ) ≤ h(X i ),
i=1

with equality iff all the X i ’s are independent from each other.
8. Invariance of differential entropy under translation: For continuous random
variables X and Y with joint pdf f X,Y and well-defined conditional pdf f X |Y ,

h(X + c) = h(X ) for any constant c ∈ R,

and
h(X + Y |Y ) = h(X |Y ).

The results also generalize in the multivariate case: for two continuous random
vectors X n = (X 1 , X 2 , . . . , X n ) and Y n = (Y1 , Y2 , . . . , Yn ) with joint pdf f X n ,Y n
and well-defined conditional pdf f X n |Y n ,

h(X n + cn ) = h(X n )

for any constant n-tuple cn = (c1 , c2 , . . . , cn ) ∈ Rn , and

h(X n + Y n |Y n ) = h(X n |Y n ),

where the addition of two n-tuples is performed component-wise.


9. Differential entropy under scaling: For any continuous random variable X
and any nonzero real constant a,

h(a X ) = h(X ) + log2 |a|.


176 5 Differential Entropy and Gaussian Channels

10. Joint differential entropy under linear mapping: Consider the random (col-
umn) vector X = (X 1 , X 2 , . . . , X n )T with joint pdf f X n , where T denotes trans-
position, and let Y = (Y1 , Y2 , . . . , Yn )T be a random (column) vector obtained
from the linear transformation Y = AX , where A is an invertible (non-singular)
n × n real-valued matrix. Then

h(Y ) = h(Y1 , Y2 , . . . , Yn ) = h(X 1 , X 2 , . . . , X n ) + log2 |det(A)|,

where det(A) is the determinant of the square matrix A.


11. Joint differential entropy under nonlinear mapping: Consider the ran-
dom (column) vector X = (X 1 , X 2 , . . . , X n )T with joint pdf f X n , and let
Y = (Y1 , Y2 , . . . , Yn )T be a random (column) vector obtained from the non-
linear transformation

Y = g(X ) := (g1 (X 1 ), g2 (X 2 ), . . . , gn (X n ))T ,

where each gi : R → R is a differentiable function, i = 1, 2, . . . , n. Then

h(Y ) = h(Y1 , Y2 , . . . , Yn )

= h(X 1 , . . . , X n ) + f X n (x1 , . . . , xn ) log2 |det(J)| d x1 · · · d xn ,
Rn

where J is the n × n Jacobian matrix given by


⎡ ∂g1 ∂g1 ∂g1

∂x ∂x
··· ∂xn
⎢ ∂g12 ∂g22 ∂g2 ⎥
⎢ ∂x1 ∂x2 ··· ∂xn ⎥
J := ⎢ . ⎢ .. . ⎥. ⎥
⎣ .. . · · · .. ⎦
∂gn ∂gn
∂x1 ∂x2
· · · ∂gn
∂xn

Observation 5.15 Property 9 of the above Lemma indicates that for a continuous
random variable X , h(X ) = h(a X ) (except for the trivial case of a = 1) and
hence differential entropy is not in general invariant under invertible maps. This is in
contrast to entropy, which is always invariant under invertible maps: given a discrete
random variable X with alphabet X ,

H ( f (X )) = H (X )

for all invertible maps f : X → Y, where Y is a discrete set; in particular H (a X ) =


H (X ) for all nonzero reals a.
On the other hand, for both discrete and continuous systems, mutual information
and divergence are invariant under invertible maps:

I (X ; Y ) = I (g(X ); Y ) = I (g(X ); h(Y ))


5.2 Joint and Conditional Differential Entropies, Divergence, and Mutual Information 177

and
D(X Y ) = D(g(X )g(Y ))

for all invertible maps g and h properly defined on the alphabet/support of the con-
cerned random variables. This reinforces the notion that mutual information and
divergence constitute the true tools of information theory.
Definition 5.16 (Multivariate Gaussian) A continuous random vector X = (X 1 ,
X 2 , . . . , X n )T is called a size-n (multivariate) Gaussian random vector with a finite
mean vector μ := (μ1 , μ2 , . . . , μn )T , where μi := E[X i ] < ∞ for i = 1, 2, . . . , n,
and an n × n invertible (real-valued) covariance matrix

K X = [K i, j ]
:= E[(X − μ)(X − μ)T ]
⎡ ⎤
Cov(X 1 , X 1 ) Cov(X 1 , X 2 ) · · · Cov(X 1 , X n )
⎢Cov(X 2 , X 1 ) Cov(X 2 , X 2 ) · · · Cov(X 2 , X n )⎥
⎢ ⎥
= ⎢ .. .. .. ⎥,
⎣ . . ··· . ⎦
Cov(X n , X 1 ) Cov(X n , X 2 ) · · · Cov(X n , X n )

where K i, j = Cov(X i , X j ) := E[(X i − μi )(X j − μ j )] is the covariance6 between


X i and X j for i, j = 1, 2, . . . , n, if its joint pdf is given by the multivariate Gaussian
pdf
1 T −1
e− 2 (x−μ) K X (x−μ)
1
f X n (x1 , x2 , . . . , xn ) = √ 
( 2π)n det(K X )

for any (x1 , x2 , . . . , xn ) ∈ Rn , where x = (x1 , x2 , . . . , xn )T . As in the scalar case


(i.e., for n = 1), we write X ∼ Nn (μ, K X ) to denote that X is a size-n Gaussian
random vector with mean vector μ and covariance matrix K X .
Observation 5.17 In light of the above definition, we make the following remarks.
1. Note that a covariance matrix K is always symmetric (i.e., K T = K) and positive-
semidefinite.7 But as we require K X to be invertible in the definition of the multi-
variate Gaussian distribution above, we will hereafter assume that the covariance

6 Note that the diagonal components of K X yield the variance of the different random variables:
K i,i = Cov(X i , X i ) = Var(X i ) = σ 2X i , i = 1, . . . , n.
7 An n × n real-valued symmetric matrix K is positive-semidefinite (e.g., cf. [128]) if for every
real-valued vector x = (x1 , x2 , . . . , xn )T ,
⎛ ⎞
x1
⎜ ⎟
x T Kx = (x1 , . . . , xn )K ⎝ ... ⎠ ≥ 0,
xn

with equality holding only when xi = 0 for i = 1, 2, . . . , n. Furthermore, the matrix is positive-
definite if x T Kx > 0 for all real-valued vectors x = 0, where 0 is the all-zero vector of size n.
178 5 Differential Entropy and Gaussian Channels

matrix of Gaussian random vectors is positive-definite (which is equivalent to


having all the eigenvalues of K X positive), thus rendering the matrix invertible.
2. If a random vector X = (X 1 , X 2 , . . . , X n )T has a diagonal covariance matrix
K X (i.e., all the off-diagonal components of K X are zero: K i, j = 0 for all i = j,
i, j = 1, . . . , n), then all its component random variables are uncorrelated but
not necessarily independent. However, if X is Gaussian and have a diagonal
covariance matrix, then all its component random variables are independent
from each other.
3. Any linear transformation of a Gaussian random vector yields another Gaussian
random vector. Specifically, if X ∼ Nn (μ, K X ) is a size-n Gaussian random
vector with mean vector μ and covariance matrix K X , and if Y = Amn X , where
Amn is a given m × n real-valued matrix, then

Y ∼ Nm (Amn μ, Amn K X Amn


T
)

is a size-m Gaussian random vector with mean vector Amn μ and covariance
T
matrix Amn K X Amn .
More generally, any affine transformation of a Gaussian random vector yields
another Gaussian random vector: if X ∼ Nn (μ, K X ) and Y = Amn X + bm ,
where Amn is a m × n real-valued matrix and bm is a size-m real-valued vector,
then
Y ∼ Nm (Amn μ + bm , Amn K X Amn
T
).

Theorem 5.18 (Joint differential entropy of the multivariate Gaussian) If X ∼


Nn (μ, K X ) is a Gaussian random vector with mean vector μ and (positive-definite)
covariance matrix K X , then its joint differential entropy is given by

1  
h(X ) = h(X 1 , X 2 , . . . , X n ) = log2 (2πe)n det(K X ) . (5.2.1)
2
In particular, in the univariate case of n = 1, (5.2.1) reduces to (5.1.1).

Proof Without loss of generality, we assume that X has a zero-mean vector since its
differential entropy is invariant under translation by Property 8 of Lemma 5.14:

h(X ) = h(X − μ);

so we assume that μ = 0.
Since the covariance matrix K X is a real-valued symmetric matrix, then it is
orthogonally diagonizable; i.e., there exists a square (n × n) orthogonal matrix A
(i.e., satisfying AT = A−1 ) such that AK X AT is a diagonal matrix whose entries
are given by the eigenvalues of K X (A is constructed using the eigenvectors
 of K X;
e.g., see [128]). As a result, the linear transformation Y = AX ∼ Nn 0, AK X AT
5.2 Joint and Conditional Differential Entropies, Divergence, and Mutual Information 179

is a Gaussian vector with the diagonal covariance matrix KY = AK X AT and has


therefore independent components (as noted in Observation 5.17). Thus

h(Y ) = h(Y1 , Y2 , . . . , Yn )
= h(Y1 ) + h(Y2 ) + · · · + h(Yn ) (5.2.2)
n
1
= log2 [2πeVar(Yi )] (5.2.3)
i=1
2
 n !
n 1
= log2 (2πe) + log2 Var(Yi )
2 2 i=1
n 1   
= log2 (2πe) + log2 det KY (5.2.4)
2 2
1 1   
= log2 (2πe) + log2 det K X
n
(5.2.5)
2 2
1   
= log2 (2πe) det K X ,
n
(5.2.6)
2
where (5.2.2) follows by the independence of the random variables Y1 , . . . , Yn (e.g.,
see Property 7 of Lemma 5.14), (5.2.3) follows from (5.1.1), (5.2.4) holds since
the matrix KY is diagonal and hence its determinant is given by the product of its
diagonal entries, and (5.2.5) holds since
   
det KY = det AK X AT
 
= det(A)det K X det(AT )
 
= det(A)2 det K X
 
= det K X ,

where the last equality holds since (det(A))2 = 1, as the matrix A is orthogonal
(AT = A−1 =⇒ det(A) = det(AT ) = 1/[det(A)]; thus, det(A)2 = 1).
Now invoking Property 10 of Lemma 5.14 and noting that |det(A)| = 1 yield that

h(Y1 , Y2 , . . . , Yn ) = h(X 1 , X 2 , . . . , X n ) + log2 |det(A)| = h(X 1 , X 2 , . . . , X n ).


" #$ %
=0

We therefore obtain using (5.2.6) that

1   
h(X 1 , X 2 , . . . , X n ) = log2 (2πe)n det K X ,
2
hence completing the proof.
An alternate (but rather mechanical) proof to the one presented above consists
of directly evaluating the joint differential entropy of X by integrating − f X n (x n )
log2 f X n (x n ) over Rn ; it is left as an exercise.

180 5 Differential Entropy and Gaussian Channels

Corollary 5.19 (Hadamard’s inequality) For any real-valued n ×n positive-definite


matrix K = [K i, j ]i, j=1,...,n ,
n
det(K) ≤ K i,i ,
i=1

with equality iff K is a diagonal matrix, where K i,i are the diagonal entries of K.

Proof Since every positive-definite matrix  is a covariance matrix (e.g., see [162]),
let X = (X 1 , X 2 , . . . , X n )T ∼ Nn 0, K be a jointly Gaussian random vector with
zero-mean vector and covariance matrix K. Then
1  
log2 (2πe)n det(K) = h(X 1 , X 2 , . . . , X n ) (5.2.7)
2

n
≤ h(X i ) (5.2.8)
i=1
n
1
= log2 [2πeVar(X i )] (5.2.9)
i=1
2
 n
!
1
= log2 (2πe)n K i,i , (5.2.10)
2 i=1

where (5.2.7) follows from Theorem 5.18, (5.2.8) follows from Property 7 of
Lemma 5.14 and (5.2.9)–(5.2.10) hold using (5.1.1) along with the fact that each
random variable X i ∼ N (0, K i,i ) is Gaussian with zero mean and variance
Var(X i ) = K i,i for i = 1, 2, . . . , n (as the marginals of a multivariate Gaussian
are also Gaussian e.g., cf. [162]).
Finally, from (5.2.10), we directly obtain that
n
det(K) ≤ K i,i ,
i=1

with equality iff the jointly Gaussian random variables X 1 , X 2 , . . ., X n are inde-
pendent from each other, or equivalently iff the covariance matrix K is diagonal.

The next theorem states that among all real-valued size-n random vectors (of support
Rn ) with identical mean vector and covariance matrix, the Gaussian random vector
has the largest differential entropy.

Theorem 5.20 (Maximal differential entropy for real-valued random vectors) Let
X = (X 1 , X 2 , . . . , X n )T be a real-valued random vector with a joint pdf of support
S X n = Rn , mean vector μ, covariance matrix K X and finite joint differential entropy
h(X 1 , X 2 , . . . , X n ). Then
5.2 Joint and Conditional Differential Entropies, Divergence, and Mutual Information 181

1  
h(X 1 , X 2 , . . . , X n ) ≤
log2 (2πe)n det(KX ) , (5.2.11)
2
& '
with equality iff X is Gaussian; i.e., X ∼ Nn μ, K X .

Proof We will present the proof in two parts: the scalar or univariate case, and the
multivariate case.
(i) Scalar case (n = 1): For a real-valued random variable with support S X = R,
mean μ and variance σ 2 , let us show that

1  
h(X ) ≤ log2 2πeσ 2 , (5.2.12)
2

with equality iff X ∼ N (μ, σ 2 ).


For a Gaussian random variable Y ∼ N (μ, σ 2 ), using the nonnegativity of diver-
gence, we can write

0 ≤ D(X Y )

f X (x)
= f X (x) log2 (x−μ)2
dx
R √ 1 e− 2σ 2
2πσ 2
 &√ ' (x − μ)2

= −h(X ) + f X (x) log2 2πσ +2 log2 e d x


R 2σ 2

1 log2 e
= −h(X ) + log2 (2πσ 2 ) + (x − μ)2 f X (x) d x
2 2σ 2 R
" #$ %
=σ 2
1  
= −h(X ) + log2 2πeσ 2 .
2
Thus
1  
h(X ) ≤ log2 2πeσ 2 ,
2

with equality iff X = Y (almost surely); i.e., X ∼ N (μ, σ 2 ).


(ii). Multivariate case (n > 1): As in the proof of Theorem 5.18, we can use an
orthogonal square matrix A (i.e., satisfying AT = A−1 and hence |det(A)| = 1) such
that AK X AT is diagonal. Therefore, the random vector generated by the linear map

Z = AX

will have a covariance matrix given by K Z = AK X AT and hence have uncorrelated


(but not necessarily independent) components. Thus
182 5 Differential Entropy and Gaussian Channels

h(X ) = h(Z ) − log2 |det(A)| (5.2.13)


" #$ %
=0
= h(Z 1 , Z 2 , . . . , Z n )
n
≤ h(Z i ) (5.2.14)
i=1
n
1
≤ log2 [2πeVar(Z i )] (5.2.15)
i=1
2
 n !
n 1
= log2 (2πe) + log2 Var(Z i )
2 2 i=1
1 1   
= log2 (2πe)n + log2 det K Z (5.2.16)
2 2
1 1   
= log2 (2πe)n + log2 det K X (5.2.17)
2 2
1   
= log2 (2πe) det K X ,
n
2
where (5.2.13) holds by Property 10 of Lemma 5.14 and since |det(A)| = 1, (5.2.14)
follows from Property 7 of Lemma 5.14, (5.2.15) follows from (5.2.12) (the scalar
case above),  since K Z is diagonal, and (5.2.17) follows from the fact
 (5.2.16) holds
that det K Z = det K X (as A is orthogonal). Finally, equality is achieved in both
(5.2.14) and (5.2.15) iff the random variables Z 1 , Z 2 , .&. ., Z n 'are Gaussian and
independent from each other, or equivalently iff X ∼ Nn μ, K X .

Observation 5.21 (Examples of maximal differential entropy under various con-


straints) The following three results can also be shown (the proof is left as an
exercise):
1. Among all continuous random variables admitting a pdf with support the interval
(a, b), where b > a are real numbers, the uniformly distributed random variable
maximizes differential entropy.
2. Among all continuous random variables admitting a pdf with support the inter-
val [0, ∞), finite mean μ, and finite differential entropy, the exponentially dis-
tributed random variable with parameter (or rate parameter) λ = 1/μ maximizes
differential entropy.
3. Among all continuous random variables admitting a pdf with support R, finite
mean μ, and finite differential entropy and satisfying E[|X − μ|] = λ, where
λ > 0 is a fixed finite parameter, the Laplacian random variable with mean μ,
variance 2λ2 and pdf

1 − |x−μ|
f X (x) = e λ for x ∈ R

maximizes differential entropy.
5.2 Joint and Conditional Differential Entropies, Divergence, and Mutual Information 183

A systematic approach to finding distributions that maximize differential entropy


subject to various support and moments constraints can be found in [83, 415].
Observation 5.22 (Information rates for stationary Gaussian sources) We close this
section by noting that for stationary zero-mean Gaussian processes {X i } and { X̂ i }, the
differential entropy rate, limn→∞ n1 h(X n ), the divergence rate, lim n→∞ n1 D(X n  X̂ n ),
as well as their Rényi counterparts all exist and admit analytical expressions in
terms of the source power spectral densities [154, 196, 223, 393], [144, Table 4]. In
particular, the differential entropy rate of {X i } and the divergence rate between {X i }
and { X̂ i } are given (in nats) by
 π
1 1 1
lim h(X n ) = ln(2πe) + ln φ X (λ) dλ, (5.2.18)
n→∞ n 2 4π −π

and
 π  
1 1 φ X (λ) φ X (λ)
lim D(X n  X̂ n ) = − 1 − ln dλ, (5.2.19)
n→∞ n 4π −π φ X̂ (λ) φ X̂ (λ)

respectively. Here, φ X (·) and φ X̂ (·) denote the power spectral densities of the
zero-mean stationary Gaussian processes {X i } and { X̂ i }, respectively. Recall that
for a stationary zero-mean process {Z i }, its power spectral density φ Z (·) is the
(discrete-time) Fourier transform of its covariance function K Z (τ ) := E[Z n+τ Z n ] −
E[Z n+τ ]E[Z n ] = E[Z n+τ Z n ], n, τ = 1, 2, . . .; more precisely,


φ Z (λ) = K Z (τ )e− jτ λ , −π ≤ λ ≤ π,
τ =−∞


where j = −1 is the imaginary unit number. Note that (5.2.18) and (5.2.19) hold
under mild integrability and boundedness conditions; see [196, Sect. 2.4] for the
details.

5.3 AEP for Continuous Memoryless Sources

The AEP theorem and its consequence for discrete memoryless (i.i.d.) sources reveal
to us that the number of elements in the typical set is approximately 2n H (X ) , where
H (X ) is the source entropy, and that the typical set carries almost all the probability
mass asymptotically (see Theorems 3.4 and 3.5). An extension of this result from
discrete to continuous memoryless sources by just counting the number of elements
in a continuous (typical) set defined via a law of large numbers argument is not
possible, since the total number of elements in a continuous set is infinite. However,
when considering the volume of that continuous typical set (which is a natural analog
to the size of a discrete set), such an extension, with differential entropy playing a
similar role as entropy, becomes straightforward.
184 5 Differential Entropy and Gaussian Channels


Theorem 5.23 (AEP for continuous memoryless sources) Let {X i }i=1 be a con-
tinuous memoryless source (i.e., an infinite sequence of continuous i.i.d. random
variables) with pdf f X (·) and differential entropy h(X ). Then

1
− log f X (X 1 , . . . , X n ) → E[− log2 f X (X )] = h(X ) in probability.
n
Proof The proof is an immediate result of the law of large numbers (e.g., see Theo-
rem 3.4).

Definition 5.24 (Typical set) For δ > 0 and any n given, define the typical set for
the above continuous source as
 
 1 
Fn (δ) := x n ∈ Rn : − log2 f X (X 1 , . . . , X n ) − h(X ) < δ .
n

Definition 5.25 (Volume) The volume of a set A ⊂ Rn is defined as



Vol(A) := d x1 · · · d xn .
A

Theorem 5.26 (Consequence of the AEP for continuous memoryless sources) For a

continuous memoryless source {X i }i=1 with differential entropy h(X ), the following
hold.
1. For n sufficiently large, PX n {Fn (δ)} > 1 − δ.
2. Vol(Fn (δ)) ≤ 2n(h(X )+δ) for all n.
3. Vol(Fn (δ)) ≥ (1 − δ)2n(h(X )−δ) for n sufficiently large.

Proof The proof is quite analogous to the corresponding theorem for discrete mem-
oryless sources (Theorem 3.5) and is left as an exercise.

5.4 Capacity and Channel Coding Theorem for the


Discrete-Time Memoryless Gaussian Channel

We next study the fundamental limits for error-free communication over the discrete-
time memoryless Gaussian channel, which is the most important continuous-alphabet
channel and is widely used to model real-world wired and wireless channels. We first
state the definition of discrete-time continuous-alphabet memoryless channels.

Definition 5.27 (Discrete-time continuous memoryless channels) Consider a


discrete-time channel with continuous input and output alphabets given by X ⊆ R
and Y ⊆ R, respectively, and described by a sequence of n-dimensional tran-
sition (conditional) pdfs { f Y n |X n (y n |x n )}∞
n=1 that govern the reception of y
n
=
5.4 Capacity and Channel Coding Theorem for the Discrete-Time . . . 185

(y1 , y2 , . . . , yn ) ∈ Y n at the channel output when x n = (x1 , x2 , . . . , xn ) ∈ X n


is sent as the channel input.
The channel (without feedback) is said to be memoryless with a given (marginal)
transition pdf f Y |X if its sequence of transition pdfs f Y n |X n satisfies

n
f Y n |X n (y n |x n ) = f Y |X (yi |xi ) (5.4.1)
i=1

for every n = 1, 2, . . . , x n ∈ X n and y n ∈ Y n .


In practice, the real-valued input to a continuous channel satisfies a certain con-
straint or limitation on its amplitude or power; otherwise, one would have a realisti-
cally implausible situation where the input can take on any value from the uncount-
ably infinite set of real numbers. We will thus impose an average cost constraint
(t, P) on any input n-tuple x n = (x1 , x2 , . . . , xn ) transmitted over the channel by
requiring that
1
n
t (xi ) ≤ P, (5.4.2)
n i=1

where t (·) is a given nonnegative real-valued function describing the cost for trans-
mitting an input symbol, and P is a given positive number representing the maximal
average amount of available resources per input symbol.
Definition 5.28 The capacity (or capacity-cost function) of a discrete-time contin-
uous memoryless channel with input average cost constraint (t, P) is denoted by
C(P) and defined as

C(P) := sup I (X ; Y ) (in bits/channel use), (5.4.3)


FX : E[t (X )]≤P

where the supremum is over all input distributions FX .


Lemma 5.29 (Concavity of capacity) If C(P) as defined in (5.4.3) is finite for any
P > 0, then it is concave, continuous, and strictly increasing in P.
Proof Fix P1 > 0 and P2 > 0. Then since C(P) is finite for any P > 0, then by the
third property in Property A.4, there exist two input distributions FX 1 and FX 2 such
that for all
> 0,
I (X i ; Yi ) ≥ C(Pi ) −
(5.4.4)

and
E[t (X i )] ≤ Pi , (5.4.5)

where X i denotes the input with distribution FX i and Yi is the corresponding channel
output for i = 1, 2. Now, for 0 ≤ λ ≤ 1, let X λ be a random variable with distribution
FX λ := λFX 1 + (1 − λ)FX 2 . Then by (5.4.5)
186 5 Differential Entropy and Gaussian Channels

E X λ [t (X )] = λE X 1 [t (X )] + (1 − λ)E X 2 [t (X )] ≤ λP1 + (1 − λ)P2 . (5.4.6)

Furthermore,

C(λP1 + (1 − λ)P2 ) = sup I (FX , f Y |X )


FX : E[t (X )]≤λP1 +(1−λ)P2
≥ I (FX λ , f Y |X )
≥ λI (FX 1 , f Y |X ) + (1 − λ)I (FX 2 , f Y |X )
= λI (X 1 ; Y1 ) + (1 − λ)I (X 2 ; Y2 )
≥ λC(P1 ) −
+ (1 − λ)C(P2 ) −
,

where the first inequality holds by (5.4.6), the second inequality follows from the
concavity of the mutual information with respect to its first argument (cf. Lemma
2.46), and the third inequality follows from (5.4.4). Letting
→ 0 yields that

C(λP1 + (1 − λ)P2 ) ≥ λC(P1 ) + (1 − λ)C(P2 )

and hence C(P) is concave in P.


Finally, it can directly be seen by definition that C(·) is nondecreasing, which,
together with its concavity, imply that it is continuous and strictly increasing.

The most commonly used cost function is the power cost function, t (x) = x 2 ,
resulting in the average power constraint P for each transmitted input n-tuple:

1 2
n
x ≤ P. (5.4.7)
n i=1 i

Throughout this chapter, we will adopt this average power constraint on the channel
input.
We herein focus on the discrete-time memoryless Gaussian channel8 with aver-
age input power constraint P and establish an operational meaning for the channel
capacity C(P) as the largest coding rate for achieving reliable communication over
the channel. The channel is described by the following additive noise equation:

Yi = X i + Z i , for i = 1, 2, . . . , (5.4.8)

where Yi , X i , and Z i are the channel output, input and noise at time i. The input and
noise processes are assumed to be independent from each other and the noise source

{Z i }i=1 is i.i.d. Gaussian with each Z i having mean zero and variance σ 2 , Z i ∼
N (0, σ ). Since the noise process is i.i.d, we directly get that the channel satisfies
2

(5.4.1) and is hence memoryless, where the channel transition pdf is explicitly given
in terms of the noise pdf as follows:

8 This
channel is also commonly referred to as the discrete-time additive white Gaussian noise
(AWGN) channel.
5.4 Capacity and Channel Coding Theorem for the Discrete-Time . . . 187

1 (y−x)2
f Y |X (y|x) = f Z (y − x) = √ e− 2σ2 .
2πσ 2

As mentioned above, we impose the average power constraint (5.4.7) on the channel
input.
Observation 5.30 The memoryless Gaussian channel is a good approximating
model for many practical channels such as radio, satellite, and telephone line chan-
nels. The additive noise is usually due to a multitude of causes, whose cumulative
effect can be approximated via the Gaussian distribution. This is justified by the

n which states that for an i.i.d. process {Ui }i=1 with mean μ and
central limit theorem
variance σ , n i=1 (Ui − μ) converges in distribution as n → ∞ to a Gaussian
2 √1

distributed random variable with mean zero and variance σ 2 (see Appendix B).9
Before proving the channel coding theorem for the above memoryless Gaussian
channel with input power constraint P, we first show that its capacity C(P) as defined
in (5.4.3) with t (x) = x 2 admits a simple expression in terms of P and the channel
noise variance σ 2 . Indeed, we can write the channel mutual information I (X ; Y )
between its input and output as follows:

I (X ; Y ) = h(Y ) − h(Y |X )
= h(Y ) − h(X + Z |X ) (5.4.9)
= h(Y ) − h(Z |X ) (5.4.10)
= h(Y ) − h(Z ) (5.4.11)
1  
= h(Y ) − log2 2πeσ 2 , (5.4.12)
2
where (5.4.9) follows from (5.4.8), (5.4.10) holds since differential entropy is invari-
ant under translation (see Property 8 of Lemma 5.14), (5.4.11) follows from the
independence of X and Z , and (5.4.12) holds since Z ∼ N (0, σ 2 ) is Gaussian (see
(5.1.1)). Now since Y = X + Z , we have that

E[Y 2 ] = E[X 2 ] + E[Z 2 ] + 2E[X ]E[Z ] = E[X 2 ] + σ 2 + 2E[X ](0) ≤ P + σ 2

since the input in (5.4.3) is constrained to satisfy E[X 2 ] ≤ P. Thus, the variance of
Y satisfies Var(Y ) ≤ E[Y 2 ] ≤ P + σ 2 , and

1 1  
h(Y ) ≤ log2 (2πeVar(Y )) ≤ log2 2πe(P + σ 2 ) ,
2 2
where the first inequality follows by Theorem 5.20 since Y is real-valued (with
support R). Noting that equality holds in the first inequality above iff Y is Gaussian
and in the second inequality iff Var(Y ) = P + σ 2 , we obtain that choosing the input
X as X ∼ N (0, P) yields Y ∼ N (0, P + σ 2 ) and hence maximizes I (X ; Y ) over

9 The reader is referred to [209] for an information theoretic treatment of the central limit theorem.
188 5 Differential Entropy and Gaussian Channels

all inputs satisfying E[X 2 ] ≤ P. Thus, the capacity of the discrete-time memoryless
Gaussian channel with input average power constraint P and noise variance (or
power) σ 2 is given by

1   1  
C(P) = log2 2πe(P + σ 2 ) − log2 2πeσ 2
2   2
1 P
= log2 1 + 2 . (5.4.13)
2 σ

Note P/σ 2 is called the channel’s signal-to-noise ratio (SNR) and is usually mea-
sured in decibels (dB).10

Definition 5.31 Given positive integers n and M, and a discrete-time memoryless


Gaussian channel with input average power constraint P, a fixed-length data trans-
mission code (or block code) ∼Cn = (n, M) for this channel with blocklength n
and rate (or code rate) n1 log2 M message bits per channel symbol (or channel use)
consists of:
1. M information messages intended for transmission.
2. An encoding function
f : {1, 2, . . . , M} → Rn

yielding real-valued codewords c1 = f (1), c2 = f (2), . . . , c M = f (M), where


each codeword cm = (cm1 , . . . , cmn ) is of length n and satisfies the power con-
straint P
1 2
n
c ≤ P,
n i=1 i

for m = 1, 2, . . . , M. The set of these M codewords is called the codebook and


we usually write ∼Cn = {c1 , c2 , . . . , c M } to list the codewords.
3. A decoding function g : Rn → {1, 2, . . . , M}.

As in Chap. 4, we assume that a message W follows a uniform distribution over


the set of messages: Pr[W = w] = M1 for all w ∈ {1, 2, . . . , M}. Similarly, to
convey message W over the channel, the encoder sends its corresponding codeword
X n = f (W ) ∈ ∼Cn at the channel input. Finally, Y n is received at the channel output
and the decoder yields Ŵ = g(Y n ) as the message estimate. Also, the average
probability of error for this block code used over the memoryless Gaussian channel
is defined as
1 
M
Pe (∼Cn ) := λw (∼Cn ),
M w=1

where

10 More specifically, SNR|dB := 10 log10 SNR in dB.


5.4 Capacity and Channel Coding Theorem for the Discrete-Time . . . 189

λw (∼Cn ) := Pr[Ŵ = W |W = w]
= Pr[g(Y n ) = w|X n = f (w)]

= f Y n |X n (y n | f (w)) dy n
y n ∈Rn : g(y n )=w

is the code’s conditional probability of decoding error given that message w is sent
over the channel. Here
n
f Y n |X n (y n |x n ) = f Y |X (yi |xi )
i=1

as the channel is memoryless, where f Y |X is the channel’s transition pdf.


We next prove that for a memoryless Gaussian channel with input average power
constraint P, its capacity C(P) is the channel’s operational capacity; i.e., it is the
supremum of all rates for which there exists a sequence of data transmission block
codes satisfying the power constraint and having a probability of error that vanishes
with increasing blocklength.
Theorem 5.32 (Shannon’s coding theorem for the memoryless Gaussian channel)
Consider a discrete-time memoryless Gaussian channel with input average power
constraint P, channel noise variance σ 2 and capacity C(P) as given by (5.4.13).
• Forward part (achievability): For any ε ∈ (0, 1), there exist 0 < γ < 2ε and a
sequence of data transmission block code {∼Cn = (n, Mn )}∞n=1 satisfying

1
log2 Mn > C(P) − γ
n

with each codeword c = (c1 , c2 , . . . , cn ) in ∼Cn satisfying

1 2
n
c ≤P (5.4.14)
n i=1 i

such that the probability of error Pe (∼Cn ) < ε for sufficiently large n.
• Converse part: If for any sequence of data transmission block codes {∼Cn =
(n, Mn )}∞
n=1 whose codewords satisfy (5.4.14), we have that

1
lim inf log2 Mn > C(P),
n→∞ n

then the codes’ probability of error Pe (∼Cn ) is bounded away from zero for all n
sufficiently large.
Proof of the forward part: The theorem holds trivially when C(P) = 0 because we
can choose Mn = 1 for every n and have Pe (∼Cn ) = 0. Hence, we assume without
loss of generality C(P) > 0.
190 5 Differential Entropy and Gaussian Channels

Step 0:

Take a positive γ satisfying γ < min{2ε, C(P)}. Pick ξ > 0 small enough such that
2[C(P) − C(P − ξ)] < γ, where the existence of such ξ is assured by the strictly
increasing property of C(P). Hence, we have C(P − ξ) − γ/2 > C(P) − γ > 0.
Choose Mn to satisfy

γ 1
C(P − ξ) − > log2 Mn > C(P) − γ,
2 n

for which the choice should exist for all sufficiently large n. Take δ = γ/8. Let
FX be the distribution that achieves C(P − ξ), where C(P) is given by (5.4.13).
In this case, FX is the Gaussian distribution with mean zero and variance P − ξ
and admits a pdf f X . Hence, E[X 2 ] ≤ P − ξ and I (X ; Y ) = C(P − ξ).
Step 1: Random coding with average power constraint.

Randomly draw Mn codewords according to pdf f X n with


n
f X n (x n ) = f X (xi ).
i=1

By the law of large numbers, each randomly selected codeword

cm = (cm1 , . . . , cmn )

satisfies
1 2
n
lim cmi = E[X 2 ] ≤ P − ξ
n→∞ n
i=1

for m = 1, 2, . . . , Mn .

Step 2: Code construction.

For Mn selected codewords {c1 , . . . , c Mn }, replace the codewords that violate the
power constraint (i.e., (5.4.14)) by an all-zero (default) codeword 0. Define the
encoder as
f n (m) = cm for 1 ≤ m ≤ Mn .

Given a received output sequence y n , the decoder gn (·) is given by



⎨ m, if (cm , y n ) ∈ Fn (δ)
gn (y ) =
n
and (∀ m  = m) (cm  , y n ) ∈
/ Fn (δ),

arbitrary, otherwise,

where the set


5.4 Capacity and Channel Coding Theorem for the Discrete-Time . . . 191
 
 1 
Fn (δ) := (x n , y n ) ∈ X n × Y n : − log f X n Y n (x n , y n ) − h(X, Y ) < δ,
 n 2 
 
 1 
− log f X n (x n ) − h(X ) < δ,
 n 2 
 
 1 
 
and − log2 f Y n (y ) − h(Y ) < δ
n
n
+n
is generated by f X n Y n (x n , y n ) = i=1 f X,Y (xi , yi ) where f X n Y n (x n , y n ) is the
joint input–output pdf realized when+the memoryless Gaussian channel (with n-
n
+n f Y |X (y |x ) = i=1 f Y |X (yi |xi )) is driven by input X with
n n n
fold transition pdf n n

pdf f X n (x ) = i=1 f X (xi ) (where f X achieves C(P − ξ)).


n

Step 3: Conditional probability of error.

Let λm denote the conditional error probability given codeword m is transmitted.


Define  ,
 n
n 1
E0 := x ∈ X :
n
x >P .
2
n i=1 i

Then by following similar argument as (4.3.4),11 we get:

11 In this proof, specifically, (4.3.3) becomes


 Mn 

λm (∼Cn ) ≤ f Y n |X n (y n |cm ) dy n + f Y n |X n (y n |cm ) dy n .
yn ∈
/ Fn (δ|cm ) y n ∈Fn (δ|cm  )
m  =1
m   =m

By taking expectation with respect to the mth codeword-selecting distribution f X n (cm ), we obtain

E[λm ] = f X n (cm )λm (∼Cn ) d cm
cm ∈X n
 
= f X n (cm )λm (∼Cn ) d cm + f X n (cm )λm (∼Cn ) d cm
cm ∈X n ∩E0 cm ∈X n ∩E0c
 
≤ f X n (cm ) d cm + f X n (cm )λm (∼Cn ) d cm
cm ∈E0 cm ∈X n
 
≤ PX n (E0 ) + f X n (cm ) f Y n |X n (y n |cm ) dy n d cm
cm ∈X n yn ∈
/ Fn (δ|cm )
 Mn 

+ f X n (cm ) f Y n |X n (y n |cm ) dy n d cm .
cm ∈X n y n ∈Fn (δ|cm  )
m  =1
m   =m
192 5 Differential Entropy and Gaussian Channels
 
E[λm ] ≤ PX n (E0 ) + PX n ,Y n Fnc (δ)
 Mn  
+ f X n ,Y n (cm , y n ) dy n d cm , (5.4.15)
cm ∈X n y n ∈Fn (δ|cm  )
m  =1
m  =m

where - .
Fn (δ|x n ) := y n ∈ Y n : (x n , y n ) ∈ Fn (δ) .

Note that the additional term PX n (E0 ) in (5.4.15) is to cope with the errors due to
all-zero codeword replacement, which will be less than δ for all sufficiently large
n by the law of large numbers. Finally, by carrying out a similar procedure as in
the proof of the channel coding theorem for discrete channels (cf. page 123), we
obtain
 
E[Pe (C n )] ≤ PX n (E0 ) + PX n ,Y n Fnc (δ)
+Mn · 2n(h(X,Y )+δ) 2−n(h(X )−δ) 2−n(h(Y )−δ)
 
≤ PX n (E0 ) + PX n ,Y n Fnc (δ) + 2n(C(P−ξ)−4δ) · 2−n(I (X ;Y )−3δ)
 
= PX n (E0 ) + PX n ,Y n Fnc (δ) + 2−nδ .

Accordingly, we can make the average probability of error, E[Pe (C n )], less than
3δ = 3γ/8 < 3ε/4 < ε for all sufficiently large n.

Proof of the converse part: Consider an (n, Mn ) block data transmission code
satisfying the power constraint (5.4.14) with encoding function

f n : {1, 2, . . . , Mn } → X n

and decoding function


gn : Y n → {1, 2, . . . , Mn }.

Since the message W is uniformly distributed over {1, 2, . . . , Mn }, we have H (W ) =


log2 Mn . Since W → X n = f n (W ) → Y n forms a Markov chain (as Y n only depends
on X n ), we obtain by the data processing lemma that I (W ; Y n ) ≤ I (X n ; Y n ). We
can also bound I (X n ; Y n ) by C(P) as follows:

I (X n ; Y n ) ≤ sup

I (X n ; Y n )
n
FX n : (1/n) i=1 E[X i2 ]≤P


n
≤ sup

I (X j ; Y j ) (by Theorem 2.21)
n
FX n : (1/n) i=1 E[X i2 ]≤P j=1


n
= sup n
sup I (X j ; Y j )
(P1 ,P2 ,...,Pn ) : (1/n) i=1 Pi =P FX n : (∀ i) E[X i2 ]≤Pi j=1
5.4 Capacity and Channel Coding Theorem for the Discrete-Time . . . 193


n
≤ sup n
sup I (X j ; Y j )
(P1 ,P2 ,...,Pn ) : (1/n) i=1 Pi =P j=1 FX n : (∀ i) E[X i2 ]≤Pi


n
= sup n
sup I (X j ; Y j )
(P1 ,P2 ,...,Pn ) : (1/n) i=1 Pi =P j=1 FX j : E[X 2j ]≤P j


n
= sup n
C(P j )
(P1 ,P2 ,...,Pn ) : (1/n) i=1 Pi =P j=1

 n
1
= sup n
nC(P j )
(P1 ,P2 ,...,Pn ) : (1/n) i=1 Pi =P j=1
n
⎛ ⎞
1 n
≤ sup nC ⎝ P j ⎠ (by concavity of C(P))
n
(P1 ,P2 ,...,Pn ) : (1/n) i=1 Pi =P n j=1
= nC(P).

Consequently, recalling that Pe (∼Cn ) is the average error probability incurred by


guessing W from observing Y n via the decoding function gn : Y n → {1, 2, . . . , Mn },
we get

log2 Mn = H (W )
= H (W |Y n ) + I (W ; Y n )
≤ H (W |Y n ) + I (X n ; Y n )
≤ h b (Pe (∼Cn )) + Pe (∼Cn ) · log2 (|W| − 1) + nC(P)
(by Fano s inequality)
≤ 1 + Pe (∼Cn ) · log2 (Mn − 1) + nC(P),
(by the fact that (∀ t ∈ [0, 1]) h b (t) ≤ 1)
< 1 + Pe (∼Cn ) · log2 Mn + nC(P),

which implies that

C(P) 1
Pe (∼Cn ) > 1 − − .
(1/n) log2 Mn log2 Mn

So if lim inf n→∞ (1/n) log2 Mn > C(P), then there exist δ > 0 and an integer N
such that for n ≥ N ,
1
log2 Mn > C(P) + δ.
n

Hence, for n ≥ N0 := max{N , 2/δ},

C(P) 1 δ
Pe (∼Cn ) ≥ 1 − − ≥ .
C(P) + δ n(C(P) + δ) 2(C(P) + δ)

194 5 Differential Entropy and Gaussian Channels

We next show that among all power-constrained continuous memoryless channels


with additive noise admitting a pdf, choosing a Gaussian distributed noise yields the
smallest channel capacity. In other words, the memoryless Gaussian model results in
the most pessimistic (smallest) capacity within the class of additive noise continuous
memoryless channels.

Theorem 5.33 (Gaussian noise minimizes capacity of additive noise channels)


Every discrete-time continuous memoryless channel with additive noise (admitting
a pdf) of mean zero and variance σ 2 and input average power constraint P has its
capacity C(P) lower bounded by the capacity of the memoryless Gaussian channel
with identical input constraint and noise variance:
 
1 P
C(P) ≥ log2 1 + 2 .
2 σ

Proof Let f Y |X and f Yg |X g denote the transition pdfs of the additive noise channel and
the Gaussian channel, respectively, where both channels satisfy input average power
constraint P. Let Z and Z g respectively denote their zero-mean noise variables of
identical variance σ 2 .
Writing the mutual information in terms of the channel’s transition pdf and input
distribution as in Lemma 2.46, then for any Gaussian input with pdf f X g with corre-
sponding outputs Y and Yg when applied to channels f Y |X and f Yg |X g , respectively,
we have that

I ( f X g , f Y |X ) − I ( f X g , f Yg |X g )
 
f Z (y − x)
= f X g (x) f Z (y − x) log2 d yd x
X Y f Y (y)
 
f Z g (y − x)
− f X g (x) f Z g (y − x) log2 d yd x
X Y f Yg (y)
 
f Z (y − x)
= f X g (x) f Z (y − x) log2 d yd x
X Y f Y (y)
 
f Z g (y − x)
− f X g (x) f Z (y − x) log2 d yd x
X Y f Yg (y)
 
f Z (y − x) f Yg (y)
= f X g (x) f Z (y − x) log2 d yd x
X Y f Z g (y − x) f Y (y)
   
f Z g (y − x) f Y (y)
≥ f X g (x) f Z (y − x)(log2 e) 1 − d yd x
X Y f Z (y − x) f Yg (y)
  

f Y (y)
= (log2 e) 1 − f X g (x) f Z g (y − x)d x dy
Y f Yg (y) X
= 0,

with equality holding in the inequality iff


5.4 Capacity and Channel Coding Theorem for the Discrete-Time . . . 195

f Y (y) f Z (y − x)
=
f Yg (y) f Z g (y − x)

for all x and y. Therefore,


 
1 P
log2 1 + 2 = sup I (FX , f Yg |X g )
2 σ FX : E[X 2 ]≤P
= I ( f X∗g , f Yg |X g )
≤ I ( f X∗g , f Y |X )
≤ sup I (FX , f Y |X )
FX : E[X 2 ]≤P
= C(P),

thus completing the proof.


Observation 5.34 (Shannon’s channel coding theorem for continuous memoryless


channels) We point out that Theorem 5.32 can be generalized to a wide class of
discrete-time continuous memoryless channels with input cost constraint (5.4.2)
where the cost function t (·) is arbitrary, by showing that

C(P) := sup I (X ; Y )
FX : E[t (X )]≤P

is the largest rate for which there exist block codes for the channel satisfying (5.4.2)
which are reliably good (i.e., with asymptotically vanishing error probability).
The proof is quite similar to that of Theorem 5.32, except that some modifications
are needed in the forward part as for a general (non-Gaussian) channel, the input
distribution FX used to construct the random code may not admit a pdf (e.g., cf.
[135, Chap. 7], [415, Theorem 11.14]).
Observation 5.35 (Capacity of memoryless fading channels) We briefly examine
the capacity of the memoryless fading channel, which is widely used to model wire-
less communications channels [151, 307, 387]. The channel is described by the
following multiplicative and additive noise equation:

Yi = Ai X i + Z i , for i = 1, 2, . . . , (5.4.16)

where Yi , X i , Z i , and Ai are the channel output, input, additive noise, and amplitude
fading coefficient (or gain) at time i. It is assumed that the fading process {Ai } and the
noise process {Z i } are each i.i.d. and that they are independent of each other and of the
input process. As in the case of the memoryless Gaussian (AWGN) channel, the input
power constraint is given by P and the noise {Z i } is Gaussian with Z i ∼ N (0, σ 2 ).
The fading coefficients Ai are typically Rayleigh or Rician distributed [151]. In both
cases, we assume that E[Ai2 ] = 1 so that the channel SNR is unchanged as P/σ 2 .
Note setting Ai = 1 for all i in (5.4.16) reduces the channel to the AWGN
channel in (5.4.8). We next examine the effect of the random fading coefficient
196 5 Differential Entropy and Gaussian Channels

on the channel’s capacity. We consider two scenarios regularly considered in the


literature: (1) the fading coefficients are known at the receiver, and (2) the fading
coefficients are known at both the receiver and the transmitter.12

1. Capacity of the fading channel with decoder side information: A common


assumption used in many wireless communication systems is that the decoder
knows the values of the fading coefficients at each time instant; in this case,
we say that the channel has decoder side information (DSI). This assumption is
realistic for wireless systems where the fading amplitudes change slowly with
respect to the transmitted codeword so that the decoder can acquire knowledge
of the fading coefficients via the the use of prearranged pilots signals. In this
case, as both A and Y are known at the receiver, we can consider (Y, A) as the
channel’s output and thus aim to maximize

I (X ; A, Y ) = I (X ; A) + I (X ; Y |A) = I (X ; Y |A),

where I (X ; A) = 0 since X and A are independent from each other. Thus


the channel capacity in this case, C DS I (P), can be solved as in the case of the
AWGN channel (with the minor change of having the input scaled by the fading)
to obtain:

C DS I (P) = sup I (X ; Y |A)


FX : E[X 2 ]≤P
= sup [h(Y |A) − h(Y |X, A)]
FX : E[X 2 ]≤P
 

1 A2 P
= EA log2 1 + 2 , (5.4.17)
2 σ

where the expectation is taken with respect to the fading distribution. Note that
the capacity-achieving distribution here is also Gaussian with mean zero and
variance P and is independent of the fading coefficient.
At this point, it is natural to compare the capacity in (5.4.17) with that of the
AWGN channel in (5.4.13). In light of the concavity of the logarithm and using
Jensen’s inequality (in Theorem B.18), we readily obtain that
 

1 A2 P
C DS I (P) = E A log2 1 + 2
2 σ
 
1 E[A2 ]P
≤ log2 1 +
2 σ2
 
1 P
= log2 1 + 2 := C G (P) (5.4.18)
2 σ

12 For other scenarios, see [151, 387].


5.4 Capacity and Channel Coding Theorem for the Discrete-Time . . . 197

which is the capacity of the AWGN channel with identical SNR, and where
the last step follows since E[A2 ] = 1. Thus, we conclude that fading degrades
capacity as C DS I (P) ≤ C G (P).
2. Capacity of the fading channel with full side information: We next assume that
both the receiver and the transmitter have knowledge of the fading coefficients;
this is the case of the fading channel with full side information (FSI). This
assumption applies to situations where there exists a reliable and fast feed-
back channel in the reverse direction where the decoder can communicate its
knowledge of the fading process to the encoder. In this case, the transmitter can
adaptively adjust its input power according to the value of the fading coefficient.
It can be shown (e.g., see [387]) using Lagrange multipliers that the capacity in
this case is given by
  !
1 A2 p(A)
C F S I (P) = E A sup log2 1 +
p(·) : E A [ p(A)]=P 2 σ2
 

1 A2 p ∗ (A)
= EA log2 1 + (5.4.19)
2 σ2

where  
1 σ2
p ∗ (a) = max 0, − 2
λ a

and λ satisfies E A [ p(A)] = P. The optimal power allotment p ∗ (A) above is a


so-called water-filling allotment, which we examine in more detail in the next
section in the case of parallel AWGN channels.

Finally, we note that real-world wireless channels are often not memoryless;
they exhibit statistical temporal memory in their fading process [80] and as a result
signals traversing the channels are distorted in a bursty fashion. We refer the reader
to [12, 108, 125, 135, 145, 211, 277, 298–301, 325, 334, 389, 420, 421] and the
references therein for models of channels with memory and for finite-state Markov
channel models which characterize the behavior of time-correlated fading channels
in various settings.

5.5 Capacity of Uncorrelated Parallel Gaussian Channels:


The Water-Filling Principle

Consider a network of k mutually independent discrete-time memoryless Gaussian


channels with respective positive noise powers (variances) σ12 , σ22 , . . . and σk2 . If one
wants to transmit information using these channels simultaneously (in parallel), what
will be the system’s channel capacity, and how should the signal powers for each
198 5 Differential Entropy and Gaussian Channels

channel be apportioned given a fixed overall power budget ? The answer to the above
question lies in the so-called water-filling or water-pouring principle.

Theorem 5.36 (Capacity of uncorrelated parallel Gaussian channels) The capacity


of k uncorrelated parallel Gaussian channels under an overall input power constraint
P is given by
 k  
1 Pi
C(P) = log2 1 + 2 ,
i=1
2 σi

where σi2 is the noise variance of channel i,

Pi = max{0, θ − σi2 },
k
and θ is chosen to satisfy i=1 Pi = P. This capacity is achieved by a tuple of
independent Gaussian inputs (X 1 , X 2 , . . . , X k ), where X i ∼ N (0, Pi ) is the input
to channel i, for i = 1, 2, . . . , k.

Proof By definition,

C(P) = sup I (X k ; Y k ).
k
FX k : i=1 E[X k2 ]≤P

Since the noise random variables Z 1 , . . . , Z k are independent from each other,

I (X k ; Y k ) = h(Y k ) − h(Y k |X k )
= h(Y k ) − h(Z k + X k |X k )
= h(Y k ) − h(Z k |X k )
= h(Y k ) − h(Z k )
k
= h(Y k ) − h(Z i )
i=1


k 
k
≤ h(Yi ) − h(Z i )
i=1 i=1


k  
1 Pi
≤ log2 1 + 2 ,
i=1
2 σi

where the first inequality follows from the chain rule for differential entropy and the
fact that conditioning cannot increase differential entropy, and the second inequality
holds since output Yi of channel i due to input X i with E[X i2 ] = Pi has its differential
entropy maximized if it is Gaussian distributed with zero-mean and variance Pi +σi2 .
Equalities hold above if all the X i inputs are independent of each other with each
k
input X i ∼ N (0, Pi ) such that i=1 Pi = P.
5.5 Capacity of Uncorrelated Parallel Gaussian Channels . . . 199

Thus, the problem is reduced to finding the power allotment that maximizes the
k
overall capacity subject to the constraint i=1 Pi = P with Pi ≥ 0. By using the
Lagrange multipliers technique and verifying the KKT conditions (see Example B.21
in Appendix B.8), the maximizer (P1 , . . . , Pk ) of
    / k 0,

k
1 Pi
k 
max log2 1 + 2 + λi Pi − ν Pi − P
i=1
2 σi i=1 i=1

can be found by taking the derivative of the above equation (with respect to Pi ) and
setting it to zero, which yields



1 1
⎨− + ν = 0, if Pi > 0;
2 ln(2) Pi + σi2
λi =


1 1
⎩− + ν ≥ 0, if Pi = 0.
2 ln(2) Pi + σi2

Hence,

Pi = θ − σi2 , if Pi > 0;
(equivalently, Pi = max{0, θ − σi2 }),
Pi ≥ θ − σi2 , if Pi = 0,
k
where θ := log2 e/(2ν) is chosen to satisfy i=1 Pi = P.

We illustrate the above result in Fig. 5.1 and elucidate why the Pi power allotments
form a water-filling (or water-pouring) scheme. In the figure, we have a vessel where
the height of each of the solid bins represents the noise power of each channel (while
the width is set to unity so that the area of each bin yields the noise power of the
corresponding Gaussian channel). We can thus visualize the system as a vessel with
an uneven bottom where the optimal input signal allocation Pi to each channel is
realized by pouring an amount P units of water into the vessel (with the resulting
overall area of filled water equal to P). Since the vessel has an uneven bottom, water
is unevenly distributed among the bins: noisier channels are allotted less signal power
(note that in this example, channel 3, whose noise power is largest, is given no input
power at all and is hence not used).

Observation 5.37 (Practical considerations) According to the water-filling prin-


ciple, one needs to use capacity-achieving Gaussian inputs and allocate more power
to less noisy channels for the optimization of channel capacity. However, Gaussian
inputs do not fit digital communication systems in practice. One may then wonder
what is the optimal power allocation scheme when the channel inputs are practi-
cally dictated to be discrete in value, such as inputs used in conjunction with binary
phase-shift keying (BPSK), quadrature phase-shift keying (QPSK), or 16 quadrature-
amplitude modulation (16-QAM) signaling. Surprisingly under certain conditions,
200 5 Differential Entropy and Gaussian Channels

P = P1 + P2 + P4

θ
P2
σ32
P1
P4
σ22
σ12
σ42

Fig. 5.1 The water-pouring scheme for uncorrelated parallel Gaussian channels. The horizontal
dashed line, which indicates the level where the water rises to, indicates the value of θ for which
k
i=1 Pi = P

the answer is different from the water-filling principle. By characterizing the relation-
ship between mutual information and minimum mean square error (MMSE) [165],
the optimal power allocation for parallel AWGN channels with inputs constrained
to be discrete is established in [250], resulting in a new graphical power allocation
interpretation called the mercury/water-filling principle: mercury of proper amounts
[250, Eq. (43)] must be individually poured into each channel bin before water of
k
amount P = i=1 Pi is added to the vessel. It is thus named because mercury is
heavier than water and does not dissolve in it; so it can play the role of pre-adjuster
of bin heights. This line of inquiry concludes with the observation that when the
total transmission power P is small, the strategy that maximizes capacity follows
approximately the equal SNR principle; i.e., a larger power should be allotted to a
noisier channel to optimize capacity.
Furthermore, it was found in [400] that when the channel’s additive noise is
no longer Gaussian, the mercury adjustment fails to interpret the optimal power
allocation scheme. For additive Gaussian noise with arbitrary discrete inputs, the pre-
adjustment before the water pouring step is always upward; hence, the mercury-filling
scheme is used to increase bin heights. However, since the pre-adjustment of bin
heights can generally be in both upward and downward directions for channels with
non-Gaussian noise, the use of the name mercury/water filling becomes inappropriate
(see [400, Example 1] for quaternary-input additive Laplacian noise channels). In this
case, the graphical interpretation of the optimal power allocation scheme is simply
named two-phase water-filling principle [400].
We end this observation by emphasizing that a vital measure for practical digital
communication systems is the effective transmission rate subject to an acceptably
small decoding error rate (e.g., an overall bit error probability ≤ 10−5 ). Instead,
researchers typically adopt channel capacity as a design criterion in order to make
the analysis tractable and obtain a simple reference scheme for practical systems.
5.6 Capacity of Correlated Parallel Gaussian Channels 201

5.6 Capacity of Correlated Parallel Gaussian Channels

In the previous section, we considered a network of k parallel discrete-time mem-


oryless Gaussian channels in which the noise samples from different channels are
independent from each other. We found out that the power allocation strategy that
maximizes the system’s capacity is given by the water-filling scheme. We next study
a network of k parallel memoryless Gaussian channels where the noise variables from
different channels are correlated. Surprisingly, we obtain that water-filling provides
also the optimal power allotment policy.
Let K Z denote the covariance matrix of the noise tuple (Z 1 , Z 2 , . . . , Z k ), and
let K X denote the covariance matrix of the system input (X 1 , . . . , X k ), where we
assume (without loss of the generality) that each X i has zero mean. We assume that
K Z is positive-definite. The input power constraint becomes


k
E[X i2 ] = tr(K X ) ≤ P,
i=1

where tr(·) denotes the trace of the k × k matrix K X . Since in each channel, the input
and noise variables are independent from each other, we have

I (X k ; Y k ) = h(Y k ) − h(Y k |X k )
= h(Y k ) − h(Z k + X k |X k )
= h(Y k ) − h(Z k |X k )
= h(Y k ) − h(Z k ).

Since h(Z k ) is not determined by the input, determining the system’s capacity reduces
to maximizing h(Y k ) over all possible inputs (X 1 , . . . , X k ) satisfying the power
constraint.
Now observe that the covariance matrix of Y k is equal to KY = K X + K Z , which
implies by Theorem 5.20 that the differential entropy of Y k is upper bounded by

1  
h(Y k ) ≤ log2 (2πe)k det(K X + K Z ) ,
2

with equality iff Y k Gaussian. It remains to find out whether we can find inputs
(X 1 , . . . , X k ) satisfying the power constraint which achieve the above upper bound
and maximize it.
As in the proof of Theorem 5.18, we can orthogonally diagonalize K Z as

K Z = AAT ,

where AAT = Ik (and thus det(A)2 = 1), Ik is the k × k identity matrix, and  is a
diagonal matrix with positive diagonal components consisting of the eigenvalues of
K Z (as K Z is positive-definite). Then
202 5 Differential Entropy and Gaussian Channels

det(K X + K Z ) = det(K X + AAT )


= det(AAT K X AAT + AAT )
= det(A) · det(AT K X A + ) · det(AT )
= det(AT K X A + )
= det(B + ),

where B := AT K X A. Since for any two matrices C and D, tr(CD) = tr(DC), we


have that

tr(B) = tr(AT K X A) = tr(AAT K X ) = tr(Ik K X ) = tr(K X ).

Thus, the capacity problem is further transformed to maximizing det(B + ) subject


to tr(B) ≤ P.
By observing that B +  is positive-definite (because  is positive-definite) and
using Hadamard’s inequality given in Corollary 5.19, we have

k
det(B + ) ≤ (Bii + λi ),
i=1

where λi is the component of matrix  locating at ith row and ith column, which
is exactly the ith eigenvalue of K Z . Thus, the maximum value of det(B + ) under
tr(B) ≤ P is realized by a diagonal matrix B (to achieve equality in Hadamard’s
inequality) with
k
Bii = P.
i=1

Finally, as in the proof of Theorem 5.36, we obtain a water-filling allotment for the
optimal diagonal elements of B:

Bii = max{0, θ − λi },
k
where θ is chosen to satisfy i=1 Bii = P. We summarize this result in the next
theorem.

Theorem 5.38 (Capacity of correlated parallel Gaussian channels) The capacity of k


correlated parallel Gaussian channels with positive-definite noise covariance matrix
K Z under overall input power constraint P is given by


k  
1 Pi
C(P) = log2 1 + ,
i=1
2 λi

where λi is the ith eigenvalue of K Z ,


5.6 Capacity of Correlated Parallel Gaussian Channels 203

Pi = max{0, θ − λi },
k
and θ is chosen to satisfy i=1 Pi = P. This capacity is achieved by a tuple of zero-
mean Gaussian inputs (X 1 , X 2 , . . . , X k ) with covariance matrix K X having the same
eigenvectors as K Z , where the ith eigenvalue of K X is Pi , for i = 1, 2, . . . , k.

We close this section by briefly examining the capacity of two important systems
used in wireless communications.

Observation 5.39 (Capacity of memoryless MIMO channels) As today’s wireless


communication systems persistently demand higher data rates, the exploitation of
multiple-input multiple-output (MIMO) systems has become a vital technological
option. By employing a transmitter with M transmit antennas and a receiver with
N receive antennas, a single-user memoryless MIMO channel, whose radom fading
gains (or coefficients) are represented by a sequence of i.i.d. N × M matrices {Hi },
is described by
Y i = Hi X i + Z i , for i = 1, 2, . . . , (5.6.1)

where X i is the M × 1 transmitted vector, Y i is the N × 1 received vector, and Z i


is the N × 1 AWGN vector. In general, Y i , Hi , X i , and Z i are complex-valued.13
The MIMO channel model in (5.6.1) can be regarded as a vector extension of the
scalar system in (5.4.16). It is also a generalization of the Gaussian systems of
Theorems 5.36 and 5.38 with M = N and Hi being deterministically equal to the
N × N identity matrix.
The noise covariance matrix K Z is often assumed to be given by the identity
matrix I N .14 Assume that both the transmitter and the receiver are not only aware
of the distribution of Hi = H but know perfectly its value at each time instance i

13 For a multivariate Gaussian vector Z , its pdf has a slightly different form when it is complex-valued

as opposed to when it is real-valued. For example, when Z = (Z 1 , Z 2 , . . . , Z N )T is Gaussian with


zero-mean and covariance matrix K Z = σ 2 I N , we have
⎧& 'N & '

⎨ √1 exp − 2σ1 2 Nj=1 Z i2 , if Z real-valued
2
f Z (z) = & 'N
2πσ & '

⎩ 12 exp − σ12 Nj=1 |Z j |2 , if Z complex-valued.
πσ

Thus, in parallel to Theorem 5.18, the joint differential entropy for a complex-valued Gaussian Z
is equal to 2 3
h(Z ) = h(Z 1 , Z 2 , . . . , Z N ) = log2 (πe) N det(K Z ) ,
where the multiplicative factors 1/2 and 2 in the differential entropy formula in Theorem 5.18 are
removed. Accordingly, the multiplicative factor 1/2 in the capacity formula for real-valued AWGN
channels is no longer necessary when a complex-valued AWGN channel is considered (e.g., see
(5.6.2) and (5.6.3)).
14 This assumption can be made valid as long as a whitening (i.e., decorrelation) matrix W of Z
i
exists. One can thus multiply the received vector Y i with W to yield the desired equivalent channel
model with I N as the noise covariance matrix (see [153, Eq. (1)] and the ensuing description).
204 5 Differential Entropy and Gaussian Channels

(as in the FSI case in Observation (5.35)). Then, we can follow a similar approach
to the one carried earlier in this section and obtain

det(KY ) = det(H K X H† + I N ),

where “† ” is the Hermitian (conjugate) transposition operation. Thus, the capacity


problem is transformed to maximizing det(H K X H† + I N ) subject to the power
constraint tr(K X ) ≤ P. As a result, the fading MIMO channel capacity assuming
perfect channel knowledge at both the transmitter and the receiver is
 

C F S I (P) = E H max log2 det(H K X H† + I N ) . (5.6.2)


K X :tr(K X )≤P

If, however, only the decoder has perfect knowledge of Hi while the transmitter only
knows its distribution (as in the DSI scenario in Observation (5.35)), then
 

C DS I (P) = max E H log2 det(H K X H† + I N ) . (5.6.3)


K X :tr(K X )≤P

It directly follows from (5.6.2) and (5.6.3) above (and the property of the maximum)
that in general,
D DS I (P) ≤ C F S I (P).

A key find emanating from the analysis of MIMO channels is that in virtue of their
spatial (multi-antenna) diversity, such channels can provide significant capacity gains
vis-a-vis the traditional single-antenna (with M = N = 1) channel. For example, it
can be shown that when the receiver knows the channel fading coefficients perfectly
with the latter governed by a Rayleigh distribution, then MIMO channel capacity
scales linearly in the minimum of the number of receive and transmit antennas at
high channel SNR values, and thus it can be significantly larger than in the single-
antenna case [380, 387]. Detailed studies about MIMO systems, including their
capacity benefits under various conditions and configurations, can be found in [151,
153, 387] and the references therein. MIMO technology has become an essential
component of mobile communication standards, such as IEEE 802.11 Wi-Fi, 4th
generation (4G) Worldwide Interoperability for Microwave Access (WiMax), 4G
Long Term Evolution (LTE) and others; see for example [110].

Observation 5.40 (Capacity of memoryless OFDM systems) We have so far con-


sidered discrete-time “narrowband flat” fading channels (e.g., in Observations 5.35
and 5.39) in the sense that the underlying continuous-time channel has a constant
fading gain over a bandwidth which is larger than that of the transmitted signal (see
Sect. 5.8 for more details on band-limited continuous-time channels). There are how-
ever many situations such as in wideband systems where the reverse property holds:
the channel has a constant fading gain over a bandwidth which is smaller than the
bandwidth of the transmitted signal. In this case, the sent signal undergoes frequency
5.6 Capacity of Correlated Parallel Gaussian Channels 205

selective fading with different frequency components of the signal affected by differ-
ent fading due to multipath propagation effects (which occur when the signal arrives
at the receiver via several paths). It has been shown that such fading channels are well
handled by multi-carrier modulation schemes such as orthogonal frequency division
multiplexing (OFDM) which deftly exploits the channels’ frequency diversity to
provide resilience against the deleterious consequences of fading and interference.
OFDM transforms a single-user frequency selective channel into k parallel nar-
rowband fading channels, where k is the number of OFDM subcarriers. It can be
modeled as a memoryless multivariate channel:

Y i = Hi X i + Z i , for i = 1, 2, . . . ,

where X i , Y i , and Z i are respectively the k ×1 transmitted vector, the k ×1 received


vector and the k × 1 AWGN vector at time instance i. Furthermore, Hi is a k × k
diagonal channel gain matrix at time instance i. As in the case of MIMO systems
(see Observation 5.39), the vectors Y i , Hi , X i , and Z i are in general complex-
valued. Under the assumption that Hi can be perfectly estimated and remains constant
over the entire code transmission block, the (sum rate) capacity for a given power
allocation vector P = (P1 , P2 , · · · , Pk ) is given by


k    k  
|h  |2 P P
C(P) = log2 1 + = log 2 1 + ,
=1
σ2 =1
σ2 /|h  |2

where σ2 is the variance of the th component of Z i and h  is the th diagonal
. Thus the overall system capacity, C(P), optimized subject to the power
entry of Hi
constraint k=1 P ≤ P, can be obtained via the water-filling principle of Sect. 5.5:

C(P) = max C(P)


P=(P1 ,P2 ,··· ,Pk ): k=1 P ≤P


k  
P∗
= log2 1 + ,
=1
σ /|h  |2
2

where
P∗ = max{0, θ − σ2 /|h  |2 },

and the parameter θ is chosen to satisfy k=1 P = P.
Like MIMO, OFDM has been adopted by many communication standards, includ-
ing Digital Video Broadcasting (DVB-S/T), Digital Subscriber Line (DSL), Wi-Fi,
WiMax, and LTE. The reader is referred to wireless communication textbooks such
as [151, 255, 387] for a thorough examination of OFDM systems.
206 5 Differential Entropy and Gaussian Channels

5.7 Non-Gaussian Discrete-Time Memoryless Channels

If a discrete-time channel has an additive but non-Gaussian memoryless noise and


an input power constraint, then it is often hard to calculate its capacity. Hence, in this
section, we introduce an upper bound and a lower bound on the capacity of such a
channel (we assume that the noise admits a pdf).
Definition 5.41 (Entropy power) For a continuous random variable Z with (well-
defined) differential entropy h(Z ) (measured in bits), its entropy power is denoted
by Ze and defined as
1 2·h(Z )
Ze := 2 .
2πe
Lemma 5.42 For a discrete-time continuous-alphabet memoryless additive noise
channel with input power constraint P and noise variance σ 2 , its capacity satisfies

1 P + σ2 1 P + σ2
log2 ≥ C(P) ≥ log2 . (5.7.1)
2 Ze 2 σ2

Proof The lower bound in (5.7.1) is already proved in Theorem 5.33. The upper
bound follows from

I (X ; Y ) = h(Y ) − h(Z )
1 1
≤ log2 [2πe(P + σ 2 )] − log2 [2πeZe ].
2 2

The entropy power of Z can be viewed as the variance of a corresponding Gaussian


random variable with the same differential entropy as Z . Indeed, if Z is Gaussian,
then its entropy power is equal to

1 2h(Z )
Ze = 2 = Var(Z ),
2πe
as expected.
Whenever two independent Gaussian random variables, Z 1 and Z 2 , are added, the
power (variance) of the sum is equal to the sum of the powers (variances) of Z 1 and
Z 2 . This relationship can then be written as

22h(Z 1 +Z 2 ) = 22h(Z 1 ) + 22h(Z 2 ) ,

or equivalently
Var(Z 1 + Z 2 ) = Var(Z 1 ) + Var(Z 2 ).

However, when two independent random variables are non-Gaussian, the relationship
becomes
5.7 Non-Gaussian Discrete-Time Memoryless Channels 207

22h(Z 1 +Z 2 ) ≥ 22h(Z 1 ) + 22h(Z 2 ) , (5.7.2)

or equivalently
Ze (Z 1 + Z 2 ) ≥ Ze (Z 1 ) + Ze (Z 2 ). (5.7.3)

Inequality (5.7.2) (or equivalently 5.7.3), whose proof can be found in [83, Sect. 17.8]
or [51, Theorem 7.10.4], is called the entropy power inequality. It reveals that the
sum of two independent random variables may introduce more entropy power than
the sum of each individual entropy power, except in the Gaussian case.

Observation 5.43 (Capacity bounds in terms of Gaussian capacity and


non-Gaussianness) It can be readily verified that

1 P + σ2 1 P + σ2
log2 = log2 + D(Z Z G ),
2 Ze 2 σ2

where D(Z Z G ) is the divergence between Z and a Gaussian random variable Z G


of mean zero and variance σ 2 . Note that D(Z Z G ) is called the non-Gaussianness
of Z (e.g., see [388]) and is a measure of the “non-Gaussianity” of the noise Z . Thus
recalling from (5.4.18)
1 P + σ2
C G (P) := log2
2 σ2
as the capacity of the channel when the additive noise is Gaussian, we obtain the
following equivalent form for (5.7.1):

C G (P) + D(Z Z G ) ≥ C(P) ≥ C G (P). (5.7.4)

5.8 Capacity of the Band-Limited White Gaussian Channel

We have so far considered discrete-time channels (with discrete or continuous alpha-


bets). We close this chapter by briefly presenting the capacity expression of the
continuous-time (waveform) band-limited channel with additive white Gaussian
noise. The reader is referred to [411], [135, Chap. 8], [30, Sects. 8.2 and 8.3] and
[196, Chap. 6] for rigorous and detailed treatments (including coding theorems) of
waveform channels.
The continuous-time band-limited AWGN channel is a common model for a radio
network or a telephone line. For such a channel, illustrated in Fig. 5.2, the output
waveform is given by

Y (t) = (X (t) + Z (t)) ∗ h(t), t ≥ 0,


208 5 Differential Entropy and Gaussian Channels

Z(t)

X(t) Y (t)
+ H(f )

Waveform channel

Fig. 5.2 Band-limited waveform channel with additive white Gaussian noise

where “∗” represents the convolution operation (recall that the convolution between

two signals a(t) and b(t) is defined as a(t) ∗ b(t) = −∞ a(τ )b(t − τ )dτ ). Here,
X (t) is the channel input waveform with average power constraint
 T /2
1
lim E[X 2 (t)]dt ≤ P (5.8.1)
T →∞ T −T /2

and bandwidth W cycles per second or Hertz (Hz); i.e., its spectrum or Fourier
 +∞
transform X ( f ) := F[X (t)] = −∞ X (t)e− j2π f t dt = 0 for all frequencies

| f | > W , where j = −1 is the imaginary unit number. Z (t) is the noise wave-
form of a zero-mean stationary white Gaussian process with power spectral density
N0 /2; i.e., its power spectral density PSD Z ( f ), which is the Fourier transform of the
process covariance (equivalently, correlation) function K Z (τ ) := E[Z (s)Z (s + τ )],
s, τ ∈ R, is given by
 +∞
N0
PSD Z ( f ) = F[K Z (t)] = K Z (t)e− j2π f t dt = ∀ f.
−∞ 2

Finally, h(t) is the impulse response of an ideal bandpass filter with cutoff fre-
quencies at ±W Hz:

1 if − W ≤ f ≤ W,
H ( f ) = F[(h(t)] =
0 otherwise.

Recall that one can recover h(t) by taking the inverse Fourier transform of H ( f );
this yields
 +∞
h(t) = F −1 [H ( f )] = H ( f )e j2π f t d f = 2W sinc(2W t),
−∞

where
5.8 Capacity of the Band-Limited White Gaussian Channel 209

sin(πt)
sinc(t) :=
πt
is the sinc function and is defined to equal 1 at t = 0 by continuity.
Note that we can write the channel output as

Y (t) = X (t) + Z̃ (t),

where Z̃ (t) := Z (t) ∗ h(t) is the filtered noise waveform. The input X (t) is not
affected by the ideal unit-gain bandpass filter since it has an identical bandwidth as
h(t). Note also that the power spectral density of the filtered noise is given by

N0
if − W ≤ f ≤ W,
PSD Z̃ ( f ) = PSD Z ( f )|H ( f )| =
2 2
0 otherwise.

Taking the inverse Fourier transform of PSD Z̃ ( f ) yields the covariance function of
the filtered noise process:

K Z̃ (τ ) = F −1 [PSD Z̃ ( f )] = N0 W sinc(2W τ ) τ ∈ R. (5.8.2)

To determine the capacity (in bits per second) of this continuous-time band-
limited white Gaussian channel with parameters, P, W , and N0 , we convert it to
an “equivalent” discrete-time channel with power constraint P by using the well-
known sampling theorem (due to Nyquist, Kotelnikov and Shannon), which states
that sampling a band-limited signal with bandwidth W at a rate of 1/(2W ) is sufficient
to reconstruct the signal from its samples. Since X (t), Z̃ (t), and Y (t) are all band-
limited to [−W, W ], we can thus represent these signals by their samples taken 2W 1

seconds apart and model the channel by a discrete-time channel described by:

Yn = X n + Z̃ n , n = 0, ±1, ±2, . . . ,

where X n := X ( 2W n
) are the input samples and Z̃ n = Z ( 2W
n
) and Yn = Y ( 2W
n
) are
the random samples of the noise Z̃ (t) and output Y (t) signals, respectively.
Since Z̃ (t) is a filtered version of Z (t), which is a zero-mean stationary Gaussian
process, we obtain that Z̃ (t) is also zero-mean, stationary and Gaussian. This directly
implies that the samples Z̃ n , n = 1, 2, . . . , are zero-mean Gaussian identically
distributed random variables. Now an examination of the expression of K Z̃ (τ ) in
(5.8.2) reveals that
K Z̃ (τ ) = 0

for τ = 2Wn
, n = 1, 2, . . . , since sinc(t) = 0 for all nonzero integer values of t.
Hence, the random variables Z̃ n , n = 1, 2, . . . , are uncorrelated and hence inde-
pendent (since they are Gaussian) and their variance is given by E[ Z̃ n2 ] = K Z̃ (0) =
210 5 Differential Entropy and Gaussian Channels

N0 W . We conclude that the discrete-time process { Z̃ n }∞


n=1 is i.i.d. Gaussian with each
Z̃ n ∼ N (0, N0 W ). As a result, the above discrete-time channel is a discrete-time
memoryless Gaussian channel with power constraint P and noise variance N0 W ;
thus the capacity of the band-limited white Gaussian channel in bits per channel use
is given using (5.4.13) by
 
1 P
log2 1 + bits/channel use.
2 N0 W

1
Given that we are using the channel (with inputs X n ) every 2W seconds, we obtain
that the capacity in bits/second of the band-limited white Gaussian channel is given
by  
P
C(P) = W log2 1 + bits/second, (5.8.3)
N0 W

where N0PW is the channel SNR.15


We emphasize that the above derivation of (5.8.3) is heuristic as we have not rig-
orously shown the equivalence between the original band-limited Gaussian channel
and its discrete-time version and we have not established a coding theorem for the
original channel. We point the reader to the references mentioned at the beginning
of the section for a full development of this subject.
Example 5.44 (Telephone line channel) Suppose telephone signals are band-limited
to 4 kHz. Given an SNR of 40 decibels (dB), i.e.,

P
10 log10 = 40 dB,
N0 W

then from (5.8.3), we calculate that the capacity of the telephone line channel (when
modeled via the band-limited white Gaussian channel) is given by

4000 log2 (1 + 10000) = 53151.4 bits/second.

Example 5.45 (Infinite bandwidth white Gaussian channel) As the channel band-
width W grows without bound, we obtain from (5.8.3) that

15 Notethat (5.8.3) is achieved by zero-mean i.i.d. Gaussian {X n }∞


n=−∞ with E[X n ] = P, which
2

can be obtained by sampling a zero-mean, stationary and Gaussian X (t) with



P
if − W ≤ f ≤ W,
PSD X ( f ) = 2W
0 otherwise.

Examining this X (t) confirms that it satisfies (5.8.1):


 T /2
1
E[X 2 (t)]dt = E[X 2 (t)] = K X (0) = P · sinc(2W · 0) = P.
T −T /2
5.8 Capacity of the Band-Limited White Gaussian Channel 211

P
lim C(P) = log2 e bits/second,
W →∞ N0

which indicates that in the infinite-bandwidth regime, capacity grows linearly with
power.
Observation 5.46 (Band-limited colored Gaussian channel) If the above band-
limited channel has a stationary colored (nonwhite) additive Gaussian noise, then it
can be shown (e.g., see [135]) that the capacity of this channel becomes


1 W
θ
C(P) = max 0, log2 d f,
2 −W PSD Z ( f )

where θ is the solution of


 W
P= max [0, θ − PSD Z ( f )] d f.
−W

The above capacity formula is indeed reminiscent of the water-pouring scheme we


saw in Sects. 5.5 and 5.6, albeit it is herein applied in the spectral domain. In other
words, we can view the curve of PSD Z ( f ) as a bowl, and water is imagined being
poured into the bowl up to level θ under which the area of the water is equal to P
(see Fig. 5.3a). Furthermore, the distributed water indicates the shape of the optimum
transmission power spectrum (see Fig. 5.3b).
Problems
1. Differential entropy under translation and scaling: Let X be a continuous random
variable with a pdf defined on its support S X .
(a) Translation: Show that differential entropy is invariant under translations:

h(X ) = h(X + c)

for any real constant c.


(b) Scaling: Show that
h(a X ) = h(X ) + log2 |a|

for any nonzero real constant a.


2. Determine the differential entropy (in nats) of random variable X for each of the
following cases.
(a) X is exponential with parameter λ > 0 and pdf f X (x) = λe−λx , x ≥ 0.
(b) X is Laplacian with parameter λ > 0, mean zero and pdf f X (x) =
1 − |x|

e λ , x ∈ R.
(c) X is log-normal with parameters μ ∈ R and σ > 0; i.e., X = eY , where
Y ∼ N (μ, σ 2 ) is a Gaussian random variable with mean μ and variance σ 2 .
212 5 Differential Entropy and Gaussian Channels

The pdf of X is given by

1 (ln x−μ)2
f X (x) = √ e− 2σ2 , x > 0.
σx 2π

(d) The source X = a X 1 +bX 2 , where a and b are nonzero constants and X 1 and
X 2 are independent Gaussian random variables such that X 1 ∼ N (μ1 , σ12 )
and X 2 ∼ N (μ2 , σ22 ).
3. Generalized Gaussian: Let X be a generalized Gaussian random variable with
mean zero, variance σ 2 and pdf given by
αη −ηα |x|α
f X (x) = e , x ∈ R,
2( α1 )

where α > 0 is a parameter describing the distribution’s exponential rate of


decay,
 !1/2
−1
( α3 )
η=σ
( α1 )

and

(a) The spectrum of PSDZ (f ) where the horizontal


line represents θ, the level at which water rises to.

(b) The input spectrum that achieves capacity.

Fig. 5.3 Water-pouring for the band-limited colored Gaussian channel


5.8 Capacity of the Band-Limited White Gaussian Channel 213
 ∞
(x) = t x−1 e−t dt, x > 0,
0

is the gamma function (recall that (1/2) = π, (1) = 1 and that (x + 1) =
x(x) for any positive x). The generalized Gaussian distribution (also called
the exponential power distribution) is well known to provide a good model for
symmetrically distributed random processes, including image wavelet transform
coefficients [258] and broad-tailed processes such as atmospheric impulsive
noise [18]. Note that when α = 2, f X reduces to the Gaussian pdf with mean
√ σ , and when 2α = 1, it reduces to the Laplacian pdf with
2
zero and variance
parameter σ/ 2 (i.e., variance σ ).
(a) Show that the differential entropy of X is given by
/ 0
1 2( α1 )
h(X ) = + ln (in nats).
α αη

(b) Show that when α = 2 and α = 1, h(X ) reduces to the differential entropy
of the Gaussian and Laplacian distributions, respectively.
4. Prove that, of all pdfs with support [0, 1], the uniform density function has the
largest differential entropy.
5. Of all pdfs with continuous support [0, K ], where K > 1 is finite, which pdf
has the largest differential entropy?
Hint: If f X is the pdf that maximizes differential entropy among all pdfs with
support [0, K ], then E[log f X (X )] = E[log f X (Y )] for any random variable Y
of support [0, K ].
6. Show that the exponential distribution has the largest differential entropy among
all pdfs with mean μ and support [0, ∞). (Recall that the pdf of the exponential
distribution with mean μ is given by f X (x) = μ1 exp(− μx ) for x ≥ 0.)
7. Show that among all continuous random variables X admitting a pdf with support
R and finite differential entropy and satisfying E[X ] = 0 and E[|X |] = λ, where
λ > 0 is a fixed parameter, the Laplacian random variable with pdf

1 − |x|
f X (x) = e λ for x ∈ R

maximizes differential entropy.
8. Find the mutual information between the dependent Gaussian zero-mean random
variables X and Y with covariance matrix
 2 
σ ρσ 2
,
ρσ 2 σ 2

where ρ ∈ [−1, 1] is the correlation coefficient. Evaluate the value of I (X ; Y )


when ρ = 1, ρ = 0 and ρ = −1, and explain the results.
214 5 Differential Entropy and Gaussian Channels

9. A variant of the fundamental inequality for the logarithm: For any x > 0 and
y > 0, show that &y'
y ln ≥ y − x,
x
with equality iff x = y.
10. Nonnegativity of divergence: Let X and Y be two continuous random variables
with pdfs f X and f Y , respectively, such that their supports satisfy S X ⊆ SY ⊆ R.
Use Problem 4.9 to show that

D( f X  f Y ) ≥ 0,

with equality iff f X = f Y almost everywhere.


11. Divergence between Laplacians: Let S be a zero-mean random variable admitting
a pdf with support R and satisfying E[|S|] = λ, where λ > 0 is a fixed parameter.
(a) Let f S and g S be two zero-mean Laplacian pdfs for S with parameters λ
and λ̃, respectively (see Problem 7 above for the Laplacian pdf expression).
Determine D( f S g S ) in nats in terms of λ and λ̃.
(b) Show that for any valid pdf f for S, we have

D( f g S ) ≥ D( f S g S ),

with equality iff f = f S (almost everywhere).


12. Let X , Y , and Z be jointly Gaussian random variables, each with mean 0 and
variance 1; let the correlation coefficient of X and Y as well as that of Y and Z
be ρ, while X and Z are uncorrelated. Determine h(X, Y, Z ).
13. Let random variables Z 1 and Z 2 have Gaussian joint distribution with E[Z 1 ] =
E[Z 2 ] = 0, E[Z 12 ] = E[Z 22 ] = 1 and E[Z 1 Z 2 ] = ρ, where 0 < ρ < 1. Also,
let U be a uniformly distributed random variable over the interval (0, 2πe).
Determine whether or not the following inequality holds:

h(U ) > h(Z 1 , Z 2 − 3Z 1 ).

14. Let Z 1 , Z 2 , and Z 3 be independent continuous random variables with identical


pdfs. Show that

1
I (Z 1 + Z 2 ; Z 1 + Z 2 + Z 3 ) ≥ log2 (3) (in bits).
2
Hint: Use the entropy power inequality in (5.7.2).
15. An alternative form of the entropy power inequality: Show that the entropy power
inequality in (5.7.2) can be written as

h(Z 1 + Z 2 ) ≥ h(Y1 + Y2 ),
5.8 Capacity of the Band-Limited White Gaussian Channel 215

where Z 1 and Z 2 are two independent continuous random variables, and Y1 and
Y2 are two independent Gaussian random variables such that

h(Y1 ) = h(Z 1 )

and
h(Y2 ) = h(Z 2 ).

16. A relation between differential entropy and estimation error: Consider a contin-
uous random variable Z with a finite variance and admitting a pdf. It is desired
to estimate Z by observing a correlated random variable Y (assume that the joint
pdf of Z and Y and their conditional pdfs are well-defined). Let Ẑ = Ẑ (Y ) be
such estimate of Z .
(a) Show that the mean square estimation error satisfies

22h(Z |Y )
E[(Z − Ẑ (Y ))2 ] ≥ .
2πe

Hint: Note that Ẑ ∗ (Y ) = E[Z |Y ] is the optimal (MMSE) estimate of Z .


(b) Assume now that Z and Y are zero-mean unit-variance jointly Gaussian
random variables with correlation parameter ρ. Also assume that a simple
linear estimator is used: Ẑ (Y ) = aY + b, where a and b are chosen so that
the estimation error is minimal. Evaluate the tightness of the bound in (a)
and comment on the result.
17. Consider continuous real-valued random variables X and Y1 , Y2 , . . ., Yn , admit-
ting a joint pdf and conditional pdfs among them such that Y1 , Y2 , . . ., Yn are
conditionally independent and conditionally identical distributed given X .
(a) Show I (X ; Y1 , Y2 , . . . , Yn ) ≤ n · I (X ; Y1 ).
(b) Show that the capacity of the channel X → (Y1 , Y2 , . . . , Yn ) with input
power constraint P is less than n times the capacity of the channel X → Y1 ,
where the notation U → V refers to a channel with input U and output V .
18. Consider the channel X → (Y1 , Y2 , . . . , Yn ) with Yi = X + Z i for 1 ≤
i ≤ n, where X is independent of {Z i }i=1
n
. Assume {Z i }i=1
n
are zero-mean equally
correlated Gaussian random variables with

σ 2 , for i = j;
E[Z i Z j ] =
σ 2 ρ, for i = j,

where σ 2 > 0 and − n−1


1
≤ ρ ≤ 1.

(a) By applying a power constraint on the input, i.e., E[X 2 ] ≤ P, where P > 0,
show the channel capacity Cn (P) of this channel is given by
216 5 Differential Entropy and Gaussian Channels
  
1 1 P
Cn (P) = (n − 1) log (1 − ρ) + log (1 − ρ) + n + ρ .
2 2 σ2

σ 2
(b) Prove that if P > n−1 , then Cn (P) < nC1 (P).
Hint: It suffices to prove that exp{2Cn (P)} < exp{2nC1 (P)}. Use also the
following identity regarding the determinant of an n×n matrix with identical
diagonal entries and identical non-diagonal entries:
⎡ ⎤
a b b ··· b
⎢b a b · · · b⎥
⎢ ⎥
det ⎢ . .. . . .. ⎥ = (a − b)n−1 (a + (n − 1)b).
⎣ .. . . .⎦
b b ··· b a

19. Consider a continuous-alphabet channel with a vectored output for a scalar input
as follows.

X→ Channel → Y1 , Y2

Suppose that the channel’s transition pdf satisfies

f Y1 ,Y2 |X (y1 , y2 |x) = f Y1 |X (y1 |x) f Y2 |X (y2 |x)

for every y1 , y2 and x.


& '
2
(a) Show that I (X ; Y1 , Y2 ) = i=1 I (X ; Yi ) − I (Y1 ; Y2 ).
Hint: I (X ; Y1 , Y2 ) = h(Y1 , Y2 ) − h(Y1 |X ) − h(Y2 |X ).
(b) Prove that the channel capacity Ctwo (S) of using two outputs (Y1 , Y2 ) is less
than C1 (S) + C2 (S) under an input power constraint S, where C j (S) is the
channel capacity of using one output Y j and ignoring the other output.
(c) Further assume that f Yi |X (·|x) is Gaussian with mean x and variance
σ 2j . In fact, these channels can be expressed as Y1 = X + N1 and
Y2 = X + N2 , where (N1 , N2 ) are independent

Gaussian distributed with
σ12 0
mean zero and covariance matrix . Using the fact that h(Y1 , Y2 ) ≤
  0 σ22
1
2
log(2πe) K Y1 ,Y2  with equality holding when (Y1 , Y2 ) are joint Gaus-
2

sian, where K Y1 ,Y2 is the covariance matrix of (Y1 , Y2 ), derive Ctwo (S) for
the two-output channel under the power constraint E[X 2 ] ≤ S.
Hint: I (X ; Y1 , Y2 ) = h(Y1 , Y2 ) − h(N1 , N2 ) = h(Y1 , Y2 ) − h(N1 ) − h(N2 ).
5.8 Capacity of the Band-Limited White Gaussian Channel 217

20. Consider the three-input three-output memoryless additive Gaussian channel

Y = X + Z,

where X = [X 1 , X 2 , X 3 ], Y = [Y1 , Y2 , Y3 ], and Z = [Z 1 , Z 2 , Z 3 ] are all three-


dimensional real vectors. Assume that X is independent of Z, and the input
power constraint is S (i.e., E(X 12 + X 22 + X 32 ) ≤ S). Also, assume that Z is
Gaussian distributed with zero-mean and covariance matrix K, where
⎡ ⎤
100
K = ⎣0 1 ρ⎦ .
0ρ1

(a) Determine the capacity-cost function of the channel, if ρ = 0.


Hint: Directly apply Theorem 5.36.
(b) Determine the capacity-cost function of the channel, if 0 < ρ < 1.
Hint: Directly apply Theorem 5.38.
Chapter 6
Lossy Data Compression
and Transmission

6.1 Preliminaries

6.1.1 Motivation

In a number of situations, one may need to compress a source to a rate less than the
source entropy, which as we saw in Chap. 3 is the minimum lossless data compression
rate. In this case, some sort of data loss is inevitable and the resultant code is referred
to as a lossy data compression code. The following are examples for requiring the
use of lossy data compression.

Example 6.1 (Digitization or quantization of continuous signals) The information


content of continuous-alphabet signals , such as voice or analog images, is typically
infinite, requiring an unbounded number of bits to digitize them without incurring
any loss, which is not feasible. Therefore, a lossy data compression code must be
used to reduce the output of a continuous source to a finite number of bits.

Example 6.2 (Extracting useful information) In some scenarios, the source informa-
tion may not be operationally useful in its entirety. A quick example is the hypothesis
testing problem where the system designer is only concerned with knowing the like-
lihood ratio of the null hypothesis distribution against the alternative hypothesis
distribution (see Chap. 2). Therefore, any two distinct source letters which produce
the same likelihood ratio are not encoded into distinct codewords and the resultant
code is lossy.

Example 6.3 (Channel capacity bottleneck) Transmitting at a rate of r source sym-


bol/channel symbol a discrete (memoryless) source with entropy H over a channel
with capacity C such that r H > C is problematic. Indeed as stated by the lossless
joint source–channel coding theorem (see Theorem 4.32), if r H > C, then the sys-
tem’s error probability is bounded away from zero (in fact, in many cases, it grows
exponentially fast to one with increasing blocklength). Hence, unmanageable error
© Springer Nature Singapore Pte Ltd. 2018 219
F. Alajaji and P.-N. Chen, An Introduction to Single-User Information Theory,
Springer Undergraduate Texts in Mathematics and Technology,
https://doi.org/10.1007/978-981-10-8001-2_6
220 6 Lossy Data Compression and Transmission

Output with
Source with unmanageable
rH > C Channel with error
capacity C

Compressed Output with


Source with Lossy data source with manageable
rH > C compressor rH < C Channel with error E
introducing capacity C
error E

Fig. 6.1 Example for the application of lossy data compression

or distortion will be introduced at the destination (beyond the control of the sys-
tem designer). A more viable approach would be to reduce the source’s information
content via a lossy compression step so that the entropy H  of the resulting source
satisfies r H  < C (this can, for example, be achieved by grouping the symbols
of the original source and thus reducing its alphabet size). By Theorem 4.32, the
compressed source can then be reliably sent at rate r over the channel. With this
approach, error is only incurred (under the control of the system designer) in the
lossy compression stage (cf. Fig. 6.1).
Note that another solution that avoids the use of lossy compression would be to
reduce the source–channel transmission rate in the system from r to r  source sym-
bol/channel symbol such that r  H < C holds; in this case, again by Theorem 4.32,
lossless reproduction of the source is guaranteed at the destination, albeit at the price
of slowing the system.

6.1.2 Distortion Measures

A discrete-time source is modeled as a random process {Z n }∞ n=1 . To simplify the


analysis, we assume that the source discussed in this section is memoryless and with
finite alphabet Z. Our objective is to compress the source with rate less than entropy
under a prespecified criterion given by a distortion measure.

Definition 6.4 (Distortion measure) A distortion measure is a mapping

 → R+ ,
ρ: Z × Z

 is a reproduction alphabet, and R+ is the set of


where Z is the source alphabet, Z
nonnegative real number.
6.1 Preliminaries 221

From the above definition, the distortion measure ρ(z, ẑ) can be viewed as the
cost of representing the source symbol z ∈ Z by a reproduction symbol ẑ ∈ Z.  It is
then expected to choose a certain number of (typical) reproduction letters in Z  that
represent the source letters with the least cost.
When Z  = Z, the selection of typical reproduction letters is similar to par-
titioning the source alphabet into several groups, and then choosing one ele-
ment in each group to represent all group members. For example, suppose that
Z = Z = {1, 2, 3, 4} and that, due to some constraints, we need to reduce the
number of outcomes to 2, and we require that the resulting expected cost cannot be
larger than 0.5. Assume that the source is uniformly distributed and that the distortion
measure is given by the following matrix:
⎡ ⎤
0 1 2 2
⎢1 0 2 2⎥
[ρ(i, j)] := ⎢
⎣2
⎥.
2 0 1⎦
2 2 1 0

We see that the two groups in Z which cost least in terms of expected distortion
E[ρ(Z , 
Z )] should be {1, 2} and {3, 4}. We may choose, respectively, 1 and 3 as
the typical elements for these two groups (cf. Fig. 6.2). The expected cost of such
selection is
1 1 1 1 1
ρ(1, 1) + ρ(2, 1) + ρ(3, 3) + ρ(4, 3) = .
4 4 4 4 2

Note that the entropy of the source is reduced from H (Z ) = 2 bits to H ( 


Z ) = 1 bit.
Sometimes, it is convenient to have |Z|  = |Z| + 1. For example,

 = {1, 2, 3, E}| = 4,
|Z = {1, 2, 3}| = 3, |Z

where E can be regarded as an erasure symbol, and the distortion measure is defined
by

Fig. 6.2 “Grouping” as a Representative for group {1, 2}


form of lossy data
compression

1 2 1 2
=⇒
3 4 3 4

Representative for group {3, 4}


222 6 Lossy Data Compression and Transmission
⎡ ⎤
0 2 2 0.5
[ρ(i, j)] := ⎣ 2 0 2 0.5 ⎦ .
2 2 0 0.5

In this case, assume again that the source is uniformly distributed and that to represent
source letters by distinct letters in {1, 2, 3} will yield four times the cost incurred
when representing them by E. Therefore, if only 2 outcomes are allowed, and the
expected distortion cannot be greater than 1/3, then employing typical elements 1
and E to represent groups {1} and {2, 3}, respectively, is an optimal choice. The
resultant entropy is reduced from log2 (3) bits to [log2 (3) − 2/3] bits. It needs to be
 > |Z| + 1 is usually not advantageous.
pointed out that having |Z|

6.1.3 Frequently Used Distortion Measures

Example 6.5 (Hamming distortion measure) Let the source and reproduction alpha-
 Then, the Hamming distortion measure is given
bets be identical, i.e., Z = Z.
by
0, if z = ẑ;
ρ(z, ẑ) :=
1, if z = ẑ.

It is also named the probability-of-error distortion measure because

E[ρ(Z , 
Z )] = Pr(Z = 
Z ).

 = R, the
Example 6.6 (Absolute error distortion measure) Assuming that Z = Z
absolute error distortion measure is given by

ρ(z, ẑ) := |z − ẑ|.

 = R,
Example 6.7 (Squared error distortion measure) Again assuming that Z = Z
the squared error distortion measure is given by

ρ(z, ẑ) := (z − ẑ)2 .

The squared error distortion measure is perhaps the most popular distortion measure
used for continuous alphabets.

Note that all above distortion measures belong to the class of so-called difference
distortion measures, which have the form ρ(z, ẑ) = d(x − x̂) for some nonnegative
function d(·, ·). The squared error distortion measure has the advantages of simplicity
and having a closed-form solution for most cases of interest, such as when using least
squares prediction. Yet, this measure is not ideal for practical situations involving
data operated by human observers (such as image and speech data) as it is inadequate
6.1 Preliminaries 223

in measuring perceptual quality. For example, two speech waveforms in which one is
a marginally time-shifted version of the other may have large square error distortion;
however, they sound quite similar to the human ear.
The above definition for distortion measures can be viewed as a single-letter
distortion measure since they consider only one random variable Z which draws a
single letter. For sources modeled as a sequence of random variables {Z n }, some
extension needs to be made. A straightforward extension is the additive distortion
measure.

Definition 6.8 (Additive distortion measure between vectors) The additive distor-
tion measure ρn between vectors z n and ẑ n of size n (or n-sequences or n-tuples) is
defined by

n
ρn (z n , ẑ n ) = ρ(z i , ẑ i ).
i=1

Another example that is also based on a per-symbol distortion is the maximum


distortion measure:

Definition 6.9 (Maximum distortion measure)

ρn (z n , ẑ n ) = max ρ(z i , ẑ i ).
1≤i≤n

After defining the distortion measures for source sequences, a natural question to
ask is whether to reproduce source sequence z n by sequence ẑ n of the same length is a
must or not. To be more precise, can we use z̃ k to represent z n for k = n? The answer
is certainly yes if a distortion measure for z n and z̃ k is defined. A quick example will
be that the source is a ternary sequence of length n, while the (fixed-length) data
compression result is a set of binary indices of length k, which is taken as small as
possible subject to some given constraints. Hence, k is not necessarily equal to n.
One of the problems for taking k = n is that the distortion measure for sequences can
no longer be defined based on per-letter distortions, and hence a per-letter formula
for the best lossy data compression rate cannot be rendered.
In order to alleviate the aforementioned (k = n) problem, we claim that for
most cases of interest, it is reasonable to assume k = n. This is because one can
actually implement lossy data compression from Z n to Z k in two steps. The first
step corresponds to a lossy compression mapping h n : Z n → Z n , and the second
n
step performs indexing h n (Z ) into Z : k

Step 1: Find the data compression mapping

n
hn : Z n → Z

for which the prespecified distortion and rate constraints are satisfied.
Step 2: Derive the (asymptotically) lossless data compression block code for source
h n (Z n ). When n is sufficiently large, the existence of such code with blocklength
224 6 Lossy Data Compression and Transmission

k 1
k > H (h n (Z n )) equivalently, R = > H (h n (Z n )
n n

is guaranteed by Shannon’s source coding theorem (Theorem 3.6).


Through the above two steps, a lossy data compression code from

n
Z → Z  → {0, 1}
n k

Step 1

is established. Since the second step is already discussed in the (asymptotically)


lossless data compression context, we can say that the theorem regarding lossy data
compression is basically a theorem on the first step.

6.2 Fixed-Length Lossy Data Compression Codes

Similar to the lossless source coding theorem, the objective is to find the theoretical
limit of the compression rate for lossy source codes. Before introducing the main
theorem, we first formally define lossy data compression codes, the rate–distortion
region, and the (operational) rate–distortion function.

Definition 6.10 (Fixed-length lossy data compression code subject to an average


distortion constraint) An (n, M, D) fixed-length lossy data compression code for a
 consists of a compression
source {Z n } with alphabet Z and reproduction alphabet Z
function
h: Z n → Zn

with a codebook size (i.e., the image h(Z n )) equal to |h(Z n )| = M = Mn and an
average (expected) distortion no larger than distortion threshold D ≥ 0:
 
1
E ρn (Z n , h(Z n )) ≤ D.
n

The compression rate of the code is defined as (1/n) log2 M bits/source symbol, as
log2 M bits can be used to represent a sourceword of length n. Indeed, an equivalent
description of the above compression code can be made via an encoder–decoder pair
( f, g), where
f : Z n → {1, 2, . . . , M}

is an encoding function mapping each sourceword in Z n to an index in {1, . . . , M}


(which can be represented using log2 M bits), and

n
g: {1, 2, . . . , M} → Z
6.2 Fixed-Length Lossy Data Compression Codes 225

is a decoding function mapping each index to a reconstruction (or reproduction)


vector in Zn such that the composition of the encoding and decoding functions
yields the above compression function h: g( f (z n )) = h(z n ) for z n ∈ Z n .

Definition 6.11 (Achievable rate–distortion pair) For a given source {Z n } and a


sequence of distortion measures {ρn }n≥1 , a rate–distortion pair (R, D) is achievable
if there exists a sequence of fixed-length lossy data compression codes (n, Mn , D)
for the source with asymptotic code rate satisfying

1
lim sup log Mn < R.
n→∞ n

Definition 6.12 (Rate–distortion region) The rate–distortion region R of a source


{Z n } is the closure of the set of all achievable rate–distortion pair (R, D).

Lemma 6.13 (Time-sharing principle) Under an additive distortion measure ρn , the


rate–distortion region R is a convex set; i.e., if (R1 , D1 ) ∈ R and (R2 , D2 ) ∈ R,
then (λR1 + (1 − λ)R2 , λD1 + (1 − λ)D2 ) ∈ R for all 0 ≤ λ ≤ 1.

Proof It is enough to show that the set of all achievable rate–distortion pairs is convex
(since the closure of a convex set is convex). Also, assume without loss of generality
that 0 < λ < 1.
We will prove convexity of the set of all achievable rate–distortion pairs using a
time-sharing argument, which basically states that if we can use an (n, M1 , D1 ) code
∼C1 to achieve (R1 , D1 ) and an (n, M2 , D2 ) code ∼C2 to achieve (R2 , D2 ), then for any
rational number 0 < λ < 1, we can use ∼C1 for a fraction λ of the time and use ∼C2
for a fraction 1 − λ of the time to achieve (Rλ , Dλ ), where Rλ = λR1 + (1 − λ)R2
and Dλ = λD1 + (1 − λ)D2 ; hence, the result holds for any real number 0 < λ < 1
by the density of the rational numbers in R and the continuity of Rλ and Dλ in λ.
Let r and s be positive integers and let λ = r +s r
; then 0 < λ < 1. Now assume
that the pairs (R1 , D1 ) and (R2 , D2 ) are achievable. Then, there exist a sequence
of (n, M1 , D1 ) codes ∼C1 and a sequence of (n, M2 , D2 ) codes ∼C2 such that for n
sufficiently large,
1
log2 M1 ≤ R1
n
and
1
log2 M2 ≤ R2 .
n

Now construct a sequence of new codes ∼C of blocklength n λ = (r + s)n, codebook


size M = M1r × M2s and compression function h: Z (r +s)n → Z(r +s)n such that
 
h(z (r +s)n ) = h 1 (z 1n ), . . . , h 1 (zrn ), h 2 (zrn+1 ), . . . , h 2 (zrn+s ) ,

where
z (r +s)n = (z 1n , . . . , zrn , zrn+1 , . . . , zrn+s )
226 6 Lossy Data Compression and Transmission

and h 1 and h 2 are the compression functions of ∼C1 and ∼C2 , respectively. In other
words, each reconstruction vector h(z (r +s)n ) of code ∼C is a concatenation of r recon-
struction vectors of code ∼C1 and s reconstruction vectors of code ∼C2 .
The average (or expected) distortion under the additive distortion measure ρn and
the rate of code ∼C are given by
 
ρ(r +s)n (z (r +s)n , h(z (r +s)n ))
E
(r + s)n
   
1 ρn (z 1n , h 1 (z 1n )) ρn (zrn , h 1 (zrn ))
= E + ··· + E
r +s n n
 n n
  
ρn (zr +1 , h 2 (zr +1 )) ρn (zr +s , h 2 (zrn+s ))
n
+E + ··· + E
n n
1
≤ (r D1 + s D2 )
r +s
= λD1 + (1 − λ)D2 = Dλ

and
1 1
log2 M = log2 (M1r × M2s )
(r + s)n (r + s)n
r 1 s 1
= log2 M1 + log2 M2
(r + s) n (r + s) n
≤ λR1 + (1 − λ)R2 = Rλ ,

respectively, for n sufficiently large. Thus, (Rλ , Dλ ) is achievable by ∼.


C 

Definition 6.14 (Rate–distortion function) The rate–distortion function, denoted by


R(D), of source {Z n } is the smallest R̂ for a given distortion threshold D such that
( R̂, D) is an achievable rate–distortion pair; i.e.,

R(D) := inf{ R̂ ≥ 0 : ( R̂, D) ∈ R}.

Observation 6.15 (Monotonicity and convexity of R(D)) Note that, under an addi-
tive distortion measure ρn , the rate–distortion function R(D) is nonincreasing and
convex in D (the proof is left as an exercise).

6.3 Rate–Distortion Theorem

We herein derive the rate–distortion theorem for an arbitrary discrete memoryless


source (DMS) using a bounded additive distortion measure ρn (·, ·); i.e., given a
single-letter distortion measure ρ(·, ·), ρn (·, ·) satisfies the additive property of Def-
inition 6.8 and
6.3 Rate–Distortion Theorem 227

max ρ(z, ẑ) < ∞.



(z,ẑ)∈Z×Z

The basic idea for identifying good data compression reproduction words from
the set of sourcewords emanating from a DMS is to draw them from the so-called
distortion typical set. This set is defined analogously to the jointly typical set studied
in channel coding (cf. Definition 4.7).

Definition 6.16 (Distortion typical set) The distortion δ-typical set with respect to
the memoryless (product) distribution PZ ,  n n
Z on Z × Z (i.e., when pairs of n-tuples
(z , ẑ ) are drawn i.i.d. from Z × Z
n n  according to PZ , 
Z ) and a bounded additive
distortion measure ρn (·, ·) is defined by

Dn (δ) := (z n , ẑ n ) ∈ Z n × Z n :
 
 1 
− log PZ n (z n ) − H (Z ) < δ,
 n 2 
 
 1 
− log P  
2 Z n (ẑ ) − H ( Z ) < δ,
n
 n
 
 1 
− log PZ n ,   
Z n (z , ẑ ) − H (Z , Z ) < δ,
n n
 n 2
  
1 
and  ρn (z n , ẑ n ) − E[ρ(Z ,  Z )] < δ .
n

Note that this is the definition of the jointly typical set with an additional con-
straint on the normalized distortion on sequences of length n being close to the
expected value. Since the additive distortion measure between two joint i.i.d. ran-
dom sequences is actually the sum of the i.i.d. random variables ρ(Z i , 
Z i ), i.e.,

n
ρn (Z n , 
Zn) = ρ(Z i , 
Z i ),
i=1

then the (weak) law of large numbers holds for the distortion typical set. Therefore,
an AEP-like theorem can be derived for distortion typical set.

Theorem 6.17 If (Z 1 , 
Z 1 ), (Z 2 , 
Z 2 ), . . ., (Z n , 
Z n ), . . . are i.i.d., and ρn (·, ·) is a
bounded additive distortion measure, then as n → ∞,

1
− log2 PZ n (Z 1 , Z 2 , . . . , Z n ) → H (Z ) in probability,
n
1    
Z n ( Z 1 , Z 2 , . . . , Z n ) → H ( Z ) in probability,
− log2 P
n
1   
Z n ((Z 1 , Z 1 ), . . . , (Z n , Z n )) → H (Z , Z ) in probability,
− log2 PZ n , 
n
228 6 Lossy Data Compression and Transmission

and
1
ρn (Z n , 
Z n ) → E[ρ(Z , 
Z )] in probability.
n
Proof Functions of i.i.d. random variables are also i.i.d. random variables. Thus by
the weak law of large numbers, we have the desired result. 

It needs to be pointed out that without the bounded property assumption on ρ, the
normalized sum of an i.i.d. sequence does not necessarily converge in probability to
a finite mean, hence the need for requiring that ρ be bounded.

Theorem 6.18 (AEP for distortion measure) Given a DMS {(Z n ,  Z n )} with generic
joint distribution PZ , 
Z and any δ > 0, the distortion δ-typical set satisfies

Z n (Dn (δ)) < δ for n sufficiently large.


c
1. PZ n , 
2. For all (z n , ẑ n ) in Dn (δ),

n n −n[I (Z ; Z )+3δ]
Z n (ẑ ) ≥ P
Z n |Z n (ẑ |z )2 .
n
P (6.3.1)

Proof The first result follows directly from Theorem 6.17 and the definition of the
distortion typical set Dn (δ). The second result can be proved as follows:

Z n (z , ẑ )
n n
PZ n , 
Z n |Z n (ẑ |z ) =
n n
P
PZ n (z n )
Z n (z , ẑ )
n n
n PZ n , 
= P
Z n (ẑ )
PZ n (z n )P
Z n (ẑ )
n

2−n[H (Z , Z )−δ]
≤ P
Z n (ẑ )
n
2−n[H (Z )+δ] 2−n[H ( 
Z )+δ]
n n[I (Z ; 
Z )+3δ]
= P
Z n (ẑ )2 ,

where the inequality follows from the definition of Dn (δ). 

Before presenting the lossy data compression theorem, we need the following
inequality.

Lemma 6.19 For 0 ≤ x ≤ 1, 0 ≤ y ≤ 1, and n > 0,

(1 − x y)n ≤ 1 − x + e−yn , (6.3.2)

with equality holding iff (x, y) = (1, 0).

Proof Let g y (t) := (1 − yt)n . It can be shown by taking the second derivative of
g y (t) with respect to t that this function is strictly convex for t ∈ [0, 1]. Hence, using
∨ to denote disjunction, we have for any x ∈ [0, 1] that
6.3 Rate–Distortion Theorem 229
 
(1 − x y)n = g y (1 − x) · 0 + x · 1
≤ (1 − x) · g y (0) + x · g y (1)
with equality holding iff (x = 0) ∨ (x = 1) ∨ (y = 0)
= (1 − x) + x · (1 − y)n
 n
≤ (1 − x) + x · e−y
with equality holding iff (x = 0) ∨ (y = 0)
≤ (1 − x) + e−ny
with equality holding iff (x = 1).

From the above derivation, we know that equality holds in (6.3.2) iff

[(x = 0)∨(x = 1)∨(y = 0)]∧[(x = 0)∨(y = 0)]∧[(x = 1)] = (x = 1, y = 0),

where ∧ denotes conjunction. (Note that (x = 0) represents {(x, y) ∈ R2 : x = 0


and y ∈ [0, 1]}; a similar definition applies to the other sets.) 

Theorem 6.20 (Shannon’s rate–distortion theorem for memoryless sources) Con-


sider a DMS {Z n }∞ 
n=1 with alphabet Z, reproduction alphabet Z and a bounded
additive distortion measure ρn (·, ·); i.e.,

n
ρn (z n , ẑ n ) = ρ(z i , ẑ i ) and ρmax := max ρ(z, ẑ) < ∞,

(z,ẑ)∈Z×Z
i=1

where ρ(·, ·) is a given single-letter distortion measure. Then, the source’s rate–
distortion function satisfies the following expression:

R(D) = min I (Z ; 
Z ).

Z |Z :E[ρ(Z , Z )]≤D
P

Proof Define
R (I ) (D) := min I (Z ; 
Z ); (6.3.3)

Z |Z :E[ρ(Z , Z )]≤D
P

this quantity is typically called Shannon’s information rate–distortion function.


We will then show that the (operational) rate–distortion function R(D) given in
Definition 6.14 equals R (I ) (D).

1. Achievability Part (i.e., R(D + ε) ≤ R (I ) (D) + 4ε for arbitrarily small ε > 0):
We need to show that for any ε > 0, there exist 0 < γ < 4ε and a sequence of
lossy data compression codes {(n, Mn , D + ε)}∞ n=1 with

1
lim sup log2 Mn ≤ R (I ) (D) + γ < R (I ) (D) + 4ε.
n→∞ n
230 6 Lossy Data Compression and Transmission

The proof is as follows.


Z |Z be the conditional distribu-
Step 1: Optimizing conditional distribution. Let P
(I )
tion that achieves R (D), i.e.,

R (I ) (D) = min I (Z ; 
Z ) = I (Z ;
Z ).

Z |Z :E[ρ(Z , Z )]≤D
P

Then
E[ρ(Z ,
Z )] ≤ D.

Choose Mn to satisfy

1 1
R (I ) (D) + γ ≤ log2 Mn ≤ R (I ) (D) + γ
2 n

for some γ in (0, 4ε), for which the choice should exist for all sufficiently large
n > N0 for some N0 . Define

γ ε
δ := min , .
8 1 + 2ρmax

n according to
Step 2: Random coding. Independently select Mn words from Z


n

Z n (z̃ ) = Z (z̃ i ),
n
P P
i=1

and denote this random codebook by ∼Cn , where


Z (z̃) =
P PZ (z)P
Z |Z (z̃|z).
z∈Z

Step 3: Encoding rule. Define a subset of Z n as


 
J (∼Cn ) := z n ∈ Z n : ∃ z̃ n ∈ ∼Cn such that (z n , z̃ n ) ∈ Dn (δ) ,

where Dn (δ) is defined under P


Z |Z . Based on the codebook

∼Cn = {c1 , c2 , . . . , c Mn },

define the encoding rule as




⎪ cm , if (z n , cm ) ∈ Dn (δ);

(when more than one satisfying the requirement,
h n (z n ) =

⎪ just pick any.)

any word in ∼Cn , otherwise.
6.3 Rate–Distortion Theorem 231

Note that when z n ∈ J (∼Cn ), we have (z n , h n (z n )) ∈ Dn (δ) and

1
ρn (z n , h n (z n )) ≤ E[ρ(Z ,
Z )] + δ ≤ D + δ.
n

Step 4: Calculation of the probability of the complement of J (∼


Cn ). Let N1 be
chosen such that for n > N1 ,

Z n (Dn (δ)) < δ.


c
PZ n ,

Let
 := PZ n (J c (∼Cn )).

Then, the expected probability of source n-tuples not belonging to J (∼Cn ), aver-
aged over all randomly generated codebooks, is given by
⎛ ⎞

E[] = Cn ) ⎝
Z n (∼
P PZ n (z n )⎠
∼Cn z n ∈J
/ (∼Cn )
⎛ ⎞

= PZ n (z n ) ⎝ Cn )⎠ .
Z n (∼
P
z n ∈Z n ∼Cn :z n ∈J
/ (∼Cn )

For any z n given, to select a codebook ∼Cn satisfying z n ∈


/ J (∼Cn ) is equivalent to
independently draw Mn n-tuples from Z n which are not jointly distortion typical
with z n . Hence,

 & ' M
P Cn ) = Pr (z n ,
Z n (∼ Zn) ∈
/ Dn (δ) n .
∼Cn :z n ∈J
/ (∼Cn )

For convenience, we let K (z n , z̃ n ) denote the indicator function of Dn (δ), i.e.,



1, if (z n , z̃ n ) ∈ Dn (δ);
K (z , z̃ ) =
n n
0, otherwise.

Then ⎛ ⎞ Mn

P Cn ) = ⎝1 −
Z n (∼ P n n n ⎠
Z n (z̃ )K (z , z̃ ) .
∼Cn :z n ∈J
/ (∼Cn ) n
z̃ n ∈Z

Continuing the computation of E[], we get


232 6 Lossy Data Compression and Transmission
⎛ ⎞ Mn

E[] = PZ n (z n ) ⎝1 − n n ⎠
Z n (z̃ )K (z , z̃ )
P n

z n ∈Z n n
z̃ n ∈Z
⎛ ⎞ Mn

−n(I (Z ;
≤ PZ n (z n ) ⎝1 − Z n |Z n (z̃ |z )2
P n n Z )+3δ)
K (z n , z̃ n )⎠
z n ∈Z n n
z̃ n ∈Z
(by 6.3.1)
⎛ ⎞ Mn


= PZ n (z n ) ⎝1 − 2−n(I (Z ; Z )+3δ) n n ⎠
Z n |Z n (z̃ |z )K (z , z̃ )
P n n

z n ∈Z n n
z̃ n ∈Z

≤ PZ n (z ) 1 −
n
Z n |Z n (z̃ |z )K (z , z̃ )
P n n n n

z n ∈Z n n
z̃ n ∈Z
( )

+ exp −Mn · 2−n(I (Z ; Z )+3δ) (from 6.3.2)

≤ PZ n (z n ) 1 − P Z n |Z n (z̃ |z )K (z , z̃ )
n n n n

z n ∈Z n n
z̃ n ∈Z
( )
n(R (I ) (D)+γ/2) −n(I (Z ;
Z )+3δ)
+ exp −2 ·2 ,

(for R (I ) (D) + γ/2 < (1/n) log2 Mn )


 nδ 
≤ 1 − PZ n , Z n (Dn (δ)) + exp −2 ,
(for R (D) = I (Z ;
(I )
Z ) and δ ≤ γ/8)
 nδ 
= PZ n ,
Z n (Dn (δ)) + exp −2
c

≤ δ + δ = 2δ
( * +)
for all n > N := max N0 , N1 , 1δ log2 log min{δ,1} 1
.

Since E[]  E [PZ n (J (∼Cn ))] ≤ 2δ, there must exist a codebook ∼Cn such that
= c

PZ n J (∼Cn ) is no greater than 2δ for n sufficiently large.
c

Step 5: Calculation of distortion. The distortion of the optimal codebook ∼C∗n (from
the previous step) satisfies for n > N :

1
1
E[ρn (Z n , h n (Z n ))] = PZ n (z n ) ρn (z n , h n (z n ))
n n
z n ∈J (∼C∗n )

1
+ PZ n (z n ) ρn (z n , h n (z n ))
n
/ (∼C∗n )
z n ∈J

≤ PZ n (z n )(D + δ) + PZ n (z n )ρmax
z n ∈J (∼C∗n ) / (∼C∗n )
z n ∈J
6.3 Rate–Distortion Theorem 233

≤ (D + δ) + 2δ · ρmax
≤ D + δ(1 + 2ρmax )
≤ D + ε.

This concludes the proof of the achievability part.

2. Converse Part (i.e., R(D + ε) ≥ R (I ) (D) for arbitrarily small ε > 0 and any D ∈
{D ≥ 0 : R (I ) (D) > 0}): We need to show that for any sequence of {(n, Mn , Dn )}∞
n=1
code with
1
lim sup log2 Mn < R (I ) (D),
n→∞ n

there exists ε > 0 such that


1
Dn = E[ρn (Z n , h n (Z n ))] > D + ε
n
for n sufficiently large. The proof is as follows.
Step 1: Convexity of mutual information. By the convexity of mutual information
I (Z ; 
Z ) with respect to P
Z |Z for a fixed PZ , we have

Z λ ) ≤ λ · I (Z ; 
I (Z ;  Z 1 ) + (1 − λ) · I (Z ; 
Z 2 ),

where λ ∈ [0, 1], and

Z λ |Z (ẑ|z) := λP
P Z 1 |Z (ẑ|z) + (1 − λ)P
Z 2 |Z (ẑ|z).

Step 2: Convexity of R(I) (D). Let P Z 1 |Z and P


Z 2 |Z be two distributions achieving
R (I ) (D1 ) and R (I ) (D2 ), respectively. Since

E[ρ(Z , 
Z λ )] = PZ (z) Z λ |Z (ẑ|z)ρ(z, ẑ)
P
z∈Z 
ẑ∈Z

& '
= PZ (z) λP
Z 1 |Z (ẑ|z) + (1 − λ)P
Z 2 |Z (ẑ|z) ρ(z, ẑ)

z∈Z,ẑ∈Z
= λD1 + (1 − λ)D2 ,

we have

R (I ) (λD1 + (1 − λ)D2 ) ≤ I (Z ; 
Zλ)
≤ λI (Z ; 
Z 1 ) + (1 − λ)I (Z ; 
Z2)
= λR (I ) (D1 ) + (1 − λ)R (I ) (D2 ).

Therefore, R (I ) (D) is a convex function.


234 6 Lossy Data Compression and Transmission

Step 3: Strictly decreasing and continuity properties of R(I) (D).


By definition, R (I ) (D) is nonincreasing in D. Also,

R (I ) (D) = 0 iff D ≥ Dmax := min PZ (z)ρ(z, ẑ) (6.3.4)



z∈Z

which is finite by the boundedness of the distortion measure. Thus, since R (I ) (D)
is nonincreasing and convex, it directly follows that it is strictly decreasing and
continuous over {D ≥ 0 : R (I ) (D) > 0}.
Step 4: Main proof.

log2 Mn ≥ H (h n (Z n ))
= H (h n (Z n )) − H (h n (Z n )|Z n ), since H (h n (Z n )|Z n ) = 0;
= I (Z n ; h n (Z n ))
= H (Z n ) − H (Z n |h n (Z n ))

n
n
= H (Z i ) − H (Z i |h n (Z n ), Z 1 , . . . , Z i−1 )
i=1 i=1
by the independence of Z n ,
and the chain rule for conditional entropy;

n

n
≥ H (Z i ) − H (Z i | 
Zi )
i=1 i=1
where 
Z i is the ith component of h n (Z n );

n
= I (Z i ; 
Zi )
i=1

n
≥ R (I ) (Di ), where Di := E[ρ(Z i , 
Z i )];
i=1

n
1 (I )
=n R (Di )
i=1
n
, n 

1
≥ n R (I ) Di , by convexity of R (I ) (D);
i=1
n

(I ) 1
= nR E[ρn (Z n , h n (Z n ))] ,
n

where the last step follows since the distortion measure is additive. Finally,
lim supn→∞ (1/n) log2 Mn < R (I ) (D) implies the existence of N and γ > 0
such that (1/n) log Mn < R (I ) (D) − γ for all n > N . Therefore, for n > N ,
6.3 Rate–Distortion Theorem 235

1
R (I ) E[ρn (Z n , h n (Z n ))] < R (I ) (D) − γ,
n

which, together with the fact that R (I ) (D) is strictly decreasing, implies that

1
E[ρn (Z n , h n (Z n ))] > D + ε
n

for some ε = ε(γ) > 0 and for all n > N .


3. Summary: For D ∈ {D ≥ 0 : R (I ) (D) > 0}, the achievability and converse parts
jointly imply that R (I ) (D) + 4ε ≥ R(D + ε) ≥ R (I ) (D) for arbitrarily small ε > 0.
These inequalities together with the continuity of R (I ) (D) yield that R(D) = R (I ) (D)
for D ∈ {D ≥ 0 : R (I ) (D) > 0}.
For D ∈ {D ≥ 0 : R (I ) (D) = 0}, the achievability part gives us R (I ) (D) + 4ε =
4ε ≥ R(D + ε) ≥ 0 for arbitrarily small ε > 0. This immediately implies that
R(D) = 0(= R (I ) (D)) as desired. 

As in the case of block source coding in Chap. 3 (compare Theorem 3.6 with
Theorem 3.15), the above rate–distortion theorem can be extended for the case of
stationary ergodic sources (e.g., see [42, 135]).

Theorem 6.21 (Shannon’s rate–distortion theorem for stationary ergodic sources)


Consider a stationary ergodic source {Z n }∞
n=1 with alphabet Z, reproduction alpha-
 and a bounded additive distortion measure ρn (·, ·); i.e.,
bet Z

n
ρn (z n , ẑ n ) = ρ(z i , ẑ i ) and ρmax := max ρ(z, ẑ) < ∞,

(z,ẑ)∈Z×Z
i=1

where ρ(·, ·) is a given single-letter distortion measure. Then, the source’s rate–
distortion function is given by
(I )
R(D) = R (D),

where
(I )
R (D) := lim Rn(I ) (D) (6.3.5)
n→∞

is called the asymptotic information rate–distortion function. And

1
Rn(I ) (D) := min I (Z n ; 
Zn) (6.3.6)
Z n |Z n : n
P 1
E[ρn (Z n , 
Z n )]≤D n

is the nth-order information rate–distortion function.


236 6 Lossy Data Compression and Transmission

Observation 6.22 (Notes on the asymptotic information rate–distortion function)


(I )
• Note that the quantity R (D) in (6.3.5) is well defined as long as the source is
(I )
stationary; furthermore, R (D) satisfies (see [135, p. 492])
(I )
R (D) = lim Rn(I ) (D) = inf Rn(I ) (D).
n→∞ n

Hence,
(I )
R (D) ≤ Rn(I ) (D) (6.3.7)

holds for any n = 1, 2, . . ..


(I ) (I )
• Wyner–Ziv lower bound on R (D): The following lower bound on R (D), due
to Wyner and Ziv [412] (see also [42]), holds for stationary sources:
(I )
R (D) ≥ Rn(I ) (D) − μn , (6.3.8)

where
1
μn := H (Z n ) − H (Z)
n

represents the amount of memory in the source and H (Z) = limn→∞ n1 H (Z n ) is


the source entropy rate. Note that as n → ∞, μn → 0; thus, the above Wyner–Ziv
lower bound is asymptotically tight in n.1
• When the source {Z n } is a DMS, it can readily be verified from (6.3.5) and (6.3.6)
that
(I )
R (D) = R1(I ) (D) = R (I ) (D),

where R (I ) (D) is given in (6.3.3).


(I )
• R (D) for binary symmetric Markov sources: When {Z n } is a stationary binary
(I )
symmetric Markov source, R (D) is partially explicitly known. More specifi-
cally, it is shown by Gray [156] that
(I )
R (D) = h b (q) − h b (D) for 0 ≤ D ≤ Dc , (6.3.9)

where ⎛ - ⎞
1⎝ (1 − q)2 ⎠,
Dc = 1− 1−
2 q

1A twin result to the above Wyner–Ziv lower bound, which consists of an upper bound on the
capacity-cost function of channels with stationary additive noise, is shown in [16, Corollary 1].
This result, which is expressed in terms of the nth-order capacity-cost function and the amount
of memory in the channel noise, illustrates the “natural duality” between the information rate–
distortion and capacity-cost functions originally pointed out by Shannon [345].
6.3 Rate–Distortion Theorem 237

q = P{Z n = 1|Z n−1 = 0} = P{Z n = 0|Z n−1 = 1} > 1/2 is the source’s
transition probability and h b ( p) = − p log2 ( p) − (1 − p) log2 (1 − p) is the binary
(I )
entropy function. Determining R (D) for D > Dc is still an open problem,
(I )
although R (D) can be estimated in this region via lower and upper bounds.
(I )
Indeed, the right-hand side of (6.3.9) still serves as a lower bound on R (D) for
(I )
D > Dc [156]. Another lower bound on R (D) is the above Wyner–Ziv bound
(I )
(6.3.8), while (6.3.7) gives an upper bound. Various bounds on R (D) are studied
in [43] and calculated (in particular, see [43, Fig. 1]).

The formula of the rate–distortion function obtained in the previous theorems


is also valid for the squared error distortion over the real numbers, even if it is
unbounded. Here, we put the boundedness assumption just to facilitate the exposi-
tion of the current proof.2 The discussion on lossy data compression, especially for
continuous-alphabet sources, will continue in Sect. 6.4. Examples of the calculation
of the rate–distortion function for memoryless sources will also be given in the same
section.
After introducing Shannon’s source coding theorem for block codes, Shannon’s
channel coding theorem for block codes and the rate–distortion theorem in the mem-
oryless system setting (or even stationary ergodic setting in the case of the source
coding and rate–distortion theorems), we briefly elucidate the “key concepts or tech-
niques” behind these lengthy proofs, in particular the notions of a typical set and
of random coding. The typical set construct—specifically, δ-typical set for source
coding, joint δ-typical set for channel coding, and distortion typical set for rate–
distortion—uses a law of large numbers or AEP argument to claim the existence of
a set with very high probability; hence, the respective information manipulation can
just focus on the set with negligible performance loss. The random coding technique
shows that the expectation of the desired performance over all possible information
manipulation schemes (randomly drawn according to some properly chosen statis-
tics) is already acceptably good, and hence the existence of at least one good scheme
that fulfills the desired performance index is validated. As a result, in situations
where the above two techniques apply, a similar theorem can often be established.
A natural question is whether we can extend the theorems to cases where these two
techniques fail. It is obvious that only when new methods (other than the above two)
are developed can the question be answered in the affirmative; see [73, 74, 157, 172,
364] for examples of more general rate–distortion theorems involving sources with
memory.

2 For example, the boundedness assumption in the theorems can be replaced with assuming that there
 such that E[ρ(Z , ẑ 0 )] < ∞ [42, Theorems 7.2.4 and 7.2.5].
exists a reproduction symbol ẑ 0 ∈ Z
This assumption can accommodate the squared error distortion measure and a source with finite
second moment (including continuous-alphabet sources such as Gaussian sources); see also [135,
Theorem 9.6.2 and p. 479].
238 6 Lossy Data Compression and Transmission

6.4 Calculation of the Rate–Distortion Function

In light of Theorem 6.20 and the discussion at the end of the previous section, we
know that for a wide class of memoryless sources

R(D) = R (I ) (D)

as given in (6.3.3).
We first note that, like channel capacity, R(D) cannot in general be explicitly
determined in a closed-form expression, and thus optimization-based algorithms can
be used for its efficient numerical computation [27, 49, 51, 88]. In the following,
we consider simple examples involving the Hamming and squared error distortion
measures where bounds or exact expressions for R(D) can be obtained.

6.4.1 Rate–Distortion Function for Discrete Memoryless


Sources Under the Hamming Distortion Measure

A specific application of the rate–distortion function that is useful in practice is when


the Hamming additive distortion measure is used with a (finite alphabet) DMS with
Z = Z.
We first assume that the DMS is binary-valued (i.e., it is a Bernoulli source) with
Z = Z = {0, 1}. In this case, the Hamming additive distortion measure satisfies

n
ρn (z n , ẑ n ) = z i ⊕ ẑ i ,
i=1

where ⊕ denotes modulo two addition. In such case, ρ(z n , ẑ n ) is exactly the number of
bit errors or changes after compression. Therefore, the distortion bound D becomes
a bound on the average probability of bit error. Specifically, among n compressed
bits, it is expected to have E[ρ(Z n , 
Z n )] bit errors; hence, the expected value of bit
n n
error rate is (1/n)E[ρ(Z , Z )]. The rate–distortion function for binary sources and
Hamming additive distortion measure is given by the next theorem.

Theorem 6.23 Fix a binary DMS {Z n }∞n=1 with marginal distribution PZ (0) = 1 −
PZ (1) = p, where 0 < p < 1. Then the source’s rate–distortion function under the
Hamming additive distortion measure is given by

h b ( p) − h b (D) if 0 ≤ D < min{ p, 1 − p};
R(D) =
0 if D ≥ min{ p, 1 − p},

where h b (·) is the binary entropy function.


6.4 Calculation of the Rate–Distortion Function 239

Proof Assume without loss of generality that p ≤ 1/2.


We first prove the theorem for 0 ≤ D < min{ p, 1 − p} = p. Observe that

H (Z | 
Z ) = H (Z ⊕ 
Z |
Z ).

Also, observe that

E[ρ(Z , 
Z )] ≤ D implies that Pr{Z ⊕ 
Z = 1} ≤ D.

Then, examining (6.3.3), we have

I (Z ; 
Z ) = H (Z ) − H (Z | 
Z)
= h b ( p) − H (Z ⊕ 
Z |
Z)

≥ h b ( p) − H (Z ⊕ Z ) (conditioning never increases entropy)
≥ h b ( p) − h b (D),

where the last inequality follows since the binary entropy function h b (x) is increasing
for x ≤ 1/2, and Pr{Z ⊕  Z = 1} ≤ D. Since the above derivation is true for any
PZ |Z , we have
R(D) ≥ h b ( p) − h b (D).

It remains to show that the lower bound is achievable by some P Z |Z , or equivalently,



H (Z | Z ) = h b (D) for some P Z (0|0) = PZ | 
Z |Z . By defining PZ |  Z (1|1) = 1 − D,
we immediately obtain H (Z |  Z ) = h b (D). The desired P Z |Z can be obtained by
simultaneously solving the equations

PZ (0) PZ (0)
Z (0) + P
1 = P Z (1) = Z |Z (0|0) +
P PZ |Z (1|0)
Z (0|0)
PZ |  Z (0|1)
PZ | 
p p
= P (0|0) + (1 − P Z |Z (0|0))
1 − D Z |Z D

and
PZ (1) PZ (1)
1 = P
Z (0) + P
Z (1) = Z |Z (0|1) +
P Z |Z (1|1)
P
Z (1|0)
PZ |  Z (1|1)
PZ | 
1− p 1− p
= (1 − P Z |Z (1|1)) + P (1|1),
D 1 − D Z |Z

which yield
1− D D
Z |Z (0|0) =
P 1−
1 − 2D p

and
240 6 Lossy Data Compression and Transmission

1− D D
Z |Z (1|1) =
P 1− .
1 − 2D 1− p

If 0 ≤ D < min{ p, 1 − p} = p, then P Z |Z (0|0) > 0 and P


Z |Z (1|1) > 0, completing
the proof.
Finally, the fact that R(D) = 0 for D ≥ min{ p, 1 − p} = p follows directly from
(6.3.4) by noting that Dmax = min{ p, 1 − p} = p. 

The above theorem can be extended for nonbinary (finite alphabet) memoryless
sources resulting in a more complicated (but exact) expression for R(D), see [207].
We instead present a simple lower bound on R(D) for a nonbinary DMS under the
Hamming distortion measure; this bound is a special case of the so-called Shannon
lower bound on the rate–distortion function of a DMS [345] (see also [42, 158], [83,
Problem 10.6]).

Theorem 6.24 Fix a DMS {Z n }∞ n=1 with distribution PZ . Then, the source’s rate–
distortion function under the Hamming additive distortion measure and Z = Z
satisfies

R(D) ≥ H (Z ) − D log2 (|Z| − 1) − h b (D) f or 0 ≤ D ≤ Dmax (6.4.1)

where Dmax is given in (6.3.4) as


Dmax := min PZ (z)ρ(z, ẑ) = 1 − max PZ (ẑ),


ẑ ẑ
z∈Z

H (Z ) is the source entropy and h b (·) is the binary entropy function. Furthermore,
equality holds in the above bound for D ≤ (|Z| − 1) minz∈Z PZ (z).

Proof The proof is left as an exercise. (Hint: use Fano’s inequality and examine the
equality condition.)

Observation 6.25 (Special cases of Theorem 6.24)


• If the source is binary (i.e., |Z| = 2) with PZ (0) = p, then

(|Z| − 1) min PZ (z) = min{ p, 1 − p} = Dmax .


z∈Z

Thus, the condition for equality in (6.4.1) always holds and Theorem 6.24 reduces
to Theorem 6.23.
• If the source is uniformly distributed (i.e., PZ (z) = 1/|Z| for all z ∈ Z), then

|Z| − 1
(|Z| − 1) min PZ (z) = = Dmax .
z∈Z |Z|

Thus, we directly obtain from Theorem 6.24 that


6.4 Calculation of the Rate–Distortion Function 241
.
log2 (|Z|) − D log2 (|Z| − 1) − h b (D), if 0 ≤ D ≤ |Z|−1
|Z|
;
R(D) = |Z|−1 (6.4.2)
0, if D > |Z| .

6.4.2 Rate–Distortion Function for Continuous Memoryless


Sources Under the Squared Error Distortion Measure

We next examine the calculation or bounding of the rate–distortion function for


continuous memoryless sources under the additive squared error distortion measure.
We first show that the Gaussian source maximizes the rate–distortion function among
all continuous sources with identical variance. This result, whose proof uses the fact
that the Gaussian distribution maximizes differential entropy among all real-valued
sources with the same variance (Theorem 5.20), can be seen as a dual result to
Theorem 5.33, which states that Gaussian noise minimizes the capacity of additive
noise channels. We also obtain the Shannon lower bound on the rate–distortion
function of continuous sources under the squared error distortion measure.
Theorem 6.26 (Gaussian sources maximize the rate–distortion function)
/n Under the
additive squared error distortion measure, namely ρn (z n , ẑ n ) = i=1 (z i − ẑ i )2 , the
rate–distortion function for any continuous memoryless source {Z i } with a pdf of
support R, zero mean, variance σ 2 , and finite differential entropy satisfies

⎨1 σ2
R(D) ≤ 2 log 2 , for 0 < D ≤ σ 2
⎩ 0, D
for D > σ 2

with equality holding when the source is Gaussian.


Proof By Theorem 6.20 (extended to the “unbounded” squared error distortion mea-
sure),
R(D) = R (I ) (D) = min I (Z ; 
Z ).
2
Z |Z :E[(Z − Z ) ]≤D
f

So for any f 
Z |Z satisfying the distortion constraint,

R(D) ≤ I ( f Z , f 
Z |Z ).

For 0 < D ≤ σ 2 , choose a dummy Gaussian random variable W with zero mean
and variance a D, where a = 1 − D/σ 2 , and is independent of Z . Let 
Z = aZ + W.
Then

E[(Z − 
Z )2 ] = E[(1 − a)2 Z 2 ] + E[W 2 ] = (1 − a)2 σ 2 + a D = D

which satisfies the distortion constraint. Note that the variance of 


Z is equal to
E[a 2 Z 2 ] + E[W 2 ] = σ 2 − D. Consequently,
242 6 Lossy Data Compression and Transmission

R(D) ≤ I (Z ; 
Z)
= h( 
Z ) − h( 
Z |Z )
= h( 
Z ) − h(W + a Z |Z )
= h( 
Z ) − h(W |Z ) (by Lemma 5.14)
= h( 
Z ) − h(W ) (by the independence of W and Z )
1
= h( 
Z ) − log2 (2πe(a D))
2
1 1
≤ log2 (2πe(σ 2 − D)) − log2 (2πe(a D)) (by Theorem 5.20)
2 2
1 σ2
= log2 .
2 D

For D > σ 2 , let  Z satisfy Pr{ Z = 0} = 1 and be independent of Z . Then


E[(Z − Z ) ] = E[Z 2 ] + E[ 
 2
Z 2 ] − 2E[Z ]E[ Z ] = σ 2 < D, and I (Z ; 
Z ) = 0.
Hence, R(D) = 0 for D > σ . 2

The achievability of this upper bound by a Gaussian source (with zero mean
and variance σ 2 ) can be proved by showing that under the Gaussian source,
(1/2) log2 (σ 2 /D) is a lower bound to R(D) for 0 < D ≤ σ 2 . Indeed, when the
source is Gaussian and for any f  2
Z |Z such that E[(Z − Z ) ] ≤ D, we have

I (Z ; 
Z ) = h(Z ) − h(Z | 
Z)
1
= log2 (2πeσ 2 ) − h(Z − 
Z |
Z)
2
1
≥ log2 (2πeσ 2 ) − h(Z − 
Z) (by Lemma 5.14)
2
1 1  
≥ log2 (2πeσ 2 ) − log2 2πe Var[(Z −  Z )] (by Theorem 5.20)
2 2
1 1  
≥ log2 (2πeσ ) − log2 2πe E[(Z − 
2
Z )2 ]
2 2
1 1
≥ log2 (2πeσ ) − log2 (2πeD)
2
2 2
1 σ2
= log2 .
2 D


Theorem 6.27 (Shannon lower bound on the rate–distortion function: squared error
distortion) Consider a continuous memoryless source {Z i } with a pdf of support R
and finite differential entropy under the additive squared error distortion measure.
Then, its rate–distortion function satisfies

1
R(D) ≥ h(Z ) − log2 (2πeD).
2
6.4 Calculation of the Rate–Distortion Function 243

Proof The proof, which follows similar steps as in the achievability of the upper
bound in the proof of the previous theorem, is left as an exercise. 

The above two theorems yield that for any continuous memoryless source {Z i }
with zero mean and variance σ 2 , its rate–distortion function under the mean square
error distortion measure satisfies

R S H (D) ≤ R(D) ≤ RG (D) for 0 < D ≤ σ 2 , (6.4.3)

where
1
R S H (D) := h(Z ) − log2 (2πeD)
2
and
1 σ2
RG (D) := log2 ,
2 D
and with equality holding when the source is Gaussian. Thus, the difference between
the upper and lower bounds on R(D) in (6.4.3) is

1
RG (D) − R S H (D) = −h(Z ) + log2 (2πeσ 2 )
2
= D(Z Z G ) (6.4.4)

where D(Z Z G ) is the non-Gaussianness of Z , i.e., the divergence between Z and


a Gaussian random variable Z G of mean zero and variance σ 2 .
Thus, the gap between the upper and lower bounds in (6.4.4) is zero if the source is
Gaussian and strictly positive if the source
√ is√non-Gaussian. For example, if Z is uni-
formly distributed on the interval [− 3σ 2 , 3σ 2 ] and hence with variance σ 2 , then
D(Z Z G ) = 0.255 bits for σ 2 = 1. On the other hand, if Z is Laplacian distributed
√ √
with variance σ 2 or parameter σ/ 2 (its pdf is given by f Z (z) = √ 1 2 exp {− 2|z|
σ
}

for z ∈ R), then D(Z Z G ) = 0.104 bits for σ = 1. We can thus deduce that the
2

Laplacian source is more similar to the Gaussian source than the uniformly distributed
source and hence its rate–distortion function is closer to Gaussian’s rate–distortion
function RG (D) than that of a uniform source. Finally in light of (6.4.4), the bounds
on R(D) in (6.4.3) can be expressed in terms of the Gaussian rate–distortion function
RG (D) and the non-Gaussianness D(Z Z G ), as follows:

RG (D) − D(Z Z G ) ≤ R(D) ≤ RG (D) for 0 < D ≤ σ 2 . (6.4.5)

Note that (6.4.5) is nothing but the dual of (5.7.4) and is an illustration of the duality
between the rate–distortion and capacity-cost functions.

Observation 6.28 (Rate–distortion function of Gaussian sources with memory)


Recall that Theorem 6.21 also holds for stationary ergodic Gaussian sources with
finite second moment under the squared error distortion (e.g., see Footnote 2 in the
244 6 Lossy Data Compression and Transmission

previous section or [42, Theorems 7.2.4 and 7.2.5]). Note that a zero-mean stationary
Gaussian source {X i } is ergodic if its covariance function K X (τ ) → 0 as τ → ∞.
For such sources, the rate–distortion function R(D) can be determined paramet-
rically; see [42, Theorem 4.5.3]. Furthermore, if the sources are also Markov, then
R(D) admits an explicit analytical expression for small values of D. More specifi-
cally, consider a zero-mean unit-variance stationary Gauss–Markov source {X i } with
covariance function K X (τ ) = a τ , where 0 < a < 1 is the correlation coefficient.
Then,
1 1 − a2 1−a
R(D) = log2 for D ≤ .
2 D 1+a

For D > 1−a


1+a
, R(D) can be obtained parametrically [42, p. 114].

6.4.3 Rate–Distortion Function for Continuous Memoryless


Sources Under the Absolute Error Distortion Measure

We herein focus on the rate–distortion function of continuous memoryless sources


under the absolute error distortion measure. In particular, we show that among all
zero-mean real-valued sources with absolute mean λ (i.e., E[|Z |] = λ), the Laplacian
source with parameter λ (i.e., with variance 2λ2 ) maximizes the rate–distortion func-
tion. This result, which also provides the expression of the rate–distortion function of
Laplacian sources, is similar to Theorem 6.26 regarding the maximal rate–distortion
function under the squared error distortion measure (which is achieved by Gaussian
sources). It is worth pointing out that in image coding applications, the Laplacian
distribution is a good model to approximate the statistics of transform coefficients
such as discrete cosine and wavelet transform coefficients [315, 375]. Finally, anal-
ogously to Theorem 6.27, we obtain a Shannon lower bound on the rate–distortion
function under the absolute error distortion.

Theorem 6.29 (Laplacian sources maximize the rate–distortion function)


/n Under the
additive absolute error distortion measure, namely ρn (z n , ẑ n ) = i=1 |z i − ẑ i |, the
rate–distortion function for any continuous memoryless source {Z i } with a pdf of
support R, zero mean and E|Z | = λ, where λ > 0 is a fixed parameter, satisfies
. λ
log2 , for 0 < D ≤ λ
R(D) ≤ D
0, for D > λ

with equality holding when the source is Laplacian with mean zero and variance 2λ2
1 − |z|
(parameter λ); i.e., its pdf is given by f Z (z) = 2λ e λ , z ∈ R.
6.4 Calculation of the Rate–Distortion Function 245

Proof Since
R(D) = min I (Z ; 
Z ),

Z |Z :E[|Z − Z |]≤D
f

we have that for any f 


Z |Z satisfying the distortion constraint,

R(D) ≤ I (Z ; 
Z ) = I ( fZ , f
Z |Z ).

For 0 < D ≤ λ, choose



 D 2
Z = 1− Z + sgn(Z )|W |,
λ

where sgn(Z ) is equal to 1 if Z ≥ 0 and to −1 if Z < 0, and W is a dummy random


variable that is independent of Z and has a Laplacian distribution with mean zero
and E[|W |] = (1 − D/λ)D; i.e., with parameter (1 − D/λ)D. Thus
 2 
 
 D 
E[| 
Z |] = E  1 − Z + sgn(Z )|W |
 λ 
2
D
= 1− E[|Z |] + E[|W |]
λ

D 2 D
= 1− λ+ 1− D
λ λ
= λ− D (6.4.6)

and
 
 D 2 
 
E[|Z − 
Z |] = E 1− Z + sgn(Z )|W | − Z 
 λ 
 
D D 
= E  2− Z − sgn(Z )|W |
λ λ

D D
= 2− E[|Z |] − E[|W |]
λ λ

D D D
= 2− λ− 1− D
λ λ λ
= D,

and hence this choice of 


Z satisfies the distortion constraint. We can therefore write
for this 
Z that
246 6 Lossy Data Compression and Transmission

R(D) ≤ I (Z ; 
Z)
= h( Z ) − h( 
 Z |Z )
(a)
= h( 
Z ) − h(sgn(Z )|W | |Z )
= h( 
Z ) − h(|W | |Z ) − log2 |sgn(Z )|
(b)
= h( 
Z ) − h(|W |)
(c)
= h( 
Z ) − h(W )
(d)
= h( 
Z ) − log2 [2e(1 − D/λ)D]
(e)
≤ log2 [2e(λ − D)] − log2 [2e(1 − D/λ)D]
λ
= log2 ,
D

where (a) follows from the expression of  Z and the fact that differential entropy is
invariant under translations (Lemma 5.14), (b) holds since Z and W are independent
of each other, (c) follows from the fact that W is Laplacian and that the Laplacian
is symmetric, and (d) holds since the differential entropy of a zero-mean Laplacian
random variable Z  with E[|Z  |] = λ is given by

h(Z  ) = log2 (2eλ ) (in bits).

Finally, (e) follows by noting that E[| 


Z |] = λ − D from (6.4.6) and from the
fact among all zero-mean real-valued random variables Z  with E[|Z  |] = λ , the
Laplacian random variable with zero-mean and parameter λ maximizes differential
entropy (see Observation 5.21, item 3).
For D > λ, let  Z satisfy Pr( 
Z = 0) = 1 and be independent of Z . Then
E[|Z − Z | ≤ E[|Z |]+ E[| 
 Z |] = λ < D. For this choice of  Z , R(D) ≤ I (Z ; 
Z) = 0
and hence R(D) = 0. This completes the proof of the upper bound.
We next show that the upper bound can be achieved by a Laplacian source with
mean zero and parameter λ. This is proved by showing that for a Laplacian source
with mean zero and parameter λ, (1/2) log2 (λ/D) is a lower bound to R(D) for

Z |Z satisfying E[|Z − Z |] ≤ D,
0 < D ≤ λ: for such a Laplacian source and for any f 
we have

I (Z ; 
Z ) = h(Z ) − h(Z | 
Z)
= log2 (2eλ) − h(Z − 
Z |
Z)
≥ log2 (2eλ) − h(Z − 
Z) (by Lemma 5.14)
≥ log2 (2eλ) − log2 (2eD)
λ
= log2 ,
D
where the last inequality follows since
6.4 Calculation of the Rate–Distortion Function 247
 
Z ) ≤ log2 2eE[|Z − 
h(Z −  Z |] ≤ log2 (2eD)

by Observation 5.21 and the fact that E[|Z − 


Z |] ≤ D. 
Theorem 6.30 (Shannon lower bound on the rate–distortion function: absolute error
distortion) Consider a continuous memoryless source {Z i } with a pdf of support R
and finite differential entropy under the additive absolute error distortion measure.
Then, its rate–distortion function satisfies

R(D) ≥ h(Z ) − log2 (2eD).

Proof The proof is left as an exercise. 


Observation 6.31 (Shannon lower bound on the rate–distortion function: difference
distortion) The general form of the Shannon lower bound for a source {Z i } with finite
differential entropy under the difference distortion measure ρ(z, ẑ) = d(x − x̂), where
d(·, ·) is a nonnegative function, is as follows:

R(D) ≥ h(Z ) − sup h(X ),


X :E[d(X )]≤D

where the supremum is taken over all random variables X with a pdf satisfying
E[d(X )] ≤ D. It can be readily seen that this bound encompasses those in Theo-
rems 6.27 and 6.30 as special cases.3

6.5 Lossy Joint Source–Channel Coding Theorem

By combining the rate–distortion theorem with the channel coding theorem, the
optimality of separation between lossy source coding and channel coding can be
established and Shannon’s lossy joint source–channel coding theorem (also known as
the lossy information-transmission theorem) can be shown for the communication of
a source over a noisy channel and its reconstruction within a distortion threshold at the
receiver. These results can be viewed as the “lossy” counterparts of the lossless joint
source–channel coding theorem and the separation principle discussed in Sect. 4.6.
Definition 6.32 (Lossy source–channel block code) Given a discrete-time source

{Z i }i=1  and a discrete-time channel
with alphabet Z and reproduction alphabet Z
with input and output alphabets X and Y, respectively, an m-to-n lossy source–
channel block code with rate mn source symbol/channel symbol is a pair of mappings
( f (sc) , g (sc) ), where4

f (sc) : Z m → X n and m .
g (sc) : Y n → Z

3 The asymptotic tightness of this bound as D approaches zero is studied in [249].


4 Note that as pointed out in Sect. 4.6, n, f (sc) , and g (sc) are all a function of m.
248 6 Lossy Data Compression and Transmission

Encoder Xn Yn Decoder
Zm ∈ Zm Channel Ẑ m ∈ Ẑ m
f (sc) g (sc)

Fig. 6.3 An m-to-n block lossy source–channel coding system

The code’s operation is illustrated in Fig. 6.3. The source m-tuple Z m is encoded via
the encoding function f (sc) , yielding the codeword X n = f (sc) (Z m ) as the channel
input. The channel output Y n , which is dependent on Z m only via X n (i.e., we have
the Markov chain Z m → X n → Y n ), is decoded via g (sc) to obtain the source tuple
estimate Ẑ m = g (sc) (Y n ). /m
Given an additive distortion measure ρm = i=1 ρ(z i , ẑ i ), where ρ is a distor-
tion function on Z × Z,  we say that the m-to-n lossy source–channel block code
( f (sc) , g (sc) ) satisfies the average distortion fidelity criterion D, where D ≥ 0, if

1
E[ρm (Z m , 
Z m )] ≤ D.
m

Theorem 6.33 (Lossy joint source–channel coding theorem) Consider a discrete-



time stationary ergodic source {Z i }i=1 with finite alphabet Z, finite reproduction

alphabet Z, bounded additive distortion measure ρm (·, ·) and rate–distortion func-
tion R(D),5 and consider a discrete-time memoryless channel with input alphabet X ,
output alphabet Y, and capacity C.6 Assuming that both R(D) and C are measured
in the same units, the following hold:
• Forward part (achievability): For any D > 0, there exists a sequence of m-to-n m
lossy source–channel codes ( f (sc) , g (sc) ) satisfying the average distortion fidelity
criterion D for sufficiently large m if

m
lim sup · R(D) < C.
m→∞ n m

• Converse part: On the other hand, for any sequence of m-to-n m lossy source–
channel codes ( f (sc) , g (sc) ) satisfying the average distortion fidelity criterion D,
we have
m
· R(D) ≤ C.
nm

5 Note that Z and Z  can also be continuous alphabets with an unbounded distortion function.
In this case, the theorem still holds under appropriate conditions (e.g., [42, Problem 7.5], [135,
Theorem 9.6.3]) that can accommodate, for example, the important class of Gaussian sources under
the squared error distortion function (e.g., [135, p. 479]).
6 The channel can have either finite or continuous alphabets. For example, it can be the memoryless

Gaussian (i.e., AWGN) channel with input power P; in this case, C = C(P).
6.5 Lossy Joint Source–Channel Coding Theorem 249

Proof The proof uses both the channel coding theorem (i.e., Theorem 4.11) and the
rate–distortion theorem (i.e., Theorem 6.21) and follows similar arguments as the
proof of the lossless joint source–channel coding theorem presented in Sect. 4.6. We
leave the proof as an exercise.
Observation 6.34 (Lossy joint source–channel coding theorem with signaling rates)
The above theorem also admits another form when the source and channel are
described in terms of “signaling rates” (e.g., [51]). More specifically, let Ts and
Tc represent the durations (in seconds) per source letter and per channel input sym-
bol, respectively.7 In this case, TTcs represents the source–channel transmission rate
measured in source symbols per channel use (or input symbol). Thus, again assum-
ing that both R(D) and C are measured in the same units, the theorem becomes as
follows:
• The source can be reproduced at the output of the channel with distortion less
than D (i.e., there exist lossy source–channel codes asymptotically satisfying the
average distortion fidelity criterion D) if

Tc
· R(D) < C.
Ts

• Conversely, for any lossy source–channel codes satisfying the average distortion
fidelity criterion D, we have

Tc
· R(D) ≤ C.
Ts

6.6 Shannon Limit of Communication Systems

We close this chapter by applying Theorem 6.33 to a few useful examples of com-
munication systems. Specifically, we obtain a bound on the end-to-end distortion of
any communication system using the fact that if a source with rate–distortion func-
tion R(D) can be transmitted over a channel with capacity C via a source–channel
block code of rate Rsc > 0 (in source symbols/channel use) and reproduced at the
destination with an average distortion no larger than D, then we must have that

Rsc · R(D) ≤ C

or equivalently,
1
R(D) ≤ C. (6.6.1)
Rsc

7 In
other words, the source emits symbols at a rate of 1/Ts source symbols per second and the
channel accepts inputs at a rate of 1/Tc channel symbols per second.
250 6 Lossy Data Compression and Transmission

Solving for the smallest D, say D SL , satisfying (6.6.1) with equality8 yields a lower
bound, called the Shannon limit,9 on the distortion of all realizable lossy source–
channel codes for the system with rate Rsc .
In the following examples, we calculate the Shannon limit for some source–
channel configurations. The Shannon limit is not necessarily achievable in general,
although this is the case in the first two examples.
Example 6.35 (Shannon limit for a binary uniform DMS over a BSC)10 Let
Z = Ẑ = {0, 1} and consider a binary uniformly distributed DMS {Z i } (i.e., a
Bernoulli(1/2) source) using the additive Hamming distortion measure. Note that in
this case, E[ρ(Z , 
Z )] = P(Z =  Z ) := Pb ; in other words, the expected distortion
is nothing but the source’s bit error probability Pb . We desire to transmit the source
over a BSC with crossover probability
< 1/2.
From Theorem 6.23, we know that for 0 ≤ D ≤ 1/2, the source’s rate–distortion
function is given by
R(D) = 1 − h b (D),

where h b (·) is the binary entropy function. Also from (4.5.5), the channel’s capacity
is given by
C = 1 − h b (
).

The Shannon limit D SL for this system with source–channel transmission rate Rsc
is determined by solving (6.6.1) with equality:

1
1 − h b (D SL ) = [1 − h b (
)].
Rsc

This yields that


1 − h b (
)
D SL = h −1
b 1− (6.6.2)
Rsc

for D SL ≤ 1/2, where for any t ≥ 0,



inf{x : t = h b (x)}, if 0 ≤ t ≤ 1
h −1
b (t) := (6.6.3)
0, if t > 1

is the inverse of the binary entropy function on the interval [0, 1/2]. Thus, D SL given
in (6.6.2) gives a lower bound on the bit error probability Pb of any rate-Rsc source–
channel code used for this system. In particular, if Rsc = 1 source symbol/channel

8 If the strict inequality R(D) < R1sc C always holds, then in this case, the Shannon limit is
& '
D S L = Dmin := E minẑ∈Ẑ ρ(Z , ẑ) .
9 Other similar quantities used in the literature are the optimal performance theoretically achievable

(OPTA) [42] and the limit of the minimum transmission ratio (LMTR) [87].
10 This example appears in various sources including [205, Sect. 11.8], [87, Problem 2.2.16], and

[266, Problem 5.7].


6.6 Shannon Limit of Communication Systems 251

use, we directly obtain from (6.6.2) that11

D SL =
. (6.6.4)

Shannon limit over an equivalent binary-input hard-decision demodulated AWGN


channel: It is well known that a BSC with crossover probability
represents a binary-
input AWGN channel used with antipodal (BPSK) signaling and hard-decision coher-
ent demodulation (e.g., see [248]). More specifically, if the channel has noise power
σ 2N = N0 /2 (i.e., the channel’s underlying continuous-time white noise has power
spectral density N0 /2) and uses antipodal signaling with average energy P per sig-
nal, then the BSCs crossover probability can be expressed in terms of P and N0 as
follows: ,-  ,- 
P 2P

=Q =Q , (6.6.5)
σ 2N N0

where 0 ∞
1 t2
Q(x) = √ e− 2 dt
2π x

is the Gaussian Q-function. Furthermore, if the channel is used with a source–


channel code of rate Rsc source (or information) bits/channel use, then 2P/N0 can
be expressed in terms of a so-called SNR per source (or information) bit denoted by
γb := E b /N0 , where E b is the average energy per source bit. Indeed, we have that
P = Rsc E b and thus using (6.6.5), we have
,- 
2Rsc E b *1 +

=Q =Q 2Rsc γb . (6.6.6)
N0

Thus, in light of (6.6.2), the Shannon limit for sending a uniform binary source
over an AWGN channel used with antipodal modulation and hard-decision decoding
satisfies the following in terms of the SNR per source bit γb :
* *1 ++
Rsc (1 − h b (D SL )) = 1 − h b Q 2Rsc γb (6.6.7)

for D SL ≤ 1/2. In Table 6.1, we use (6.6.7) to present the optimal (minimal) values
of γb (in dB) for a given target value of D SL and a given source–channel code rate
Rsc < 1. The table indicates, for example, that if we desire to achieve an end-to-end
bit error probability of no larger than 10−5 at a rate of 1/2, then the system’s SNR per
source bit can be no smaller than 1.772 dB. The Shannon limit values can similarly
be computed for rates Rsc > 1.

11 Source–channel systems with rate Rsc = 1 are typically referred to as systems with matched
source and channel bandwidths (or signaling rates). Also, when Rsc < 1 (resp., > 1), the system is
said to have bandwidth compression (resp., bandwidth expansion); e.g., cf. [274, 314, 358].
252 6 Lossy Data Compression and Transmission

Table 6.1 Shannon limit values γb = E b /N0 (dB) for sending a binary uniform source over a
BPSK-modulated AWGN used with hard-decision decoding
Rate Rsc DS L = 0 D S L = 10−5 D S L = 10−4 D S L = 10−3 D S L = 10−3
1/3 1.212 1.210 1.202 1.150 0.077
1/2 1.775 1.772 1.763 1.703 1.258
2/3 2.516 2.513 2.503 2.423 1.882
4/5 3.369 3.367 3.354 3.250 2.547

Optimality of uncoded communication: Note that for Rsc = 1, the Shannon limit
in (6.6.4) can surprisingly be achieved by a simple uncoded scheme12 : just directly
transmit the source over the channel (i.e., set the blocklength m = 1 and the channel
input X i = Z i for any time instant i = 1, 2, . . .) and declare the channel output as
the reproduced source symbol (i.e., set Z i = Yi for any i).13
In this case, the expected distortion (i.e., bit error probability) of this uncoded
rate-one source–channel scheme is indeed given as follows:

E[ρ(Z , 
Z )] = Pb
= P(X = Y )
= P(Y = X |X = 1)(1/2) + P(Y = X |X = 0)(1/2)
=

= D SL .

We conclude that this rate-one uncoded source–channel scheme achieves the Shan-
non limit and is hence optimal. Furthermore, this scheme, which has no encod-
ing/decoding delay and no complexity, is clearly more desirable than using a separate
source–channel coding scheme,14 which would impose large encoding and decoding
delays and would demand significant computational/storage resources.
Note that for rates Rsc = 1 and/or nonuniform sources, the uncoded scheme is not
optimal and hence more complicated joint or separate source–channel codes would
be required to yield a bit error probability arbitrarily close to (but strictly larger than)
the Shannon limit D SL . Finally, we refer the reader to [140], where necessary and
sufficient conditions are established for source–channel pairs under which uncoded
schemes are optimal.

Observation 6.36 The following two systems are extensions of the system consid-
ered in the above example.

12 Uncoded transmission schemes are also referred to as scalar or single-letter codes.


13 In other words, the code’s encoding and decoding functions, f (sc) and g (sc) , respectively, are both
equal to the identity mapping.
14 Note that in this system, since the source is incompressible, no source coding is actually required.

Still the separate coding scheme will consist of a near-capacity achieving channel code.
6.6 Shannon Limit of Communication Systems 253

• Binary nonuniform DMS over a BSC: The system is identical to that of Exam-
ple 6.35 with the exception that the binary DMS is nonuniformly distributed with
P(Z = 0) = p. Using the expression of R(D) from Theorem 6.23, it can be
readily shown that this system’s Shannon limit is given by

1 − h b (
)
D SL = h −1
b h b ( p) − (6.6.8)
Rsc

for D SL ≤ min{ p, 1 − p}, where h −1 b (·) is the inverse of the binary entropy
function on the interval [0, 1/2] defined in (6.6.3). Setting p = 1/2 in (6.6.8)
directly results in the Shannon limit given in (6.6.2), as expected.
• Nonbinary uniform DMS over a nonbinary symmetric channel: Given integer
q ≥ 2, consider a q-ary uniformly distributed DMS with identical alphabet and
reproduction alphabet Z = Z  = {0, 1, . . . , q − 1} using the additive Hamming
distortion measure and the q-ary symmetric DMC (with q-ary input and output
alphabets and symbol error rate
) described in (4.2.11). Thus using the expressions
for the source’s rate–distortion function in (6.4.2) and the channel’s capacity in
Example 4.19, we obtain that the Shannon limit of the system using rate-Rsc
source–channel codes satisfies
1 & '
log2 (q) − D SL log2 (q − 1) − h b (D SL ) = log2 (q) −
log2 (q − 1) − h b (
)
Rsc
(6.6.9)
for D SL ≤ q−1
q
. Setting q = 2 renders the source a Bernoulli(1/2) source and the
channel a BSC with crossover probability
, thus reducing (6.6.9) to (6.6.2).

Example 6.37 (Shannon limit for a memoryless Gaussian source over an AWGN
channel [147]) Let Z = Z  = R and consider a memoryless Gaussian source {Z i } of
mean zero and variance σ 2 and the squared error distortion function. The objective
is to transmit the source over an AWGN channel with input power constraint P and
noise variance σ 2N and recover it with distortion fidelity no larger than D, for a given
threshold D > 0.
By Theorem 6.26, the source’s rate–distortion function is given by

1 σ2
R(D) = log2 for 0 < D < σ2 .
2 D
Furthermore, the capacity (or capacity-cost function) of the AWGN channel is given
in (5.4.13) as
1 P
C(P) = log2 1 + 2 .
2 σN

The Shannon limit D SL for this system with rate Rsc is obtained by solving

C(P)
R(D SL ) =
Rsc
254 6 Lossy Data Compression and Transmission

or equivalently,
1 σ2 1 P
log2 = log2 1 + 2
2 D SL 2Rsc σN

for 0 < D SL < σ 2 , which gives

σ2
D SL = * + 1 . (6.6.10)
P Rsc
1 + σ2
N

In particular, for a system with rate Rsc = 1, the Shannon limit in (6.6.10) becomes

σ 2 σ 2N
D SL = . (6.6.11)
P + σ 2N

Optimality of a simple rate-one scalar source–channel coding scheme: The follow-


ing simple (uncoded) source–channel coding scheme with rate Rsc = 1 can achieve
the Shannon limit in (6.6.11). The code’s encoding and decoding functions are scalar
(with m = 1). More specifically, at any time instant i, the channel input (with power
constraint P) is given by
2
(sc) P
Xi = f (Z i ) = Zi ,
σ2

and is sent over the AWGN channel. At the receiver, the corresponding channel
output Yi = X i + Ni , where Ni is the additive Gaussian noise (which is independent
of Z i ), is decoded via a scalar (MMSE) detector to yield the following reconstructed
source symbol  Zi √
 (sc) Pσ 2
Z i = g (Yi ) = Yi .
P + σ 2N

A simple calculation reveals that this code’s expected distortion is given by


⎡, √ ,2 2 ⎤
Pσ 2 P
E[(Z − 
Z )2 ] = E ⎣ Z − Z+N ⎦
P + σ 2N σ2

σ 2 σ 2N
=
P + σ 2N
= D SL ,

which proves the optimality of this simple (delayless) source–channel code.


Extensions on the optimality of similar uncoded (scalar) schemes in Gaussian
sensor networks can be found in [38, 141].
6.6 Shannon Limit of Communication Systems 255

Example 6.38 (Shannon limit for a memoryless Gaussian source over a fading chan-
nel) Consider the same system as the above example except that now the channel
is a memoryless fading channel as described in Observation 5.35 with input power
constraint P and noise variance σ 2N . We determine the Shannon limit of this system
with rate Rsc for two cases: (1) the fading coefficients are known at the receiver, and
(2) the fading coefficients are known at both the receiver and the transmitter.
1. Shannon limit with decoder side information (DSI): Using (5.4.17) for the chan-
nel capacity with DSI, we obtain that the Shannon limit with DSI is given by

(DS I ) σ2
D SL = ⎧


1 ⎬
⎤⎫ (6.6.12)
⎨ Rsc
⎣ A2 P
E log2 1+ 2 ⎦
⎩ A σN ⎭
2
(DS I )
for 0 < D SL < σ 2 . Making the fading process deterministic by setting A = 1
(almost surely) reduces (6.6.12) to (6.6.10), as expected.
2. Shannon limit with full side information (FSI): Similarly, using (5.4.19) for the
fading channel capacity with FSI, we obtain the following Shannon limit:

(F S I ) σ2
D SL = ⎧ ⎡
1 ⎬
⎤⎫ (6.6.13)
⎨ 2 ∗ Rsc
E A ⎣log2 1+ A p2 (A) ⎦
⎩ σN ⎭
2
(F S I )
for 0 < D SL < σ 2 , where p ∗ (·) in (6.6.13) is given by

∗ 1 σ2
p (a) = max 0, − 2
λ a

and λ is chosen to satisfy E A [ p(A)] = P.

Example 6.39 (Shannon limit for a binary uniform DMS over a binary-input AWGN
channel) Consider the same binary uniform source as in Example 6.35 under the
Hamming distortion measure to be sent via a source–channel code over a binary-input
AWGN channel used with antipodal (BPSK) signaling of power P and noise variance
σ 2N = N0 /2. Again, here the expected distortion is nothing but the source’s bit error
probability Pb . The source’s rate–distortion function is given by Theorem 6.23 as
presented in Example 6.35.
However, the channel
√ capacity
√ C(P) of the AWGN whose input takes on two
possible values + P or − P, whose output is real-valued (unquantized), and
whose noise variance is σ 2N = N20 , is given by evaluating the mutual information

√ the channel input and output under the input distribution PX (+ P) =
between
PX (− P) = 1/2 (e.g., see [63]):
256 6 Lossy Data Compression and Transmission

0  -  ,

P 1 −y 2 /2 P P
C(P) = 2 log2 (e) − √ e log2 cosh +y dy
σN 2π −∞ σN
2
σ 2N
0 ∞  , - 
Rsc E b 1 −y 2 /2 Rsc E b Rsc E b
= log2 (e) − √ e log2 cosh +y dy
N0 /2 2π −∞ N0 /2 N0 /2
0 ∞ 1
1
e−y /2 log2 [cosh(2Rsc γb + y 2Rsc γb )]dy,
2
= 2Rsc γb log2 (e) − √
2π −∞

where P = Rsc E b is the channel signal power, E b is the average energy per source
bit, Rsc is the rate in source bit/channel use of the system’s source–channel code,
and γb = E b /N0 is the SNR per source bit. The system’s Shannon limit satisfies
&
Rsc (1 − h b (D SL )) = 2Rsc γb log2 (e)
0 ∞ 
1 1
e−y /2 log2 [cosh(2Rsc γb + y 2Rsc γb )]dy ,
2
−√
2π −∞

or equivalently,
0 ∞ 1
1
e−y /2
2
h b (D SL ) = 1 − 2γb log2 (e) + √ log2 [cosh(2Rsc γb + y 2Rsc γb )]dy
Rsc 2π −∞
(6.6.14)
for D SL ≤ 1/2. In Fig. 6.4, we use (6.6.14) to plot the Shannon limit versus γb (in
dB) for codes with rates 1/2 and 1/3. We also provide in Table 6.2 the optimal values
of γb for target values of D SL and Rsc .

Shannon Limit
1
Rsc = 1/2

Rsc = 1/3
−1
10

10−2
DSL

10−3

10−4

10−5

10−6
−6 −5 −4 −3 −2 −1 −.496 .186 1
γb (dB)

Fig. 6.4 The Shannon limit for sending a binary uniform source over a BPSK-modulated AWGN
channel with unquantized output; rates Rsc = 1/2 and 1/3
6.6 Shannon Limit of Communication Systems 257

Table 6.2 Shannon limit values γb = E b /N0 (dB) for sending a binary uniform source over a
BPSK-modulated AWGN with unquantized output
Rate Rsc DS L = 0 D S L = 10−5 D S L = 10−4 D S L = 10−3 D S L = 10−3
1/3 −0.496 −0.496 −0.504 −0.559 −0.960
1/2 0.186 0.186 0.177 0.111 −0.357
2/3 1.060 1.057 1.047 0.963 0.382
4/5 2.040 2.038 2.023 1.909 1.152

The Shannon limits calculated above are pertinent due to the invention of near-
capacity achieving channel codes, such as turbo [44, 45] or LDPC [133, 134, 251,
252] codes. For example, the rate-1/2 turbo coding system proposed in [44, 45] can
approach a bit error rate of 10−5 at γb = 0.9 dB, which is only 0.714 dB away from
the Shannon limit of 0.186 dB. This implies that a near-optimal channel code has
been constructed, since in principle, no codes can perform better than the Shannon
limit. Source–channel turbo codes for sending nonuniform memoryless and Markov
binary sources over the BPSK-modulated AWGN channel are studied in [426–428].

Example 6.40 (Shannon limit for a binary uniform DMS over a binary-input Rayleigh
fading channel) Consider the same system as the one in the above example, except
that the channel is a unit-power BPSK-modulated Rayleigh fading channel (with
unquantized output). The channel is described by (5.4.16), where the input can take
on one of the two values, −1 or +1 (i.e., its input power is P = 1 = Rsc E b ), the
noise variance is σ 2N = N0 /2, and the fading distribution is Rayleigh:

f A (a) = 2ae−a ,
2
a > 0.

Assume also that the receiver knows the fading amplitude (i.e., the case of decoder
side information). Then, the channel capacity is given by evaluating I (X ; Y |A) under
the uniform input distribution PX (−1) = PX (+1) = 1/2, yielding the following
expression in terms of the SNR per source bit γb = E b /N0 :
2 0 +∞ 0 +∞
Rsc γb  
f A (a) e−Rsc γb (y+a) log2 1 + e4Rsc γb ya dy da.
2
C DS I (γb ) = 1−
π 0 −∞

Now, setting Rsc R(D SL ) = C DS I (γb ) implies that the Shannon limit satisfies
2 0 +∞ 0 +∞
1 γb
h b (D SL ) = 1 − + f A (a)
Rsc Rsc π 0 −∞
 
× e−Rsc γb (y+a) log2 1 + e4Rsc γb ya dy da
2
(6.6.15)

for D SL ≤ 1/2. In Table 6.3, we present some Shannon limit values calculated from
(6.6.15).
258 6 Lossy Data Compression and Transmission

Table 6.3 Shannon limit values γb = E b /N0 (dB) for sending a binary uniform source over a
BPSK-modulated Rayleigh fading channel with decoder side information
Rate Rsc DS L = 0 D S L = 10−5 D S L = 10−4 D S L = 10−3 D S L = 10−3
1/3 0.489 0.487 0.479 0.412 −0.066
1/2 1.830 1.829 1.817 1.729 1.107
2/3 3.667 3.664 3.647 3.516 2.627
4/5 5.936 5.932 5.904 5.690 4.331

Example 6.41 (Shannon limit for a binary uniform DMS over an AWGN channel)
As in the above example, we consider a memoryless binary uniform source but we
assume that the channel is an AWGN channel (with real inputs and outputs) with
power constraint P and noise variance σ 2N = N0 /2. Recalling that the channel
capacity is given by

1 P
C(P) = log2 1 + 2
2 σN
1
= log2 (1 + 2Rsc γb ) ,
2
we obtain that the system’s Shannon limit satisfies

1
h b (D SL ) = 1 − log2 (1 + 2Rsc γb )
2Rsc

for D SL ≤ 1/2. In Fig. 6.5, we plot the above Shannon limit versus γb for systems
with Rsc = 1/2 and 1/3.
Other examples of determining the Shannon limit for sending sources with mem-
ory over memoryless channels, such as discrete Markov sources under the Hamming
distortion function15 or Gauss–Markov sources under the squared error distortion
measure (e.g., see [98]) can be similarly considered. Finally, we refer the reader to
the end of Sect. 4.6 for a discussion of relevant works on lossy joint source–channel
coding.
Problems

1. Prove Observation 6.15.


2. Binary source with infinite distortion: Let {Z i } be a DMS with binary source and
reproduction alphabets Z = Ẑ = {0, 1}, distribution PZ (1) = p = 1 − PZ (0),
where 0 < p ≤ 1/2, and the following distortion measure:

15 For example, if the Markov source is binary symmetric, then its rate–distortion function is given

by (6.3.9) for D ≤ Dc and the Shannon limit for sending this source over say a BSC or an AWGN
channel can be calculated. If the distortion region D > Dc is of interest, then (6.3.8) or the right
side of (6.3.9) can be used as lower bounds on R(D); in this case, a lower bound on the Shannon
limit can be obtained.
6.6 Shannon Limit of Communication Systems 259

Shannon Limit
1
Rsc = 1/2

Rsc = 1/3
−1
10

10−2
DSL

10−3

10−4

10−5

10−6
−6 −5 −4 −3 −2 −1 −.55 −.001 1
γb (dB)

Fig. 6.5 The Shannon limits for sending a binary uniform source over a continuous-input AWGN
channel; rates Rsc = 1/2 and 1/3


⎨ 0 if ẑ = z
ρ(z, ẑ) = 1 if z = 1 and ẑ = 0

∞ if z = 0 and ẑ = 1.

(a) Determine the source’s rate–distortion function R(D) (in your calculations,
use the convention that 0 · ∞ = 0).
(b) Specialize R(D) to the case of p = 1/2 (uniform source).
3. Binary uniform source with erasure and infinite distortion: Consider a uniformly
distributed DMS {Z i } with alphabet Z = {0, 1} and reproduction alphabet Z =
{0, 1, E}, where E represents an erasure. Let the source’s distortion function be
given as follows:

⎨ 0 if ẑ = z
ρ(z, ẑ) = 1 if ẑ = E

∞ otherwise.

Find the source’s rate–distortion function.


4. For the binary source and distortion measure considered in Problem 6.3, describe
a simple data compression scheme whose rate achieves the rate–distortion func-
tion R(D) for any given distortion threshold D.
5. Nonbinary uniform source with erasure and infinite distortion: Consider a sim-
ple generalization of Problem 6.3 above, where {Z i } is a uniformly distributed
nonbinary DMS with alphabet Z = {0, 1, . . . , q − 1}, reproduction alphabet
Ẑ = {0, 1, . . . , q − 1, E} and the same distortion function as above, where
260 6 Lossy Data Compression and Transmission

q ≥ 2 is an integer. Find the source’s rate–distortion function and verify that it


reduces to the one derived in Problem 6.3 when q = 2.
6. Translated distortion: Consider a DMS {Z i } with alphabet Z and reproduction
alphabet Ẑ. Let R(D) denote the source’s rate–distortion function under the
distortion function ρ(·, ·). Consider a new distortion function ρ̂(·, ·) obtained by
adding to ρ(·, ·) a constant that depends on the source symbols. More specifically,
let
ρ̂(z, ẑ) = ρ(z, ẑ) + cz

where z ∈ Z, ẑ ∈ Z,  and cz is a constant that depends on source symbol z.


Show that the source’s rate–distortion function R(D) associated with the new
distortion function ρ̂(·, ·) can be expressed as follows in terms of R(D):


R(D) = R(D − c̄)
/
for D ≥ c̄, where c̄ = z∈Z PZ (z)cz .
Note: This result was originally shown by Pinkston [302].
7. Scaled distortion: Consider a DMS {Z i } with alphabet Z, reproduction alphabet
Ẑ, distortion function ρ(·, ·), and rate–distortion function R(D). Let ρ̂(·, ·) be a
new distortion function obtained by scaling ρ(·, ·) via a positive constant a:

ρ̂(z, ẑ) = aρ(z, ẑ)

for z ∈ Z and ẑ ∈ Z.  Determine the source’s rate–distortion function R̂(D)


associated with the new distortion function ρ̂(·, ·) in terms of R(D).
8. Source symbols with zero distortion: Consider a DMS {Z i } with alphabet Z,
reproduction alphabet Ẑ, distortion function ρ(·, ·), and rate–distortion func-
tion R(D). Assume that one source symbol, say z 1 , in Z has zero distortion:
 Show that
ρ(z 1 , ẑ) = 0 for all ẑ ∈ Z.

D
R(D) = (1 − PZ (z 1 )) R
1 − PZ (z 1 )


where R(D) is the rate–distortion function of source { =
Z i } with alphabet Z
Z \ {z 1 } and distribution

PZ (z)
PZ̃ (z) = , z ∈ Z,
1 − PZ (z 1 )

and with the same reproduction alphabet Z  and distortion function ρ(·, ·).
Note: This result first appeared in [302].
9. Consider a DMS {Z i } with quaternary source and reproduction alphabets Z =
Ẑ = {0, 1, 2, 3}, probability distribution vector
6.6 Shannon Limit of Communication Systems 261

(PZ (0), PZ (1), PZ (2), PZ (3)) = ( p/3, p/3, 1 − p, p/3)

for fixed 0 < p < 1, and distortion measure given by the following matrix:
⎡ ⎤
0 ∞ 1 ∞
⎢∞ 0 1 ∞⎥
[ρ(z, ẑ)] = [ρz ẑ ] = ⎢
⎣ 0 0
⎥.
0 0 ⎦
∞∞ 1 0

Determine the source’s rate–distortion function.


10. Consider a binary DMS {Z i } with Z = Z  = {0, 1}, distribution PZ (0) = p,
where 0 < p ≤ 1/2, and the following distortion matrix
 
b1 a + b1
[ρ(z, ẑ)] = [ρz ẑ ] = ,
a + b2 b2

where a > 0, b1 , and b2 are constants.


(a) Find R(D) in terms of a, b1 , and b2 . (Hint: Use Problems 6.6 and 6.7.)
(b) What is R(D) when a = 1 and b1 = b2 = 0?
(c) What is R(D) when a = b2 = 1 and b1 = 0?
11. Prove Theorem 6.24.
12. Prove Theorem 6.30.
13. Memory decreases the rate distortion function: Give an example of a discrete
source with memory whose rate–distortion function is strictly less (at least in
some range of the distortion threshold) than the rate–distortion function of a
memoryless source with identical marginal distribution.
14. Lossy joint source–channel coding theorem—Forward part: Prove the forward
part of Theorem 6.33. For simplicity, assume that the source is memoryless.
15. Lossy joint source–channel coding theorem—Converse part: Prove the converse
part of Theorem 6.33.
16. Gap between the Laplacian rate–distortion function and the Shannon bound:
Consider a continuous memoryless source {Z i } with a pdf of support R, zero
mean, and E[|Z |] = λ, where λ > 0, under the absolute error distortion measure.
Show that for any 0 < D ≤ λ,

R L (D) − R SLD (D) = D(Z Z L ),

where D(Z Z L ) is the divergence between Z and a Laplacian random variable


Z L of mean zero and parameter λ, R L (D) denotes the rate–distortion function
of source Z L (see Theorem 6.29) and R SLD (D) denotes the Shannon lower bound
under the absolute error distortion measure (see Theorem 6.30).
17. q-ary uniform DMS over the q-ary symmetric DMC: Given integer q ≥ 2,
consider a q-ary DMS {Z n } that is uniformly distributed over its alphabet Z =
{0, 1, . . . , q−1}, with reproduction alphabet Ẑ = Z and the Hamming distortion
262 6 Lossy Data Compression and Transmission

measure. Consider also the q-ary symmetric DMC described in (4.2.11) with
q-ary input and output alphabets and symbol error rate
≤ q−1 q
.
Determine whether or not an uncoded source–channel transmission scheme of
rate Rsc = 1 source symbol/channel use (i.e., a source–channel code whose
encoder and decoder are both given by the identity function) is optimal for this
communication system.
18. Shannon limit of the erasure source–channel system: Given integer q ≥ 2, con-
sider the q-ary uniform DMS together with the distortion measure of Problem 6.5
above and the q-ary erasure channel described in (4.2.12), see also Problem 4.13.
(a) Find the system’s Shannon limit under a transmission rate of Rsc source
symbol/channel use.
(b) Describe an uncoded source–channel transmission scheme for the system
with rate Rsc = 1 and assess its optimality.
19. Shannon limit for a Laplacian source over an AWGN channel: Determine the
Shannon limit under the absolute error distortion criterion and a transmission
rate of Rsc source symbols/channel use for a communication system consisting
of a memoryless zero-mean Laplacian source with parameter λ and an AWGN
channel with input power constraint P and noise variance σ 2N .
20. Shannon limit for a nonuniform DMS over different channels: Find the Shannon
limit under the Hamming distortion criterion for each of the systems of Exam-
ples 6.39–6.41, where the source is a binary nonuniform DMS with PZ (0) = p,
where 0 ≤ p ≤ 1.
Appendix A
Overview on Suprema and Limits

We herein review basic results on suprema and limits which are useful for the devel-
opment of information theoretic coding theorems; they can be found in standard real
analysis texts (e.g., see [262, 398]).

A.1 Supremum and Maximum

Throughout, we work on subsets of R, the set of real numbers.

Definition A.1 (Upper bound of a set) A real number u is called an upper bound of
a non-empty subset A of R if every element of A is less than or equal to u; we say
that A is bounded above. Symbolically, the definition becomes

A ⊂ R is bounded above ⇐⇒ (∃ u ∈ R) such that (∀ a ∈ A), a ≤ u.

Definition A.2 (Least upper bound or supremum) Suppose A is a non-empty subset


of R. Then, we say that a real number s is a least upper bound or supremum of A if
s is an upper bound of the set A and if s ≤ s for each upper bound s of A. In this
case, we write s = sup A; other notations are s = supx∈A x and s = sup{x : x ∈ A}.

Completeness Axiom: (Least upper bound property) Let A be a non-empty subset


of R that is bounded above. Then A has a least upper bound.
It follows directly that if a non-empty set in R has a supremum, and then this
supremum is unique. Furthermore, note that the empty set (∅) and any set not bounded
above do not admit a supremum in R. However, when working in the set of extended
real numbers given by R ∪ {−∞, ∞}, we can define the supremum of the empty set
as −∞ and that of a set not bounded above as ∞. These extended definitions will
be adopted in the text.
We now distinguish between two situations: (i) the supremum of a set A belongs
to A, and (ii) the supremum of a set A does not belong to A. It is quite easy to create
examples for both situations. A quick example for (i) involves the set (0, 1], while

© Springer Nature Singapore Pte Ltd. 2018 263


F. Alajaji and P.-N. Chen, An Introduction to Single-User Information Theory,
Springer Undergraduate Texts in Mathematics and Technology,
https://doi.org/10.1007/978-981-10-8001-2
264 Appendix A: Overview on Suprema and Limits

the set (0, 1) can be used for (ii). In both examples, the supremum is equal to 1;
however, in the former case, the supremum belongs to the set, while in the latter case
it does not. When a set contains its supremum, we call the supremum the maximum
of the set.

Definition A.3 (Maximum) If sup A ∈ A, then sup A is also called the maximum of
A and is denoted by max A. However, if sup A ∈
/ A, then we say that the maximum
of A does not exist.

Property A.4 (Properties of the supremum)

1. The supremum of any set in R ∪ {−∞, ∞} always exits.


2. (∀ a ∈ A) a ≤ sup A.
3. If −∞ < sup A < ∞, then (∀ ε > 0)(∃ a0 ∈ A) a0 > sup A − ε.
(The existence of a0 ∈ (sup A − ε, sup A] for any ε > 0 under the condition of
| sup A| < ∞ is called the approximation property for the supremum.)
4. If sup A = ∞, then (∀ L ∈ R)(∃ B0 ∈ A) B0 > L.
5. If sup A = −∞, then A is empty.

Observation A.5 (Supremum of a set and channel coding theorems) In informa-


tion theory, a typical channel coding theorem establishes that a (finite) real number
α is the supremum of a set A. Thus, to prove such a theorem, one must show that α
satisfies both Properties 3 and 2 above, i.e.,

(∀ ε > 0)(∃ a0 ∈ A) a0 > α − ε (A.1.1)

and
(∀ a ∈ A) a ≤ α, (A.1.2)

where (A.1.1) and (A.1.2) are called the achievability (or forward) part and the
converse part, respectively, of the theorem. Specifically, (A.1.2) states that α is an
upper bound of A, and (A.1.1) states that no number less than α can be an upper
bound for A.

Property A.6 (Properties of the maximum)


1. (∀ a ∈ A) a ≤ max A, if max A exists in R ∪ {−∞, ∞}.
2. max A ∈ A.

From the above property, in order to obtain α = max A, one needs to show that
α satisfies both
(∀ a ∈ A) a ≤ α and α ∈ A.
Appendix A: Overview on Suprema and Limits 265

A.2 Infimum and Minimum

The concepts of infimum and minimum are dual to those of supremum and maximum.

Definition A.7 (Lower bound of a set) A real number  is called a lower bound of a
non-empty subset A in R if every element of A is greater than or equal to ; we say
that A is bounded below. Symbolically, the definition becomes

A ⊂ R is bounded below ⇐⇒ (∃  ∈ R) such that (∀ a ∈ A) a ≥ .

Definition A.8 (Greatest lower bound or infimum) Suppose A is a non-empty subset


of R. Then, we say that a real number  is a greatest lower bound or infimum of A
if  is a lower bound of A and if  ≥  for each lower bound  of A. In this case,
we write  = inf A; other notations are  = inf x∈A x and  = inf{x : x ∈ A}.

Completeness Axiom: (Greatest lower bound property) Let A be a non-empty


subset of R that is bounded below. Then, A has a greatest lower bound.
As for the case of the supremum, it directly follows that if a non-empty set in R
has an infimum, and then this infimum is unique. Furthermore, working in the set of
extended real numbers, the infimum of the empty set is defined as ∞ and that of a
set not bounded below as −∞.

Definition A.9 (Minimum) If inf A ∈ A, then inf A is also called the minimum of
A and is denoted by min A. However, if inf A ∈
/ A, we say that the minimum of A
does not exist.

Property A.10 (Properties of the infimum)


1. The infimum of any set in R ∪ {−∞, ∞} always exists.
2. (∀ a ∈ A) a ≥ inf A.
3. If ∞ > inf A > −∞, then (∀ ε > 0)(∃ a0 ∈ A) a0 < inf A + ε.
(The existence of a0 ∈ [inf A, inf A + ε) for any ε > 0 under the assumption of
| inf A| < ∞ is called the approximation property for the infimum.)
4. If inf A = −∞, then (∀A ∈ R)(∃ B0 ∈ A)B0 < L.
5. If inf A = ∞, then A is empty.

Observation A.11 (Infimum of a set and source coding theorems) Analogously


to Observation A.5, a typical source coding theorem in information theory establishes
that a (finite) real number α is the infimum of a set A. Thus, to prove such a theorem,
one must show that α satisfies both Properties 3 and 2 above, i.e.,

(∀ ε > 0)(∃ a0 ∈ A) a0 < α + ε (A.2.1)

and
(∀ a ∈ A) a ≥ α. (A.2.2)
266 Appendix A: Overview on Suprema and Limits

Here, (A.2.1) is called the achievability or forward part of the coding theorem; it
specifies that no number greater than α can be a lower bound for A. Also, (A.2.2) is
called the converse part of the theorem; it states that α is a lower bound of A.

Property A.12 (Properties of the minimum)


1. (∀ a ∈ A) a ≥ min A, if min A exists in R ∪ {−∞, ∞}.
2. min A ∈ A.

A.3 Boundedness and Suprema Operations

Definition A.13 (Boundedness) A subset A of R is said to be bounded if it is both


bounded above and bounded below; otherwise, it is called unbounded.

Lemma A.14 (Condition for boundedness) A subset A of R is bounded iff (∃ k ∈ R)


such that (∀ a ∈ A) |a| ≤ k.

Lemma A.15 (Monotone property) Suppose that A and B are non-empty subsets of
R such that A ⊂ B. Then
1. sup A ≤ sup B.
2. inf A ≥ inf B.

Lemma A.16 (Supremum for set operations) Define the “addition” of two sets A
and B as

A + B := {c ∈ R : c = a + b for some a ∈ A and b ∈ B}.

Define the “scalar multiplication” of a set A by a scalar k ∈ R as

k · A := {c ∈ R : c = k · a for some a ∈ A}.

Finally, define the “negation” of a set A as

−A := {c ∈ R : c = −a for some a ∈ A}.

Then, the following hold:


1. If A and B are both bounded above, then A + B is also bounded above and
sup(A + B) = sup A + sup B.
2. If 0 < k < ∞ and A is bounded above, then k · A is also bounded above and
sup(k · A) = k · sup A.
3. sup A = − inf(−A) and inf A = − sup(−A).
Appendix A: Overview on Suprema and Limits 267

Property 1 does not hold for the “product” of two sets, where the “product” of
sets A and B is defined as

A · B := {c ∈ R : c = ab for some a ∈ A and b ∈ B}.

In this case, both of these two situations can occur

sup(A · B) > (sup A) · (sup B)


sup(A · B) = (sup A) · (sup B).

Lemma A.17 (Supremum/infimum for monotone functions)


1. If f : R → R is a nondecreasing function, then

sup{x ∈ R : f (x) < ε} = inf{x ∈ R : f (x) ≥ ε}

and
sup{x ∈ R : f (x) ≤ ε} = inf{x ∈ R : f (x) > ε}.

2. If f : R → R is a nonincreasing function, then

sup{x ∈ R : f (x) > ε} = inf{x ∈ R : f (x) ≤ ε}

and
sup{x ∈ R : f (x) ≥ ε} = inf{x ∈ R : f (x) < ε}.

The above lemma is illustrated in Fig. A.1.

A.4 Sequences and Their Limits

Let N denote the set of “natural numbers” (positive integers) 1, 2, 3, . . .. A sequence


drawn from a real-valued function is denoted by

f : N → R.

In other words, f (n) is a real number for each n = 1, 2, 3, . . .. It is usual to write


f (n) = an , and we often indicate the sequence by any one of these notations

{a1 , a2 , a3 , . . . , an , . . .} or {an }∞
n=1 .

One important question that arises with a sequence is what happens when n gets
large. To be precise, we want to know that when n is large enough, whether or not
every an is close to some fixed number L (which is the limit of an ).
268 Appendix A: Overview on Suprema and Limits

Fig. A.1 Illustration of


Lemma A.17

f (x)

sup{x : f (x) < ε} sup{x : f (x) ≤ ε}


= inf{x : f (x) ≥ ε} = inf{x : f (x) > ε}

f (x)

sup{x : f (x) > ε} sup{x : f (x) ≥ ε}


= inf{x : f (x) ≤ ε} = inf{x : f (x) < ε}

Definition A.18 (Limit) The limit of {an }∞


n=1 is the real number L satisfying: (∀ ε >
0)(∃ N ) such that (∀ n > N )
|an − L| < ε.

In this case, we write L = limn→∞ an . If no such L satisfies the above statement, we


say that the limit of {an }∞
n=1 does not exist.

Property A.19 If {an }∞ ∞


n=1 and {bn }n=1 both have a limit in R, then the following
hold:

1. limn→∞ (an + bn ) = limn→∞ an + limn→∞ bn .


2. limn→∞ (α · an ) = α · limn→∞ an .
3. limn→∞ (an bn ) = (limn→∞ an )(limn→∞ bn ).

Note that in the above definition, −∞ and ∞ cannot be a legitimate limit for any
sequence. In fact, if (∀ L)(∃ N ) such that (∀ n > N ) an > L, then we say that an
Appendix A: Overview on Suprema and Limits 269

diverges to ∞ and write an → ∞. A similar argument applies to an diverging to −∞.


For convenience, we will work in the set of extended real numbers and thus state that
a sequence {an }∞
n=1 that diverges to either ∞ or −∞ has a limit in R ∪ {−∞, ∞}.

Lemma A.20 (Convergence of monotone sequences) If {an }∞ n=1 is nondecreasing in


n, then limn→∞ an exists in R ∪ {−∞, ∞}. If {an }∞ n=1 is also bounded from above—
i.e., an ≤ L ∀n for some L in R—then limn→∞ an exists in R.
Likewise, if {an }∞
n=1 is nonincreasing in n, then lim n→∞ an exists in R∪{−∞, ∞}.
If {an }∞
n=1 is also bounded from below—i.e., an ≥ L ∀n for some L in R—then
limn→∞ an exists in R.

As stated above, the limit of a sequence may not exist. For example, an = (−1)n .
Then, an will be close to either −1 or 1 for n large. Hence, more generalized defini-
tions that can describe the general limiting behavior of a sequence is required.

Definition A.21 (limsup and liminf) The limit supremum of {an }∞


n=1 is the extended
real number in R ∪ {−∞, ∞} defined by

lim sup an := lim (sup ak ),


n→∞ n→∞ k≥n

and the limit infimum of {an }∞


n=1 is the extended real number defined by

lim inf an := lim (inf ak ).


n→∞ n→∞ k≥n

Some also use the notations lim and lim to denote limsup and liminf, respectively.

Note that the limit supremum and the limit infimum of a sequence are always
defined in R ∪ {−∞, ∞}, since the sequences supk≥n ak = sup{ak : k ≥ n} and
inf k≥n ak = inf{ak : k ≥ n} are monotone in n (cf. Lemma A.20). An immediate
result follows from the definitions of limsup and liminf.

Lemma A.22 (Limit) For a sequence {an }∞


n=1 ,

lim an = L ⇐⇒ lim sup an = lim inf an = L .


n→∞ n→∞ n→∞

Some properties regarding the limsup and liminf of sequences (which are parallel
to Properties A.4 and A.10) are listed below.

Property A.23 (Properties of the limit supremum)


1. The limit supremum always exists in R ∪ {−∞, ∞}.
2. If | lim supm→∞ am | < ∞, then (∀ ε > 0)(∃ N ) such that (∀ n > N ) an <
lim supm→∞ am + ε. (Note that this holds for every n > N .)
270 Appendix A: Overview on Suprema and Limits

3. If | lim supm→∞ am | < ∞, then (∀ ε > 0 and integer K )(∃ N > K ) such that
a N > lim supm→∞ am − ε. (Note that this holds only for one N , which is larger
than K .)

Property A.24 (Properties of the limit infimum)


1. The limit infimum always exists in R ∪ {−∞, ∞}.
2. If | lim inf m→∞ am | < ∞, then (∀ ε > 0 and K )(∃ N > K ) such that a N <
lim inf m→∞ am + ε. (Note that this holds only for one N , which is larger than
K .)
3. If | lim inf m→∞ am | < ∞, then (∀ ε > 0)(∃ N ) such that (∀ n > N ) an >
lim inf m→∞ am − ε. (Note that this holds for every n > N .)

The last two items in Properties A.23 and A.24 can be stated using the terminology
of sufficiently large and infinitely often, which is often adopted in information theory.

Definition A.25 (Sufficiently large) We say that a property holds for a sequence
{an }∞
n=1 almost always or for all sufficiently large n if the property holds for every
n > N for some N .

Definition A.26 (Infinitely often) We say that a property holds for a sequence {an }∞
n=1
infinitely often or for infinitely many n if for every K , the property holds for one
(specific) N with N > K .

Then, Properties 2 and 3 of Property A.23 can be, respectively, rephrased as if


| lim supm→∞ am | < ∞, then (∀ ε > 0)

an < lim sup am + ε for all sufficiently large n


m→∞

and
an > lim sup am − ε for infinitely many n.
m→∞

Similarly, Properties 2 and 3 of Property A.24 becomes: if | lim inf m→∞ am | < ∞,
then (∀ ε > 0)
an < lim inf am + ε for infinitely many n
m→∞

and
an > lim inf am − ε for all sufficiently large n.
m→∞

Lemma A.27
1. lim inf n→∞ an ≤ lim supn→∞ an .
2. If an ≤ bn for all sufficiently large n, then

lim inf an ≤ lim inf bn and lim sup an ≤ lim sup bn .


n→∞ n→∞ n→∞ n→∞
Appendix A: Overview on Suprema and Limits 271

3. lim supn→∞ an < r =⇒ an < r for all sufficiently large n.


4. lim supn→∞ an > r =⇒ an > r for infinitely many n.
5.

lim inf an + lim inf bn ≤ lim inf (an + bn )


n→∞ n→∞ n→∞
≤ lim sup an + lim inf bn
n→∞ n→∞

≤ lim sup(an + bn )
n→∞
≤ lim sup an + lim sup bn .
n→∞ n→∞

6. If limn→∞ an exists, then

lim inf (an + bn ) = lim an + lim inf bn


n→∞ n→∞ n→∞

and
lim sup(an + bn ) = lim an + lim sup bn .
n→∞ n→∞ n→∞

Finally, one can also interpret the limit supremum and limit infimum in terms
of the concept of clustering points. A clustering point is a point that a sequence
{an }∞
n=1 approaches (i.e., belonging to a ball with arbitrarily small radius and that
point as center) infinitely many times. For example, if an = sin(nπ/2), then
{an }∞
n=1 = {1, 0, −1, 0, 1, 0, −1, 0, . . .}. Hence, there are three clustering points
in this sequence, which are −1, 0 and 1. Then, the limit supremum of the sequence
is nothing but its largest clustering point, and its limit infimum is exactly its smallest
clustering point. Specifically, lim supn→∞ an = 1 and lim inf n→∞ an = −1. This
approach can sometimes be useful to determine the limsup and liminf quantities.

A.5 Equivalence

We close this appendix by providing some equivalent statements that are often used
to simplify proofs. For example, instead of directly showing that quantity x is less
than or equal to quantity y, one can take an arbitrary constant ε > 0 and prove that
x < y + ε. Since y + ε is a larger quantity than y, in some cases it might be easier to
show x < y + ε than proving x ≤ y. By the next theorem, any proof that concludes
that “x < y + ε for all ε > 0” immediately gives the desired result of x ≤ y.
Theorem A.28 For any x, y and a in R,
1. x < y + ε for all ε > 0 iff x ≤ y;
2. x < y − ε for some ε > 0 iff x < y;
3. x > y − ε for all ε > 0 iff x ≥ y;
4. x > y + ε for some ε > 0 iff x > y;
5. |a| < ε for all ε > 0 iff a = 0.
Appendix B
Overview in Probability and Random Processes

This appendix presents a quick overview of important concepts from probability


theory and the theory of random processes. We assume that the reader has a basic
knowledge of these subjects; for a thorough study, comprehensive texts such as [30,
47, 104, 162, 170] should be consulted. We close the appendix with a brief discussion
of Jensen’s inequality and the Lagrange multipliers optimization technique [46, 56].

B.1 Probability Space

Definition B.1 (σ-Fields) Let F be a collection of subsets of a non-empty set .


Then, F is called a σ-field (or σ-algebra) if the following hold:
1.  ∈ F.
2. F is closed under complementation: If A ∈ F, then Ac ∈ F, where Ac = {ω ∈
: ω ∈/ A}.
3. F
∞ is closed under countable unions: If Ai ∈ F for i = 1, 2, 3, . . ., then
A
i=1 i ∈ F.

It directly follows that the empty set ∅ is also an element of F (as c = ∅) and
that F is closed under countable intersection since

 ∞ c
 
Ai =
c
Ai .
i=1 i=1

The largest σ-field of subsets of a given set  is the collection of all subsets of 
(i.e., its powerset), while the smallest σ-field is given by {, ∅}. Also, if A is a proper
(strict) non-empty subset of , then the smallest σ-field containing A is given by
{, ∅, A, Ac }.

© Springer Nature Singapore Pte Ltd. 2018 273


F. Alajaji and P.-N. Chen, An Introduction to Single-User Information Theory,
Springer Undergraduate Texts in Mathematics and Technology,
https://doi.org/10.1007/978-981-10-8001-2
274 Appendix B: Overview in Probability and Random Processes

Definition B.2 (Probability space) A probability space is a triple (, F, P), where
 is a given set called sample space containing all possible outcomes (usually
observed from an experiment), F is a σ-field of subsets of  and P is a proba-
bility measure P : F → [0, 1] on the σ-field satisfying the following:
1. 0 ≤ P(A) ≤ 1 for all A ∈ F.
2. P() = 1.
3. Countable additivity: If A1 , A2 , . . . is a sequence of disjoint sets (i.e., Ai ∩ A j = ∅
for all i = j) in F, then
∞  ∞
 
P Ak = P(Ak ).
k=1 k=1

It directly follows from Properties 1–3 of the above definition that P(∅) = 0. Usually,
the σ-field F is called the event space and its elements (which are subsets of 
satisfying the properties of Definition B.1) are called events.

B.2 Random Variables and Random Processes

A random variable X defined over the probability space (, F, P) is a real-valued


function X :  → R that is measurable (or F-measurable), i.e., satisfying the prop-
erty that
X −1 ((−∞, t]) := {ω ∈  : X (ω) ≤ t} ∈ F

for each real t.1


The Borel σ-field of R, denoted by B(R), is the smallest σ-field of subsets of R
containing all open intervals in R. The elements of B(R) are called Borel sets. For
any random variable X , we use PX to denote the probability distribution on B(R)
induced by X , given by

PX (B) := Pr[X ∈ B] = P (w ∈  : X (w) ∈ B) , B ∈ B(R).

Note that the quantities PX (B), B ∈ B(R), fully characterize the random variable
X as they determine the probabilities of all events that concern X .

1 One may question why bother defining random variables based on some abstract probability
space. One may continue that “a random variable X can simply be defined based on its probability
distribution,” which is indeed true (cf. Observation B.3). A perhaps easier way to understand the
abstract definition of a random variable is that the underlying probability space (, F , P) on which
it is defined is what truly occurs internally, but it is possibly non-observable. In order to infer which
of the non-observable ω occurs, an experiment is performed resulting in an observable x that is a
function of ω. Such experiment yields the random variable X whose probability is defined over the
probability space (, F , P).
Appendix B: Overview in Probability and Random Processes 275

Observation B.3 (Distribution function versus probability space) In many appli-


cations, we are perhaps more interested in the distribution functions of random
variables than the underlying probability space on which they are defined. It can
be proved [47, Theorem 14.1] that given a real-valued nonnegative function F(·)
that is nondecreasing and right-continuous and satisfies lim x↓−∞ F(x) = 0 and
lim x↑∞ F(x) = 1, there exist a random variable X and an underlying probability
space such that the cumulative distribution function (cdf) of the random variable,
Pr[X ≤ x] = PX ((−∞, x]), defined over the probability space is equal to F(·).
This result releases us from the burden of referring to a probability space before
defining the random variable. In other words, we can define a random variable X
directly by its cdf, FX (x) = Pr[X ≤ x], without bothering to refer to its underlying
probability space. Nevertheless, it is important to keep in mind that, formally, random
variables are defined over underlying probability spaces.

The n-tuple of random variables X n := (X 1 , X 2 , . . . , X n ) is called a random


vector of length n. In other words, given a probability space (, F, P), X n is a
measurable function from  to Rn , where Rn denotes the n-fold Cartesian product
of R with itself: Rn := R × R × · · · × R. Also, the probability distribution PX n
induced by X n is given by

PX n (B) = P w ∈  : X n (w) ∈ B , B ∈ B(Rn ),

where B(Rn ) is the Borel σ-field of Rn ; i.e., the smallest σ-field of subsets of Rn
containing all open sets in Rn .
The joint cdf FX n of X n is the function from Rn to [0, 1] given by

FX n (x n ) = PX ((−∞, xi ), i = 1, . . . , n) = P (ω ∈  : X i (w) ≤ xi , i = 1, . . . , n)

for x n = (x1 , . . . , xn ) ∈ Rn .
A random process (or random source) is a collection of random variables that
arise from the same probability space. It can be mathematically represented by the
collection
{X t , t ∈ I },

where X t denotes the tth random variable in the process, and the index t runs over
an index set I which is arbitrary. The index set I can be uncountably infinite (e.g.,
I = R), in which case we are dealing with a continuous-time process. We will,
however, exclude such a case in this appendix for the sake of simplicity.
In this text, we focus mostly on discrete-time sources; i.e., sources with the count-
able index set I = {1, 2, . . .}. Each such source is denoted by

X := {X n }∞
n=1 = {X 1 , X 2 , . . .},

as an infinite sequence of random variables, where all the random variables take on
values from a common generic alphabet X ⊆ R. The elements in X are usually
276 Appendix B: Overview in Probability and Random Processes

called letters (or symbols or values). When X is a finite set, the letters of X can be
conveniently expressed via the elements of any appropriately chosen finite set (i.e.,
the letters of X need not be real numbers).2
The source X is completely characterized by the sequence of joint cdf’s {FX n }∞
n=1 .
When the alphabet X is finite, the source can be equivalently described by the
sequence of joint probability mass functions (pmf’s):

PX n (a n ) = Pr[X 1 = a1 , X 2 = a2 , . . . , X n = an ]

for all a n = (a1 , a2 , . . . , an ) ∈ X n , n = 1, 2, . . ..

B.3 Statistical Properties of Random Sources

For a random process X = {X n }∞ n=1 with alphabet X (i.e., X ⊆ R is the range


of each X i ) defined over probability space (, F, P), consider X ∞ , the set of all
sequences x := (x1 , x2 , x3 , . . .) of real numbers in X . An event E in F X , the smallest
σ-field generated by all open sets of X ∞ (i.e., the Borel σ-field of X ∞ ), is said to be
T-invariant with respect to the left-shift (or shift transformation) T : X ∞ → X ∞ if

TE ⊆ E,

where

TE := {Tx : x ∈ E} and Tx := T(x1 , x2 , x3 , . . .) = (x2 , x3 , . . .).

In other words, T is equivalent to “chopping the first component.” For example,


applying T onto an event E 1 defined below:

E 1 := {(x1 = 1, x2 = 1, x3 = 1, x4 = 1, . . .), (x1 = 0, x2 = 1, x3 = 1, x4 = 1, . . .),


(x1 = 0, x2 = 0, x3 = 1, x4 = 1, . . .)} , (B.3.1)

yields

2 More formally, the definition of a random variable X can be generalized by allowing it to take
values that are not real numbers: a random variable over the probability space (, F , P) is a function
X :  → X satisfying the property that for every F ∈ F X ,

X −1 (F) := {w ∈  : X (w) ∈ F} ∈ F ,

where the alphabet X is a general set and F X is a σ-field of subsets of X [159, 349]. Note that this
definition allows X to be an arbitrary set (including being an arbitrary finite set). Furthermore, if
we set X = R, then we revert to the earlier (standard) definition of a random variable.
Appendix B: Overview in Probability and Random Processes 277

TE 1 = {(x1 = 1, x2 = 1, x3 = 1, . . .), (x1 = 1, x2 = 1, x3 = 1 . . .),


(x1 = 0, x2 = 1, x3 = 1, . . .)}
= {(x1 = 1, x2 = 1, x3 = 1, . . .), (x1 = 0, x2 = 1, x3 = 1, . . .)} .

We then have TE 1 ⊆ E 1 , and hence E 1 is T-invariant.


It can be proved3 that if TE ⊆ E, then T2 E ⊆ TE. By induction, we can further
obtain
· · · ⊆ T3 E ⊆ T2 E ⊆ TE ⊆ E.

Thus, if an element say (1, 0, 0, 1, 0, 0, . . .) is in a T-invariant set E, then all its left-
shift counterparts (i.e., (0, 0, 1, 0, 0, 1 . . .) and (0, 1, 0, 0, 1, 0, . . .)) should be con-
tained in E. As a result, for a T-invariant set E, an element and all its left-shift coun-
terparts are either all in E or all outside E, but cannot be partially inside E. Hence, a
“T-invariant group” such as one containing (1, 0, 0, 1, 0, 0, . . .), (0, 0, 1, 0, 0, 1 . . .)
and (0, 1, 0, 0, 1, 0, . . .) should be treated as an indecomposable group in T-invariant
sets.
Although we are in particular interested in these “T-invariant indecomposable
groups” (especially when defining an ergodic random process), it is possible that
some single “transient” element, such as (0, 0, 1, 1, . . .) in (B.3.1), is included in
a T-invariant set, and will be excluded after applying left-shift operation T. This,
however, can be resolved by introducing the inverse operation T−1 . Note that T is
a many-to-one mapping, so its inverse operation does not exist in general. Similar
to taking the closure of an open set, the definition adopted below [349, p. 3] allows
us to “enlarge” the T-invariant set such that all right-shift counterparts of the single
“transient” element are included


T−1 E := x ∈ X ∞ : Tx ∈ E .

We then notice from the above definition that if

T−1 E = E, (B.3.2)

then4
TE = T(T−1 E) = E,

and hence E is constituted only by the T-invariant groups because

· · · = T−2 E = T−1 E = E = TE = T2 E = · · · .

A ⊆ B, then TA ⊆ TB. Thus T2 E ⊆ TE holds whenever TE ⊆ E.


3 If
4 The proof of T(T−1 E) = E is as follows. If y ∈ T(T−1 E) = T({x ∈ X ∞ : Tx ∈ E}), then there
must exist an element x ∈ {x ∈ X ∞ : Tx ∈ E} such that y = Tx. Since Tx ∈ E, we have y ∈ E
and T(T−1 E) ⊆ E.
On the contrary, if y ∈ E, all x’s satisfying Tx = y belong to T−1 E. Thus, y ∈ T(T−1 E),
which implies E ⊆ T(T−1 E).
278 Appendix B: Overview in Probability and Random Processes

The sets that satisfy (B.3.2) are sometimes referred to as ergodic sets because as
time goes by (the left-shift operator T can be regarded as a shift to a future time), the
set always stays in the state that it has been before. A quick example of an ergodic
set for X = {0, 1} is one that consists of all binary sequences that contain finitely
many 0’s.5
We now classify several useful statistical properties of random process X =
{X 1 , X 2 , . . .}.

• Memoryless process:The process or source X is said to be memoryless if its ran-


dom variables are independent and identically distributed (i.i.d.). Here by inde-
pendence, we mean that any finite sequence X i1 , X i2 , . . . , X in of random variables
satisfies

n
Pr[X i1 = x1 , X i2 = x2 , . . . , X in = xn ] = Pr[X il = xl ]
l=i

for all xl ∈ X , l = 1, . . . , n; we also say that these random variables are mutually
independent. Furthermore, the notion of identical distribution means that

Pr[X i = x] = Pr[X 1 = x]

for any x ∈ X and i = 1, 2, . . .; i.e., all the source’s random variables are governed
by the same marginal distribution.
• Stationary process The process X is said to be stationary (or strictly stationary)
if the probability of every sequence or event is unchanged by a left (time) shift,
or equivalently, if any j = 1, 2, . . ., the joint distribution of (X 1 , X 2 , . . . , X n )
satisfies

Pr[X 1 = x1 , X 2 = x2 , . . . , X n = xn ]
= Pr[X j+1 = x1 , X j+2 = x2 , . . . , X j+n = xn ]

for all xl ∈ X , l = 1, . . . , n.

It is direct to verify that a memoryless source is stationary. Also, for a


stationary source, its random variables are identically distributed.
• Ergodic process: The process X is said to be ergodic if any ergodic set (satisfy-
ing (B.3.2)) in F X has probability either 1 or 0. This definition is not very intuitive,
but some interpretations and examples may shed some light.

5 As the textbook only deals with one-sided random processes, the discussion on T-
invariance only focuses on sets of one-sided sequences. When a two-sided random process
. . . , X −2 , X −1 , X 0 , X 1 , X 2 , . . . is considered, the left-shift operation T of a two-sided sequence
actually has a unique inverse. Hence, TE ⊆ E implies TE = E. Also, TE = E iff T−1 E = E.
Ergodicity for two-sided sequences can therefore be directly defined using TE = E.
Appendix B: Overview in Probability and Random Processes 279

Observe that the definition has nothing to do with stationarity. It simply states
that events that are unaffected by time-shifting (both left- and right-shifting) must
have probability either zero or one.
Ergodicity implies that all convergent sample averages6 converge to a con-
stant (but not necessarily to the ensemble average or statistical expectation), and
stationarity assures that the time average converges to a random variable; hence,
it is reasonable to expect that they jointly imply the ultimate time average equals
the ensemble average. This is validated by the well-known ergodic theorem by
Birkhoff and Khinchin.

Theorem B.4 (Pointwise ergodic theorem) Consider a discrete-time stationary ran-


dom process, X = {X n }∞ n=1 . For real-valued function f (·) on R with finite mean (i.e.,
|E[ f (X n )]| < ∞), there exists a random variable Y such that

1
n
lim f (X k ) = Y with probability 1.
n→∞ n
k=1

If, in addition to stationarity, the process is also ergodic, then

1
n
lim f (X k ) = E[ f (X 1 )] with probability 1.
n→∞ n
k=1


Example B.5 Consider the process {X i }i=1 consisting of a family of i.i.d. binary
random variables (obviously, it is stationary and ergodic). Define the function f (·)
by f (0) = 0 and f (1) = 1. Hence,7

E[ f (X n )] = PX n (0) f (0) + PX n (1) f (1) = PX n (1)

is finite. By the pointwise ergodic theorem, we have

f (X 1 ) + f (X 2 ) + · · · + f (X n ) X1 + X2 + · · · + Xn
lim = lim
n→∞ n n→∞ n
= PX (1).

As seen in the above example, one of the important consequences that the
pointwise ergodic theorem indicates is that the time average can ultimately replace
the statistical average, which is a useful result. Hence, with stationarity and ergod-
icity, one, who observes

6 Two alternative names for sample average are time average and Cesàro mean. In this book, these

names will be used interchangeably.


7 As specified in Sect. B.2, P (0) = Pr[X = 0]. These two representations will be used alterna-
Xn n
tively throughout the book.
280 Appendix B: Overview in Probability and Random Processes

X 130 = 154326543334225632425644234443

from the experiment of rolling a dice, can draw the conclusion that the true distri-
bution of rolling the dice can be well approximated by

1 6 7
Pr{X i = 1} ≈ Pr{X i = 2} ≈ Pr{X i = 3} ≈
30 30 30
9 4 3
Pr{X i = 4} ≈ Pr{X i = 5} ≈ Pr{X i = 6} ≈
30 30 30
Such result is also known by the law of large numbers. The relation between ergod-
icity and the law of large numbers will be further explored in Sect. B.5.

We close the discussion on ergodicity by remarking that in communications


theory, one may assume that the source is stationary or the source is stationary
ergodic. But it is not common to see the assumption of the source being ergodic
but nonstationary. This is perhaps because an ergodic but nonstationary source
does not in general facilitate the analytical study of communications problems.
This, to some extent, justifies that the ergodicity assumption usually comes after
the stationarity assumption. A specific example is the pointwise ergodic theorem,
where the random processes considered is presumed to be stationary.

• Markov chain for three random variables: Three random variables X , Y , and Z
are said to form a Markov chain if

PX,Y,Z (x, y, z) = PX (x) · PY |X (y|x) · PZ |Y (z|y); (B.3.3)

i.e., PZ |X,Y (z|x, y) = PZ |Y (z|y). This is usually denoted by X → Y → Z .

X → Y → Z is sometimes read as “X and Z are conditionally independent given


Y ” because it can be shown that (B.3.3) is equivalent to

PX,Z |Y (x, z|y) = PX |Y (x|y) · PZ |Y (z|y).

Therefore, X → Y → Z is equivalent to Z → Y → X . Accordingly, the


Markovian notation is sometimes expressed as X ↔ Y ↔ Z .
• kth order Markov sources: The sequence of random variables {X n }∞ n=1 = X 1 , X 2 ,
X 3 , . . . with common finite-alphabet X is said to form a kth order Markov chain
(or kth order Markov source or process) if for all n > k, xi ∈ X , i = 1, . . . , n,

Pr[X n = xn |X n−1 = xn−1 , . . . , X 1 = x1 ]


= Pr[X n = xn |X n−1 = xn−1 , . . . , X n−k = xn−k ]. (B.3.4)
Appendix B: Overview in Probability and Random Processes 281

n−1
Each xn−k := (xn−k , xn−k+1 , . . . , xn−1 ) ∈ X k is called the state of the Markov
chain at time n.

When k = 1, then {X n }∞ n=1 is called a first-order Markov source (or just a


Markov source or chain). In light of (B.3.4), for any n > 1, the random variables
X 1 , X 2 , . . ., X n directly satisfy the conditional independence property

Pr[X i = xi |X i−1 = x i−1 ] = Pr[X i = xi |X i−1 = xi−1 ] (B.3.5)

for all xi ∈ X , i = 1, . . . , n; this property is denoted as in (B.3.3) by

X1 → X2 → · · · → Xn

for n > 2. The same property applies to any finite number of random variables
from the source ordered in terms of increasing time indices.

We next summarize important concepts and facts about Markov sources (e.g.,
see [137, 162]).
– A kth order Markov chain is irreducible if with some probability, we can go
from any state in X k to another state in a finite number of steps, i.e., for all
x k , y k ∈ X k there exists an integer j ≥ 1 such that
 
k+ j−1 
Pr X j = x k  X 1k = y k > 0.

– A kth order Markov chain is said to be time-invariant or homogeneous, if for


every n > k,

Pr[X n = xn |X n−1 = xn−1 , . . . , X n−k = xn−k ]


= Pr[X k+1 = xk+1 |X k = xk , . . . , X 1 = x1 ].

Therefore, a homogeneous first-order Markov chain can be defined through its


transition probability:
 
Pr{X 2 = x2 |X 1 = x1 } |X |×|X |
,

and its initial state distribution PX 1 (x).


– In a first-order Markov chain, the period d(x) of state x ∈ X is defined by

d(x) := gcd {n ∈ {1, 2, 3, . . .} : Pr{X n+1 = x|X 1 = x} > 0} ,

where gcd denotes the greatest common divisor; in other words, if the Markov
chain starts in state x, then the chain cannot return to state x at any time that
282 Appendix B: Overview in Probability and Random Processes

is not a multiple of d(x). If Pr{X n+1 = x|X 1 = x} = 0 for all n, we say that
state x has an infinite period and write d(x) = ∞. We also say that state x is
aperiodic if d(x) = 1 and periodic if d(x) > 1. Furthermore, the first-order
Markov chain is called aperiodic if all its states are aperiodic. In other words,
the first-order Markov chain is aperiodic if

gcd {n ∈ {1, 2, 3, . . .} : Pr{X n+1 = x|X 1 = x} > 0} = 1 ∀x ∈ X .

– In an irreducible first-order Markov chain, all states have the same period. Hence,
if one state in such a chain is aperiodic, then the entire Markov chain is aperiodic.
– A distribution π(·) on X is said to be a stationary distribution for a homogeneous
first-order Markov chain, if for every y ∈ X ,

π(y) = π(x) Pr{X 2 = y|X 1 = x}.
x∈X

For a finite-alphabet homogeneous first-order Markov chain, π(·) always exists;


furthermore, π(·) is unique if the Markov chain is irreducible. For a finite-
alphabet homogeneous first-order Markov chain that is both irreducible and
aperiodic,
lim Pr{X n+1 = y|X 1 = x} = π(y)
n→∞

for all states x and y in X . If the initial state distribution is equal to a sta-
tionary distribution, then the homogeneous first-order Markov chain becomes a
stationary process.
– A finite-alphabet stationary Markov source is an ergodic process (and hence sat-
isfies the pointwise ergodic theorem) iff it is irreducible; see [30, p. 371]
and [349, Prop. I.2.9].
The general relations among i.i.d. sources, Markov sources, stationary sources,
and ergodic sources are depicted in Fig. B.1

B.4 Convergence of Sequences of Random Variables

In this section, we will discuss modes in which a random process X 1 , X 2 , . . . con-


verges to a limiting random variable X . Recall that a random variable is a real-valued
measurable function from  to R, where  the sample space of the probability space
over which the random variable is defined. So the following two expressions will
be used interchangeably: X 1 (ω), X 2 (ω), X 3 (ω), . . . ≡ X 1 , X 2 , X 3 , . . . , for ω ∈ .
Note that the random variables in a random process are defined over the same prob-
ability space (, F, P),
Appendix B: Overview in Probability and Random Processes 283

Fig. B.1 General relations


of random processes
Markov

i.i.d. Stationary

Ergodic

Definition B.6 (Convergence modes for random sequences)


1. Pointwise convergence on .8

{X n }∞
n=1 is said to converge to X pointwise on  if

lim X n (ω) = X (ω) for all ω ∈ .


n→∞

This notion of convergence, which is familiar to us from real analysis, is denoted


p.w.
by X n −→ X .
2. Almost sure convergence or convergence with probability 1.

{X n }∞
n=1 is said to converge to X with probability 1, if

P{ω ∈  : lim X n (ω) = X (ω)} = 1.


n→∞

a.s.
Almost sure convergence is denoted by X n −→ X ; note that it is nothing but a
probabilistic version of pointwise convergence.
3. Convergencein probability.

{X n }∞
n=1 is said to converge to X in probability, if for any ε > 0,

lim P{ω ∈  : |X n (ω) − X (ω)| > ε} = lim Pr{|X n − X | > ε} = 0.


n→∞ n→∞

p
This mode of convergence is denoted by X n −→ X .

8 Although such mode of convergence is not used in probability theory, we introduce it herein to
contrast it with the almost sure convergence mode (see Example B.7).
284 Appendix B: Overview in Probability and Random Processes

4. Convergence in r th mean.

{X n }∞
n=1 is said to converge to X in r th mean, if

lim E[|X − X n |r ] = 0.
n→∞

Lr
This is denoted by X n −→ X .
5. Convergence in distribution.

{X n }∞
n=1 is said to converge to X in distribution, if

lim FX n (x) = FX (x)


n→∞

for every continuity point of F(x), where

FX n (x) = Pr{X n ≤ x} and FX (x) = Pr{X ≤ x}.

d
We denote this notion of convergence by X n −→ X .

An example that facilitates the understanding of pointwise convergence and almost


sure convergence is as follows.

Example B.7 Consider a probability space (, 2 , P), where  = {0, 1, 2, 3}, 2
is the power set of  and P(0) = P(1) = P(2) = 1/3 and P(3) = 0. Define a
random variable as ω
X n (ω) = .
n
Then    
1 2 1
Pr{X n = 0} = Pr X n = = Pr X n = = .
n n 3

It is clear that for every ω in , X n (ω) converges to X (ω), where X (ω) = 0 for
every ω ∈ ; so
p.w.
X n −→ X.

Now let X̃ (ω) = 0 for ω = 0, 1, 2 and X̃ (ω) = 1 for ω = 3. Then, both of the
following statements are true:
a.s. a.s.
X n −→ X and X n −→ X̃ ,

since
 3 
Pr lim X n = X̃ = P(ω) · 1 lim X n (ω) = X̃ (ω) = 1,
n→∞ n→∞
ω=0
Appendix B: Overview in Probability and Random Processes 285

where 1{·} represents the set indicator function. However, X n does not converge to
X̃ pointwise because
lim X n (3) = X̃ (3).
n→∞

In other words, pointwise convergence requires “equality” even for samples without
probability mass; however, these samples are ignored under almost sure convergence.

Observation B.8 (Uniqueness of convergence)


p.w. p.w.
1. If X n −→ X and X n −→ Y , then X = Y pointwise. That is, (∀ ω ∈ )
X (ω) = Y (ω).
a.s. a.s. p p Lr
2. If X n −→ X and X n −→ Y (or X n −→ X and X n −→ Y ) (or X n −→ X and
Lr
X n −→ Y ), then X = Y with probability 1. That is, Pr{X = Y } = 1.
d d
3. X n −→ X and X n −→ Y , then FX (x) = FY (x) for all x.

For ease of understanding, the relations of the five modes of convergence can be
depicted as follows. As usual, a double arrow denotes implication.

p.w.
Xn −→ X

a.s. Thm. B.10 Lr


Xn −→ X Xn −→ X (r ≥ 1)
Thm. B.9
p
Xn −→ X

d
Xn −→ X

There are some other relations among these five convergence modes that are also
depicted in the above graph (via the dotted line); they are stated below.

Theorem B.9 (Monotone convergence theorem [47])

a.s. L1
X n −→ X, (∀ n)Y ≤ X n ≤ X n+1 , and E[|Y |] < ∞ =⇒ X n −→ X
=⇒ E[X n ] → E[X ].

Theorem B.10 (Dominated convergence theorem [47])

a.s. L1
X n −→ X, (∀ n)|X n | ≤ Y, and E[|Y |] < ∞ =⇒ X n −→ X
=⇒ E[X n ] → E[X ].
286 Appendix B: Overview in Probability and Random Processes

L1
The implication of X n −→ X to E[X n ] → E[X ] can be easily seen from

|E[X n ] − E[X ]| = |E[X n − X ]| ≤ E[|X n − X |].

B.5 Ergodicity and Laws of Large Numbers

B.5.1 Laws of Large Numbers

Consider a random process X 1 , X 2 , . . . with common marginal mean μ. Suppose


that we wish to estimate μ on the basis of the observed sequence x1 , x2 , x3 , . . . .
The weak and strong laws of large numbers ensure that such inference is possible
(with reasonable accuracy), provided that the dependencies between X n ’s are suitably
restricted: e.g., the weak law is valid for uncorrelated X n ’s, while the strong law is
valid for independent X n ’s. Since independence is a more restrictive condition than
the absence of correlation, one expects the strong law to be more powerful than the
weak law. This is indeed the case, as the weak law states that the sample average

X1 + · · · + Xn
n
converges to μ in probability, while the strong law asserts that this convergence takes
place with probability 1.
The following two inequalities will be useful in the discussion of this subject.

Lemma B.11 (Markov’s inequality) For any integer k > 0, real number α > 0,
and any random variable X ,

1
Pr[|X | ≥ α] ≤ E[|X |k ].
αk
Proof Let FX (·) be the cdf of random variable X . Then,
 ∞
E[|X |k ] = |x|k d FX (x)
−∞

≥ |x|k d FX (x)
{x∈R : |x|≥α}

≥ αk d FX (x)
{x∈R : |x|≥α}

= αk d FX (x)
{x∈R : |x|≥α}

= αk Pr[|X | ≥ α].
Appendix B: Overview in Probability and Random Processes 287

Equality holds iff


 
|x|k d FX (x) = 0 and |x|k d FX (x) = 0,
{x∈R : |x|<α} {x∈R : |x|>α}

namely,
Pr[X = 0] + Pr[|X | = α] = 1.

In the proof of Markov’s inequality, we use the general representation for inte-
gration with respect to a (cumulative) distribution function FX (·), i.e.,

·d FX (x), (B.5.1)
X

which is named the Lebesgue–Stieltjes integral. Such a representation can be applied


for both discrete and continuous supports as well as the case that the probability
density function does not exist. We use this notational convention to remove the
burden of differentiating discrete random variables from continuous ones.

Lemma B.12 (Chebyshev’s inequality) For any random variable X with variance
Var[X ] and real number α > 0,

1
Pr[|X − E[X ]| ≥ α] ≤ Var[X ].
α2
Proof By Markov’s inequality with k = 2, we have

1
Pr[|X − E[X ]| ≥ α] ≤ E[|X − E[X ]|2 ].
α2
Equality holds iff
 
Pr[|X − E[X ]| = 0] + Pr |X − E[X ]| = α = 1,

equivalently, there exists p ∈ [0, 1] such that


   
Pr X = E[X ] + α = Pr X = E[X ] − α = p and Pr[X = E[X ]] = 1 − 2 p.

In the proofs of the above two lemmas, we also provide the condition under which
equality holds. These conditions indicate that equality usually cannot be fulfilled.
Hence in most cases, the two inequalities are strict.
288 Appendix B: Overview in Probability and Random Processes

Theorem B.13 (Weak law of large numbers) Let {X n }∞n=1 be a sequence of uncor-
related random variables with common mean E[X i ] = μ. If the variables also have
common variance, or more generally,
 
1 
n
X1 + · · · + Xn L2
lim Var[X i ] = 0, equivalently, −→ ¯
n→∞ n 2 n
i=1

then the sample average


X1 + · · · + Xn
n
converges to the mean μ in probability.

Proof By Chebyshev’s inequality,


 n  
1   1 
n
 
Pr  X i − μ ≥ ε ≤ 2 2 Var[X i ].
n  n ε i=1
i=1


Note that the right-hand side of the above Chebyshev’s inequality is just the second
moment of the difference between the n-sample average and the mean μ. Thus, the
L2 p
variance constraint is equivalent to the statement that X n −→ μ implies X n −→ μ.
Theorem B.14 (Kolmogorov’s strong law of large numbers) Let {X n }∞ n=1 be a
sequence of independent random variables with common mean E[X n ] = μ. If either
1. X n ’s are identically distributed; or
2. X n ’s are square-integrable9 with variances satisfying

 Var[X i ]
< ∞,
i=1
i2

then
X 1 + · · · + X n a.s.
−→ μ.
n
Note that the above i.i.d. assumption does not exclude the possibility of μ = ∞
(or μ = −∞), in which case the sample average converges to ∞ (or −∞) with
probability 1. Also note that there are cases of sequences of independent random
variables for which the weak law applies, but the strong law does not. This is due to
the fact that

9 A random variable X is said to be square-integrable if E[|X |2 ] < ∞.


Appendix B: Overview in Probability and Random Processes 289


n
Var[X i ] 1 
n
≥ Var[X i ].
i=1
i2 n 2 i=1

The final remark is that Kolmogorov’s strong law of large number can be extended
to a function of a sequence of independent random variables:

g(X 1 ) + · · · + g(X n ) a.s.


−→ E[g(X 1 )].
n

But such extension cannot be applied to the weak law of large numbers, since g(Yi )
and g(Y j ) can be correlated even if Yi and Y j are not.

B.5.2 Ergodicity Versus Law of Large Numbers

After the introduction of Kolmogorov’s strong law of large numbers, one may find
that the pointwise ergodic theorem (Theorem B.4) actually indicates a similar result.
In fact, the pointwise ergodic theorem can be viewed as another version of the strong
law of large numbers, which states that for stationary and ergodic processes, time
averages converge with probability 1 to the ensemble expectation.
The notion of ergodicity is often misinterpreted, since the definition is not very
intuitive. Some texts may provide a definition that a stationary process satisfying
the ergodic theorem is also ergodic.10 However, the ergodic theorem is indeed a
consequence of the original mathematical definition of ergodicity in terms of the
shift-invariant property (see Sect. B.3 and the discussion in [160, pp. 174–175]).
Let us try to clarify the notion of ergodicity by the following remarks:
• The concept of ergodicity does not require stationarity. In other words, a nonsta-
tionary process can be ergodic.
• Many perfectly good models of physical processes are not ergodic, yet they obey
some form of law of large numbers. In other words, non-ergodic processes can be
perfectly good and useful models.

10 Here is one example. A stationary random process {X n }∞


n=1 is called ergodic if for arbitrary
integer k and function f (·) on X k of finite mean,

1
n
a.s.
f (X i+1 , . . . , X i+k ) −→ E[ f (X 1 , . . . , X k )].
n
i=1

As a result of this definition, a stationary ergodic source is the most general dependent random
process for which the strong law of large numbers holds. This definition somehow implies that if
a process is not stationary ergodic, then the strong law of large numbers is violated (or the time
average does not converge with probability 1 to its ensemble expectation). But this is not true. One
can weaken the conditions of stationarity and ergodicity from its original mathematical definitions to
asymptotic stationarity and ergodicity, and still make the strong law of large numbers hold. (Cf. the
last remark in this section and also Fig. B.2.)
290 Appendix B: Overview in Probability and Random Processes

Ergodicity
defined through
Ergodicity ergodic theorem
defined through i.e., stationarity and
shift-invariance time average
property converging to
sample average
(law of large numbers)

Fig. B.2 Relation of ergodic random processes, respectively, defined through time-shift invariance
and ergodic theorem

• There is no finite-dimensional equivalent definition of ergodicity as there is for


stationarity. This fact makes it more difficult to describe and interpret ergodicity.
• I.i.d. processes are ergodic; hence, ergodicity can be thought of as a (kind of)
generalization of i.i.d.
• As mentioned earlier, stationarity and ergodicity imply the time average converges
with probability 1 to the ensemble mean. Now if a process is stationary but not
ergodic, then the time average still converges, but possibly not to the ensemble
mean.

For example, let {An }∞ ∞


n=1 and {Bn }n=1 be two sequences of i.i.d. binary-
valued random variables with Pr{An = 0} = Pr{Bn = 1} = 1/4. Suppose that
X n = An if U = 1, and X n = Bn if U = 0, where U is an equiprobable binary
random variable, and {An }∞ ∞ ∞
n=1 , {Bn }n=1 and U are independent. Then {X n }n=1 is
stationary. Is the process ergodic? The answer is negative. If the stationary process
were ergodic, then from the pointwise ergodic theorem (Theorem B.4), its relative
frequency would converge to

Pr(X n = 1) = Pr(U = 1) Pr(X n = 1|U = 1)


+ Pr(U = 0) Pr(X n = 1|U = 0)
1
= Pr(U = 1) Pr(An = 1) + Pr(U = 0) Pr(Bn = 1) = .
2
However, one should observe that the outputs of (X 1 , . . . , X n ) form a Bernoulli
process with relative frequency of 1’s being either 3/4 or 1/4, depending on the
value of U . Therefore,
1
n
a.s.
lim X n −→ Y,
n→∞ n
i=1

where Pr(Y = 1/4) = Pr(Y = 3/4) = 1/2, which contradicts the ergodic theo-
rem.
Appendix B: Overview in Probability and Random Processes 291

From the above example, the pointwise ergodic theorem can actually be
made useful in such a stationary but non-ergodic case, since an “apparent” station-
ary ergodic process (either {An }∞ ∞
n=1 or {Bn }n=1 ) is actually being observed when
measuring the relative frequency (3/4 or 1/4). This renders a surprising funda-
mental result for random processes—the ergodic decomposition theorem: under
fairly general assumptions, any (not necessarily ergodic) stationary process is in
fact a mixture of stationary ergodic processes, and hence one always observes a
stationary ergodic outcome (e.g., see [159, 349]). As in the above example, one
always observe either A1 , A2 , A3 , . . . or B1 , B2 , B3 , . . ., depending on the value
of U , for which both sequences are stationary ergodic (i.e., the time-stationary
observation X n satisfies X n = U · An + (1 − U ) · Bn ).
• The previous remark implies that ergodicity is not required for the strong law of
large numbers to be useful. The next question is whether or not stationarity is
required. Again, the answer is negative. In fact, the main concern of the law of
large numbers is the convergence of sample averages to its ensemble expectation.
It should be reasonable to expect that random processes could exhibit transient
behaviors that violate the stationarity definition, with their sample average still
converging. One can then introduce the notion of asymptotically mean stationary
to achieve the law of large numbers [159]. For example, a finite-alphabet time-
invariant (but not necessarily stationary) irreducible Markov chain satisfies the law
of large numbers. Thus, the stationarity and/or ergodicity properties of a process
can be weakened with the process still admitting laws of large numbers (i.e., time
averages and relative frequencies have desired and well-defined limits).

B.6 Central Limit Theorem

Theorem B.15 (Central limit theorem) If {X n }∞


n=1 is a sequence of i.i.d. random
variables with finite common marginal mean μ and variance σ 2 , then

1 
n
d
√ (X i − μ) −→ Z ∼ N (0, σ 2 ),
n i=1

where the convergence is in distribution (as n → ∞) and Z ∼ N (0, σ 2 ) is a


Gaussian distributed random variable with mean 0 and variance σ 2 .

B.7 Convexity, Concavity, and Jensen’s inequality

Jensen’s inequality provides a useful bound for the expectation of convex (or concave)
functions.
292 Appendix B: Overview in Probability and Random Processes

Definition B.16 (Convexity) Consider a convex set11 O ∈ Rm , where m is a fixed


positive integer. Then, a function f : O → R is said to be convex over O if for every
x, y in O and 0 ≤ λ ≤ 1,
 
f λx + (1 − λ)y ≤ λ f (x) + (1 − λ) f (y).

Furthermore, a function f is said to be strictly convex if equality holds only when


λ = 0 or λ = 1.
Note that different from the usual notations x n = (x1 , x2 , . . . , xn ) or x =
(x1 , x2 , . . .) throughout this book, we use x to denote a column vector in this section.
Definition B.17 (Concavity) A function f is concave if − f is convex.
Note that when O = (a, b) is an interval in R and function f : O → R has a
nonnegative (resp. positive) second derivative over O, then the function is convex
(resp. strictly convex). This can be shown via the Taylor series expansion of the
function.
Theorem B.18 (Jensen’s inequality) If f : O → R is convex over a convex set
O ⊂ Rm , and X = (X 1 , X 2 , . . . , X m )T is an m-dimensional random vector with
alphabet X ⊂ O, then
E[ f (X )] ≥ f (E[X ]).

Moreover, if f is strictly convex, then equality in the above inequality immediately


implies X = E[X ] with probability 1.

Note: O is a convex set; hence, X ⊂ O implies E[X ] ∈ O. This guarantees that


f (E[X ]) is defined. Similarly, if f is concave, then

E[ f (X )] ≤ f (E[X ]).

Furthermore, if f is strictly concave, then equality in the above inequality immedi-


ately implies that X = E[X ] with probability 1.
Proof Let y = a T x + b be a “support hyperplane” for f with “slope” vector a T and
affine parameter b that passes through the point (E[X ], f (E[X ])), where a support
hyperplane12 for function f at x is by definition a hyperplane passing through

11 A set O ∈ Rm is said to be convex if for every x = (x1 , x2 , . . . , xm )T and y = (y1 , y2 , . . . , ym )T


in O (where T denotes transposition), and every 0 ≤ λ ≤ 1, λx + (1 − λ)y ∈ O; in other words,
the “convex combination” of any two “points” x and y in O also belongs to O.
12 A hyperplane y = a T x + b is said to be a support hyperplane for a function f with “slope”

vector a T ∈ Rm and affine parameter b ∈ R if among all hyperplanes of the same slope vector
a T , it is the largest one satisfying a T x + b ≤ f (x) for every x ∈ O. A support hyperplane may
not necessarily be made to pass through the desired point (x , f (x )). Here, since we only consider
convex functions, the validity of the support hyperplane passing (x , f (x )) is therefore guaranteed.
Note that when x is one-dimensional (i.e., m = 1), a support hyperplane is simply referred to as a
support line.
Appendix B: Overview in Probability and Random Processes 293

Fig. B.3 The support line y


y = ax + b of the convex
function f (x) f (x)

support line
y = ax + b

the point (x , f (x )) and lying entirely below the graph of f (see Fig. B.3 for an
illustration of a support line for a convex function over R).
Thus,
(∀x ∈ X ) a T x + b ≤ f (x).

By taking the expectation of both sides, we obtain

a T E[X ] + b ≤ E[ f (X )],

but we know that a T E[X ] + b = f (E[X ]). Consequently,

f (E[X ]) ≤ E[ f (X )].

B.8 Lagrange Multipliers Technique and


Karush–Kuhn–Tucker (KKT) Conditions

Optimization of a function f (x) over x = (x1 , . . . , xn ) ∈ X ⊆ Rn subject to some


inequality constraints gi (x) ≤ 0 for 1 ≤ i ≤ m and equality constraints h j (x) = 0
for 1 ≤ j ≤  is a central technique to problems in information theory. An immediate
example is to maximize the mutual information subject to an “inequality” power
constraint and an “equality” probability unity-sum constraint in order to determine
the channel capacity. We can formulate such an optimization problem [56, Eq. (5.1)]
mathematically as13
min f (x), (B.8.1)
x∈Q

13 Since maximization of f (·) is equivalent to minimization of − f (·), it suffices to discuss the KKT
conditions for the minimization problem defined in (B.8.1).
294 Appendix B: Overview in Probability and Random Processes

where

Q := {x ∈ X : gi (x) ≤ 0 for 1 ≤ i ≤ m and h i (x) = 0 for 1 ≤ j ≤ } .

In most cases, solving the constrained optimization problem defined in (B.8.1) is


hard. Instead, one may introduce a dual optimization problem without constraints as
⎛ ⎞

m 

L(λ, ν) := min ⎝ f (x) + λi gi (x) + ν j h j (x)⎠ . (B.8.2)
x∈X
i=1 j=1

In the literature, λ = (λ1 , . . . , λm ) and ν = (ν1 , . . . , ν ) are usually referred to


as Lagrange multipliers, and L(λ, ν) is called the Lagrange dual function. Note
that L(λ, ν) is a concave function of λ and ν since it is the minimization of affine
functions of λ and ν.
It can be verified that when λi ≥ 0 for 1 ≤ i ≤ m,
⎛ ⎞

m 

L(λ, ν) ≤ min ⎝ f (x) + λi gi (x) + ν j h j (x)⎠ ≤ min f (x). (B.8.3)
x∈Q x∈Q
i=1 j=1

We are however interested in when the above inequality becomes equality (i.e., when
the so-called strong duality holds) because if there exist nonnegative λ̃ and ν̃ that
equate (B.8.3), then

f (x ∗ ) = min f (x) = L(λ̃, ν̃)


x∈Q
⎡ ⎤

m 

= min ⎣ f (x) + λ̃i gi (x) + ν̃ j h j (x)⎦
x∈X
i=1 j=1


m 

≤ f (x ∗ ) + λ̃i gi (x ∗ ) + ν̃ j h j (x ∗ )
i=1 j=1
≤ f (x ∗ ), (B.8.4)

where (B.8.4) follows because the minimizer x ∗ of (B.8.1) lies in Q. Hence, if


the strong duality holds, the same x ∗ achieves both min x∈Q f (x) and L(λ̃, ν̃), and
λ̃i gi (x ∗ ) = 0 for 1 ≤ i ≤ m.14
The strong duality does not in general hold. A situation that guarantees the validity
of the strong duality has been determined by William Karush [212], and separately
Harold W. Kuhn and Albert W. Tucker [235]. In particular, when f (·) and {gi (·)}i=1 m

are both convex, and {h j (·)} j=1 are affine, and these functions are all differentiable,

#m
14 Equating (B.8.4) implies i=1 λ̃i gi (x ∗ ) = 0. It can then be easily verified from λ̃i gi (x ∗ ) ≤ 0

for every 1 ≤ i ≤ m that λ̃i gi (x ) = 0 for 1 ≤ i ≤ m.
Appendix B: Overview in Probability and Random Processes 295

they found that the strong duality holds iff the KKT conditions are satisfied [56,
p. 258].

Definition B.19 (Karush–Kuhn–Tucker (KKT) conditions) Point x = (x1 , . . ., xn )


and multipliers λ = (λ1 , . . . , λm ) and ν = (ν1 , . . . , ν ) are said to satisfy the KKT
conditions if


⎪ gi (x) ≤ 0, λi ≥ 0, λi gi (x) = 0 i = 1, . . . , m



h j (x) = 0 j = 1, . . . , 



⎪ # #
⎩ ∂ f (x) + m λ ∂gi (x) +  ν ∂h j (x) = 0 k = 1, . . . , n
∂xk i=1 i ∂xk j=1 j ∂xk

Note that when f (·) and constraints {gi (·)}i=1m


and {h j (·)}j=1 are arbitrary func-
tions, the KKT conditions are only necessary for the validity of the strong duality. In
other words, for a non-convex optimization problem, we can only claim that if the
strong duality holds, then the KKT conditions are satisfied but not vice versa.
A case that is particularly useful in information theory is when x is restricted to be
a probability distribution. In such case, apart from other problem-specific constraints,
we have additionally n inequality constraints
# gm+i (x) = −xi ≤ 0 for 1 ≤ i ≤ n and
one equality constraint h +1 (x) = nk=1 xk − 1 = 0. Hence, the KKT conditions
become


⎪ gi (x) ≤ 0, λi ≥ 0, λi gi (x) = 0 i = 1, . . . , m





⎪ gm+k (x) = −xk ≤ 0, λm+k ≥ 0, λm+k xk = 0 k = 1, . . . , n




h j (x) = 0 j = 1, . . . , 



⎪ #

⎪ h +1 (x) = nk=1 xk − 1 = 0





⎩∂f #m # ∂h
∂xk
(x) + i=1 ∂gi
λi ∂x k
(x) − λm+k + j=1 ν j ∂xkj (x) + ν+1 = 0 k = 1, . . . , n

From λm+k ≥ 0 and λm+k xk = 0, we can obtain the well-known relation below.
⎧∂f #m # ∂h
⎨ ∂xk (x) + i=1 ∂gi
λi ∂x k
(x) + j=1 ν j ∂xkj (x) + ν+1 = 0 if xk > 0
λm+k = #m #
⎩∂f ∂gi ∂h
∂xk
(x) + i=1 λi ∂x k
(x) + j=1 ν j ∂xkj (x) + ν+1 ≥ 0 if xk = 0.

The above relation is the most seen form of the KKT conditions when it is used in
problems in information theory.

#n
Example B.20 Suppose for nonnegative {qi, j }1≤i≤n,1≤ j≤n with j=1 qi, j = 1,
296 Appendix B: Overview in Probability and Random Processes
⎧ n

⎪ n 
qi, j

⎪ f (x) = − xi qi, j log #n

⎨ i=1 j=1 i =1 x i qi , j


⎪ gi (x) = −xi ≤ 0 i = 1, . . . , n



⎩ #n
h(x) = i=1 xi − 1 = 0

Then, the KKT conditions imply




⎪ xi ≥ 0, λi ≥ 0, λi xi = 0 i = 1, . . . , n



⎪ #

⎨ n xi = 1
i=1

⎪ n

⎪  qk, j

⎪ qk, j log #n
⎩−
⎪ + 1 − λk + ν = 0 k = 1, . . . , n
j=1 i =1 x i qi , j

which further implies that



⎪ n


⎪ qk, j

⎪− qk, j log #n + 1 + ν = 0 xk > 0
⎨ i =1 x i qi , j
j=1
λk =

⎪ 
n
qk, j

⎪− qk, j log #n + 1 + ν ≥ 0 xk = 0


j=1 i =1 x i qi , j

By this, the input distributions that achieve the channel capacities of some channels
such as BSC and BEC can be identified. 

The next example shows the analogy of determining the channel capacity to the
problem of optimal power allocation.

Example B.21 (Water-filling) Suppose with σi2 > 0 for 1 ≤ i ≤ n and P > 0,
⎧ #n  

⎪ f (x) = − log 1 + xi

⎪ i=1 σi2

gi (x) = −xi ≤ 0 i = 1, . . . , n



⎪ #n

h(x) = i=1 xi − P = 0

Then, the KKT conditions imply


Appendix B: Overview in Probability and Random Processes 297


⎪ x ≥ 0, λi ≥ 0, λi xi = 0 i = 1, . . . , n
⎪ i

⎨#
n
xi = P
⎪ i=1



⎩− 1 − λ + ν = 0 i = 1, . . . , n
σ 2 +x i
i i

which further implies that


 
− σ2 +x
1
+ ν = 0 xi > 0 1
− σi2 σi2 < 1
λi = i i
equivalently xi = ν ν
− σ2 +x
1
+ ν ≥ 0 xi = 0 0 σi2 ≥ 1
ν
i i

This then gives the water-filling solution for the power allocation over parallel
continuous-input AWGN channels. 
References

1. J. Aczél, Z. Daróczy, On Measures of Information and Their Characterization (Academic


Press, New York, 1975)
2. C. Adam, Information theory in molecular biology. Phys. Life Rev. 1, 3–22 (2004)
3. R. Ahlswede, Certain results in coding theory for compound channels I, in Proceedings of
Bolyai Colloquium on Information Theory, Debrecen, Hungary, 1967, pp. 35–60
4. R. Ahlswede, The weak capacity of averaged channels. Z. Wahrscheinlichkeitstheorie Verw.
Gebiete 11, 61–73 (1968)
5. R. Ahlswede, J. Körner, Source coding with side information and a converse for degraded
broadcast channels. IEEE Trans. Inf. Theory 21(6), 629–637 (1975)
6. R. Ahlswede, I. Csiszár, Common randomness in information theory and cryptography. I.
Secret sharing. IEEE Trans. Inf. Theory 39(4), 1121–1132 (1993)
7. R. Ahlswede, I. Csiszár, Common randomness in information theory and cryptography. II.
CR capacity. IEEE Trans. Inf. Theory 44(1), 225–240 (1998)
8. E. Akyol, K.B. Viswanatha, K. Rose, T.A. Ramstad, On zero-delay source-channel coding.
IEEE Trans. Inf. Theory 60(12), 7473–7489 (2014)
9. F. Alajaji, Feedback does not increase the capacity of discrete channels with additive noise.
IEEE Trans. Inf. Theory 41, 546–549 (1995)
10. F. Alajaji, T. Fuja, The performance of focused error control codes. IEEE Trans. Commun.
42(2/3/4), 272–280 (1994)
11. F. Alajaji, T. Fuja, Effect of feedback on the capacity of discrete additive channels with mem-
ory, in Proceedings of IEEE International Symposium on Information Theory, Trondheim,
Norway (1994)
12. F. Alajaji, T. Fuja, A communication channel modeled on contagion. IEEE Trans. Inf. Theory
40(6), 2035–2041 (1994)
13. F. Alajaji, N. Phamdo, N. Farvardin, T.E. Fuja, Detection of binary Markov sources over
channels with additive Markov noise. IEEE Trans. Inf. Theory 42(1), 230–239 (1996)
14. F. Alajaji, N. Phamdo, T.E. Fuja, Channel codes that exploit the residual redundancy in CELP-
encoded speech. IEEE Trans. Speech Audio Process. 4(5), 325–336 (1996)
15. F. Alajaji, N. Phamdo, Soft-decision COVQ for Rayleigh-fading channels. IEEE Commun.
Lett. 2, 162–164 (1998)

© Springer Nature Singapore Pte Ltd. 2018 299


F. Alajaji and P.-N. Chen, An Introduction to Single-User Information Theory,
Springer Undergraduate Texts in Mathematics and Technology,
https://doi.org/10.1007/978-981-10-8001-2
300 References

16. F. Alajaji, N. Whalen, The capacity-cost function of discrete additive noise channels with and
without feedback. IEEE Trans. Inf. Theory 46(3), 1131–1140 (2000)
17. F. Alajaji, P.-N. Chen, Z. Rached, Csiszár’s cutoff rates for the general hypothesis testing
problem. IEEE Trans. Inf. Theory 50(4), 663–678 (2004)
18. V.R. Algazi, R.M. Lerner, Binary detection in white non-Gaussian noise, Technical Report
DS-2138, M.I.T. Lincoln Lab, Lexington, MA (1964)
19. S.A. Al-Semari, F. Alajaji, T. Fuja, Sequence MAP decoding of trellis codes for Gaussian and
Rayleigh channels. IEEE Trans. Veh. Technol. 48(4), 1130–1140 (1999)
20. E. Arikan, An inequality on guessing and its application to sequential decoding. IEEE Trans.
Inf. Theory 42(1), 99–105 (1996)
21. E. Arikan, N. Merhav, Joint source-channel coding and guessing with application to sequential
decoding. IEEE Trans. Inf. Theory 44, 1756–1769 (1998)
22. E. Arikan, A performance comparison of polar codes and Reed-Muller codes. IEEE Commun.
Lett. 12(6), 447–449 (2008)
23. E. Arikan, Channel polarization: a method for constructing capacity-achieving codes for
symmetric binary-input memoryless channels. IEEE Trans. Inf. Theory 55(7), 3051–3073
(2009)
24. E. Arikan, Source polarization, in Proceedings of International Symposium on Information
Theory and Applications, July 2010, pp. 899–903
25. E. Arikan, I.E. Telatar, On the rate of channel polarization, in Proceedings of IEEE Interna-
tional Symposium on Information Theory, Seoul, Korea, June–July 2009, pp. 1493–1495
26. E. Arikan, N. ul Hassan, M. Lentmaier, G. Montorsi, J. Sayir, Challenges and some new
directions in channel coding. J. Commun. Netw. 17(4), 328–338 (2015)
27. S. Arimoto, An algorithm for computing the capacity of arbitrary discrete memoryless channel.
IEEE Trans. Inf. Theory 18(1), 14–20 (1972)
28. S. Arimoto, Information measures and capacity of order α for discrete memoryless channels,
Topics in Information Theory, Proceedings of Colloquium Mathematical Society Janos Bolyai,
Keszthely, Hungary, 1977, pp. 41–52
29. R.B. Ash, Information Theory (Interscience, New York, 1965)
30. R.B. Ash, C.A. Doléans-Dade, Probability and Measure Theory (Academic Press, MA, 2000)
31. S. Asoodeh, M. Diaz, F. Alajaji, T. Linder, Information extraction under privacy constraints.
Information 7(1), 1–37 (2016)
32. E. Ayanoǧlu, R. Gray, The design of joint source and channel trellis waveform coders. IEEE
Trans. Inf. Theory 33, 855–865 (1987)
33. J. Bakus, A.K. Khandani, Quantizer design for channel codes with soft-output decoding. IEEE
Trans. Veh. Technol. 54(2), 495–507 (2005)
34. V.B. Balakirsky, Joint source-channel coding with variable length codes. Probl. Inf. Transm.
1(37), 10–23 (2001)
35. A. Banerjee, P. Burlina, F. Alajaji, Image segmentation and labeling using the Polya urn
model. IEEE Trans. Image Process. 8(9), 1243–1253 (1999)
36. M.B. Bassat, J. Raviv, Rényi’s entropy and the probability of error. IEEE Trans. Inf. Theory
24(3), 324–330 (1978)
37. F. Behnamfar, F. Alajaji, T. Linder, MAP decoding for multi-antenna systems with non-
uniform sources: exact pairwise error probability and applications. IEEE Trans. Commun.
57(1), 242–254 (2009)
38. H. Behroozi, F. Alajaji, T. Linder, On the optimal performance in asymmetric Gaussian
wireless sensor networks with fading. IEEE Trans. Signal Process. 58(4), 2436–2441 (2010)
39. S. Ben-Jamaa, C. Weidmann, M. Kieffer, Analytical tools for optimizing the error correction
performance of arithmetic codes. IEEE Trans. Commun. 56(9), 1458–1468 (2008)
40. C.H. Bennett, G. Brassard, Quantum cryptography: public key and coin tossing, in Proceed-
ings of International Conference on Computer Systems and Signal Processing, Bangalore,
India, Dec 1984, pp. 175–179
41. C.H. Bennett, G. Brassard, C. Crepeau, U.M. Maurer, Generalized privacy amplification.
IEEE Trans. Inf. Theory 41(6), 1915–1923 (1995)
References 301

42. T. Berger, Rate Distortion Theory: A Mathematical Basis for Data Compression (Prentice-
Hall, New Jersey, 1971)
43. T. Berger, Explicit bounds to R(D) for a binary symmetric Markov source. IEEE Trans. Inf.
Theory 23(1), 52–59 (1977)
44. C. Berrou, A. Glavieux, P. Thitimajshima, Near Shannon limit error-correcting coding and
decoding: Turbo-codes(1), in Proceedings of IEEE International Conference on Communi-
cations, Geneva, Switzerland, May 1993, pp. 1064–1070
45. C. Berrou, A. Glavieux, Near optimum error correcting coding and decoding: turbo-codes.
IEEE Trans. Commun. 44(10), 1261–1271 (1996)
46. D.P. Bertsekas, with A. Nedić, A.E. Ozdagler, Convex Analysis and Optimization (Athena
Scientific, Belmont, MA, 2003)
47. P. Billingsley, Probability and Measure, 2nd edn. (Wiley, New York, 1995)
48. C.M. Bishop, Pattern Recognition and Machine Learning (Springer, Berlin, 2006)
49. R.E. Blahut, Computation of channel capacity and rate-distortion functions. IEEE Trans. Inf.
Theory 18(4), 460–473 (1972)
50. R.E. Blahut, Theory and Practice of Error Control Codes (Addison-Wesley, MA, 1983)
51. R.E. Blahut, Principles and Practice of Information Theory (Addison Wesley, MA, 1988)
52. R.E. Blahut, Algebraic Codes for Data Transmission (Cambridge University Press, Cam-
bridge, 2003)
53. M. Bloch, J. Barros, Physical-Layer Security: From Information Theory to Security Engi-
neering (Cambridge University Press, Cambridge, 2011)
54. A.C. Blumer, R.J. McEliece, The Rényi redundancy of generalized Huffman codes. IEEE
Trans. Inf. Theory 34(5), 1242–1249 (1988)
55. L. Boltzmann, Uber die beziehung zwischen dem hauptsatze der mechanischen warmetheorie
und der wahrscheinlicjkeitsrechnung respective den satzen uber das warmegleichgewicht.
Wiener Berichte 76, 373–435 (1877)
56. S. Boyd, L. Vandenberghe, Convex Optimization (Cambridge University Press, Cambridge,
UK, 2003)
57. G. Brante, R. Souza, J. Garcia-Frias, Spatial diversity using analog joint source channel coding
in wireless channels. IEEE Trans. Commun. 61(1), 301–311 (2013)
58. L. Breiman, The individual ergodic theorems of information theory. Ann. Math. Stat. 28,
809–811 (1957). (with acorrection made in vol. 31, pp. 809–810, 1960)
59. D.R. Brooks, E.O. Wiley, Evolution as Entropy: Toward a Unified Theory of Biology (Uni-
versity of Chicago Press, Chicago, 1988)
60. N. Brunel, J.P. Nadal, Mutual information, Fisher information, and population coding. Neural
Comput. 10(7), 1731–1757 (1998)
61. O. Bursalioglu, G. Caire, D. Divsalar, Joint source-channel coding for deep-space image
transmission using rateless codes. IEEE Trans. Commun. 61(8), 3448–3461 (2013)
62. V. Buttigieg, P.G. Farrell, Variable-length error-correcting codes. IEE Proc. Commun. 147(4),
211–215 (2000)
63. S.A. Butman, R.J. McEliece, The ultimate limits of binary coding for a wideband Gaussian
channel, DSN Progress Report 42–22, Jet Propulsion Lab, Pasadena, CA, Aug 1974, pp.
78–80
64. G. Caire, K. Narayanan, On the distortion SNR exponent of hybrid digital-analog space-time
coding. IEEE Trans. Inf. Theory 53, 2867–2878 (2007)
65. F.P. Calmon, Information-Theoretic Metrics for Security and Privacy, Ph.D. thesis, MIT, Sept
2015
66. F.P. Calmon, A. Makhdoumi, M. Médard, Fundamental limits of perfect privacy, in Proceed-
ings of IEEE International Symposium on Information Theory, Hong Kong, pp. 1796–1800,
June 2015
67. L.L. Campbell, A coding theorem and Rényi’s entropy. Inf. Control 8, 423–429 (1965)
68. L.L. Campbell, A block coding theorem and Rényi’s entropy. Int. J. Math. Stat. Sci. 6, 41–47
(1997)
302 References

69. A.T. Campo, G. Vazquez-Vilar, A. Guillen i Fabregas, T. Koch, A. Martinez, A derivation of


the source-channel error exponent using non-identical product distributions. IEEE Trans. Inf.
Theory 60(6), 3209–3217 (2014)
70. C. Chang, Error exponents for joint source-channel coding with side information. IEEE Trans.
Inf. Theory 57(10), 6877–6889 (2011)
71. B. Chen, G. Wornell, Analog error-correcting codes based on chaotic dynamical systems.
IEEE Trans. Commun. 46(7), 881–890 (1998)
72. P.-N. Chen, F. Alajaji, Strong converse, feedback capacity and hypothesis testing, in Proceed-
ings of Conference on Information Sciences and Systems, John Hopkins University, Baltimore
(1995)
73. P.-N. Chen, F. Alajaji, Generalized source coding theorems and hypothesis testing (Parts I
and II). J. Chin. Inst. Eng. 21(3), 283–303 (1998)
74. P.-N. Chen, F. Alajaji, A rate-distortion theorem for arbitrary discrete sources. IEEE Trans.
Inf. Theory 44(4), 1666–1668 (1998)
75. P.-N. Chen, F. Alajaji, Optimistic Shannon coding theorems for arbitrary single-user systems.
IEEE Trans. Inf. Theory 45, 2623–2629 (1999)
76. P.-N. Chen, F. Alajaji, Csiszár’s cutoff rates for arbitrary discrete sources. IEEE Trans. Inf.
Theory 47(1), 330–338 (2001)
77. X. Chen, E. Tuncel, Zero-delay joint source-channel coding using hybrid digital-analog
schemes in the Wyner-Ziv setting. IEEE Trans. Commun. 62(2), 726–735 (2014)
78. H. Chernoff, A measure of asymptotic efficiency for tests of a hypothesis based on the sum
of observations. Ann. Math. Stat. 23(4), 493–507 (1952)
79. S.Y. Chung, On the Construction of some Capacity Approaching Coding Schemes, Ph.D.
thesis, MIT, 2000
80. R.H. Clarke, A statistical theory of mobile radio reception. Bell Syst. Tech. J. 47, 957–1000
(1968)
81. T.M. Cover, A. El Gamal, M. Salehi, Multiple access channels with arbitrarily correlated
sources. IEEE Trans. Inf. Theory 26(6), 648–657 (1980)
82. T.M. Cover, An algorithm for maximizing expected log investment return. IEEE Trans. Inf.
Theory 30(2), 369–373 (1984)
83. T.M. Cover, J.A. Thomas, Elements of Information Theory, 2nd edn. (Wiley, New York, 2006)
84. I. Csiszár, Joint source-channel error exponent. Probl. Control Inf. Theory 9, 315–328 (1980)
85. I. Csiszár, On the error exponent of source-channel transmission with a distortion threshold.
IEEE Trans. Inf. Theory 28(6), 823–828 (1982)
86. I. Csiszár, Generalized cutoff rates and Rényi’s information measures. IEEE Trans. Inf. Theory
41(1), 26–34 (1995)
87. I. Csiszár, J. Körner, Information Theory: Coding Theorems for Discrete Memoryless Systems
(Academic, New York, 1981)
88. I. Csiszár, G. Tusnady, Information geometry and alternating minimization procedures, in
Statistics and Decision, Supplement Issue, vol. 1 (1984), pp. 205–237
89. J.V. Davis, B. Kulis, P. Jain, S. Sra, I.S. Dhillon, Information-theoretic metric learning, in
Proceedings of 24th International Conference on Machine Learning, June 2007, pp. 209–216
90. B. de Finetti, Funcione caratteristica di un fenomeno aleatorio, Atti della R. Academia
Nazionale dei Lincii Ser. 6, Memorie,classe di Scienze, Fisiche, Matamatiche e Naturali,
vol. 4 (1931), pp. 251–299
91. J. del Ser, P.M. Crespo, I. Esnaola, J. Garcia-Frias, Joint source-channel coding of sources
with memory using Turbo codes and the Burrows-Wheeler transform. IEEE Trans. Commun.
58(7), 1984–1992 (2010)
92. J. Devore, A note on the observation of a Markov source through a noisy channel. IEEE Trans.
Inf. Theory 20(6), 762–764 (1974)
93. A. Diallo, C. Weidmann, M. Kieffer, New free distance bounds and design techniques for joint
source-channel variable-length codes. IEEE Trans. Commun. 60(10), 3080–3090 (2012)
94. W. Diffie, M. Hellman, New directions in cryptography. IEEE Trans. Inf. Theory 22(6), 644–
654 (1976)
References 303

95. R.L. Dobrushin, Asymptotic bounds of the probability of error for the transmission of mes-
sages over a memoryless channel with a symmetric transition probability matrix (in Russian).
Teor. Veroyatnost. i Primenen 7(3), 283–311 (1962)
96. R.L. Dobrushin, General formulation of Shannon’s basic theorems of information theory,
AMS Translations, vol. 33, AMS, Providence, RI (1963), pp. 323–438
97. R.L. Dobrushin, M.S. Pinsker, Memory increases transmission capacity. Probl. Inf. Transm.
5(1), 94–95 (1969)
98. S.J. Dolinar, F. Pollara, The theoretical limits of source and channel coding, TDA Progress
Report 42-102, Jet Propulsion Lab, Pasadena, CA, Aug 1990, pp. 62–72
99. Draft report of 3GPP TSG RAN WG1 #87 v0.2.0, The 3rd Generation Partnership Project
(3GPP),Reno, Nevada, USA, Nov. 2016
100. P. Duhamel, M. Kieffer, Joint Source-Channel Decoding: A Cross- Layer Perspective with
Applications in Video Broadcasting over Mobile and Wireless Networks (Academic Press,
2010)
101. S. Dumitrescu, X. Wu, On the complexity of joint source-channel decoding of Markov
sequences over memoryless channels. IEEE Trans. Commun. 56(6), 877–885 (2008)
102. S. Dumitrescu, Y. Wan, Bit-error resilient index assignment for multiple description scalar
quantizers. IEEE Trans. Inf. Theory 61(5), 2748–2763 (2015)
103. J.G. Dunham, R.M. Gray, Joint source and noisy channel trellis encoding. IEEE Trans. Inf.
Theory 27, 516–519 (1981)
104. R. Durrett, Probability: Theory and Examples (Cambridge University Press, Cambridge,
2015)
105. A.K. Ekert, Quantum cryptography based on Bell’s theorem. Phys. Rev. Lett. 67(6), 661–663
(1991)
106. A. El Gamal, Y.-H. Kim, Network Information Theory (Cambridge University Press, Cam-
bridge, 2011)
107. P. Elias, Coding for noisy channels, IRE Convention Record, Part 4, pp. 37–46 (1955)
108. E.O. Elliott, Estimates of error rates for codes on burst-noise channel. Bell Syst. Tech. J. 42,
1977–1997 (1963)
109. S. Emami, S.L. Miller, Nonsymmetric sources and optimum signal selection. IEEE Trans.
Commun. 44(4), 440–447 (1996)
110. M. Ergen, Mobile Broadband: Including WiMAX and LTE (Springer, Berlin, 2009)
111. F. Escolano, P. Suau, B. Bonev, Information Theory in Computer Vision and Pattern Recog-
nition (Springer, Berlin, 2009)
112. I. Esnaola, A.M. Tulino, J. Garcia-Frias, Linear analog coding of correlated multivariate
Gaussian sources. IEEE Trans. Commun. 61(8), 3438–3447 (2013)
113. R.M. Fano, Class notes for “Transmission of Information,” Course 6.574, MIT, 1952
114. R.M. Fano, Transmission of Information: A Statistical Theory of Communication (Wiley, New
York, 1961)
115. B. Farbre, K. Zeger, Quantizers with uniform decoders and channel-optimized encoders. IEEE
Trans. Inf. Theory 52(2), 640–661 (2006)
116. N. Farvardin, A study of vector quantization for noisy channels. IEEE Trans. Inf. Theory
36(4), 799–809 (1990)
117. N. Farvardin, V. Vaishampayan, On the performance and complexity of channel-optimized
vector quantizers. IEEE Trans. Inf. Theory 37(1), 155–159 (1991)
118. T. Fazel, T. Fuja, Robust transmission of MELP-compressed speech: an illustrative example
of joint source-channel decoding. IEEE Trans. Commun. 51(6), 973–982 (2003)
119. W. Feller, An Introduction to Probability Theory and its Applications, vol. I, 3rd edn. (Wiley,
New York, 1970)
120. W. Feller, An Introduction to Probability Theory and its Applications, vol. II, 2nd edn. (Wiley,
New York, 1971)
121. T. Fine, Properties of an optimum digital system and applications. IEEE Trans. Inf. Theory
10, 443–457 (1964)
304 References

122. T. Fingscheidt, T. Hindelang, R.V. Cox, N. Seshadri, Joint source-channel (de)coding for
mobile communications. IEEE Trans. Commun. 50, 200–212 (2002)
123. F. Fleuret, Fast binary feature selection with conditional mutual information. J. Mach. Learn.
Res. 5, 1531–1555 (2004)
124. M. Fossorier, Z. Xiong, K. Zeger, Progressive source coding for a power constrained Gaussian
channel. IEEE Trans. Commun. 49(8), 1301–1306 (2001)
125. B.D. Fritchman, A binary channel characterization using partitioned Markov chains. IEEE
Trans. Inf. Theory 13(2), 221–227 (1967)
126. M. Fresia, G. Caire, A linear encoding approach to index assignment in lossy source-channel
coding. IEEE Trans. Inf. Theory 56(3), 1322–1344 (2010)
127. M. Fresia, F. Perez-Cruz, H.V. Poor, S. Verdú, Joint source-channel coding. IEEE Signal
Process. Mag. 27(6), 104–113 (2010)
128. S.H. Friedberg, A.J. Insel, L.E. Spence, Linear Algebra, 4th edn. (Prentice Hall, 2002)
129. T.E. Fuja, C. Heegard, Focused codes for channels with skewed errors. IEEE Trans. Inf.
Theory 36(9), 773–783 (1990)
130. A. Fuldseth, T.A. Ramstad, Bandwidth compression for continuous amplitude channels based
onvector approximation to a continuous subset of the source signal space, in Proceedings IEEE
International Conference on Acoustics, Speech and Signal Processing, Munich, Germany, Apr
1997, pp. 3093–3096
131. S. Gadkari, K. Rose, Robust vector quantizer design by noisy channel relaxation. IEEE Trans.
Commun. 47(8), 1113–1116 (1999)
132. S. Gadkari, K. Rose, Unequally protected multistage vector quantization for time-varying
CDMA channels. IEEE Trans. Commun. 49(6), 1045–1054 (2001)
133. R.G. Gallager, Low-density parity-check codes. IRE Trans. Inf. Theory 28(1), 8–21 (1962)
134. R.G. Gallager, Low-Density Parity-Check Codes (MIT Press, 1963)
135. R.G. Gallager, Information Theory and Reliable Communication (Wiley, New York, 1968)
136. R.G. Gallager, Variations on a theme by Huffman. IEEE Trans. Inf. Theory 24(6), 668–674
(1978)
137. R.G. Gallager, Discrete Stochastic Processes (Kluwer Academic, Boston, 1996)
138. Y. Gao, E. Tuncel, New hybrid digital/analog schemes for transmission of a Gaussian source
over a Gaussian channel. IEEE Trans. Inf. Theory 56(12), 6014–6019 (2010)
139. J. Garcia-Frias, J.D. Villasenor, Joint Turbo decoding and estimation of hidden Markov
sources. IEEE J. Sel. Areas Commun. 19, 1671–1679 (2001)
140. M. Gastpar, B. Rimoldi, M. Vetterli, To code, or not to code: lossy source-channel communi-
cation revisited. IEEE Trans. Inf. Theory 49, 1147–1158 (2003)
141. M. Gastpar, Uncoded transmission is exactly optimal for a simple Gaussian sensor network.
IEEE Trans. Inf. Theory 54(11), 5247–5251 (2008)
142. A. Gersho, R.M. Gray, Vector Quantization and Signal Compression (Kluwer Academic
Press/Springer, 1992)
143. J.D. Gibson, T.R. Fisher, Alphabet-constrained data compression. IEEE Trans. Inf. Theory
28, 443–457 (1982)
144. M. Gil, F. Alajaji, T. Linder, Rényi divergence measures for commonly used univariate con-
tinuous distributions. Inf. Sci. 249, 124–131 (2013)
145. E.N. Gilbert, Capacity of a burst-noise channel. Bell Syst. Tech. J. 39, 1253–1266 (1960)
146. J. Gleick, The Information: A History, a Theory and a Flood (Pantheon Books, New York,
2011)
147. T. Goblick Jr., Theoretical limitations on the transmission of data from analog sources. IEEE
Trans. Inf. Theory 11(4), 558–567 (1965)
148. N. Goela, E. Abbe, M. Gastpar, Polar codes for broadcast channels. IEEE Trans. Inf. Theory
61(2), 758–782 (2015)
149. N. Görtz, On the iterative approximation of optimal joint source-channel decoding. IEEE J.
Sel. Areas Commun. 19(9), 1662–1670 (2001)
150. N. Görtz, Joint Source-Channel Coding of Discrete-Time Signals with Continuous Amplitudes
(Imperial College Press, London, UK, 2007)
References 305

151. A. Goldsmith, Wireless Communications (Cambridge University Press, UK, 2005)


152. A. Goldsmith, M. Effros, Joint design of fixed-rate source codes and multiresolution channel
codes. IEEE Trans. Commun. 46(10), 1301–1312 (1998)
153. A. Goldsmith, S.A. Jafar, N. Jindal, S. Vishwanath, Capacity limits of MIMO channels. IEEE
J. Sel. Areas Commun. 21(5), 684–702 (2003)
154. L. Golshani, E. Pasha, Rényi entropy rate for Gaussian processes. Inf. Sci. 180(8), 1486–1491
(2010)
155. M. Grangetto, P. Cosman, G. Olmo, Joint source/channel coding and MAP decoding of arith-
metic codes. IEEE Trans. Commun. 53(6), 1007–1016 (2005)
156. R.M. Gray, Information rates for autoregressive processes. IEEE Trans. Inf. Theory 16(4),
412–421 (1970)
157. R.M. Gray, Entropy and Information Theory (Springer, New York, 1990)
158. R.M. Gray, Source Coding Theory (Kluwer Academic Press/Springer, 1990)
159. R.M. Gray, Probability, Random Processes, and Ergodic Properties (Springer, Berlin, 1988),
last revised 2010
160. R.M. Gray, L.D. Davisson, Random Processes: A Mathematical Approach for Engineers
(Prentice-Hall, 1986)
161. R.M. Gray, D.S. Ornstein, Sliding-block joint source/noisy-channel coding theorems. IEEE
Trans. Inf. Theory 22, 682–690 (1976)
162. G.R. Grimmett, D.R. Stirzaker, Probability and Random Processes, 3rd edn. (Oxford Univer-
sity Press, New York, 2001)
163. S.F. Gull, J. Skilling, Maximum entropy method in image processing. IEE Proc. F (Commun.
Radar Signal Process.) 131(6), 646–659 (1984)
164. D. Gunduz, E. Erkip, Joint source-channel codes for MIMO block-fading channels. IEEE
Trans. Inf. Theory 54(1), 116–134 (2008)
165. D. Guo, S. Shamai, S. Verdú, Mutual information and minimum mean-square error in Gaussian
channels. IEEE Trans. Inf. Theory 51(4), 1261–1283 (2005)
166. A. Guyader, E. Fabre, C. Guillemot, M. Robert, Joint source-channel Turbo decoding of
entropy-coded sources. IEEE J. Sel. Areas Commun. 19(9), 1680–1696 (2001)
167. R. Hagen, P. Hedelin, Robust vector quantization by a linear mapping of a block code. IEEE
Trans. Inf. Theory 45(1), 200–218 (1999)
168. J. Hagenauer, R. Bauer, The Turbo principle in joint source channel decoding of variable
length codes, in Proceedings IEEE Information Theory Workshop, Sept 2001, pp. 33–35
169. J. Hagenauer, Source-controlled channel decoding. IEEE Trans. Commun. 43, 2449–2457
(1995)
170. B. Hajek, Random Processes for Engineers (Cambridge University Press, Cambridge, 2015)
171. R. Hamzaoui, V. Stankovic, Z. Xiong, Optimized error protection of scalable image bit
streams. IEEE Signal Process. Mag. 22(6), 91–107 (2005)
172. T.S. Han, Information-Spectrum Methods in Information Theory (Springer, Berlin, 2003)
173. T.S. Han, Multicasting multiple correlated sources to multiple sinks over a noisy channel
network. IEEE Trans. Inf. Theory 57(1), 4–13 (2011)
174. T.S. Han, M.H.M. Costa, Broadcast channels with arbitrarily correlated sources. IEEE Trans.
Inf. Theory 33(5), 641–650 (1987)
175. T.S. Han, S. Verdú, Approximation theory of output statistics. IEEE Trans. Inf. Theory 39(3),
752–772 (1993)
176. G.H. Hardy, J.E. Littlewood, G. Polya, Inequalities (Cambridge University Press, Cambridge,
1934)
177. E.A. Haroutunian, Estimates of the error probability exponent for a semi-continuous memo-
ryless channel (in Russian). Probl. Inf. Transm. 4(4), 37–48 (1968)
178. E.A. Haroutunian, M.E. Haroutunian, A.N. Harutyunyan, Reliability criteria in information
theory and in statistical hypothesis testing. Found. Trends Commun. Inf. Theory 4(2–3),
97–263 (2008)
179. J. Harte, T. Zillio, E. Conlisk, A.B. Smith, Maximum entropy and the state-variable approach
to macroecology. Ecology 89(10), 2700–2711 (2008)
306 References

180. R.V.L. Hartley, Transmission of information. Bell Syst. Tech. J. 7, 535 (1928)
181. B. Hayes, C. Wilson, A maximum entropy model of phonotactics and phonotactic learning.
Linguist. Inq. 39(3), 379–440 (2008)
182. M. Hayhoe, F. Alajaji, B. Gharesifard, A Polya urn-based model for epidemics on networks,
in Proceedings of American Control Conference, Seattle, May 2017, pp. 358–363
183. A. Hedayat, A. Nosratinia, Performance analysis and design criteria for finite-alphabet
source/channel codes. IEEE Trans. Commun. 52(11), 1872–1879 (2004)
184. S. Heinen, P. Vary, Source-optimized channel coding for digital transmission channels. IEEE
Trans. Commun. 53(4), 592–600 (2005)
185. F. Hekland, P.A. Floor, T.A. Ramstad, Shannon-Kotelnikov mappings in joint source-channel
coding. IEEE Trans. Commun. 57(1), 94–105 (2009)
186. M. Hellman, J. Raviv, Probability of error, equivocation and the Chernoff bound. IEEE Trans.
Inf. Theory 16(4), 368–372 (1970)
187. M.E. Hellman, Convolutional source encoding. IEEE Trans. Inf. Theory 21, 651–656 (1975)
188. M. Hirvensalo, Quantum Computing (Springer, Berlin, 2013)
189. B. Hochwald, K. Zeger, Tradeoff between source and channel coding. IEEE Trans. Inf. Theory
43, 1412–1424 (1997)
190. T. Holliday, A. Goldsmith, H.V. Poor, Joint source and channel coding for MIMO systems:
is it better to be robust or quick? IEEE Trans. Inf. Theory 54(4), 1393–1405 (2008)
191. G.D. Hu, On Shannon theorem and its converse for sequence of communication schemes in
the case of abstract random variables, Transactions of 3rd Prague Conference on Informa-
tion Theory, Statistical Decision Functions, Random Processes (Czechoslovak Academy of
Sciences, Prague, 1964), pp. 285–333
192. T.C. Hu, D.J. Kleitman, J.K. Tamaki, Binary trees optimum under various criteria. SIAM J.
Appl. Math. 37(2), 246–256 (1979)
193. Y. Hu, J. Garcia-Frias, M. Lamarca, Analog joint source-channel coding using non-linear
curves and MMSE decoding. IEEE Trans. Commun. 59(11), 3016–3026 (2011)
194. J. Huang, S. Meyn, M. Medard, Error exponents for channel coding with application to signal
constellation design. IEEE J. Sel. Areas Commun. 24(8), 1647–1661 (2006)
195. D.A. Huffman, A method for the construction of minimum redundancy codes. Proc. IRE 40,
1098–1101 (1952)
196. S. Ihara, Information Theory for Continuous Systems (World-Scientific, Singapore, 1993)
197. I. Issa, S. Kamath, A.B. Wagner, An operational measure of information leakage, in Proceed-
ings of the Conference on Information Sciences and Systems, Princeton University, Mar 2016,
pp. 234–239
198. H. Jafarkhani, N. Farvardin, Design of channel-optimized vector quantizers in the presence
of channel mismatch. IEEE Trans. Commun. 48(1), 118–124 (2000)
199. K. Jacobs, Almost periodic channels, Colloquium on Combinatorial Methods in Probability
Theory, Aarhus, 1962, pp. 118–126
200. X. Jaspar, C. Guillemot, L. Vandendorpe, Joint source-channel turbo techniques for discrete-
valued sources: from theory to practice. Proc. IEEE 95, 1345–1361 (2007)
201. E.T. Jaynes, Information theory and statistical mechanics. Phys. Rev. 106(4), 620–630 (1957)
202. E.T. Jaynes, Information theory and statistical mechanics II. Phys. Rev. 108(2), 171–190
(1957)
203. E.T. Jaynes, On the rationale of maximum-entropy methods. Proc. IEEE 70(9), 939–952
(1982)
204. M. Jeanne, J.-C. Carlach, P. Siohan, Joint source-channel decoding of variable-length codes
for convolutional codes and Turbo codes. IEEE Trans. Commun. 53(1), 10–15 (2005)
205. F. Jelinek, Probabilistic Information Theory (McGraw Hill, 1968)
206. F. Jelinek, Buffer overflow in variable length coding of fixed rate sources. IEEE Trans. Inf.
Theory 14, 490–501 (1968)
207. V.D. Jerohin,
-entropy of discrete random objects. Teor. Veroyatnost. i Primenen 3, 103–107
(1958)
208. R. Johanesson, K. Zigangirov, Fundamentals of Convolutional Coding (IEEE, 1999)
References 307

209. O. Johnson, Information Theory and the Central Limit Theorem (Imperial College Press,
London, 2004)
210. N.L. Johnson, S. Kotz, Urn Models and Their Application: An Approach to Modern Discrete
Probability Theory (Wiley, New York, 1977)
211. L.N. Kanal, A.R.K. Sastry, Models for channels with memory and their applications to error
control. Proc. IEEE 66(7), 724–744 (1978)
212. W. Karush, Minima of Functions of Several Variables with Inequalities as Side Constraints,
M.Sc. Dissertation, Department of Mathematics, University of Chicago, Chicago, Illinois,
1939
213. A. Khisti, G. Wornell, Secure transmission with multiple antennas I: The MISOME wiretap
channel. IEEE Trans. Inf. Theory 56(7), 3088–3104 (2010)
214. A. Khisti, G. Wornell, Secure transmission with multiple antennas II: the MIMOME wiretap
channel. IEEE Trans. Inf. Theory 56(11), 5515–5532 (2010)
215. Y.H. Kim, A coding theorem for a class of stationary channels with feedback. IEEE Trans.
Inf. Theory 54(4), 1488–1499 (2008)
216. Y.H. Kim, A. Sutivong, T.M. Cover, State amplification. IEEE Trans. Inf. Theory 54(5),
1850–1859 (2008)
217. J. Kliewer, R. Thobaben, Iterative joint source-channel decoding of variable-length codes
using residual source redundancy. IEEE Trans. Wireless Commun. 4(3), 919–929 (2005)
218. P. Knagenhjelm, E. Agrell, The Hadamard transform—a tool for index assignment. IEEE
Trans. Inf. Theory 42(4), 1139–1151 (1996)
219. Y. Kochman, R. Zamir, Analog matching of colored sources to colored channels. IEEE Trans.
Inf. Theory 57(6), 3180–3195 (2011)
220. Y. Kochman, G. Wornell, On uncoded transmission and blocklength, in Proceedings IEEE
Information Theory Workshop, Sept 2012, pp. 15–19
221. E. Koken, E. Tuncel, On robustness of hybrid digital/analog source-channel coding with
bandwidth mismatch. IEEE Trans. Inf. Theory 61(9), 4968–4983 (2015)
222. A.N. Kolmogorov, On the Shannon theory of information transmission in the case of contin-
uous signals. IEEE Trans. Inf. Theory 2(4), 102–108 (1956)
223. A.N. Kolmogorov, A new metric invariant of transient dynamical systems and automorphisms,
Lebesgue Spaces.18 Dokl. Akad. Nauk. SSSR, 119.61-864 (1958)
224. A.N. Kolmogorov, S.V. Fomin, Introductory Real Analysis (Dover Publications, New York,
1970)
225. L.H. Koopmans, Asymptotic rate of discrimination for Markov processes. Ann. Math. Stat.
31, 982–994 (1960)
226. S.B. Korada, Polar Codes for Channel and Source Coding, Ph.D. Dissertation, EPFL, Lau-
sanne, Switzerland, 2009
227. S.B. Korada, R.L. Urbanke, Polar codes are optimal for lossy source coding. IEEE Trans. Inf.
Theory 56(4), 1751–1768 (2010)
228. S.B. Korada, E. Şaşoğlu, R. Urbanke, Polar codes: characterization of exponent, bounds, and
constructions. IEEE Trans. Inf. Theory 56(12), 6253–6264 (2010)
229. I. Korn, J.P. Fonseka, S. Xing, Optimal binary communication with nonequal probabilities.
IEEE Trans. Commun. 51(9), 1435–1438 (2003)
230. V.N. Koshelev, Direct sequential encoding and decoding for discrete sources. IEEE Trans.
Inf. Theory 19, 340–343 (1973)
231. V. Kostina, S. Verdú, Lossy joint source-channel coding in the finite blocklength regime. IEEE
Trans. Inf. Theory 59(5), 2545–2575 (2013)
232. V.A. Kotelnikov, The Theory of Optimum Noise Immunity (McGraw-Hill, New York, 1959)
233. G. Kramer, Directed Information for Channels with Feedback, Ph.D. Dissertation, ser. ETH
Series in Information Processing. Konstanz, Switzerland: Hartung-Gorre Verlag, vol. 11
(1998)
234. J. Kroll, N. Phamdo, Analysis and design of trellis codes optimized for a binary symmetric
Markov source with MAP detection. IEEE Trans. Inf. Theory 44(7), 2977–2987 (1998)
308 References

235. H.W. Kuhn, A.W. Tucker, Nonlinear programming, in Proceedings of 2nd Berkeley Sympo-
sium, Berkeley, University of California Press, 1951, pp. 481–492
236. S. Kullback, R.A. Leibler, On information and sufficiency. Ann. Math. Stat. 22(1), 79–86
(1951)
237. S. Kullback, Information Theory and Statistics (Wiley, New York, 1959)
238. H. Kumazawa, M. Kasahara, T. Namekawa, A construction of vector quantizers for noisy
channels. Electron. Eng. Jpn. 67–B(4), 39–47 (1984)
239. A. Kurtenbach, P. Wintz, Quantizing for noisy channels. IEEE Trans. Commun. Technol. 17,
291–302 (1969)
240. F. Lahouti, A.K. Khandani, Efficient source decoding over memoryless noisy channels using
higher order Markov models. IEEE Trans. Inf. Theory 50(9), 2103–2118 (2004)
241. J.N. Laneman, E. Martinian, G. Wornell, J.G. Apostolopoulos, Source-channel diversity for
parallel channels. IEEE Trans. Inf. Theory 51(10), 3518–3539 (2005)
242. G.G. Langdon, An introduction to arithmetic coding. IBM J. Res. Dev. 28, 135–149 (1984)
243. G.G. Langdon, J. Rissanen, A simple general binary source code. IEEE Trans. Inf. Theory
28(5), 800–803 (1982)
244. K.H. Lee, D. Petersen, Optimal linear coding for vector channels. IEEE Trans. Commun.
24(12), 1283–1290 (1976)
245. J.M. Lervik, A. Grovlen, T.A. Ramstad, Robust digital signal compression and modulation
exploiting the advantages of analog communications, in Proceedings of IEEEGLOBECOM,
Nov 1995, pp. 1044–1048
246. F. Liese, I. Vajda, Convex Statistical Distances, Treubner, 1987
247. J. Lim, D.L. Neuhoff, Joint and tandem source-channel coding with complexity and delay
constraints. IEEE Trans. Commun. 51(5), 757–766 (2003)
248. S. Lin, D.J. Costello, Error Control Coding: Fundamentals and Applications, 2nd edn. (Pren-
tice Hall, Upper Saddle River, NJ, 2004)
249. T. Linder, R. Zamir, On the asymptotic tightness of the Shannon lower bound. IEEE Trans.
Inf. Theory 40(6), 2026–2031 (1994)
250. A. Lozano, A.M. Tulino, S. Verdú, Optimum power allocation for parallel Gaussian channels
with arbitrary input distributions. IEEE Trans. Inf. Theory 52(7), 3033–3051 (2006)
251. D.J.C. MacKay, R.M. Neal, Near Shannon limit performance of low density parity check
codes. Electron. Lett. 33(6) (1997)
252. D.J.C. MacKay, Good error correcting codes based on very sparse matrices. IEEE Trans. Inf.
Theory 45(2), 399–431 (1999)
253. D.J.C. MacKay, Information Theory, Inference and Learning Algorithms (Cambridge Uni-
versity Press, Cambridge, 2003)
254. F.J. MacWilliams, N.J.A. Sloane, The Theory of Error Correcting Codes (North-Holland Pub.
Co., 1978)
255. U. Madhow, Fundamentals of Digital Communication (Cambridge University Press, Cam-
bridge, 2008)
256. H. Mahdavifar, A. Vardy, Achieving the secrecy capacity of wiretap channels using polar
codes. IEEE Trans. Inf. Theory 57(10), 6428–6443 (2011)
257. H.M. Mahmoud, Polya Urn Models (Chapman and Hall/CRC, 2008)
258. S. Mallat, A theory for multiresolution signal decomposition: the wavelet representation.
IEEE Trans. Pattern Anal. Mach. Intell. 11(7), 674–693 (1989)
259. C.D. Manning, H. Schütze, Foundations of Statistical Natural Language Processing (MIT
Press, Cambridger, MA, 1999)
260. W. Mao, Modern Cryptography: Theory and Practice (Prentice Hall Professional Technical
Reference, 2003)
261. H. Marko, The bidirectional communication theory—a generalization of information theory.
IEEE Trans. Commun. Theory 21(12), 1335–1351 (1973)
262. J.E. Marsden, M.J. Hoffman, Elementary Classical Analysis (W.H. Freeman & Company,
1993)
References 309

263. J.L. Massey, Joint source and channel coding, in Communications and Random Process
Theory, ed. by J.K. Skwirzynski (Sijthoff and Nordhoff, The Netherlands, 1978), pp. 279–293
264. J.L. Massey, Cryptography—a selective survey, in Digital Communications, ed. by E. Biglieri,
G. Prati (Elsevier, 1986), pp. 3–21
265. J. Massey, Causality, feedback, and directed information, in Proceedings of International
Symposium on Information Theory and Applications, 1990, pp. 303–305
266. R.J. McEliece, The Theory of Information and Coding, 2nd edn. (Cambridge University Press,
Cambridge, 2002)
267. B. McMillan, The basic theorems of information theory. Ann. Math. Stat. 24, 196–219 (1953)
268. A. Méhes, K. Zeger, Performance of quantizers on noisy channels using structured families
of codes. IEEE Trans. Inf. Theory 46(7), 2468–2476 (2000)
269. N. Merhav, Shannon’s secrecy system with informed receivers and its application to systematic
coding for wiretapped channels. IEEE Trans. Inf. Theory 54(6), 2723–2734 (2008)
270. N. Merhav, E. Arikan, The Shannon cipher system with a guessing wiretapper. IEEE Trans.
Inf. Theory 45(6), 1860–1866 (1999)
271. N. Merhav, S. Shamai, On joint source-channel coding for the Wyner-Ziv source and the
Gel’fand-Pinsker channel. IEEE Trans. Inf. Theory 49(11), 2844–2855 (2003)
272. D. Miller, K. Rose, Combined source-channel vector quantization using deterministic anneal-
ing. IEEE Trans. Commun. 42, 347–356 (1994)
273. U. Mittal, N. Phamdo, Duality theorems for joint source-channel coding. IEEE Trans. Inf.
Theory 46(4), 1263–1275 (2000)
274. U. Mittal, N. Phamdo, Hybrid digital-analog (HDA) joint source-channel codes for broad-
casting and robust communications. IEEE Trans. Inf. Theory 48(5), 1082–1102 (2002)
275. J.W. Modestino, D.G. Daut, Combined source-channel coding of images. IEEE Trans. Com-
mun. 27, 1644–1659 (1979)
276. B. Moore, G. Takahara, F. Alajaji, Pairwise optimization of modulation constellations for
non-uniform sources. IEEE Can. J. Electr. Comput. Eng. 34(4), 167–177 (2009)
277. M. Mushkin, I. Bar-David, Capacity and coding for the Gilbert-Elliott channel. IEEE Trans.
Inf. Theory 35(6), 1277–1290 (1989)
278. T. Nakano, A.M. Eckford, T. Haraguchi, Molecular Communication (Cambridge University
Press, Cambridge, 2013)
279. T. Nemetz, On the α-divergence rate for Markov-dependent hypotheses. Probl. Control Inf.
Theory 3(2), 147–155 (1974)
280. T. Nemetz, Information Type Measures and Their Applications to Finite Decision-Problems,
Carleton Mathematical Lecture Notes, no. 17, May 1977
281. J. Neyman, E.S. Pearson, On the problem of the most efficient tests of statistical hypotheses.
Philos. Trans. R. Soc. Lond. A 231, 289–337 (1933)
282. H. Nguyen, P. Duhamel, Iterative joint source-channel decoding of VLC exploiting source
semantics over realistic radio-mobile channels. IEEE Trans. Commun. 57(6), 1701–1711
(2009)
283. A. Nosratinia, J. Lu, B. Aazhang, Source-channel rate allocation for progressive transmission
of images. IEEE Trans. Commun. 51(2), 186–196 (2003)
284. J.M. Ooi, Coding for Channels with Feedback (Springer, Berlin, 1998)
285. E. Ordentlich, T. Weissman, On the optimality of symbol-by-symbol filtering and denoising.
IEEE Trans. Inf. Theory 52(1), 19–40 (2006)
286. X. Pan, A. Banihashemi, A. Cuhadar, Progressive transmission of images over fading channels
using rate-compatible LDPC codes. IEEE Trans. Image Process. 15(12), 3627–3635 (2006)
287. L. Paninski, Estimation of entropy and mutual information. Neural Comput. 15(6), 1191–1253
(2003)
288. M. Park, D. Miller, Joint source-channel decoding for variable-length encoded data by exact
and approximate MAP source estimation. IEEE Trans. Commun. 48(1), 1–6 (2000)
289. R. Pemantle, A survey of random processes with reinforcement. Probab. Surv. 4, 1–79 (2007)
290. W.B. Pennebaker, J.L. Mitchell, JPEG: Still Image Data Compression Standard (Kluwer
Academic Press/Springer, 1992)
310 References

291. H. Permuter, T. Weissman, A.J. Goldsmith, Finite state channels with time-invariant deter-
ministic feedback. IEEE Trans. Inform. Theory 55, 644–662 (2009)
292. H. Permuter, H. Asnani, T. Weissman, Capacity of a POST channel with and without feedback.
IEEE Trans. Inf. Theory 60(10), 6041–6057 (2014)
293. N. Phamdo, N. Farvardin, T. Moriya, A unified approach to tree-structured and multistage
vector quantization for noisy channels. IEEE Trans. Inf. Theory 39(3), 835–850 (1993)
294. N. Phamdo, N. Farvardin, Optimal detection of discrete Markov sources over discrete mem-
oryless channels—applications to combined source-channel coding. IEEE Trans. Inf. Theory
40(1), 186–193 (1994)
295. N. Phamdo, F. Alajaji, N. Farvardin, Quantization of memoryless and Gauss-Markov sources
over binary Markov channels. IEEE Trans. Commun. 45(6), 668–675 (1997)
296. N. Phamdo, F. Alajaji, Soft-decision demodulation design for COVQ over white, colored, and
ISI Gaussian channels. IEEE Trans. Commun. 46(9), 1499–1506 (2000)
297. J.R. Pierce, An Introduction to Information Theory: Symbols, Signals and Noise, 2nd edn.
(Dover Publications Inc., New York, 1980)
298. C. Pimentel, I.F. Blake, Modeling burst channels using partitioned Fritchman’s Markov mod-
els. IEEE Trans. Veh. Technol. 47(3), 885–899 (1998)
299. C. Pimentel, T.H. Falk, L. Lisbôa, Finite-state Markov modeling of correlated Rician-fading
channels. IEEE Trans. Veh. Technol. 53(5), 1491–1501 (2004)
300. C. Pimentel, F. Alajaji, Packet-based modeling of Reed-Solomon block coded correlated
fading channels via a Markov finite queue model. IEEE Trans. Veh. Technol. 58(7), 3124–
3136 (2009)
301. C. Pimentel, F. Alajaji, P. Melo, A discrete queue-based model for capturing memory and soft-
decision information in correlated fading channels. IEEE Trans. Commun. 60(5), 1702–1711
(2012)
302. J.T. Pinkston, An application of rate-distortion theory to a converse to the coding theorem.
IEEE Trans. Inf. Theory 15(1), 66–71 (1969)
303. M.S. Pinsker, Information and Information Stability of Random Variables and Processes
(Holden-Day, San Francisco, 1964)
304. G. Polya, F. Eggenberger, Über die Statistik Verketteter Vorgänge. Z. Angew. Math. Mech. 3,
279–289 (1923)
305. G. Polya, F. Eggenberger, Sur l’Interpretation de Certaines Courbes de Fréquences. Comptes
Rendus C.R. 187, 870–872 (1928)
306. G. Polya, Sur Quelques Points de la Théorie des Probabilités. Ann. Inst. H. Poincarré 1,
117–161 (1931)
307. J.G. Proakis, Digital Communications (McGraw Hill, 1983)
308. L. Pronzato, H.P. Wynn, A.A. Zhigljavsky, Using Rényi entropies to measure uncertainty in
search problems. Lect. Appl. Math. 33, 253–268 (1997)
309. Z. Rached, Information Measures for Sources with Memory and their Application to Hypoth-
esis Testing and Source Coding, Doctoral dissertation, Queen’s University, 2002
310. Z. Rached, F. Alajaji, L.L. Campbell, Rényi’s entropy rate for discrete Markov sources, in
Proceedings on Conference of Information Sciences and Systems, Baltimore, Mar 1999
311. Z. Rached, F. Alajaji, L.L. Campbell, Rényi’s divergence and entropy rates for finite alphabet
Markov sources. IEEE Trans. Inf. Theory 47(4), 1553–1561 (2001)
312. Z. Rached, F. Alajaji, L.L. Campbell, The Kullback-Leibler divergence rate between Markov
sources. IEEE Trans. Inf. Theory 50(5), 917–921 (2004)
313. M. Raginsky, I. Sason, Concentration of measure inequalities in information theory, commu-
nications, and coding, Found. Trends Commun. Inf. Theory. 10(1-2), 1–246, Now Publishers,
Oct 2013
314. T.A. Ramstad, Shannon mappings for robust communication. Telektronikk 98(1), 114–128
(2002)
315. R.C. Reininger, J.D. Gibson, Distributions of the two-dimensional DCT coefficients for
images. IEEE Trans. Commun. 31(6), 835–839 (1983)
References 311

316. A. Rényi, On the dimension and entropy of probability distributions. Acta Math. Acad. Sci.
Hung. 10, 193–215 (1959)
317. A. Rényi, On measures of entropy and information, in Proceedings of the Fourth Berkeley
Symposium on Mathematical Statistics Probability, vol. 1 (University of California Press,
Berkeley, 1961), pp. 547–561
318. A. Rényi, On the foundations of information theory. Rev. Inst. Int. Stat. 33, 1–14 (1965)
319. M. Rezaeian, A. Grant, Computation of total capacity for discrete memoryless multiple-access
channels. IEEE Trans. Inf. Theory 50(11), 2779–2784 (2004)
320. Z. Reznic, M. Feder, R. Zamir, Distortion bounds for broadcasting with bandwidth expansion.
IEEE Trans. Inf. Theory 52(8), 3778–3788 (2006)
321. T.J. Richardson, R.L. Urbanke, Modern Coding Theory (Cambridge University Press, Cam-
bridge, 2008)
322. J. Rissanen, Generalized Kraft inequality and arithmetic coding. IBM J. Res. Dev. 20, 198–203
(1976)
323. H.L. Royden, Real Analysis, 3rd edn. (Macmillan Publishing Company, New York, 1988)
324. M. Rüngeler, J. Bunte, P. Vary, Design and evaluation of hybrid digital-analog transmission
outperforming purely digital concepts. IEEE Trans. Commun. 62(11), 3983–3996 (2014)
325. P. Sadeghi, R.A. Kennedy, P.B. Rapajic, R. Shams, Finite-state Markov modeling of fading
channels. IEEE Signal Process. Mag. 25(5), 57–80 (2008)
326. D. Salomon, Data Compression: The Complete Reference, 3rd edn. (Springer, Berlin, 2004)
327. L. Sankar, S.R. Rajagopalan, H.V. Poor, Utility-privacy tradeoffs in databases: an information-
theoretic approach. IEEE Trans. Inf. Forensic Secur. 8(6), 838–852 (2013)
328. L. Sankar, S.R. Rajagopalan, S. Mohajer, H.V. Poor, Smart meter privacy: a theoretical frame-
work. IEEE Trans. Smart Grid 4(2), 837–846 (2013)
329. E. Şaşoğlu, Polarization and polar codes. Found. Trends Commun. Inf. Theory 8(4), 259–381
(2011)
330. K. Sayood, Introduction to Data Compression, 4th edn. (Morgan Kaufmann, 2012)
331. K. Sayood, J.C. Borkenhagen, Use of residual redundancy in the design of joint source/channel
coders. IEEE Trans. Commun. 39, 838–846 (1991)
332. L. Schmalen, M. Adrat, T. Clevorn, P. Vary, EXIT chart based system design for iterative
source-channel decoding with fixed-length codes. IEEE Trans. Commun. 59(9), 2406–2413
(2011)
333. N. Sen, F. Alajaji, S. Yüksel, Feedback capacity of a class of symmetric finite-state Markov
channels. IEEE Trans. Inf. Theory 57, 4110–4122 (2011)
334. S. Shahidi, F. Alajaji, T. Linder, MAP detection and robust lossy coding over soft-decision
correlated fading channels. IEEE Trans. Veh. Technol. 62(7), 3175–3187 (2013)
335. S. Shamai, S. Verdú, R. Zamir, Systematic lossy source/channel coding. IEEE Trans. Inf.
Theory 44, 564–579 (1998)
336. G.I. Shamir, K. Xie, Universal source controlled channel decoding with nonsystematic quick-
look-in Turbo codes. IEEE Trans. Commun. 57(4), 960–971 (2009)
337. C.E. Shannon, A symbolic analysis of relay and switching circuits. Trans. Am. Inst. Electr.
Eng. 57(12), 713–723 (1938)
338. C.E. Shannon, A Symbolic Analysis of Relay and Switching Circuits, M.Sc. Thesis, Department
of Electrical Engineering, MIT, 1940
339. C.E. Shannon, An Algebra for Theoretical Genetics, Ph.D. Dissertation, Department of Math-
ematics, MIT, 1940
340. C.E. Shannon, A mathematical theory of communications. Bell Syst. Tech. J. 27, 379–423
and 623–656 (1948)
341. C.E. Shannon, Communication in the presence of noise. Proc. IRE 37, 10–21 (1949)
342. C.E. Shannon, Communication theory of secrecy systems. Bell Syst. Tech. J. 28, 656–715
(1949)
343. C.E. Shannon, The zero-error capacity of a noisy channel. IRE Trans. Inf. Theory 2, 8–19
(1956)
312 References

344. C.E. Shannon, Certain results in coding theory for noisy channels. Inf. Control 1(1), 6–25
(1957)
345. C.E. Shannon, Coding theorems for a discrete source with a fidelity criterion. IRE Nat. Conv.
Rec. 4, 142–163 (1959)
346. C.E. Shannon, W.W. Weaver, The Mathematical Theory of Communication (University of
Illinois Press, Urbana, IL, 1949)
347. C.E. Shannon, R.G. Gallager, E.R. Berlekamp, Lower bounds to error probability for coding
in discrete memoryless channels I. Inf. Control 10(1), 65–103 (1967)
348. C.E. Shannon, R.G. Gallager, E.R. Berlekamp, Lower bounds to error probability for coding
in discrete memoryless channels II. Inf. Control 10(2), 523–552 (1967)
349. P.C. Shields, The Ergodic Theory of Discrete Sample Paths (American Mathematical Society,
1991)
350. P.C. Shields, Two divergence-rate counterexamples. J. Theor. Probab. 6, 521–545 (1993)
351. Y. Shkel, V.Y.F. Tan, S. Draper, Unequal message protection: asymptotic and non-asymptotic
tradeoffs. IEEE Trans. Inf. Theory 61(10), 5396–5416 (2015)
352. R. Sibson, Information radius. Z. Wahrscheinlichkeitstheorie Verw. Geb. 14, 149–161 (1969)
353. C.A. Sims, Rational inattention and monetary economics, in Handbook of Monetary Eco-
nomics, vol. 3 (2010), pp. 155–181
354. M. Skoglund, Soft decoding for vector quantization over noisy channels with memory. IEEE
Trans. Inf. Theory 45(4), 1293–1307 (1999)
355. M. Skoglund, On channel-constrained vector quantization and index assignment for discrete
memoryless channels. IEEE Trans. Inf. Theory 45(7), 2615–2622 (1999)
356. M. Skoglund, P. Hedelin, Hadamard-based soft decoding for vector quantization over noisy
channels. IEEE Trans. Inf. Theory 45(2), 515–532 (1999)
357. M. Skoglund, N. Phamdo, F. Alajaji, Design and performance of VQ-based hybrid digital-
analog joint source-channel codes. IEEE Trans. Inf. Theory 48(3), 708–720 (2002)
358. M. Skoglund, N. Phamdo, F. Alajaji, Hybrid digital-analog source-channel coding for band-
width compression/expansion. IEEE Trans. Inf. Theory 52(8), 3757–3763 (2006)
359. N.J.A. Sloane, A.D. Wyner (eds.), Claude Elwood Shannon: Collected Papers (IEEE Press,
New York, 1993)
360. K. Song, Rényi information, loglikelihood and an intrinsic distribution measure. J. Stat. Plan.
Inference 93(1–2), 51–69 (2001)
361. L. Song, F. Alajaji, T. Linder, On the capacity of burst noise-erasure channels with and without
feedback, in Proceedings of IEEE International Symposium on Information Theory, Aachen,
Germany, June 2017, pp. 206–210
362. J. Soni, R. Goodman, A Mind at Play: How Claude Shannon Invented the Information Age
(Simon & Schuster, 2017)
363. J.F. Sowa, Conceptual Structures: Information Processing in Mind and Machine (Addison-
Wesley Pub, MA, 1983)
364. Y. Steinberg, S. Verdú, Simulation of random processes and rate-distortion theory. IEEE Trans.
Inf. Theory 42(1), 63–86 (1996)
365. Y. Steinberg, N. Merhav, On hierarchical joint source-channel coding with degraded side
information. IEEE Trans. Inf. Theory 52(3), 886–903 (2006)
366. K.P. Subbalakshmi, J. Vaisey, On the joint source-channel decoding of variable-length encoded
sources: the additive-Markov case. IEEE Trans. Commun. 51(9), 1420–1425 (2003)
367. M. Taherzadeh, A.K. Khandani, Single-sample robust joint source-channel coding: achieving
asymptotically optimum scaling of SDR versus SNR. IEEE Trans. Inf. Theory 58(3), 1565–
1577 (2012)
368. G. Takahara, F. Alajaji, N.C. Beaulieu, H. Kuai, Constellation mappings for two-dimensional
signaling of nonuniform sources. IEEE Trans. Commun. 51(3), 400–408 (2003)
369. Y. Takashima, M. Wada, H. Murakami, Reversible variable length codes. IEEE Trans. Com-
mun. 43, 158–162 (1995)
370. C. Tan, N.C. Beaulieu, On first-order Markov modeling for the Rayleigh fading channel. IEEE
Trans. Commun. 48(12), 2032–2040 (2000)
References 313

371. I. Tal, A. Vardy, How to construct polar codes. IEEE Trans. Inf. Theory 59(10), 6562–6582
(2013)
372. I. Tal, A. Vardy, List decoding of polar codes. IEEE Trans. Inf. Theory 61(5), 2213–2226
(2015)
373. V.Y.F. Tan, S. Watanabe, M. Hayashi, Moderate deviations for joint source-channel coding of
systems with Markovian memory, in Proceedings IEEE Symposium on Information Theory,
Honolulu, HI, June 2014, pp. 1687–1691
374. A. Tang, D. Jackson, J. Hobbs, W. Chen, J.L. Smith, H. Patel, A. Prieto, D. Petrusca, M.I.
Grivich, A. Sher, P. Hottowy, W. Davrowski, A.M. Litke, J.M. Beggs, A maximum entropy
model applied to spatial and temporal correlations from cortical networks in vitro. J. Neurosci.
28(2), 505–518 (2008)
375. N. Tanabe, N. Farvardin, Subband image coding using entropy-coded quantization over noisy
channels. IEEE J. Sel. Areas Commun. 10(5), 926–943 (1992)
376. S. Tatikonda, Control Under Communication Constraints, Ph.D. Dissertation, MIT, 2000
377. S. Tatikonda, S. Mitter, Control under communication constraints. IEEE Trans. Autom. Con-
trol 49(7), 1056–1068 (2004)
378. S. Tatikonda, S. Mitter, The capacity of channels with feedback. IEEE Trans. Inf. Theory 55,
323–349 (2009)
379. H. Theil, Economics and Information Theory (North-Holland, Amsterdam, 1967)
380. I.E. Telatar, Capacity of multi-antenna Gaussian channels. Eur. Trans. Telecommun. 10(6),
585–596 (1999)
381. R. Thobaben, J. Kliewer, An efficient variable-length code construction for iterative source-
channel decoding. IEEE Trans. Commun. 57(7), 2005–2013 (2009)
382. C. Tian, S. Shamai, A unified coding scheme for hybrid transmission of Gaussian source over
Gaussian channel, in Proceedings International Symposium on Information Theory, Toronto,
Canada, July 2008, pp. 1548–1552
383. C. Tian, J. Chen, S.N. Diggavi, S. Shamai, Optimality and approximate optimality of source-
channel separation in networks. IEEE Trans. Inf. Theory 60(2), 904–918 (2014)
384. N. Tishby, F.C. Pereira, W. Bialek, The information bottleneck method, in Proceedings of
37th Annual Allerton Conference on Communication, Control, and Computing (1999), pp.
368–377
385. N. Tishby, N. Zaslavsky, Deep learning and the information bottleneck principle, in Proceed-
ings IEEE Information Theory Workshop, Apr 2015, pp. 1–5
386. S. Tridenski, R. Zamir, A. Ingber, The Ziv-Zakai-Rényi bound for joint source-channel coding.
IEEE Trans. Inf. Theory 61(8), 4293–4315 (2015)
387. D.N.C. Tse, P. Viswanath, Fundamentals of Wireless Communications (Cambridge University
Press, Cambridge, UK, 2005)
388. A. Tulino, S. Verdú, Monotonic decrease of the non-Gaussianness of the sum of independent
random variables: a simple proof. IEEE Trans. Inf. Theory 52(9), 4295–4297 (2006)
389. W. Turin, R. van Nobelen, Hidden Markov modeling of flat fading channels. IEEE J. Sel.
Areas Commun. 16, 1809–1817 (1998)
390. R.E. Ulanowicz, Information theory in ecology. Comput. Chem. 25(4), 393–399 (2001)
391. V. Vaishampayan, S.I.R. Costa, Curves on a sphere, shift-map dynamics, and error control
for continuous alphabet sources. IEEE Trans. Inf. Theory 49(7), 1658–1672 (2003)
392. V.A. Vaishampayan, N. Farvardin, Joint design of block source codes and modulation signal
sets. IEEE Trans. Inf. Theory 38, 1230–1248 (1992)
393. I. Vajda, Theory of Statistical Inference and Information (Kluwer, Dordrecht, 1989)
394. S. Vembu, S. Verdú, Y. Steinberg, The source-channel separation theorem revisited. IEEE
Trans. Inf. Theory 41, 44–54 (1995)
395. S. Verdú, α-mutual information, in Proceedings of Workshop Information Theory and Appli-
cations, San Diego, 2015
396. S. Verdú, T.S. Han, A general formula for channel capacity. IEEE Trans. Inf. Theory 40(4),
1147–1157 (1994)
314 References

397. S. Verdú, S. Shamai, Variable-rate channel capacity. IEEE Trans. Inf. Theory 56(6), 2651–
2667 (2010)
398. W.R. Wade, An Introduction to Analysis (Prentice Hall, Upper Saddle River, NJ, 1995)
399. D. Wang, A. Ingber, Y. Kochman, A strong converse for joint source-channel coding, in
Proceedings International Symposium on Information Theory, Cambridge, MA, 2012, pp.
2117–2121
400. S.-W. Wang, P.-N. Chen, C.-H. Wang, Optimal power allocation for (N , K )-limited access
channels. IEEE Trans. Inf. Theory 58(6), 3725–3750 (2012)
401. Y. Wang, F. Alajaji, T. Linder, Hybrid digital-analog coding with bandwidth compression for
Gaussian source-channel pairs. IEEE Trans. Commun. 57(4), 997–1012 (2009)
402. T. Wang, W. Zhang, R.G. Maunder, L. Hanzo, Near-capacity joint source and channel coding
of symbol values from an infinite source set using Elias Gamma error correction codes. IEEE
Trans. Commun. 62(1), 280–292 (2014)
403. W. Wang, L. Ying, J. Zhang, On the relation between identifiability, differential privacy, and
mutual-information privacy. IEEE Trans. Inf. Theory 62(9), 5018–5029 (2016)
404. T.A. Welch, A technique for high-performance data compression. Computer 17(6), 8–19
(1984)
405. N. Wernersson, M. Skoglund, T. Ramstad, Polynomial based analog source channel codes.
IEEE Trans. Commun. 57(9), 2600–2606 (2009)
406. T. Weissman, E. Ordentlich, G. Seroussi, S. Verdú, M.J. Weinberger, Universal discrete denois-
ing: known channel. IEEE Trans. Inf. Theory 51, 5–28 (2005)
407. S. Wicker, Error Control Systems for Digital Communication and Storage (Prentice Hall,
Upper Saddle RiverNJ, 1995)
408. M.M. Wilde, Quantum Information Theory, 2nd edn. (Cambridge University Press, Cam-
bridge, 2017)
409. M.P. Wilson, K.R. Narayanan, G. Caire, Joint source-channel coding with side information
using hybrid digital analog codes. IEEE Trans. Inf. Theory 56(10), 4922–4940 (2010)
410. T.-Y. Wu, P.-N. Chen, F. Alajaji, Y.S. Han, On the design of variable-length error-correcting
codes. IEEE Trans. Commun. 61(9), 3553–3565 (2013)
411. A.D. Wyner, The capacity of the band-limited Gaussian channel. Bell Syst. Tech. J. 45, 359–
371 (1966)
412. A.D. Wyner, J. Ziv, Bounds on the rate-distortion function for stationary sources with memory.
IEEE Trans. Inf. Theory 17(5), 508–513 (1971)
413. A.D. Wyner, The wire-tap channel. Bell Syst. Tech. J. 54, 1355–1387 (1975)
414. H. Yamamoto, A source coding problem for sources with additional outputs to keep secret
from the receiver or wiretappers. IEEE Trans. Inf. Theory 29(6), 918–923 (1983)
415. R.W. Yeung, Information Theory and Network Coding (Springer, New York, 2008)
416. S. Yong, Y. Yang, A.D. Liveris, V. Stankovic, Z. Xiong, Near-capacity dirty-paper code design:
a source-channel coding approach. IEEE Trans. Inf. Theory 55(7), 3013–3031 (2009)
417. X. Yu, H. Wang, E.-H. Yang, Design and analysis of optimal noisy channel quantization with
random index assignment. IEEE Trans. Inf. Theory 56(11), 5796–5804 (2010)
418. S. Yüksel, T. Başar, Stochastic Networked Control Systems: Stabilization and Optimization
under Information Constraints (Springer, Berlin, 2013)
419. K.A. Zeger, A. Gersho, Pseudo-Gray coding. IEEE Trans. Commun. 38(12), 2147–2158
(1990)
420. L. Zhong, F. Alajaji, G. Takahara, A binary communication channel with memory based on
a finite queue. IEEE Trans. Inform. Theory 53, 2815–2840 (2007)
421. L. Zhong, F. Alajaji, G. Takahara, A model for correlated Rician fading channels based on a
finite queue. IEEE Trans. Veh. Technol. 57(1), 79–89 (2008)
422. Y. Zhong, F. Alajaji, L.L. Campbell, On the joint source-channel coding error exponent for
discrete memoryless systems. IEEE Trans. Inf. Theory 52(4), 1450–1468 (2006)
423. Y. Zhong, F. Alajaji, L.L. Campbell, On the joint source-channel coding error exponent of
discrete communication systems with Markovian memory. IEEE Trans. Inf. Theory 53(12),
4457–4472 (2007)
References 315

424. Y. Zhong, F. Alajaji, L.L. Campbell, Joint source-channel coding excess distortion exponent
for some memoryless continuous-alphabet systems. IEEE Trans. Inf. Theory 55(3), 1296–
1319 (2009)
425. Y. Zhong, F. Alajaji, L.L. Campbell, Error exponents for asymmetric two-user discrete mem-
oryless source-channel coding systems. IEEE Trans. Inf. Theory 55(4), 1487–1518 (2009)
426. G.-C. Zhu, F. Alajaji, Turbo codes for non-uniform memoryless sources over noisy channels.
IEEE Commun. Lett. 6(2), 64–66 (2002)
427. G.-C. Zhu, F. Alajaji, J. Bajcsy, P. Mitran, Transmission of non-uniform memoryless sources
via non-systematic Turbo codes. IEEE Trans. Commun. 52(8), 1344–1354 (2004)
428. G.-C. Zhu, F. Alajaji, Joint source-channel Turbo coding for binary Markov sources. IEEE
Trans. Wireless Commun. 5(5), 1065–1075 (2006)
429. J. Ziv, The behavior of analog communication systems. IEEE Trans. Inf. Theory 16(5), 587–
594 (1970)
430. J. Ziv, A. Lempel, A universal algorithm for sequential data compression. IEEE Trans. Inf.
Theory 23(3), 337–343 (1977)
431. J. Ziv, A. Lempel, Compression of individual sequences via variable-rate coding. IEEE Trans.
Inf. Theory 24(5), 530–536 (1978)
Index

A Bandwidth, 208, 210, 251


Acceptance region, 41, 42, 44 Bayesian hypothesis testing, 41
Achievability, 62, 70, 85, 118, 143, 147, 189, Bernoulli process or source, 110, 238, 250,
229, 248, 261 290
Achievable rate, 56, 66, 117, 160 Bernoulli random variable, 153
Achievable rate-distortion pair, 225, 226 Beta distribution, 72
Additive noise channel, 73, 110, 113, 151, Binary entropy function, 9, 9, 22, 69, 72, 134,
158, 159, 161, 186, 195, 206 151, 153, 155, 237, 238, 240, 250
Additive White Gaussian Noise (AWGN), Binary Erasure Channel (BEC), 110, 111,
186, 197, 207, 248, 251–253, 255, 127, 136, 137, 140, 152, 153, 163,
258, 262, 297 296
AEP, 59, see also Shannon-McMillan- Binary Symmetric Channel (BSC), 109, 109,
Breiman theorem 126, 134, 151, 153–155, 163, 250,
continuous memoryless sources, 184 253, 258, 296
discrete memoryless sources, 59, 60 Binary Symmetric Erasure Channel (BSEC),
distortion typical, 227, 228 111, 112, 137, 140, 152
joint, 116 Birkhoff-Khinchin ergodic theorem, 279
stationary ergodic sources, 70 Bit error probability, 250–252, 255
Applications of information measures, 40 Bits, 8, 9, 27, 47, 54, 56, 58, 61, 63, 75, 92,
Arimoto-Blahut algorithm, 131 114, 125
Arithmetic codes, 93 Block code, 57, 66, 98, 114, 142, 163, 188,
Asymptotic equipartition property, see AEP 247
Average code rate, 77, 83, 85, 90 Blocklength, 58, 62, 66, 75, 98, 114, 115,
of order t, 86, 87 117, 125, 129, 142, 148, 149, 154,
Average codeword length, 55, 77, 83, 88, 92, 188, 219, 225, 252
103 Bose-Chaudhuri-Hocquenghem (BCH)
of order t, 86 codes, 126
Average distortion, 224, 248, 249 Bound
Average power, 186, 188, 189, 208 sharp, 24
Average probability of error, 114, 118, 123, tight, 24, 236, 247
188 Bounded distortion, 226, 227, 229, 248
AWGN, see additive white Gaussian noise Burst noise channel, 161

B C
Band-limited, 207, 209–211 Capacity
Bandpass filter, 208, 209 BEC, 136
© Springer Nature Singapore Pte Ltd. 2018 317
F. Alajaji and P.-N. Chen, An Introduction to Single-User Information Theory,
Springer Undergraduate Texts in Mathematics and Technology,
https://doi.org/10.1007/978-981-10-8001-2
318 Index

BSC, 134 fading channels, 195


BSEC, 137 feedback, 160
discrete channels with memory, 159 information, 117
fading channels, 195 KKT conditions, 138
feedback, 160 memoryless Gaussian channel, 188
Gaussian channel, 185 MIMO channels, 203
information, 117 OFDM systems, 204
MIMO channels, 203 operational, 117, 161
non-Gaussian channels, 206, 207 product channel, 152
OFDM systems, 204 q-ary symmetric channel, 134
operational, 117, 161, 189 quasi-symmetric channel, 135, 140, 153
parallel Gaussian channels, 198, 202 Rayleigh fading channel, 257
q-ary symmetric channel, 134 sum channel, 152
Capacity-cost function, 185, 236 uncorrelated parallel Gaussian channels,
concavity, 185 197
Cascade channel, 157 water-filling, 199
Cauchy-Schwarz inequality, 50 weakly symmetric channel, 153
Causality in channels, 109, 160 Channels with erasures, 113
Central limit theorem, 187, 291 Channels with errors and erasures, 111
Cesaro-mean theorem, 67 Channels with memory, 160, 197
Chain rule Channel transition matrix, 108, 109, 132,
conditional entropy, 14, 19 134, 152
differential entropy, 172, 174 Chebyshev’s inequality, 60, 98, 287
divergence, 35 Chernoff-Stein lemma, 43
entropy, 13, 18, 47, 99 Code
mutual information, 17, 19, 175 block, see block code
Channel error-correcting, 126
binary erasure, see BEC Huffman, 87, 87
binary symmetric, see BSC instantaneous, 80, see also code, prefix
binary symmetric erasure, see BSEC non-singular, 76, 80, 81, 100
cascade, 157 optimal, 80, 87, 89, 91, 149
discrete memoryless, see DMC polar, 126, 127
Gaussian, see Gaussian channel prefix, 80, 81, 83, 87, 94, 101–103
product, 152 random, 118, 195, 230
quasi-symmetric, 134 Shannon-Fano–Elias, 91, 93
sum, 152 uniquely decodable, 61, 75, 76, 78–81,
symmetric, 132 83, 85, 87, 100, 103
T -symmetric, 141 zero error, 125
weakly symmetric, 132 Code rate, 58, 61, 62, 78, 80, 83, 85, 86, 90,
with feedback, 109, 160 115, 117, 142, 166, 188, 225, 251
with memory, 73, 107, 142, 159, 160, 197 Colored Gaussian channel, 211
Z -channel, 151 Communication system, 3
Channel capacity general description, 2
additive noise channel, 151, 158, 159 Shannon limit, 249
AWGN channel, 255, 258 Computation
BEC, 136 channel capacity, 130
BSC, 134 rate-distortion function, 238
BSEC, 137, 152 Concavity, 37, 185, 292
calculation, 130 Conditional divergence, 34, 35, 48, 49
continuous channel, 216 Conditional entropy, 12–14, 19, 67
correlated parallel Gaussian channels, Conditional mutual information, 17, 35
201 Conditioning increases divergence, 35
DMC, 117, 150, 152, 154 Conditioning reduces entropy, 14
Index 319

Continuous channel, 185 difference, 222, 247


Continuous random variables, 165, 169, 175, Hamming, 222
182, 213–215 maximum, 223
Continuous-time process, 275 squared error, 222
Convergence of random variables, 283 Distortion typical set, 227
in distribution, 284 Divergence, 26, 172
in probability, 283 additivity for independence, 36
in r th mean, 284 bound from variational distance, 31
with probability one, 283 conditional, 34
Converse, 63, 70, 85, 118, 143, 148, 189, convexity, 37
233, 248, 261 nonnegativity, 27
Convex set, 225, 292 Rényi, 46
Convexity, 291 rate, 69
Convolutional codes, 126 Divergence rate, 183
Covariance function, 183, 208, 243 Dominated convergence theorem, 285
Covariance matrix, 177, 178, 180, 201, 202, Doubly stochastic matrix, 134
213, 215, 216 Duality, 236, 243, 294
Cryptography, 2, 40
Cumulative distribution function, 165, 275
E
Eigenvalue, 178, 201, 202
D Empirical, 60, 93, 116
D-adic source, 80 Energy, 251, 256
D-ary unit, 9, 46, 85–87 Entropy, 8
Data privacy, 2, 40 additivity for independence, 15
Data processing inequality, 20 and mutual information, 17
for divergence, 29 chain rule, 13, 18
system interpretation, 21 concavity, 37
Data processing lemma, see data processing conditional, 13, 14
inequality chain rule, 14, 19
δ-typical set, 59 lower additivity, 15
Determinant, 176, 180 differential, 169
Differential entropy independence bound, 19
conditional, 172 invertible functions, 48
definition, 169 joint, 12
estimation error, 215 non-negativity, 10
exponential distribution, 211 properties, 10, 18
Gaussian distribution, 241 Rényi, 45
Gaussian source, 171 relative entropy, 26, 27, 172
generalized Gaussian distribution, 213 uniformity, 10
Laplacian distribution, 211 upper bound, 11
log-normal distribution, 212 Venn diagram, 17
multivariate Gaussian, 178 Entropy power, 206
operational characteristic, 167 Entropy rate, 67–69, 141, 143, 236
properties, 174 differential, Gaussian sources, 183
uniform source, 171 Markov source, 74, 99, 100
Discrete Memoryless Channel (DMC), 107 Polya contagion process, 72, 100
Discrete Memoryless Source (DMS), 57 Rényi, 87
Discrimination, 26 spectral sup-entropy rate, 74
Distortion measure, 220, see also rate- stationary ergodic source, 70, 97, 142
distortion function stationary process, 68, 86, 103
absolute error, 222 Entropy stability property, 59
additive, 223 Erasure, 110, 221, 259
320 Index

Erasure channel, 110, 111, 113, 136, 152, Huffman codes, 87


163, 262 adaptive, 93
Ergodic, 66, 69–71, 74, 75, 97, 142, 143, 147, generalized, 91
159, 161, 235, 237, 248, 277, 278, sibling property, 94
279, 282, 289, 291 Hypothesis testing, 40
Error exponent, 149 Bayes criterion, 41
Exchangeable process, 72 Neyman-Pearson criterion, 41
Exponential source, 182 simple hypothesis testing, 40
type I error, 41
type II error, 41
F
Fading channel, 195–197, 203, 204, 255, 257
Fano’s inequality, 22 I
channel coding, 123 Independence bound on entropy, 19
list decoding, 52 Independent and identically distributed
source coding, 99 (i.i.d.), 8, 57, 278
ternary partitioning of the observation Inequality
space, 52 arithmetic and geometric mean, 51
Feedback, 109, 160 Cauchy-Schwarz, 50
Feedback capacity, 160 Chebyshev’s, 287
Finite-memory Polya contagion process, 73 data processing, 20
Fixed-length code, 56 entropy power, 207
Fixed-length data transmission code, 114
Fano’s, 22
Fixed-length lossy data compression code,
fundamental, 10
224
Hölder’s, 50
Function
Hadamard’s, 180, 202
concavity, 292
Jensen’s, 292
convexity, 291
Kraft, 78
Fundamental inequality (FI), 10
log-sum, 11
Markov’s, 286
G Pinsker’s, 31
Gallager, 126, 135 Infimum, 265
Gamma function, 72, 213 approximation property for infimum, 265
Gauss-Markov sources, 243, 258 equivalence to greatest lower bound, 265
Gaussian channel, 165, 184, 187–189, 194, monotone property, 266
197, 198, 201, 202, 207, 210, 217, see property for monotone function, 267
also additive white Gaussian noise set algebraic operations, 266
Gaussian process or source, 183, 208, 241 Infinitely often, definition, 270
Gaussian random variable, 171 Information capacity, 117
Gaussian random vector, 177, 178 Information rate-distortion function, 229
Generalized AEP, 70 sources with memory, 235
Generalized Gaussian distribution, 212 Information rates for stationary Gaussian
Geometric distribution, 48 sources, 183
Gilbert-Elliott channel, 161 Instantaneous code, 80
Golay codes, 126 Irreducible Markov chain or source, 69, 71,
74, 100, 147, 281, 282, 291

H
Hadamard’s inequality, 180, 202 J
Hamming codes, 126 Jensen’s inequality, 12, 50, 54, 196, 292
Hamming distortion, 222, 238, 240, 250, Joint distribution, 12, 28, 35, 47, 49, 66, 72,
253, 255, 258, 262 214, 228, 278
Hölder’s inequality, 50 Joint entropy, 5, 12–14, 116
Index 321

Joint source-channel coding, 141, 247, 248, channel transition, 108, 110, 132, 134,
252, 254, 261 152
Joint source-channel coding theorem covariance, 177
lossless general rate block codes, 147 doubly stochastic, 134
lossless rate-one block codes, 143 positive-definite, 177, 180
lossy, 248 positive-semidefinite, 177
Jointly typical set, 116 transition probability, 71, 157, 158
Maximal probability of error, 115, 144
Maximum, 264
K Maximum a Posteriori (MAP), 155
Karush-Kuhn-Tucker (KKT) conditions, Maximum differential entropy, 180, 182
138, 295 Maximum entropy principle, 40
Kraft inequality, 78, 80, 81 Maximum Likelihood (ML), 154
Kullback-Leibler distance, divergence, 26, Memoryless
69, 172 channel, 107, 184, 248
source, 8, 57, 183, 226, 278
MIMO channels, 203
L Minimum, 265, 266
Lagrange multipliers, 138, 197, 199, 293 Modes of convergences
Laplacian source, 182, 243, 244 almost surely or with probability one,
Law of large numbers 283
strong law, 288, 289 in distribution, 284
weak law, 287 in mean, 284
L2 -distance, 117
in probability, 283
Lempel-Ziv codes, 95
pointwise, 283
Likelihood ratio, 27, 42, 43
uniqueness of convergence limit, 285
Limit infimum, see liminf under sequence
Molecular communication, 40
Limit supremum, see limsup under sequence
Monotone convergence theorem, 285
List decoding, 52
Monotone sequence
Log-likelihood ratio, 27, 34, 43
convergence, 269
Log-sum inequality, 11
Multivariate Gaussian, 177
Lossy information-transmission theorem,
Mutual information, 16
248
bound for memoryless channel, 19
Lossy joint source-channel coding theorem,
chain rule, 17, 19
248
conditional, 17, 21
Low-Density Parity-Check (LDPC) codes,
126, 130 continuous random variables, 173, 175,
214
convexity and concavity, 37
M for specific input symbol, 138
Machine learning, 2, 40 properties, 16, 174
Markov chain, 281 Venn diagram, 17
aperiodic, 281
homogeneous, 281
irreducible, 69, 71, 74, 100, 147, 281, N
282, 291 Nats, 8, 58, 213, 214
stationary distribution, 69, 74, 100, 282 Network epidemics, 2, 73
time-invariant, 69, 104, 147, 281 Neyman-Pearson lemma, 42
Markov source or process, 68, 74, 76, 100, Noise, 73, 106, 110, 112, 113, 126, 148, 149,
280 151, 154, 158, 159, 161, 186, 189,
stationary ergodic, 282 194, 195, 197, 200, 201, 206, 207,
Markov’s inequality, 286 209, 211, 213, 236, 241, 251, 253–
Martingale, 73 255, 262, 297
Matrix Non-Gaussianness, 207, 243
322 Index

Non-singular code, 76 Gaussian sources, 241


Normal random variable, 171 Hamming distortion, 238
infinite distortion, 258, 259
information, 229, 235
O Laplacian sources, 244
OFDM systems, 204 Markov sources, 236
Operational capacity, 117, 161, 189 operational, 224
Optimality of uncoded communication, 252, Shannon lower bound, 242, 247
254 sources with memory, 235
squared error distortion, 241
Rate-distortion function bound
P DMS, 240
Parallel channels, 197, 201, 202 sources with memory, 236
Pinsker’s inequality, 31 Rate-distortion region, 225
Pointwise ergodic theorem, 279 Rate-distortion theorem
Polar codes, 126, 127 memoryless sources, 229
Polya contagion channel, 160 sources with memory, 235
Polya contagion process, 71 stationary ergodic sources, 235
Polya’s urn model, 71 Rayleigh fading, 195, 257
Power constraint, 186–188, 194, 195, 197, Redundancy, 75, 148
201, 208, 215, 217, 253, 255, 258, Reed-Muller codes, 126
262, 293 Reed-Solomon codes, 126
Power spectral density, 183, 208, 209, 251 Refinement of distribution, 28
Prefix code, 80, 81, 83, 87, 94, 101–103 Relative entropy, 26, 27, 172
Probability density function, 73, 165, 287 Rényi’s
Probability mass function, 8, 37, 48, 57, 276 divergence, 46
Probability of error, 22, 53, 62, 106, 114,
entropy, 45, 86
115, 118, 125, 145, 188, 189
Probability space, 274
Processing of distribution, 28
S
Sampling theorem, 209
Q Self-information, 5
Quantization, 149, 166, 167, 170, 219 joint, 12
Quasi-symmetric channel, 132, 134, 136, uniqueness, 6
140, 141, 153 Sequence, 267
liminf, 269, 270
limit, 267, 268, 269
R limsup, 269, 270
Random coding, 118, 150, 190, 230, 237 Set
Random process, 274 boundedness, 266
Random variable, 274 convexity, 292
Rate, see achievable rate; code rate; entropy jointly typical, 116
rate; divergence rate typical, 59, 184
Rate-distortion function, 225, 226 volume, 184
absolute error distortion, 244 Shannon limit
achievability, 229, 242 binary DMS over AWGN channel, 255,
Bernoulli source, 238 258
binary sources, 238 binary DMS over BSC, 250, 253
calculation, 238 binary DMS over Rayleigh fading chan-
converse, 233 nel, 257
convexity, 226, 233 communication systems, 249
erasure, 259, 262 Gaussian source over AWGN channel,
Gauss-Markov sources, 243 253
Index 323

Gaussian source over fading channel, Suffix code, 90, 102


255 Support, 165, 166, 168, 169, 172, 173
q-ary DMS over q-ary channel, 253 Supremum, 263, 264
Shannon’s channel coding theorem approximation property for supremum,
continuous memoryless channel, 195 264
DMC, 117 completeness axiom, 264
information stable channels, 160 equivalence to least upper bound, 263
memoryless Gaussian channel, 189 monotone property, 266
Shannon’s joint source-channel coding the- property for monotone function, 267
orem set algebraic operations, 266
lossless, 141 Symmetric channel, 132
lossy, 247
Shannon’s rate-distortion theorem
memoryless sources, 229 T
stationary ergodic sources, 235 Telephone line channel, 187, 210
Shannon’s source coding theorem Time-sharing principle, 225
fixed-length, DMS, 62 Transition matrix, 110–113, 117, 130, 132,
fixed-length, stationary ergodic sources, 134, 140, 151, 152, 163
70 Tree, 57, 81, 82, 91, 94, 95
variable-length, DMS, 85 T -symmetric channel, 141
variable-length, stationary sources, 86 Turbo codes, 126, 130
Shannon’s source-channel coding separation Typical set, 184
principle, 141, 247
Shannon, Claude E., 1
Shannon-Fano-Elias code, 91 U
Shannon-McMillan-Breiman theorem, 59, Uncertainty, 5, 10, 14, 16, 17, 34, 40
60, 70, 116, 184, see also AEP Uniform distribution, 9, 11, 114, 120, 155,
Shift-invariant transformation, 276 188
σ-field, 274 Uniquely decodable, 61, 62, 75–81, 83, 85,
SNR, 188, 195, 197, 200, 210, 251, 256, 257 87, 100, 103
Source-channel separation, 141
Squared error distortion, 222
Stationary distribution, 69, 74, 100, 282 V
Stationary ergodic process, 71, 291 Variable-length codes, 76, 81, 87, 93
Stationary ergodic source, 66, 70, 75, 97, Variable-length codes with exponential cost
142, 143, 235, 237, 243, 248, 289 functions, 86
Stationary process, 278, 279, 282 Variational distance, 30
Stationary source, 67, 68, 86, 278, 280 bound from divergence, 31
Statistics of processes Volume, 184
ergodic, 278, 289
Markov
aperiodic, 281 W
first-order, 281 Water-pouring scheme, 200
irreducibility, 282 Weakly symmetric channel, 132, 134, 153
kth order, 280 Weakly δ-typical set, 59
stationary distribution, 282 White noise, 251, see also Gaussian channel
stationary ergodic, 282 Wireless, 148, 184, 195–197
memoryless, 278 Wyner-Ziv rate-distortion function bound
sources with memory, 69 sources with memory, 236
stationary, 278
Stochastic control, 2, 40
Strong converse, 62, 63, 70 Z
Sufficiently large, definition, 270 Zero error, 125

You might also like