0% found this document useful (0 votes)
85 views156 pages

Information Theory and Coding: Universit' A Degli Studi Di Siena Facolt'a Di Ingegneria

This document contains lecture notes on information theory and coding. It begins with an introduction to measuring information through entropy. The lecturer defines a discrete memoryless source and outlines Shannon's axiomatic approach to defining entropy. Shannon proposed that the entropy H of a random variable X should be a function of the probabilities of X's outcomes. He formulated four axioms for H, including that H should be a continuous function and invariant to permutations of the probabilities. The axioms ultimately lead to Shannon's formula for entropy.

Uploaded by

zlight Hagos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views156 pages

Information Theory and Coding: Universit' A Degli Studi Di Siena Facolt'a Di Ingegneria

This document contains lecture notes on information theory and coding. It begins with an introduction to measuring information through entropy. The lecturer defines a discrete memoryless source and outlines Shannon's axiomatic approach to defining entropy. Shannon proposed that the entropy H of a random variable X should be a function of the probabilities of X's outcomes. He formulated four axioms for H, including that H should be a continuous function and invariant to permutations of the probabilities. The axioms ultimately lead to Shannon's formula for entropy.

Uploaded by

zlight Hagos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 156

Università degli Studi di Siena

Facoltà di Ingegneria

Lecture notes on

Information Theory
and Coding

Mauro Barni
Benedetta Tondi

2012
Contents

1 Measuring Information 1
1.1 Modeling of an Information Source . . . . . . . . . . . . . . . 1
1.2 Axiomatic definition of Entropy . . . . . . . . . . . . . . . . . 2
1.3 Property of the Entropy . . . . . . . . . . . . . . . . . . . . . 8

2 Joint Entropy, Relative Entropy and Mutual Information 11


2.1 Joint and Conditional Entropy . . . . . . . . . . . . . . . . . . 11
2.1.1 Joint Entropy . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.2 Conditional entropy . . . . . . . . . . . . . . . . . . . . 12
2.2 Relative Entropy and Mutual Information . . . . . . . . . . . 17
2.2.1 Relative Entropy . . . . . . . . . . . . . . . . . . . . . 17
2.2.2 Mutual Information . . . . . . . . . . . . . . . . . . . . 18

3 Sources with Memory 23


3.1 Markov Chain (3 r.v.) . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Characterization of Stochastic Processes . . . . . . . . . . . . 25
3.2.1 Entropy Rate . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.2 Markov Sources . . . . . . . . . . . . . . . . . . . . . . 29
3.2.3 Behavior of the Entropy of a Markov Chain . . . . . . 33

i
ii Contents

4 Asymptotic Equipartition Property and Source Coding 37


4.1 A reminder of Statistics . . . . . . . . . . . . . . . . . . . . . 37
4.2 Asymptotic Equipartition Property . . . . . . . . . . . . . . . 39
4.3 Source Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3.1 Memoryless Source Coding . . . . . . . . . . . . . . . . 42
4.3.2 Extension to the Sources with Memory . . . . . . . . . 47
4.4 Data Compression . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4.1 Kraft Inequality . . . . . . . . . . . . . . . . . . . . . . 52
4.4.2 Alternative proof of Shannon’s source coding theorem
for instantaneous codes . . . . . . . . . . . . . . . . . . 54

5 Channel Capacity and Coding 59


5.1 Discrete Memoryless Channel . . . . . . . . . . . . . . . . . . 59
5.1.1 A Mathematical Model for the channel . . . . . . . . . 60
5.1.2 Examples of discrete memoryless channels . . . . . . . 61
5.2 Channel Coding . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2.1 Preview of the Channel Coding Theorem . . . . . . . . 64
5.2.2 Definitions and concepts . . . . . . . . . . . . . . . . . 67
5.2.3 Channel Coding Theorem . . . . . . . . . . . . . . . . 71
5.2.4 Channel Coding in practice . . . . . . . . . . . . . . . 79

6 Continuous Sources and Gaussian Channel 83


6.1 Differential Entropy . . . . . . . . . . . . . . . . . . . . . . . . 83
6.2 AEP for Continuous Sources . . . . . . . . . . . . . . . . . . . 84
6.3 Gaussian Sources . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.4 Gaussian Channel (AWGN) . . . . . . . . . . . . . . . . . . . 88
6.4.1 The Coding problem: a qualitative analysis . . . . . . 89
6.4.2 Coding Theorems for the Gaussian Channel . . . . . . 91
Contents iii

6.4.3 Bandlimited Channels . . . . . . . . . . . . . . . . . . 98

7 Rate Distortion Theory 109


7.1 Rate Distortion Function . . . . . . . . . . . . . . . . . . . . . 109
7.1.1 Interpretation of the Rate Distortion Theorem . . . . . 114
7.1.2 Computing the Rate Distortion Function . . . . . . . . 115
7.1.3 Simultaneous representation of Independent Gaussian
Random variables . . . . . . . . . . . . . . . . . . . . . 121
7.2 Lossy Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.2.1 The Encoding procedure in practice . . . . . . . . . . . 126
7.2.2 Scalar Quantization . . . . . . . . . . . . . . . . . . . . 128
7.2.3 Vector Quantization . . . . . . . . . . . . . . . . . . . 137
7.2.4 Avoiding VQ: the decorrelation procedure . . . . . . . 143
iv Contents
Chapter 1

Measuring Information

Even if information theory is considered a branch of communication the-


ory, it actually spans a wide number of disciplines including computer science,
probability, statistics, economics, etc. The most basic questions treated by
information theory are: how can ‘information’ measured? How can ‘infor-
mation’ be transmitted?
From a communication theory perspective it is reasonable to assume that
the information is carried out either by signals or by symbols. Shannon’s
sampling theory tells us that if the channel is bandlimited, in place of the
signal we can consider its samples without any loss. Therefore, it makes sense
to confine the information carriers to discrete sequences of symbols, unless
differently stated. This is what we’ll do throughout this lectures.

1.1 Modeling of an Information Source


It is common sense that the reception of a message about a certain event,
for instance the result of an experiment, brings us information. Of course
the information is received only if we do not know the content of the message
in advance; this suggests that the concept of “information” is related to the
ignorance about the result of the event. We gain information only because
of our prior uncertainty about the event. We can say that the amount of
a-priori uncertainty equals the amount of information delivered by the sub-
sequent knowledge of the result of the experiment.
The first successful attempt to formalize the concept of information was made
by Shannon, who is considered the father of Information Theory. In his pa-
per “The mathematical Theory of Communication” (published in the Bell
System Technical Journal, 1948) Shannon stated the inverse link between
information and probability. Accordingly, the realization of an event gives

1
2 Chapter 1, Measuring Information

more information if it is less probable. For instance, the news that a foot-
ball match between Barcelona and Siena has been won by Siena team carries
much more information than the opposite.
Shannon’s intuition suggests that information is related to randomness. As
a consequence, information sources can be modeled by random processes,
whose statistical properties depend on the nature of the information sources
themselves. A discrete time information source X can then be mathemati-
cally modeled by a discrete-time random process {Xi }. The alphabet X over
which the random variables Xi are defined can be either discrete (|X | < ∞)
or continuous when X corresponds to R or a subset of R (|X | = ∞). The
simplest model for describing an information source is the discrete memo-
ryless source (DMS) model. In a DMS all the variables Xi are generated
independently and according to the same distribution, i.i.d.. In this case,
it is possible to represent a memoryless source through a unique random
variable X.

1.2 Axiomatic definition of Entropy


The first effort that Shannon made was searching for a measure of the
average information received when an event occurs. We now provide the
entire proof procedure leading to Shannon’s formula of the entropy for the
DMS case.
Let X be a random variable describing a memoryless source with alphabet
X = {x1 , x2 , ..., xn }. Concisely,
Pn let us call pi the quantity P r{X = xi }. Then
pi ≥ 0 ∀i = 0, 1, ..., n and i pi = 1. Let H be the (unknown) measure we
look for. According to Shannon’s intuition H(X) must be a function of the
probabilities according to which the symbols X are emitted, that is

H(X) = Hn (p1 , p2 , ..., pn ). (1.1)

In addition, the function Hn (p1 , p2 , ..., pn ) should have several intuitive prop-
erties. It is possible to formulate these properties as axioms from which we
will deduce the specific form of the H function.
The four fundamental axioms are:

A.1 H2 ( 21 , 12 ) = 1 bit (binary unit).


This equality gives the unit of measure of the information.
It states that tossing a fair coin delivers 1 bit of information.
1.2. Axiomatic definition of Entropy 3

A.2 H2 (p, 1 − p) is a continuous function of p (p ∈ [0, 1]).


It expresses a natural requirement: small changes of probabilities of an ex-
periment with two outcomes must result in small changes of the uncertainty
of the experiment.

A.3 (Permutation-invariance)

Hn (σ(p1 , p2 , ..., pn )) = Hn (p1 , p2 , ..., pn ), (1.2)

for any permutation σ of the probabilities of the n symbols of the alphabet.


The uncertainty of an experiment (i.e. the information delivered by its exe-
cution) does not depend on how the probabilities are assigned to the symbols.

A.4 (Grouping property )


 
p1 p2
Hn (p1 , p2 , ..., pn ) = Hn−1 (p1 +p2 , p3 , ..., pn )+(p1 +p2 )·H2 , ,
p1 + p2 p1 + p2
(1.3)
where Hn−1 is the measure of the information we receive from the experiment
which considers the first two events grouped in a unique one, while the sec-
ond terms gives the additional information regarding which of the two events
occurred.

The above axioms reduce the number of possible candidate functions for
H. Even more, it is possible to prove that they suffice to determine a unique
function, as the next theorem asserts.
Let us extend the list of the axioms including another property. We will use
it to prove the theorem. However, we point out that this is not a real axiom
since it is deducible from the others. We will introduce it only to ease the
proof.
We define A(n) = H( n1 , n1 , ..., n1 ),

P.1 A(n) is a monotonically increasing function of n.


This property is reasonable. If all the symbols are equally likely, the more
they are, the more is the uncertainty about the result of the experiment.

Before stating and proving the theorem, we introduce some useful nota-
4 Chapter 1, Measuring Information

tions: let sk be the partial sum of probabilities


k
X
pi = sk , (1.4)
i=1

and let h(p) denote the entropy of the binary source H2 (p, 1 − p).

Theorem (Entropy definition).


There is only one function Hn which satisfies the four axioms listed above.
Such a function has the following expression:
n
X
Hn (p1 , p2 , ..., pn ) = − pi log2 pi . (1.5)
i=1

Hn is referred to as the Entropy of X.

Proof. The proof is organized in five steps.

1) By considering A.4 together with A.3 we deduce that we can group any
two symbols, not only the first and the second one. We now want to extend
the grouping property to a number k of symbols. We have:
 
p2
Hn (p1 , p2 , ..., pn ) = Hn−1 (s2 , p3 , ..., pn ) + s2 h
s
 2  
p3 p2
= Hn−2 (s3 , p4 , ..., pn ) + s3 h + s2 h = ...
s3 s2
k  
X pi
... = Hn−k+1 (sk , pk+1 , ..., pn ) + si h . (1.6)
i=2
si

We would like to express the sum in (1.6) as a function of Hk . To this aim,


we notice that starting from Hk and grouping the first k − 1 symbols yields
    X k−1  
p1 pk sk−1 pk si pi /sk
Hk , ...., = H2 , + h .
sk sk sk sk s
i=2 k
si /sk
k  
X si pi
= h . (1.7)
i=2
s k s i

By properly substituting the above equality in (1.6), we obtain the extension


1.2. Axiomatic definition of Entropy 5

to k elements of the grouping property, that is


 
p1 pk
Hn (p1 , p2 , ..., pn ) = Hn−k+1 (sk , pk+1 , ..., pn ) + sk Hk , ..., . (1.8)
sk sk

2) Let us consider two integer values n and m and the function A(n · m). If
we apply m times the extended grouping property we have just found (Point
1), each time to n elements in A(n · m), we obtain:
 
1 1
A(n · m) = Hnm , ...,
nm nm
   
1 1 1 1 1 1
= Hnm−n+1 , , ..., + Hn , ...,
m nm nm m n n
 
(a) 1 1 1 1 2
= Hnm−2n+2 , , , ..., + A(n) = ....
m m nm nm m
 
1 1
... = Hm , ..., + A(n)
m m
= A(m) + A(n), (1.9)

where in (a) we implicity used axiom A.3.

3) From the previous point we deduce that

A(nk ) = k · A(n). (1.10)

We now consider the following property:

Property. The unique function which satisfies property (1.10) over all the
integer values is the logarithm function. Then A(n) = log(n).

Proof. Let n be given. Then, for any arbitrary number r,

∃k : 2k ≤ nr < 2k+1 . (1.11)

By applying the (base-2) logarithm operator to each of the tree members, we


obtain
k k 1
k ≤ r log(n) < k + 1 → ≤ log(n) < + . (1.12)
r r r
Hence, the distance between log(n) and k/r is at most 1/r, i.e. log(n) − kr <

1
r
.
6 Chapter 1, Measuring Information

Similarly, we can apply the function A to the members of relation (1.11)1 .


By exploiting equality (1.10) and the fact that A(2) = 1 (Axiom 1), we get

k k 1 k 1
≤ A(n) < + or A(n) − ≤ . (1.13)
r r r r r

Therefore |A(n) − log(n)| ≤ 2r , which thanks to the arbitrariness of r con-


cludes the proof.

4) We are now able to show that the expression of the entropy in (1.5) holds
for the binary case. Let us consider a binary source with p = rs for some
positive values r and s (obviously, r ≤ s).
 
1 1
A(s) = log(s) = Hs , ...,
s s
   
r 1 1 r 1 1
= Hs−r+1 , , ..., + · Hr , ...,
s s s s r r
r s−r s−r
   
1 1 r
= H2 , + · Hs−r , ..., + · A(r).
s s s s−r s−r s
(1.14)

From the last equality and thanks to point 3, we have


r s − r r
log(s) = h + · A(s − r) + A(s). (1.15)
s s s
By expliciting the term of the binary entropy we get
r s−r r
h = log(s) − · log(s − r) − · log(r). (1.16)
s s s
s−r
Since log(s) can be written as log(s) + rs log(s), we have
s

s−r s−r
r  
r r
h =− · log − · log , (1.17)
s s s s s

and then
2
X
h(p) = −(1 − p) · log(1 − p) − p · log(p) = pi log pi . (1.18)
i=1

1
Remember that A(·) is a monotonic function of its argument.
1.2. Axiomatic definition of Entropy 7

We have confined our derivation to rational probabilities. However, the fol-


lowing two considerations allow to extend our analysis to real numbers: the
former is that rational numbers are dense in the real ones, the latter is that
the h function is continuous (A.2). Accordingly, for any irrational number p∗
it is possible to construct a sequence of rational number pn which tends to
it for n tending to infinity. The corresponding sequence of the h(pn ) values,
thanks to A.2, has limit h(p∗ ). This extends the proof to all p ∈ R ∩ [0, 1].

5) As a last step, we extend the validity of the expression (1.5) to any value
n. The proof is given by induction exploiting the relation for n = 2, already
proved. Let us consider a generic value n and suppose that for n − 1 the
expression holds. Then,
n−1
X
Hn−1 (p1 , ..., pn−1 ) = − pi log pi . (1.19)
i=1

We want to show that the same expression holds for Hn .

 
p1
Hn (p1 , ..., pn ) = Hn−1 (p1 + p2 , p3 , ..., pn ) + (p1 + p2 ) · h
p1 + p2
n
X
= − pi log pi − (p1 + p2 ) · log(p1 + p2 ) +
i=3
 
p1 p1
−
(p
1+ 2) ·
p  log +

(p
 1+p
2) p1 + p2
 
p2 p2
−
(p + p ) · log
 
1  2
(p
1+p
2) p1 + p2


Xn
= − pi log pi − p1 log p1 − p2 log p2
i=3
n
X
= − pi log pi , (1.20)
i=1

which completes our proof.

Information theory and statistical mechanics

The name entropy assigned to the quantity in (1.5) reminds the homonymous
quantity defined in physics. Roughly speaking, from a microscopical point
of view, Boltzmann defined the entropy S as the logarithm of the number of
8 Chapter 1, Measuring Information

microstates Ω having an energy equal to E, i.e. S = k ln [Ω(E)] 2 , where k


is a normalizing constant, by following a similar procedure to that adopted
later by Shannon and described above. Boltzmann, who is one of the pioneers
in statistical mechanics and thermodynamics, contributed in describing the
entropy as a measure of the disorder of an individual, microscopic state of a
physical system. Therefore, Shannon’s concept of uncertainty is very similar
to the disorder of a physical system. The analogies between information
theory and statistical mechanics go far beyond this. However, we do not
further dwell on this subject since is beyond the scope of these lectures.

1.3 Property of the Entropy

According to Shannon’s definition, given a discrete random variable X


with alphabet X and probability mass function p(x) = P r{X = x}3 , x ∈ X ,
the entropy H(X) of the random variable X has the expression
X
H(X) = − p(x) log p(x), (1.21)
x∈X

where log is the base-2 logarithm. In (1.21) we use the convention that
0 log 0 = 0 which can be easily proved through de l’Hospital’s rule. This is in
agreement with the fact that adding zero probability terms does not change
the value of the entropy.

Property. Let X be a DMS with alphabet X and p(x) the corresponding


pmf. Then,
H(X) ≤ log2 |X | (1.22)
1
where the equality holds if and only if p(x) = |X |
∀x ∈ X .

Proof. We exploit the relation ln z ≥ 1 − 1/z, where equality holds if and


only if z = 1. Note that the logarithm involved in the relation is the base-e

2
Ω(E) indicates the number of microstate having an energy equal to E.
3
For convenience, we denote pmfs by p(x) rather than by pX (x).
1.3. Property of the Entropy 9

logarithm4 . We have:
X
log2 |X | − H(X) = log2 |X | + p(x) log2 (p(x))
X
X
= p(x) [log2 |X | + log2 p(x)]
X
X
= log2 e · p(x) [ln |X | + ln p(x)]
X
X
= log2 e · p(x) ln(|X |p(x))
X
 
(a) X 1
≥ log2 e · p(x) 1 −
X
|X |p(x)
!
X X 1
= log2 e · p(x) − = 0. (1.23)
X X
|X |

Hence,
log2 |X | ≥ H(X), (1.24)
1
where the equality holds if and only if p(x) = |X |
, ∀x (in which case (a) holds
with the equality).
From the above property, we argue that the uniform distribution for an
information source is the one that gives rise to the maximum entropy. This
fact provides new hints about the correspondence of information theory and
statistical mechanics. In a physical system the condition of equally likely
microstates is the configuration associated to the maximum possible disorder
of the system and hence to its maximum entropy.

4
Remember the relation log2 z = log2 e · loge z holding for logarithms with a different
base, which will be useful in the following.
10 Chapter 1, Measuring Information
Chapter 2

Joint Entropy, Relative


Entropy and Mutual
Information

In Chapter 1 we defined the entropy of a random variable as the measure


of the uncertainty of the random variable, or equivalently the measure of the
amount of information required on the average to describe the value assumed
by the random variable. In this chapter we introduce some related quantities.

2.1 Joint and Conditional Entropy


2.1.1 Joint Entropy
Given two discrete memoryless sources X and Y with alphabet X and Y
respectively, we define the information obtained by observing the couple of
random variables (X, Y ). The extension of the entropy definition to a pair
of random variables is called joint entropy and involves the joint distribution
pXY (x, y), that is the statistical quantity describing the dependence between
the variables.

Definition. The joint entropy H(X, Y ) of a pair of discrete random variables


(X, Y ) with joint distribution p(x, y) is defined as
XX
H(X, Y ) = − p(x, y) log p(x, y). (2.1)
x∈X y∈Y

The joint entropy can also be seen as the entropy of the vector random
variable Z = (X, Y ) whose alphabet is the cartesian product X × Y.

11
12 Chapter 2, Joint Entropy, Relative Entropy and Mutual Information

The intuitive properties of a quantity describing the ‘joint information’ are


captured by H(X, Y ); for example, it can be shown that if X and Y are two
independent sources then

H(X, Y ) = H(X) + H(Y ). (2.2)

Proof. We exploit the relation between the pmfs of independent sources, i.e.
p(x, y) = p(x)p(y), and proceeds with some simple algebra.
X
H(X, Y ) = − p(x, y) log p(x, y)
xy
X
= − p(x, y) log p(x)p(x)
xy
XX XX
= − p(x, y) log p(x) − p(x, y) log p(y)
x y x y
X X X X
= − p(x) log p(x) p(y|x) − p(y) log p(y) p(x|y)
x y y x
= H(X) + H(Y ). (2.3)

The definition of joint entropy can be easily extended to m sources, X1 ,


X2 ,....,Xm , as follows
XX X
H(X1 , X2 , ..., Xm ) = − ··· p(x1 , x2 , ..., xn ) log p(x1 , x2 , ..., xm ).
x1 x2 xn
(2.4)
Accordingly, equation (2.4) represents the entropy of the joint random vari-
able (X1 , X2 , ...., Xm ) taking values in the alphabet X1 × X2 .... × Xm .
Equation (2.2) can also be generalized to m independent sources, yielding
m
X
H(X1 , X2 , ...., Xm ) = H(Xi ). (2.5)
i=1

2.1.2 Conditional entropy


We now characterize the information received by observing a random
variable X when we already know the value taken by another random variable
Y . Reasonably, if the knowledge of Y gives us some information about X,
the information carried by X will no more be H(X).
Given a single realization Y = y, we define the entropy of the conditional
2.1. Joint and Conditional Entropy 13

distribution p(x|y)1 as
X
H(X|Y = y) = − p(x|y) log p(x|y). (2.6)
x∈X

Definition. Given a pair of random variables (X, Y ), the conditional entropy


H(X|Y ) is defined as
XX
H(X|Y ) = − p(x, y) log p(x|y). (2.7)
x∈X y∈Y

The conditional entropy can be expressed as the expected value of the


entropies of the conditional distributions
P averaged over the conditioning ran-
dom variable Y , i.e. H(X|Y ) = y∈Y p(y) · H(X|Y = y).

We now list some useful properties of the conditional entropy:

• X and Y independent ⇒ H(X|Y ) = H(X);

Proof. Suggestion: exploit the relation p(x|y) = p(x) which holds for inde-
pendent sources.

• (Chain Rule)
(a) (b)
H(X, Y ) = H(X) + H(Y |X) = H(Y ) + H(X|Y ). (2.8)
Equality (a) tells us that the information given by the pair of random vari-
ables (X, Y ), i.e. H(X, Y ), is the same information we receive by considering
the information carried by X (H(X)), plus the ‘new’ information provided
by Y (H(Y /X)), that is the information that has not been given by the
knowledge of X. Analogue considerations can be made for equality (b).

1
In probability theory, p(x|y) denotes the conditional probability distribution of X given
Y, i.e. the probability distribution of X when Y is known to be a particular value.
14 Chapter 2, Joint Entropy, Relative Entropy and Mutual Information

Proof.
X
H(X, Y ) = − p(x, y) log p(x, y)
xy
X
= − p(x, y) log p(y|x)p(x)
xy
XX XX
= − p(x, y) log p(y|x) − p(x, y) log p(x)
x y x y
= H(Y |X) + H(X). (2.9)

The same holds for equality (b).

• (Generalized Chain Rule)

By considering the case of m sources we get the generalized chain rule, which
takes the form
m
X
H(X1 , X2 , ...., Xm ) = H(Xi |Xi−1 , Xi−2 , ...., X1 )
i=1
= H(X1 ) + H(X2 |X1 ) + H(X3 |X2 , X1 ) + ....
... + H(Xm |Xm−1 , ...., X1 ). (2.10)

Proof. Suggestion: for m = 2 it has been proved above. The proof for a
generic m follows by induction.

By referring to (2.10) the meaning of the term ‘chain rule’ becomes clear:
at each step in the chain we add only the new information brought by the
next random variable, that is the novelty with respect to the information we
already have.

• (Conditioning reduces entropy)

H(X|Y ) ≤ H(X). (2.11)

This relation asserts that the knowledge of Y can only reduce the uncertainty
about X. Said differently, conditioning reduces the value of the entropy or
at most leaves it unchanged if the two random variables are independent.
2.1. Joint and Conditional Entropy 15

Proof.
XX p(x|y)
H(X) − H(X|Y ) = p(x, y) log
x y
p(x)
XX p(x, y)
= p(x, y) log
x y
p(x)p(y)
XX p(x, y)
= log e p(x, y) ln . (2.12)
x y
p(x)p(y)

By using the lower bound for the logarithm, ln z ≥ 1 − z1 , from (2.12) we get
 
XX p(x)p(y)
≥ log e p(x, y) 1 − =0 (2.13)
x y
p(x, y)

where the equality holds if and only if X and Y are independent.

Warning
Inequality (2.11) is not necessarily true if we refer to the entropy of a con-
ditional distribution p(X/y) for a given occurrence y, that is H(X|y). The
example below aims at clarifying this fact.
Let us consider the problem of determining the most likely winner of a foot-
ball match. Suppose that the weather affects differently the performance of
the two teams according to the values of the table 2.1; X is the random
variable describing the outcome of the match (1, ×, 2) and Y is the random
variable describing the weather condition (rain, sun). By looking at the ta-
ble of values we note that if it rains we are in great uncertainty about the
outcome of the match, while if it’s sunny we are almost sure that the win-
ner of the match will be the first team. As a consequence, if we computed
H(X/Y = rain) we would find out that the obtained value is larger then
H(X). Because of the fact we are considering the conditioning to a particular
event, this fact should not arouse any wonder since it is not in conflict with
relation (2.11).

• H(X, Y ) ≤ H(X) + H(Y );

Proof. It directly follows from the chain rule and from relation (2.11).
16 Chapter 2, Joint Entropy, Relative Entropy and Mutual Information

Y/X 1 × 2
rain 1/3 1/3 1/3
sun 9/10 1/10 0
Table 2.1: The table shows the probability of the various outcomes in the two
possible weather conditions.

• (Conditional Chain Rule)


Given tree random variables X1 , X2 , X3 , the following relation holds:

H(X1 , X2 /X3 ) = H(X1 /X3 ) + H(X2 /X1 , X3 ). (2.14)

As usual it’s easy to argue that the above relation can be generalized to any
number m of sources.

• (Mapping application)
If we apply a deterministic function g to a given random variable X, i.e. a
deterministic processing, the following relation holds:

H(g(X)) ≤ H(X). (2.15)

This means that we have less a priori uncertainty about g(X) than about X;
in other words, considering g(X) in place of X causes a loss of information.
The equality in (2.15) holds only if g is an invertible function.

Proof. By considering the joint entropy, we apply the chain rule in two pos-
sible ways, yielding

H(X, g(X)) = H(X) + H(g(X)/X) = H(X) (2.16)

and
H(X, g(X)) = H(g(X)) + H(X/g(X)). (2.17)
By equating the terms in (2.16) and (2.17) we obtain

H(g(X)) = H(X) − H(X/g(X)) ≤ H(X). (2.18)

The inequality holds since the term H(X/g(X)) is always greater then zero
and reaches zero only if the function g is invertible, so that it’s possible to
recover X by applying g −1 to g(X). If this is the case, knowing X or g(X)
is the same.
2.2. Relative Entropy and Mutual Information 17

2.2 Relative Entropy and Mutual Informa-


tion
In this section we introduce two concepts related to the entropy: the
relative entropy and the mutual information.

2.2.1 Relative Entropy


The relative entropy is a way to measure the distance between two pmfs
p(x) and q(x) defined over a same alphabet.

Definition. The relative entropy or Kullback-Leibler distance or even di-


vergence between two probability mass functions p(x) and q(x) is defined
as
X p(x)
D(p||q) = p(x) log . (2.19)
x∈X
q(x)

We used the conventions 0 log 0 = 0, p log p0 = ∞ and 0 log 00 = 0.


Despite the name “distance”, the relative entropy is not a distance at all. In
fact, although the positivity property is fulfilled (see below), the divergence
is not a symmetric quantity and does not satisfy the triangular inequality
(which are the other properties that a distance function must own).
According to a common interpretation, the relative entropy D(p||q) is a mea-
sure of the inefficiency of assuming that the distribution is q when the true
distribution is p. For instance, let us suppose we have a source X whose
symbols are drawn according to an unknown distribution p(x); if we knew
another distribution q(x) and decided to use it in order to construct a source
coder, then D(p||q) would represents the extra bits we have to pay for the
encoding. This situation arises frequently in estimation problems, where p
and q are respectively the true and estimated distribution of an observable
set.

• (Positivity)

D(p(x)||q(x)) ≥ 0. (2.20)
where the equality holds if and only if p(x) = q(x).

Proof. Suggestion: apply the relation ln z ≥ 1 − z1 .


18 Chapter 2, Joint Entropy, Relative Entropy and Mutual Information

As for the entropy, it is possible to define the conditional version of the


relative entropy and prove the chain rule property.

Definition. The conditional relative entropy D(p(x|y)||q(x|y)) is given by


the average of the relative entropies between the conditional probability mass
functions p(x|Y = y) and q(x|Y = y) over the probability mass function p(y).
Formally,
X X p(x|y)
D(p(x|y)||q(x|y)) = p(y) p(x|y) log
Y X
q(x|y)
XX p(x|y)
= p(x, y) log . (2.21)
X Y
q(x|y)

• (Chain rule for relative entropy)

D(p(x, y)||q(x, y)) = D(p(y)||q(y)) + D(p(x|y)||q(x|y)). (2.22)

Proof. Suggestion: use the expression for the joint probability (conditional
probability theorem) for both terms inside the argument of the logarithm
and replace the logarithm with an appropriate sum of two logarithms.

2.2.2 Mutual Information


We now introduce the concept of mutual information which allows to
measure the amount of information that two variables have in common.

Definition. Consider two random variable X and Y with joint probability


mass function p(x, y) and marginal probability mass functions p(x) and p(y).
The mutual information between the two random variables is obtained as the
difference between the entropy of one random variable and the conditional
entropy of the same random variable given the other, i.e.

I(X; Y ) = H(X) − H(X/Y ). (2.23)

According to the intuition, the mutual information represents the reduc-


tion in the uncertainty of X due to the knowledge of Y .
Let us derive the explicit expression for I(X; Y ) as a function of the proba-
2.2. Relative Entropy and Mutual Information 19

bilities by applying the definitions of entropy and conditional entropy:


X XX
I(X; Y ) = − p(x) log p(x) + p(x, y) log p(x|y)
X X Y
(a) XX XX
= − p(x, y) log p(x) + p(x, y) log p(x|y)
Y X X Y
XX p(x, y)
= p(x, y) log , (2.24)
X Y
p(x)p(y)
P
where in (a) we replaced p(x) with Y p(x, y).

We now give some properties of the mutual information:


• (Symmetry)

I(X; Y ) = I(Y ; X). (2.25)


This property tells that, as expected, the information that X has in common
with Y is the same that Y has in common with X.
Proof. By referring to (2.24), from the commutative property of the product
operator and the symmetry of p(x, y) it’s easy to deduce that it is possible to
exchange X and Y in the definition of the mutual information. An alternative
and interesting way to prove of the symmetry of the mutual information is
by exploiting the relation between conditional and joint entropy. We have:

I(X; Y ) = H(X) − H(X|Y )


= H(X) − (H(X, Y ) − H(Y ))
= H(X) + H(Y ) − (H(X) + H(Y |X))
= H(Y ) − H(Y |X) = I(Y ; X). (2.26)

• (Positivity)

I(X; Y ) ≥ 0 (2.27)
Proof. The proof is exactly the same we used to show relation (2.11). How-
ever, there is another way to prove the positivity of I(X; Y ), that is through
the application of the relation ln z ≥ 1 − z1 to the expression in (2.24). Notice
that the positivity of I has been already implicitly proved in Section 2.1.2
by proving the relation H(X/Y ) ≤ H(X).
20 Chapter 2, Joint Entropy, Relative Entropy and Mutual Information

• X, Y independent r.v. ⇔ I(X; Y ) = 0;

The validity of the above assertion arises also from the following:

Observation.
The mutual information I(X; Y ) is the relative entropy between the joint
distribution p(x, y) and the product of the marginal distributions p(x) and
p(y):
I(X; Y ) = D(p(x, y)||p(x)p(y)). (2.28)

The more p(x, y) differs from the product of the marginal distributions,
the more the two variables are dependent and then the common information
between them large.
Hence, the positivity of the mutual information directly follows from that of
the relative entropy.

We now define the conditional mutual information as the reduction in the


uncertainty of X due to the knowledge of Y when we know another random
variable Z.

Definition. The conditional mutual information of the random variables X


and Y given Z is defined as

I(X; Y |Z) = H(X|Z) − H(X|Z, Y )


XXX p(x, y|z)
= p(x, y, z) log . (2.29)
X Y Z
p(x|z)p(y|z)

Notice that even in this case conditioning is referred to the average value
of z.

• (Chain rule for mutual information)


m
X
I(X1 , X2 , ...., Xm ; Y ) = I(Xi ; Y |Xi−1 , ...., X1 ). (2.30)
i=1

~ Y)
We can indicate I(X1 , X2 , ...., Xm ; Y ) with the equivalent notation I(X;
where the variable X~ = (X1 , X2 , ...., Xm ) takes values in X m . For i = 1 no
conditioning is considered.
2.2. Relative Entropy and Mutual Information 21

Proof.
(a)
I(X1 , X2 , ...., Xm ; Y ) = H(X1 , X2 , ...., Xm ) − H(X1 , ...., Xm |Y )
m m
(b) X X
= H(Xi |Xi−1 , ...., X1 ) − H(Xi |Xi−1 , ...., X1 , Y )
i=1 i=1
m
(c) X
= I(Xi ; Y |Xi−1 , ...., X1 ), (2.31)
i=1

where in (a) we simply rewrote the mutual information as a function of the


entropies, and in (b) we applied the chain rule and the conditional chain rule
for the entropy. Substracting the terms of the two sums i by i yields the
mutual information terms, (c).
An alternative way to prove (2.30) starts, as usual, from the explicit expres-
sion of the mutual information as a function of the probabilities and passes
through some algebra operations.

Venn diagram
All the above relationships among the entropy and the related quantities
(H(X), H(Y ), H(X, Y ), H(X/Y ), H(Y /X) and I(X; Y )) can be expressed
in a Venn diagram. In a Venn diagram these quantities are visually repre-
sented as sets and their relationships are described as unions or intersections
among these sets, as illustrated in Figure 2.1.

Exercise:
To practice with the quantities introduced so far, prove the following rela-
tions:

• H(X, Y |Z) ≥ H(X|Z);

• I(X, Y ; Z) ≥ I(X; Z);

• I(X; Z|Y ) = I(Z; Y |X) − I(Z; Y ) + I(X; Z).


22 Chapter 2, Joint Entropy, Relative Entropy and Mutual Information

H(X, Y )

H(X|Y ) I(X; Y ) H(Y |X)

H(X) H(Y )
Figure 2.1: Venn diagram illustrating the relationship between entropy and mutual
information.
Chapter 3

Sources with Memory

In the first two chapters we introduced the concept of information and


some related measures confining our analysis to discrete memoryless sources.
In this chapter we remove the memoryless assumption moving towards a
more general definition of information.
Among the sources with memory, Markov sources play a major role. The
rigorous definition of a Markov process will be given in Section 3.2.

3.1 Markov Chain (3 r.v.)


Let us consider the following configuration:

X → Y → Z. (3.1)

Definition. Tree random variables X, Y and Z, form a Markov chain in


that direction (denoted by →) if

p(z|y, x) = p(z|y), (3.2)

i.e., given Y , the knowledge of X (which precedes Y in the chain) does not
change our knowledge about Z.

In a similar way, we can say that X, Y and Z form a Markov chain


X → Y → Z if the joint probability mass function can be written as

p(x, y, z) = p(z|y)p(y|x)p(x). (3.3)

We now state some interesting properties of Markov chains.

23
24 Chapter 3, Sources with Memory

Property (1).

X→Y →Z ⇔ p(x, z|y) = p(x|y)p(z|y), (3.4)

that is the random variable X, Y and Z form a Markov chain with direction
→ if and only if X and Z are conditionally independent given Y .

Proof. We show first the validity of the direct implication, then that of the
reverse one.

• M ⇒ C.I. (Markovity implies Conditional Independence)

p(x, y, z) (a) p(z|y)p(y|x)p(x) (b)


p(x, z|y) = = = p(z|y)p(x|y), (3.5)
p(y) p(y)
where in (a) we use the definition of Markov chain while equality (b) follows
from Bayes’ theorem;

• C.I. ⇒ M. (Conditional Independence implies Markovity)


(a) (b)
p(x, y, z) = p(x, z|y)p(y) = p(z|y)p(x|y)p(y) = p(z|y)p(y|x)p(x), (3.6)
where the conditional independence between X and Z given Y yields equality
(a), and equality (b) is again a consequence of Bayes’ theorem.

Property (2).
X→Y →Z ⇒ Z → Y → X, (3.7)
that is, if three random variable form a Markov chain in a direction, they
also form a Markov chain in the inverse direction.
Proof. From Property (1) it’s easy to argue that X and Y have an inter-
changeable rule; hence proving (3.7).

Observation.
If we have a deterministic function f , then

X → Y → f (Y ). (3.8)

In fact, since f (·) is a deterministic function of Y , if f (Y ) depends on X it


is surely through Y ; therefore, conditioning to Y makes X and f (Y ) inde-
pendent.
3.2. Characterization of Stochastic Processes 25

Property (3). For a Markov chain the Data Processing Inequality (DPI)
holds, that is

X→Y →Z ⇒ I(X; Z) ≤ I(X; Y ). (3.9)

The DPI states that proceeding along the chain leads to a reduction in the
information about the first random variable.

Proof. By exploiting the chain rule we expand the mutual information I(X; Y, Z)
in two different ways:

I(X; Y, Z) = I(X; Z) + I(X; Y |Z) (3.10)


= I(X; Y ) + I(X; Z|Y ). (3.11)

By the properties of Markov chains we know that I(X; Z|Y ) = 0. Then,


from the positivity of the mutual information (I(X; Y |Z) ≥ 0) the desired
relation holds. Similarly, by reversing the direction of the chain (according
to Property (2)), we can also prove that I(X; Z) ≤ I(Z; Y ).

Note: the data-processing inequality can be used to show that no clever


manipulation of the data can increase the knowledge of the data and then
improve the inferences that can be made from the data. With specific ref-
erence to the observation above, the DPI tells us that no deterministic pro-
cessing of Y can increase the information that Y contains about X, i.e.
I(X; Y ) ≥ I(X; f (Y )).

3.2 Characterization of Stochastic Processes


So far we have described a source of information by a random variable.
Nevertheless, because of the introduction of memory in the source, this is
no longer correct. Indeed, a source with memory has to be modeled as a
stochastic process. For now, we limit our analysis to discrete time sources;
accordingly, we refer to discrete-time processes. As to the symbols emitted
by the sources, we still consider finite alphabets and then the processes we
consider are also discrete-state.
The stochastic process describing a discrete memory source is then a sequence
of random variables X1 , X2 , ...., Xn which is often denoted by the notation
x(k, n), where the index k refers to the process sampling at a given instant,
while the index n points out the process sampling in time for a given realiza-
tion. Due to the presence of memory, the random variables representing the
26 Chapter 3, Sources with Memory

output of the source at a given time instant are not necessarily identically
distributed.
For simplicity, we shall use the notation Xn to represent the stochastic pro-
cess omitting the dependence on k.
For mathematical tractability, we limit our analysis to stationary processes
(stationary sources).

Definition. A stochastic process is said to be stationary if the joint distri-


bution of any subset of the sequence of random variables is invariant with
respect to shifts in the time index; that is

pX1 ,...,Xn (x1 , ...., xn ) = pX1 +l,...,Xn +l (x1 , ...., xn ), (3.12)

for every value n and every shift l and for all x1 , x2 , ...., xn ∈ X .

From a practical point of view, one may wonder whether such a model can
actually describe a source in a real context. Well, it’s possible to affirm that
the above model represents a good approximation of some real processes, at
least (if we consider them) on limited time intervals.

If a process is stationary each random variable Xi has the same distribu-


tion p(x), whereby the entropy can still be defined and is the same for any
i, i.e. H(Xi ) = H(X). However, as opposed to the memoryless case, the
entropy no longer defines the information we receive by observing an output
of the source when we know the previous outcomes. For sources with mem-
ory, in order to characterize such information, we need to introduce a new
concept: the Entropy Rate.

3.2.1 Entropy Rate


We consider a sequence of n dependent random variables X1 , X2 , ...., Xn
describing a source with memory over an interval of n instants. If we observe
n outputs we receive the amount of information H(X1 , ..., Xn ). It’s clear
that, due to the dependence among the variables, H(X1 , ..., Xn ) 6= nH(X).
We want to determine how the entropy of the sequence grows with n. In this
way, we would be able to get a definition of the growth rate of the entropy
of the stochastic process i.e. its entropy rate.
Before stating the next theorem we first introduce two quantities that intu-
itively seem to be both reasonable definitions of the entropy rate.
We can define the average information carried by one of the n symbols emit-
ted by the source as H(X1 , ..., Xn )/n. The question is: how large n should
be in order to get a good estimation of the effective average? Clearly, it is
3.2. Characterization of Stochastic Processes 27

necessary to raise n so to take into account all the memory.


The quantity
H(X1 , ...., Xn )
lim , (3.13)
n→∞ n
is the limit of the per symbol entropy of the n random variables as n tends to
infinity and define, in literature, the entropy of the stochastic process {Xi }.
An other quantity that seems to be a good definition for the amount of
information received by observing the output is obtained as follows: take the
conditional entropy of the last random variable given the past, and, to be
sure to consider the entire memory of the source, take the limit, i.e.

lim H(Xn |Xn−1 , ..., X1 ). (3.14)


n→∞

It must be pointed out that, in general, the above limits may not exist.
We now prove the important result that for stationary processes both limits
(3.13) and (3.14) exist and assume the same value.
Theorem (Entropy Rate).
If Xn is a stationary source we have

H(X1 , ...., Xn )
lim = lim H(Xn |Xn−1 , ..., X1 ), (3.15)
n→∞ n n→∞

which is defined entropy rate and denoted by H(Xn ).


Proof. We start by proving that the limit on the right-hand side of (3.15)
exists. In fact

H(Xn |Xn−1 , ..., X1 ) ≤ H(Xn |Xn−1 , ..., X2 ) = H(Xn−1 |Xn−2 , ..., X1 ), (3.16)

where the inequality follows from the fact that conditioning reduces the en-
tropy, and the equality follows from the stationarity assumption. Relation
(3.16) shows that H(Xn |Xn−i , ..., X1 ) is non-increasing in n. Since, in addi-
tion, for any n H(Xn |Xn−i , ..., X1 ) is a positive quantity, according to a well
known result from calculus we can conclude that the limit in (3.14) exists
and is finite.

We now prove that the average information H(X1 , ...., Xn )/n has the same
asymptotical limit value.
By the chain rule:
n
H(Xn , ..., X1 ) X H(Xi |Xi−1 , ..., X1 )
= . (3.17)
n i=1
n
28 Chapter 3, Sources with Memory

We indicate by an the quantity H(Xn |Xn−1 , ..., X1 ) and by ā the value


limn→∞ an which
Pn we know to exist and be finite; then, the average entropy is
equivalent to i=1 ai /n. We can directly prove relation (3.15) by exploiting
the following result from calculus:
Pn
i=1 ai
an → ā ⇒ → ā. (3.18)
n
Below, we give the formal proof.
Let us consider the absolute value of the difference between the mean value
of ai on n symbols and the value ā. According to the limit definition, we
have to show that this quantity can be made arbitrarily small as n → ∞.
We can write:

1 X n 1 X n 1X n
ai − ā = (ai − ā) ≤ |ai − ā| . (3.19)

n n n

i=1 i=1 i=1

By the limit definition, an → ā means that

∀ε > 0 ∃Nε : ∀n > Nε |an − ā| < ε. (3.20)

Hence, going on from (3.19) we obtain


n Nε n
1X 1X 1 X
|ai − ā| = |ai − ā| + |ai − ā|. (3.21)
n i=1 n i=1 n i=N +1
ε

By looking at the first term of the sum we argue that its value is fixed (call
it k) and finite while, thanks to the proper choice of Nε , all the terms of the
second summation are less then ε. Then we have
n
1 X k n−N +1
ε
ai − ā < + · ε −→ ε, (3.22)

n n n

n→∞
i=1

q.e.d..

Differently from the definition of the entropy, the definition of entropy


rate is not so easy to handle for a generic source. The main difficulty resides
in the estimation of the joint distribution of the source, which is needed
for evaluating the joint or conditional entropy. All the more that, strictly
speaking, the computation of the entropy rate requires that n goes to infinity,
making the estimate of the joint distribution a prohibitive task. There are
only some cases in which such estimation is possible: one of these is the case
3.2. Characterization of Stochastic Processes 29

of Markov sources.

3.2.2 Markov Sources


In Section 3.1 we introduced the Markov chain for 3 random variables.
We now give the general definition of a Markov process and discuss its prin-
cipal features. For Markov sources we are able to evaluate the entropy of the
process H(Xn ).

Definition (Markov Chain). A discrete stochastic process Xn is a Markov


chain or Markov process if ∀n

p(xn |xn−1 , ..., x1 ) = p(xn |xn−1 ) (3.23)

for all x1 , x2 , ..., xn ∈ X n .

Then, in a Markov source the pmf at a given time instant n depends only
on what happened in the previous instant (n − 1).

From (3.23) it follows that for a Markov source the joint probability mass
function can be written as

p(x1 , x2 , ..., xn ) = p(xn |xn−1 )p(xn−1 |xn−2 )...p(x2 |x1 )p(x1 ). (3.24)

For clarity of notation, we indicate through ai a generic symbol of the source


alphabet; then X = {a1 , a2 , ..., am } for some m.

Definition (Time Invariant M.C.). The Markov process is said to be time


invariant (t.i.) if the conditional probability p(Xn = aj |Xn−1 = ai ) does not
depend on n; formally, for any n,

p(Xn = aj |Xn−1 = ai ) = p(X2 = aj |X1 = ai ), ∀ai , aj ∈ X . (3.25)

As a consequence, we can indicate the conditional probability of a time


invariant Markov chain with the notation Pij , without reference to the time
index.
When Xn is a Markov chain, Xn is called the state at time n.
A t.i. Markov chain is completely characterized by the initial state, via the
30 Chapter 3, Sources with Memory

1−α 1−β
α

State A State B

Figure 3.1: Two-state Markov chain.

probability vector P (1) 1 , and the probability transition matrix P = {Pij },


i, j = 1, 2, ..., |X |.
The probability vector at time n, which is

P (n) = (p(Xn = a1 ), p(Xn = a2 ), ..., p(Xn = am ))T , (3.26)

is recursively obtained by the probability vector at time n − 1 as follows

P (n),T = P (n−1),T · P. (3.27)

Observe that P (n) = P (n−1) holds if P (n) is an eigenvector of P with eigen-


value 1 or if P is the identity matrix.
From now on, we assume that the Markov chain is time invariant unless oth-
erwise stated.

Example (Two-state Markov chain).


Let us consider the simple example of a two state Markov chain shown in
Figure 3.1 by means of a state diagram. The correspondent transition matrix
is the following:  
1−α α
P= .
β 1−β
We now introduce two interesting properties that a Markov chain may
have.

1
P (1) denotes the vector of the probabilities of the alphabet symbols at time instant
n = 1, i.e. P (1) = (p(X1 = a1 ), p(X1 = a2 ), ..., p(X1 = am ))T .
3.2. Characterization of Stochastic Processes 31

Definition (Irreducible M.C.). If it is possible to go, with positive probabil-


ity, from any state of the chain to any other state in a finite number of steps,
the Markov chain is said irreducible.

In particular, by referring to a same state i, we can define the period:

Ki = Lcf {n : P r{Xn = i|X0 = i} > 0}2 , (3.28)

i.e. the largest common factor of the number of steps that starting from
state i allow to come back to the state i itself (or equivalently, the largest
common factor of the lengths of different paths from a state to itself).

Definition (Aperiodic M.C.). If Ki = 1 ∀i, the (irreducible) Markov chain


is said to be aperiodic.

Theorem (Uniqueness of the stationary distribution).


If a finite state (t.i.) Markov chain is irreducible and aperiodic, there ex-
ists a unique stationary distribution Π = limn→∞ P (n) whatever the initial
distribution P (1) is. This also means that:

ΠT = ΠT · P. (3.29)

From the above theorem we deduce that the stationary distribution is


so called because if the initial state distribution P (1) is itself Π the Markov
chain forms a stationary process. If this is the case it is easy to evaluate the
entropy rate by computing one of the two limits in (3.15).
In fact:

H(Xn ) = lim H(Xn |Xn−1 , ..., X1 )


n→∞
(a)
= lim H(Xn |Xn−1 )
n→∞
(b)
= lim H(X2 |X1 )
n→∞
= H(X2 |X1 ), (3.30)

where (a) follows from the definition of Markov chain and (b) from the sta-
tionarity of the process. Hence, the quantity H(X2 |X1 ) is the entropy rate
of a stationary Markov chain.
In the sequel we express the entropy rate of a stationary M.C. as a function of
the quantities defining the Markov process, i.e. the initial state distribution
P (1) and the transition matrix P. Dealing with a stationary Markov chain
2
Lcf is the abbreviation for largest common factor or greatest common divisor.
32 Chapter 3, Sources with Memory

we know that P (i) = Π for all i, then we have


|X |
X
H(X2 |X1 ) = p(X1 = ai )H(X2 |X1 = ai )
i=1
|X |
(1)
X
= Pi H(X2 |X1 = ai )
i=1
|X |
X
= Πi H(X2 |X1 = ai )
i=1
|X | |X |
X X
= − Πi Pij log Pij , (3.31)
i=1 j=1

where for a fixed i p(X2 = aj |X1 = ai ) = Pij is the probability to pass from
the state i to a state j, for j = 1, ..., |X |. In general, p(X2 |X1 = ai ) corre-
sponds to one element of the i-th row of the P matrix.

Going back to the example of the two state Markov chain we now can
easily compute the entropy rate. In fact, by looking at the state diagram
in Figure (3.1) it’s easy to argue that the Markov chain is irreducible and
aperiodic. Therefore we know that, for any starting distribution, the same
stationary distribution is reached. The components of the vector Π are the
stationary probabilities of the states A and B, i.e. ΠA and ΠB respectively.
The stationary distribution can be found by solving the equation ΠT = ΠT ·P .
Alternatively, we can obtain the stationary distribution by setting to zero the
net probability flows across any cut in the state transition graph.
By imposing the balance at the cut to the state diagram in Figure (3.1) we
have the following system with two unknowns

ΠA α = ΠB β,
ΠA + ΠB = 1,

where the second equality accounts for the fact that the sum of the proba-
bilities must be one. The above system, once solved, leads to the following
solution for the stationary distribution:
 
β α
Π= , . (3.32)
α+β α+β

We are now able to compute the entropy H(Xn ) from expression in (3.31).
3.2. Characterization of Stochastic Processes 33

Let us call h(α) the entropy of the binary source given the state A (i.e.
H(X2 |X1 = A)), that is the entropy of the distribution of the first row in P.
Similarly, we define h(β) for the state B. The general expression for a two
state Markov chain is
β α
H(Xn ) = h(α) + h(β). (3.33)
α+β α+β

Equation (3.33) tells us that in order to evaluate the entropy rate of the two
state Markov chain it’s sufficient to estimate the transition probabilities of
the process once the initial phase ends.

Note: the stationarity of the process has not been required for the derivation.
The entropy rate, in fact, is defined as a long term behavior and then is the
same regardless of the initial state distribution. Hence, if the initial state
distribution is P (1) 6= Π we can always skip the initial phase and consider
the behavior of the process from a certain time onwards.

3.2.3 Behavior of the Entropy of a Markov Chain


We have already mentioned the relation between the entropy concept in
information theory and the notion of entropy derived from thermodynamics.
In this section we point out the similarity of a Markov chain with a physical
isolated system. As in a Markov chain, even in a physical system knowing
the present state makes the future of the system independent from the past:
think for instance to the notion of position and velocities of gas particles.
We now get more insight into the Markov chain Xn in order to show that
the entropy rate H(Xn ) is nondecreasing, just as in thermodynamics.

Let p(n) and q (n) be two pmfs on the state space of a Markov chain at
time n. According to the time invariance assumption these two distributions
are obtained by starting from two different initial states p(1) and q (1) . Let
p(n+1) and q (n+1) be the corresponding distribution at time n + 1, i.e. the
evolution of the chain.

Property. The relative entropy D(p(n) ||q (n) ) decreases with n; equivalently

D(p(n+1) ||q (n+1) ) ≤ D(p(n) ||q (n) ) for any n. (3.34)

Proof. We use the expression p(n+1,n) (q (n+1,n) ) to indicate the joint proba-
34 Chapter 3, Sources with Memory

bility distribution of the two discrete random variables, one representing the
state at time n and the other representing the state at time n + 1,

p(n+1,n) = pXn+1 ,Xn (xn+1 , xn ) (q (n+1,n) = qXn+1 ,Xn (xn+1 , xn )). (3.35)

Similarly, by referring to the conditional distributions we have

p(n+1|n) = pXn+1 |Xn (xn+1 |xn ) (q (n+1|n) = qXn+1 |Xn (xn+1 |xn )). (3.36)

According to the chain rule for relative entropy, we can write the following
two expansions

D(p(n+1,n) ||q (n+1,n) ) = D(p(n+1) ||q (n+1) ) + D(p(n|n+1) ||q (n|n+1) )


= D(pn ||q n ) + D(p(n+1|n) ||q (n+1|n) ).
(3.37)

It’s easy to see that the term of D(p(n+1|n) ||q (n+1|n) ) is zero, since in a Markov
chain 3 the probability to pass from a state at time n to another state at time
n+1 (the transitional probability) is the same whatever the probability vector
is. Then, from the positivity of D equation (3.34) is proved.

The above property asserts that in a Markov chain the K-L distance be-
tween the probability distributions tends to decrease as n increases.
We observe that equation (3.34) together with the positivity of D allows to
say that the sequence of the relative entropies D(p(n) ||q (n) ) admits limit as
n → ∞. However, we have no guarantee that the limit is zero. This is not
surprising since we are working with a generic Markov chain and then the
long term behavior of the chain may depend on the initial state (equivalently,
the stationary distribution is not unique).

From the above property the following corollary holds.

Corollary. Let p(n) be the state vector at time n; if we let Π be a stationary


distribution, we have

D(p(n) ||Π) is a monotonically non-increasing sequence of n,

thus implying that any state distribution gets closer and closer to each sta-
tionary distribution as time passes.
3
Keep in mind that when we say “Markov chain” we implicity assume the time invari-
ance of the chain.
3.2. Characterization of Stochastic Processes 35

As a consequence, if we also assume that the Markov chain is irreducible


and aperiodic we know that the stationary distribution is unique and there-
fore the asymptotical limit of the sequence is zero, that is D(p(n) ||Π) → 0 as
n grows.

The above corollary permits to state the following interesting property of


Markov chains.
Property. If the stationary distribution is uniform, the entropy H(Xn ) in-
crease as n grows
Proof. Due to the uniformity of the stationarity distribution, i.e. Πi = 1/|X |
for any i = 1, 2, .., |X |, the relative entropy can be expressed as
(n)
X (n) P X (n) (n)
X (n)
D(P (n)
||Π) = Pi log i = Pi log Pi + Pi log |X |
i
Πi i i
= log |X | − H(Xn ). (3.38)

Therefore, the monotonic decrease of the relative entropy as n grows implies


the monotonic increase of the entropy H(Xn ).
This last property has very close ties with statistical thermodynamics,
which asserts that any closed system (remember that all the microstates are
equally likely) evolves towards a state of maximum entropy or “disorder”.
If we further inspect equation (3.38), we can deduce something more than
the relation H(Xn ) → log |X |. In fact, since D(P (n) ||Π) is monotonic non-
increasing, the entropy growth with n is monotonic, meaning that the entropy
does not swing (fluctuate).

We now briefly characterize the Markov processes having a uniform sta-


tionary distribution. Before, we give the following definition:
Definition. A probability transition matrix P = {Pij }, where Pij = P r{Xn+1 =
j|Xn = i}, is called doubly stochastic if
X
Pij = 1, j = 1, 2, ... (3.39)
i

and X
Pij = 1, i = 1, 2, ...4 (3.40)
j

4
This is always true. In any transition matrix the sum over the columns for a fixed row
is 1.
36 Chapter 3, Sources with Memory

We enunciate the following theorem, without giving the proof.

Theorem. A Markov chain admits a uniform stationary distribution if and


only if the transition matrix P is doubly stochastic.
Chapter 4

Asymptotic Equipartition
Property and Source Coding

In information theory, the asymptotic equipartition property (AEP) is a


direct consequence of the weak law of large numbers defined in statistics.
For simplicity, we confine the discussion to discrete memoryless sources.

4.1 A reminder of Statistics


According to the law of large number, given n independentPand identically
distributed (i.i.d.) random variables Xi , the sample mean n1 ni=1 Xi is close
to the expected value E[X] for large values of n. The definition which is
commonly adopted is that of the weak law of large numbers where with the
term “weak” we refer to the convergence in probability. According to this
type of convergence,
n
1X
X̄n = Xi −→ E[X] (4.1)
n i=1 in prob

means that:

∀ε > 0, P r{|X̄n − E[X]| > ε} −→ 0 as n → ∞. (4.2)

As a reminder, the above equation is the same that, in statistics, proves the
consistence of the point estimator X̄n , directly following from Tchebycheff
inequality.
We point out that, although in (4.1)-(4.2) we considered the mean value
E[X], the convergence in probability of the sample values to the ensemble

37
38 Chapter 4, Asymptotic Equipartition Property and Source Coding

ones can be defined also for the others statistics. Indeed, the law of large
numbers rules the behavior of the relative frequencies k/n (where k is the
number of successes in n trials) with respect to the probability p, that is it
states that  
k
P r − p > ε −→ 0 as n → ∞. (4.3)
n
Since the estimation of any statistic based on the samples depends on the
behavior of the relative frequencies, the convergence in probability of the sam-
ple values to the ensemble ones can be derived from the law of large numbers.

Example (Repeated Trials).


This example illustrates the concept of the law of large numbers and at the
same time introduces the concept of typical sequence formally defined later.
Let the random variable X ∈ {0, 1} model the toss of a fake coin (a smash).
We set 0 = head, 1 = tail and assume that p(1) = 0.9 = 1 − p and p(0) =
0.1 = p. We aim at showing that, when the numbers of tosses tends to
infinity, the sequences drawn from X will have 90% of tails and 10% of heads
with a probability arbitrarily close to 1.
Let k = αn be the number of 0’s in the n-length sequence. Being a case of
repeated trials, we have
 
n
P r{n(0) = k} = 0.1k 0.9n−k . (4.4)
k

For any k, the probability of a sequence having k 0’s, in the specific case
0.1k 0.9n−k , tends to 0 as n → ∞, while the binomial coefficient tends to ∞
1
, leading to an indeterminate form. By Stirling’s formula2 it’s possible to
prove that if we consider k = pn = 0.1n, P r{n(0) = k} ' 1. All the corre-
sponding sequences are referred to as typical sequences. As a consequence,
all the other sequences (having a different number of 0’s) occur with an ap-
proximately zero probability. We say that the sequences drawn from the
sources have “the correct frequencies”, where the term correct means that
the relative frequencies coincide with the true probabilities.

1
Strictly speaking, this is not true for the unitary sequence (k = n) which is only one
and then has a zero probability. n √
2
Stirling’s formula gives an approximation for factorials: n! ∼ ne 2πn.
4.2. Asymptotic Equipartition Property 39

4.2 Asymptotic Equipartition Property


In this section we introduce one of the most important theorems of Infor-
mation Theory which formally introduces the fundamental concept of typi-
cality.
We consider a discrete memoryless source X with alphabet X and probability
mass function p(x). We use the notation Xi to indicate the random variable
describing the outcome of the source at time i. According to the memoryless
nature of the source, these variables are independent each other.
The asymptotic equipartition property (AEP) is formalized in the following
theorem:

Theorem (AEP).
If X1 , X2 , ... are i.i.d. ∼ p(x), then
1
− log p(X1 , X2 , ..., Xn ) −→ H(X). (4.5)
n in probability

Proof.
n
1 1 Y
− log p(X1 , X2 , ..., Xn ) = − log p(Xi )
n n i
n
1X
= − log p(Xi )
n i
n
1X 1
= log . (4.6)
n i p(Xi )

1
By conveniently introducing the new random variables Yi = log p(X i)
, the
1
P
above expression is just the sample mean of Y , i.e. n i Yi . Therefore, by
the law of large numbers we have that
1
− log p(X1 , ..., Xn ) → E[Y ] in probability. (4.7)
n
Expliciting the expected value of Y yields
X 1
E[Y ] = p(x) log = H(X). (4.8)
x
p(x)

The above theorem allows to give the definition of typical set and typical
40 Chapter 4, Asymptotic Equipartition Property and Source Coding

sequence.
Let us rewrite relation (4.5) as follows

1
− log p(X1 , X2 ..., Xn ) − H(X) → 0 in probability. (4.9)
n

According to the weak law of large numbers, equation (4.9) is equivalent to


the following statement: ∀ε > 0, ∀δ > 0
 
1
∃N : ∀n > N P r − log p(X1 , X2 , ..., Xn ) − H(X) > ε < δ. (4.10)

n

Hence, with high probability, the sequence X1 , X2 , ..., Xn satisfies the relation
1
H(X) − ε ≤ − log p(X1 , X2 , ..., Xn ) ≤ H(X) + ε, (4.11)
n
which corresponds to the following lower and upper bound for the probability:

2−n(H(X)+ε) ≤ p(X1 , X2 , ..., Xn ) ≤ 2−n(H(X)−ε) . (4.12)


(n)
Definition. The typical set Aε with respect to p(x) is the set of sequences
xn = (x1 , x2 , ..., xn ) ∈ X n for which

2−n(H(X)+ε) ≤ p(x1 , x2 ..., xn ) ≤ 2−n(H(X)−ε) . (4.13)

We now give some informal insights into the properties of the typical set,
from which it is already possible to grasp the key ideas behind Shannon’s
source coding theorem.
By the above definition and according to the law of large numbers we can
argue that
P r{X n ∈ A(n)
ε } → 1 as n → ∞. (4.14)
Then, the probability of any observed sequence will be almost surely close to
2−nH(X) . A noticeable consequence is that the sequences inside the typical
set, i.e. the so called typical sequences, are equiprobable. Besides, since a
sequence lying outside the typical set will almost never occur for large n, the
number of typical sequences k can be roughly estimated as follows:

2−nH(X) · k ∼
=1 ⇒ k∼
= 2nH(x) .

It is easy to understand that, in the coding operation, these are the se-
quences that really matter. For instance, if we consider binary sources, the
above relation states that nH(X) bits suffice on the average to describe n
4.2. Asymptotic Equipartition Property 41

binary random variables, leading to a considerable bit saving (remember


that H(X) < 1!). These considerations concerning the typical set are the
essence of Shannon’s source coding theorem and will be made rigorous in the
following section. The theorem below gives a rigorous formalization to the
(n)
properties of Aε , and is a direct consequence of the AEP.

Theorem (Typical Set).


(n)
Let X ∼ p(x) be a DM source and Aε the correspondent typical set as
defined above:
(n)
1. ∀δ, ∀ε, n large, Pr{Aε } ≥ 1 − δ;
(n)
2. ∀ε, |Aε | ≤ 2n(H(X)+ε) ∀n;
(n)
3. ∀δ, ∀ε, n large, |Aε | ≥ (1 − δ)2n(H(X)−ε) .

Proof.
1. It directly follows from equation (4.10) which can also be written as
follows: ∀ε > 0, ∀δ > 0
 
1
∃N : ∀n > N P r − log p(X1 , ..., Xn ) − H(X) < ε > 1 − δ. (4.15)
n

Since the expression in curly braces defines the typical set, equation (4.15)
proves point 1.

2.

X
1 = p(xn )
xn ∈X n
X
≥ p(xn )
(n)
xn ∈Aε
(a) X
≥ 2−n(H(X)+ε)
(n)
xn ∈Aε

= A(n)
−n(H(X)+ε)
ε
·2 (4.16)

where (a) follows by the definition of typical set.


Notice that the proof of point 2 does not involve the bounds on the proba-
bility of the observed sequence and then holds for any n.
42 Chapter 4, Asymptotic Equipartition Property and Source Coding

3. ∀δ > 0 and for large n, from point 1. we have

1 − δ ≤ P r{Aε(n) }
X
= p{xn }
(n)
xn ∈Aε
X
≤ 2−n(H(X)−ε)
(n)
xn ∈Aε

= A(n)
−n(H(X)−ε)
ε
·2 (4.17)

4.3 Source Coding


In this chapter we state and prove Shannon’s source coding theorem for
the discrete memoryless case. We also discuss the extension of Shannon’s
theorem to the source with memory.

4.3.1 Memoryless Source Coding


The celebrated Shannon’s source coding theorem, also known as noiseless
source coding theorem, refers to the case of discrete memoryless sources. The
theorem consists of two distinct parts: the direct theorem and the converse
theorem.
Before stating the theorem, we need to give the definition of code and ex-
tended code. Let X be the source we want to compress with alphabet X and
pmf PX . Working on symbols, we define a coding as a mapping procedure
from the source alphabet X to a code alphabet C. Due to its use in computer
science, we consider a binary code alphabet.
Definition (Binary code). A binary code C associates to each source sym-
bol x a string of bits3 , i.e. is a mapping C : X → {0, 1}∗ . For each x, C(x)
denote the associated codeword.

Definition (Expected length). The expected length L of a code C for a


random variable X with probability mass function p(x) is given by
X
L= p(x)l(x), (4.18)
x∈X
3
A string of bits is an element in {0, 1}∗ , i.e. the set of all the binary strings.
4.3. Source Coding 43

where l(x) is the length of the codeword associated with x.

We now define a property that a code should have.

Property (Non singular code). A code is said invertible or nonsingular if


each symbol is mapped into a different string, that is

ai 6= aj ⇒ C(ai ) 6= C(aj ), ai , aj ∈ X . (4.19)

Since we transmit and store sequences of symbols, the above property


does not guarantee the unambiguous description of the sequences and then
their correct decodability. We need to define a further property which passes
through the following definition:

Definition (Extended Code). The n-th extension C ∗ of a code C is the map-


ping from n-length strings of elements in X to n-length strings of {0, 1}, that
is C ∗ : X n → {0, 1}∗ . C ∗ is defined as the concatenation of the codewords of
C:
C ∗ (x1 x2 ...xn ) = C(x1 )C(x2 ) · · · C(xn ). (4.20)

Property (Uniquely decodable code). A code is uniquely decodable if its ex-


tension is nonsingular.

We now start by enunciating the direct part of Shannon’s source coding


theorem.

Theorem (Shannon’s Source Coding: direct).


Let X be a discrete memoryless source, with alphabet X , whose symbols are
drawn according to a probability mass function p(x), and let xn be a n length
sequence of symbols drawn from the source. Then, ∀ε > 0 and sufficiently
large n, ∃ C(xn ) invertible s.t.

L
≤ H(X) + ε, (4.21)
n
where L denotes the average length of the codewords, i.e. E[l(c(xn ))], and
L/n is the code rate, i.e. the average number of bits per symbol.

Proof. The proof comes out directly from the AEP theorem and Typical set
theorem.
We search for a code having a rate which satisfies relation (4.21). In order
to prove the theorem it’s sufficient to find one such a code.
Let us construct a code giving a short description of the source. We divide
44 Chapter 4, Asymptotic Equipartition Property and Source Coding

(n)
all sequences in X n into two sets: the typical set Aε and the complementary
(n),c
set Aε .
(n)
As to Aε , we know from the AEP theorem that the sequences xn belonging
to it are equiprobable and then we can use the same codeword length l(xn )
(l(C(xn ))) for each of them. We represent each typical sequence by giving the
index of the sequence in the set. Since there are at most 2n(H(X)+ε) sequences
(n)
in Aε , the indexing requires no more then n(H + ε) + 1 bits, where the extra
bit is necessary because n(H + ε) may not be an integer. Spending another
bit 0 as a flag, so to make uniquely decodable the code, the total length of
bits is at most n(H + ε) + 2. To sum up,

xn ∈ A(n)
ε ⇒ l(xn ) ≤ n(H + ε) + 2. (4.22)

We stress that the order of indexing is not important, since it does not affect
the average length.
As to the encoding of the non-typical sequences, the Shannon’s idea is ”to
squander”. Since the AEP theorem asserts that, as n tends to infinity, the
(n),c
sequences in the non-typical set Aε will never occur, it’s not necessary to
look for a short description. Specifically, Shannon suggested indexing each
(n),c
sequence in Aε by using no more than n log |X | + 1 bits (as before, the
additional bit takes into account the fact that n log |X | may not be integer).
(n)
Observe that n log |X | bit would suffice to describe all the sequences (|Aε |
(n),c
+ |Aε | = |X |n ). Therefore, by using such coding, we waste a lot of bits
(surprisingly, this is good enough to yield an efficient encoding). Prefixing
the indices by 1, we have

xn ∈ A(n)c
ε ⇒ l(xn ) ≤ n log |X | + 2. (4.23)

The description of the source provided by the above code is depicted in Figure
4.1. By using this code we now prove the theorem.
The code is obviously invertible. We now compute the average length of the
codeword:
X
E[l(xn )] = p(xn )l(xn )
xn
X X
= p(xn )l(xn ) + p(xn )l(xn )
(n) (n),c
xn ∈Aε xn ∈Aε
X X
≤ (n(H(X) + ε) + 2) · p(xn ) + (n log |X | + 2) · p(xn ).
(n) (n),c
xn ∈Aε xn ∈Aε
(4.24)
4.3. Source Coding 45

Non-typical set
Description: n log |X | + 2 bits

probability
close to 1

Typical set
Description: n(H + ε) + 2 bits

Figure 4.1: Source code using the typical set.

(n)
For all positive value δ, if n is sufficiently large, P r{Aε } ≥ 1 − δ; then
expression (4.24) is upper bounded as follows:

≤ (n(H(X) + ε) + 2) + δ · (n log |X | + 2). (4.25)

Then,
L 2 2
≤ H(X) + ε + + δ + δ log |X |
n n n
= H(X) + ε0 , (4.26)

where ε0 = ε + n2 + δ n2 + δ log |X | can be made arbitrarily small for an


appropriate choice of δ and n.
It’s clear that,since the non-typical sequences will almost never occur for large
n, the lengths of their codewords have a negligible impact on the average
codeword length.

The above theorem states that the code rate L/n can get arbitrarily close
to the entropy of the source. Nevertheless, in order to state that is not
46 Chapter 4, Asymptotic Equipartition Property and Source Coding

possible to go below this value it’s necessary to prove the converse theorem.
The converse part shows that if we use an average codeword length even
slightly below the entropy we are no longer able to decode.

Theorem (Shannon’s Source Coding: converse).


Let X ∼ p(x) be a DMS with alphabet X . Let us indicate by P (err) the
probability of not being able to decode, that is the probability of incurring
in a non invertible code. Then, for any ν > 0, for any coding scheme C(xn )
such that for large n L/n = H(X) − ν, ∀δ > 0
νn
P (err) ≥ 1 − δ − 2− 2 . (4.27)

Proof. Since the average number of bits used for a n-length sequence is
n(H(X) − ν), we can encode at most 2n(H(X)−ν) sequences. Let us search for
a good choice of the sequences to index; the best thing to do is trying to en-
(n)
code at least the sequences in Aε . As to the non-typical sequences, if there
are no bits left, we assign to each of them the same codeword (in this way
the code loses the invertibility property but only for non-typical sequences).
However, it is easy to argue that the number of sequences we can encode
through this procedure is less than the total number of typical sequences. In
(n)
order to show that, let us set ε = ν/2, and then consider Aν/2 . We evaluate
the probability of a correctly encoded sequence4 (i.e. the probability of falling
into the set of the correctly encoded sequences), namely P (corr), which has
the following expression:
X
P (corr) = p(xn )
(n) 5
xn ∈Aν/2 : xn ↔c(xn )
ν
≤ 2n(H(X)−ν) · 2−n(H(X)+ν/2) = 2−n 2 , (4.28)

where the number of elements of the sum was upper bounded by the car-
dinality of the typical set and p(xn ) by the upper bound of the probabil-
ity of a typical sequence. Then, by considering that ∀δ > 0 and large n
(n) (n)
P r{Aν/2 } ≥ 1 − δ, the probability that the source emits a sequence Aν/2
which can not be correctly coded is
ν
P (err) ≥ 1 − δ − 2−n 2 . (4.29)

Notice that relation (4.29) is actually an underestimation of the bound for


P (err) since we have not considered the contribution of the non typical se-
4
With the term ‘correctly’ we mean ‘in an invertible way’.
5
The double arrow indicates that the mapping is one to one.
4.3. Source Coding 47

quences to P (err) (which cannot be correctly decoded too). However, we


know that they have an arbitrarily small probability for large n so they do
not have a significant impact on P (err).

The couple of theorems just proved gives a rigorous formalization of our


previous discussion. Accordingly, the entropy H(X) gives the measure of the
number of bits per symbol required on the average to describe a source of
information.
Some considerations follow from Shannon’s source coding theorem. First of
all, the coding procedure used to prove the theorem is highly impractical.
Secondly, Shannon asserts that is possible to reach the entropy as long as
we take n sufficiently large, that is if we jointly encode long sequences of
symbols. This is another issue that raises a number of practical problems.

4.3.2 Extension to the Sources with Memory


In Chapter 3.2 we discussed the sources with memory and defined the
entropy rate. We now want to discuss the encoding limits, in terms of average
length, for the sources with memory. We prove that for stationary sources
the entropy rate takes the role of the entropy in the memoryless case, with
some subtle differences.

Extended and Adjoint Sources


To discuss information-theoretic concepts it is often useful to consider
blocks rather than individual symbols, each block consisting of k subsequent
source symbols. If we let Xn denote the source of information (with memory)
with alphabet X , each such block can be seen as being produced by an ex-
tended source Xkn with a source alphabet X k . Given the extended source, it is
possible to define the correspondent memoryless source Xk,∗ n , named adjoint
source, by confining the memory to k. Then, the k-length blocks drawn from
the source Xkn are independent of each other.

Before stating the coding theorem for sources with memory we give the
following lemma.
Lemma (Behavior of the average entropy).
H(Xk ,...,X1 )
For any stationary source with memory Xn the sequence of values k
tends to H(Xn ) from above as k → ∞, that is for large k

H(Xk , ..., X1 )
− H(Xn ) ≥ 0. (4.30)
k
48 Chapter 4, Asymptotic Equipartition Property and Source Coding

Proof. By applying the chain rule to the joint entropy twice, we have

H(Xk , ..., X1 ) = H(Xk−1 , ..., X1 ) + H(Xk |Xk−1 , ..., X1 ) (4.31)


= H(Xk−2 , ..., X1 ) + H(Xk−1 |Xk−2 , ..., X1 ) +
+H(Xk |Xk−1 , ..., X1 ). (4.32)

The stationarity assumption permits to consider shifts of the random vari-


ables; hence, going on from (4.32)

= H(Xk−2 , ..., X1 ) + H(Xk |Xk−1 , .., X2 ) + H(Xk |Xk−1 , .., X1 )(4.33)


≥ H(Xk−2 , ..., X1 ) + 2H(Xk |Xk−1 , ..., X1 ), (4.34)

where inequality (4.34) is obtained by adding X1 to the conditioning variables


of the second entropy term in (4.33) (remember that conditioning reduces
entropy). By iterating the same process k − 2 more times we obtain

H(Xk , ..., X1 ) ≥ kH(Xk |Xk−1 , ..., X1 ). (4.35)

Deriving from relation (4.35) an upper bound for the conditional entropy
H(Xk |Xk−1 , ..., X1 ) and substituting it in the expression in (4.31) we get

H(Xk , ..., X1 ) = H(Xk−1 , ..., X1 ) + H(Xk |Xk−1 , ..., X1 )


H(Xk , ..., X1 )
≤ H(Xk−1 , ..., X1 ) + . (4.36)
k
Moving the term H(Xk , ..., X1 )/k on the left-hand side of the inequality and
dividing both sides by k − 1 yields

H(Xk , ..., X1 ) H(Xk−1 , ..., X1 )


≤ , (4.37)
k k−1
which proves that the sequence of the mean entropies on k outputs is non-
increasing with respect to k.
Then, from the definition of the entropy rate equation (4.30) follows.

Theorem (Source Coding with Memory: direct).


Let Xn be a stationary source and let xn be a sequence of n symbols emitted
by the source. Then, ∀ε > 0 and for sufficiently large n, ∃ C(xn ) s.t.

L
≤ H(Xn ) + ε. (4.38)
n
Proof. Let us consider the k-th order extension of Xn , i.e. the source Xkn hav-
4.3. Source Coding 49

ing alphabet X k . Let Xk,∗


n denote the adjoint source, i.e. the corresponding
discrete memoryless source that we get by confining the memory to a length
k 6 . We can apply the noiseless source coding theorem to Xk,∗
n . According to
the direct theorem we have that: ∀ε > 0, ∃N0 : ∀N > N0 ∃C(x(k,∗),N ) s.t.

E[l(x(k,∗),N )]
≤ H(Xk , Xk−1 , ..., X1 ) + ε, (4.39)
N
where x(k,∗),N denotes a N -length sequence of blocks drawn from the memo-
ryless source. According to the entropy rate definition, for k → ∞

H(Xk , Xk−1 , ..., X1 ) → H(Xn ) · k. (4.40)

By the definition of the entropy rate, we know that for any positive number
δ, if k is large enough,

H(Xk , Xk−1 , ..., X1 )


≤ H(Xn ) + δ. (4.41)
k
Substituting (4.41) in (4.39) yields

E[l(x(k,∗),N )]
≤ k · H(Xn ) + k · δ + ε. (4.42)
N
(k,∗),N
Since E[l(x N )] is the average number of bits per block, we can divide by k
in order to obtain the average number of bits per symbol. Then,

E l(x(k,∗),N )
 
ε
≤ H(Xn ) + δ + . (4.43)
k·N k
The product k · N is the total length of the starting sequence of symbols, i.e.
k · N = n. Thus, setting ε0 = δ + kε we have

E[l(xn )]
≤ H(Xn ) + ε, (4.44)
n
and the theorem is proved.

An important aspect which already comes out from the direct theorem is
that, in order to reach the entropy rate, we have to encode blocks of sym-
6
It must be pointed out that the joint entropy H(Xk , Xk−1 , ..., X1 ) is not the sum of
the single entropies because of the presence of memory among the symbols within the
block.
50 Chapter 4, Asymptotic Equipartition Property and Source Coding

bols. However, with respect to the memoryless source coding we have now
two parameters, the block length k and the number of blocks N , which have
both to be large (tend to infinity) to approach the entropy rate.
We now consider the converse theorem.

Theorem (Source Coding with Memory: converse).


Given a stationary source Xn , the average number of bits per symbol required
by an invertible code can not be less than the entropy rate H(Xn ).

Proof. The proof is given by contradiction. Let us suppose that for large
enough n it is possible to encode with a rate less then H(Xn ), say H(Xn ) − ν
for any arbitrarily small ν > 0. Given such a code, we can apply the same
mapping to the memoryless source Xk,∗ n . In other words, given a sequence
drawn from the source Xk,∗n , we consider it as if it were generated from H(Xn )
and we assign it the correspondent codeword. Then, we get the following
expression for the average number of bits per block,

E l(x(k,∗) ) = k · (H(Xn ) − ν)
 

H(Xk , ..., X1 )
≤ k· − k · ν,
k
= H(Xk , ..., X1 ) − k · ν, (4.45)

where the inequality follows from the Lemma stating the behavior of the
average entropy. By looking at equation (4.45) we realize the expected con-
tradiction. Equation (4.45) says that it is possible to code the output of a
DMS, namely Xk,∗ n , at a rate lower then the entropy. This fact is in contrast
with the noiseless source coding theorem.

4.4 Data Compression


Ever since Shannon proved the noiseless coding theorem, researchers tried
develop practical codes which achieve Shannon’s limit.
We know from the previous section that any practical source code must be
uniquely decodable. Below, we define a more stringent property

Property (Instantaneous code). A uniquely decodable code is said to be a


prefix code or instantaneous code if no codeword is prefix7 .
7
A codeword c1 is prefix of another codeword c2 when the string of bits of the first
codeword matches exactly the first l(c1 ) bits of the second codeword.
4.4. Data Compression 51

All codes

Nonsingular
codes

Uniquely
decodable
codes

Instantaneous
codes

Figure 4.2: Relation among the classes of codes.

X Singular Nonsingular, Uniquely Decodable, Instantaneous


Not Uniquely Decodable Not instantaneous
a 0 0 1 0
b 1 00 10 10
c 11 1 100 110
d 0 11 1000 111
Table 4.1: Example of cedes.

If a code is instantaneous it is possible to decode each codeword in a


sequence without reference to succeeding code symbols. This allows the re-
ceiver to decode immediately at the end of the codeword, thus reducing the
decoding delay. In practice, we always use instantaneous codes. Figure 4.2
illustrates the relations between the classes of codes. Some examples of var-
ious kinds of codes are given in Table 4.1.
In order to highlight the differences between the analysis developed in this
chapter and Shannon’s source coding theorem, we remind that Shannon
refers, more in general, to uniquely decodable codes (indeed, he adopt non
singular block codes).
52 Chapter 4, Asymptotic Equipartition Property and Source Coding

4.4.1 Kraft Inequality

In designing codes, our concern is the length of the codewords rather


than the specific codewords, being the length the parameter which affects
the transmission rate and the storage requirements. Our goal is indeed to
design instantaneous codes with the minimum length. From the definition of
prefix codes it is easy to argue that it may not always be possible to find
a prefix code given a set of codeword lengths. For instance, it is easy to
guess that it is not possible to design a prefix code with lengths 2,3,2,2,2 (we
are forced to use all the couples 00,01,10,11 already for the four two-length
codewords...).
The question is then what codeword lengths can we use to design prefix
codes?

Theorem (Kraft’s inequality).


A necessary and sufficient condition for the existence of a prefix code with
codeword lengths l1 ,l2 ,...,l|X | is that

|X |
X
2−li ≤ 1. (4.46)
i=1

Proof. Consider a binary tree, as the one depicted in Figure 4.3, in which
each node has 2 children. The branches of the tree represents the symbols
of the codeword, 0 or 1. Then, each codeword is represented by a node or
a leaf on the tree. The path from the root traces out the symbols of the
codeword. The property of prefix code implies that in the tree no codeword
is an ancestor of any other codeword, that is, the presence of a codeword
eliminates all its descendants as possible codewords. Then, for a prefix code,
each codeword is represented by a leaf.

• (Necessary condition): for any instantaneous code, the codeword lengths


satisfy the Kraft inequality.

Let lmax be the length of the longest codeword (i.e. the depth of the tree).
A codeword at level li has 2lmax −li descendants at level lmax , which cannot be
codewords of a prefix code and must then be removed from the tree. For a
prefix codes with the given lengths (l1 ,l2 ,...,l|X | ) to exist, the overall number
of leaves that we remove from the tree must be less than those available
4.4. Data Compression 53

1 0

l 10
lmax

110
lmax − l

Figure 4.3: Code tree for Kraft inequality.

(2lmax ). In formula:
|X |
X
2lmax −li ≤ 2lmax , (4.47)
i=1

which, divided by 2lmax , yields


|X |
X
2−li ≤ 1. (4.48)
i=1

• (Sufficient condition): given a set of lengths satisfying Kraft’s inequality


there exists an instantaneous code with these codeword lengths.

Let us construct the code with the given set of lengths. For each length
li , we consider 2lmax −li leaves of the tree, label the root of the correspondent
subtree (which corresponds to a node at depth li ) as the codeword i and
remove all its descendants from the tree. This procedure can be repeated for
all the lengths if there are enough leaves, that is if
X
2lmax −li ≤ 2lmax . (4.49)
i

If Kraft’s inequality holds, then we can construct a code which is prefix.


Then, Kraft’s inequality tells us if, given a set of codeword lengths, a prefix
code exists. Note that, however, it does not tell us whether a code satisfying
it is instantaneous.
54 Chapter 4, Asymptotic Equipartition Property and Source Coding

Given a source X ∼ p(x), we seek the minimum average length of a prefix


code for the source (C : X → {0, 1}∗ ). To do so, we have to solve the
following constrained optimization problem:
X
min p(x)l(x), (4.50)
l(x)
x∈X

subject to
X
2−l(x) ≤ 1,
x
l(x) ∈ N. (4.51)

Minimization (4.50) subject to the constraints in (4.51) is hard to solve. In


the next section we see, for a particular choice of codeword lengths, how far
the average length remains from the minimum value (i.e. H(X)).

4.4.2 Alternative proof of Shannon’s source coding the-


orem for instantaneous codes

It is interesting to compare the minimum average length L for instan-


taneous codes with the entropy of the source. We know from Shannon’s
theorem that L is surely greater or at most equal to the entropy.

Property. For a diadic source8 X it is possible to build an instantaneous


code with average length L = H(X).

8
In a diadic source the probability of the symbols are negative quadratic powers of 2,
that is pi = 2−αi , (αi ∈ N).
4.4. Data Compression 55

Proof.
X X
L − H(X) = pi li − pi log pi
i i
X X
= pi log 2li + pi log pi
i i
X pi
= pi log
i
2−li
X  2−li

≥ log e pi 1 −
i
pi
X X
= log e( pi − 2−li ) ≥ 0. (4.52)
i i
| {z }
=1

If the source is diadic, li = log2 (1/pi ) belongs to N for each i, and then,
by using these lengths for the codewords, the derivation in (4.52) holds at
equality.

What if the source is not diadic? The following property tells us how far
from H(X) the minimum average codeword length is (at most) in the general
case.

Theorem (Average length).


For any source X, there exists a prefix code with average length satisfying

H(X) ≤ L ≤ H(X) + 1. (4.53)

Proof. The left-hand side has already been proved in (4.52). In order to
prove the right-hand side, let us assign the lengths li according to a round-
off approach, i.e. by using the following approximation:
1 1
li = dlog e ≤ log + 1. (4.54)
pi pi
The average codeword length of this code is
X X  1

L= pi li ≤ pi log + 1 = H(X) + 1. (4.55)
i i
pi

Since this code, built by means of the round-off approximation, is only a


56 Chapter 4, Asymptotic Equipartition Property and Source Coding

particular choice, for the minimum-length code we surely have L ≤ H(X)+1.

So far we have assumed to encode each source symbol separately. Ac-


cording to Shannon’s theorem, to get closer to the entropy we must encode
together blocks of symbols. Let X k denote the extended source. Now, C
maps each k-length block of source symbols into a string of bits, that is
C : X k → (0, 1)∗ . Then, the average codeword length in bits per symbol is
Lk /k, being Lk the average codeword length for the k-th extended source.
Theorem (Instantaneous Source Coding).
For a memoryless source X, there exists an instantaneous code with average
length Lk satisfying
Lk 1
H(X) ≤ ≤ H(X) + . (4.56)
k k
Proof. Let L∗k be the minimum average length for a code of the extended
source Xk . Applying the theorem on the average length to the extended
source yields
H(X k ) ≤ L∗k ≤ H(X k ) + 1. (4.57)
Being the source memoryless H(X k ) = kH(X), if we consider the average
number of bits per symbol spent for the encoding, i.e. L∗k /k, equation (4.56)
is proved.

As expected, for any source X when k → ∞ we have that L → H(X).


Then, requiring the code to be instantaneous (more than uniquely decodable)
does not change the minimum number of bits per symbol we have to spend
for the lossless encoding of the source. Again, in order to reach the entropy
value H(X) for a generic source we have to code long blocks of symbols and
not separate symbols. The theorem can then be seen as a formulation of the
Shannon coding theorem for instantaneous codes.

Coding of Sources with Memory


We want to restate the above theorem for the sources with memory.
Let Xn be a stationary source and Xkn the correspondent extended source.
Let Lk denote the average length of a code for the extended source.
Theorem (Instantaneous Coding for Sources with Memory).
Given a source with memory Xn , there exists an instantaneous code for its
kth extension satisfying
Lk 1
H(Xn ) ≤ ≤ H(Xn ) + ε + . (4.58)
k k
4.4. Data Compression 57

where ε is a positive quantity which can be taken arbitrarily small for large
k.

Proof. If we consider the adjoint source Xk,∗


n , we have the following bounds
for the length of the optimum code:

H(X k,∗ ) ≤ Lk ≤ H(X k,∗ ) + 1, (4.59)

and then
H(Xk , ..., X1 ) ≤ Lk ≤ H(Xk , Xk−1 , ..., X1 ) + 1. (4.60)
Dividing by k yields

H(Xk , ..., X1 ) Lk H(Xk , ..., X1 ) 1


≤ ≤ + . (4.61)
k k k k
The proof of the lower bound in (4.58) directly follows from the Lemma on
the behavior of the average entropy in Section 4.3.2 ( H(Xkk,...,X1 ) ≥ H(Xn )),
while in order to prove the upper bound we exploit the definition of the
entropy rate: for any ε > 0, there exists k such that H(Xkk,...,X1 ) ≤ H(Xn ) + ε.
Then, relation (4.58) holds.

From the above theorem it is evident that the benefit of coding blocks of
symbols is twofold:

• the round off to the next integer number, which costs 1 bit, is spread
on k symbols (this is the same benefit we had for memoryless sources);

• the ratio H(Xkk,...,X1 ) decreases as k increase (while in the memoryless


case H(Xkk,...,X1 ) = H(X)).

Therefore, we argue that coding blocks of symbols rather than individual


symbols is even more necessary when we deal with sources with memory,
since it leads to a great gain in terms of bits saving.
58 Chapter 4, Asymptotic Equipartition Property and Source Coding
Chapter 5

Channel Capacity and Coding

In the previous chapters we faced with the problem of source coding.


Once encoded, the information must be transmitted through a communica-
tion channel to reach its destination. This chapter is devoted to the study of
this second step of the communication process.

5.1 Discrete Memoryless Channel


Each communication channel is characterized by the relation between the
input and the output. For simplicity, throughout the analysis, we consider
only discrete time channels. We know that, from an information theory per-
spective, the signals carry information and then they have a random nature;
specifically they are stochastic processes x(k, t). According to Shannon’s
sampling theorem, which also holds for random signals, if the signal band-
width is limited, we can consider its samples1 and then we can assume that
the channel is discrete in time. The sampling of the stochastic process yields
at the input of the channel the sequence of random variables x(k, nT ), as de-
picted in Figure 5.1. To ease the notation, we refer to the sequence x(k, nT )
as a sequence of random variables Xn , omitting the dependence on k. Clearly,
the channel input can be seen as the outcome of an information source.
As to the values assumed by each random variable Xn , if the input source
has a finite alphabet (|X | < ∞) we have a discrete channel, a continuous
channel otherwise.

1
As a matter of fact the requirement of limited bandwidth is not necessary due to the
presence of the channel which acts itself as bandwidth limiter.

59
60 Chapter 5, Channel Capacity and Coding

X1 , X2 , X3 , .... Y1 , Y2 , Y3 , ....
C

Figure 5.1: Discrete time channel. The input sequence is the sampling the stochas-
tic process x(k, t) with sampling step T .

5.1.1 A Mathematical Model for the channel


There are many factors, several of which with a random nature, that in a
physical channel cause the output to be different from the input, e.g. attenu-
ation, multipath, noise. Then, the input-output relation in a communication
channel is, generally, a stochastic relation.

Definition. A discrete channel is a statistical model with an input Xn and


an output Yn which can be seen as a noisy version of Xn . The sequences Xn
and Yn take value in X and Y respectively (|X |, |Y| < ∞).
Given the input and the output alphabet X and Y, a channel is described
by the probabilistic relationship between the input and the output, i.e. by
the set of transition probabilities

P r{Yk = y|X1 = x1 , X2 = x2 , ...., Xk = xk } y ∈ Y, (x1 , ..., xk ) ∈ X k (5.1)

where k denotes the discrete time at which the outcome is observed. Note
that, due to causality, conditioning is restricted to the inputs preceding k
and to the k-th input itself.
The channel is said memoryless when the output symbol at a given time
depends only on the current input. In this case the transition probabilities
become:
P r{Yk = y|Xk = x} ∀y ∈ Y, ∀x ∈ X . (5.2)
and the simplified channel scheme is illustrated in Figure 5.2. Assuming a
memoryless channel greatly restricts our model since in this way we do not
consider several factors, like fading, which could affect the communication
because due to the introduction of intersymbol interference. Such phenom-
ena require the adoption of much more complex models.
In order to further simplify the analysis, we also assume that the channel is
stationary. Frequently2 , we can make this assumption without loss of general-
ity since the channel variability is slow with respect to the transmission rate.
In other words, during the transmission of a symbol, the statistical proper-
ties of the channel do not change significantly. Then, since the probabilistic
2
This is not true when dealing with mobile channels.
5.1. Discrete Memoryless Channel 61

Xn C Yn

Figure 5.2: Discrete memoryless channel. The output signal at each time instant
n (r.v.) depends on the input signal (r.v.) at the same time.

model describing the channel does not change over time, we can characterize
the channel by means of the transition probabilities p(y|x), where y ∈ Y
and x ∈ X . These probabilities can be conveniently arranged in a matrix
P = {Pij }, where

Pij = P {yj |xi } j = 1, .., |Y| i = 1, ..., |X |. (5.3)

The matrix P is called channel matrix or channel transition matrix.

5.1.2 Examples of discrete memoryless channels


Noiseless binary channel
Suppose that we have a channel in which the binary input is reproduced
exactly at the output. Then, any transmitted bit is received without error.
The transition matrix is  
1 0
P= . (5.4)
0 1
This is a limit case, for which we have no longer a probabilistic channel. A
graphical representation of the noiseless channel is given in Figure 5.3.

Noisy channel with non-overlapping outputs


This is another example in which noise does not affect the transmission,
even if the channel is probabilistic. Indeed, see Figure 5.4, the output of the
channel depends randomly on the input; however the input can be exactly
determined from the output and then every transmitted bit can be recovered
without any error. The transition matrix is
 
1/2 1/2 0 0
P= . (5.5)
0 0 1/2 1/2
62 Chapter 5, Channel Capacity and Coding

0 0

1 1
Figure 5.3: Noiseless binary channel.

1/2 a
0
1/2 b
1/2 c
1
1/2 d

Figure 5.4: Model of the noisy channel with non overlapping outputs.

Noisy Typewriter

This is a more realistic example. A channel input is delivered unchanged


at the output with probability 1/2 and transformed into the subsequent el-
ement with probability 1/2. In this case, the transmitted signal can not be
correctly recovered from the output. Figure 5.5 illustrates the behavior of
this channel; the transition matrix has the following form
 
1/2 1/2 0 0 ... 0
 0 1/2 1/2 0 ... 0 
P=  ... ... ... ... ... 0  .
 (5.6)
1/2 0 ... ... ... 1/2

Binary Symmetric Channel (BSC)

The binary symmetric channel is a binary channel in which the input


symbols are flipped with probability ε and left unchanged with probability
1 − ε (Figure 5.6). The transition matrix of the BSC is
 
1−ε ε
P= . (5.7)
ε 1−ε

This channel model is used very frequently in communication engineering.


Without loss of generality, we will only consider BSC with ε < 1/2. Indeed,
if ε > 1/2, we can trivially reverse the input symbols thus yielding an error
probability lower than 1/2.
5.2. Channel Coding 63

1/2
a a
1/2
1/2
b 1/2
b
1/2
c c
1/2

1/2

z 1/2
z

Figure 5.5: Noisy typewriter.

1−ε
0 0
ε
ε
1 1−ε 1

Figure 5.6: Binary symmetric channel.

Binary Erasure Channel (BEC)


This channel is similar to the binary symmetric channel, but in this case
the bits are lost, rather than flipped, with a given probability α. The tran-
sition matrix is  
1−α α 0
P= . (5.8)
0 α 1−α
The channel model is depicted in Figure 5.7.

5.2 Channel Coding


From previous chapters we know that H(X) represents the fundamental
limit on the rate at which a discrete memoryless source can be encoded. We
we will prove that a similar fundamental limit also exists for the transmission
rate over communication channels.
The main goal when transmitting information over any communication chan-
nel is reliability, which is measured by the probability of correct reception at
the output of the channel. The surprising result that we will prove in this
64 Chapter 5, Channel Capacity and Coding

1−α 0
0 α
e
α
1
1−α 1

Figure 5.7: Binary erasure channel.

chapter is that reliable transmission is possible even over noisy channels, as


long as the transmission rate is sufficiently low. The existence of a funda-
mental bound on the transmission rate, proved by Shannon, is one of the
most remarkable results of information theory.
By referring to the example of the noisy typewriter in Section 5.1.2, some
interesting considerations can be made. By using only half of the inputs,
it is possible to make the corresponding outputs disjoint, and then recover
the input symbols from the output. Then, this subset of the inputs can be
transmitted over the channel with no error. This is just an example in which
the limitation that the noise causes in the communication is not on the re-
liability of the communication but on the rate of the communication. This
example provides also a first insight into channel coding: limiting the inputs
to a subset is similar to the addition of redundancy which will be performed
through channel coding.

5.2.1 Preview of the Channel Coding Theorem


BSC: a qualitative analysis
By looking at the binary symmetric channel we try to apply a similar
approach to that used for the noisy typewriter in order to determine if non-
overlapping outputs, and then transmission without error, can be obtained
in the BSC case. To this purpose, we have to consider sequences of in-
put symbols instead of single inputs. Then, we define the n-th extension
of the channel or extended channel, which is a channel having input and
output alphabets X n = {0, 1}n and Y n = {0, 1}n and transition probabili-
n
ties p(y n |xn ) = i=1 p(yi |xi ). Figure 5.8 gives a schematic representation of
Q
the extended channel. Due to the dispersion introduced by the channel, a
set of possible output sequences corresponds to a n-length transmitted se-
quences. If the sets corresponding to different input sequences were disjoint,
the transmission would be error-free. This happens only with channels hav-
ing non-overlapping outputs. By looking at the BSC, Figure 5.6, we see that
it is no so, but we can consider a subset of the input sequences in order
5.2. Channel Coding 65

Xn Yn
channel
dispersion

2n sequences

Figure 5.8: Representation of the n-th extension of the channel.

to make the corresponding set disjoint. That is, we can consider 2k input
sequences for some value k (k < n). Note that, without noise, k bits would
suffice to index 2k sequences; the n − k additional bits in each sequence cor-
respond to the ‘redundancy’. In the sequel we better formalize this concept.
In the BSC, according to the law of large numbers, if a binary sequence of
length n (for large n) is transmitted over the channel with high probability,
the output will disagree with the input at about nε positions. The number
of possible ways in which it is possible to have nε error in a n-length se-
quence (or the number of possible sequences that disagree with the input in
nε positions) is given by  
n
. (5.9)


By using Stirling’s approximation n! ≈ nn e−n 2πn and by applying some
algebra we obtain
2nh(ε)
 
n
≈ p . (5.10)
nε 2πn(1 − ε)ε
Relation (5.10) gives an approximation on the number of sequences in each
output set. Then, for each block of n inputs, there exist roughly 2nh(ε) highly
probable corresponding output blocks. Note that if ε = 1/2, then h(ε) = 1
and the entire output set would be required for an error-free transmission of
only one input sequence.
On the other hand, by referring to the output of the channel, regarded as
a source, the total number of highly probable sequences is roughly 2nH(Y ) .
Therefore, the maximum number of input sequences that may produce almost
66 Chapter 5, Channel Capacity and Coding

non-overlapping output sets is at most equal to

2nH(Y )
M= p . (5.11)
2nh(ε) / 2πn(1 − ε)ε

As a consequence, the maximum number of information bits that can be


correctly transmitted is
 p 
n(H(Y )−h(ε))
k = log2 2 · 2πn(1 − ε)ε . (5.12)

Then, the number of bit that can be transmitted each time, i.e. the trans-
mission rate for channel use is:
p
k log2 (2n(H(Y )−h(ε)) · 2πn(1 − ε)ε)
R= = . (5.13)
n n
Finally, as n → ∞, R → H(Y ) − h(ε).
A close inspection of the limit expression for R reveals that we have still a
degree of freedom that can be exploited to maximize the transmission rate;
it consists in the input probabilities p(x), which determine the values of p(y)
(remember that the transition probability of the channel are fixed by the
stationarity assumption) and then H(Y ). In the sequel we look for the input
probability distribution maximizing H(Y ), giving the maximum transmis-
sion rate. Since Y is a binary source, the maximum of H(Y ) is 1, which
is obtained when the input symbols are equally likely. So, the maximum
transmission rate is Rmax = 1 − h(ε).

Observation.
The quantity 1−h(ε) is exactly the maximum value of the mutual information
between the input and the output for the binary symmetric channel (BSC),
that is
max I(X; Y ) = 1 − h(ε). (5.14)
pX (x)

In fact, given the input bit x, the BSC behaves as a binary source,
giving at the output the same bit with probability 1 − ε. Thus, we can
state that H(Y |X) = h(ε) and consequently I(X; Y ) = H(Y ) − H(Y |X) =
H(Y ) − h(ε), whose maximum is indeed 1 − h(ε).
5.2. Channel Coding 67

i Xn Channel Yn î
Encoder p(y|x) Decoder
Message Estimate
{1, ..., M } of Message

Figure 5.9: Communication channel.

Qualitative analysis of a general discrete memoryless channel


The previous analysis explains the essence of Shannon’s theorem on the
channel coding by focusing specifically on the binary symmetric channel.
In order to extend the previous analysis to a generic channel we need some
clarifications. Firstly, we note that when we refer to sets of outputs we do not
mean necessarily a compact set. Given an input, the corresponding output
sequence may be scattered throughout the whole space Y n , depending on the
behavior of the channel. Secondly, the output sets in a general channel have
usually different sizes since the channel is not symmetric.
We can affirm that, given an input sequence xn , the number of possible output
n
sequences y n is approximately 2nH(Y |x ) , with high probability. This is indeed
the approximate number of typical sequences with respect to the distribution
p(y|X n = xn ). By varying the input sequence xn , we can consider the mean
number of output sequences 2nH(Y |X) . Since the total number of typical
sequences for the source Y is still 2nH(Y ) , it follows that the maximum number
of disjoint sets is 2n(H(Y )−H(Y |X)) = 2nI(X;Y ) . Accordingly, we can correctly
transmit I(X; Y ) information bits for channel use. By properly choosing the
prior probabilities, we directly have the following expression for the maximum
achievable rate:
Rmax = max I(X; Y ). (5.15)
pX (x)

This result is in agreement with the previous one for the BSC. We foretell
that the above expression represents the channel capacity.
In Section 5.2.3, we will give a rigorous formalization to the above consider-
ations by proving the noisy channel-coding theorem.

5.2.2 Definitions and concepts


Let {1, 2, ..., M } be the index set from which a message is drawn. Before
being transmitted into the channel the indexes are encoded. At the receiver
side, by observing the output of the channel the receiver guesses the index
through an appropriate decoding rule. The situation is depicted in Figure
5.9. Let us rigorously define some useful concepts, many of them already
68 Chapter 5, Channel Capacity and Coding

discussed in the previous section.


Definition. A discrete memoryless channel (DMC) consists of two finite sets
X and Y and a collection of probability mass functions p(y|x), denoted by
(X , p(y|x), Y).
Definition. The nth extension of the discrete memoryless channel corre-
sponds to the channel (X n , p(y n |xn ), Y n ), where

p(yk |xk , y k−1 ) = p(yk |xk ), k = 1, 2, ..., n (5.16)

i.e. the output does not depend on the past inputs and outputs.
If the channel is used without feedback, i.e. if the input symbols do not depend
on the past output symbols (p(xk |xk−1 , y k−1 ) = p(xk |xk−1 )), the channel
transition probabilities for the nth extension of the DMC can be written as
n
Y
n n
p(y |x ) = p(yi |xi ). (5.17)
i=1

We shall always implicitly refer to channels without feedback, unless stated


otherwise.
Definition. An (M, n) code for the channel (X , p(y|x), Y) consists of:
1. An encoding function g : {1 : M } → X n , which is a mapping from the
index set to a set of codewords or codebook.
2. A decoding function f : Y n → {1 : M }, which is a deterministic rule
assigning a number (index) to each received vector.
Definition. Let λi be the error probability given that index i was sent,
namely the conditional probability of error:

λi = P r{f (y n ) 6= i|xn = g(i)}. (5.18)

Often, we will use xn (i) instead of g(i) to indicate the codeword associated
to index i. As a consequence of the above definition, the maximal probability
(n)
of error λmax for an (M, n) code is defined as

λ(n)
max = max λi . (5.19)
i∈{1,2,...,M }

(n)
The average probability of error Pe for an (M.n) code is
M
1 X
Pe(n) = λi , (5.20)
M i=1
5.2. Channel Coding 69

where we implicitly assumed that the indexes are drawn in an equiproba-


ble manner. We point out that the average probability of error, like the
maximum one, refers to the n-length sequences.
Definition. The rate R of an (M, n) code is

log M
R= bits per channel use. (5.21)
n
Definition. A rate R is said to be achievable if there exists a sequence of
codes having rate R, i.e. (2nR , n) codes, such that
(n)
lim λmax = 0. (5.22)
n→∞

Definition. The capacity of the channel is the supremum of all the achievable
rates.

Jointly typical sequences and set


In order to describe the decoding process in Shannon’s coding theorem it
is necessary to introduce the concept of ‘joint typicality’.
(n)
Definition. Given two DMSs X and Y , the set Aε of joint typical sequences
{(xn , y n )} with respect to the distribution p(x, y), is the following set of n-
long sequences

A(n) n n n
ε = {(x , y ) ∈ X × Y :
n

1 n
1 n

− log p(x ) − H(X) < ε, − log p(y ) − H(Y ) < ε,
n n

1
− log p(xn , y n ) − H(X, Y ) < ε ,

n (5.23)

where the first and the second conditions require the typicality of the se-
quences xn and y n respectively, and the last inequality requires the joint
typicality of the couple of sequences (xn , y n ).
We observe that if we do not considered the joint typicality, the number
of possible couples of sequences in Aε would be the product |Aε,x | · |Aε,y | ∼
(n) (n) (n)
=
2n[H(X)+H(Y )] . The intuition suggests that the total number of jointly typical
sequences is approximately 2nH(X,Y ) and then not all pairs of typical xn and
typical y n are jointly typical since H(X, Y ) ≤ H(X) + H(Y ). These consid-
erations are formalized in the following theorem, which is the extension of
the AEP theorem to the case of two sources.
70 Chapter 5, Channel Capacity and Coding

Theorem (joint AEP).


Let X and Y be two DMS with marginal probabilities pX and pY and let
(xn , y n ) be a couple of sequences of length n drawn from the two sources.
Then:

(n)
1. Pr{Aε } → 1 as n → ∞ (> 1 − δ for large n);

(n)
2. ∀ε, |Aε | ≤ 2n(H(X,Y )+ε) ∀n;

(n)
3. ∀δ, ∀ε, n large, |Aε | ≥ (1 − δ)2n(H(X,Y )−ε) ;

4. Considering two sources X̃ and Ỹ with alphabets X and Y such that


pX̃ = pX and pỸ = pY but independent of each other, i.e. such that
(X̃ n , Ỹ n ) ∼ pX (xn )pY (y n ), we have

P r{(x̃n , ỹ n ) ∈ A(n) ∼ −nI(X;Y ) .


ε } = 2 (5.24)

Formally,

∀ε < 0, ∀n, P r{(x̃n , ỹ n ) ∈ A(n)


ε } ≤ 2
−n(I(X;Y )−3ε)
, (5.25)

and

∀ε > 0, ∀δ > 0, n large, P r{(x̃n , ỹ n ) ∈ A(n)


ε } ≥ (1 − δ)2
−n(I(X;Y )+3ε)
.
(5.26)

Proof. The first point says that for large enough n, with high probability, the
couple of sequences (xn , y n ) lies in the typical set. It directly follows from the
weak law of large numbers. In order to prove the second and the third point
we can use the same arguments of the proof of the AEP theorem. Instead, we
explicitly give the proof of point 4 which represents the novelty with respect
to the AEP theorem. The new sources X̃ n and Ỹ n are independent but have
5.2. Channel Coding 71

the same marginals as X n and Y n , then


X
P r{(x̃n , ỹ n ) ∈ A(n)
ε } = pX̃ (x̃n )pỸ (ỹ n )
(n)
(x̃n ,ỹ n )∈Aε
X
= pX (x̃n )pY (ỹ n )
(n)
(x̃n ,ỹ n )∈Aε
X
= pX (xn )pY (y n )
(n)
(xn ,y n )∈Aε
(a)
≤ |Aε(n) | · 2−n(H(X)−ε) 2−n(H(Y )−ε)
(b)
≤ 2−n(H(X,Y )+ε) 2−n(H(X)−ε) 2−n(H(Y )−ε)
= 2−n(I(X;Y )−3ε) ,
(5.27)

where inequality (a) follows from the AEP theorem, while (b) derives from
point 2. Similarly, it’s possible to find a lower bound for sufficiently large n,
i.e.
X
P r{(x̃n , ỹ n ) ∈ A(n)
ε } = p(xn )p(y n )
(n)
(xn ,y n )∈Aε

≥ (1 − δ)2−n(H(X)+H(Y )−H(X,Y )+3ε)


≥ (1 − δ)2−n(I(X;Y )+3ε) . (5.28)

The above theorem suggests that we have to consider about 2nI(X;Y ) pairs
before we are likely to come across a jointly typical pair.

5.2.3 Channel Coding Theorem


We are now ready to prove the other basic theorem of information theory
stated by Shannon in 1948, that is the channel coding theorem. As previ-
ously mentioned, the remarkable result of this theorem is that, even though
the channel introduce errors, the information can still be reliably sent over
the channel at all rates up to channel capacity. Shannon’s key idea is to
sequentially use the channel many times, so that the law of large number
comes into effect. Shannon’s outline of the proof is indeed strongly based on
the concept of typical sequences and in particular on a joint typicality based
72 Chapter 5, Channel Capacity and Coding

decoding rule. However, the rigorous proof was given long after Shannon’s
initial paper. We now give the complete statement and proof of Shannon’s
second theorem.

Theorem (Channel Coding Theorem).


Let us define the channel capacity as follows:

C = max I(X; Y ). (5.29)


pX (x)

For a discrete memoryless channel a rate R is achievable if and only if R < C.

According to the definition of achievable rate, the direct implication states


that, for every rate R < C, there exists a sequence of (2nR , n) codes with
maximum probability of error λ(n) → 0. Conversely, the reverse implication
says that for any sequence of (2nR , n) codes with λ(n) → 0, R ≤ C.
Let us now prove that all the rates R < C are achievable (direct implication,
if). Later we will prove that any rate exceeding C is not achievable (converse
implication,only if).

Proof. (Channel Coding Theorem: Achievability)


Let us fix pX (x).
For any given rate R, we have to find a proper sequence of (2nR , n) codes.
The question that arises is how to build a codebook. It may come as a
surprise that Shannon suggests to take the codewords at random. Specifically,
we generate a (2nR , n) code according to the distribution Q p(x) by taking
2nR codewords drawn according to the distribution p(xn ) = ni=1 p(xi ), thus
obtaining a mapping
g : {1, 2, ..., 2nR } → X n . (5.30)
We can organize the codewords in a matrix 2nR × n as follows
 
x1 (1) x2 (1) . . . xn (1)
 x1 (2) x2 (2) . . . xn (2) 
C= .. .. .. . (5.31)
 
..
 . . . . 
nR nR nR
x1 (2 ) x2 (2 ) . . . xn (2 )

Each element of the matrix is drawn i.i.d. ∼ p(x). Each row i of the matrix
corresponds to the codeword xn (i).
Having defined the encoding function g, we define the correspondent decoding
function f . Shannon proposed a decoding rule based on joint typicality.
The receiver looks for a codeword that is jointly typical with the received
sequence. If a unique codeword exists satisfying this property, the receiver
5.2. Channel Coding 73

declares that word to be the transmitted codeword. Formally, given y n , if


(n)
the receiver finds a unique i s.t. (y n , xn (i)) ∈ Aε , then

f (y n ) = i. (5.32)

Otherwise, that is if no such i exists or if there is more than one such code-
word, an error is declared and the transmission fails. Notice that joint typical
decoding is suboptimal. Indeed, the optimum procedure for minimizing the
probability of error is the maximum likelihood decoding. However the pro-
posed decoding rule is easier to analyze and asymptotically optimal.
We now calculate the average probability of error over all codes generated
at random according to the above described procedure, that is
X
Pe(n) = Pe(n) (C)P r(C) (5.33)
C

(n)
where Pe (C) is the probability of error averaged over all codewords in code-
book C. Then we have3
2nR
X 1 X
Pe(n) = P r(C) λi (C)
C
2nR i=1
2nR X
1 X
= P r(C)λi (C). (5.34)
2nR i=1 C

By considering the specific code construction we adopted, it’s easy to argue


that λi does not depend on the particular index i sent. Thus, without loss
of generality, we can assume i = 1, yielding
X
Pe(n) = P r(C)λ1 (C). (5.35)
C

If Y n is the result of sending X n (i) over the channel4 , we define the event Ei
as the event that the i-th codeword and the received one are jointly typical,
that is
Ei = {(X n (i), Y n ) ∈ A(n)
ε }, i ∈ {1, 2, ..., 2nR }. (5.36)
3 (n)
We precise that there is a slight abuse of notation, since Pe (C) in (5.33) corresponds
(n) (n)
to Pe in (5.20), while Pe in (5.33) denotes the probability of an error averaged over all
the codes. Similarly, λi (C) corresponds to λi where again the dependence on the codebook
is made explicit.
4
Both X n (i) and Y n are random since we are not conditioning to a particular code.
We are interested in the average on C.
74 Chapter 5, Channel Capacity and Coding

Since we assumed i = 1, we can define the error event E as the union of all
the possible types of error which may occur during the decoding procedure
(jointly typical decoding):

E = E1c ∪ E2 ∪ E3 ∪ ... ∪ E2nR , (5.37)

where the event E1c occurs when the transmitted codeword and the received
one are not jointly typical, while the other events refer to the possibility that
a wrong codeword (different from the transmitted one) is jointly typical with
Y n (the received sequence). Hence: P r(E) = P r(E1c ∪ E2 ∪ E3 ∪ ... ∪ E2nR ).
We notice that the transmitted codeword and the received sequence must be
jointly typical, since they are probabilistically linked through the channel.
Hence, by bounding the probability of the union in (5.37) with the sum of
the probabilities, from the first and the fourth point of the joint AEP theorem
we obtain
nR
2
X
P r(E) ≤ P r(E1c ) + P r(Ei )
i=2
2nR
X
≤ δ+ 2−n(I(X;Y )−3ε) ,
i=1

≤ δ + (2 − 1)2−n(I(X;Y )−3ε) ,
nR

≤ δ + 2nR 2−n(I(X;Y )−3ε) ,


≤ δ + 2−n(I(X;Y )−R−3ε)
= δ0, (5.38)

where δ 0 can be made arbitrarily small for n → ∞ if R < I(X; Y ). The


intuitive meaning of the above derivation is the following: since for any
codeword, different from the transmitted one, the probability to be jointly
typical with the received sequence is approximately 2−nI(X;Y ) , we can use
at most 2nI(X;Y ) codewords in order to keep the error probability arbitrarily
small for large enough n. In other words, if we have not too many codewords
(R < I), with high (arbitrarily close to 1) probability there is no other
codeword that can be confused with the transmitted one.
At the beginning of the proof, we fixed pX (x) which determines the value of
I(X; Y ). Actually pX (x) is the ultimate degree of freedom we can exploit in
order to obtain the smallest P r(E) for the given rate R. As a consequence,
it is easy to argue that P r(E) can be made arbitrarily small (for large n) if
5.2. Channel Coding 75

the rate R is less than the maximum of mutual information, that is

C = max I(X; Y ). (5.39)


pX (x)

To conclude the proof we need a further step. In fact, the achievability


(n)
definition is given in terms of the maximal probability of error λmax , while
up to now we have dealt with the average probability of error. We now show
that
Pe(n) → 0 ⇒ ∃C s.t λ(n) max → 0. (5.40)
(n)
Since Pe = P r(E) < δ 0 , there exists at least one code C (actually more than
(n)
one) such that Pe (C) < δ 0 . Name it C ∗ . Let us list the probabilities of error
λi of the code C ∗ in increasing order:

λ1 , λ2 , ..........., λ2nR .

Now, we throw away the upper half of the codewords in C ∗ , thus generating
a new code C ? with half codewords. Being the average probability of error
for the code C ∗ lower than δ 0 we deduce that

λ 2nR < 2δ 0 . (5.41)


2

(n)
(If it were not so, it is easy to argue that Pe (C) would be greater than δ 0 .)
But λ2nR /2 is the maximal probability of error for the code C ? , which then is
arbitrarily small (tends to zero as n → ∞).
What about the rate of C ? ? Throwing out half the codewords reduces the
rate from R to R − n1 (= log(2nR−1 )/n). This reduction is negligible for large
(n)
n. Then, for large n, we have found a code having rate R and whose λmax
tends to zero. This concludes the proof that any rate below C is achievable.

Some considerations can be made regarding the proof: similarly to the


source coding theorem, Shannon does not provide any usable way to con-
struct the codes. The construction procedure used in the proof is highly
impractical for many reasons. Firstly, Shannon’s approach is asymptotical:
both the number of codewords, 2nR , and the length, n, have to go to infinity.
Secondly, but not least, Shannon suggests to generate the code at random;
accordingly, we should write down all the codewords in the matrix C (see
(5.31)) and moreover transmit the matrix to the receiver. It is easy to guess
that, for large values of n, this scheme requires (storage and transmission)
resources out of any proportion. In fact, without some structure in the code
76 Chapter 5, Channel Capacity and Coding

it is not possible to decode. Only structured codes (i.e. codes generated


according to a rule) are easy to encode and decode in practice.

Now we must show that it is not possible to ‘do better’ than C (converse).
Before giving the proof we need to introduce two lemmas of general validity.

Lemma (Fano’s inequality). Let X and Y be two dependent sources and let
g be any deterministic reconstruction function s.t. X̂ = g(Y ). The following
upper bound on the remained uncertainty (or equivocation) about X given Y
holds:

H(X|Y ) ≤ h(Pe ) + Pe log(|X | − 1)


≤ 1 − Pe log(|X | − 1), (5.42)

where Pe = P r(X̂ 6= X).

Proof. We introduce an error random variable



6 x (with probability Pe )
1 if x̂ =
E= (5.43)
0 if x̂ = x (with probability 1 − Pe ).

By using the chain rule we can expand H(E, X|Y ) in two different ways:

H(X, E|Y ) = H(X|Y ) + H(E|X, Y ) (5.44)


= H(E|Y ) + H(X|E, Y ). (5.45)

It’s easy to see that H(E|X, Y ) = 0 while H(E|Y ) < H(E) = h(Pe ). As to
H(X|E, Y ), by expliciting the sum on E we have

H(X|E, Y ) = (1 − Pe )H(X|0, Y ) + Pe H(X|1, Y ). (5.46)

Relation (5.46) can be simplified by observing that, when E = 0, there is no


uncertainty on the value of X (that is, being x̂ = x, H(X|0, Y ) = 0) while,
when E = 1, the estimation of X is not correct (being x̂ 6= x). Using the
bound on the maximum entropy yields H(X|1, Y ) ≤ log(|X | − 1). Then, the
sum in (5.46) can be written as:

H(X|E, Y ) =≤ Pe log(|X | − 1). (5.47)

By expliciting H(X|Y ) from equality (5.44)-(5.45) we eventually have

H(X|Y ) ≤ h(Pe ) + Pe log(|X | − 1)


≤ 1 − Pe log(|X | − 1), (5.48)
5.2. Channel Coding 77

which is the desired relation.


The second inequality provides a weaker upper bound which however allows
to avoid the evaluation of the binary entropy h(Pe ).

Fano’s inequality is useful whenever we know a random variable Y and


we wish to guess the value of a correlated random variable X. It relates
the probability of error in guessing the random variable X, i.e. Pe , to the
conditional entropy H(X|Y ).
It’s interesting to note that Fano’s inequality can also be seen as a lower
bound on Pe . Looking at X and Y as the input and the output of a channel
and looking at g as the decoding function, Pe corresponds to the probability
of a decoding error5 .

Lemma. Let us consider a discrete memoryless channel (DMC) with input


and output sources X and Y . By referring to the extended channel we have

I(X n ; Y n ) ≤ nC. (5.49)

Proof.

I(X n ; Y n ) = H(Y n ) − H(Y n |X n )


(a) X
= H(Y n ) − H(Yi |Yi−1 , ..., Y1 , X n )
i
(b) X
= H(Y n ) − H(Yi |Xi )
i
(c) X X
≤ H(Yi ) − H(Yi |Xi )
i i
= I(Yi ; Xi ) ≤ nC, (5.50)

where (a) derives from the application of the generalized chain rule and (b)
follows from the memoryless (and no feedback) assumption. Since condi-
n
P
tioning reduces uncertainty, H(Y ) ≤ i H(Yi ), we have relation (c). We
stress that the output symbols Yi do not need to be independent, that is
generally p(yi |yi−1 , ..., y1 ) 6= p(yi ). Since C is defined as the maximal mutual
information over p(x) the last inequality clearly holds.

The above lemma shows that using the channel many times does not in-
crease the transmission rate.
5
For sake of clarity, we point out that Fano’s inequality holds even in the more general
case in which the function g(Y ) is random, that is for any estimator X̂ such that X →
Y → X̂.
78 Chapter 5, Channel Capacity and Coding

Remark : the lemma holds also for non DM channels, but this extension is
out of the scope of these notes.

We have now the necessary tools to prove the converse of the channel
coding theorem.

Proof. (Channel Coding Theorem: Converse)


(n)
We show that any sequence of (2nR ,n) codes with λmax → 0 must have
(n)
R ≤ C; equivalently, if R > C then Pe cannot tend to 0 (thus implying
(n)
that λmax does not tend to 0).
Given the index set {1, 2, ..., 2nR }, a fixed encoding function which associates
to a index (message) W a codeword X n (W ), and a fixed decoding rule g(·)
such that Ŵ = g(Y n ), we have

W → X n (W ) → Y n → Ŵ . (5.51)

In (5.51), Y n takes the role of the observation, W the role of the index
(n) 1
P
we have to estimate and P r(Ŵ 6= W ) = Pe = 2nR i λi . The random
variable W corresponds to a uniform source, since the indexes are drawn in an
equiprobable manner, thus the entropy has the expression H(W ) = log(2nR ).
By using the definition of the mutual information we have

nR = H(W ) = I(W ; Y n ) + H(W |Y n ). (5.52)

Since the channel directly acts on X n , we deduce that p(y n |xn , w) = p(y n |xn ),
that is W → X n → Y n . Then, according to the properties of the Markov
chains and in particular to DPI, from (5.52) it follows that

nR ≤ I(X n ; Y n ) + H(W |Y n ). (5.53)

By exploiting the lemmas proved above, from (5.53) we get

nR ≤ I(X n ; Y n ) + 1 + Pe(n) log(2nR − 1)


< nC + 1 + Pe(n) nR. (5.54)

Dividing by n yields:
1
R<C+ + Pe(n) R. (5.55)
n
(n)
It follows that if n → ∞ and Pe → 0 then R < C + ε for any arbitrarily
small ε, i.e. R ≤ C.
According to the direct channel coding theorem, n must tend to infinity so
5.2. Channel Coding 79

Pe

C R

Figure 5.10: Asymptotic lower bound on Pe by varying R.

(n) (n)
that Pe can be made arbitrarily small. Therefore, if we want Pe → 0 it’s
necessary that the rate R stays below capacity. This fact proves that R < C
is also a necessary condition for a rate R to be achievable.
From (5.55) there is another possible way through which we can show that
(n)
if R > C then Pe 9 0. Let us rewrite (5.55) as follows

C 1
Pe(n) ≥ 1 − − . (5.56)
R nR
(n)
Joining this condition with the positivity of Pe produces the asymptotical
lower bound on Pe depicted in Figure 5.10. It’s easy to see that if R > C the
probability of error is bounded away from 0 for large n. As a consequence, we
cannot achieve an arbitrarily low probability of error at rates above capacity.

5.2.4 Channel Coding in practice

The essence of the channel coding theorem is that, as long as R < C, it is


possible to send information without affecting the reliability of the transmis-
sion. Hence, the noisiness of the channel does not limit the reliability of the
transmission but only its rate. Moreover, Shannon proves that choosing the
codes at random is asymptotically the best choice whatever the channel is.
However, it is easy to deduce that for finite n the knowledge of the channel
may help to choose a better code.
The problems we have to face with in practice are many. Hereinafter, we
review the most common channels in order to compute the channel capacity
C.
80 Chapter 5, Channel Capacity and Coding

Evaluation of channel capacity

In order to evaluate the channel capacity of a given channel we have to


solve the maximization
C = max I(X; Y ), (5.57)
p(x)

for a given p(y|x) and subject to the constraints on p(x),



P ∈ [0, 1] ∀x
p(x)
(5.58)
x p(x) = 1.

It’s possible to prove that since p(y|x) is fixed by the channel, the mutual
information is a concave function of p(x). Hence, a maximum for I(X; Y )
exists and is unique. However, being the objective function a nonlinear func-
tion, solving (5.57) is not easy and requires using methods of numerical op-
timization. There are only some simple channels, already introduced at the
beginning of the chapter, for which it is possible to determine C analytically.

• Noisy typewriter

In this channel if we know the input symbol we have two possible outputs
(the same or the subsequent symbol) with a probability 1/2 for each. Then,
H(Y |X) = 1 and max I(X; Y ) = max(H(Y ) − H(Y |X))) = max(H(Y ) − 1).
The maximum of the entropy of the output source, which is log |Y|, can be
achieved by using p(x) distributed uniformly over all the inputs. Since the
input and the output alphabet coincide, we have

C = log |Y| − 1 = log |X | − 1. (5.59)

We deduce that, due to the action of the channel, we loose 1 information


bit. Equivalently, the maximum rate of transmission is C = log |X2 | . This
suggests that the intuitive idea of considering half symbols we proposed at
the beginning of Section 5.2 is an optimum choice. It may come as a paradox
that the value C is obtained by considering the inputs equally likely, but this
is not necessarily the way according to which we have to take the inputs if
we want to transmit at rate C. In fact, in this particular case, taking only
non consecutive inputs permits to send information through the channel at
the maximum rate C, without having to send n to infinity. This is not a con-
tradiction; Shannon proposes a conceptually simple encoding and decoding
scheme, this does not preclude the existence of better schemes, especially for
finite n. What is certain is that the transmission rate cannot go beyond C.
5.2. Channel Coding 81

• BSC

Even for this channel the maximization of the mutual information is straight-
forward, since we can easily compute the probability distribution p(x) which
maximizes H(Y ). As we already know from the analysis in Section 5.2.1,
C = max(H(Y ) − h(ε)) = 1 − h(ε), which is achieved when the input distri-
bution is uniform.

• BEC

For the binary erasure channel (Figure 5.7) the evaluation of the capacity is
a little bit more complex. Since H(Y |X) is a characteristic of the channel
and does not depend on the probability of the input, we can write

C = max(H(Y ) − H(Y |X))


p(x)

= max H(Y ) − h(α). (5.60)


p(x)

For a generic value of α the absolute maximum value for H(Y ) (log |Y| =
log 3) cannot be achieved for any choice of the input distribution. Then,
we have to explicitly solve the maximization problem. Let pX (0) = π and
pX (1) = 1 − π. There are two ways for the evaluation of π. According to
the first method, from the output distribution given by the triplet pY (y) =
(π(1−α), α, (1−π)(1−α)) we calculate the entropy H(Y ) and later maximize
on π. The other method exploits the grouping property, yielding

H(Y ) =H3 (π(1 − α), α, (1 − π)(1 − α))


=H2 (α, (1 − α)) + (1 − α)H2 (π, 1 − π) = h(α) + (1 − α)h(π). (5.61)

The maximum of the above expression is obtained when h(π) = 1, and then
for π = 1/2. It follows that C = h(α) + (1 − α) − h(α) = 1 − α. The result
is expected since the BEC channel is nothing else that a noiseless binary
channel which breaks down with a probability α; then, C can be obtained
substracting to 1 the fraction of time the channel remains inoperative.

Construction of the codes

The channel coding theorem promises the existence of block codes that
allow to transmit information at rates below capacity with arbitrarily small
probability of error if the block length is large enough. The greatest problem
82 Chapter 5, Channel Capacity and Coding

of channel coding is to find codes which allows in practice to transmit at


rate close to C. Ever since the appearance of Shannon’s paper, people have
searched for such codes. In addition, usable codes should be “simple”, so
that they could be encoded and decoded easily. If we generated the code-
words at random, according to Shannon’s scheme, we would have to list all
the codewords and send them to the receiver, requiring a huge amount of
memory. Furthermore, we need a way to associate the messages we have to
transmit and the codewords. Besides, since the code must be invertible, the
codewords have to be distinct among themselves. Shannon overcomes this
problem considering an asymptotical situation. Sending n to infinity is also
what makes possible to use the jointly typical decoding as decoding rule at
the receiver side. Such decoding scheme requires the receiver to check all the
sequences which may have been sent in order to make the decision on the
transmitted codeword. However, even if we consider a minimum distance
algorithm it may require up to 2nR evaluations.
Chapter 6

Continuous Sources and


Gaussian Channel

In this chapter, we deal with continuous sources. By following the same


steps of the analysis developed for the discrete case we highlight the concep-
tual differences the continuity assumption leads to.

6.1 Differential Entropy


Let X be a random variable taking values in R characterized by a prob-
ability density function fX (x) 1 .

Definition. The differential entropy h(X) is defined as


Z
h(X) = − fX (x) log fX (x)dx. (6.1)
R

The lower case letter h is used in place of the capital letter H denoting
the entropy in the discrete case.
It can be shown that the differential entropy represents a valid measure for
the information carried by a continuous random variable: indeed, if h(X)
grows the prior uncertainty about the value of X increases. However, some
of the intuitiveness of the entropy is lost. The main reason for this is that
now the differential entropy can take negative values: this happens for in-
stance when we compute the entropy of a random variable with a uniform
distribution in a continuous range [0, a] where a < 1 (in this case in fact
h(X) = log a < 0).
1
In the continuous case we refer to pdf instead of pmf.

83
84 Chapter 6, Continuous Sources and Gaussian Channel

The quantities related to the differential entropy, like the joint and condi-
tional entropy, mutual information, and divergence, can be defined in the
same way as for the discrete case2 and most of their properties proved like-
wise.

6.2 AEP for Continuous Sources


We now revisit the AEP theorem which still holds for continuous memo-
ryless sources. The proof is omitted since it is very similar to that of the AEP
theorem for the discrete case. A major difference which must be remarked is
the fact that, dealing with continuous alphabets, the typical sequences can
not be counted or listed. Then, in the continuous case, we refer to the volume
occupied by the set of typical sequences, which in turn is about 2nh(X) . By
looking at this approximated value for the volume, it is worth noting that
a negative differential entropy corresponds to a small (but always positive)
volume occupied by the set of typical sequences, not implying any contra-
diction. Furthermore, as the intuition suggests, a low uncertainty about the
sequences is associated to a small volume occupied by the typical sequences.
For completeness, we give the formal definition of the volume. Assuming
that a set S is measurable sets according to Riemann or Lebesgue measures,
the volume of S is defined as
Z Z
V ol(S) = · · · d~x, (6.2)
S

where ~x = (x1 , x2 , ..., xn ). We can now state the AEP theorem.

Theorem. (AEP: continuous case)


(n)
Given a CMS3 , X ∼ fX (x) and defined the set Aε as follows:
 
(n) n 1

Aε = x : − log fX1 ,X2 ,...,Xn (x1 , x2 , ..., xn ) − h(X) < ε ,
(6.3)
n
we have:
(n)
1. ∀δ > 0, ∀ε > 0, n large, Pr{Aε } ≥ 1 − δ;
(n)
2. ∀ε, ∀n, V ol(Aε ) ≤ 2n(h(X)+ε) ;
(n)
3. ∀δ > 0, ∀ε > 0, n large, V ol(Aε ) ≥ (1 − δ)2n(h(X)−ε) .
2
by paying attention to replace the sum with the integral.
3
Continuous memoryless source.
6.3. Gaussian Sources 85

Observation.
From the AEP theorem we know that the differential entropy is directly re-
lated to the volume occupied by the typical sequences. The following intuitive
properties hold:

1. h(X) = h(X − µ),

2. h(X) 6= h(αX),

where α is any scale factor.


Equality 1 follows from the translation invariance of the volume. We stress
that the same relation holds for the entropy in the discrete case. We leave
the proof as exercise.
Inequality 2 is due to the fact that scaling changes the volume. This property
contrasts with the discrete case for which it’s easy to deduce that H(X) =
H(αX). Let us prove such inequality. It is known that
1 y
X ∼ fX (x) implies Y = αX ∼ fX . (6.4)
|α| α

Hence (α > 0),


 
1 y 1 y
Z
h(Y ) = − fX log fX dy. (6.5)
α α α α

By performing a variable change from y to x (modifying the differential and


the support of the integral) we get

1
Z Z
= − fX (x) log dx − fX (x) log fX (x)dx
α
= h(X) + log α 6= h(X). (6.6)

Observe that the additional term log α corresponds to the (volume) scaling
n-dimensional factor being V ol ≈ 2n(h(X)+log α) = αn 2nh(X) .

6.3 Gaussian Sources


The calculation of the differential entropy is somewhat problematic and
actually impossible for a generic probability density function fX (x). A re-
markable case for which the computation is particularly easy is the case of a
Gaussian density function.
86 Chapter 6, Continuous Sources and Gaussian Channel

The probability density function of a Gaussian random variable X is:


1 (x−µ)2
fX (x) = √ e− 2σ 2 . (6.7)
2πσ 2
where µ is the expected value (expectation) of the distribution and σ the
standard deviation (i.e. σ 2 is the variance). The Gaussian (or normal) dis-
tribution in (6.7) is often denoted by N (µ, σ 2 ).
Let us compute the differential entropy of the Gaussian distributed random
variable X:
Z
h(X) = − fX (x) · log fX (x)
R
(x−µ)2
e− 2σ2
Z
= − N (µ, σ 2 ) log √ dx
R 2πσ 2
√ Z
(x − µ)2
= log 2πσ 2 + log e N (µ, σ 2 ) dx
R 2σ 2
1 log e
Z
2
= log 2πσ + (x − µ)2 · N (µ, σ 2 )dx
2 2σ 2 R
(a) 1 log e
= log 2πσ 2 +
2 2
1
= log 2πeσ 2 . (6.8)
2
where in (a) we exploit the definition of the variance: σ 2 = E[(x − µ)2 ] =
(x − µ)2 · fX (x)dx.
R
R

Let us now consider the general case of n jointly Gaussian random vari-
ables forming a Gaussian vector X ~ = X1 , ..., Xn . We want to evaluate the
~ is distributed ac-
differential entropy h(X1 , X2 , ..., Xn ). A Gaussian vector X
cording to a multivariate Gaussian density function which has the expression
1 µ)C −1 (~
x−~
(~ µ)T
x−~
fX~ (~x) = p e− 2 , (6.9)
(2π)n |C|

where µ~ is the vector of the expected values µi of the random variables Xi


(i = 1, ..., n), and C is the covariance matrix. In the sequel we will use the
compact notation Cij to denote the element (i, j) of the covariance matrix
C, i.e. Cij = cov(Xi , Xj ) = E[(Xi − µi )(Xj − µj )]. The Gaussian (normal)
density function of a random vector X ~ with mean µ ~ and covariance C is
commonly referred to as N (~µ, C).
Note that if C is a diagonal matrix, that is the n r.v.’s are independent, we
6.3. Gaussian Sources 87

obtain the product of n one-dimensional Gaussian pdf’s N (µi , σi2 ).


Let us now compute the entropy of the random Gaussian vector X ~ or equiv-
alently the joint entropy of the n random variables Xi :
−(~ µ)C −1 (~
x−~ µ) T
x−~
e
Z Z
2
~ = − ···
h(X) N (~
µ, C) · log
p
Rn (2π)n |C|
4 (~x − µ~ )C −1 (~x − µ
~ )T
p Z Z
= log (2π)n |C| + log e · · · N (~µ, C) d~x
n 2
ZR
1 log e
Z XX
= log(2π)n |C| + ··· N (~
µ, C) · (xi − µi )(C −1 )ij (xj − µj )d~x
2 2 Rn i j
1 1 X X Z Z
= log(2π)n |C| + log e · (C −1 )ij · · · N (~
µ, C)(xi − µi )(xj − µj )d~x
2 2 R
i j
1 1 XX
= log(2π)n |C| + log e · (C −1 )ij · Cij . (6.10)
2 2
i j

By exploiting the symmetry of the covariance


P matrix the inner sum in the
−1
last equation of 6.10 can be rewritten as j (C )ij · Cji , which is nothing
else then the (i, i) element of the matrix product C −1 · C (i.e. the identity
matrix). Then, going on from (6.10) we get:
n
~ = 1 log(2π)n |C| + 1 log e ·
X
h(X) Iij
2 2 i=1
1
= log[(2πe)n |C|]. (6.11)
2
We have found the expression of the entropy of a n-length vector of jointly
Gaussian random variables. Setting n = 1 yields the entropy of a Gaussian
random variable, that is
1
h(X) = log 2πeσ 2 . (6.12)
2
As expected, if the n random variables are independent (C is diagonal and
~ = P h(Xi ) = P 1 log 2πeσi2 .
then |C| = i σi2 ) we have h(X)
Q
i i 2

We now prove that, among all the possible continuous distributions with
the same variance, the Gaussian distribution is the one that has the largest
entropy.
4
R R
We make use of the unitary sum property of the density function: ··· Rn
N (~
µ, C) =
1.
88 Chapter 6, Continuous Sources and Gaussian Channel

Property. Let f (x) be a Gaussian density function with variance σ 2 and let
g(x) be any other density function having the same variance. Then

h(f ) ≥ h(g)5 . (6.13)

Proof.

g(x)
Z
0 ≤ D(g(x)||f (x)) = g(x) log dx
f (x)
Z Z
= g(x) log g(x) − g(x) log f (x)dx
1
Z
(a) x2
= −h(g) − g(x) log √ e− 2σ2 dx
2πσ 2
1 x2
Z
= −h(g) − log √ + log e · g(x) 2 dx
2πσ 2 2σ
1 1
Z
= −h(g) − log √ + 2 log e · x2 g(x)dx.
2πσ 2 2σ
1 1
= −h(g) + log(2πσ 2 ) + log e,
2 2
= −h(g) + h(f ), (6.14)

where in (a), without any loss of generality, we considered a zero mean density
function (as we will see, the differential entropy does not depend on the mean
value). From (6.14), we can easily obtain the desired relation.

Note: the previous property can be used to give a meaningful justification to


the Gaussian assumption often made in noise characterization. According to
the property, this is a worst case assumption since it corresponds to consider
the situation of maximum a priori uncertainty. In other words, making Gaus-
sian assumption corresponds to apply the principle of maximum entropy.

6.4 Gaussian Channel (AWGN)


In Chapter 5 we introduced the discrete communication channel and chan-
nel coding and analyzed the most common discrete channel models. For each
5
Here, we use a slight different notation to indicate the differential entropy in order to
make explicit the density function according to which the variable is generated.
6.4. Gaussian Channel (AWGN) 89

model (BSC, BEC,...), we have supposed that the channel is characterized by


a transition matrix, thus viewing it as a black box. The analysis of continu-
ous sources allows us to directly consider the analog (physical) channel inside
which the continuous modulated waveform are transmitted. In this way, we
can configure the connection by designing the most appropriate modulation
for a given physical channel. From the knowledge of digital modulation the-
ory we know that the error probability of a link is related to the bandwidth
and the power of the transmitted signal.
In this chapter we formally describe the Gaussian channel 6 by specifying the
relation between the input X and the output Y of the channel. The AWGN
channel is characterized by an additive relation between input and output;
the output Y is obtained by adding to X a white Gaussian noise Z, that is

Y = X + Z, Z ∼ N (0, σz2 ). (6.15)

Being the added noise white, the channel is stationary and memoryless. This
channel is a model for a number of common communication channels, such
as the telephone channel and satellite links.
Without any limitation on the input, we argue that the capacity of the Gaus-
sian channel is infinite and we can obtain a perfect (with no error) transmis-
sion, as if the noise variance were zero. However, it is quite reasonable to
assume that the power of the input signal is constrained. In particular we
require that using the channel n times yields a transmitted power less then
PX (or σx2 ), i.e.
1X 2
x ≤ PX . (6.16)
n i i
Given the constraint in (6.16), we are interested to determine the maximum
rate at which transmission is possible through the channel. It’s proper to
stress that, strictly speaking, we still have to demonstrate that the channel
capacity concept holds for the continuous case. Before rigorously formalizing
these concepts we empirically show the basic ideas behind channel coding for
transmission over an AWGN channel.

6.4.1 The Coding problem: a qualitative analysis


As done for the discrete case, we introduce the channel coding problem
through a qualitative analysis.
Consider the n-th extension of the channel. Working with a continuous
alphabet causes an infinite number of possible sequences xn that can be
6
We still assume a discrete-time channel.
90 Chapter 6, Continuous Sources and Gaussian Channel

channel
dispersion Yn
Xn (AWGN)


xn nPN


nPX

||X n ||2 ≤ nPX

Figure 6.1: Representation of the n-the extended Gaussian channel.

transmitted over the channel through n uses. Due to the noise added by the
channel during the transmission, for any input sequence there are in turn √ an
n
infinite number of possible outputs. The power constraint ||x || ≤ nPX
allows to say that √ all the possible inputs lie in a n-dimensional hypersphere
(in R ) of radius nPX (see Figure 6.1). What we want to determine is
n

the maximum number of sequences that can be reliably transmitted over the
channel (error-free transmission). Looking at the figure, we see that without
limitation imposed on the power of the input signal we could reliably transmit
an infinite number of sequences (being the radius of the sphere unbounded),
despite the dispersion caused by the noise.
In order to find the maximum number of reliably transmissible sequences
we can compute the maximum number of disjoint sets we can dispose in
the output space (Y n ) (Figure 6.1). Each sequence y n in the set of output
sequences is obtained by the sum xn +z n , where xn is the corresponding input
sequence and z n is a Gaussian noise vector. Each coefficient zi represents the
noise relative to the the i-th use of the channel ( Zi ∼ N (0, σzn ), being the Zi
i.i.d.). The random output vector Y n = xn + Z n has a Gaussian distribution
with mean xn and the same variance of the noise, i.e. σz2 = PN . Therefore, it’s
correct to represent the output set centered on the input sequence. Besides,
if n is sufficiently large, we can affirm√ that with high probability the output
points lie on the boundary of the nPN -radius hypersphere since for the Law
of Large Numbers
X
||(xn + Z n ) − xn ||2 = ||Z n ||2 = Zi2 −→ nPN as n → ∞. (6.17)
i
6.4. Gaussian Channel (AWGN) 91

We now evaluate the volume of the n-dimensional hypersphere containing


approximately, and with high probability, all the ”typical” output sequences.
In order to do this, we observe that a generic point of the output space can
be denoted as X n + Z n . The total power, for large n, with high probability,
is

||X n + Z n ||2 = ||X n ||2 + ||Z n ||2 + 2 < X n · Z n >


X X X
= Xi2 + Zi2 + 2 Xi Zi (6.18)
i i i
≤ nPX + nPN .

where in equality (6.18) we exploited the independence of the psignal from the
noise. Then, the received vectors lie inside a sphere of radius n(PX + PN ).
Being the volume directly proportional to the n-th power of the radius with a
proportionality constant an , the maximum number of non-overlapping (non-
intersecting) spheres which is possible to arrange in this volume is bounded
by
n/2
an (n(PX + PN ))n/2

PN + PX
= . (6.19)
an (nPN )n/2 PN
Then, the number of bits that can be reliably transmitted for each use of the
channel is at most  
1 PX
log 1 + . (6.20)
2 PN
The above arguments tell us that we cannot hope to send information at
rate larger then the value in (6.20) with no error. In the next section we will
rigorously prove that as n → ∞ we can do almost as well as this.

6.4.2 Coding Theorems for the Gaussian Channel


Even for the continuous case it’s possible to refer to the same definitions
of code, error probability and achievable rate given in Section 5.2.2 for the
discrete case. Before stating and proving the channel coding theorem for the
Gaussian case, we define the jointly typical set and enunciate the joint AEP
theorem.

Definition. Let X and Y be two continuous sources with probability density


function fX (x) and fY (y) respectively and joint pdf fXY (x, y). The jointly
(n)
typical set Aε is defined as:
92 Chapter 6, Continuous Sources and Gaussian Channel

A(n) n n n
ε = {(x , y ) ∈ X × Y :
n

1
− log fX (xn ) − h(X) < ε, − 1 log fY (y n ) − h(Y ) < ε,

n n

1
− log fXY (xn , y n ) − h(X, Y ) < ε .

n (6.21)

Using the above definition we can state the following theorem:

Theorem (Joint AEP: continuous case).

(n)
1. ∀δ > 0, ∀ε > 0, n large, Pr{Aε } ≥ 1 − δ;
(n)
2. ∀ε, V ol(Aε ) ≤ 2n(h(X,Y )+ε) ∀n;
(n)
3. ∀δ > 0, ∀ε > 0, n large, V ol(Aε ) ≥ (1 − δ)2n(h(X,Y )−ε) .

4. Let X̃ and Ỹ be two independent random variables with the same


marginal distributions of X and Y , i.e. fX̃ = fX and fỸ = fY . We
have
P r{(X̃ n , Ỹ n ) ∈ A(n)
ε } ' 2
−nI(X;Y )
. (6.22)
Formally, the following two bounds hold:

P r{(X̃ n , Ỹ n ) ∈ A(n)
ε } ≤ 2
−n(I(X;Y )−3ε)
. (6.23)

and ∀δ > 0, n large,

P r{(X̃ n , Ỹ n ) ∈ A(n)
ε } ≥ (1 − δ)2
−n(I(X;Y )+3ε)
. (6.24)

Proof. The proof is virtually identical to the proof of the AEP discrete the-
orem.

We are now ready to state and prove the coding theorem for the AWGN
channel, including both the direct and the converse part.

Theorem (Capacity of the Gaussian channel).


A rate R is achievable if and only if

R<C= max I(X; Y ), (6.25)


fX (x):E[X 2 ]≤PX
6.4. Gaussian Channel (AWGN) 93

and  
1 PX
C = log 1 + bits/use of channel. (6.26)
2 PN
Proof. The proof is organized in tree parts: in the first part we formally derive
expression (6.26) for the Gaussian channel capacity, while in the second and
in the third part we prove respectively the achievability and the converse
parts of the theorem.

• Without loss of generality we can assume that the constraint in (6.25)


holds at the equality. Then,

C= max I(X; Y ) = max h(Y ) − h(Y |X)


fX (x):E[X 2 ]=PX fX (x):E[X 2 ]=PX

= max h(Y ) − h(X + Z|X)


fX (x):E[X 2 ]=PX
(a)
= max h(Y ) − h(Z|X)
fX (x):E[X 2 ]=PX

= max h(Y ) − h(Z)


fX (x):E[X 2 ]=PX
1
= max2 h(Y ) − log 2πePN ,
fX (x):E[X ]=PX 2
(6.27)

where in (a) we exploited the fact that h(X + Z|X) = h(Z|X). We now look
for a bound of h(Y ). For simplicity, we force Y to be a zero mean random
variable (the entropy does not change); in this way, the variance of Y is 7

σy2 = E[Y 2 ] = E[X 2 ] + E[Z 2 ] ≤ PX + PN . (6.28)

We know that, for a fixed variance, the Gaussian distribution yields the
maximum value of the entropy. Hence, from (6.27),
1 1 1
h(Y ) − log 2πePN ≤ log 2πe(PX + PN ) − log 2πePN
2 2   2
1 PX
= log 1 + . (6.29)
2 PN

In order to conclude the proof we have to show an input distribution fX (x)


exists which allows to reach the limit value. It’s easy to see that such distri-
bution is the Gaussian distribution with σx2 = PX . In this case, in fact, Y is
also a Gaussian random variable, with σy2 = PX + PN .
7
Remember that the noise Z has zero mean.
94 Chapter 6, Continuous Sources and Gaussian Channel

• (Achievability)

We now pass to the proof of the direct implication of the theorem (stating
that any rate below C is achievable). As usual, we make use of the concepts
of random coding and joint typical decoding .
We consider the n-th extension of the channel. For a fixed rate R, the first
step is the generation of the codebook for the 2nR indexes. Since, as in the
discrete case, we will consider large n values, we can generate the codewords
i.i.d. according to a density function fX (x) with variance PX − ε, so to en-
sure the fulfillment
Pn of the power constraint (according to the LGN, the signal
power (1/n) i=1 xi (1) tends to σ 2 as n → ∞). Let xn (1), xn (2), ..., xn (2nR )
2

be the codewords and xn (i) the generic codeword transmitted through the
channel. The sequence y n at the output of the channel is decoded at the
receiver by using the same procedure described for the discrete channel de-
coding; that is, we search for a sequence which is jointly typical with the
received one and we declare it to be the transmitted codeword.
We now evaluate the error probability. Without any loss of generality we
assume that the codeword W = 1 was sent. Let us define the possible types
of error:
- violation of the power constraint (tx side):
n
1X 2
E0 = {xn (1) : x (1) > PX }; (6.30)
n i=1 i

- the received sequence is not jointly typical with the transmitted one:

E1 = {(xn (1), y n ) ∈
/ A(n)
ε }; (6.31)

- the received sequence is jointly typical with another sequence (different


from the transmitted one):

Ei = {(xn (i), y n ) ∈ Aε(n) }, i 6= 1. (6.32)

The error event, that is the event Ŵ 6= 1, can be described as

E1 = E0 ∪ E1c ∪ E2 ∪ ... ∪ E2nR . (6.33)

According to the code generation procedure used, the error probability aver-
aged over all codewords and codes corresponds to the error probability for a
6.4. Gaussian Channel (AWGN) 95

transmitted codeword. Hence,

P r(E) = P r(E|W = 1) = P r(E1 )


= P (E0 ∪ E1c ∪ E2 ∪ ... ∪ E2nR )
nR
2
X
≤ P (E0 ) + P (E1c ) + P (Ei ). (6.34)
1=2

As n → ∞, by the law of large number and the joint AEP theorem (respec-
tively) we know that P (E0 )8 and P (E1c ) tend to zero. Besides, we know that
X n (1) and X n (i) for any i 6= 1 are independent by construction; then, the
joint AEP theorem provides an upper bound to the probability that X n (i)
and the output Y n (X n (1) + Z n ) are jointly typical, which is 2−n(I(X;Y )−3ε) .
Going on from (6.34), for sufficiently large n we have

P r(E) ≤ ε1 + ε2 + (2nR − 2) · 2−n(I(X;Y )−3ε)


≤ ε1 + ε2 + 2−n(I(X;Y )−R−3ε) , (6.35)

with ε1 and ε2 arbitrarily small. If R < I(X; Y ), its easy to see that
we can choose a positive ε such that 3ε < I(X; Y ) − R, thus yielding
2−n(I(X;Y )−R−3ε) → 0 for n → ∞ and then an arbitrarily small error proba-
bility. So far we have considered the average error probability; we can repeat
the same passages of Section (5.2.3) in order to prove that the maximal
probability of error λnmax is arbitrarily small too. Therefore, any rate below
I(X; Y ) and then below C is achievable.

• (Converse)

We now show that the capacity of the channel C is the supremum of all
achievable rates. The proof differs from the one given for the discrete case.
(n)
For any code satisfying the power constraint we show that if Pe → 0 then
the rate R must be less then C.
Let W be a r.v. uniformly distributed over the index set W = {1, 2, ..., 2nR }.
Being H(W ) = nR we can write

nR = I(W ; Y n ) + H(W |Y n ). (6.36)

8
We point out that, strictly speaking, a new version of the AEP theorem, accounting
also for the constraint on the power, is needed.
96 Chapter 6, Continuous Sources and Gaussian Channel

Applying Fano’s inequality to H(W |Y n ) yields 9

H(W |Y n ) ≤ 1 + Pe(n) log(|W| − 1)


< 1 + Pe(n) log(|W|)
= 1 + Pe(n) nR
 
1 (n)
= n + Pe R . (6.37)
n
(n)
Given that Pe → 0 for n → ∞, the expression in brackets can be made
arbitrarily small for sufficiently large n. Let us name it εn . Then, we can
write

nR ≤ I(W ; Y n ) + nεn
(a)
≤ I(Y n ; X n ) + nεn
= h(Y n ) − h(Y n |X n ) + nεn
(b) X
≤ h(Yi ) − h(Z n ) + nεn
i
X
= (h(Yi ) − h(Zi )) + nεn , (6.38)
i

where (a) follows from the fact that W → xn (W ) → Y n is a Markov chain.


Observing that h(Y n |X n ) = h(X n + Z n |X n ) = h(Z n |X n ) = h(Z n
P ) (the sig-
n
nal and the noise are assumed independent) and that h(Y ) ≤ i h(Yi ) (the
received signals may not be independent), inequality (b) holds.
As usual, we can use the Gaussian density function as an upper bound for
h(Yi ), while Zi is itself a Gaussian variable. We have: E[Yi2 ] = E[Xi2 ] +
E[Zi2 ] + 2E[X1 ]E[Zi ] = E[Xi2 ] + E[Zi2 ], where E[X 2
Pi ] is the2 average power
2 1
corresponding to the position i, i.e. E[Xi ] = 2nR w xi (w) (The random-
ness of Xi directly follows from the randomness of W since Xi = xi (W )).
Let us denote it by Pi . Then: E[Yi2 ] = Pi + PN and from (6.38) we have
X 1 1

≤ log 2πe(Pi + PN ) − log 2πePN + nεn
i
2 2
X1  
Pi
≤ log 1 + + nεn .
i
2 P N

(6.39)
9
It can be proven that Fano’s inequality also holds if the variable under investigation
is discrete and the conditioning variable is continuous, but not in the reverse case.
6.4. Gaussian Channel (AWGN) 97

Dividing by n we obtain:
n  
1X1 Pi
R < log 1 + + εn . (6.40)
n i 2 PN

Since the log is a concave function we can exploit the following property:

Property. Let f be a concave function. The following relation holds


n n
!
1X X xi
f (xi ) ≤ f . (6.41)
n i=1 i=1
n

Proof. The proof follows by induction. For n = 2 the relation is true, due to
the concavity of f . Supposing that relation (6.41) is true for n − 1, we have
to prove that it also holds for n.
We can write:
n
! n−1
!!
X xi xn n − 1 1 X
f =f + xi
i=1
n n n n − 1 i=1
n−1
!
1 n−1 1 X
≥ f (xn ) + f xi . (6.42)
n n n − 1 i=1

1
Pn−1
Given the two points xn and n−1 i=1 xi , inequality (6.42) follows by the
concavity of the function f . By applying relation (6.41) to the second term
of the sum in (6.42) 10 we obtain
n
! n−1
X xi 1 n−1 1 X
f ≥ f (xn ) + f (xi ),
i=1
n n n n − 1 i=1
n
1X
= f (xi ). (6.43)
n i=1

By exploiting the property, from (6.40) we have


 Pn 
1 i=1 Pi /n
R < log 1 + + εn . (6.44)
2 PN

10
Remember that we made the assumption that relation (6.41) holds for n − 1.
98 Chapter 6, Continuous Sources and Gaussian Channel

Let us now observe that


n n n
!
X Pi 1X 1 X 2 1 X 1X
= x i (w) = xi (w)2 < PX , (6.45)
i
n n i 2nR w 2nR w n i

where the expression in round brackets is the average power of the codeword
xn (w) which averaged on all the codewords is less than PX . We eventually
get the following upper bound for the rate:
 
1 PX
R < log 1 + + εn = C + εn . (6.46)
2 PN
(n)
This proves that for n → ∞, if Pe → 0, then necessarily R ≤ C.

6.4.3 Bandlimited Channels

Up to now we have treated the Gaussian channel as if it had an infinite


bandwidth. In practice, this is never the case. Nevertheless, from Shannon’a
sampling theorem we know that for a channel with finite bandwidth W , we
can consider a sampled version of the signal with Tc ≤ 1/2W (Tc = sampling
step). In this way the channel does not distort the transmitted signal.
If the noise power spectral density of the white noise is N0 /2 watts/hertz, we
can say that PN = N0 W . Substituting it in the expression of the capacity
we can say that
 
1 PX
C = log 1 + bits/transmission. (6.47)
2 N0 W

By assuming a sampling frequency equal to the minimum frequency dictated


by Shannon’s sampling theorem, we can use the channel 2W times per second.
Then, the channel capacity in bits per second is
 
PX
C = W log 1 + bits/sec. (6.48)
N0 W

where the ratio PX /N0 W is the SNR (Signal to Noise ratio). This is the
famous Shannon’s formula for the capacity of an additive white Gaussian
noise channel (AWGN).
6.4. Gaussian Channel (AWGN) 99

Shannon’s Capacity Curve

Looking at expression (6.48), the basic factors which determine the value
of the channel capacity are the channel bandwidth W and the input signal
power PX . Increasing the input signal power obviously increases the channel
capacity. However, the presence of the logarithm makes this growth slow. If
we consider, instead, the channel bandwidth, which is the other parameter
we can actually set, we realize that an increase of W (enlargement of the
bandwidth) has two contrasting effects. On one side, a larger bandwidth
allows to increase the transmission rate; on the other side, it causes a higher
input noise at the receiver, thus reducing the capacity. While for small values
of W it’s easy to see that enlarging the bandwidth leads to an overall increase
of the capacity, for large W we have11 :
P
lim C = log e · . (6.49)
W →∞ N0
Then, by increasing only the bandwidth, we cannot increase the capacity
beyond (6.49).
We now introduce Shannon’s capacity curve which shows the existence of a
tradeoff between power and bandwidth in any communication system.
Since in any practical reliable communication system we have R < C, the
following relation is satisfied:
 
P
R < W log 1 + . (6.50)
N0 W

Dividing both sides by W yields


 
P
r < log 1 + , (6.51)
N0 W

where r = R/W is the spectral efficiency, i.e. the number of bits per second
that can be transmitted in a bandwidth unit (Hertz). By observing that
P = Eb · R (we indicate by Eb the energy per transmitted bit) we get
 
Eb
r < log 1 + r · . (6.52)
N0
100 Chapter 6, Continuous Sources and Gaussian Channel

r (dB)
bandwidth-efficient
transmissions

non-achievable
rates

ln 2 = −1, 6 0
Eb
(dB)
N0

achievable
power-efficient
transmissions rates

Figure 6.2: Shannon’s capacity curve.

The above relation defines the achievable spectral efficiencies


 for any
 value
Eb Eb
of the ratio N0 . The locus of points r such that r = log 1 + r · N0 is the
Eb
so called Shannon’s curve. Shannon’s curve splits the ( N 0
, r) plane into two
regions, as plotted in Figure 6.2. The region below the curve includes all
the operative points for which reliable transmission is possible. In order to
determine the behavior of the energy-to-noise ratio by varying r, we can raise
both sides of equation (6.52) to the power 2, obtaining

Eb 2r − 1
= . (6.53)
N0 r
Then, we can evaluate the following limit values:

Eb
• r→∞ ⇒ N0
→ ∞;

Eb
• r→0 ⇒ N0
→ ln 2;

Eb
proving that the curve in Figure 6.2 has a vertical asymptote in N 0
= ln 2,
and below this value no reliable transmission is possible (for any value of r).
Clearly, the more the working point is close to the curve, the more the com-
munication system is efficient. All the communications whose main concern
 
11 P P P
We make use of the approximation ln 1 + N0 W ≈ N0 W holding for N0 W << 1.
6.4. Gaussian Channel (AWGN) 101

is the limitation of the transmitted power lie on the area of the plane in which
r  1 (which is the area of the power efficient transmission). We refer to
these system as power-limited systems. On the contrary, all the systems for
which the bandwidth of the channel is small, referred to as bandwidth-limited
systems, lie on the area where r  1 (which is the area of the spectrally effi-
cient transmission). Nevertheless, there is an unavoidable trade-off between
power efficiency and bandwidth efficiency.
We now give some insights into how digital modulations are distributed with
respect to Shannon’s curve. From a theoretical point of view, the channel
coding theorem asserts that it’s possible to work with spectral efficiencies
arbitrarily close to the curve with Pe = 0. In practice, classical modulation
schemes have always a positive, although small, error probability and, de-
spite this, they lie very far from the curve. Channel coding is what allows to
improve the performance of a system, moving the operative points closer to
Shannon’s capacity curve.
Below, we see some examples of digital modulations. In the case of power-
limited systems, high dimensional schemes are frequently used (e.g. M-FSK),
which allow to save power at the expense of bandwidth. Contrarily, in the
case of bandwidth-limited systems, the goal is to save bandwidth, then low
dimensional modulation schemes (e.g. M-PSK) are often implemented12 .

• B − PSK
For a binary PSK (B-PSK or 2PSK), the error probability of a symbol cor-
responds to the bit error probability and is given by13
r !
2Eb
Pe = Q . (6.54)
N0

For Pe = 10−4 we get Eb /N0 = 8.5dB (from the table of the Q function). Let
Ts indicate the transmission time of a symbol. By using a B-PSK modulation
scheme we transmit one bit per symbol and then the per-symbol energy Es
corresponds to the energy per-bit Eb (Es = Eb ). Let W denote the bandwidth
of the impulse of duration Ts , i.e. W = 1/Ts 14 . Then r = WR
= (1/Ts)
1/Ts
= 1.

12
In all the examples we consider an error probability Pe of about 10−4 .
13
The function Q gives the probability of the tail of the Gaussian distribution. More pre-
cisely, Q(x) denote the the probability that a normal (Gaussian) random variable N (µ, σ 2 )
will obtain a value larger than x standard deviations (σ) above the mean (µ).
14
Strictly speaking, a finite impulse has an infinite bandwidth. Nevertheless, in digital
modulation applications it is common to take the bandwidth as the frequency range which
encompasses most of (but not all) the energy of the impulse. Indeed, the higher frequencies
contribute at giving the (exact) shape, which, in such cases, is unnecessary.
102 Chapter 6, Continuous Sources and Gaussian Channel

r (dB) Shannon’s curve

behavior of the M-PSK, for M > 4

Q-PSK
−1, 6 0 8, 5
3dB

Eb
B-PSK 2-FSK (dB)
N0
4-FSK

behavior of
the M-FSK,
for M >4

Figure 6.3: Location of the operative points of the classical modulation schemes
on the Shannon’s plane.

The corresponding operative point for the B-PSK scheme is shown in Figure
6.3.
According to Shannon’s limit the same rate could have been reached with
Eb /N0 = 0dB (with 8.5dB of power save).

• Q − PSK
In the QPSK modulation the probability of symbol error is approximated by
r !
2Eb
Pe ≈ 2Q , (6.55)
N0

where now Es = 2Eb .


The multiplicative term in (6.55) does not influence the value of Eb /N0 for
Pe = 10−4 which depends almost only on the argument of the Q function
and then is the same as the BPSK case. On the other side, the QPSK
modulation transmits two information bits per symbol (R = 2/Ts ), hence
r = 2 (see Figure 6.3).

• M − PSK
From the general expression for the Pe of a M-PSK it follows that as M
grows we have an increase of Eb /N0 (for the same value of Pe ). Besides,
the increase of the rate R with M is logarithmic (log M bits are transmitted
simultaneously) and then the general expression for the spectral efficiency is
6.4. Gaussian Channel (AWGN) 103

r = (log1/T
2 M/Ts )
s
= log2 M . The approximate location of the operative points
Eb
in the ( N0 , r) plane is illustrated in Figure 6.3.
As mentioned previously, phase modulations (low dimensionality modula-
tions) permits to save bandwidth at the expense of power efficiency. Never-
theless, they remain far away from Shannon’s curve.

• M − FSK
Given a coupleqof orthogonal
 signals with energy Es , the probability of error
Es
is given by Q N0
. Considering M orthogonal signals, the union bound
for the error probability yields
r ! r !
Es Eb
Pe ≤ (M − 1)Q = (M − 1)Q log2 M . (6.56)
N0 N0

Neglecting the multiplicative term, for a modulation scheme with M > 2 we


can save a factor log2 M of bit energy Eb with respect the 2-FSK scheme
(with the same Pe )15 . However, orthogonal modulations are characterized
by a linear growth of the bandwidth with M , i.e. W = M Ts
. Hence, since
log2 M log2 M
R = Ts , we get r = M . Saving power comes at the cost of a band-
width increase. Figure 6.3 shows the operative points for various M .

The promise of Coding

Orthogonal modulation schemes approach the capacity curve as the di-


mensionality M grows but the price to pay is extremely high, since the band-
width grows linearly with M . Starting from a simple example we show that
through coding we can transmit information reliably, saving in power and
without increasing too much the bandwidth.

Example.
Consider a situation in which we want to transmit 2 bits. Instead of using
a 4 − P SK for transmitting the two bits in 2Tb sec, we can consider tree
orthogonal signals in the interval 2Tb , as depicted in Figure 6.4. The tree
orthogonal signals, ψ1 , ψ2 and ψ3 , constitute a basis for the three-dimensional
space. We can then build four distinct waveforms to be associated to each
q  q 
15 Eb 2Eb
Remember that for a 2-FSK Pe = Q N0 , while for a 4-PSK Pe = Q N0 .
104 Chapter 6, Continuous Sources and Gaussian Channel

ψ1

2
0 3 Tb Tb 2Tb

ψ2

0 Tb 2Tb

ψ3

0 Tb 2Tb

Figure 6.4: Tree orthogonal signals in 2Tb .

of the starting configurations of two bits. For instance, the four waveforms
could√be the ones depicted in Figure 6.5. In vector notation:
s1 = √E(1, 1, 1);
s2 = √E(1, −1, −1);
s3 = √E(−1, 1, −1);
s4 = E(−1, −1, 1),
where the signal energy is Es = 3E 16 . The signal energy can be obtained
from the bit energy as Es = 2Eb (E = 23 Eb ).
We remind that the general approximation of the error probability as a
function of the distance d among the transmitted signals is given by
s !
d2
Pe = Q . (6.57)
2N0

Having increased the dimensionality of the system (from two to tree), the
above procedure allows us to take four signals more distant from each other
with respect to the Q-PSK scheme. In fact, taken an arbitrary couple of
vectors in the constellation, we have
16
d2 = 8E = Eb > 4Eb , (6.58)
3
16
E indicate the energy of each pulse of duration T = 23 Tb composing the signal.
6.4. Gaussian Channel (AWGN) 105

(00) (01)
s1 s2

0 T Tb 2Tb 0 Tb 2Tb

(10)
s4 (11)
s3

0 Tb 2Tb 0 Tb 2Tb

Figure 6.5: Possible waveforms we can associate to the configurations of two bits.

where 4Eb is the minimum distance between the signals in the Q-PSK con-
stellation. Hence:
r ! r !
8 Eb 4 2Eb
Pe = Q =Q , (6.59)
3 N0 3 N0

leading to a coding gain of 4/3 with respect to the Q-PSK scheme. Nev-
ertheless, the signals contain pulses narrower than Tb (whose pulse width is
T = 32 Tb ), and then they occupy larger bandwidth (W = 2T3 b ). As a con-
sequence, for this system we have r = 32 17 . Therefore, there is always a
trade-off between power and bandwidth but in this case the trade-off is more
advantageous, as the following generalization of the above procedure clarifies.

Generalized procedure
What we have described above is nothing but a primitive form of coding.
Let us now suppose that we aim at transmitting k bits. We can use a
code C(k, n) in order to associate to any configuration of k bits another
configuration of n bits (with n > k). In this way the constellation of k
points can be represented in the n-dimensional space where each point lies
17
With a Q-PSK we would have r = 1.
106 Chapter 6, Continuous Sources and Gaussian Channel

on a vertex of a n-dimensional hypercube. √ The vector representation of a


given point is of the following kind: s = E(..., 1, .., −1, ..). Since the error
probability is dominated by the contribution given by the couple of points
at the shortest distance, we consider again the following approximation:
s 
2
dmin 
Pe ' Q  . (6.60)
2N0

According to this procedure, we have E = nk Eb . Indicating with dH the


Hamming distance between two codewords (number of positions in which two
codewords differ), it’s straightforward to deduce that the distance between
two codewords has the expression d2 = 4dH E = 4dH nk Eb . Then, denoting by
dH min the minimum Hamming distance between two codewords we can write
s  !
2Eb k
Pe ' Q dH min · , (6.61)
N0 n

where Gc = dH min · nk is the so called coding gain.


In turn, the bandwidth becomes W = kTnb and then r = nk . Therefore, a high
redundancy (n − k) causes a significant bandwidth expansion.
It is evident that for a given ratio k/n the best approach to coding is to gen-
erate the codewords with the largest minimum distance (dH min ), since this
parameter does not affect the bandwidth. Nevertheless, if we want a larger
dH min in order to increase the coding gain Gc , we need to add redundancy
and then decrease the ratio k/n.
Through coding, as through FSK modulation schemes, we save power at the
expense of bandwidth; but now the exchange rate is much more advanta-
geous, as Figure 6.6 shows.
Since Shannon’s time, great efforts have been made by researches for design-
ing channel coding schemes that get as close as possible to Shannon’s limit.
The recent invention of the LDPC and Turbo codes finally allowed to get
very close to reach this goal.

Ex : Prove that by using a repetition code we have a Gc = 1, while for


the Hamming code Gc = 3.
6.4. Gaussian Channel (AWGN) 107

r (dB)

M-PSK

0
Eb
coding N0
(dB)

M-FSK

Eb
Figure 6.6: The role of coding in the ( N0
, r) plane.
108 Chapter 6, Continuous Sources and Gaussian Channel
Chapter 7

Rate Distortion Theory

The source coding theorem states that a discrete source X can be en-
coded lossless as long as R ≥ H(X). However, in many real applications,
the presence of (a moderate amount of) reconstruction errors does not com-
promise the result of the transmission (or the storage); then, sometimes, it
may be preferable to admit errors within certain limits, i.e. a quality loss,
to increase compression efficiency. In other words, lossless compression re-
moves the statistical redundancy, but there are other types of redundancy,
e.g. psychovisual and psychoacoustic redundancy (depending on the appli-
cation), that can be taken into account in order to increase the compression
ratio. Think for instance to JPEG compression for still images!
In order to introduce a controlled reduction of quality we need to define of
a distortion measure, that is a measure of the distance between the random
variable and its (lossy) representation. The basic problem tackled with by
rate distortion theory is determining the minimum expected distortion which
is necessary to tolerate in order to compress the source at a given rate.
Rate distortion theory is particularly suited to deal with continuous sources.
We know that in the continuous case lossless coding cannot be used be-
cause of the fact that a continuous source requires an infinite precision to
be represented exactly. Then, while for discrete sources the rate distortion
theory can be introduced as an additional (optional) tool to source coding,
for continuous sources it is an essential tool for representing the source.

7.1 Rate Distortion Function


Let us consider, without loss of generality, a source X with finite alpha-
bet X , X ∼ pX (x)1 . Let X̂ denote the (lossy) reconstruction of the random
1
Similar arguments hold for continuous sources.

109
110 Chapter 7, Rate Distortion Theory

variable X, with finite alphabet X̂ .


We introduce the distance measure d(x, x̂); the most commonly used for
continuous and discrete alphabets are respectively the Euclidean and the
Hamming distance:

• Euclidean distance:
d(x, x̂) = (x − x̂)2 ; (7.1)

• Hamming distance:

0 se x = x̂
d(x, x̂) = x ⊕ x̂ = (7.2)
1 se x 6= x̂.

The distortion measure or distortion function D is defined as


h i
D = E d(X, X̂) , (7.3)

where the average is taken over all the alphabet symbols x and all the possible
values of the reconstruction x̂. In the Euclidean case the distortion function
is h i
D = E (X − X̂)2 , (7.4)
i.e. the mean square error between the signal and its reconstruction.
In the Hamming case the distortion function is
h i
D = E X ⊕ X̂ = Pe , (7.5)

i.e. the probability of a reconstruction error.


We can extend the definition of distance to sequences of symbols xn and x̂n
as follows: n
n n 1X
d(x , x̂ ) = d(xi , x̂i ), (7.6)
n i=1
which, for stationary sources, leads to the same mean value as before, that
is h i h i
n n
D = E d(X , X̂ ) = E d(X, X̂) . (7.7)
Having introduced the above quantities, we can give the following definition.

Definition. The rate distortion function R(D) gives the hminimumi number
of bits (Rmin ) guaranteeing a reconstruction distortion E d(X, X̂) ≤ D.
7.1. Rate Distortion Function 111

Note: for D = 0, no distortion is accepted, hence R(0) is exactly the entropy


of the source H(X)2 .

It’s easy to argue that R(D) is a monotonic decreasing function of D.


Then, we can also compute the inverse function D(R), named the distortion
rate function. Given the (maximum) number of bits we are willing to spend,
D(R) tells us the minimum amount of distortion which is introduced in the
reconstruction.

The main theorem of the rate distortion theory (Shannon, 1959), also
known as the lossy coding theorem, is the following.

Theorem (Rate Distortion).


Let X ∼ p(x). Then

R(D) = min I(X; X̂) (7.8)


p(x̂|x):E[d(X,X̂)]≤D

is the minimum achievable rate at distortion D.

Observe that the conditional distribution p(x̂|x) in (7.8) is the actual


degree of freedom in the minimization; it derives from the joint distribution
p(x̂, x) by exploiting the knowledge of p(x) (conditional probability theorem).
Even though for simplicity we refer to discrete sources, we stress that the
theorem holds both for discrete and continuous sources.
The rigorous proof of the theorem is beyond the scope of these notes. Instead,
we’ll provide an outline of the proof in order to point out the main ideas
behind it.
Before starting, it’s necessary to extend the typicality definitions given in
Chapter 4 so to take into account the distortion D. In the new context
addressed here, in fact, we aim to characterize sequences which are typical
even with respect to a given distortion measure, namely ‘distortion typical
sequences’.

Definition. Let X be a discrete memoryless source with pmf p(x) and let
X̂ be the reconstructed source with pmf p(x̂)3 . Let p(x, x̂) be the joint
probability distribution.
2
By referring to continuous sources, R(0) is ∞.
3
For notational simplicity we omit the subscript in pX (x) and pX̂ (x̂), being it recover-
able from the argument.
112 Chapter 7, Rate Distortion Theory

(n)
We define the distortion jointly typical set Ad,ε as follows
n
(n)
Ad,ε = (xn , x̂n ) ∈ X n × X̂ n :

1
− log p(xn ) − H(X) < ε, − 1 log p(x̂n ) − H(X̂) < ε,

n n

1
− log p(xn , x̂n ) − H(X, X̂) < ε,

n
o
n n
d(x , x̂ ) − E[d(X, X̂)] < ε , (7.9)

representing, respectively, the typicality of xn and x̂n with respect to p(x)


and p(x̂), the joint typicality and the typicality with respect to distortion.

Note that the difference from the previous definition of jointly typical
set resides only in the additional constraint which expresses the typicality of
the couples of sequences with respect to distortion. Instead of a probability,
the involved statistics for measuring this type of typicality is the distance
between the random variables. Let us define d(xi , x̂i ) = di and consider
the corresponding random variable Di , which is a function of the random
variables Xi and X̂i , i.e. Di = d(Xi , X̂i ). By applying the law of large
numbers, as n → ∞ the sample mean of di tends to the ensemble average,
that is n
1X
Di −→ E[D] in prob. (7.10)
n i=1 n→∞

Then, the additional requirement regarding distortion does not limit much
(n)
the number of sequences in Ad,ε with respect to the number of sequences
(n)
in the jointly typical set, since for large n a sequence belongs to Ad,ε with
probability arbitrarily close to 1.

Outline of the Proof.

• (Direct implication/Achievability)

To start with, let us fix p(x̂|x)4 . Knowing the marginal pmf p(x), from p(x̂|x)
we can derive the joint pmf p(x̂, x) and then p(x̂).
Fix also R.
The proof of achievability proceeds along the following steps.
4
Chosen according to the constraint E[d(X, X̂)] ≤ D.
7.1. Rate Distortion Function 113

Generation of the codebook. Generate a codebook C consisting of 2nR se-


quences x̂n drawn i.i.d. according to p(x̂).
(n)
Encoding. For any xn find a sequence x̂n such that (xn , x̂n ) ∈ Ad,ε . If there
is more than one such xn , take the one with the least index. If there is no
such sequence take the first sequence, i.e. x̂n (1), and declare an error. The
index i = {1, 2, ..., 2nR } of the chosen sequence is the codeword.
Decoding. The list of the sequences x̂n is known to the decoder. Then, the
decoder associates to the received index i the corresponding sequence x̂n (i).
Computation of the distortion. Consider the expected distortion over the
random choice of codebooks, i.e. EX n ,C [d(X n , X̂ n )] (the subscript of E indi-
cates the variables over which the expectation is taken).

By referring to the above procedure it is possible to prove that:

EX n ,C [d(X n , X̂ n )] ≤ D if R ≥ I(X, X̂)5 . (7.11)

We now give an intuitive view of the implication in (7.11). From the initial
(n)
choice of p(x̂, x) and from the definition of Ad,ε we argue that if the coder at
the transmitter side has found a distortion jointly typical sequence, then the
expected distortion is close to D. But the sequences xˆn are drawn only ac-
cordingly to the marginal distribution p(x̂) and not to the joint one. There-
fore, we have to evaluate the probability that a pair of sequences (xn , x̂n )
(generated by the correspondent marginal distributions) is typical. Accord-
ing to the joint AEP theorem P r{((xn , x̂n ))} ∼ 2−nI(X;X̂) . Hence, the prob-
ability of finding at least one x̂n which is distortion typical with xn during
the encoding procedure is

2nR · 2−nI(X;X̂) = 2−n(I(X;X̂)−R) .

We can hope of finding such sequence only if R > I(X, X̂). If this is not the
case, the probability of finding a typical xn tends to zero as n → ∞.
Now, we can exploit the degree of freedom we have on p(x̂|x) in order to de-
termine the minimum rate at which reconstruction is possible along with the
fixed maximum distortion D. Hence: Rmin = R(D) = minp(x̂|x):E[d(x,x̂)]≤D I(X; X̂).

• (Reverse implication/Converse)

The proof is quite involved, so we do not sketch it in these notes.

5
Note the correspondence with the channel capacity proof in which we show that
Pe → 0 if R < I(X, Y ).
114 Chapter 7, Rate Distortion Theory

R3

x̂n
D
xn

Figure 7.1: Graphical representation of the lossy coding procedure (quantization)


for n = 3. The choice n = 3 is only to ease the graphical representation.

Note: in the rate distortion theorem we have considered the average distor-
tion E[d(X, X̂)]. Nevertheless, the same result holds by considering a stricter
distortion constraint.

7.1.1 Interpretation of the Rate Distortion Theorem


The rate distortion theorem states that the function R(D) in (7.8) spec-
ifies the lowest rate at which the output of a source can be encoded while
keeping the distortion less than or equals to D. The sketch of the proof allows
us to make some interesting considerations. The generation of the codebook
is nothing else than ‘quantizing’ blocks of source symbols. Indeed, the proof
considers a finite set of 2nR values, {x̂n }, to represent the sequences of sym-
bols xn . The concept arises naturally if we consider a continuous source X
(xn ∈ Rn ). Let R be the number of bits used to represent a symbol of the
source, then nR bits are associated to a n-length sequence of symbols. Fig-
ure 7.1 illustrates the quantization procedure. According to the distortion
constraint, for any sequence xn the search of the point x̂n is restricted to
the neighborhood of xn for which d(xn , x̂n ) ≤ D. If the quantization is suf-
ficiently dense, that is if R > I(X, X̂), with high probability it’s possible to
7.1. Rate Distortion Function 115

xn mapping x̂n(i) index


i i x̂n(i)
association decoding
assignment

(n) Quantization
(xn , x̂n (i)) ∈ Ad,

Figure 7.2: Lossy source coding: the Shannon scheme.

find a point x̂n in the neighborhood of xn , which satisfies the jointly typicality
property for large n. The quantization can be more or less coarse depending
on the value of D. If the tolerable distortion is small the quantization must
be fine (R large), while it can be made coarser (smaller R) as the amount of
tolerable distortion increases. Indeed, looking at the figure, if D decrease we
would have to increase the number of reconstruction sequences x̂n in order
that at least one of them (x̂n ) falls inside each region with high probability.
Figure 7.2 schematically represents the lossy coding procedure.

It’s easy to argue that in the discrete source case (xn ∈ X n ) the same
procedure leads to a further quantization of an already quantized signal, but
nothing conceptually changes. We stress again that in this case, as opposed
to the continuous source case, lossless coding is possible. However, rate
distortion theory can be applied whenever we prefer to decrease the coding
rate at the price of an introduction of an acceptable distortion.
As it happened for the proofs of Source and Channel Coding Theorems, the
proof of the Rate Distortion Theorem does not indicate a practical coding
strategy. Therefore, we have to face with the problem of finding the optimum
set of points {xn } to represent the source for finite n. To this purpose, it’s
easy to guess that knowing the type of source helps and then should be taken
into account in order to make the rate close to the theoretical value R(D).
This problem will be faced with in Section 7.2.

7.1.2 Computing the Rate Distortion Function

We now compute the rate distortion function R(D) for some common
sources.
116 Chapter 7, Rate Distortion Theory

Bernoulli source

p(1) = p;
p(0) = 1 − p.

Let us suppose, without any loss of generality, that p ≤ 1/2.


The most natural choice for the distance measure is the Hamming distance
dH .
The rate distortion function for the Bernoulli source with parameter p is
given by
h(P ) − h(D) if D ≤ p6

R(D) = (7.12)
0 if D > p.

Proof. We have to compute:

min I(X; X̂). (7.13)


p(x̂|x):E[X⊕X̂]≤D

Case 1: D > p

Let us take X̂ = 0. This choice allows to achieve the lower bound for the
mutual information, i.e. I(X, X̂) = 0. We have to check if the constraint is
satisfied. It is easy to argue that it is so since E[X ⊕ 0] = p(x = 1) = p < D.
Note that this solution is also suggested by intuition: a reconstruction with
an error less than or equal to a value (D) greater than p is trivially obtained
by encoding every sequence as a zero sequence.

Case 2: D ≤ p

It is possible to solve minimization (7.13) through the same procedure adopted


for the computation of the capacity for some simple channels in Section 5.2.4:
first we find a lower bound for I(X; X̂), later on we seek a distribution which
fulfils the constraint and attains the limit value.

6
Notice that the notation h(D) is correct since D = E[dH ] ≤ 1.
7.1. Rate Distortion Function 117

1 1−D 1
?r p
D
X̂ D
X

1−r 0 1−D 0
1−p

Figure 7.3: Joint distribution between X̂ and X given by the binary symmetric
channel (test channel).

Let us find the lower bound:

I(X; X̂) = H(X) − H(X|X̂)


(a)
= h(p) − H(X ⊕ X̂|X̂)
≥ h(p) − H(X ⊕ X̂)
(b)
= h(p) − h(E[X ⊕ X̂]), (7.14)

where (a) follows from the fact that x = x̂ ⊕ (x ⊕ x̂), while (b) is obtained
by observing that X ⊕ X̂ is itself a binary source with P r{X ⊕ X̂ = 1} =
E[(X ⊕ X̂)].
Now, since the binary entropy h(r) grows with r (with r < 1/2) and E[(X ⊕
X̂)] is less than D, we have h(E[X ⊕ X̂]) ≤ h(D) and then by going on from
(7.14) we get
I(X; X̂) ≥ h(p) − h(D). (7.15)
At this point we know that R(D) ≥ h(P ) − h(D). Let us show that a
conditional probability distribution p(x̂|x) attaining this value exists.
For establishing a relation between the two binary random variables X and
X̂, that is for determining a joint distribution, we can refer to the binary
symmetric channel (BSC) with ε = D, see Figure 7.3. Let us determine the
input of the channel X̂ so that the output X is the given distribution (with
the fixed p(x)). Let r = P r(X̂ = 1). Then, we require that

r(1 − D) + (1 − r)D = p, (7.16)

that is
p−D
r= . (7.17)
1 − 2D
p−D
For D ≤ p the choice p(x̂ = 1) = 1−2D
:
118 Chapter 7, Rate Distortion Theory

1. satisfies the constraint, being E[X ⊕ X̂] = P r{X 6= X̂} = D;


2. reaches the minimum value I(X; Y ) = h(p) − h(D) by construction.
We have then proved that R(D) = h(p) − h(D).

Gaussian source
2
Let X be a Gaussian source, X ∼ N (0, σX ).
For this type of source it is reasonable to adopt the Euclidean distance
(squared error distortion). The rate distortion function is given by
σx2
 1
log if D ≤ σx2
R(D) = 2 D (7.18)
0 if D > σx2 .

Proof. We have to compute:

min I(X; X̂). (7.19)


f (x̂|x):E[(X−X̂)2 ]≤D

Case 1: D > σx2

We can take X̂ ≡ 0. With this choice the average error we make is the
variance of the random variable X which, being less then D, permits to sat-
isfy the constraint: E[X 2 ] = σx2 ≤ D 7 . Besides, this choice attains the
absolute minimum for I(X; X̂), that is I(X; X̂) = 0. Then, R(D) = 0.

Case 2: D ≤ σx2

We go along the same steps we followed in the Bernoulli case.


We first find a lower bound for I(X; X̂):

I(X; X̂) = h(X) − h(X|X̂)


1
= log 2πeσx2 − h(X − X̂|X̂)
2
≥ log 2πeσx2 − h(X − X̂)
(a)
≥ log 2πeσx2 − log 2πeE[(X − X̂)2 ],
1 σ2
≥ log x , (7.20)
2 D

where (a) derives from the fact that (X − X̂) is a random variable whose
7
We remind that for any random variable Z the relation σz2 = E[Z 2 ] − µ2z holds.
7.1. Rate Distortion Function 119

X̂ X
N (0, σx2 − D) N (0, σx2 )

N ∼ N (0, D)

Figure 7.4: Joint distribution between X̂ and X given by the AWGN (test channel).

variance is surely less than the mean square error E[(X − X̂)2 ] and then, the
entropy of a Gaussian random variable with variance E[(X − X̂)2 ] gives an
upper bound for h(X − X̂) (principle of the maximum entropy).
We now have to find a distribution f (x̂|x) that attains the lower bound for
I. As before, it is easier to look at the reverse conditional probability f (x|x̂)
as the transitional probability of a channel and choose it in such a way that
the distribution of the channel output x is the desired one. Then, from the
knowledge of f (x) and f (x̂) we derive f (x̂|x). Let us consider the relation
between X̂ and X depicted in Figure 7.4 (test channel ), i.e. we assume that
the difference between X and its reconstruction X̂ is an additive Gaussian
noise N . It is easy to check that this choice:
1. satisfies the distortion constraint; indeed

E[(X − X̂)2 ] = E[N 2 ] = D (7.21)

2. achieves the lower bound:

I(X, X̂) = h(X) − h(X|X̂)


1
= log 2πeσx2 − h(X − X̂|X̂)
2
1
= log 2πeσx2 − h(N |X̂)
2
(a) 1
= log 2πeσx2 − h(N )
2
1 1
= log 2πeσx2 − log 2πeD
2 2
2
1 σ
= log x , (7.22)
2 D
where (a) follows from the fact that according to the chosen model N and
X̂ are independent.
120 Chapter 7, Rate Distortion Theory

R(D) achievable rates


( distortion less than D)

σx2 D

Figure 7.5: Rate distortion curve for a Gaussian source.

Figure 7.5 depicts the rate distortion curve for a Gaussian source. The
curve partitions the space into two regions; by varying D, only the rates
lying above the curve are achievable. For D → 0 we fall back into lossless
source coding, and then R → ∞ (entropy of a continuous random variable).
If instead the reconstruction distortion is larger than σx , there is no need to
transmit any bit (R = 0).
For the Gaussian source we can express the distortion in terms of the rate
by reversing R(D), obtaining

D(R) = σx2 2−2R . (7.23)

Given the number of bits we are willing to spend for describing the source,
D(R) provides the minimum distortion we must tolerate in the reconstruction
(Figure 7.6). Obviously, the condition D = 0 is achievable only asymptoti-
cally.
Let us evaluate the signal to noise ratio associated to the rate distortion:

σx2
SN R = = 22R → SN Rdb = 6R. (7.24)
D
For any bit we add, the SNR increases by 6 dB.

Note: it is possible to prove that, like the differential entropy for the Gaussian
source, the rate distortion function for the Gaussian source is larger than
the rate distortion function for any other continuous source with the same
variance. This means that, for a fixed D, the Gaussian source gives the
maximum R(D). This is a valuable result because for many sources the
computation of the rate distortion function is very difficult. In these cases,
7.1. Rate Distortion Function 121

achievable
σx2
D(R) distortions

Figure 7.6: Distortion rate curve for a Gaussian source. Fixed R, the amount of
distortion introduced cannot be less than the value of the curve in that point.

the rate distortion curve in Figure 7.5 provides an upper bound.

7.1.3 Simultaneous representation of Independent Gaus-


sian Random variables
Let us consider the problem of representing M independent zero mean
Gaussian random variables X1 , ..., XM having different variances, i.e. Xi ∈
N (0, σi2 ), given a global distortion constraint:
"M #
X
E di (Xi , X̂i ) ≤ D. (7.25)
i=1

We take di (Xi , X̂i ) = (X − X̂)2 (squared-error distortion).


The problem we have to solve is the following: given R bits for representing
the M sources, what is the best possible allocation, that is the allocation
which minimizes the overall distortion D?
Because of the global constraint we have to join the M random variables in
a vector and encode it as a unique symbol. We then have to consider the
extension of the rate distortion function to the vector case, that is:

R(D) = min I(X M , X̂ M ), (7.26)


f (x̂M |xM ):E [||X −X̂ || ]≤D
M M 2

where we used the Euclidean norm.


As usual, firstly we determine the lower bound and later search for a joint
distribution which reaches it.
122 Chapter 7, Rate Distortion Theory

1. Evaluation of the lower bound:

I(X M ; X̂ M ) =h(X M ) − h(X M |X̂ M )


M
X M
X
= h(Xi ) − h(Xi |X̂ M , Xi−1 , ..., X1 ) (7.27)
i i=1
M
X XM
≥ h(Xi ) − h(Xi |X̂i ) (7.28)
i i=1
M
X
= I(Xi , X̂i )
i=1
XM
≥ R(Di ) (7.29)
i=1
M  +
X 1 σi2
= log , (7.30)
i=1
2 Di

where in equality (7.27) we exploited the independence of the random vari-


ables Xi (for the first term) and the chain rule (for the second term), while
inequality (7.28) follows from the fact that conditioning reduces entropy. In
equality (7.30), the plus sign in the subscript is a compact way for writing
the expression in (7.18) (each term of the sum is the expression in round
brackets if it is positive, 0 otherwise).
Each term Di , i = 1, ..., M denotes the average distortion Passigned to the
2 M
i-th variable (Di = E[(Xi − X̂i ) ]). Overall we must have i=1 Di ≤ D.

2. Search for a conditional probability f (xM |x̂M ): to do so we wonder when


inequalities (7.28) and (7.29) are satisfied to equality. As to the former, since
Xi does not depend on Xi , ..., Xi−1 we have that h(Xi |X̂ M , Xi , ..., Xi−1 ) =
h(Xi |X̂ M ). Hence, by choosing f (xM |x̂M ) = M 8
Q
i=1 f (xi |x̂i ) equation (7.28)
holds at equality. We still have the freedom of choosing the probabilities
f (xi |x̂i ) for each i; we can take them in such a way that equation (7.29)
holds at equality too. From the previous evaluation of R(D) for the Gaus-
sian source, we know that if we consider the conditional probability f (xi |x̂i )
obtained by the test channel which adds noise N ∼ N (0, Di ) to an input
x̂i ∼ N (0, σi2 − D), we achieve the condition I(Xi ; X̂i ) = R(Di ). Hence,
taking f (xi |x̂i ) for each i in this way permits to satisfy (7.28) at the equality.

8
According to this expression, given the reconstruction x̂i the symbol xi is conditionally
independent on the other reconstructions.
7.1. Rate Distortion Function 123

PM  1 +
M M M i σi2
We have then found a f (x |x̂ ) such that I(X ; X̂ ) = i=1 2 log . Di
Now we remember that in our problem the distortion values Di , i = 1, ..., M
provide an additional degree of freedom we can exploit. Hence, from (7.30)
the final minimum is obtained by varying Di , i = 1, ..., M , that is:
M  +
X 1 σ2
R(D) = min log i . (7.31)
2 Di
P
Di : i Di =D
i=1

In (7.31) the distortion constraint is expressed with equality since it is rea-


sonable to expect that the minimum value will be achieved exploiting all the
available distortion.
Then, in order to find the rate distortion function for the M independent
Gaussian random variables with global distortion constraint we have to solve
a constrained optimization problem.

We can solve the minimization in (7.31) by applying the Lagrange method.


Accordingly, we have to minimize the functional
M  M
+ !
X 1 σi2 X
min log +λ Di − D . (7.32)
Di
i=1
2 Di i=1

Let us write down the Karush-Kuhn-Tucker (KKT) conditions9 :


(M  + M
!)
d X 1 σi2 X
log +λ Di − D = 0 ∀j
dDj i=1 2 Di i=1
M
X
Di − D = 0
i=1
λ ≥ 0.
(7.33)

9
For nonlinear optimization problems, the Karush-Kuhn-Tucker (KKT) conditions are
necessary conditions that a solution has to satisfy for being optimal. In some cases, the
KKT conditions are also sufficient for optimality; this happens when the objective func-
tion is convex and the feasible set is convex too (convex inequality constraints and linear
equality constraints). The system of equations corresponding to the KKT conditions is
usually solved numerically, except in the few special cases where a closed-form solution
can be derived analytically.
In the minimization problem considered here, the KKT are necessary and sufficient con-
ditions for optimality. Besides, we will be able to solve them analytically.
124 Chapter 7, Rate Distortion Theory

Solving system (7.33) is complicated by the presence, in the objective func-


tion, of the plus sign in the subscript of the terms of the sum. Let us assume
for the moment that Di ≤ σi2 ∀i; in this case we have the system
(M  M
!)
σi2

d X 1 X
log +λ Di − D = 0 ∀j
dDj i=1 2 Di i=1
M
X
Di − D = 0
i=1
λ ≥ 0.
(7.34)

The computation of the KKT conditions in (7.34) is now much easier:


 
d 1 1
− log Dj + λ = 0 −→ Dj = = λ0 , ∀j
dDj 2 (1) 2λ
M
X D
Di − D = 0 −→ Dj =
i=1
(2) M
λ0 ≥ 0.
D D
If M ≤ σi2 ∀i, then the solution of the minimization (7.31) is Di = M for each
i, that means distributing the distortion equally among the variables. Note
that this does not correspond to allocating the bits/symbol equally among
the variables since Di → R(Di ) = 12 log Dσii (more bits are allocated to the
r.v.’s with larger variance).
D
If instead for some i M > σi2 , it is straightforward to argue that allowing the
random variables take a distortion D/M does not make sense. Indeed, when
an admitted distortion Di = σi2 is achieved for a variable Xi , this means that
we are assigning no bits to the random variable. Therefore, for the random
variables Xi such that the distortion D/M exceeds the value of the variance
σi2 , the best thing to do is to assign to them a distortion Di = σi2 and to
D
reallocate the ‘surplus’ ( M − σi2 ) uniformly among the remaining variables
in order to reduce the bits/symbol for them (compensation principle). Then
the optimal distortion distribution is

λ if λ < σi2

Di = (7.35)
σi2 if λ ≥ σi2 ,

where λ satisfies M
P
i=1 Di = D.
The method described is a kind of reverse water-filling and is graphically
7.1. Rate Distortion Function 125

σi2
2
σM −1
σ22
σ42
σ12
λ
σ32
2
σM
2
σM −2

Figure 7.7: Reverse water-filling procedure for independent Gaussian random vari-
ables.

illustrate in Figure 7.7.


It is possible to prove that solution (7.35) is the same solution that we would
have found by solving the initial set of KKT conditions in (7.33).
Then, the rate distortion function for M independent Gaussian sources with
overall maximum distortion D is
M
X 1 σi2
R(D) = log , (7.36)
i=1
2 Di

with {Di }M
i=1 satisfying (7.35).
126 Chapter 7, Rate Distortion Theory

lossless

xn x̂n
Q entropy coder

lossy

Figure 7.8: Quantization scheme.

7.2 Lossy Coding

7.2.1 The Encoding procedure in practice

The rate distortion theorem quantifies the trade-off between distortion


and coding rate for lossy coding by means of the rate distortion function
R(D). For any source X and distortion measure D, R(D) gives the minimum
number of bits symbol required to reconstruct the source with a prescribed
maximum distortion. As already pointed out in Section 7.1.1, we emphasize
that, as it was the case with the source coding and the channel coding the-
orems, the values provided by the rate distortion function are ‘fundamental
limits’: they can be achieved asymptotically and with increasing complexity
of the encoding-decoding scheme (again, Shannon’s scheme cannot be imple-
mented).
In this section we search for the ‘best’ possible quantization procedure, that
is the procedure which allows in practice to get as close as possible to R(D).
Specifically, we search for the set of reconstruction sequences {x̂ni } which sat-
isfies the reconstruction distortion constraint. Figure 7.8 illustrates the idea.
The encoder (Q) observes the source outputs xn ∈ Rn (or X n ) and maps
them into representation sequences of length n, x̂n ∈ X̂ n . The quantization
scheme should work on long blocks of source outputs (vector quantization).
Indeed, similarly to what happened for the lossless source coding, quantizing
together the random variables allows to reduce the rate even for memoryless
sources (independent r.v.). The presence of the downstream entropy coder
is due to the following reason: using any practical (suboptimum) quanti-
zation scheme, the bit rate at the output of Q does not correspond to the
entropy of the output source, or equivalently the probability distribution of
the encoded/assigned index is probably far from being uniform. Therefore,
we can improve the compression efficiency through lossless coding, thus get-
7.2. Lossy Coding 127

reconstruction example of
levels reconstruction for a
vectorial scheme

Figure 7.9: Scalar quantization: arrangement of the reconstruction points in a


regular lattice (green crosses).

ting closer to R(D)10 .


The rest of this chapter is devoted to the study of the quantization process.
Without any loss of generality, from now on we will refer to the continuous
source case.

Quantization
Let xn ∈ Rn denote a n-long vector of source outputs. In scalar quantiza-
tion each single source output xi is quantized into a number of levels which
are later encoded into a binary sequence. In vector quantization instead the
entire n-length vector of outputs is seen as a unique symbol to be quantized.
A vector quantization scheme allows a much greater flexibility with respect
to the scalar one, at the price of an increasing complexity. Let us see an
example.
Suppose that we want to encode with a rate R. Assume for simplicity
n = 2. Figure 7.9 shows how, through the scalar procedure, the reconstruc-
tion (quantized) levels in the R2 space are constrained to be disposed on a
regular rectangular lattice and the only degrees of freedom are the quantiza-
tion steps along the two axes. In general, we have n2R steps to set. Through
the vector procedure, instead, we directly work in the R2 space and we can
10
In the rate distortion theorem entropy coding is not necessary. Shannon proves that
the distortion jointly encoding scheme reaches R(D) which is the minimum. Then, it is
as if the reconstructed sequences come out according to a uniform distribution.
128 Chapter 7, Rate Distortion Theory

put the reconstruction vectors wherever we want (e.g. the blue star). We
have 2nR points to set in general. Nevertheless, because of the lack of reg-
ularity, all the 2nR points must be listed and for any output vector xn all
the 2nR distances must be computed to find the closest reconstruction point,
with a complexity which increases exponentially with n.

7.2.2 Scalar Quantization


Defining a scalar quantizer corresponds to define the set of quantized or
reconstruction levels x̂1 , x̂2 , ..., x̂m (m = 2R ) and the corresponding decision
regions Ri , i = 1, 2, ..., m (the partitioning of the space R). Since the decision
regions are intervals they are defined by means of the decision boundaries ai ,
i = 1, 2, ..., m.

Uniform quantizer
The uniform quantizer is the simplest type of quantizer. In a uniform
quantizer the spacing among the reconstruction points have the same size,
that is the reconstruction levels are spaced evenly. Consequently, the deci-
sion boundaries are also spaced evenly and all the intervals have the same
size except for the outer intervals. Then, a uniform quantizer is completely
defined by the following parameters:
- levels: m = 2R ;
- quantization step: ∆.
Assuming that the source pdf is centered in the origin we have the two quan-
tization schemes depicted in Figure 7.10, depending on the value of m (odd or
even). Through a uniform scheme, once the number of levels and the quan-
tization step are fixed, the reconstruction levels and the decision boundaries
are univocally defined.
Figure 7.11 illustrates the quantization function Q : R → C (where C is
a numerable subset of R). For odd m, the quantizer is called a midthread
quantizer since the axes cross the step of the quantization function in the
middle of the tread. Similarly, when m is even we have a so called midrise
quantizer (crossing occurs in the middle of the rise).

In the sequel, we study the design of a uniform quantizer first for a source
having a uniform distribution, and later for a non uniformly distributed
source.
7.2. Lossy Coding 129

reconstruction
levels
decision boundaries

∆ − ∆2 2

R
0
(a) m odd.
reconstruction
levels
decision boundaries

R
0
(b) m even.

Figure 7.10: Uniform quantization: reconstruction levels and decision boundaries.

• X uniform in [-A,A].
It is easy to argue that for this type of distribution the uniform quan-
tizer is the most appropriate choice.
Given m (number of reconstruction levels), we want to design the value
of the parameter ∆ which minimizes the distortion D, that is the quan-
tization error/noise. Being the distribution confined in the interval
[−A, A] we deduce that ∆ = 2A m
.
The distortion (mean square quantization error) has the expression:

D = E[(X − X̂)2 ]
Z
= (x − Q(x))2 fX (x)dx
R
m Z
X
= (x − x̂i )2 fX (x)dx
i=1 Ri
m
1
Z
(a) X
= (x − x̂i )2
i=1
2A Ri
Z ∆
m 2
= x2 dx
2A − 2 ∆

Z ∆
1 2 ∆2
= x2 dx = , (7.37)
∆ − ∆2 12

where each element i of the sum in (a) is the inertia moment centered
on x̂i .
In this case, it is easy to see that the indexes obtained through the
130 Chapter 7, Rate Distortion Theory

Q(x)

∆ ∆
x
2

(a) midthread quantizer (m odd).


Q(x)


2

∆ x
− ∆2

(b) midrise quantizer (m even).

Figure 7.11: Quantization functions for a uniform quantizer.


7.2. Lossy Coding 131

encoding (outputs symbols) are uniformly distributed too. In this case


entropy coding is useless.
Remembering that the variance of a source uniformly distributed in
[−A, A] is A2 /3, we can compute the signal to noise ratio as follows:

A2 /3
SN R = = m2 = 22R . (7.38)
∆2 /12

Then, the scalar uniform quantization of a uniformly distributed source


yields an SN R = 6R dB, with a 6 dB increase for each additional bit11 .

• Non uniform X.
In this situation there are some ranges of values in which it is more
probable to have an observation. Then, we would like to increase the
density of the reconstruction levels in the more populated zone of the
distribution and use a sparser allocation in the other regions.
This is not possible by using a uniform quantizer (the only parameter
we can design is the spacing ∆). The question is then how to design
the constant spacing ∆ in order to achieve the minimum reconstruction
error.
Let us suppose m to be even (similar arguments hold if m is odd). We
must compute:
m Z
X
D= (x − x̂)2 fX (x)dx
i=1 Ri

m/2−1

X Z i∆  ∆
2
=2 x − i∆ + fX (x)dx
 i=1 (i−1)∆
 2
| {z }
granular noise

Z ∞  2
m ∆ 
+ x− ∆+ fX (x)dx .
(m/2−1)∆ 2 2 
| {z }
overload noise

The problem of finding the best ∆ (minimizing D) must be solved


numerically. However, it is interesting to point out the different con-
tributions given by the two terms. Since ∆ must have a finite value,
through the quantization procedure we support only a limit range of
the possible output values. This corresponds to clipping the output
11
We remind that 6 dB is also the maximum growth of the SNR for added bit that we
have for the Gaussian rate distortion function.
132 Chapter 7, Rate Distortion Theory

whenever the input exceeds the supported range. The second term of
the sum is the clipping error, known as overload noise. Conversely, the
error made for quantizing the values in the supported range is referred
to as granular noise and is given by the sum at the first term. It is
easy to see that decreasing ∆ the contribution of the granular noise
decreases, while the overload noise increases. Viceversa, if we increase
∆, the overload noise decreases at the price of an higher granular noise.
The choice of ∆ is then a tradeoff between these two types of noise,
and designing the quantizer corresponds to finding the proper balance.
Obviously, this balancing will depend on the to-be-quantized distribu-
tion, and specifically on how much it weighs the tail of the distribution
with respect to the belly.

Example.
Numerical values for ∆ obtained for a Gaussian distribution with σ 2 =
1:

1. m = 2 → ∆opt = 1.596 SN R = 4.40dB;


2. m = 4 → ∆opt = 0.996 SN R = 9.24dB;
3. m = 8 → ∆opt = 0, 586 SN R = 14.27dB.

Note, that by increasing R by 1 bit/sample the SNR increases much


less than 6 dB (limit value for the optimum quantizer).
It is possible to show that for a Laplacian distribution with the same m
the value ∆opt is larger. This is expected, being the tail of the Laplacian
distribution much heavier than that of the Gaussian one.

Note: it’s worth noting that using a uniform quantization for nonuniform
distributions implies that the probabilities of the encoded output symbols
(indexes) are not equiprobable and then in this case a gain can be obtained
by means of entropy coding.

Non uniform quantizer


It is evident that the choice of a uniform quantizer for quantizing a non
uniform source distribution is quite unnatural. At this purpose, it is surely
more suited to resort to a non uniform quantizer which permits to exploit the
local mass concentration of the probability distribution near the origin, where
consequently the input is more likely to fall (see the example illustration in
7.2. Lossy Coding 133

fX (x)


∆ x
− ∆2 2

fX (x)

x̂i
x̂i+1

x
ai ai+1

Figure 7.12: Example of uniform quantization (above) and non-uniform quantiza-


tion (below) of a relatively peaked distribution.

Figure 7.12).
Again, suppose that the source is centered in the origin. With respect to
the uniform quantizer, a nonuniform quantizer gives the designer much more
freedom. Given the number of reconstruction levels m, the non uniform
quantizer is defined by setting:
- reconstruction levels: x̂i , i = 1, ..., m;
- decision boundaries: ai , i = 0, ..., m (where a0 = −∞ and am = +∞).
Hence, we have 2m − 1 parameters (degrees of freedom) to set in such a way
that the quantization error is minimized. It is easy to guess that, as in the
example in Figure (7.12) (bottom), for non uniform sources, the optimum
decision regions will have in general different sizes.
134 Chapter 7, Rate Distortion Theory

Max-Lloyd quantizer

In order to design the best nonuniform quantizer we have to search for


the decision boundaries and the reconstruction levels that minimize the mean
squared quantization error. Given the distortion error
m Z
X
D= (x − x̂i )2 fX (x)dx
i=1 Ri
Xm Z ai
= (x − x̂i )2 fX (x)dx, (7.39)
i=1 ai−1

we have to minimize D by varying (ai )m−1 m


i=1 and (x̂i )i=1 .
Let us start by deriving the minimum with respect to the decision boundaries
∂D
aj , that is by computing ∂a j
= 0 for j = 1, ...., m − 1.

ai m
∂D ∂
Z X
= (x − x̂i )2 fX (x)dx
∂aj ∂aj ai−1 i=1
aj aj+1
∂ ∂
Z Z
2
= (x − x̂j ) fX (x)dx + (x − x̂j+1 )2 fX (x)dx.
∂aj aj−1 ∂aj aj

(7.40)

Exploiting the following general relation:


Z k

f (x)dx = f (k), (7.41)
∂k α

from (7.40) we get

∂D
= −(aj − x̂j+1 )2 fX (aj ) + (aj − x̂j )2 fX (aj ). (7.42)
∂aj

Setting equation (7.42) to zero yields:

− (aj − x̂j+1 )2 + (aj − x̂j )2 = 0, (7.43)

where we threw away the multiplicative constant fX (aj ) (fX (aj ) 6= 0).
By exploiting the relation a2 − b2 = (a + b)(a − b), after easy algebraic
7.2. Lossy Coding 135

manipulation we get:

(2aj − (x̂j + x̂j+1 ))(x̂j+1 − x̂j ) = 0. (7.44)

Since (x̂j+1 − x̂j ) 6= 0, we must have:

x̂j + x̂j+1
aj = , ∀j. (7.45)
2
Then, each decision boundary must be the midpoint of the two neighboring
reconstruction levels. As a consequence, the reconstruction levels will not lie
in the middle of the regions/intervals.
Let us now pass to the derivative with respect to reconstruction levels. We
∂D
have to compute ∂x j
= 0 for j = 1, ...., m.

m Z m
∂D ∂ X ai X
= (x − x̂i )2 fX (x)dx
∂ x̂j ∂ x̂j i=1 ai−1 i=1
Z aj

= (x − x̂j )2 fX (x)dx.
∂ x̂j aj−1
Z aj  
∂ 2
= (x − x̂j ) fX (x)dx. (7.46)
aj−1 ∂ x̂j

Equating expression (7.46) to 0 yields


Z aj
2 (x − x̂j )fX (x)dx = 0. (7.47)
aj−1

and then R aj
a
xfX (x)dx
x̂j = R j−1
aj ∀j = 1, ..., m. (7.48)
aj−1
fX (x)dx
R aj
By observing that fX (x|x ∈ Rj ) = fX (x)/ aj−1
fX (x)dx, we have
Z aj
x̂j = xfX (x|x ∈ Rj )dx
aj−1

= E[X|X ∈ Rj ]. (7.49)

Then, the output point for each quantization interval is the centroid of the
probability density function in that interval.
To sum up, we have found the optimum boundaries (ai )m−1 i=1 expressed as
m
a function of the reconstruction levels (x̂i )i=1 and, in turn, the optimum
136 Chapter 7, Rate Distortion Theory

m−1
reconstruction levels (x̂i )m
i=1 as a function of (ai )i=1 . Therefore, in order to
find the decision boundaries and the reconstruction levels we have to employ
an iterative procedure (Max Lloyd alghoritm): at first we choose the m
reconstruction levels at random (or using some heuristic); then we calculate
the boundaries and update the levels by computing the centroids:
x̂ +x̂

aj = j 2 j+1 j = 1, ..., m − 1
(7.50)
x̂j = E[X|X ∈ Rj ] j = 1, ..., m.

The iterative procedure converges to a local minimum. However, the conver-


gence to the absolute minimum of D is not guaranteed and depends on the
choice of the initial conditions.

Entropy-constrained quantizer

With respect to the uniform quantizers, the nonuniform quantizer allow


to define smaller step sizes in high probability regions and larger step sizes
in low probability regions. This corresponds to ‘equalize’ the probability
distribution. In this way, the gain achieved by means of the downstream
entropy coder for the nonuniform quantization scheme is much less than that
achieved in the uniform case. In order to understand this point we must stress
that, in all the introduced quantization schemes, we have minimized D for a
given number of reconstruction levels m (cardinality of the output alphabet).
But what about the rate R which is the parameter we are interested in?
Clearly, its value depends on the probability distribution of the output of
the quantizer. When we have a uniform distribution and we quantize it
uniformly, the distribution of the output indexes is uniform and then R =
log m. However, in general, when we deal with nonuniform distributions, the
index distribution at the output of the quantizer is non uniform and then
the effective rate is determined by the downstream entropy coder.
An alternative strategy is to design the quantizer by fixing the rate at the
output of the downstream entropy coder, i.e. the entropy, rather than the
output alphabet size (m).
The entropy of the quantizer output is:
m
X
H(Q) = − Pi log Pi , (7.51)
i=1
7.2. Lossy Coding 137

where Pi is the probability that the input falls in the i-th quantization bin12 ,
that is Z ai
Pi = fX (x)dx, (7.52)
ai−1

where ai−1 and ai are the decision boundaries of the i-th decision region.
Hence for a fixed rate R, the optimum decision boundaries ai and reconstruc-
tion levels x̂i can be obtained through the minimization of the distortion D
subject to the constraint R = H(Q).
Such a quantizer is called Entropy constrained quantizer (ECQ), and is the
best nonuniform scalar quantizer. However the minimization with the addi-
tional constraint on H(Q) is much more complex and must be solved numer-
ically.

7.2.3 Vector Quantization


From the proof of the rate distortion theorem we argue that, similarly
to what happened for lossless source coding, vector quantization gives ad-
vantages even when dealing with memoryless sources. However, block-wise
quantization is even more important (actually essential) when we deal with
memory sources. Below, we provide some examples to give an idea of the
gain which derives by working with vector schemes with respect to scalar
schemes, both for the memoryless and the memory case.
Since now the source outputs are quantized in blocks, the reconstruction lev-
els are vectors in Rn (let n be the length of each block of symbols), namely
x̂n1 , x̂n2 , ..., x̂nm (m = 2nR )13 , and the decision regions Ri , i = 1, 2, ..., m, are
partitions of the Rn space.

Vector quantization vs scalar quantization

Example (two Uniform Independent Sources).


Let us consider the quantization of two memoryless sources X and Y both
having uniform distribution in [−A, A]. To start with, let us suppose R = 2.

I Scalar Quantization of the sources.


We can design a uniform quantizer by allocating the reconstruction lev-
12
i.e. the probability that the output is x̂i .
13
Sometimes, the vector notation ~xˆi (instead of xn ) is used to denote the reconstruction
i
points.
138 Chapter 7, Rate Distortion Theory

A A

−A A −A A

−A −A
(a) regular lattice (scalar quantization). (b) triangular tessellation (example of
vector quantization).

Figure 7.13: Quantization of two Uniform Independent Sources.

els and the boundaries as in Figure 7.13(a). For uniform distributions


it is easy to guess that the uniform quantization is also the optimal
solution of the Max-Lloyd quantizer (and the ECQ).
Accordingly, we have
Z Z
D= (~x − Q(~x))2 fX~ (~x)d~x
R2
m
X Z 1
Z
= ||~x − ~xˆi ||2 dx, (7.53)
i=1 Ri 4A2

where each term i of the sum is the central moment of inertia (c.m.i.)
of the region Ri . Since ~xˆi 14 is the central point of each region Ri and
the regions are all the same in shape and dimension, the contribution
of each term of the sum is the same (the c.m.i is translation invariant).
Then we have
m
D= I2 , (7.54)
4A2
4A2
where I2 is the central moment of inertia of a square having area 16
.

I Vector quantization
If we use a vector scheme we have much more freedom in the definition
of the quantization regions, being them no more constrained to a rigid
reticular structure as in the scalar case. Dealing with uniform distri-
14
which here is a couple (x̂1i , x̂2i ).
7.2. Lossy Coding 139

butions, we might think that the possibility of choosing the shape of


the regions does not lead to a gain. However, the vector quantization
gives an advantage in terms of distortion even in such a case. Indeed,
we can suppose to use decision regions of different form (e.g. triangu-
lar regions, as in Figure 7.13(b). Using for instance hexagonal regions
having the same area of the square regions we have that I7 < I2 ,
where I7 is the moment of inertia of the hexagonal region. This simple
choice for the v.q.15 already diminishes the distortion. According to
the behavior of the c.m.i., the distortion gain would be even higher if
we used multi-sided geometrical figure (with the same area) to cover
the space, the minimum of I being attained by the sphere. However, in
all these cases we have a boundary effect due to the fact that it is not
possible to exactly cover a square domain through figures with more
than 4 sides, as shown in the example in Figure 7.14 for the hexagonal
regions. This effect becomes negligible when we have many reconstruc-
tion levels (fine quantization). Furthermore, we point out that by using
geometrical figure with many-sides (> 6) a without-gap coverage of the
internal space is not possible (think to the limit case of the sphere).
So far, we have considered a simple example of quantization of blocks
of symbols of length 2. The gain of the v.q. with respect the s.q. in-
creases when we quantize blocks consisting of more than 2 symbols.
The reason is twofold:
- the gain in terms of the c.m.i. obtained by using n-dimensional hy-
persphere in place of n-dimensional hypercube grows with n;
- the coverage of the space is possible by means of geometrical fig-
ures with a larger number of sides and the boundary effect becomes
negligible.

Note: by considering the limit case, that is when the length of the block n
approaches ∞, the problem faced with in rate distortion can be seen as a
problem of ‘sphere covering’. The sphere covering problem consists in find-
ing the minimum number of spheres through which is possible to cover a
given space satisfying a condition on the maximum reconstruction distortion
(maximum ray of the spheres).

In the example above we have considered the case of independent uni-


form sources. It is possible to show that the gain of the v.q. with respect the
s.q. is higher when we deal with the quantization of independent non-uniform
15
We use v.q. as the short for vector quantization, s.q. for scalar quantization.
140 Chapter 7, Rate Distortion Theory

boundary
effect
A

−A A

−A

Figure 7.14: Hexagonal tessellation of the square domain: boundary effect.

sources, and especially of sources having peaked and tailed distributions. The
reason is that, using the v.q., the freedom in the choice of the reconstruction
levels allows to better exploit the greater concentration of the probability
distribution in some areas with respect to others, thus reducing the quanti-
zation error.

We now consider an example of source with memory and show that in this
case the gain derived by using the v.q. is really much stronger. We stress that
the rate distortion theorem has been proved for the memoryless case. In the
case of dependent sources the theorem should be rephrased by considering
the entropy rate H in place of the entropy H. It is possible to prove that the
theoretic limit value for the rate distortion R(D) is lower than that of the
memoryless case. This is not a surprise: for reconstructing the source with
a given fixed distortion D the number of information bits required is less
if we can exploit the dependence between subsequent outputs (correlation).
Nevertheless, this is possible only by means of vector quantization schemes.

Example (two Dependent Sources).


Let us consider the problem of the quantization of two sources X and Y with
joint distribution  1
ab
(X, Y ) ∈ A
fXY = (7.55)
0 (X, Y ) ∈/ A,
where A is the rectangular region of side lengths a and b depicted in Figure
7.15.
The correlation between X and Y makes necessary the resort to vector
quantization. Indeed, any scalar quantization scheme looks at the marginal
7.2. Lossy Coding 141

a 1
fXY = ab

0
x
b

Figure 7.15: Example of two correlated random variables X and Y . The vector
quantization (star tessellation, blue) is necessary since any scalar scheme (cross
tessellation, green) leads to an unsatisfactory distribution of the reconstruction
points.
142 Chapter 7, Rate Distortion Theory

distributions fX and fY separately. Clearly, in this way, it places many


reconstruction levels in regions in which the input couples never falls, causing
a noticeable waste of resources. The situation is illustrated in Figure 7.15.
Assuming for simplicity that the marginal distributions are not far from being
uniform (a << b), or equivalently a fine quantization (high m), the optimum
scalar solution for the quantization is to divide the space in m equally sized
square regions. The distortion introduced is again
m
D= I2 , (7.56)
4A2
where now I2 is the c.m.i. of a square with area ( a+b
√ )2 · 1 .
2 m
Through a vector scheme, instead, by exploiting the correlation between the
sources, we can place the decision regions only in A (see figure 7.15). In this
way, for the same number of reconstruction levels, the area of each square
region would be ab N
, which is much less than before. Consequently, the dis-
tortion is given by the c.m.i. I2 of a smaller square and then has a lower
value.
This example shows that, for the case of source with memory, the gain
achieved by the v.q. is more significant, since it derives from the reduc-
tion of the size of the regions and not only from the choice of a more suitable
shape, as for the memoryless sources case.

The Linde-Buzo-Gray (LBG) algorithm

The Linde-Buzo-Gray algorithm is the generalization of the optimum Max


Lloyd quantizer to vector quantization. The source output is grouped into n-
length blocks and each resulting n dimensional vector is quantized as a unique
symbol. In this way, the decision regions Ri can no longer be described as
easily as in the case of the scalar quantization (where the regions were simply
intervals!). However, in hindsight, the M-L quantizer is nothing else than a
minimum distance quantizer 16 ; then the update of the iterative algorithm
can be generalized as follows:
(
Ri = {~x : ||~x − ~xˆi || < ||~x − ~xˆj ||} ∀j 6= i
(7.57)
~xˆi = E[X|R
~ i] ∀i.

The system defines the updating of the decision regions and reconstruction
levels for the LBG algorithm.
16
This is the meaning of placing the boundary points at the midpoints between the
reconstruction levels.
7.2. Lossy Coding 143

The main practical drawbacks of the LBG quantization are:


1. the evaluation of the expected value requires the computation of a n
dimensional integral. Besides, it requires the knowledge of the joint
distribution fX~ (~x), which then should be estimated!
2. The convergence of the algorithm (in time and in space) strongly de-
pends on the choice of the initial conditions, i.e. the initial placement
of the quantization levels.
Clustering by K-means

For solving the problem of the estimation of fX~ (~x), Linde, Buzo and
Gray propose to design the vector quantizer by using a clustering procedure,
specifically the K-means algorithm.
Given a large training set of output vectors ~xi from the source and an initial
set of k reconstruction vectors ~xˆj , j = 1, ..., k, we can partition the points
(~xi ) instead of splitting the space Rn . We define each cluster Cj as follows:

Cj = {~xi : ||~xi − ~xˆj || ≤ ||~xi − ~xˆw ||, ∀w 6= j}. (7.58)

In this way, once the clusters are defined, we can update the reconstruction
vectors by simply evaluating the mean value of the points inside each cluster
(without having to compute any integral!). Then, the new levels are
1 X
~xˆj = ~xi , ∀j = 1, .., k. (7.59)
|Cj |
i:~
xi ∈Cj

At each step, a distortion contribution Dj is associated to cluster j,

1 X
Dj = ||~xi − ~xˆi ||, ∀j. (7.60)
|Cj |
i:~
xi ∈Cj

By iterating the procedure of cluster’s definition, (7.58), and update of the


centroids, (7.59), the algorithm converges to a local optimum. However, the
solution found and the convergence speed strongly still depends on the initial
distribution of the reconstruction vectors.

7.2.4 Avoiding VQ: the decorrelation procedure


We have seen that vector quantization procedure is necessary to approach
the R(D) curve, but is computationally heavy. Even using the LBG algo-
rithm for the design of the quantizer, the real problem is that the number of
144 Chapter 7, Rate Distortion Theory

xn decorrelator yn y1 , ..., yn ŷ1 , ..., ŷn


(e.g. transform coding, Q entropy coder
predictive coding)

Figure 7.16: Avoiding vector quantization: the input sequence xn is transformed


by the decorrelator into a sequence y n which has a lower correlation and then is
more suited to be encoded scalarly.

reconstruction levels (centroids) we have to determine and store for a given


rate is high, (2nR ), and above all grows exponentially with n. Besides, the
quantization procedure requires to evaluate for each input 2nR distances in
order to determine the closest ‘centroid’.
Then, for the memoryless case, in which the gain derived from the use of
the v.q. is of minor importance, we are often content with the scalar quan-
tization. Differently, for the case of sources with memory, the gain of the
v.q. is significant and then we have to resort to clever tricks. The idea is
to act on the output of the source X ~ at the purpose of eliminating the de-
pendence between subsequent symbols. The block at the beginning of the
chain in Figure 7.16 implements a transform based coder or a predicted coder
for decorrelating the outputs of the source. In this way, at the output of the
decorrelator the scalar quantization can be applied without significant loss
of performance.

Transform-based Coding

We search for a transformation A such that the new source Y~ = AX ~


shows no or little dependence between the variables.
~ is a Gaussian vector with zero mean. Then,
We suppose for simplicity that X
the multivariate pdf is
1 −1 ~
xT Cx
~ x
fX~ (~x) = p e− 2 dx, (7.61)
(2π)n |Cx |

where Cx = E[X ~ ·X ~ T ] is the covariance matrix (Cx = {Cij }n


i,j=1 with
Cij = E[Xi Xj ]). Being Cx a symmetric and positive definite matrix, it ad-
mits an inverse matrix (Cx−1 ).
Let us evaluate the behavior of the random vector after the transformation
A is applied, i.e. the distribution of the output source Y~ . Being the transfor-
mation linear, we know that fY~ (~y ) is still a multivariate Gaussian pdf with
7.2. Lossy Coding 145

µ
~ y = 0. The covariance matrix is

Cy =E[Y~ · Y~ T ] = E[AX ~ · (AX)


~ T] =
=E[AX ~ ·X ~ T AT ] = AE[X ~ ·X
~ T ]A = ACx AT . (7.62)

Since Cx is positive definite, we can choose A in such a way that Cy is a


diagonal matrix (Yi independent r.v.). Being A the matrix which diagonalizes
Cx (diagonalization matrix), the rows of A are formed by the eingevectors
of Cx . Besides, since Cx is symmetric, there exists an orthonormal basis
formed by the eigenvectors of Cx . Then, A is an orthonormal matrix (AT =
A−1 ). As a consequence, the transformation does not change the entropy of
the source and then does not lead to any loss of information. Indeed, the
entropy of a Gaussian random vector Y~ (h(Y~ )) depends on Y~ only through
the determinant of Cy 17 , and we have

|Cy | = |ACx AT | = |A||Cx ||A−1 | = |Cx |. (7.63)

Then, it follows that h(Y~ ) = h(X).~


In this way, we can work on the source Y~ obtained at the output the transform
block and only later go back to X ~ (X~ = A−1 Y~ ) without any loss.
Since X~ is a source with memory, we know that h(X) ~ < P h(Xi ) (remember
i
~ → nH(X)). On the
that if the length of the vector n tends to infinity, h(X)
contrary, the entropy of the decorrelated random vector Y~ which results from
the diagonalization procedure can be computed as the sum of the entropies
of the single r.v. Yi . Indeed, from the resulting diagonal covariance matrix
 2 
σy1 0 . . . . . . 0
. .. 
 0 σy21 . . . 

 . . ... ... .  .
Cy =  .
. . . . . (7.64)
 . .. ..
 ..

. . 0 
0 . . . . . . 0 σy2n
Qn
it follows that |Cy | = i=1 σi2 , and then
n n
1 X1
h(Y~ ) = log((2πe)n |Cy |) =
X
log(2πeσi2 ) = h(Yi ). (7.65)
2 i=1
2 i=1
146 Chapter 7, Rate Distortion Theory

x2
~x = (x1, x2)
x01

x02

x2

large variances
(σ12, σ22)

(a) Location of the points before the decorrelation. If we apply the


scalar quantization directly to this source we ignore the correlation
between the variables.

~x0 = (x01, x02)


x02

x01

0
σ12 → much smaller
variance
(b) Location of the points after the decorrelator. The variance of
the random variables is greatly reduced as well as their correlation
(which is approximately zero).

Figure 7.17: Variability of the couple of random variables before and after the
decorrelation takes place.
7.2. Lossy Coding 147

At this point, we can apply the scalar quantization to the resulting source
Y~ incurring only on the little gain loss we have in the memoryless case (dis-
cussed in Section 7.2.3).
Figure 7.17 illustrates the effect of the transformation for the case of two
dependent Gaussian random variables X1 and X2 . The point cloud in Fig-
ure 7.17(a) describes the density of the vectors X ~ = (X1 , X2 ), according to
the bivariate Gaussian distribution. By looking at each variable separately,
the ranges of variability are large (σx21 and σx22 are large). Therefore, as
discussed in the previous sections, directly applying the scalar quantization
to this source implies a waste of resources, yielding a high rate distortion
R(D)18 . Applying the transformation A to X ~ corresponds to rotating the
axes in such a way that the variability ranges for the variables are reduced
(minimized), see Figure 7.17(b). It is already evident here (for n = 2) that
decorrelating the variables corresponds to compacting the energy 19 . At the
output of the transformation
Pn block we have independent Gaussian variables
and then R(D) = i=1 R(Di ). At this point, we know that the best way
for distributing the distortion between the variables is given by the Reverse
Water Filling procedure, which allocates the bits to the random variables
depending on their variance.

The Discrete Cosine Transform (DCT).

Given a source X,~ the optimum transform for decorrelating the source
samples is Karhunen Loeve Transform (KLT). The KLT is precisely the ma-
trix A which diagonalizes the covariance matrix Cx . However, this implies
that the KLT depends on the statistical properties of the source X~ and the co-
variance matrix Cx must be estimated for computing the KLT. Furthermore,
for non stationary sources the estimation procedure must be periodically re-
peated.
As a consequence, computing the KLT is computationally very expensive. In
practice, we need transforms that (although suboptimum) do not depend on
the statistical properties of the data, but have a fixed analytical expression.
The DCT (Discrete Cosine Transform) is one of the most popular trans-
form employed in place of the KLT. For highly correlated Gaussian Markov
sources, it has been theoretically showed that the DCT behaves like the KLT

17 ~ ) = 1/2 log((2πe)n |Cy |).


From the analysis of the previous chapter we remind that h(Y
2
18 1 σx
Remember that for a Gaussian r.v. R(D) = 2 log D .
19
The overall energy is preserved by the transformation.
148 Chapter 7, Rate Distortion Theory

(KLT ≈ DCT ). Then, in image compression applications20 , the DCT has


the property of decorrelating the variables, even if, being suboptimal, some
correlation remains among the transformed coefficients. This is the reason
why many compression schemes (e.g. JPEG) work in the frequency domain:
being the DCT coefficient almost decorrelated, scalar quantization can be
applied with a negligible gain loss. The DCT transform compacts the energy
into a small number of coefficients: the variance of the DCT coefficients is
large at low frequency and decreases at high frequencies. In this way, due to
the low variability of the coefficients in high frequency, through the proce-
dure of Reverse Water Filling we can allocate 0 bits to them, that is discard
them, introducing a very small distortion.

Predictive Coding

When the elements of the vector X ~ are highly correlated (large amount
of memory among the components) consecutive symbols will have similar
values. Then, a possible approach to perform decorrelation is by means
of ‘prediction’. Using the output at the time instant n − 1, i.e. Xn−1 , as
the prediction for the output at the subsequent time instant n (Zero Order
Prediction), we can transmit only the ‘novel’ information brought by the
output Xn , that is:
Dn = Xn − Xn−1 . (7.66)
21
In this way, the quantizer Q works on the symbols dn (dn = xn − xn−1 ) .

Quantizing Dn instead of Xn has many advantages which derive from the


following properties:

X σd2 << σx2 ;


remember that the variance is the parameter which affects the number
of bits we have to spend for the encoding with prescribed maximum
distortion; specifically, a lower σ 2 corresponds to a lower rate distortion
function R(D).

20
The source of an image is approximately a Markov source with high enough correlation.
21 ~ with reduced
The symbols dn can be seen as the output (at time n) of a new source D
memory.
7.2. Lossy Coding 149

Proof.

E[Dn2 ] = E[(Xn − Xn−1 )2 ]


= σx2 + σx2 − 2E[Xn Xn−1 ]
(a)
= 2σx2 − 2ρσx2 = 2σx2 (1 − ρ), (7.67)

where (a) follows from the definition of the correlation coefficient ρ


which for a couple of r.v. X and Y has the expression

cov(XY ) E[XY ]
ρ= = . (7.68)
σX σY σX σY
Due to the high correlation between Xn and Xn−1 , ρ is close to 1 and
then from (7.67) holds σd2 << σx2 .

X ρd << ρ;
the correlation (memory) between the new variables Dn is less then
the correlation among the original source outputs Xn (as an example,
if the source is a first order Markov process the symbols dn obtained
according to (7.66) are completely decorrelated).
Then, working on the symbols dn (instead of on xn ), the loss incurred
by using scalar quantization is much smaller.

However, it must be pointed out that the impact of the quantization of the
symbols dn on the original symbols xn is different with respect to the case in
which we directly quantize xn .
In detail, the problem incurred by considering the differences in (7.66) is
described in the following.

Coding/decoding scheme (Open loop)


At the first step the encoder transmits the first symbol x1 . Then, at the
second step:
⇒ Transmitter side:

d2 = x2 − x1 −→ dˆ2 = d2 + q2 , (7.69)
Q

(qi denotes the quantization error at step i).


⇒ Receiver side: knowing x1 and having received dˆ2 ,

x̂2 = x1 + dˆ2 = x1 + d2 + q2 = x2 + q2 . (7.70)


At the third step:
150 Chapter 7, Rate Distortion Theory

⇒ Transmitter side:

d3 = x3 − x2 −→ dˆ3 = d3 + q3 . (7.71)
Q

⇒ Receiver side: knowing only the quantized version of x2 ,

x̂3 = x̂2 + dˆ3 = x2 + q2 + q3 . (7.72)

Proceeding in this way, the n-th decoded symbol is


n
X
x̂n = xn + qi . (7.73)
i=2

It is evident that this encoding scheme leads to a propagation of the error


(or drift) at the receiver.
Then, for avoiding error propagation at the receiver, the encoder must em-
ploy a closed loop encoding, by computing the differences with respect to the
quantized value at the previous step, i.e. Dn = Xn − X̂n−1 . In the following,
we analyze the scheme in detail.

Coding/Decoding scheme (Closed Loop)


In order to avoid the propagation of the decoding error, the coder must base
the prediction on the quantized values of the source symbols instead than the
original ones. In fact, defining the differences with respect to the quantized
values, that is dn = xn − x̂n−1 , at the third step we would have:
⇒ Transmitter side:

d3 = x3 − x̂2 −→ dˆ3 = d3 + q3 . (7.74)


Q

⇒ Receiver side:

x̂3 = xˆ2 + dˆ3 = x̂2 + d3 + q3 = x3 + q3 . (7.75)

The n-th decoded symbol is now

x̂n = xn + qn , (7.76)

thus avoiding that quantization errors are accumulated. Figure 7.18 illus-
trates the closed loop encoding scheme and the corresponding decoding pro-
cedure.
Note: we have considered the simplest type of predictor, i.e. the Zero Or-
7.2. Lossy Coding 151

xn dn dˆn to the entropy coder dˆn x̂n


Q

x̂n−1 x̂n−1
Z −1

x̂n
Z −1

Figure 7.18: Predictive coding scheme. Closed loop encoder (on the left) and
decoder (on the right).

der Predictor. More efficient prediction schemes can be obtained by using a


linear combination of a certain number of source outputs (FIR filter).

You might also like