Information Theory and Coding: Universit' A Degli Studi Di Siena Facolt'a Di Ingegneria
Information Theory and Coding: Universit' A Degli Studi Di Siena Facolt'a Di Ingegneria
Facoltà di Ingegneria
Lecture notes on
Information Theory
and Coding
Mauro Barni
Benedetta Tondi
2012
Contents
1 Measuring Information 1
1.1 Modeling of an Information Source . . . . . . . . . . . . . . . 1
1.2 Axiomatic definition of Entropy . . . . . . . . . . . . . . . . . 2
1.3 Property of the Entropy . . . . . . . . . . . . . . . . . . . . . 8
i
ii Contents
Measuring Information
1
2 Chapter 1, Measuring Information
more information if it is less probable. For instance, the news that a foot-
ball match between Barcelona and Siena has been won by Siena team carries
much more information than the opposite.
Shannon’s intuition suggests that information is related to randomness. As
a consequence, information sources can be modeled by random processes,
whose statistical properties depend on the nature of the information sources
themselves. A discrete time information source X can then be mathemati-
cally modeled by a discrete-time random process {Xi }. The alphabet X over
which the random variables Xi are defined can be either discrete (|X | < ∞)
or continuous when X corresponds to R or a subset of R (|X | = ∞). The
simplest model for describing an information source is the discrete memo-
ryless source (DMS) model. In a DMS all the variables Xi are generated
independently and according to the same distribution, i.i.d.. In this case,
it is possible to represent a memoryless source through a unique random
variable X.
In addition, the function Hn (p1 , p2 , ..., pn ) should have several intuitive prop-
erties. It is possible to formulate these properties as axioms from which we
will deduce the specific form of the H function.
The four fundamental axioms are:
A.3 (Permutation-invariance)
The above axioms reduce the number of possible candidate functions for
H. Even more, it is possible to prove that they suffice to determine a unique
function, as the next theorem asserts.
Let us extend the list of the axioms including another property. We will use
it to prove the theorem. However, we point out that this is not a real axiom
since it is deducible from the others. We will introduce it only to ease the
proof.
We define A(n) = H( n1 , n1 , ..., n1 ),
Before stating and proving the theorem, we introduce some useful nota-
4 Chapter 1, Measuring Information
and let h(p) denote the entropy of the binary source H2 (p, 1 − p).
1) By considering A.4 together with A.3 we deduce that we can group any
two symbols, not only the first and the second one. We now want to extend
the grouping property to a number k of symbols. We have:
p2
Hn (p1 , p2 , ..., pn ) = Hn−1 (s2 , p3 , ..., pn ) + s2 h
s
2
p3 p2
= Hn−2 (s3 , p4 , ..., pn ) + s3 h + s2 h = ...
s3 s2
k
X pi
... = Hn−k+1 (sk , pk+1 , ..., pn ) + si h . (1.6)
i=2
si
2) Let us consider two integer values n and m and the function A(n · m). If
we apply m times the extended grouping property we have just found (Point
1), each time to n elements in A(n · m), we obtain:
1 1
A(n · m) = Hnm , ...,
nm nm
1 1 1 1 1 1
= Hnm−n+1 , , ..., + Hn , ...,
m nm nm m n n
(a) 1 1 1 1 2
= Hnm−2n+2 , , , ..., + A(n) = ....
m m nm nm m
1 1
... = Hm , ..., + A(n)
m m
= A(m) + A(n), (1.9)
Property. The unique function which satisfies property (1.10) over all the
integer values is the logarithm function. Then A(n) = log(n).
4) We are now able to show that the expression of the entropy in (1.5) holds
for the binary case. Let us consider a binary source with p = rs for some
positive values r and s (obviously, r ≤ s).
1 1
A(s) = log(s) = Hs , ...,
s s
r 1 1 r 1 1
= Hs−r+1 , , ..., + · Hr , ...,
s s s s r r
r s−r s−r
1 1 r
= H2 , + · Hs−r , ..., + · A(r).
s s s s−r s−r s
(1.14)
s−r s−r
r
r r
h =− · log − · log , (1.17)
s s s s s
and then
2
X
h(p) = −(1 − p) · log(1 − p) − p · log(p) = pi log pi . (1.18)
i=1
1
Remember that A(·) is a monotonic function of its argument.
1.2. Axiomatic definition of Entropy 7
5) As a last step, we extend the validity of the expression (1.5) to any value
n. The proof is given by induction exploiting the relation for n = 2, already
proved. Let us consider a generic value n and suppose that for n − 1 the
expression holds. Then,
n−1
X
Hn−1 (p1 , ..., pn−1 ) = − pi log pi . (1.19)
i=1
p1
Hn (p1 , ..., pn ) = Hn−1 (p1 + p2 , p3 , ..., pn ) + (p1 + p2 ) · h
p1 + p2
n
X
= − pi log pi − (p1 + p2 ) · log(p1 + p2 ) +
i=3
p1 p1
−
(p
1+ 2) ·
p log +
(p
1+p
2) p1 + p2
p2 p2
−
(p + p ) · log
1 2
(p
1+p
2) p1 + p2
Xn
= − pi log pi − p1 log p1 − p2 log p2
i=3
n
X
= − pi log pi , (1.20)
i=1
The name entropy assigned to the quantity in (1.5) reminds the homonymous
quantity defined in physics. Roughly speaking, from a microscopical point
of view, Boltzmann defined the entropy S as the logarithm of the number of
8 Chapter 1, Measuring Information
where log is the base-2 logarithm. In (1.21) we use the convention that
0 log 0 = 0 which can be easily proved through de l’Hospital’s rule. This is in
agreement with the fact that adding zero probability terms does not change
the value of the entropy.
2
Ω(E) indicates the number of microstate having an energy equal to E.
3
For convenience, we denote pmfs by p(x) rather than by pX (x).
1.3. Property of the Entropy 9
logarithm4 . We have:
X
log2 |X | − H(X) = log2 |X | + p(x) log2 (p(x))
X
X
= p(x) [log2 |X | + log2 p(x)]
X
X
= log2 e · p(x) [ln |X | + ln p(x)]
X
X
= log2 e · p(x) ln(|X |p(x))
X
(a) X 1
≥ log2 e · p(x) 1 −
X
|X |p(x)
!
X X 1
= log2 e · p(x) − = 0. (1.23)
X X
|X |
Hence,
log2 |X | ≥ H(X), (1.24)
1
where the equality holds if and only if p(x) = |X |
, ∀x (in which case (a) holds
with the equality).
From the above property, we argue that the uniform distribution for an
information source is the one that gives rise to the maximum entropy. This
fact provides new hints about the correspondence of information theory and
statistical mechanics. In a physical system the condition of equally likely
microstates is the configuration associated to the maximum possible disorder
of the system and hence to its maximum entropy.
4
Remember the relation log2 z = log2 e · loge z holding for logarithms with a different
base, which will be useful in the following.
10 Chapter 1, Measuring Information
Chapter 2
The joint entropy can also be seen as the entropy of the vector random
variable Z = (X, Y ) whose alphabet is the cartesian product X × Y.
11
12 Chapter 2, Joint Entropy, Relative Entropy and Mutual Information
Proof. We exploit the relation between the pmfs of independent sources, i.e.
p(x, y) = p(x)p(y), and proceeds with some simple algebra.
X
H(X, Y ) = − p(x, y) log p(x, y)
xy
X
= − p(x, y) log p(x)p(x)
xy
XX XX
= − p(x, y) log p(x) − p(x, y) log p(y)
x y x y
X X X X
= − p(x) log p(x) p(y|x) − p(y) log p(y) p(x|y)
x y y x
= H(X) + H(Y ). (2.3)
distribution p(x|y)1 as
X
H(X|Y = y) = − p(x|y) log p(x|y). (2.6)
x∈X
Proof. Suggestion: exploit the relation p(x|y) = p(x) which holds for inde-
pendent sources.
• (Chain Rule)
(a) (b)
H(X, Y ) = H(X) + H(Y |X) = H(Y ) + H(X|Y ). (2.8)
Equality (a) tells us that the information given by the pair of random vari-
ables (X, Y ), i.e. H(X, Y ), is the same information we receive by considering
the information carried by X (H(X)), plus the ‘new’ information provided
by Y (H(Y /X)), that is the information that has not been given by the
knowledge of X. Analogue considerations can be made for equality (b).
1
In probability theory, p(x|y) denotes the conditional probability distribution of X given
Y, i.e. the probability distribution of X when Y is known to be a particular value.
14 Chapter 2, Joint Entropy, Relative Entropy and Mutual Information
Proof.
X
H(X, Y ) = − p(x, y) log p(x, y)
xy
X
= − p(x, y) log p(y|x)p(x)
xy
XX XX
= − p(x, y) log p(y|x) − p(x, y) log p(x)
x y x y
= H(Y |X) + H(X). (2.9)
By considering the case of m sources we get the generalized chain rule, which
takes the form
m
X
H(X1 , X2 , ...., Xm ) = H(Xi |Xi−1 , Xi−2 , ...., X1 )
i=1
= H(X1 ) + H(X2 |X1 ) + H(X3 |X2 , X1 ) + ....
... + H(Xm |Xm−1 , ...., X1 ). (2.10)
Proof. Suggestion: for m = 2 it has been proved above. The proof for a
generic m follows by induction.
By referring to (2.10) the meaning of the term ‘chain rule’ becomes clear:
at each step in the chain we add only the new information brought by the
next random variable, that is the novelty with respect to the information we
already have.
This relation asserts that the knowledge of Y can only reduce the uncertainty
about X. Said differently, conditioning reduces the value of the entropy or
at most leaves it unchanged if the two random variables are independent.
2.1. Joint and Conditional Entropy 15
Proof.
XX p(x|y)
H(X) − H(X|Y ) = p(x, y) log
x y
p(x)
XX p(x, y)
= p(x, y) log
x y
p(x)p(y)
XX p(x, y)
= log e p(x, y) ln . (2.12)
x y
p(x)p(y)
By using the lower bound for the logarithm, ln z ≥ 1 − z1 , from (2.12) we get
XX p(x)p(y)
≥ log e p(x, y) 1 − =0 (2.13)
x y
p(x, y)
Warning
Inequality (2.11) is not necessarily true if we refer to the entropy of a con-
ditional distribution p(X/y) for a given occurrence y, that is H(X|y). The
example below aims at clarifying this fact.
Let us consider the problem of determining the most likely winner of a foot-
ball match. Suppose that the weather affects differently the performance of
the two teams according to the values of the table 2.1; X is the random
variable describing the outcome of the match (1, ×, 2) and Y is the random
variable describing the weather condition (rain, sun). By looking at the ta-
ble of values we note that if it rains we are in great uncertainty about the
outcome of the match, while if it’s sunny we are almost sure that the win-
ner of the match will be the first team. As a consequence, if we computed
H(X/Y = rain) we would find out that the obtained value is larger then
H(X). Because of the fact we are considering the conditioning to a particular
event, this fact should not arouse any wonder since it is not in conflict with
relation (2.11).
Proof. It directly follows from the chain rule and from relation (2.11).
16 Chapter 2, Joint Entropy, Relative Entropy and Mutual Information
Y/X 1 × 2
rain 1/3 1/3 1/3
sun 9/10 1/10 0
Table 2.1: The table shows the probability of the various outcomes in the two
possible weather conditions.
As usual it’s easy to argue that the above relation can be generalized to any
number m of sources.
• (Mapping application)
If we apply a deterministic function g to a given random variable X, i.e. a
deterministic processing, the following relation holds:
This means that we have less a priori uncertainty about g(X) than about X;
in other words, considering g(X) in place of X causes a loss of information.
The equality in (2.15) holds only if g is an invertible function.
Proof. By considering the joint entropy, we apply the chain rule in two pos-
sible ways, yielding
and
H(X, g(X)) = H(g(X)) + H(X/g(X)). (2.17)
By equating the terms in (2.16) and (2.17) we obtain
The inequality holds since the term H(X/g(X)) is always greater then zero
and reaches zero only if the function g is invertible, so that it’s possible to
recover X by applying g −1 to g(X). If this is the case, knowing X or g(X)
is the same.
2.2. Relative Entropy and Mutual Information 17
• (Positivity)
D(p(x)||q(x)) ≥ 0. (2.20)
where the equality holds if and only if p(x) = q(x).
Proof. Suggestion: use the expression for the joint probability (conditional
probability theorem) for both terms inside the argument of the logarithm
and replace the logarithm with an appropriate sum of two logarithms.
• (Positivity)
I(X; Y ) ≥ 0 (2.27)
Proof. The proof is exactly the same we used to show relation (2.11). How-
ever, there is another way to prove the positivity of I(X; Y ), that is through
the application of the relation ln z ≥ 1 − z1 to the expression in (2.24). Notice
that the positivity of I has been already implicitly proved in Section 2.1.2
by proving the relation H(X/Y ) ≤ H(X).
20 Chapter 2, Joint Entropy, Relative Entropy and Mutual Information
The validity of the above assertion arises also from the following:
Observation.
The mutual information I(X; Y ) is the relative entropy between the joint
distribution p(x, y) and the product of the marginal distributions p(x) and
p(y):
I(X; Y ) = D(p(x, y)||p(x)p(y)). (2.28)
The more p(x, y) differs from the product of the marginal distributions,
the more the two variables are dependent and then the common information
between them large.
Hence, the positivity of the mutual information directly follows from that of
the relative entropy.
Notice that even in this case conditioning is referred to the average value
of z.
~ Y)
We can indicate I(X1 , X2 , ...., Xm ; Y ) with the equivalent notation I(X;
where the variable X~ = (X1 , X2 , ...., Xm ) takes values in X m . For i = 1 no
conditioning is considered.
2.2. Relative Entropy and Mutual Information 21
Proof.
(a)
I(X1 , X2 , ...., Xm ; Y ) = H(X1 , X2 , ...., Xm ) − H(X1 , ...., Xm |Y )
m m
(b) X X
= H(Xi |Xi−1 , ...., X1 ) − H(Xi |Xi−1 , ...., X1 , Y )
i=1 i=1
m
(c) X
= I(Xi ; Y |Xi−1 , ...., X1 ), (2.31)
i=1
Venn diagram
All the above relationships among the entropy and the related quantities
(H(X), H(Y ), H(X, Y ), H(X/Y ), H(Y /X) and I(X; Y )) can be expressed
in a Venn diagram. In a Venn diagram these quantities are visually repre-
sented as sets and their relationships are described as unions or intersections
among these sets, as illustrated in Figure 2.1.
Exercise:
To practice with the quantities introduced so far, prove the following rela-
tions:
H(X, Y )
H(X) H(Y )
Figure 2.1: Venn diagram illustrating the relationship between entropy and mutual
information.
Chapter 3
X → Y → Z. (3.1)
i.e., given Y , the knowledge of X (which precedes Y in the chain) does not
change our knowledge about Z.
23
24 Chapter 3, Sources with Memory
Property (1).
that is the random variable X, Y and Z form a Markov chain with direction
→ if and only if X and Z are conditionally independent given Y .
Proof. We show first the validity of the direct implication, then that of the
reverse one.
Property (2).
X→Y →Z ⇒ Z → Y → X, (3.7)
that is, if three random variable form a Markov chain in a direction, they
also form a Markov chain in the inverse direction.
Proof. From Property (1) it’s easy to argue that X and Y have an inter-
changeable rule; hence proving (3.7).
Observation.
If we have a deterministic function f , then
X → Y → f (Y ). (3.8)
Property (3). For a Markov chain the Data Processing Inequality (DPI)
holds, that is
The DPI states that proceeding along the chain leads to a reduction in the
information about the first random variable.
Proof. By exploiting the chain rule we expand the mutual information I(X; Y, Z)
in two different ways:
output of the source at a given time instant are not necessarily identically
distributed.
For simplicity, we shall use the notation Xn to represent the stochastic pro-
cess omitting the dependence on k.
For mathematical tractability, we limit our analysis to stationary processes
(stationary sources).
for every value n and every shift l and for all x1 , x2 , ...., xn ∈ X .
From a practical point of view, one may wonder whether such a model can
actually describe a source in a real context. Well, it’s possible to affirm that
the above model represents a good approximation of some real processes, at
least (if we consider them) on limited time intervals.
It must be pointed out that, in general, the above limits may not exist.
We now prove the important result that for stationary processes both limits
(3.13) and (3.14) exist and assume the same value.
Theorem (Entropy Rate).
If Xn is a stationary source we have
H(X1 , ...., Xn )
lim = lim H(Xn |Xn−1 , ..., X1 ), (3.15)
n→∞ n n→∞
H(Xn |Xn−1 , ..., X1 ) ≤ H(Xn |Xn−1 , ..., X2 ) = H(Xn−1 |Xn−2 , ..., X1 ), (3.16)
where the inequality follows from the fact that conditioning reduces the en-
tropy, and the equality follows from the stationarity assumption. Relation
(3.16) shows that H(Xn |Xn−i , ..., X1 ) is non-increasing in n. Since, in addi-
tion, for any n H(Xn |Xn−i , ..., X1 ) is a positive quantity, according to a well
known result from calculus we can conclude that the limit in (3.14) exists
and is finite.
We now prove that the average information H(X1 , ...., Xn )/n has the same
asymptotical limit value.
By the chain rule:
n
H(Xn , ..., X1 ) X H(Xi |Xi−1 , ..., X1 )
= . (3.17)
n i=1
n
28 Chapter 3, Sources with Memory
By looking at the first term of the sum we argue that its value is fixed (call
it k) and finite while, thanks to the proper choice of Nε , all the terms of the
second summation are less then ε. Then we have
n
1 X k n−N +1
ε
ai − ā < + · ε −→ ε, (3.22)
n n n
n→∞
i=1
q.e.d..
of Markov sources.
Then, in a Markov source the pmf at a given time instant n depends only
on what happened in the previous instant (n − 1).
From (3.23) it follows that for a Markov source the joint probability mass
function can be written as
p(x1 , x2 , ..., xn ) = p(xn |xn−1 )p(xn−1 |xn−2 )...p(x2 |x1 )p(x1 ). (3.24)
1−α 1−β
α
State A State B
1
P (1) denotes the vector of the probabilities of the alphabet symbols at time instant
n = 1, i.e. P (1) = (p(X1 = a1 ), p(X1 = a2 ), ..., p(X1 = am ))T .
3.2. Characterization of Stochastic Processes 31
i.e. the largest common factor of the number of steps that starting from
state i allow to come back to the state i itself (or equivalently, the largest
common factor of the lengths of different paths from a state to itself).
ΠT = ΠT · P. (3.29)
where (a) follows from the definition of Markov chain and (b) from the sta-
tionarity of the process. Hence, the quantity H(X2 |X1 ) is the entropy rate
of a stationary Markov chain.
In the sequel we express the entropy rate of a stationary M.C. as a function of
the quantities defining the Markov process, i.e. the initial state distribution
P (1) and the transition matrix P. Dealing with a stationary Markov chain
2
Lcf is the abbreviation for largest common factor or greatest common divisor.
32 Chapter 3, Sources with Memory
where for a fixed i p(X2 = aj |X1 = ai ) = Pij is the probability to pass from
the state i to a state j, for j = 1, ..., |X |. In general, p(X2 |X1 = ai ) corre-
sponds to one element of the i-th row of the P matrix.
Going back to the example of the two state Markov chain we now can
easily compute the entropy rate. In fact, by looking at the state diagram
in Figure (3.1) it’s easy to argue that the Markov chain is irreducible and
aperiodic. Therefore we know that, for any starting distribution, the same
stationary distribution is reached. The components of the vector Π are the
stationary probabilities of the states A and B, i.e. ΠA and ΠB respectively.
The stationary distribution can be found by solving the equation ΠT = ΠT ·P .
Alternatively, we can obtain the stationary distribution by setting to zero the
net probability flows across any cut in the state transition graph.
By imposing the balance at the cut to the state diagram in Figure (3.1) we
have the following system with two unknowns
ΠA α = ΠB β,
ΠA + ΠB = 1,
where the second equality accounts for the fact that the sum of the proba-
bilities must be one. The above system, once solved, leads to the following
solution for the stationary distribution:
β α
Π= , . (3.32)
α+β α+β
We are now able to compute the entropy H(Xn ) from expression in (3.31).
3.2. Characterization of Stochastic Processes 33
Let us call h(α) the entropy of the binary source given the state A (i.e.
H(X2 |X1 = A)), that is the entropy of the distribution of the first row in P.
Similarly, we define h(β) for the state B. The general expression for a two
state Markov chain is
β α
H(Xn ) = h(α) + h(β). (3.33)
α+β α+β
Equation (3.33) tells us that in order to evaluate the entropy rate of the two
state Markov chain it’s sufficient to estimate the transition probabilities of
the process once the initial phase ends.
Note: the stationarity of the process has not been required for the derivation.
The entropy rate, in fact, is defined as a long term behavior and then is the
same regardless of the initial state distribution. Hence, if the initial state
distribution is P (1) 6= Π we can always skip the initial phase and consider
the behavior of the process from a certain time onwards.
Let p(n) and q (n) be two pmfs on the state space of a Markov chain at
time n. According to the time invariance assumption these two distributions
are obtained by starting from two different initial states p(1) and q (1) . Let
p(n+1) and q (n+1) be the corresponding distribution at time n + 1, i.e. the
evolution of the chain.
Property. The relative entropy D(p(n) ||q (n) ) decreases with n; equivalently
Proof. We use the expression p(n+1,n) (q (n+1,n) ) to indicate the joint proba-
34 Chapter 3, Sources with Memory
bility distribution of the two discrete random variables, one representing the
state at time n and the other representing the state at time n + 1,
p(n+1,n) = pXn+1 ,Xn (xn+1 , xn ) (q (n+1,n) = qXn+1 ,Xn (xn+1 , xn )). (3.35)
p(n+1|n) = pXn+1 |Xn (xn+1 |xn ) (q (n+1|n) = qXn+1 |Xn (xn+1 |xn )). (3.36)
According to the chain rule for relative entropy, we can write the following
two expansions
It’s easy to see that the term of D(p(n+1|n) ||q (n+1|n) ) is zero, since in a Markov
chain 3 the probability to pass from a state at time n to another state at time
n+1 (the transitional probability) is the same whatever the probability vector
is. Then, from the positivity of D equation (3.34) is proved.
The above property asserts that in a Markov chain the K-L distance be-
tween the probability distributions tends to decrease as n increases.
We observe that equation (3.34) together with the positivity of D allows to
say that the sequence of the relative entropies D(p(n) ||q (n) ) admits limit as
n → ∞. However, we have no guarantee that the limit is zero. This is not
surprising since we are working with a generic Markov chain and then the
long term behavior of the chain may depend on the initial state (equivalently,
the stationary distribution is not unique).
thus implying that any state distribution gets closer and closer to each sta-
tionary distribution as time passes.
3
Keep in mind that when we say “Markov chain” we implicity assume the time invari-
ance of the chain.
3.2. Characterization of Stochastic Processes 35
and X
Pij = 1, i = 1, 2, ...4 (3.40)
j
4
This is always true. In any transition matrix the sum over the columns for a fixed row
is 1.
36 Chapter 3, Sources with Memory
Asymptotic Equipartition
Property and Source Coding
means that:
As a reminder, the above equation is the same that, in statistics, proves the
consistence of the point estimator X̄n , directly following from Tchebycheff
inequality.
We point out that, although in (4.1)-(4.2) we considered the mean value
E[X], the convergence in probability of the sample values to the ensemble
37
38 Chapter 4, Asymptotic Equipartition Property and Source Coding
ones can be defined also for the others statistics. Indeed, the law of large
numbers rules the behavior of the relative frequencies k/n (where k is the
number of successes in n trials) with respect to the probability p, that is it
states that
k
P r − p > ε −→ 0 as n → ∞. (4.3)
n
Since the estimation of any statistic based on the samples depends on the
behavior of the relative frequencies, the convergence in probability of the sam-
ple values to the ensemble ones can be derived from the law of large numbers.
For any k, the probability of a sequence having k 0’s, in the specific case
0.1k 0.9n−k , tends to 0 as n → ∞, while the binomial coefficient tends to ∞
1
, leading to an indeterminate form. By Stirling’s formula2 it’s possible to
prove that if we consider k = pn = 0.1n, P r{n(0) = k} ' 1. All the corre-
sponding sequences are referred to as typical sequences. As a consequence,
all the other sequences (having a different number of 0’s) occur with an ap-
proximately zero probability. We say that the sequences drawn from the
sources have “the correct frequencies”, where the term correct means that
the relative frequencies coincide with the true probabilities.
1
Strictly speaking, this is not true for the unitary sequence (k = n) which is only one
and then has a zero probability. n √
2
Stirling’s formula gives an approximation for factorials: n! ∼ ne 2πn.
4.2. Asymptotic Equipartition Property 39
Theorem (AEP).
If X1 , X2 , ... are i.i.d. ∼ p(x), then
1
− log p(X1 , X2 , ..., Xn ) −→ H(X). (4.5)
n in probability
Proof.
n
1 1 Y
− log p(X1 , X2 , ..., Xn ) = − log p(Xi )
n n i
n
1X
= − log p(Xi )
n i
n
1X 1
= log . (4.6)
n i p(Xi )
1
By conveniently introducing the new random variables Yi = log p(X i)
, the
1
P
above expression is just the sample mean of Y , i.e. n i Yi . Therefore, by
the law of large numbers we have that
1
− log p(X1 , ..., Xn ) → E[Y ] in probability. (4.7)
n
Expliciting the expected value of Y yields
X 1
E[Y ] = p(x) log = H(X). (4.8)
x
p(x)
The above theorem allows to give the definition of typical set and typical
40 Chapter 4, Asymptotic Equipartition Property and Source Coding
sequence.
Let us rewrite relation (4.5) as follows
1
− log p(X1 , X2 ..., Xn ) − H(X) → 0 in probability. (4.9)
n
Hence, with high probability, the sequence X1 , X2 , ..., Xn satisfies the relation
1
H(X) − ε ≤ − log p(X1 , X2 , ..., Xn ) ≤ H(X) + ε, (4.11)
n
which corresponds to the following lower and upper bound for the probability:
We now give some informal insights into the properties of the typical set,
from which it is already possible to grasp the key ideas behind Shannon’s
source coding theorem.
By the above definition and according to the law of large numbers we can
argue that
P r{X n ∈ A(n)
ε } → 1 as n → ∞. (4.14)
Then, the probability of any observed sequence will be almost surely close to
2−nH(X) . A noticeable consequence is that the sequences inside the typical
set, i.e. the so called typical sequences, are equiprobable. Besides, since a
sequence lying outside the typical set will almost never occur for large n, the
number of typical sequences k can be roughly estimated as follows:
2−nH(X) · k ∼
=1 ⇒ k∼
= 2nH(x) .
It is easy to understand that, in the coding operation, these are the se-
quences that really matter. For instance, if we consider binary sources, the
above relation states that nH(X) bits suffice on the average to describe n
4.2. Asymptotic Equipartition Property 41
Proof.
1. It directly follows from equation (4.10) which can also be written as
follows: ∀ε > 0, ∀δ > 0
1
∃N : ∀n > N P r − log p(X1 , ..., Xn ) − H(X) < ε > 1 − δ. (4.15)
n
Since the expression in curly braces defines the typical set, equation (4.15)
proves point 1.
2.
X
1 = p(xn )
xn ∈X n
X
≥ p(xn )
(n)
xn ∈Aε
(a) X
≥ 2−n(H(X)+ε)
(n)
xn ∈Aε
= A(n)
−n(H(X)+ε)
ε
·2 (4.16)
1 − δ ≤ P r{Aε(n) }
X
= p{xn }
(n)
xn ∈Aε
X
≤ 2−n(H(X)−ε)
(n)
xn ∈Aε
= A(n)
−n(H(X)−ε)
ε
·2 (4.17)
L
≤ H(X) + ε, (4.21)
n
where L denotes the average length of the codewords, i.e. E[l(c(xn ))], and
L/n is the code rate, i.e. the average number of bits per symbol.
Proof. The proof comes out directly from the AEP theorem and Typical set
theorem.
We search for a code having a rate which satisfies relation (4.21). In order
to prove the theorem it’s sufficient to find one such a code.
Let us construct a code giving a short description of the source. We divide
44 Chapter 4, Asymptotic Equipartition Property and Source Coding
(n)
all sequences in X n into two sets: the typical set Aε and the complementary
(n),c
set Aε .
(n)
As to Aε , we know from the AEP theorem that the sequences xn belonging
to it are equiprobable and then we can use the same codeword length l(xn )
(l(C(xn ))) for each of them. We represent each typical sequence by giving the
index of the sequence in the set. Since there are at most 2n(H(X)+ε) sequences
(n)
in Aε , the indexing requires no more then n(H + ε) + 1 bits, where the extra
bit is necessary because n(H + ε) may not be an integer. Spending another
bit 0 as a flag, so to make uniquely decodable the code, the total length of
bits is at most n(H + ε) + 2. To sum up,
xn ∈ A(n)
ε ⇒ l(xn ) ≤ n(H + ε) + 2. (4.22)
We stress that the order of indexing is not important, since it does not affect
the average length.
As to the encoding of the non-typical sequences, the Shannon’s idea is ”to
squander”. Since the AEP theorem asserts that, as n tends to infinity, the
(n),c
sequences in the non-typical set Aε will never occur, it’s not necessary to
look for a short description. Specifically, Shannon suggested indexing each
(n),c
sequence in Aε by using no more than n log |X | + 1 bits (as before, the
additional bit takes into account the fact that n log |X | may not be integer).
(n)
Observe that n log |X | bit would suffice to describe all the sequences (|Aε |
(n),c
+ |Aε | = |X |n ). Therefore, by using such coding, we waste a lot of bits
(surprisingly, this is good enough to yield an efficient encoding). Prefixing
the indices by 1, we have
xn ∈ A(n)c
ε ⇒ l(xn ) ≤ n log |X | + 2. (4.23)
The description of the source provided by the above code is depicted in Figure
4.1. By using this code we now prove the theorem.
The code is obviously invertible. We now compute the average length of the
codeword:
X
E[l(xn )] = p(xn )l(xn )
xn
X X
= p(xn )l(xn ) + p(xn )l(xn )
(n) (n),c
xn ∈Aε xn ∈Aε
X X
≤ (n(H(X) + ε) + 2) · p(xn ) + (n log |X | + 2) · p(xn ).
(n) (n),c
xn ∈Aε xn ∈Aε
(4.24)
4.3. Source Coding 45
Non-typical set
Description: n log |X | + 2 bits
probability
close to 1
Typical set
Description: n(H + ε) + 2 bits
(n)
For all positive value δ, if n is sufficiently large, P r{Aε } ≥ 1 − δ; then
expression (4.24) is upper bounded as follows:
Then,
L 2 2
≤ H(X) + ε + + δ + δ log |X |
n n n
= H(X) + ε0 , (4.26)
The above theorem states that the code rate L/n can get arbitrarily close
to the entropy of the source. Nevertheless, in order to state that is not
46 Chapter 4, Asymptotic Equipartition Property and Source Coding
possible to go below this value it’s necessary to prove the converse theorem.
The converse part shows that if we use an average codeword length even
slightly below the entropy we are no longer able to decode.
Proof. Since the average number of bits used for a n-length sequence is
n(H(X) − ν), we can encode at most 2n(H(X)−ν) sequences. Let us search for
a good choice of the sequences to index; the best thing to do is trying to en-
(n)
code at least the sequences in Aε . As to the non-typical sequences, if there
are no bits left, we assign to each of them the same codeword (in this way
the code loses the invertibility property but only for non-typical sequences).
However, it is easy to argue that the number of sequences we can encode
through this procedure is less than the total number of typical sequences. In
(n)
order to show that, let us set ε = ν/2, and then consider Aν/2 . We evaluate
the probability of a correctly encoded sequence4 (i.e. the probability of falling
into the set of the correctly encoded sequences), namely P (corr), which has
the following expression:
X
P (corr) = p(xn )
(n) 5
xn ∈Aν/2 : xn ↔c(xn )
ν
≤ 2n(H(X)−ν) · 2−n(H(X)+ν/2) = 2−n 2 , (4.28)
where the number of elements of the sum was upper bounded by the car-
dinality of the typical set and p(xn ) by the upper bound of the probabil-
ity of a typical sequence. Then, by considering that ∀δ > 0 and large n
(n) (n)
P r{Aν/2 } ≥ 1 − δ, the probability that the source emits a sequence Aν/2
which can not be correctly coded is
ν
P (err) ≥ 1 − δ − 2−n 2 . (4.29)
Before stating the coding theorem for sources with memory we give the
following lemma.
Lemma (Behavior of the average entropy).
H(Xk ,...,X1 )
For any stationary source with memory Xn the sequence of values k
tends to H(Xn ) from above as k → ∞, that is for large k
H(Xk , ..., X1 )
− H(Xn ) ≥ 0. (4.30)
k
48 Chapter 4, Asymptotic Equipartition Property and Source Coding
Proof. By applying the chain rule to the joint entropy twice, we have
Deriving from relation (4.35) an upper bound for the conditional entropy
H(Xk |Xk−1 , ..., X1 ) and substituting it in the expression in (4.31) we get
L
≤ H(Xn ) + ε. (4.38)
n
Proof. Let us consider the k-th order extension of Xn , i.e. the source Xkn hav-
4.3. Source Coding 49
E[l(x(k,∗),N )]
≤ H(Xk , Xk−1 , ..., X1 ) + ε, (4.39)
N
where x(k,∗),N denotes a N -length sequence of blocks drawn from the memo-
ryless source. According to the entropy rate definition, for k → ∞
By the definition of the entropy rate, we know that for any positive number
δ, if k is large enough,
E[l(x(k,∗),N )]
≤ k · H(Xn ) + k · δ + ε. (4.42)
N
(k,∗),N
Since E[l(x N )] is the average number of bits per block, we can divide by k
in order to obtain the average number of bits per symbol. Then,
E l(x(k,∗),N )
ε
≤ H(Xn ) + δ + . (4.43)
k·N k
The product k · N is the total length of the starting sequence of symbols, i.e.
k · N = n. Thus, setting ε0 = δ + kε we have
E[l(xn )]
≤ H(Xn ) + ε, (4.44)
n
and the theorem is proved.
An important aspect which already comes out from the direct theorem is
that, in order to reach the entropy rate, we have to encode blocks of sym-
6
It must be pointed out that the joint entropy H(Xk , Xk−1 , ..., X1 ) is not the sum of
the single entropies because of the presence of memory among the symbols within the
block.
50 Chapter 4, Asymptotic Equipartition Property and Source Coding
bols. However, with respect to the memoryless source coding we have now
two parameters, the block length k and the number of blocks N , which have
both to be large (tend to infinity) to approach the entropy rate.
We now consider the converse theorem.
Proof. The proof is given by contradiction. Let us suppose that for large
enough n it is possible to encode with a rate less then H(Xn ), say H(Xn ) − ν
for any arbitrarily small ν > 0. Given such a code, we can apply the same
mapping to the memoryless source Xk,∗ n . In other words, given a sequence
drawn from the source Xk,∗n , we consider it as if it were generated from H(Xn )
and we assign it the correspondent codeword. Then, we get the following
expression for the average number of bits per block,
E l(x(k,∗) ) = k · (H(Xn ) − ν)
H(Xk , ..., X1 )
≤ k· − k · ν,
k
= H(Xk , ..., X1 ) − k · ν, (4.45)
where the inequality follows from the Lemma stating the behavior of the
average entropy. By looking at equation (4.45) we realize the expected con-
tradiction. Equation (4.45) says that it is possible to code the output of a
DMS, namely Xk,∗ n , at a rate lower then the entropy. This fact is in contrast
with the noiseless source coding theorem.
All codes
Nonsingular
codes
Uniquely
decodable
codes
Instantaneous
codes
|X |
X
2−li ≤ 1. (4.46)
i=1
Proof. Consider a binary tree, as the one depicted in Figure 4.3, in which
each node has 2 children. The branches of the tree represents the symbols
of the codeword, 0 or 1. Then, each codeword is represented by a node or
a leaf on the tree. The path from the root traces out the symbols of the
codeword. The property of prefix code implies that in the tree no codeword
is an ancestor of any other codeword, that is, the presence of a codeword
eliminates all its descendants as possible codewords. Then, for a prefix code,
each codeword is represented by a leaf.
Let lmax be the length of the longest codeword (i.e. the depth of the tree).
A codeword at level li has 2lmax −li descendants at level lmax , which cannot be
codewords of a prefix code and must then be removed from the tree. For a
prefix codes with the given lengths (l1 ,l2 ,...,l|X | ) to exist, the overall number
of leaves that we remove from the tree must be less than those available
4.4. Data Compression 53
1 0
l 10
lmax
110
lmax − l
(2lmax ). In formula:
|X |
X
2lmax −li ≤ 2lmax , (4.47)
i=1
Let us construct the code with the given set of lengths. For each length
li , we consider 2lmax −li leaves of the tree, label the root of the correspondent
subtree (which corresponds to a node at depth li ) as the codeword i and
remove all its descendants from the tree. This procedure can be repeated for
all the lengths if there are enough leaves, that is if
X
2lmax −li ≤ 2lmax . (4.49)
i
subject to
X
2−l(x) ≤ 1,
x
l(x) ∈ N. (4.51)
8
In a diadic source the probability of the symbols are negative quadratic powers of 2,
that is pi = 2−αi , (αi ∈ N).
4.4. Data Compression 55
Proof.
X X
L − H(X) = pi li − pi log pi
i i
X X
= pi log 2li + pi log pi
i i
X pi
= pi log
i
2−li
X 2−li
≥ log e pi 1 −
i
pi
X X
= log e( pi − 2−li ) ≥ 0. (4.52)
i i
| {z }
=1
If the source is diadic, li = log2 (1/pi ) belongs to N for each i, and then,
by using these lengths for the codewords, the derivation in (4.52) holds at
equality.
What if the source is not diadic? The following property tells us how far
from H(X) the minimum average codeword length is (at most) in the general
case.
Proof. The left-hand side has already been proved in (4.52). In order to
prove the right-hand side, let us assign the lengths li according to a round-
off approach, i.e. by using the following approximation:
1 1
li = dlog e ≤ log + 1. (4.54)
pi pi
The average codeword length of this code is
X X 1
L= pi li ≤ pi log + 1 = H(X) + 1. (4.55)
i i
pi
where ε is a positive quantity which can be taken arbitrarily small for large
k.
and then
H(Xk , ..., X1 ) ≤ Lk ≤ H(Xk , Xk−1 , ..., X1 ) + 1. (4.60)
Dividing by k yields
From the above theorem it is evident that the benefit of coding blocks of
symbols is twofold:
• the round off to the next integer number, which costs 1 bit, is spread
on k symbols (this is the same benefit we had for memoryless sources);
1
As a matter of fact the requirement of limited bandwidth is not necessary due to the
presence of the channel which acts itself as bandwidth limiter.
59
60 Chapter 5, Channel Capacity and Coding
X1 , X2 , X3 , .... Y1 , Y2 , Y3 , ....
C
Figure 5.1: Discrete time channel. The input sequence is the sampling the stochas-
tic process x(k, t) with sampling step T .
where k denotes the discrete time at which the outcome is observed. Note
that, due to causality, conditioning is restricted to the inputs preceding k
and to the k-th input itself.
The channel is said memoryless when the output symbol at a given time
depends only on the current input. In this case the transition probabilities
become:
P r{Yk = y|Xk = x} ∀y ∈ Y, ∀x ∈ X . (5.2)
and the simplified channel scheme is illustrated in Figure 5.2. Assuming a
memoryless channel greatly restricts our model since in this way we do not
consider several factors, like fading, which could affect the communication
because due to the introduction of intersymbol interference. Such phenom-
ena require the adoption of much more complex models.
In order to further simplify the analysis, we also assume that the channel is
stationary. Frequently2 , we can make this assumption without loss of general-
ity since the channel variability is slow with respect to the transmission rate.
In other words, during the transmission of a symbol, the statistical proper-
ties of the channel do not change significantly. Then, since the probabilistic
2
This is not true when dealing with mobile channels.
5.1. Discrete Memoryless Channel 61
Xn C Yn
Figure 5.2: Discrete memoryless channel. The output signal at each time instant
n (r.v.) depends on the input signal (r.v.) at the same time.
model describing the channel does not change over time, we can characterize
the channel by means of the transition probabilities p(y|x), where y ∈ Y
and x ∈ X . These probabilities can be conveniently arranged in a matrix
P = {Pij }, where
0 0
1 1
Figure 5.3: Noiseless binary channel.
1/2 a
0
1/2 b
1/2 c
1
1/2 d
Figure 5.4: Model of the noisy channel with non overlapping outputs.
Noisy Typewriter
1/2
a a
1/2
1/2
b 1/2
b
1/2
c c
1/2
1/2
z 1/2
z
1−ε
0 0
ε
ε
1 1−ε 1
1−α 0
0 α
e
α
1
1−α 1
Xn Yn
channel
dispersion
2n sequences
to make the corresponding set disjoint. That is, we can consider 2k input
sequences for some value k (k < n). Note that, without noise, k bits would
suffice to index 2k sequences; the n − k additional bits in each sequence cor-
respond to the ‘redundancy’. In the sequel we better formalize this concept.
In the BSC, according to the law of large numbers, if a binary sequence of
length n (for large n) is transmitted over the channel with high probability,
the output will disagree with the input at about nε positions. The number
of possible ways in which it is possible to have nε error in a n-length se-
quence (or the number of possible sequences that disagree with the input in
nε positions) is given by
n
. (5.9)
nε
√
By using Stirling’s approximation n! ≈ nn e−n 2πn and by applying some
algebra we obtain
2nh(ε)
n
≈ p . (5.10)
nε 2πn(1 − ε)ε
Relation (5.10) gives an approximation on the number of sequences in each
output set. Then, for each block of n inputs, there exist roughly 2nh(ε) highly
probable corresponding output blocks. Note that if ε = 1/2, then h(ε) = 1
and the entire output set would be required for an error-free transmission of
only one input sequence.
On the other hand, by referring to the output of the channel, regarded as
a source, the total number of highly probable sequences is roughly 2nH(Y ) .
Therefore, the maximum number of input sequences that may produce almost
66 Chapter 5, Channel Capacity and Coding
2nH(Y )
M= p . (5.11)
2nh(ε) / 2πn(1 − ε)ε
Then, the number of bit that can be transmitted each time, i.e. the trans-
mission rate for channel use is:
p
k log2 (2n(H(Y )−h(ε)) · 2πn(1 − ε)ε)
R= = . (5.13)
n n
Finally, as n → ∞, R → H(Y ) − h(ε).
A close inspection of the limit expression for R reveals that we have still a
degree of freedom that can be exploited to maximize the transmission rate;
it consists in the input probabilities p(x), which determine the values of p(y)
(remember that the transition probability of the channel are fixed by the
stationarity assumption) and then H(Y ). In the sequel we look for the input
probability distribution maximizing H(Y ), giving the maximum transmis-
sion rate. Since Y is a binary source, the maximum of H(Y ) is 1, which
is obtained when the input symbols are equally likely. So, the maximum
transmission rate is Rmax = 1 − h(ε).
Observation.
The quantity 1−h(ε) is exactly the maximum value of the mutual information
between the input and the output for the binary symmetric channel (BSC),
that is
max I(X; Y ) = 1 − h(ε). (5.14)
pX (x)
In fact, given the input bit x, the BSC behaves as a binary source,
giving at the output the same bit with probability 1 − ε. Thus, we can
state that H(Y |X) = h(ε) and consequently I(X; Y ) = H(Y ) − H(Y |X) =
H(Y ) − h(ε), whose maximum is indeed 1 − h(ε).
5.2. Channel Coding 67
i Xn Channel Yn î
Encoder p(y|x) Decoder
Message Estimate
{1, ..., M } of Message
This result is in agreement with the previous one for the BSC. We foretell
that the above expression represents the channel capacity.
In Section 5.2.3, we will give a rigorous formalization to the above consider-
ations by proving the noisy channel-coding theorem.
i.e. the output does not depend on the past inputs and outputs.
If the channel is used without feedback, i.e. if the input symbols do not depend
on the past output symbols (p(xk |xk−1 , y k−1 ) = p(xk |xk−1 )), the channel
transition probabilities for the nth extension of the DMC can be written as
n
Y
n n
p(y |x ) = p(yi |xi ). (5.17)
i=1
Often, we will use xn (i) instead of g(i) to indicate the codeword associated
to index i. As a consequence of the above definition, the maximal probability
(n)
of error λmax for an (M, n) code is defined as
λ(n)
max = max λi . (5.19)
i∈{1,2,...,M }
(n)
The average probability of error Pe for an (M.n) code is
M
1 X
Pe(n) = λi , (5.20)
M i=1
5.2. Channel Coding 69
log M
R= bits per channel use. (5.21)
n
Definition. A rate R is said to be achievable if there exists a sequence of
codes having rate R, i.e. (2nR , n) codes, such that
(n)
lim λmax = 0. (5.22)
n→∞
Definition. The capacity of the channel is the supremum of all the achievable
rates.
A(n) n n n
ε = {(x , y ) ∈ X × Y :
n
1 n
1 n
− log p(x ) − H(X) < ε, − log p(y ) − H(Y ) < ε,
n n
1
− log p(xn , y n ) − H(X, Y ) < ε ,
n (5.23)
where the first and the second conditions require the typicality of the se-
quences xn and y n respectively, and the last inequality requires the joint
typicality of the couple of sequences (xn , y n ).
We observe that if we do not considered the joint typicality, the number
of possible couples of sequences in Aε would be the product |Aε,x | · |Aε,y | ∼
(n) (n) (n)
=
2n[H(X)+H(Y )] . The intuition suggests that the total number of jointly typical
sequences is approximately 2nH(X,Y ) and then not all pairs of typical xn and
typical y n are jointly typical since H(X, Y ) ≤ H(X) + H(Y ). These consid-
erations are formalized in the following theorem, which is the extension of
the AEP theorem to the case of two sources.
70 Chapter 5, Channel Capacity and Coding
(n)
1. Pr{Aε } → 1 as n → ∞ (> 1 − δ for large n);
(n)
2. ∀ε, |Aε | ≤ 2n(H(X,Y )+ε) ∀n;
(n)
3. ∀δ, ∀ε, n large, |Aε | ≥ (1 − δ)2n(H(X,Y )−ε) ;
Formally,
and
Proof. The first point says that for large enough n, with high probability, the
couple of sequences (xn , y n ) lies in the typical set. It directly follows from the
weak law of large numbers. In order to prove the second and the third point
we can use the same arguments of the proof of the AEP theorem. Instead, we
explicitly give the proof of point 4 which represents the novelty with respect
to the AEP theorem. The new sources X̃ n and Ỹ n are independent but have
5.2. Channel Coding 71
where inequality (a) follows from the AEP theorem, while (b) derives from
point 2. Similarly, it’s possible to find a lower bound for sufficiently large n,
i.e.
X
P r{(x̃n , ỹ n ) ∈ A(n)
ε } = p(xn )p(y n )
(n)
(xn ,y n )∈Aε
The above theorem suggests that we have to consider about 2nI(X;Y ) pairs
before we are likely to come across a jointly typical pair.
decoding rule. However, the rigorous proof was given long after Shannon’s
initial paper. We now give the complete statement and proof of Shannon’s
second theorem.
Each element of the matrix is drawn i.i.d. ∼ p(x). Each row i of the matrix
corresponds to the codeword xn (i).
Having defined the encoding function g, we define the correspondent decoding
function f . Shannon proposed a decoding rule based on joint typicality.
The receiver looks for a codeword that is jointly typical with the received
sequence. If a unique codeword exists satisfying this property, the receiver
5.2. Channel Coding 73
f (y n ) = i. (5.32)
Otherwise, that is if no such i exists or if there is more than one such code-
word, an error is declared and the transmission fails. Notice that joint typical
decoding is suboptimal. Indeed, the optimum procedure for minimizing the
probability of error is the maximum likelihood decoding. However the pro-
posed decoding rule is easier to analyze and asymptotically optimal.
We now calculate the average probability of error over all codes generated
at random according to the above described procedure, that is
X
Pe(n) = Pe(n) (C)P r(C) (5.33)
C
(n)
where Pe (C) is the probability of error averaged over all codewords in code-
book C. Then we have3
2nR
X 1 X
Pe(n) = P r(C) λi (C)
C
2nR i=1
2nR X
1 X
= P r(C)λi (C). (5.34)
2nR i=1 C
If Y n is the result of sending X n (i) over the channel4 , we define the event Ei
as the event that the i-th codeword and the received one are jointly typical,
that is
Ei = {(X n (i), Y n ) ∈ A(n)
ε }, i ∈ {1, 2, ..., 2nR }. (5.36)
3 (n)
We precise that there is a slight abuse of notation, since Pe (C) in (5.33) corresponds
(n) (n)
to Pe in (5.20), while Pe in (5.33) denotes the probability of an error averaged over all
the codes. Similarly, λi (C) corresponds to λi where again the dependence on the codebook
is made explicit.
4
Both X n (i) and Y n are random since we are not conditioning to a particular code.
We are interested in the average on C.
74 Chapter 5, Channel Capacity and Coding
Since we assumed i = 1, we can define the error event E as the union of all
the possible types of error which may occur during the decoding procedure
(jointly typical decoding):
where the event E1c occurs when the transmitted codeword and the received
one are not jointly typical, while the other events refer to the possibility that
a wrong codeword (different from the transmitted one) is jointly typical with
Y n (the received sequence). Hence: P r(E) = P r(E1c ∪ E2 ∪ E3 ∪ ... ∪ E2nR ).
We notice that the transmitted codeword and the received sequence must be
jointly typical, since they are probabilistically linked through the channel.
Hence, by bounding the probability of the union in (5.37) with the sum of
the probabilities, from the first and the fourth point of the joint AEP theorem
we obtain
nR
2
X
P r(E) ≤ P r(E1c ) + P r(Ei )
i=2
2nR
X
≤ δ+ 2−n(I(X;Y )−3ε) ,
i=1
≤ δ + (2 − 1)2−n(I(X;Y )−3ε) ,
nR
λ1 , λ2 , ..........., λ2nR .
Now, we throw away the upper half of the codewords in C ∗ , thus generating
a new code C ? with half codewords. Being the average probability of error
for the code C ∗ lower than δ 0 we deduce that
(n)
(If it were not so, it is easy to argue that Pe (C) would be greater than δ 0 .)
But λ2nR /2 is the maximal probability of error for the code C ? , which then is
arbitrarily small (tends to zero as n → ∞).
What about the rate of C ? ? Throwing out half the codewords reduces the
rate from R to R − n1 (= log(2nR−1 )/n). This reduction is negligible for large
(n)
n. Then, for large n, we have found a code having rate R and whose λmax
tends to zero. This concludes the proof that any rate below C is achievable.
Now we must show that it is not possible to ‘do better’ than C (converse).
Before giving the proof we need to introduce two lemmas of general validity.
Lemma (Fano’s inequality). Let X and Y be two dependent sources and let
g be any deterministic reconstruction function s.t. X̂ = g(Y ). The following
upper bound on the remained uncertainty (or equivocation) about X given Y
holds:
By using the chain rule we can expand H(E, X|Y ) in two different ways:
It’s easy to see that H(E|X, Y ) = 0 while H(E|Y ) < H(E) = h(Pe ). As to
H(X|E, Y ), by expliciting the sum on E we have
Proof.
where (a) derives from the application of the generalized chain rule and (b)
follows from the memoryless (and no feedback) assumption. Since condi-
n
P
tioning reduces uncertainty, H(Y ) ≤ i H(Yi ), we have relation (c). We
stress that the output symbols Yi do not need to be independent, that is
generally p(yi |yi−1 , ..., y1 ) 6= p(yi ). Since C is defined as the maximal mutual
information over p(x) the last inequality clearly holds.
The above lemma shows that using the channel many times does not in-
crease the transmission rate.
5
For sake of clarity, we point out that Fano’s inequality holds even in the more general
case in which the function g(Y ) is random, that is for any estimator X̂ such that X →
Y → X̂.
78 Chapter 5, Channel Capacity and Coding
Remark : the lemma holds also for non DM channels, but this extension is
out of the scope of these notes.
We have now the necessary tools to prove the converse of the channel
coding theorem.
W → X n (W ) → Y n → Ŵ . (5.51)
In (5.51), Y n takes the role of the observation, W the role of the index
(n) 1
P
we have to estimate and P r(Ŵ 6= W ) = Pe = 2nR i λi . The random
variable W corresponds to a uniform source, since the indexes are drawn in an
equiprobable manner, thus the entropy has the expression H(W ) = log(2nR ).
By using the definition of the mutual information we have
Since the channel directly acts on X n , we deduce that p(y n |xn , w) = p(y n |xn ),
that is W → X n → Y n . Then, according to the properties of the Markov
chains and in particular to DPI, from (5.52) it follows that
Dividing by n yields:
1
R<C+ + Pe(n) R. (5.55)
n
(n)
It follows that if n → ∞ and Pe → 0 then R < C + ε for any arbitrarily
small ε, i.e. R ≤ C.
According to the direct channel coding theorem, n must tend to infinity so
5.2. Channel Coding 79
Pe
C R
(n) (n)
that Pe can be made arbitrarily small. Therefore, if we want Pe → 0 it’s
necessary that the rate R stays below capacity. This fact proves that R < C
is also a necessary condition for a rate R to be achievable.
From (5.55) there is another possible way through which we can show that
(n)
if R > C then Pe 9 0. Let us rewrite (5.55) as follows
C 1
Pe(n) ≥ 1 − − . (5.56)
R nR
(n)
Joining this condition with the positivity of Pe produces the asymptotical
lower bound on Pe depicted in Figure 5.10. It’s easy to see that if R > C the
probability of error is bounded away from 0 for large n. As a consequence, we
cannot achieve an arbitrarily low probability of error at rates above capacity.
It’s possible to prove that since p(y|x) is fixed by the channel, the mutual
information is a concave function of p(x). Hence, a maximum for I(X; Y )
exists and is unique. However, being the objective function a nonlinear func-
tion, solving (5.57) is not easy and requires using methods of numerical op-
timization. There are only some simple channels, already introduced at the
beginning of the chapter, for which it is possible to determine C analytically.
• Noisy typewriter
In this channel if we know the input symbol we have two possible outputs
(the same or the subsequent symbol) with a probability 1/2 for each. Then,
H(Y |X) = 1 and max I(X; Y ) = max(H(Y ) − H(Y |X))) = max(H(Y ) − 1).
The maximum of the entropy of the output source, which is log |Y|, can be
achieved by using p(x) distributed uniformly over all the inputs. Since the
input and the output alphabet coincide, we have
• BSC
Even for this channel the maximization of the mutual information is straight-
forward, since we can easily compute the probability distribution p(x) which
maximizes H(Y ). As we already know from the analysis in Section 5.2.1,
C = max(H(Y ) − h(ε)) = 1 − h(ε), which is achieved when the input distri-
bution is uniform.
• BEC
For the binary erasure channel (Figure 5.7) the evaluation of the capacity is
a little bit more complex. Since H(Y |X) is a characteristic of the channel
and does not depend on the probability of the input, we can write
For a generic value of α the absolute maximum value for H(Y ) (log |Y| =
log 3) cannot be achieved for any choice of the input distribution. Then,
we have to explicitly solve the maximization problem. Let pX (0) = π and
pX (1) = 1 − π. There are two ways for the evaluation of π. According to
the first method, from the output distribution given by the triplet pY (y) =
(π(1−α), α, (1−π)(1−α)) we calculate the entropy H(Y ) and later maximize
on π. The other method exploits the grouping property, yielding
The maximum of the above expression is obtained when h(π) = 1, and then
for π = 1/2. It follows that C = h(α) + (1 − α) − h(α) = 1 − α. The result
is expected since the BEC channel is nothing else that a noiseless binary
channel which breaks down with a probability α; then, C can be obtained
substracting to 1 the fraction of time the channel remains inoperative.
The channel coding theorem promises the existence of block codes that
allow to transmit information at rates below capacity with arbitrarily small
probability of error if the block length is large enough. The greatest problem
82 Chapter 5, Channel Capacity and Coding
The lower case letter h is used in place of the capital letter H denoting
the entropy in the discrete case.
It can be shown that the differential entropy represents a valid measure for
the information carried by a continuous random variable: indeed, if h(X)
grows the prior uncertainty about the value of X increases. However, some
of the intuitiveness of the entropy is lost. The main reason for this is that
now the differential entropy can take negative values: this happens for in-
stance when we compute the entropy of a random variable with a uniform
distribution in a continuous range [0, a] where a < 1 (in this case in fact
h(X) = log a < 0).
1
In the continuous case we refer to pdf instead of pmf.
83
84 Chapter 6, Continuous Sources and Gaussian Channel
The quantities related to the differential entropy, like the joint and condi-
tional entropy, mutual information, and divergence, can be defined in the
same way as for the discrete case2 and most of their properties proved like-
wise.
Observation.
From the AEP theorem we know that the differential entropy is directly re-
lated to the volume occupied by the typical sequences. The following intuitive
properties hold:
2. h(X) 6= h(αX),
1
Z Z
= − fX (x) log dx − fX (x) log fX (x)dx
α
= h(X) + log α 6= h(X). (6.6)
Observe that the additional term log α corresponds to the (volume) scaling
n-dimensional factor being V ol ≈ 2n(h(X)+log α) = αn 2nh(X) .
Let us now consider the general case of n jointly Gaussian random vari-
ables forming a Gaussian vector X ~ = X1 , ..., Xn . We want to evaluate the
~ is distributed ac-
differential entropy h(X1 , X2 , ..., Xn ). A Gaussian vector X
cording to a multivariate Gaussian density function which has the expression
1 µ)C −1 (~
x−~
(~ µ)T
x−~
fX~ (~x) = p e− 2 , (6.9)
(2π)n |C|
We now prove that, among all the possible continuous distributions with
the same variance, the Gaussian distribution is the one that has the largest
entropy.
4
R R
We make use of the unitary sum property of the density function: ··· Rn
N (~
µ, C) =
1.
88 Chapter 6, Continuous Sources and Gaussian Channel
Property. Let f (x) be a Gaussian density function with variance σ 2 and let
g(x) be any other density function having the same variance. Then
Proof.
g(x)
Z
0 ≤ D(g(x)||f (x)) = g(x) log dx
f (x)
Z Z
= g(x) log g(x) − g(x) log f (x)dx
1
Z
(a) x2
= −h(g) − g(x) log √ e− 2σ2 dx
2πσ 2
1 x2
Z
= −h(g) − log √ + log e · g(x) 2 dx
2πσ 2 2σ
1 1
Z
= −h(g) − log √ + 2 log e · x2 g(x)dx.
2πσ 2 2σ
1 1
= −h(g) + log(2πσ 2 ) + log e,
2 2
= −h(g) + h(f ), (6.14)
where in (a), without any loss of generality, we considered a zero mean density
function (as we will see, the differential entropy does not depend on the mean
value). From (6.14), we can easily obtain the desired relation.
Being the added noise white, the channel is stationary and memoryless. This
channel is a model for a number of common communication channels, such
as the telephone channel and satellite links.
Without any limitation on the input, we argue that the capacity of the Gaus-
sian channel is infinite and we can obtain a perfect (with no error) transmis-
sion, as if the noise variance were zero. However, it is quite reasonable to
assume that the power of the input signal is constrained. In particular we
require that using the channel n times yields a transmitted power less then
PX (or σx2 ), i.e.
1X 2
x ≤ PX . (6.16)
n i i
Given the constraint in (6.16), we are interested to determine the maximum
rate at which transmission is possible through the channel. It’s proper to
stress that, strictly speaking, we still have to demonstrate that the channel
capacity concept holds for the continuous case. Before rigorously formalizing
these concepts we empirically show the basic ideas behind channel coding for
transmission over an AWGN channel.
channel
dispersion Yn
Xn (AWGN)
√
xn nPN
√
nPX
transmitted over the channel through n uses. Due to the noise added by the
channel during the transmission, for any input sequence there are in turn √ an
n
infinite number of possible outputs. The power constraint ||x || ≤ nPX
allows to say that √ all the possible inputs lie in a n-dimensional hypersphere
(in R ) of radius nPX (see Figure 6.1). What we want to determine is
n
the maximum number of sequences that can be reliably transmitted over the
channel (error-free transmission). Looking at the figure, we see that without
limitation imposed on the power of the input signal we could reliably transmit
an infinite number of sequences (being the radius of the sphere unbounded),
despite the dispersion caused by the noise.
In order to find the maximum number of reliably transmissible sequences
we can compute the maximum number of disjoint sets we can dispose in
the output space (Y n ) (Figure 6.1). Each sequence y n in the set of output
sequences is obtained by the sum xn +z n , where xn is the corresponding input
sequence and z n is a Gaussian noise vector. Each coefficient zi represents the
noise relative to the the i-th use of the channel ( Zi ∼ N (0, σzn ), being the Zi
i.i.d.). The random output vector Y n = xn + Z n has a Gaussian distribution
with mean xn and the same variance of the noise, i.e. σz2 = PN . Therefore, it’s
correct to represent the output set centered on the input sequence. Besides,
if n is sufficiently large, we can affirm√ that with high probability the output
points lie on the boundary of the nPN -radius hypersphere since for the Law
of Large Numbers
X
||(xn + Z n ) − xn ||2 = ||Z n ||2 = Zi2 −→ nPN as n → ∞. (6.17)
i
6.4. Gaussian Channel (AWGN) 91
where in equality (6.18) we exploited the independence of the psignal from the
noise. Then, the received vectors lie inside a sphere of radius n(PX + PN ).
Being the volume directly proportional to the n-th power of the radius with a
proportionality constant an , the maximum number of non-overlapping (non-
intersecting) spheres which is possible to arrange in this volume is bounded
by
n/2
an (n(PX + PN ))n/2
PN + PX
= . (6.19)
an (nPN )n/2 PN
Then, the number of bits that can be reliably transmitted for each use of the
channel is at most
1 PX
log 1 + . (6.20)
2 PN
The above arguments tell us that we cannot hope to send information at
rate larger then the value in (6.20) with no error. In the next section we will
rigorously prove that as n → ∞ we can do almost as well as this.
A(n) n n n
ε = {(x , y ) ∈ X × Y :
n
1
− log fX (xn ) − h(X) < ε, − 1 log fY (y n ) − h(Y ) < ε,
n n
1
− log fXY (xn , y n ) − h(X, Y ) < ε .
n (6.21)
(n)
1. ∀δ > 0, ∀ε > 0, n large, Pr{Aε } ≥ 1 − δ;
(n)
2. ∀ε, V ol(Aε ) ≤ 2n(h(X,Y )+ε) ∀n;
(n)
3. ∀δ > 0, ∀ε > 0, n large, V ol(Aε ) ≥ (1 − δ)2n(h(X,Y )−ε) .
P r{(X̃ n , Ỹ n ) ∈ A(n)
ε } ≤ 2
−n(I(X;Y )−3ε)
. (6.23)
P r{(X̃ n , Ỹ n ) ∈ A(n)
ε } ≥ (1 − δ)2
−n(I(X;Y )+3ε)
. (6.24)
Proof. The proof is virtually identical to the proof of the AEP discrete the-
orem.
We are now ready to state and prove the coding theorem for the AWGN
channel, including both the direct and the converse part.
and
1 PX
C = log 1 + bits/use of channel. (6.26)
2 PN
Proof. The proof is organized in tree parts: in the first part we formally derive
expression (6.26) for the Gaussian channel capacity, while in the second and
in the third part we prove respectively the achievability and the converse
parts of the theorem.
where in (a) we exploited the fact that h(X + Z|X) = h(Z|X). We now look
for a bound of h(Y ). For simplicity, we force Y to be a zero mean random
variable (the entropy does not change); in this way, the variance of Y is 7
We know that, for a fixed variance, the Gaussian distribution yields the
maximum value of the entropy. Hence, from (6.27),
1 1 1
h(Y ) − log 2πePN ≤ log 2πe(PX + PN ) − log 2πePN
2 2 2
1 PX
= log 1 + . (6.29)
2 PN
• (Achievability)
We now pass to the proof of the direct implication of the theorem (stating
that any rate below C is achievable). As usual, we make use of the concepts
of random coding and joint typical decoding .
We consider the n-th extension of the channel. For a fixed rate R, the first
step is the generation of the codebook for the 2nR indexes. Since, as in the
discrete case, we will consider large n values, we can generate the codewords
i.i.d. according to a density function fX (x) with variance PX − ε, so to en-
sure the fulfillment
Pn of the power constraint (according to the LGN, the signal
power (1/n) i=1 xi (1) tends to σ 2 as n → ∞). Let xn (1), xn (2), ..., xn (2nR )
2
be the codewords and xn (i) the generic codeword transmitted through the
channel. The sequence y n at the output of the channel is decoded at the
receiver by using the same procedure described for the discrete channel de-
coding; that is, we search for a sequence which is jointly typical with the
received one and we declare it to be the transmitted codeword.
We now evaluate the error probability. Without any loss of generality we
assume that the codeword W = 1 was sent. Let us define the possible types
of error:
- violation of the power constraint (tx side):
n
1X 2
E0 = {xn (1) : x (1) > PX }; (6.30)
n i=1 i
- the received sequence is not jointly typical with the transmitted one:
E1 = {(xn (1), y n ) ∈
/ A(n)
ε }; (6.31)
According to the code generation procedure used, the error probability aver-
aged over all codewords and codes corresponds to the error probability for a
6.4. Gaussian Channel (AWGN) 95
As n → ∞, by the law of large number and the joint AEP theorem (respec-
tively) we know that P (E0 )8 and P (E1c ) tend to zero. Besides, we know that
X n (1) and X n (i) for any i 6= 1 are independent by construction; then, the
joint AEP theorem provides an upper bound to the probability that X n (i)
and the output Y n (X n (1) + Z n ) are jointly typical, which is 2−n(I(X;Y )−3ε) .
Going on from (6.34), for sufficiently large n we have
with ε1 and ε2 arbitrarily small. If R < I(X; Y ), its easy to see that
we can choose a positive ε such that 3ε < I(X; Y ) − R, thus yielding
2−n(I(X;Y )−R−3ε) → 0 for n → ∞ and then an arbitrarily small error proba-
bility. So far we have considered the average error probability; we can repeat
the same passages of Section (5.2.3) in order to prove that the maximal
probability of error λnmax is arbitrarily small too. Therefore, any rate below
I(X; Y ) and then below C is achievable.
• (Converse)
We now show that the capacity of the channel C is the supremum of all
achievable rates. The proof differs from the one given for the discrete case.
(n)
For any code satisfying the power constraint we show that if Pe → 0 then
the rate R must be less then C.
Let W be a r.v. uniformly distributed over the index set W = {1, 2, ..., 2nR }.
Being H(W ) = nR we can write
8
We point out that, strictly speaking, a new version of the AEP theorem, accounting
also for the constraint on the power, is needed.
96 Chapter 6, Continuous Sources and Gaussian Channel
nR ≤ I(W ; Y n ) + nεn
(a)
≤ I(Y n ; X n ) + nεn
= h(Y n ) − h(Y n |X n ) + nεn
(b) X
≤ h(Yi ) − h(Z n ) + nεn
i
X
= (h(Yi ) − h(Zi )) + nεn , (6.38)
i
(6.39)
9
It can be proven that Fano’s inequality also holds if the variable under investigation
is discrete and the conditioning variable is continuous, but not in the reverse case.
6.4. Gaussian Channel (AWGN) 97
Dividing by n we obtain:
n
1X1 Pi
R < log 1 + + εn . (6.40)
n i 2 PN
Since the log is a concave function we can exploit the following property:
Proof. The proof follows by induction. For n = 2 the relation is true, due to
the concavity of f . Supposing that relation (6.41) is true for n − 1, we have
to prove that it also holds for n.
We can write:
n
! n−1
!!
X xi xn n − 1 1 X
f =f + xi
i=1
n n n n − 1 i=1
n−1
!
1 n−1 1 X
≥ f (xn ) + f xi . (6.42)
n n n − 1 i=1
1
Pn−1
Given the two points xn and n−1 i=1 xi , inequality (6.42) follows by the
concavity of the function f . By applying relation (6.41) to the second term
of the sum in (6.42) 10 we obtain
n
! n−1
X xi 1 n−1 1 X
f ≥ f (xn ) + f (xi ),
i=1
n n n n − 1 i=1
n
1X
= f (xi ). (6.43)
n i=1
10
Remember that we made the assumption that relation (6.41) holds for n − 1.
98 Chapter 6, Continuous Sources and Gaussian Channel
where the expression in round brackets is the average power of the codeword
xn (w) which averaged on all the codewords is less than PX . We eventually
get the following upper bound for the rate:
1 PX
R < log 1 + + εn = C + εn . (6.46)
2 PN
(n)
This proves that for n → ∞, if Pe → 0, then necessarily R ≤ C.
where the ratio PX /N0 W is the SNR (Signal to Noise ratio). This is the
famous Shannon’s formula for the capacity of an additive white Gaussian
noise channel (AWGN).
6.4. Gaussian Channel (AWGN) 99
Looking at expression (6.48), the basic factors which determine the value
of the channel capacity are the channel bandwidth W and the input signal
power PX . Increasing the input signal power obviously increases the channel
capacity. However, the presence of the logarithm makes this growth slow. If
we consider, instead, the channel bandwidth, which is the other parameter
we can actually set, we realize that an increase of W (enlargement of the
bandwidth) has two contrasting effects. On one side, a larger bandwidth
allows to increase the transmission rate; on the other side, it causes a higher
input noise at the receiver, thus reducing the capacity. While for small values
of W it’s easy to see that enlarging the bandwidth leads to an overall increase
of the capacity, for large W we have11 :
P
lim C = log e · . (6.49)
W →∞ N0
Then, by increasing only the bandwidth, we cannot increase the capacity
beyond (6.49).
We now introduce Shannon’s capacity curve which shows the existence of a
tradeoff between power and bandwidth in any communication system.
Since in any practical reliable communication system we have R < C, the
following relation is satisfied:
P
R < W log 1 + . (6.50)
N0 W
where r = R/W is the spectral efficiency, i.e. the number of bits per second
that can be transmitted in a bandwidth unit (Hertz). By observing that
P = Eb · R (we indicate by Eb the energy per transmitted bit) we get
Eb
r < log 1 + r · . (6.52)
N0
100 Chapter 6, Continuous Sources and Gaussian Channel
r (dB)
bandwidth-efficient
transmissions
non-achievable
rates
ln 2 = −1, 6 0
Eb
(dB)
N0
achievable
power-efficient
transmissions rates
Eb 2r − 1
= . (6.53)
N0 r
Then, we can evaluate the following limit values:
Eb
• r→∞ ⇒ N0
→ ∞;
Eb
• r→0 ⇒ N0
→ ln 2;
Eb
proving that the curve in Figure 6.2 has a vertical asymptote in N 0
= ln 2,
and below this value no reliable transmission is possible (for any value of r).
Clearly, the more the working point is close to the curve, the more the com-
munication system is efficient. All the communications whose main concern
11 P P P
We make use of the approximation ln 1 + N0 W ≈ N0 W holding for N0 W << 1.
6.4. Gaussian Channel (AWGN) 101
is the limitation of the transmitted power lie on the area of the plane in which
r 1 (which is the area of the power efficient transmission). We refer to
these system as power-limited systems. On the contrary, all the systems for
which the bandwidth of the channel is small, referred to as bandwidth-limited
systems, lie on the area where r 1 (which is the area of the spectrally effi-
cient transmission). Nevertheless, there is an unavoidable trade-off between
power efficiency and bandwidth efficiency.
We now give some insights into how digital modulations are distributed with
respect to Shannon’s curve. From a theoretical point of view, the channel
coding theorem asserts that it’s possible to work with spectral efficiencies
arbitrarily close to the curve with Pe = 0. In practice, classical modulation
schemes have always a positive, although small, error probability and, de-
spite this, they lie very far from the curve. Channel coding is what allows to
improve the performance of a system, moving the operative points closer to
Shannon’s capacity curve.
Below, we see some examples of digital modulations. In the case of power-
limited systems, high dimensional schemes are frequently used (e.g. M-FSK),
which allow to save power at the expense of bandwidth. Contrarily, in the
case of bandwidth-limited systems, the goal is to save bandwidth, then low
dimensional modulation schemes (e.g. M-PSK) are often implemented12 .
• B − PSK
For a binary PSK (B-PSK or 2PSK), the error probability of a symbol cor-
responds to the bit error probability and is given by13
r !
2Eb
Pe = Q . (6.54)
N0
For Pe = 10−4 we get Eb /N0 = 8.5dB (from the table of the Q function). Let
Ts indicate the transmission time of a symbol. By using a B-PSK modulation
scheme we transmit one bit per symbol and then the per-symbol energy Es
corresponds to the energy per-bit Eb (Es = Eb ). Let W denote the bandwidth
of the impulse of duration Ts , i.e. W = 1/Ts 14 . Then r = WR
= (1/Ts)
1/Ts
= 1.
12
In all the examples we consider an error probability Pe of about 10−4 .
13
The function Q gives the probability of the tail of the Gaussian distribution. More pre-
cisely, Q(x) denote the the probability that a normal (Gaussian) random variable N (µ, σ 2 )
will obtain a value larger than x standard deviations (σ) above the mean (µ).
14
Strictly speaking, a finite impulse has an infinite bandwidth. Nevertheless, in digital
modulation applications it is common to take the bandwidth as the frequency range which
encompasses most of (but not all) the energy of the impulse. Indeed, the higher frequencies
contribute at giving the (exact) shape, which, in such cases, is unnecessary.
102 Chapter 6, Continuous Sources and Gaussian Channel
Q-PSK
−1, 6 0 8, 5
3dB
Eb
B-PSK 2-FSK (dB)
N0
4-FSK
behavior of
the M-FSK,
for M >4
Figure 6.3: Location of the operative points of the classical modulation schemes
on the Shannon’s plane.
The corresponding operative point for the B-PSK scheme is shown in Figure
6.3.
According to Shannon’s limit the same rate could have been reached with
Eb /N0 = 0dB (with 8.5dB of power save).
• Q − PSK
In the QPSK modulation the probability of symbol error is approximated by
r !
2Eb
Pe ≈ 2Q , (6.55)
N0
• M − PSK
From the general expression for the Pe of a M-PSK it follows that as M
grows we have an increase of Eb /N0 (for the same value of Pe ). Besides,
the increase of the rate R with M is logarithmic (log M bits are transmitted
simultaneously) and then the general expression for the spectral efficiency is
6.4. Gaussian Channel (AWGN) 103
r = (log1/T
2 M/Ts )
s
= log2 M . The approximate location of the operative points
Eb
in the ( N0 , r) plane is illustrated in Figure 6.3.
As mentioned previously, phase modulations (low dimensionality modula-
tions) permits to save bandwidth at the expense of power efficiency. Never-
theless, they remain far away from Shannon’s curve.
• M − FSK
Given a coupleqof orthogonal
signals with energy Es , the probability of error
Es
is given by Q N0
. Considering M orthogonal signals, the union bound
for the error probability yields
r ! r !
Es Eb
Pe ≤ (M − 1)Q = (M − 1)Q log2 M . (6.56)
N0 N0
Example.
Consider a situation in which we want to transmit 2 bits. Instead of using
a 4 − P SK for transmitting the two bits in 2Tb sec, we can consider tree
orthogonal signals in the interval 2Tb , as depicted in Figure 6.4. The tree
orthogonal signals, ψ1 , ψ2 and ψ3 , constitute a basis for the three-dimensional
space. We can then build four distinct waveforms to be associated to each
q q
15 Eb 2Eb
Remember that for a 2-FSK Pe = Q N0 , while for a 4-PSK Pe = Q N0 .
104 Chapter 6, Continuous Sources and Gaussian Channel
ψ1
2
0 3 Tb Tb 2Tb
ψ2
0 Tb 2Tb
ψ3
0 Tb 2Tb
of the starting configurations of two bits. For instance, the four waveforms
could√be the ones depicted in Figure 6.5. In vector notation:
s1 = √E(1, 1, 1);
s2 = √E(1, −1, −1);
s3 = √E(−1, 1, −1);
s4 = E(−1, −1, 1),
where the signal energy is Es = 3E 16 . The signal energy can be obtained
from the bit energy as Es = 2Eb (E = 23 Eb ).
We remind that the general approximation of the error probability as a
function of the distance d among the transmitted signals is given by
s !
d2
Pe = Q . (6.57)
2N0
Having increased the dimensionality of the system (from two to tree), the
above procedure allows us to take four signals more distant from each other
with respect to the Q-PSK scheme. In fact, taken an arbitrary couple of
vectors in the constellation, we have
16
d2 = 8E = Eb > 4Eb , (6.58)
3
16
E indicate the energy of each pulse of duration T = 23 Tb composing the signal.
6.4. Gaussian Channel (AWGN) 105
(00) (01)
s1 s2
0 T Tb 2Tb 0 Tb 2Tb
(10)
s4 (11)
s3
0 Tb 2Tb 0 Tb 2Tb
Figure 6.5: Possible waveforms we can associate to the configurations of two bits.
where 4Eb is the minimum distance between the signals in the Q-PSK con-
stellation. Hence:
r ! r !
8 Eb 4 2Eb
Pe = Q =Q , (6.59)
3 N0 3 N0
leading to a coding gain of 4/3 with respect to the Q-PSK scheme. Nev-
ertheless, the signals contain pulses narrower than Tb (whose pulse width is
T = 32 Tb ), and then they occupy larger bandwidth (W = 2T3 b ). As a con-
sequence, for this system we have r = 32 17 . Therefore, there is always a
trade-off between power and bandwidth but in this case the trade-off is more
advantageous, as the following generalization of the above procedure clarifies.
Generalized procedure
What we have described above is nothing but a primitive form of coding.
Let us now suppose that we aim at transmitting k bits. We can use a
code C(k, n) in order to associate to any configuration of k bits another
configuration of n bits (with n > k). In this way the constellation of k
points can be represented in the n-dimensional space where each point lies
17
With a Q-PSK we would have r = 1.
106 Chapter 6, Continuous Sources and Gaussian Channel
r (dB)
M-PSK
0
Eb
coding N0
(dB)
M-FSK
Eb
Figure 6.6: The role of coding in the ( N0
, r) plane.
108 Chapter 6, Continuous Sources and Gaussian Channel
Chapter 7
The source coding theorem states that a discrete source X can be en-
coded lossless as long as R ≥ H(X). However, in many real applications,
the presence of (a moderate amount of) reconstruction errors does not com-
promise the result of the transmission (or the storage); then, sometimes, it
may be preferable to admit errors within certain limits, i.e. a quality loss,
to increase compression efficiency. In other words, lossless compression re-
moves the statistical redundancy, but there are other types of redundancy,
e.g. psychovisual and psychoacoustic redundancy (depending on the appli-
cation), that can be taken into account in order to increase the compression
ratio. Think for instance to JPEG compression for still images!
In order to introduce a controlled reduction of quality we need to define of
a distortion measure, that is a measure of the distance between the random
variable and its (lossy) representation. The basic problem tackled with by
rate distortion theory is determining the minimum expected distortion which
is necessary to tolerate in order to compress the source at a given rate.
Rate distortion theory is particularly suited to deal with continuous sources.
We know that in the continuous case lossless coding cannot be used be-
cause of the fact that a continuous source requires an infinite precision to
be represented exactly. Then, while for discrete sources the rate distortion
theory can be introduced as an additional (optional) tool to source coding,
for continuous sources it is an essential tool for representing the source.
109
110 Chapter 7, Rate Distortion Theory
• Euclidean distance:
d(x, x̂) = (x − x̂)2 ; (7.1)
• Hamming distance:
0 se x = x̂
d(x, x̂) = x ⊕ x̂ = (7.2)
1 se x 6= x̂.
where the average is taken over all the alphabet symbols x and all the possible
values of the reconstruction x̂. In the Euclidean case the distortion function
is h i
D = E (X − X̂)2 , (7.4)
i.e. the mean square error between the signal and its reconstruction.
In the Hamming case the distortion function is
h i
D = E X ⊕ X̂ = Pe , (7.5)
Definition. The rate distortion function R(D) gives the hminimumi number
of bits (Rmin ) guaranteeing a reconstruction distortion E d(X, X̂) ≤ D.
7.1. Rate Distortion Function 111
The main theorem of the rate distortion theory (Shannon, 1959), also
known as the lossy coding theorem, is the following.
Definition. Let X be a discrete memoryless source with pmf p(x) and let
X̂ be the reconstructed source with pmf p(x̂)3 . Let p(x, x̂) be the joint
probability distribution.
2
By referring to continuous sources, R(0) is ∞.
3
For notational simplicity we omit the subscript in pX (x) and pX̂ (x̂), being it recover-
able from the argument.
112 Chapter 7, Rate Distortion Theory
(n)
We define the distortion jointly typical set Ad,ε as follows
n
(n)
Ad,ε = (xn , x̂n ) ∈ X n × X̂ n :
1
− log p(xn ) − H(X) < ε, − 1 log p(x̂n ) − H(X̂) < ε,
n n
1
− log p(xn , x̂n ) − H(X, X̂) < ε,
n
o
n n
d(x , x̂ ) − E[d(X, X̂)] < ε , (7.9)
Note that the difference from the previous definition of jointly typical
set resides only in the additional constraint which expresses the typicality of
the couples of sequences with respect to distortion. Instead of a probability,
the involved statistics for measuring this type of typicality is the distance
between the random variables. Let us define d(xi , x̂i ) = di and consider
the corresponding random variable Di , which is a function of the random
variables Xi and X̂i , i.e. Di = d(Xi , X̂i ). By applying the law of large
numbers, as n → ∞ the sample mean of di tends to the ensemble average,
that is n
1X
Di −→ E[D] in prob. (7.10)
n i=1 n→∞
Then, the additional requirement regarding distortion does not limit much
(n)
the number of sequences in Ad,ε with respect to the number of sequences
(n)
in the jointly typical set, since for large n a sequence belongs to Ad,ε with
probability arbitrarily close to 1.
• (Direct implication/Achievability)
To start with, let us fix p(x̂|x)4 . Knowing the marginal pmf p(x), from p(x̂|x)
we can derive the joint pmf p(x̂, x) and then p(x̂).
Fix also R.
The proof of achievability proceeds along the following steps.
4
Chosen according to the constraint E[d(X, X̂)] ≤ D.
7.1. Rate Distortion Function 113
We now give an intuitive view of the implication in (7.11). From the initial
(n)
choice of p(x̂, x) and from the definition of Ad,ε we argue that if the coder at
the transmitter side has found a distortion jointly typical sequence, then the
expected distortion is close to D. But the sequences xˆn are drawn only ac-
cordingly to the marginal distribution p(x̂) and not to the joint one. There-
fore, we have to evaluate the probability that a pair of sequences (xn , x̂n )
(generated by the correspondent marginal distributions) is typical. Accord-
ing to the joint AEP theorem P r{((xn , x̂n ))} ∼ 2−nI(X;X̂) . Hence, the prob-
ability of finding at least one x̂n which is distortion typical with xn during
the encoding procedure is
We can hope of finding such sequence only if R > I(X, X̂). If this is not the
case, the probability of finding a typical xn tends to zero as n → ∞.
Now, we can exploit the degree of freedom we have on p(x̂|x) in order to de-
termine the minimum rate at which reconstruction is possible along with the
fixed maximum distortion D. Hence: Rmin = R(D) = minp(x̂|x):E[d(x,x̂)]≤D I(X; X̂).
• (Reverse implication/Converse)
5
Note the correspondence with the channel capacity proof in which we show that
Pe → 0 if R < I(X, Y ).
114 Chapter 7, Rate Distortion Theory
R3
x̂n
D
xn
Note: in the rate distortion theorem we have considered the average distor-
tion E[d(X, X̂)]. Nevertheless, the same result holds by considering a stricter
distortion constraint.
(n) Quantization
(xn , x̂n (i)) ∈ Ad,
find a point x̂n in the neighborhood of xn , which satisfies the jointly typicality
property for large n. The quantization can be more or less coarse depending
on the value of D. If the tolerable distortion is small the quantization must
be fine (R large), while it can be made coarser (smaller R) as the amount of
tolerable distortion increases. Indeed, looking at the figure, if D decrease we
would have to increase the number of reconstruction sequences x̂n in order
that at least one of them (x̂n ) falls inside each region with high probability.
Figure 7.2 schematically represents the lossy coding procedure.
It’s easy to argue that in the discrete source case (xn ∈ X n ) the same
procedure leads to a further quantization of an already quantized signal, but
nothing conceptually changes. We stress again that in this case, as opposed
to the continuous source case, lossless coding is possible. However, rate
distortion theory can be applied whenever we prefer to decrease the coding
rate at the price of an introduction of an acceptable distortion.
As it happened for the proofs of Source and Channel Coding Theorems, the
proof of the Rate Distortion Theorem does not indicate a practical coding
strategy. Therefore, we have to face with the problem of finding the optimum
set of points {xn } to represent the source for finite n. To this purpose, it’s
easy to guess that knowing the type of source helps and then should be taken
into account in order to make the rate close to the theoretical value R(D).
This problem will be faced with in Section 7.2.
We now compute the rate distortion function R(D) for some common
sources.
116 Chapter 7, Rate Distortion Theory
Bernoulli source
p(1) = p;
p(0) = 1 − p.
Case 1: D > p
Let us take X̂ = 0. This choice allows to achieve the lower bound for the
mutual information, i.e. I(X, X̂) = 0. We have to check if the constraint is
satisfied. It is easy to argue that it is so since E[X ⊕ 0] = p(x = 1) = p < D.
Note that this solution is also suggested by intuition: a reconstruction with
an error less than or equal to a value (D) greater than p is trivially obtained
by encoding every sequence as a zero sequence.
Case 2: D ≤ p
6
Notice that the notation h(D) is correct since D = E[dH ] ≤ 1.
7.1. Rate Distortion Function 117
1 1−D 1
?r p
D
X̂ D
X
1−r 0 1−D 0
1−p
Figure 7.3: Joint distribution between X̂ and X given by the binary symmetric
channel (test channel).
where (a) follows from the fact that x = x̂ ⊕ (x ⊕ x̂), while (b) is obtained
by observing that X ⊕ X̂ is itself a binary source with P r{X ⊕ X̂ = 1} =
E[(X ⊕ X̂)].
Now, since the binary entropy h(r) grows with r (with r < 1/2) and E[(X ⊕
X̂)] is less than D, we have h(E[X ⊕ X̂]) ≤ h(D) and then by going on from
(7.14) we get
I(X; X̂) ≥ h(p) − h(D). (7.15)
At this point we know that R(D) ≥ h(P ) − h(D). Let us show that a
conditional probability distribution p(x̂|x) attaining this value exists.
For establishing a relation between the two binary random variables X and
X̂, that is for determining a joint distribution, we can refer to the binary
symmetric channel (BSC) with ε = D, see Figure 7.3. Let us determine the
input of the channel X̂ so that the output X is the given distribution (with
the fixed p(x)). Let r = P r(X̂ = 1). Then, we require that
that is
p−D
r= . (7.17)
1 − 2D
p−D
For D ≤ p the choice p(x̂ = 1) = 1−2D
:
118 Chapter 7, Rate Distortion Theory
Gaussian source
2
Let X be a Gaussian source, X ∼ N (0, σX ).
For this type of source it is reasonable to adopt the Euclidean distance
(squared error distortion). The rate distortion function is given by
σx2
1
log if D ≤ σx2
R(D) = 2 D (7.18)
0 if D > σx2 .
We can take X̂ ≡ 0. With this choice the average error we make is the
variance of the random variable X which, being less then D, permits to sat-
isfy the constraint: E[X 2 ] = σx2 ≤ D 7 . Besides, this choice attains the
absolute minimum for I(X; X̂), that is I(X; X̂) = 0. Then, R(D) = 0.
Case 2: D ≤ σx2
where (a) derives from the fact that (X − X̂) is a random variable whose
7
We remind that for any random variable Z the relation σz2 = E[Z 2 ] − µ2z holds.
7.1. Rate Distortion Function 119
X̂ X
N (0, σx2 − D) N (0, σx2 )
N ∼ N (0, D)
Figure 7.4: Joint distribution between X̂ and X given by the AWGN (test channel).
variance is surely less than the mean square error E[(X − X̂)2 ] and then, the
entropy of a Gaussian random variable with variance E[(X − X̂)2 ] gives an
upper bound for h(X − X̂) (principle of the maximum entropy).
We now have to find a distribution f (x̂|x) that attains the lower bound for
I. As before, it is easier to look at the reverse conditional probability f (x|x̂)
as the transitional probability of a channel and choose it in such a way that
the distribution of the channel output x is the desired one. Then, from the
knowledge of f (x) and f (x̂) we derive f (x̂|x). Let us consider the relation
between X̂ and X depicted in Figure 7.4 (test channel ), i.e. we assume that
the difference between X and its reconstruction X̂ is an additive Gaussian
noise N . It is easy to check that this choice:
1. satisfies the distortion constraint; indeed
σx2 D
Figure 7.5 depicts the rate distortion curve for a Gaussian source. The
curve partitions the space into two regions; by varying D, only the rates
lying above the curve are achievable. For D → 0 we fall back into lossless
source coding, and then R → ∞ (entropy of a continuous random variable).
If instead the reconstruction distortion is larger than σx , there is no need to
transmit any bit (R = 0).
For the Gaussian source we can express the distortion in terms of the rate
by reversing R(D), obtaining
Given the number of bits we are willing to spend for describing the source,
D(R) provides the minimum distortion we must tolerate in the reconstruction
(Figure 7.6). Obviously, the condition D = 0 is achievable only asymptoti-
cally.
Let us evaluate the signal to noise ratio associated to the rate distortion:
σx2
SN R = = 22R → SN Rdb = 6R. (7.24)
D
For any bit we add, the SNR increases by 6 dB.
Note: it is possible to prove that, like the differential entropy for the Gaussian
source, the rate distortion function for the Gaussian source is larger than
the rate distortion function for any other continuous source with the same
variance. This means that, for a fixed D, the Gaussian source gives the
maximum R(D). This is a valuable result because for many sources the
computation of the rate distortion function is very difficult. In these cases,
7.1. Rate Distortion Function 121
achievable
σx2
D(R) distortions
Figure 7.6: Distortion rate curve for a Gaussian source. Fixed R, the amount of
distortion introduced cannot be less than the value of the curve in that point.
8
According to this expression, given the reconstruction x̂i the symbol xi is conditionally
independent on the other reconstructions.
7.1. Rate Distortion Function 123
PM 1 +
M M M i σi2
We have then found a f (x |x̂ ) such that I(X ; X̂ ) = i=1 2 log . Di
Now we remember that in our problem the distortion values Di , i = 1, ..., M
provide an additional degree of freedom we can exploit. Hence, from (7.30)
the final minimum is obtained by varying Di , i = 1, ..., M , that is:
M +
X 1 σ2
R(D) = min log i . (7.31)
2 Di
P
Di : i Di =D
i=1
9
For nonlinear optimization problems, the Karush-Kuhn-Tucker (KKT) conditions are
necessary conditions that a solution has to satisfy for being optimal. In some cases, the
KKT conditions are also sufficient for optimality; this happens when the objective func-
tion is convex and the feasible set is convex too (convex inequality constraints and linear
equality constraints). The system of equations corresponding to the KKT conditions is
usually solved numerically, except in the few special cases where a closed-form solution
can be derived analytically.
In the minimization problem considered here, the KKT are necessary and sufficient con-
ditions for optimality. Besides, we will be able to solve them analytically.
124 Chapter 7, Rate Distortion Theory
λ if λ < σi2
Di = (7.35)
σi2 if λ ≥ σi2 ,
where λ satisfies M
P
i=1 Di = D.
The method described is a kind of reverse water-filling and is graphically
7.1. Rate Distortion Function 125
σi2
2
σM −1
σ22
σ42
σ12
λ
σ32
2
σM
2
σM −2
Figure 7.7: Reverse water-filling procedure for independent Gaussian random vari-
ables.
with {Di }M
i=1 satisfying (7.35).
126 Chapter 7, Rate Distortion Theory
lossless
xn x̂n
Q entropy coder
lossy
reconstruction example of
levels reconstruction for a
vectorial scheme
Quantization
Let xn ∈ Rn denote a n-long vector of source outputs. In scalar quantiza-
tion each single source output xi is quantized into a number of levels which
are later encoded into a binary sequence. In vector quantization instead the
entire n-length vector of outputs is seen as a unique symbol to be quantized.
A vector quantization scheme allows a much greater flexibility with respect
to the scalar one, at the price of an increasing complexity. Let us see an
example.
Suppose that we want to encode with a rate R. Assume for simplicity
n = 2. Figure 7.9 shows how, through the scalar procedure, the reconstruc-
tion (quantized) levels in the R2 space are constrained to be disposed on a
regular rectangular lattice and the only degrees of freedom are the quantiza-
tion steps along the two axes. In general, we have n2R steps to set. Through
the vector procedure, instead, we directly work in the R2 space and we can
10
In the rate distortion theorem entropy coding is not necessary. Shannon proves that
the distortion jointly encoding scheme reaches R(D) which is the minimum. Then, it is
as if the reconstructed sequences come out according to a uniform distribution.
128 Chapter 7, Rate Distortion Theory
put the reconstruction vectors wherever we want (e.g. the blue star). We
have 2nR points to set in general. Nevertheless, because of the lack of reg-
ularity, all the 2nR points must be listed and for any output vector xn all
the 2nR distances must be computed to find the closest reconstruction point,
with a complexity which increases exponentially with n.
Uniform quantizer
The uniform quantizer is the simplest type of quantizer. In a uniform
quantizer the spacing among the reconstruction points have the same size,
that is the reconstruction levels are spaced evenly. Consequently, the deci-
sion boundaries are also spaced evenly and all the intervals have the same
size except for the outer intervals. Then, a uniform quantizer is completely
defined by the following parameters:
- levels: m = 2R ;
- quantization step: ∆.
Assuming that the source pdf is centered in the origin we have the two quan-
tization schemes depicted in Figure 7.10, depending on the value of m (odd or
even). Through a uniform scheme, once the number of levels and the quan-
tization step are fixed, the reconstruction levels and the decision boundaries
are univocally defined.
Figure 7.11 illustrates the quantization function Q : R → C (where C is
a numerable subset of R). For odd m, the quantizer is called a midthread
quantizer since the axes cross the step of the quantization function in the
middle of the tread. Similarly, when m is even we have a so called midrise
quantizer (crossing occurs in the middle of the rise).
In the sequel, we study the design of a uniform quantizer first for a source
having a uniform distribution, and later for a non uniformly distributed
source.
7.2. Lossy Coding 129
reconstruction
levels
decision boundaries
∆
∆ − ∆2 2
R
0
(a) m odd.
reconstruction
levels
decision boundaries
∆
R
0
(b) m even.
• X uniform in [-A,A].
It is easy to argue that for this type of distribution the uniform quan-
tizer is the most appropriate choice.
Given m (number of reconstruction levels), we want to design the value
of the parameter ∆ which minimizes the distortion D, that is the quan-
tization error/noise. Being the distribution confined in the interval
[−A, A] we deduce that ∆ = 2A m
.
The distortion (mean square quantization error) has the expression:
D = E[(X − X̂)2 ]
Z
= (x − Q(x))2 fX (x)dx
R
m Z
X
= (x − x̂i )2 fX (x)dx
i=1 Ri
m
1
Z
(a) X
= (x − x̂i )2
i=1
2A Ri
Z ∆
m 2
= x2 dx
2A − 2 ∆
Z ∆
1 2 ∆2
= x2 dx = , (7.37)
∆ − ∆2 12
where each element i of the sum in (a) is the inertia moment centered
on x̂i .
In this case, it is easy to see that the indexes obtained through the
130 Chapter 7, Rate Distortion Theory
Q(x)
∆ ∆
x
2
∆
2
∆ x
− ∆2
A2 /3
SN R = = m2 = 22R . (7.38)
∆2 /12
• Non uniform X.
In this situation there are some ranges of values in which it is more
probable to have an observation. Then, we would like to increase the
density of the reconstruction levels in the more populated zone of the
distribution and use a sparser allocation in the other regions.
This is not possible by using a uniform quantizer (the only parameter
we can design is the spacing ∆). The question is then how to design
the constant spacing ∆ in order to achieve the minimum reconstruction
error.
Let us suppose m to be even (similar arguments hold if m is odd). We
must compute:
m Z
X
D= (x − x̂)2 fX (x)dx
i=1 Ri
m/2−1
X Z i∆ ∆
2
=2 x − i∆ + fX (x)dx
i=1 (i−1)∆
2
| {z }
granular noise
Z ∞ 2
m ∆
+ x− ∆+ fX (x)dx .
(m/2−1)∆ 2 2
| {z }
overload noise
whenever the input exceeds the supported range. The second term of
the sum is the clipping error, known as overload noise. Conversely, the
error made for quantizing the values in the supported range is referred
to as granular noise and is given by the sum at the first term. It is
easy to see that decreasing ∆ the contribution of the granular noise
decreases, while the overload noise increases. Viceversa, if we increase
∆, the overload noise decreases at the price of an higher granular noise.
The choice of ∆ is then a tradeoff between these two types of noise,
and designing the quantizer corresponds to finding the proper balance.
Obviously, this balancing will depend on the to-be-quantized distribu-
tion, and specifically on how much it weighs the tail of the distribution
with respect to the belly.
Example.
Numerical values for ∆ obtained for a Gaussian distribution with σ 2 =
1:
Note: it’s worth noting that using a uniform quantization for nonuniform
distributions implies that the probabilities of the encoded output symbols
(indexes) are not equiprobable and then in this case a gain can be obtained
by means of entropy coding.
fX (x)
∆
∆ x
− ∆2 2
fX (x)
x̂i
x̂i+1
x
ai ai+1
Figure 7.12).
Again, suppose that the source is centered in the origin. With respect to
the uniform quantizer, a nonuniform quantizer gives the designer much more
freedom. Given the number of reconstruction levels m, the non uniform
quantizer is defined by setting:
- reconstruction levels: x̂i , i = 1, ..., m;
- decision boundaries: ai , i = 0, ..., m (where a0 = −∞ and am = +∞).
Hence, we have 2m − 1 parameters (degrees of freedom) to set in such a way
that the quantization error is minimized. It is easy to guess that, as in the
example in Figure (7.12) (bottom), for non uniform sources, the optimum
decision regions will have in general different sizes.
134 Chapter 7, Rate Distortion Theory
Max-Lloyd quantizer
ai m
∂D ∂
Z X
= (x − x̂i )2 fX (x)dx
∂aj ∂aj ai−1 i=1
aj aj+1
∂ ∂
Z Z
2
= (x − x̂j ) fX (x)dx + (x − x̂j+1 )2 fX (x)dx.
∂aj aj−1 ∂aj aj
(7.40)
∂D
= −(aj − x̂j+1 )2 fX (aj ) + (aj − x̂j )2 fX (aj ). (7.42)
∂aj
where we threw away the multiplicative constant fX (aj ) (fX (aj ) 6= 0).
By exploiting the relation a2 − b2 = (a + b)(a − b), after easy algebraic
7.2. Lossy Coding 135
manipulation we get:
x̂j + x̂j+1
aj = , ∀j. (7.45)
2
Then, each decision boundary must be the midpoint of the two neighboring
reconstruction levels. As a consequence, the reconstruction levels will not lie
in the middle of the regions/intervals.
Let us now pass to the derivative with respect to reconstruction levels. We
∂D
have to compute ∂x j
= 0 for j = 1, ...., m.
m Z m
∂D ∂ X ai X
= (x − x̂i )2 fX (x)dx
∂ x̂j ∂ x̂j i=1 ai−1 i=1
Z aj
∂
= (x − x̂j )2 fX (x)dx.
∂ x̂j aj−1
Z aj
∂ 2
= (x − x̂j ) fX (x)dx. (7.46)
aj−1 ∂ x̂j
and then R aj
a
xfX (x)dx
x̂j = R j−1
aj ∀j = 1, ..., m. (7.48)
aj−1
fX (x)dx
R aj
By observing that fX (x|x ∈ Rj ) = fX (x)/ aj−1
fX (x)dx, we have
Z aj
x̂j = xfX (x|x ∈ Rj )dx
aj−1
= E[X|X ∈ Rj ]. (7.49)
Then, the output point for each quantization interval is the centroid of the
probability density function in that interval.
To sum up, we have found the optimum boundaries (ai )m−1 i=1 expressed as
m
a function of the reconstruction levels (x̂i )i=1 and, in turn, the optimum
136 Chapter 7, Rate Distortion Theory
m−1
reconstruction levels (x̂i )m
i=1 as a function of (ai )i=1 . Therefore, in order to
find the decision boundaries and the reconstruction levels we have to employ
an iterative procedure (Max Lloyd alghoritm): at first we choose the m
reconstruction levels at random (or using some heuristic); then we calculate
the boundaries and update the levels by computing the centroids:
x̂ +x̂
aj = j 2 j+1 j = 1, ..., m − 1
(7.50)
x̂j = E[X|X ∈ Rj ] j = 1, ..., m.
Entropy-constrained quantizer
where Pi is the probability that the input falls in the i-th quantization bin12 ,
that is Z ai
Pi = fX (x)dx, (7.52)
ai−1
where ai−1 and ai are the decision boundaries of the i-th decision region.
Hence for a fixed rate R, the optimum decision boundaries ai and reconstruc-
tion levels x̂i can be obtained through the minimization of the distortion D
subject to the constraint R = H(Q).
Such a quantizer is called Entropy constrained quantizer (ECQ), and is the
best nonuniform scalar quantizer. However the minimization with the addi-
tional constraint on H(Q) is much more complex and must be solved numer-
ically.
A A
−A A −A A
−A −A
(a) regular lattice (scalar quantization). (b) triangular tessellation (example of
vector quantization).
where each term i of the sum is the central moment of inertia (c.m.i.)
of the region Ri . Since ~xˆi 14 is the central point of each region Ri and
the regions are all the same in shape and dimension, the contribution
of each term of the sum is the same (the c.m.i is translation invariant).
Then we have
m
D= I2 , (7.54)
4A2
4A2
where I2 is the central moment of inertia of a square having area 16
.
I Vector quantization
If we use a vector scheme we have much more freedom in the definition
of the quantization regions, being them no more constrained to a rigid
reticular structure as in the scalar case. Dealing with uniform distri-
14
which here is a couple (x̂1i , x̂2i ).
7.2. Lossy Coding 139
Note: by considering the limit case, that is when the length of the block n
approaches ∞, the problem faced with in rate distortion can be seen as a
problem of ‘sphere covering’. The sphere covering problem consists in find-
ing the minimum number of spheres through which is possible to cover a
given space satisfying a condition on the maximum reconstruction distortion
(maximum ray of the spheres).
boundary
effect
A
−A A
−A
sources, and especially of sources having peaked and tailed distributions. The
reason is that, using the v.q., the freedom in the choice of the reconstruction
levels allows to better exploit the greater concentration of the probability
distribution in some areas with respect to others, thus reducing the quanti-
zation error.
We now consider an example of source with memory and show that in this
case the gain derived by using the v.q. is really much stronger. We stress that
the rate distortion theorem has been proved for the memoryless case. In the
case of dependent sources the theorem should be rephrased by considering
the entropy rate H in place of the entropy H. It is possible to prove that the
theoretic limit value for the rate distortion R(D) is lower than that of the
memoryless case. This is not a surprise: for reconstructing the source with
a given fixed distortion D the number of information bits required is less
if we can exploit the dependence between subsequent outputs (correlation).
Nevertheless, this is possible only by means of vector quantization schemes.
a 1
fXY = ab
0
x
b
Figure 7.15: Example of two correlated random variables X and Y . The vector
quantization (star tessellation, blue) is necessary since any scalar scheme (cross
tessellation, green) leads to an unsatisfactory distribution of the reconstruction
points.
142 Chapter 7, Rate Distortion Theory
The system defines the updating of the decision regions and reconstruction
levels for the LBG algorithm.
16
This is the meaning of placing the boundary points at the midpoints between the
reconstruction levels.
7.2. Lossy Coding 143
For solving the problem of the estimation of fX~ (~x), Linde, Buzo and
Gray propose to design the vector quantizer by using a clustering procedure,
specifically the K-means algorithm.
Given a large training set of output vectors ~xi from the source and an initial
set of k reconstruction vectors ~xˆj , j = 1, ..., k, we can partition the points
(~xi ) instead of splitting the space Rn . We define each cluster Cj as follows:
In this way, once the clusters are defined, we can update the reconstruction
vectors by simply evaluating the mean value of the points inside each cluster
(without having to compute any integral!). Then, the new levels are
1 X
~xˆj = ~xi , ∀j = 1, .., k. (7.59)
|Cj |
i:~
xi ∈Cj
1 X
Dj = ||~xi − ~xˆi ||, ∀j. (7.60)
|Cj |
i:~
xi ∈Cj
Transform-based Coding
µ
~ y = 0. The covariance matrix is
x2
~x = (x1, x2)
x01
x02
x2
large variances
(σ12, σ22)
x01
0
σ12 → much smaller
variance
(b) Location of the points after the decorrelator. The variance of
the random variables is greatly reduced as well as their correlation
(which is approximately zero).
Figure 7.17: Variability of the couple of random variables before and after the
decorrelation takes place.
7.2. Lossy Coding 147
At this point, we can apply the scalar quantization to the resulting source
Y~ incurring only on the little gain loss we have in the memoryless case (dis-
cussed in Section 7.2.3).
Figure 7.17 illustrates the effect of the transformation for the case of two
dependent Gaussian random variables X1 and X2 . The point cloud in Fig-
ure 7.17(a) describes the density of the vectors X ~ = (X1 , X2 ), according to
the bivariate Gaussian distribution. By looking at each variable separately,
the ranges of variability are large (σx21 and σx22 are large). Therefore, as
discussed in the previous sections, directly applying the scalar quantization
to this source implies a waste of resources, yielding a high rate distortion
R(D)18 . Applying the transformation A to X ~ corresponds to rotating the
axes in such a way that the variability ranges for the variables are reduced
(minimized), see Figure 7.17(b). It is already evident here (for n = 2) that
decorrelating the variables corresponds to compacting the energy 19 . At the
output of the transformation
Pn block we have independent Gaussian variables
and then R(D) = i=1 R(Di ). At this point, we know that the best way
for distributing the distortion between the variables is given by the Reverse
Water Filling procedure, which allocates the bits to the random variables
depending on their variance.
Given a source X,~ the optimum transform for decorrelating the source
samples is Karhunen Loeve Transform (KLT). The KLT is precisely the ma-
trix A which diagonalizes the covariance matrix Cx . However, this implies
that the KLT depends on the statistical properties of the source X~ and the co-
variance matrix Cx must be estimated for computing the KLT. Furthermore,
for non stationary sources the estimation procedure must be periodically re-
peated.
As a consequence, computing the KLT is computationally very expensive. In
practice, we need transforms that (although suboptimum) do not depend on
the statistical properties of the data, but have a fixed analytical expression.
The DCT (Discrete Cosine Transform) is one of the most popular trans-
form employed in place of the KLT. For highly correlated Gaussian Markov
sources, it has been theoretically showed that the DCT behaves like the KLT
Predictive Coding
When the elements of the vector X ~ are highly correlated (large amount
of memory among the components) consecutive symbols will have similar
values. Then, a possible approach to perform decorrelation is by means
of ‘prediction’. Using the output at the time instant n − 1, i.e. Xn−1 , as
the prediction for the output at the subsequent time instant n (Zero Order
Prediction), we can transmit only the ‘novel’ information brought by the
output Xn , that is:
Dn = Xn − Xn−1 . (7.66)
21
In this way, the quantizer Q works on the symbols dn (dn = xn − xn−1 ) .
20
The source of an image is approximately a Markov source with high enough correlation.
21 ~ with reduced
The symbols dn can be seen as the output (at time n) of a new source D
memory.
7.2. Lossy Coding 149
Proof.
cov(XY ) E[XY ]
ρ= = . (7.68)
σX σY σX σY
Due to the high correlation between Xn and Xn−1 , ρ is close to 1 and
then from (7.67) holds σd2 << σx2 .
X ρd << ρ;
the correlation (memory) between the new variables Dn is less then
the correlation among the original source outputs Xn (as an example,
if the source is a first order Markov process the symbols dn obtained
according to (7.66) are completely decorrelated).
Then, working on the symbols dn (instead of on xn ), the loss incurred
by using scalar quantization is much smaller.
However, it must be pointed out that the impact of the quantization of the
symbols dn on the original symbols xn is different with respect to the case in
which we directly quantize xn .
In detail, the problem incurred by considering the differences in (7.66) is
described in the following.
d2 = x2 − x1 −→ dˆ2 = d2 + q2 , (7.69)
Q
⇒ Transmitter side:
d3 = x3 − x2 −→ dˆ3 = d3 + q3 . (7.71)
Q
⇒ Receiver side:
x̂n = xn + qn , (7.76)
thus avoiding that quantization errors are accumulated. Figure 7.18 illus-
trates the closed loop encoding scheme and the corresponding decoding pro-
cedure.
Note: we have considered the simplest type of predictor, i.e. the Zero Or-
7.2. Lossy Coding 151
x̂n−1 x̂n−1
Z −1
x̂n
Z −1
Figure 7.18: Predictive coding scheme. Closed loop encoder (on the left) and
decoder (on the right).