0% found this document useful (0 votes)
19 views59 pages

Info Theory

Uploaded by

hay902
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views59 pages

Info Theory

Uploaded by

hay902
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Lecture Notes

on

Information Theory

Univ.-Prof. Dr. rer. nat. Rudolf Mathar


RWTH Aachen University
Institute for Theoretical Information Technology
Kopernikusstr. 16, 52074 Aachen, Germany
2
Contents

1 Introduction 5

2 Fundamentals of Information Theory 7


2.1 Preliminary Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Information Measures for Random sequences . . . . . . . . . . . . . . . . . . . 18
2.4 Asymptobic Equipartition Property(AEP) . . . . . . . . . . . . . . . . . . . . . 21
2.5 Differential Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 Source Coding 29
3.1 Variable Length Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Prefix Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Kraft-McMillan Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Average Code Word Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5 Noiseless Coding Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.6 Compact Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.7 Huffman Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.8 Block Codes for Stationary Sources . . . . . . . . . . . . . . . . . . . . . . . . 36
3.8.1 Huffman Block Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.9 Arithmetic Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4 Information Channels 41
4.1 Discrete Channel Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Channel Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Binary Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3.1 Binary Symmetric Channel (BSC) . . . . . . . . . . . . . . . . . . . . . 43
4.3.2 Binary Asymmetric Channel (BAC) . . . . . . . . . . . . . . . . . . . . 47
4.3.3 Binary Z-Channel (BZC) . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3.4 Binary Asymmetric Erasure Channel (BAEC) . . . . . . . . . . . . . . . 48
4.4 Channel Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.5 Decoding Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.6 Error Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.7 Discrete Memoryless Channel . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.8 The Noisy Coding Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.9 Converse of the Noisy Coding Theorem . . . . . . . . . . . . . . . . . . . . . . 53
4 Contents
1 Introduction
According to Merriam-webster.com ”Information is any entity or form that provides the answer to
a question of some kind or resolves uncertainty. It is thus related to data and knowledge, as data
represents values attributed to parameters, and knowledge signifies understanding of real things
or abstract concepts.”.
However, modern Information Theory is not a theory which deals with the above on general
grounds. Instead, information theory is a mathematical theory to model and analyze how in-
formation is transferred. Its starting point is an article by Claude E. Shannon, ”A Mathematical
Theory of Communication”, Bell System Technical Journal, 1948.

Figure 1.1: Claude Elwood Shannon (1916 – 2001)

Quoting from the introduction of this article provides insight into the main focus of information
theory: ”The fundamental problem of communication is that of reproducing at one point either
exactly or approximately a message selected at another point. Frequently the messages have
meaning. . . . These semantic aspects of communications are irrelevant to the engineering problem.
. . . The system must be designed to operate for each possible selection, not just the one which will
actually be chosen since this is unknown at the time of design.”
Later, in 1964 a book by Claude E. Shannon and Warren Weaver with a slightly modified title ”The
Mathematical Theory of Communication”appeared at University of Illinois Press, emphasizing the

5
6 CHAPTER 1. INTRODUCTION

source random destination


noise
source encoder source decoder

channel encoder channel channel decoder

modulator analog channel demodulator

channel
estimation

Figure 1.2: The general model of a communciation system.

generality of this work.


Information Theory provides methods and analytical tools to design such systems. The basic com-
ponents of a communication are shown in Fig. 1.2. Although the starting point of information
theory was in electrical engineering and communications, the theory turned out to be useful for
modeling phenomena in a variety of fields, particularly in physics, mathematics, statistics, com-
puter science and economics. It cannot be regarded only as a subset of communication theory, but
is much more in general. In recent years, its concepts were applied and even further developed in
biological information processing, machine learning and data science.
This lecture focuses on the latter. We first provide the basic concepts of information theory and
prove some main theorems in communications, which refer to source coding, channel coding and
the concept of channel capacity. This will be followed by the relation between rate distortion
theory and autoencoders. Biological information processing will be modeled and analyzed by the
concept of mutual information. This will finally lead to artificial neural networks and contributions
of information theory to understanding how such networks learn in the training phase.
2 Fundamentals of Information Theory

2.1 Preliminary Definitions

In this chapter we provide basic concepts of information theory like entropy, mutual information
and the Kulback-Leibler divergence. We also prove fundamental properties and some important
inequalities between these quantities. Only discrete random variables (r.v.) will be considered,
denoted by capitals X, Y and Z and having only finite sets of possible values to attain, the so called
support. Only the distribution of the r.v. will be relevant for what follows. Discrete distribution
can be characterized by stochastic vectors, denoted by
X
p = (p1 , . . . , pm ), pi ≥ 0, pi = 1.
i

For intuitively motivating a measure of uncertainty consider the following two random experi-
ments with four outcomes and corresponding probabilities

p = (0.7, 0.1, 0.1, 0.1)


q = (0.25, 0.25, 0.25, 0.25)

Certainly the result of the second experiment will be more uncertain than of the first one. On the
other hand, having observed the outcome of the second experiment provides more information
about the situation. In this sense, we treat information and uncertainty as equivalently describing
the same phenomenon.
Now, an appropriate measure of uncertainty was introduced by Shannon in his 1948 paper. He did
this axiomatically, essentially requiring three properties of such a measure and then necessarily
deriving entropy as introduced below.
We start by requesting that the information content of some event E shall only depend on its
probability p = P (E). Furthermore the information content is measured by some function h :
[0, 1] → R satisfying the following axioms.

(i) h is continuous on [0, 1]


(ii) h(p · q) = h(p) + h(q)
(iii) there is some constant c > 1 such that h(1/c) = 1

The first axiom (i) requires that a small change in p results in a small change of the measure
of its information content. Number (ii) says that for two independent events E1 and E2 with
probabilities p and q respectively the intersection of both, i.e., the event that both occur at the

7
8 CHAPTER 2. FUNDAMENTALS OF INFORMATION THEORY

same time, has information content h(p) + h(q). The information content shall hence be additive
for independent events. Finally, by (iii) a certain normalization is fixed.
Now, if (i),(ii), and (iii) are satisfied by some information measure h then necessarily

h(p) = − logc (p), p ∈ [0, 1].

A convenient description is achieved by introducing a discrete random variables X P with finite


support X = {x1 , . . . , xm } and distribution P (X = xi ) = pi , i = 1, . . . , m, pi ≥ 0, i pi = 1.
Entropy is then defined as the average information content of the events {X = xi }.

Definition 2.1. Let c > 1 be fixed.


X X
H(X) = − pi logc pi = − P (X = xi ) log P (X = xi )
i i

is called entropy of X or of p = (p1 , . . . , pm ).

Remark 2.2.
a) H(X) depends only on the distribution of X, not on the specific support.
b) If pi = 0 for some i, we set pi log pi = 0. This follows easily from continuity.
c) The base of the logarithm will be omitted in the following. After the base has been chosen,
it is considered to be fixed and constant throughout.
d) Let p(x) denote the probability mass function (pmf), also called discrete density of X, i.e.,

p : X → [0, 1] : xi 7→ p(xi ) = pi .

Then H(X) may be written as


 1 
H(X) = E log ,
p(X)
1
the expectation of the r.v. log p(X) .

Example 2.3. a) Let X ∼ Bin(1, p), i.e., P (X = 0) = 1 − p and P (X = 1) = p, p ∈ [0, 1].


Then
H(X) = −p log p − (1 − p) log(1 − p).

1
b) Let X ∼ U({1, . . . , m}), i.e., P (X = i) = m for all i = 1, . . . , m. Then
m
X 1 1 1
H(X) = − log = − log = log m
m m m
i=1

Particularly if m = 26, the size of the Latin alphabet, then H(X) = log2 26 = 4.7004
2.1. PRELIMINARY DEFINITIONS 9

c) Consider the frequency of characters in written English regarded as precise estimates of


corresponding probabilities.
character A B C ··· Y Z
probability 0.08167 0.01492 0.02782 ··· 0.01974 0.00074
It holds that

H(X) = −0.08167 log2 0.08167 − · · · − 0.00074 log2 0.00074 = 4.219 < 4.7004.

The last inequality holds in general as will be clarified later.

The extension of the above definition to two-dimensional or even higher dimensional random
vectors and conditional distributions is obvious.

P X × Y = {x1 , . . . , xm } × {y1 , . . . , yd } and


Let (X, Y ) be a discrete random vector with support
distribution P (X = xi , Y = yj ) = pij , pij ≥ 0, ij pij = 1.
Definition 2.4. a)
X X
H(X, Y ) = − P (X = xi , Y = yj ) log P (X = xi , Y = yj ) = − pij log pij
i,j i,j

is called joint entropy of (X, Y ).


b)
X X
H(X | Y ) = − P (Y = yj ) P (X = xi | Y = yj ) log P (X = xi | Y = yj )
j i
X
=− P (X = xi , Y = yj ) log P (X = xi | Y = yj ) log P (X = xi | Y = yj )
i,j

is called conditional entropy or equivocation of X given Y


Theorem 2.5. (Chain rule) chain rule

H(X, Y ) = H(X) + H(Y | X) = H(Y ) + H(Y | X)

Proof. Denote by p(xi ), p(xi , yj ) and p(yj |xi ) corresponding probability mass functions. It holds
that
X  
H(X, Y ) = − p(xi , yj ) log p(xi , yj ) − log p(xi ) + log p(xi )
ij
X XX 
=− p(xi , yj ) log p(yj |xi ) − log p(xi , yj ) log p(xi )
i,j i j
| {z }
=p(xi )

= H(Y | X) + H(X)

The second equality is shown analogously by interchanging the roles of X and Y .


10 CHAPTER 2. FUNDAMENTALS OF INFORMATION THEORY

Theorem 2.6. (Jensen’ s inequality ) If f is a convex function and X is a random variable, then

Ef (X) ≥ f (EX). (∗)

(∗) holds for any random variable (discrete,abs-continuous,others) as long as the expectation is
defined. For discrete random variable with distribution (p1 ..........pm ), (∗) read as

m m
!
X X
pi f (xi ) ≥ f p i xi ∀x1 .....xm ∈ dom(f )
i=1 i=1

Proof. We prove this for discrete distributions by induction on the number of mass points. The
proof of conditions for equality when f is strictly convex is left to the reader. For a two-mass-point
distribution, the inequality becomes

p1 f (x1 ) + p2 f (x2 ) ≥ f (p1 x1 + p2 x2 )

which follows directly from the definition of convex functions. Suppose that the theorem is true
for distributions with k − 1 mass points. Then writing p01 = pi / (1 − pk ) for i = 1, 2....k − 1, we
have
k
X k−1
X
pi f (xi ) = pk f (xk ) + (1 − pk ) p0i f (xi )
i=1 i=1
k−1
!
X
≥ pk f (xk ) + (1 − pk )f p0i xi
i=1
k−1
!
X
≥f pk xk + (1 − pk ) p0i xi
i=1
k
!
X
=f pi xi
i=1

where the first inequality follows from the induction hypothesis and the second follows from
the definition of convexity. The proof can be extended to continuous distributions by continuity
arguments.

Prior to showing relations between the entropy concepts we consider some important inequalities

Theorem 2.7. (log-sum inequality) Let ai , bi ≥ 0, where i = 1.....m. Then


P
X ai X j aj
ai log ≥ ai log P
bi j bj
i i
ai
with equality if and only if bi = constant.
We use the convention that 0 log 0 = 0, a log a0 = ∞ if a > 0 , 0 log 00 = 0 (by continuity)
2.1. PRELIMINARY DEFINITIONS 11

00 1
Proof. The function f (t) = t log t ≥ 0 is strictly convex, since f (t) = t > 0, for t > 0. Assume
without loss of generality that ai , bi > 0.
By convexity of f :

m m
! m
X X X
αi f (ti ) ≥ f αi t i , αi ≥ 0, αi = 1.
i=1 i=1 i=1

Setting αi = Pbi , ti = ai
j bj bi ; it follows

!
X b ai ai X bi ai X bi ai
Pi log ≥ P log P
i j bj bi bi
i
bj bi
i j bj bi
!
1 X ai 1 X X ai
⇔ P ai log ≥P ai log P
j bj i bi j bj i i j bj
P
X ai X j aj
⇔ ai log ≥ ai log P
bi j bj
i i

P
Corollary 2.8. Let p = (p1 , ...., pm ), q = (qi ..., qm ) be stochastic vectors, i.e pi , qi ≥ 0,
P i pi =
i qi = 1. Then
Xm Xm
− pi log pi ≤ − pi log qi ,
i=1 i=1

equality holds if and only if p = q.

P P
Proof. In theorem 2.7, set ai = pi , bi = qi and note that i pi = i qi = 1.

Theorem 2.9. Let X, Y, Z be discrete random variable as above


(i) (ii)
a) 0 ≤ H(X) ≤ log m Equality in (i) ⇐⇒ X has a singleton distribution. i.e ∃xi : P (X =
1
xi ) = 1 Equality in (ii) ⇐⇒ X has a uniformly distribution. i.e P (X = xi ) = m ∀i =
1....m.
(i) (ii)
b) 0 ≤ H(X|Y ) ≤ H(X) Equality in (i) ⇐⇒ P (X = xi |Y = yj ) = 1∀(i, j) with
P (X = xi , Y = yj ) > 0 , i.e X is totally dependent on Y Equality in (ii) ⇐⇒ X, Y are
stochastically independent
(i) (ii)
c) H(X) ≤ H(X, Y ) ≤ H(X) + H(Y ) Equality in (i) ⇐⇒ Y is totally dependent on X
Equality in (ii) ⇐⇒ X, Y are stochastically independent
d) H(X|Y, Z) ≤ min{H(X|Y ), H(X|Z)}
12 CHAPTER 2. FUNDAMENTALS OF INFORMATION THEORY

Proof. a) (i)0 ≤ H(X) , By definition. (ii)


m
X m
X 1
H(X) = − pi log pi = pi log
pi
i=1 i=1
m
X 1
≤ log pi (Jensen Inequality)
pi
i=1
m
X 
= log 1 = log m
i=1

b) Equality (i) holds iff H(X|Y ) = 0, that is if X is a deterministic function of Y .


Similarly, (ii) holds iff H(X) = H(X|Y ). Since

H(X|Y ) = H(X) − I(X; Y )

we have that this holds only if I(X; Y ) = 0, that is X and Y are statistically independent.
c) i) By the Chain rule, theorem 2.5 H(X, Y ) = H(X) + H(Y |X) ≥ H(X) with ”equal-
| {z }
≥0
ity” from b) (ii).
ii) From b) 0 ≤ H(X) − H(X|Y ). Using Chain rule theorem 2.5, we can write H(X) −
H(X|Y ) = H(X)− [H(X, Y )−H(Y ) ]. Hence, we get H(X)+H(Y ) ≥ H(X, Y )
with ”equality” from b) (i)
d)
H(X|Y, Z) = H(X|Z) − I(X; Y |Z) ≤ H(X|Z) .
| {z }
≥0

The same can be said for H(X|Y, Z) ≤ H(X|Y ).

Definition 2.10. Let X, Y, Z be discrete random variables

I(X; Y ) = H(X) − H(X|Y ) = H(Y ) − H(Y |X)

is called mutual information of X and Y .

I(X; Y | Z) = H(X | Z) − H(X | Y, Z)

is called conditional mutual information of X and Y given Z.

Interpretation: I(X; Y ) is the reduction in uncertainty about X when Y is given or the amount
of information about X provided by Y .
2.1. PRELIMINARY DEFINITIONS 13

Relation between entropy and mutual information (See fig.2.1):

I(X; Y ) = H(X) − H(X|Y )


H(X|Y ) = H(X, Y ) − H(X)
I(X; Y ) = H(X) + H(X) − H(X|Y )

H(X,Y)

H(X)

H(Y)

H(X|Y) I(X,Y) H(Y|X)

Figure 2.1: Relation between entropy and mutual information.

Note: By theorem 2.9 b), we know I(X; Y ) ≥ 0.


By definition it holds that
X X
I(X; Y ) = − p(xi ) log p(xi ) + p(xi , yj ) log p(xi |yj )
i i,j
X X
=− p(xi , yj ) log p(xi ) + p(xi , yj ) log p(xi |yj )
i,j i,j
X p(xi |yj )
= p(xi , yj ) log
p(xi )
i,j
X p(xi , yj )
=− p(xi , yj ) log
p(xi )p(yj )
i,j

which shows symmetry in X and Y.


Example 2.11. (Binary symmetric channel,BSC)
Let the symbol error with probability be ε, 0 ≤ ε ≤ 1. Then,

P (Y = 0|X = 0) = P (Y = 1|X = 1) = 1 − ε

P (Y = 0|X = 1) = P (Y = 1|X = 0) = ε.
Assume that input symbols are uniformly distributed P (X = 0) = P (X = 1) = 12 . Then for the
joint distributions: P (X = 0, Y = 0) = P (Y = 0|X = 0)P (X = 0) = (1 − ε) − 21 , that is
14 CHAPTER 2. FUNDAMENTALS OF INFORMATION THEORY

1−ε
0 0
ε
X Y
ε
1 1
1−ε

Figure 2.2: Binary symmetric channel .

Y
0 1
X
1 (ε) 1
0 2 (1 − ε) 2 2
(ε) 1 1
1 2 2 (1 − ε) 2
1 1
2 2

Further,

P (X = 0, Y = 0)
P (X = 0|Y = 0) = =1−ε
P (Y = 0)
P (X = 1|Y = 1) = 1 − ε
P (X = 0|Y = 1) = P (X = 1|Y = 0) = ε

For log = log2

1 1 1 1
H(X) = H(Y ) = − log − log = 1bit
2 2 2 2
H(X, Y ) = 1 − (1 − ε) log(1 − ε) − ε log ε
H(X, Y ) = −(1 − ε) log(1 − ε) − ε log ε
0 ≤ I(X, Y ) = 1 + (1 − ε) log(1 − ε) + ε log ε ≤ 1

Definition 2.12. (Kullbach-Leibler divergence, KL divergence)


Let p = (pi , ....pn ), q = (qi , ...., qn ) be stochastic vectors. Then
n
X pi
D(pkq) = pi log
qi
i=1

is called KL Divergence between p and q (or relative entropy).

D(pkq) measures the divergence (distance, dissimilarity) between distributions p and q. However
it is not a metric ,neither symmetric nor satisfies the triangle inequality.It measures how difficult it
is for p to pretend it to be q.

Theorem 2.13. (Relative entropy)


a) D(pkq) ≥ 0 with ”equality” iff p = q
2.1. PRELIMINARY DEFINITIONS 15

b) D(pkq) is a convex in the pair (p, q)


c) I(X; Y ) = D((p(xi , yj ))i,j k(p(xi )p(yj )))i,j ) ≥ 0

Proof. a) Immediate by definition and Corollary 2.8


b) Use the log-sum inequality 2.7 Let p, r and q, s be stochastic vectors. For all i = 1......n it
holds
λpi + (1 − λ)ri λpi (1 − λ)ri
(λpi + (1 − λ)ri ) log ≤ λpi log + (1 − λ)ri log
λqi + (1 − λ)si λqi (1 − λ)si

summing over i = 1.....n it follows ∀λ ∈ [0, 1] ,

D(λp + (1 − λ)rkλq + (1 − λ)s) ≤ λD(pkq) + (1 − λ)D(rks).

c) By definition.

Note: D(pkq) 6= D(qkp).

Lemma 2.14. For any distribution p, q with support X = {x1 , ..., xm } and stochastic matrix
W = (p(yj | xi ))i,j ∈ Rm×d
D(pkq) ≥ D(pWkqW)

Proof. Let w1 , ...., wd be the columns of W , that is W = (w1 , ...., wd ). Using the log-sum
inequality 2.7 P
X ai X j aj
ai log ≥ ai log P
bi j bj
i i

m
X p(xi )
D(pkq) = p(xi ) log
q(xi )
i=1
ai
m X
d
z }| {
X p(xi )p(yj | xi )
= p(xi )p(yj | xi ) log
| {z } q(xi )p(yj | xi )
i=1 j=1 ai | {z }
bi
d
X pwj
≥ pwj log .
qwj
j=1

= D(pWkqW)

Theorem 2.15. H(p) is a concave function of p = (p1 , ....., pm ).


16 CHAPTER 2. FUNDAMENTALS OF INFORMATION THEORY

1 1 Pm pi
Proof. Let u = ( m ,..... m ) be the uniform distribution. D(pku) = i=1 pi log 1 = log m −
m
H(p). Hence by theorem 2.13 b), H(p) = log m − D(pku)

D(λp + (1 − λ)qkλu + (1 − λ)u) ≤ λD(pku) + (1 − λ)D(qku),

i.e., convexity in p thus ,H(p) is a concave function of p.

2.2 Inequalities

Definition 2.16. Random variables X, Y, Z are said to form a Markov chain in that order (denoted
by X → Y→ Z) if the joint probability mass function (discrete density) satisfies

p(x, y, z) = p(x)p(y | x)p(z | y)

For X → Y → Z, the conditional distribution of Z depends only on Y and is conditionally


independent of X.
Lemma 2.17. a) If X → Y → Z then p(x, z | y) = p(x | y)p(z | y).
b) If X → Y → Z then Z → Y → X.
c) If Z = f (y), then X → Y → Z.

Proof. a)
p(x, z, y) p(x)p(y | x)p(z | y)
p(x, z | y) = =
p(y) p(y)
p(x, y)p(z | y)
= = p(x | y)p(z | y)
p(y)

b) If X → Y → Z then Z → Y → X. p(x, y, z) = p(x)p(y | x)p(z | y) = p(x, y) p(z,y) p(z)


p(y) p(z) =
p(z)p(y | z)p(x | y) , i.e Z → Y → X
c) If Z = f (y) then, (
p(x, y), if z = f (y).
p(x, y, z) =
0, otherwise
hence p(x, y, z) = p(x, y)1(z = f (y)) = p(x)p(y | x)p(z | y),
(
1, z = f (y).
p(z | y) =
0, otherwise

Theorem 2.18. (Data-processing inequality) If X → Y → Z ,then I(X; Z) ≤ min{I(X; Y ), I(Y ; Z)},


“No processing of Y can increase the information that Y contains about X”.
2.2. INEQUALITIES 17

Proof. By chain rule

I(X; Y, Z) = I(X; Z) + I(X; Y |Z)


= I(X; Y ) + I(X; Z|Y )

Since X and Z are conditionally independent given Y , we have I(X; Y |Z) = 0. Since I(X; Y |Z) ≥
0, we have
I(X; Y ) ≥ I(X; Z)
Equality holds iff I(X; Y |Z) = 0 i.e X → Z → Y I(X; Z) ≤ I(Y ; Z) is shown analogously.

Theorem 2.19. (Fano inequality) Assume X,Y are random variables with the same support X =
{x1 , ........, xm }. Lets define Pe = P (X 6= Y ), the “error probability”.

H(X|Y ) ≤ H(Pe ) + Pe log(m − 1)

H(X|Y )−log 2
This implies that Pe ≥ log(m−1) .

Proof. We Know,
P 1 P 1
1) H(X|Y ) = p(x, y) log p(x|y)
x6=y + x p(x, x) log p(x|x)
P
2) Pe log(m − 1) = x6=y log(m − 1)p(x, y)
3) H(Pe ) = −Pe log Pe − (1 − Pe ) log(1 − Pe )
4) ln(t) ≤ t − 1, t ≥ 0
Using this we obtain

H(X|Y ) = Pe log(m − 1) − H(Pe )


X Pe X 1 − Pe
= p(x, y) log + p(x, x) log
p(x|y)(m − 1) x
p(x|x)
x6=y
 
X   X  
Pe 1 − P e
≤ (log e)  p(x, y) −1 + p(x, x) −1 
(m − 1)p(x|y) x
p(x, x)
x6=y
 
Pe X X X X
= (log e)  p(y) − p(x, y) + (1 − Pe ) p(x) − p(x, x)
(m − 1) x x
x6=y x6=y

= log e [Pe − Pe + (1 − Pe ) − (1 − Pe )] = 0

Lemma 2.20. If X and Y are i.i.d random variables with entropy H(X). Then

P (X = Y ) ≥ 2−H(X)
18 CHAPTER 2. FUNDAMENTALS OF INFORMATION THEORY

Proof. let p(x) denote the p.m.f of of X. Use Jensen inequality f (t) = 2t is a convex function.
Hence with Y = log p(x) we obtain E(2Y ) ≥ 2E(Y ) , that is

2−H(X) = 2E(log p(X)) ≤ E(2log p(X) )


X
= p(x)2log p(x)
x
X
= p2 (x)
x
= P (X = Y )

2.3 Information Measures for Random sequences

Consider sequences of random variables X1 ,X2 .... denoted as X = {Xn }nN . A naive approach
to define the entropy of X is

H(X) = lim H(X1 , ......, Xn ).


n→∞

In most cases this limit will be infinite. Instead consider the entropy rate.

Definition 2.21. Let X = {Xn }nN be a sequence of discrete random variables.


1
H∞ (X) = lim H(X1 , ...., Xn )
n→∞ n

is called the entropy rate of X, provided the limit exists. H∞ (X) may be interpreted as average
uncertainty per symbol

Example 2.22. a) Let X = {Xn }nN be i.i.d random variable with H(Xi ) < ∞, Then
n
1 1X
H∞ (X) = lim H(X1 , ...., Xn ) = lim H(Xi ) = H(Xi )
n→∞ n n→∞ n
i=1

b) Let X = {Xn }nN = {(Xn , Yn )}nN be i.i.d sequence with I(Xk ; Yk ) < ∞. Then
1
I∞ (Xk , Yk ) = lim I(X1 , ....., Xn ; Y1 , ....., Yn )
n→∞n
n
1X
= lim I(Xk , Yk )
n→∞ n
k=1
= I(X1 , Y1 )

Going further than i.i.d sequences, let us introduce in the following definition.
2.3. INFORMATION MEASURES FOR RANDOM SEQUENCES 19

Definition 2.23. A sequence of random variable X ={ Xn }nN is called (strongly) stationary if

p(Xi1 , ...., Xik ) = p(Xi1 +t , ...., Xik +t )

for all 1 ≤ i1 < .....< ik , t  N.


• The joint distribution of any finite selection of random variable from {Xn } is invariant w.r.t
shifts.
• An equivalent condition for discrete random variable with support X is as follows.

P (X1 = s1 , ...., Xn = sn ) = P (X1+t = s1 , ...., Xn+t = sn )

for all s1 ,....,sn ∈ X, n ∈ N, t ∈ N . For stationary sequences all marginal distributions P (Xi )
are the same.

Theorem 2.24. Let X = {Xn }nN be a stationary sequence. Then


a) H(Xn |X1 , . . . , Xn−1 ) is monotonically decreasing
b) H(Xn |X1 , . . . , Xn−1 ) ≤ n1 H(X1 , ...., Xn )
1
c) n H(Xn |X1 , . . . , Xn ) is monotonically decreasing
d) limn→∞ H(Xn |X1 , . . . , Xn−1 ) = limn→∞ n1 H(X1 , ...., Xn ) = H∞ (X)

Definition 2.25.

a) X = {Xn }nN0 is called a Markov chain (MC) with state shape X = {X1 , . . . , Xn } if

P (Xn = sn |Xn−1 = sn−1 , ....., X0 = s0 ) = P (Xn = sn |Xn−1 = sn−1 )∀s1 N

b) It is called homogenous, if the transition probabilities P (Xn = sn |Xn−1 = sn−1 ) are


independent of n.
c) p(0) = (p1 (0), ...., pm (0)) ∼ X0 is called initial distribution.
d) Π = (pij )1≤i,j≤m = (P (Xn = j|Xn−1 = i))1≤i,j≤n is called transition matrix.
e) p = (p1 , ...pm ) is called stationary if pΠ = p.

Lemma 2.26. Let X= { Xn }nN0 be a stationary homogenous MC. Then

X
H∞ (X) = − pi (0)pij log pij
i,j
20 CHAPTER 2. FUNDAMENTALS OF INFORMATION THEORY

Proof. By Theorem 2.24


H∞ = lim H(Xn |Xn−1 , ....., X0 )
n→∞
= lim H(X1 |X0 )
n→∞
X X
=− pi (0) pij log pij
i j
X
=− pi (0)pij log pij .
ij

Remark: A homogenous MC is stationary if p(0)Π = p(0), i.e, if the initial distribution is a so


called stationary distribution.
Example 2.27. (2- state homogenous MC)
 
1−α α
Two states: X = {0, 1} Transition probabilities Π = , 0 ≤ α, β ≤ 1
β 1−β
Compute a stationary distribution p = (p1 , p2 ),solve pΠ = p

Figure 2.3: Transition graph


β
Solution: p∗ = ( α+β , α
α+β )
Choose p(0) = p∗ . Then X = {Xn }nN0 is a stationary MC with
β α
H(Xn ) = H( α+β , α+β ). However
β α
H∞ (X) = H(X1 |X0 ) = H(α) + H(β)
α+β α+β
Example 2.28. (Random walk as a weighted graph) Consider an indirected weighted graph
Nodes {1, ...., m}. Edges with weight wij , i < j = 1, ..., m, wji = wij , no edge means wi,j = 0.
Random walk on the graph X = {Xn }nN0 is a MC with support X = {1, ..., m} and
wij
P (Xn+1 = j|Xn = 1) = Pm wik = pij , 1 ≤ i, j ≤ m.
k=1
Stationary distribution: (we guess it and then prove that it is actually the same)
P
j wij wi
pi ∗ = P = ,
ij wij w
(pi ∗ , ...., pm ∗ ) = p∗
2.4. ASYMPTOBIC EQUIPARTITION PROPERTY(AEP) 21

Figure 2.4: Random walk as a weighted graph

Assume that the random walk starts at time 0 with the stationary distribution pi (0) = p∗i , i =
1, ...., m. Then X = {Xn }nN0 is a stationary sequence (MC) and

H∞ (X) = H(X1 |X0 )


X X
=− p∗i pij log pij
i j
X wi X wij wij
=− log
w w wi
i j
X wij wij
=− log
w wi
ij
X wij wij X wij wi
=− log + log
w w w w
ij ij
 
wij   w  
i
=H −H
w i,j w i

If all edges have equal weight, then


Ei
p∗i = ,
2E
where Ei =no of edges emanating from node i E=total no of edges
In this case  
E1 Em
H∞ (X) = log(2E) − H , ...,
2E 2E
H∞ (X) depends only on the entropy of the stationary distribution and the total no of edges.

2.4 Asymptobic Equipartition Property(AEP)

Information theory, the AEP is the analog of the law of large numbers (LNN).
22 CHAPTER 2. FUNDAMENTALS OF INFORMATION THEORY

LLN: Let {Xc } be i.i.d r.v.s Xi ∼ X


n
1X
Xi → E(X) almost everywhere (and in prob.) as n → ∞ .
n
i=1

AEP : Xi discrete i.i.d with joint pmf p(n) (Xi , ..., Xn ) then
1 1
log (n) is “close to” H(X) as n → ∞ .
n p (X1 , .., Xn )
Thus
p(n) (X1 , ...., Xn ) is “close to” 2−nH(X) as n → ∞ .
“close to” must be made precise.
Consequence: Existence of the typical set with sample entropy close to true entropy and the
nontypical set, which contain the other sequences.
Definition 2.29. A sequence of random variable Xn is said to converge to a random variable X
(i) in probability if ∀ > 0, P (| Xn − X |> ) → 0 as n → ∞
(ii) in mean square if E(Xn − X)2 ) → 0 as n → ∞
(iii) with prob 1 (or almost everywhere) if P (limn→∞ Xn = X) = 1
Theorem 2.30. Let {Xn } be i.i.d discrete random variable Xi ∼ X with support X. (X1 , ...., Xn )
with joint pmf pn (X1 , ...., Xn ). Then − n1 log p(n) (X1 , ..., Xn ) →(n→∞) H(X) in probability.

Proof. Yi = log p(Xi ) are also i.i.d. By the weak law of Large numbers
n
1 1X
− log p(n) (Xi , ..., Xn ) = − log p(Xi ) → −E log p(X) = H(X)
n n
i=1

with convergence in probability.

Definition 2.31.
−n(H(X)+)
A(n) n
 = {(x1 , .., xn )} ∈ X | 2 ≤ p(n) (x1 , .., xn ) ≤ 2−n(H(X)−) }

is called the typical set w.r.t  and p.

Consider Xi i.i.d ∼ p(x) (p.m.f), Then


n
1 1X
− log p(n) (X1 , ...., Xn ) = − log p(Xi ) = E[− log p(X1 )] = H(X)
n n
i=1

(a.e, hence in probability)


−n(H(X)+
A(n) n
 = {(x1 , ..., xn ) ∈ X | 2 ≤ p(n) (x1 , ..., xn ) ≥ 2−n(H(X)− }

typical set
2.4. ASYMPTOBIC EQUIPARTITION PROPERTY(AEP) 23

Theorem 2.32.

(n)
a) If (xi , ..., xn ) ∈ A then
1
H(X) −  ≤ − log p(n) (x1 , .., xn ) ≤ H(X) + 
n
(n)
b) P (A > 1 −  for n sufficiently large.
(n)
c) | A |≤ 2(n)(H(X)+) , (| . | cordinality)
(n)
d) | A |≥ (1 − )2(n)(H(X)−) for n sufficiently large

Proof. a) obvious
b) obvious
c)
X X X
1= pn (X) ≥ p(n) (X) ≥ 2−n(H(X)+) çç = 2−n(H(X)+) | A(n) |
x∈X n x∈A
(n) (n)
x∈A

(n) (n) P
d) For sufficiently large n, P (A ) > 1−, hence 1− < P (A ) ≥ x∈A( n) 2−n(H(X)− =
(n)
2−n(H(X)− | A |

(n)
For given  > 0 and sufficiently large n. X n decomposes into a set T = A (typical set) such
that
• P ((X1 , ..., Xn ) ∈ T c ) ≥ 
• For all x = (x1 , ..., xn ) ∈ T :
1
| − log p(n) (x1 , ..., xn ) − H(X) |≤ 
n
the normalized log-prob of all sequences in T is nearly equal and close to H(X).
Graphically:

The AEP and Data Compression

Let X1 , .., Xn i.i.d with support X , X (n) = (X1 , ..., Xn ). The aim is to find a short descrip-
tion/encoding of all values x(n) = (x1 , ..., xn ) ∈ X n . The key idea is index coding, allocate each
of the | X n | values an index
(n) (n)
• Holds | A |≤ 2(n)(H(X)+) (Th 2.4.4 c). Indexing of all x(n) ∈ A requires at most
n(H(X) + ) + 1 (1 bit extra since n(H(X) + ) may not be an integer)
24 CHAPTER 2. FUNDAMENTALS OF INFORMATION THEORY

• | X n | requires n log | X | +1 bits as indices


(n) (n)
Prefix each code word for x(n) ∈ A by 0 and each code word for x(n) ∈
/ A by 1. Let
l(x(n) ) denotes the length of the code word for x(n) . Then
X
E[l(X (n) ] = p(x(n) l(x(n)
x(n) ∈X n
X X
= p(x(n) )l(x(n) + p(x(n) )l(x(n)
(n) (n)
x(n) ∈A x(n) ∈A
/ 
X X
≤ p(x(n) )(n(H(X) + ) + 2) + p(x(n) )(n log | X | +2)
(n) (n)
x(n) ∈A x(n) ∈A
/ 

= P (X (n) ∈ A(n)
 )(n(H(X) + ) + 2) + P (X
(n)
/ A(n)
∈  )(n log | X | +2)
≤ n(H(X) + ) + n log | X | +2
2
≤ n(H(X) +  + n log | X | +
n
= n(H(X) + 0
for any 0 > 0 with n sufficient large it follows:
Theorem 2.33. {Xn } i.i.d For any  > 0 there exists n ∈ N and a binary code that maps
each X (n) one-to-one onto a binary string satisfying
1
E( l(X (n) )) ≤ H(X) + .
n
Hence, for sufficiently large n there exists a code for X (n) such that the expected average
codeword length is arbitrary close to H(X).

2.5 Differential Entropy

Theorem 2.34. By now: Entropy for discrete random variable with finite support. Exten-
sion: Discrete random variable but countably many support points , X = {x1 , x2 , .....}
distr p = (p1 , p2 , ......)
X∞
H(X) = − pi log pi
i=1
2.5. DIFFERENTIAL ENTROPY 25

Note : The sum may be infinite or may not even exist. Important: Extension of entropy to
random variable X with a density f .
Definition 2.35. Let X be absolute continuous with density f (x), then
Z ∞
h(X) = − f (x) log f (x)dx
−∞

is called differential entropy of X.


Remarks:
a) The integral in Def 2.35 may be infinite or may not even exist ( Exercises).
b) As a general implicit assumption in defining h(x) we include: “ provided the integral
exist ”.
c) h(X) = E[− log f (X)].
Example 2.36.

a) X ∼ U (0, a), f (x) = a1 1(0 < x ≤ a)


Z a
1 1
h(X) = − log dx
0 a a
= log a, a > 0.

(x−µ)2
b) X ∼ N (µ, σ 2 ), f (x) = √ 1 e− 2σ 2 , X ∈ IRn
2πσ

1
h(X) = ln(2πeσ 2 )
2
Definition 2.37. a) XR = (X R 1 , .., Xn ) a random vector with joint density f (x1 , .., xn ).
h(X1 , .., Xn ) = − .... f (x1 , .., xn ) log f (x1 , ...., xn )dx1 .....dxn is called joint dif-
ferential entropy of X.
b) (X, Y ) a random vector with joint density f (x, y) and conditional density.

f (x, y)
f (x | y) = , iff (y) > 0,
f (y)
and 0 otherwise. Then,
Z Z
h(X | Y ) = − f (x, y) log f (x | y)dxdy

is called conditional differential entropy of X given Y .


Definition 2.38. The mutual information between two random variable X and Y with joint
density f (x, y) is defined as

I(X; Y ) = h(X) − h(X | Y ) = h(Y ) − h(Y | X)


26 CHAPTER 2. FUNDAMENTALS OF INFORMATION THEORY

Interpretation: Amount of information about X from Y and vice versa.

I(X; Y ) = h(X) − h(X | Y )


Z Z Z
= − f (x) log f (x)dx + f (x, y) log f (x | y)dxdy
Z Z
f (x | y)
= f (x, y) log dxdy
f (x)
Z Z
f (x, y)
= f (x, y) log dxdy (∗)
f (x)f (y)

also showing interchangeability of X and Y .


Definition 2.39. The relative entropy or Kulback-Leibler divergence between two densities
f and g is defined as Z
f (x)
D(f kg) = f (x) log dx
g(x)
From (*) it follows that.

I(X; Y ) = D(f (x, y)kf (x).f (y)). (∗∗)

Theorem 2.40. D(f kg) ≥ 0 with equality iff f = g (almost everywhere)


Proof. Let S = {x | f (x) > 0} be the support of f . Then
Z
g
−D(f kg) = f log
s Z f
g
≤ log f ((la2.1.6)(f is concave : Ef (X) ≤ f (EX)))
f
Zs
= log g ≤ log 1 = 0
s

Equality holds iff f = g a.e.


Corollary 2.41.

a) I(X; Y ) ≥ 0 with equality iff X and Y are independent.


b) h(X | Y ) ≤ h(X) with equality iff X, Y are independent.
R R
c) − f log f ≤ − f log g.
Proof. a) follows from (∗∗).
b) I(X; Y ) = h(X) − h(X | Y ) ≥ 0 by a).
c) By definition of D(f kg).
2.5. DIFFERENTIAL ENTROPY 27

Theorem 2.42. (Chain rule for different entropy)


n
X
h(X1 , ..., Xn ) = h(Xi | X1 , ..., Xi−1 ).
i=1

Proof. From the definition it follows that

h(X, Y ) = h(X) + h(Y | X).

This implies

h(X1 , .., Xi ) = h(X1 , .., Xi−1 ) + h(Xi | X1 , ..., Xi−1 ).

The assertion follows by induction.


Corollary 2.43.
n
X
h(X1 , ..., Xn ) ≤ h(Xi ),
i−1

with equality iff X1 , ..., Xn are stoch independent.


Theorem 2.44. Let X ∈ IRn with density f (x), A ∈ IRn×n of full rank, b ∈ IRn . Then

h(AX + b) = h(X) + log | A |

Proof. If X ∼ f (x), then Y = AX + B ∼ 1 −1 − b)), x, y ∈ IRn


|A| f (A (y
Z
1 1
− f (A−1 (y − b)) log( f (A−1 9y − b))dy
|A| |A|
Z
1 1
= − log − f (A−1 y) log(f (A−1 y)dy
|A| |A|
Z
1
= − log − f (x) log f (x)dx
|A|
= log | A | +h(X)

Theorem 2.45. Let X ∈ IRn absolute continuous with density f (x) and Cov(X) = C,
with C positive definite. Then

1
h(x) ≤ ln((2πe)n | C |),
2
i.e Nn (µ, C) has largest entropy amongst all random variables with positive definite co-
variance matrix C.
Proof. W.l.o.g assume EX = 0 (see thm 2.44).
1 1 T −1
Let Q(x) = n 1 exp{− 2 X C x} be the density of Nn (0, C). Let X ∼ f (x), EX =
(2π) 2 |C| 2
28 CHAPTER 2. FUNDAMENTALS OF INFORMATION THEORY

R
0, Cov(X) = E(XX T ) = XX T f (x)dx
Z
h(X) = − f (x)f (x)dx
Z
≤ − f (x) ln q(x)dx Cor 2.41
Z " #
1 1 T −1
= − f (x) ln n 1 exp{− x C x} dx
(2π) 2 | C | 2 2
" # Z
1 1
= − ln n 1 + xT C −1 xf (x)dx
(2π) | C |
2 2 2
Z
n 1 1
= ln (2π) 2 | C | 2 + tr(C −1 xxT )f (x)dx
2
Z
n 1 1 −1
= ln (2π) 2 | C | 2 + tr(C ) xxT f (x)dx
2
n 1 n
= ln (2π) 2 | C | 2 +
2
n 1
= ln((2πe) 2 | C | 2 )
3 Source Coding

source random destination


noise
source encoder source decoder

channel encoder channel channel decoder

modulator analog channel demodulator

channel
estimation

Communication Channel from an information theoretic point of view

3.1 Variable Length Encoding

Given some source alphabet X = {x1 , . . . , xm } and code alphabet Y = {y1 , . . . , yd }. The
aim is to find a code word formed over Y for each character x1 , . . . , xm . In other words,
each character xi ∈ X uniquely mapped onto a “word” over Y.
Definition 3.1. An injective mapping


[
g:X → Y ` : xi 7→ g(xi ) = (wi1 , . . . , wini )
`=0

is called encoding. g(xi ) = (wi1 , . . . , wini ) is called code word of character xi , ni is


called length of code word i.

29
30 CHAPTER 3. SOURCE CODING

Example 3.2.
g1 g2 g3 g4
a 1 1 0 0
b 0 10 10 01
c 1 100 110 10
d 00 1000 111 11
no encoding encoding, encoding, encoding,
words are separable shorter, even shorter,
words separable not separable

Hence, separability of concatenated words over Y is important.


Definition 3.3. An encoding g is called uniquely decodable (u.d.) or uniquely decipher-
able, if the mapping

[ ∞
[
`

G: X → Y ` : a1 , . . . , ak ) 7→ (g(a1 ), . . . , g(ak )
`=0 `=0

is injective.
Example 3.4. Use the previous encoding g3
g3
a 0
b 10
c 110
d 111
111100011011100010
1 1 1|1 0 0 0 1 1 0 1 1 1 0 0 0 1 0
1 1 1|1 0 |0 0 1 1 0 1 1 1 0 0 0 1 0
1 1 1|1 0 |0|0 |1 1 0|1 1 1|0| 0|0|1 0
dbaacdaaab

(g3 is a so called prefix code)

3.2 Prefix Codes

Definition 3.5. A code is called prefix code, if no complete code word is prefix of some
other code word, i.e., no code word evolves from continuing some other.

Formally:
a ∈ Y k is called prefix of b ∈ Y l , k ≤ l, if there is some c ∈ Y l−k such that b = (a, c).
Theorem 3.6. Prefix codes are uniquely decodable.

Properties of prefix codes:


3.3. KRAFT-MCMILLAN THEOREM 31

– Prefix codes are easy to construct based on the code word lengths.
– Decoding of prefix codes is fast and requires no memory storage.

Next aim: characterize uniquely decodable codes by their code word lengths.

3.3 Kraft-McMillan Theorem

Theorem 3.7. Kraft-McMillan Theorem


a) [McMillan (1959)]: All uniquely decodable codes with code word lengths n1 , . . . , nm
satisfy
Xm
d−nj ≤ 1
j=1
P
b) [Kraft (1949)]: Conversely, if n1 , . . . , nm ∈ N are such that m j=1 d
−nj ≤ 1, then

there exists a u.d. code (even a prefix code) with code word lengths n1 , . . . , nm .

Proof.

(a) g u.d. code with codeword lengths n1 , ..., nm . Let r = max{ni } maximum codeword
length, βe = |{i|ni = l}| be the number of codewords of length l ∈ IN, l ≤ r and it
holds, k ∈ IN
m
X X r k.r
X −e
( d−nj )k = ( βe d−e )k = γe d
j=1 l=1 l=k
with X
γe = βi1 , ..., βik , l = k, ..., k.r
i≤i1 ,..,ik ≤r
i1 +..+ik =l
γe is the number of source words of length of length k which have codeword length l
and de be the number of all codewords of length l, Since g is u.d, each code word has
at most one source word. Hence
γe ≤ de
Xm k.r
X
−nj k
( d ) ≤ de d−e = kr − k + 1 ≤ kr ∀k ∈ IN.
j=1 i=k

Further
m
X 1
d−nj ≤ (kr) k → 1(k → ∞),
j=1
Pm −nj
so that j=1 d ≤ 1.
32 CHAPTER 3. SOURCE CODING

g3 g4
a 0 0
b 10 01
Example 3.8.
c 110 10
d 111 11
u.d. not u.d.
For g3 : 2−1 + 2−2 + 2−3 + 2−3 = 1.

For g4 : 2−1 + 2−2 + 2−2 + 2−2 = 5/4 > 1.


g4 is not u.d., there is no u.d. code with code word lengths 1, 2, 2, 2.
Proving with an example. Assume n1 = n2 = 2, n3 = n4 = n5 = 3, n6 = 4. Then
(b) P
i = 16 = 15/16 < 1.
Construct a prefix code by a binary code tree as follows.

x1
1

1 x2
0
x3
1
x4
0 1 0
x5
0 1
1 x6
0

The corresponding code is given as


xi x1 x2 x3 x4 x5 x6
g(xi ) 11 10 011 010 001 0001

3.4 Average Code Word Length

Given a code g(x1 ), . . . , g(xm ) with code word lengths n1 , . . . , nm .


Question: What is a reasonable measure of the “length of a code”?

Definition 3.9. The expected code word length is defined as

m
X m
X
n̄ = n̄(g) = nj p j = nj P (X = xj )
j=1 j=1
3.5. NOISELESS CODING THEOREM 33

pi g2 g3
a 1/2 1 0
b 1/4 10 10
Example 3.10. c 1/8 100 110
d 1/8 1000 111
n̄(g) 15/8 14/8
H(X) 14/8

3.5 Noiseless Coding Theorem

Theorem 3.11. Noiseless Coding Theorem, Shannon (1949) Let random variable X de-
scribe a source with distribution P (X = xi ) = pi , i = 1, . . . , m. Let the code alphabet
Y = {y1 , . . . , yd } have size d.
a) Each u.d. code g with code word lengths n1 , . . . , nm satisfies

n̄(g) ≥ H(X)/ log d.

b) Conversely, there is a prefix code, hence a u.d. code g with

n̄(g) ≤ H(X)/ log d + 1.

Proof. a) For any u.d. code it holds by McMillan’s Theorem that


m m
H(X) 1 X 1 X
− n̄(g) = pj log − p j nj
log d log d pj
j=1 j=1
m
X m
X
1 1 log d−nj
= pj log + pj
log d pj log d
j=1 j=1
Xm
1 d−nj
= pj log
log d pj
j=1
Xm
log e d−nj
= pj ln
log d pj
j=1

log e Xm  d−nj 
≤ pj −1 (since ln x ≤ x − 1, x ≥ 0)
log d pj
j=1

log e X  −nj 
m
≤ d − pj ≤ 0.
log d
j=1

b) Shannon-Fano Coding
W.l.o.g. assume that pj > 0 for all j.
34 CHAPTER 3. SOURCE CODING

Choose integers nj such that d−nj ≤ pj < d−nj +1 for all j. Then

m
X m
X
d−nj ≤ pj ≤ 1
j=1 j=1

such that by Kraft’s Theorem a u.d. code g exists. Furthermore,

log pj < (−nj + 1) log d

holds by construction. Hence

m
X m
X
pj log pj < (log d) pj (−nj + 1),
j=1 j=1

equivalently,

H(X) > (log d) n̄(g) − 1 .

3.6 Compact Codes

Is there always a u.d. code g with

n̄(g) = H(X)/ log d?

No! Check the previous proof. Equality holds if and only if pj = 2−nj for all j = 1, . . . , m.

Example 3.12. Consider binary codes, i.e., d = 2. X = {a, b}, p1 = 0.6, p2 = 0.4. The
shortest possible code is g(a) = (0), g(b) = (1).

H(X) = −0.6 log2 0.6 − 0.4 log2 0.4 = 0.97095


n̄(g) = 1.

Definition 3.13. Any code of shortest possible average code word length is called compact.
How to construct compact codes?
3.7. HUFFMAN CODING 35

3.7 Huffman Coding

1
01111 a 0.05 0.1 1
01110 b 0.05 0.15 1
0
0110 c 0.05 0
1
111 d 0.1 0.2 1 0.3 1
110 e 0.1 0
0.4 1
1.0
010 f 0.15 0.6
0
0
10 g 0.2 0
00 h 0.3 0

A compact code g ∗ is given by:


Character: a b c d e f g h
Code word: 01111 01110 0110 111 110 010 10 00
It holds (log to the base 2):

n̄(g ∗ ) = 5 · 0.05 + · · · + 2 · 0.3 = 2.75


H(X) = −0.05 · log2 0.05 − · · · − 0.3 · log2 0.3 = 2.7087

Huffman are optimal i.e, have shortest average codeword length. We consider the case
d = 2.
Lemma 3.14. Let X = {x1 , ..., xm } with probabilities p1 ≥, ..., ≥ pm > 0. There exists
an optimal binary prefix code g with codeword lengths n1 ≤, ....., nm such that
(i) n1 ≤ ..... ≤ nm ,
(ii) nm−1 = nm ,
(iii) g(Xm−1 ) and g(Xm ) differ only in the last position.

Proof. Let g be an optimum prefix code with n1 , , ..., nm .


(i) If pi > pj then necessarily ni ≤ nj , 1 ≤ i < j < m. Otherwise exchange g(xi ) and
g(xj ) to obtain code g 0 with

n̄(g 0 ) − n̄(g) = pi nj + pj ni − pi ni − pj nj
= (pi − pj )(nj − ni ) < 0

contradictory optimality of g.
(ii) There is an optimal prefix code g with ni ≤, .., ≤ m.if nm−1 < nm delete nm − nm−1
positions of g(xm ) to obtain a better code.
36 CHAPTER 3. SOURCE CODING

(iii) If l1 ≤ ..... ≤ lm−1 = lm for an optimal prefix code g and g(xm−1 ) and g(xm ) differ
by more than the last position, delete the last position in both to obtain a better code.

Lemma 3.15. Let X = {xi , ..., xm } with prob p1 ≥ .... ≥ pm > 0. X1 = {x01 , .., Xm−1
0 }
0 0 0
with prob pi = pi , i = 1..., m − 2, and pm−1 = pm−1 + pm . Let g be an optimal prefix
code for X 0 with codewords g 0 (x0i ), i = 1, ...., m − 1. Then

 0 0 i = 1, ..., m − 2.
g (xi ),
0 0
g(x1 ) = (g (xm−1 , 0), i = m − 1

 0 0
(g (xm−1 , 1), i = m

is an optimal prefix code for X

Proof. Denote codeword lengths ni , n0i for g, g 0 respectively.

m−2
X
n̄(g) = pj n0j + (pm + pm−1 )(n0m−1 + 1)
j=1
m−2
X
= p0j n0j + p0m−1 (n0m−1 + 1)
j=1
m−1
X
= p0j n0j + pm−1 + pm = n̄(g 0 ) + pm−1 + pm
j=1

Assume g is not optimal for X .There exists an opt prefix code h with properties (i)-(iii) of
3.14 and n̄(h) < n̄(g).
Set (
0 0 h(xj ), j = 1, ..., m − 2.
h (xj ) =
bh(xm−1 c, deleting the last position of h(xm−1 ), j = m

Then n̄(h0 ) + pm−1 + pm = n̄(h) < n̄(g) = n̄(g 0 ) + pm−1 + pm . Hence n̄(h0 ) < n̄(g 0 )
contradicting optimality of g 0 .

3.8 Block Codes for Stationary Sources

Encode blocks/words of length N by words over the code alphabet Y. Assume that blocks
are generated by a stationary source, a stationary sequence of random variables {Xn }n∈N .
Notation for a block code:

[
(N ) N
g :X → Y`
`=0
3.9. ARITHMETIC CODING 37

Block codes are “normal” variable length codes over the extended alphabet X N . A fair
measure of the “length” of a block code is the average code word length per character

n̄ g (N ) /N.

Theorem 3.16. Noiseless Coding Theorem for Block Codes


Let X = {Xn }n∈ be a stationary source. Let the code alphabet Y = {y1 , . . . , yd } have
size d.
a) Each u.d. block code g (N ) satisfies

n̄(g (N ) ) H(X1 , . . . , XN )
≥ .
N N log d

b) Conversely, there is a prefix block code, hence a u.d. block code g (N ) with

n̄(g (N ) ) H(X1 , . . . , XN ) 1
≤ + .
N N log d N

Hence, in the limit as N → ∞:


There is a sequence of u.d. block codes g (N ) such that

n̄(g (N ) ) H∞ (X)
lim = .
N →∞ N log d

3.8.1 Huffman Block Coding

In principle, Huffman encoding can be applied to block codes. However, problems include
– The size of the Huffman table is mN , thus growing exponentially with the block length.
– The code table needs to be transmitted to the receiver.
– The source statistics are assumed to be stationary. No adaptivity to changing probabil-
ities.
– Encoding and decoding only per block. Delays occur at the beginning and end. Padding
may be necessary.

3.9 Arithmetic Coding

Assume that:
– Message (xi1 , . . . , xiN ), xij ∈ X , j = 1, . . . , N is generated by some source {Xn }n∈N .
– All (conditional) probabilities

P (Xn = xin | X1 = xi1 , . . . , Xn−1 = xin−1 ) = p(in | i1 , . . . , in−1 ),


38 CHAPTER 3. SOURCE CODING

xi1 , . . . , xin ∈ X , n = 1, . . . , N , are known to the encoder and decoder, or can be


estimated.
Then,

P (X1 = xi1 , . . . , Xn = xin ) = p(i1 , . . . , in )

can be easily computed as

p(i1 , . . . , in ) = p(in | i1 , . . . , in−1 ) · p(i1 , . . . , in−1 ).

Iteratively construct intervals



Initialization, n = 1: c(1) = 0, c(m + 1) = 1

j−1
X
 
I(j) = c(j), c(j + 1) , c(j) = p(i), j = 1, . . . , m
i=1

(cumulative probabilities)

Recursion over n = 2, . . . , N :

I(i1 , . . . , in )
h n −1
iX

= c(i1 , . . . , in−1 ) + p(in | i1 , . . . , in−1 ) · p(i1 , . . . , in−1 )
i=1
in
X 
c(i1 , . . . , in−1 ) + p(in | i1 , . . . , in−1 ) · p(i1 , . . . , in−1 )
i=1

Program code available from Togneri, deSilva, p. 151, 152.


3.9. ARITHMETIC CODING 39

Example 3.17.
0 1
p(1) p(2) p(m)

c(1) c(2) c(3) c(m)

p(1|2)p(2) p(2|2)p(2) p(m|2)p(2)

c(2, 1) c(2, 2) c(2, 3) c(2, m)

p(1|2, m)p(2, m) p(2|2, m)p(2, m) p(m|2, m)p(2, m)

c(2, m, 1) c(2, m, 2) c(2, m, 3) c(2, m, m)

Encode message (xi1 , . . . , xiN ) by the binary representation of some binary number in the
interval I(i1 , . . . , in ).

A scheme which usually works quite well is as follows.


Let l = l(i1 , . . . , in ) and r = r(i1 , . . . , in ) denote the left and right bound of the corre-
sponding interval. Carry out the binary expansion of l and r until until they differ. Since
l < r, at the first place they differ there will be a 0 in the expansion of l and a 1 in the
expansion of r. The number 0. a1 a2 . . . at−1 1 falls within the interval and requires the least
number of bits.
(a1 a2 . . . at−1 1) is the encoding of (xi1 , . . . , xiN ).

The probability of occurrence of message (xi1 , . . . , xiN ) is equal to the length of the repre-
senting interval. Approximately

− log2 p(i1 , . . . , in )

bits are needed to represent the interval, which is close to optimal.


Example 3.18. Assume a memoryless source with 4 characters and probabilities
xi a b c d
P (Xn = xi ) 0.3 0.4 0.1 0.2
Encode the word (bad):
40 CHAPTER 3. SOURCE CODING

a b c d
0.3 0.4 0.1 0.2

ba bb bc bd
0.12 0.16 0.04 0.08

baa bab bac bad


0.036 0.048 0.012 0.024
0.396 0.420

(bad) = [0.396, 0.42)


0.396 = 0.01100 . . . 0.420 = 0.01101 . . .
(bad) = (01101)
4 Information Channels

source random destination


noise
source encoder source decoder

channel encoder channel channel decoder

modulator analog channel demodulator

channel
estimation

Communication Channel from an information theoretic point of view

4.1 Discrete Channel Model

Discrete information channels are described by


– A pair of random variables (X, Y ) with support X × Y , where X is the input r.v.,
X = {x1 , . . . , xm } the input alphabet and Y is the output r.v., Y = {y1 , . . . , yd } the
output alphabet.
– The channel matrix

W = wij i=1,...,m, j=1,...,d

with

wij = P (Y = yj | X = xi , i = 1, . . . , m, j = 1, . . . , d

– Input distribution
P (X = xi ) = pi , i = 1, . . . , m,
p = (p1 , . . . , pm ).
Discrete Channel Model :

41
42 CHAPTER 4. INFORMATION CHANNELS

Input X Channel W Output Y



xi W = wij 1≤i≤m,1≤j≤r
yj

 
w1
 w2 
 
where W composed of rows w1 , . . . , wm as W =  .  .
.
 . 
wm
Lemma 4.1. Let X and Y be the input r.v. and the output r.v. of a discrete channel with
channel matrix W , respectively. Lets denote the input distribution as P (X = xi ) = pi , i =
1, . . . , m, with p = (p1 , . . . , pm ). Then
(a) H(Y ) = H(pW ).
(b) H(Y | X = xi ) = H(wi ).
P
(c) H(Y | X) = m i=1 pi H(wi ).

Proof. (a) H(Y ), determine the distribution of Y :

m
X
P (Y = yj ) = P (Y = yj | X = xi )P (X = xi )
i=1
Xm
= pi wij = (pW )j , j = 1, .., d.
i=1

(b) H(Y | X = xi ) = H(wi ) by definition.


P Pm
(c) H(Y | X) = m i=1 pi H(Y | X = xi ) = i=1 pi H(wi ).
4.2. CHANNEL CAPACITY 43

4.2 Channel Capacity

The mutual information between X and Y is


I(X; Y ) = H(Y ) − H(Y | X)
m
X
= H(pW ) − pi H(wi )
i=1
" m
# m d
X X X
=H pi wij + pi wij log wij
i=1 j=1,...,d i=1 j=1
d m
! m
!
X X X X
=− pi wij log pi wij + pi wij log wij
j=1 i=1 i=1 i,j
m
!
X X X
=− pi wij log pl wlj + pi wij log wij
i,j l=1 i,j
 
X X wij 
= pi  wij log P
i j i pi wij
m
X 
= pi D wi k pW = I(p; W ),
i=1

where D denoting the Kulback-Leibler divergence.


The aim is to use the input distribution that maximizes mutual information I(X; Y ) for a
given channel W .
Definition 4.2.
C = max I(X; Y ) = max I(p, W )
(p1 ,...,pm ) p

is called channel capacity.


Determining capacity is in general a complicated optimization problem.

4.3 Binary Channels

4.3.1 Binary Symmetric Channel (BSC)

Example 4.3. BSC


1−ε Input distribution p = (p0 , p1 )
0 0
ε Channel matrix
X Y  
ε 1− 
1 W =
1  1−
1−ε
44 CHAPTER 4. INFORMATION CHANNELS

In this case I(X, Y ) is


m
X
I(X; Y ) = I(p; W ) = H(pW ) − pi H(wi )
i=1
= H(p0 (1 − ) + p1 , p0 + (1 − )p1 ) − p0 .H(1 − , ) − p1 H(, 1 − )
= H2 (p0 (1 − ) + p1 ) − H2 ()
and is to be maximised over (p0 , p1 ), p0 , p1 ≥ 0, p0 + p1 = 1, that is
H2 (q) = −q log q − (1 − q) log(1 − q), 0 ≤ q ≤ 1
≤ log 2.
This is a achieved if p0 = p1 = 1
2. Capacity achieving distribution is p∗ = ( 21 , 12 ) with
capacity
C = max I(X; Y ) = log 2 + (1 − ) log(1 − ) +  log()
= 1 + (1 − ) log2 (1 = ) +  log2 .

Capacity of the BSC as a function of :


1
0.9
0.8
C=C( ε)
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
ε
Remark 4.4. To compute channel capacity for a given channel with channel matrix W , we
need to solve
Xm

C = max I(p; W ) = max pi D wi k pW .
p p
i=1
Theorem 4.5. The capacity of the channel W is attained at p∗ = (p∗1 , . . . , p∗m ) if and only
if
D(wi k p∗ W ) = ζ for all i = 1, . . . , m.
for all i = 1, . . . , m with pi > 0.
Moreover,
C = I(p∗ ; W ) = ζ.
4.3. BINARY CHANNELS 45

Proof. Mutual information I(p; W ) is a concave function of p. Hence the KKT conditions
(cf., e.g., Boyd and Vandenberge 2004) are necessary and sufficient for optimality of some
input distribution p. Using the above representation some elementary algebra shows that

I(p; W ) = D(wi kpW ) − 1.
∂pi
Then,
∂ ∂ XX X
H(pW ) = [− ( pi wij ) log( pi wij )]
∂p k ∂pk
j i i
X X X wkj
=− [wkj log( pi wij ) + pi wij P ]
i pi wij
j i i
X X
=− [wkj log( pi wij ) + wkj ],
j i

thus
∂ ∂ ∂ X
I(p, W ) = H(pW ) − ( pi H(wi ))
∂p k ∂pk ∂pk
i
X X X
=− wkj log( pi wij ) + wkj log wkj − 1
j i j
X wkj
= wkj log P −1
j i pi wij

= D(wk kpW ) − 1.

The full set of KKT conditions now reads as


Pm
j=1 pj =1
pi ≥ 0, i = 1, . . . , m
λi ≥ 0, i = 1, . . . , m
λi pi = 0, i = 1, . . . , m
D(wi kpW ) + λi + ν = 0, i = 1, . . . , m
which shows the assertion.
Theorem 4.6. (G. Alirezaei, 2018)

Given a channel with square channel matrix W = wij i,j=1,...,m . Denote self information
by ρ(q) = −q log q, q ≥ 0. Assume that W is invertible with inverse

T = tij i,j=1,...,m .

Then, measured in nats, the capacity is


X  X 
C = ln exp − tki ρ(wij )
k i,j
46 CHAPTER 4. INFORMATION CHANNELS

and the capacity achieving distribution is given by


P  P
X  X −
k tks exp tki ρ(wij )
p∗` = e−C tks exp − tki ρ(wij ) = P  P i,j .
k i,j k exp − i,j tki ρ(wij )

Proof. p is capacity achieving iff D(wi kpW ) = ζ ∀i : pi > 0. Let p(q) = −q log q, q ≥ 0
and T inverse of W , T = W −1 so that T 1m = T W 1m = I1m = 1m .
Then H holds:
ζ = D(wi kpW )
X wij
= wij ln Pm
j l=1 pl wlj
X X
=− [wij ln( pl wlj ) + p(wij )], i = 1, .., m.
j l

Hence ∀k = 1, ..., m by summation over i


X X X X
ζ( tki ) = − tki [wij ln( pl wlj ) + p(wij )]
| i {z } i j l
1
XX X X
=− tki wij ln( pl wlj ) − tki p(wij )
j
| i {z } l i,j
δkj
X X
= − ln( pl wlk ) − tki p(wij ).
l i,j
P P
Resolve for p = (p1 , ..., pm ): l pl wlk = exp(−ζ − i,j tki p(wij )∀k = 1, .., m.(∗∗)
Summation over k:
X X
1= exp(−ζ − tki p(wij ))
k i,j
X P
= e−ζ e− i,j tki p(wij )

k
It follows
X P
ζ = ln( e−ζ e− i,j tki p(wij ) ) = C (capacity)
k
To determine pl multiply (**) by tks and sum over k.
X X X P
tks pl wlk = tks e−C e− i,j tki p(wij )
k l k
X X X P
pl wlk tks = tks e−C e− i,j tki p(wij )
l
| k {z } k
δks
X P
Ps = e−C tks e− i,j tki p(wij ) , s = 1, .., m (Capacity-achieving distribution)
k
4.3. BINARY CHANNELS 47

4.3.2 Binary Asymmetric Channel (BAC)

1−ε
0 0  
ε 1−ε ε
W =
Example 4.7. δ 1−δ
δ
1 1
1−δ

The capacity-achieving distribution is

1 b
p∗0 = , p∗1 = ,
1+b 1+b
with  h(δ) − h() 
a − (1 − )
b= and a = exp ,
δ − a(1 − δ) 1−−δ
and h() = H(, 1 − ), the entropy of (, 1 − ).

Note that  = δ yields the previous result for the BSC.


Derivation of capacity for the BAC:

By 4.5 the capacity-achieving input distribution p = (p0 , p1 ) satisfies

D(w1 kpW ) = D(w2 kpW ).

This is an equation in the variables p0 , p1 which jointly with the condition p0 + p1 = 1 has
the solution
1 b
p∗0 = , p∗1 = , (4.1)
1+b 1+b
with  h(δ) − h() 
a − (1 − )
b= and a = exp ,
δ − a(1 − δ) 1−−δ
and h() = H(, 1 − ), the entropy of (, 1 − ).

4.3.3 Binary Z-Channel (BZC)

Example 4.8. The so called Z-channel is a special case of the BAC with  = 0.
48 CHAPTER 4. INFORMATION CHANNELS

1
0 0

δ
1 1
1−δ

The capacity-achieving distribution is obtained from the BAC by setting  = 0.

4.3.4 Binary Asymmetric Erasure Channel (BAEC)

1−ε
0 0
ε  
1−ε ε 0
W =
Example 4.9. e 0 δ 1−δ
δ
1 1
1−δ

The capacity-achieving distribution is determined by finding the solution x∗ of

 log  − δ log δ = (1 − δ) log(δ + x) − (1 − ) log( + δ/x)

and setting
p∗0
= x∗ , p∗0 + p∗1 = 1.
p∗1

Derivation of capacity for the BAEC:

By 4.5 the capacity-achieving distribution p∗ = (p∗0 , p∗1 ), p∗0 +p∗1 = 1 is given by the solution
of
1− 
(1 − ) log +  log
p0 (1 − ) p0  + p1 δ
(4.2)
δ 1−δ
= δ log + (1 − δ) log ,
p0  + p1 δ p0 (1 − δ)
p0
Substituting x = p1 , equation (4.2) reads equivalently as

 log  − δ log δ = (1 − δ) log(δ + x) − (1 − ) log( + δ/x)

By differentiating w.r.t. x it is easy to see that the right hand side is monotonically increasing
such that exactly one solution p∗ = (p∗1 , p∗2 ) exists, which can be numerically computed.
4.4. CHANNEL CODING 49

4.4 Channel Coding

Consider transmission of blocks of length N .


Denote:

XN = (X1 , . . . , XN ) input random vector of length N


YN = (Y1 , . . . , YN ) output random vector of length N

where X1 , . . . , XN ∈ X , Y1 , . . . , YN ∈ Y.

Only a subset of all possible blocks of length N is used as input, the channel code.

Definition 4.10. A set of M codewords of length N , denoted by

CN = {c1 , . . . , cM } ⊆ X N

is called (N, M )-code.


log2 M
R=
N
is called the code rate. It represents the average number of bits per code word.
Transmission is characterized by
– the channel code CN = {c1 , . . . , cM }
– transmission probabilities

pN (bN | aN ) = P YN = bN | Xn = aN

– the decoding rule


hN : Y N → CN : bN 7→ hN (bN )

It can be represented graphically as:


CN YN

c1 c1

Channel
c2 c2
cN bN hN
pN (bN | aN )

cM cM
50 CHAPTER 4. INFORMATION CHANNELS

4.5 Decoding Rules

Definition 4.11. A decoding rule hN : Y N → Cn is called minumum error rule (ME) or


ideal observer if
cj = hN (b) ⇒ P (XN = cj | YN = b) ≥ P (XN = ci | YN = b)
for all i = 1, . . . , M . Equivalently,
cj = hN (b) ⇒ P (YN = b | XN = cj )P (XN = cj )
≥ P (YN = b | XN = ci )P (XN = ci )
for all i = 1, . . . , M .

With ME-decoding, b is decoded as the codeword cj which has greatest conditional proba-
bility of having been sent given b is received. Hence,

hN (b) ∈ arg max P XN = ci | YN = b .
i=1,...,M

. The ME decoding rules depend on the input distribution.


Definition 4.12. A decoding rule hN : Y N → Cn is called maximum likelihood rule (ML)
if
cj = hN (b) ⇒ P (YN = b | XN = cj ) ≥ P (YN = b | XN = ci )
for all i = 1, . . . , M .

With ML-decoding, b is decoded as the codeword cj which has greatest conditional proba-
bility of b being received given that cj was sent. Hence,

hN (b) ∈ arg max P YN = b | YN = ci .
i=1,...,M

4.6 Error Probabilities

For a given Code CN = {c1 , . . . , cM },


– 
ej (CN ) = P hN (YN ) 6= cj | XN = cj
is the probability for a decoding error of code word cj .


M
X
e(CN ) = ej (CN ) P (XN = cj )
j=1
is the error probability of code CN .
4.7. DISCRETE MEMORYLESS CHANNEL 51


ê(CN ) = max ej (CN )
j=1,...,M

is the maximum error probability.

4.7 Discrete Memoryless Channel

Definition 4.13. A discrete channel is called memoryless (DMC) if

N
 Y 
P YN = bN | Xn = aN = P Y1 = bi | X1 = ai
i=1

for all N ∈, aN = (a1 , . . . , aN ) ∈ X N , bN = (b1 , . . . , bN ) ∈ Y N .

Remark 4.14. From the above definition it follows that the channel
– is memoryless and nonanticipating
– transition probablities of symbols are the same at each position
– transition probabilities of blocks only depend on the channel matrix
Definition 4.15. Suppose a source produces R bits per second (rate R). Hence ,N R bits in
N seconds. Let the total no of messages in N seconds is 2N R (assigned as integer) and M
codewords available for encoding all messages.

log M
M = 2N R ⇐⇒ R =
N
(No of bits per channel use)

Lemma 4.16. (XN , YN ) is a DMC iff ∀l = 1 . . . N

P (Yl = bl | X1 = a1 , . . . , XN = aN , Y1 = b1 , . . . Yl−1 = bl−1 ) = P (Y1 = bl | X1 = al )

Proof. ” ⇐= ”
P (YN = bN | XN = aN )

P (YN −1 = bN −1 , XN = aN )
= P (YN = bN | XN = aN , YN −1 = bN −1 ).
P (XN = aN )
= P (Y1 = bN | X1 = aN )P (YN −1 = bN −1 , XN = aN )
= P (Y1 = bN | X1 = aN )P (YN −1 = bN −1 | X1 = aN −1 )P (YN −2 = bN −2 | XN = aN )
= ...
= ΠN
i=1 P (Y1 = bi | X1 = ai )
52 CHAPTER 4. INFORMATION CHANNELS

=⇒

P (Ye = be | XN = aN )
P (Yl = bl | XN , Ye−1 = be−1 ) =
P (Ye−1 = be−1 | XN = aN )
= P (Y1 = bl | X1 = al )

{(Xn , Yn )} is a sequence of independent random variable system then (XN , YN ) forms a


DMC

4.8 The Noisy Coding Theorem

Theorem 4.17. (Shannon 1949)


Given some discrete memoryless channel of capacity C. Let 0 < R < C and MN ∈ be a
sequence of integers such that
log MN
< R.
N
There exists a sequence of (N, MN )-codes with MN codewords of length N and a constant
a > 0 such that
ê(CN ) ≤ e−N a .

Hence, the maximum error probability tends to zero exponentially fast as the block length
N tends to infinity.
Example 4.18. Consider the BSC with ε = 0.03.

C = 1 + (1 − ε) log2 (1 − ε) + ε log2 ε = 0.8056

Choose R = 0.8
log2 MN
< R ⇔ MN < 2N R
N
hence choose
MN = b20.8N c

N 10 20 30
|X N | = 2N 1 024 1 048 576 1.0737 · 109
MN = b20.8N c 256 65 536 16.777 · 106
Percentage of 25% 6.25% 1.56%
used codewords
4.9. CONVERSE OF THE NOISY CODING THEOREM 53

4.9 Converse of the Noisy Coding Theorem

Theorem 4.19. (Wolfowitz 1957)


Given some discrete memoryless channel of capacity C. Let R > C and MN ∈ be a
sequence of integers such that
log MN
> R.
N
For any sequence of (N, MN )-codes with MN codewords of length N it holds that that

lim e(CN ) = 1.
N →∞

Hence, such codes tend to be fully unreliable.


Theorem 4.20. (Only the outline of the proof) Use random coding, i.e r.v., C1 . . . CM ∈
X N , Ci = (Ci1 , . . . , CiN )i = 1, .., M with Cij ∈ X , i.i.d ∼ p(x), i = 1, ..., M, j =
1, .., N.
(a)
Theorem 4.21. For a DMC with ML-decoding, it holds for all 0 ≤ γ ≤ 1, j =
1, .....M
d X
X m
1
E(ej (C1 , . . . , CM )) ≤ (M − 1)γ ( ( pi p1 (yj | xi ) 1+γ )1+γ )N
j=1 i=1

Proof. Set
d X
X m
1
G(γ, p) = − ln( ( pi p1 (yj | xi ) 1+γ )1+γ )
j=1 i=1
ln M
and R = N .

E(ej (C1 , . . . , cM )) ≤ exp(−N (G(γ, p) − γR)

Set G∗ (R) = max0≤γ≤1 maxp {G(γ, p) − γR}


(b)
Theorem 4.22. For a DMC with MC decoding there exists a code c1 , . . . , cM ∈ X N s.t

ê(c1 , . . . , cM ) ≤ 4e−N G (R)
Proof. Use 2M random codewords. Then
2M
1 X ∗( ln 2M )
E(ej (C1 , . . . , C2M )) ≤ e−N G N
2M
j=1

There exists a sample c1 , . . . , c2M s.t


2M
1 X ∗( ln 2M )
ej (C1 , . . . , C2M ) ≤ e−N G N (∗).
2M
j=1
54 CHAPTER 4. INFORMATION CHANNELS

Remove M codewords, particularly with


∗( ln 2M )
ek (c1 , . . . , c2M ) > 2e−N G N

There are at most M ,other wise (∗) would be violated. For the remaining ones
∗ (R)
ej (ci1 , . . . , ciM ) ≤ 4e−N G ∀j = 1 . . . M.

(c)
lnM
Theorem 4.23. If R = N < C, then

G∗ (R) = max max {G(γ, p) − γR}


p 0≤γ≤1

≥ max {G(γ, p∗ ) − γR} > 0


0≤γ≤1

where p∗ denotes the capacity-achieving distribution. For detailed proof please refer
RM p 103-114)
5 Rate Distortion Theory

Motivation:
a) By the source coding theorem(Th 3.7 and 3.9): error free / loss less encoding needs at
least on average H(X) bits per symbol.
b) ASignal is represented by bits. What is the min no of bits needed not to exceed a
certain maximum distortion?.
Example 5.1. a) Representing a real number by k bits: X = IR ,X̂ = {(b1 , . . . , bk ) |
bi ∈ {0, 1}}
b) 1-bit quantization X = IR ,X̂ = {0, 1}
Definition 5.2. A distortion function measure is a mapping d : X × X̂ → R+ .
Examples:
a) Hamming distance , X = X̂ = {0, 1}
(
0, x = x̂
d(x, x̂) =
1, otherwise

b) Squared error :d(x̂, x) = (x − x̂)2 .


Definition 5.3. The distortion measure between sequence xn , x̂n is defined as
n
1X
d(xn , x̂n ) = d(xi , x̂i ) .
n
i=1

Definition 5.4. A (2nR , n) rate distortion code of rate R and block length n consists an
encoder
fn : X n → {1, 2 . . . 2nR }
and a decoders
gn : {1, 2 . . . 2nR } → X̂ n .
The expected distortion of the (fn , gn ) is

D = Ed(X n , X̂ n ) = Ed(X n , gn (fn (xn ))).

Remarks:
a) X , X̂ n are assumed to be finite
b) 2nR means d2nR e ,if it is not integer

55
56 CHAPTER 5. RATE DISTORTION THEORY

c) fn yields 2nR different values. We need ≈ nR bits to represent each. Hence, R =


number of bits per source symbol needed to represent fn (X n )
d)
X
D = Ed(X n , X̂ n ) = Ed(Xn , gn (fn (X n ))) = p(xn )d(xn , gn (fn (xn )))
xn ∈X̂ n

e) {gn (1), . . . , gn (2nR )} is called codebook, while fn−1 (1), . . . , fn−1 (2nR ) are called as-
signment regions.
Ultimate goal of lossy source coding is to
– minimise R for a given D or
– minimise D for a given R.
Definition 5.5. A rate distortion pair (R, D) is called achievable if there exists a sequence
of (2nR , n) rate distortion codes such that,

lim Ed(Xn , gn (fn (Xn ))) ≤ D.


n→∞

Definition 5.6. The rate distortion function is defined as

R(D) = inf (R, D)


R

is achievable.
Definition 5.7. The informatin distortion function RI (D) is defined as fallows:

RI (D) = P min I(X, X̂)


p(x̂|x): (x,x̂) p(x,x̂)d(x,x̂)≤D

= min I(X, X̂) .


p(x̂|x):E[d(X,X̂)]≤D

Compare with capacity:


– C: given p(x̂ | x), max I(X, X̂) over the input distribution,
– RI (D): given p(x), min(X, X̂) over “channels” s.t. the expected distortion does
exceed D.
Theorem 5.8. a) RI (D) is a convex non increasing function of D.
b) RI (D) =0 if D > D∗ ,D∗ = minx̂∈X̂ Ed(X, x̂).
c) RI (0) ≤ H(X).
Proof. Yeung p.198 ff, Cover and Thomas p.316 ff.
Theorem 5.9. X ∈ {0, 1}, P (X = 0) = 1 − p, P (X = 1) = p, 0 ≤ p ≤ 1 and d is the
Hamming distance.
(
H(p) − H(D), 0 ≤ D ≤ min{p, 1 − p}
RI (D) =
0, otherwise
57

Proof. w.l.o.g assume p < 12 , otherwise interchange 0 and 1

min I(X, X̂) .


p(x̂|x):E[d(X,X̂)]≤D

Assume D ≤ p < 21 , then

I(X; X̂) = H(X) − H(X | X̂)


= H(X) − H(X ⊕ X̂ | X̂)
≥ H(X) − H(X ⊕ X̂)
= H(p) − H(P (X 6= X̂))
≥ H(p) − H(D).

This lower bound is attained by the following joint distribution of (X, X̂).


0 1
X
(1−D)(1−p−D) D(p−D)
0 1−2D 1−2D 1−p
D(1−p−D) 1−D(p−D)
1 1−2D 1−2D p
(1−p−D) (p−D)
Total 1−2D 1−2D 1

This corresponds to the following BSC.

It follows that

P (X 6= X̂) = Ed(X, X̂)


D(p − D) D(1 − p − D)
= + = D.
1 − 2D 1 − 2D
Further,

I(X; X̂) = H(X) − H(X | X̂)


= H(p) − [H(X | X̂ = 0) P (X̂ = 0) + H(X | X̂ = 1) P (X̂ = 1)]
| {z } | {z }
H(D) H(1−D)=H(D)

= H(p) − H(D).
58 CHAPTER 5. RATE DISTORTION THEORY


0 1
X
0 1−p 0 1−p
1 p 0 p
Total 1 0 1

such that the lower bound is attained. If D ≥ p set P (X̂ = 0) = 1 and get
Then Ed(X; X̂) = P (X 6= X̂) = P (X = 1) = p ≤ D and

I(X; X̂) = H(X) − H(X | X̂) = H(p) − H(X | X̂ = 0).1 = H(p) − H(p) = 0.

Plot for Bin (1 , 12 ):

Theorem 5.10. (converse to the rate distortion theorem )

R(D) ≥ RI (D)

Proof. Recall the general situation X1 , . . . Xn i.i.d ∼ X ∼ p(x), x ∈ X , X̂ n = gn (fn (X n ))


has at most 2nR values. Hence

H(X̂ n ) ≤ log 2nR ≤ nR.


59

We first show (R,D) achievable ⇒ R ≥ RI (D) suppose (R,D) is achievable. Then

nR ≥ H(X̂)
≥ H(X̂) − H(X̂ | X n )
= I(X̂ n ; X n ) = I(X n , X̂ n )
= H(Xn ) − H(X n | X̂ n )
Xn n
X
= H(Xi ) − H(Xi | X̂ n , (X1 , . . . Xi=1 ))
i=1 i=1
n
X
≥ I(Xi ; X̂i )
i=1
Xn
≥ RI (Ed(Xi , X̂i ))
i=1
Xn
1
=n RI (Ed(Xi , X̂i ))
n
i=1
n
1X
≥ nRI ( Ed(Xi , X̂i ))
n
i=1
= nRI Ed(X , X̂ n )
n

≥ nRI (D)

that is R = R(D) ≥ RI (D), hence (R, D) is achievable.


The reverse inequality in th 5.10 also holds
Theorem 5.11.
R(D) = RI (D)
Proof.

– R(D) ≥ RI (D)
– R(D) ≤ RI (D)

Yeung :Section 9.5, p. 206-212, Cover and thomas section 10.5 p.318-324.

You might also like