Info Theory
Info Theory
on
Information Theory
1 Introduction 5
3 Source Coding 29
3.1 Variable Length Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Prefix Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Kraft-McMillan Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Average Code Word Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5 Noiseless Coding Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.6 Compact Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.7 Huffman Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.8 Block Codes for Stationary Sources . . . . . . . . . . . . . . . . . . . . . . . . 36
3.8.1 Huffman Block Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.9 Arithmetic Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4 Information Channels 41
4.1 Discrete Channel Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Channel Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Binary Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3.1 Binary Symmetric Channel (BSC) . . . . . . . . . . . . . . . . . . . . . 43
4.3.2 Binary Asymmetric Channel (BAC) . . . . . . . . . . . . . . . . . . . . 47
4.3.3 Binary Z-Channel (BZC) . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3.4 Binary Asymmetric Erasure Channel (BAEC) . . . . . . . . . . . . . . . 48
4.4 Channel Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.5 Decoding Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.6 Error Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.7 Discrete Memoryless Channel . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.8 The Noisy Coding Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.9 Converse of the Noisy Coding Theorem . . . . . . . . . . . . . . . . . . . . . . 53
4 Contents
1 Introduction
According to Merriam-webster.com ”Information is any entity or form that provides the answer to
a question of some kind or resolves uncertainty. It is thus related to data and knowledge, as data
represents values attributed to parameters, and knowledge signifies understanding of real things
or abstract concepts.”.
However, modern Information Theory is not a theory which deals with the above on general
grounds. Instead, information theory is a mathematical theory to model and analyze how in-
formation is transferred. Its starting point is an article by Claude E. Shannon, ”A Mathematical
Theory of Communication”, Bell System Technical Journal, 1948.
Quoting from the introduction of this article provides insight into the main focus of information
theory: ”The fundamental problem of communication is that of reproducing at one point either
exactly or approximately a message selected at another point. Frequently the messages have
meaning. . . . These semantic aspects of communications are irrelevant to the engineering problem.
. . . The system must be designed to operate for each possible selection, not just the one which will
actually be chosen since this is unknown at the time of design.”
Later, in 1964 a book by Claude E. Shannon and Warren Weaver with a slightly modified title ”The
Mathematical Theory of Communication”appeared at University of Illinois Press, emphasizing the
5
6 CHAPTER 1. INTRODUCTION
channel
estimation
In this chapter we provide basic concepts of information theory like entropy, mutual information
and the Kulback-Leibler divergence. We also prove fundamental properties and some important
inequalities between these quantities. Only discrete random variables (r.v.) will be considered,
denoted by capitals X, Y and Z and having only finite sets of possible values to attain, the so called
support. Only the distribution of the r.v. will be relevant for what follows. Discrete distribution
can be characterized by stochastic vectors, denoted by
X
p = (p1 , . . . , pm ), pi ≥ 0, pi = 1.
i
For intuitively motivating a measure of uncertainty consider the following two random experi-
ments with four outcomes and corresponding probabilities
Certainly the result of the second experiment will be more uncertain than of the first one. On the
other hand, having observed the outcome of the second experiment provides more information
about the situation. In this sense, we treat information and uncertainty as equivalently describing
the same phenomenon.
Now, an appropriate measure of uncertainty was introduced by Shannon in his 1948 paper. He did
this axiomatically, essentially requiring three properties of such a measure and then necessarily
deriving entropy as introduced below.
We start by requesting that the information content of some event E shall only depend on its
probability p = P (E). Furthermore the information content is measured by some function h :
[0, 1] → R satisfying the following axioms.
The first axiom (i) requires that a small change in p results in a small change of the measure
of its information content. Number (ii) says that for two independent events E1 and E2 with
probabilities p and q respectively the intersection of both, i.e., the event that both occur at the
7
8 CHAPTER 2. FUNDAMENTALS OF INFORMATION THEORY
same time, has information content h(p) + h(q). The information content shall hence be additive
for independent events. Finally, by (iii) a certain normalization is fixed.
Now, if (i),(ii), and (iii) are satisfied by some information measure h then necessarily
Remark 2.2.
a) H(X) depends only on the distribution of X, not on the specific support.
b) If pi = 0 for some i, we set pi log pi = 0. This follows easily from continuity.
c) The base of the logarithm will be omitted in the following. After the base has been chosen,
it is considered to be fixed and constant throughout.
d) Let p(x) denote the probability mass function (pmf), also called discrete density of X, i.e.,
p : X → [0, 1] : xi 7→ p(xi ) = pi .
1
b) Let X ∼ U({1, . . . , m}), i.e., P (X = i) = m for all i = 1, . . . , m. Then
m
X 1 1 1
H(X) = − log = − log = log m
m m m
i=1
Particularly if m = 26, the size of the Latin alphabet, then H(X) = log2 26 = 4.7004
2.1. PRELIMINARY DEFINITIONS 9
H(X) = −0.08167 log2 0.08167 − · · · − 0.00074 log2 0.00074 = 4.219 < 4.7004.
The extension of the above definition to two-dimensional or even higher dimensional random
vectors and conditional distributions is obvious.
Proof. Denote by p(xi ), p(xi , yj ) and p(yj |xi ) corresponding probability mass functions. It holds
that
X
H(X, Y ) = − p(xi , yj ) log p(xi , yj ) − log p(xi ) + log p(xi )
ij
X XX
=− p(xi , yj ) log p(yj |xi ) − log p(xi , yj ) log p(xi )
i,j i j
| {z }
=p(xi )
= H(Y | X) + H(X)
Theorem 2.6. (Jensen’ s inequality ) If f is a convex function and X is a random variable, then
(∗) holds for any random variable (discrete,abs-continuous,others) as long as the expectation is
defined. For discrete random variable with distribution (p1 ..........pm ), (∗) read as
m m
!
X X
pi f (xi ) ≥ f p i xi ∀x1 .....xm ∈ dom(f )
i=1 i=1
Proof. We prove this for discrete distributions by induction on the number of mass points. The
proof of conditions for equality when f is strictly convex is left to the reader. For a two-mass-point
distribution, the inequality becomes
which follows directly from the definition of convex functions. Suppose that the theorem is true
for distributions with k − 1 mass points. Then writing p01 = pi / (1 − pk ) for i = 1, 2....k − 1, we
have
k
X k−1
X
pi f (xi ) = pk f (xk ) + (1 − pk ) p0i f (xi )
i=1 i=1
k−1
!
X
≥ pk f (xk ) + (1 − pk )f p0i xi
i=1
k−1
!
X
≥f pk xk + (1 − pk ) p0i xi
i=1
k
!
X
=f pi xi
i=1
where the first inequality follows from the induction hypothesis and the second follows from
the definition of convexity. The proof can be extended to continuous distributions by continuity
arguments.
Prior to showing relations between the entropy concepts we consider some important inequalities
00 1
Proof. The function f (t) = t log t ≥ 0 is strictly convex, since f (t) = t > 0, for t > 0. Assume
without loss of generality that ai , bi > 0.
By convexity of f :
m m
! m
X X X
αi f (ti ) ≥ f αi t i , αi ≥ 0, αi = 1.
i=1 i=1 i=1
Setting αi = Pbi , ti = ai
j bj bi ; it follows
!
X b ai ai X bi ai X bi ai
Pi log ≥ P log P
i j bj bi bi
i
bj bi
i j bj bi
!
1 X ai 1 X X ai
⇔ P ai log ≥P ai log P
j bj i bi j bj i i j bj
P
X ai X j aj
⇔ ai log ≥ ai log P
bi j bj
i i
P
Corollary 2.8. Let p = (p1 , ...., pm ), q = (qi ..., qm ) be stochastic vectors, i.e pi , qi ≥ 0,
P i pi =
i qi = 1. Then
Xm Xm
− pi log pi ≤ − pi log qi ,
i=1 i=1
P P
Proof. In theorem 2.7, set ai = pi , bi = qi and note that i pi = i qi = 1.
we have that this holds only if I(X; Y ) = 0, that is X and Y are statistically independent.
c) i) By the Chain rule, theorem 2.5 H(X, Y ) = H(X) + H(Y |X) ≥ H(X) with ”equal-
| {z }
≥0
ity” from b) (ii).
ii) From b) 0 ≤ H(X) − H(X|Y ). Using Chain rule theorem 2.5, we can write H(X) −
H(X|Y ) = H(X)− [H(X, Y )−H(Y ) ]. Hence, we get H(X)+H(Y ) ≥ H(X, Y )
with ”equality” from b) (i)
d)
H(X|Y, Z) = H(X|Z) − I(X; Y |Z) ≤ H(X|Z) .
| {z }
≥0
Interpretation: I(X; Y ) is the reduction in uncertainty about X when Y is given or the amount
of information about X provided by Y .
2.1. PRELIMINARY DEFINITIONS 13
H(X,Y)
H(X)
H(Y)
P (Y = 0|X = 0) = P (Y = 1|X = 1) = 1 − ε
P (Y = 0|X = 1) = P (Y = 1|X = 0) = ε.
Assume that input symbols are uniformly distributed P (X = 0) = P (X = 1) = 12 . Then for the
joint distributions: P (X = 0, Y = 0) = P (Y = 0|X = 0)P (X = 0) = (1 − ε) − 21 , that is
14 CHAPTER 2. FUNDAMENTALS OF INFORMATION THEORY
1−ε
0 0
ε
X Y
ε
1 1
1−ε
Y
0 1
X
1 (ε) 1
0 2 (1 − ε) 2 2
(ε) 1 1
1 2 2 (1 − ε) 2
1 1
2 2
Further,
P (X = 0, Y = 0)
P (X = 0|Y = 0) = =1−ε
P (Y = 0)
P (X = 1|Y = 1) = 1 − ε
P (X = 0|Y = 1) = P (X = 1|Y = 0) = ε
1 1 1 1
H(X) = H(Y ) = − log − log = 1bit
2 2 2 2
H(X, Y ) = 1 − (1 − ε) log(1 − ε) − ε log ε
H(X, Y ) = −(1 − ε) log(1 − ε) − ε log ε
0 ≤ I(X, Y ) = 1 + (1 − ε) log(1 − ε) + ε log ε ≤ 1
D(pkq) measures the divergence (distance, dissimilarity) between distributions p and q. However
it is not a metric ,neither symmetric nor satisfies the triangle inequality.It measures how difficult it
is for p to pretend it to be q.
c) By definition.
Lemma 2.14. For any distribution p, q with support X = {x1 , ..., xm } and stochastic matrix
W = (p(yj | xi ))i,j ∈ Rm×d
D(pkq) ≥ D(pWkqW)
Proof. Let w1 , ...., wd be the columns of W , that is W = (w1 , ...., wd ). Using the log-sum
inequality 2.7 P
X ai X j aj
ai log ≥ ai log P
bi j bj
i i
m
X p(xi )
D(pkq) = p(xi ) log
q(xi )
i=1
ai
m X
d
z }| {
X p(xi )p(yj | xi )
= p(xi )p(yj | xi ) log
| {z } q(xi )p(yj | xi )
i=1 j=1 ai | {z }
bi
d
X pwj
≥ pwj log .
qwj
j=1
= D(pWkqW)
1 1 Pm pi
Proof. Let u = ( m ,..... m ) be the uniform distribution. D(pku) = i=1 pi log 1 = log m −
m
H(p). Hence by theorem 2.13 b), H(p) = log m − D(pku)
2.2 Inequalities
Definition 2.16. Random variables X, Y, Z are said to form a Markov chain in that order (denoted
by X → Y→ Z) if the joint probability mass function (discrete density) satisfies
Proof. a)
p(x, z, y) p(x)p(y | x)p(z | y)
p(x, z | y) = =
p(y) p(y)
p(x, y)p(z | y)
= = p(x | y)p(z | y)
p(y)
Since X and Z are conditionally independent given Y , we have I(X; Y |Z) = 0. Since I(X; Y |Z) ≥
0, we have
I(X; Y ) ≥ I(X; Z)
Equality holds iff I(X; Y |Z) = 0 i.e X → Z → Y I(X; Z) ≤ I(Y ; Z) is shown analogously.
Theorem 2.19. (Fano inequality) Assume X,Y are random variables with the same support X =
{x1 , ........, xm }. Lets define Pe = P (X 6= Y ), the “error probability”.
H(X|Y )−log 2
This implies that Pe ≥ log(m−1) .
Proof. We Know,
P 1 P 1
1) H(X|Y ) = p(x, y) log p(x|y)
x6=y + x p(x, x) log p(x|x)
P
2) Pe log(m − 1) = x6=y log(m − 1)p(x, y)
3) H(Pe ) = −Pe log Pe − (1 − Pe ) log(1 − Pe )
4) ln(t) ≤ t − 1, t ≥ 0
Using this we obtain
= log e [Pe − Pe + (1 − Pe ) − (1 − Pe )] = 0
Lemma 2.20. If X and Y are i.i.d random variables with entropy H(X). Then
P (X = Y ) ≥ 2−H(X)
18 CHAPTER 2. FUNDAMENTALS OF INFORMATION THEORY
Proof. let p(x) denote the p.m.f of of X. Use Jensen inequality f (t) = 2t is a convex function.
Hence with Y = log p(x) we obtain E(2Y ) ≥ 2E(Y ) , that is
Consider sequences of random variables X1 ,X2 .... denoted as X = {Xn }nN . A naive approach
to define the entropy of X is
In most cases this limit will be infinite. Instead consider the entropy rate.
is called the entropy rate of X, provided the limit exists. H∞ (X) may be interpreted as average
uncertainty per symbol
Example 2.22. a) Let X = {Xn }nN be i.i.d random variable with H(Xi ) < ∞, Then
n
1 1X
H∞ (X) = lim H(X1 , ...., Xn ) = lim H(Xi ) = H(Xi )
n→∞ n n→∞ n
i=1
b) Let X = {Xn }nN = {(Xn , Yn )}nN be i.i.d sequence with I(Xk ; Yk ) < ∞. Then
1
I∞ (Xk , Yk ) = lim I(X1 , ....., Xn ; Y1 , ....., Yn )
n→∞n
n
1X
= lim I(Xk , Yk )
n→∞ n
k=1
= I(X1 , Y1 )
Going further than i.i.d sequences, let us introduce in the following definition.
2.3. INFORMATION MEASURES FOR RANDOM SEQUENCES 19
for all s1 ,....,sn ∈ X, n ∈ N, t ∈ N . For stationary sequences all marginal distributions P (Xi )
are the same.
Definition 2.25.
a) X = {Xn }nN0 is called a Markov chain (MC) with state shape X = {X1 , . . . , Xn } if
X
H∞ (X) = − pi (0)pij log pij
i,j
20 CHAPTER 2. FUNDAMENTALS OF INFORMATION THEORY
Assume that the random walk starts at time 0 with the stationary distribution pi (0) = p∗i , i =
1, ...., m. Then X = {Xn }nN0 is a stationary sequence (MC) and
Information theory, the AEP is the analog of the law of large numbers (LNN).
22 CHAPTER 2. FUNDAMENTALS OF INFORMATION THEORY
AEP : Xi discrete i.i.d with joint pmf p(n) (Xi , ..., Xn ) then
1 1
log (n) is “close to” H(X) as n → ∞ .
n p (X1 , .., Xn )
Thus
p(n) (X1 , ...., Xn ) is “close to” 2−nH(X) as n → ∞ .
“close to” must be made precise.
Consequence: Existence of the typical set with sample entropy close to true entropy and the
nontypical set, which contain the other sequences.
Definition 2.29. A sequence of random variable Xn is said to converge to a random variable X
(i) in probability if ∀ > 0, P (| Xn − X |> ) → 0 as n → ∞
(ii) in mean square if E(Xn − X)2 ) → 0 as n → ∞
(iii) with prob 1 (or almost everywhere) if P (limn→∞ Xn = X) = 1
Theorem 2.30. Let {Xn } be i.i.d discrete random variable Xi ∼ X with support X. (X1 , ...., Xn )
with joint pmf pn (X1 , ...., Xn ). Then − n1 log p(n) (X1 , ..., Xn ) →(n→∞) H(X) in probability.
Proof. Yi = log p(Xi ) are also i.i.d. By the weak law of Large numbers
n
1 1X
− log p(n) (Xi , ..., Xn ) = − log p(Xi ) → −E log p(X) = H(X)
n n
i=1
Definition 2.31.
−n(H(X)+)
A(n) n
= {(x1 , .., xn )} ∈ X | 2 ≤ p(n) (x1 , .., xn ) ≤ 2−n(H(X)−) }
typical set
2.4. ASYMPTOBIC EQUIPARTITION PROPERTY(AEP) 23
Theorem 2.32.
(n)
a) If (xi , ..., xn ) ∈ A then
1
H(X) − ≤ − log p(n) (x1 , .., xn ) ≤ H(X) +
n
(n)
b) P (A > 1 − for n sufficiently large.
(n)
c) | A |≤ 2(n)(H(X)+) , (| . | cordinality)
(n)
d) | A |≥ (1 − )2(n)(H(X)−) for n sufficiently large
Proof. a) obvious
b) obvious
c)
X X X
1= pn (X) ≥ p(n) (X) ≥ 2−n(H(X)+) çç = 2−n(H(X)+) | A(n) |
x∈X n x∈A
(n) (n)
x∈A
(n) (n) P
d) For sufficiently large n, P (A ) > 1−, hence 1− < P (A ) ≥ x∈A( n) 2−n(H(X)− =
(n)
2−n(H(X)− | A |
(n)
For given > 0 and sufficiently large n. X n decomposes into a set T = A (typical set) such
that
• P ((X1 , ..., Xn ) ∈ T c ) ≥
• For all x = (x1 , ..., xn ) ∈ T :
1
| − log p(n) (x1 , ..., xn ) − H(X) |≤
n
the normalized log-prob of all sequences in T is nearly equal and close to H(X).
Graphically:
Let X1 , .., Xn i.i.d with support X , X (n) = (X1 , ..., Xn ). The aim is to find a short descrip-
tion/encoding of all values x(n) = (x1 , ..., xn ) ∈ X n . The key idea is index coding, allocate each
of the | X n | values an index
(n) (n)
• Holds | A |≤ 2(n)(H(X)+) (Th 2.4.4 c). Indexing of all x(n) ∈ A requires at most
n(H(X) + ) + 1 (1 bit extra since n(H(X) + ) may not be an integer)
24 CHAPTER 2. FUNDAMENTALS OF INFORMATION THEORY
= P (X (n) ∈ A(n)
)(n(H(X) + ) + 2) + P (X
(n)
/ A(n)
∈ )(n log | X | +2)
≤ n(H(X) + ) + n log | X | +2
2
≤ n(H(X) + + n log | X | +
n
= n(H(X) + 0
for any 0 > 0 with n sufficient large it follows:
Theorem 2.33. {Xn } i.i.d For any > 0 there exists n ∈ N and a binary code that maps
each X (n) one-to-one onto a binary string satisfying
1
E( l(X (n) )) ≤ H(X) + .
n
Hence, for sufficiently large n there exists a code for X (n) such that the expected average
codeword length is arbitrary close to H(X).
Theorem 2.34. By now: Entropy for discrete random variable with finite support. Exten-
sion: Discrete random variable but countably many support points , X = {x1 , x2 , .....}
distr p = (p1 , p2 , ......)
X∞
H(X) = − pi log pi
i=1
2.5. DIFFERENTIAL ENTROPY 25
Note : The sum may be infinite or may not even exist. Important: Extension of entropy to
random variable X with a density f .
Definition 2.35. Let X be absolute continuous with density f (x), then
Z ∞
h(X) = − f (x) log f (x)dx
−∞
(x−µ)2
b) X ∼ N (µ, σ 2 ), f (x) = √ 1 e− 2σ 2 , X ∈ IRn
2πσ
1
h(X) = ln(2πeσ 2 )
2
Definition 2.37. a) XR = (X R 1 , .., Xn ) a random vector with joint density f (x1 , .., xn ).
h(X1 , .., Xn ) = − .... f (x1 , .., xn ) log f (x1 , ...., xn )dx1 .....dxn is called joint dif-
ferential entropy of X.
b) (X, Y ) a random vector with joint density f (x, y) and conditional density.
f (x, y)
f (x | y) = , iff (y) > 0,
f (y)
and 0 otherwise. Then,
Z Z
h(X | Y ) = − f (x, y) log f (x | y)dxdy
This implies
Theorem 2.45. Let X ∈ IRn absolute continuous with density f (x) and Cov(X) = C,
with C positive definite. Then
1
h(x) ≤ ln((2πe)n | C |),
2
i.e Nn (µ, C) has largest entropy amongst all random variables with positive definite co-
variance matrix C.
Proof. W.l.o.g assume EX = 0 (see thm 2.44).
1 1 T −1
Let Q(x) = n 1 exp{− 2 X C x} be the density of Nn (0, C). Let X ∼ f (x), EX =
(2π) 2 |C| 2
28 CHAPTER 2. FUNDAMENTALS OF INFORMATION THEORY
R
0, Cov(X) = E(XX T ) = XX T f (x)dx
Z
h(X) = − f (x)f (x)dx
Z
≤ − f (x) ln q(x)dx Cor 2.41
Z " #
1 1 T −1
= − f (x) ln n 1 exp{− x C x} dx
(2π) 2 | C | 2 2
" # Z
1 1
= − ln n 1 + xT C −1 xf (x)dx
(2π) | C |
2 2 2
Z
n 1 1
= ln (2π) 2 | C | 2 + tr(C −1 xxT )f (x)dx
2
Z
n 1 1 −1
= ln (2π) 2 | C | 2 + tr(C ) xxT f (x)dx
2
n 1 n
= ln (2π) 2 | C | 2 +
2
n 1
= ln((2πe) 2 | C | 2 )
3 Source Coding
channel
estimation
Given some source alphabet X = {x1 , . . . , xm } and code alphabet Y = {y1 , . . . , yd }. The
aim is to find a code word formed over Y for each character x1 , . . . , xm . In other words,
each character xi ∈ X uniquely mapped onto a “word” over Y.
Definition 3.1. An injective mapping
∞
[
g:X → Y ` : xi 7→ g(xi ) = (wi1 , . . . , wini )
`=0
29
30 CHAPTER 3. SOURCE CODING
Example 3.2.
g1 g2 g3 g4
a 1 1 0 0
b 0 10 10 01
c 1 100 110 10
d 00 1000 111 11
no encoding encoding, encoding, encoding,
words are separable shorter, even shorter,
words separable not separable
is injective.
Example 3.4. Use the previous encoding g3
g3
a 0
b 10
c 110
d 111
111100011011100010
1 1 1|1 0 0 0 1 1 0 1 1 1 0 0 0 1 0
1 1 1|1 0 |0 0 1 1 0 1 1 1 0 0 0 1 0
1 1 1|1 0 |0|0 |1 1 0|1 1 1|0| 0|0|1 0
dbaacdaaab
Definition 3.5. A code is called prefix code, if no complete code word is prefix of some
other code word, i.e., no code word evolves from continuing some other.
Formally:
a ∈ Y k is called prefix of b ∈ Y l , k ≤ l, if there is some c ∈ Y l−k such that b = (a, c).
Theorem 3.6. Prefix codes are uniquely decodable.
– Prefix codes are easy to construct based on the code word lengths.
– Decoding of prefix codes is fast and requires no memory storage.
Next aim: characterize uniquely decodable codes by their code word lengths.
there exists a u.d. code (even a prefix code) with code word lengths n1 , . . . , nm .
Proof.
(a) g u.d. code with codeword lengths n1 , ..., nm . Let r = max{ni } maximum codeword
length, βe = |{i|ni = l}| be the number of codewords of length l ∈ IN, l ≤ r and it
holds, k ∈ IN
m
X X r k.r
X −e
( d−nj )k = ( βe d−e )k = γe d
j=1 l=1 l=k
with X
γe = βi1 , ..., βik , l = k, ..., k.r
i≤i1 ,..,ik ≤r
i1 +..+ik =l
γe is the number of source words of length of length k which have codeword length l
and de be the number of all codewords of length l, Since g is u.d, each code word has
at most one source word. Hence
γe ≤ de
Xm k.r
X
−nj k
( d ) ≤ de d−e = kr − k + 1 ≤ kr ∀k ∈ IN.
j=1 i=k
Further
m
X 1
d−nj ≤ (kr) k → 1(k → ∞),
j=1
Pm −nj
so that j=1 d ≤ 1.
32 CHAPTER 3. SOURCE CODING
g3 g4
a 0 0
b 10 01
Example 3.8.
c 110 10
d 111 11
u.d. not u.d.
For g3 : 2−1 + 2−2 + 2−3 + 2−3 = 1.
x1
1
1 x2
0
x3
1
x4
0 1 0
x5
0 1
1 x6
0
m
X m
X
n̄ = n̄(g) = nj p j = nj P (X = xj )
j=1 j=1
3.5. NOISELESS CODING THEOREM 33
pi g2 g3
a 1/2 1 0
b 1/4 10 10
Example 3.10. c 1/8 100 110
d 1/8 1000 111
n̄(g) 15/8 14/8
H(X) 14/8
Theorem 3.11. Noiseless Coding Theorem, Shannon (1949) Let random variable X de-
scribe a source with distribution P (X = xi ) = pi , i = 1, . . . , m. Let the code alphabet
Y = {y1 , . . . , yd } have size d.
a) Each u.d. code g with code word lengths n1 , . . . , nm satisfies
log e Xm d−nj
≤ pj −1 (since ln x ≤ x − 1, x ≥ 0)
log d pj
j=1
log e X −nj
m
≤ d − pj ≤ 0.
log d
j=1
b) Shannon-Fano Coding
W.l.o.g. assume that pj > 0 for all j.
34 CHAPTER 3. SOURCE CODING
Choose integers nj such that d−nj ≤ pj < d−nj +1 for all j. Then
m
X m
X
d−nj ≤ pj ≤ 1
j=1 j=1
m
X m
X
pj log pj < (log d) pj (−nj + 1),
j=1 j=1
equivalently,
H(X) > (log d) n̄(g) − 1 .
No! Check the previous proof. Equality holds if and only if pj = 2−nj for all j = 1, . . . , m.
Example 3.12. Consider binary codes, i.e., d = 2. X = {a, b}, p1 = 0.6, p2 = 0.4. The
shortest possible code is g(a) = (0), g(b) = (1).
Definition 3.13. Any code of shortest possible average code word length is called compact.
How to construct compact codes?
3.7. HUFFMAN CODING 35
1
01111 a 0.05 0.1 1
01110 b 0.05 0.15 1
0
0110 c 0.05 0
1
111 d 0.1 0.2 1 0.3 1
110 e 0.1 0
0.4 1
1.0
010 f 0.15 0.6
0
0
10 g 0.2 0
00 h 0.3 0
Huffman are optimal i.e, have shortest average codeword length. We consider the case
d = 2.
Lemma 3.14. Let X = {x1 , ..., xm } with probabilities p1 ≥, ..., ≥ pm > 0. There exists
an optimal binary prefix code g with codeword lengths n1 ≤, ....., nm such that
(i) n1 ≤ ..... ≤ nm ,
(ii) nm−1 = nm ,
(iii) g(Xm−1 ) and g(Xm ) differ only in the last position.
n̄(g 0 ) − n̄(g) = pi nj + pj ni − pi ni − pj nj
= (pi − pj )(nj − ni ) < 0
contradictory optimality of g.
(ii) There is an optimal prefix code g with ni ≤, .., ≤ m.if nm−1 < nm delete nm − nm−1
positions of g(xm ) to obtain a better code.
36 CHAPTER 3. SOURCE CODING
(iii) If l1 ≤ ..... ≤ lm−1 = lm for an optimal prefix code g and g(xm−1 ) and g(xm ) differ
by more than the last position, delete the last position in both to obtain a better code.
Lemma 3.15. Let X = {xi , ..., xm } with prob p1 ≥ .... ≥ pm > 0. X1 = {x01 , .., Xm−1
0 }
0 0 0
with prob pi = pi , i = 1..., m − 2, and pm−1 = pm−1 + pm . Let g be an optimal prefix
code for X 0 with codewords g 0 (x0i ), i = 1, ...., m − 1. Then
0 0 i = 1, ..., m − 2.
g (xi ),
0 0
g(x1 ) = (g (xm−1 , 0), i = m − 1
0 0
(g (xm−1 , 1), i = m
m−2
X
n̄(g) = pj n0j + (pm + pm−1 )(n0m−1 + 1)
j=1
m−2
X
= p0j n0j + p0m−1 (n0m−1 + 1)
j=1
m−1
X
= p0j n0j + pm−1 + pm = n̄(g 0 ) + pm−1 + pm
j=1
Assume g is not optimal for X .There exists an opt prefix code h with properties (i)-(iii) of
3.14 and n̄(h) < n̄(g).
Set (
0 0 h(xj ), j = 1, ..., m − 2.
h (xj ) =
bh(xm−1 c, deleting the last position of h(xm−1 ), j = m
Then n̄(h0 ) + pm−1 + pm = n̄(h) < n̄(g) = n̄(g 0 ) + pm−1 + pm . Hence n̄(h0 ) < n̄(g 0 )
contradicting optimality of g 0 .
Encode blocks/words of length N by words over the code alphabet Y. Assume that blocks
are generated by a stationary source, a stationary sequence of random variables {Xn }n∈N .
Notation for a block code:
∞
[
(N ) N
g :X → Y`
`=0
3.9. ARITHMETIC CODING 37
Block codes are “normal” variable length codes over the extended alphabet X N . A fair
measure of the “length” of a block code is the average code word length per character
n̄ g (N ) /N.
n̄(g (N ) ) H(X1 , . . . , XN )
≥ .
N N log d
b) Conversely, there is a prefix block code, hence a u.d. block code g (N ) with
n̄(g (N ) ) H(X1 , . . . , XN ) 1
≤ + .
N N log d N
n̄(g (N ) ) H∞ (X)
lim = .
N →∞ N log d
In principle, Huffman encoding can be applied to block codes. However, problems include
– The size of the Huffman table is mN , thus growing exponentially with the block length.
– The code table needs to be transmitted to the receiver.
– The source statistics are assumed to be stationary. No adaptivity to changing probabil-
ities.
– Encoding and decoding only per block. Delays occur at the beginning and end. Padding
may be necessary.
Assume that:
– Message (xi1 , . . . , xiN ), xij ∈ X , j = 1, . . . , N is generated by some source {Xn }n∈N .
– All (conditional) probabilities
j−1
X
I(j) = c(j), c(j + 1) , c(j) = p(i), j = 1, . . . , m
i=1
(cumulative probabilities)
Recursion over n = 2, . . . , N :
I(i1 , . . . , in )
h n −1
iX
= c(i1 , . . . , in−1 ) + p(in | i1 , . . . , in−1 ) · p(i1 , . . . , in−1 )
i=1
in
X
c(i1 , . . . , in−1 ) + p(in | i1 , . . . , in−1 ) · p(i1 , . . . , in−1 )
i=1
Example 3.17.
0 1
p(1) p(2) p(m)
Encode message (xi1 , . . . , xiN ) by the binary representation of some binary number in the
interval I(i1 , . . . , in ).
The probability of occurrence of message (xi1 , . . . , xiN ) is equal to the length of the repre-
senting interval. Approximately
− log2 p(i1 , . . . , in )
a b c d
0.3 0.4 0.1 0.2
ba bb bc bd
0.12 0.16 0.04 0.08
channel
estimation
with
wij = P (Y = yj | X = xi , i = 1, . . . , m, j = 1, . . . , d
– Input distribution
P (X = xi ) = pi , i = 1, . . . , m,
p = (p1 , . . . , pm ).
Discrete Channel Model :
41
42 CHAPTER 4. INFORMATION CHANNELS
w1
w2
where W composed of rows w1 , . . . , wm as W = . .
.
.
wm
Lemma 4.1. Let X and Y be the input r.v. and the output r.v. of a discrete channel with
channel matrix W , respectively. Lets denote the input distribution as P (X = xi ) = pi , i =
1, . . . , m, with p = (p1 , . . . , pm ). Then
(a) H(Y ) = H(pW ).
(b) H(Y | X = xi ) = H(wi ).
P
(c) H(Y | X) = m i=1 pi H(wi ).
m
X
P (Y = yj ) = P (Y = yj | X = xi )P (X = xi )
i=1
Xm
= pi wij = (pW )j , j = 1, .., d.
i=1
Proof. Mutual information I(p; W ) is a concave function of p. Hence the KKT conditions
(cf., e.g., Boyd and Vandenberge 2004) are necessary and sufficient for optimality of some
input distribution p. Using the above representation some elementary algebra shows that
∂
I(p; W ) = D(wi kpW ) − 1.
∂pi
Then,
∂ ∂ XX X
H(pW ) = [− ( pi wij ) log( pi wij )]
∂p k ∂pk
j i i
X X X wkj
=− [wkj log( pi wij ) + pi wij P ]
i pi wij
j i i
X X
=− [wkj log( pi wij ) + wkj ],
j i
thus
∂ ∂ ∂ X
I(p, W ) = H(pW ) − ( pi H(wi ))
∂p k ∂pk ∂pk
i
X X X
=− wkj log( pi wij ) + wkj log wkj − 1
j i j
X wkj
= wkj log P −1
j i pi wij
= D(wk kpW ) − 1.
Proof. p is capacity achieving iff D(wi kpW ) = ζ ∀i : pi > 0. Let p(q) = −q log q, q ≥ 0
and T inverse of W , T = W −1 so that T 1m = T W 1m = I1m = 1m .
Then H holds:
ζ = D(wi kpW )
X wij
= wij ln Pm
j l=1 pl wlj
X X
=− [wij ln( pl wlj ) + p(wij )], i = 1, .., m.
j l
k
It follows
X P
ζ = ln( e−ζ e− i,j tki p(wij ) ) = C (capacity)
k
To determine pl multiply (**) by tks and sum over k.
X X X P
tks pl wlk = tks e−C e− i,j tki p(wij )
k l k
X X X P
pl wlk tks = tks e−C e− i,j tki p(wij )
l
| k {z } k
δks
X P
Ps = e−C tks e− i,j tki p(wij ) , s = 1, .., m (Capacity-achieving distribution)
k
4.3. BINARY CHANNELS 47
1−ε
0 0
ε 1−ε ε
W =
Example 4.7. δ 1−δ
δ
1 1
1−δ
1 b
p∗0 = , p∗1 = ,
1+b 1+b
with h(δ) − h()
a − (1 − )
b= and a = exp ,
δ − a(1 − δ) 1−−δ
and h() = H(, 1 − ), the entropy of (, 1 − ).
This is an equation in the variables p0 , p1 which jointly with the condition p0 + p1 = 1 has
the solution
1 b
p∗0 = , p∗1 = , (4.1)
1+b 1+b
with h(δ) − h()
a − (1 − )
b= and a = exp ,
δ − a(1 − δ) 1−−δ
and h() = H(, 1 − ), the entropy of (, 1 − ).
Example 4.8. The so called Z-channel is a special case of the BAC with = 0.
48 CHAPTER 4. INFORMATION CHANNELS
1
0 0
δ
1 1
1−δ
1−ε
0 0
ε
1−ε ε 0
W =
Example 4.9. e 0 δ 1−δ
δ
1 1
1−δ
and setting
p∗0
= x∗ , p∗0 + p∗1 = 1.
p∗1
By 4.5 the capacity-achieving distribution p∗ = (p∗0 , p∗1 ), p∗0 +p∗1 = 1 is given by the solution
of
1−
(1 − ) log + log
p0 (1 − ) p0 + p1 δ
(4.2)
δ 1−δ
= δ log + (1 − δ) log ,
p0 + p1 δ p0 (1 − δ)
p0
Substituting x = p1 , equation (4.2) reads equivalently as
By differentiating w.r.t. x it is easy to see that the right hand side is monotonically increasing
such that exactly one solution p∗ = (p∗1 , p∗2 ) exists, which can be numerically computed.
4.4. CHANNEL CODING 49
where X1 , . . . , XN ∈ X , Y1 , . . . , YN ∈ Y.
Only a subset of all possible blocks of length N is used as input, the channel code.
CN = {c1 , . . . , cM } ⊆ X N
c1 c1
Channel
c2 c2
cN bN hN
pN (bN | aN )
cM cM
50 CHAPTER 4. INFORMATION CHANNELS
With ME-decoding, b is decoded as the codeword cj which has greatest conditional proba-
bility of having been sent given b is received. Hence,
hN (b) ∈ arg max P XN = ci | YN = b .
i=1,...,M
With ML-decoding, b is decoded as the codeword cj which has greatest conditional proba-
bility of b being received given that cj was sent. Hence,
hN (b) ∈ arg max P YN = b | YN = ci .
i=1,...,M
–
M
X
e(CN ) = ej (CN ) P (XN = cj )
j=1
is the error probability of code CN .
4.7. DISCRETE MEMORYLESS CHANNEL 51
–
ê(CN ) = max ej (CN )
j=1,...,M
N
Y
P YN = bN | Xn = aN = P Y1 = bi | X1 = ai
i=1
Remark 4.14. From the above definition it follows that the channel
– is memoryless and nonanticipating
– transition probablities of symbols are the same at each position
– transition probabilities of blocks only depend on the channel matrix
Definition 4.15. Suppose a source produces R bits per second (rate R). Hence ,N R bits in
N seconds. Let the total no of messages in N seconds is 2N R (assigned as integer) and M
codewords available for encoding all messages.
log M
M = 2N R ⇐⇒ R =
N
(No of bits per channel use)
Proof. ” ⇐= ”
P (YN = bN | XN = aN )
P (YN −1 = bN −1 , XN = aN )
= P (YN = bN | XN = aN , YN −1 = bN −1 ).
P (XN = aN )
= P (Y1 = bN | X1 = aN )P (YN −1 = bN −1 , XN = aN )
= P (Y1 = bN | X1 = aN )P (YN −1 = bN −1 | X1 = aN −1 )P (YN −2 = bN −2 | XN = aN )
= ...
= ΠN
i=1 P (Y1 = bi | X1 = ai )
52 CHAPTER 4. INFORMATION CHANNELS
=⇒
P (Ye = be | XN = aN )
P (Yl = bl | XN , Ye−1 = be−1 ) =
P (Ye−1 = be−1 | XN = aN )
= P (Y1 = bl | X1 = al )
Hence, the maximum error probability tends to zero exponentially fast as the block length
N tends to infinity.
Example 4.18. Consider the BSC with ε = 0.03.
Choose R = 0.8
log2 MN
< R ⇔ MN < 2N R
N
hence choose
MN = b20.8N c
N 10 20 30
|X N | = 2N 1 024 1 048 576 1.0737 · 109
MN = b20.8N c 256 65 536 16.777 · 106
Percentage of 25% 6.25% 1.56%
used codewords
4.9. CONVERSE OF THE NOISY CODING THEOREM 53
lim e(CN ) = 1.
N →∞
Proof. Set
d X
X m
1
G(γ, p) = − ln( ( pi p1 (yj | xi ) 1+γ )1+γ )
j=1 i=1
ln M
and R = N .
There are at most M ,other wise (∗) would be violated. For the remaining ones
∗ (R)
ej (ci1 , . . . , ciM ) ≤ 4e−N G ∀j = 1 . . . M.
(c)
lnM
Theorem 4.23. If R = N < C, then
where p∗ denotes the capacity-achieving distribution. For detailed proof please refer
RM p 103-114)
5 Rate Distortion Theory
Motivation:
a) By the source coding theorem(Th 3.7 and 3.9): error free / loss less encoding needs at
least on average H(X) bits per symbol.
b) ASignal is represented by bits. What is the min no of bits needed not to exceed a
certain maximum distortion?.
Example 5.1. a) Representing a real number by k bits: X = IR ,X̂ = {(b1 , . . . , bk ) |
bi ∈ {0, 1}}
b) 1-bit quantization X = IR ,X̂ = {0, 1}
Definition 5.2. A distortion function measure is a mapping d : X × X̂ → R+ .
Examples:
a) Hamming distance , X = X̂ = {0, 1}
(
0, x = x̂
d(x, x̂) =
1, otherwise
Definition 5.4. A (2nR , n) rate distortion code of rate R and block length n consists an
encoder
fn : X n → {1, 2 . . . 2nR }
and a decoders
gn : {1, 2 . . . 2nR } → X̂ n .
The expected distortion of the (fn , gn ) is
Remarks:
a) X , X̂ n are assumed to be finite
b) 2nR means d2nR e ,if it is not integer
55
56 CHAPTER 5. RATE DISTORTION THEORY
e) {gn (1), . . . , gn (2nR )} is called codebook, while fn−1 (1), . . . , fn−1 (2nR ) are called as-
signment regions.
Ultimate goal of lossy source coding is to
– minimise R for a given D or
– minimise D for a given R.
Definition 5.5. A rate distortion pair (R, D) is called achievable if there exists a sequence
of (2nR , n) rate distortion codes such that,
is achievable.
Definition 5.7. The informatin distortion function RI (D) is defined as fallows:
This lower bound is attained by the following joint distribution of (X, X̂).
X̂
0 1
X
(1−D)(1−p−D) D(p−D)
0 1−2D 1−2D 1−p
D(1−p−D) 1−D(p−D)
1 1−2D 1−2D p
(1−p−D) (p−D)
Total 1−2D 1−2D 1
It follows that
= H(p) − H(D).
58 CHAPTER 5. RATE DISTORTION THEORY
X̂
0 1
X
0 1−p 0 1−p
1 p 0 p
Total 1 0 1
such that the lower bound is attained. If D ≥ p set P (X̂ = 0) = 1 and get
Then Ed(X; X̂) = P (X 6= X̂) = P (X = 1) = p ≤ D and
I(X; X̂) = H(X) − H(X | X̂) = H(p) − H(X | X̂ = 0).1 = H(p) − H(p) = 0.
R(D) ≥ RI (D)
nR ≥ H(X̂)
≥ H(X̂) − H(X̂ | X n )
= I(X̂ n ; X n ) = I(X n , X̂ n )
= H(Xn ) − H(X n | X̂ n )
Xn n
X
= H(Xi ) − H(Xi | X̂ n , (X1 , . . . Xi=1 ))
i=1 i=1
n
X
≥ I(Xi ; X̂i )
i=1
Xn
≥ RI (Ed(Xi , X̂i ))
i=1
Xn
1
=n RI (Ed(Xi , X̂i ))
n
i=1
n
1X
≥ nRI ( Ed(Xi , X̂i ))
n
i=1
= nRI Ed(X , X̂ n )
n
≥ nRI (D)
– R(D) ≥ RI (D)
– R(D) ≤ RI (D)
Yeung :Section 9.5, p. 206-212, Cover and thomas section 10.5 p.318-324.