0% found this document useful (0 votes)
152 views

Shannon Source Coding Theorem

Shannon's source coding theorem states that typical messages, which occur with high probability, can be encoded using fewer bits as the message length increases. Specifically: 1) Only typical messages that occur with a probability close to the expected frequency need be encoded, as atypical messages have a very low probability of occurring. 2) The number of bits needed to encode a message of length N approaches the entropy of the message source as N increases. 3) By only encoding typical messages, which represent the vast majority as N increases, the probability of error can be made arbitrarily small for long messages.

Uploaded by

ETC
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
152 views

Shannon Source Coding Theorem

Shannon's source coding theorem states that typical messages, which occur with high probability, can be encoded using fewer bits as the message length increases. Specifically: 1) Only typical messages that occur with a probability close to the expected frequency need be encoded, as atypical messages have a very low probability of occurring. 2) The number of bits needed to encode a message of length N approaches the entropy of the message source as N increases. 3) By only encoding typical messages, which represent the vast majority as N increases, the probability of error can be made arbitrarily small for long messages.

Uploaded by

ETC
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Shannon’s Source Coding Theorem

Kim Boström

Institut für Physik, Universität Potsdam, 14469 Potsdam, Germany

The idea of Shannon’s famous source coding theorem Now consider a very long message x. Typically, the let-
[1] is to encode only typical messages. Since the typical ter ak will appear with the frequency Nk ≈ N pk . Hence,
messages form a tiny subset of all possible messages, we the probability of such typical message is roughly
need less resources to encode them. We will show that K
the probability for the occurence of non-typical strings Y
p(x) ≈ ptyp ≡ pN Nk
1 · · · pK =
1
pN
k
pk
. (4)
tends to zero in the limit of large message lengths. Thus
k=1
we have the paradoxical situation that although we “for-
get” to encode most messages, we loose no information We see that typical messages are uniformly distributed
in the limit of very long strings. In fact, we make use of by ptyp . This indicates that the set T of typical messages
redundancy, i.e. we do not encode “unnecessary” infor- has the size
mation represented by strings which almost never occur. 1
Recall that a random message of length N is a string |T | ≈ . (5)
ptyp
If we encode each member of T by a binary string we
need
possible
messages K
X
IN = log |T | = −N pk log pk ≡ N H(X), (6)
k=1

bits, where H(X) is the Shannon entropy of the letter en-


typical
messages code words
semble. Thus for very long messages the average number
of bits per letter reads
1
I≡ IN = H(X). (7)
N
This is Shannon’s source coding theorem in a nutshell.
FIG. 1: Lossy coding. Now let us get a bit more into detail. In order to rigor-
ously prove the theorem we need the concept of a random
variable and the law of large numbers. Given the letter
x ≡ x1 · · · xN of letters, which are independently drawn
from an alphabet A = {a1 , . . . , aK } with a priori proba-
ensemble X, the function f : A → R
defines a discrete,
real random variable. The realizations of f (X) are the
bilities real numbers f (x), x ∈ A. The average of f (X) is defined
as
p(ak ) = pk ∈ (0, 1], k = 1, . . . , K (1)
X K
X
P
where k pk = 1. Each given string x of a random mes- hf (X)i := p(x) f (x) = pk f (ak ), (8)
sage is an instance or realization of the message ensemble x∈A k=1

X ≡ X1 · · · XN , where each random letter Xn is identical and the variance is given by


to a fixed letter ensemble X,
∆2 f (X) := hf 2 (X)i − hf (X)i2 . (9)
Xn = X, n = 1, . . . , N. (2)
For the sequence f (X) ≡ f (X1 ), . . . , f (XN ) we define its
A particular message x = x1 · · · xN appears with the arithmetic average as
probability N
1 X
A := f (Xn ), (10)
p(x1 · · · xn ) = p(x1 ) · · · p(xn ), (3) N n=1

which expresses the fact that the letters are statistically which is also a random variable. Since the Xn are iden-
independent from each other. tical copies of the letter ensemble X, the average of A is
equal to the average of f (X),
N
1 X
hAi = hf (Xn )i = hf (X)i, (11)
∗ Electronic address: [email protected] N i=1
2

and the variance of A reads or equivalently

∆2 A = hA2 i − hAi2 (12) 2−N (H+δ) ≤ p(x) ≤ 2−N (H−δ) , (22)


1 X
= hf (Xn )f (Xm )i where H ≡ H(X). By the law of large numbers, the
N 2 n,m
probability for a randomly drawn message x to be a mem-
1 X ber of T reads
− 2 hf (Xn )ihf (Xm )i (13)
N n,m X
PT ≡ p(x) ≥ 1 − . (23)
1 X 2 x∈T
hf (Xn )i − hf (Xn )i2

= (14)
N2 n
If we encode only typical sequences, the probability of
1 2 error
= ∆ f (X). (15)
N
Perr := 1 − PT ≤  (24)
The relative standard deviation of A yields
  can be made arbitrarily small by choosing N large
∆A 1 ∆f (X) enough. Now let us determine how many typical se-
=√ . (16)
hAi N hf (X)i quences there are. The lefthand side of (22) gives
Concluding, in the limit of large N the arithmetic average p(x) ≥ 2−N (H+δ) (25)
of the sequence f (X) and the ensemble average of f (X) X
−N (H+δ)
coincide. This is the law of large numbers. It is respon- ⇔ p(x) ≥ |T | 2 . (26)
sible for the validity of statistical experiments. Without x∈T
this law, we could never verify statistical properties of
a system by performing many experiments. In partic- The righthand side of (22) gives
ular, quantum mechanics would be free of any physical
meaning. p(x) ≤ 2−N (H−δ) (27)
X
Let us reformulate the law of large numbers in the , δ- −N (H−δ)
⇔ p(x) ≤ |T | 2 , (28)
language. For δ > 0 we define the typical set T of a ran- x∈T
dom sequence X as the set of realizations x ≡ x1 · · · xN
such that which yields together with (23)

1 X
N
|T | 2−N (H−δ) ≥ 1 −  (29)
hf (X)i − δ ≤ f (xn ) ≤ hf (X)i + δ. (17) N (H−δ)
N n=1 ⇔ |T | ≥ (1 − ) 2 . (30)

The law of large numbers implies that for every , δ > 0 Relations (28) and (30) can be combined into the crucial
there is a natural number N0 , such that for all N > N0 relation
the total probability of all typical sequences fulfills
(1 − ) 2N (H−δ) ≤ |T | ≤ 2N (H+δ) . (31)
X
PT ≡ p(x) ≥ 1 − . (18)
For N → ∞ we can choose , δ = 0 and obtain the desired
x∈T
expression
The total probability PT represents the probability for a
randomly chosen sequence x to lie in the typical set T . |T | → 2N H(X) , (32)
Now consider the special random variable
thus we need IN → N H(X) bits to encode the message.
f (X) := − log p(X). (19) Equivalently, the information content per letter reads I =
H(X) bits. Finally, let us investigate if we can further
The average of f (X) equals the Shannon entropy of the improve the compression. Relation (30) gives a lower
ensemble X, bound for the size of the typical set. Let us compress
below H bits per letter by fixing some 0 > 0 and encode
only sequences that lie in a “subtypical set” T 0 ⊂ T whose
X
hf (X)i = − p(x) log p(x) = H(X). (20)
x∈A
size reads
0 0
The typical set now contains all messages x whose prob- |T 0 | ≤ (1 − )2N (H−δ− ) < 2N (H−δ− ) . (33)
ability fulfills
The righthand side of (22) states that the probability of
N a typical sequence is bounded from above by
1 X
H −δ ≤− log p(xn ) ≤ H + δ, (21)
N n=1 p(x) ≤ pmax ≡ 2−N (H−δ) . (34)
3

If we encode only the typical sequences in the subtypical goes to 0 for N → ∞,


set T 0 , the probability that a sequence is in T 0 fulfills
X
PT 0 = p(x) (35) PT 0 → 0. (38)
x∈T 0
0
≤ |T 0 | · pmax = 2N (H−δ− ) 2−N (H−δ) (36) Concluding, if we compress the messages below N H(X)
−N 0 bits, we are not able to encode all typical messages and
=2 . (37)
for N → ∞ we will loose all information. A good review
Because 0 > 0, the probability of a successful encoding on the issue can also be found in [2, 3].

[1] C. E. Shannon and W. Weaver. A mathematical Theory (1995-2000).


of communication, The Bell System Technical Journal, 27, [3] J. Preskill. Lecture notes.
379–423,623–656, (1948). http://www.theory.caltech.edu/people/preskill/ph219/,
[2] D.J.C. MacKay. (1997-1999).
Information theory, inference, and learning algorithms,
http://wol.ra.phy.cam.ac.uk/mackay/itprnn/book.html,

You might also like