A Mathematical Theory of Communication by C. E. SHANNON
A Mathematical Theory of Communication by C. E. SHANNON
A Mathematical Theory of Communication by C. E. SHANNON
By C. E. SHANNON
INTRODUCTION
Vol. 27, PP. 379.423, 623.656, July, October, 1948 Reissued December. 1957
Copyright 1948 by AMERICAN TBLEPIIONE AND TELEGRAPII Co.
Printed in U. S. A.
5
6 C. E. Shannon
such as time, bandwidth, number of relays, etc., tend to vary linearly with
the logarithm of the number of possibilities. For example, adding one relay
to a group doubles the number of possible states of the relays. It adds 1
to the base 2 logarithm of this number. Doubling the time roughly squares
the number of possible messages, or doubles the logarithm, etc.
2. It is nearer to our intuitive feeling as to the proper measure. This is
closely related to (1) since we intuitively measure entities by linear com-
parison with common standards. One feels, for example, that two punched
cards should have twice the capacity of one for information storage, and two
identical channels twice the capacity of one for transmitting information.
3. It is mathematically more suitable. Many of the limiting operations
are simple in terms of the logarithm but would require clumsy restatement in
terms of the number of possibilities.
The choice of a logarithmic base corresponds to the choice of a unit for
measuring information. If the base 2 is used the resulting units may be
called binary digits, or more briefly bils, a word suggested by J. W. Tukey.
A device with two stable positions, such as a relay or a flip-flop circuit, can
store one bit of information. iV such devices can store N bits, since the
total number of possible states is 2N and log,2N = N. If the base 10 is
used the units may be called decimal digits. Since
log2 M = log10M/log102
= 3.32 log,, M,
a decimal digit is about 3f bits. A digit wheel on a desk computing machine
has ten stable positions and therefore has a storage capacity of one decimal
digit. In analytical work where integration and differentiation are involved
the base e is sometimes useful. The resulting units of information will be
called natural units. Change from the base a to base b merely requires
multiplication by logb a.
By a communication system we will mean a system of the type indicated
schematically in Fig. 1. It consists of essentially five parts:
1. An iitforntalion source which produces a message or sequence of mes-
sages to be communicated to the receiving terminal. The message may be
of various types: e.g. (a) A sequence of letters as in a telegraph or teletype
system; (b) A single function of time f(l) as in radio or telephony; (c) A
function of time and other variables as in black and white television-here
the message may be thought of as a function f(x, y, 1) of two space coordi-
nates and time, the light intensity at point (x, y) and time t on a pickup tube
plate; (d) Two or more functions of time, say f(l), g(l), h(l)-this is the
case in “three dimensional” sound transmission or if the system is intended
to service several individual channels in multiplex; (e) Several functions of
A Mathematical Theory of Communication..
IN~OoRu~k4~10N
TRANSMITTER RECEIVERS DESTINATION
t -
SIGNAL RECEIVED
SIGNAL c
MESSAGE Ib MESSAGE
El
NOISE
SOURCE
Fig. l-Schematic diagram of a general communication system.
In the more general case with different lengths of symbols and constraints
on the allowed sequences, we make the following delinition:
Definition: The capacity C of a discrete channel is given by
a space was the last symbol transmitted. If so then only a dot or a dash
can be sent next and the stale always changes. If not, any symbol can be
transmitted and the state changes if a space is sent, otherwise it remains
the same. The conditions can bc indicated in a linear graph as shown in
Fig. 2. The junction points correspond to the states and the lines indicate
the symbols possible in a state and the resulting state. In Appendix I it is
shown that if the conditions on allowed scqucnccs can bc described in this
form C will exist and can bc calculated in accordance with the following
result:
Theorem1: Let 6:;’ be the duration of the sul symbol which is allowable in
state i and leads to state j. Then the channel capacity C is equal to log
W where W is the largest real root of the determinant equation:
DASH
DOT
DASH
WORD SPACE
We have seen that under very general conditions the logarithm of the
number of possible signals in a discrctc channel increases linearly with time.
The capacity to transmit information can be specified by giving this rate of
increase, the number of bits per second required to specify the particular
signal used.
We now consider the information source. How is an information source
to be described mathematically, and how much information in bits per sec-
ond is produced in a given source? The main point at issue is the effect of
statistical knowledge about the source in reducing the required capacity
A Mathematical Theory of Communication 11
(B) Using the same five letters let the probabilities be .4, .l, .2, .2, .l
respectively, with successive choices independent. A typical
message from this source is then:
AAACDCBDCEAADADACEDA
EADCABEDADDCECAAAAAD
(C) A more complicated structure is obtained if successive symbols are
not chosen independently but their probabilities depend on preced-
ing letters. In the simplest case of this type a choice depends only
on the preceding letter and not on ones before {hat. The statistical
structure can then be described by a set of transition probabilities
pi(j), the probability that letter i is followed by letter j. The in-
dices i and j range over all the possible symbols. A second cquiv-
alent way of specifying the structure is to give the “digram” prob-
abilities p(i, j), i.e., the relative frequency of the digram i j. The
letter frequencies p(i), (the probability of letter i), the transition
probabilities pi(j) and the digram probabilities p(i, j) are related by
the following formulas.
P(i9.d = PCi)PiW
As a specific example suppose there arc three letters A, B, C with the prob-
ability tables:
;B$+O B++ i B 2~ 27 0
c+ Z.&j c x27 C air T&s da
As we have indicated above a discrete source for our purposes can be con-
sidered to be represented by a Markoff process. Among the possible discrete
Markoff processes there is a group with special properties of significance in
E
.I
B
C
&
AA .a
2 .5
5
B
B .4 5
C
.I
Fig. 4-A graph corresponding to the source in example C.
In the ergodic case it can bc shown that with any starting conditions the
probabilities Z’#V) of being in state j after N symbols, approach the equi-
librium values as N + 00.
function of 12. With equally likely events there is more choice, or un-
certainty, when there are more possible events.
3. If a choice be broken down into two successive choices, the origina
H should be the weighted sum of the individual values of H. The
meaning of this is illustrated in Fig. 6. At the left we have three
possibilities pl = 3, PZ = g, $3 = f. On the right we first choose be-
tween two possibilitiesleach with probability 4, and if the second occurs
make another choice with probabilities $, 5. The final results have
the same probabilities as before. We require, in this special case,
that
H(& 4, 3) = a($, 3) + $ZZ($, f)
The coefficient $ is because this second choice only occurs half the time.
Fig. t--Entropy in the case of two possibilities with probabilities p and (1 - p).
1. ZZ = 0 if and only if all the pi but one are zero, this one having the
value unity. Thus only when we are certain of the outcome does II vanish.
Otherwise ZZ is positive.
2. For a given n, ZZ is a maximum and equal to log n when all the pi are
1
equal i.e., - . This is also intuitively the most uncertain situation.
( IL >
3. Suppose there are two events, x and y, in question with m possibilities
for the first and n for the second. Let p(i, j) be the probability of the joint
occurrence of i for the first and j for the second. The entropy of the joint
event is
H(x, y) = -
A Mathematical Theory of Communication 21
while
H(x) = -
with equality only if the events are independent (i.e., p(i, J) = p(i) PO’)).
The uncertainty of a joint event is less than or equal to the sum of the
individual uncertainties.
4. Any change toward equalization of the probabilities pr , pz , - - . , p,
increases ZZ. Thus if pr < PZ and we increase pr , decreasing pz an equal
amount so that pr and pz are more nearly equal, then H increases. More
generally, if we perform any “averaging” operation on the pi of the form
p: = C aij pj
i
where c aij = caij = 1, and all aij 2 0, then H increases (except in the
i i
special case where this transformation amounts to no more than a permuta-
tion of the pj with ZZ of course remaining the same).
5. Suppose there are two chance events x and y as in 3, not necessarily
independent. For any particular value i that x can assume there is a con-
ditional probability pi(j) that y has the value j. This is given by
PC4.i)
Pd.9 = c p(i, j> *
i
We define the conditionalentropy of y, Zi=(y) as the average of the entropy
of y for eacli value of x, weighted according to the probability of getting
that particular x. That is
my> = -c PG, j> 1% Pi(j).
i,i
This quantity measures how uncertain we are of y on the average when we
know x. Substituting the value of PiG) we obtain
Hz(y) = -Cii PG, j> log PC, j> + Cij PC;,j> log C PG,j)
i
= H(x, y) - ZZ(x)
or
wx, y> = H(x) + fIz(y)
C. E. Shannon
or
A Mathematical Theory of Communication 23
log p G N c p; log pi
i
is large.
A closely related result deals with the number of sequences of various
probabilities. Consider again the sequences of length N and let them be
arranged in order of decreasing probability. We dehne s(q) to be the
number we must take from this set starting with the most probable one in
order to accumulate a total probability q for those taken.
Theorem4:
Lim log n(rl) = ZZ
N-tm N
when q does not equal 0 or 1.
We may interpret log IZ(~) as the number of bits required to specify the
sequence when we consider only the most probable sequences with a total
probability q. Then l?!.!?!@ 1s
* the number of bits per symbol for the
N
specification. The theorem says that for large N this will be independent of
Q and equal to ZZ. The rate of growth of the logarithm of the number of
reasonably probable sequences is given by ZZ, regardless of our intcrpreta-
tion of “reasonably probable.” Due to these results, which are proved in
appendix III, it is possible for most purposes to treat the long sequences as
though there were just 21fN of them, each with a probability 2-“N.
24 C. E. Shannon
The next two theorems show that ZZ and II’ can be determined by limit-
ing operations directly from the statistics of the message sequences, without
reference to the states and transition probabilities between states.
Theorem5: Let p(Bi) be the probability of a sequence i3i of symbols from
the source. Let
where Eiy is the duration of the sfh symbol leading from state i to state ,j
and the Bi satisfy
Thus the messages of high probability are represented by short codes and
those of low probability by long codes. From these inequalities we have
The code for P. will differ from al\ succeeding ones in one or more of its
?n8places, since all the remaining Pi are at least & larger and their binary
expansions therefore differ in the first fti2,places. Consequently all the codes
are different and it is possible to recover the message from it-s code. If the
channel sequences are not already sequences of binary digits, they can be
ascribed binary numbers in an arbitrary fashion and the binary code thus
translated into signals suitable for the channel.
The average number II’ of binary digits used per symbol of original mes-
sage is easily estimated. WC have
II’ = f z??t,p.
But,
and therefore,
A Mathematical Theory of Communication 29
1%
-- P-’ _ c
A 0
l3 10
c 110
D 111
The average number of binary digits used in encoding a sequence of N sym-
bols will be
N($ X 1 + $ X 2 + $ X 3) = -$N
It is easily seen that the binary digits 0, 1 have probabilities 3, i:; so the II for
the coded sequences is one bit per symbol. Since, on the average, WChave p
binary symbols per original letter, the entropies on a time basis are the
same. The maximum possible entropy for the original set is log 4 = 2,
occurring when li, 13, C, D have probabilities f, 4, 3, 1. Hcncc the relative
entropy is g. We can translate the binary sequences into the original set of
symbols on a two-to-one basis by the following table:
00 Al
01 U’
10 C’
11 D’
This double process then encodes the original message into the same symbols
but with an average compression ratio s .
In such a case one can construct a fairly good coding of the message on a
0, 1 channel by sending a special sequence, say 0000, for the infrequent
symbol n and then a sequence indicating the nu&er of B’s following it.
This could be indicated by the binary representation with all numbers con-
taining the special sequence deleted. All numbers up to 16 are represented
as usual; 16 is represented by the next binary number after 16 which does
not contain four zeros, namely 17 = 10001, etc.
It can be shown that as p -+ 0 the coding approaches ideal provided the
length of the special sequence is properly adjusted.
32 C. E. Shannon
The conditional entropy H&r) will, for convenience, be called the equi-
vocation. It measures the average ambiguity of the received signal.
C. E. Shannon
CORRECTION DATA
mitting from a source which produces binary digits with probability p for
1 (correct) and q for 0 (incorrect). This rcquircs a channel of capacity
- [P 1% p + p log ql
which is the equivocation of the original system.
The rate of transmission R c m be written in two other forms due to the
identities noted above. We have
R = U(x) - II,(x)
= 11(y) - IL(y)
= IzI(n-) + II(y) - I-l(x, y).
The first defining expression has already been interpreted as the amount of
information sent less the uncertainty of what was sent. The second meas-
36 C. E. Shannon
ures the amount received less the part of this which is due to noise. The
third is the sum of the two amounts less the joint entropy and therefore in a
sense is the number of bits per second common to the two. Thus all three
expressions have a certain intuitive significance.
The capacity C of a noisy channel should be the maximum possible rate
of transmission, i.e., the rate when the source is properly matched to the
channel. We therefore define the channel capacity by
where the maximum is with respect to all possible information sources used
as input to the channel. If the channel is noiseless, II,(X) = 0. The defini-
tion is then equivalent to that already given for a noiseless channel since the
maximum entropy for the channel is its capacity.
These results are the main justification for the definition of C and will
now be proved.
Theorem11. Let a discrete channel have the capacity C and a discrete
source the entropy per second N. If II < C there exists a coding system
such that the output of the source can be transmitted over the channel with
an arbitrarily small frequency of errors (or an arbitrarily small equivocation).
If Ii > C it is possible to encode the source so that the equivocation is less
than $1 - C + B where c is arbitrarily small. There is no method of encod-
ing which gives an equivocation less than H - C.
The method of proving the first part of this theorem is not by exhibiting
a coding method having the desired properties, but by showing that such a
code must exist in a certain group of codes. In fact we will average the
frequency of errors over this group and show that this average can be made
less than 6. If the average of a set of numbers is less than e there must
exist at least one in the set which is less than e. This will establish the
desired result.
All the e’s and 6’s implied by the words “small” and “about” in these
statements approach zero as we allow T to increase and So to approach the
maximizing source.
The situation is summarized in Fig. 10 where the input sequences are
points on the left and output sequences points on the right. The fan of
cross lines represents the range of possible causes for a typical output.
Now suppose we have another source producing information at rate R
with R < C. In the period T this source will have 2TR high probability
outputs. We wish to associate these with a selection of the possible channel
2Hk)T
HIGH PROBABILITY
MESSAGES 2li(y)T
HIGH PROBABILITY
RECEIVED SIGNALS
. REASOibABLE EFFECTS .
FROM EACH M
.
Fig. lo-Schematic representation of the relations between inputs and outputs in a
channel.
The probability that none of the points in the fan is a message (apart from
the actual originating message) is
P=[l-2 T(R--H(r))
1
2T”&)
IW - I&(x) = c + e
14. DISCUSSION
The demonstration of theorem 11, while not a pure existence proof, has
some of the deficiencies of such proofs. An attempt to obtain a good
approximation to ideal coding by following the method of the proof is gen-
erally impractical. In fact, apart from some rather trivial cases and
certain limiting situations, no explicit description of a series of approxima-
tion to the ideal has been found. Probably this is no accident but is related
to the difficulty of giving an explicit construction for a good approximation
to a random sequence.
40 C. E. Shannon
An approximation to the ideal would have the property that if the signal
is altered in a reasonable way by the noise, the original can still be recovered.
In other words the alteration will not in general bring it closer to another
reasonable signal than the original. This is accomplished at the cost of a
certain amount of redundancy in the coding. The redundancy must be
introduced in the proper way to combat the particular noise structure
involved. However, any redundancy in the source will usually help if it is
utilized at the receiving point. Ih particular, if the source already has a
certain redundancy and no attempt is made to eliminate it in matching to the
channel, this redundancy will help combat noise. For example, in a noiseless
telegraph channel one could save about 50% in time by proper encoding of
the messages.This is not done and most of the redundancy of English
remains in the channel symbols. This has the advantage, however, of
allowing considerable noise in the channel. A sizable fraction of the letters
can be received incorrectly and still reconstructed by the context. In
fact this is probably not a bad approximation to the ideal in many cases,
since the statistical structure of English is rather involved and the reasonable
English sequences are not too far (in the sense required for theorem) from a
random selection.
As in the noiseless case a delay is generally required to approach the ideal
encoding. It now has the additional function of allowing a large sample of
noise to affect the signal before any judgment is made at the receiving point
as to the original message. Increasing the sample size always sharpens the
possible statistical assertions.
The content of theorem 11 and its proof can be formulated in a somewhat
different way which exhibits the connection with the noiseless case more
clearly. Consider the possible signals of duration T and suppose a subset
of them is selected to be used. Let those in the subset all be used with equal
probability, and suppose the receiver is constructed to select, as the original
signal, the most probable cause from the subset, when a perturbed signal
is received. We define N(T, q) to be the maximum number of signals we
can choose for the subset such that the probability of an incorrect inter-
pretation is less than or equal to q.
TRANSMITTED
SYMBOLS
x 9
P
Fig. 11-Example of a discrete channel
RECEIVED
SYMBOLS
flu
-= -1 -logP+x=o
all
au
-= -2 -2 1ogQ -2a+2x=o.
aQ
Eliminating X
log P = log Q + a!
P = Qe" = Q/3
p=P Q-L.
P-i-2 P-l-2
The channel capacity is then
C. E. Shannon
Note how this checks the obvious values in the cases p = 1 and p = ij .
In the first, /3 = 1 and C = log 3, which is correct since the channel isthen
noiseless with three possible symbols. If p = a, p = 2 and C = log 2.
Here the second and third symbols cannot be distinguished at all and act
together like one symbol. The first symbol is used with probability P =
$ and the second and third together with probability 3 . This may be
distributed in any desired way and still achieve the maximum capacity.
For intermediate values of p the channel capacity will lie between log
2 and log 3. The distinction between the second and third symbols conveys
some information but not as much as in the noiseless case. The first symbol
is used somewhat more frequently than the other two because of its freedom
from noise.
s = 1,2, *** .
Hence:
F pi Pit = exp tc T kc + 5 h,f p*j log p,j]
or,
Pi = F kf exp [C C l&f + C lb p,j log p.j].
I 8.j
If each input symbol has the same set of probabilities on the lines emerging
from it, and the same is true of each output symbol, the capacity can be
easily calculated. Examples are shown in Fig. 12. In such a case H&)
is independent of the distribution of probabilities on the input symbols, and
is given by --Z pi log pi where the pi are the values of the transition proba-
bilities from any input symbol. The channel capacity is
Max [J-W - rrlb>l
=MaxIZ(~~)+ZpilogPi.
The maximum of 11(y) is clearly log m where m is the number of output
a b C
Fig. 12--Examples of discrete channels with the same transition probabilities for each
nput and for each output.
i
symbols, since it is possible to make them all equally probable by making
the input symbols equally probable. The channel capacity is therefore
C = log m + Z pi log pi.
In Fig. 12a it would be
c = log 4 - log2 = log2.
This could be achieved by using only the 1st and 3d symbols. In Fig. 12b
C = log 4 - $ log 3 - # log 6
= log 4 - log 3 - f log 2
= log& 29.
h Fig. 12c we have
C = log 3 - 4 log 2 - f log 3 - Q log G
44 C. E. Shannon
Suppose the symbols fall into several groups such that the noise never
causes a symbol in one group to be mistaken for a symbol in another group.
Let the capacity for the lath group be C, when we use only the symbols
in this group. Then it is easily shown that, for best use of the entire set,
the total probability I’, of all symbols in the nib group should be
P, = &.
where a’,, , bij , . . . byj are the length of the symbols which may be chosen
in state i and lead to state j. These are linear difference equations and the
behavior as L -+ 00 must be of the type
iVj = AjW"
F (T WwbC’ - 6ij)Ai = 0.
must vanish and this determines W, which is, of course, the largest real root
ofD = 0.
The quantity C is then given by
and we also note that the same growth properties result if we require that all
blocks start in the same (arbitrarily chosen) state.
APPENDIX 2
Similarly
A(1”) = 1z A(t)
= -KZpilogz~.= -KKpiIogpi.
,
If the pi are incommeasurable, they may be approximated by rationals and
the same expression must hold by our continuity assumption. Thus the
expression holds in general. The choice of coetlicient K is a matter nf con-
veniencc and amounts to the choice of a unit of measure.
APPENDIX 3
and 1%
-fi- P 1s
* limited by
1%
__
P = Z(Pipij f S) log pij
N
or
1%
-- P - ZPipij log pij < ve
N
This proves theorem 3.
Theorem 4 follows immediately from this on calculating upper and lower
bounds for n(q) based on the possible range of values of p in Theorem 3.
In the mixed (not ergodic) case if
L= ZpiLi
and the entropies of the components are ZZr 2 ZZz > . . . 2 ZZ,,we have the
T!zeoretn: Lim ‘9 = p(q) is a decreasing step function,
N-W.3
r-1
and summing this for all N gives GN = f Z PN . Hence GN >_ 17~ and GN
monotonic decreasing. Also they must approach the same limit. By using
theorem 3 we see that Lim GN = II.
N--m *
APPENDIX 4
MAXIMIZING THE RATE FOR A SYSTEM OF CONSTRAINTS
Suppose we have a set of constraints on sequences of symbols that is of
the finite state type and can be represented therefore by a linear graph.
48 C. E. Shannon
Let C$r be the lengths of the various symbols that can occur in passing from
state i to state j. What distribution of probabilities Pi for the different
states and fi(l? for choosing symbol s in state i and going to statej maximizes
the rate of generating information under these constraints? The constraints
define a discrete channel and the maximum rate must be less than or equal
to the capacity C of this channel, since if all blocks of large length were
equally likely, this rate would result, and if possible this would be best. We
will show that this rate can bc achieved by proper choice of the Pi and pi:.
The rate in question is
-zPip::J log p!‘? = _N
(a) (1) M’
zP(i)Pij eij
Let Cij = C tfy . Evidently for a maximum pi? = k exp 419. The con-
straints on ‘maximization are ZPi = 1, C p<j = 1, 2 Pi(pij - 6ij) = 0.
i
Hence we maximize
--ZPi pij log pij
u=- + AF Pi + XPi Pij + Ztlj Pi&ii - 6ij)
ZPi pij lij
au
-=- Wi(l + log Pij) + NPi Cj + h + cli + ttipi = 0 .
apij lw
Solving for pij
p, = /ii&L+ja
Since
F fij = 1, AT’ = ~ BjD-li’
BjZFdij
Pii = C B, D-li8 .
8
The correct value of D is the capacity C and the Bj are solutions of
Bi = 2 Bj-+j
for then
p, = 2 (J--b.
t
xpi 3 Cvcii = P,
Bi
or
49
A Mathematical Theory of Communication
SO that if Xi satisfy
Z;yic-Gii = yj
Pi= Biyi
Both of the sets of equations for I3 6 and 7; can be satisfied since C is such that
1 cpdii - aij 1 = 0
but
Z;Pipij(log Bj - log Bi) = C Pj log Bj - ZPi log Bi = 0
i
Hence the rate is C and as this could never be exceeded this is the maximum’
justifying the assumed solution.
50 C. E. Shannon
h(l) = sin@+@,
we may give a probability distribution for 0, P(0). The set then becomes
an ensemble.
Some further examples of ensembles of functions are:
1. A finite set of functionsf&) (K = 1, 2, * * * , n) with the probability of
jk being pk .
f&l ) QZ , * * * , %I ,* 1)
pb1 , * * * , WI)
where the lk are the points of the Poisson distribution. This ensemble
can be considered as a type of impulse or shot noise where all the impulses
are identical.
5. The set of English speech functions with the probability measure given
by the frequency of occurrence in ordinary use.
An ensemble of functionsf&) is slalio~tary if the same ensemble results
when all functions are shifted any fixed amount in time. The ensemble
f&) = sin (b + e)
is stationary if 0 distributed uniformly from 0 to 2~. If we shift each func-
tion by 11we obtain
f0(t + 11) = sin (1 + II + 0)
= sin (1 + V)
with cp distributed uniformly from 0 to 25~. Each function has changed
but the ensemble as a whole is invariant under the translation, The other
examples given above are also stationary.
An ensemble is ergo&c if it is stationary, and there is no subset of the func-
tions in the set with a probability different from 0 and 1 which is stationary.
The ensemble
sin (1 + e)
is ergodic. No subset of these functions of probability #O, 1 is transformed
into itself under all time translations. On the other hand the ensemble
a sin (1 + 0)
with a distributed normally and 8 uniform is stationary but not ergodic.
The subset of these functions with a between 0 and 1 for example is
stationary.
Of the examples given, 3 and 4 are ergodic, and 5 may perhaps be con-
sidered so. If an ensemble is ergodic we may say roughly that each func-
tion in the set is typical of the ensemble. More precisely it is known that
with an ergodic ensemble an average of any statistic over the ensemble is
equal (with probability 1) to an average over all the time translations of a
A MathematicalTheory of Communication 53
particular function in the set.3 Roughly speaking, each function can bc cx-
petted, as time progresses, to go through, with the proper frequency, all the
convolutions of any of the functions in the set.
Just as we may perform various operations on numbers or functions to
obtain new numbers or functions, we can perform operations on cnscmbles
to obtain new ensembles. Suppose, for example, we have an ensemble of
functionsf,(i) and an operator T which gives for each functionf,(/) a result
&40 :
Probability measure is defined for the set gn(l) by means of that for the set
f&). Tlx probab i 1i 1y of a certain subset of the g,(l) functions is equal
to that of the subset of the ja(l) functions which produce mcmbcrs of the
given subset of g tunctions under the operation I’. Physically this corrc-
sponds to passing the cnscmble through some dcvicc, for exan~plc, a filter,
a rectifier or a modulator. The output functions of the device form the
ensemble ge(l).
A device or operator T will bc called invariant if shifting the input merely
shifts the output, i.e., if
g&> = TJ,!O
implies
g&l -I- fd = Tf,(f + 4)
for allf&l) and all II . It is easily shown (see appendix 1) that if T is in-
variant and the input enscn~blc is stationary then the output cnscmblc is
stationary. Likewise if the input is ergodic the output will also bc crgodic.
A filter or a rectifier is invariant under all time translations. The opera-
tion of modulation is not since the carrier phase gives a certain time struc-
ture. However, modulation is invariant under all translations which arc
multiples of the period of the carrier.
Wiener has pointed out tlic intiinatc relation bctwcen the invariance of
physical devices under time translations a.nd Fourier tllcory.4 1-1~ has
3 This is the famous crgo:lic Lhcorcm or ralhcr one aspect of Lhis theorem which was
t~roved is son~cwhat differen formulations by JSir$off, van Neumann, and Koopman, and
subsequently gcncralizetl by Wicncr, I-Iopf! JIurcwcz and olhcrs. The lilcralure on crgo:!ic
theory is quite exlcnsive and the reader 1s referred lo the papers of these writers for pre-
cise and general Iormulalions; e.g., 11. HopI “JSrgo~!cntheoric” Ergcbnissc tier Malhcmatic
und ihrer Grcnzgebiet~, Vol. 5, “On Causality SLaLisLics and Probability” Journal of
Mathematics and Physics, Vol. XIII, No. 1, 1934; N. Weiner “The Ergodic Theorem”
Duke Mathematical Journal., Vol. 5, 1939.
4 CommunicaLion theory IS heavily indcbtcd Lo Wiener for much of ils basic philosophy
and theory. His classic NDRC report “The Intcrt~olalion, 15xLral~olaLion, and Smoolhing
of Stationary Time Series,” Lo appear soon in book form, contains the lirst clear-cul
formulaLion of communication theory as a statistical problem, the study of opcralions
54
C. E. Shannon
on time series. This work, although chiefly concerned with the linear prediction and
filtering problem, is an important collateral reference in connection with the present paper.
We may also refer here to Wiener’s forthcoming book “Cybcrnctics” dealing with the
general problems of communication and control.
6 For a proor of this theorem and further discussion see the author’s paper “Communi-
cation in the Presence of Noise” to be published in the Proceedirzgs o/ llre Z~slil~la of Radio
Ettgitteers.
A MathematicalTheory of Communication 55
zz = -
J-1 P(x) 1% P(x) dx
With an 11dimensional distribution p(xr , . * * , x,) we have
where
P(x) = /- P(G Y) dy
The entropies of continuous distributions have most (but not all) of the
properties of the discrete case. In particular we have the following:
C. E. Shannon
a(x, y) dx =
s s &, Y> dy = 1, 4% Y> 2 0.
p(x) = Au e-(r*12sP)
=log&r+&2
= log 4% + log 6
= log &G u.
Similarly the IZ dimensional gaussian distribution with associated
quadratic form aii is given by
1aij It
p(x1 , . . * , x,) = ~ exp (- $2&j Xi Xj)
(2?r)“‘2
and the entropy can be calculated as
II = log (2*C)“” ( Uij (’
where 1 aij 1 is the determinant whose elements are aij .
7. If x is limited to a half line (p(x) = 0 for z 5 0) and the first moment of
x is fixed at a:
58 C. E. Shannon
p(x) = ; e-(z’o)
In this case the Jacobian is simply the determinant 1a;j 1-l and
ZZ(y) = H(X). + log 1 Uij 1s
For a given average power N, white noise has the maximum possible
entropy. This follows from the maximizing properties of the Gaussian
distribution noted above.
The entropy for a continuous stochastic process has many properties
analogous to that for discrete processes. In the discrete case the entropy
was related to the logarithm of the probability of long sequences, and to the
number of reasonably probable sequences of long length. In the continuous
case it is related in a similar fashion to the logarithm of the probabililj
dettsily for a long series of samples, and the volunte of reasonably high prob-
ability in the function space.
More precisely, if we assume p(s * . * x,) continuous in all the Xi for all 12,
then for sufficiently large 92
for all choices of (x1 , . . * , r,) apart from a set whose total probability is
less than 6, with 6 and e arbitrarily small. This follows from the ergodic
property if WCdivide the space into a large number of small cells.
60 C. E. Shannon
The relation of H to volume can be stated as follows: Under the same as-
sumptions consider the it dimensional space corresponding to p(zr , . . - , x,).
Let V,,(q) be the smallest volume in this space which includes in its interior
a total probability q. Then
Lim log Im(q) = N’
“-%W n
provided 9 does not equal 0 or 1.
These results show that for large n there is a rather well-defined volume (at
least in the logarithmic sense) of high probability, and that within this
volume the probability density is relatively uniform (again in the logarithmic
sense).
In the white noise case the distribution function is given by
Since this depends only on Xx: the surfaces of equal probability density
are spheres and the entire distribution has spherical
- symmetry. The region
of high probability is a sphere of radius d,zN. As I&--+ Q, the probability
of being outside a sphere of radius dm approaches zero and k times
the logarithm of the volume of the sphere approaches log m.
In the continuous case it is convenient to work not with the entropy ZZ of
an ensemble but with a derived quantity which we will call the entropy
power. This is defined as the power in a white noise limited to the same
band as the original ensemble and having the same entropy. In other words
if H’ is the entropy of an ensemble its entropy power is
SIN’? %‘t
(ntF
cost-1
---+- co5 t SIN t’
0.384
t4 2t.2 t3
, - “3 ----) 0 11 GJ I
q J, W
-2.66
P---- t
-6.66 a -!-
at* [
cos (I-cqt-cos t
1
J = Q I Y(fi) I2
C.E.Shannon
where the ji are equally spaced through the band W. This becomes in
the limit
White Gaussian noise has the peculiar property that it can absorb any
other noise or signal ensemble which may be added to it with a resultant
entropy power approximately equal to the sum of the white noise power and
the signal power (measured from the average signal value, which is normally
zero), provided the signal power is small, in a certain sense, compared to
the noise.
Consider the function space associated with thcsc enscmblcs having ?E
dimensions. The white noise corresponds to a spherical Gaussian distribu-
tion in this space. The signal ensemble corresponds to another probability
distribution, not necessarily Gaussian or spherical. Let the second moments
of this distribution about its center of gravity be aij. That is, if
Pbl, **- 9 x,) is the density distribution function
where the ai are the coordinates of the center of gravity. Now aij is a posi-
tive definite quadratic form, and we can rotate our coordinate system to
align it with the principal directions of this form. aij is then reduced to
diagonal form bii . We require \hat each bii be small compared to N, the
squared radius of the spherical distribution.
In this case the convolution of the noise and signal produce a Gaussian
distribution whose corresponding quadratic form is
N + bii ,
[II(N + bii)I””
or approximately
= [(N)” + Zbii(N)“-‘]l’”
The last term is the signal power, while the first is the noise power.
C. E. Shannon
where H(X) is the entropy of the input and 11&) the equivocation. Tbc
channel capacity C is defined as the maximum of R when we vary the input
over all possible ensembles. This means that in a finite dimensional ap-
proximation we must vary P(x) = P(xl , * * * , 2,) and maximize
--&,y)
ssax,Y)log P(x)P(y) dx dy
using the fact that P(x, y) log P(x) dx dy = P(x) log P(x) d.r. The
s
channel capacity is thus expressed
Pz(Y) = Q(Y - 4
and we can assign a definite entropy to the noise (independent of the sta-
tistics of the signal), namely the entropy of the distribution Q(lt). This
entropy will be denoted by ZZ(n).
Theorem Id: If the signal and noise are independent and the received
signal is the sum of the transmitted signal and the noise then the rate of
66 C. E. Shannon
transmission is
R * ZZ(y) - U(H)
i.e., the entropy of the received signal less the entropy of the noise. The
channel capacity is
c = ~~zx N(y) - II(N).
This is the upper limit given in the theorem. The lower limit can be ob-
tained by considering the rate if we make the transmitted signal a white
noise, of power P. In this case the entropy power of the received signal
must be at least as great as that of a white noise of power P + ATI since we
have shown in a previous theorem that the entropy power of the sum of two
ensembles is greater than or equal to the sum of the individual entropy
powers. Hence
max H(y) >_ W log 2?re(P + N1)
and
C 2 W log 2re(P -I- NI) - W log 2rreNr
P + NI
= w log ~ .
Nl
As P increases, the upper and lower bounds approach each other, so we
have as an asymptotic rate
PfN
w log -
Nl
A Mathematical Theory of Communication 69
If the noise is itself white, N = iVr and the result reduces to the formula
proved previously:
C=Wlog 1+;,
( >
If the noise is Gaussian but with a spectrum which is not necessarily flat,
Nr is the geometric mean of the noise power over the various frequencies in
the band W. Thus
Nl = exp $
sW log
NJ)dJ
where N(J) is the noise power at frequencyf.
Theorem 19: If we set the capacity for a given trausmitter power P
equal to
P+N-t,
C = w log N
1
subject to the constraint that all the functions j(l) in the ensemble be less
than or equal to 6, say, for all 1. A constraint of this type does not work
out as well mathematically as the average power limitation. The most we
have obtained for this case is a lower bound valid for all $ an “asymptotic”
where 5’is the peak allowed transmitter power. For sufficiently large
c < w logre N (1 + 4
C+Wlog 1+$ .
( >
-?S+N
W log (4s + 2?reN)(l -I- e) - 117log (27reN) = W log ?re N (1 + 4.
in the output. This function is never negative. The input function (in
the general case) can be thought of as the sum of a series of shifted functions
sin 27rlV1
a 27rwt
where a, the amplitude of the sample, is not greater than 6. Hence the
output is the sum of shifted functions of the non-negative foim above with
the same coeflicients. These functions being non-negative, the greatest
positive value for any t is obtained when all the coefficients a have their
maximum positive values, i.e. l/s. In this case the input function was a
constant of amplitude 4 and since the filter has unit gain for D.C., the
output is the same. Hence the output ensemble has a peak power 5’.
The entropy of the output ensemble can be calculated from that of the
input ensemble by using the theorem dealing with such a situation. The
output entropy is equal to the input entropy plus the geometrical mean
gain of the filter;
w log 4s - 2w = w1og4;
72 C. E. Shannon
w log $-* 5.
We now wish to show that, for small 5 (peak signal power over average
white noise power), the channel capacity is approximately
C=Wlog 1+; .
( >
power P is less than or equal to the peak S, it follows that for all S
N
To obtain this we need only assume (1) that the source and system are
ergodic so that a very long sample will be, with probability nearly 1, typical
of the ensemble, and (2) that the evaluation is “reasonable” in the sense
that it is possible, by observing a typical input and output x1 and yr , to
form a tentative evaluation on the basis of these samples; and if these
samples are increased in duration the tentative evaluation will, with proba-
bility 1, approach the exact evaluation based on a full knowledge of P(x, y).
Let the tentative evaluation be p(x, y). Then the function p(x, y) ap-
proaches (as T 4 a) a constant for almost all (x, y) which are in the high
probability region corresponding to the system:
Pb, Y) --+ wx, Y>>
and we may also write
Pb, Y) + SJ
Pb, YMX, Y> dx, dr
since
SJ P(x, y) dx dy = 1
v = (x(0 - Ym2
In this very commonly used criterion of fidelity the distance function
p(x, y) is (apart from a constant factor) the square of the ordinary
euclidean distance between the points x and y in the associated function
spa05
Pb, Y> = f
s0
= [x(t) - y(t)l” dt
PC%
Y> = f s
oT J(t)2 dt.
PC%Y> = f
!
o T 1 x(t) - r(t) I dt
v= dx, y) P(x, y) dx dy
This means that we consider, in effect, all the communication systems that
might be used and that transmit with the required fidelity. The rate of
transmission in bits per second is calculated for each one and we cl~oose that
having the least rate. This latter rate is the rate we assign the source for
the fidelity in question.
The justification of this definition lies in the following result:
Tlgeorenz 21: If a source has a rate RI for a valuation v1 it is possible to
encode the output of the source and transmit it over a channel of capacity C
with fidelity as near vr as desired provided RI < C. This is not possible
if RI > C.
The last statement in the theorem follows immediately from the definition
of RI and previous results. If it were not true we could transmit more than
C bits per second over a channel of capacity C. The first part of the theorem
is proved by a method analogous to that used for Theorem 11. We may, in
the first place, divide the (x, y) space into a large number of small cells and
A MathematicalTheory of Communication 77
represent Lhc situation as a discrete case. This will not change the cvalua-
tion function by more than an arbitrarily small amount (when the cells are
very small) because of the continuity assumed for p(.z., y). Suppose that
Pr(x, y) is the particular system which minimizes the rate and gives RI . We
choose from the high probability y’s a set at random containing
p, + 4 P
members where e --+ 0 as T + Q). With large T each chosen point will be
connected by a high probability line (as in Fig. 10) to a set of 2s. A calcu-
lation similar to that used in proving Theorem 11 shows that with large T
almost all x’s are covered by the fans from the chosen y points for almost
all choices of the y’s. The communication system to bc used operates as
follows: The selected points are assigned binary numbers. When a message
x is originated it will (with probability approaching 1 as T + m) lie within
one at least of the fans. The corresponding binary number is transmitted
(or one of them chosen arbitrarily if there are several) over the channel by
suitable coding means to give a small probability of error. Since RI < C
this is possible. At the receiving point the corresponding y is reconstructed
and used as the recovered message.
The evaluation V: for this system can be made arbitrarily close to ~1,by
taking T sufliciently large. This is due to the fact that for each long sample
of message ~(1) and recovered message y(f) the evaluation approaches vl
(with probability 1).
It is interesting to note that, in this system, the noise in the recovered
message is actually produced by a kind of general quantizing at the trans-
mitter and is not produced by the noise in the channel. It is more or less
analogous to the quantizing noise in P.C.M.
R = Min
P,(Y)
Theorem 22: The rate for a white noise source of power Q and band WI
r&live to an R.M.S. measure of fidelity is
R = WI log $
where N is t:.e allowed mean square error between original and recovered
messages.
More generally with any message source we can obtain inequalities bound-
ing the rate relative to a mean square error criterion.
Theorem 23: The rate for any source of band WI is bounded by
wllog$ 2 RI W&g;
where Q is the average power of the source, Q1 its entropy power and N the
allowed mean square error.
The lower bound follows from the fact that the max III(~) for a given
(x - r)” = N occurs in the white noise case. The upper bound results if we
place the points (used in the proof of Theorem 21) not in the best way but
at random in a sphere of radius 4Q - N.
ACKNOWLEDGMENTS
The writer is indebted to his colleagues at the Laboratories, particularly
to Dr. H. W. Bode, Dr. J. R. Pierce, Dr. B. McMillan, and Dr. 13. M. Oliver
for many helpful suggestions and criticisms during the course of this work.
Credit should also be given to Professor N. Wiener, whose elegant solution
of the problems of filtering and prediction of stationary ensembles has con-
siderably influenced the writer’s thinking in this field.
APPENDIX 5
Let St be any measurable subset of the g ensemble, and SZ the subset of
thef ensemble which gives S1 under the operation T. Then
& = T&s
Let 11’ be the operator which shifts all functions in a set by the time X.
Then
I?& = i?TSz = TII’S,
since T is invariant and therefore commutes with 11”. Hence if m[S] is the
probability measure of the set S
m[II”&] = m[TIIXS2] = m[d&]
= m[Sd = m[S11
80 C. E. Shannon
Hz
=-sq(xi)
log
q(xi)
dxi
-
We consider then
611 = -
s [I1 + log Y(x)lsY(4 + Ml + log p(x>lsp(x)
+ P11+ log qbdM4ll ok
A MathematicalTheory of Communication
If WC multiply the first by p(Si) and the second by q(si) and integrate with
respect to s we obtain
II3 = -A III
IIS = -p zzz
or solving for X and ~1and replacing in the equalions
Then T(Xi) will also be normal with quadratic form Cij. If the inverses of
these forms are aij, bij, cij then
cij = a;j + bij.
We wish to show that these functions satisfy the minimizing conditions if
and only if aij = Kbij and thus give the minimum 113under Ihe constraints.
First we have
where
.P(Xi) is the probability measure of the strip over Xi
P( Vi) is the probability measure of the strip over 1’;
P(X;, Ui) is the probability measure of the intersection of the strips.
A further subdivision can never decrease RI. For let Xr be divided into
X1 = Xi + Xi’ and let
P(I’1) = a P(X,) = b + c
Ii = b zyx:, Y,) = d
P(Xl, Yl) = d + e
Then, in the sum we have replaced (for the Xi, 1’1 intersection)
by a! log $ + e log 5.
ac
and consequently the sum is increased. Thus the various possible subdivi-
sions form a directed set, with R monotonic increasing with refinement of
the subdivision. We may define R unambiguously as the least upper bound
for the RI and write it
This integral, understood in the above sense, includes both the continuous
and discrete cases and of course many others which cannot be represented
in either form. It is trivial in this formulation that if x and u are in one-to-
one correspondence, the rate from u to y is equal to that from x to y. If v
is any function of y (not necessarily with an inverse) then the rate from x to
y is greater than or equal to that from x to v since, in the calculation of the
approximations, the subdivisions of y are essentially a finer subdivision of
those for v. More generally if y and v are related not functionally but
statistically, i.e., we have a probability measure space (y, v), then R(x, v) <
R(x, y). This means that any operation applied to the received signal, even
though it involves statistical elements, does not increase R.
Another notion which should be defined precisely in an abstract formu-
lation of the theory is that of “dimension rate,” that is the average number
of dimensions required per second to specify a member of an ensemble. In
the band limited case 2W numbers per second are sufficient. A general
definition can be framed as follows. Letf,(l) be an ensemble of functions
and let ~~If&),fddl b e a metric measuring the “distance” from fa to Ifa
over the time T (for example the R.M.S. discrepancy over this interval.)
Let N(e, 6, T) be the least number of elementsf which can be chosen such
that all elements of the ensemble apart from a set of measure d are within
the distance E of at least one of those chosen. Thus we are covering the
space to within t apart from a set of small measure 6. We define the di-
mension rate X for the ensemble by the triple limit