0% found this document useful (0 votes)
58 views34 pages

A Mathematical Theory of Communication: Part Iii: Mathematical Preliminaries

This document discusses mathematical preliminaries for analyzing communication systems with continuous signals or messages. It introduces sets and ensembles of functions as a way to represent continuous signals. Examples of sets and ensembles are provided, including stationary and ergodic ensembles. Operations on ensembles that produce new ensembles are discussed, such as applying an operator like a filter or modulator. Invariant operators are defined as those that do not change the output ensemble when the input is shifted in time.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views34 pages

A Mathematical Theory of Communication: Part Iii: Mathematical Preliminaries

This document discusses mathematical preliminaries for analyzing communication systems with continuous signals or messages. It introduces sets and ensembles of functions as a way to represent continuous signals. Examples of sets and ensembles are provided, including stationary and ergodic ensembles. Operations on ensembles that produce new ensembles are discussed, such as applying an operator like a filter or modulator. Invariant operators are defined as those that do not change the output ensemble when the input is shifted in time.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

A Mathematical Theory of Communication

By C. E. SHANNON

(COl/eluded from Jill)' 19-18 issue)

PART III: MATHEMATICAL PRELIMINARIES


In this final installment of the paper we consider the case where the
signals or the messages or both are continuously variable, in contrast with
the discrete nature assumed until now. To a considerable extent the con-
tinuous case can be obtained through a limiting process from the discrete
case by dividing the continuum of messages and signals into a large but finite
number of small regions and calculating the various parameters involved on
a discrete basis. As the size of the regions is decreased these parameters in
general approach as limits the proper values for the continuous case. There
are, however, a few new effects that appear and also a general change of
emphasis in the direction of specialization of the general results to particu-
lar cases.
We will not attempt, in the continuous case, to obtain our results with
the greatest generality, or with the extreme rigor of pure mathematics, since
this would involve a great deal of abstract measure theory and would ob-
scure the main thread of the analysis. A preliminary study, however, indi-
cates that the theory can be formulated in a completely axiomatic and
rigorous manner which includes both the continuous and discrete cases and
many others. The occasional liberties taken with limiting processes in the
present analysis can be justified in all cases of practical interest.

lR. Sf:TS AND ENSEMBLES OF FUNCTIONS

We shall have to deal in the continuous case with sets uf functions and
ensembles of functions. A set of functions, as the name implies, is merely a
class or collection of functions, generally of one variable, time. It can be
specified by giving an explicit representation of the various functions in the
set, or implicitly by giving a property which functions in the set possess and
others do not. Some examples are:
1. The set of functions:
f6(t) = sin (t + 8).
Each particular value of 8 determines a particular function in the set.
623
624 BELL SYSTEM TECHNICAL JOURNAL

2. The set of all functions of time containing no frequencies over W cycles


per second.
3. The set of all functions limited in band to Wand in amplitude to A.
4. The set of all English speech signals as functions of time.
An ensemble of functions is a set of functions together with a probability
me~sure whereby we may determine the probability of a function in the
set having certain properties.' For example with the set,

j8(t) = sin (t + (J),


we may give a probability distribution for (J, P(J). The set then becomes
an ensemble.
Some further examples of ensembles of functions are:
1. A finite set of functions A(t) (k = 1, 2, ... , n) with the probability of
I being Pk.
2. A finite dimensional family of functions

with a probability distribution for the parameters a, :


p(al, ... , an)

For example we could consider the ensemble defined by


n
j(al , . .. , a.; , fh , .. , , On ; t) = L
n_l
an sin newt + On)
with the amplitudes a, distributed normally and independently, and the
phrases (J, distributed uniformly (from 0 to 21T) and independently.
3. The ensemble
. ) -
f( a., t -
-s:
..Jan
sin 1T(2Wt -n)
n~< 1T(2Wt - n)

with the a, normal and independent all with the same standard deviation
VN'. This is a representation of "white" noise, band-limited to the band
2
from 0 to W cycles per second and with average power N.

1 In mathematical terminology the functions belong to a measure space whose total


measure is unity.
2 This representation can be used as a definition of band limited white noise. It has
certain advantages in that it involves fewer limiting op,erations than do definitions that
have been used in the past. The name "white noise, ' already firmly intrenched in the
literature, is perhaps somewhat unfortunate. In optics white light means either any
continuous spectrum as contrasted with a point spectrum, or a spectrum which is flat with
. wave/engll, (which is not the same as a spectrum flat with frequency).
MATHEMATICAL THEORY OF COMMUNIQATION 625

4. Let points be distributed on the t axis according to a Poisson distribu-


tion. At each selected point the functionf(t) is placed and the different
functions added, giving the ensemble

where the tk are the points of the Poisson distribution. This ensemble
can be considered as a type of impulse or shot noise where all the impulses
are identical.
S. The set of English speech functions with the probability measure given
by the frequency of occurrence in ordinary use.
An ensemble of functions faCt) is stationary if the same ensemble results
when all functions are shifted any fixed amount in time. The ensemble

fe(t) = sin (I + 8)
is stationary if 8 distributed uniformly from 0 to 211". If we shift each func-
tion by tl we obtain
feCt + 11) = sin (t + i + IJ)
l

= sin (i + q;)
with <p distributed uniformly from 0 to 211". Each function has changed
but the ensemble as a whole is invariant under the translation. The other
examples given above are also stationary.
An ensemble is ergodic if it is stationary, and there is no subset of the func-
tions in the set with a probability different from 0 and 1 which is stationary.
The ensemble
sin (t + IJ)
is ergodic. No subset of these functions of probability ~O, 1 is transformed
into itself under all time translations. On the other hand the ensemble
a sin (t + IJ)
with a distributed normally and 8 uniform is stationary but not ergodic.
The subset of these functions with a between 0 and 1 for example is
stationary.
Of the examples given, 3 and 4 are ergodic, and 5 may perhaps be con-
sidered so. If an ensemble is ergodic we may say roughly that each func-
tion in the set is typical of the ensemble. More precisely it is known that
with an ergodic ensemble an average of any statistic over the ensemble is
equal (with probability 1) to an average over all the time translations of a
626 BELL SYSTEM TECHNICAL JOURNAL

particular function in the set." Roughly speaking, each function can be ex-
pected, as time progresses, to go through, with the proper frequency, all the
convolutions of any of the functions in the set.
Just as we may perform various operations on numbers or functions to
obtain new numbers or functions, we can perform operations on ensembles
to obtain new ensembles. Suppose, for example, we have an ensemble of
functionsj.,(1) and an operator T which gives for each functionj.,(t) a result
g.,(l) :
gael) = Tja(l)
Probability measure is defined for the set gael) by means of that for the set
faCt). The probability of a certain subset of the gael) functions is equal
to that of the subset of the j.,(t) functions which produce members of the
given subset of g functions under the operation T. Physically this corre-
sponds to passing the ensemble through some device, for example, a filter,
a rectifier or a modulator. The output functions of the device form the
ensemble gael).
A device or operator T will be called invariant if shifting the input merely
shifts the output, i.e., if
gael) Tja(l)
implies
gael + h) = Tja(t + h)
for allja(l) and all ti . It is easily shown (see appendix 1) that if T is in-
variant and the input ensemble is stationary then the output ensemble is
stationary. Likewise if the input is ergodic the output will also be ergodic.
A filter or a rectifier is invariant under all time translations. The opera-
tion of modulation is not since the carrier phase gives a certain time struc-
ture. However, modulation is invariant under all translations which are
multiples of the period of the carrier.
Wiener has pointed out the intimate relation between the invariance of
physical devices under time translations and Fourier theory.' He has
3 This is the famous ergodic theorem or rather one aspect of this theorem which was
proved is somewhat different formulations by Birkhoff, von Neumann, and Koopman, and
subsequently generalizer! by Wiener, Hopf, Hurewicz and others. The literature on ergodic
theory is quite extensive and the reader is referred to the papers of these writers for pre-
cise and general formulations; e.g., E. Hopf "Ergodentheorie" Ergebnisse der Mathematic
und ihrer Grenzgebiete, Vol. 5, "On Causality Statistics and Probability" Journal of
Mathematics and Physics, Vol. XIII, No.1, 1934; N. Weiner "The Ergodic Theorem"
Duke Mathematical Journal, Vol. 5, 1939.
, Communication theory is heavily indebted to Wiener for much of its basic philosophy
and theory. His classic NDRC report "The Interpolation, Extrapolation, and Smoothing
of Stationary Time Series," to appear soon in book form, contains the first clear-cut
formulation of communication theory as a statistical problem, the study of operations
MATI/EMATICAL THEORY OF COMMUNICATION 627

shown, in fact, that if a device is linear as well as invariant Fourier analysis


is then the appropriate mathematical tool for dealing with the problem.
An ensemble of functions is the appropriate mathematical representation
of the messages produced by a continuous source (for example speech), of
the signals produced by a transmitter, and of the perturbing noise. Com-
munication theory is properly concerned, as has been emphasized by Wiener,
not with operations on particular functions, but with operations on en-
sembles of functions. A communication system is designed not for a par-
ticular speech function and still less for a sine wave, but for the ensemble of
speech functions.

19. BA:\D LIMITED EKSE),[BU:S OF FUNCTIONS

If a function of time f(/) is limited to the band from 0 to W cycles per


second it is completely determined by giving its ordinates at a series of dis-
crete points spaced 2~V seconds apart in the manner indicated by the follow-
ing result."
Theorem 13: Let f(l) contain no frequencies over W.
Then
f(1) = t: X" sin7r(2TVt
-co
7r(2TVt - n)
- n)
where

In this expansion f(t) is represented as a sum of orthogonal functions.


The coefficients X" of the various terms can be considered as coordinates in
an infinite dimensional "function space." In this space each function cor-
responds to precisely one point and each point to one function.
A function can be considered to be substantially limited to a time T if all
the ordinates XII outside this interval of time are zero. In this case all but
2T1V of the coordinates will be zero. Thus functions limited to a band TV
and duration T correspond to points in a space of 2TW dimensions.
A subset of the functions of band II' and duration T corresponds to a re-
gion in this space. For example, the functions whose total energy is less

on time series. This work, although chiefly concerned with the linear prediction and
filtering problem, is an important collateral reference in connection with the present paper.
We may also refer here to Wiener's forthcoming book "Cybernetics" dealing with the
general problems of communication and control.
For a proof of this theorem and further discussion see the author's paper "Communi-
cation in the Presence of Noise" to be published in the Proceedings of liteTnstitute of &uJio
Engineers.
628 BELL SYSTEM TECHNICAL JOURNAL

than or equal to E correspond to points in a 2TW dimensional sphere with


radius r = V2WE.
An ensemble of functions of limited duration and band will be represented
by a probability distribution P(Xi ... x,,) in the corresponding n dimensional
space. If the ensemble is not limited in time we can consider the 2TW co-
ordinates in a given interval T to represent substantially the part of the
function in the interval T and the probability distribution P(Xi , ... , x,,)
to give the statistical structure of the ensemble for intervals of that duration.
20. ENTROPY OF A CONTINUOUS DISTRIBUTION
The entropy of a discrete set of probabilities pi, ... p" has been defined as:
H = - L Pi log Pi .
In an analogous manner we define the entropy of a continuous distribution
with the density distribution function p(x) by:

H = - L: p(x) log p(x) dx

With an n dimensional distribution P(Xi , ... , x,,) we have

H = - J... J P(Xi'" xn ) log P(Xl, ... , x..) dXi ... dx...

If we have two arguments x and y (which may themselves be multi-dimen-


sional) the joint and conditional entropies of p(x, y) are given by

H(x, y) = - JJp(x, y) log p(x, y) dx dy

and

Hz(y) = - JJp(x, y) log p~(~~) dx dy


HI/(x) = - JJp(x, y) log P~~~~) dx dy
where

p(x) = Jp(x, y) dy
p(y) = Jp(x, y) dx.
The entropy of continuous distributions have most (but not all) of the
properties of the discrete case. In particular we have the following:
MATHEMATICAL THEORY OF COMMUNICATION 629

1. If x is limited to a certain volume v in its space, then H(x) is a maximum


and equal to log v when p(x) is constant (f) in the volume.
2. With any two variables x, y we have
H(x, y) ~ H(x) + H(y)
with equality if (and only if) x and yare independent, i.e., p(x, y) = pC;\:)
p(y) (apart possibly from a set of points of probability zero).
3. Consider a generalized averaging operation of the following type:

p'(y) = J a(x, y)p(x) dx

with

J a(x, y) ds: = Ja(x, y) dy = 1, a(x, y) ~ O.

Then the entropy of the averaged distribution p'(y) is equal to or greater


than that of the original distribution p(x).
4. We have
tu, y) = H(x) + Hz(y) = H(y) + lly(x)
and
Hx<y) s H(y).
S. Let p(x) be a one-dimensional distribution. The form of p(x) giving a
maximum entropy subject to the condition that the standard deviation
of x be fixed at a is gaussian. To show this we must maximize

H(x) = - Jp(x) log p(x) dx


with

u
2
= J p(x)i dx and 1 = J p(x) dx
as constraints. This requires, by the calculus of variations, maximizing

J[- p(x) log p(x) + Xp(x)x + MP(X)] dx. 2

The condition for this is


-1 - log p(x) + Xx2 + M = 0
and consequently (adjusting the constants to satisfy the constraints)
_ _1_ -(,,2/2a 2)
( ) -V2;ue
px .
630 BELL SYSTEM TECHNICAL JOURNAL

Similarly in It dimensions, suppose the second order moments of


p(x! I ... ,XII) are fixed at A ij :

Au = J... J :>.'iXjP(Xl, . . . , x n ) dx, ... dx n

Then the maximum entropy occurs (by a similar calculation) when


p(x! , ... , :I'll) is the It dimensional gaussian distribution with the second
order moments A ij .
6. The entropy of a one-dimensional gaussian distribution whose standard
deviation is a is given by
H(x) = log V21relT.
This is calculated as follows:
1 _(,,1/262)
( ) = y'2;o-e
px

-log p(x) = log y'2;fT + 2:


2

H(x) = - J p(x) log p(x) dx

= J p(x) log V21r a dx + J 2:


p(x) 2 dx

+ {;2
2

= log y'2; (J

= log y'2; a + log ve


= log ~fT.
Similarly the It dimensional gaussian distribution with associated
quadratic form aij is given by

P(Xl, . ,Xn
I aij I! exp (1""
) = (21r)n/2 - 2~aij X i X i )

and the entropy can be calculated as


H = log (21re)"/21 aij I~

where I aij I is the determinant whose elements are aij


7. If x is limited to a half line (p(x) = 0 for x ::; 0) and the first moment of
X is fixed at a:

a = [ e p(x)x dx,
MATHEJJ..ITICAL TllEORI" OF COMMUNICATION 631

then the maximum entropy occurs when

p(X) = -1 e-(ria)
a
and is equal to log ea.
8. There is one important difference between the continuous and discrete
entropies. In the discrete case the entropy measures in an absolute
way the randomness of the chance variable. In the continuous case the
measurement is relative til/he coordinate system. If we change coordinates
the entropy will in general change. In fact if we change to coordinates
)'1 ... Yll the new entropy is given by

where ] (~) is the Jacobian of the coordinate transformation. On ex-


panding the logarithm and changing variables to x" , we obtain:

(D
Xl .

Il(y) = H(x) - J... JP(Xl, ... , x n) log J dx, ... dx; .

Thus the new entropy is the old entropy less the expected logarithm of
the Jacobian. In the continuous case the entropy can be considered a
measure of randomness relatiee to GI/ assumed standard, namely the co-
ordinate system chosen with each small volume element dx, ... dx; given
equal weight. When we change the coordinate system the entropy in
the new system measures the randomness when equal volume elements
dYL ... dy" in the new system are given equal weight.
In spite of this dependence on the coordinate system the entropy
concept is as important in the continuous case as the discrete case. This
is due to the fact that the derived concepts of information rate and
channel capacity depend on the differmce of two entropies and this
difference does no! depend on the coordinate frame, each of the two terms
being changed by the same amount.
The entropy of a continuous distribution can be negative. The scale
of measurements sets an arbitrary zero corresponding to a uniform dis-
tribution over a unit volume. A distribution which is more confined than
this has less entropy and will be negative. The rates and capacities will,
however, always be non-negative.
9. A particular case of changing coordinates is the linear transformation
)'i = L
;
aijXi.
632 BELL SYSTEM TECHNICAL JOURNAL

In this case the Jacobian is simply the determinant I aij 1-1 and
H(y) = H(x) + log I aij I.
In the case of a rotation of coordinates (or any measure preserving trans-
formation) J = 1 and H(y) = H(x).

21. ENTROPY OF AN ENSEMBLE OF FUNCTIONS

Consider an ergodic ensemble of functions limited to a certain band of


width W cycles per second. Let

be the density distribution function for amplitudes Xl ... Xn at n successive


sample points. We define the entropy of the ensemble per degree of free-
dom by

H' = -Lim!
n_oo n
J... JP(Xl" . xn ) log P(Xl, ... , xn ) dXl ... dx n

We may also define an entropy H per second by dividing, not by n, but by


the time T in seconds for n samples. Since n = 2TW, H' = 2WH.
With white thermal noise P is gaussian and we have
H' = log V21reN,
H = W log 21reN.

For a given average power N, white noise has the maximum possible
entropy. This follows from the maximizing properties of the Gaussian
distribution noted above.
The entropy for a continuous stochastic process has many properties
analogous to that for discrete processes. In the discrete case the entropy
was related to the logarithm of the probability of long sequences, and to the
number of reasonably probable sequences of long length. In the continuous
case it is related in a similar fashion to the logarithm of the probability
density for a long series of samples, and the volume of reasonably high prob-
ability in the function space.
More precisely, if we assume P(XI ... x n ) continuous in all the Xi for all n,
then for sufficiently large n

for all choices of (Xl, . .. , Xn ) apart from a set whose total probability is
less than Ii, with Ii and e arbitrarily small, This follows from the ergodic
property if we divide the space into a large number of small cells.
MATHEMATICAL THEORY OF COMMUNICATION 633

The relation of H to volume can be stated as follows: Under the same as-
sumptions consider the It dimensional space corresponding to P(Xl, ... ,xn ) .
Let Vn(q) be the smallest volume in this space which includes in its interior
a total probability g. Then
Lim log l'n(g) = H'
"_00 II

provided q does not equal 0 or 1.


These results show that for large n there is a rather well-defined volume (at
least in the logarithmic sense) of high probability, and that within this
volume the probability density is relatively uniform (again in the logarithmic
sense).
In the white noise case the distribution function is given by

1 1 2
P( Xl' ) = (21rN)n/2 exp - 2N ~x.
. X n

Since this depends only on ~x; the surfaces of equal probability density
are spheres and the entire distribution has spherical symmetry. The region
of high probability is a sphere of radius v;N. As n - ? 00 the probability
of being outside a sphere of radius Vll(N + e) approaches zero and!n times
the logarithm of the volume of the sphere approaches log V21reN.
In the continuous case it is convenient to work not with the entropy H of
an ensemble but with a derived quantity which we will call the entropy
power. This is defined as the power in a white noise limited to the same
band as the original ensemble and having the same entropy. In other words
if H' is the entropy of an ensemble its entropy power is

Nl = 2~
1re
exp 2H'.

In the geometrical picture this amounts to measuring the high probability


volume by the squared radius of a sphere having the same volume. Since
white noise has the maximum entropy for a given power, the entropy power
of any noise is less than or equal to its actual power.

21. ENTROPY Loss IN LlNEAR FILTERS

Theorem 14: If an ensemble having an entropy HI per degree of freedom


in band W is passed through a filter with characteristic Y(j) the output
ensemble has an entropy

H2 = HI + l~ L log I Y (j)
2
1 df,
634 BELL SYSTEM TECHNICAL JOURNAL

The operation of the filter is essentially a linear transformation of co-


ordinates. If we think of the different frequency components as the original
coordinate system, the new frequency components are merely the old ones
multiplied by factors. The coordinate transformation matrix is thus es-

TABLE I
ENTROPY ENTROPY
GAIN POWER POWER GAIN IMPULSE RESPONSE
FACTOR IN DECIBELS

,-~----~~ 1
e2
-8.68
SIN 21Tt
(fft}2

o flJ I

,-~._--~~
,
(it -5.32 2[~t 3
_ cost]
t 2

o /l1 ,

,-~,-:-~ti 0.384 -4.15 e[ ~_ cost + SINt]


t4 2t 2 t 3

e /l1 I

~---~D o -, /l1 ,
(it -2.66
1T
"2
J,
t
It)

lL...-

0 /l1
~
I
e2 U
1
-8.68 U u t [COS(1-U)t-cost]
' 2

sentially diagonalized in terms of these coordinates. The Jacobian of the


transformation is (for n sine and n cosine components)
"
J = ._1 J Y(ji) 1
II 2
MATHEMATICAL THEORY OF COMMUNICATION 635

where the fi are equally spaced through the band W. This becomes in
the limit

exp T~r L log I Y(j)


2
1 dj.

Since J is constant its average value is this same quantity and applying the
theorem on the change of entropy with a change of coordinates, the result
follows. We may also phrase it in terms of the entropy power. Thus if
the entropy power of the first ensemble is N 1 that of the second is

The final entropy power is the initial entropy power multiplied by the geo-
metric mean gain of the filter. If the gain is measured in db, then the
output entropy power will be increased by the arithmetic mean db gain
over W.
In Table I the entropy power loss has been calculated (and also expressed
in db) for a number of ideal gain characteristics. The impulsive responses
of these filters are also given for lV = 2'11", with phase assumed to be O.
The entropy loss for many other cases can be obtained from these results.
For example the entropy power factor ~ for the first case also applies
e
to any
gain characteristic obtained from 1 - w by a measure preserving transforma-
tion of the w axis. In particular a linearly increasing gain G(w) = w, or a
"saw tooth" characterist'ic between 0 and 1 have the same entropy loss.
The reciprocal gain has the reciprocal factor. Thus! has the factor e2
w
Raising the gain to any power raises the factor to this power.
22. El\TROl'Y 01" THE Stnr OF Two ENSEMBLES

If we have two ensembles of functions faCt) and gp(t) we can form a new
ensemble by "addition." Suppose the first ensemble has the probability
density function P(.I"! , ... , .1',,) and the second q(.I'l , ... ,.1',,). Then the
density function for the sum is given by the convolution:

1j(.I'1 - Yl , ... , :t:n - y,,) dYl, dY2, ... , dYn .


Physically this corresponds to adding the noises or signals represented by
the original ensembles of functions.
636 BELL SYSTEM TECHNICAL JOURNAL

The following result is derived in Appendix 6.


Theorem 15: Let the average power of two ensembles be N 1 and N 2 and
let their entropy powers be N1 and N2 Then the entropy power of the
sum, N3 , is bounded by

White Gaussian noise has the peculiar property that it can absorb any
other noise or signal ensemble which may be added to it with a resultant
entropy power approximately equal to the sum of the white noise power and
the signal power (measured from the average signal value, which is normally
zero), provided the signal power is small, in a certain sense, compared to
the noise.
Consider the function space associated with these ensembles having n
dimensions. The white noise corresponds to a spherical Gaussian distribu-
tion in this space. The signal ensemble corresponds to another probability
distribution, not necessarily Gaussian or spherical. Let the second moments
of this distribution about its center of gravity be aij. That is, if
P(X1, .. , x,,) is the density distribution function

aij = J'" J P(Xi - ai)(Xj - aj) d X1 , , dx"

where the ai are the coordinates of the center of gravity. Now aij is a posi-
tive definite quadratic form, and we can rotate our coordinate system to
align it with the principal directions of this form.s aij is then reduced to
diagonal form bu . We require that each b be small compared to N, the
squared radius of the spherical distribution.
In this case the convolution of the noise and signal produce a Gaussian
distribution whose corresponding quadratic form is

N + bi ,
The entropy power of this distribution is

[II(N + b,,)]I'"
or approximately
= [(N)" + ~b'i(N)"-I]l'"
. 1
= N + -~bii.
n

The last term is the signal power, while the first is the noise power.
MATHEMATICAL THEORY OF COMMUNICATION 637

PART IV: THE CONTINUOUS CHANNEL


23. THE CAPACITY OF A CONTINUOUS CHANNEL

In a continuous channel the input or transmitted signals will be con-


tinuous functions of time jet) belonging to a certain set, and the output or
received signals will be perturbed versions of these. We will consider only
the case where both transmitted and received signals are limited to a certain
band W. They can then be specified, for a time T, by 2TW numbers, and
their statistical structure by finite dimensional distribution functions.
Thus the statistics of the transmitted signal will be determined by
P(Xl' ... , x n ) = P(x)
and those of the noise by the conditional probability distribution
P:r;l'"'' "'n(Yl, ... ,Yn) = Piy).
The rate of transmission of information for a continuous channel is defined
in a way analogous to that for a discrete channel, namely
R = H(x) - H~(x)

where H(x) is the entropy of the input and Hix) the equivocation. The
channel capacity C is defined as the maximum of R when we vary the input
over all possible ensembles. This means that in a finite dimensional ap-
proximation we must vary P(x) = P(Xl , ... ,xn ) and maximize

- J p(x) log PCX') dx + JJ rt, y) log P~~~~) dx dy.


This can be written

JJ Pi, y) p(x, y)
log P(x)P(y) dx dy

using the fact that J f P(x, y) log P(x) dx dy = J P(x) log P(x) dx. The
channel capacity is thus expressed
1
C = Lim Max -T fJp(x, y) log Ppx)P'y( dx dy.
T_oo P(:r;) X Y
It is obvious in this form that Rand C are independent of the coordinate
.
system smce t h e numera t or an d d enommator
. . I P(x, y)
in og P(x)P(y)
'11 b emu Iti-.
WI

plied by the same factors when x and yare transformed in anyone to one
way. This integral expression for C is more general than H(x) - Hix).
Properly interpreted (see Appendix i) it will always exist while H(x) - H~(x)
638 BELL SYSTEM TECHNICAL JOURNAL

may assume an indeterminate form 00 - 00 in some cases. This occurs, for


example, if x is limited to a surface of fewer dimensions than It in its It dimen-
sional approximation.
If the logarithmic base used in computing H(x) and Hy(x) is two then C
is the maximum number of binary digits that can be sent per second over the
channel with arbitrarily small equivocation, just as in the discrete case.
This can be seen physically by dividing the space of signals into a large num-
ber of small cells, sufficiently small so that the probability density Px(y)
of signal x being perturbed to point y is substantially constant over a cell
(either of x or y). If the cells are considered as distinct points the situation
is essentially the same as a discrete channel and the proofs used there will
apply. But it is clear physically that this quantizing of the volume into
individual points cannot in any practical situation alter the final answer
significantly, provided the regions are sufficiently small. Thus the capacity
will be the limit of the capacities for the discrete subdivisions and this is
just the continuous capacity defined above.
On the mathematical side it can be shown first (see Appendix 7) that if it
is the message, x is the signal, y is the received signal (perturbed by noise)
and v the recovered message then

H(x) - Hy(x) ? H(u) - Hv(u)

regardless of what operations are performed on u to obtain x or on y to obtain


v. Thus no matter how we encode the binary digits to obtain the signal, or
how we decode the received signal to recover the message, the discrete rate
for the binary digits does not exceed the channel capacity we have defined.
On the other hand, it is possible under very general conditions to find a
coding system for transmitting binary digits at the rate C with as small an
equivocation or frequency of errors as desired. This is true, for example, if,
when we take a finite dimensional approximating space for the signal func-
tions, P(x, y) is continuous in both x and y except at a set of points of prob-
ability zero.
An important special case occurs when the noise is added to the signal
and is independent of it (in the probability sense). Then P,,(y) is a function
only of the difference n = (y - x),

and we can assign a definite entropy to the noise (independent of the sta-
tistics of the signal), namely the entropy of the distribution Q(n). This
entropy will be denoted by H(n).
Theorem 16: If the signal and noise are independent and the received
signal is the sum of the transmitted signal and the noise then the rate of
,ILt Til F.MA TlCI/, Til EORI' OF CO.ILlfl'l'-lICA nON 6.N

transmission IS

R = H(y) - H(n)
i.e., the entropy of the received signal less the entropy of the noise. The
channel capacity is
C = Max H(y) - lI(II).
P(r)

We have, since y = x + II:


Hi, :\,) = H(x, II).
Expanding the left side and using the fact that x and n are independent
II(y) + H,,(.r) = H(x) + H(11).
Hence
R = H(x) - H y ( " :) = H(y) - H(n).

Since H(n) is independent of P(x), maximizing R requires maximizing


H(y), the entropy of the received signal. If there are certain constraints on
the ensemble of transmitted signals, the entropy of the received signal must
be maximized subject to these constraints.

24. CHANNEL CAPACITY WITH AN AVERAGE POWER LIMITATION

A simple application of Theorem 16 is the case where the noise is a white


thermal noise and the transmitted signals are limited to a certain average
power P. Then the received signals have an average power P +
N where
N is the average noise power. The maximum entropy for the received sig-
nals occurs when they also form a white noise ensemble since this is the
greatest possible entropy for a power P + N and can be obtained by a
suitable choice of the ensemble of transmitted signals, namely if they form a
white noise ensemble of power P. The entropy (per second) of the re-
ceived ensemble is then
H(y) Tr log 27re(P + N),
and the noise entropy is
H (II) II" log heN.
The channel capacity is

C = ll()') - IJ(Il) 11" 1 P :+ N..


Summarizing we have the following:
Theorem 17: The capacity of a channel of band W perturbed by white
640 BELL SYSTEM TECHNICAL JOURNAL

thermal noise of power N when the average transmitter power is P is given by


P+N
C = Wlog~.

This means of course that by sufficiently involved encoding systems we


. bimary d'igits
can t ransmit . at t he rate TIT
tIl IOg2 P+ N b'Its per secon d ,WIt. h
----w-
arbitrarily small frequency of errors. It is not possible to transmit at a
higher rate by any encoding system without a definite positive frequency of
errors.
To approximate this limiting rate of transmission the transmitted signals
must approximate, in statistical properties, a white noise." A system which
approaches the ideal rate may be described as follows: Let M = 2' samples
of white noise be constructed each of duration T. These are assigned
binary numbers from 0 to (M - 1). At the transmitter the message se-
quences are broken up into groups of s and for each group the corresponding
noise sample is transmitted as the signal. At the receiver the M samples are
known and the actual received signal (perturbed by noise) is compared with
each of them. The sample which has the least R.M.S. discrepancy from the
received signal is chosen as the transmitted signal and the corresponding
binary number reconstructed. This process amounts to choosing the most
probable (a posteriori) signal. The number M of noise samples used will
depend on the tolerable frequency E of errors, but for almost all selections of
samples we have

L' log M(E, T) _ TXT I P


L im un T
N
- 'v og - N '
+
4!_O T_oo

so that no matter how small is chosen, we can, by taking T sufficiently


t
E

large, transmit as near as we wish to TW log P N binary digits in the


time T.
Formulas similar to C = TV log P t N for the white noise case have
been developed independently by several other writers, although with some-
what different interpretations, We may mention the work of N. Wiener,"
W. G. Tuller," and H. Sullivan in this connection.
In the case of an arbitrary perturbing noise (not necessarily white thermal
noise) it does not appear that the maximizing problem involved in deter-
SThis and other properties of the white noise case are discussed from the geometrical
point of view in "Communication in the Presence of Noise," loco cit.
7" Cybernetics," loco cit.
asc. D. thesis, Department of Electrical Engineering. M.I.T., 1948.
MATHEMATICAL THEORY OF COMMUNICATION 641

mining the channel capacity C can be solved explicitly. However, upper


and lower bounds can be set for C in terms of the average noise power N
and the noise entropy power N l . These bounds are sufficiently close to-
gether in most practical cases to furnish a satisfactory solution to the
problem.
Theorem 18: The capacity of a channel of band W perturbed by an arbi-
trary noise is bounded by the inequalities

W log P +
N
N l <
-
C <
-
W log P +
N
N
l l

where
P = average transmitter power
N = average noise power
N l = entropy power of the noise. .
Here again the average power of the perturbed signals will be P + N.
The maximum entropy for this power would occur if the received signal
were white noise and would be W log 21re(P +
N). It may not be possible
to achieve this; i.e. there may not be any ensemble of transmitted signals
which, added to the perturbing noise, produce a white thermal noise at the
receiver, but at least this sets an upper bound to H(y). We have, therefore
C = max H(y) - H(n)
~ W log 21re(P + N) - W log 21reN1
This is the upper limit given in the theorem. The lower limit can be ob-
tained by considering the rate if we make the transmitted signal a white
noise, of power P. In this case the entropy power of the received signal
must be at least as great as that of a white noise of power P N l since we+
have shown in a previous theorem that the entropy power of the sum of two
ensembles is greater than or equal to the sum of the individual entropy
powers. Hence
max H(y) ~ W log 21re(P +N l)

and
C ~ W log hc(P + Nl ) - W log heNl

-- WI og P +
]\', 1N .
l

As P increases, the upper and lower bounds approach each other, so we


have as an asymptotic rate
P+N
WIog---
Nl
642 BRLL SYSTEM TECHNICAL JOURNAL

If the noise is itself white, N = N, and the result reduces to the formula
proved previously:

C = TV log (1 + ~) .
If the noise is Gaussian but with a spectrum which is not necessarily flat,
N 1 is the geometric mean of the noise power over the various frequencies in
the band W. Thus

N1 = exp ~ Llog N(j) df

where N(j) is the noise power at frequency j.


Theorem 19: If we set the capacity for a given transmitter power P
equal to
c = W log P +N - 7J
N1
then 7J is monotonic decreasing as P increases and approaches 0 as a limit.
Suppose that for a given power PI the channel capacity is

W log PI +~ - 7]1

This means that the best signal distribution, say p(x), when added to the
noise distribution q(x), gives a received distribution r(y) whose entropy
power is (PI + N - 1]1). Let us increase the power to PI + I).P by adding
a white noise of power I).p to the signal. The entropy of the received signal
is now at least
H(y) = III log 27re(PI +N - 7/1 + I).P)
by application of the theorem on the minimum entropy power of a sum.
Hence, since we can attain the H indicated, the entropy of the maximizing
distribution must be at least as great and." must be monotonic decreasing.
To show that 1/ - 0 as P - 00 consider a signal which is a white noise with
a large P. Whatever the perturbing noise, the received signal will be
approximately a white noise, if P is sufficiently large, in the sense of having
an entropy power approaching P + N.

25. THE CHANNEL CAPACITY WITH A PEAK Pm:VER LIMITATION

In some applications the transmitter is limited not by the average power


output but by the peak instantaneous power. The problem of calculating
the channel capacity is then that of maximizing (by variation of the ensemble
of transmitted symbols)
H(y) - H(n)
MATHEMATICAL THEORY OF COMMUNICATION (jB

subject to the constraint that all the functions 1(1) in the ensemble be less
than or equal to vIS, say, for all t. A constraint of this type does not work
out as well mathematically as the average power limitation. The most we
have obtained for this case is a lower bound valid for all ~, an "asymptotic"
upper band (valid for large NS) and an asymptotic value of C for NSsmall.
Theorem 20: The channel capacity C for a band TV perturbed by white
thermal noise of power N is bounded by

c~ IF log --;
1r(
N~ ,

where S is the peak allowed transmitter power. For sufficiently large Z


2
-S+N
C ::; TV log 7rC N (l + f)

where f is arbitrarily small. As ~ --'> 0 (and provided the band W starts


'at 0)

We wish to maximize the entropy of the received signal. If ~ is large


this will occur very nearly when we maximize the entropy of the trans-
mitted ensemble.
The asymptotic upper bound is obtained by relaxing the conditions on
the ensemble. Let us suppose that the power is limited to S not at every
instant of time, but only at the sample points. The maximum entropy of
the transmitted ensemble under these weakened conditions is certainly
greater than or equal to that under the original conditions. This altered
problem can be solved easily. The maximum entropy occurs if the different

vS
samples are independent and have a distribution function which is constant
from - to + 0. The entropy can be calculated as
II" log -IS.

The received signal will then have an entropy less than

Il ' log (4S + 21reN){1 + f)


644 BELL SYSTEM TECHNICAL JOURNAL

with f -+ 0 as ~. -+ 00 and the channel capacity is obtained by subtracting


the entropy of the white noise, W log 21reN

~S+N
W log (4S + 21reN)(1 + f) - W log (21reN) =W log 'lre N (1 + E).
This is the desired upper bound to the channel capacity.
To obtain a lower bound consider the same ensemble of functions. Let
these functions be passed through an ideal filter with a triangular transfer
characteristic. The gain is to be unity at frequency 0 and decline linearly
down to gain 0 at frequency W. We first show that the output functions
of the filter have a peak power limitation S at all times (not just the sample
pomts 2'lrWt gomg
at a pu Ise sin2'lrWt
. ) . First we note t hat . mto
. t he filter pro d uces

1 sin2 'lrWt
2" (1rWt) 2
in the output. This function is never negative. The input function (in
the general case) can be thought of as the sum of a series of shifted functions
sin 21rWt
a 21rWt

where a, the amplitude of the sample, is not greater than '\IS. Hence the
output is the sum of shifted functions of the non-negative form above with
the same coefficients. These functions being non-negative, the greatest
positive value for any t is obtained when all the coefficients a have their
maximum positive values, i.e. '\IS. In this case the input function was a
constant of amplitude VS and since the filter has unit gain for D.C., the
output is the same. Hence the output ensemble has a peak power S.
The entropy of the output ensemble can be calculated from that of the
input ensemble by using the theorem dealing with such a situation. The
output entropy is equal to the input entropy plus the geometrical mean
gain of the filter;

l
w
o log G d1 =
2 llF log (TV;- 1)2 1
0 d = - 2TV

Hence the output entropy is

W log 4S - 2W = TV log ~
(2
MATHEMATICAL THEORY OF COMMUNICATION 645

and the channel capacity is greater than

IV log :ea t
We now wish to show that, for small ~ (peak signal power over average
white noise power), the channel capacity is approximately

C = TV log (1 + ~) .
More precisely C/W log (1 + ~) -4 1 as ~ -4 O. Since the average signal

power P is less than or equal to the peak S, it follows that for all ~

C ~ TV log (1 + ~) ~ W log (1 + ~) .
Therefore, if we can find an ensemble of functions such that they correspond
to a rate nearly W log (1 + ~) and are limited toband Wand peak S the
result will be proved. Consider the ensemble of functions of the following
type. A series of t samples have the same value, either +VS or - VS,
then the next / samples have the same value, etc. The value for a series
is chosen at random, probability! for + VS and! for - VS If this
ensemble be passed through a filter with triangular gain characteristic (unit
gain at D.C.), the output is peak limited to S. Furthermore the average
power is nearly S and can be made to approach this by taking t sufficiently
large. The entropy of the sum of this and the thermal noise can be found
by applying the theorem on the sum of a noise and a small signal. This
theorem will apply if
5
0-
N

is sufficiently small. This can be insured by taking ~ small enough (after


t is chosen). The entropy power will be S + N to as close an approximation
as desired, and hence the rate of transmission as near as we wish to

TV log (5 t N).
646 BELL SYSTEM TECHNICAL JOURNAL

PART V: THE RATE FOR A CONTINUOUS SOURCE


26. FIDELITY EVALUATION FUNCTIONS

In the case of a discrete source of information we were able to determine a


definite rate of generating information, namely the entropy of the under-
lying stochastic process. With a continuous source the situation is con-
siderably more involved. In the first place a continuously variable quantity
can assume an infinite number of values and requires, therefore, an infinite
number of binary digits for exact specification. This means that to transmit
the output of a continuous source with exact recovery at the receiving point
requires, in general, a channel of infinite capacity (in bits per second).
Since, ordinarily, channels have a certain amount of noise, and therefore a
finite capacity, exact transmission is impossible.
This, however, evades the real issue. Practically, we are not interested
in exact transmission when we have a continuous source, but only in trans-
mission to within a certain tolerance. The question is, can we assign a
definite rate to a continuous source when we require only a certain fidelity
of recovery, measured in a suitable way. Of course, as the fidelity require-
ments are increased the rate will increase. It will be shown that we can, in
very general cases, define such a rate, having the property that it is possible,
by properly encoding the information, to transmit it over a channel whose
capacity is equal to the rate in question, and satisfy the fidelity requirements.
A channel of smaller capacity is insufficient.
It is first necessary to give a general mathematical formulation of the idea
of fidelity of transmission. Consider the set of messages of a long duration,
say T seconds. The source is described by giving the probability density,
in the associated space, that the source will select the message in question
P(x). A given communication system is described (from the external point
of view) by giving the conditional probability Pz(y) that if message x is
produced by the source the recovered message at the receiving point will
be y. The system as a whole (including source and transmission system)
is described by the probability function Pis, y) of having message x and
final output y. If this function is known, the complete characteristics of
the system from the point of view of fidelity are known. Any evaluation
of fidelity must correspond mathematically to an operation applied to
P(x, y). This operation must at least have the properties of a simple order-
ing of systems; i.e. it must be possible to say of two systems represented by
l\(x, y) and P2(X, y) that, according to our fidelity criterion, either (1) the
first has higher fidelity, (2) the second has higher fidelity, or (3) they have
MATHEMATlC.H. THEORI" OF COMMUNICATION 647

equal fidelity. This means that a criterion of fidelity can be represented by


a numerically valued function:
zo(P(x, y
whose argument ranges over possible probability functions P(x, y).
We will now show that under very general and reasonable assumptions
the function v(P(x, y can be written in a seemingly much more specialized
form, namely as an average of a function p(x, y) over the set of possible values
of x and y:

v(P(x, y = JJP(x, y) p(x, y) d dy


To obtain this we need only assume (1) that the source and system are
ergodic so that a very long sample will be, with probability nearly 1, typical
of the ensemble, and (2) that the evaluation is "reasonable" in the sense
that it is possible, by observing a typical input and output Xl and Y1, to
form a tentative evaluation on the basis of these samples; and if these
samples are increased in duration the tentative evaluation will, with proba-
bility 1, approach the exact evaluation based on a full knowledge of P(x, y).
Let the tentative evaluation be p(x, y). Then the function p(x, y) ap-
proaches (as T --+ (0) a constant for almost all (x, y) which are in the high
probability region corresponding to the system:
pCr, y) --+ t'(P(x, y
and we may also write

p(x, y) --+ JJrt, y)p(x, y) dx, dy

since

JJPt, y) dx dy 1

This establishes the desired result.


The function p(x, y) has the general nature of a "distance" between x
and y.9 It measures how bad it is (according to our fidelity criterion) to
receive y when x is transmitted. The general result given above can be
restated as follows: Any reasonable evaluation can be represented as an
average of a distance function over the set of messages and recovered mes-
sages .r and y weighted according to the probability Pi x, y) of getting the
pair in question, provided the duration T of the messages be taken suffi-
ciently large.
9 It is not a "metric" in the strict sense, however, since in general it does not satisfy
either p(x, y) = pry, x) or p(.I', y) + pry, z) ;::: p(x, z).
648 BELL SYSTEM TECHNICAL JOURNAL

The following are simple examples of evaluation functions:


1. R.M.S. Criterion.
v = (x(t) - y(t2
In this very commonly used criterion of fidelity the distance function
p(x, y) is (apart from a constant factor) the square of the ordinary
euclidean distance between the points x and y in the associated function
space.

p(x, y) = T1 IT [x{t) -
0 y(t)J2 dt

2. Frequency weighted R.M.S. criterion. More generally one can apply


different weights to the different frequency components before using an
R.M.S. measure of fidelity. This is equivalent to passing the difference
x(t) - yet) through a shaping filter and then determining the average
power in the output. Thus let
e(t) = x(t) - yet)
and

jet) L: e(T)k(t - T) dt

then

p(x, y) = T1 IT j(ti dt.


0

3. Absolute error criterion.

p(x, y) =
1[1' I x(t)
T 0 - yet) I dt

4. The structure of the ear and brain determine implicitly an evaluation, or


rather a number of evaluations, appropriate in the case of speech or music
transmission. There is, for example, an "intelligibility" criterion in
which p(x, y) is equal to the relative frequency of incorrectly interpreted
words when message x(t) is received as Yet). Although we cannot give
an explicit representation of p(x, y) in these cases it could, in principle,
be determined by sufficient experimentation. Some of its properties
follow from well-known experimental results in hearing, e.g., the ear is
relatively insensitive to phase and the sensitivity to amplitude and fre-
quency is roughly logarithmic.
5. The discrete case can be considered as a specialization in which we have
MATHEMATICAL THEORY OF COMMUNICATION 649

tacitly assumed an evaluation based on the frequency of errors. The


function p(x, y) is then defined as the number of symbols in the sequence
y differing from the corresponding symbols in x divided by the total num-
ber of symbols in .r.

27. THE RATE FOR .\ SOURCE RELATIVE TO A FIDELITY EVALUATION

We are now in a position to define a rate of generating information for a


continuous source. We are given P(x) for the source and an evaluation v
determined by a distance function p(x, y) which will be assumed continuous
in both x and y. With a particular system Pix, y) the quality is measured by

v = if p(x, y) P(x, y) dx dy

Furthermore the rate of flow of binary digits corresponding to Pix, y) is


or
fJ
p(x, y)
R = p(x, y) log P(x)P(y) dx dy.

We define the rate R I of generating information for a given quality 'VI of


reproduction to be the minimum of R when we keep v fixed at 'VI and vary
Piy). That is:
rr p(x, y)
RI = ~;~ JJ pCI', y) log P(x)P(y) dx dy

subject to the constraint:

7'1 = if p(x, y)p(x, y) dx dy.

This means that we consider, in effect, all the communication systems that
might be used and that transmit with the required fidelity. The rate of
transmission in bits per second is calculated for each one and we choose that
having the least rate. This latter rate is the rate we assign the source for
the fidelity in question.
The justification of this definition lies in the following result:
Theorem 21: If a source has a rate R I for a valuation '111 it is possible to
encode the output of the source and transmit it over a channel of capacity C
with fidelity as near 7'1 as desired provided R; ~ C. This is not possible
if R I > c.
The last statement in the theorem follows immediately from the definition
of R l and previous results. If it were not true we could transmit more than
C bits per second over a channel of capacity C. The first part of the theorem
is proved by a method analogous to that used for Theorem 11. We may, in
the first place, divide the (x, y) space into a large number of small cells and
650 BELL SYSTEM TECHNICAL JOURNAL

represent the situation as a discrete case. This will not change the evalua-
tion function by more than an arbitrarily small amount (when the cells are
very small) because of the continuity assumed for p(x, y). Suppose that
PI(x, y) is the particular system which minimizes the rate and gives R I . We
choose from the high probability y's a set at random containing

members where ~ - 0 as T - 00. With large T each chosen point will be


connected by a high probability line (as in Fig. 10) to a set of ;r's. A calcu-
lation similar to that used in proving Theorem 11 shows that with large T
almost all .r's are covered by the fans from the chosen)' points for almost
all choices of the y's. The communication system to be used operates as
follows: The selected points are assigned binary numbers. When a message
x is originated it will (with probability approaching 1 as T - 00) lie within
one at least of the fans. The corresponding binary number is transmitted
(or one of them chosen arbitrarily if there are several) over the channel by
suitable coding means to give a small probability of error. Since R I ::; C
this is possible. At the receiving point the corresponding y is reconstructed
and used as the recovered message.
The evaluation r'~ for this system can be made arbitrarily close to VI by
taking T sufficiently large. This is due to the fact that for each long sample
of message .r(t) and recovered message yet) the evaluation approaches t'l
(with probability 1).
It is interesting to note that, in this system, the noise in the recovered
message is actually produced by a kind of general quantizing at the trans-
mitter and is not produced by the noise in the channel. It is more or less
analogous to the quantizing noise in P.C.M.

28. TIm CALCULATION OF RATES

The definition of the rate is similar in many respects to the definition of


channel capacity. In the former

R -
_ ~~~~.rr. P(x,y).
JJ n, y) log P(x)P(y) dx dy
with P(x) and t'l = If P(x, y)p(x, y) dx dy fixed. In the latter

C -- 1\1'
~(~
lrJr P(
.l,
)I rc, y) dx d
Y og P(x)P(Y) x Y

with PzCy) fixed and possibly one or more other constraints (e.g., an average
power limitation) of the form K = f f Ptx, y) XCx, y) dx dy.
JIATHEMATICAL THEORI T OF COMMUNICATION 651

A partial solution of the general maximizing problem for determining the


rate of a source can be given. Using Lagrange's method we consider

Jf [P(x, y) log );~:;1~~'~) + I/o PCr, y)p(x, y) + ,,(x)P(x, y)] dx dy

The variational equation (when we take the first variation on Pt, y))
leads to
Py(x) = B(x) e-Ap(r,l/)
where A is determined to give the required fidelity and B(x) is chosen to
satisfy
J B(x)e-Xp(r,l/) d = 1

This shows that, with best encoding, the conditional probability of a cer-
tain cause for various received y, PI/C\") will decline exponentially with the
distance function p(x, y) between the x and y is question.
In the special case where the distance function p(x, y) depends only on the
(vector) difference between x and y,
p(x, y) = p(x - y)
we have
J B(x)e-Ap(r- y) dx 1.

Hence B(x) is constant, say a, and


FI/(x) = ae-Ap(r-I/)

Unfortunately these formal solutions are difficult to evaluate in particular


cases and seem to be of little value. In fact, the actual calculation of rates
has been carried out in only a few very simple cases.
If the distance function p(x, y) is the mean square discrepancy between
;r and y and the message ensemble is white noise, the rate can be determined.
In that case we have
R = Min [H(x) - Hy(.r)] = H(x) - Max Hy(x)
with N = (x - yt But the Max HyCr) occurs when y - .r is a white noise,
and is equal to WI log 21re ]I,! where TVI is the bandwidth of the message en-
semble. Therefore
R = TVI log 21reQ - WI log 21rcN

= TVllog Q
N
where Q is the average message power. This proves the following:
652 BELL SYSTEM TECHNICAL JOURNAL

Theorem 22: The rate for a white noise source of power Q and band WI
relative to an R.M.S. measure of fidelity is

R = WI log Q
N
where N is the allowed mean square error between original and recovered
messages.
More generally with any message source we can obtain inequalities bound-
ing the rate relative to a mean square error criterion.
Theorem 23: The rate for any source of band WI is bounded by

WI log ~ ::; R :::;; WI log ~

where Q is the average power of the source, Q1 its entropy power and N the
allowed mean square error.
The lower bound follows from the fact that the max HI/(x) for a given
(x - y)2 = N occurs in the white noise case. The upper bound results if we
place the points (used in the proof of Theorem 21) not in the best way but
at random in a sphere of radius vQ -N.

ACKNOWLEDGMENTS

The writer is indebted to his colleagues at the Laboratories, particularly


to Dr. H. W. Bode, Dr. J. R. Pierce, Dr. B. McMillan, and Dr. B. M. Oliver
for many helpful suggestions and criticisms during the course of this work.
Credit should also be given to Professor N. Wiener, whose elegant solution
of the problems of filtering and prediction of stationary ensembles has con-
siderably influenced the writer's thinking in this field.
APPENDIX 5
Let SI be any measurable subset of the g ensemble, and S2 the subset of
the! ensemble which gives SI under the operation T. Then
SI = TS2
A
Let H be the operator which shifts all functions in a set by the time A.
Then
H ASI = H
ATS
2 = TH AS2
A
since T is invariant and therefore commutes with H Hence if mrS] is the
probability measure of the set S
m[H
AS
l] = m[TH
AS
2 1= AS
m[H 2 l
= m[Sd = m[SI]
MATHEMATICAL THEORY OF COMMUNICATION 653

where the second equality is by definition of measure in the g space the


third since thej ensemble is stationary, and the last by definition of g meas-
ure again.
To prove that the ergodic property is preserved under invariant operations,
let SI be a subset of the g ensemble which is invariant under H\ and let S2
be the set of all functions j which transform into SI. Then
H'SI = H'TS2 = TH'S2 = SI
so that H'SI is included in SI for all A. Now, since
m[H'S2] = m[SI]
this implies
H'S2 = S2
for all Awith m[5 2] ,e 0, 1. This contradiction shows that 5, does not exist.
APPENDIX 6
The upper bound, N3 ~ N l +
N 2 , is due to the fact that the maximum
possible entropy for a power N l + N 2 occurs when we have a white noise of
this power. In this case the entropy power is N; N 2 +
To obtain the lower bound, suppose we have two distributions in n dimen-
sions P(Xi) and q(Xi) with entropy powers Nl and N2 What form should
P and q have to minimize the entropy power N3 of their convolution r(xi):
r(xi) = J
P(Yi)q(X; - y;) dy; .

The entropy H 3 of r is given by

H3 = - J r(xi) log r(xi) dXi.

We wish to minimize this subject to the constraints

HI = - Jp(x;) log p(x;) s;


H2 = - J q(Xi) log q(x;) dx;.

We consider then

U =- J [rex) log rex) + Xp(x) log p(x) + J.Lq(x) log q(x)] dx


lJU = - J [[1 + log r(x)]or(x) + xlt + log p(x)]op(x)
+ 1J{1 + log q(x)oq(x)j} dx.
654 BELL SYSTEM TECHNICAL JOURNAL

If p(x) is varied at a particular argument Xi = sr , the variation in r(.r) is

or(:\') = q(Xj - s,)


and

oC = - J q(x, - s,) log rex,) dx, - X log pes;) = 0

and similarly when q is varied. Hence the conditions for a minimum are

Jq(x, - s;) log rex,) = -X log pes;)

J p(:r; - s;) log rex;) = - J.I. log q(s;).

If we multiply the first by pes;) and the second by q(s;) and integrate with
respect to s we obtain

H = -J.l.H2
or solving for Xand J.I. and replacing in the equations

Il, Jq(x, - s,) log r(x;) dx, = -Halog P(Si)

II 2 J p(x; - s;) log rex;) dx; = - H a log p(s;).

Now suppose p(x;) and q(x;) are normal


A;j r
/2
p(x;) =
I(211")n/2 exp
1
- 2l;A;jx;Xj
n l2
_ 1 Bij I 1
q(x;) - (211")n/2 exp - '.il;B;jx;J.j.

Then r(xi) will also be normal with quadratic form Ci: If the inverses of
these forms are aii> hii> e,j then
e;j = a,j + h,j'
We wish to show that these functions satisfy the minimizing conditions if
and only if a., = Kb., and thus give the minimum H a under the constraints.
First we have

= ~2 log 211"')
log r(x-) ~ le-I - l;C I)x Ix )-
1
-2
MATHEMATICAL THEORY OF COMMUNICATION 655

This should equal


H3
HI
[It2 log 211"1 I A ij I- ~~AjjSiSjoJ

which requires A ii = !fi c.:


In this case A ij = ~~ Bi, and both equations reduce to identities.

APPENDIX 7
The following will indicate a more general and more rigorous approach to
the central definitions of communication theory. Consider a probability
measure space whose elements are ordered pairs (x, y). The variables x, y
are to be identified as the possible transmitted and received signals of some
long duration T. Let us call the set of all points whose x belongs to a subset
SI of x points the strip over SI, and similarly the set whose y belongs to S2
the strip over S2. We divide x and y into a collection of non-overlapping
measurable subsets Xi and Y i approximate to the rate of transmission R by

1 "P(~- .) I P(X j , Yj )
R1 = T L.:' .~ .. , I i og P(X . )P(F;)
where
P(X i ) is the probability measure of the strip over Xi
PO',) is the probability measure of the strip over I"..
P(.Y j , 1-..) is the probability measure of the intersection of the strips.

A further subdivision can never decrease R I . For let XI be divided into


XI = X: +X~' and let
P(h) = a P(XI ) = b +c
P(X;) =. b P(X~, I'I) = tl
P(X~/) = c P(X~, YI ) = C

Then in the sum we have replaced (for the XI, 1"1 intersection)
d+e d
(d + e) log a(b + c) by d log -
ab
+ e log ac
C
- .

It is easily shown that with the limitation we have on b, c, d, e,


656 BELL SYSTEM TECHNICAL JOURNAL

and consequently the sum is increased. Thus the various possible subdivi-
sions form a directed set, with R monotonic increasing with refinement of
the subdivision. We may define R unambiguously as the least upper bound
for the R 1 and write it

R = T1 JJ rt, y)
p(x, y) log P(x)P(y) dx dy.

This integral, understood in the above sense, includes both the continuous
and discrete cases and of course many others which cannot be represented
in either form. It is trivial in this formulation that if x and 11 are in one-to-
one correspondence, the rate from u to y is equal to that from x to y. If v
is any function of y (not necessarily with an inverse) then the rate from x to
y is greater than or equal to that from x to v since, in the calculation of the
approximations, the subdivisions of yare essentially a finer subdivision of
those for v. More generally if y and v are related not functionally but
statistically, i.e., we have a probability measure space (y, v), then R(x, v) ~
R(x, y). This means that any operation applied to the received signal, even
though it involves statistical elements, does not increase R.
Another notion which should be defined precisely in an abstract formu-
lation of the theory is that of "dimension rate," that is the average number
of dimensions required per second to specify a member of an ensemble. In
the band limited case 2W numbers per second are sufficient. A general
definition can be framed as follows. Letja(t) be an ensemble of functions
and let P7[ja(t),/tJ{t)] be a metric measuring the "distance" fromja to fp
over the time T (for example the R.M.S. discrepancy over this interval.)
Let N(e, Il, T) be the least number of elementsjwhich can be chosen such
that all elements of the ensemble apart from a set of measure /) are within
the distance E of at least one of those chosen. Thus we are covering the
space to within E apart from a set of small measure o. We define the di-
mension rate A for the ensemble by the triple limit

A = Lim Lim Lim log N(E, 0, T) .


~ ....o .....0 T .... cc T log E
This is a generalization of the measure type definitions 01 dimension in
topology, and agrees with the intuitive dimension rate for simple ensembles
where the desired result is obvious.

You might also like