A Mathematical Theory of Communication: Part Iii: Mathematical Preliminaries
A Mathematical Theory of Communication: Part Iii: Mathematical Preliminaries
By C. E. SHANNON
We shall have to deal in the continuous case with sets uf functions and
ensembles of functions. A set of functions, as the name implies, is merely a
class or collection of functions, generally of one variable, time. It can be
specified by giving an explicit representation of the various functions in the
set, or implicitly by giving a property which functions in the set possess and
others do not. Some examples are:
1. The set of functions:
f6(t) = sin (t + 8).
Each particular value of 8 determines a particular function in the set.
623
624 BELL SYSTEM TECHNICAL JOURNAL
with the a, normal and independent all with the same standard deviation
VN'. This is a representation of "white" noise, band-limited to the band
2
from 0 to W cycles per second and with average power N.
where the tk are the points of the Poisson distribution. This ensemble
can be considered as a type of impulse or shot noise where all the impulses
are identical.
S. The set of English speech functions with the probability measure given
by the frequency of occurrence in ordinary use.
An ensemble of functions faCt) is stationary if the same ensemble results
when all functions are shifted any fixed amount in time. The ensemble
fe(t) = sin (I + 8)
is stationary if 8 distributed uniformly from 0 to 211". If we shift each func-
tion by tl we obtain
feCt + 11) = sin (t + i + IJ)
l
= sin (i + q;)
with <p distributed uniformly from 0 to 211". Each function has changed
but the ensemble as a whole is invariant under the translation. The other
examples given above are also stationary.
An ensemble is ergodic if it is stationary, and there is no subset of the func-
tions in the set with a probability different from 0 and 1 which is stationary.
The ensemble
sin (t + IJ)
is ergodic. No subset of these functions of probability ~O, 1 is transformed
into itself under all time translations. On the other hand the ensemble
a sin (t + IJ)
with a distributed normally and 8 uniform is stationary but not ergodic.
The subset of these functions with a between 0 and 1 for example is
stationary.
Of the examples given, 3 and 4 are ergodic, and 5 may perhaps be con-
sidered so. If an ensemble is ergodic we may say roughly that each func-
tion in the set is typical of the ensemble. More precisely it is known that
with an ergodic ensemble an average of any statistic over the ensemble is
equal (with probability 1) to an average over all the time translations of a
626 BELL SYSTEM TECHNICAL JOURNAL
particular function in the set." Roughly speaking, each function can be ex-
pected, as time progresses, to go through, with the proper frequency, all the
convolutions of any of the functions in the set.
Just as we may perform various operations on numbers or functions to
obtain new numbers or functions, we can perform operations on ensembles
to obtain new ensembles. Suppose, for example, we have an ensemble of
functionsj.,(1) and an operator T which gives for each functionj.,(t) a result
g.,(l) :
gael) = Tja(l)
Probability measure is defined for the set gael) by means of that for the set
faCt). The probability of a certain subset of the gael) functions is equal
to that of the subset of the j.,(t) functions which produce members of the
given subset of g functions under the operation T. Physically this corre-
sponds to passing the ensemble through some device, for example, a filter,
a rectifier or a modulator. The output functions of the device form the
ensemble gael).
A device or operator T will be called invariant if shifting the input merely
shifts the output, i.e., if
gael) Tja(l)
implies
gael + h) = Tja(t + h)
for allja(l) and all ti . It is easily shown (see appendix 1) that if T is in-
variant and the input ensemble is stationary then the output ensemble is
stationary. Likewise if the input is ergodic the output will also be ergodic.
A filter or a rectifier is invariant under all time translations. The opera-
tion of modulation is not since the carrier phase gives a certain time struc-
ture. However, modulation is invariant under all translations which are
multiples of the period of the carrier.
Wiener has pointed out the intimate relation between the invariance of
physical devices under time translations and Fourier theory.' He has
3 This is the famous ergodic theorem or rather one aspect of this theorem which was
proved is somewhat different formulations by Birkhoff, von Neumann, and Koopman, and
subsequently generalizer! by Wiener, Hopf, Hurewicz and others. The literature on ergodic
theory is quite extensive and the reader is referred to the papers of these writers for pre-
cise and general formulations; e.g., E. Hopf "Ergodentheorie" Ergebnisse der Mathematic
und ihrer Grenzgebiete, Vol. 5, "On Causality Statistics and Probability" Journal of
Mathematics and Physics, Vol. XIII, No.1, 1934; N. Weiner "The Ergodic Theorem"
Duke Mathematical Journal, Vol. 5, 1939.
, Communication theory is heavily indebted to Wiener for much of its basic philosophy
and theory. His classic NDRC report "The Interpolation, Extrapolation, and Smoothing
of Stationary Time Series," to appear soon in book form, contains the first clear-cut
formulation of communication theory as a statistical problem, the study of operations
MATI/EMATICAL THEORY OF COMMUNICATION 627
on time series. This work, although chiefly concerned with the linear prediction and
filtering problem, is an important collateral reference in connection with the present paper.
We may also refer here to Wiener's forthcoming book "Cybernetics" dealing with the
general problems of communication and control.
For a proof of this theorem and further discussion see the author's paper "Communi-
cation in the Presence of Noise" to be published in the Proceedings of liteTnstitute of &uJio
Engineers.
628 BELL SYSTEM TECHNICAL JOURNAL
and
p(x) = Jp(x, y) dy
p(y) = Jp(x, y) dx.
The entropy of continuous distributions have most (but not all) of the
properties of the discrete case. In particular we have the following:
MATHEMATICAL THEORY OF COMMUNICATION 629
with
u
2
= J p(x)i dx and 1 = J p(x) dx
as constraints. This requires, by the calculus of variations, maximizing
+ {;2
2
= log y'2; (J
P(Xl, . ,Xn
I aij I! exp (1""
) = (21r)n/2 - 2~aij X i X i )
a = [ e p(x)x dx,
MATHEJJ..ITICAL TllEORI" OF COMMUNICATION 631
p(X) = -1 e-(ria)
a
and is equal to log ea.
8. There is one important difference between the continuous and discrete
entropies. In the discrete case the entropy measures in an absolute
way the randomness of the chance variable. In the continuous case the
measurement is relative til/he coordinate system. If we change coordinates
the entropy will in general change. In fact if we change to coordinates
)'1 ... Yll the new entropy is given by
(D
Xl .
Thus the new entropy is the old entropy less the expected logarithm of
the Jacobian. In the continuous case the entropy can be considered a
measure of randomness relatiee to GI/ assumed standard, namely the co-
ordinate system chosen with each small volume element dx, ... dx; given
equal weight. When we change the coordinate system the entropy in
the new system measures the randomness when equal volume elements
dYL ... dy" in the new system are given equal weight.
In spite of this dependence on the coordinate system the entropy
concept is as important in the continuous case as the discrete case. This
is due to the fact that the derived concepts of information rate and
channel capacity depend on the differmce of two entropies and this
difference does no! depend on the coordinate frame, each of the two terms
being changed by the same amount.
The entropy of a continuous distribution can be negative. The scale
of measurements sets an arbitrary zero corresponding to a uniform dis-
tribution over a unit volume. A distribution which is more confined than
this has less entropy and will be negative. The rates and capacities will,
however, always be non-negative.
9. A particular case of changing coordinates is the linear transformation
)'i = L
;
aijXi.
632 BELL SYSTEM TECHNICAL JOURNAL
In this case the Jacobian is simply the determinant I aij 1-1 and
H(y) = H(x) + log I aij I.
In the case of a rotation of coordinates (or any measure preserving trans-
formation) J = 1 and H(y) = H(x).
H' = -Lim!
n_oo n
J... JP(Xl" . xn ) log P(Xl, ... , xn ) dXl ... dx n
For a given average power N, white noise has the maximum possible
entropy. This follows from the maximizing properties of the Gaussian
distribution noted above.
The entropy for a continuous stochastic process has many properties
analogous to that for discrete processes. In the discrete case the entropy
was related to the logarithm of the probability of long sequences, and to the
number of reasonably probable sequences of long length. In the continuous
case it is related in a similar fashion to the logarithm of the probability
density for a long series of samples, and the volume of reasonably high prob-
ability in the function space.
More precisely, if we assume P(XI ... x n ) continuous in all the Xi for all n,
then for sufficiently large n
for all choices of (Xl, . .. , Xn ) apart from a set whose total probability is
less than Ii, with Ii and e arbitrarily small, This follows from the ergodic
property if we divide the space into a large number of small cells.
MATHEMATICAL THEORY OF COMMUNICATION 633
The relation of H to volume can be stated as follows: Under the same as-
sumptions consider the It dimensional space corresponding to P(Xl, ... ,xn ) .
Let Vn(q) be the smallest volume in this space which includes in its interior
a total probability g. Then
Lim log l'n(g) = H'
"_00 II
1 1 2
P( Xl' ) = (21rN)n/2 exp - 2N ~x.
. X n
Since this depends only on ~x; the surfaces of equal probability density
are spheres and the entire distribution has spherical symmetry. The region
of high probability is a sphere of radius v;N. As n - ? 00 the probability
of being outside a sphere of radius Vll(N + e) approaches zero and!n times
the logarithm of the volume of the sphere approaches log V21reN.
In the continuous case it is convenient to work not with the entropy H of
an ensemble but with a derived quantity which we will call the entropy
power. This is defined as the power in a white noise limited to the same
band as the original ensemble and having the same entropy. In other words
if H' is the entropy of an ensemble its entropy power is
Nl = 2~
1re
exp 2H'.
H2 = HI + l~ L log I Y (j)
2
1 df,
634 BELL SYSTEM TECHNICAL JOURNAL
TABLE I
ENTROPY ENTROPY
GAIN POWER POWER GAIN IMPULSE RESPONSE
FACTOR IN DECIBELS
,-~----~~ 1
e2
-8.68
SIN 21Tt
(fft}2
o flJ I
,-~._--~~
,
(it -5.32 2[~t 3
_ cost]
t 2
o /l1 ,
e /l1 I
~---~D o -, /l1 ,
(it -2.66
1T
"2
J,
t
It)
lL...-
0 /l1
~
I
e2 U
1
-8.68 U u t [COS(1-U)t-cost]
' 2
where the fi are equally spaced through the band W. This becomes in
the limit
Since J is constant its average value is this same quantity and applying the
theorem on the change of entropy with a change of coordinates, the result
follows. We may also phrase it in terms of the entropy power. Thus if
the entropy power of the first ensemble is N 1 that of the second is
The final entropy power is the initial entropy power multiplied by the geo-
metric mean gain of the filter. If the gain is measured in db, then the
output entropy power will be increased by the arithmetic mean db gain
over W.
In Table I the entropy power loss has been calculated (and also expressed
in db) for a number of ideal gain characteristics. The impulsive responses
of these filters are also given for lV = 2'11", with phase assumed to be O.
The entropy loss for many other cases can be obtained from these results.
For example the entropy power factor ~ for the first case also applies
e
to any
gain characteristic obtained from 1 - w by a measure preserving transforma-
tion of the w axis. In particular a linearly increasing gain G(w) = w, or a
"saw tooth" characterist'ic between 0 and 1 have the same entropy loss.
The reciprocal gain has the reciprocal factor. Thus! has the factor e2
w
Raising the gain to any power raises the factor to this power.
22. El\TROl'Y 01" THE Stnr OF Two ENSEMBLES
If we have two ensembles of functions faCt) and gp(t) we can form a new
ensemble by "addition." Suppose the first ensemble has the probability
density function P(.I"! , ... , .1',,) and the second q(.I'l , ... ,.1',,). Then the
density function for the sum is given by the convolution:
White Gaussian noise has the peculiar property that it can absorb any
other noise or signal ensemble which may be added to it with a resultant
entropy power approximately equal to the sum of the white noise power and
the signal power (measured from the average signal value, which is normally
zero), provided the signal power is small, in a certain sense, compared to
the noise.
Consider the function space associated with these ensembles having n
dimensions. The white noise corresponds to a spherical Gaussian distribu-
tion in this space. The signal ensemble corresponds to another probability
distribution, not necessarily Gaussian or spherical. Let the second moments
of this distribution about its center of gravity be aij. That is, if
P(X1, .. , x,,) is the density distribution function
where the ai are the coordinates of the center of gravity. Now aij is a posi-
tive definite quadratic form, and we can rotate our coordinate system to
align it with the principal directions of this form.s aij is then reduced to
diagonal form bu . We require that each b be small compared to N, the
squared radius of the spherical distribution.
In this case the convolution of the noise and signal produce a Gaussian
distribution whose corresponding quadratic form is
N + bi ,
The entropy power of this distribution is
[II(N + b,,)]I'"
or approximately
= [(N)" + ~b'i(N)"-I]l'"
. 1
= N + -~bii.
n
The last term is the signal power, while the first is the noise power.
MATHEMATICAL THEORY OF COMMUNICATION 637
where H(x) is the entropy of the input and Hix) the equivocation. The
channel capacity C is defined as the maximum of R when we vary the input
over all possible ensembles. This means that in a finite dimensional ap-
proximation we must vary P(x) = P(Xl , ... ,xn ) and maximize
JJ Pi, y) p(x, y)
log P(x)P(y) dx dy
using the fact that J f P(x, y) log P(x) dx dy = J P(x) log P(x) dx. The
channel capacity is thus expressed
1
C = Lim Max -T fJp(x, y) log Ppx)P'y( dx dy.
T_oo P(:r;) X Y
It is obvious in this form that Rand C are independent of the coordinate
.
system smce t h e numera t or an d d enommator
. . I P(x, y)
in og P(x)P(y)
'11 b emu Iti-.
WI
plied by the same factors when x and yare transformed in anyone to one
way. This integral expression for C is more general than H(x) - Hix).
Properly interpreted (see Appendix i) it will always exist while H(x) - H~(x)
638 BELL SYSTEM TECHNICAL JOURNAL
and we can assign a definite entropy to the noise (independent of the sta-
tistics of the signal), namely the entropy of the distribution Q(n). This
entropy will be denoted by H(n).
Theorem 16: If the signal and noise are independent and the received
signal is the sum of the transmitted signal and the noise then the rate of
,ILt Til F.MA TlCI/, Til EORI' OF CO.ILlfl'l'-lICA nON 6.N
transmission IS
R = H(y) - H(n)
i.e., the entropy of the received signal less the entropy of the noise. The
channel capacity is
C = Max H(y) - lI(II).
P(r)
W log P +
N
N l <
-
C <
-
W log P +
N
N
l l
where
P = average transmitter power
N = average noise power
N l = entropy power of the noise. .
Here again the average power of the perturbed signals will be P + N.
The maximum entropy for this power would occur if the received signal
were white noise and would be W log 21re(P +
N). It may not be possible
to achieve this; i.e. there may not be any ensemble of transmitted signals
which, added to the perturbing noise, produce a white thermal noise at the
receiver, but at least this sets an upper bound to H(y). We have, therefore
C = max H(y) - H(n)
~ W log 21re(P + N) - W log 21reN1
This is the upper limit given in the theorem. The lower limit can be ob-
tained by considering the rate if we make the transmitted signal a white
noise, of power P. In this case the entropy power of the received signal
must be at least as great as that of a white noise of power P N l since we+
have shown in a previous theorem that the entropy power of the sum of two
ensembles is greater than or equal to the sum of the individual entropy
powers. Hence
max H(y) ~ W log 21re(P +N l)
and
C ~ W log hc(P + Nl ) - W log heNl
-- WI og P +
]\', 1N .
l
If the noise is itself white, N = N, and the result reduces to the formula
proved previously:
C = TV log (1 + ~) .
If the noise is Gaussian but with a spectrum which is not necessarily flat,
N 1 is the geometric mean of the noise power over the various frequencies in
the band W. Thus
W log PI +~ - 7]1
This means that the best signal distribution, say p(x), when added to the
noise distribution q(x), gives a received distribution r(y) whose entropy
power is (PI + N - 1]1). Let us increase the power to PI + I).P by adding
a white noise of power I).p to the signal. The entropy of the received signal
is now at least
H(y) = III log 27re(PI +N - 7/1 + I).P)
by application of the theorem on the minimum entropy power of a sum.
Hence, since we can attain the H indicated, the entropy of the maximizing
distribution must be at least as great and." must be monotonic decreasing.
To show that 1/ - 0 as P - 00 consider a signal which is a white noise with
a large P. Whatever the perturbing noise, the received signal will be
approximately a white noise, if P is sufficiently large, in the sense of having
an entropy power approaching P + N.
subject to the constraint that all the functions 1(1) in the ensemble be less
than or equal to vIS, say, for all t. A constraint of this type does not work
out as well mathematically as the average power limitation. The most we
have obtained for this case is a lower bound valid for all ~, an "asymptotic"
upper band (valid for large NS) and an asymptotic value of C for NSsmall.
Theorem 20: The channel capacity C for a band TV perturbed by white
thermal noise of power N is bounded by
c~ IF log --;
1r(
N~ ,
vS
samples are independent and have a distribution function which is constant
from - to + 0. The entropy can be calculated as
II" log -IS.
~S+N
W log (4S + 21reN)(1 + f) - W log (21reN) =W log 'lre N (1 + E).
This is the desired upper bound to the channel capacity.
To obtain a lower bound consider the same ensemble of functions. Let
these functions be passed through an ideal filter with a triangular transfer
characteristic. The gain is to be unity at frequency 0 and decline linearly
down to gain 0 at frequency W. We first show that the output functions
of the filter have a peak power limitation S at all times (not just the sample
pomts 2'lrWt gomg
at a pu Ise sin2'lrWt
. ) . First we note t hat . mto
. t he filter pro d uces
1 sin2 'lrWt
2" (1rWt) 2
in the output. This function is never negative. The input function (in
the general case) can be thought of as the sum of a series of shifted functions
sin 21rWt
a 21rWt
where a, the amplitude of the sample, is not greater than '\IS. Hence the
output is the sum of shifted functions of the non-negative form above with
the same coefficients. These functions being non-negative, the greatest
positive value for any t is obtained when all the coefficients a have their
maximum positive values, i.e. '\IS. In this case the input function was a
constant of amplitude VS and since the filter has unit gain for D.C., the
output is the same. Hence the output ensemble has a peak power S.
The entropy of the output ensemble can be calculated from that of the
input ensemble by using the theorem dealing with such a situation. The
output entropy is equal to the input entropy plus the geometrical mean
gain of the filter;
l
w
o log G d1 =
2 llF log (TV;- 1)2 1
0 d = - 2TV
W log 4S - 2W = TV log ~
(2
MATHEMATICAL THEORY OF COMMUNICATION 645
IV log :ea t
We now wish to show that, for small ~ (peak signal power over average
white noise power), the channel capacity is approximately
C = TV log (1 + ~) .
More precisely C/W log (1 + ~) -4 1 as ~ -4 O. Since the average signal
power P is less than or equal to the peak S, it follows that for all ~
C ~ TV log (1 + ~) ~ W log (1 + ~) .
Therefore, if we can find an ensemble of functions such that they correspond
to a rate nearly W log (1 + ~) and are limited toband Wand peak S the
result will be proved. Consider the ensemble of functions of the following
type. A series of t samples have the same value, either +VS or - VS,
then the next / samples have the same value, etc. The value for a series
is chosen at random, probability! for + VS and! for - VS If this
ensemble be passed through a filter with triangular gain characteristic (unit
gain at D.C.), the output is peak limited to S. Furthermore the average
power is nearly S and can be made to approach this by taking t sufficiently
large. The entropy of the sum of this and the thermal noise can be found
by applying the theorem on the sum of a noise and a small signal. This
theorem will apply if
5
0-
N
TV log (5 t N).
646 BELL SYSTEM TECHNICAL JOURNAL
since
JJPt, y) dx dy 1
p(x, y) = T1 IT [x{t) -
0 y(t)J2 dt
jet) L: e(T)k(t - T) dt
then
p(x, y) =
1[1' I x(t)
T 0 - yet) I dt
v = if p(x, y) P(x, y) dx dy
This means that we consider, in effect, all the communication systems that
might be used and that transmit with the required fidelity. The rate of
transmission in bits per second is calculated for each one and we choose that
having the least rate. This latter rate is the rate we assign the source for
the fidelity in question.
The justification of this definition lies in the following result:
Theorem 21: If a source has a rate R I for a valuation '111 it is possible to
encode the output of the source and transmit it over a channel of capacity C
with fidelity as near 7'1 as desired provided R; ~ C. This is not possible
if R I > c.
The last statement in the theorem follows immediately from the definition
of R l and previous results. If it were not true we could transmit more than
C bits per second over a channel of capacity C. The first part of the theorem
is proved by a method analogous to that used for Theorem 11. We may, in
the first place, divide the (x, y) space into a large number of small cells and
650 BELL SYSTEM TECHNICAL JOURNAL
represent the situation as a discrete case. This will not change the evalua-
tion function by more than an arbitrarily small amount (when the cells are
very small) because of the continuity assumed for p(x, y). Suppose that
PI(x, y) is the particular system which minimizes the rate and gives R I . We
choose from the high probability y's a set at random containing
R -
_ ~~~~.rr. P(x,y).
JJ n, y) log P(x)P(y) dx dy
with P(x) and t'l = If P(x, y)p(x, y) dx dy fixed. In the latter
C -- 1\1'
~(~
lrJr P(
.l,
)I rc, y) dx d
Y og P(x)P(Y) x Y
with PzCy) fixed and possibly one or more other constraints (e.g., an average
power limitation) of the form K = f f Ptx, y) XCx, y) dx dy.
JIATHEMATICAL THEORI T OF COMMUNICATION 651
The variational equation (when we take the first variation on Pt, y))
leads to
Py(x) = B(x) e-Ap(r,l/)
where A is determined to give the required fidelity and B(x) is chosen to
satisfy
J B(x)e-Xp(r,l/) d = 1
This shows that, with best encoding, the conditional probability of a cer-
tain cause for various received y, PI/C\") will decline exponentially with the
distance function p(x, y) between the x and y is question.
In the special case where the distance function p(x, y) depends only on the
(vector) difference between x and y,
p(x, y) = p(x - y)
we have
J B(x)e-Ap(r- y) dx 1.
= TVllog Q
N
where Q is the average message power. This proves the following:
652 BELL SYSTEM TECHNICAL JOURNAL
Theorem 22: The rate for a white noise source of power Q and band WI
relative to an R.M.S. measure of fidelity is
R = WI log Q
N
where N is the allowed mean square error between original and recovered
messages.
More generally with any message source we can obtain inequalities bound-
ing the rate relative to a mean square error criterion.
Theorem 23: The rate for any source of band WI is bounded by
where Q is the average power of the source, Q1 its entropy power and N the
allowed mean square error.
The lower bound follows from the fact that the max HI/(x) for a given
(x - y)2 = N occurs in the white noise case. The upper bound results if we
place the points (used in the proof of Theorem 21) not in the best way but
at random in a sphere of radius vQ -N.
ACKNOWLEDGMENTS
We consider then
and similarly when q is varied. Hence the conditions for a minimum are
If we multiply the first by pes;) and the second by q(s;) and integrate with
respect to s we obtain
H = -J.l.H2
or solving for Xand J.I. and replacing in the equations
Then r(xi) will also be normal with quadratic form Ci: If the inverses of
these forms are aii> hii> e,j then
e;j = a,j + h,j'
We wish to show that these functions satisfy the minimizing conditions if
and only if a., = Kb., and thus give the minimum H a under the constraints.
First we have
= ~2 log 211"')
log r(x-) ~ le-I - l;C I)x Ix )-
1
-2
MATHEMATICAL THEORY OF COMMUNICATION 655
APPENDIX 7
The following will indicate a more general and more rigorous approach to
the central definitions of communication theory. Consider a probability
measure space whose elements are ordered pairs (x, y). The variables x, y
are to be identified as the possible transmitted and received signals of some
long duration T. Let us call the set of all points whose x belongs to a subset
SI of x points the strip over SI, and similarly the set whose y belongs to S2
the strip over S2. We divide x and y into a collection of non-overlapping
measurable subsets Xi and Y i approximate to the rate of transmission R by
1 "P(~- .) I P(X j , Yj )
R1 = T L.:' .~ .. , I i og P(X . )P(F;)
where
P(X i ) is the probability measure of the strip over Xi
PO',) is the probability measure of the strip over I"..
P(.Y j , 1-..) is the probability measure of the intersection of the strips.
Then in the sum we have replaced (for the XI, 1"1 intersection)
d+e d
(d + e) log a(b + c) by d log -
ab
+ e log ac
C
- .
and consequently the sum is increased. Thus the various possible subdivi-
sions form a directed set, with R monotonic increasing with refinement of
the subdivision. We may define R unambiguously as the least upper bound
for the R 1 and write it
R = T1 JJ rt, y)
p(x, y) log P(x)P(y) dx dy.
This integral, understood in the above sense, includes both the continuous
and discrete cases and of course many others which cannot be represented
in either form. It is trivial in this formulation that if x and 11 are in one-to-
one correspondence, the rate from u to y is equal to that from x to y. If v
is any function of y (not necessarily with an inverse) then the rate from x to
y is greater than or equal to that from x to v since, in the calculation of the
approximations, the subdivisions of yare essentially a finer subdivision of
those for v. More generally if y and v are related not functionally but
statistically, i.e., we have a probability measure space (y, v), then R(x, v) ~
R(x, y). This means that any operation applied to the received signal, even
though it involves statistical elements, does not increase R.
Another notion which should be defined precisely in an abstract formu-
lation of the theory is that of "dimension rate," that is the average number
of dimensions required per second to specify a member of an ensemble. In
the band limited case 2W numbers per second are sufficient. A general
definition can be framed as follows. Letja(t) be an ensemble of functions
and let P7[ja(t),/tJ{t)] be a metric measuring the "distance" fromja to fp
over the time T (for example the R.M.S. discrepancy over this interval.)
Let N(e, Il, T) be the least number of elementsjwhich can be chosen such
that all elements of the ensemble apart from a set of measure /) are within
the distance E of at least one of those chosen. Thus we are covering the
space to within E apart from a set of small measure o. We define the di-
mension rate A for the ensemble by the triple limit