0% found this document useful (0 votes)
12 views

Predictive_coding--I

Predictive coding is a method for transmitting messages by sending the difference between the actual message term and its predicted value based on past terms. This approach utilizes memory and prediction to efficiently encode messages, particularly time series, while minimizing the channel space required for transmission. The paper discusses the theoretical foundations of predictive coding and sets the stage for further exploration of optimal predictors and their applications in Part II.

Uploaded by

kawkaw22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Predictive_coding--I

Predictive coding is a method for transmitting messages by sending the difference between the actual message term and its predicted value based on past terms. This approach utilizes memory and prediction to efficiently encode messages, particularly time series, while minimizing the channel space required for transmission. The paper discusses the theoretical foundations of predictive coding and sets the stage for further exploration of optimal predictors and their applications in Part II.

Uploaded by

kawkaw22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

16 IRE TRANSACTIONS--INFORMATION THEORY March

Predictive Coding-Part I
PETER ELIASt

Summary-Predictive coding is a procedure for transmitting as information about the past of the message and not
messages which are sequences of magnitudes. In this coding method, about its present; indeed, since pi is a quite determinate
the transmitter and the receiver store past message terms, and from
mathematical function, it contains no information at all
them estimate the value of the next message term. The transmitter
transmits, not the message term, but the diierence between it and by Shannon’s definition of this quantity.3
its predicted value. At the receiver this error term is added to the The communications procedure which will be discussed
receiver prediction to reproduce the message term. This procedure is illustrated in Fig. 1. There is a message-generating
is de&red and messages, prediction, entropy, and ideal coding are source that feeds into a memory at the transmitter. The
discussed to provide a basis for Part II, which will give the mathe-
matical criterion for the best predictor for use in the predictive coding
transmitter has a predictor, which operates on the past of
of particular messages, will give examples of such messages, and the message as stored in the memory to produce an
will show that the error term which is transmitted in predictive estimate of its future. The subtractor subtracts the
coding may always be coded efficiently. prediction from the message term and produces an error
term ei , which is applied as an input to the coder. The
INTRODUCTION coder codes the error term, and this coded term is sent to
the receiver. In the receiver the transmitting process is
WO MAJOR contributions have been made within reversed. The receiver also has a memory and an identical
the past few years to the mathematical theory of predictor, and has predicted the same value pi for the
communication. One of these is Wiener’s work
message as did the predictor at the transmitter. When
on the prediction and filtering of random, stationary time the coded correction term is received, it is decoded to
series, and the other is Shannon’s work, defining the reproduce the error term ei . This is added to the predicted
information content of a message which is such a time value pi and the message term mi is reproduced. The
series, and relating this quantity to the bandwidth and message term is then presented to the observer at the
time required for the transmission of the message.’ This receiver, and is also stored in the receiver memory to
paper makes use of the point of view suggested by Wiener’s permit the prediction of the following values of the
work on prediction to attack a problem in Shannon’s message.
field: prediction is used to make possible the efficient
coding of a class of messages of considerable physical
interest.
Consider a message which is a time series, a function mi
which is defined for all integer i, positive or negative.
Such a series might be derived from the sampling used in
a pulse-code modulation system.’ From a knowledge of
the statistics of the set of messages to be transmitted, we
may find a predictor which operates on all the past values
of the function, mi with j less than i, and produces a
prediction p, of the value which m will next assume.
Now consider the error ei , which is defined as the differ-
ence between the message and its predicted value:

ei = rn; - pi . (1)

All of the information generated by the source in Fig. l-Predicting coding and decoding procedure.
selecting the term mi is given just as well by ei ; the error
term may be transmitted, and will enable the receiver to
This procedure is essentially a coding scheme, and will
reconstruct the original message, for the portion of the
be called predictive coding. The memory, predictor, sub-
message that is not transmitted, pi , may be considered
tractor, and coder at the transmitter, and the memory,
predictor, adder, and decoder at the receiver may be
t Elec. Engrg. Dept. and Res. Lab. Elec., Mass. Inst. Tech.,
Cambridge, Mass. considered as complex coding and decoding devices.
1 For historical remarks on the origin of modern information theory Predictive coding may then be compared with the ideal
see C. E. Shannon and W. Weaver, “The Mathematical Theory of
Communication,” Univ. of Illinois Press, Urbana, Ill., p. 52 (foot- coding methods given by Shannon and Fano.4 In general,
note) and p. 95 (footnote); 1949.
* B. M. Oliver, J. R. Pierce, and C. E. Shannon, “The philosophy
of PCM,” PROC. I.R.E., vol. 36, pp. 1324-1331; November, 1948; 3 Shannon and Weaver, op. cit., p. 31.
also, W. R. Bennett, “Spectra of quantized signals,” Bell Sys. Tech. 4 Shannon and Weaver, op. at., p. 30; also R. M. Fano, Tech.
Jour., vol. 27, pp. 446-472; July, 1948. Rep. No. 65, Res. Lab. Elect., M.I.T., Cambridge, Mass.; 1949.
Authorized licensed use limited to: ECOLE POLYTECHNIQUE DE MONTREAL. Downloaded on February 03,2025 at 15:46:50 UTC from IEEE Xplore. Restrictions apply.
i955 Elias: Predictive Coding 17

predictive coding cannot take less channel space for the It is necessary to define precisely what is meant by an
transmission of a message at a given rate than does an optimum predictor for use in predictive coding-i.e., to
ideal coding scheme, and it will often take more. However, define some quantity, which depends upon the choice of
there is a large class of message-generating processes the predictor, and define as optimum a predictor which
which are at present coded in a highly inefficient way, minimizes this quantity. Wiener’s work uses as a criterion
and for which the use of large codebook memories, such the minimization of the mean square error term 2.
as are required for the ideal coding methods, is impractical. Wiener has pointed out that other criteria are possible,
Time series which are obtained by sampling a smoothly but that the mathematical work is made simpler by the
varying function of time are examples in this class. For mean square choice.6 Minimizing the mean square error
many such processes predictive coding can give an corresponds to minimizing the power of the error term,
efficient code, using a reasonable amount of apparatus at and if no further coding is to be done, this is a reasonable
the transmitter and the receiver. criterion for predictive coding purposes. However, in the
It should be noted that in the transmission scheme of system illustrated in Fig. 1, the error term is coded before
Fig. 1 errors accumulate. That is, any noise which is it is transmitted, and its power may be radically altered
introduced after the transmitter memory, or at the in the coding process. What we are really interested in
receiver, or in transmission, will be perpetuated as an minimizing is the channel space which the system will
error in all future values of the message, as will any require for the transmission of the error term. This leads
discrepancy between the operation of the two memories, to the following criterion which will be justified in Part II
or the two predictors. This means that eventually errors of this paper: That predictor is best which leads to an
will accumulate to such an extent that the message will average error-term distribution having minimum entropy.
disappear in the noise. If, therefore, continuous messages, The coder of Fig. 1 also requires some consideration.
i.e., time series each member of which is selected from a Predictive coding eliminates the codebook requirement
continuum of magnitudes, are to be transmitted, it will be by using prediction. To take advantage of the resultant
necessary periodically to clear the memories of both the savings in equipment, it is necessary to show that the
receiver and the transmitter and start afresh. This is coder itself will not require a large codebook. This reduces
undesirable, since after each such clearing there will be to the problem of showing that a message whose terms
no remembered values on which to base a prediction, and are assumed independent of one another may always be
more information transmission will be required for a coded efficiently by a process with a small memory require-
period following each such clearing, until enough re- ment. It will be shown that this is true. It is necessary to
membered values have accumulated to permit good use two kinds of coding processes: one for cases in which
prediction once more. the entropy of the distribution from which the successive
A more satisfactory alternative is the use of some pulse- terms are chosen is large compared to unity, and another
code transmission system in which only quantized magni- for cases in which the entropy is small compared to unity.
tudes of input are accepted. Such a system may be made The following sections of the present paper are devoted
virtually error-free.5 A system of this kind has the further to a discussion of messages, prediction, entropy, and ideal
advantage that the only very reliable memory units now coding. Part II will discuss the predictor criterion given
available or in immediate prospect are of a quantized above, the classes of messages for which a predictor that
nature, most of them being capable only of storing binary is optimum by this criterion may be found, and other
digits. The use of a quantized system requires that the classes of messages for which predictive coding may be of
predicted values be selected from the permissible quantized use. Mathematically defined examples of message-
set of message values. Strictly interpreted, this severely generating processes which belong to these classes will be
limits the permissible predictors; if by a choice of scale given, and the problem of coding the error term so as to
the permissible quantized levels are made equal to the take advantage of the minimal entropy of its average
integers, then the restriction on ~(rn,-~ . . . m,-,) is that distribution will be examined.
it take integer values for all sets of integer arguments.
Actually the ordinary extrapolation formulas have this CHARACTERIZATION OF MESSAGES
property, and may be used as predictors. But it is not
A necessary preliminary to a discussion of messages is a
necessary to limit the choice of predictors so severely.
precise definition of what “message” is taken to mean.’
The problem may be evaded by using any function as a
Since a communication system is designed to transmit
predictor and computing its value to a predetermined
many messages, what is actually of interest is the
number of places by digital computing techniques, the
prediction then being taken to be the function rounded 6 N. Wiener,, “The Extrapolation, Interpolation and Smoothing
off to the nearest integer. If the predictor as originally of Stationary Time Series with Engineering Applications,” published
in 1942 as an NDRC report, and in 1949 as a book, by the Mass. Inst.
computed was optimum in some well-defined sense, then Tech. Press, Cambridge, Mass., and John Wiley & Sons, Inc., New
the rounded predictor will presumably be less good in that York, N. Y., especially p. 13.
7 Such definitions are given by Wiener, ibid., and Wiener,
sense, but in cases where predictive coding may be “Cybernetics,” Mass. Inst. Tech. Press, and John Wiley & Sons,
expected to be useful the difference will usually be small. Inc., 1948; also by Shannon and Weaver, lot. cit. Our discussion
starts with a definition like Wiener’s and ends with one like Shan-
5 Oliver, Pierce, and Shannon, Zoc. cit. non’s.

Authorized licensed use limited to: ECOLE POLYTECHNIQUE DE MONTREAL. Downloaded on February 03,2025 at 15:46:50 UTC from IEEE Xplore. Restrictions apply.
18 IRE TRANXACTIONS-INFORMATION THEORY March

characterization of the ensemble from which the trans- function with a spectrum of finite range by any physically
mitted messages are chosen, or the stochastic process by realizable filter; the transfer characteristic of a filter can
which they are generated. As a preliminary definition, be zero only for a set of frequencies of total measure
we may say that a message is a single-valued real function zero.’ However, this is no practical problem. For since
of time, chosen from an ensemble of such functions. It the message has a finite total power distributed over the
will be denoted by m(a, t), where a is a real number spectrum, there will always be an f. so high that a
between zero and one which labels the particular message negligible fraction of the total power will be located
chosen from the ensemble, and m(a, t) is defined, for beyond it in the power spectrum.
each such a, for all values of t from - 00 to m . This The reason for this assumption is that, as Shannon has
definition must be restricted in several respects, in part pointed out, any function of time that is band-limited
to take into account the physical requirements of trans- may be replaced by a time series, which gives the values
mitting systems and in part for mathematical con- of the function at times separated by an interval l/2j0 .g
venience. For any band-limited function we have the following
First, it is assumed that the ensemble from which the identity:
messages are chosen is ergodic. This means that any one
message of the ensemble, except for a set whose measure m(t) = 2 m(i/2fo) g~($;~~-i)i) . (3)
i---m { >
in a is zero, is typical of the ensemble in the following
sense: let &(a) be the probability distribution of the param-
eter of distribution a. Then with probability one, for The values of the function at the sampling points
any function f[m(a, t)] and almost any a, , t = i/2fo , which are the coefficients of this series, thus
completely determine the function. If the function is not
initially band-limited, the expansion will give a function
limL T f[dal , 01 dt = I1 f[m(a, 01 dQ(a). 69 which passes through the same values at the sampling
T+rn2T s -T
points, but which is band-limited. As we assume band-
limited messages, for our purpose the series and the
I.e., any function of m has the same average value when
function are equivalent, and since the series is easier to
averaged over time as a function of a single message, as
deal with in the sequel, it is desirable to change the
when averaged over the ensemble of all possible messages.
definition of the message. Henceforth the message will be
We can thus find out all possible statistical information
defined as the series of coefficients in the expansion (3).
about the ensemble by observing a single message over
By choice of the unit of time, the sampling interval is
its entire history. The ergodic requirement implies that
made unity, and the message is then m;(a), defined for all
the ensemble is stationary: i.e., that the statistics do not
(positive and negative) integer values of the index i.
change with time. Its practical importance is that it
A message is thus a time series drawn from an ergodic
permits us to speak indifferently of the message or the
ensemble of such series, and each term in any one message
ensemble, and makes it unnecessary to specify the sense
is drawn from a probability distribution whose form is
in which we speak of an average. In particular, it permits
determined by the preceding terms of that message.
the substitution of measurable time averages for ex-
For the reasons indicated in the first section, we will be
perimentally awkward ensemble averages.
interested primarily in quantized messages, for which this
Second, it is assumed that the average square of the
probability distribution will be discrete. However, it will
message [in either sense of (2)] is finite. The message will
at times be more convenient in the analysis and the
be represented in physical systems by a voltage or a
examples to deal with continuous distributions, it being
current, or the displacement of a membrane, or the
understood that quantization will ultimately be used. In
pressure in a gas, or by several such physical variables, as
the discrete case, the message term mi will be selected
it proceeds from its origin to its destination. All of these
from a discrete probability distribution M, , where
representations require power; in particular, representa-
M,(m,-l *** mi-; *a*) is the conditional distribution
tion as a voltage or a current between two points separated
giving the probability that, for a particular set of past
by a fixed impedance, which is a necessary intermediate
values mi-l . . . rnCmi . . * , the message term mi will take
representation in any presently used electrical communi-
the integer value Ic. In the continuous case, the message
cation method, requires a power proportional to the
term mi will be chosen from a continuous conditional dis-
square of the message. Since only a finite amount of
tribution M(m, : m,-l . 9. m,+ . . e). Both of these dis-
power may be supplied to a physical transmitter, it is
tributions are dependent on the set of values of the pre-
obviously required that the average message power be
ceding message terms rn;-, * * . mimi . . . , but are of course
bounded.
independent of the value of the index i, by the stationary
Third, it is assumed that the spectrum of the message
nature of the ensemble.
vanishes for frequencies greater than some fixed frequency
f. . This will not in general be true for the radio-frequency 8 Wiener, “The Extrapolation, Interpolation and Smoothing
spectrum of the messages as they are generated by a of Stationary Time Series with Engineering Applications,” NDRC
source, and it has been shown that a function with an Report, Mass. Inst. Tech. Press, Cambridge, Mass., p. 37; 1942.
9 C. E. Shannon, “Communication in the presence of noise,” PROC.
infinitely extended spectrum cannot be reduced to a I.R.E., vol. 37, pp. 10-21; January, 1949.

Authorized licensed use limited to: ECOLE POLYTECHNIQUE DE MONTREAL. Downloaded on February 03,2025 at 15:46:50 UTC from IEEE Xplore. Restrictions apply.
1955 Elias: Predictive Coding 19

Stochastic processes of this sort are known as Markoff PREDICTION


processes and have an extensive mathematical literature.” Norbert Wiener has developed a very general method
An nth order Markoff process is one in which the dis- for finding the linear predictor for a given ensemble of
tribution from which each term is chosen depends on the messages which minimizes the root mean square error of
set of values of the n preceding terms only; a process in prediction. His method was developed for the difficult
which each term is chosen from a single unconditional case of nonband-limited messages, i.e., continuous
probability distribution may be called a Markoff process functions of time which cannot be reduced to time series.
of order zero. It should be noted that, while any Markoff However, he has also solved the much simpler problem of
process yielding a message with a finite second moment the prediction of time series, such as the messages which
is included in this definition, we will expect most of the were defined above. The details of this work are thoroughly
messages to be Markoff processes of a rather special kind. covered in the literature,l’ and this section will merely
The messages have been derived by the time-sampling of define some terms, note some results, and discuss the
a continuously varying physical quantity. The sampling prediction problem from a point of view which is weighted
rate must be high enough so that the sampling does not towards probability considerations and not towards
suppress significant variations in the message-i.e., the Fourier transform considerations.
f. must be above the bulk of the spectral power of the From a time series, a linear prediction p, of the value
message. Now for most such messages, the average rate of of a message term m, is a linear combination of the previous
variation with time is much lower than the highest rate message values
that the system must be capable of transmitting. Con- m
sequently, it is to be expected that on the average, suc- pi = C .
CL,WL,-j

cessive message values will be near to one another. This i=1


means, in particular, that in the discrete case the index lc
is not just an arbitrary labeling of a particular symbol- The error e, is defined as
as it is, for example, in Shannon’s finite-order Markoff ei = p, - mi .
approximations to English”-but may be expected to give
a genuine metric: message values with indexes near to The predictor itself may be considered to be the set of
one another may be expected to have probabilities near coefficients a, . The best linear predictor, in the rms sense,
to one another, and the conditional distributions mentioned is the set of coefficients which, on the average, minimizes
above may be expected to be unimodal. This is not a e’: Wiener has shown that this predictor is determined, not
restriction on what kinds of series will be considered to be by the message ensemble directly, but by the auto-
messages, but is rather a specification of the class of correlation function of the ensemble. In general, there will
messages for which predictive coding may be expected to be many ensembles with the same autocorrelation function,
be of use, as will be discussed in detail in Part II of this and the same linear predictor will be the best in the rms
paper. sense for all of them.
For a message ensemble for which the conditional dis- The autocorrelation function for a time series is defined
tributions are not given a priori, it is necessary to de- by
termine them by the observation of a number of messages,
or of a single message for a long time. It is obviously ck = N-tm
lim * tN mimi-k *
impossible to do this on the assumption that the distri-
bution from which a particular message term is chosen
depends on the infinite set of past message values. What Devices for rapidly obtaining approximate autocorrelation
can, in fact, be measured are the zeroth order approxi- functions have been constructed.13 These devices accept
mation, in which each term is treated as if it were drawn the message directly as an input, and graph or tabulate
from the same distribution, giving M(m;), an unconditional the function. By the use of such devices, or by a statistical
distribution; the first order conditional distribution examination of the message, or in some cases by an a
M(mi : m,-,), and so on to the nth order conditional priori knowledge of the message-generating process, it is
distribution for some finite n. A communications system possible to determine the autocorrelation function. The
which is designed to transmit this approximation will be best linear predictor in the rms sense may then be de-
inefficient: the approximating process itself would generate termined. But it should be noted that there may be non-
messages with a greater information content than the linear predictors which are very much better.
messages which are actually being transmitted, and a Indeed, given a complete knowledge of the stochastic
system designed for the approximation will waste time definition of the message, i.e., a complete knowledge of
or power or bandwidth when transmitting the real message.
I2 Wiener, op. cit. Also H. W. Bode and C. E. Shannon, “A
This will be discussed more fully later. simplified derivation of linear least square smoothing and prediction
theory,” PROC. IRE, vol. 38, pp. 417-425; April, 1950.
13T. P. Cheatham, Jr., Tech. Rep. No. 122, Res. Lab. Elect.,
lo Shannon and Weaver, op. cit., p. 15; also, M. Frechet, cited M. I. T. (to be published). See also, Y. W. Lee, T. P. Cheatham, Jr.,
there, and P. Levv, “Processus Stochastique et Mouvement Brown- and J. B. Wiesner, “The Application of Correlation Functions in the
ien,” Gauthier-Viilars, 1948, which give further references. Detection of Small Signals in Noise,” Tech. Rep. No. 141, Res. Lab.
I1 Shannon and Weaver, op. cit., pp. 9-15. Elect., M. I. T.; 1949.

Authorized licensed use limited to: ECOLE POLYTECHNIQUE DE MONTREAL. Downloaded on February 03,2025 at 15:46:50 UTC from IEEE Xplore. Restrictions apply.
20 IRE TRANSACTIONS-INFORMATION THEORY March

the conditional probability distributions M(m, : m,-l . . . the prediction. If it is desired to limit predictions to the
m,+ . . .) or M,(m,-l . . m,-, . . .) the best rms predictor, possible quantized values of a discrete probability dis-
with no restriction as to linearity, is directly available. tribution, it is only necessary to make p** and p”“* unique
Obviously the best rms predictor for a message term rn; in a way which does this in the cases of ambiguity; where
defined in this way is the mean of the distribution from the median and mode are uniquely defined, they will
which it is chosen, which is determined by the past always coincide with one of the possible values of the
message history: i.e., the best rms predictor, p*, is message. For rms prediction it is necessary to take the
m quantized value that is nearest to the computed mean
p* = ei, = C kM,(m,~, ... rnLmi *..) of the distribution as the value of p”.
kc-cc
As an example of a predictable function, consider
or
m M(m, : m,-,) = ~ exp [-(m, - ami-J2/2a2]. (4)
= m,M(m, : rntel ... rnZdi ...) dm, u&i
s -cc
The unconditional distribution of mi may be found by
in the discrete and continuous cases respectively. For the using the reproductive property of the normal distribution.
mean of a distribution is that point about which its M(m,) will be normal, with a standard deviation g’, and
second moment is a minimum. Of course, the mean need am,-, will have a normal distribution with standard
not be a linear function of the past message values. How- deviation ad : then,
ever, it is some determinate function of these values unless
the message values are completely uncorrelated-i.e., a2 + a’q” = g”; u r = &J (5)
unless the Markoff process is of order zero. In this case,
it is just the constant which is the mean of the zero-order and
distribution. We therefore have as the unconditionally
best rms predictor the function p*(m,-l . . . m,-i . . +). M(m,) = ~ exp [ - mt/2P]. (6)
From this same general statistical viewpoint the best IJf $2;
predictor on a mean-absolute error basis is the prediction The zero-order approximation to this first-order Markoff
of the median of the conditional distribution, since the process has, then, a message term distribution of the
median is that point about which the first absolute same form as the original conditional distribution L- but a
moment is a minimum. Like the mean, the median’is standard deviation which is larger by a factor I/ 2/l - a’.
defined by the conditional distribution M as a function By our definition in a previous section the process will
of the past history of the message. This definition may generate messages only if a < 1: otherwise the standard
not be unique: if there is a region of zero probability deviation will be infinite, and the message will require
density between the two halves of a probability distri- infinite power for transmission. A more general example
bution, any point in the region is a median. However, the in complete analogy to (4) is:
definition may be made unique by selecting a point within
this range, for those sets of past message values for which M(m, : rn,-, ... m,-i ...)
the ambiguity arises. We will denote the best predictor
in the mean-absolute sense by p**, it being understood
=-uk exp
[-(m - 2 m,,ai)2/22]. (7)
that the definition has been made unique in some suitable
way if the ensemble is such as to require this. Wiener’s prediction procedure is designed for functions
Finally, it may be desired to predict in such a way that of the form (7), in which each term of the time series is
in the discrete case, the probability of no error is a maxi- drawn from a normal distribution with constant u, with a
mum, and in the continuous case the probability density mean which is a linear combination of past values, the
has the maximum possible value at zero error. This permissible combinations being limited by the requirement
requires modal prediction. The mode of the conditional that the resultant average distribution have a finite
distribution will not be unique if there are several equal second moment. The linear combination of past values
probabilities which are each larger than any other prob- which is the mean of the conditional distribution is also
ability in the discrete case, or if the continuous distribution the best linear rms predictor, and is indeed the best rms
attains its maximum value at more than one point. The predictor p*, as noted above. Wiener’s method is then a
difficulty may again be removed by a suitable choice, and procedure for finding this linear combination in terms of
P *** will signify the best modal predictor. the autocorrelation function of the message.
In any of these cases, and indeed for any other prediction The combination of past terms in the exponent may be
criterion which yields a determinate value of the prediction rewritten as a sum of differences, less a constant times the
as a function of the past history of the message, the error message value mi . The stochastic function determined by
term e, is drawn from a distribution E(e, : rniml. . . m,-, . . .) the conditional distribution will then be as approximation
or E,(m,-t ... m,-i 1 . .) which is of exactly the same to the solution of the difference equation obtained by
form as the original distribution of the message term, but setting the exponent in (7) equal to zero. In the limit
which has been shifted along the axis by the amount of u + 0. the stochastic function will become precisely the

Authorized licensed use limited to: ECOLE POLYTECHNIQUE DE MONTREAL. Downloaded on February 03,2025 at 15:46:50 UTC from IEEE Xplore. Restrictions apply.
1955 Elias: Predictive Coding 21

function which is a solution to this equation, as determined ENTROPY, AVERAGING, AND IDEAL CODING
by the set of past message values (initial conditions): as g The entropy H of a probability distribution M has
grows, the function will wander about in the neighborhood been defined asI4
of this solution, diverging from it more and more as i
m
increases. In (4) above, the equation obtained is just H = -c M,logM,
mi - amimI = 0, and the solution, mi = am,-, , gives a k=--m
geometric approach to the origin. and
In the case of continuous functions of time, taking
appropriate limits gives a normal distribution about a Hz- m M(m,) log M(m,) drn;
s -m (8)
linear function of the past which may include integral or
differential operators on the past. The bulk of Wiener’s
in the discrete and continuous cases, respectively. The
analysis is devoted to this case. Although the method was
entropy of a probability distribution may be used as a
designed with functions like (7) in mind, it is clearly not
measure of the information content of a symbol or message
limited to them. In the case of time series it is possible to
value mi chosen from this distribution. The choice of the
use a distribution which is not normal, with a standard
logarithmic base corresponds to the choice of a unit of
deviation (or other parameter or parameters) which is not
entropy: when logarithms are taken to the base two, as is
constant, but is also determined by the past values of the
convenient in many discrete cases, the unit of entropy is
message. So long as the mean of the distribution is still a
the “bit,” a contraction for binary digit, since in a two-
linear combination of past values, the predictor derived
symbol system with the two symbols equiprobable, the
from the autocorrelation function will still give the best
entropy per symbol is one bit for this choice of base. In
rms predictor. If the mean is a nonlinear function of the
the continuous case computations are often made simpler
past values, the predictor obtained from the autocorrela-
by the use of natural logarithms. The resultant unit of
tion function will be the best linear approximation to this
entropy is called by Shannon the natural unit. We have
nonlinear function in the rms sense.
one natural unit = log,e bits.
Where the best predictor is indeed linear, or is well
Wiener, Shannon, and Fano14 give a number of reasons
approximated by a linear combination of past values, the
for the use of this function as a measure of information
great practical superiority of Wiener’s, method over the
per symbol, and the arguments are plausible and satis-
use of the conditional distribution should be clear. For in
fying, but as Shannon remarks, the ultimate justification
this method only the autocorrelation function, a function
of the definition is in the implications and applications of
of a single variable, needs to be measured; the predictor
entropy as a measure of information.15 For the analysis of
can then be computed no matter what the order of the
communications systems, the definition is completely
Markoff process may be. Using the conditional probability
justified by theorems which prove that it is possible to
distribution directly, an nth order Markoff process will
code any message with entropy H bits per symbol in a
require the observational determination of a function of
binary code which uses an average of H + 6 binary digits
n + 1 variables. This becomes a task of fantastic propor-
per message symbo1, where c is a positive quantity which
tions when n is as large as four or five: it is practical for
may be made as small as desired, and by equivalent
small n only for a quantized system with very few possible
theorems in the case of the discrete channel with noise
quantized levels.
-i.e., where there is a finite probability that a symbol
The direct use of the conditional distribution may,
however, be quite valuable if the best rms predictor is a I4 This is the definition given by Shannon (Shannon and Weaver,
highly nonlinear function of only a few past values, op. cit.) and Fano (R. M. Fano, “The Transmission of Information”,
Tech. Rep. No. 65, Res. Lab. Elec., M. I. T.; 1949) Wienes (“Cyber-
particularly in a quantized system. Nonlinearity is no netics”, op. cit., p. 76) gives a definition with the opposite sign. There
more difficult to treat than is the linear case as far as is no real conflict here, however, for Wiener is talking about a differ-
ent measure. Wiener asks, how much information we are given about
analysis by this method is concerned. For the synthesis a message term, whose exact value will never be known, when we are
problem the lack of suitable nonlinear elements for the given the probability distribution from which it is chosen. The an-
swer is that we know a good deal when the distribution is narrow, and
physical construction of nonlinear operators on the past very little when the distribution is broad. Correspondingly, entropy
is confined to the case of continuous functions of time; in as Wiener defines it has a large positive value for very narrow dis-
tributions and a large negative value for ver broad distributions.
the case of time series with quantized terms, digital com- This measure is useful in determining how muc 5 information has been
puter techniques can provide any desired nonlinear func- transmitted when a message term which is contaminated by noise
with a known distribution is received; we can use Bayes’ theorem
tion of any number of variables-at, of course, an expense and find the probability distribution of the original message, and
in equipment which may become very large for large n. measure information transmitted by measuring the entropy of this
distribution. Shannon, on the other hand, asks how much informa-
When the conditional distribution always has a point of tion is transmitted by the precise transmission of a message symbol,
symmetry, we may note that the best rms predictor p* is when we know a priori the probability distribution from which it was
selected. In this case, if the distribution is very narrow, the message
equal to the best mean absolute predictor p**. If the term tells us very little when it arrives; we knew what it would be
distribution is also always unimodal, then the best modal before we received it. If the distribution is broad, however, then the
arrival of the term tells us a good deal. This requires the use of the
predictor p”** will also be the same as p*. In particular, opposite sign for entropy. Shannon’s definition will be used through
this will be the case for the examples (4) and (7), but it this paper; it is the more appropriate one for the kind of problem
with which we are concerned.
does not, of course, depend on the linearity of the predictor. 16Shannon and Weaver, op. cit., p. 19.

Authorized licensed use limited to: ECOLE POLYTECHNIQUE DE MONTREAL. Downloaded on February 03,2025 at 15:46:50 UTC from IEEE Xplore. Restrictions apply.
22 IRE TRANSACTIONS-INFORMATION THEORY March

transmitted at one quantized level will be received at a is that the entropy for the process as a whole is just the
different level, and in the case of the continuous channel average over-all past histories of the entropy of the
with noise-in which the message term is chosen from a conditional distribution of order k which defines the
continuous distribution, and is received mixed with noise, process: the information content per symbol of a message
so that each received term is the sum of a signal term and a generated by such a stochastic process is the average of
noise term, and reception is always approximate. the entropies of the distribution from which the successive
For messages as defined in a previous section, we have, message terms are chosen.
in general, that the entropy of the distribution from which It was noted that only a finite order Markoff process
any single message term is chosen is a function of the can, in general, be used as a model of a message source,
message history: in both the continuous and discrete cases and that, in general, the use of such an approximation is
we are concerned with conditional distributions, whose inefficient. We may now state this more exactly. If a kth
form depends on the set of values of the terms m,-, . . . order Markoff process is approximated by a process of
m,-i . . . which prcede the message term mi whose entropy order less than Ic, then the entropy of the approximating
is defined in (8). For such cases-i.e., Markoff processes of process will be greater than or equal to the entropy of the
order one or greater-the entropy is defined in terms of the original process, with the equality holding only if the
probability, not of each message term, but of a sequence of original process is actually of order less than Ic: i.e., only
N message terms, the limit being taken as N approaches if the lath order conditional distribution can be expressed
infinity. Following Shannon, l6 we define G, in the discrete in terms of conditional distributions of lower order. The
and the continuous cases as result holds also for suitably convergent processes of
infinite order. It is a direct consequence of the following
more general theorem.

Averaging Theorem I
slog M(m, , *** rnZWN)dmi ... dmi-N
Let P(z: y) be a probability density distribution of 2,
= -(l/N) e ... 2 M(mi,...mi-N) for each value of the parameter y: i.e., for all y,
m&,=-cc ml-,+.--m
. log M(mi , . * . m<-N). m P(x: y) dx = 1,
(9) s -m
and
Then the entropy per symbol of the process is defined as P(x: y) >_ 0
H = lim GN . (10)
N-r=- for all x and y. Let Q(y) be a probability density distribu-
tion of y:
The distribution M(m, , . . . mi-N) in (9) is not a con- m
ditional but a joint distribution: the distribution which _m Q(Y) dy = 1
determines the probability of getting a given set of N s
values for the N + 1 message terms rnimN to mi . Now the
Q(Y) 2 0.
joint probability distribution of order N + 1 is related to
the conditional probability distribution and the joint Let R(x) be the distribution P(z: y) averaged over the
distribution of order N by parameter y, and let H’ be its entropy:
M(mi , a** m,-J
R(x) = SE Q(Y>P(x: Y) &/
-m
= M(m, : m,-, .. . mi-,)M(mi-, , .. . rn<-J. (11)
H/E-.- m R(x) log R(x) dx. (13)
Using the relation (11) in the expression (9), for a s -m
message generating process which is a Markoff process of
finite order k, and taking the limit (lo), we have Let H(y) be the entropy of the distribution P(x: y) as a
m
function of the parameter y, and let H be its average
H=l-+... _m M(m,-, , ... mi-J dm+, *. . drnimk value:
s
m H(y) = -I- P(x: y) logP(x: y) dx
M(m, : m,-, . . . miTk) -m
4 -m m
H= _m Q(Y)H(Y) dy.
*log M(m, : m,-, 0.. mi-J dmi ’ (12) s
I
Then we always have H’ 2 H, and the equality holds only
with a similar relation for the discrete case, in which the when the y dependence of P(x: y) is fictitious. In words,
integrals are replaced by sums. In words, what (12) states the entropy of the average distribution is always greater
16Shannon and Weaver, op. cit., p. 25. than the average of the entropy of the distribution.

Authorized licensed use limited to: ECOLE POLYTECHNIQUE DE MONTREAL. Downloaded on February 03,2025 at 15:46:50 UTC from IEEE Xplore. Restrictions apply.
1955 Elias: Predictive Coding 23

The proof is given in the appendix.17 The theorem Let R(x) be the averaged distribution with entropy H’:
remains true for discrete distributions, and the statement m

is unchanged except for the uniform substitution of the R(x) = /- P(Y>&(x, Y) dy


J-CC
summation indexes i and j for the continuous variables x
and y and the replacement of integrations by sums. By
successive application of the proof it is also obvious that H’=- m R(x) log R(x) dx.
s -m
the result holds for a distribution which is a function of n
parameters Y1 to Yn . The application to Markoff processes Then we always have H’ 2 H, and the equality holds only
is direct, for a conditional distribution of order k - 1 may when the weighting function is a Dirac delta function.
be expressed as an integral of the form R(x) in (13), This theorem is given by Shannon.lg It is also true in the
where P(x: y) is the conditional distribution of order Ic and discrete case: the equality then holds only if the average
y is the term rntmk . distribution R(x), or R, in the discrete case, is a mere
The theorem is also applicable to cases in which the permutation of the distribution P(x), or Pi .
dependence of the distribution on past history is not At the beginning of this section it was stated that it is
explicit. If the dependence of the distribution M(m, : m,-, possible to code a message with entropy H bits per symbol
. . . m,-,) on the set of past message values is through a by a coding method which uses H + E binary output
dependence on one or several parameters (e.g., the mean symbols per input symbol, on an average. Such a coding
and the standard deviation of a distribution are functions scheme will be called an ideal code. Shannon has given two
of the set of past message values but the distribution is such coding procedures, and Fano has given one which is
always normal), the conclusion still holds: the entropy of quite similar to one of Shannon’s” We will call coding by
the average distribution, averaged over the distribution means of Shannon’s second procedure, or by means of
of the parameters, is always greater than the average over Fano’s method, Shannon-Fan0 coding. Both are procedures
the parameters of the entropy. This is illustrated by the for giving short codes to common messages and long codes
example of (4). The average message term distribution of to rare messages. They are given in the references. We will
the process is a normal distribution with a standard here only note the important result. Coding a group of N
deviation u/ dr’ - a , with an entropy which may easily message terms at once, the average number H, of output
be computedl’ as binary symbols per input message symbol is bounded:
--
H, = log CT42re + log (1/2/l - a’),
G, I H, I G, + l/N. (14)
but each message term has a normal distribution with Here G, is the quantity defined in (9). As N increases’
standard deviation, with entropy just G, approaches H, the true entropy of the process, so H,
H = log crd2ae, also approaches H. For a discrete process, an eficient code
may be defined as one for which the ratio H/H, is near
which is thus the average entropy of the process as a whole. one. It is clear that there are two reasons why a Shannon-
The difference between these two entropies may be made Fano code for small N may be inefficient: first, if G, is
as large as we like by letting a approach one. small, the ratio G,/H, may be small, if H, is near its
A second averaging theorem which will be useful later upper bound in (14). Second, for small N, G, may be a
deals with averages over a single distribution. poor approximation to H.
It should be noted that it is not reasonable to define an
Averaging Theorem II
efficiency measure for continuous distributions as a ratio
Let P(x) be a probability distribution with entropy H: of entropies. For a process which is ultimately to be
quantized, the entropy of a continuous distribution does
- P(x)dx = 1, P(x) 2 0 for all x, not approximate the entropy of the discrete distribution
s -m
which is obtained by quantization, unless the scale of the
H= - m P(x) log P(x) dx. variable in the continuous distribution is so chosen as to
I -cc make the interval between quantized levels unity. Using
a different choice of scale adds a constant to the entropy of
Let Q(x, y) be a weighting function:
the distribution, so that the ratio which defines efficiency
m is changed. For this reason, until a quantizing level spacing
1 Qb, Y) dx = Irn-m Qb, Y) dy = 1, is chosen, it is possible to speak only of the differences
-m
between the entropies of continuous distributions, and
Qb, Y) 2 0 for all x and y.
not of their ratios.
I7The content of this theorem is implied by the derivations leading
UP to Shannon’s fundamental theorem. Shannon and Weaver. on. I9 Shannon and Weaver, op. cit., p. 21, property 4 for the discrete
cit., p. 28. However, the theorem can’be stated and proved ‘asLa case; p. 55, property 3 for the continuous case.
property of entropy as a functional of a probability distribution, xoShannon and Weaver, op. cit., p. 29; Fano, op. cit. Shannon’s
with no reference to sequences of message terms, and the proof is so procedure is simpler to handle mathematically; Fano’s is perhaps
straightforward and simple that the theorem deserves an independent somewhat simpler to grasp. Fano’s method is not quite completely
statement. determinate. In cases in which the two methods do not agree, Fano’s
I8 Shannon and Weaver, op. cit., p. 56. provides a more efficient code than Shannon’s.

Authorized licensed use limited to: ECOLE POLYTECHNIQUE DE MONTREAL. Downloaded on February 03,2025 at 15:46:50 UTC from IEEE Xplore. Restrictions apply.
24 IRE TRANSACTIONS-INFORMATION THEORY March

APPENDIX of (14) by the same constant, so we are free to use natural


logarithms and measure entropy in natural units. Using
Proof of Averaging Theorem I the inequality logu 5 u - 1 in the integral in (14) gives
Expanding H’ by the definitions given, we have
H’ 2. II - .c dx .ca, dy QWYx: z/l

s
H’ = - Irn dx {sm dy Q(y)P(x: y) log jrn dz Q(z)P(x: i)}.
-m -m -cc
m dz Q(z)P(x: z)
Adding and subtracting a term gives -I+-“---
fYx: Y)

m
s
dz Q(z)P(x: z) - s_c,
dx[; &/1;dzQMQW'(
+ j- -m dy Q(Y)&: Y) log -- p(x. y) .
. 1 Integrating first with respect to Z, we have by the normali-
zation requirements on P(x: y) and Q(v) that
The quotient in the last integral cannot cause trouble,
since the integrand as a whole approaches zero with H’>H+l-l>H.
P(x: y). Interchanging the order of integration in the first
integral and using the definition of H gives The equality can be realized only when logu = 1, or in
this case when

- P(x:z)Q(z) dz = P(x: y). (15)


s -m

For this to hold, P(x: y) must have no dependence on vari-


.log J-m (14) able y, since y does not appear on the left of (15), Q.E.D.
P(x: y) .
In the discrete case, the precise same proof holds when
Changing the logarithmic base will multiply both sides summations are uniformly substituted for integrations.

Predictive Coding-Part II
Summary-In Part I predictive coding was defined and messages, outlined in the Introduction. An obvious definition is:
prediction, entropy, and ideal coding were discussed. In the present that predictor is best, in the sense of information theory,
paper the criterion to be used for predictors for the purpose of pre-
dictive coding is defined: that predictor is optimum in the information
which requires the minimum channel space for the trans-
theory (IT) sense which minimizes the entropy of the average error- mission of its error term. But this specification is not yet
term distribution. Ordered averages of distributions are defined and sufficient. It is necessary to define to some extent the way
it is shown that if a predictor gives an ordered average error term in which the error term is to be coded, in order to define a
distribution it will be a best IT predictor. Special classes of messages predictor uniquely for a given message-generating process.
are considered for which a best IT predictor can easily be found,
and some examples are given.
One procedure is to use Shannon-Fan0 coding for the
The error terms which are transmitted in predictive coding are transmission of the error term. This means that the
treated as if they were statistically independent. If this is indeed predictor p(mi-l .. . m,-, .. .) should be chosen to
the case, or a good approximation, then it is still necessary to show minimize t,he ensemble average of the entropy of the error
that sequences of message terms which are statistically independent distribution. The average of
may always be coded efficiently, without impractically large memory
requirements, in order to show that predictive coding may be prac-
tical and efficient in such cases. This is done in the final section of
- m E(m z : m,-, a-. m,-, .-.)
s -m
this paper.
-log E(m, : m,-l . . . m,_, . . .) dm, (If9
DEFINITION OF INFORMATION-THEORY CRITERION or of
FOR PREDICTORS
m
We have now a sufficient vocabulary and collection of - c E,(m,-l . . . rn;-? . . .) log E,(m,-, . . . m,-, ...)
-m
results to define and discuss a criterion of prediction that
is appropriate for the kind of communications scheme is to be minimized, the averaging being done over the

Authorized licensed use limited to: ECOLE POLYTECHNIQUE DE MONTREAL. Downloaded on February 03,2025 at 15:46:50 UTC from IEEE Xplore. Restrictions apply.

You might also like