Pub - The Algebra of Probable Inference
Pub - The Algebra of Probable Inference
Cox
the Algebra
of Probable
Inference
THE ALGEBRA OF PROBABLE INFERENCE
The Algebra of
Probable Inference
by Richard T. Cox
PROFESSOR OF PHYSICS
THE JOHNS HOPKINS UNIVERSITY
BALTIMORE:
The Johns Hopkins Press
© 1961 by The Johns Hopkins Press, Baltimore 18, Md.
Distributed in Great Britain by Oxford University Press, London
Printed in the United States of America by Horn-Shafer Co., Baltimore
Library of Congress Catalog Card Number: 61-8039
to my wife Shelby
Preface
viii
Contents
Preface vii
I. Probability 1
II. Entropy 35
III. Expectation 69
Notes 99
Index 109
THE ALGEBRA OF PROBABLE INFERENCE
I
Probability
--a=a, (2.1)
a- a = a, (2.2 I) a v a = a, (2.211)
(a aV V
a (2.5
c V b) c = (a c) V c),
(a b) (a b
V I) b = b, II)
a
I) II)
I h = F[(i I h), (j I
(3.1)
Since the probabilities are all numbers, F is a numerical function
of two numerical variables.
The form of the function F is in part arbitrary, but it can not
be entirely so, because the equation must be consistent with
Boolean algebra. Let us see what restriction is placed on the
form of F by the Boolean equation
a.(b.c) =
If we let
h = a, i = b,
so that
i = c, j = d,
so that
we find
I a = F[(b.c I a), z],
and, if we now let
h = a, i=b, j=c,
we have
so that
I a = F[F(x, y), z].
Equating this expression for I a with that given by Eq.
(3.2), we have
F[x, F(y, z)] = F[F(x, y), z],. (3.3)
as a functional equation to be satisfied by the function F.10
Let F be assumed differentiable and let aF(u, v)/au be denoted
by Fl(u, v) and 3F(u, v)/av by F2(u, v). Then, by differentiating
this equation with respect to x and y, we obtain the two equations,
(a V -a) j = (a V -a) = h.
(a V h (a V tea),
is equal simply to j I h and
aV-ash=1.
The truism, as we should suppose, is thus certain on every
hypothesis.
It is to be understood that the absurdity, a- -a, is excluded as
an hypothesis but, at the same time, it should be stressed that not
every false hypothesis is thus excluded. A proposition is false if
it contradicts a fact but absurd only if it contradicts itself. It is
permissible logically and often worth while to consider the prob-
ability of an inference on an hypothesis which is contrary to fact
in one respect or another.
An hypothesis h, on which an inference i is certain, is said to
imply the inference. Every hypothesis, for example, thus implies
the truism. There are some discourses in which a proposition
18 PROBABILITY
h j i j on the hypothesis h.
The equation just obtained shows that also j is then irrelevant to i
on the same hypothesis. The relation is therefore one of mutual
irrelevance between the propositions i and j on the hypothesis h,
and it is conveniently defined by the condition,
I h = (i I h)(j I h). (3.9)
If h alone implies j, so also does h i. Then j I h and j I h i
are both unity and therefore equal, and i and j are mutually
irrelevant. Thus every proposition implied by a given hypothe-
sis is irrelevant on that hypothesis to every other proposition.
I h = f(i I h)f f (i V j I h)
f(i I h)
By Eqs. (2.3 I), (3.7) and (4.1),
,"i.jIh=j.,-iIh=
(j I h) f ( i hh) .
(j I h) f h= f(i I h) f f(i I h)
f(iVjIh) (4.3)
jI
This equation holds for arbitrary meanings of i and j. Let
j = a V b,
so that
V b) = V b)] by Eq. (2.5I)
= by Eqs. (2.3 I) and (2.7 I) = i,
and, by a similar argument resting on Eqs. (2.5 II), (2.3 I) and
(2.7 II),
iVj=j.
Thus Eq. (4.3) becomes
r
(j l h) f (j h) =1(1 I h) f
11
l [f (i I h) J
This equation is given in a more concise and symmetrical form if
we denote i I h by f (y), so that f (i I h) = y, and j I h by z. In
this way we obtain the equation,
r
zf Cf (z )] = yf [f yz)] . (4.4)
zf(u) = yf(v),
where u denotes f(y)/z, v denotes f(z)/y, f' the first derivative off
and f" the second derivative.
Multiplying together the corresponding members of the first
and last of these equations, we eliminate y and z at the same time,
obtaining
u f"(u)f (u)f ' (y) = vf" (v)f (v)f ' (z)
With this equation and the second and third of the preceding
group, it is possible to eliminate f'(y) and f'(z). The'resulting
equation is
of" (u) f(u) _ of" (v) f(v)
[uf'(u) - f(u)] F(U) W, (V) - AV) If' (V)
Each member of this equation is the same function of a different
variable and the two variables, u and v, are mutually independent.
This function of an arbitrary variable x must therefore be equal
to a constant. Calling this constant c, we have
xf"(x)f(x) = c[xf'(x) - f(x)]f'(x).
This equation may be put in the form
df'/f' = c(df/f - dx/x),
whence, by integration, we find that
f' = A (f/x)
where A is a constant. The variables being separable, another
integration gives
fr=Axr+B,
22 PROBABILITY
has still a meaning and the value zero, unless j is also impossible
on the hypothesis h alone, and, in any case, i j I h = 0. If both
i I h and j I h are zero, then both j I h i and i I h- j are meaning-
less, but, a fortiori, I h = 0. It is convenient to comprise all
these cases under a common term and call any two propositions
mutually exclusive on a given hypothesis if their conjunction is
impossible on that hypothesis, whether they are singly so or not.
In this sense, any proposition which is impossible on an hypothe-
sis is mutually exclusive on that hypothesis with every proposi-
tion, including even itself, and the absurdity is mutually exclusive
with every proposition on every possible hypothesis.
It is worth remarking that, if two propositions are mutually
irrelevant on a given hypothesis, then each is irrelevant to the
contradictory of the other and the contradictories of both are
mutually irrelevant. To see this, let i and j be propositions
24 PROBABILITY
The two axioms which, in the two chapters preceding this one,
have been found sufficient for the probabilities of the conjunctive
and contradictory inferences, suffice also for the probability of the
disjunctive inference. That only two axioms are required is a
consequence of the fact that, among the three operations: con-
tradiction, conjunction and disjunction, there are only two in-
dependent ones: contradiction and either of the others but not
both. For the Boolean equations, '(i V j) = and
- . (i V j) = i V j, can be combined to give
iVi= (,.,i. ".,j),
Thus
iVjlh= (iIh)+('rijIh).
PROBABILITY 25
± I h). (5.3)
The limits of the summations in this equation are such that none
of the propositions, a1, a2, ... am, is conjoined with itself in any
inference and also that no two inferences in any summation are
conjunctions of the same propositions in different order. In the
three-fold summation, for example, there is no such term as
I h, and the only conjunction of a1, a2, and as is in the
term a1 a2 as I h, because the limits exclude probabilities such as
a3 I h, obtained from this one by permuting the proposi-
tions. For the m-fold summation, therefore, there is only one
possible order of the m propositions and the summation is reduced
to a single term. Its sign is positive if m is odd and negative if m
is even.
26 PROBABILITY
Of the three probabilities now on the right, both the first and the
third are those of disjunctions of m propositions, for which we
assume, for the sake of the mathematical induction, that Eq.
(5.3) is valid. For the first of these probabilities, Eq. (5.3) gives
an expression which can be substituted without change in Eq.
(5.4). The expression to be substituted for the other is obtained
by replacing ai in Eq. (5.3) by a,- a.+,, a2 by am by
This expression, with the simplification allowed by the
equality of am+i and am+i am+1, is given by the equation,
V V ... V Ih
m m-1 in
= i=1
E I h) -i-1Ej-i+1
E I h)
- [
a,Va2V... vam+,Ih
i=lj=i+1
[.-2
EEE
m-1 m
I h)
v-1
+ i=1
E
m-1
+(am+,Ih)]
I h)
I h) +i=lj=i+1
EE
in
I h)]
i=1j=i+lk=j+1
- ... ± I h).
i=1
(ai I h) + (am+1 I h) F, (ai I h),
m-1 m m+1
+ E F, E
i=1j=i+lk=j+i
I h)
± I h).
This is the same as Eq. (5.3), except that the number of proposi-
tions appearing in the inferences, which was m in that equation,
is m + 1 in this one. Therefore Eq. (5.3), being valid when m is
2, as in Eq. (5.2), is now proved for all values of m.
The rather elaborate way in which the limits of summation
were indicated in the preceding equations was needed to avoid
28 PROBABILITY
If the propositions, a,, a2, ... am, are all mutually exclusive
on the hypothesis h, so that every conjunction of two or more of
them is impossible, Eq. (5.6) becomes simply
a,Va2V... VamIh = Ei(aiIh). (5.8)
6. A Remark on Measurement
and hence
(aIaV'ya)(bIa) = (bIbV ^'b)(aIb)
If then a I a V -a and b I b V 'b were each equal to 2, it would
follow that b I a = a I b for arbitrary meanings of a and b. This
would be a monstrous conclusion, because b I a and a I b can'
have any ratio from zero to infinity. Instead of supposing that
a I a V -a = 1, we may more reasonably conclude, when the
hypothesis is the truism, that all probabilities are entirely un-
defined except those of the truism itself and its contradictory, the
absurdity. This conclusion agrees with common sense and might
perhaps have been reached without the formal argument, because
the knowledge of a probability, though it is knowledge of a par-
ticular and limited kind, is still knowledge, and it would be sur-
prising if it could be derived from the truism, which is the ex-
pression of complete ignorance, asserting nothing.
Not only must the hypothesis of a probability assert something,
if the probability is to be defined within any limits narrower
than the extremes of certainty and impossibility, but also what it
asserts must have some relevance to the inference. For example,
the probability of the inference, "There will be scattered thunder-
showers tonight in the lower Shenandoah Valley," is entirely
undefined on the hypothesis, "Dingoes are used as half tamed
hunting dogs by the Australian aborigines," although the hypoth-
esis is by no means without meaning and gives a fairly precise
definition and a value near certainty to the inference, "The
Australian aborigines are not vegetarians."
The instances in which probabilities are precisely defined are
thus circumscribed on two sides. On the one hand, the hypoth-
esis must provide some information relevant to the inferences, for
otherwise their probabilities are not defined at all. On the other
hand, this information must contain nothing which favors one of
the inferences more than another, for then the judgment of indif-
ference on which precise definition rests is impossible. The cases
are exceptional in which our actual knowledge provides an
PROBABILITY 33
Entropy
d,7(y)
d(xy) dy
whence we obtain, by eliminating dr?(xy)/d(xy),
x dn(x)/dx = y dn(y)/dy.
Since x and y are independent variables, this equation requires
that each of its members be equal to a constant. Calling this
constant k, we have then
dn(w) _ (k/w) dw,
whence we find by integration that
71(w) = k In w + C,
where C is a constant of integration. By substitution in Eq.
(7.1), we find that C = 0. Thus
-q(w) = k In w.
In thermodynamics, k is the well known Boltzmann constant and
has a value determined by the unit of heat and the scale of tem-
perature. In the theory of probability it is convenient to assign
it unit value, so that
n(w) = In w. (7.2)
Whatever value is assigned to k, when w = 1, 21 = 0; when there
is only one possible inference, there is no diversity or uncertainty.
The special appropriateness of the logarithm rather than some
other function in this expression can be made plainer by consider-
ing the game of twenty questions, in which one player or one side
chooses a subject and the other player or side asks questions to
find out what it is. The rules vary with the age and skill of the
players, but a usual requirement is that all questions must be
answerable by "yes" or "no." The skill of the questioner is
shown by finding the subject with as few questions as possible.
If one player opens the game by saying, "I am thinking of a
famous person," the other, if it is a child just learning to play,
ENTROPY 39
propositions, al, a2, ... am, are mutually exclusive and form an
exhaustive set. This is the set of inferences for the entropy of
which we now seek an expression.
If the same number of chances were in every block, the proposi-
tions, a,, a2.... am, would all be equally probable. Their entropy
could be denoted by ,q(m) and it would be equal to In m. The
case would be formally identical with that considered in the pre-
ceding chapter, the number of societies in the present example
corresponding to the number of suits of cards in the former one
and the number of chances held in each society corresponding to
the number of cards in each suit. The entropy, fl(m), would
measure the information required to find in which society the
winning chance is held, and the additional information required
to find the winning chance among those held there would be
measured by an entropy denoted by o(w) and equal to In-w, where
to is the number of chances sold in each block. In this case,
therefore, we should have the equation,
17(m) + ,l(w) = In m + In w = In (mw) = In W.
When there are different numbers of chances in the various
blocks and the inferences, al, a2, . . . am, are therefore no longer
equally probable, their entropy is no longer a function of m alone.
Consequently it can not be denoted by 17(m) and it is, of course,
not equal to In m. Let us denote it by 77(al, a2, . . . am I h) until,
in a later chapter, we can explain and justify a simpler notation,
and let us seek an expression for it by asking how much additional
information we shall need to find the winning chance, if we sup-
pose that we first obtain the information which this entropy
measures.
If we find in the first inquiry that a member of the Board of
Trade holds the winning chance, the required additional informa-
tion will be measured by 71(w1), whereas it will be measured by
17(w2) if we find that the winning chance is held in the League of
Women Voters. We can not know, in advance of the first
inquiry, how much additional information will be needed after
42 ENTROPY
and
Similarly,
9. Systems of Propositions
A A.B.C.
The propositions which compose the system (A V B). C are
those which belong to either A or B and to C and therefore to both
52 ENTROPY
(A V B= B.
I b. h)].
56 ENTROPY
(10.5)
n(AVBIh) = (10.6)
By mathematical induction based on Eq. (10.5), it is now fairly
simple to obtain expressions for the entropies of conjunctions and
disjunctions of any number of systems. The proof will be
omitted and only the equations given. Let A1, A2, ... AM be
any systems. Then
I h) _Ein(Ai I h) - EiEi>i,7(Ai V A3 I h)
+ EiEi>iEk>9n(Ai V A, V Ak I h) - . . .
±n(A,vA2v...vAMIh) (10.7)
and
n(AiVA2v... VAMIh) _ Ein(AiIh)
- h) + I h)
- ... ± I h). (10.8)
whereas
n(A I b, h) = 0 and n(A I b2.h) = In 2.
Also
bilk = a, b2lh= s
and
n(A I (b1 I h)n(A b1.h) + (b2 I h)n(A I b2-h) = s In 2.
Therefore
n(AVBIh) =n(AIh) -n(AIB-h)
=1n3 In 2=*In 2%6>0.
The uncertainty about the color of the ball in the man's right
hand is measured in each case by the entropy of A. It is meas-
ured by n(A I h) if the color of the ball in his left hand is un-
known, by n(A I if the ball in his left hand is known to be
white, and by n(A I b2 h) if it is known to be black. In the former
case the entropy of A is decreased by the additional information,
whereas in the latter case it is increased. Although the decrease
is only half as probable as the increase, it is more than twice as
great, and it therefore counts for more in the expectation, as is
shown by the fact that n(A I is less than -q (A I h).
From Eq. (11.1) and the theorem (11.i), there follows directly
another theorem :
If each of two systems is definable by a set of mutually ex-
clusive propositions, the entropy of their conjunction is equal to
the sum of their entropies if the systems are mutually irrele-
vant, and otherwise is less 24 (1 1.ii)
=,(AIh)+n(BIh)
68 ENTROPY
where x and y are any quantities and A, B and C are any con-
stants. More generally, we have the theorem,
The expectation of a linear function of any quantities is equal
to the same linear function of the expectations of the quantities.
(13.i)
When all the expectations involved in a given discussion are
reckoned on the same hypothesis, the symbol for the hypothesis
may, without confusion, be omitted from the symbols for the ex-
pectations. Thus, with the omission of the symbol h, the pre-
ceding equation may be written in the form,
(Ax + By + - C) = A (x) + B (y) -I- C.
The simpler notation will be used henceforth except when refer-
ence to the hypothesis is necessary in order to avoid ambiguity.
For functions which are not linear, there is no theorem corres-
ponding to (13.i). For example, the expectation of the product
of two quantities is not, in general, equal to the product of their
expectations. The expectation of the product xy is given by
(xy) = E,E. I h),
whereas the product of the expectations is given by
(x) (y) _ Ex ,(x,
(xr I h) Las y. (y. I h) = E,E. xrys (xr I h) (y8 I h).
hand member is the same as that of Eq. (13.7) and the left-hand
member is
ET{x,.(xr I h)[Ei(bz I x,-h) - I x,.h) + .. .
± (b,-b2' ... b I x,-h)]}.
The bracketed quantity, by which x,(x, I h) is multiplied, is equal
to bl V b2 V ... V b I x,-h and hence to 1, because the set,
b1, b2, ... b,,, is exhaustive. Thus the whole expression is reduced
to E xr(x, I h), which is equal to (x I h), and thus Eq. (13.7) is
proved.
There is an evident likeness between this equation and Eq.
(10.2), which defines the conditional entropy.
It was remarked in the chapter before this one that the expecta-
tion of the square of the deviation of a quantity measures the
dispersion of its probable values and is small if the quantity is
unlikely to have values appreciably different from its expectation.
We therefore conclude from Eq. (14.8) and the inequality (14.9)
that:
EXPECTATION 79
(x) = E7xpr.
Among M instances in the ensemble, let the number in which
x has the value xr be denoted by mr. The average value of x in
these M instances is then given by
xav = Erxrmr/M.
If M is a very large number, mr/M is almost certain to be
very nearly equal to pr. Therefore xav is almost certain to be
very nearly equal to (x).
In such a subject as statistical mechanics, in which the numbers
of instances are ordinarily enormous, it is common practice to
ignore the distinction between the expectation and the average,
as though they were not only equal quantities but also inter-
changeable concepts.
When we say that a true die will show, on the average, one
deuce in every six throws, we are, in effect, considering an en-
semble not of single throws but of sequences of six. One such
sequence is one instance in this ensemble, and the number of
deuces in the sequence is a quantity whose possible values are the
integers from 0 to 6. Its expectation in a single instance and its
approximate average in a large number are both equal to 1. The
law of great numbers, in the aspect illustrated by this example, is
often called the law of averages.
82 EXPECTATION
pr(Pr I
whence, summing with respect to r, we see that
aM+1 I ErPr(Pr I (P
Thus Eq. (16.1) can be written
(P
(pr.+i(1
(P-(1
- I h)
p)M_rn
(16.2)
- p)M-m' I h)
In some examples, p is not limited to discrete values but has a
continuous range. In such a case, Eq. (16.2) requires no change,
but the summations in Eq. (16.1) must be replaced by integrals.
If we denote by f (p) dp the probability on the hypothesis h that
p has a value within the infinitesimal range dp, the equation
becomes
i
f pr,+i(1 - p)'--f(p) dp
ay i I (16.3)
f pm(1 - p)M-mf(p) dp
=M+2 (16.4)
1 p[pµ(1 - p)1-1111f(p) dp
J0
aM+l I m.h =
[pµ(1 - p)1-111Mf(p) dp
0
In this chapter so far, and the one before it, we have been
concerned with examples at two extremes. In the example of the
dice, considered in the preceding chapter, the required conditions
of irrelevance are fully met. By contrast, in the example of the
sunrise, they are not met at all, and the calculation from the rule
of succession is, in this example, a travesty of the proper use of
the principle. Between these extremes we carry on the familiar
daily reasoning by which we bring our experience to bear on our
expectation. In an ordinary case, we are obliged, under the given
circumstances, whatever they are, to anticipate an unknown
event. We look to experience for occasions in which the circum-
90 EXPECTATION
stances were similar and where we know the event which followed
them. We determine our expectation of a particular event in
the present instance by the frequency with which like events have
occurred in the past, allowing as best we can for whatever dis-
parity we find between the present and the former circumstances.
The ensemble is the conventional form for this reasoning.
Some cases it fits with high precision, others with low, and for
some it is scarcely useful. Suppose that someone is reading a
book about a subject which he knows well in some respects but
not in others, and that he finds, among the author's assertions,
instances both true and false in the matters he knows about. If
he finds more truth than error in these matters, he will judge that
an assertion about an unfamiliar matter is more probably true
than false, other things being equal. His reasoning has the same
character as an application of the rule of succession but not the
same precision. In the algebra of propositions,
a = a V (b 'b) = (a V b) (a V mob)
for every meaning of a, and thus there is no proposition so simple
that it can not be expressed as the conjunction of others. Hence
there is no unambiguous way of counting the assertions in a dis-
course. Although it is possible often to recognize true and false
statements and sometimes to observe a clear preponderance of
one kind over the other, yet this observation can not always be
expressed by a ratio of numbers of instances, as it must be if the
rule of succession is to be applicable.
In every case in which we use the ensemble to estimate a prob-
ability, whether with high precision or low, we depend on the
similarity of the circumstances associated with the known and
unknown events. It seems strange, therefore, that Venn, who
defined probability in terms of the ensemble, should have ex-
cluded argument by analogy from the theory, as he did in the
passage quoted in the first chapter. For every estimate of prob-
ability made by that definition is an argument by analogy.
EXPECTATION 91
gIh I h g) (gIh),
whence
gIh
g i i hg
g h
By this equation, gIh = 0oriih = 1.
The reasons for these two exceptions are obvious. If g I h = 0,
g is an impossible inference to begin with and no accumulation of
evidence will make it possible. If i I h = 1, 1 is implied by h and
its verification, since it gives no information which was not al-
ready implicit in h alone, can not change the probability of g.
In all other cases, Eq. (18.1) shows that the verification of any
proposition i increases the probability of every proposition g
which implies it.
Moreover, the smaller is i I h, the prior probability of i, the
greater is (g I h.i)/(g I h), the factor by which its verification in-
92 EXPECTATION
g gI
4. (p. 4) The opinion which would comprise all kinds of probable inference
in an extended logic (whether independent of the logic of necessary inference
or including it as a special case) is an old one. It was expressed, for example,
by Leibnitz, who wrote: "Opinion, based on probability, deserves perhaps the
name knowledge also; otherwise nearly all historic knowledge and many
other kinds will fall. But without disputing about terms, I hold that the
investigation of the degrees of probability is very important, that we are still
lacking in it, and that this lack is a great defect of our logics." Nouveaux
Essais sur l'Entendement Humain, book 4, ch. 2, Langley's translation. Similar
statements occur in the same work in book 2, ch. 21, and book 4, ch. 16.
TOTES 101
6. (p. 5) Boole himself used only the signs of ordinary algebra and a num-
ier of later writers have followed his practice. It has the advantage of keeping
is aware of the resemblances between Boolean and ordinary algebra. But it
Las the corresponding disadvantage of helping us to forget their points of
ontrast, and it is besides somewhat inconvenient in a discussion in which the
igns of Boolean and ordinary algebra appear in the same equations. With the
igns used here, which are the choice of many authors, the only required pre-
aution against confusion is to reserve the sign for conjunction in Boolean
,lgebra and avoid its use as the sign of ordinary multiplication.
8. (p. 12) It is interesting that vector algebra and logical algebra were
developed at nearly the same time. Although Boole's Laws of Thought did not
appear until 1854, he had already published a part of its contents some years
earlier in The Mathematical Analysis of Logic. Hamilton's first papers on
quaternions and Grassmann's Lineale Ausdehnungslehre were published in
1844, and Saint-Venant's memoir on vector algebra the next year.
The following quotation from P. G. Tait's Quaternions is apt in this
connection:
"It is curious to compare the properties of these quaternion symbols with
those of the Elective Symbols of Logic, as given in Boole's wonderful treatise
on the Laws of Thought; and to think that the same grand science of mathe-
matical analysis, by processes remarkably similar to each other, reveals to us
truths in the science of position far beyond the powers of the geometer, and
truths of deductive reasoning to which unaided thought could never have led
the logician."
9. (p. 12) Many symbols have been used for probabilities. Any will serve if
it indicates the propositions of which it is a function, distinguishes the in-
ference from the hypothesis and is unlikely to be confused with any other
symbol used in the same discourse with it. It should, of course, also be easily
read, written and printed.
10. (p. 14) A functional equation almost the same as this was solved by
Abel. The solution may be found in Oeuvres Completes de Niels Henrik Abel,
edited by L. Sylow and S. Lie (Christiania: Impr. de Groendahl & soen, 1881).
I owe this reference to the article by Jaynes cited in Note 1.
12. (p. 29) This may be the meaning of Kronecker's often quoted remark,
"God made the whole numbers. Everything else is the work of man."
13. (p. 30) The principle of insufficient reason, invoked to justify this
judgment, was so called early in the development of the theory of probability,
in antithesis to the principle of sufficient reason. It was meant by the latter
principle that causes identical in all respects have always the same effects. On
the other hand, if it is known only that the causes are alike in some respects,
whereas their likeness or difference in other respects is unknown, the reason
NOTES 103
for expecting the same effect from all is insufficient. Alternatives become
possible and probability replaces certainty.
In much of the early theory and some more recent, there is an underlying
assumption, which does not quite come to the surface, that, in every case of
this kind, alternatives can be found among which there is not only insufficient
reason for expecting any one with certainty but even insufficient reason for
expecting one more than another. This assumption was doubtless derived
from games of chance, in which it is ordinarily valid. Its tacit acceptance,
however, was probably also made easier by the use of the antithetical terms,
sufficient reason and insufficient reason. The antithesis suggests what the
assumption asserts, that there are only two cases to be distinguished, the one
in which there is no ground for doubt and the one in which there is no ground
for preference.
The term principle of indifference, introduced by Keynes, does not carry
this implication and is besides apter and briefer.
14. (p. 31) This opinion is clearly expressed in the following quotation
from W. S. Jevons:
"But in the absence of all knowledge the probability should be considered
_ %, for if we make it less than this we incline to believe it false rather than
true. Thus, before we possessed any means of estimating the magnitude of
the fixed stars, the statement that Sirius was greater than the sun had a
probability of exactly %; it was as likely that it would be greater as that it
would be smaller; and so of any other star.... If I ask the reader to assign
the odds that a `Platythliptic Coefficient is positive' he will hardly see his
way to doing so, unless he regard them as even." The Principles of Science: a
treatise on logic and scientific method (London and New York, Macmillan, 2nd
ed. 1877).
15. (p. 31) This example is, of course, from The Hunt of the Snark by Lewis
Carroll. Readers who wish to pursue the subject farther are referred also to
La Chasse au Snark, une agonie en huit crises, par Lewis Carroll. Traduit pour
la premiere fois en frangais en 1929 par Louis Aragon. (Paris: P. Seghers,
1949).
16. (p. 33) The influence of games of chance on the early development of
the mathematical theory of probability is well described in the work of Isaac
Todhunter, A History of the Mathematical Theory of Probability from the time
of Pascal to that of Laplace (Cambridge and London: Macmillan, 1865). The
theory is usually held to have begun in a correspondence on games between
Pascal and Fermat. A hundred years earlier, the mathematician Cardan had
written a treatise on games, De Ludo Aleae, but it was published after Pascal
and Fermat had ended their correspondence. Cardan, according to Todhunter,
was an inveterate gambler, and his interests were thus more practical and less
theoretical than those of the eminent mathematicians who followed him in
104 NOTES
the field. It is therefore not surprising that he was less disposed than they
were to take for granted the equality of chances and instructed his readers
how to make sure of the matter when playing with persons of doubtful
character.
17. (p. 35) The word entropy was coined in 1871 by Clausius as the name
of a thermodynamic quantity, which he defined in terms of heat and tempera-
ture but which, he rightly supposed, must have an alternative interpretation
in terms of molecular configurations and motions. This conjecture was con-
firmed as statistical mechanics was developed by Maxwell, Boltzmann and
Gibbs. As this development proceeded, the association of entropy with proba-
bility became, by stages, more explicit, so that Gibbs could write in 1889:
"In reading Clausius, we seem to be reading mechanics; in reading Maxwell,
and in much of Boltzmann's most valuable work, we seem rather to be reading
in the theory of probabilities. There is no doubt that the larger manner in
which Maxwell and Boltzmann proposed the problems of molecular science
enabled them in some cases to get a more satisfactory and complete answer,
even for those questions which do not seem at first sight to require so broad
a treatment." (This passage is quoted from a tribute to Clausius published in
the Proceedings of the American Academy of Arts and Sciences and reprinted
in Gibbs' Collected Works.)
What Gibbs wrote in 1889 of the work of Maxwell and Boltzmann could
not have been said of statistical mechanics as it had been presented the year
before by J. J. Thomson in his Applications of Dynamics to Physics and
Chemistry, but it applies to Gibbs' own work, Elementary Principles in Sta-
tistical Mechanics, published in 1902. Ih the comparison of these two books,
it is worth noticing that Thomson mentioned entropy only to explain that
he preferred not to use it, because it "depends upon other than purely dy-
namical considerations," whereas Gibbs made it the guiding concept in his
method. As different as they are, however, these two books have one very
important feature in common, which they share also with the later works of
Boltzmann. This common trait is that the conclusions do not depend on any
particular model of a physical system, whether the model of a gas as a swarm
of colliding spherical particles or any other. Generalized coordinates were
used in all these works and thus entropy was made independent of any par-
ticular structure, although it remained still a quantity with its meaning defined
only in thermodynamics and statistical mechanics.
There was still wanting the extension of thought by which entropy would
become a logical rather than a physical concept and could be attributed to a
set of events of any kind or a set of propositions on any subject. It is true that
several writers on probability had noted the need of some such concept and
had even partly defined it. In Keynes' Treatise, for example, there is a chapter
on "The weight of arguments," in which the following passage is found:
"As the relevant evidence at our disposal increases, the magnitude of the
probability of the argument may either decrease or increase, according as the
NOTES 105
18. (p. 37) This conclusion was derived from experimentally known proper-
ties of gases by Gibbs in his work, "On the equilibrium of heterogeneous
substances." It is known as Gibbs' paradox.
information was used by Hartley in the paper already cited. The name bit
as an abbreviation of binary digit was adopted by Shannon on the suggestion
of J. W. Tukey. I do not know who first used the game of twenty questions
to illustrate the measurement of information by entropy.
20. (p. 43) In statistical mechanics the condition in which the possible
microscopic states of a physical system are all equally probable is called the
microcanonical distribution. It is the condition of equilibrium of an isolated
system with a given energy, and the fact that it is also the condition of maxi-
mum entropy is in agreement with the second law of thermodynamics.
21. (p. 43) A proposal to extend the meaning of such an established term
as entropy calls for some justification. There is good precedent, of course, in
the generalizations already made. In the work of Boltzmann and Gibbs
entropy has a broader meaning than Clausius gave it, and it has a broader
meaning still in the work of Shannon. The further generalization proposed
here does not change its meaning in any case in which it has had a meaning
heretofore. It only defines it where it has been undefined until now and it
does this by reasoning so natural that it seems almost unavoidable.
22. (p. 53) Boole, in The Laws of Thought, applied his algebra to classes of
things as well as to propositions, and it might be supposed that a system of
propositions, as defined in the chapter just ended, could be considered a class
of things in Boole's sense. There is indeed a likeness between them, and it
is this which allows the conjunction and disjunction of systems. But in
respect to contradiction the analogy fails, for the propositions which do not
belong to a system A, although they form a Boolean class, do not constitute a
system. This is because of the rule that every proposition which implies a
proposition of a system itself belongs to that system. Innumerable propositions
belong to the system A but imply propositions which do not belong to it. It
is this fact which keeps the system A from having a system standing in such
a relation to it as to be denoted by -A.
23. (p. 56) In the case in which each of the systems A and B is defined by
a set of mutually exclusive propositions, the definition of conditional entropy
given in Eq. (10.2) is the same as Shannon's. He also gave Eq. (10.4) for the
entropy of the conjunction.
24. (p. 65) This theorem has its physical counterpart in the fact that the
thermodynamic entropy of a physical system is the sum of the entropies of
its parts, at least so long as the parts are not made too fine. There is a system
of propositions associated in statistical mechanics with every physical system,
and the logical entropy of the one system is identified with the thermodynamic
entropy of the other. If, in the system of propositions, there is one which is
certain, the microscopic state of the physical system is uniquely determined.
In a physical system of several parts, a microscopic state of the whole system
NOTES 107
26. (p. 66) If we can believe the ballad, he did neither, but 'instead fell
into an intermediate state, whence he was changed by enchantment into a
grisly elf. His sister broke the spell and restored him to life in his human form.
This complication, although it is essential to the theme of the ballad, seems
unnecessary in the present discussion.
27. (p. 68) This view has been expressed by authors whose opinions on
other subjects were widely different, as, for example:
Milton, in Paradise Lost: "That power Which erring men call Chance."
Hume, in An Enquiry concerning Human Understanding: "Though there be
no such thing as chance in the world, our ignorance of the real cause of any
event has the same influence on the understanding and begets a like species
of belief or opinion."
Jevons, in The Principles of Science: "There is no doubt in lightning as to
the point it shall strike; in the greatest storm there is nothing capricious; not
a grain of sand lies upon the beach, but infinite knowledge would account for
its lying there; and the course of every falling leaf is guided by the principles
of mechanics which rule the motions of the heavenly bodies.
"Chance then exists not in nature, and cannot coexist with knowledge; it is
merely an expression, as Laplace remarked, for our ignorance of the causes in
action, and our consequent inability to predict the result, or to bring it about
infallibly."
28. (p. 79) This principle was proved, in a more precise form than that
given here, in the Ars Conjectandi of James Bernoulli, published in 1713, eight
years after his death. His proof applied only to the case in which all the
probabilities are equal. The general proof was published in 1837 by Poisson
108 NOTES
Impossibility J
as zero probability, 22
in relation to entropy, 36, 42, 45 Jaynes, E. T., Notes 1, 17
in relation to mutual exclusion, Jeffreys, Harold, Notes 1, 4
23 Jevons, W. S., Notes 7, 14, 27
in relation to systems of proposi-
tions, 50
K
Indifference, judgment of, 30 if. and
Note 13
Induction Keynes, J. M., Notes 1, 3, 4, 13, 17,
29
as an example of probable infer-
ence, 2 Khinchin, A. I., Note 17
as inference about an ensemble, Kleene, S. C., Note 1
93 Kolmogorov, A., Note 1
cumulative effect of verifications, Koopman, 0., Note 1
92 ff. Kronecker, Leopold, Note 12
Hume's criticism, 94 if.
induction justified by the rules of L
probable inference, 95 f.
may approximate but can not Laplace, 86, 89 and Notes 27, 29, 31
attain certainty, 93 if. Law of averages, 81
Inductive reasoning, defined, 91 Law of great numbers, 79, 81
Inductive system, 49 Leibnitz, G. W., Note 4
Information measured by entropy, Linear function, expectation of, 71
39 f., 40 if., 48, 58 if., 63 if. Lottery as an illustration of expec-
Instances in an ensemble, described, tation, 69
79
(See also Ensemble.)
Insufficient reason, 30 and Note 13 M
Irreducible set, 53 if.
Irrelevance Maundeville, Sir John, 3
as minimum entropy of a disjunc- Maxwell, J. C., Note 17
tive system, 62 f. Measurement
associated with chance, 65 if. always partly arbitrary, 1
in an ensemble of instances, 80 f., of different quantities, compared,
82 ff. 29 f.
in conjoined systems, 65 of diversity and uncertainty, 35
in relation to contradiction, 23 f. if., 47 f.
in relation to expectation, 72, 77 of information, 39 f., 48
if. of relevance, 60
in relation to implication, 18 probabilities measurable by judg-
in the law of great numbers, 79 ments of indifference, 30 f.
in the proof of the rule of succes- probabilities measurable by the
sion, 82 if., 88 f. rule of succession, 86, 88
of propositions, defined 18 Meinong, A., Note 17
of systems, defined, 61 Milton, John, Note 27
INDEX 113
T V