Bayesian Probability
Bayesian Probability
Herman Bruyninckx
Dept. of Mechanical Engineering, K.U.Leuven, Belgium
http://people.mech.kuleuven.ac.be/~bruyninc
November 2002
Abstract
This document introduces the foundations of Bayesian probability theory. The emphasis is on
understanding why Bayesian probability theory works, and on realizing that the theory relies, on the
one hand, on a very limited number of fundamental properties for information processing, and, on the
other hand, on a number of application-dependent and arbitrary choices for decision making.
• Bayesianism.
• the Bayesian framework.
• the Bayesian paradigm.
• plausible inference.
• Bayesian reasoning.
• ...
This course speaks most often of Bayesian probability (theory), mainly because its terminology borrows
most from statistical probability theory. The “theory” is much more than just a theory: it’s a systematic
way of approaching an application problem in which one is confronted with incomplete knowledge about
1
the problem. When reading papers that use Bayesian probability one gets the impression that the core
of the systematic approach is that one uses all the time the well-known laws of probability: sum rule,
product rule, and Bayes’ rule. However, this is only part of the whole picture, because applying Bayesian
probability to an application problem also means that one takes great care to unambiguously define and
motivate the system models and decision criteria that are used in the problem solution.
Message 1 (Modelling, information processing, decision making.) This document does a big ef-
fort to separate explicitly the discussion on Bayesian probability theory into these three sub-problems: mod-
elling, information processing, and decision making. And to give a clear definition of what information
means in the context of Bayesian probability.
1.1 Examples
This section gives some examples of applications of Bayesian probability theory:
(Parameter) estimation.
• Police speed radar.
• Prediction of celestial body motions.
• Steering space ships to Jupiter.
Pattern matching/Hypothesis testing.
• Detection of tumor in a scan image.
• Detection of gene sequences.
• Speech recognition.
Model building.
• Reverse engineering.
• Autonomously navigating robot.
Inference.
• Reasoning in court decisions.
• Weather forecasting.
2
Output:
• an analysis of the “system”: the understanding and interpretation of
what is going on in the system. The explication of what where the
causes of the observed effects in the “system.”
• predictions: about how the “system” will evolve in the near future,
and how it will react to specific inputs.
• decisions: does the “system” satisfies its requirements? What actions
should be taken on the basis of the available information?
All the reasoning about the “system” is done under uncertainty: numerical conclusions about the “system”
are (usually) not given as logical, binary numbers, which are either true or false. On the contrary, each
conclusion is accompanied with a measure of its uncertainty. How to represent information about the
“system” in a numerical manner is explained in a following Section.
Message 2 (Probability as information processing tool) This course explains why one can have
good faith in using the mathematics of probability theory as a consistent, unique and plausible tool for
dealing with uncertainty in real-world systems.
(Note that this message doesn’t talk about the whole of Bayesian probability theory, but just about the
information processing part of it.)
Definition 1 (Consistency) It does not matter in what form or order the available information is pro-
cessed by the Bayesian probability tools and algorithms, because the result will always be the same. That
is, the Bayesian framework is free from paradoxes and internal contradictions.
A later Section will show how this internal consistency is derived from an axiomatic basis.
Definition 2 (Unique) Bayesian probability theory provides an unique way to process information from
“input” to “output.”
However, this unique way can be too complex to describe, and/or too computationally intensive to calcu-
late. Hence, many approximate processing algorithms have been developed (and will be introduced in a
later Section).
3
Hence, Bayesian probability has become quite popular in much of the modern research and products
in Artificial Intelligence. One of the major achievements in the 20th century development of Bayesian
probability is that this “vague” definition of plausibility can be nicely formalised in mathematical form,
and that the axioms of probability theory (sum, product and Bayes’ rule) can be derived from it. It is also
interesting to learn that this development was almost exclusively driven by physicists, and that almost no
mathematicians or statisticians were involved.
Fact 1 (Determinism) Bayesian probability is a fully deterministic theory to deal with undeterministic
systems and data.
Definition 4 (Uncertainty, Belief, Evidence) Uncertainty, Belief and Evidence are used as synonyms
of information.
Message 3 (PDF = function + measure) A PDF is not fully characterized by the function over the
variables space: also the (density) measure at each point of this space is important.
4
So, only expressions like Z
p(x) dx (1)
D
have real meaning (i.e., the amount of “probability mass” contained in the domain of the integral). The
dx is called the measure of the variables space: the total probability mass is the product of the value of
the PDF function, times the “density” of variables at this particular place. This density or measure is in
general not equal in different places of the state space. For example, on the surface of the earth, there is
a diminishing amount of ground between two longitude and latitude increments (i.e., the “square” formed
by moving one unit in either direction), when coming closer to the poles.
Measures are not just used for notational purposes (i.e., to denote the variables over which to integrate),
but they are examples of so-called differential forms. An n-dimensional differential form maps n tangent
vectors to a real number. The tangent vectors form the edges of a “parallellepipedum” of volume in the
parameter space; the result of applying the differential form on this parallellepipedum is the amount of
volume enclosed in it. Measures have their own transformation rules when changing the representation
(change of coordinates, change of physical units, etc.): changing the representation changes the coordinates
of the tangent vectors, and hence also the mathematical representation of the differential forms should
change, in order to keep the enclosed volume invariant. An invariant measure puts structure on the
parameter space. It can also be interpreted as the generalized form of a uniform probability distribution:
with this distribution, each unit of volume in parameter space is equally probable.
The concept of an invariant measure is important for parameter spaces that are fundamentally different
from Rn , i.e., that have a different structure than Rn , even though they may have n real numbers as
coordinates. The best-known example is probably the measure used to calculate the surface integral over a
sphere: depending on whether one uses Cartesian x, y, and z coordinates, or spherical r and θ coordinates,
the measure changes, [36]:
Z Z
{f (x, y, z)} {dx dy dz} = {f (r, θ, φ)} {r dr sin(θ)dθ dφ} . (2)
A A
Fact 2 (Properties of PDF) Classical statistics often characterizes a PDF by saying that its integral
over its whole domain is equal to 1. However, this “1” doesn’t have any intrinsic meaning, and could be
replaced by any other positive number.
Definition 5 (Model) The combination of a set of variables, their information and relationship func-
tion(s) is called a model.
Choice 1 (Choice of model) Any “system” can be given many possible mathematical models; which one
to choose is a rather arbitrary choice of the practitioners. They can be guided in their choice by various
criteria: level of detail of the model; computational complexity; tradition; etc.
Fact 3 (Importance of model selection) Choosing an appropriate model for a particular application
is not a simple job. And (the mathematical representation chosen in) the model determines to a very large
extent the efficiency and accuracy of the information processing that will take place in the system.
5
Figure 1: Examples of discrete and continuous information functions, i.e., probability distributions.
Learning the syntax and semantics of the model (state, information functions, relationships) in a partic-
ular application domain is usually also the hardest part of building a reasoning system for that application.
Once the model has been built and understood, all information processing should be straigthforward, as
will be shown in later sections.
Definition 8 (Dependent variables) Two variables X and Y are (statistically) dependent if a change
in the value of X is correlated with a change in the value of Y .
This does not necessarily mean that there is a physical causal connection between X and Y . In the
example of the alarm, X could be JohnCalls, and Y is MaryCalls; there is no physical connection between
John and Mary making calles; but in this case, both are sometimes influenced by a common cause, i.e.,
the alarm that sounds.
Definition 9 (Conditional PDF) The information function p(X, Y, Z, . . . ) between the variables X, Y, Z, . . .
can depend on the given values of some other variables A, B, C, . . . . This sort of information function is
called a conditional PDF, and the dependence is denoted by a vertical bar: p(X, Y, Z, . . . |A, B, C, . . . ).
Figure 1 shows a discrete and a continuous PDF in one single dimension. Figure 2 shows a set of
four PDFs from a very common family of one-dimensional PDFs, called normal distribution or Gaussian
distribution. These latter distributions are very popular, because (i) the information about many “systems”
can be described by Gaussians, and (ii) their mathematical properties are very attractive. That is, one
needs only two parameters to represent the normal distribution, denoted by N (µ, σ): its mean µ and its
standard deviation σ:
(x − µ)2
1
N (µ, σ) = √ exp − . (3)
2πσ 2σ 2
6
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
−50 −40 −30 −20 −10 0 10 20 30 40 50
Figure 2: Gaussian probability distributions, with mean 0 and different standard deviations σ.
σ 2 is called the covariance of the distribution. The Gaussian distribution above is univariate distribution,
i.e., it is a function of one single parameter x. Its multivariate generalization has the form
1 1
N (µ, P ) = p exp − (x − µ)T P −1 (x − µ) . (4)
(2π)n det(P ) 2
x is an n-dimensional vector, µ is the vector of the mean (first moment, or “expected value”) of x:
Z ∞ Z ∞
µ= dx1 . . . dxn {x p(x|µ, P )} . (5)
−∞ −∞
||P || is the two-norm (i.e., the square root of the largest eigenvalue of P T P , [11]) of the second moment,
or covariance matrix P :
Z ∞ Z ∞
dxn (x − µ)(x − µ)T p(a|µ, P ) .
P = dx1 . . . (6)
−∞ −∞
Note that the term (x − µ)(x − µ)T in the above equation is a matrix, while the term (x − µ)T P −1 (x − µ)
in Eq. (4) is a number. It can be shown that this number has all properties of a distance on the parameter
space of x; in other words, the inverse of the covariance matrix P is a metric on that space, [27], and
hence determines structure on the space: points in the space are “ordered” relative to each other by their
distances.
7
1.5 Marginalization and Bayes network
Definition 10 (Marginalization) Given a PDF p(X, Y, Z) that represents the information about in what
ways the values of the variables X, Y and Z can occur together. Marginalization is then the process to
derive the information about X and Y , given all possible values of Z.
The mathematical instantiation of marginalization is conceptually simple: integrate p(X, Y, Z) over all
possible values of Z: Z
p(X, Y ) = p(X, Y, Z) dz. (7)
Z
These superfluous variables are not superfluous at all: they are needed in the system because it’s usually
through them that information enters the model, and can be used to infer information about the variables
that the application user is interested in.
Calculating the integrals needed in marginalization can, in practice, be very time consuming. This dif-
ficulty has motivated two complementary research developments: (i) more efficient integration algorithms,
and (ii) modeling of uncertaint systems with PDF functions that are not too computational complex to
work with. Note that both developments are a trade-off between efficiency and accuracy. One successful
result of this quest for efficient representation (and hence also processing) of information are Bayesian
networks.
Definition 11 (Bayes network) A Bayes network is a directed graph structure to represent the (proba-
bilistic) dependencies between various variables in a system, and that is used to reduce the computational
complexity with respect to working with the full joint PDF.
Later sections will give more details, but Figure 3 illustrates the basic idea: the variables X, Y and Z
are connected by a PDF p(X, Y, Z); the network represents the fact that X and Y are independent, given
the information about the variable Z. This structure in the dependencies can be exploited to simplify
marginalization operations on p(X, Y, Z).
The collection of information comes from various sources: prior knowledge about the system (such as
physical laws it has to obey, or known initial conditions); measurements from sensors that can observe
(parts of) the system; data from the past; etc.
8
Z
X Y
Figure 3: Smallest element in a Bayesian network: X and Z are conditionally independent, given Z.
9
Figure 4: A mobile robot drives in an office corridor, and builds up information about where it is positioned
with respect to the doors in the corridor.
10
Message 4 (Conservation of information) The processing of information should not change the in-
formation content; i.e., it should not delete, nor add information.
This seems a trivial statement to make, but is not in practice. One of the fundamental questions to answer
first before one can check whether information is conserved, is how to measure information. This will be
dealt with in Section 2.
Definition 12 (Decision and utility functions) Both terms are used as synonyms in this text. A
utility function is a mapping from the state space of the system to the real line. It represents the value of
the information acquired thus far.
Fact 6 (Choice of decision function) The choice of a decision function is most often very arbitrary.
There is no equally attractive axiomatic theory on decision making as for the axiomatic foundations of
Bayes’ rule (see later).
In the example of Figure 4, the robot must make a decision to enter a particular door or not, based in
the collected information thus far. Hence, from all the possible positions, it has to choose only one single
position, and plan its motion to move through the “door” as if this chosen position was the real position.
Message 5 In every application, always make the explicit distinction between modelling, information
processing, and decision making.
Fact 7 (Process first, decide later) In Bayesian probability, information processing is performed every
time the system transforms, and/or gets new information. Decisions can be taken at any moment, using
the information gathered until that moment. The decision doesn’t transform the available information.
Fact 8 (Conflicting beliefs) Many reasoning systems feel the need to introduce the concept of conflicting
beliefs. In Bayesian probability, this concept does not exist on the level of information processing. However,
it is possible that two different people define decision functions whose evalution contradicts each other.
11
2 Information measures
Any uncertainty reasoning system should be able to tell its users how much information it has gathered
about a particular part of the system under study. This means the system must have a procedure to
reduce the available information (stored in the form of possibly high-dimensional PDFs) to one single
scalar number.
Of course, an infinite number of ways exist to reduce a PDF to one single number. But history has provided
us with a (small) number of choices that have interesting and “natural” properties. Entropy is one of the
most “pure” measures, because, as this Section will show, it is derived from a minimal and very plausible
set of desirable information measure properties.
12
information becomes additive if and only if I is any function of log(p), or ln(p). The simplest choice being,
of course, I = ln(p), which is the rationale behind the abundance of logarithms in statistics, for example in
the information measures discussed in the following Subsections. Indeed, addition is the natural operator
on the space of these logarithms, and is easy to work with.
The following subsection explains how Shannon found one particular logarithm-based function as a
very attractive measure of information.
The first and second specifications model our intuition that (i) small changes in probability imply only
small changes in entropy, and (ii) our uncertainty about the exact value of a parameter increases when it
is a member of a larger group.
The third desideratum is represented mathematiclly as follows: the information measure H obeys the
following additive composition law :
where w1 is the probability of the set {x1 , . . . , xk }, w2 is the probability of the set {xk+1 , . . . , xk+m }, and
so on, Figure 5; pi |wj is the probability of the alternative xi if one knows that the parameter x comes
from the set that has probability wj .
13
w1 w3
w2
pm pn
p1 pm+1
Figure 5: Grouping in a discrete probability distribution pi , in three sub-groups. The dashed lines represent
the discrete probability density function wj of the three groups.
For example, assume that x comes from a set of three members, with the alternatives occurring with
probabilities 1/2, 1/3 and 1/6, respectively. If one then groups the second and third alternatives together
(i.e., w1 = p1 = 1/2, the probability of the set {x1 }, and w2 = p2 + p3 = 1/3 + 1/6 = 1/2, the probability
of the set {x2 , x3 }), the composition law gives H(1/2, 1/3, 1/6) = H(1/2, 1/2) + 12 H(2/3, 1/3) since 2/3
and 1/3 are the probabilities of x2 and x3 within the set {x2 , x3 }.
The three above-mentioned axioms suffice to derive an analytical expression for the information measure
function
Pn H(p). The first axiom implies that it is sufficient to determine H(p) for rational values pi =
ni / j=1 nj (with nj integer numbers) only; the reason is that the rational numbers are a dense subset
of the real numbers. One then uses the composition law toPfind that H(p) can be found from the uniform
n
probability distribution P = (1/N, . . . , 1/N ) over N = i=1 ni alternatives. Indeed, the composition
law says that the entropy H(p) is equal to the entropy H(P ), because in P one can group the first n1
alternatives, the following n2 alternatives, and so on, which reduces to the original distribution. For
example, let n = 3 and (n1 , n2 , n3 ) = (3, 4, 2) such that N = 3 + 4 + 2 = 9; denoting H(1/N, . . . , 1/N ) by
H(N ) yields
3 4 2 3 4 2
H , , + H(3) + H(4) + H(2) = H(9).
9 9 9 9 9 9
In general, this could be written as
X X
H(p1 , . . . , pn ) + pi H(ni ) = H ni . (9)
A solution to this equation is given by H(n) = K ln(n), with K > 0 because of the monotonicity rule. All
14
this yields the following expression for the entropy:
X X
H(p1 , . . . , pn ) = K ln ni − K pi ln(ni ), (10)
X ni
= −K pi ln P . (11)
ni
X
= −K pi ln(pi ). (12)
15
H=1.6957 H=1.7480
.1 .2 .3 .2 .1 .1 .1 .2 .2 .2 .2 .1
H=1.8866 H=1.6094
.1 .2 .1 .2 .1 .2 .1 .2 .2 .2 .2 .2
where dx is the density of the continuous PDF that we obtain from “taking the limit” of the discrete
PDF ni , and simularly for dY and mj .
Hence, only relative information measures are possible. This is consistent with the fact that there is no
absolute zero information, to which the “distance” of a PDF p(x) dx could be taken.
Fact 11 (Mutual information) The relative information measure (often called “ mutual information”)
H(p, q) of two continuous PDFs p(x) and q(x) is defined as:
Z
p(x)
H(p, q) = − p(x) ln dx. (15)
q(x)
16
This scalar is also called the Kullback-Leibler divergence, (after the duo that first presented it, [22], [23]),
or also mutual entropy, or cross entropy, of both probability measures p(x) and m(x), [22, 28].
It is a (coordinate-independent) measure for how much information one needs to go from theprobability
distribution m(x) to the probability distribution p(x). As Shannon’s entropy, H p(x):m(x) is a global
measure, since all information contributions log(p(x)/m(x)) are weighted by p(x)dx, and then added
together.
Rao [27] was the first to come up with a real distance function on the manifold MΣ of probability
distributions p(x, Σ) over the state space x and described by a parameter vector Σ = {σ1 , . . . , σn }.
Message 7 (PDF manifold) MΣ is a parameterized space of PDFs, which is not the same space as the
state space of the system on which the PDF is defined.
The v i are the coordinates of the tangent vector in the basis formed by the tangent vectors of the logarithms
of the σ-coordinates. A metric M at the point p (which is a probability distribution) is a bilinear mapping
that gives a real number when applied to two (logarithmic) tangent vectors v and w attached to p. Rao
showed that the covariance of both vectors satisfies all properties of a metric. Hence, the elements gij of
the matrix representing the metric are found from the covariance of the coordinate tangent vectors:
Z
gij (Σ) = ∂i l(x, Σ) ∂j l(x, Σ) p(x, Σ) dx. (18)
The matrix gij got the name Fisher information matrix. The covariance “integrates out” the dependency
on the state space coordinates x, hence the metric is only a function of the statistical coordinates Σ. This
metric is defined on the tangent space to the manifold MΣ of the σ-parameterized family of probability
distributions over the x-parameterized state space X . Kullback and Leibler already proved the following
relationship between the relative entropy of two “infinitely separated” probability distributions Σ and
Σ + v on the one hand, and the Fisher Information matrix gij (Σ) on the other hand:
1X
H(Σ:Σ + v) = gij (Σ)v i v j + O(2 ). (19)
2 i,j
Hence, Fisher Information represents the local behaviour of the relative entropy: it indicates the rate of
change in information in a given direction of the probability manifold (not in a given direction of the state
space!).
17
Fact 12 (Fisher Information of a Gaussian PDF) It can be proven that the Fisher Information of a
Gaussian distribution gives the distribution’s covariance matrix.
This covariance matrix P is a function of the variables on the PDF parameter space, i.e., mean µ and
variance σ.
Fact 13 (Generalized least-squares) Equation (20) generalizes the well-known least-squares criterion
of measuring the deviation between a state space vector x and another state space vector µ.
In addition, the covariance matrix itself is often used to derive measures for the information contained in
the corresponding Gaussian PDF. This is a list of often used measures:
• Trace. That is, the sum of the eigenvalues of the matrix. This is one of the so-called linear invariants
of the covariance matrix.
• Determinant. Another invariant, being the product of the eigenvalues.
• Ratio of singular values. Most often, the condition number (ratio of smalles to largest singular value)
is taken as measure.
Fact 14 (Arbitrariness of Gaussian measures) None of the above-mentioned measures derived from
the covariance matrix of a Gaussian PDF has the status of a natural or absolute information measure.
2.6 No information—Ignorance
Every piece of software, hence also every set of software agents that together implement an “intelligent”
robot, must start from a certain initial state. In the context of plausible inference, one is often tempted
to describe this initial state as the state in which the robot knows “nothing” yet. This raises the two
questions as to (i) what such a total ignorance really means, and (ii) how to represent it. Since many
researchers didn’t find satisfactory answers to these two questions, they jumped to the conclusion that
18
probability theory is not a valid framework for A.I. However, the fact that they didn’t find satisfactory
answers says much about their own state of ignorance, since ignorance can be dealt with in a very clean and
formal way. (But, it is true, not always in a simple way.) The French scientist Pierre Simon de Laplace
(1749–1827) was the first to propose the uniform distribution of a parameter as the state of ignorance
about its exact value. This approach of assigning an a priori distribution this way later got the name
of Laplace’s principle of indifference. Harold Jeffreys [19] generalized this, and presented the invariant
volume form as the non-informative prior distribution. However, Jeffreys was not accepted by the then
active community of statisticians, so his ideas were not widespread. A similar fate befell Edwin Thompson
Jaynes (1922–1998), in the 50s, 60s and 70s, although he did add mathematical rigour to the somewhat
intuitive ideas of Jeffreys, [14, 29], and, especially, did a lot to spread the Bayesian ideas.
Let’s first discuss the question about what total ignorance means. In fact, it doesn’t mean much: one
always knows something about the system one is interested in; or, at least, one could come up with some
models, even though one would have no idea about the values of the parameters in it. Or, in the words of
Jaynes: “merely knowing the physical meaning of our parameters [in a model], already constitutes highly
relevant prior information which our intuition is able to use at once” (emphasis is Jaynes’).
The question about which “ignorance priors” (or “noninformative priors”) to choose has still not been
answered completely satisfactorily: Jeffreys’ non-informative prior distribution works only for location
parameters, such as the mean value of a parameter. For other properties, such as e.g. the standard
deviation, other ignorance prior distributions are needed. Jaynes’s approaches to find ignorance priors are
[15, 16, 17, 18]:
1. Invariance under transformations. If the only thing one knows about the system is a model or an
hypothesis, this ignorance should not change if the mathematical representation of the model is
transformed into an equivalent representation. For example, a uniform distribution on x does not
necessarily lead to a uniform distribution on y = x2 .
2. Maximum Entropy (MaxEnt) principle. If all one knows about a system is a number of constraints it
has to obey, the prior distribution of the parameters describing this system is given my maximizing
the entropy of the distribution taking into account the constraints. For example, if one knows the
mean and variance of a parameter, the probability distribution that adds no extra a priori information
(and hence has the largest entropy) turns out to be the Gaussian with given mean and variance, e.g.,
[3, 18, 24, 25].
3. Extra “I don’t know” hypothesis. If the robot has a set of hypotheses for the system under considera-
tion, and it doesn’t know at all which one to prefer, or even whether one of these hypotheses is valid,
it can add a new hypothesis that just says “I don’t know what hypothesis is valid.” Total ignorance
is then represented by giving all the probability to this last hypothesis.
There still exists discussion about the appropriateness of these approaches; much of this controversy is
caused by the fact that most researchers do not thoroughly understand the importance of structure. . .
Also for the MaxEnt criterion, one has found a small set of very intuitive desiderata, [33, 34], that lead to
an axiomatic treatment of, and hence fundamental structure on, the space of probability distributions of
a system:
19
Axioms for MaxEnt
I Subset independence: information about one domain should not
affect the information description in another domain, provided
that there are no constraints linking both domains.
II Coordinate invariance.
III System independence: if the principle is valid in general, it
should also work on specific individual systems.
IV Scaling: the entropy of the probability distribution of a system
should not change if no new information is added to the system.
This text does not give the proof that these intuitive desiderata indeed lead to the MaxEnt principle, but
the approach is very similar to the approach applied on the desiderata for entropy; see the cited references
for more details.
20
2. The negation of the negation of a proposition A is equal to the proposition A. This means that a
function g must exist such that:
g g p(A) = p(A). (23)
3. The same function g should also satisfy the following law of logic:
g p(A OR B) = g p(A) AND g p(B) . (24)
These structural prescriptions are sufficient to derive the product rule p(A AND B|M ) = p(A|M )p(B|A AND M ),
and the additivity to one, p(A) + p(NOT A) = 1. (M represents all available knowledge (“models”) used
in the inference procedure.)
Jaynes builds further on the approach by Cox, and states some assumptions (e.g., invariance) a bit
more explicitly. In his unfinished manuscript [18], the basic rules of plausible reasoning are formulated as
follows:
These “axioms” are not accepted without discussion, e.g., [10, p. 241]:
• Assigning equivalent probabilities for equivalent states seems to assume that the modeller has “ab-
solute knowledge,” since one often doesn’t know that states are equivalent.
• Many people state that representing probability by one single real number is not always possible or
desirable.
It can be proved, [5, 18], in a way again very similar to the proof given for the entropy desiderata, that
the above-mentioned axioms lead to the following well-known probability rules:
Bayesian calculus
p(Data|Model, H)
Bayes’ rule p(Model|Data, H) = p(Model|H). (27)
p(Data|H)
21
Equation (27) suggestively uses the names “Data” and “Model,” since this is the contents of these variables
in many robotics inference problems.
Fact 15 (Bayes’ rule and PDFs) Bayes’ rule is defined to work with probabilities, not densities, but
the differentials dx “cancel out” of the equations, [25, p. 104].
Fact 16 (Bayes’ rule and the product rule) Bayes’ rule follows straightforwardly from applying the
product rule twice, developing p(Model, Data|H once for Model and once for Data.
The denominator in Bayes’ rule is usually considered to be nothing more than just a normalization constant,
[25, p. 105]; it is independent of the Model, and predicts the Data given only the prior information.
The term p(Model|Data, H) is called the posterior (probability); p(Data|Model, H)/p(Data|H) is the
likelihood ; and p(Model|H) is the prior probability of the hypothesis. The likelihood is not a probability
distribution in itself; as the ration of two probability distributions with values between 0 and 1 it can have
any positive real value.
Fact 17 (Bayes’ rule represents model-based learning) It expresses the probability of the Model,
given the Data (and the background information H), as a function of (i) the probability of the Data when
the Model is assumed to be known, and (ii) the probability of that Model given the background information.
Let’s take a look at what each term in Bayes’ rule means. The left-hand side is what the reasoning
system wants to calculate: the PDF representing the information on the Model, taking into account all
Data; this PDF is a function over the Model’s parameter space. The right-most term is also a PDF over
the same parameter space, but before the Data was taken into account. The other term is written down
as the ratio of two PDF functions, seemingly having both the same domain, i.e., the parameter space of
the Data. However, the likelihood is also a function over the Model space. The Data is “measured” in its
own parameter space, but in Bayes’ rule it is used after transformation to the Model domain, through the
mathematical relationship between both. This relationship is sometimes called the measurement equation,
and makes up an indispensable part of the modelling step in any Bayesian reasoning system.
The transformation of a PDF through a functional relationship between two domains is quite straight-
forward. Assume that a PDF p(x) is given over the x domain, and that x is transformed to the y domain
through a functional relationship x = f (y). Then the PDFs are related as follows:
∂f
p(x) dx = p(f (y)) dy. (28)
∂y
Figure 7 shows Bayes’ rule in action, working on a Gaussian PDF as prior, and a likelihood which is
proportional to a Gaussian PDF. The resulting posterior is again a Gaussian PDF, [24, p. 7]. Recall,
however, that the prior and the likelihood in general do not have the same function domain. So, what is
depicted in the Figure is the Data PDF after transformation through the measurement equation.
Bayesian probability theory gives an axiomatically founded treatment of all structures and concepts
needed in reasoning with uncertainty; that’s why it is presented in this text. Classical (“orthodox”)
statistics gives some shortcuts for particular problems; this is convenient for specific implementations
(such as many parameter identification applications) but it is seldom easy and intuitive to know what
shortcuts are used in a given robotics task.
22
3.1 Optimality of information processing
The reasoning in Section 2.1 also allows to interpret Bayes’ rule as a procedure to combine information
from two sources without loss of information: the first source is the prior information already contained in
the current state, and the second source is the new information added by the current measurement. This
relationship is straightforward: take Bayes’ rule for two models M1 and M2 that receive the same new
Data:
p(Data|M1 )
p(M1 |Data) = p(M1 ),
p(Data)
p(Data|M2 )
p(M2 |Data) = p(M2 ).
p(Data)
Taking the logarithms of the ratio of both relationships yields
The left-hand side is the information after the measurement; the right-hand side represents the information
contributions of both sources. Hence:
Fact 18 (Bayes’ rule is optimal information processor) Bayes’ rule has equal information “before”
its application as “after,” and hence is optimal in the sense that it doesn’t add nor delete information,
[37].
23
0.8
Prior
0.6
0.4
0.2
0.0
−4 −2 0 2 4
2.0
Likelihood
1.5
1.0
0.5
0.0
−4 −2 0 2 4
2.0
Posterior
1.5
1.0
0.5
0.0
−4 −2 0 2 4
24
4 Estimation
Estimation is the activity of taking as input the PDF that represents the information about a system, and
perform a data reduction to one single scalar that is believed to summarize the information about this
parameter. By the way we define estimation, it is clear that some arbitrary choices are involved.
In general, the reasoning system should build the joint PDF over all model and data parameters, as the
complete source of information about the whole system. However, the dimension of this PDF would grow
unbounded for systems that receive new information on a regular basis. Hence, one reduces these high-
dimensional PDFs to lower-dimensional ones, in a way that one can hope to keep most of the information
in this data reduction step.
Marginalization is one way to reduce the amount of representation data; estimators are another way.
The following types of estimators are most widely used:
Maximum Likelihood (ML) The PDF over the parameters that represent the Model information is
replaced by that parameter combination that has the highest value in the likelihood function.
Figure 8 depicts an ML estimate. It also shows that ML only works reliably for so-called unimodal
PDFs: PDFs that have only one “peak.” But even in that case, the ML estimate could be a poor
data reduction; e.g., in the case of a very “flat” PDF, there are many parameters with almost the
same function value, so the maximum is poorly conditioned numerically.
Maximum a Posteriori (MAP) Instead of looking only at the likelihood function, the MAP estimator
considers the complete posterior PDF. However, one usually looks for the Rmaximum of p(x), while
one should look for the interval I in parameter space on which the integral I p(x) dx is largest, [26].
Least Squares (LS) This is the minimum of the “squared error function,” i.e., the function (x̂ − x)T (x̂ −
x). It seems at first sight that this estimator involves no probability distributions on the parameter(s)
x; but the squared error is proportional to the exponent of a Gaussian distribution with equal
covariance in all parameters.
R
Mean value The integral p(x) dx gives the average or mean value of the PDF p(x). This is also one
scalar, and hence a valid procedure for doing data reduction. Figure 8 also shows that the mean is
a poor choice for a multi-modal PDF.
Some of these estimators become equivalent in special cases:
• ML = MAP if the prior is noninformative, i.e., “flat.”
• LS = ML, for indentically distributed and independent (“iid ”), symmetric and zero mean distribu-
tions.
• ML = MAP = mean for Gaussian PDFs.
5 Hypothesis tests
Many reasoning systems are asked to answer the question whether it has enough information to decide
in favour of a particular “hypothesis” in the system. In the Bayesian framework, the validity of an
25
ML
mean
ML?
mean
Figure 8: Example of Maximum Likelihood estimation (top). In the bottom figure, i.e., the case of a
multi-modal PDF, the ML and mean are not so good estimators.
hypothesis cannot be determined absolutely; only the relative quantification of hypotheses is meaningful.
Equation (29), derived from Bayes’ rule in a previous Section, is the basis for hypothesis testing. Written
without the logarithms one gets:
p(M1 |Data) p(Data|M1 ) p(M1 )
= . (30)
p(M2 |Data) p(Data|M2 ) p(M2 )
This relationship answers the question whether the information available in the reasoning system prefers
model M1 over model M2 , depending on whether the left-hand side ratio is larger or smaller than 1. Or
rather, it seems to answer this question, because taking a closer look reveals some problems:
• The terms in Eq. (30) are not scalar numbers, but functions.
• In addition, the models M1 and M2 need not even be defined over the same parameters spaces. M1
could have parameters x, y and z, while M2 has parameters a and b. So, the ratios in the equation
have no unambiguous meaning. (In contrast to the PDF ratio in the likelihood function of Bayes’
rule, which is a physically meaningful ratio.)
This means that Eq. (30) has only qualitative value, but is quantitatively wrong: the ratio of any two
PDF functions doesn’t make sense, in general. However, the particular ratio involving the Data PDFs
does make sense, because both PDFs are defined over the same Data parameter space.
So, a data reduction step must be performed on some of the PDFs in Eq. (30), using information
measures to transform the PDFs into scalars that can be multiplied, divided, and compared. In this
respect, hypothesis tests are, in fact, the same things as parameter estimators, [25, p. 104]: estimating
a parameter to have a certain value is the same problem as testing the hypothesis that it has that value
against the hypotheses that it has other values.
The relative probability of two Models not only depends on how well they predict the Data (i.e., the
first term on the right-hand side), but also on how complex the models are (i.e., represented by the second
26
term). The predictive power and the complexity are always in a trade-off. More complex models have a
higher probability to fit the Data better. Simpler models have less “degrees of freedom,” and hence the
probability mass can be spread less. So, p(M |H) will be higher for a simpler model M . This principle is
often called Occam’s razor, after (the Latin name of) the Englishman William of Ockham (1288–1348),
who got famous (long after his death!) for his quotations “Frustra fit per plura, quod fieri potest per
pauciora (It is vain to do with more what can be done with less) and “Essentia non sunt multiplicanda
praeter necessitatem” (Entities should not be multiplied unnecessarily).
6 Model building
Some of the more advanced reasoning systems have the goal of doing the modelling part of a Bayesian
system autonomously, i.e., without supervision by a human operator. However, these so-called model
building systems rely on estimation and hypothesis testing. Indeed, they start from a set of “modelling
primitives”, with which to build models. Then, model building corresponds to finding which combination
of primitives fits “best” with the data. The system typically also is given some algorithms or procedures
to define which combinations of primitives should be looked at.
27
a11
a13
1 a31
a21 3 a33
a12 a23
2 a32
a22
Figure 9: A Markov model system.
28
Then, the Kalman Filter is nothing else but Bayes’ rule applied to this particular kind of model. The
innovation in Fig. 10 is the difference between the predicted value of the measurement and the actual
value; this is indeed the new information, because one learns nothing new from predictions that turn out
to correspond to the reality.
v(k) Transition to next state Prediction of next state Prediction of next state covariance
x(k + 1) = F (k)x(k) x̂(k + 1|k) = F (k)x̂(k) P (k + 1|k) = F (k)P (k)F T (k) + Q(k)
+v(k)
A Kalman Filter is an information processor, but also trivially useable as an estimator, because, for
Gaussian PDFs, the mean of the PDF corresponds to the Maximum Likelihood estimate.
Note that the Kalman Filter works on full PDFs at all times (but these PDFs are analytically simple),
and hence within the model no information is created or lost. Of course, this only holds as long as the linear
process and measurement models are reliable. And quite often they are only first-order approximations.
Kalman Filters are the information processing workhorses in the aviation and aerospace applications:
they are used for over fourty years to keep rockets, airplanes and satellites in their correct course. Kalman
Filters are also very popular in robotics applications, such as autonomously navigating mobile robots
(Fig. 11).
Message 8 (Data reduction through Maximum Likelihood) Many Bayesian algorithms tackle the
complexity involved in the full application of Bayes’ rule, by reducing the information contained in a PDF
by only one single parameter, i.e., the Maximum Likelihood estimate.
29
.
end
position
start beacons
position uncertainty
ellipsoid
scan line
.
Figure 11: Mobile robot navigation based on tracking of beacons in the environment, by means of a Kalman
Filter. Whenever the robot is able to see beacons in its environment (and is able to match them with its
a priori map) it can reduce its position uncertainty. This is reflected by the covariance ellipses shown in
the picture. (Figure courtesy of J. Vandorpe.)
Message 9 (ML and local maxima) Every estimation or information processing algorithm that uses
not the full PDFs but only a Maximum Likelihood (or any other) estimate, is prone to converge to a local
maximum, and not the global maximum.
The Viterbi algorithm is an example of such a ML-based approach. The underlying system is modelled by
an HMM, and data correlated with the state transitions is coming in at every sample instant. The goal of
the reasoning system is to estimate which state transition sequence has generated the measured signals.
Instead of calculating a PDF over all possible transition paths that could have led from the past to the
current measurement, the Viterbi algorithm only stores the most probable path.
30
• Expectation. Based on the latest ML estimates of the states, the expected measurements are pre-
dicted.
• Maximization. These predicted measurements are compared to the real measurements, and from the
difference the state estimate is adapted, in a Maximum Likelihood sense.
Figure 12 depicts this EM loop, in two snapshots of the reasoning system inside a mobile robot that
navigates in an unknown environment, building a map of that environment. The first figure shows still
mapping inaccuracies, due to the inaccurate self-motion measurements of the mobile robot. The second
row shows a better converged estimate at a later point in time.
Figure 12: Intermediate and final topological maps and occupancy grids during an EM algorithm for model
building by a mobile robot.
31
Figure 13: Sample-based information processing for a mobile robot that wants to localize itself in an office
like environment of which it has a map. The differently coloured robots and the correspondingly coloured
samples represent the information at subsequent instants in time.
Fact 19 (Sampled representation of PDF) R Any PDF can be approximated by a number of samples,
i.e., at areas I in the parameter space where I p(x) dx is high, one puts more samples than where this
integral is low. All operations on PDFs are then replaced by operations on the individual samples.
Fact 20 (Monte Carlo) The name Monte Carlo is associated with many sample-based methods.
Fact 21 (Sampling from uniform interval) The only computationally efficient sampling algorithms
sample from a uniform distribution over the interval [0, 1]. Sampling from other PDFs is always trans-
formed, via various routes, to such a sampling.
The Cumulative Density Function (CDF) is an important concept in sampling, Fig. 14. The CDF of
any given PDF can be used to generate samples from that PDF, via the Inverse CDF sampling procedure,
Fig. 15: one takes uniform samples from the Y axis of the CDF plot, and each of the inverse function
values is a sample from the original PDF.
Fact 22 (When is sampling possible?) The PDF must be “easy” to evaluate, i.e., the value of p(x)
should be fast to calculate.
32
CDF
Figure 14: The Cumulative Density Function of the given PDF is the function that corresponds to the
integral of the PDF along its domain.
The ICDF sampling is simplest for discrete or discretized PDFs; it can be quite complicated for a
general analytical PDF. So, discrete approximations to these PDFs are most often used.
Fact 23 (Normalization of samples) After each information processing step, it is common practice to
normalize the samples. That is, instead of keeping information about p(xRi ) dx one stores a number of
samples of “unit” magnitude, where the number corresponds to the integral I p(x) dx.
The ”information processing step” can be: (i) transformation of the type y = f (x), or (ii) the application
of Bayes’ rule.
Fact 24 (Monte Carlo integration) The Bayesian framework uses a lot of integrations, for example,
for doing marginalizations. Monte Carlo integration is a sample-based approximation of a real integral,
formed by the sum of samples of the PDF, normalized by their “density.”
33
Figure 15: Sampling from any PDF can be done by “inverse” uniform sampling from its CDF.
References
[1] Y. Bar-Shalom and X.-R. Li. Estimation and Tracking, Principles, Techniques, and Software. Artech
House, 1993.
[2] T. Bayes. Essay towards solving a problem in the doctrine of chances. Philosophical Transactions of
the Royal Society of London, 53:370–418, 1764. Reprinted in Biometrika, 45:293–315, 1958, and in
Fascimiles of two papers by Bayes, W. Edwards Deming.
[3] G. L. Bretthorst. An introduction to parameter estimation using Bayesian probability theory. In
P. F. Fougère, editor, Maximum Entropy and Bayesian Methods, pages 53–79. Kluwer, Dordrecht,
The Netherlands, 1990.
[4] T. M. Cover and J. A. Thomas. Elements of information theory. Wiley, New York, NY, 1991.
34
[5] R. T. Cox. Probability, frequency, and reasonable expectation. American Journal of Physics, 14(1):1–
13, 1946. Reprinted in [30, p. 353].
[6] P.-S. de Laplace. Théorie analytique des probabilités. Courcier Imprimeur, 1812. 2nd edition, 1814;
3rd edition, 1820.
[7] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the
EM algorithm (with discussion). Journal of the Royal Statistical Society, Series B, 39:1–38, 1977.
[8] G. J. Erickson and C. R. Smith, editors. Maximum-Entropy and Bayesian Methods in Science and
Engineering. Vol. 1: Foundations; Vol. 2: Applications. Kluwer Academic Publishers, Dordrecht, The
Netherlands, 1988.
[9] R. A. Fisher. Statistical methods for research workers. Oliver and Boyd, Edinburgh, Scotland, 13th
edition, 1967.
[10] M. Ginsberg. Essentials of Artificial Intelligence. Morgan Kaufmann, San Mateo, CA, 1993.
[11] G. Golub and C. Van Loan. Matrix Computations. The Johns Hopkins University Press, 1989.
[12] I. J. Good. A derivation of the probabilistic explanation of information. Journal of the Royal Statistical
Society Ser. B, 28:578–581, 1966.
[13] E. T. Jaynes. How does the brain do plausible reasoning? Technical Report 421, Stanford University
Microwave Laboratory, 1957. Reprinted in [8, Vol. 1, p. 1–24].
[14] E. T. Jaynes. Information theory and statistical mechanics. Physical Review, 106:620–630, 1957.
Reprinted in [29, p. 6–16].
[15] E. T. Jaynes. Prior probabilities. IEEE Trans. Systems Science and Cybernetics, 4:227–241, 1968.
Reprinted in [29, p. 116–130].
[16] E. T. Jaynes. Where do we stand on Maximum Entropy? In R. D. Levine and M. Tribus, editors,
The maximum entropy formalism, pages 15–118. MIT Press, 1978. Reprinted in [29, p. 211–314].
[17] E. T. Jaynes. Bayesian methods: general background. In J. H. Justice, editor, MaxEnt’84: Maximum
Entropy and Bayesian Methods in Geophysical Inverse Problems, pages 1–25. Cambridge University
Press, 1986.
[18] E. T. Jaynes. Probability theory: The logic of science. Unfinished manuscript, http://bayes.wustl.
edu/etj., 1996.
[19] H. Jeffreys. Theory of Probability. Clarendon Press, 1939. 2nd edition, 1948; 3rd edition, 1961.
Reprinted by Oxford University Press, 1998.
[20] R. E. Kalman. A new approach to linear filtering and prediction problems. Trans. ASME J. Basic
Eng., 82:34–45, 1960.
[21] M. G. Kendall and A. O’Hagan. Kendall’s advanced theory of statistics. 2B: Bayesian inference.
Arnold, London, England, 1994.
35
[22] S. Kullback. Information theory and statistics. Wiley, New York, NY, 1959.
[23] S. Kullback and R. A. Leibler. On information and sufficiency. The Annals of Mathematical Statistics,
22:79–86, 1951.
[24] D. V. Lindley. Introduction to probability and statistics from a Bayesian viewpoint. Vol. 1: Probability,
Vol. 2: Inference. Cambridge University Press, 1965.
[25] T. J. Loredo. From Laplace to supernova SN 1987a: Bayesian inference in astrophysics. In P. F.
Fougère, editor, Maximum Entropy and Bayesian Methods, pages 81–142. Kluwer, Dordrecht, The
Netherlands, 1990.
[26] D. J. C. MacKay. Information theory, inference and learning algorithms. Textbook in preparation.
http://wol.ra.phy.cam.ac.uk/mackay/itprnn/, 1999.
[27] C. R. Rao. Information and the accuracy attainable in the estimation of statistical parameters.
Bulletin of the Calcutta Mathematics Society, 37:81–91, 1945.
[28] C. C. Rodrı́guez. The metrics induced by the Kullback number. In J. Skilling, editor, Maximum
Entropy and Bayesian Methods, pages 415–422. Kluwer, 1989.
[29] R. D. Rosenkrantz, editor. E. T. Jaynes: Papers on Probability, Statistics and Statistical Physics.
D. Reidel, 1983. Second paperbound edition, Kluwer Academic Publishers, 1989.
[30] G. Shafer and J. Pearl, editors. Readings in Uncertain Reasoning. Morgan Kaufmann, San Mateo,
CA, 1990.
[31] C. E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27:379–423,
1948.
[32] C. E. Shannon and W. Weaver. The mathematical theory of communication. University of Illinois
Press, Urbana, IL, 1949.
[33] J. E. Shore and R. W. Johnson. Axiomatic derivation of the principle of maximum entropy and
the principle of minimum cross-entropy. IEEE Trans. Information Theory, 26(1):26–37, 1980. Error
corrections in Vol. 29, No. 6, pp. 942–943, 1983.
[34] J. Skilling. The axioms of Maximum Entropy. In Erickson and Smith [8], pages 173–187 (Vol. 1).
[35] H. W. Sorenson. Least-squares estimation from Gauss to Kalman. IEEE Spectrum, 7:63–68, 1970.
[36] G. Strang. Calculus. Wellesley-Cambridge Press, Wellesley, MA, 1991.
[37] A. Zellner. Optimal information processing and Bayes’s theorem. The American Statistician, 42:278–
284, 1988.
36