1439880328
1439880328
com
Bayesian
Programming
www.Ebook777.com
K13774_FM.indd 1 10/28/13 2:17 PM
Chapman & Hall/CRC
Machine Learning & Pattern Recognition Series
SERIES EDITORS
This series reflects the latest advances and applications in machine learning and pat-
tern recognition through the publication of a broad range of reference works, text-
books, and handbooks. The inclusion of concrete examples, applications, and meth-
ods is highly encouraged. The scope of the series includes, but is not limited to, titles
in the areas of machine learning, pattern recognition, computational intelligence,
robotics, computational/statistical learning theory, natural language processing,
computer vision, game AI, game theory, neural networks, computational neurosci-
ence, and other relevant topics, such as machine learning applied to bioinformatics
or cognitive science, which might be proposed by potential contributors.
PUBLISHED TITLES
Bayesian
Programming
Pierre Bessière
CNRS, Paris, France
Emmanuel Mazer
CNRS, Grenoble, France
Juan-Manuel Ahuactzin
ProbaYes, Puebla, Mexico
Kamel Mekhnacha
ProbaYes, Grenoble, France
www.Ebook777.com
K13774_FM.indd 3 10/28/13 2:17 PM
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit-
ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microfilming, and recording, or in any information storage or retrieval system,
without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.
com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
Foreword xv
Preface xvii
1 Introduction 1
1.1 Probability an alternative to logic . . . . . . . . . . . . . . 1
1.2 A need for a new computing paradigm . . . . . . . . . . . . 5
1.3 A need for a new modeling methodology . . . . . . . . . . . 5
1.4 A need for new inference algorithms . . . . . . . . . . . . . 8
1.5 A need for a new programming language and new hardware 10
1.6 A place for numerous controversies . . . . . . . . . . . . . . 11
1.7 Running real programs as exercises . . . . . . . . . . . . . . 12
vii
viii Contents
17 Glossary 331
17.1 Bayesian filter . . . . . . . . . . . . . . . . . . . . . . . . . . 331
17.2 Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . 332
17.3 Bayesian network . . . . . . . . . . . . . . . . . . . . . . . . 333
17.4 Bayesian program . . . . . . . . . . . . . . . . . . . . . . . 334
17.5 Coherence variable . . . . . . . . . . . . . . . . . . . . . . . 335
17.6 Conditional statement . . . . . . . . . . . . . . . . . . . . . 335
17.7 Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 336
17.8 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
17.9 Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
17.10 Incompleteness . . . . . . . . . . . . . . . . . . . . . . . . . 337
17.11 Mixture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
17.12 Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
17.13 Preliminary knowledge . . . . . . . . . . . . . . . . . . . . . 338
17.14 Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
xiv Contents
Bibliography 341
Index 359
Foreword
Stuart Russell
University of California, Berkeley
xv
This page intentionally left blank
Preface
The Bayesian Programming project first began in the late 1980s, when we
came across Jaynes’ ideas and writings on the subject of “Probability as
Logic.”
We were convinced that dealing with uncertainty and incompleteness was
one of the main challenges for robotics and a sine qua non condition to the
advancement toward autonomous robots, but we had no idea of how to tackle
these questions.
Edwin T. Jaynes’ proposition of probability as an alternative and an exten-
sion of logic to deal with both incompleteness and uncertainty was a revelation
for us. The theoretical solution was described in his writings, especially in the
early version of his book Probability Theory: The Logic of Science [Jaynes,
2003] that he was distributing, already aware that illness may prevent him
from finishing it.
He described the principles of what he called “the robot,” which was not
a physical device, but an inference engine to automate probabilistic reasoning
— a kind of Prolog for probability instead of logic. We decided to try to
implement such an inference engine and to apply it to robotics.
The main lines of the Bayesian programming formalism were designed, no-
tably, by Olivier Lebeltel. The first prototype versions of the inference engine
were developed in Lisp and several applications to robotic programming were
designed.
In the late 1990s we realized, first, for research applications that proba-
bilistic modeling could be successfully applied to any sensory-motor systems,
whether artificial or living, and, second, that in industry, Bayesian program-
ming has numerous potential applications beyond robotics.
To investigate the research applications along with our international part-
ners, we set up two successive European projects: BIBA (Bayesian Inspired
Brain and Artifacts) and BACS (Bayesian Approach to Cognitive Systems).
In these projects we made progress on three different scientific questions:
xvii
xviii Preface
Along with the coauthors of this book, the main developers of ProBT
were David Raulo, Ronan Le Hy, and Christopher Tay. We would like to
emphasize their hard work and their invaluable contributions, which have led
to an effective programming tool. Professor Philippe Lerray and Linda Smail
also contributed to the definition and implementation of key algorithms found
in ProBT, namely the structural identification and symbolic simplification
algorithms.
None of this would have been possible without the work of all the PhD
students and postdocs who have used and improved Bayesian programming
along the years either within our group at Grenoble University (France) or
elsewhere in Europe (in approximate chronological order): Eric Dedieu, Olivier
Lebeltel, Christophe Coué, Ruben Senen Garcı́a Ramı́rez, Julien Diard, Cédric
Pradalier, Jihene Serkhane, Guy Ramel, Adriana Tapus, Carla Maria Chagas
e Cavalcante Koike, Miriam Amavizca, Jean Laurens, Francis Colas, Ronan Le
Hy, Pierre-Charles Dangauthier, Shrihari Vasudevan, Joerg Rett, Estelle Gilet,
Xavier Perrin, João Filipe Ferreira, Clement Moulin-Frier, Gabriel Synnaeve,
and Raphael Laurent. Many of these former students have been mentored or
coadvised by Julien Diard, Jorge Dias, Thierry Fraichard, Christian Laugier,
or Roland Siegwart, whom we also thank for the countless discussions we had
together.
A very special thanks to Jacques Droulez and Jean-Luc Schwartz for their
inspiring and essential collaborations.
We would also like to thank the team who assisted us during the actual
making of the book: Mr. Peter Bjornsen for correcting the many errors found in
the initial version, Mr. Shashi Kumar for tuning the configuration file needed
by LATEX, Ms. Linda Leggio and Mr. David Grubbs for the organization.
Finally thanks to CNRS (Centre National de la Recherche Scientifique),
INRIA, University of Grenoble, and the European Commission, who actively
supported this project.
Pierre Bessière
Emmanuel Mazer
Juan-Manuel Ahuactzin
Kamel Mekhnacha
Chapter 1
Introduction
1
2 Bayesian Programming
Computing a cost price to decide on a sell price may seem a purely arith-
metic operation consisting of adding elementary costs. However, often these
elementary costs may not be known exactly. For instance, a part’s cost may
be biased by exchange rates, production cost may be biased by the number of
orders, and transportation costs may be biased by the time of year. Exchange
rates, the number of orders, and the time of year when unknown are hidden
variables, which induce uncertainty in the computation of the cost price.
Analyzing the content of an e-mail to filter spam is a difficult task, because
no word or combination of words can give you an absolute certitude about
the nature of the e-mail. At most, the presence of certain words is a strong
clue that an e-mail is spam. It may never be a conclusive proof, because
the context may completely change its meaning. For instance, if one of your
friends is forwarding you a spam for discussion about the spam phenomenon,
its whole content is suddenly not spam any longer. A linguistic model of spam
is irremediably incomplete because of this boundless contextual information.
Filtering spam is not hopeless and some very efficient solutions exist, but the
perfect result is a chimera.
Machine control and dysfunction diagnosis is very important to industry.
However, the dream of building a complete model of a machine and all its pos-
sible failures is an illusion. One should recall the first “bug” of the computer
era: the moth located in relay 70 panel F of the Harvard Mark II computer.
Once again, it does not mean that control and diagnosis are hopeless, it only
means that models of these machines should take into account their own in-
completeness and the resulting uncertainty.
In 1781, Sir William Herschel discovered Uranus, the seventh planet of the
solar system. In 1846, Johann Galle observed for the first time, Neptune, the
eighth planet. In the meantime, both Urbain Leverrier, a French astronomer,
and John Adams, an English one, became interested in the “uncertain” tra-
jectory of Uranus. The planet was not following the exact trajectory that
Newton’s theory of gravitation predicted. They both came to the conclusion
that these irregularities could be the result of a hidden variable not taken into
account by the model: the existence of an eighth planet. They even went much
further, finding the most probable position of this eighth planet. The Berlin
observatory received Leverrier’s prediction on September 23, 1846, and Galle
observed Neptune the very same day!
to the already quoted James C. Maxwell in 1850 and to the visionary Henri
Poincaré in 1902:
and finally, by Edwin T. Jaynes in his book Probability theory: The logic
of science where he brilliantly presents the subjectivist alternative and sets
clearly and simply the basis of the approach:
5
On voit, par cet Essai, que la théorie des probabilités n’est, au fond, que le bon sens
réduit au calcul; elle fait apprécier avec exactitude ce que les esprits justes sentent par une
sorte d’instinct, sans qu’ils puissent souvent s’en rendre compte.
Introduction 5
6
See FAQ-FAM: Cox theorem, Section 16.8.
6 Bayesian Programming
Descriptions are the basic elements that are used, combined, composed,
manipulated, computed, compiled, and questioned in different ways to build
Bayesian programs.
with the same description to solve very different problems. This clear separa-
tion between the model and its use is a very important feature of Bayesian
Programming.
adapted and tuned to more or less specific models and a software architecture
to combine them in a coherent and unique tool.
Numerous such Bayesian inference algorithms have been proposed in the
literature. The purpose of this book is not to present these different computing
techniques and their associated models once more. Instead, we offer a synthesis
of this work and a number of bibliographic references for those who would like
more detail on these subjects.
Chapters 12 to 15 are dedicated to that.
think that these repetitions are useful as our goal in this chapter is to give a
synthetic overview of all these models.
# import all
from pypl import *
# print it
print ’P_dice = ’, P_dice
This may require some computer science proficiency, which is not required
from the readers of this book. Running these programs is a plus but is not
necessary in the comprehension of this book. To run these programs on a
computer, a Python package called pypl is needed. The source code of the
examples as well as the Python package can be downloaded free of charge
from “http:/www.probayes.com/Bayesian-Programming-Book/.” The Python
package is based on ProBT, a C++ multiplatform professional library used
to automate probabilistic calculus.
Additional exercises and programs are available on this Web site.
This page intentionally left blank
Part I
Bayesian Programming
Principles
15
This page intentionally left blank
Chapter 2
Basic Concepts
2.1 Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 The normalization postulate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 Variable conjunction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6 The conjunction postulate (Bayes theorem) . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7 Syllogisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.8 The marginalization rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.9 Joint distribution and questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.10 Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.11 Parametric forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.12 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.13 Specification = Variables + Decomposition + Parametric forms . 29
2.14 Description = Specification + Identification . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.15 Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.16 Bayesian program = Description + Question . . . . . . . . . . . . . . . . . . . . . . . . 31
2.17 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
17
18 Bayesian Programming
2.1 Variable
The variables necessary to write this program are as follows:
1. Spam 1 : a binary variable, false if the e-mail is not spam and true
otherwise.
2.2 Probability
A variable can have one and only one value at a given time, so the value
of Spam is either true2 or false, as the e-mail may either be spam or not.
However, this value may be unknown. Unknown does not mean that you
do not have any information concerning Spam. For instance, you may know
that the average rate of nonspam e-mail is 25%.
This information may be formalized, writing:
X
∀y ∈ Y P ([X = x] | [Y = y]) = 1.0 (2.4)
∀x∈X
{(f alse, f alse), (f alse, true), (true, f alse), (true, true)} (2.6)
This may be generalized as the conjunction of an arbitrary number of
variables. For instance, in the sequel, we will be very interested in the joint
probability distribution of the conjunction of N + 1 variables:
P (X ∧ Y ) = P (X) P (Y | X)
= P (Y ) P (X | Y ) (2.8)
This rule is better known under the form of the so-called Bayes theorem:
P (Y ) P (X | Y )
P (Y | X) = (2.9)
P (X)
However, we prefer the first form, which clearly states that it is a means
Basic Concepts 21
2.7 Syllogisms
It is very important to acquire a clear intuitive feeling of what a condi-
tional probability and the conjunction rule mean. A first step toward this
understanding may be to restate the classical logical syllogisms in their prob-
abilistic forms.
Let us first recall the two logical syllogisms:
1. Modus Ponens: a ∧ [a ⇒ b] → b
if a is true and if a implies b then b is true3 .
2. Modus Tollens: ¬b ∧ [a ⇒ b] → ¬a
if b is false and if a implies b then a is false.
X X
[P (X ∧ Y )] = [P (Y ) P (X|Y )] from (2.8)
X X X
= P (Y ) [P (X|Y )] (2.13)
X
= P (Y ) from (2.5)
P (X ∧ Y )
3. P (Y | X) = P
Y P (X ∧ Y )
P (X ∧ Y )
4. P (X | Y ) = P
X P (X ∧ Y )
5. P (X ∧ Y ) = P (X ∧ Y )
There are five and only five interesting possible computations with two
variables and these five calculi all come down to sum, product, and division
on the joint probability distribution P (X ∧ Y ).
This is of course also true for a joint distribution on more than two vari-
ables.
For our spam instance, if you know the joint distribution:
P (Spam ∧ W0 ∧ . . . ∧ Wn ∧ . . . ∧ WN −1 ) (2.14)
N +1 N +1
you can compute any of the 3 −2 possible questions that you can
imagine on this set of N + 1 variables.
A question is defined by partitioning a set of variables in three subsets: the
searched variables (on the left of the conditioning bar), the known variables
(on the right of the conditioning bar), and the free variables. The searched
variables set must not be empty.
Examples of these questions are:
24 Bayesian Programming
3. The a priori probability for the nth word of the dictionary to appear:
P (Wn ) X
[P (Spam ∧ W0 ∧ · · · ∧ WN −1 )]
Spam∧W0 ∧···∧Wn−1 ∧Wn+1 ∧···∧WN −1
= X
[P (Spam ∧ W0 ∧ · · · ∧ WN −1 )]
Spam∧W0 ∧···∧WN −1
(2.17)
th
4. The probability for the n word to appear, knowing that the text is
a spam:
P (Wn | [Spam
X= true])
[P ([Spam = true] ∧ W0 ∧ · · · ∧ WN −1 )]
W0 ∧···∧Wn−1 ∧Wn+1 ∧···∧WN −1
= X
[P ([Spam = true] ∧ W0 ∧ · · · ∧ WN −1 )]
W0 ∧···∧WN −1
(2.18)
5. The probability for the e-mail to be a spam knowing that the nth
word appears in the text:
P (Spam| [WX
n = true])
[P (Spam ∧ W0 ∧ · · · wn · · · ∧ WN −1 )]
W0 ∧···Wn−1 ∧Wn+1 ···∧WN −1
= X
[P (Spam ∧ W0 ∧ · · · wn · · · ∧ WN −1 )]
Spam∧···Wn−1 ∧Wn+1 ···∧WN −1
(2.19)
6. Finally, the most interesting one, the probability that the e-mail is a
spam knowing for all N words in the dictionary if they are present or
not in the text:
P (Spam|w0 ∧ · · · ∧ wN −1 )
P (Spam ∧ w0 ∧ · · · ∧ wN −1 )
= X (2.20)
[P (Spam ∧ w0 ∧ · · · ∧ wN −1 )]
Spam
Basic Concepts 25
2.10 Decomposition
The key challenge for a Bayesian programmer is to specify a way to com-
pute the joint distribution that has the three main qualities of being a good
model, easy to compute, and easy to learn.
This is done using a decomposition that restates the joint distribution as
a product of simpler distributions.
Starting from the joint distribution and applying recursively the conjunc-
tion rule we obtain:
P (Spam ∧ W0 ∧ · · · ∧ WN −1 )
= P (Spam) × P (W0 |Spam) × P (W1 |Spam ∧ W0 )
(2.21)
×···
×P (WN −1 |Spam ∧ W0 ∧ · · · ∧ WN −2 )
This is an exact mathematical expression.
We simplify it drastically by assuming that the probability of appearance
of a word knowing the nature of the text (spam or not) is independent of the
appearance of the other words.
For instance, we assume that:
We finally obtain:
N
Y −1
P (Spam ∧ W0 ∧ . . . ∧ WN −1 ) = P (Spam) P (Wn | Spam) (2.23)
n=0
Spam
W2 W1 W0 W3 W4
FIGURE 2.1: The graphical model of a small Bayesian spam filter based on
five words.
• P (Spam):
Each of the N forms P (Wn | Spam) must in turn be specified. The first
idea is to simply count the number of times the nth word of the dictionary
appears in both spam and nonspam. This would naively lead to histograms:
• P (Wn | Spam):
anf
– P (Wn | [Spam = false]) =
af
ant
– P (Wn | [Spam = true]) =
at
Basic Concepts 27
where anf stands for the number of appearances of the nth word in nonspam
e-mails and af stands for the total number of nonspam e-mails. Similarly, ant
stands for the number of appearances of the nth word in spam e-mails and at
stands for the total number of spam e-mails.
The drawback of histograms is that when no observation has been made,
the probabilities are null. For instance, if the nth word has never been observed
in spam then:
A very strong assumption indeed, which says that what has not yet been
observed is impossible! Consequently, we prefer to assume that the parametric
forms P (Wn | Spam) are Laplace succession laws rather than histograms:
• P (Wn | Spam):
1 + anf
– P (Wn | [Spam = false]) =
|Wn | + af
1 + ant
– P (Wn | [Spam = true]) =
|Wn | + at
where |Wn | stands for the number of possible values of variable Wn . Here,
|Wn | = 2 as Wn is a binary variable.
If the nth word has never been observed in spam then:
1
P ([Wn = true] | [Spam = true]) = (2.25)
2 + at
which tends toward zero when at tends toward infinity but never equals zero.
An event not yet observed is not completely impossible, even if it becomes
very improbable if it has never been observed in a long series of experiments.
28 Bayesian Programming
2.12 Identification
The N forms P (Wn | Spam) are not yet completely specified because the
−1 −1
2N + 2 parameters an=0,...,N
f , an=0,...,N
t , af , and at have no values yet.
The identification of these parameters could be done either by batch pro-
cessing of a series of classified e-mails or by an incremental updating of the
parameters using the user’s classifications of the e-mails as they arrive.
Both methods could be combined: the system could start with initial stan-
dard values of these parameters issued from a generic database, then some
incremental learning customizes the classifier to each individual user.
Basic Concepts 29
2.15 Question
Once you have a description (a way to compute the joint distribution), it
is possible to ask any question, as we saw in Section 2.9.
For instance, after some simplification, the answers to our six questions
are:
1.
N
Y −1
P (Spam ∧ W0 ∧ . . . ∧ WN −1 ) = P (Spam) P (Wn | Spam)
n=0
(2.26)
By definition, the joint distribution is equal to the decomposition.
2.
P (Spam) = P (Spam) (2.27)
as P (Spam) appears as such in the decomposition.
30 Bayesian Programming
3. X
P (Wn ) = P (Spam) P (Wn | Spam) (2.28)
Spam
The a priori probability for the nth word of the dictionary to appear,
which gives:
1 + anf 1 + ant
P ([Wn = true]) = 0.25 × + 0.75 × (2.29)
2 + af 2 + at
We see that the denomination “a priori” is here misleading as P (Wn )
is completely defined by the description and cannot be fixed in this
model.
4.
1 + ant
P (Wn | [Spam = true]) = (2.30)
2 + at
as the probability for the nth word to appear knowing that the text
is a spam is already specified in the description.
5.
P (Spam) P ([Wn = true] | Spam)
P (Spam | [Wn = true]) = X
P (Spam) P ([Wn = true] | Spam)
Spam
(2.31)
as the probability for the e-mail to be spam knowing that the nth
word appears in the text.
6.
P (Spam|w0 ∧ · · · ∧ wN −1 )
N
Y −1
P (Spam) [P (wn |Spam)]
n=0 (2.32)
= " N −1
#
X Y
P (Spam) [P (wn |Spam)]
Spam n=0
These three sets define a valid question. For example, assuming you only
know two words of the message, the question P (Spam | W0 ∧ W1 ) is defined
by the subsets:
1. Searched = {Spam}
2. Known = {W0 , W1 }
P (Searched|known)
X
[P (Searched ∧ known ∧ F ree)]
F ree (2.34)
= X
[P (Searched ∧ known ∧ F ree)]
Searched∧F ree
V ariables
Description. Specif ication(π) Decomposition
P rogram F orms (2.35)
(based
Identif ication on δ)
Question
V a : Spam, W0 , W1 . . . WN−1
P (Spam ∧ W0 ∧ . . . ∧ Wn ∧ . . . ∧ WN−1 )
N−1
Dc :
Y
= P (Spam) P (Wn | Spam)
n=0
(
P ([Spam = false]) = 0.25
P (Spam) :
Sp(π) P ([Spam = true]) = 0.75
Ds
Pr P (Wn | [Spam = false]) (2.36)
1 + an
F o :
f
=
2 + af
P (W | Spam) :
n
P (Wn | [Spam = true])
1 + an
t
=
2 + a
t
Identif ication (based on δ)
Qu : P (Spam | w0 ∧ . . . ∧ wn ∧ . . . ∧ wN−1 )
2.17 Results
If we consider a spam filter with an N word dictionary, then any given
e-mail contains one and only one of the 2N possible subsets of the dictionary.
Here we restrict our spam filter to a five word dictionary so that we can
analyze the 25 = 32 subsets. Assume that a set of 1000 e-mails is used in the
identification phase and that the resulting numbers of nonspam are 250 and
750 respectively. Assume also that the resulting counter tables for anf and ant
are those shown in Table 2.1 and the corresponding distribution P (Wn |Spam)
is given in Table 2.2.
It is now possible to compute the probability for an e-mail to be a spam
or not given it contains or not each of the N words. This may be done using
equation 2.32.
Table 2.3 shows the obtained results for different subsets of words present
in the e-mail.
Basic Concepts 33
TABLE 2.1: Counters Resulting from an Analysis of 1000
E-mails. (The values anf and ant denote the number of e-mails that
contained the nth word in nonspam and spam e-mails, respectively.)
n Word n anf ant
0 fortune 0 375
1 next 125 0
2 programming 250 0
3 money 0 750
4 you 125 375
The goal of this chapter is twofold: (i) to present the concept of incomplete-
ness and (ii) to demonstrate how incompleteness is a source of uncertainty.
35
36 Bayesian Programming
second stage, we pretend that some of the variables and functions of the model
are not available. In other words, we generate a synthetic incompleteness of
our model. The goal is to show the consequences of this incompleteness and
to present a first step toward Bayesian modeling.
S C
I0
M1 O
I1 F
FIGURE 3.1: The treatment unit receives two water streams of quality I0
and I1 and generates an output stream of quality O. The resulting quality
depends on I0 , I1 , two unknown variables H and F , and a control variable
C. An operator regulates C, while the value of F is estimated by a sensor
variable S.
The unit takes two water streams as inputs with respective water qualities
I0 and I1 . Two different streams are used because partly purified water is
recycled to dilute the more polluted stream, to facilitate its decontamination.
The unit produces an output stream of quality O.
The internal functioning state of the water treatment unit is described by
the variable F . This variable F quantifies the efficiency of the unit but is not
directly measurable. For instance, as the sandboxes become more loaded with
contaminants the purification becomes less and less efficient and the value of
F becomes lower and lower.
A sensor S helps to estimate the efficiency F of the unit.
A controller C is used to regulate and optimize O, the quality of the water
in the output stream.
Finally, some external factor H may disturb the operation of the unit. For
instance, this external factor could be the temperature or humidity of the air.
For didactic purposes, we consider that these seven variables may each
take 11 different integer values ranging from 0 to 10. The value 0 is the worst
value for I0 , I1 , F , and O, and 10 is the best.
When all variables have their nominal values, the ideal quality Q of the
output stream is given by the equation:
Incompleteness and Uncertainty 37
I0 + I1 + F
Q = Int (3.1)
3
Where Int (x) is the integer part of x.
The value of Q never exceeds the value O∗ , reached when the unit is in
perfect condition, with:
I0 + I1 + 10
O∗ = Int (3.2)
3
The external factor H may reduce the ideal quality Q and the control
C may try to compensate for this disturbance or the bad condition of the
treatment unit because of F . Consequently, the output quality O is obtained
according to the following equations:
I0 + I1 + F + C − H
α = Int (3.3)
3
α
if (0 ≤ α ≤ O∗ )
O = (2O∗ − α) if (α ≥ O∗ ) (3.4)
0 Otherwise
We consider the example of a unit directly connected to the sewer: [I0 = 2],
[I1 = 8].
When [C = 0] (no control) and [H = 0] (no disturbance), Figure 3.2 gives
the value of the quality O according to F , (O∗ = 6).
When the state of operation is not optimal (F different from 10), it is
possible to compensate using C. However, if we over-control, then it may
happen that the output deteriorates. For instance, if [I0 = 2], [I1 = 8], [F = 8],
[H = 0], the outputs obtained for the different values of C are shown in Figure
3.3.
The operation of the unit may be degraded by H. For instance, if [I0 = 2],
[I1 = 8], [F = 8],[C = 0], the output obtained for the different values of H are
shown in Figure 3.4.
Finally, the value of the sensor S depends on I0 and F as follows:
I0 + F
S = Int (3.5)
2
The outputs of S in the 121 possible situations for I0 and F are shown
in Figure 3.5. Note that, if we know I0 , I1 , F , H, and C, we know with
certainty the values of both S and O. At this stage, our water treatment unit
is a completely deterministic process. Consequently, a complete model can be
constructed. Now consider what happens if we ignore the exact equations that
rule the water treatment unit and, of course, the existence of the external
factor H. The starting point for constructing our own model is limited to
38 Bayesian Programming
10
5
O
0
0 1 2 3 4 5 6 7 8 9 10
F
5
Output (O)
0 2 4 6 8 10
Control (C)
1. [S = 1] ⇒ [O = 4]
2. [S = 3] ⇒ [O = 5]
3. [S = 4] ⇒ [O = 6]
4. [S = 6] ⇒ [O = 5]
For some other values of S it is not possible to predict the output O with
certainty:
1. If [S = 2], then O may take the value either four or five, with a slightly
higher probability for four. Indeed, when [S = 2], then F may be
either two or three (see Figure 3.6) and, O will, respectively, be either
four or five.
2. If [S = 5], then O may take the value either five or six, with a slightly
lower probability for five. When [S = 5], F may be either eight or
nine.
5
Output (O)
0 2 4 6 8 10
External Factor (H)
FIGURE 3.4: The output O as a function the external factor H with inputs,
functioning state, and control fixed to: [I0 = 2], [I1 = 8], [F = 8], [C = 0].
results with more uncertainty due to the effect on the output of the hidden
variable H. The obtained data when [I0 = 2], [I1 = 8], [C = 2] is presented on
Figure 3.8.
In contrast with our previous experiment, this time no value of S is suffi-
cient to infer the value of O exactly.
The dispersion of the observations is the direct translation of the effect
of H. Taking into account the effect of hidden variables such as H and even
measuring their importance is one of the major challenges that Bayesian Pro-
gramming must face. This is not an easy task when you are not even aware
of the nature and number of these hidden variables!
10
9
8
7
S 6
5
4
3
2 11
10
1 9
8
00 7
1 6
2 5
3 4 4 F
5 6 3
7 8 2
I0 9 1
10 11 0
model with no hidden variables.2 The effect of these hidden variables is that
the model and the phenomenon never have exactly reproducible behavior.
Uncertainty appears as a direct consequence of this incompleteness. Indeed,
the model may not completely take into account the data and may not predict
exactly the behavior of the phenomenon.3 For instance, in the above example,
the influence of the hidden variable H makes it impossible to predict with
certainty the output O given the inputs I0 and I1 , the reading of the sensor
S, and the control C.
10
5
S
0
0 1 2 3 4 5 6 7 8 9 10
F
is not completely hidden but is only partially known and accessible. Even
though it is weak, this incompleteness still generates uncertainty.
4
Note the absence of H.
Incompleteness and Uncertainty 43
0.2
0.18
0.16
0.14
P 0.12
0.1
0.08
0.06
0.04
10
0.02
8
00 6
2 S
4 4
6 2
8
O 10 0
FIGURE 3.7: The histogram of the observed sensor state S and the output
O when the inputs, the control, and the external factor are fixed to [I0 = 2],
[I1 = 8], [C = 2], [H = 0], and the internal function F is generated randomly
with a uniform distribution.
I0 + F
S = Int (3.8)
2
It would lead to false predictions of the output O and, consequently, to
wrong control decision on C to optimize this output.
For instance, scanning the 11 different possible values for C when [I0 = 2],
[I1 = 8], [F = 8] and consequently [S = 5], the above model predicts that in-
differently for [C = 0], [C = 1], and [C = 2], O will take its optimal value: six
(see Figure 3.3).
The observations depict a somewhat different and more complicated “re-
ality” as shown in Figure 3.9. The choice of C to optimize O is now more
complicated but also more informed. The adequate choice of C to produce
the optimal output [O = 6] is now, with nearly equivalent probabilities, to
select a value of C greater than or equal to two. Indeed, this is a completely
different choice from when the “exact” model is used!
0.06
0.05
0.04
P
0.03
0.02
0.01 10
8
00 6
2 S
4 4
6 2
8
O 10 0
FIGURE 3.8: The histogram of the observed sensor state S and the output
O when the inputs and the control are set to [I0 = 2], [I1 = 8], [C = 2], and
the values of the external factor and the internal functioning H ∧ F are drawn
at random.
with a well established formal theory: probability calculus. The sequel of this
book will try to explain how to do this.
In the Bayesian Programming approach, the programmer does not propose
an exact model but rather expresses a probabilistic canvas in the specification
phase. This probabilistic canvas gives some hints about what observations
are expected. The specification is not a fixed and rigid model purporting
completeness. Rather, it is a framework, with open parameters, waiting to
be shaped by the experimental data. Learning is the means of setting these
parameters. The resulting probabilistic descriptions come from both: (i) the
views of the programmer and (ii) the physical interactions specific of each
phenomenon. Even the influence of the hidden variables is taken into account
and quantified; the more important their effects, the more noisy the data, and
the more uncertain the resulting descriptions.
The theoretical foundations of Bayesian Programming may be summed up
by Figure 3.10.
The first step in Figure 3.10 transforms the irreducible incompleteness
into uncertainty. Starting from the specification and the experimental data,
learning builds probability distributions.
The maximum entropy principle is the theoretical foundation of this first
step. Given some specifications and some data, the probability distribution
that maximizes the entropy is the distribution that best represents the com-
Incompleteness and Uncertainty 45
0.05
0.045
0.04
0.035
P 0.03
0.025
0.02
0.015
0.01
10
0.005
8
00 6
2 C
4 4
6 2
8
O 10 0
FIGURE 3.9: The histogram of the observed output O and the control C
when the inputs are set to [I0 = 2], [I1 = 8], and the internal functioning F is
set to [F = 8] with H drawn at random.
bined specification and data. Entropy gives a precise, mathematical, and quan-
tifiable meaning to the quality of a distribution.5
Two extreme examples may help to understand what occurs:
1. Suppose that we are studying a formal phenomenon. There may not
be any hidden variables. A complete model may be proposed. The
phenomenon and the model could be identical. For instance, this
would be the case if we take the equations of Section 3.1.1 as the
model of the phenomenon described in that same section. If we select
this model as the specification, any data set will lead to a descrip-
tion made of Diracs. There is no uncertainty; any question may be
answered either by true or false. Logic appears as a special case of
the Bayesian approach in that particular context (see Cox [1979]).
2. At the opposite extreme, suppose that the specification consists of
very poor hypotheses about the modeled phenomenon, for instance,
by ignoring H and also the inputs I0 and I1 in a model of the above
process. Learning will only lead to flat distributions, containing no
information. No relevant decisions can be made, only completely ran-
dom ones.
Specifications allow us to build general models where inaccuracy and hidden
5
See FAQ/FAM, Section 16.11 “Maximum entropy principle justifications” for justifica-
tions for the use of the maximum entropy principle.
46 Bayesian Programming
variables may be explicitly represented. These models may lead to good pre-
diction and decision. The formalism also allows us to take into account missing
variables. In real life, such models are in general poorly informative and may
not be useful in practical applications. They give no certitudes, although they
provide a means of taking the best possible decision according to the available
information. This is the case here when the only hidden variable is H.
The second step in Figure 3.10 consists of reasoning with the probability
distributions obtained by the first step. To do so, we only require the two
basic rules of Bayesian inference presented in Chapter 2. These two rules are
to Bayesian inference what the resolution principle is to logical reasoning (see
Robinson [1965], Robinson [1979], Robinson and Silbert [1982a], and Robinson
and Silbert [1982b]). These inferences may be as complex and subtle as those
usually achieved with logical inference tools, as will be demonstrated in the
different examples presented in the sequel of this book.
Chapter 4
Description = Specification +
Identification
1
“L’esprit scientifique nous interdit d’avoir une opinion sur des questions que nous ne
comprenons pas, sur des questions que nous ne savons pas poser clairement. Avant tout, il
faut savoir poser les problèmes. Et quoi qu’on dise, dans la vie scientifique, les problèmes
ne se posent d’eux-mêmes. C’est précisément ce ‘sens du problème’ qui donne la marque du
véritable esprit scientifique. Pour un esprit scientifique, toute connaissance est une réponse
à une question. S’il n’y a pas eu de question, il ne peut y avoir connaissance scientifique.
Rien ne va de soi. Rien n’est donné. Tout est construit” [Bachelard, 1938].
47
48 Bayesian Programming
Descriptions are the basic elements that are used, combined, composed, ma-
nipulated, computed, compiled, and questioned in different ways to build
Bayesian programs.
90 (P x5 − P x0 ) + 45 (P x4 − P x1 ) + 5 (P x3 − P x2 )
Dir = Floor
9 (1 + P x0 + P x0 + P x1 + P x2 + P x3 + P x4 + P x5 )
(4.1)
• The robot is piloted solely by its rotation speed (the translation speed
is fixed). It receives motor commands from the Rot variable, calculated
from the difference between the rotation speeds of the left and right
wheels. Rot takes on values between −10 (fastest to the left) and +10
(fastest to the right).
Prox 2 3
1 4
0 5
Dir = −10 Dir = +10
Rot
− +
7 6
FIGURE 4.2: The sensor and motor variables of the Khepera robot.
4.1.2.1 Specification
Having defined our goal, we describe the three steps necessary to define
the preliminary knowledge.
Variables: First, the programmer specifies which variables are pertinent for
the task. To push objects it is necessary to have an idea of the position of the
objects relative to the robot. The front proximeters provide this information.
However, we chose to summarize the information from these six proximeters
by the two variables Dir and P rox.
We also chose to set the translation speed to a constant and to operate
the robot by its rotation speed Rot.
Description = Specification + Identification 51
These three variables are all we require to push obstacles. Their definitions
are summarized as follows
This equality simply results from the application of the conjunction rule
(2.8).
4
Bell-shaped distributions are distributions of discrete variables that have a Gaussian
shape. They are noted with the B symbol and defined by their means and standard devia-
tions as regular Gaussian distributions of continuous variables.
52 Bayesian Programming
4.1.2.2 Identification
To set the values of these free parameters we drive the robot with a joystick
and collect a set of data.
Every tenth of a second, we obtain the value of Dir and P rox from the
proximeters and the value of Rot from the joystick. Let us call the particular
set of data corresponding to this experiment δpush . A datum collected at time
t is a triplet (rott , dirt , proxt ). During the 30 seconds of learning, 300 such
triplets are recorded.
From the collection δpush of such data, it is very simple to compute the
corresponding values of the free parameters. We first sort the data in 336
groups, each corresponding to a given position of the object and then compute
the mean and standard deviations of Rot for each of these groups.
There are only 300 triplets for 336 groups. Moreover, these 300 triplets
are concentrated around some particular position often observed when push-
ing obstacles. Consequently, it may often happen that a given position never
occurs and that no data is collected for this particular situation. In that case,
we set the corresponding mean to 0 and the standard deviation to 10. The
bell-shaped distribution is then flat, close to a uniform distribution.
Figure 4.3 presents three of the 336 curves. The first one corresponds to an
obstacle very close to the left ([Dir = −10], [P rox = 13]), and shows that the
robot should turn to the left rapidly with average uncertainty. The second one
corresponds to an obstacle right in front and in contact ([Dir = 0], [P rox =
15]), and shows that the robot should go straight with very low uncertainty.
Finally, the last one shows an unobserved situation where the uncertainty is
maximal ([Dir = 3], [P rox = 0]).
0.3
Probability
Dir=–10, Prox=13
0.2
0.1
Dir=3,Prox=0
0 –8 –6 –4 –2 0 2 4 6 8 10
Rot
FIGURE 4.3: P (Rot | Dir ∧ P rox) when pushing objects for different situ-
ations.
Description = Specification + Identification 53
4.1.2.3 Results
To render the pushing obstacle behavior just learned, a decision on Rot is
made every tenth of a second according to the following algorithm.
1. The sensors are read and the values of dirt and proxt are computed.
2. The corresponding distribution P Rot | Dir = dirt ∧ P rox = proxt
We keep the exact same specification, changing only the data to be learned.
The resulting description is, however, completely different: following contours
of objects instead of pushing them.
4.1.3.1 Specification
Variables: To follow the contours, we must know where the object is situ-
ated relative to the robot. This is defined by the variables Dir and P rox, as
in the previous experiment. We must also pilot the robot using its rotation
speed with the variable Rot. The required variables are thus exactly the same
as previously:
4.1.3.2 Identification
In contrast, the learned data are completely different, because we are driv-
ing the robot to do some contour following (see Movie 3). The learning process
is the same but the data set, called δf ollow , is completely different.
The collection δf ollow of data leads to completely different values of the 336
means and standard deviation of the bell-shaped distributions. This clearly
appears in the following distributions presented for the same relative positions
of the object and the robot as in the previous experiment:
• Figure 4.4 shows the two distributions obtained after learning for both
experiments (pushing objects and following contours) when the object
is close to the left ([Dir = −10], [P rox = 13]). When pushing, the robot
turns left to face the object; on the contrary, when following, the robot
goes straight, bordering the object.
Description = Specification + Identification 55
0.4
0.3 Following
Probability
Pushing
0.2
0.1
0 –8 –6 –4 –2 0 2 4 6 8 10
Rot
FIGURE 4.4: P (Rot | Dir = −10] ∧ [P rox = 13]) when pushing objects and
when following contours.
• Figure 4.5 shows the two distributions obtained after learning for both
experiments (pushing objects and following contours) when the object is
in contact right in front of the robot P (Rot | Dir = 0] ∧ [P rox = 15]).
When pushing, the robot goes straight. On the contrary, when following,
the robot turns to the right to have the object on its left. However, the
uncertainty is larger in this last case.
Pushing
0.4
0.3
Probability
0.2
Following
0.1
0 –8 –6 –4 –2 0 2 4 6 8 10
Rot
FIGURE 4.5: P (Rot | Dir = 0] ∧ [P rox = 15]) when pushing objects and
when following contours.
4.1.3.3 Result
The restitution process is also the same, but as the bell-shaped distribu-
tions are different, the resulting behavior is completely different, as demon-
56 Bayesian Programming
strated by Movie 3. It should be noted that one turn around the object is
enough to learn the contour following behavior.
4.2.1 Specification
4.2.1.1 Variables
Following our Bayesian Programming methodology, the first step in build-
ing this description is to choose the pertinent variables.
The variables to be used by our Bayesian model are obviously the following:
4.2.1.2 Decomposition
Using the conjunction postulate (2.8) iteratively, we can write that the
joint probability distribution of the six variables is equal to:
P (I0 ∧ I1 ∧ F ∧ S ∧ C ∧ O)
= P (I0 ) × P (I1 |I0 ) × P (F |I0 ∧ I1 ) × P (S|I0 ∧ I1 ∧ F ) (4.17)
×P (C|I0 ∧ I1 ∧ F ∧ S) × P (O|I0 ∧ I1 ∧ F ∧ S ∧ C)
This is an exact mathematical expression. The designer knows more about
the process than this exact form. For instance, he or she knows that:
1. The qualities of the two input streams I0 and I1 are independent:
P (I1 | I0 ) = P (I1 )
Description = Specification + Identification 57
P (F | I0 ∧ I1 ) = P (F )
P (S | I0 ∧ I1 ∧ F ) = P (S | I0 ∧ F ) (4.18)
P (C | I0 ∧ I1 ∧ F ∧ S) = P (C) (4.19)
P (O | I0 ∧ I1 ∧ F ∧ S ∧ C) = P (O | I0 ∧ I1 ∧ S ∧ C) (4.20)
P (I0 ∧ I1 ∧ F ∧ S ∧ C ∧ O)
= P (I0 ) × P (I1 ) × P (F ) × P (S|I0 ∧ F ) (4.21)
×P (C) × P (O|I0 ∧ I1 ∧ S ∧ C)
We see here a first example of the “art of decomposing” a joint distribution.
The decomposition is a means to compute the joint distribution and, conse-
quently, answer all possible questions. This decomposition has the following
qualities:
58 Bayesian Programming
4.2.1.3 Forms
To finish the specification task, we must still specify the parametric forms
of the distribution appearing in the decomposition:
P (F ) ≡ Uniform (4.24)
where:
δInt( I0 +F ) (4.26)
2
I0 + F
S = Int (4.27)
2
4. Not knowing the desired output O, all possible controls are equally
probable:
4.2.2 Identification
After the specification phase, we end up with 115 = 161, 051 free param-
eters to identify. To do this we will run the simulator described in Chapter
3, drawing at random with uniform distributions I0 , I1 , F , H, and C. For
each of these draws (for instance, 10 of them), we compute the corresponding
values of the sensor S and the output O. We then update the 114 histograms
according to the values of I0 , I1 , S, C, and O.
Va:
I0 , I1 , F, S, C, O
Dc
:
P (I0 ∧ I1 ∧ F ∧ S ∧ C ∧ O)
= P (I0 ) × P (I1 ) × P (F ) × P (S|I0 ∧ F )
×P (C) × P (O|I0 ∧ I1 ∧ S ∧ C)
F o :
Ds Sp(π)
Pr P (I0 ) = U nif orm
P (I1 ) = U nif orm
P (F ) = U nif orm
P (S|I0 ∧ F ) = δS=Int( I0 +F )
2
P (C) = U nif orm
P (O|I0 ∧ I1 ∧ S ∧ C) = Histograms
Id
Qu :
(4.29)
4.2.4 Results
Such histograms have already been presented in previous chapters, for in-
stance, in Figure 3.8 reproduced below as Figure 4.7 may be seen a collection of
11 of these histograms P (O | I0 ∧ I1 ∧ S ∧ C) when [I0 = 2],[I1 = 8],[C = 2],
and S varies. The complete description of the elementary water treatment
unit is made of 114 histograms, 113 times as much data as in Figure 4.7.
0.06
0.05
0.04
P
0.03
0.02
0.01 10
8
00 6
2 S
4 4
6 2
8
O 10 0
may be found in the FAQ/FAM: Objectivism vs. subjectivism controversy and the “mind
projection fallacy,” Section 16.13.
62 Bayesian Programming
This strict and simple methodology and framework for the expression of the
preliminary knowledge of the programmer present several fundamental advan-
tages:
What we call chance is, and may only be, the ignored cause of
known effect1
Dictionaire Philosophique
Voltaire [1993–1764, 2005]
The goal of this chapter is both to explain the notion of conditional inde-
pendence and to demonstrate its importance in actually solving and comput-
ing complex problems.
65
66 Bayesian Programming
The units M0 and M1 take the same inputs I0 and I1 . They respectively
produce O0 and O1 as outputs, which in turn are used as inputs by M2 . M3
takes I3 and O3 as inputs, and finally produces O3 as output. The four water
treatment units have four internal states (respectively F0 , F1 , F2 , and F3 ), four
sensors (respectively S0 , S1 , S2 , and S3 ), four controllers (respectively C0 , C1 ,
C2 , and C3 ), and may all be perturbed by some external factors (respectively
H0 , H1 , H2 , and H3 ). The production of each of these units is governed by
Equations 3.3 and 3.4. The sensors take their values according to Equation
3.5.
5.2.1 Specification
5.2.1.1 Variables
There are now 19 variables in our global Bayesian model:
I0 , I1 , I3 , F0 , F1 , F2 , F3 , S0 , S1 , S2 , S3 , C0 , C1 , C2 , C3 , O0 , O1 , O2 , O3 ∈ {0, . . . , 10}
5.2.1.2 Decomposition
Using the conjunction postulate (2.8) iteratively as in the previous chapter
it is possible to write the joint probability on the 19 variables as:
P (I0 ∧ I1 ∧ I3 ∧ . . . ∧ O2 ∧ O3 )
= P (I0 ∧ I1 ∧ I3 )
× P (F0 ∧ S0 ∧ C0 ∧ O0 | I0 ∧ I1 ∧ I3 )
× P (F1 ∧ S1 ∧ C1 ∧ O1 | I0 ∧ I1 ∧ I3 ∧ F0 ∧ S0 ∧ C0 ∧ O0 )
× P (F2 ∧ S2 ∧ C2 ∧ O2 | I0 ∧ I1 ∧ I3 ∧ F0 ∧ S0 ∧ C0 ∧ O0 ∧ F1 ∧ S1 ∧ C1 ∧ O1 )
× P (F3 ∧ S3 ∧ C3 ∧ O3 | I0 ∧ I1 ∧ I3 . . . ∧ S1 ∧ C1 ∧ O1 ∧ F2 ∧ S2 ∧ C2 ∧ O2 )
P (F0 ∧ S0 ∧ C0 ∧ O0 |I0 ∧ I1 ∧ I3 )
(5.1)
= P (F0 ∧ S0 ∧ C0 ∧ O0 |I0 ∧ I1 )
P (F1 ∧ S1 ∧ C1 ∧ O1 |I0 ∧ I1 ∧ I3 ∧ F0 ∧ S0 ∧ C0 ∧ O0 )
(5.2)
= P (F1 ∧ S1 ∧ C1 ∧ O1 |I0 ∧ I1 )
P (F2 ∧ S2 ∧ C2 ∧ O2 | I0 ∧ I1 ∧ I3 ∧ F0 ∧ S0 ∧ C0 ∧ O0 ∧ F1 ∧ S1 ∧ C1 ∧ O1 )
= P (F2 ∧ S2 ∧ C2 ∧ O2 | I0 ∧ I1 ∧ F0 ∧ S0 ∧ C0 ∧ O0 ∧ F1 ∧ S1 ∧ C1 ∧ O1 )
know the value of O1 , then we do not care anymore about the values
of F1 , S1 , and C1 . This is called conditional independence between
variables and is a main tool to build interesting and efficient descrip-
tions. One should be very careful that conditional independence has
nothing in common with independence. The variable S2 depends on
C0 (P (S2 | C0 ) 6= P (S2 )), but is conditionally independent of C0 if
O0 is known (P (S2 | C0 ∧ O0 ) = P (S2 | O0 )).
See Section 5.3.1 in the sequel for further discussions of this point.
This finally leads to:
P (F2 ∧ S2 ∧ C2 ∧ O2 |I0 ∧ I1 ∧ I3 ∧ F0 ∧ S0 ∧ · · · ∧ C1 ∧ O1 )
= P (F2 ∧ S2 ∧ C2 ∧ O2 |O0 ∧ O1 )
(5.3)
5. For the same kind of reasons, we find:
P (F3 ∧ S3 ∧ C3 ∧ O3 |I0 ∧ I1 ∧ I3 ∧ F0 ∧ S0 ∧ · · · ∧ C2 ∧ O2 )
= P (F3 ∧ S3 ∧ C3 ∧ O3 |I3 ∧ O2 )
(5.4)
P (I0 ∧ I1 ∧ I3 ∧ . . . ∧ O2 ∧ O3 )
= P (I0 ∧ I1 ∧ I3 )
× P (F0 ∧ S0 ∧ C0 ∧ O0 | I0 ∧ I1 )
× P (F1 ∧ S1 ∧ C1 ∧ O1 | I0 ∧ I1 )
× P (F2 ∧ S2 ∧ C2 ∧ O2 | O0 ∧ O1 )
× P (F3 ∧ S3 ∧ C3 ∧ O3 | I3 ∧ O2 ) (5.5)
1. Concerning M0:
P (F0 ∧ S0 ∧ C0 ∧ O0 | I0 ∧ I1 )
= P (F0 )
× P (S0 | I0 ∧ F0 )
× P (C0 )
× P (O0 | I0 ∧ I1 ∧ S0 ∧ C0 ) (5.6)
The Importance of Conditional Independence 69
2. Concerning M1:
P (F1 ∧ S1 ∧ C1 ∧ O1 | I0 ∧ I1 )
= P (F1 )
× P (S1 | I0 ∧ F1 )
× P (C1 )
× P (O1 | I0 ∧ I1 ∧ S1 ∧ C1 ) (5.7)
3. Concerning M2:
P (F2 ∧ S2 ∧ C2 ∧ O2 | O0 ∧ O1 )
= P (F2 )
× P (S2 | O0 ∧ F2 )
× P (C2 )
× P (O2 | O0 ∧ O1 ∧ S2 ∧ C2 ) (5.8)
4. Concerning M3:
P (F3 ∧ S3 ∧ C3 ∧ O3 | I3 ∧ O2 )
= P (F3 )
× P (S3 | I3 ∧ F3 )
× P (C3 )
× P (O3 | I3 ∧ O2 ∧ S3 ∧ C3 ) (5.9)
After some reordering, we obtain the following final decomposition and the
associated graphical model (Figure 5.2):
P (I0 ∧ I1 ∧ I3 ∧ . . . ∧ O2 ∧ O3 )
= P (I0 ) × P (I1 ) × P (I3 )
× P (F0 ) × P (F1 ) × P (F2 ) × P (F3 )
× P (C0 ) × P (C1 ) × P (C2 ) × P (C3 )
× P (S0 | I0 ∧ F0 ) × P (S1 | I0 ∧ F1 ) × P (S2 | O0 ∧ F2 ) × P (S3 | I3 ∧ F3 )
× P (O0 | I0 ∧ I1 ∧ S0 ∧ C0 ) × P (O1 | I0 ∧ I1 ∧ S1 ∧ C1 )
× P (O2 | O0 ∧ O1 ∧ S2 ∧ C2 ) × P (O3 | I3 ∧ O2 ∧ S3 ∧ C3 ) (5.10)
70 Bayesian Programming
FIGURE 5.2: The graphical model of the decomposition of the joint distri-
bution as defined by Equation 5.10.
5.2.1.3 Forms
The distributions P (I0 ), P (I1 ), P (I3 ), P (F0 ), P (F1 ), P (F2 ), P (F3 ),
P (C0 ), P (C1 ), P (C2 ), and P (C3 ) are all assumed to be uniform distributions.
P (S0 | I0 ∧ F0 ), P (S0 | I0 ∧ F1 ) , P (S0 | O0 ∧ F2 ), and P (S3 | I3 ∧ F3 )
are all Dirac distributions.
Finally, the four distributions relating the output to the inputs, the sensor,
and the control are all specified as histograms as in the previous chapter.
5.2.2 Identification
We have now four series of 114 histograms to identify (4 × 115 free param-
eters).
In a real control and diagnosis problem, the four production units, even
though they are identical, would most probably function slightly differently
(because of incompleteness and some other hidden variables besides H). In
that case, the best thing to do would be to perform four different identification
campaigns to take these small differences into account by learning.
In this didactic example, as the four units are simulated and formal, they
really are perfectly identical (there are no possible hidden variables but H in
our formal model as specified by Equations 3.3, 3.4, and 3.5). Consequently,
we use the exact same histogram for the four units as was identified in the
previous chapter.
The Importance of Conditional Independence 71
Va: I0 , I1 , I3 , , F0 , . . . C3 , O0 , O1 , O2 , O3
P (I0 ∧ I1 ∧ I3 ∧ . . . ∧ O2 ∧ O3 )
= P (I0 ) × P (I1 ) × P (I3 )
×P (F0 ) × P (F1 ) × P (F2 ) × P (F3 )
×P (C0 ) × P (C1 ) × P (C2 ) × P (C3 )
×P (S | I ∧ F ) × P (S | I ∧ F )
0 0 0 1 0 1
Dc :
×P (S | O ∧ F ) × P (S | I ∧ F3)
2 0 2 3 3
×P (O0 | I0 ∧ I1 ∧ S0 ∧ C0 )
Ds Sp(π)
×P (O1 | I0 ∧ I1 ∧ S1 ∧ C1 )
Pr
×P (O2 | O0 ∧ O1 ∧ S2 ∧ C2 )
×P (O | I ∧ O ∧ S ∧ C )
3 3 2 3 3
P (I ) . . . P (C ) ≡ Uniform
0 3
P (S0 | I0 ∧ F0 ) , . . . , P (S3 | I3 ∧ F3 ) ≡ Dirac
F o : P (O0 | I0 ∧ I1 ∧ S0 ∧ C0 ) ≡ Histogram
. . .
P (O | I ∧ 0 ∧ S ∧ C ) ≡ Histogram
3 3 2 3 3
Identif ication
Qu :?
(5.11)
For the first case, for production unit M0 , the first entry I0 , and the internal
state F0 , are independent but are not conditionally independent knowing S0
the reading of the sensor. Figure 5.3 shows, for instance, the corresponding
probabilities for I0 = 5 and S0 = 6.
0.5
0.4
0.3
Frequency
0.2
0.1 P(FO | SO=6)
0 P(F0 | I0=5 S0=6)
0
2 P(F0)
4
6
Unit 0 efficiency (F0) 8 P(F0 | IO=5)
10
This is a very common situation where the two causes of a phenomenon are
independent but are conditionally dependent on one another, knowing their
common consequence (see Figure 5.4). Otherwise, no sensor measuring several
factors would be of any use.
I0
S0
F0
0.2 0.2
0.16 0.16
0.12 0.12
Probability Probability
0.08 0.08
0.04 - P(S2 | O0=5) 0.04
10
0 - P(S2 | C0=10 O0=5) 0 8
0 P(S2 | O0=5) - 6
2 - P(S2) P(S2 | C0=10 O0=5) - 4 Sensor 2 (S2)
4 P(S2) -
6 - P(S2 | CO=10) 2
Sensor 2 (S2) 8 P(S2 | CO=10) -
10 0
This is also a very common situation where there is a causal chain between
three variables (see Figure 5.6).
C0 O0 S2
For any model of any phenomenon, knowing what matters with, what
does not influence what, and, most importantly, what bias could be neglected
compared to another one is fundamental knowledge.
A model where everything depends on everything else is a very poor model,
indeed. In probabilistic terms, such a model would be a joint distribution on
all the relevant variables coded as a huge table containing the probabilities
of all the possible cases. In our example, simple as it is, it would be a ta-
ble containing the 263 probability values necessary for the joint distribution
74 Bayesian Programming
1119 ≈ 263
Such a table would encode all the necessary information, but in a very
poor manner. Hopefully, a model does not usually code the joint distribution
in this way but rather uses a decomposition and the associated conditional
independencies to express the knowledge in a structured and formal way. The
probabilistic model of the production models as expressed by Equation 5.10
only requires 218 probability values to encode the joint distribution:
75
76 Bayesian Programming
clear separation between the model and its use is a very important feature of
Bayesian Programming.
V a : I0 , I1 , I3 , , F0 , . . . C3 , O0 , O1 , O2 , O3
P (I0 ∧ I1 ∧ I3 ∧ . . . ∧ O2 ∧ O3 )
= P (I0 ) × P (I1 ) × P (I3 )
×P (F0 ) × P (F1 ) × P (F2 ) × P (F3 )
×P (C 0 ) × P (C1 ) × P (C2 ) × P (C3 )
×P (S0 | I0 ∧ F0 ) × P (S1 | I0 ∧ F1 )
Dc :
×P (S2 | 00 ∧ F2 ) × P (S3 | I3 ∧ F3 )
×P (O 0 | I 0 ∧ I 1 ∧ S0 ∧ C 0 )
Ds Sp(π)
×P (O1 | I0 ∧ I1 ∧ S1 ∧ C1 )
Pr
×P (O2 | 00 ∧ 01 ∧ S2 ∧ C2 )
×P (O 3 | I3 ∧ 02 ∧ S3 ∧ C3 )
P (I ) . . . P (C3 ) ≡ Uniform
0
P (S0 | I0 ∧ F0 ) , . . . , P (S3 | I3 ∧ F3 ) ≡ Dirac
:
F o P (O0 | I0 ∧ I1 ∧ S0 ∧ C0 ) ≡ Histogram
...
P (O3 | I3 ∧ 02 ∧ S3 ∧ C3 ) ≡ Histogram
Identif ication
Qu :?
(6.1)
Quite a simple model indeed, thanks to our knowledge of this process, even
if it took some time to make all its subtleties explicit! However, the question
is still unspecified. Specifying and answering different possible questions is the
purpose of the sequel of this chapter.
6.2.1 Question
For instance, we may look for the value of O0 knowing the values of I0 , I1 ,
S0 , and C0 . The corresponding question will be:
P (O0 | I0 ∧ I1 ∧ S0 ∧ C0 )
This distribution is directly available since it is given to specify the de-
scription. This could also be inferred using the simple algorithm formulae
2.34.
Let’s define F ree, Searched, and Known as:
F ree = I3 ∧ F0 ∧ F1 ∧ F2 ∧ F3 ∧ S1 ∧ S2 ∧ S3 ∧ C1 ∧ C2 ∧ C3 ∧ O1 ∧ O2 ∧ O3
Known = I0 ∧ I1 ∧ S0 ∧ C0
Searched = O0
(6.2)
In this particular case P (Search | Known) appears in the decomposition
and the previous Bayesian program may be rewritten as follows.
Va:( F ree, Known, Searched
Dc : P (F ree ∧ Searched ∧ Known) =
Ds Sp(π) P (F ree ∧ Known) × P (Search | Known)
(
Pr P (F ree ∧ Known) Any Distribution (6.3)
F o :
P (Searched | Known) Any Distribution
Identif ication
Qu : P (Searched | known)
P (Searched | known)
X
P (F ree ∧ Searched ∧ known)
= X F ree
X
P (F ree ∧ Searched ∧ known)
F ree Searched
P (Searched | known)
X
[P (F ree ∧ known) × P (Search | known)]
= X F ree
X
[P (F ree ∧ known) × P (Search | known)]
F ree Searched
78 Bayesian Programming
X
using the marginalization rule P (F ree ∧ known) = P (known):
F ree
P (Searched | known)
P (known) × P (Search | known)
= X X
[P (F ree ∧ known)] [P (Search | known)]
F ree Searched
X
Using again the marginalization rule P (F ree ∧ known) = P (known)
X F ree
and the normalization rule P (Search | known) = 1 we obtain:
Searched
Which could have been obtained directly since it appears in the decomposition.
6.2.2 Results
The corresponding results have already been presented in Chapter 4 for
[I0 = 2], [I1 = 8], [C0 = 2], and for all the values of S0 (see Figure 4.7).
P (O3 | I0 ∧ I1 ∧ I3 ∧ S0 ∧ S1 ∧ S3 ∧ S4 ∧ C0 ∧ C1 ∧ C2 ∧ C3 ) (6.4)
Searched = {O3 }
Known = {I0 , I1 , I3 , S0 , S1 , S2 , S3 , S4 , C0 , C1 , C2 , C3 } (6.5)
F ree = {F0 , F1 , F2 , F3 , O0 , O1 , O2 }
Bayesian Program = Description + Question 79
P (Searched | known) =
X
P (Searched ∧ known ∧ F ree)
F ree
P (known)
As the distributions for the entries, the sensors, and the con-
trollers are uniform distributions P (known) is a constant for any value
{i0 , i1 , i3 , s0 , s1 , s2 , s3 , s4 , c0 , c1 , c2 , c3 } of the Known variables.
P (O3 | i0 ∧ i1 ∧ i3 ∧ s0 ∧ s1 ∧ s2 ∧ s3 ∧ s4 ∧ c0 ∧ c1 ∧ c2 ∧ c3 ) =
P (i0 ) × P (i1 ) × P (i3 )
×P (F0 ) × P (F1 ) × P (F2 ) × P (F3 )
×P (c0 ) × P (c1 ) × P (c2 ) × P (c3 )
×P (s0 | i0 ∧ F0 ) × P (s1 | i0 ∧ F1 )
1 X
×P (s2 | O0 ∧ F2 ) × P (s3 | i3 ∧ F3 )
×
Z
F0 ,F1 ,F2 ,F3 ,O0 ,O1 ,O2 ×P (O0 | i0 ∧ i1 ∧ s0 ∧ c0 )
×P (O1 | i0 ∧ i1 ∧ s1 ∧ c1 )
×P (O2 | O0 ∧ O1 ∧ s2 ∧ c2 )
×P (O3 | i3 ∧ O2 ∧ s3 ∧ c3 )
P (O3 | i0 ∧ i1 ∧ i3 ∧ s0 ∧ s1 ∧ s2 ∧ s3 ∧ s4 ∧ c0 ∧ c1 ∧ c2 ∧ c3 ) =
×P (s0 | i0 ∧ F0 ) × P (s1 | i0 ∧ F1 )
×P (s2 | O0 ∧ F2 ) × P (s3 | i3 ∧ F3 )
1 X ×P (O0 | i0 ∧ i1 ∧ s0 ∧ c0 )
×
×P (O1 | i0 ∧ i1 ∧ s1 ∧ c1 )
Z
F0 ,F1 ,F2 ,F3 ,O0 ,O1 ,O2
×P (O2 | O0 ∧ O1 ∧ s2 ∧ c2 )
×P (O3 | i3 ∧ O2 ∧ s3 ∧ c3 )
Finally, after some reordering (see Chapter 14) of the sums to minimize
the amount of computation, we obtain:
80 Bayesian Programming
P (O3 | i0 ∧ i1 ∧ i3 ∧ s0 ∧ s1 ∧ s2 ∧ s3 ∧ s4 ∧ c0 ∧ c1 ∧ c2 ∧ c3 ) =
1 X X X
× P (s0 | i0 ∧ F0 ) × P (s0 | i0 ∧ F1 ) × P (s3 | i3 ∧ F3 ) ×
Z
F0 F1 F3
P (O0 | i0 ∧ i1 ∧ s0 ∧ c0 ) ×
X
P (s2 | O0 ∧ F2 ) ×
X F2
P (O1 | i0 ∧ i1 ∧ s1 ∧ c1 ) ×
O0 X X
P (O2 | O0 ∧ O1 ∧ s2 ∧ c2 ) ×
O1 P (O3 | i3 ∧ O2 ∧ s3 ∧ c3 )
O2
6.3.2 Results
P (O3 | i0 ∧ i1 ∧ i3 ∧ s0 ∧ s1 ∧ s2 ∧ s3 ∧ s4 ∧ c0 ∧ c1 ∧ c2 ∧ c3 ) (6.6)
print model.ask(O[3],I0^I1^I3^C^S)
The inference engine may provide other and extra simplifications:
Sum_{O1} {P(O1|I(0) I(1) S1 C1)
Sum_{O0} {P(O0|I(0) I(1) S0 C0)
Sum_{F2} {P(F2)P(S2|O0 F2) }
sum_{O2} {P(O2|O0 O1 S2 C2)
P(O3|I(3) O2 S3 C3)
}
}
}
Some of the results obtained for the forward simulation of the water treat-
ment center are presented in Figure 6.1. For the same three inputs (i0 = 1, i1 =
Bayesian Program = Description + Question 81
P(O3)
P(O3)
0.2 0.2 0.2
0 0 0
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
O3 O3 O3
FIGURE 6.1: Direct distributions for the output O3 for three different con-
trols.
P (C0 ∧ C1 ∧ C2 ∧ C3 | i0 ∧ i1 ∧ i3 ∧ s0 ∧ s1 ∧ s2 ∧ s3 ∧ s4 ∧ o3 ) (6.7)
The forecast for this control choice is presented in Figure 6.2. It shows
that even if this is the best control choice, the probability of obtaining o3 = 9
is only 3% when the probability of obtaining o3 = 6 is around 25%.
This suggests that searching the controls for a given value of o3 may still
not be the best question. Indeed this question could lead to a control choice
that ensures the highest probability for o3 = 9, but at the price of very high
probabilities for much worse outputs.
P (C0 ∧ C1 ∧ C2 ∧ C3 | i0 ∧ i1 ∧ i3 ∧ s0 ∧ s1 ∧ s2 ∧ s3 ∧ s4 ∧ o3 )
0.4
0.3
P(O3)
0.2
0.1
0
0 2 4 6 8 10
O3
V a : I0 , . . . O3 , H, V ALM IN
P (I0 ∧ I1 ∧ I3 ∧ . . . ∧ O2 ∧ O3 )
= P (V ALM IN ) × P (H)
×P (I0 ) × P (I1 ) × P (I3 )
×P (F0 ) × P (F1 ) × P (F2 ) × P (F3 )
×P (C 0 ) × P (C1 ) × P (C2 ) × P (C3 )
Dc : ×P (S0 | I0 ∧ F0 ) × P (S1 | I0 ∧ F1 )
×P (S2 | O0 ∧ F2 ) × P (S3 | I3 ∧ F3 )
×P (O0 | I0 ∧ I1 ∧ S0 ∧ C0 )
×P (O1 | I0 ∧ I1 ∧ S1 ∧ C1 )
Ds Sp(π)
×P (O2 | O0 ∧ O1 ∧ S2 ∧ C2 )
Pr
×P (O3 | I3 ∧ O2 ∧ S3 ∧ C3 )
(V ALM IN ) Uniform
P
P (I0 ) . . . P (C3 ) ≡ Uniform
P (S0 | I0 ∧ F0 ) , . . . , P (S3 | I3 ∧ F3 ) ≡ Dirac
P (O0 | I0 ∧ I1 ∧ S0 ∧ C0 ) ≡ Histogram
F o :
. . .
P (O3 | I3 ∧ O2 ∧ S3 ∧ C3 ) ≡ Histogram
P (H | V ALM IN ∧ O3 ) ≡ Dirac:
if (O3 ≥ V ALM IN ) : H = 1 else H = 0
Identif ication
Qu : P (C0,1,2,3 | i0,1,3 ∧ s0,1,2,3 ∧ h ∧ valmin)
84 Bayesian Programming
Let us, for example, analyze the cases where V ALM IN ∈ {5, 6, 7}.
For example, P (C0,1,2,3 | i0,1,3 ∧ s0,1,2,3 ∧ h = 1 ∧ valmin = 5) will pro-
vide a distribution on controls C0,1,2,3 which maximizes the chances for O3 to
be greater than five.
Figure 6.3 presents, for the three considered values of V ALM IN , the
probability distribution
P(O3)
P(O3)
0 0 0
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
O3 O3 O3
Finally we may combine the readings with the desired control and
use the initial model to compute the distribution on O3 :
opt_known_val = known_val^new_opt_val
newresultA = question.instantiate(opt_known_val)
6.5 Diagnosis
We may also use our Bayesian model to diagnose failures. Let us suppose
that the output is only seven. This means that at least one of the four units
is in poor working condition. We want to identify these defective units so we
can fix them.
86 Bayesian Programming
6.5.1 Question
The question is: “What is going wrong?” We must look for the values of
F0 , F1 , F2 , and F3 knowing the entries, the sensor values, the control, and the
final output. The corresponding question is:
P (F0 ∧ F1 ∧ F2 ∧ F3 | i0 ∧ i1 ∧ i3 ∧ s0 ∧ s1 ∧ s2 ∧ s3 ∧ c0 ∧ c1 ∧ c2 ∧ c3 ∧ o3 )
6.5.2 Results
P (F2 | i0 ∧ i1 ∧ i3 ∧ s0 ∧ s1 ∧ s2 ∧ s3 ∧ c0 ∧ c1 ∧ c2 ∧ c3 ∧ o3 )
0.45
0.4
0.35
0.3
0.25
P(F2)
0.2
0.15
0.1
0.05
0
0 2 4 6 8 10
F2
One distribution may be obtained by setting the values of the Known variables
88 Bayesian Programming
P (Search | known)
you are given some values of the control variables, you can predict the outputs
exactly, but if you are given the goal, you have numerous control solutions to
achieve it.
In these cases, functional models do not have enough information to give
you any hint to help you choose between the different values of X satisfying
X = F−1 (Y ).
In contrast, P (X | Y ) gives much more information, because it allows you
to find the relative probabilities of the different solutions.
Part II
Bayesian Programming
Cookbook
91
This page intentionally left blank
Chapter 7
Information Fusion
93
94 Bayesian Programming
P (S|r1 ∧ . . . ∧ rN ∧ π) (7.1)
N
Y
P (S ∧ R1 ∧ . . . ∧ RN |π) = P (S|π) × [P (Rn |S ∧ π)] (7.2)
n=1
Va:
S, R1 , . . . , RN
Dc :
P (S ∧ R1 ∧ . . . ∧ RN |π)
Sp(π)
N
Ds
Y
= P (S|π) × [P (Rn |S ∧ π)]
Pr (7.3)
n=1
Fo :
any
Id
Qu :
P (S|r1 ∧ . . . ∧ rN ∧ π)
N
1 Y
P (S|r1 ∧ . . . ∧ rN ∧ π) = × P (S|π) × [P (rn |S ∧ π)] (7.4)
Z n=1
P (S|r1 ∧". . . ∧ rN −1 ∧ π)
N
#
1 X Y
= × P (S|π) × [P (Rn |S ∧ π)]
Z n=1
RN
N −1
1 Y X (7.5)
= × P (S|π) × [P (rn |S ∧ π)] × [P (RN |S ∧ π)]
Z n=1 RN
N −1
1 Y
= × P (S|π) × [P (rn |S ∧ π)]
Z n=1
96 Bayesian Programming
Y
World refernce frame
L2(−50,0)
Boat(0,0) X
L1(−50,−50) L3(0,−50)
FIGURE 7.1: In this section, the boat is located at X = Y = 0 and the
three landmarks are located at (−50, −50), (−50, 0), and (0, −50).
We assume the uncertainty gets bigger as the boat is further away from the
landmark and becomes:
fd1 (X, Y )
gd1 (X, Y ) = +α
10
where α is some minimal uncertainty on the reading of the distance. Similarly
we define the bearing knowing the location of the boat:
Y − 50
fb1 = arctan
X − 50
def g_d_1(Output_,Input_):
Output_[0]= hypot( Input_[X]+50.0,Input_[Y]+50.0)/10.0 + 5
These functions are used to define a conditional probability distribu-
tion on D1:
plCndNormal(D1,X^Y, \
plPythonExternalFunction(X^Y,f_d_1), \
plPythonExternalFunction(X^Y,g_d_1)))
This distribution is added to the list (JointDistributionList) of all
the distributions defining the specification. The joint distribution is then
defined as:
localisation_model=plJointDistribution(X^Y^D1^D2^D3^B1^B2^B3,\
JointDistributionList)
We can ask a lot of different questions to this description as, for instance,
for a supposed observer in position (0, 0):
1. The distribution on the position knowing the three distances and the
three bearings (see Figure 7.2a):
P (X ∧ Y |d1 ∧ d2 ∧ d3 ∧ b1 ∧ b2 ∧ b3 ∧ π)
P (X ∧ Y |b1 ∧ b2 ∧ b3 ∧ π)
Information Fusion 99
0.005
0.0045
0.004
0.0035
P 0.003
0.0025
0.002
0.0015
0.001
0.0005
0
40
20
-40 0
-20 Y
0 -20
20 -40
X 40
(a)
0.0025
0.002
P 0.0015
0.001
0.0005
40
20
-40 0
-20 Y
0 -20
20 -40
X 40
(b)
FIGURE 7.2: The probability of being at a given location given the readings:
(a): P (X ∧ Y |d1 = 70 ∧ d2 = 50 ∧ d3 = 50 ∧ b1 = 225 ∧ b2 = 180 ∧ b3 = 270 ∧ π);
(b): P (X ∧ Y |b1 = 225 ∧ b2 = 180 ∧ b3 = 270 ∧ π).
100 Bayesian Programming
0.0012
0.001
0.0008
P
0.0006
0.0004
0.0002
0
40
20
-40 0
-20 Y
0 -20
20 -40
X 40
(a)
0.0016
0.0014
0.0012
P 0.001
0.0008
0.0006
0.0004
0.0002
0
40
20
-40 0
-20 Y
0 -20
20 -40
X 40
(b)
FIGURE 7.3: The probability of being at a given location given the readings:
(a): P (X ∧ Y |d2 = 50 ∧ d3 = 50 ∧ π)
(b): P (X ∧ Y |d1 = 70 ∧ d2 = 50 ∧ d3 = 70 ∧ π)
Information Fusion 101
P (X ∧ Y |d2 ∧ d3 ∧ π)
4. A question with the three distances but with a wrong value (d3 = 70)
for the reading d3 (see Figure 7.3b).
P (B3 |b1 ∧ b2 ∧ d1 ∧ d2 ∧ π)
The result of this question could be either used to search for the third
landmark if you do not know which direction to look for it (most probable
direction 270) or, even, to detect a potential problem with your measure of B3
if it is not coherent with the other readings (i.e., it has a very low probability,
for instance for b3 = 150).
0.03
0.025
0.02
P(B3)
0.015
0.01
0.005
0
0 50 100 150 200 250 300 350
B3
FIGURE 7.4: The probability distribution on the location of the third land-
mark knowing b1 = 225, b2 = 180, d1 = 70, d2 = 50.
102 Bayesian Programming
K h
Y i
P (S ∧ R1 ∧ . . . ∧ RN |π) = P (S|π) × P Rk1 ∧ . . . ∧ RkMk |S ∧ π (7.7)
k=1
Information Fusion 103
K
X
with [Mk ] = N .
k=1
Va:
S, R1 , . . . , RN
Dc :
P (S ∧ R1 ∧ . . . ∧ RN |π)
Sp(π)
K h
Ds
Y i
1 Mk
= P (S|π) × P R ∧ . . . ∧ R |S ∧ π
k
k
Pr
k=1
F o :
any
Id
Qu :
P (S|r1 ∧ . . . ∧ rN ∧ π)
(7.8)
P (S|r1 ∧ . . . ∧ rN ∧ π) can still be computed very efficiently as it is simply
equal to:
K h
1 Y i
P (S|r1 ∧ . . . ∧ rN ∧ π) = ×P (S|π)× P rk1 ∧ . . . ∧ rkMk |S ∧ π (7.9)
Z
k=1
We can express in this way that the measure of the bearing depends on
the distance of the landmark. For instance, by having a decreasing function
gbn (Dn ) we may say that the further the landmark the more precise the mea-
sure of the bearing:
For this example, we choose to set the standard deviation to vary from a
minimum of 5 to a maximum of 20:
500
gbn = max(5, (20 − )
10 + dn
Va:
X, Y, D1 , D2 , D3 , B1 , B2 , B3
Dc :
P (X ∧ Y ∧ . . . ∧ B3 |π)
P (Dn |X ∧ Y ∧ π)
3
Y
= P (X ∧ Y |π) × ×
n=1
Sp(π)
P (Bn |Dn ∧ X ∧ Y ∧ π)
Ds
F o :
Pr
P (X ∧ Y |π) = U nif orm
P (Dn |X ∧ Y ∧ π)
= B ([µ = fdn (X, Y )] , [σ = gdn (X, Y )])
P (Bn |Dn ∧ X ∧ Y ∧ π)
B ([µ = fbn (X, Y ) bn ] , [σ = gbn (Dn )])
=
Id
Qu :
P (X ∧ Y |b1 ∧ b2 ∧ b3 ∧ π)
(7.11)
We may use this description to recompute the distribution on the location
of the boat using only the bearing readings:
P (X ∧ Y |b1 ∧ b2 ∧ b3 ∧ π)
0.0016
0.0014
0.0012
P 0.001
0.0008
0.0006
0.0004
0.0002
0
40
20
-40 0
-20 Y
0 -20
20 -40
X 40
FIGURE 7.5: The probability distribution on the location of the boat assum-
ing the precision of the bearings depends on the distance with the following
readings: b1 = 225, b2 = 180, b3 = 270 (to be compared with Figure 7.2b).
7.3 Classification
7.3.1 Statement of the problem
It may happen quite often that we are not interested in knowing the state
in all its details. It may also happen that the information provided by the
sensors is so imprecise that we can only access a rough evaluation of this
state.
In both cases, it is possible to introduce a variable C to classify the states
into categories that are either considered sufficient for the task or are imposed
as the best that can be achieved.
Rather than P (S|r1 ∧ . . . ∧ rN ∧ π), the question asked to this model is:
P (C|r1 ∧ . . . ∧ rN ∧ π) (7.12)
106 Bayesian Programming
Va:
S, C, R1 , . . . , RN
Dc :
P (S ∧ C ∧ R1 ∧ . . . ∧ RN |π)
Sp(π)
N
Ds
Y
= P (S|π) × P (C|S ∧ π) × [P (Rn |S ∧ π)]
Pr
n=1
F o :
any
Id
Qu :
P (C|r1 ∧ . . . ∧ rN ∧ π)
(7.13)
P (C|S ∧ π) encodes the definition of the classes.
P (S|π)×P (C|S ∧ π) may be eventually replaced by P (C|π)×P (S|C ∧ π)
if more convenient to defined the classes by specifying the constraints on the
states knowing the classes.
P (C|r1 ∧ . . . ∧ rN ∧ π) can be computed by the following formula:
P (C|r1 ∧". . . ∧ rN ∧ π)
N
#
1 X Y (7.14)
= × P (S|π) × P (C|S ∧ π) × [P (rn |S ∧ π)]
Z n=1
S
The sum on S may be costly but often, hopefully, reduced as the classes
impose by definition that P (C|S ∧ π) is zero for large ranges of S.
30
P (C = unsaf e | X ∧ Y ∧ π) = M in(1.0, p ) (7.15)
(Y + 50) + (X + 50)2
2
Information Fusion 107
If we are very close (d < 30) the danger is certain and it becomes less and
less probable when we get further from the shipwreck.
P (Bn | X ∧ Y ∧ π) ::
θ = atan2(−(Y + LYn ), −(X + LX
n ))
π
if (0 < θ < : P (Bn = N E) = 0.5; P (Bn = SE) = 0.2
2
P (Bn = N W ) = 0.2; P (Bn = SW ) = 0.1
π
if ( < θ < π) : P (Bn = N E) = 0.2; P (Bn = SE) = 0.1
2 (7.16)
P (Bn = N W ) = 0.5; P (Bn = SW ) = 0.2;
π
if (− < θ < 0) : P (Bn = N E) = 0.2; P (Bn = SE) = 0.5
2
P (Bn = N W ) = 0.1; P (Bn = SW ) = 0.2;
π
if (0 < θ < − ) : P (Bn = N E) = 0.1; P (Bn = SE) = 0.2
2
P (Bn = N W ) = 0.2; P (Bn = SW ) = 0.5;
The corresponding Bayesian program is now:
108 Bayesian Programming
Va:
X, Y, C, B1 , B2 , B3
Dc
:
P (X ∧ Y ∧ . . . ∧ B3 |π)
= P (X ∧ Y |π) × P (C|X ∧ Y ∧ π)
Sp(π) Y3
Ds × [P (Bn |X ∧ Y ∧ π)]
Pr n=1 (7.17)
Fo :
P (X ∧ Y |π) = U nif orm
P (C|X ∧ Y ∧ π) : equation(7.15)
P (Bn |X ∧ Y ∧ π) = equation(7.16)
Id
Qu :
P (C|B ∧ B ∧ B ∧ π)
1 2 3
We can divide the space into four regions labeled with the true reading
of the bearings in that region: If we use these readings as examples, we can
obtain the probability of being “unsafe” with the corresponding reading (see
Figure 7.6). Note this probability does not correspond to the probability of
being in that region but to the probability of being in danger if we have this
reading.
P (Rn |S ∧ A ∧ π) (7.18)
SW,SW,SE SW,SW,SW
0,44 0,37
L2
SW,NW,SE SW,NW,SW
0,57 0,44
L1
L3
FIGURE 7.6: The probability of being “unsafe” if the readings correspond
to one of the labels of the danger regions.
Va:
S, A, R1 , . . . , RN
Dc :
P (S ∧ A ∧ R1 ∧ . . . ∧ RN |π)
Sp(π)
N
Ds
Y
= P (A ∧ S|π) × [P (Rn |S ∧ A ∧ π)]
n=1
Pr
Fo :
any
Id
Qu
( :
P (S|r1 ∧ . . . ∧ rN ∧ π)
P (S|a ∧ r1 ∧ . . . ∧ rN ∧ π)
(7.19)
P (A ∧ S|π) encodes the eventual relations between the state of the phe-
nomenon and the ancillary clues.
There could be none. The ancillary clue could be independent of the state:
P (A ∧ S|π) = P (A|π) × P (S|π).
But if there are some relations that deserve to be taken into account, as
in the classification case, P (A ∧ S|π) could be either defined as P (A|π) ×
P (S|A ∧ π) or as P (S|π) × P (A|S ∧ π).
110 Bayesian Programming
The real important innovation here is that the sensor models depend on
both S and A. Either A is used to refine the value of the parameters of the sen-
sor models or it can even completely change the very nature and mathematical
form of this sensor model.
This model may be used to estimate the state either knowing the value a
of the ancillary clue or ignoring it.
If the visibility is perfect (V = true), the sensor model used is the same
as the one used in the Bayesian program 7.6 at the beginning of this chapter.
If there is no visibility (V = f alse), the sensor model assumes a larger uncer-
tainty. For example, the sensor readings with no visibility will have a standard
deviation of 30, which will be reduced to 10 when there is good visibility.
a:
X, Y, V, B1 , B2 , B3
Dc
:
P (X ∧ Y ∧ . . . ∧ B3 |π)
= P (X ∧ Y |π) × P (V |π)
3
Y
× [P (Bn |V ∧ X ∧ Y ∧ π)]
Sp(π) n=1
Ds
Fo :
Pr (7.20)
P (X ∧ Y |π) = U nif orm
P (V |π) = Sof t − evidence
P (Bn | [V = f alse] ∧ X ∧ Y ∧ π)
= B ([µ = fn (X, Y )] , [σ = 30])
P (Bn | [V = true] ∧ X ∧ Y ∧ π)
= B ([µ = fn (X, Y )] , [σ = 10])
Id
Qu :
P (X ∧ Y |b1 ∧ b2 ∧ b3 ∧ π)
P (X ∧ Y"|b1 ∧ b2 ∧ b3 ∧ π)
3
#
1 X Y
= × P (V |π) × [P (bn |V ∧ X ∧ Y ∧ π)]
Z n=1
V
3
Y
P ([V = 0] |π) × [P (bn | [V = 0] ∧ X ∧ Y ∧ π)]
1 n=1
= ×
3
Z Y
+ P ([V = 1] |π) × [P (bn | [V = 1] ∧ X ∧ Y ∧ π)]
n=1
(7.21)
P (V |π) may be considered as “soft evidence.” Indeed, the weather forecast
gives you an estimation of the visibility in percent that can be used as the
value for this soft evidence.
The computation above appears as a weighting sum between the two mod-
els, the weights being the estimation of the visibility.
We may compute P (X ∧ Y |b1 ∧ b2 ∧ b3 ∧ π) for different values of this soft
evidence:
1. P ([V = f alse] |π) = 1 (no visibility) see Figure 7.7a.
2. P ([V = f alse] |π) = 0.9 (almost no visibility ) see Figure 7.7b.
112 Bayesian Programming
0.0003 0.0009
0.0008
0.00025
0.0007
0.0002 0.0006
P(X Y) P(X Y)
0.0005
0.00015
0.0004
0.0001 0.0003
0.0002
5e-05
0.0001
0 0
40 40
20 20
-40 0 -40 0
-20 Y -20 Y
0 -20 0 -20
20 -40 20 -40
X 40 X 40
(a) (b)
0.002 0.0025
0.0018
0.0016 0.002
0.0014
P(X Y) 0.0012 P(X Y) 0.0015
0.001
0.0008 0.001
0.0006
0.0004 0.0005
0.0002
0 0
40 40
20 20
-40 0 -40 0
-20 Y -20 Y
0 -20 0 -20
20 -40 20 -40
X 40 X 40
(c) (d)
3
This is a main concern in physical experiments where considerable effort is made to
build the setup that will warrant that the sensors measure only the search quantity. For
instance, neutrino detectors are buried several thousand meters under mountains so as to
be free from cosmic radiation.
114 Bayesian Programming
Va:
S, F1 , . . . , FN , R1 , . . . , RN
Dc :
P (S ∧ F1 ∧ . . . ∧ RN |π)
Sp(π) YN
Va:
X, Y, F1 , F2 , F3 , D1 , D2 , D3
Dc :
P (X ∧ Y ∧ . . . ∧ D3 |π)
3
Y
= P (X ∧ Y |π) ×
[P (Fn |π)]
n=1
3
Y
× [P (Dn |X ∧ Y ∧ Fn ∧ π)]
Sp(π)
Ds
n=1
Fo :
Pr
P (X ∧ Y |π) = U nif orm
P (Fn = true|π) = 0.3
P (Dn |X ∧ Y ∧ [Fn = 0] ∧ π)
d
= G [µ = d ] , σ = 1 + n
n
10
P (Dn |X ∧ Y ∧ [Fn = 1] ∧ π) = U nif orm
Id
Qu :
P (X ∧ Y |d ∧ d ∧ d ∧ π)
1 2 3
(7.25)
P (X ∧ Y |d1 ∧ d2 ∧ d3 ∧ π) is a weighted sum between the different varia-
tions of the models according to the different combinations of false alarms for
the different targets. The weights are given by the product of the probabilities
of false alarms:
P (X ∧ Y |d1 ∧d2 ∧ d3 ∧ π)
3
Y
P (X ∧ Y |π) × [P (Fn |π)]
1 X
n=1
(7.26)
= ×
3
Z Y
F1 ∧F2 ∧F3
× P (Dn |X ∧ Y ∧ Fn ∧ π)
n=1
We observe in Figure 7.8 (to be compared to Figure 7.3b) that this new
model with a false alarm is more robust than the one without one when we
have a wrong reading on D3.
116 Bayesian Programming
0.001
0.0009
0.0008
0.0007
P(X Y) 0.0006
0.0005
0.0004
0.0003
0.0002
0.0001
0
40
20
-40 0
-20 Y
0 -20
20 -40
X 40
N
Y
P (A ∧ R1 ∧ . . . ∧ RN |π) = P (A|π) × [P (Rn |A ∧ π)] (7.27)
n=1
Va:
A, R1 , . . . , RN
Dc :
P (A ∧ R1 ∧ . . . ∧ RN |π)
Sp(π)
N
Ds
Y
= P (A|π) × [P (Rn |A ∧ π)]
Pr (7.28)
n=1
Fo :
any
Id
Qu :
P (A|r1 ∧ . . . ∧ rN ∧ π)
Knowing the heading direction, whatever the position of the boat the bear-
ing of the obstacle should be different, as we want to avoid it.
1
P (B0 |H) = [1 − λB (µ = H, σ ′ )]
Z
0.006
0.005
0.004
P(B0)
0.003
0.002
0.001
0
0 50 100 150 200 250 300 350
B0
Va:
H, B1 , B0
Dc :
P (H ∧ B0 ∧ B1 ∧ B3 |π)
3
Y
= P (H|π) × [P (Bn |H ∧ π)]
Sp(π)
Ds
n=0
F
o :
Pr (7.30)
P (H|π) = U nif orm
1
P (B0 |H) = [1 − λB (µ = H, σ ′ )]
Z
P (B1 |H) = B (B1 , µ = H, σ)
Id :
Qu :
P (H|b0 ∧ b1 ∧ π)
B0
B1
121
122 Bayesian Programming
Two different calculi may lead to the same result. This is the case if you
try to compute the same thing with two different methods in a “consistent”
or “coherent”1 calculus system.
You can impose it as a constraint of your model by specifying that a given
equation should be respected. Solving the equation then consists of finding the
conditions of the two terms of the equation in order to make them “equal.”2
It can finally be used as a programming notion when you “assign”3 the
result of a calculus to a given variable in order to use it in a subsequent
calculus.4
However, for all these fundamental notions of logic, mathematic, and com-
puting the results of the calculus are always values either Boolean, numeric,
or symbolic.
In probabilistic computing, the basic objects that are manipulated are not
values but rather probability distributions on variables. In this context, the
“equality” has a different meaning as it should say that two variables have
the same probability distribution.
To realize this, we introduce in this chapter the notion of coherence vari-
ables. A coherence variable is a Boolean variable. If the coherence variable is
equal to 1 (or “true”) it imposes that the two variables are “coherent” which
means that they should share the same probability distribution knowing the
same premises.
Va:
A, B, Λ
Dc
( :
Sp(π) P (A ∧ B ∧ Λ|π)
Ds
= P (A|π) × P (B|π) × P (Λ|A ∧ B ∧ π)
Pr
Fo :
P ([Λ = 1] |A ∧ B ∧ π) = δA=B
Id
Qu :
P (A| [Λ = 1] ∧ π)
P (B| [Λ = 1] ∧ π)
(8.3)
The interesting question is P (A| [Λ = 1] ∧ π):
P (a|λ ∧ π)
1 X
= × [P (a|π) × P (B|π) × P (λ|a ∧ B ∧ π)]
Z
B
(8.4)
1 P (a|π) × P b̄|π × P λ|a ∧ b̄ ∧ π
= ×
Z P (a|π) × P (b|π) × P (λ|a ∧ b ∧ π)
1
= × P (a|π) × P (b|π)
Z
P (a|λ ∧ π)
P (a|π) × P (b|π) (8.5)
=
P (a|π) × P (b|π) + P (ā|π) × P b̄|π
y×x
P (a|λ ∧ π) = (8.7)
y × x + (1 − y) × (1 − x)
5
See Section 2.6.2 titled “Godel’s theorem” of Jaynes’ book [2003] (pages 45–47) for a
very stimulating discussion on this subject and about the perspectives it opens relatively
to the meaning of Godel’s theorem in probability.
Bayesian Programming with Coherence Variables 125
8.1.3.3 P(a|λ̄ ∧ π)
What happens if Λ is set to false?
In that case we get:
P a|λ̄ ∧ π
1 (8.8)
= × P (a|π) × P b̄|π
Z
If B is true, we get that A is false and if B is false, we get that A is true.
The logical interpretation is that a ⇔ b̄ and the algebraic interpretation is
that the value of ¬B is equal to A.
If we have no certainty on B and a uniform prior on A then we get:
P a|λ̄ ∧ π = P b̄|π = 1 − P (b|π) (8.9)
Va:
A, B, Λ
Dc
( :
Sp(π)
P (A ∧ B ∧ Λ|π)
Ds
= P (A|π) × P (B|π) × P (Λ|A ∧ B ∧ π)
Pr
Fo :
P ([Λ = 1] |A ∧ B ∧ π) = δ
A=B
Id
Qu :
P (A| [Λ = 1] ∧ π)
P (B| [Λ = 1] ∧ π)
(8.12)
For P (A| [Λ = 1] ∧ π) we get:
P (A|λ ∧ π)
1 X
= × [P (A|π) × P (B|π) × P (λ|A ∧ B ∧ π)]
Z (8.13)
B
1
= × P (A|π) × P ([B = A] |π)
Z
And, if we further assume that P (A) is uniform, we get for all possible
values of A:
P (A|λ ∧ π) = 0 (8.17)
P ([B = x] |π)
P ([A = x] |λ ∧ π) = X (8.18)
[P ([B = a] |π)]
a∈A
where the sum on A is made only for the values of A that are in the range of
B (see Figure 8.1).
P ([B = x] |π)
P ([A = x] |λ ∧ π) = X (8.19)
[P ([B = a] |π)]
a∈A
P (A|λ ∧ π) = 0 (8.20)
P ([B = x] |π)
P ([A = x] |λ ∧ π) = X (8.21)
[P ([B = a] |π)]
a∈A
where the sum on A is made only for the values of A that are in the range of
B.
128 Bayesian Programming
0.35 0.35
0.3 0.3
0.25 0.25
0.2 0.2
P(A)
P(B)
0.15 0.15
0.1 0.1
0.05 0.05
0 0
8 10 12 14 16 18 20 12 13 14 15 16 17 18 19
A B
(a) (b)
0.35
0.3
0.25
0.2
P(A|λ =1)
0.15
0.1
0.05
0
8 10 12 14 16 18 20
A
(c)
linear mapping such as, for instance, a logarithmic mapping where f (B) =
int (log2 (B)). We get:
X
P (B|π) × δx=int(log2 (B))
P ([A = x] |λ ∧ π) = "B # (8.23)
X X
P (B|π) × δa=int(log2 (B))
a∈A B
0.6 0.6
0.5 0.5
0.4 0.4
P(A|λ=1)
P(A|λ=1)
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
A A
(a) (b)
FIGURE 8.2: The variable B ∈ [1, 255] is mapped into the variable A ∈
[1, 7] using the log2 function: (a) P (A|λ ∧ π) with a uniform prior on B, (b)
P (A|λ ∧ π) with a Gaussian prior on B.
Let us check that this can effectively be realized with a generic model of
the form:
P (A ∧ B ∧ C ∧ D ∧ E ∧ Λ)
(8.24)
= P (A ∧ C ∧ E) × P (Λ|A ∧ B) × P (B ∧ D ∧ E)
It is generic in the sense that: (i) A and B are bound by the coherence
variable λ, (ii) A is part of a model including C not shared with B and E
shared with B, and (iii) symmetrically B is part of a model including D not
shared with A and E shared with A.
Va:
A, B, C, D, E, Λ
Dc :
P (A ∧ B ∧ ∧ ∧ ∧Λ|π)
Sp(π)
Ds
= P (A ∧ C ∧ E|π) × P (Λ|A ∧ B ∧ π)
×P (B ∧ D ∧ E|π)
Fo :
Pr (8.25)
any
Id
Qu :
P (A|λ ∧ π)
P (A|c ∧ λ ∧ π)
P (A|e ∧ λ ∧ π)
P (A|c ∧ e ∧ λ ∧ π)
P (A|d ∧ c ∧ e ∧ λ ∧ π)
P ([A
X= x] |λ ∧ π)
≺ [P ([A = x] ∧ C ∧ E|π) × P (λ|A ∧ B ∧ π) × P (B ∧ D ∧ E|π)]
B∧C∧D∧E
X
≺ [P ([A = x] ∧ E|π) × P (λ|A ∧ B ∧ π) × P (B ∧ E|π)]
B∧E
X
≺ [P ([A = x] ∧ E|π) × P ([B = x] ∧ E|π)]
E
≺ P ([B = x] |λ ∧ π)
(8.26)
We have also:
PX([A = x] |c ∧ λ ∧ π)
≺ [P ([A = x] ∧ c ∧ E|π) × P (λ|A ∧ B ∧ π) × P (B ∧ D ∧ E|π)]
B∧D∧E
X
≺ [P ([A = x] ∧ c ∧ E|π) × P ([B = x] ∧ E|π)]
E
≺ P ([B = x] |c ∧ λ ∧ π)
(8.27)
132 Bayesian Programming
and:
PX
([A = x] |e ∧ λ ∧ π)
≺ [P ([A = x] ∧ C ∧ e|π) × P (λ|A ∧ B ∧ π) × P (B ∧ D ∧ e|π)]
B∧C∧D
≺ P ([A = x] ∧ e|π) × P ([B = x] ∧ e|π)
≺ P ([B = x] |e ∧ λ ∧ π)
(8.28)
and also:
PX([A = x] |c ∧ e ∧ λ ∧ π)
≺ [P ([A = x] ∧ c ∧ e|π) × P (λ|A ∧ B ∧ π) × P (B ∧ D ∧ e|π)]
B∧D
≺ P ([A = x] ∧ c ∧ e|π) × P ([B = x] ∧ e|π)
≺ P ([B = x] |c ∧ e ∧ λ ∧ π)
(8.29)
and, finally:
P ([A = x] |d ∧ c ∧ e ∧ λ ∧ π)
X
≺ [P ([A = x] ∧ c ∧ e|π) × P (λ|A ∧ B ∧ π) × P (B ∧ d ∧ e|π)]
B (8.30)
≺ P ([A = x] ∧ c ∧ e|π) × P ([B = x] ∧ d ∧ e|π)
≺ P ([B = x] |d ∧ c ∧ e ∧ λ ∧ π)
You may not be convinced that expressing both these noninformative pri-
ors is a practical necessity. Let us get back to the “false alarm” example of
Section 7.5 of Chapter 7. The sensor model is the following:
N
Y
P (S ∧ R1 ∧ . . . ∧ RN |π) = [P (Rn |π) × P (S|Rn ∧ π)] (8.33)
n=1
P (S ∧ S1 ∧ . . . ∧ SN ∧ R1 ∧ . . . ∧ RN |π)
N
Y (8.34)
= P (S|S1 ∧ . . . ∧ SN ) × [P (Rn |π) × P (Sn |Rn ∧ π)]
n=1
where each expert expresses his own opinion Sn and where the distribution
P (S|S1 ∧ . . . ∧ SN ) is in charge of the synthesis of these diverging opinions. It
has two essential shortcomings as (i) P (S|S1 ∧ . . . ∧ SN ) is a very big dis-
tribution, most of the time very difficult to formalize and (ii) computing
P (S|r1 ∧ . . . ∧ rN ∧ π) supposes to marginalize out the N variables Sn which
is a very cumbersome computation.
134 Bayesian Programming
Yet another approach could be to say that we use the “regular” fusion
model:
N
Y
P (S ∧ R1 ∧ . . . ∧ RN |π) = P (S|π) × [P (Rn |S ∧ π)] (8.35)
n=1
P (S ∧ R1 ∧ . .. ∧ RN |π)
N
Y P (Rn |πn ) × P (S|Rn ∧ πn ) (8.37)
= P (S|π) × X
[P (R |π ) × P (S|R ∧ π )]
n=1 n n n n
Rn
P (S|r1 ∧ . . . ∧ rN ∧ π)
N
P (S|π) Y
XP (rn |πn ) × P (S|rn ∧ πn )
(8.38)
= ×
P (r1 ∧ . . . ∧ rN |π) n=1
[P (Rn |πn ) × P (S|Rn ∧ πn )]
Rn
Va:
S, R1 , . . . , RN , Λ1 , . . . , ΛN
Dc :
P (S ∧ R1 ∧ . . . ∧ ΛN |π)
Sp(π)
N
Ds Y
= P (S|π) × [P (Rn |π) × P (Λn |S ∧ Rn ∧ π)]
Pr n=1
Fo :
see − text
Id
Qu :
P (S|r1 ∧ . . . rN ∧ λ1 ∧ . . . ∧ λN ∧ π)
P (S|λ1 ∧ . . . ∧ λN ∧ π)
(8.40)
The answer to the first question is the following:
P (S|r1 ∧ . . . rN ∧ λ1 ∧ . . . ∧ λN ∧ π)
N
1 Y (8.41)
= × P (S|π) × [P (λn |S ∧ rn ∧ π)]
Z n=1
P (S|λ1 ∧ . . . ∧ λN ∧"π)
N
#
1 Y X (8.42)
= × P (S|π) × [P (Rn |π) × P (λn |S ∧ Rn ∧ π)]
Z′ n=1 Rn
P (S|λ1 ∧ . . . ∧ λN ∧ π)
N
1 Y (8.43)
= × P (S|π) × [P ([Rn = S] |π)]
Z n=1
P (S|r1 ∧ . . . rN ∧ λ1 ∧ . . . ∧ λN ∧ π)
N h
1 Y i (8.45)
= × P (S|π) × e−dn (S,Rn)
Z n=1
If, for instance, S, R1 , and R2 are three integer variables varying between
abs(S − Rn )
1 and 100 and dn (S, Rn ) = , we can compute the distribution
σn
on S P (S | R0 = 50 ∧ R1 = 70 ∧ λ0 ∧ λ1 ) when we have two different readings
R0 = 50 and R1 = 70. Figure 8.3 shows two cases: one with an identical pre-
cision for the reading σ0 = σ1 = 10 and the other for two different precisions:
σ0 = 10 and σ1 = 20.
For the second question, we have:
P (S|λ1 ∧ . . . ∧ λN ∧" π)
N
#
1 Y Xh i
(8.46)
= × P (S|π) × P (Rn |π) × e−dn(S,Rn )
Z n=1 Rn
6
The same remark as above. See Section 2.6.2 titled “Godel’s theorem” of Jaynes’ book
[2003] (pages 45–47) for a very stimulating discussion on this subject and about the per-
spectives it opens relative to the meaning of Godel’s theorem in probability.
Bayesian Programming with Coherence Variables 137
0.05 0.05
0.04 0.04
0.03 0.03
P(S)
P(S)
0.02 0.02
0.01 0.01
0 0
0 20 40 60 80 100 0 20 40 60 80 100
S S
(a) (b)
P (S|r1 ∧ . . . rN ∧ λ1 ∧ . . . ∧ λN ∧ π)
N
1 Y (8.47)
= × P (S|π) × [P (rn |S ∧ πn )]
Z n=1
which is the exact same expression as for the naive Bayesian fusion (see Equa-
tion 7.4).
However, we now have the freedom to specify priors for the readings as
the P (Rn |π) appears in the decomposition of the Bayesian program (Equation
8.40). The answer to the second question is then:
P (S|λ1 ∧ . . . ∧ λN ∧"π)
N
#
1 Y X (8.48)
= × P (S|π) × [P (Rn |π) × P (Rn |S ∧ πn )]
Z′ n=1 Rn
Va:
X, Y, F2 , F3 , D2 , D3 , Λ2 , Λ3
Dc :
P (X ∧ Y ∧ . . . ∧ Λ3 |π)
3
Y
= P (X ∧ Y |π) ×
[P (Dn |π) × P (Fn |π)]
n=2
3
Y
× [P (Λn |X ∧ Y ∧ Dn ∧ Fn ∧ π)]
n=2
Sp(π)
Fo :
Ds
P (X ∧ Y |π) = U nif orm
Pr
P ([F2 = 1] |π) = 0.3
P ([F3 = 1] |π) = 0.3
P (Dn |π) = U nif orm
P (λ n |X ∧ Y ∧ Dn ∧ [Fn = 0] ∧ π)
d
n
= G [µ = dn ] , σ = 1 +
10
P (λ |X ∧ Y ∧ D ∧ [Fn = 1] ∧ π) = 1/2
n n
Id
Qu :
P (X ∧ Y |d2 ∧ d3 ∧ λ2 ∧ λ3 ∧ π)
(8.49)
If there is no false alarm, the position of the boat (X ∧Y ) and the different
distances should be coherent and, on the contrary, if there is a false alarm there
is no reason to justify any relation between the position and the observed
readings.
When there is no false alarm a good measure of the coherence is given by
the regular sensor model. This can be encoded as:
P (λn |X ∧ Y ∧ Dn ∧ [Fn = 0] ∧ π)
= P (D
n |X ∧ Y ∧ π) (8.50)
dn
= G [µ = dn ] , σ = 1 +
10
When there is a false alarm, we do not know if the position of the boat
and the distances are coherent:
P (X ∧ Y |d2 ∧ d3 ∧ λ2 ∧ λ3 ∧ π)
3
Y
P (X ∧ Y |π) × [P (Fn |π)]
1 X
n=2
(8.52)
= ×
3
Z Y
F2 ∧F3 × [P (λn |X ∧ Y ∧ dn ∧ Fn ∧ π)]
n=2
P (S|r1 ∧ . . . rN ∧ λ1 ∧ . . . ∧ λN ∧ π)
N
1 Y (8.53)
= × P (S|π) × [P (S|rn ∧ πn )]
Z n=1
which has the appealing property of being again a simple product of proba-
bility distributions as in the sensor fusion case.
P (S|λ1 ∧ . . . ∧ λN ∧"π)
N
#
1 Y X
= × P (S|π) × [P (Rn |π) × P (S|Rn ∧ πn )]
Z′ n=1 (8.54)
Rn
N
1 Y
= × P (S|π) × [P (S|πn )]
Z′ n=1
P (Bn |H ∧ πn ) = B (µ = H, σ) (8.55)
Va:
H, B1 , B2 , B3 , Λ1 , Λ2 , Λ3
Dc :
P (H ∧ B1 ∧ . . . ∧ Λ3 |π)
3
Y
Sp(π) = P (H|π) × [P (Bn |π) × P (Λn |H ∧ Bn ∧ π)]
Ds
n=1
Pr
F
o :
P (H|π) = U nif orm
P (Bn |π) = U nif orm
P (λn |H ∧ Bn ∧ π) = B (Bn , µ = H, σ)
Id
Qu :
P H|b1 ∧ λ1 ∧ b2 ∧ λ¯2 ∧ b3 ∧ λ¯3 ∧ π
(8.57)
This way we obtain for the question P H|b1 ∧ λ1 ∧ b2 ∧ λ¯2 ∧ b3 ∧ λ¯3 ∧ π :
The first question “P (B|P (A|π ′ ) ∧ π)” is often solved by replacing the
prior P (A|π) by the soft evidence P (A|π ′ ) and by computing P (B|π):
X
P (B|π) = [P (A|π ′ ) × P (B|A ∧ π)] (8.60)
A
P (ΠA′ ∧ A′ ∧ ΛA ∧ A ∧ B|π)
(8.61)
= P (ΠA′ |π) P (A′ |ΠA′ ∧ π) × P (ΛA |A′ ∧ A ∧ π) × P (A ∧ B|π)
Va:
ΠA′ , A′ , ΛA , A, B
Dc
:
P (ΠA′ ∧ A′ ∧ ΛA ∧ A ∧ B|π)
′
Sp(π) = P (ΠA′ |π) × P (A |ΠA′ ∧ π)
×P (ΛA |A′ ∧ A ∧ π) × P (A ∧ B|π)
Ds
F o :
Pr (8.62)
P (ΠA′ |π) = U nif orm
′ ′
P (A |ΠA′ ∧ π) = f (A , ΠA′ )
′
P (λA |A ∧ A ∧ π) = δA=A′
Id
Qu :
P (A|πA′ ∧ λA ∧ π)
P (B|πA′ ∧ λA ∧ π)
P (B|πA′ ∧ λA ∧ π) (8.63)
As already mentioned, the first difficulty is to assign the variable A with a
probability distribution imposed as known for a given inference. This can be
easily solved using coherence variables which have been designed especially to
solve this kind of problem. It leads to the following decomposition:
P (A′ ∧ ΛA ∧ A ∧ B|π)
(8.64)
= P (A′ |π) × P (ΛA |A′ ∧ A ∧ π) × P (A|π) × P (B|A ∧ π)
have two parameters µA′ the mean and σA′ the standard deviation and where
1 (x−y)2
G(x, y, z) = √ × e− z 2 .
2π × z
P (B|πA′ ∧ λA ∧ π)
1 X P (πA′ |π) × P (A′ |πA′ ∧ π)
= ×
Z ×P (λA |A′ ∧ A ∧ π) × P (A ∧ B|π) (8.65)
A∧A′
1 X
= ′
× [P ([A′ = A] |πA′ ∧ π) × P (A ∧ B|π)]
Z
A
P (B|πA′ ∧ λA ∧ π)
1 X
(8.66)
= ′
× [P ([A′ = A] |πA′ ∧ π) × P (A|π) × P (B|A ∧ π)]
Z
A
which is similar to Equation 8.60 but where both appear as the imposed
probability distribution P ([A′ = A] |πA′ ∧ π) and the prior P (A|π).
If P (A ∧ B|π) = P (B|π) × P (A|B ∧ π) we get:
P (B|πA′ ∧ λA ∧ π)
1 X
(8.67)
= ′
× [P ([A′ = A] |πA′ ∧ π) × P (B|π) × P (A|B ∧ π)]
Z
A
Va:
X, Y, D1 , D2 , D3 , D1′ , D2′ D3′ , Λ1 , Λ2 , Λ3 ,
M 1 , M 2 , M 3 , Σ1 , Σ2 , Σ3
Dc :
P (X ∧ Y ∧ . . . ∧ Σ3 |π)
3
Y
[P (Mn ∧ Σn |π) × P (Dn′ |Mn ∧ Σn ∧ π)]
=
n=1
3
Y
Sp(π) ×P (X ∧ Y |π) × [P (Dn |X ∧ Y ∧ π)]
Ds
n=1
3
Pr
Y
[P (Λn |Dn ∧ Dn′ ∧ π)]
×
n=1
F o :
P (Dn′ |µn ∧ σn ∧ π) = B (Dn′ , µn , σn )
P (X ∧ Y |π) = U nif orm
P (Dn |X ∧ Y ∧ π) = δDn =d−n(X,Y )
P (Λn |Dn ∧ Dn′ ∧ π) = δDn =Dn′
Id
Qu :
P (X ∧ Y |µ1 ∧ µ2 ∧ µ3 ∧ σ1 ∧ σ2 ∧ σ3 ∧ λ1 ∧ λ2 ∧ λ3 ∧ π)
(8.68)
P (Dn′ |µn ∧ σn ∧ π) is a bell-shaped distribution of parameters µn and σn .
P (Dn |X ∧ Y ∧ π), the previous sensor model, does not necessarily need to
encode uncertainty anymore as this uncertainty is provided by the sensor itself
with the soft evidence µn and σn . It may be chosen as a Dirac distribution,
taking a value of one when Dn is equal to the distance d − n(X, Y ) of the boat
from the nth target.
P (Λn |Dn ∧ Dn′ ∧ π) is the coherence variable Dirac distribution used to
bound the soft evidence distribution from D′ to D.
P (X ∧ Y |µ1 ∧ µ2 ∧ µ3 ∧ σ1 ∧ σ2 ∧ σ3 ∧ λ1 ∧ λ2 ∧ λ3 ∧ π)
3
Y (8.69)
∝ [B (d − n(X, Y ), µn , σn )]
n=1
It may seem a very complicated model to finally obtain the same result as
with the naive fusion model.
However, we have now a complete separation between the soft evidence
Bayesian Programming with Coherence Variables 145
modeling the sensors and the internal model describing the geometrical rela-
tions. Especially, we are completely free to specify this internal model as we
want; all dependencies and all priors are acceptable.
8.6 Switch
8.6.1 Statement of the problem
A complex model is most of the time made of several submodels, partially
independent from one another, only connected by well-defined interfaces.
An interesting feature is to be able to switch on or off some part of the
model when needed. This feature could be implemented with coherence vari-
ables.
Va:
A, I1 , B, I2 , C, I3 , Λ12 , Λ13 , Λ23
Dc
:
P (A ∧ . . . ∧ Λ23 |π)
= P (A ∧ I1 |π) × P (B ∧ I2 |π)
Sp(π)
Ds
×P (C ∧ I3 |π)
×P (Λ12 |I1 ∧ I2 ∧ π) × P (Λ13 |I1 ∧ I3 ∧ π)
×P (Λ23 |I2 ∧ I3 ∧ π)
Pr
F o :
n
P (Λij |Ii ∧ Ij ∧ π) = δIi =Ij
Id
Qu :
P (A|b ∧ λ12 ∧ π)
P (A|b ∧ c ∧ λ12 ∧ λ13 ∧ π)
P (C|b ∧ λ12 ∧ λ13 ∧ π)
P (C|b ∧ λ ∧ π)
23
(8.70)
This decomposition is a good example to prove that algebraic notation is
146 Bayesian Programming
P (A|b ∧ λ12 ∧ π)
1 X (8.71)
= × P (A ∧ I1 |π) × P (b ∧ [I2 = I1 ] |π)
Z
I1
where two submodels are activated to search for the probability distribution
on A knowing b.
P (C|b ∧ λ23 ∧ π)
1 X (8.74)
= × P (b ∧ I2 |π) × P (C ∧ [I3 = I2 ] |π)
Z
I2
8.7 Cycles
8.7.1 Statement of the problem
Another common problem appears when you have several “tracks” of rea-
soning to draw the same conclusion.
In this case, you have cycles in your Bayesian graph which lead to problems
expressing the model in the Bayesian programming formalism.
148 Bayesian Programming
The most simple case may be expressing this with only three variables A,
B, and C.
Let us suppose that we know, on the one hand, a dependency between A
and C (P (C|A)) and, on the other hand, a dependency between A and B
(P (B|A)) followed by a dependency between B and C (P (C|B)).
In this case, we cannot express the joint probability distribution as a prod-
uct of these three elementary distributions, as C appears twice on the left.
An attractive solution is to write that P (A ∧ B ∧ C) = P (A) × P (B|A) ×
P (C|A ∧ B), but then the known distributions P (C|A) and P (C|B) do not
appear in the decomposition, instead in this decomposition the distribution
P (C|A ∧ B) appears, which is not known and may be very difficult to express.
Here again, the coherence variables offer an easy solution. C is the variable
that may be deduced from A when a new variable C ′ may be deduced from
the inference chain starting from A to infer B to finally infer C ′ . We then
only need a coherence variable Λ to express that the distributions on C and
C ′ should be “equal.”
Va:
A, B, C, C ′ , Λ
Dc :
P (A ∧ B ∧ C ∧ C ′ ∧ Λ|π)
Sp(π) = P (A|π) × P (C|A ∧ π)
Ds
×P (B|A ∧ π) × P (C ′ |B ∧ π)
Pr (8.75)
×P (Λ|C ∧ C ′ ∧ π)
F
no :
P (Λ|C ∧ C ′ ∧ π) = δC=C ′
Id
Qu :
P (C|a ∧ λ ∧ π)
Pe
Pa
C
L
Robot Actuator
Robot Link
Range Sensor
R
World
Po
Pr
Robot Location Object Location
The position of this robot’s base in the world reference frame is stored in
a variable Pr .
The robot has a range sensor that is able to measure the distance of an
object (variable R). Knowing Pr and R you may infer the position of the
object Po as in a perfect world we would have: Po = Pr + R.
The robot bears a prismatic joint. The command of this joint is the variable
C. Knowing Pr and C you can infer the position of the link Pa as, yet in a
perfect world we would have: Pa = Pr + C.
The length of the link is supposed to be L. Knowing Pa and L, we know
the position of the extremity of the arm Pe as Pe = Pa + L.
However, the world is not perfect. We may have some uncertainty on the
position of the robot P (Pr ), the precision of the sensor P (Po |Pr ∧ R), the
command of the robot P (Pa |Pr ∧C), and even the length of the arm P (Pe |Pa ∧
L).
The goal of the robot is to touch the object with its arm. When the contact
is made, then we have Po < Pe and Pe − Po < ǫt . The two kinematic chains
make a closed loop. This is modeled by a coherence variable Λ equal to one if
and only if the contact is realized.
This finally leads to the following Bayesian program:
150 Bayesian Programming
Va:
Pr , R, Po , C, Pa , L, Pe , Λ
Dc
:
P (Pr ∧ . . . ∧ Λ|π)
= P (Pr |π) × P (R|π) × P (Po |Pr ∧ R ∧ π)
×P (C|π) × P (Pa |Pr ∧ C ∧ π)
×P (L|π) × P (Pe |Pa ∧ L ∧ π)
Sp(π) ×P (Λ|Po ∧ Pe ∧ π)
Ds
Fo :
P (Po |Pr ∧ R ∧ π) = N ormal(Pr , ǫr )
Pr
P (C|π) = U nif orm
P (Pa |Pr ∧ C ∧ π) = N ormal(pr + c, ǫc )
P (L|π) = N ormal(L0 , ǫL )
P (Pe |Pa ∧ L ∧ π) = δpe =pa +l
P (Λ|Po ∧ Pe ∧ π) = δ0≤pe −po ≤ǫt
Id
Qu :
P (C|r ∧ λ ∧ π)
P (Po |r ∧ c ∧ λ ∧ π)
P (L|r ∧ c ∧ λ ∧ π)
(8.76)
This program assumes the following error models:
1. N ormal(Pr , ǫr ): error model for the sensor.
2. N ormal(pr + c, ǫc ): error model for the control.
3. N ormal(L0 , ǫL ): error model for the manufacturing.
4. δ0≤pe −po ≤ǫt : error model for the task.
Numerous interesting questions may be asked about this model. Let us
take three of them as examples:
P (C|r ∧ λ ∧ π) (8.77)
where knowing the distance measured by the sensor we search the control that
will drive the robot to the contact with the object (inverse kinematic).
(Po |r ∧ c ∧ λ ∧ π) (8.78)
where we look for the position of the object knowing both the reading of the
sensor and the command that leads to contact (localization).
P (L|r ∧ c ∧ λ ∧ π) (8.79)
Bayesian Programming with Coherence Variables 151
where we derive the probability distribution on the length of the arm knowing
the sensor’s reading and the command (calibration).
PPa=plCndNormal(Pa,Pr^C, \
plPythonExternalFunction(Pr^C, \
actuator_model),
2)
A functional Dirac is used to implement the distribution on the
cohrence variable Λ:
Lambda = plSymbol("Lambda",plIntegerType(0,1))
def Coherence(Output_,Input_):
r = Input_[Pe]- Input_[Po]
if r.to_float() > 0 and r.to_float() < 1 :
Output_[Lambda]=1
else :
Output_[Lambda]=0
DiracLambda=plFunctionalDirac(Lambda,Pe^Po, \
plPythonExternalFunction(Lambda,Pe^Po, \
Coherence))
inverse_kinematic = model.ask_mc_sample(C,Lambda^R,500)
Le Hasard
Emile Borel [1914]
1
Quels que soient les progrès des connaissances humaines, il y aura toujours place pour
l’ignorance et par suite pour le hasard et la probabilité.
153
154 Bayesian Programming
Va:
Rain, Sprinkler, GrassW et
Dc
:
P (Sprinkler ∧ Rain ∧ GrassW et|π1 )
= P (Rain|π1 ) × P (Sprinkler|Rain ∧ π1 )
×P (GrassW et|Rain ∧ Sprinkler ∧ π1 )
Sp(π)
Fo :
Ds
171
Pr P ([Rain = 1] |π1 ) =
365
P ([Sprinkler = 1] | [Rain = 0] ∧ π1 ) = 0.40
P ([Sprinkler = 1] | [Rain = 1] ∧ π1 ) = 0.01
P ([GrassW et = 1] |Rain ∧ Sprinkler ∧ π1 )
= δRain∨Sprinkler
Id
Qu :
P (Rain| [GrassW et = 1] ∧ π )
1
(9.1)
where 171 is the number of rainy days in the considered area, 40% is the
percentage of times the sprinkler triggers when the weather is dry, and 1%
the percentage of times the sprinkler triggers when it should not as the rain
already watered the vegetation.
The answer to the question may be computed by the following formula:
P (Rain| [GrassW et = 1] ∧ π1 )
≺ P (Rain|π1)
X P (Sprinkler|Rain ∧ π1 )
(9.2)
×
×P ([GrassW et = 1] |Rain ∧ Sprinkler ∧ π1 )
Sprinkler
Bayesian Programming Subroutines 155
Va:
Rain, Sprinkler, GrassW et, Roof W et
Dc :
P (Sprinkler ∧ Rain ∧ GrassW et ∧ Roof W et|π2 )
= P (Rain|π ) × P (Sprinkler|Rain ∧ π )
2 2
×P (GrassW et|Rain ∧ Sprinkler ∧ π 2)
×P (Roof W et|Rain ∧ π2 )
Sp(π)
F o :
Ds
171
P ([Rain = 1] |π2 ) =
Pr
365
P ([Sprinkler = 1] | [Rain = 0] ∧ π2 ) = 0.40
P ([Sprinkler = 1] | [Rain = 1] ∧ π2 ) = 0.01
P ([GrassW et = 1] |Rain ∧ Sprinkler ∧ π2 )
= δRain∨Sprinkler
P ([Roof W et = 1]|Rain ∧ π2 ) = δ[Rain=1]
Id
Qu :
n
P ([Roof W et = 1] | [GrassW et = 1] ∧ π )
2
(9.4)
The answer to the question may be computed by the following formula:
P ([Roof W et = 1] | [GrassW et = 1] ∧ π2 )
P ([Roof W et = 1] |Rain ∧ π2 )
X P (Rain|π 2)
≺
X P (Sprinkler|Rain ∧ π2 )
Rain P ([GrassW et = 1] |Rain ∧ Sprinkler ∧ π2 )
Sprinkler
(9.5)
As P ([Roof W et = 1] |Rain ∧ π2 ) = δ[Rain=1] , we finally get:
P ([Roof W et = 1] | [GrassW et = 1] ∧ π2 )
= P ([Rain = 1] | [GrassW et = 1] ∧ π2 ) (9.6)
= 69%
156 Bayesian Programming
Va:
Rain, GrassW et, Roof W et
Dc
:
P (Rain ∧ GrassW et ∧ Roof W et|π3 )
Sp(π) = P (Rain ∧ GrassW et|π3 )
Ds
×P (Roof W et|Rain ∧ π3 )
Pr
F o :
P (Rain ∧ GrassW et|π3 ) = P (Rain ∧ GrassW et|π1 )
P (Roof W et|Rain ∧ π3 ) = δRain=1
Id
Qu
n :
P ([Roof W et = 1] | [GrassW et = 1] ∧ π )
3
(9.7)
P (Rain ∧ GrassW et|π3 ) = P (Rain ∧ GrassW et|π1 ) may be seen as call-
ing the Bayesian program (9.1) as a probabilistic subroutine. We get directly:
P ([Roof W et = 1] | [GrassW et = 1] ∧ π3 )
= P ([Rain = 1] | [GrassW et = 1] ∧ π1 ) (9.8)
= 69%
model=plJointDistribution(Rain^Roof^GrassWet,\
jointlist)
In this model the variable Sprinkler is not used, but it will produce
the exact same result for all the questions with the variable Rain, Roof ,
and GrassW et as the extended model built for verification:
extendedmodel=plJointDistribution(Rain^Roof^GrassWet^Sprinkler,\
jointlist)
For instance, 171 in the Bayesian program (Equation 9.1) is the average
number of rainy days in Paris. It is a parameter that has been identified using
a set of climate data δP aris . To be more exact, Bayesian program (Equation
9.1) should have been written:
158 Bayesian Programming
Va:
Rain, Sprinkler, GrassW et
Dc
:
P (Sprinkler ∧ Rain ∧ GrassW et|π1 )
= P (Rain|π1 ) × P (Sprinkler|Rain ∧ π1 )
×P (GrassW et|Rain ∧ Sprinkler ∧ π1 )
Sp(π) F o :
Ds n
P ([Rain = 1] |δP aris ∧ π1 ) =
365
Pr
P ([Sprinkler = 1] | [Rain = 0] ∧ π1 ) = 0.40
P ([Sprinkler = 1] | [Rain = 1] ∧ π1 ) = 0.01
P ([GrassW et = 1] |Rain ∧ Sprinkler ∧ π1 )
= δRain∨Sprinkler
Id :
learn n as the average number of rainy days in the data set δP aris
Qu :
P (Rain| [GrassW et = 1] ∧ δ
P aris ∧ π1 )
(9.9)
Using another set of data as, for instance, δnice , would lead to another
value of this parameter n, namely 88.
Of course, the questions:
and
We can introduce a new variable Location with two values: nice and paris
and use a conditional probability distribution to select the submodel we would
like to use knowing our location.
Va:
Rain, GrassW et, Roof W et, Location
Dc :
P (Rain ∧ GrassW et ∧ Roof W et ∧ Location|π4 )
= P (Location|π4 )
×P (Rain ∧ GrassW et|Location ∧ π4 )
×P (Roof W et|Rain ∧ π4 )
Sp(π)
Ds
F o :
P (Location|π4 ) = any
Pr
P (Rain ∧ GrassW et| [Location = paris] ∧ π4 )
= P (Rain ∧ GrassW et|δparis ∧ π1 )
P (Rain ∧ GrassW et| [Location = nice] ∧ π4 )
= P (Rain ∧ GrassW et|δnice ∧ π1 )
P ([Roof W et = 1]|Rain ∧ π4 ) = δ[Rain=1]
Id
Qu
( :
P ([Roof W et = 1] | [GrassW et = 1] ∧ [Location = paris] ∧ π4 )
P ([Roof W et = 1] | [GrassW et = 1] ∧ [Location = nice] ∧ π4 )
(9.13)
where we state that knowing the location, we call the Bayesian program spec-
ified by preliminary knowledge π1 with learning done either on the data set
δP aris
P (Rain ∧ GrassW et| [Location = paris] ∧ π4 )
(9.14)
= P (Rain ∧ GrassW et|δP aris ∧ π1 )
or on the data set δN ice
#selecting subroutines
#defines a new variable
Location = plSymbol(‘‘Location", plLabelType([’Paris’,’Nice’]))
locval=plValues(Location)
jointlist=plComputableObjectList()
#
#push a uniform distribution for the location
jointlist.push_back(plUniform(Location))
#
#define the two distributions corresponding to Paris and Nice
PGrasswetkLocation=plDistributionTable(GrassWet,Location)
locval[Location]=’Paris’
submodel.replace(Rain,PRainParis)
PGrasswetkLocation.push(submodel.ask(GrassWet),locval)
locval[Location]=’Nice’
submodel.replace(Rain,PRainNice)
PGrasswetkLocation.push(submodel.ask(GrassWet),locval)
#and push it in the joint distribution list
jointlist.push_back(PGrasswetkLocation)
#
#idem for the conditional ditribution on Rain
PRainkGrasswetLocation=\
plDistributionTable(Rain,GrassWet^Location,Location)
locval[Location]=’Paris’
submodel.replace(Rain,PRainParis)
PRainkGrasswetLocation.push(submodel.ask(Rain,GrassWet),locval)
locval[Location]=’Nice’
submodel.replace(Rain,PRainNice)
PRainkGrasswetLocation.push(submodel.ask(Rain,GrassWet),locval)
#and push it in the joint distribution list
jointlist.push_back(PRainkGrasswetLocation)
model=plJointDistribution(Rain^Roof^GrassWet^Location,\
jointlist)
162 Bayesian Programming
Va:
I0 , I1 , SO , C0 , O0 , S1 , C1 , O1 , S2 , C2 , O2 , I3 , S3 , C3 , O3
Dc :
P
(I0 ∧ II ∧ · · · ∧ O3 |πcenter )
P (I0 ∧ I1 ∧ I3 |πcenter )
Ds Sp(π)
P (S0 ∧ C0 ∧ O0 |I0 ∧ I1 ∧ πunit1 )
Pr
= P (S1 ∧ C1 ∧ O1 |I0 ∧ I1 ∧ πunit2 )
P (S ∧ C ∧ O |O ∧ O ∧ π )
2 2 2 0 1 unit3
P (S3 ∧ C3 ∧ O3 |I3 ∧ O2 ∧ πunit4 )
F o :
Id
Qu :
(9.17)
The obtained results for the different questions are evidently the same but
for the diagnosis ones. Indeed, in Equation 9.17 we have chosen to hide the
Fi variables and consequently we cannot anymore ask questions using them.
Here again, it is similar to what occurs with the use of classical subroutine
calls where you cannot, in the calling program, use internal variables of the
subroutines. An alternative would have been to explicitly use the Fi variables
in the above program to preserve the ability to ask diagnosis questions.
Bayesian Programming Subroutines 163
Va:
X, Y, D2 , D3
Dc :
P (X ∧ Y ∧ D2 ∧ D3 | π)
= P (X ∧ Y | π) ×
Sp(π) P (D2 | X ∧ Y ∧ π) ×
Ds
P (D3 | X ∧ Y ∧ π)
Pr
F o :
P (X ∧ Y | π) = U nif orm
P (D2 | X ∧ Y ∧ π) = P (D2 | X ∧ Y ∧ πS2 )
P (D3 | X ∧ Y ∧ π) = P (D3 | X ∧ Y ∧ πS3 )
Id
Qu :
P (X, Y | d2 ∧ d3 ∧ π)
(9.18)
The first sensor becomes faulty if two conditions A2 and B2 are met while
the other becomes faulty if one of the other conditions A3 or B3 is met. We
obtain two descriptions πS1 and πS2 which relate the position X, Y to the
sensor readings.
164 Bayesian Programming
V a : X, Y, F2 , D2 , A2 , B2
Dc :
P (X ∧ Y ∧ F2 ∧ D2 ∧ A2 ∧ B2 | πS2 )
= (X ∧ Y | πS2 ) P (A2 | πS2 ) P (B2 | πS2 )
P
P (F2 | A2 ∧ B2 ∧ πS2 ) P (D2 | X ∧ Y ∧ F2 ∧ πS2 )
F o :
P (X ∧ Y | πS2 ) = U nif orm
Sp(π)
P (A2 = true | πS2 ) = 0.2, P (B2 = true | πS2 ) = 0.1
Ds
P (F2 | A2 ∧ B2 ∧ πS2 ) = δA2 ∧B2
Pr
(D2 | X ∧ Y ∧ F2 ∧ πS2 )
P
[F2 = 0] : B µ = fd2 (X, Y ) , σ = gd2 (X, Y )
fd2 = (X + 50)2 + Y 2
p
2
gd2 = 1 + fd (X, Y )
10
[F = 1] : U nif orm
2
Id
Qu :
P (D2 | X ∧ Y ∧ πS2 )
(9.19)
V a : X, Y, F3 , D3 , A3 , B3
Dc :
P (X ∧ Y ∧ F3 ∧ D3 ∧ A3 ∧ B3 | πS3 )
= (X ∧ Y | πs3 ) P (A3 | πs3 ) P (B3 | πs3 )
P
P (F3 | A3 ∧ B3 ∧ πs3 ) P (D3 | X ∧ Y ∧ F3 ∧ πs3 )
Fo :
P (X ∧ Y | πS3 ) = U nif orm
Sp(π) P (A3 = true | πS3 ) = 0.01, P (B3 = true | πS3 ) = 0.03
Ds
P (F3 | A3 ∧ B3 ∧ πs3 ) = δA3 ∨B3
Pr
P (D3 | X ∧ Y ∧ F3 ∧ πS3 )
[F3 = 0] : B µ = fd2 (X, Y ) , σ = gd3 (X, Y )
fd3 = X 2 + (Y + 50)2
p
3
g 3 = 1 + fd (X, Y )
d
10
[F = 1] : U nif orm
3
Id
Qu :
P (D3 | X ∧ Y ∧ πS3 )
(9.20)
Bayesian Programming Subroutines 165
These two descriptions differ by the type of fault model: δA2 ∧B2 versus
δA3 ∨B3 and also by the sensor
pmodels which have to take into
paccount the loca-
tion of the landmarks: fd2 = (X + 50)2 + Y 2 versus fd3 = X 2 + (Y + 50)2 .
9.5 Superposition
The standard fusion program for our localization problem (see Section
7.3) can be restated as follows (at first, dropping the distances for the sake of
simplicity):
166 Bayesian Programming
Sp
Va:
X, Y, XB , YB , B1 , B2 , B3
Dc :
P (X ∧ Y ∧ XB ∧ YB ∧ B1 ∧ B2 ∧ B3 |πB )
= P (B1 ∧ B2 ∧ B3 ∧ πB )
×P (XB ∧ YB | B1 ∧ B2 ∧ B3 ∧ πB )
×P (X ∧ Y | XB ∧ YB ∧ B1 ∧ B2 ∧ B3 ∧ πB )
Ds (πB ) F o :
Pr
P (B1 ∧ B2 ∧ B3 |πB |=) U nif orm
P (XB ∧ YB | B1 ∧ B2 ∧ B3 ∧ πB ) =
P (XB ∧ YB | B1 ∧ B2 ∧ B3 ∧ πk )
(P (X ∧ Y | XB ∧ YB ∧ B1 ∧ B2 ∧ B3 ∧ πB ) =
if XB > 0, Y B > 0 : P (X ∧ Y | B1 ∧ B2 ∧ B3 ∧ πk )
else : U nif orm
Id
Qu :
P (X ∧ Y |b1 ∧ b2 ∧ b3 ∧ πB )
(9.21)
The program 9.21 implements the following idea: if the position corre-
sponding to the measurements is within the specified region the distribution
of the location remains unchanged; otherwise it is unknown.
The Figure 9.1 represents the distribution for a given measurement of the
bearings corresponding to the location X = Y = 0.
0.0007
0.0006
0.0005
P 0.0004
0.0003
0.0002
0.0001
0
40
20
-40 0
-20 Y
0 -20
20 -40
X 40
(a)
0.0005
0.00045
0.0004
0.00035
P
0.0003
0.00025
0.0002
0.00015
0.0001
5e-05
40
20
-40 0
-20 Y
0 -20
20 -40
X 40
(b)
Since the two regions do not overlap we can can stitch our two localization
procedures on the same space with the program in Equation 9.22. The result
168 Bayesian Programming
is presented Figure 9.2. The sensor superposition still gives good results on
the boundaries of the regions.
Should the regions overlap, it is then possible to fuse all the sensors in that
region. For example, if we use the program (Equation 7.6) and if we assume a
new valid region XB ≥ −20, Y B ≥ −20 for the localization with bearings we
may introduce two new variables XF , Y F and use the sensor fusion program
in the region −20 ≤ XF ≤ 0, −20 ≤ Y F ≤ 0.
Sp
Va:
X, Y, XB , YB , XD , YD , B1 , B2 , B3 , D1 , D2 , D3 , CB , CD
:
Dc
P (X ∧ Y ∧ XB ∧ YB ∧ XD ∧ YD ∧ B1 . . . ∧ D3 ∧ πS )
= P (B1 ∧ B2 ∧ B3 ∧ D1 ∧ D2 ∧ D3 ∧ πS )
×P (XB ∧ YB | B1 ∧ B2 ∧ B3 ∧ πS )
×P (CB | XB ∧ YB ∧ πs )
×P (XD ∧ YD | D1 ∧ D2 ∧ D3 ∧ πS )
×P (CD | XD ∧ YD ∧ πs )
×P (X ∧ Y | C ∧ C )
B D
F o :
P (B1 . . . ∧ D3 |πS |=) U nif orm
P (XB ∧ YB | B1 ∧ B2 ∧ B3 ∧ πS ) =
Ds (π )
S
P (XB ∧ YB | B1 ∧ B2 ∧ B3 ∧ πk )
Pr
(XD ∧ YD | D1 ∧ D2 ∧ D3 ∧ πS ) =
P
P (XD ∧ YD | D1 ∧ D2 ∧ D3 ∧ πk )
( (CB | XB ∧ YB ∧ πs ) =
P
if XB ≥ 0, YB ≥ 0 : P (CB = 1) = 1
else : P (CB = 1) = 0
P (CD | XD ∧ YD ∧ πs ) =
(
if XD < 0, YD < 0 : P (CD = 1) = 1
else : P (CD = 1) = 0
P (X ∧ Y | CD ∧ CB ∧ πS ) =
if CD = 1 : P (XD ∧ YD | D1 ∧ D2 ∧ D3 ∧ πk )
else : if CD = 1 : P (XB ∧ YB | B1 ∧ B2 ∧ B3 ∧ πk )
else : U nif orm
Id
Qu :
P (X ∧ Y |d1 ∧ d2 ∧ d3 ∧ b1 ∧ b2 ∧ b3 ∧ πs )
(9.22)
Bayesian Programming Subroutines 169
0.0009
0.0008
0.0007
0.0006
P
0.0005
0.0004
0.0003
0.0002
0.0001
0
40
20
-40 0
-20 Y
0 -20
20 -40
X 40
• PXDYD_K_D1D2D3=\
localisation_model_with_bearings.ask(XD^YD,D1^D2^D3)
171
172 Bayesian Programming
Let’s recall the Khepera robot (see Chapter 4) and formalize the programs
to push a part or to follow its contour. The probabilistic variables Dir, P rox,
and Rot, respectively, denote the direction of the nearest obstacle, its prox-
imity, and the rotational speed of the robot (the robot is assumed to move at
constant speed).
Sp(π ∧ δb ) :
b
V a : Dir, P rox, Rot
Dc
:
P (Dir ∧ P rox ∧ Rot | πb ∧ δb )
= P (Dir ∧ P rox | πb ∧ δb )
×P (Rot | Dir ∧ P rox ∧ πb ∧ δb )
Ds
Pr F
( :o
P (Dir ∧ P rox | πb ∧ δb ) = U nif orm
P (Rot | Dir ∧ P rox ∧ πb ∧ δb ) = N ormal(σb , µb )
Id :
(
σb (Dir, P rox) ← δb
µb (Dir, P rox) ← δb
Qu : P (Rot | Dir ∧ P rox ∧ πb ∧ δb )
(10.1)
The light sensors of the robot may also be used to build a new variable
θl indicating the direction of a light beam in the robot reference frame. This
new variable may be used to move toward the light (phototaxis) using the
program in Equation 10.2.
Bayesian Programming Conditional Statement 173
Sp(π ):
phototaxis
V a : Θl , Rot
Dc
:
P (Θl ∧ Rot | πphototaxis )
= P (Θl | πphototaxis )
Ds ×P (Rot | Θl ∧ πphototaxis )
Pr (10.2)
F
:o
P (Θl | πphototaxis ) = U nif orm
P (Rot | Θl ∧ πphototaxis )
= N ormal(µ = Θl , σ = 2)
Id :
Qu : P (Rot | Θl ∧ πphototaxis )
The probabilistic conditional statement will be used to combine the two be-
haviors: “phototaxy” and “avoidance” into a more complex behavior (“home”)
leading the robot to reach its base (where the light is located) while avoiding
the obstacles.
Sp(πhome ) :
V a : Dir, P rox, Θl , H, Rot
P (Dir ∧ P rox ∧ Θl ∧ H ∧ Rot | πhome )
= P (Dir ∧ P rox ∧ Θl | πhome )
Dc :
×P (H | P rox ∧ πhome )
×P (Rot | Dir ∧ P rox ∧ H ∧ Θl ∧ πhome )
P (Dir ∧ P rox ∧ Θl | πhome ) = U nif orm
Ds
Pr
P (H = avoidance | P rox ∧ pihome ) = S Shape(P rox)
P (Rot | Dir ∧ P rox ∧ H ∧ Θl ∧ πhome ) =
F o : H = avoidance :
P (Rot | Dir ∧ P rox ∧ πavoidance ∧ δavoidance )
= phototaxy :
H
P (Rot | Θl ∧ πphototaxy )
Id :
Qu : P (Rot | Dir ∧ P rox ∧ Θ ∧ π
l home )
(10.3)
1
P (H | P rox ∧ πhome ) = (10.4)
1 + eβ(α−prox)
FIGURE 10.1: The shape of the sigmoid defines the way to mix behaviors.
Far from the obstacle (P rox = 0) the probability of [H = 1] is 0 meaning that
you want to do phototaxy, on the contrary, close to the obstacle (P rox = 15)
the probability of [H = 1] is 1 meaning that you only care about avoiding
the obstacle. In between the behavior will be a combination of phototaxy and
avoidance behaviors (see below).
FIGURE 10.2: The top left distribution shows the knowledge on Rot given
by the phototaxy description; the top right is the probability on Rot given
by the “avoidance” description; the bottom left shows the knowledge of the
“command variable” H; finally, the bottom right shows the probability distri-
bution on Rot resulting from the marginalization (weighted sum) of variable
H, and the robot will most probably turn right.
178 Bayesian Programming
FIGURE 10.3: We are now far from the obstacle; the probability of [H =
phototaxy] is higher than the probability of [H = avoidance] (bottom left).
Consequently, the result of the combination is completely different than in the
previous case and the robot will most probably turn left.
Bayesian Programming Conditional Statement 179
(Sp(πhome ) :
Ds : V a : Dir, P rox, Θl , H, Rot
Pr : (10.6)
Dc : identical to 10.3
Qu : P (H | Dir ∧ P rox ∧ T hetal ∧ Θl ∧ πhome )
Sp(π
M) :
V a : H ∈ [1, . . . , n], I, S1 , . . . , Sn
P (H ∧ I ∧ S1 . . . Sn | πM )
= P (I | πM )
×P (H | I ∧ πM )
Dc :
×P (S1 | H ∧ I ∧ πM )
...
Ds :
×P (Sn | H ∧ I ∧ πM )
Pr : (10.7)
P
(I | πM ) , P (H | I ∧ πM ) =
H = 1 :
P (S | I ∧ π1 )
F o :
...
H = n :
P (S | I ∧ π )
n
Id :
Qu : P (S | I ∧ πM )
1 X
P (S | I ∧ πM ) = × (P (H = h | I ∧ πM ) P (S | I ∧ πh )) (10.8)
Z
h=1,...,n
183
184 Bayesian Programming
O1 ∈ [1, 6], S1 ∈ [1, 6]
V a : ...
O ∈ [i, 6], SN ∈ [N, N × 6]
N
P (O1 ∧ S1 , . . . Oi ∧ Si . . . ON ∧ SN )
= P (O1 ) × P (S1 | O1 ) ×
Dc : P (O2 ) P (S2 | S1 ∧ O2 ) ×
. ..
Ds Sp
Pr
P (ON ) P (SN | SN−1 ∧ ON )
(O1 ) = P (O2 ) = . . . = P (ON ) = U nif orm
P
P (S 1 | O1 ) = δo1
F o : P (S 2 | S1 ∧ O2 ) = δs1 +o2
...
P (SN | SN−1 ∧ ON ) = δsN −1 +oN
:
Id
Qu : P (SN )
(11.1)
Bayesian Programming Iteration 185
0.16
0.3
0.14
0.25
0.12
P(O3 | S3 = 14)
0.2
0.1
P(S3)
0.08 0.15
0.06
0.1
0.04
0.05
0.02
0 0
2 4 6 8 10 12 14 16 18 0 1 2 3 4 5 6 7
S3 O3
(a) (b)
(
S t , ∀t ∈ [0, . . . , T ] : S t ∈ DS
V a :
Ot , ∀t ∈ [1, . . . , T ] : Ot ∈ DO
P S 0 ∧ O 1 , . . . S t ∧ O t . . . S T ∧ O T =
Dc :
t∈[1...T ]
Y
0
(P S t | S t−1 P Ot | S t )
P S
Ds Sp
Pr
0
P S = Initial condition
t t−1
F o : P S | S = Transition Model
t t
P O | S = Sensor Model
Id :
Qu : P S T | o1 . . . oT
(11.2)
The variables S t have the same definition domain and denote the states
of the system at time t. For example, a sequence s0 , s1 , ..., sT represents a
possible evolution for the system. The variable Ot denotes the observation
of the system at time t. These variables share the same definition domain.
The Bayesian program 11.2 encodes the two main hypotheses used in classical
Bayesian filters.
• The probability distribution on the current state only depends on the
previous state P S t | S t−1 (order 1 Markov hypothesis).
• The observation only depends on the current state: P Ot | S t .
tions. The parametric form P S t | S t−1 does not depend on t and defines the
T
" #
X Y
T 1 T t t t t−1 0
P S |o ∧ · · · ∧ o ≺ P o |S P S |S P (S ) (11.3)
S 1 ···S T −1 t=1
ables S 1 · · · S T −2 :
P S T |o1 ∧ · · · ∧ oT
P S T |S T"−1
−1
TY
X #
P oT |S T
≺ X
t t t t−1 0
P o |S P S |S P (S )
T −1S
S 1 ···S T −2 t=1
"T −1 # (11.4)
X Y
P ot |S t P S t |S t−1 P (S 0 ) we recognize the
In the term
S 1 ···S T −2 t=1
same filtering question at the preceding instant:
"T −1 #
X Y
T −1 1 T −1 t t t t−1 0
P S |o ∧ · · · ∧ o ≺ P o |S P S |S P (S )
S 1 ···S T −2 t=1
(11.5)
And we finally get the recursive expression:
P S T |o1 ∧ · · · ∧oT
X P S T |S T −1
≺ P o |ST T (11.6)
P S T −1 |o1 ∧ · · · ∧ oT −1
T −1 S
X h T T −1 T −1 1 i
P o |S P S |o ∧ · · · ∧ oT −1 = P S T |o1 ∧ · · · ∧ oT −1 (11.7)
S T −1
P X t | X t−1 ∧ π = B µ = X t−1 , [σ = 5]
(11.9)
P Y t | Y t−1 ∧ π = B µ = Y t−1 , [σ = 5]
(11.10)
V a : X 0 , Y 0 , ·· · , X T , Y T , B11, · · · , B3T
P X 0 ∧ · · · ∧ B3T
P X t |X t−1 P Y t |Y t−1
T
Dc :
Y 3 0 0
= ×P X ∧Y
Y
t t t
P Bi |X ∧ Y
t=1
Ds Sp(π)
i=1
P X 0 ∧ Y 0 = U nif orm
Pr
P X t |X t−1 = B µ = X t−1 , [σ = 5]
F o :
t t−1
= Y t−1 , [σ = 5] i
P Y |Y = B µh
P B t |X t ∧ Y t = B µ = f i X t , Y t , [σ = 10]
i b
Id
Qu : P X T ∧ Y T |B 0 ∧ · · · ∧ B T
1 3
(11.11)
0.0025 0.0045
0.004
0.002 0.0035
0.003
P 0.0015 P
0.0025
0.001 0.002
0.0015
0.0005 0.001
0.0005
0 0
40 40
20 20
-40 0 -40 0
-20 Y -20 Y
0 -20 0 -20
20 -40 20 -40
X 40 X 40
(a) (b)
0.005
0.0045
0.004
0.0035
P 0.003
0.0025
0.002
0.0015
0.001
0.0005
0
40
20
-40 0
-20 Y
0 -20
20 -40
X 40
(c)
0 t t
S . . . S , ∀t ∈ [0, . . . , T ] : S ∈ DS
V a : M 0 . . . M t , ∀t ∈ [0, . . . , T ] : M t ∈ DM
1 t t
O . . . O , ∀t ∈ [0, . . . , T ] : O ∈ DO
0 0
P S ∧ M , . . . S ∧ O . . . S T ∧ OT =
t t
P S0 P M 0
Dc :
Y
t t t−1 t−1 t t
× [P M P S | S ∧ M P O | S ]
Ds Sp
Pr
t∈[1...T ]
P S 0 = Initial condition
P M t = Priors on commands
F o :
P S t | S t−1 ∧ M t−1 = Transition Model
P Ot | S t = Sensor Model
Id :
Qu : P S | o0 ∧ m0 ∧ o1 ∧ m1 . . . oT
T
(11.12)
The specification of Equation 11.2 permits adding several interesting ques-
tions to the questions already attached to a Bayesian filter, for instance:
• Forecasting: P S k | m0 ∧ o1 ∧ m1 . . . oT ∧ mT ∧ mT +1 . . . mk , to esti-
mate the state of the system in the future (k > T ) with the past and
present measurements and the past and future commands.
rent control to reach a given state in the future (sk ) knowing the past
and present measurements and the past commands.
Note that the recursive property is only valid for the filtering question.
The other question, even if completely valid from a mathematical point of
view, may be intractable in practice due to some cumbersome computations.
boat is instructed to move with constant speed V x, V y during the next time
interval. This hypothesis leads to the following transition model:
Va:
X 0 , Y 0 , Vx0 , Vy0 · · · , VyT , B11 , · · · , B3T
Dc :
P X 0 ∧ · · · ∧ B3T
P Vxt P Vyt
T
P X t |X t−1 ∧ Vxt−1 P Y t |Y t−1 ∧ Vyt−1
Y
=
3
Y
t t t
t=1
P Bi |X ∧ Y
i=1
×P X 0 ∧ Y 0 P Vx0 P Vy0
Sp(π)
Ds
F o :
Pr
P X 0 ∧ Y 0 = U nif orm
P Vxt = Constant
P Vyt = Constant
P X t |X t−1 ∧ V t−1 = B µ = X t−1 + V xt−1 × δt , [σ = 5]
x
t t−1 t−1
= B µ = Y t−1 + V y t−1 × δt , [σ = 5]
P Y |Y ∧ V
y
t t t
P Bh i |X ∧ Y
i t t
i
= = [σ = 10]
B µ f X , Y ,
b
Id
Qu :
P X T ∧ Y T |v 0 ∧ · · · ∧ bT
x 3
(11.15)
This program may be, for instance, used to estimate the position of the
boat knowing successive observations of the bearings and assuming a constant
velocity toward the upper right corner as in Figure 11.3.
194 Bayesian Programming
0.0035 0.008
0.003 0.007
0.0025 0.006
P P 0.005
0.002
0.004
0.0015
0.003
0.001 0.002
0.0005 0.001
0 0
40 40
20 20
-40 0 -40 0
-20 Y -20 Y
0 -20 0 -20
20 -40 20 -40
X 40 X 40
(a) (b)
0.014
0.012
0.01
P 0.008
0.006
0.004
0.002
0
40
20
-40 0
-20 Y
0 -20
20 -40
X 40
(c)
FIGURE 11.3: The precision of the location is increasing even with the boat
moving at constant speed toward the upper right corner. Such a model is very
similar to what is used in a GPS (global positioning system).
Bayesian Programming Iteration 195
JointDistributionList=plComputableObjectList()
#use the mutable distribution as the prior
#on the state distribution
JointDistributionList.push_back(PXt_1Yt_1)
#use avialable knowledge on M_t:
JointDistributionList.push_back(plUniform(Mxt_1^Myt_1))
#define the prediction term
JointDistributionList.push_back(plCndNormal(Xt,Xt_1^Mxt_1,\
plPythonExternalFunction(Xt_1^Mxt_1,f_vx),2))
JointDistributionList.push_back(plCndNormal(Yt,Yt_1^Myt_1, \
plPythonExternalFunction(Yt_1^Myt_1,f_vy),2))
#define the sensor model
JointDistributionList.push_back(plCndNormal(B1,Xt^Yt, \
plPythonExternalFunction(Xt^Yt,f_b_1), \
10.0))
JointDistributionList.push_back(plCndNormal(B2,Xt^Yt, \
plPythonExternalFunction(Xt^Yt,f_b_2), \
10.0))
JointDistributionList.push_back(plCndNormal(B3,Xt^Yt, \
plPythonExternalFunction(Xt^Yt,f_b_3), \
10.0))
#define the joint distribution
filtered_localisation_model = \
plJointDistribution(JointDistributionList)
This page intentionally left blank
Part III
Bayesian Programming
Formalism and Algorithms
197
This page intentionally left blank
Chapter 12
Bayesian Programming Formalism
Discours de la Méthode
Descartes [1637]
199
200 Bayesian Programming
2. The normalization rule, which states that the sum of the probabilities
of a and ¬a is one.
1
For sources giving justifications of these two rules, see Chapter 16, Section 16.8 on the
Cox theorem.
2
These two rules are sufficient as long as we work with discrete variables. To use con-
tinuous variables much more elaborated math is required. See Chapter 16, “Discrete versus
continuous variables” (Section 16.9) for a discussion on this matter.
202 Bayesian Programming
P (X ∧ Y |Z ∧ π) = P (X|Z ∧ π) (12.5)
stands for:
∀xi ∈ X, ∀yj ∈ Y, ∀zk ∈ Z,
(12.6)
P (xi ∧ yj |zk ∧ π) = P (xi |zk ∧ π)
P (X ∧ Y |π) = P (X|π) × P (Y |X ∧ π)
(12.7)
= P (Y |π) × P (X|Y ∧ π)
According to our convention for probabilistic formulas including variables,
this may be restated as:
∀xi ∈ X, ∀yj ∈ Y,
(12.8)
P (xi ∧ yj |π) = P (xi |π) × P (yj |xi ∧ π) = P (yj |π) × P (xi |yj ∧ π)
which may be directly deduced from the conjunction rule for propositions
(Equation 12.1).
3
In contrast, the disjunction of two variables, defined as the set of propositions xi ∨ xj ,
is not a variable. These propositions are not mutually exclusive.
Bayesian Programming Formalism 203
where the first equality derives from the normalization rule for propositions
(Equation 12.2), the second from the exhaustiveness of propositions xi , and
the third from both the application of Equation 12.3 and the mutual exclu-
sivity of propositions xi .
X X
[P (X ∧ Y |π)] = [P (Y |π) × P (X|Y ∧ π)]
X X X
= P (Y |π) × [P (X|Y ∧ π)] (12.12)
X
= P (Y |π)
V ariables
Description Specif ication (π) Decomposition
P rogram F orms (P arametric or P rogram)
(based
Identif ication on δ)
Question
12.12 Description
The purpose of a description is to specify an effective method of computing
a joint distribution on a set of variables {X1 , X2 , · · · , XN } given a set of ex-
perimental data δ and some specification π. This joint distribution is denoted
as: P (X1 ∧ X2 ∧ · · · ∧ XN |δ ∧ π).
12.13 Specification
To specify preliminary knowledge, the programmer must undertake the
following:
P (X1 ∧ X2 ∧ · · · ∧ XN |δ ∧ π)
= P (L1 ∧ · · · ∧ LK |δ ∧ π)
(12.13)
= P (L1 |δ ∧ π) × P (L2 |L1 ∧ δ ∧ π)
× · · · × P (LK |LK−1 ∧ · · · ∧ L1 ∧ δ ∧ π)
We then obtain:
P (X1 ∧ X2 ∧ · · · ∧ XN |δ ∧ π)
= P (L1 |δ ∧ π) × P (L2 |R2 ∧ δ ∧ π) × · · · × P (LK |RK ∧ δ ∧ π)
(12.15)
Such a simplification of the joint distribution as a product of simpler
distributions is called a decomposition.
This ensures that each variable appears at the most once on the left
of a conditioning bar, which is the necessary and sufficient condition
to write mathematically valid decompositions.
12.14 Questions
Given a description (i.e., P (X1 ∧ X2 ∧ · · · ∧ XN |δ ∧ π)), a question is ob-
tained by partitioning {X1 , X2 , · · · , XN } into three sets: the searched vari-
ables, the known variables, and the free variables.
We define the variables Searched, Known, and F ree as the conjunction
of the variables belonging to these sets. We define a question as the set of
distributions:
P (Searched|Known ∧ δ ∧ π) (12.16)
made of as many “instantiated questions” as the cardinal of Known, each
instantiated question being the distribution:
P (Searched|known ∧ δ ∧ π) (12.17)
12.15 Inference
Given the joint distribution P (X1 ∧ X2 ∧ · · · ∧ XN |δ ∧ π), it is always pos-
sible to compute any possible question using the following general inference:
PX(Searched|known ∧ δ ∧ π)
= [P (Searched ∧ F ree|known ∧ δ ∧ π)]
FX
ree
[P (Searched ∧ F ree ∧ known|δ ∧ π)]
F ree
=
P (known|δ ∧ π)
X (12.18)
[P (Searched ∧ F ree ∧ known|δ ∧ π)]
F ree
= X
[P (Searched ∧ F ree ∧ known|δ ∧ π)]
F ree∧Searched
1 X
= × [P (Searched ∧ F ree ∧ known|δ ∧ π)]
Z
F ree
where the first equality results from the marginalization rule (Equation 12.11),
the second results from the conjunction rule (Equation 12.7), and the third
corresponds to a second application of the marginalization rule. The denom-
inator appears to be a normalization term. Consequently, by convention, we
will replace it by Z.
Theoretically, this allows us to solve any Bayesian inference prob-
lem. In practice, however, the cost of computing exhaustively and exactly
Bayesian Programming Formalism 207
The goal of this chapter is to review the main probabilistic models currently
used.
We systematically use the Bayesian Programming formalism to present
these models, because it is precise and concise, and it simplifies their compar-
ison.
We mainly concentrate on the definition of these models. Discussions about
inference and computation are postponed to Chapter 14 and discussions about
learning and identification are postponed to Chapter 15.
We chose to divide the different probabilistic models into three categories:
the general purpose probabilistic models, the engineering oriented probabilis-
tic models, and the cognitive oriented probabilistic models.
In the first category, the modeling choices are made independently of any
209
210 Bayesian Programming
specific knowledge about the modeled phenomenon. Most of the time, these
choices are essentially made to keep the inference tractable. However, the
technical simplifications of these models may be compatible with large classes
of problems and consequently may have numerous applications.
In the second category, on the contrary, the modeling choices and simpli-
fications are decided according to some specific knowledge about the modeled
phenomenon. These choices could eventually lead to very poor models from
a computational viewpoint. However, most of the time, problem-dependent
knowledge, such as conditional independence between variables, leads to very
significant and effective simplifications and computational improvements.
Several of these models were already presented with more detail in previous
chapters. Certain models will appear several times in different categories but
are presented with a different point of view for each presentation. We think
that these repetitions are useful as our goal in this chapter is to give a synthetic
overview of all these models.
Va:
X1 , · · · , XN
Dc :
P (X1 ∧ · · · ∧ XN |π)
Sp(π)
N
Ds
Y
= [P (Xn |Rn ∧ π)]
Pr (13.1)
n=1
Fo :
any
Id
Qu :
P (Xn |known)
• The pertinent variables are not constrained and have no specific seman-
tics.
• The parametric forms are not constrained but they are very often re-
stricted to probability tables.
Readings on Bayesian networks and graphical models should start with the
following introductory textbooks: Probabilistic Reasoning in Intelligent Sys-
tems: Networks of Plausible Inference [Pearl, 1988], Graphical Models [Lau-
ritzen, 1996], Learning in Graphical Models [Jordan, 1999], and Graphical
Models for Machine Learning and Digital Communication [Frey, 1998].
See FAQ-FAM, Chapter 16, Section 16.3 for a summary of the differences
between Bayesian programming and Bayesian networks.
212 Bayesian Programming
Va:
X10 , · · · , XN
0
, · · · , X1T , · · · , XN
T
Dc :
P X10 ∧ · · · ∧ XN T
|π
Sp(π)
T
"N #
Ds
Y Y
t t
= P Xn |Rn ∧ π
Pr (13.2)
t=0 n=1
Fo :
any
Id
Qu :
P Xnt |known
N
Y
P Xnt |Rnt ∧ π
• defines a graph for a time slice and all time slices
n=1
are identical when the time index t changes.1
The best introduction, survey, and starting point on DBNs is the PhD
thesis of K. Murphy, Dynamic Bayesian Networks: Representation, Inference
and Learning [Murphy, 2002].
1
The first time slice may be different as it expresses initial conditions.
Bayesian Models Revisited 213
Va:
S 0 , · · · , S T , O0 , · · · , OT
Dc :
0 T 0 T
P S ∧ · · · ∧ S ∧ O ∧ · · · ∧ O |π
T
Y
0 0
P S t |S t−1 × P Ot |S t
Sp(π) = P S ∧ O ×
Ds
t=1
F o :
P S 0 ∧ O0
Pr
P S t |S t−1
P Ot |S t
Id
Qu :
P S t+k |O0 ∧ · · · ∧ Ot
(k = 0) ≡ F iltering
(k > 0) ≡ P rediction
(k < 0) ≡ Smoothing
(13.3)
namic model, which formalizes the transition from the state at time
t − 1 to the state at time t;
– on P Ot |S t , called the observation model, which expresses what
puted simply from P S t−1 |O0 ∧ · · · ∧ Ot−1 with the following formula:
P S t |O0 ∧ · ·X
· ∧ Ot
S t−1
Another interesting point of view for this equation is to consider that there
are two phases, a prediction phase and an estimation phase:
• During the prediction phase, the state is predicted using the dynamic
model and the estimation of the state at the previous moment:
P S t |O0∧ · · · ∧ Ot
(13.6)
= P Ot |S t × P S t |O0 ∧ · · · ∧ Ot−1
Va:
S 0 , · · · , S T , O0 , · · · , OT
Dc
:
P S 0 ∧ · · · ∧ OT |π
0 0
P S ∧ O |π
Sp(π) = Y T
t t−1
t t
Ds P S |S ∧ π × P O |S ∧ π
Pr t=1
F o :
0 0
P S ∧ O |π ≡ M atrix
t t−1
P S |S ∧ π ≡ M atrix
t t
P O |S ∧ π ≡ M atrix
Id
Qu :
1 T −1 T
|S ∧ O0 ∧ · · · ∧ OT ∧ π
M ax 1
S ∧···∧S T −1 P S ∧ · · · ∧ S
(13.7)
M axS 1 ∧···∧S T −1 P S 1 ∧ · · · ∧ S T −1 |S T ∧ O0 ∧ · · · ∧ OT ∧ π
(13.8)
What is the most probable series of states that leads to the present state,
knowing the past observations?2
This particular question may be answered with a specific and very efficient
algorithm called the Viterbi algorithm, which is presented in Chapter 14.
A specific learning algorithm called the Baum–Welch algorithm has also
been developed for HMMs (see Chapter 15).
A good introduction to HMMs is Rabiner’s tutorial [Rabiner, 1989].
Va:
S 0 , · · · , S T , O0 , · · · , OT
Dc
:
P S 0 ∧ · · · ∧ OT |π
P S 0 ∧ O0 |π
Sp(π) T
Ds = Y
t t−1 t t
P S |S ∧ π × P O |S ∧ π
Pr
t=1
F o :
(
P S t |S t−1 ∧ π ≡ G S t , A • S t−1 , Q
P Ot |S t ∧ π ≡ G Ot , H • S t , R
Id
Qu :
P S T |O0 ∧ · · · ∧ OT ∧ π
(13.9)
t t−1
• The transition model P S |S ∧ π and the observation model
t t
P O |S ∧ π are both specified using Gaussian laws with means that
are linear functions of the conditioning variables.
With these hypotheses, and using the recursive formula in Equation 13.4,
it is possible to solve the inference problem analytically to answer the usual
P S T |O0 ∧ · · · ∧ OT ∧ π question. This leads to an extremely efficient algo-
rithm, which explains the popularity of Kalman filters and the number of their
everyday applications.3
When there are no obvious linear transition and observation models, it is
still often possible, using a first-order Taylor’s expansion, to treat these models
as locally linear. This generalization is commonly called extended Kalman
filters.
A good tutorial by Welch and Bishop may be found on the Web
(http://www.cs.unc.edu/ welch/kalman/). For a more complete mathemat-
ical presentation, one should refer to a report by Barker et al. [1994], but
these are only two sources from the vast literature on this subject.
3
A very popular application of Kalman filters is the GPS (global positioning system).
The recursive evaluation of the position explains why the precision is poor when you turn
on your GPS and improves rapidly after a while. The dynamic model takes into account
the previous position and speed to predict the future position (Equation 13.5), when the
observation model confirms (or invalidates) this position knowing the signal coming from
the satellites (Equation 13.6).
Bayesian Models Revisited 217
Va:
X1 , · · · , XN
Dc :
P (X1 ∧ · · · ∧ XN |π)
M
Ds Sp(π) =
X
[αm P (X1 ∧ · · · ∧ XN |πm )]
Pr
m=1
F
(o :
F or − instance :
P (X ∧ · · · ∧ X |π ) ≡ G (X ∧ · · · ∧ X , µ , σ )
1 1
N m N m m
Id :
Qu :
(13.10)
It should be noted that this is not a valid Bayesian program. In particular,
the decomposition does not have the right form:
K
Y
P (X1 ∧ · · · ∧ XN |π) = P (L1 |π) × [P (Lk |Rk ∧ π)] (13.11)
k=2
It is, however, a very popular and convenient way to specify distributions
P (X1 ∧ · · · ∧ XN |π), especially when the types of the component distributions
218 Bayesian Programming
Va:
X1 , · · · , XN , H
Dc
( :
P (X1 ∧ · · · ∧ XN ∧ H|π)
Sp(π)
= P (H|π) × P (X1 ∧ · · · ∧ XN |H ∧ π)
Ds
F
:
o
P (H|π) ≡ T able
Pr
P (X1 ∧ · · · ∧ XN | [H = m] ∧ π)
≡ P (X1 ∧ · · · ∧ XN |πm )
Id
Qu :
P (X1 ∧ · · · ∧ XN |π)
XM
= [P ([H = m] |π) × P (X1 ∧ · · · ∧ XN |πm )]
m=1
(13.12)
M
X
P (X1 ∧ · · · ∧ XN |π) = [P ([H = m] |π) × P (X1 ∧ · · · ∧ XN |πm )]
m=1
(13.14)
Va:
X1 , · · · , XN
Dc :
P (X1 ∧ · · · ∧ XN |π)
M h
e− [λm × fm (X1 ∧ · · · ∧ XN )]
Y i
=
Ds Sp(π)
m=0
M
Pr
X
− [λm × fm (X1 ∧ · · · ∧ XN )]
= e m=0
F
(o :
f0 (X1 ∧ · · · ∧ XN ) = 1
f ,··· ,f
1 M
Id :
Qu :
(13.15)
hfmX
(X1 ∧ · · · ∧ XN )i
= [P (X1 ∧ · · · ∧ XN |π) × fm (X1 ∧ · · · ∧ XN )]
X1 ∧···∧XN
(13.16)
Va:
Φ, R1 , · · · , RN
Dc :
P (Φ ∧ R1 ∧ · · · ∧ RN |π)
Sp(π)
N
Ds
Y
= P (Φ|π) × [P (Rn |Φ ∧ π)]
Pr (13.17)
n=1
Fo :
any
Id :
Qu :
P (Φ|r1 ∧ · · · ∧ rN ∧ π)
• Φ is the variable used to describe the phenomenon, when R1 , · · · , RN
are the variables encoding the readings of the sensors.
• The decomposition:
N
Y
P (Φ ∧ R1 ∧ · · · ∧ RN |π) = P (Φ|π) × [P (Rn |Φ ∧ π)] (13.18)
n=1
may seem peculiar, as the readings of the different sensors are obviously
not independent from one another. The exact meaning of this equation
is that the phenomenon Φ is considered to be the main reason for the
contingency of the readings. Consequently, it is stated that knowing
Φ, the readings Rn are independent. Φ is the cause of the readings
and, knowing the cause, the consequences are independent. Indeed, this
is a very strong hypothesis, far from always being satisfied. However,
it very often gives satisfactory results and has the main advantage of
considerably reducing the complexity of the computation.
• The distributions P (Rn |Φ ∧ π) are called sensor models. Indeed, these
distributions encode the way a given sensor responds to the observed
phenomenon. When dealing with industrial sensors, this is the kind of in-
formation directly provided by the device manufacturer. However, these
distributions may also be identified very easily by experiment.
• The most common question asked of this fusion model is:
P (Φ|r1 ∧ · · · ∧ rN ∧ π) (13.19)
It should be noted that this is an inverse question as the model has been
specified the other way around by giving the distributions P (Rn |Φ ∧ π).
The capacity to answer such inverse questions easily is one of the main
advantages of probabilistic modeling, thanks to Bayes’ rule.
222 Bayesian Programming
13.2.2 Classification
The classification problem may be seen as the same as the sensor fusion
problem just described. Usually, the problem is called a classification problem
when the possible value for Φ is limited to a small number of classes and it is
called a sensor fusion problem when Φ can be interpreted as a “measure.”
A slightly more subtle definition of classification uses one more variable.
In this model, not only is there the variable Φ, used to merge the information,
but there is C, used to classify the situation. C has far fewer values than Φ
and it is possible to specify P (Φ|C), which, for each class, makes the possible
values of Φ explicit. Answering the classification question P (C|r1 ∧ · · · ∧ rN )
supposes a summation over the different values of Φ.
The Bayesian program then obtained is as follows:
Va:
C, Φ, R1 , · · · , RN
Dc :
P (C ∧ Φ ∧ R1 ∧ · · · ∧ RN |π)
Sp(π) N
Ds
Y
= P (C|π) × P (Φ|C ∧ π) × [P (Rn |Φ ∧ π)]
Pr
n=1
Fo :
any
Id :
Qu :
P (C|r1 ∧ · · · ∧ rN ∧ π)
(13.20)
sequence recognition, which is why the most common question asked of these
models is P S 1 ∧ · · · ∧ S T −1 |S T ∧ O0 ∧ · · · ∧ OT ∧ π (see Equation 13.8).
Va:
S 0 , · · · , S T , A0 , · · · , AT , O0 , · · · , OT
Dc :
P S 0 ∧ · · · ∧ OT |π
T
Sp(π)
Y
0 0 t
P S ∧ O |π × P A |π
Ds
Pr
= T t=0
Y
P S t |S t−1 ∧ At−1 ∧ π × P Ot |S t ∧ π
t=1
Fo :
Id
Qu :
P S T |a0 ∧ · · · ∧ aT −1 ∧ o0 ∧ · · · ∧ oT −1 ∧ π
(13.21)
The resulting model is used to answer the question
P S T |a0 ∧ · · · ∧ aT −1 ∧ o0 ∧ · · · ∧ oT −1 ∧ π
(13.22)
which estimates the state of the robot, given past actions and observations.
When this state represents the position of the robot in its environment, this
amounts to localization.
A reference for Markov localization and its use in robotics is Thrun’s book
titled Probabilistic Robotics [Thrun et al., 2005].
224 Bayesian Programming
T
* +
X
t t
γ ×R (13.23)
t=0
where γ is a discount factor (less than one), R(t) is the reward obtained
at time t, and hi is the mathematical expectation. Given this measure, the
goal of the planning process is to find an optimal mapping from probability
distributions over states to actions (a policy). This planning process, which
leads to intractable computation, is sometimes approximated using iterative
algorithms called policy iteration or value iteration. These algorithms start
with random policies, and improve them at each step until some numerical
convergence criterion is met.
Va:
S 0 , · · · , S T , A0 , · · · , AT
Dc :
P S 0 ∧ · · · ∧ AT |π
T
Sp(π)
Y
0
P At |π
P S |π ×
Ds
Pr
= t=0 (13.24)
T
Y
P S t |S t−1 ∧ At−1 ∧ π
t=1
F o :
Id
Qu :
P A0 ∧ · · · ∧ AT |s0 ∧ sT ∧ π
models (in the sense that these models are shared by these issues). In other
words, a few template mathematical constructs, based only on probabilities
and Bayes’ rule, can be applied to a large assortment of problems that have
to be addressed by cognitive systems.
Our purpose, in this section, is to demonstrate these assertions by propos-
ing a step by step inspection of these cognitive problems and, for each of
them, by describing a candidate Bayesian model.4 The book titled Probabilis-
tic Reasoning and Decision Making in Sensory-Motor Systems [Bessière et al.,
2008] proposed much more detail on several of these models. A recent paper
in Science by Tenenbaum et al. [2011] offers an interesting general overview
of this matter.
13.3.1 Ambiguities
Natural cognitive systems are immersed in rich and widely variable envi-
ronments. It would be difficult to assume that such systems apprehend their
environments in all their details, all the time, if only because of limited sensory
or memory capacities. As a consequence, relations between the characteristics
of external phenomena and internal states cannot always be bijections. In
other words, internal states will sometimes be ambiguous with respect to ex-
ternal situations.
4
A large part of this section was originally published in Acta Biotheoretica under the
title “Common bayesian models for common cognitive issues” by Colas et al. [2010]
Bayesian Models Revisited 227
In this expression, P (Φ) is a prior on the phenomenon; that is, the ex-
pectation about the phenomenon before any observation has occurred.
P (S | Φ) is the probability distribution over sensations, given the phe-
nomenon, which is also known as the likelihood of the phenomenon (when
considered not as a probability distribution but as a function of Φ); it
is the direct model.
• The probabilistic question of perception is P (Φ | S), the probability
distribution on the phenomenon, based on a given sensation. This ques-
tion, which is the posterior distribution on the phenomenon after some
observation, is solved by Bayesian inference:
P (Φ|π) × P (s|Φ ∧ π)
P (Φ|s ∧ π) = X (13.27)
[P (Φ|π) × P (s|Φ ∧ π)]
Φ
13.3.1.3 Discussion
Natural cognitive systems are equipped with a variety of rich sensors: rich,
as they continuously provide several measurements about a given phenomenon
(e.g., multiple cells in the retina), and various, as they can measure differ-
ent manifestations of the same phenomenon (e.g., both hearing and seeing a
falling object). The difficulty arises when several of these measurements are to
be used together to recover characteristics of the phenomenon. We argue that
the concepts of fusion, multimodality, and conflict can generally be cast into
the same model structure. This model is structured around the conditional
independency assumption. We also present extensions of this model that alle-
viate this assumption, thus trading model simplicity for expressiveness.
13.3.2.1 Fusion
several times:
Va:
Φ, S1 , S2 , · · · , SN
Dc :
P (Φ ∧ S1 ∧ · · · SN |π)
Sp(π)
N
Ds
Y
= P (Φ|π) × [P (Sn |Φ ∧ π)]
n=1
Pr (13.28)
F o :
any
Id
Qu :
N
Y
P (Φ|s ∧ · · · s ∧ π) ∝ P (Φ|π) × [P (sn |Φ ∧ π)]
1 N
n=1
• This model offers the advantage of being much simpler than a complete
model without any independence, as the joint is written as a product
of low dimensional factors. The size of a complete joint probability dis-
tribution is exponential in the number of variables, whereas it is only
linear when assuming naive fusion.
Bayesian Models Revisited 231
13.3.2.2 Multimodality
Fusion is often considered within a sensory modality, such as vision or
touch. However, it is often beneficial to consider multiple sources of informa-
tion from various sensory modalities when forming a percept in what is known
as multimodality.
The model for multimodality match the assumptions of the naive Bayesian
fusion model presented above. The main assumption is that the probability
distribution over each sensation is independent of the others given the phe-
nomenon. Some of these models use MLE, while others compute the complete
posterior distribution.
232 Bayesian Programming
13.3.2.3 Conflicts
When various sources of information are involved, each can sometimes lead
to significantly different individual percepts. In experiments, a conflict arises
when sensations from different modalities induce different behavior compared
with when each modality is observed individually.
The models accounting for conflicts are still naive fusion and maximum
likelihood (with a naive fusion decomposition). Conflicts arise when the prob-
ability distributions corresponding to each sensation are significantly different.
The concept of a conflict is of a similar nature to that of an ill-posed
problem. Both are defined with respect to characteristics of the result. Con-
versely, inversion and fusion are both structural properties of a problem and
its associated model.
Bayesian Models Revisited 233
questions asked, this new model is only concerned with the part Φ of the
global phenomenon Φ′ , either knowing the ancillary cues, A, or not.
For instance, Landy et al. [1995] introduced a modified weak fusion frame-
work. They tackled the cue combination problem in depth estimation and
argued for the addition of so-called ancillary cues.5 These cues do not provide
direct information on depth but help to assess the reliability of the various cues
that appear in the fusion process. Ancillary cues are the basis for a dynamic
reweighing of the depth cues.
Yuille and Bülthoff [1996] propose the term strong coupling for nonnaive
Bayesian fusion. They consider the examples of shape recovery from shading
and texture, and the coupling of binocular depth cues with monocular cues
obtained from motion parallax.
13.3.2.5 Discussion
Fusion is often seen as a product of models. It can be used when the under-
lying models are defined independently so that they can be combined to form
a shared variable. Each model links this variable to distinct properties, and
the fusion operates on the conjunction (logical product) of these properties.
This idea of fusion as a product of independent models is also mirrored
by the inference process. Indeed, most of the time, the result of the fusion
process is proportional to the product of each individual result obtained by
N
Y
the underlying models, P (Φ | s1 · · · sn ) ∝ P (Φ | sn ). However, this may
n=1
not be the case, depending on the exact specification of the underlying models.
There are some more complex fusion models that ensure that this inference
product holds (see [Pradalier et al., 2003] and Section 8.4 on fusion with
coherence variables).
It is also interesting to note that, when all the probability distributions in
the product are Gaussian, the resulting probability distribution is also Gaus-
sian. Its mean is a weighted sum of the means of each Gaussian weighted
according to a function of their variance. Therefore, many weighted models
proposed in the literature can be interpreted as Bayesian fusion models with
Gaussian uncertainty. The weights also acquire the meaning of representing
uncertainty, which can sometimes be manipulated.
Finally, a conflict can only occur when it is assumed that a unique object is
to be perceived. When the discrepancy is too large, segmentation occurs, lead-
ing to the perception of separate objects (e.g., perception of transparency).
Likewise, fusion is a process of combining information from different sources
for one object or feature. Therefore, to account for segmentation, there is a
need for more complex models. Such models can deal with either one or mul-
tiple objects, as well as provide a mechanism for deciding whether there is one
or more objects. This theoretical issue is called binding, unity assumption, or
5
See Section 7.4.
Bayesian Models Revisited 235
pairing [Sato et al., 2007], and has received recent attention using hierarchical
Bayesian models.
13.3.3.1 Modularity
Hierarchies rely on the notion of modularity, the simplest instance of which
is the subroutine; that is, the use of part of a model (submodel) within another
model. A model can be seen as a resource that can be exploited by other
models.
The corresponding Bayesian model is the following (see Chapter 9):
Va :
A, B
Dc( :
Sp(π) P (A ∧ B|π1 )
Ds
= P (A|π1 ) × P (B|A ∧ π1 )
Pr (13.32)
Fo :
P (B|A ∧ π ) = P (B|A ∧ π )
1 2
Id
Qu :
any
For example, Laskey and Mahoney [1997] propose the network fragments
framework, inspired by object oriented software analysis and design to define
submodules of a global probabilistic model. They apply this tool to military
situation assessment, where a military analyst has to reason on various levels
(basic units, regiment, overall situation, etc.).
Koller and Pfeffer [1997] also propose object oriented Bayesian networks as
a modeling language for Bayesian models based on an object oriented frame-
work.6
Va:
Y, H, X
Dc
( :
P (Y ∧ H ∧ X|π)
Sp(π)
= P (Y |π) P (H|Y ∧ π) P (X|H ∧ Y ∧ π)
Ds
F
o :
P (Y |π) ≡ Any
Pr
P (H|Y ∧ π) ≡ Any
P (X| [H = m] ∧ Y ∧ π) ≡ P (X|Y ∧ πm )
Id
Qu :
P (X|y ∧ π)
XM
≺ [P ([H = m] |y ∧ π) × P (X|y ∧ πm )]
m=1
(13.34)
• The most simple form is obtained when there is no variable Y and when
6
See more on this subject in the FAQ-FAM, Section 16.3 “Bayesian programming versus
Bayesian networks.”
Bayesian Models Revisited 237
M
X
P (X|y ∧ π) ≺ [P ([H = m] |y ∧ π) × P (X|y ∧ πm )] (13.37)
m=1
These models can be fit using the general framework of Bayesian model
recognition. Let ∆ = {∆i } be the variables corresponding to the data used
for learning (∆i , a variable for each datum), and let Π be the variable corre-
sponding to the model. Generally, model recognition is performed using the
following decomposition:
P (Π ∆) = P (Π)P (∆ | Π) (13.39)
Bayesian formalism allows for these hyper-parameters in the same way. Let
Λ be the variable representing these hyper-parameters. We write the joint
probability distribution as:
P (Λ Θ ∆ | π ′′ ) = P (Λ | π ′′ )P (Θ | Λ π ′′ )P (∆ | Θ π ′′ ) (13.43)
′′ ′′
where P (Λ | π ) is a prior on hyper-parameters, P (Θ | Λ π ) is the distri-
bution on parameters Θ according to hyper-parameters and P (∆ | Θ π ′′ ) is
the likelihood function, as above. As a result, inference on the parameters is
modified slightly:
X
P (Θ | δ π ′′ ) ∝ [P (Λ | π ′′ )P (Θ | Λ π ′′ )]P (δ | Θ π ′′ ) (13.44)
Λ
where nk is the number of times that the observation, [∆i = k], has been
made. To a first approximation, this number is proportional to the probability
P ([∆i = k] | Π) itself and we obtain, with N the total number of observations:
K
Y
P (Π | δ) ∝ P (Π) P ([∆i = k] | Π)N P ([∆i =k] | Π)
(13.46)
k=1
For example, Gopnik and Schulz [2004] studied the learning of causal de-
pendencies by young children. The experiments included trying to decide
240 Bayesian Programming
which objects are “blickets” (imaginary objects that are supposed to illu-
minate a given machine). Some objects are put on a machine that lights
up depending on which objects are placed on it. The patterns of response
were predicted well by a causal Bayesian network, even after adding some
prior knowledge (“blickets are rare”). The learning phase involved selecting
the causal dependency structure that matches the observations among all the
possibilities.
Xu and Garcia [2008] made this impressive experiment with 8 month old
children demonstrating that they are already able to do model recognition.
They are, indeed, able to estimate the proportion of blue and red balls in an
urn from few samples.
Another example is the application of embedded hidden Markov models.
These are models in which a top-level Bayesian model reasons on nodes that
are themselves Bayesian models. Nefian and Hayes [1999] proposed such a
formalism in the context of face recognition. Each submodel is responsible
for the recognition of a particular feature of the face (forehead, eyes, nose,
mouth, and chin). The global model ensures perception of the facial structure
by recognizing each feature in the correct order. Neal et al. [2003] applied
embedded HMMs to the tracking of 3-D human motion from 2-D tracker
images. The high dimensionality of the problem proved to be less an issue for
their algorithm than it was for the previous approach.
13.3.3.4 Abstraction
Usually, a modeler uses learning in order to select a unique model or set
of parameter values, which is then applied to the problem at hand. That
is, the learning process computes a probability distribution over models (or
their parameters) to be applied, and a decision, based on this probability
distribution, is used to select only one model or parameter set.
Another way to use model recognition is to include it as part of a higher-
level program in order to maintain the uncertainty on the models at the time
of their application. This is called model abstraction.
Let Π be the variable representing the submodels, let ∆ be the variable
representing the data, and let X be the sought-after variable that depends
on the model. The joint probability distribution can be decomposed in the
following way:
P (X ∆ Π) = P (Π)P (∆ | Π)P (X | Π) (13.48)
where P (Π) and P (∆ | Π) are the priors and likelihood functions defined as
before and P (X | Π) describes the influence of the model, Π, on the variable
of interest, X.
The question is concerned with the distribution over X given the data, ∆:
X
P (X | δ) ∝ [P (Π)P (δ | Π)P (X | Π)] . (13.49)
Π
This inference is similar to model recognition except for the factor P (X | Π).
Bayesian Models Revisited 241
With respect to the question, P (X | δ), the details of the models can be
abstracted.
When applied to classes of models and their parameters (i.e., when X is
Θ), this abstraction model yields the Bayesian model selection (BMS) method.
It can also be used to jointly compute the distribution over joint models and
parameters, using P (Π Θ | ∆) [Kemp and Tenenbaum, 2008].
Diard and Bessière [2008] used abstraction for robot localization. They
defined several Bayesian maps corresponding to various locations in the en-
vironment. Each map is a model of sensorimotor interactions with a part of
the environment. Then, they built an abstracted map based on these models.
In this new map, the location of the robot is defined in terms of the submap
that best fits the observations obtained from the robot’s sensors. The aim
of their abstracted map was to navigate in the environment. Therefore, they
were more interested in the action to be taken than in the actual location.
However, the choice of an action was made with respect to the uncertainty of
the location.
A similar model was also recently applied to the domain of multimodal
perception, under the name of causal inference [Körding et al., 2007; Sato
et al., 2007]. When sensory cues are close, they could originate from a sin-
gle source, and the small spatial discrepancies could be explained away by
noise; on the other hand, when cues are largely separated, they more prob-
ably originate from distinct sources, instead, and their spatial positions are
not correlated. The optimal strategy, when estimating the positions of these
cues, is then to have both alternative models coexist and to integrate over the
number of sources during the final estimation.
13.3.4 Loops
It would be hard to assume that natural cognitive systems process their
complex sensory systems in a single direction, uniquely from sensory input
toward motor outputs. Indeed, neurophysiology highlights a variety of ways
that neural system activity can be fed back in loops. Models that include loops
are mainly temporal in nature and deal with memory systems. Examples are
mostly taken from models of artificial systems, and especially from the robotics
community.
Va:
S 0 , · · · , S T , O0 , · · · , OT
Dc :
P S 0 ∧ · · · ∧ S T ∧ O0 ∧ · · · ∧ OT |π
T
Sp(π) = P S 0 ∧ O0 × Y P S t |S t−1 × P Ot |S t
Ds
t=1
F o :
P S 0 ∧ O0
Pr
P S t |S t−1
P Ot |S t
Id
Qu :
P S t+k |O0 ∧ · · · ∧ Ot
(k = 0) ≡ F iltering
(k > 0) ≡ P rediction
(k < 0) ≡ Smoothing
(13.50)
Va:
S 0 , · · · , S T , A0 , · · · , AT , O0 , · · · , OT
Dc :
P S 0 ∧ · · · ∧ OT |π
T
Sp(π) Y
0 0 t
P S ∧ O |π × P A |π
Ds
Pr
= T t=0
Y
t t−1 t−1 t t
P S |S ∧ A ∧ π × P O |S ∧ π
t=1
F o :
Id
Qu :
P S T |a0 ∧ · · · ∧ aT −1 ∧ o0 ∧ · · · ∧ oT −1 ∧ π
(13.51)
Incorporating such knowledge into models makes them less generic and
more sophisticated. Figure 13.2 shows the joint distribution of the full model
from Koike [2008].
FIGURE 13.2: Joint probability factorization for the full model as designed
by Koike [2005].
13.3.4.4 Discussion
There is no clear definition of a loop. In this section, we have only presented
the definition and examples of temporal loops. These loops can be compared
to loops in the field of computer science, occurring when the execution flow
Bayesian Models Revisited 245
gets several times through the same set of instructions. Such instructions are
specified once for all the executions of the loop, and the global program is its
replication through time.
This replication often occurs with fixed time spans. However, in biologi-
cal systems, multiple loops may take place simultaneously with different and
sometimes varying time constants. In robotics, many processes are run concur-
rently with different levels of priority. There is a need in Bayesian modeling for
a proper way of integrating and synchronizing loops with different time scales.
Finally, loops can also be considered without reference to time. Bayesian fil-
ters are a single model that is replicated at each time step, with an optional
temporal dependency on preceding time steps. Models have also been pro-
posed for spatial replication of models, with dependencies occurring over a
neighborhood. One interesting difference is that temporal relations between
instances are oriented according to the passage of time, whereas models of
spatial loops, such as the Markov random field, rely on a symmetrical relation
between neighbors.
This page intentionally left blank
Chapter 14
Bayesian Inference Algorithms
Revisited
“Five to one against and falling?” she said, “four to one against
and falling...three to one...two...one...probability factor of one to
one...we have normality, I repeat we have normality.” She turned
her microphone off, then turned it back on, with a slight smile and
continued:“Anything you still can’t cope with is therefore your
own problem.”
The Hitchhiker’s Guide to the Galaxy
Douglas Adams [1995]
247
248 Bayesian Programming
where the first equality results from the marginalization rule (12.11), the sec-
ond results from the conjunction rule (12.7), and the third corresponds to a
second application of the marginalization rule.
The denominator appears to be a normalization term. Consequently, by
convention, we will either replace it with Z or write a proportional equation
(∝) instead of an equality one (=).
Finally, it is possible to replace the joint distribution by its decomposition
(14.1).
to produce a new expression requiring less computation that gives the same
result or a good approximation of it?
Section 14.2 presents the different possibilities. We will see that these
symbolic computations can be either exact (Section 14.2.1) or approximate
(Section 14.2.2), in which case they lead to an expression that, while not
mathematically equal to Equation 14.4, should be close enough.
Once simplified, the expression obtained is used to compute:
P (Searched|known ∧ δ ∧ π) (14.5)
(on the search space defined by F ree), where most of the probability density is
concentrated and which mostly contribute to the sum. Finally, marginalizing
in a high-dimensional space appears to be a very similar problem to searching
the modes in a high-dimensional space.
to obtain another expression to compute the same result with far fewer el-
ementary operations (sum and product). It is called symbolic computation
because this can be done independently of the possible numerical values of
the considered variables.
We will present these different algorithms as pure and simple algebraic
manipulations of expression 14.8 above, even if most of them have been his-
torically proposed from different points of view (especially in the form of
manipulation of graphs and message passing along their arcs).
P (X1 ∧ X2 ∧ · · · ∧ X9 )
= P (X1 ) × P (X2 |X1 ) × P (X3 |X1 ) × P (X4 |X2 ) × P (X5 |X2 ) (14.9)
×P (X6 |X3 ) × P (X7 |X3 ) × P (X8 |X6 ) × P (X9 |X6 )
X
and we see that [P (X9 |X6 )] vanishes as it sums to one. We obtain:
X9
However, further rearranging the order of the sums in Equation 14.13 may
lead to more gains.
First, P (X1 ) may be factorized out of the sum:
P (X1 |x5 ∧ x7 )
X P (X2 |X1 ) × P (X3 |X1 )
∝ P (X1 ) × (14.14)
×P (x5 |X2 ) × P (x7 |X3 )
X2 ∧ X3
P (X1 |x5 ∧
X x7 )
∝ P (X1 ) × P (X2 |X1 ) × P (x5 |X2 )
X X2 (14.16)
× [P (X3 |X1 ) × P (x7 |X3 )]
X3
P (X2 |x5 ∧ x7 )
X P (X X1 ) × P (X2 |X1 ) (14.17)
∝ P (x5 |X2 ) × × [P (X3 |X1 ) × P (x7 |X3 )]
X1 X3
P (Lk |Rk ∧ δ ∧ π)
Each of these questions is called a belief. The given value of known is called
the evidence.
We return to the example of the previous section. The family of interesting
questions is:
P (X1 |x5 ∧ x7 )
P (X2 |x5 ∧ x7 )
P (X3 |x5 ∧ x7 )
P (X4 |x5 ∧ x7 ) (14.22)
P (X |x ∧ x )
6 5 7
P (X8 |x5 ∧ x7 )
P (X9 |x5 ∧ x7 )
Using the simplification scheme of the previous section for each of these
seven questions, we obtain:
P (X1 |x5 ∧
X x7 )
∝ P (X1 ) × P (X2 |X1 ) × P (x5 |X2 )
X X2 (14.23)
× [P (X3 |X1 ) × P (x7 |X3 )]
X3
256 Bayesian Programming
P (X2 |x5 ∧ x7 )
X P (X X1 ) × P (X2 |X1 ) (14.24)
∝ P (x5 |X2 ) × × [P (X3 |X1 ) × P (x7 |X3 )]
X1 X3
P (X3 |x5 ∧ x7 )
X P (X X1 ) × P (X3 |X1 ) (14.25)
∝ P (x7 |X3 ) × × [P (X2 |X1 ) × P (x5 |X2 )]
X1 X2
P (X
4 |x5 ∧ x7 )
P (X4|X2 ) × P (x5 |X2 )
X P X(X1 ) × P (X2 |X1 ) (14.26)
X
∝ ×
X
[P (X 3 |X 1 ) × P (x7 |X 3 )]
2
X1 X3
P (X
6 |x5 ∧ x7 )
P (X6|X3 ) × P (x7 |X3 )
X P X(X1 ) × P (X3 |X1 ) (14.27)
X
∝ ×
X
[P (X 2 |X 1 ) × P (x5 |X 2 )]
3
X1 X2
P (X
8 |x5 ∧ x7 ) X
P (X8 |X6 ) × [P (X6 |X3 ) × P (x7 |X3 )]
X3
X
X P (X1 ) × P (X3 |X1 )
(14.28)
∝ X
X6 × [P (X2 |X1 ) × P (x5 |X2 )]
X1 X2
P (X
9 |x5 ∧ x7 ) X
P (X9 |X6 ) × [P (X6 |X3 ) × P (x7 |X3 )]
X3
X
X P (X1 ) × P (X3 |X1 )
(14.29)
∝ X
X6 × [P (X2 |X1 ) × P (x5 |X2 )]
X1 X2
1. Step 0 : First, P (x5 |X2 ) and P (x7 |X3 ), which appear everywhere and
can be computed immediately.
X X
2. Step 1 : Then [P (X2 |X1 ) × P (x5 |X2 )] and [P (X6 |X3 ) × P (x7 |X3 )]
X2 X3
can be computed directly.
Bayesian Inference Algorithms Revisited 257
P (X1 |x5 ∧
X x7 )
∝ P (X1 ) × P (X2 |X1 ) × P (x5 |X2 )
X X2 (14.30)
× [P (X3 |X1 ) × P (x7 |X3 )]
X3
4. Step 3 : Then the two questions P (X2 |x5 ∧ x7 ) and P (X3 |x5 ∧ x7 )
can be solved:
P (X2 |x5 ∧ x7 )
X P (X2 |X1 )
∝ P (x5 |X2 ) × P (X1 |x5 ∧ x7 )
×P
X1 X2 [P (X2 |X1 ) × P (x5 |X2 )]
(14.31)
P (X3 |x5 ∧ x7 )
X P (X3 |X1 )
∝ P (x7 |X3 ) × P (X1 |x5 ∧ x7 )
×P
X3 [P (X6 |X3 ) × P (x7 |X3 )]
X1
(14.32)
5. Step 4 : The next two expressions, P (X4 |x5 ∧ x7 ) and P (X6 |x5 ∧ x7 ),
can be deduced directly from the two previous ones as:
P
X(X |x ∧ x7 )
4 5
(14.33)
∝ P (X4 |X2 ) × P (X2 |x5 ∧ x7 )
X2
P
X(X |x ∧ x7 )
6 5
(14.34)
∝ P (X6 |X3 ) × P (X3 |x5 ∧ x7 )
X3
P
X(X |x ∧ x7 )
8 5
(14.35)
∝ P (X8 |X6 ) × P (X6 |x5 ∧ x7 )
X6
P
X(X |x ∧ x7 )
9 5
(14.36)
∝ P (X9 |X6 ) × P (X6 |x5 ∧ x7 )
X6
Pearl [Pearl, 1988] under the name of Belief Propagation, and by Lauritzen
and Spiegelhalter [Lauritzen and Spiegelhalter, 1988; Lauritzen, 1996] as the
Sum–Product algorithm.
When the graph associated with the Bayesian network has no undirected
cycles,2 it is always possible to find this ordering, ensuring that each sub-
expression is evaluated once and only once.
On the other hand, when the graph of the Bayesian net has some undi-
rected cycles the situation is trickier and such a clever ordering of the compu-
tation may not be found.
For instance, let us modify the above example by adding a dependency
between X2 and X3 . We then obtain the new decomposition:
P (X1 ∧ X2 ∧ · · · ∧ X9 )
= P (X1 ) × P (X2 |X1 ) × P (X3 |X2 ∧ X1 ) × P (X4 |X2 ) × P (X5 |X2 )
×P (X6 |X3 ) × P (X7 |X3 ) × P (X8 |X6 ) × P (X9 |X6 )
(14.37)
which corresponds to the graph of Figure 14.3 below.
Applying the simplification rules to the different questions, we obtain:
P (X1 |x5 ∧ x7 )
X P (X2 |X1 ) × P (X3 |X2 ∧ X1 ) (14.38)
∝ P (X1 ) ×
×P (x5 |X2 ) × P (x7 |X3 )
X2 ∧X3
2
It is either a tree or a polytree.
Bayesian Inference Algorithms Revisited 259
P (X2 |x5 ∧ x7 )
X P (X1 ) × P (X2 |X1 )
∝ P (x5 |X2 ) × (14.39)
×P (X3 |X2 ∧ X1 ) × P (x7 |X3 )
X1 ∧X3
P (X3 |x5 ∧ x7 )
X P (X1 ) × P (X2 |X1 )
∝ P (x7 |X3 ) × (14.40)
×P (X3 |X2 ∧ X1 ) × P (x5 |X2 )
X1 ∧X2
The four other cases are unchanged relative to these three (see Equations
14.33, 14.34, 14.35, and 14.36).
Obviously, the different elements appearing in these three expressions may
not be neatly separated as in the previous case. The conjunction of variables
X1 ∧ X2 ∧ X3 must be considered as a whole: they form a new variable A =
X1 ∧ X2 ∧ X3 . The decomposition (Equation 14.37) becomes:
P (X1 ∧ X2 ∧ · · · ∧ X9 )
∝ P (A) × P (X4 |A) × P (X5 |A) × P (X6 |A) × P (X7 |A) (14.41)
×P (X8 |X6 ) × P (X9 |X6 )
This corresponds to the graph in Figure 14.4 below, which again has a tree
structure.
We have recreated the previous case, where the message-passing algorithms
may be applied. However, this has not eliminated our troubles completely,
because to compute P (X1 |x5 ∧ x7 ), P (X2 |x5 ∧ x7 ), and P (X3 |x5 ∧ x7 ), we
shall now require marginalization of the distribution P (A|x5 ∧ x7 ):
X
P (X1 |x5 ∧ x7 ) ∝ [P (A|x5 ∧ x7 )] (14.42)
X2 ∧X3
260 Bayesian Programming
X
P (X2 |x5 ∧ x7 ) ∝ [P (A|x5 ∧ x7 )] (14.43)
X1 ∧X3
X
P (X3 |x5 ∧ x7 ) ∝ [P (A|x5 ∧ x7 )] (14.44)
X1 ∧X2
P (Searched|known ∧ δ ∧ π)
K
" #
1 X Y (14.45)
= × P (L1 |δ ∧ π) × [P (Lk |Rk ∧ δ ∧ π)]
Z
F ree k=2
is transformed into:
M axSearched "
[P (Searched|known ∧ δ ∧ π)]
K
#
Y (14.46)
= M axSearched P (L1 |δ ∧ π) × [P (Lk |Rk ∧ δ ∧ π)]
k=2
Y
The distributive law applies to the couple M ax, in the same way
X Y
as it applies to the couple , . Consequently, most of the previous
simplifications are still valid with this new couple of operator.
The sum-product algorithm becomes the max-product algorithm, or more
commonly, the min-sum algorithm, as it may be further transformed by op-
erating on the inverse of the logarithm [MacKay, 2003].
It is also known as the Viterbi algorithm [Viterbi, 1967] and it is partic-
ularly used with hidden Markov models (HMMs) to find the most probable
series of states that lead to the present state, knowing the past observations
as stated in the Bayesian program in Equation 13.7.
C0 C1 C2 C3
d C4 C5 C6 C7
C 4d+4 C 4d+5
B
P (C3 | a) ×
X
C2
P (C | a) ×
X 2
C7
P (C | C2 ∧ C3 ) ×
X 7
C1
P (C1 | a) ×
X
C6
P (C6 | C1 ∧ C2 ) ×
X
P (B | a) =
X
C3
C9
P (C | C6 ∧ C7 ) ×
X 9
C0
P (C0 | a) ×
P (C5 | C0 ∧ C1 ) ×
X
X
C4
P (C | C0 ) ×
X 4
C5
P (C8 | C4 ∧ C5 ) × P (B | C8 ∧ C9 )
C8
(14.47)
P (B | a) =
P (C3 | a) ×
X
C2
P (C2 | a) ×
X
C1
P (C1 | a) ×
X
C0
P (C0 | a) ×
P (C7 | C2 ∧ C3) ×
X
X
C6
C3
P (C6 | C1 ∧ C2) ×
X
C9
P (C9 | C6 ∧ C7) ×
X
X
C7
C5
P (C5 | C0 ∧ C1) ×
X
C4
P (C4 | C0) ×
X
P (C8 | C4 ∧ C5) × P (B | C8 ∧ C9)
C8
(14.49)
Using this ordering, the number NC of the arithmetic operations (additions
and multiplications) required to compute P (B | a) and the one number NU
required to update this table for a new evidence value of a′ are respectively:
Notice that the constant part of the sum (requiring no update when the evi-
dence value of A changes) is now much higher in the nested loops, which leads
to fewer computations when changing the evidence A, but at a bigger initial
cost.
Let us now assume that n is fixed to 2. Table 14.1 gives the number
of arithmetic operations (addition and multiplication) required to compute
Bayesian Inference Algorithms Revisited 265
Optimization criterion
Compilation Update
d NC NU NC NU
1 440 248 696 120
2 1112 536 1592 120
3 2072 600 2488 120
6 5016 216 5176 120
10 7064 216 7224 120
100 53144 216 53304 120
300 155544 216 155704 120
TABLE 14.1: Computational Cost for Different Values of d When Using the
“First Compilation Time Minimization” (Left) and “Update Time Minimiza-
tion” (Right) Criteria (The results of the chosen optimization criterion are
shown in bold.)
the variational method boils down to the mean field approximation. Minimiz-
ing F (Q, P ) is greatly simplified using the acyclic graph structure of P (X).
These approaches have been used successfully in a considerable number of
specific models where exact inference becomes intractable, that is, when the
graph is highly connected. A general introduction to variational methods may
be found in introductory texts by Jordan [1999], Jordan and Weiss [2002], and
Jaakkola and Jordan [1999].
The first class groups the sampling-based techniques, while the second con-
cerns the variational methods.
Sampling-based (or Monte Carlo) approaches for approximate Bayesian
inference group together several stochastic simulation techniques that can be
applied to solve optimization and numerical integration problems in large-
dimensional spaces. Since their introduction in the physics literature in the
1950s, Monte Carlo methods have been at the center of the recent Bayesian
revolution in applied statistics and related fields [Geweke, 1996]. They are
applied in numerous other fields such as, for instance, image synthesis [Keller,
1996], CAD modeling [Mekhnacha et al., 2001], and mobile robotics [Dellaert
et al., 1999; Fox et al., 1999].
The aim of this section is to present some of the most popular sampling-
based techniques and their use in the problem of approximate Bayesian infer-
ence.
ancestral ordering, from P (Xj | pa(Xj )), where pa(Xj ) are the parents of Xj ,
for which values have been already drawn.
Suppose for example that we are interested in drawing a point from the
distribution P (X1 X2 ) = P (X1 ) P (X2 | X1 ) where P (X1 ) and P (X2 | X1 ) are
simple distributions for which direct sampling methods are available. Drawing
(i) (i)
a point x(i) = (x1 , x2 ) from P (X1 X2 ) using forward sampling consists of:
(i) (i) (i)
(i) drawing x1 from P (X1 ), then (ii) drawing x2 from P (X2 | X1 = x1 ).
This sampling scheme may be used when no evidence values are available
or when evidence concerns only the conditioning (right side) variables.
When evidence on the conditioned (left side) variables is available, forward
sampling may also be used by introducing rejection of samples that are not
consistent with the evidence. In this case, this algorithm may be very inefficient
(have a high rejection rate) for evidence values with small probabilities of
occurrence. Moreover, applying this algorithm is impossible when evidence
concerns continuous variables.
2. evaluate c × Q(xq ),
3. generate a uniform random value u in [0, c × Q(xq )],
4. if P (x) > u then the point xq is accepted. Otherwise, the point is
rejected.
It is clear that this rejection sampling is efficient if the distribution Q(X)
is a good approximation of P (X). Otherwise, the rejection rate will be very
important.
1 X
Iˆ = P (x(i) )g(x(i) ).
N i
where {x(i) }N
i=1 are randomly drawn in the integration space.
Because high-dimensional probability distributions are often concentrated
on a small region T of the state (integration) space, known as its “typical set”
[MacKay, 1996, 2003], the number N of points drawn uniformly for the state
(integration) space must be sufficiently large to cover the region T containing
most of the probability mass of P (X).
Instead of exploring the integration space uniformly, Monte Carlo methods
try to use the information provided by the distribution P (X) to explore this
space more efficiently. The main idea of these techniques is to approximate
Bayesian Inference Algorithms Revisited 271
the integral 14.53 by estimating the expectation of the function g(X) under
the distribution P (X):
Z
I = P (X)g(X) dk X = hg(X)i.
This Monte Carlo method assumes the capacity to sample the distribution
P (X) efficiently. It is called “perfect Monte Carlo integration”. A a good
survey of Monte Carlo sampling techniques can be found in Neal [1993].
P (A ∧ B ∧ C ∧ D ∧ E ∧ F )
= P (A) P (B) P (C|A ∧ B) P (D||B) (14.55)
P (E|C) P (F |D)
where {c(j) }N
j=1 and {d
C (k) ND
}k=1 are generated from P (C | a b) and P (D | b)?
More generally, is it more efficient to use the sum/product evaluation tree
built using the SRA algorithm (see Section 14.2.1.2) to estimate the integrals
(sums) using Monte Carlo approximation?
To answer this question, we must consider error propagation in the estima-
tion of intermediate terms and the convergence of this estimation. In ProBT,
we use Equation 14.57 rather than 14.58 to estimate integrals (sums). In other
words no elimination ordering is done. This choice is motivated as follows:
• It is more efficient to use the estimator in Equation 14.57 to avoid error
propagation.
• Monte Carlo methods for integral estimation perform better in high-
dimensional spaces [Neal, 1993].
ProBT allows two ways to control the cost/accuracy of the estimate.
The first way is to specify the number of sample points to be used for
estimating the integral. This allows the user to express constraints on the
computational cost and on the required accuracy of the estimate. This param-
eter (i.e., the number of sample points) is also used internally in the MCSEM
algorithm (see Section 14.4.3) for a posteriori distributions optimization.
Bayesian Inference Algorithms Revisited 273
MAP is known to be a very hard problem. Its complexity has been inves-
tigated in Park [2002] and some approximation methods have been proposed
for discrete Bayesian networks. However, we think that the continuous case is
harder and needs more adapted algorithms.
For general purpose Bayesian inference problems, the optimization method
to be used must satisfy a set of criteria in relation to the shape and nature of
the objective function (target distribution) to optimize. The method must:
14.4.3.4 Initialization
The population of the GA is initialized at random from the search space. To
minimize computing time in this initialization phase, we use a small number
N0 of points to estimate integrals. We propose the following algorithm as an
automatic initialization procedure for the initial temperature T0 , able to adapt
to the complexity of the problem.
INITIALIZATION()
FOR each population[i] DO
REPEAT
population[i] = random(Space)
T
value[i] = EN 0
(population[i])
if (value[i] == 0.0)
T = T + ∆T
UNTIL (value[i]> 0.0)
END FOR
Reevaluate population()
278 Bayesian Programming
TEMPERATURE REDUCTION()
WHILE (T > Tǫ ) DO
FOR i=1 TO nc1 DO
Run GA()
END FOR
T=T*α
Reevaluate population()
END WHILE
T = 0.0
Reevaluate population()
0 1 0
V ar(Eβ∗N (X)) = V ar(EN (X)).
β
where Nmax is the number of points that allow convergence of the estimates
0
ÊN (X) for all individuals of the population.
This phase of the algorithm is schematized in Figure 14.9.
ProBT also provides an anytime version of the MCSEM algorithm. In this
version, the user is allowed to fix the maximum number of evaluations of the
objective function or the maximum time to be used to maximize it.
A preliminary implementation of the MCSEM algorithm and its use in
high-dimensional inference problems has been presented in Mekhnacha et al.
[2001, 2000] in which this algorithm is used as a resolution module in a prob-
abilistic CAD system.
280 Bayesian Programming
La Science et l’Hypothèse
Henri Poincaré [1902]
1
Le savant doit ordonner ; on fait la science avec des faits comme une maison avec des
pierres ; mais une accumulation de faits n’est pas plus une science qu’un tas de pierres n’est
une maison.
281
282 Bayesian Programming
Sp(π)
Va :
V rot, Dir, P rox
Dc
:
P (V rot ∧ Dir ∧ P rox ∧ π)
= P (Dir ∧ P rox | π)
Ds
×P (V rot | Dir ∧ P rox ∧ π)
Pr
F o :
P (Dir ∧ P rox ∧ π) = U nif orm
P (V rot | Dir ∧ P rox ∧ π) =
B(µ = f (δ)), σ = g(δ))
Id(δ behavior ) : µdir,prox = f (δbehavior ), σdir,prox = g(δbehavior )
Qu :
P (V rot | Dir ∧ P rox ∧ π ∧ δbehavior )
(15.1)
For this particular identification, k = card(Dir) × card(P rox) probabil-
ity distributions P (V rot)1,...,k are computed from the set of observations
δbehavior = {vroti , diri , proxi } : i ∈ {1, l}. These distributions are indexed
by the values dir and prox. The parameters of each distribution µdir,prox and
σdir,prox are obtained by pruning the set {vroti , diri , proxi } : i ∈ {1, . . . , l} to
obtain a subset {vrotj , dirj , proxj } with dirj = dir and proxj = prox and by
computing the experimental mean and standard deviation:
Bayesian Learning Revisited 283
µdir,prox = E(vrotj )
σdir,prox = E((vrotj − mudir,prox )2 )
This shows that multiple descriptions may be obtained with multiple data
sets, however, many hypotheses remain hidden in that implementation.
Sp(π0 )
Va:
O, O1 , . . . , On , Λ
Dc
:
P (O ∧ O1 ∧ . . . ∧ ON ∧ Λ ∧ π0 )
= P (Λ | π0 )
×P (O | Λ ∧ π ) × . . . P (O | Λ ∧ π )
Ds 1 0 N 0
Pr
F
o :
P (Λ | π0 ) = prior knowledge on the distribution Λ
P (O | Λ ∧ π0 ) = prior on the distribution on O
P (O1 | Λ ∧ π0 ) = prior on the distribution on O
. . .
P (O | Λ ∧ π ) = prior on the distribution on O
N 0
Id :
Qu : P (O | O . . . ..... ∧ O ∧ π )
1 N 0
(15.2)
P (O | O1 = o1 . . . ∧ ON = oN ∧ π0 ) is the probability distribution ob-
tained by instantiating the question of the Bayesian program 15.2 with the
data set δ. It is the result of several modeling choices:
This distribution may also be obtained with another approach: let’s con-
sider the following subprogram.
Sp(πλ )
Va:
O1 , . . . , On , Λ
Dc
:
P (O1 ∧ . . . ∧ ON ∧ Λ ∧ πλ )
= P (Λ | πλ )
Ds ×P (O1 | Λ ∧ πλ ) × . . . P (ON | Λ ∧ πλ )
Pr
F o :
P (Λ | πλ ) = prior knowledge on the distribution Λ
P (O | Λ ∧ π ) = prior on the distribution on O
1 λ
. . .
P (ON | Λ ∧ πλ ) = prior on the distribution on O
Id :
Qu : P (Λ | O1 . . . ∧ ON ∧ πλ )
(15.3)
We denote by P (Λ | δ ∧ πλ ) = P (Λ | O1 = o1 . . . ∧ ON = on ∧ πλ ) the
probability distribution obtained by instantiating the question of program
15.3 with the previous readings δ. We can now define a simple program:
Sp(δ ∧ π)
Va:
O, Λ
Dc :
n
Ds P (Λ ∧ O ∧ π) = P (Λ | δ ∧ π) × P (O | Λ ∧ π)
Pr
F
(o :
P (Λ | δ ∧ π) : instantiated question of program 15.3
P (O | Λ ∧ π) : prior distribution on O
Id :
Qu : P (O | δ ∧ π)
(15.4)
We can show that:
P (O | δ ∧ π) = P (O | O1 = o1 . . . ∧ ON = on ∧ π0 ) (15.5)
On the one hand, the question of the program in Equation 15.3 can be
Bayesian Learning Revisited 285
computed as:
P (Λ | πλ ) × P (O1 | Λ ∧ πλ ) × . . . P (On | Λ ∧ πλ )
P (Λ | O1 . . . ∧ ON ∧ πλ ) = X
P (Λ | πλ ) × P (O1 | Λ ∧ πλ ) × . . . P (On | Λ ∧ πλ )
Λ
(15.6)
while the question of the program in Equation 15.4 can be computed as:
X
P (O | δ ∧ π) = P (Λ | δ ∧ π) × P (O | Λ ∧ π) (15.7)
Λ
Sp(δ ∧ π)
Va:
O, Λ
Dc :
n
P (O ∧ π) = P (Λ | δ ∧ π) × P (O | Λ ∧ π)
Ds
Pr
F o :
(15.11)
(
P (Λ | δ ∧ π) : δµ∗
P (O | λ ∧ π) : N (λ, σ)
Id :
Qu :
P (O | δ ∧ π)
file = ’C:/Users/mazer/Documents/Publications/\
BPbook/Chapters/chapter15/code/previous_O.csv’
#define the data source ignoring unknown fields
previous_O=plCSVDataDescriptor(file,O^O_I)
previous_O.ignore_unknown_variables()
#define the type of ML learner
learner_O=plLearn1dNormal(O)
#use data
i= learner_O.learn_using_data_descriptor(previous_O)
#retrieve the distribution
distrib = learner_O.get_distribution()
#print it
print distrib
Bayesian Learning Revisited 287
P (Oi = 1) = λ
P (Oi = 0) = 1 − λ
288 Bayesian Programming
Sp(δ
∧ π)
Va:
O, Λ
Dc
n :
P (O ∧ Λ ∧ π) = P (Λ | δ ∧ π) × P (O | Λ ∧ π)
Ds
Pr
Fo :
N
X
P (Λ | δ ∧ π) := beta(α + oi , β + N )
i=1
P (O | Λ ∧ π) : Binomial distribution
Id :
Qu : P (O | δ ∧ π)
(15.13)
Sp(δ ∧ π)
Va:
O, Λ
Dc :
n
Ds P (O ∧ π) = P (Λ | δ ∧ π) × P (O | Λ ∧ π)
Pr (15.14)
F
(o :
P (Λ | δ ∧ π) := δE(beta(α+PN oi ,β+N ))
i=1
P (O | Λ ∧ π) : Binomial distribution
Id :
Qu : P (O | δ ∧ π)
290 Bayesian Programming
PrE : Program for E Step
PrM : Program for the M step
Ds Ds
Sp(π ∧ πi ) Sp(π)
V a :
V a :
O, Z, Λ O, Z, Λ
Dc : Dc :
P (O ∧ Z ∧ Λ | π ∧ πi ) P (O ∧ Z ∧ Λ | π ∧ πi )
= P (Λ | πi )
= P (Λ | π)
×P (O ∧ Z | Λ ∧ π) ×P (O ∧ Z | Λ ∧ π)
( :
F o F o :
(
P (Λ | πi ) : (i) P (Λ | π) : (ii)
P (O ∧ Z | Λ ∧ π) : Model
P (O ∧ Z | Λ ∧ π) : Model
Id : Id :
Qu : P (Z | O ∧ πi ) Qu : P (Λ | O ∧ Z ∧ π)
(15.15)
P (O ∧ Z | Λ ∧ π)
Kullback-Leibler-distance = +∞
define P (Λ | π0 ){E step Prior}
i=0
while Kullback-Leibler-distance > ǫ do
{E Step}
define PrE with P (Λ | πi ) as prior
infer P (Z | O ∧ πi )
Instantiate with readings P (Z | O = δ ∧ πi )
{M Step}
define PrM with P (Λ | π) {Initial Prior}
infer P (Λ | O ∧ Z ∧ π)
Compute the soft Xevidence
P (Λ | πi+1 ) = P (z | O = δ ∧ πi ) × P (Λ | O = δ ∧ Z = z ∧ π)
z∈Z
i=i+1
compute Kullback-Leibler-distance(P (Λ | πi+1 ) , P (Λ | πi ))
end while
return P (Λ | πi+1 )
The algorithm starts with the E step: a prior distribution on the pa-
rameters P (Λ | π0 ) is given to initialize the process. The result of the in-
ference P (Z | O ∧ πi ) is
instantiated with the observed data P (Z | δ ∧ πi ) =
P Z | O = oi1 ...oin ∧ πi . The algorithm then proceeds with the program de-
signed for the M step. The distribution P (Λ | π) is set with the prior on Λ be-
fore considering any data. The program is used to compute P (Λ | O ∧ Z ∧ π).
In this version, we use soft evidence 8.5 to compute the new prior P (Λ | πi+1 )
for the next E step:
X
P (Λ | πi+1 ) = P (z | O = δ ∧ πi ) × P (Λ | O = δ ∧ Z = z ∧ π)
z∈Z
Sp(π ∧ δ)
Va:
Λ, R, X
Dc
:
P (R ∧ X ∧ Λ ∧ π ∧ δ)
= P (Λ | π ∧ δ)
Ds
Pr ×P (R ∧ X | Λ ∧ π) (15.16)
F
(o :
P (Λ | π ∧ δ) = P (Λ | πk )
P (R ∧ X | Λ ∧ π) : model
Id :
Qu :
P (X | R ∧ π ∧ δ)
Given an initial template for the EM algorithm we can design further varia-
tions leading to different results, computing times, and convergence properties.
For example, a Dirac distribution δλ∗ may be used in the E step to describe
P (Λ | πi+1 ).
X
P (Λ | πi+1 ) = P (z | O = δ ∧ πi ) × P (Λ | O = δ ∧ Z = z ∧ π)
z∈Z
λ∗ = maxP (Λ = λ | πi+1 ) (15.17)
λ
P (Λ | πi+1 ) = δλ∗
Sp(π)
Va :
A1 , . . . , An ; Ai : observations ; n cardinality of learning set
C1 , . . . , Cn ; Ci ∈ [1, . . . k] : k number of classes
Λ
Dc
:
P (A1 ∧ C1 . . . AN ∧ CN ∧ Λ ∧ π)
Ds
Pr
= P (Λ | π)
×P (A1 ∧ C1 | Λ ∧ π) × . . . × P (AN ∧ CN | Λ ∧ π)
Fo :
(
P (Λ | π) = prior knowledge on the distribution Λ
P (Ai ∧ Ci | Λ ∧ π) = Observational model
Id :
Qu : P (Λ | A1 ∧ C1 . . . AN ∧ CN ∧ π)
(15.18)
The distribution on Λ is then given by
P (Λ | δ ∧ π) = P (Λ | A1 = a1 ∧ C1 = c1 . . . AN = an ∧ CN = cn ∧ π)
(15.19)
When the class is not observed in the data, it is an instance of the un-
supervised learning problem. Unsupervised learning may be used to classify
multidimensional data, to discretize variables, to approximate a complex dis-
tribution, or to learn with an incomplete data set. For example, we may want
to study the weight of a species. One possible assumption is to consider that
the weight is dependent on the gender of each individual. The EM algorithm
will be used to obtain a classifier (female or male) while only being able to
observe the weight. Let’s describe a version of this classifier.
We consider a population of N individuals. The descriptions will use the
following variables:
• Λσf , Λµf are the parameters of the Normal distributions used to represent
the probability on the weights for females.
• Λσm , Λµm are the parameters of the Normal distributions used to represent
the probability on the weights for males.
P (O
∧ Z | Λ ∧ π) =
µ
σ σ µ
P W | C ∧ Λ ∧ Λ ∧ Λ ∧ Λ ∧ π =
j j
f f m m
C = 0 :
j
(15.20)
P W | Λσ ∧ Λµ = N (Λσ , Λµ )
j f f f f
C = 1 :
j
P (Wj | Λσm ∧ Λµm ) = N (Λσm , Λµm )
The prior for the E step is defined as a Dirac (see Equation 15.17):
.
The initial prior for the E step is defined according to some background
knowledge: λ∗g = 0.7, λµ∗
f < λm
µ∗
(ii) P (Λ | π) = Uniform
.
The result of this EM algorithm will give the following result:
which are the parameters of the distributions on the gender and on weights
knowing the gender. These parameters may be used to classify any individual
knowing its weight using the program in Equation 15.21.
296 Bayesian Programming
Sp(π ∧ δ)
V a : C, W
Dc
:
P (C ∧ W ∧ π ∧ δ)
= P (C | π ∧ δ)
×P (W | C ∧ π ∧ delta)
F o :
Ds
P (C | π ∧ δ) = Binomial(λkg )
Pr (15.21)
P (W | C ∧ π ∧ δ) =
C = 0 :
P (W | C ∧ π ∧ δ) = N (Λσk , Λµk )
f f
C = 1 :
j
P (W | C ∧ π ∧ δ) = N (Λσmk , Λµmk )
Id :
Qu :
P (C | W ∧ π ∧ δ)
0.03
0.025
0.02
P(W)
0.015
0.01
0.005
0
0 20 40 60 80 100
W
15.2.2.1 HMM
A HMM indexed with u is defined as Bayesian program 15.22. The vari-
ables Otu , Stu respectively denote the observations and the states at time t and
Λu0 , ΛuM , ΛuS are the model parameters. Λu0 is the set of parameters for the
initial condition on states, ΛuM is the set of parameters for the sensor model,
and
Λus = Λu1 ∧ Λu2 . . . ∧ Λuj . . . ∧ Λua(u)−1
are the parameters for the state transitions, with P (St | St−1 = j) parame-
terized by Λuj . The cardinality a(u) of the states S0,...,T
u
may vary from one
HMM to another.
Sp(π u )
u u u
Λ0 , ΛM , ΛS ,
V a : Stu , ∀t ∈ [0, . . . , T ] : Stu ∈ 1, ...., a(u)
u
Ot , ∀t ∈ [1, . . . , T ] : Otu ∈ D
u u u u u u u u u
P (So ∧ O1 , . . . St ∧ Ot . . . ST ∧ OT ∧ Λ0 ∧ ΛM ∧ ΛS | π ) =
u u u u
P (Λ0 ∧ ΛM ∧ ΛS | π )
Dc : P (S0u | Λu0 ∧ π u )
Ds Y
Pr P Stu | St−1u
∧ ΛuS ∧ π u P (Otu | St ∧ ΛuM ∧ π u ))
]
t∈[1...T
u u u u
P (Λ 0 ∧ Λ0 ∧ ΛS ∧ | π )
P (S0 | Λu0 ∧ π u ) = multinomial
Fo :
P Stu | St−1 u
∧ ΛuS ∧ π u = multinomial
P (Otu | Stu ∧ ΛuM ∧ π u ) = Sensor Model
Id : P (Λu0 ∧ Λu0 ∧ ΛuS | π u ) = identified with EM
Qu : P (STu | O1u ∧ . . . ∧ OTu )
(15.22)
P (Λ | π)=Uniform
for j = 1 → L do
set δ = {oj1 , . . . ojn(j) , a(u)}
RUN EM
Update P (Λ | π) {Use EM as a filter}
end for
return P (Λ | π) as P (Λ | π u )
The question is used to obtain a probability distribution over the Stu given
an observation out . In turn, it can be used in a prediction step at time t to be
compared with other sequences u′ .
Sp(πt′u )(
Stu , ∈ 1, ...., a(u)
V a :
Otu , ∈ D
(
P (Stu ∧ Otu ∧ πt′u ) =
Ds Dc :
Pr
P (Stu | πt′u ) P (Otu | St ∧ πt′u )
(
P (St | πt′u ) = P (Stu | Otu = ot ∧ πtu ) see 15.23
F o :
P (Otu | Stu ∧ πt′u ) = Sensor Model
Id :
Qu : P (Otu | πt′u )
(15.24)
Sp(πTu )
Ut , Ut−1 [∈ 1, ...., m; current and previous sequence
V a :
{Otu } : Otu ∈ D; current observation vector
O =
t
u∈[1...m]
(
P (Ut−1 ∧ Ut ∧ Ot ∧ πtu ) =
Dc :
P (Ut−1 | πt ) P (Ut | Ut−1 ∧ πt ) P (Ot | Ut ∧ πt )
P (Ut−1 | πt ) = previous estimation
Ds
Pr
P (Ut | Ut−1 ) = transition model
P (O t | Ut = u ∧ πt )
F o : u ′u
P (Ot | πt ) ; prediction with model u (15.24)
×
=
[
{Otu })
U nif orm(Ot −
u∈[1...m]
Id :
Qu : P (Ut | Ot ∧ πTu )
(15.25)
Bayesian Learning Revisited 301
Sp(π
M) :
V a : H ∈ [1, . . . , n], I, S1 , . . . , Sn
P (H ∧ I ∧ S | πM )
= P (I | π )
M
Dc :
×P (H | I ∧ πM )
×P (S | H ∧ I ∧ πM )
P (I | πM ) = Uniform
P (H | I ∧ πM ) = Given or learned with EM
Ds :
P (S | H ∧ I ∧ πM )
Pr :
H = 1 :
P (S | I ∧ π1 )
F o :
. . .
= P (S | I ∧ πi ) = Question to model πi
...
H = n :
P (S | I ∧ π )
n
Id :
Qu : P (S | I ∧ πM )
(15.26)
To learn P (H | I ∧ πM ), we use the EM algorithm by stating:
• Z=H
• O=S
• Λ = parameters of a multinomial law for H
the analyst to consider the bias-variance trade-off. In the next section we con-
sider selecting a model among a huge number of possible models: given a set
of variables and a data set on these variables, we select the most appropriate
decomposition.
b(1) = 1
n
X n O(n) (15.27)
(−1)i+1 2i(n−i) b(n − i) = n2
i
i=1
For example, the number of acyclic graphs for 10 variables is of the order
of 1018 . It may not be necessary to consider all the possibilities since several
descriptions may lead to the same results no matter which data are given as
a learning example: they may belong to the same Markov equivalence class.
For example, if we could equally learn P (A), P (B), P (C), P (A | B),
P (B | A), P (C | B), and P (B | C) from a set of triplets ai , bi , Ci , then the
three decompositions P (C) P (B | C) P (A | B), P (A) P (B | A) P (C | B),
and P (B) P (C | B) P (A | B) will lead to the same joint distribution
P (A ∧ B ∧ C) while the decomposition P (A) P (C) P (B | C ∧ A) will lead
to another decomposition since P (B | C ∧ A) 6= P (B | C).
The double exponential number of possible models makes the problem
intractable, and methods such as the model selection algorithm presented in
Section 15.2.3 cannot be applied. The existing algorithms rely on heuristics,
and could be classified into two classes:
We use the work of Leray [2006] to briefly present the main approaches.
Bayesian Learning Revisited 303
where P (xl ∧ xk ),P (xl ),P (xk ) are computed as histograms from the
data set.
• Laplace mutual information: as the M I measure but where P (xl ∧ xk ),
P (xl ), P (xk ) are computed as Laplace laws from the data set.
• Normalize mutual information:
M I(Xl , Xk )
N M I(Xl , Xk ) = −
H(Xl , Xk )
where the entropy H(Xl , Xk ) is computed as
X X
−P (xl ∧ xk ) log (P (xl ∧ xk ))
xl ∈Xl xk ∈Xk
Bayesian Learning Revisited 305
n
!
X 1
BIC(Ds, D) = log P (Di | Θ ∧ Dc) − dim(Θ) log (N )
i
2
In the same spirit the score based on the minimum description length,
M DL, penalizes descriptions having a large number (NDc ) of distributions in
their decomposition:
n
!
X
M DL(Ds, D) = log P (Di | Θ ∧ Dc) − NDc log (N ) − c dim(Θ) log (N )
i
306 Bayesian Programming
Many other scores are used to compare one description to another for ex-
ample, Akaike’s information criterion (AIC) and Bayesian Dirichlet equivalent
uniform criterion (BDEU). Once a score has been selected it is used to locally
navigate among the large space of descriptions.
Sp(π)
:
V a : X , . . . , Xn
1
P (X1 ∧ . . . Xn | π)
Dc : = P (X1 | π)
Ds : Y P X | X k ∧ . . . ∧ X k ∧ π
Pr :
k 1 p
(
P (X | π) = Histograms
1
F o : P X | X k ∧ . . . ∧ X k ∧ π = Histograms
k
1 p
Id :
Qu :
(15.29)
tions are ways to walk in the search space moving from one graph to another
by modifying edges. The chosen score is used to evaluate the quality of a move
and a decision is made selecting the best alternative. The algorithm never goes
back and stops when no move leads to a better solution than the current one.
For example, we may consider the following operations on Bayesian net-
works provided they lead to a valid decomposition:
• remove:
P Xk | X1k ∧ . . . Xjk . . . ∧ Xpk → P Xk | X1k ∧ . . . ∧ Xpk
• add:
P Xk | X1k ∧ . . . ∧ Xpk → P Xk | X1k ∧ . . . Xjk . . . ∧ Xpk
• reverse:
P Xk | X1k ∧ . . . Xjk . . . ∧ Xpk → P Xk | X1k ∧ . . . ∧ Xpk P Xkj | Xk
The variations among this type of algorithm are numerous. They could
also be used to explore very small subspaces of the initial search space by
only considering a small subset of the conditional distribution which may be
modified.
The greedy search (GS) algorithm uses the output of the DMST
algorithm to define the initial structure.
learner = plStructureLearner(result_dmst);
score_gs = plNodeScoreBIC_ (dataset);
learner.GS(score_gs);
result_gs = learner.get_joint_distribution(dataset);
Part IV
Frequently Asked
Questions — Frequently
Argued Matters
309
This page intentionally left blank
Chapter 16
Frequently Asked Questions and
Frequently Argued Matters
1
C’est au savant moderne que convient, plus qu’à tout autre, l’austère conseil de Kipling:
“Si tu peux voir s’écrouler soudain l’ouvrage de ta vie, et te remettre au travail, si tu peux
souffrir, lutter, mourir sans murmurer, tu seras un homme, mon fils.” Dans l’oeuvre de la
science seulement on peut aimer ce qu’on détruit, on peut continuer le passé en le niant, on
peut vénérer son maı̂tre en le contredisant.
311
312 Bayesian Programming
• Guy Ramel, PhD thesis which proposes a new approach for objects
recognition that incorporates visual and range information with spatial
arrangement between objects (see [Ramel, 2006](in French) and [Ramel
and Siegwart, 2008]).
• Miriam Amavizca, PhD thesis titled “3D Human Hip Volume Recon-
struction with Incomplete Multimodal Medical Images” (see [Amavizca,
2005](in French) and [Amavizca, 2008]).
• Ronan Le Hy, PhD thesis titled “Playing to Train Your Video Game
Avatar,” where it is demonstrated how a player of an FPS video game
can teach an avatar how to play (see [Le Hy, 2007](in French) and [Le Hy
et al., 2004; Le Hy and Bessière, 2008]).
• Pierre-Charles Dangauthier, PhD thesis titled “Bayesian Learning:
Foundations, Method and Applications,” which deals with different
aspects of learning and especially addresses the automatic selection
and creation of relevant variables to build a model (see [Dangauthier,
2007](in French) and [Dangauthier et al., 2004, 2005, 2007]).
• Shrihari Vasudevan, PhD thesis titled “Spatial Cognition for Mobile
Robots: A Hierarchical Probabilistic Concept-Oriented Representation
of Space” (see [Vasudevan, 2008] and [Vasudevan and Siegwart, 2008]).
• Francis Colas investigated the role of position uncertainty in the pe-
ripheral visual field to guide eye movement saccades (see [Colas et al.,
2009]).
• Jorg Rett, PhD thesis titled “Robot-Human Interface Using Laban
Movement Analysis Inside a Bayesian Framework” (see [Rett, 2008] and
[Rett et al., 2010]).
• Estelle Gilet, PhD thesis titled “Bayesian Modeling of Sensory-Motor
Loop: An Application to Handwriting,” where a Bayesian Action Per-
ception (BAP) model of the reading-writing sensory motor loop is pro-
posed (see [Gilet, 2009](in French) and [Gilet et al., 2011]).
• Xavier Perrin, PhD thesis titled “Semi-Autonomous Navigation of an
Assistive Robot Using Low Throughput Interfaces,” where a Bayesian
strategy to help a disabled person to drive a wheelchair using an EEG
signal is proposed (see [Perrin, 2009] and [Perrin et al., 2010]).
• Joao Filipe Ferreira, PhD thesis titled “Bayesian Cognitive Models for
3D Structure and Motion Multimodal Perception” (see [Ferreira, 2011]
and [Ferreira et al., 2012]).
• Clement Moulin-Frier, PhD thesis titled “Emergence of Articulatory-
Acoustic Systems from Deictic Interaction Games in a ‘Vocalize to Local-
ize’ Framework”(see [Moulin-Frier, 2011](in French) and [Moulin-Frier
et al., 2011, 2012]).
• Gabriel Synnaeve, PhD thesis titled “Bayesian Programming and Learn-
ing for Multi-Player Video Games: Application to RTS AI,” where a
probabilistic model of a “bot” to automatically play Starcraft is pro-
posed (see [Synnaeve, 2012] and also [Synnaeve and Bessière, 2010,
2011a,b,c]).
For an up-to-date description of the industrial applications please consult
the Web site of ProbaYes (http://probayes.com).
316 Bayesian Programming
As stated in the very first paragraph, the use of computers makes the
difference between “programming” and “modeling”:
(a|c) ∈ R (16.2)
[[(a|c′ ) > (a|c)] ∧ [(b|a ∧ c′ ) = (b|a ∧ c)]] ⇒ [(a ∧ b|c′ ) > (a ∧ b|c)]
(16.4)
Starting from these postulates, Richard T. Cox has demonstrated that plau-
sible reasoning should follow the two rules from which all the theories can be
rebuilt:
Furthermore, this theorem shows that any technic for plausibility calculus
that would not respect these two rules would contradict at least one of the
2
See an interesting discussion in Appendix A.3, p. 656 of Jaynes’ book [2003] arguing
that rational numbers are sufficient.
Frequently Asked Questions and Frequently Argued Matters 321
theory as defined by Emile Borel and Henri-Léon Lebesgue and leads to An-
drey Kolmogorov’s axiomatization of probability. Of course this approach to
probability is of primary importance and has countless applications. However,
it is a very different concept and viewpoint on probability than the episte-
mological position adopted in this book where probability is considered as an
alternative and extension of logic (see Jaynes [2003] for an extended discussion
and comparison between his approach and Kolmogov’s).
If a cone rests on its point we know that it will fall down but we
don’t know on which side. It seems that only hazard will decide.
Would the cone be perfectly symmetric, would its axis be perfectly
vertical, would there be absolutely no other force than gravity, it
will not fall. But the slightest symmetry break will tilt it on one
side or another and, as soon as it will be tilted, so little that it is,
it will completely fall on that side. Even with a perfect symmetry,
an infinitesimal juddering, a breath of air will tilt it of a few arc
seconds and it will be sufficient to cause its fall and to determine
the direction of this fall toward the initial inclination.
An infinitesimal cause that we overlook may determine a major
effect that we cannot miss. We then say that this effect is due
to hazard. Would we know exactly the laws of nature and the
state of the universe at the initial instant, we could exactly predict
the state of this same universe at the next moment. But, even
with this perfect knowledge of the laws of nature, we have only an
approximate knowledge of the initial state. If we can predict the
next state with the same approximation, it’s all what we need, the
phenomenon has been forecast, it is ruled by laws. However, it is
not always the case, it may happen that slight differences in initial
conditions generate huge ones in final phenomenon. The prediction
becomes impossible and we are facing a fortuitous phenomenon.4
4
Pour trouver une meilleure définition du hasard, il nous faut examiner quelques-uns des
faits qu’on s’accorde à regarder comme fortuits, et auxquels le calcul des probabilités paraı̂t
s’appliquer; nous rechercherons ensuite quels sont leurs caractères communs. Le premier
exemple que nous allons choisir est celui de l’équilibre instable; si un cône repose sur sa
pointe, nous savons bien qu’il va tomber, mais nous ne savons pas de quel côté; il nous
semble que le hasard seul va en décider. Si le cône était parfaitement symétrique, si son axe
était parfaitement vertical, s’il n’était soumis à aucune autre force que la pesanteur, il ne
tomberait pas du tout. Mais le moindre défaut de symétrie va le faire pencher légèrement
d’un côté ou de l’autre, et dès qu’il penchera, si peu que ce soit, il tombera tout à fait de ce
côté. Si même la symétrie est parfaite, une trépidation très légère, un souffle d’air pourra
le faire incliner de quelques secondes d’arc; ce sera assez pour déterminer sa chute et même
le sens de sa chute qui sera celui de l’inclinaison initiale. Une cause très petite, qui nous
échappe, détermine un effet considérable que nous ne pouvons pas ne pas voir, et alors nous
disons que cet effet est dû au hasard. Si nous connaissions exactement les lois de la nature
et la situation de l’univers à l’instant initial, nous pourrions prédire exactement la situation
de ce même univers à un instant ultérieur. Mais, lors même que les lois naturelles n’auraient
plus de secret pour nous, nous ne pourrions connaı̂tre la situation qu’approximativement.
Si cela nous permet de prévoir la situation ultérieure avec la même approximation, c’est
tout ce qu’il nous faut, nous disons que le phénomène a été prévu, qu’il est régi par des
lois; mais il n’en est pas toujours ainsi, il peut arriver que de petites différences dans les
conditions initiales en engendrent de très grandes dans les phénomènes finaux; une petite
erreur sur les premières produirait une erreur énorme sur les derniers. La prédiction devient
impossible et nous avons le phénomène fortuit.
324 Bayesian Programming
Q
X
[nq ] = N (16.7)
q=1
Q
X
[nq × eq ] = E (16.8)
q=1
where E is the global energy of the system and where eq is the energy of state
q.
If W (νk ) is the number of permutations of microscopic states that realize
the macroscopic state νk ,
N!
W (νk ) = (16.9)
n1 ! × · · · × nQ !
For Boltzmann, the most probable macroscopic state is the one that can
be realized by the highest number of possible permutations of microscopic
states. In other words, the macroscopic state that maximizes W (νk ).
Using the Stirling formula to approximate the factorial for large n:
√
1 1
log (n!) = n × log(n) − n + 2πn + +o (16.10)
12n n2
Frequently Asked Questions and Frequently Argued Matters 325
we get:
Q h
X nq n i
q
log (W (νk )) ≃ −N × × log (16.11)
q=1
N N
and the most probable macroscopic state is the one that maximizes the entropy
yet respects the constraints of Equations 16.7 and 16.8.
then the most probable probability distribution is the one that maximizes the
entropy yet respects the M constraints in Equation 16.13. The most probable
probability distribution is the one that corresponds to the highest number
of possible permutations of the observations compatible with the imposed
constraints.
The Laplace succession law controversy which has made for an exciting
debate for the last 150 years is an interesting example of the two different
points of view. Laplace proposed to model a series of experiments using the
following law:
1 + nx
P (x) = (16.15)
Ω+n
where nx is the number of times the x value appears in the series, Ω is the
cardinal of variable X, and n is the total number of observations in the series.
If the observed series is the life of an individual and the variable X stands
for “the individual survives this year,” then Ω = 2 and we get for a 14-year-
old boy a probability of surviving one more year equal to 15/16 when for his
75-year-old grandfather we get a probability of 76/77.
Using this kind of argument, the objectivists have been making fun of
Laplace and his succession law, saying that they were both stupid.
The subjectivists’ position is that the Laplace succession law is just one
5
They even often deny the existence or, at least, the necessity of these observers.
328 Bayesian Programming
You can find in Jaynes’ book many examples of misuses of probability due
to an objectivist interpretation and especially a review of apparent paradoxes
that can be easily solved with a subjectivist point of view.
Of course we presented here only the two extreme positions, when a lot of
intermediary approaches exist. For instance, a usual definition for “Bayesian-
ism” refers to probabilists that accept the use of priors as reasoning subjects’
knowledge. Even this position has been largely attacked by objectivists with
endless discussions on the relevance of the used priors. From a subjectivist
position, the subject is free and takes his own risks when using a given prior.
If he makes a wrong choice then he will get an inappropriate model.
In this book we went much further in the subjectivist direction. We do not
only use priors but “preliminary knowledge.” Priors are limited to the specifi-
cation of a few parametric forms to summarize subject preliminary knowledge,
when, in contrast, preliminary knowledge is made of the specification part of
the Bayesian program made of (i) the choice of the relevant variables, (ii) the
choice of the decomposition assuming conditional independences, and (iii) the
Frequently Asked Questions and Frequently Argued Matters 329
choice of the parametric forms for each of the distributions appearing in the
decomposition.
A major contribution of this book is precisely this formalization of the
preliminary knowledge which, we hoped, has been proved in these pages to be
general and generic enough to model a lot of different problems.
Va:
I0 , I1 , F, S, C, O
Dc
:
P (I0 ∧ I1 ∧ F ∧ S ∧ C ∧ O)
= P (I0 ) × P (I1 ) × P (F ) × P (S|I0 ∧ F )
×P (C) × P (O|I0 ∧ I1 ∧ S ∧ C)
Ds Sp(π) F o :
Pr
P (I0 ) = U nif orm
P (I1 ) = U nif orm
P (F ) = U nif orm
P (S|I0 ∧ F ) = δ
I +F
S=Int( 0 2 )
P (C) = U nif orm
P (O|I0 ∧ I1 ∧ S ∧ C) = Histograms
Id
Qu :
(16.16)
However, these hypotheses are not always necessary. If for a given model,
you are sure that some of the variables appearing in your model will always
be known (appearing only in the right part of a question), then you do not
need to specify prior distributions for these variables as these distributions
will be canceled out in the inference by appearing both at the numerator and
denominator of the expression required to be computed to solve any of the
possible questions.
This is the case in the water treatment example for variables I0 and
I1 , which are always known. The answer to any question of the form
P (Search|known ∧ i0 ∧ i1 ) is obtained by:
330 Bayesian Programming
P (Search|known
X ∧ i0 ∧ i1 )
[P (i0 ) P (i1 ) P (F ) P (S|i0 ∧ F ) P (C) P (O|i0 ∧ i1 ∧ S ∧ C)]
F ree
= X
[P (i0 ) P (i1 ) P (F ) P (S|i0 ∧ F ) P (C) P (O|i0 ∧ i1 ∧ S ∧ C)]
F ree∧Search
X
[P (F ) P (S|i0 ∧ F ) P (C) P (O|i0 ∧ i1 ∧ S ∧ C)]
F ree
= X
[P (F ) P (S|i0 ∧ F ) P (C) P (O|i0 ∧ i1 ∧ S ∧ C)]
F ree∧Search
(16.17)
The answer depends on neither P (I0 ) nor on P (I1 ).
The situation is not the same for variables F and C that are, for some
interesting questions, either searched or let free. For them, P (F ) and P (C)
must be specified.
Consequently, in Bayesian programming you can specify a distribution as
U nknown and ProBT will provide an error message if you try to ask a question
that supposes to use this distribution.
Chapter 17
Glossary
La Méthode
Edgar Morin [1981]
This chapter is a very short summary of the book where the central con-
cepts are recalled as an extended glossary.
331
332 Bayesian Programming
Va:
S 0 , · · · , S T , O0 , · · · , OT
Dc :
P S 0 ∧ · · · ∧ S T ∧ O0 ∧ · · · ∧ OT |π
T
Sp(π) = P S 0 ∧ O0 × Y P S t |S t−1 × P Ot |S t
Ds
t=1
F o :
P S 0 ∧ O0
Pr
P S t |S t−1
P Ot |S t
Id
Qu :
P S t+k |O0 ∧ · · · ∧ Ot
(k = 0) ≡ F iltering
(k > 0) ≡ P rediction
(k < 0) ≡ Smoothing
(17.1)
See Section 13.1.2 for details and special cases like hidden Markov models
(HMMs), Kalman filters, and particle filters.
PX(Searched|known ∧ δ ∧ π)
= [P (Searched ∧ F ree|known ∧ δ ∧ π)]
FX
ree
[P (Searched ∧ F ree ∧ known|δ ∧ π)]
F ree
=
X P (known|δ ∧ π)
[P (Searched ∧ F ree ∧ known|δ ∧ π)]
F ree (17.2)
= X
[P (Searched ∧ F ree ∧ known|δ ∧ π)]
F ree∧Searched
1 X
= × [P (Searched ∧ F ree|known ∧ δ ∧ π)]
Z
F ree "
K
#
1 X Y
= × P (L1 |δ ∧ π) × [P (Lk |Rk ∧ δ ∧ π)]
Z
F ree k=2
Most of the time only approximate inference is acceptable and all the algo-
rithms tried to solve these optimization problems in clever ways. See Chapter
14 for details.
Va:
X1 , · · · , XN
Dc :
P (X1 ∧ · · · ∧ XN |π)
Sp(π) N
Ds
Y
= [P (Xn |Rn ∧ π)]
Pr (17.3)
n=1
F o :
any
Id
Qu :
P (Xn |known)
Note the particularity of the decomposition (see Section 17.7) where one
and only one variable Xn is appearing left of the conditioning sign and de-
pending of its “antecedents” Rn .
See Section 13.1.1 for details and Section 16.3 for a discussion on “Bayesian
programming versus Bayesian networks.”
V ariables
Specif ication(π) Decomposition
Description.
P rogram F orms
Identif ication (based on δ)
Question
adds knowledge to simplify the model, and the forms (see Section 17.9) where
the modeler specifies the mathematical means to compute and learn the ele-
ments of the decomposition.
17.7 Decomposition
P (X1 ∧ X2 ∧ · · · ∧ XN |δ ∧ π)
(17.5)
= P (L1 |δ ∧ π) × P (L2 |R2 ∧ δ ∧ π) × · · · × P (LK |RK ∧ δ ∧ π)
17.8 Description
This model results from, on one hand, knowledge provided by the program-
mer called specification (see Section 17.15) and, on the other hand, knowledge
coming from the “environment” and learned during identification.
17.9 Forms
Forms are the third necessary ingredient in the specification (see Section
17.15) part of a Bayesian program to be able to compute the joint distribution.
A form can either be a parametric form or a question (see Section 17.14) to
another Bayesian program.
17.10 Incompleteness
17.11 Mixture
Va:
X1 , · · · , XN , H
Dc
( :
P (X1 ∧ · · · ∧ XN ∧ H|π)
Sp(π)
= P (H|π) × P (X1 ∧ · · · ∧ XN |H ∧ π)
Ds
F
o :
P (H|π) ≡ T able
Pr (17.6)
P (X1 ∧ · · · ∧ XN | [H = m] ∧ π)
≡ P (X1 ∧ · · · ∧ XN |πm )
Id
Qu :
P (X1 ∧ · · · ∧ XN |π)
M
X
= [P ([H = m] |π) × P (X1 ∧ · · · ∧ XN |πm )]
m=1
The usual form of the mixture (the weighted sum) appears as the result
of the inference done by ignoring and marginalizing out the variable H.
It is then easy to go one step further in generalization by considering that
the variable H could be conditioned by some of the variables. Doing this
we obtain the standard probabilistic conditional statement form (see Section
17.6).
17.12 Noise
Noise is anything that is not music, anything not written in the score,
anything not specified in your model, everything that you are ignoring, which
may be much.
See the discussion “Noise or ignorance?” in Section 16.12.
17.14 Question
Questions are formally defined as a partition of the set of variables in three
sub-sets:
• The known variables for which a value is imposed.
• The searched variables for which you are computing a probability dis-
tribution knowing the known variables.
• The free variables that you are ignoring and that have to be marginalized
out.
A question is a family of probability distributions defined P (Searched|Known)
made up of as many distributions as the possible values of Known. Each in-
stantiated question is defined by P (Searched|known) and the answer is given
by the computation:
1 X
P (Searched|known) = [P (Searched ∧ known ∧ F ree)] (17.7)
Z
F ree
17.15 Specification
Specification is where the preliminary knowledge (see Section 17.13) of a
programmer is specified in a Bayesian program.
340 Bayesian Programming
17.16 Subroutines
Subroutine calls in Bayesian programming consist in specifying a form of
one distribution appearing in the decomposition of a Bayesian program π1 as
a question (see Section 17.14) to another Bayesian program π2 (see Chapter
9 for details and examples).
When a question is asked to π1 , during the necessary inferences, each time
the form corresponding to the question to π2 has to be evaluated, it triggers
supplementary inferences to answer this question.
This mechanism allows the conception of hierarchical models where a
Bayesian program can be conceived using other Bayesian programs, eventually
written by others, as its elementary components.
17.17 Variable
Variables in Bayesian programming are defined formally as a set of mutu-
ally exclusive and exhaustive logical propositions. More intuitively this corre-
sponds to the concept of discrete variables. Continuous variables when neces-
sary are discretized (see the discussion on this matter in Section 16.9).
These variables have absolutely no intrinsic character of randomness. They
are not “random variables” formally defined as functions from the set of events
into R (or Rn for a random vector). The knowledge about these variables
may be probabilistic but it is not the nature of the variable itself. Indeed,
in Bayesian programming either a variable has a known value which appears
on the right part of a conditioning symbol, or it is known by a probability
distribution on its possible values.
When writing a Bayesian program the choice of the relevant variables that
should appear in this program is the most difficult part. When the appropriate
set of variables has been selected a large part of the modeling work has been
done. However, this remark is not specific to probabilistic modeling but is true
for any kind of modeling work.
Bibliography
D. Alais and D. Burr. The ventriloquist effect results from near-optimal bi-
modal integration. Current Biology, 14:257–262, February 2004.
341
342 Bibliography
M. S. Banks. Neuroscience: what you see and hear is what you get. Current
Biology, 14(6):236–238, 2004.
P. Bessiere, E. Dedieu, and O. Lebeltel. Wings Were Not Designed to Let An-
imals Fly. In Third European Conference on Artificial Evolution (Megève,
France), volume 1363 of Lecture Notes in Computer Science, pages 237–250.
Springer-Verlag, 1997.
K. Drewing and M. Ernst. Integration of force and position cues for shape
perception through active touch. Brain Research, 1078:92–100, 2006.
S. J. Gould. The Streak of Streaks. The New York Review of Books, 35(13),
1988.
A. Keller. The Fast Calculation of Form Factors Using Low Discrepancy Point
Sequence. In Proc. of the 12th Spring Conf. on Computer Graphics, pages
195–204, Bratislava, 1996.
J. Rett, J. Dias, and J.-M. Ahuactzin. Bayesian reasoning for Laban move-
ment analysis used in human-machine interaction. Int. J. Reasoning-based
Intelligent Systems, 2(1):13–35, 2010.
R. Y. Rubinstein. Simulation and the Monte Carlo Method. John Wiley and
Sons, 1981.
T. Sato and Y. Kameya. Parameter learning of logic programs for symbolic-
statistical modeling. Journal of Artificial Intelligence Research, page 454,
2001.
Y. Sato, T. Toyoizumi, and K. Aihara. Bayesian inference explains perception
of unity and ventriloquism aftereffect: Identification of common sources of
audiovisual stimuli. Neural Computation, 19(12):3335–3355, 2007.
J.-L. Schwartz, J. Serkhane, P. Bessière, and L.-J. Boë. La robotique de
la parole, ou comment modéliser la communication par gestes orofaciaux.
Primatologie, 6:329–352, 2004.
J. Serkhane, J.-L. Schwartz, L.-J. Boë, B. Davis, P. Bessière, and E. Mazer.
Etude comparative de vocalisations de bébés humains et de bébés robots. In
XXIVème Journées d’Etude sur la Parole (JEP), LORIA et ATLIF, Nancy
(France), 2002.
J. Serkhane, J.-L. Schwartz, and P. Bessière. Simulating Vocal Imitation
in Infants, using a Growth Articulatory Model and Speech Robotics. In
International Congress of Phonetic Sciences (ICPhS), Barcelona, Spain,
page x, 2003.
J. Serkhane, J.-L. Schwartz, and P. Bessière. Building a talking baby robot: A
contribution to the study of speech acquisition and evolution. Interaction
Studies, 6(2):253–286, 2005.
J. Serkhane, J.-L. Schwartz, L.-J. Boë, B. Davis, and C. Matyear. Infants’
vocalizations analyzed with an articulatory model: A preliminary report.
Journal of Phonetics, 35(3):321–340, Mar. 2007.
J. Serkhane, J.-L. Schwartz, and P. Bessière. Building a Talking Baby Robot:
A Contribution to the Study of Speech Acquisition and Evolution. In
P. Bessière, editor, Probabilistic Reasoning and Decision Making in Sensory-
Motor Systems, pages 329–357. Springer, 2008.
J. E. Serkhane. Un bébé androı̈de vocalisant: Etude et modélisation
des mécanismes d’exploration vocale et d’imitation orofaciale dans le
développement de la parole. PhD thesis, Inst. Nat. Polytechnique de Greno-
ble, November 2005.
R. D. Shachter, B. D’Ambrosio, and B. A. Del Favero. Symbolic Probabilistic
Inference in Belief Networks. In Proceedings of the Eighth National Confer-
ence on Artificial Intelligence, AAAI’90, pages 126–131. AAAI Press, 1990.
A. F. Smith and G. O. Roberts. Bayesian computation via the Gibbs sampler
and related Monte Carlo methods. Journal of the Royal Statistical Society
B, 55:3–23, 1993.
356 Bibliography