Learning Probabilistic Graphical Models in R - Sample Chapter
Learning Probabilistic Graphical Models in R - Sample Chapter
$ 34.99 US
22.99 UK
P U B L I S H I N G
David Bellot
This book is for anyone who has to deal with lots of data and
draw conclusions from it, especially when the data is noisy
or uncertain. Data scientists, machine learning enthusiasts,
engineers, and those who are curious about the latest
advances in machine learning will find PGM interesting.
Learning Probabilistic
Graphical Models in R
ee
pl
C o m m u n i t y
E x p e r i e n c e
D i s t i l l e d
Learning Probabilistic
Graphical Models in R
Familiarize yourself with probabilistic graphical models through
real-world problems and illustrative code examples in R
Sa
m
David Bellot
Preface
Probabilistic graphical models is one of the most advanced techniques in machine
learning to represent data and models in the real world with probabilities. In many
instances, it uses the Bayesian paradigm to describe algorithms that can draw
conclusions from noisy and uncertain real-world data.
The book covers topics such as inference (automated reasoning and learning), which
is automatically building models from raw data. It explains how all the algorithms
work step by step and presents readily usable solutions in R with many examples.
After covering the basic principles of probabilities and the Bayes formula, it presents
Probabilistic Graphical Models(PGMs) and several types of inference and learning
algorithms. The reader will go from the design to the automatic fitting of the model.
Then, the books focuses on useful models that have proven track records in solving
many data science problems, such as Bayesian classifiers, Mixtures models, Bayesian
Linear Regression, and also simpler models that are used as basic components to
build more complex models.
Preface
Chapter 4, Bayesian Modeling Basic Models, covers simple and powerful Bayesian
models that can be used as building blocks for more advanced models and shows
you how to fit and query them with adapted algorithms.
Chapter 5, Approximate Inference, covers the second way to perform an inference in
PGM using sampling algorithms and a presentation of the main sampling algorithms
such as MCMC.
Chapter 6, Bayesian Modeling Linear Models, shows you a more Bayesian view of the
standard linear regression algorithm and a solution to the problem of over-fitting.
Chapter 7, Probabilistic Mixture Models, goes over more advanced probabilistic models
in which the data comes from a mixture of several simple models.
Appendix, References, includes all the books and articles which have been used to
write this book.
Probabilistic Reasoning
Among all the predictions that were made about the 21st century, maybe the most
unexpected one was that we would collect such a formidable amount of data about
everything, everyday, and everywhere in the world. Recent years have seen an
incredible explosion of data collection about our world, our lives, and technology;
this is the main driver of what we can certainly call a revolution. We live in the Age
of Information. But collecting data is nothing if we don't exploit it and try to extract
knowledge out of it.
At the beginning of the 20th century, with the birth of statistics, the world was all about
collecting data and making statistics. In that time, the only reliable tools were pencils
and paper and of course, the eyes and ears of the observers. Scientific observation was
still in its infancy, despite the prodigious development of the 19th century.
More than a hundred years later, we have computers, we have electronic sensors,
we have massive data storage and we are able to store huge amounts of data
continuously about, not only our physical world, but also our lives, mainly through
the use of social networks, the Internet, and mobile phones. Moreover, the density of
our storage technology has increased so much that we can, nowadays, store months
if not years of data into a very small volume that can fit in the palm of our hand.
But storing data is not acquiring knowledge. Storing data is just keeping it
somewhere for future use. At the same time as our storage capacity dramatically
evolved, the capacity of modern computers increased too, at a pace that is sometimes
hard to believe. When I was a doctoral student, I remember how proud I was when
in the laboratory I received that brand-new, shiny, all-powerful PC for carrying my
research work. Today, my old smart phone, which fits in my pocket, is more than 20
times faster.
[1]
Probabilistic Reasoning
Therefore in this book, you will learn one of the most advanced techniques to
transform data into knowledge: machine learning. This technology is used in every
aspect of modern life now, from search engines, to stock market predictions, from
speech recognition to autonomous vehicles. Moreover it is used in many fields
where one would not suspect it at all, from quality assurance in product chains to
optimizing the placement of antennas for mobile phone networks.
Machine learning is the marriage between computer science and probabilities and
statistics. A central theme in machine learning is the problem of inference or how to
produce knowledge or predictions using an algorithm fed with data and examples.
And this brings us to the two fundamental aspects of machine learning: the design of
algorithms that can extract patterns and high-level knowledge from vast amounts of
data and also the design of algorithms that can use this knowledgeor, in scientific
terms: learning and inference.
Pierre-Simon Laplace (1749-1827) a French mathematician and one of the greatest
scientists of all time, was presumably among the first to understand an important
aspect of data collection: data is unreliable, uncertain and, as we say today, noisy.
He was also the first to develop the use of probabilities to deal with such aspects of
uncertainty and to represent one's degree of belief about an event or information.
In his Essai philosophique sur les probabilits (1814), Laplace formulated an original
mathematical system for reasoning about new and old data, in which one's belief
about something could be updated and improved as soon as new data where
available. Today we call that Bayesian reasoning. Indeed Thomas Bayes was
the first, toward the end of the 18th century, to discover this principle. Without
any knowledge about Bayes' work, Pierre-Simon Laplace rediscovered the same
principle and formulated the modern form of the Bayes theorem. It is interesting
to note that Laplace eventually learned about Bayes' posthumous publications
and acknowledged Bayes to be the first to describe the principle of this inductive
reasoning system. Today, we speak about Laplacian reasoning instead of Bayesian
reasoning and we call it the Bayes-Price-Laplace theorem.
More than a century later, this mathematical technique was reborn thanks to new
discoveries in computing probabilities and gave birth to one of the most important
and used techniques in machine learning: the probabilistic graphical model.
From now on, it is important to note that the term graphical refers to the theory of
graphsthat is, a mathematical object with nodes and edges (and not graphics or
drawings). You know that, when you want to explain to someone the relationships
between different objects or entities, you take a sheet of paper and draw boxes that
you connect with lines or arrows. It is an easy and neat way to show relationships,
whatever they are, between different elements.
[2]
Chapter 1
Probabilistic Graphical Models (PGM for short) are exactly that: you want to
describe relationships between variables. However, you don't have any certainty
about your variables, but rather beliefs or uncertain knowledge. And we know now
that probabilities are the way to represent and deal with such uncertainties, in a
mathematical and rigorous way.
A probabilistic graphical model is a tool to represent beliefs and uncertain knowledge
about facts and events using probabilities. It is also one of the most advanced machine
learning techniques nowadays and has many industrial success stories.
Probabilistic graphical models can deal with our imperfect knowledge about the
world because our knowledge is always limited. We can't observe everything, we
can't represent all the universe in a computer. We are intrinsically limited as human
beings, as are our computers. With probabilistic graphical models, we can build
simple learning algorithms or complex expert systems. With new data, we can
improve those models and refine them as much as we can and also we can infer new
information or make predictions about unseen situations and events.
In this first chapter you will learn about the fundamentals needed to understand
probabilistic graphical models; that is, probabilities and the simple rules of calculus on
which they are based. We will have an overview of what we can do with probabilistic
graphical models and the related R packages. These techniques are so successful that
we will have to restrict ourselves to just the most important R packages.
We will see how to develop simple models, piece by piece, like a brick game and
how to connect models together to develop even more advanced expert systems.
We will cover the following concepts and applications and each section will contain
numerical examples that you can directly use with R:
Machine learning
[3]
Probabilistic Reasoning
Machine learning
This book is about a field of science called machine learning, or more generally
artificial intelligence. To perform a task, to reach conclusions from data, a computer
as well as any living being needs to observe and process information of a diverse
nature. For a long time now, we have been designing and inventing algorithms and
systems that can solve a problem, very accurately and at incredible speed, but all
algorithms are limited to the very specific task they were designed for. On the other
hand, living beings in general and human beings (as well as many other animals)
exhibit this incredible capacity to adapt and improve using their experience, their
errors, and what they observe in the world.
Trying to understand how it is possible to learn from experience and adapt to
changing conditions has always been a great topic of science. Since the invention of
computers, one of the main goals has been to reproduce this type of skill in a machine.
Machine learning is the study of algorithms that can learn and adapt from data
and observation, reason, and perform tasks using learned models and algorithms.
As the world we live in is inherently uncertain, in the sense that even the simplest
observation such as the color of the sky is impossible to determine absolutely, we
needed a theory that can encompass this uncertainty. The most natural one is the
theory of probability, which will serve as the mathematical foundation of the
present book.
But when the amount of data grows to very large datasets, even the simplest
probabilistic tasks can become cumbersome and we need a framework that will
allow the easy development of models and algorithms that have the necessary
complexity to deal with real-world problems.
By real-world problems, we really think of tasks that a human being is able to do
such as understanding people's speech, driving a car, trading the stock exchange,
recognizing people's faces on a picture, or making a medical diagnosis.
At the beginning of artificial intelligence, building such models and algorithms was a
very complex task and, every time a new algorithm was invented, implemented, and
programmed with inherent sources of errors and bias. The framework we present
in this book, called probabilistic graphical models, aims at separating the tasks of
designing a model and implementing algorithm. Because it is based on probability
theory and graph theory, it has very strong mathematical foundations. But also, it is a
framework where the practitioner doesn't need to write and rewrite algorithms all the
time, for algorithms were designed to solve very generic problems and already exist.
Moreover, probabilistic graphical models are based on machine learning techniques
which will help the practitioner to create new models from data in the easiest way.
[4]
Chapter 1
Algorithms in probabilistic graphical models can learn new models from data and
answer all sorts of questions using those data and the models, and of course adapt
and improve the models when new data is available.
In this book, we will also see that probabilistic graphical models are a mathematical
generalization of many standard and classical models that we all know and that we
can reuse, mix, and modify within this framework.
The rest of this chapter will introduce required notions in probabilities and graph
theory to help you understand and use probabilistic graphical models in R.
One last note about the title of the book: Learning Probabilistic Graphical Models in R.
In fact this title has two meanings: you will learn how to make probabilistic graphical
models, and you will learn how the computer can learn probabilistic graphical
models. This is machine learning!
[5]
Probabilistic Reasoning
[6]
Chapter 1
In machine learning, probabilities are the basic components of most of the systems
and algorithms. You might want to know the probability that an e-mail you received
is a spam (junk) e-mail. You want to know the probability that the next customer on
your online site will buy the same item as the previous customer (and whether your
website should advertise it right away). You want to know the probability that, next
month, your shop will have as many customers as this month.
As you can see with these examples, the line between purely frequentist and purely
Bayesian is far from being clear. And the good news is that the rules of probability
calculus are rigorously the same, whatever interpretation you choose (or not).
Conditional probability
A central theme in machine learning and especially in probabilistic graphical
models is the notion of a conditional probability. In fact, let's be clear, probabilistic
graphical models are all about conditional probability. Let's get back to our horse
race example. We say that, if you know nothing about the riders and their horses,
you would assign, say, a probability of 0.1 to each (assuming there are 10 horses).
Now, you just learned that the best rider in the country is participating in this race.
Would you give him the same chance as the others? Certainly not! Therefore the
probability for this rider to win is, say, 19% and therefore, we will say that all other
riders have a probability to win of only 9%. This is a conditional probability: that is, a
probability of an event based on knowing the outcome of another event. This notion
of probability matches perfectly changing our minds intuitively or updating our beliefs
(in more technical terms) given a new piece of information. At the same time we also
saw a simple example of Bayesian update where we reconsidered and updated our
beliefs given a new fact. Probabilistic graphical models are all about that but just
with more complex situations.
Probabilistic Reasoning
A sample space is the set of all possible outcomes of an experiment. In this set, we
call a point of , a realization. And finally we call a subset of an event.
For example, if we toss a coin once, we can have heads (H) or tails (T). We say that the
sample space is = {H , T } . An event could be I get a head (H). If we toss the coin twice,
the sample space is bigger and we can have all those possibilities = {HH , HT , TH , TT } .
An event could be I get a head first. Therefore my event is E = {HH , HT } .
A more advanced example could be the measurement of someone's height in
centimeters. The sample space is all the positive numbers from 0.0 to 10.9. Chances
are that none of your friends will be 10.9 meters tall, but it does no harm to the
theory. An event could be all the basketball players, that is, measurements that are
2 meters or more. In mathematical notation we write in terms of intervals = [ 0,10.9]
and E = [ 2,10.9] .
A probability is a real number Pr(E) that we assign to every event E. A probability
must satisfy the three following axioms. Before writing them, it is time to recall why
we're using these axioms. If you remember, we said that, whatever the interpretation
of the probabilities that we make (frequentist or Bayesian), the rules governing the
calculus of probability are the same:
When throwing two dices, X is the sum of the numbers is a random variable
Chapter 1
For each possible event, we can associate a probability pi and the set of all those
probabilities is the probability distribution of the random variable.
Let's see an example: we consider an experiment in which we toss a coin three times.
A sample point (from the sample space) is the result of the three tosses. For example,
HHT, two heads and one tail, is a sample point.
Therefore, it is easy to enumerate all the possible outcomes and find that the
sample space is:
we see that P ( H1 ) = P ( H 2 ) = P ( H 3 ) = 1 .
2
Under this probability model, the events H1, H2, H3 are mutually independent.
To verify, we first write that:
P ( H1 H 2 H 3 ) = P ({ HHH } ) =
1 1 1 1
= = P ( H1 ) P ( H 2 ) P ( H 3 )
8 2 2 2
P ( H1 H 2 ) = P ({ HHH , HHT } ) =
2 1 1
= = P ( H1 ) P ( H 2 )
8 2 2
The same applies to the two other pairs. Therefore H1, H2, H3 are mutually
independent. In general, we write that the probability of two independent events is the
product of their probability: P ( A B ) = P ( A ) .P ( B ) . And we write that the probability of
two disjoint independent events is the sum of their probability: P ( A B ) = P ( A ) + P ( B ) .
If we consider a different outcome, we can define another probability distribution.
For example, let's consider again the experiment in which a coin is tossed three
times. This time we consider the random variable X as the number of heads obtained
after three tosses.
[9]
Probabilistic Reasoning
HHH
3
HHT
2
HTH
2
THH
2
TTH
1
THT
1
HTT
1
TTT
0
So the range for the random variable X is now {0,1,2,3}. If we assume the same
probability for all points as before, that is , then we can deduce the probability
function on the range of X:
x
P(X=x)
2
3
8
3
8
[ 10 ]
Chapter 1
If we keep adding more and more experiments and therefore more and more variables,
we can write a very big and complex joint probability distribution. For example, I
could be interested in the probability that it will rain tomorrow, that the stock market
will rise and that there will be a traffic jam on the highway that I take to go to work.
It's a complex one but not unrealistic. I'm almost sure that the stock market and the
weather are really not dependent. However, the traffic condition and the weather
are seriously connected. I would like to write the distribution P(W, M, T)weather,
market, trafficbut it seems to be overly complex. In fact, it is not and this is what we
will see throughout this book.
A probabilistic graphical model is a joint probability distribution. And nothing else.
One last and very important notion regarding joint probability distributions is
marginalization. When you have a probability distribution over several random
variables, that is a joint probability distribution, you may want to eliminate some of
the variables from this distribution to have a distribution on fewer variables. This
operation is very important. The marginal distribution p(X) of a joint distribution
p(X, Y) is obtained by the following operation:
p ( X ) = y p ( X , Y ) where we sum the probabilities over all the possible values of y.
By doing so, you can eliminate Y from P(X, Y). As an exercise, I'll let you think about
the link between this and the probability of two disjoint events that we saw earlier.
Bayes' rule
Let's continue our exploration of the basic concepts we need to play with
probabilistic graphical models. We saw the notion of marginalization, which
is important because, when you have a complex model, you may want to
extract information about one or a few variables of interest. And this is when
marginalization is used.
But the two most important concepts are conditional probability and Bayes' rule.
[ 11 ]
Probabilistic Reasoning
Knowing it's going to rain tomorrow, what is now the probability of a traffic
jam? Presumably higher than if you knew nothing.
This is a conditional probability. In more formal terms, we can write the following
formula:
p( X |Y ) =
p ( X ,Y )
P ( X ,Y )
and P (Y | X ) =
p (Y )
P( X )
From these two equations we can easily deduce the Bayes formula:
P( X |Y ) =
P ( Y | X ) .P ( X )
P (Y )
This formula is the most important and it helps invert probabilistic relationships.
This is the chef d'oeuvre of Laplace's career and one of the most important formulas in
modern science. Yet it is very simple.
In this formula, we call P(X | Y) the posterior distribution of X given Y. Therefore,
we also call P(X) the prior distribution. We also call P(Y | X) the likelihood and
finally P(Y) is the normalization factor.
The normalization factor needs a bit of explanation and development here. Recall
that P ( X , Y ) = P (Y | X ) P ( X ) . And also, we saw that P (Y ) = x P ( X , Y ) , an operation
we called marginalization, whose goal was to eliminate (or marginalize out) a
variable from a joint probability distribution.
So from there, we can write P (Y ) = x P ( X , Y ) = x P (Y | X ) P ( X ) .
Thanks to this magic bit of simple algebra, we can rewrite the Bayes' formula in its
general form and also the most convenient one:
P( X |Y ) =
P ( Y | X ) .P ( X )
x P (Y | X ) P ( X )
[ 12 ]
Chapter 1
The simple beauty of this form is that we only need to specify and use P(Y |X)
and P(X), that is, the prior and likelihood. Despite the simple form, the sum in the
denominator, as we will see in the rest of this book, can be a hard problem to solve
and advanced techniques will be required for advanced problems.
P ( | D ) =
p ( D | ) .P ( )
i P ( D | ) P ( )
The prior distribution P() is what I believe about X before everything else is
knownmy initial belief.
The likelihood given a value for , what is the data D that I could generate, or
in other terms what is the probability of D for all values of ?
This formula also gives the basis of a forward process to update my beliefs about
the variable . Applying Bayes' rule will calculate the new distribution of . And if I
receive new information again, I can update my beliefs again, and again.
The likelihood
[ 13 ]
Probabilistic Reasoning
In this example, we won't need a specific package; we just need to write a simple
function to implement a simple form of the Bayes' rule.
The prior distribution is our initial belief on how the machine is working or not.
We identified a first random variable M for the state of the machine. This random
variable can have two states {working, broken}. We believe our machine is working
well because it's a good machine, so let's say the prior distribution is as follows:
P ( M = working ) = 0.99
P ( M = broken ) = 0.01
It simply says that our belief that the machine is working is really high, with a
probability of 99% and only a 1% chance that it is broken. Here, clearly we're using
the Bayesian interpretation of probability because we don't have many machines but
just one. We could also ask the machine's vendor about the frequency of working
machines he or she is able to produce. And we could use his or her number and, in
that case, this probability would have a frequentist interpretation. Nevertheless, the
Bayes' rule works in all the cases.
The second random variable is L and it is the light bulb produced by the machine.
The light bulb can either be good or bad. So this random variable will have two states
again {good, bad}.
Again, we need to give a prior distribution for the light bulb variable L: in the
Bayes' formula, it is required that we specify a prior distribution and the likelihood
distribution. In this case, the likelihood is P(L | M) and not simply P(L).
Here we need in fact to define two probability distributions: one when the machine
works M = working and one when the machine is broken M = broken. And we ask the
question twice:
How likely is it to have a good or a bad light bulb when the machine
is working?
How likely is it to have a good or a bad light bulb when the machine is
not working?
Let's try to give our best guess, either Bayesian or frequentist, because we have
some statistics:
P ( L = good | M = working ) = 0.99
P ( L = bad | M = working ) = 0.01
P ( L = good | M = broken ) = 0.6
P ( L = bad | M = broken ) = 0.4
[ 14 ]
Chapter 1
Here we believe that, if the machine is working, it will only give one bad light bulb
out of 100, which is even higher than what we said before. But in this case, we know
that the machine is working so we expect a very high success rate. However, if the
machine is broken, we say we expect at least 40% of the light bulbs to be bad. From
now on, we have fully specified our model and we can start using it.
Using a Bayesian model is to compute posterior distributions when a new fact is
available. In our case, we want to know if the machine is working knowing that
we just observed that our latest light bulb was not working. So we want to compute
P(M | L). We just specified P(M) and P(L | M), so the last thing we have to do is to
use the Bayes' formula to invert the probability distribution.
For example, let's say the last produced light bulb is bad, that is, L = bad. Using the
Bayes formula we obtain:
P ( M = working | L = bad ) =
P ( L = bad | M = working ) .P ( M = working )
P ( L = bad | M = working ) P ( M = working ) + P ( L = bad | M = broken ) P ( M = working )
=
0.01 0.99
= 0.71
0.01 0.99 + 0.4 0.01
Or if you prefer, a 71% chance that the machine is working. It's lower but follows our
intuition that the machine might still work. After all even if we received a bad light
bulb, it's only one and maybe the next will still be good.
Let's try to redo the same problem, with equal priors on the state of the machine:
a 50% chance the machine is working and 50% the machine is broken. The result
is therefore:
0.01 0.5
= 0.024
0.01 0.5 + 0.4 0.5
It is a 2.4% chance the machine is working! That's very low. Indeed, given the
apparent quality of this machine, as modeled in the likelihood, it appears very
surprising that the machine can produce a bad light bulb. In this case, we didn't
make the assumption that the machine was working as in the previous example, and
having a bad light bulb can be seen as an indication that something is wrong.
[ 15 ]
Probabilistic Reasoning
So we defined three variables, the prior with two states working and broken, the
likelihood we specified for each condition of the machine (working or broken), and the
distribution over the variable L of the light bulb. So that's four values in total and the R
matrix is indeed like the conditional distribution we defined in the previous section:
likelihood
good
bad
working
0.99
0.01
broken
0.60
0.40
The data variable contains the sequence of observed light bulbs we will use to test
our machine and compute the posterior probabilities. So, now we can define our
Bayesian update function as follows:
bayes <- function(prior, likelihood, data)
{
posterior <- matrix(0, nrow=length(data), ncol=length(prior))
dimnames(posterior) <- list(data, names(prior))
initial_prior <- prior
for(i in 1:length(data))
{
posterior[i, ] <prior*likelihood[ , data[i]]/
[ 16 ]
Chapter 1
sum(prior * likelihood[ , data[i]])
prior <- posterior[i , ]
}
return(rbind(initial_prior,posterior))
}
In the end, the function returns a matrix with the initial prior and all subsequent
posterior distributions.
Let's do a few runs to understand how it works. We will use the function matplot to
draw the evolution of the two distributions, one for the posterior probability that the
machine is working (in green) and the other in red, meaning that the machine is broken:
matplot( bayes(prior,likelihood,data), t='b', lty=1, pch=20,
col=c(3,2))
[ 17 ]
Probabilistic Reasoning
The result can be seen on the following graph: as the bad light bulbs arrive, the
probability that the machine will fail quickly falls (the plain or green line). We
expected something like 1 bad light bulb out of 100, and not that many. So this
machine needs maintenance now. The red or dashed line represents the probability
that the machine is broken.
If the prior was different, we would have seen a different evolution. For example,
let's say that we have no idea if the machine is broken or not, that is, we give an
equal chance to each situation:
prior <- c(working = 0.5, broken = 0.5)
Again we obtain a quick convergence to very high probabilities that the machine is
broken, which is not surprising given the long sequence of bad light bulbs:
[ 18 ]
Chapter 1
If we keep playing with the data we might see different behaviors again. For
example, let's say we assume the machine is working well, with a 99% probability.
And we observe a sequence of 10 light bulbs, among which the first one is bad. In R
we have:
prior=c(working=0.99,broken=0.01)
data=c("bad","good","good","good","good","good","good","good","good","go
od")
matplot(bayes(prior,likelihood,data),t='b',pch=20,col=c(3,2))
The algorithm hesitates at first because, given such a good machine, it's unlikely
to see a bad light bulb, but then it will converge back to high probabilities again,
because the sequence of good light bulbs does not indicate any problem.
[ 19 ]
Probabilistic Reasoning
This concludes our first example of a Bayesian model with R. In the rest of this
chapter, we will see how to create real-world models, with more than just two very
simple random variables, and how to solve two important problems:
A careful reader should now ask: doesn't this little algorithm we just saw solve the
problem of inference? Indeed it does, but only when one has two discrete variables,
which is a bit too simple to capture the complexity of the world. We will introduce
now the core of this book and the main tool for performing Bayesian inference:
probabilistic graphical models.
Probabilistic models
If you remember, we saw that it is possible to represent really advanced concepts
using a probability distribution; when we have many random variables, we call this
distribution a joint distribution. Sometimes it is not impossible to have hundreds
if not thousands or more of those random variables. Representing such a big
distribution is extremely hard and in most cases impossible.
[ 20 ]
Chapter 1
From a database of patients, we want to assess and discover all the probability
distributions and their associated parameters, automatically of course.
We want to put questions to the model, such as, "If I observe a series of
symptoms, is my patient healthy or not?" Similarly, "If I change this or that in
my patient's diet and give this drug, will my patient recover?"
Probabilistic Reasoning
Because each of the symptoms can exist in different degrees, it is natural to represent
the variables as random variables. For example, if the patient's nose is a bit blocked,
we will assign a probability of, say, 60% to this variable, that is P(N=blocked)=0.6 and
P(N=not blocked)=0.4.
In this example, the probability distribution P(Se,N,H,S,C,Cold) will require 4 * 25 =
128 values in total (4 seasons and 2 values for each other random variables). It's quite
a lot and honestly it's quite hard to determine things such as the probability that the
nose is not blocked and that the patient has a headache and sneezes and so on.
However, we can say that a headache is not directly related to a cough or a blocked
nose, except when the patient has a cold. Indeed, the patient could have a headache
for many other reasons.
Moreover, we can say that the Season has quite a direct effect on Sneezing, Blocked
Nose, or Cough but less or none on Headache. In a probabilistic graphical model, we
will represent these dependency relationships with a graph, as follows, where each
random variable is a node in the graph and each relationship is an arrow between
two nodes:
[ 22 ]
Chapter 1
As you can see in the preceding figure, there is a direct relationship between each
node and each variable of the probabilistic graphical model and also a direct
relationship between arrows and the way we can simplify the joint probability
distribution in order to make it tractable.
Using a graph as a model to simplify a complex (and sometimes complicated)
distribution presents numerous benefits:
Algorithms to perform inference and learning can use graph theory and the
associated algorithms to improve and facilitate all the inference and learning
algorithms: compared to the raw joint probability distribution, using a PGM
will speed up computations by several orders of magnitude.
Factorizing a distribution
In the previous example on the diagnosis of the common cold, we defined a simple
model with a few variables Se, N, H, S, C, and R. We saw that, for such a simple
expert system, we needed 128 parameters!
[ 23 ]
Probabilistic Reasoning
We also saw that we can make a few independence assumptions based only on
common sense or common knowledge. Later in this book, we will see how to
discover those assumptions from a data set (also called structural learning).
So we can rewrite our joint probability distribution taking into account these
assumptions as follows:
P ( Se, N , H , S , C , Cold ) = P ( Se ) P ( S | Se, Cold ) P ( N | Se, Cold ) P ( Cold ) P ( C | Cold ) P ( H | Cold )
In this distribution, we did a factorization; that is, we expressed the original joint
distribution as a product of factors. In this case, the factors are simpler probability
distributions such as P(C | Cold), the probability of coughing given that one has a
cold. And as we considered all the variables to be binary (except Season, which can
take of course four values), each small factor (distribution) will need only a few
parameters to be determined: 4 + 23 + 23 + 2 +22 + 22 =30. Only 30 easy parameters
instead of 128! It's a massive improvement.
I said the parameters are easy, because they're easy to determine, either by hand or
from data. For example, we don't know if the patient has a cold or not, so we can
assign equal probability to the variable Cold, that is P(Cold = true)=P(Cold = false)=0.5.
Similarly, it's easy to determine P(C | Cold) because, if the patient has a cold
(Cold=true), he or she will likely cough. If he or she has no cold, then chances will be
low for the patient to cough, but not zero because the cause could be something else.
Directed models
In general, a directed probabilistic graphical model factorizes a joint distribution
over the random variables X1, X2Xn as follows:
pa(Xi) is the subset of parent variables of the variable Xi as defined in the graph.
The parents are easy to read on a graph: when an arrow goes from A to B, then A is
the parent of B. A node can have as many children as needed and a node can have as
many parents as needed too.
Directed models are good for representing problems in which causality has to be
modeled. It is also a good model for learning from parameters because each local
probability distribution is easy to learn.
[ 24 ]
Chapter 1
Several times in this chapter, we mentioned the fact that PGM can be built using
simple blocks and assembled to make a bigger model. In the case of directed models,
the blocks are the small probability distributions P(Xi | pa(Xi)).
Moreover, if one wants to extend the model by defining new variables and relations,
it is as simple as extending the graph. The algorithms designed for directed PGM
work for any graph, whatever its size.
Nevertheless, not all probability distributions can be represented by a directed PGM
and sometimes it is necessary to relax certain assumptions.
Also it is important to note the graph must be acyclic. It means that you can't have an
arrow from node A to node B and from node B to node A as in the following figure:
In fact, this graph does not represent a factorization at all as defined earlier and it
would mean something like A is a cause of B while at the same time B is a cause of A. It's
paradoxical and has no equivalent mathematical formula.
When the assumption or relationships are not directed, there exists a second form
of probabilistic graphical model in which all the edges are undirected. It is called an
undirected probabilistic graphical model or a Markov network.
Undirected models
An undirected probabilistic graphical model factorizes a joint distribution over the
random variables X1, X2Xn as follows:
P ( X 1 , X 2 , , X n ) =
1 C
c ( c )
Z c =1
The first term on the left-hand side is our now usual joint probability
distribution
Probabilistic Reasoning
In the preceding figure, we have four nodes and the c functions will be defined on
the subsets that are maximal cliquesthat is {ABC} and {A,D}. So the distribution is
not very complex after all. This type of model is used a lot in applications such as
computer vision, image processing, finance, and many more applications where the
relationships between the variables follow a regular pattern.
The light bulb machine, though, is defined by two variables only: L and M. And the
factorization is very simple:
P ( L, M ) = P ( M ) .P ( L | M )
[ 26 ]
Chapter 1
Note that the installation can take several minutes because this package depends
on many other packages (and especially one we will use often called gRbase) and
provides the base functions for manipulating graphs.
When the package is installed, you can load the base package with:
library("gRbase")
First of all, we want to define a simple undirected graph with five variables A, B, C, D
and E:
graph <- ug("A:B:E + C:E:D")
class(graph)
We define a graph with a clique between A, B, and E, and another clique between C, E,
and D. This will form a butterfly graph. The syntax is very simple: in the string each
clique is separated by a + and each clique is defined by the name of each variable
separated by a colon.
Next we need to install a graph visualization library. We will use the popular
Rgraphviz and to install it you can enter:
install.packages("Rgraphviz")
plot(graph)
[ 27 ]
Probabilistic Reasoning
Next we want to define a directed graph. Let's say we have again the same
{A,B,C,D,E} variables:
dag <- dag("A + B:A + C:B + D:B + E:C:D")
dag
plot(dag)
The syntax is again very simple: a node without parent comes alone such as A;
otherwise parents are specified by the list of nodes separated by colons.
In this library, several syntaxes are available to define graphs, and you can also build
them node by node. Throughout the book we will use several notations as well as a
very important representation: the matrix notation. Indeed, a graph can be equivalently
represented by a squared matrix where each row and each column represents a node
and the coefficient in the matrix will be 1 is there is an edge; 0 otherwise. If the graph is
undirected, the matrix will be symmetric; otherwise, the matrix can be anything.
Finally, with this second test we obtain the following graph:
[ 28 ]
Chapter 1
Now we want to define a simple graph for the light bulb machine and provide
numerical probabilities. Then we will do our computations again and check that the
results are the same.
First we define the values for each node:
machine_val <- c("working","broken")
light_bulb_val <- c("good","bad")
Then we define the numerical values as percentages for the two random variables:
machine_prob <- c(99,1)
light_bulb_prob <- c(99,1,60,40)
Here, cptable means conditional probability table: it's a term to designate the
memory representation of a probability distribution in the case of a discrete random
variable. We will come back to this notion in Chapter 2, Exact Inference.
Finally, we can compile the new graphical model before using it. Again, this notion
will make more sense in Chapter 2, Exact Inference. when we look at inference
algorithms such as the Junction Tree Algorithm:
plist <- compileCPT(list(M,L))
plist
Here, you clearly recognize the probability distributions that we defined earlier in
this chapter.
If we print the variables' distribution we will find again what we had before:
plist$machine
plist$light_bulb
[ 29 ]
Probabilistic Reasoning
broken
0.99
0.01
> plist$light_bulb
machine
light_bulb working broken
good
0.99
0.6
bad
0.01
0.4
And now we ask the model the posterior probability. The first step is to enter an
evidence into the model (that is to say that we observed a bad light bulb) by doing
as follows:
net <- grain(plist)
net2 <- setEvidence(net, evidence=list(light_bulb="bad"))
querygrain(net2, nodes=c("machine"))
The library will compute the result by applying its inference algorithm and will
output the following result:
$machine
machine
working
broken
0.7122302 0.2877698
And this result is rigorously the same as we obtained with the Bayes method we
defined earlier.
Therefore we are now ready to create more powerful models and explore the
different algorithms suitable for solving different problems. This is what we're going
to learn in the next chapter on exact inference in graphical models.
[ 30 ]
Chapter 1
Summary
In this first chapter we learned the base concepts of probabilities
We saw how and why they are used to represent uncertainty about data and
knowledge, while also introducing the Bayes formula. This is the most important
formula to compute posterior probabilitiesthat is, to update our beliefs and
knowledge about a fact when new data is available
We saw what a joint probability distribution is and learnt that they can quickly
become too complex and intractable to deal with. We learned the basics of
probabilistic graphical models as a generic framework to perform tractable, efficient,
and easy modeling with probabilistic models. Finally, we introduced the different
types of probabilistic graphical model and learned how to use R packages to write
our first models
In the next chapter, we will learn the first set of algorithms to do Bayesian inference
with probabilistic graphical modelsthat is, to put questions and queries to our
models. We will introduce new features of the R packages and, at the same time,
we'll learn how these algorithms work and can be used in an efficient manner.
[ 31 ]
www.PacktPub.com
Stay Connected: