Building Probabilistic Graphical Models With Python
Building Probabilistic Graphical Models With Python
K
Here, is a node in the graph G and
G
Par are the parents of the node in
the graph G.
Reasoning patterns
In this section, we shall look at different kinds of reasoning used in a Bayes network.
We shall use the Libpgm library to create a Bayes network. Libpgm reads the
network information such as nodes, edges, and CPD probabilities associated with
each node from a JSON-formatted le with a specic format. This JSON le is read
into the NodeData and GraphSkeleton objects to create a discrete Bayesian network
(which, as the name suggests, is a Bayes network where the CPDs take discrete
values). The TableCPDFactorization object is an object that wraps the discrete
Bayesian network and allows us to query the CPDs in the network. The JSON le
for this example, job_interview.txt, should be placed in the same folder as the
IPython Notebook so that it can be loaded automatically.
The following discussion uses integers 0, 1, and 2 for discrete outcomes of each
random variable, where 0 is the worst outcome. For example, Interview = 0 indicates
the worst outcome of the interview and Interview = 2 is the best outcome.
For More Information:
www.packtpub.com/building-probabilistic-graphical-models-with-python/book
Chapter 2
[ 25 ]
Causal reasoning
The rst kind of reasoning we shall explore is called causal reasoning. Initially,
we observe the prior probability of an event unconditioned by any evidence (for
this example, we shall focus on the Offer random variable). We then introduce
observations of one of the parent variables. Consistent with our logical reasoning, we
note that if one of the parents (equivalent to causes) of an event is observed, then we
have stronger beliefs about the child random variable (Offer).
We start by dening a function that reads the JSON data le and creates an object
we can use to run probability queries. The following code is from the Bayes net-
Causal Reasoning.ipynb IPython Notebook:
from libpgm.graphskeleton import GraphSkeleton
from libpgm.nodedata import NodeData
from libpgm.discretebayesiannetwork import DiscreteBayesianNetwork
from libpgm.tablecpdfactorization import TableCPDFactorization
def getTableCPD():
nd = NodeData()
skel = GraphSkeleton()
jsonpath="job_interview.txt"
nd.load(jsonpath)
skel.load(jsonpath)
# load bayesian network
bn = DiscreteBayesianNetwork(skel, nd)
tablecpd=TableCPDFactorization(bn)
return tablecpd
We can now use the specificquery function to run inference queries on the
network we have dened. What is the prior probability of getting a ( ) 1 P Offer =
Offer?
Note that the probability query takes two dictionary arguments: the rst one being
the query and the second being the evidence set, which is specied by an empty
dictionary, as shown in the following code:
tcpd=getTableCPD()
tcpd.specificquery(dict(Offer='1'),dict())
The following is the output of the preceding code:
0.432816
It is about 43 percent, and if we now introduce evidence that the candidate has poor
grades, how does it change the probability of getting an offer? We will evaluate the
value of ( ) 1| 0 P Offer Grades = = , as shown in the following code:
tcpd=getTableCPD()
tcpd.specificquery(dict(Offer='1'),dict(Grades='0'))
For More Information:
www.packtpub.com/building-probabilistic-graphical-models-with-python/book
Directed Graphical Models
[ 26 ]
The following is the output of the preceding code:
0.35148
As expected, it decreases the probability of getting an offer since we
reason that students with poor grades are unlikely to get an offer. Adding
further evidence that the candidate's experience is low as well, we evaluate
( ) 1| 0, 0 P Offer Grades Experience = = = , as shown in the following code:
tcpd=getTableCPD()
tcpd.specificquery(dict(Offer='1'),dict(Grades='0',Experience='0'))
The following is the output of the preceding code:
0.2078
As expected, it drops even lower on the additional evidence, from 35 percent to
20 percent.
What we have seen is that the introduction of the observed parent random variable
strengthens our beliefs, which leads us to the name causal reasoning.
In the following diagram, we can see the different paths taken by causal and
evidential reasoning:
Dark Gray: Query Node
Gray: Observed Node
Arrow: Direction of inference
Evidential Reasoning
Job Offer
Job
Interview
Experience
Degree
score
Causal Reasoning
Job Offer
Degree
score
Job
Interview
Experience
For More Information:
www.packtpub.com/building-probabilistic-graphical-models-with-python/book
Chapter 2
[ 27 ]
Evidential reasoning
Evidential reasoning is when we observe the value of a child variable, and we
wish to reason about how it strengthens our beliefs about its parents. We will
evaluate the prior probability of high Experience ( ) 1 P Experience = , as shown in the
following code:
tcpd=getTableCPD()
tcpd.specificquery(dict(Experience='1'),dict())
The output of the preceding code is as follows:
0.4
We now introduce evidence that the candidate's interview was good and evaluate
the value for P(Experience=1|Interview=2), as shown in the following code:
tcpd=getTableCPD()
print tcpd.specificquery(dict(Experience='1'),dict(Interview='2'))
The output of the preceding code is as follows:
0.864197530864
We see that if the candidate scores well on the interview, the probability that the
candidate was highly experienced increases, which follows the reasoning that the
candidate must have good experience or education, or both. In evidential reasoning,
we reason from effect to cause.
Inter-causal reasoning
Inter-causal reasoning, as the name suggests, is a type of reasoning where multiple
causes of a single effect interact. We rst determine the prior probability of having
high, relevant experience; thus, we will evaluate P(Experience=1), as shown in the
following code:
tcpd=getTableCPD()
tcpd.specificquery(dict(Experience='1'),dict())
The following is the output of the preceding code:
0.4
For More Information:
www.packtpub.com/building-probabilistic-graphical-models-with-python/book
Directed Graphical Models
[ 28 ]
By introducing evidence that the interview went extremely well, we think that
the candidate must be quite experienced. We will now evaluate the value for
( ) 1| 2 P Experience Interview = = , as shown in the following code:
tcpd=getTableCPD()
tcpd.specificquery(dict(Experience='1'),dict(Interview='2'))
The following is the output of the preceding code:
0.864197530864
The Bayes network conrms what we think is true (the candidate is experienced),
and the probability of high experience goes up from 0.4 to 0.86. Now, if we
introduce evidence that the candidate didn't have good grades and still managed
to get a good score in the interview, we may conclude that the candidate must be
so experienced that his grades didn't matter at all. We will evaluate the value for
( ) 1| 2, 0 P Experience Interview Grades = = = , as shown in the following code:
tcpd=getTableCPD()
tcpd.specificquery(dict(Experience='1'),dict(Interview='2',Grad
es='0'))
The output of the preceding code is as follows:
0.909090909091
This conrms our hunch that even though the probability of high experience went
up only a little, it strengthens our belief about the candidate's high experience. This
example shows the interplay between the two parents of the Job interview node,
which are Experience and Degree Score, and shows us that if we know one of the
causes behind an effect, it reduces the importance of the other cause. In other words,
we have explained the poor grades on observing the experience of the candidate.
This phenomenon is commonly called explaining away.
The following diagram shows the path of interaction between nodes involved in
inter-causal reasoning:
For More Information:
www.packtpub.com/building-probabilistic-graphical-models-with-python/book
Chapter 2
[ 29 ]
Dark Gray: Query Node
Gray: Observed Node
Arrow: Direction of inference
Fig x.x Intercausal Reasoning
Job Offer
Degree
score
Job
Interview
Experience
Bayes networks are usually drawn with the independent events on top and the
inuence ows from top to bottom (similar to the job interview example). It may be
useful to recall causal reasoning ows from top to bottom, evidential reasoning ows
from bottom to top, and inter-causal reasoning ows sideways.
D-separation
Having understood that the direction of arrows indicate that one node can inuence
another node in a Bayes network, let's see how exactly inuence ows in a Bayes
network. We can see that the grades eventually inuence the job offer, but in the
case of a very big Bayes network, it would not help to state that the leaf node is
inuenced by all the nodes at the top of the Bayes network. Are there conditions
where inuence does not ow? We shall see that there are simple rules that explain
the ow of inuence in the following table:
No variables observed Y has been observed
X Y Z X Y
X Y Z X Y Z
X Y Z
X Y Z
X Y Z X Y Z
For More Information:
www.packtpub.com/building-probabilistic-graphical-models-with-python/book
Directed Graphical Models
[ 30 ]
The preceding table depicts the open and closed active trails between three nodes
X, Y, and Z. In the rst column, no variables are observed, whereas in the second
column, Y has been observed.
We shall rst consider the case where no random variables have been observed.
Consider the chains of nodes in the rst column of the preceding table. Note that the
rules in the rst three rows allow the inuence to ow from the rst to the last node.
Inuence can ow along the path of the edges, even if these chains are extended for
longer sequences. It must be pointed out that the ow of inuence is not restricted by
the directionality of the links that connect them.
However, there is one case we should watch out for, which is called the V-structure,
X Y Z probably called so because the direction of edges is pointed inwards.
In this case, the inuence cannot ow from X to Z since it is blocked by Y. In longer
chains, the inuence will ow unless it is obstructed by a V-structure.
In this case, A B X Y Z because of the V-structure at X the inuence can
ow from A B X and X Y Z but not across the node X.
We can now state the concept of an active trail (of inuence). A trail is active if it
contains no V-structures, in the event that no evidence is observed. In case multiple
trails exist between two variables, they are conditionally independent if none of the
trails are active.
Job Offer
(Jo)
Degree
score (S)
Job
Interview
(Ji)
Flow of influence without
any variable being observed.
Experience
(E)
Influence flows
through:
E->Ji
E->Ji->Jo
S->Ji
S->Ji->Jo
S->A
Doesnt flow
through:
E->Ji<-S
E->Ji<-S->A
Postgraduate
degree
admission (A)
For More Information:
www.packtpub.com/building-probabilistic-graphical-models-with-python/book
Chapter 2
[ 31 ]
Let's now look at the second case where we do have observed evidence variables. It
is easier to understand if we compare the case with the previous chains, where we
observe the random variable Y.
The smiley trails shown in the previous table indicate an active trail, and the
others indicate a blocked trail. It can be observed that the introduction of evidence
simply negates a previously active trail, and it opens up if a previously blocking
V-structure existed.
We can now state that a trail given some evidence will be active if the middle
node or any of its descendants in any V-structure (for example, Y or its descendants
in X Y Z ) is present in the evidence set . In other words, observing Y or any
of its children will open up the blocking V-structure, making it an active trail.
Additionally, as seen in the the following table, an open trail gets blocked by
introduction of evidence and vice versa.
Job Offer
(Jo)
Degree
score (S)
Job
Interview
(Ji)
Flow of influence with Job
Interview variable being observed.
Experience
(E)
Influence doesnt flows
through:
E->Ji->Jo
S->Ji->Jo
Flow through:
E->Ji<-S
E->Ji<-S->A
Postgraduate
degree
admission (A)
The D-separation example
In this section, we shall look at using the job candidate example to understand
D-separation. In the process of performing causal reasoning, we will query for the
job offer and shall introduce the observed variables in the parents of the job offer to
verify the concepts of active trails, which we have seen in the previous section. The
following code is from the D-separation.ipynb IPython Notebook.
For More Information:
www.packtpub.com/building-probabilistic-graphical-models-with-python/book
Directed Graphical Models
[ 32 ]
We rst query the job offer with no other observed variables, as shown in the
following code:
getTableCPD().specificquery(dict(Offer='1'),dict())
The output of the preceding code is as follows:
0.432816
We know from the active trail rules that observing Experience should change the
probability of the offer, as shown in the following code:
getTableCPD().specificquery(dict(Offer='1'),dict(Experience='1'))
The output of the preceding code is as follows:
0.6438
As per the output, it changes. Now, let's add the Interview observed variable, as
shown in the following code:
getTableCPD().specificquery(dict(Offer='1'),dict(Interview='1'))
The output of the preceding code is as follows:
0.6
We get a slightly different probability for Offer. We know from the D-separation
rules that observing Interview should block the active trail from Experience to
Offer, as shown in the following code:
getTableCPD().specificquery(dict(Offer='1'),dict(Interview='1',Experi
ence='1'))
The output of the preceding code is as follows:
0.6
Observe that the probability of Offer does not change from 0.6, despite the addition
of the Experience variable being observed. We can add other values of Interview
object's parent variables, as shown in the following code:
query=dict(Offer='1')
results=[getTableCPD().specificquery(query,e) for e in [dict(Interview
='1',Experience='0'),
dict(Interview='1',Experience='1'),
dict(Interview='1',Grades='1'),
dict(Interview='1',Grades='0')]]
print results
For More Information:
www.packtpub.com/building-probabilistic-graphical-models-with-python/book
Chapter 2
[ 33 ]
The output of the preceding code is as follows:
[0.6, 0.6, 0.6, 0.6]
The preceding code shows that once the Interview variable is observed, the active
trail between Experience and Offer is blocked. Therefore, Experience and Offer
are conditionally independent when Interview is given, which means observing
the values of the interview's parents, Experience and Grades, do not contribute to
changing the probability of the offer.
Blocking and unblocking a V-structure
Let's look at the only V-structure in the network, Experience Interview Grades,
and see the effect observed evidence has on the active trail.
getTableCPD().specificquery(dict(Grades='1'),dict(Experience='0'))
getTableCPD().specificquery(dict(Grades='1'),dict())
The result of the preceding code is as follows:
0.3
0.3
According to the rules of D-separation, the interview node is a V-structure between
Experience and Grades, and it blocks the active trails between them. The preceding
code shows that the introduction of the observed variable Experience has no effect
on the probability of the grades.
getTableCPD().specificquery(dict(Grades='1'),dict(Interview='1'))
The following is the output of the preceding code:
0.413016270338
The following code should activate the trail between Experience and Grades:
getTableCPD().specificquery(dict(Grades='1'),dict(Interview='1',Exper
ience='0'))
getTableCPD().specificquery(dict(Grades='1'),dict(Interview='1',Exper
ience='1'))
The output of the preceding code is as follows:
0.588235294118
0.176470588235
The preceding code now shows the existence of an active trail between Experience
and Grades, where changing the observed Experience value changes the
probability of Grades.
For More Information:
www.packtpub.com/building-probabilistic-graphical-models-with-python/book
Directed Graphical Models
[ 34 ]
Factorization and I-maps
So far, we have understood that a graph G is a representation of a distribution P. We
can formally dene the relationship between a graph G and a distribution P in the
following way.
If G is a graph over random variables
1 2
, , ,
n
X X X K , we can state that a distribution
P factorizes over G if ( ) ( ) ( )
1 2 1
, , , |
n i G i
P X X X P X Par X = K . Here, ( )
G i
Par X are the parent
nodes of . In other words, a joint distribution can be dened as a product of each
random variable when its parents are given.
The interplay between factorization and independence is a useful phenomenon that
allows us to state that if the distribution factorizes over a graph and given that two
nodes , | X Y Z are D-separated, the distribution satises those independencies ( , | X Y Z).
Alternately, we can state that the graph G is an Independency map (I-map)
for a distribution P, if P factorizes over G because of which we can read the
independencies from the graph, regardless of the parameters. An I-map may not
encode all the independencies in the distribution. However, if the graph satises all
the dependencies in the distribution, it is called a Perfect map (P-map). The graph of
the job interview is an example of an I-map.
The Naive Bayes model
We can sum this up by saying that a graph can be seen from the following
two viewpoints:
Factorization: This is where a graph allows a distribution to be represented
I-map: This is where the independencies encoded by the graph hold in
the distribution
The Naive Bayes model is the one that makes simplistic independence assumptions.
We use the Naive Bayes model to perform binary classication Here, we are given
a set of instances, where each instance consists of a set of features
1 2
, , ,
n
X X X K
and a
class
y
. The task in classication is to predict the correct class of
y
when the rest of
the features
1 2
, , ,
n
X X X K
... are given.
For example, we are given a set of newsgroup postings that are drawn from two
newsgroups. Given a particular posting, we would like to predict which newsgroup
that particular posting was drawn from. Each posting is an instance that consists
of a bag of words (we make an assumption that the order of words doesn't matter,
just the presence or absence of the words is taken into account), and therefore, the
1 2
, , ,
n
X X X K features indicate the presence or absence of words.
Here, we shall look at the Naive Bayes model as a classier.
For More Information:
www.packtpub.com/building-probabilistic-graphical-models-with-python/book
Chapter 2
[ 35 ]
The difference between Naive Bayes and the job candidate example is that Naive
Bayes is so called because it makes nave conditional independence assumptions,
and the model factorizes as the product of a prior and individual conditional
probabilities, as shown in the following formula:
( ) ( ) ( )
1 2
2
, , , |
n
n i
i
P C X X X P C P X C
=
=
K
Although the term on the left is the joint distribution that needs a huge number of
independent parameters (2
n+1
- 1 if each feature is a binary value), the Naive Bayes
representation on the right needs only 2n+1 parameters, thus reducing the number of
parameters from exponential (in a typical joint distribution) to linear (in Naive Bayes).
In the context of the newsgroup example, we have a set of words such as {atheist,
medicine, religion, anatomy} drawn from the alt.atheism and sci.med newsgroups.
In this model, you could say that the probability of each word appearing is only
dependent on the class (that is, the newsgroup) and independent of other words in
the posting. Clearly, this is an overly simplied assumption, but it has been shown
to have a fairly good performance in domains where the number of features is large
and the number of instances is small, such as text classication, which we shall see
with a Python program.
Word n-
1
Word n
Class
Word 1 Word 2
Once we see a strong correlation among features, a hierarchical Bayes network can
be thought of as an evolved version of a Naive Bayes model.
For More Information:
www.packtpub.com/building-probabilistic-graphical-models-with-python/book
Directed Graphical Models
[ 36 ]
The Naive Bayes example
In the Naive Bayes example, we will use the Naive Bayes implementation from Scikit-
learna machine learning library to classify newsgroup postings. We have chosen two
newsgroups from the datasets provided by Scikit-learn (alt.atheism and sci.med),
and we shall use Naive Bayes to predict which newsgroup a particular posting is from.
The following code is from the Naive Bayes.ipynb le:
from sklearn.datasets import fetch_20newsgroups
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics,cross_validation
from sklearn.feature_extraction.text import TfidfVectorizer
cats = ['alt.atheism', 'sci.med']
newsgroups= fetch_20newsgroups(subset='all',remove=('headers',
'footers', 'quotes'), categories=cats)
We rst load the newsgroup data using the utility function provided by Scikit-
learn (this downloads the dataset from the Internet and may take some time). The
newsgroup object is a map, the newsgroup postings are saved against the data key,
and the target variables are in newsgroups.target, as shown in the following code:
newsgroups.target
The output of the preceding code is as follows:
array([1, 0, 0, ..., 0, 0, 0], dtype=int64)
Since the features are words, we transform them to another representation using
Term Frequency-Inverse Document Frequency (Tdf). The purpose of Tdf is to
de-emphasize words that occur in all postings (such as "the", "by", and "for") and
instead emphasize words that are unique to a particular class (such as religion and
creationism, which are from the alt.atheism newsgroup). We can do the same by
creating a TfidfVectorizer object and then transforming all the newsgroup data to
a vector representation, as shown in the following code:
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(newsgroups.data)
Vectors now contain features that we can use as the input data to the Naive Bayes
classier. A shape query reveals that it contains 1789 instances, and each instance
contains about 24 thousand features, as shown in the following code. However,
many of these features can be 0, indicating that the words do not appear in that
particular posting:
vectors.shape
For More Information:
www.packtpub.com/building-probabilistic-graphical-models-with-python/book
Chapter 2
[ 37 ]
The output of the preceding code is as follows:
(1789, 24202)
Scikit-learn provides a few versions of the Naive Bayes classier, and the one we will
use is called MultinomialNB. Since using a classier typically involves splitting the
dataset into train, test, and validation sets, then training on the train set and testing
the efcacy on the validation set, we can use the utility provided by Scikit-learn to do
the same for us. The cross_validation.cross_val_score function automatically
splits the data into multiple sets and returns the f1 score (a metric that measures a
classier's accuracy), as shown in the following code:
clf = MultinomialNB(alpha=.01)
print "CrossValidation Score: ", np.mean(cross_validation.cross_val_
score(clf,vectors, newsgroups.target, scoring='f1'))
The output of the preceding code is as follows:
CrossValidation Score: 0.954618416381
We can see that despite the assumption that all features are conditionally independent
when the class is given, the classier maintains a decent f1 score of 95 percent.
Summary
In this chapter, we learned how conditional independence properties allow a joint
distribution to be represented as the Bayes network. We then took a tour of types of
reasoning and understood how inuence can ow through a Bayes network, and we
explored the same concepts using Libpgm. Finally, we used a simple Bayes network
(Naive Bayes) to solve a real-world problem of text classication.
In the next chapter, we shall learn about the undirected graphical models or
Markov networks.
For More Information:
www.packtpub.com/building-probabilistic-graphical-models-with-python/book
Where to buy this book
You can buy Building Probabilistic Graphical Models with Python from the Packt
Publishing website:
.
Free shipping to the US, UK, Europe and selected Asian countries. For more information, please
read our shipping policy.
Alternatively, you can buy the book from Amazon, BN.com, Computer Manuals and
most internet book retailers.
www.PacktPub.com
For More Information:
www.packtpub.com/building-probabilistic-graphical-models-with-python/book