0% found this document useful (0 votes)
42 views

EECS6895 AdvancedBigDataAnalytics Lecture6

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views

EECS6895 AdvancedBigDataAnalytics Lecture6

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

Machine Reasoning using

Bayesian Network

Ching-Yung Lin, Ph.D.


Columbia University and Graphen, Inc.
February 28, 2020
Outline
• Introduction
• Probability Review
• Bayesian Network
• Inference Methods
• Network Structure Learning
Evolution of Intelligence

Direction of Evolution

recognition
perception

reasoning sensors
strategy representation

memory

3
Introduction
Suppose the doctor is trying to
determine if a patient has inhalational
anthrax. She observes the following
symptoms:
• The patient has a cough
• The patient has difficulty in breathing
• The patient has a fever

4
Introduction
Dealing with uncertainty:
You would like to determine how likely the
patient is infected with inhalational
anthrax given that the patient has a cough,
a fever, and difficulty breathing

5
Introduction
New evidence: X-ray image shows that the
patient has a wide mediastinum.
Belief update: your belief that the patient is
infected with inhalational anthrax is now
much higher now.

6
Introduction

• In the previous slides, what you observed affected your


belief that the patient is infected with anthrax
• This is called reasoning with uncertainty
• Wouldn’t it be nice if we had some tools for reasoning
with uncertainty? In fact, we do…

7
Bayesian Network Has Anthrax

Cough Fever Difficulty Breathing Wide Mediastinum

• Need a representation and reasoning system


that is based on conditional independence
• Compact yet expressive representation
• Efficient reasoning procedures Thomas Bayes
• Bayesian Network is such a representation
• Named after Thomas Bayes (ca. 1702 –1761)
• Term coined in 1985 by Judea Pearl (1936
– ), 2011 winner of the ACM Turing Award
• Many applications, e.g., spam filtering, speech
recognition, robotics, diagnostic systems and
even syndromic surveillance Judea Pearl 8
Outline
• Introduction
• Probability Review
• Bayesian Network
• Inference methods
• Network Structure Learning
Probabilities
We will write P(A = true) to mean the probability that A = true.
One definition of probability: the relative frequency with which an
outcome would be obtained if the process were repeated a large number
of times under similar conditions

The sum of the red and


blue areas is 1 P(A = true)

P(A = false)

11
Conditional Probability
• P(A = true | B = true) : Out of all the outcomes in which B is
true, how many also have A equal to true
• Read as: “Probability of A given B”

F = “Have a fever”
C = “Coming down with cold”
P(F = true)
P(F = true) = 1/10
P(C = true) = 1/15
P(F = true | C = true) = 1/2

“Fever are rare and cold is rarer, but if


P(C = true) you’re coming down with cold there’s a
50-50 chance you’ll have a headache.”

12
The Joint Probability Distribution
• P(A = true, B = true) :“the probability of A = true and B = true”
• Notice that:
P(F=true|C=true)

P(F = true/C = true)


P(F = true)
Area of " C and F" region
=
Area of " C" region
P(C = true, F = true)
=
P(C = true)
P(C = true)

13
The Joint Probability Distribution
A B C P(A,B,C)
• Joint probabilities can be between any false false false 0.1
number of variables false false true 0.2
e.g. P(A = true, B = true, C = true) false true false 0.05

• For each combination of variables, we false true true 0.05


need to say how probable that true false false 0.3
combination is true false true 0.1
true true false 0.05
true true true 0.15

Sums to 1

14
The Joint Probability Distribution
A B C P(A,B,C)
• Once you have the joint probability false false false 0.1
distribution, you can calculate any false false true 0.2
probability involving A, B, and C false true false 0.05
• Note: May need to use false true true 0.05
marginalization and Bayes rule, true false false 0.3
true false true 0.1
true true false 0.05
Examples of things you can compute: true true true 0.15

• P(A=true) = sum of P(A,B,C) in rows with A=true


• P(A=true, B = true | C=true) =
P(A = true, B = true, C = true) / P(C = true)
15
Independence
Variables A and B are independent if any of the
following hold:
• P(A,B) = P(A) P(B)
• P(A | B) = P(A)
• P(B | A) = P(B)

16
Independence
How is independence useful?
• Suppose you have n coin flips and you want to
calculate the joint distribution P(C1, …, Cn)
• If the coin flips are not independent, you need 2n
values in the table
• If the coin flips are independent, then
n
P(C1 ,..., Cn ) = ∏ P(Ci )
i =1

17
Conditional Independence
• C and A are conditionally independent given B if the following
holds:
P(C | A, B) = P(C | B)
• Example: “Cancer is a common cause of the two symptoms: a
positive X-ray and dyspnoea”
Lung
cancer
B
dyspnoea
X-ray
A C

• Joint distribution: P(A,B,C)=P(C/A,B)P(A,B)=P(C/B)P(A,B)=P(C/


B)P(A/B)P(B)
18
Outline
• Introduction
• Probability Review
• Bayesian Network
• Inference methods
A Bayesian Network
A Bayesian network is made up of:
1. A Directed Acyclic Graph
A

C D

2. A set of tables for each node in the graph: conditional probability table

A P(A) A B P(B|A) B D P(D|B) B C P(C|B)


false 0.4 false false 0.03 false false 0.01 false false 0.3
true 0.6 false true 0.97 false true 0.99 false true 0.7
true false 0.6 true false 0.04 true false 0.8
true true 0.4 true true 0.96 true true 0.2
A Directed Acyclic Graph
Each node in the graph is a random
variable

A node X is a parent of another node Y if


A
there is an arrow from node X to node Y
e.g. A is a parent of B
B

C D

an arrow from node X to node Y


means X has a direct influence on
Y

21
A Set of Tables for Each Node
Each node Xi has a conditional probability distribution

P(Xi | Parents(Xi)) that quantifies the effect of the parents on the node
except the root node
A P(A) A B P(B|A)
false 0.4 false false 0.03
true 0.6 A false true 0.97
true false 0.6
true true 0.4
B
B C P(C|B) B D P(D|B)
false false 0.3 C D false false 0.01
false true 0.7 false true 0.99
true false 0.8 true false 0.04
true true 0.2 true true 0.96
Bayesian Networks
Two important properties:
1. Encodes the conditional independence relationships between the
variables in the graph structure
2. Is a compact representation of the joint probability distribution
over the variables
A

C D

23
Conditional Independence
The probability distribution for each node depends only on its parents
C1 and C2 are conditionally independent given X

P1 P2

C1 C2

24
The Joint Probability Distribution
Due to the conditional independence property, the
joint probability distribution over all the variables X1,
…, Xn in the Bayesian net can be computed using the
formula:

n
P( X 1 = x1 ,..., X n = xn ) = ∏ P( X i = xi | Parents( X i ))
i =1

25
Using a Bayesian Network Example
P(A = true, B = true, C = true, D = true)
= P(A = true) * P(B = true | A = true) *
P(C = true | B = true) *P( D = true | B = true) A

= (0.6)*(0.4)*(0.2)*(0.96)
B

C D

26
Using a Bayesian Network Example
P(A = true, B = true, C = true, D = true)
= P(A = true) * P(B = true | A = true) * from the graph structure
P(C = true | B = true) *P( D = true | B = true)
A
= (0.6)*(0.4)*(0.2)*(0.96)

B
From the conditional
probability tables
C D
A P(A) A B P(B|A) B D P(D|B) B C P(C|B)
false 0.4 false false 0.03 false false 0.01 false false 0.3
true 0.6 false true 0.97 false true 0.99 false true 0.7
true false 0.6 true false 0.04 true false 0.8
true true 0.4 true true 0.96 true true 0.2

27
Another example
• I'm at work, neighbor Jeff calls to say my alarm is ringing, but neighbor Mary doesn't call.
Sometimes it's set off by minor earthquakes. Is there a burglar?

• Variables: Burglary, Earthquake, Alarm, JeffCalls, MaryCalls

• Network topology reflects "causal" knowledge:

• A burglar can set the alarm off


• An earthquake can set the alarm off
• The alarm can cause Mary to call
• The alarm can cause Jeff to call
Another example: Earthquake or Burglar
B:Burglary E: Earthquake

A:Alarm

M:Mary Calls J:Jeff Calls

29
Bayesian Network for Alarm Domain
B:Burglary E:Earthquake
P(B) P(E)
.001 .002

B E P(A)
T T .95
T F .94 A:Alarm
F T .29
F F .001

A P(M) A P(J)
T .70 T .90
F .01 F .05
M:Mary Calls J:Jeff Calls
P(J =true, M=true, A=true, B=false, E=false)
= P(J =true |A =true)P(M =true |A =true)P(A =true |B =false, E =false)P(B =false)P(E =false)
= 0.9 * 0.7 * 0.001 * 0.999 * 0.998 = 0.00062 30
Outline
• Introduction
• Probability Review
• Bayesian Network
• Inference methods
• Network Structure Learning
Inference
• How can one infer the (probabilities of) values of one or more
network variables, given observed values of others?
• P( X | E )
E = The evidence variable(s)

X = The query variable(s)

• Bayes net contains all information needed for this inference


• If only one variable with unknown value, easy to infer it
• In general case, problem is NP hard

32
Inference: example
Has Anthrax

Has Cough Has Fever Has Difficulty Breathing Has Wide Mediastinum

• An example of a query would be:


P(Anthrax = true | Fever = true, Cough = true)
• Note: Even though HasDifficultyBreathing and HasWideMediastinum
are in the Bayesian network, they are not given values in the query
• They are treated as unobserved variables

33
Inference in Bayesian Network
• Exact inference:
Variable Elimination
Junction Tree

• Approximate inference:
Markov Chain Monte Carlo
Variational Methods
FromConditional
Bayesian Network to Junction Tree
dependence among random variables
– Exact Inference

allows information propagated from a node to another
⇨ foundation of probabilistic
node → random variable inference Not straightforward

edge → precedence relationship


Given evidence (observations) E,
conditional probability table (CPT)
output the posterior probabilities of
query P(Q|E)

NP hard

Bayes’ theorem can not be applied


directly to non-singly connected
networks, as it would yield erroneous
results

Therefore, junction trees are used to implement exact


inference
35
Conversion of Bayesian Network into Junction Tree

clique Junction tree


moralization triangulation
identification construction

– Parallel Moralization connects all parents of each node


– Parallel Triangulation chordalizes cycles with more than 3 edges
– Clique identification finds cliques using node elimination
• Node elimination is a step look-ahead algorithms that brings challenges in processing
large scale graphs
– Parallel Junction tree construction builds a hypergraph of the Bayesian
network based on running intersection property

36
Constructing Junction Trees
1. Moralization: construct an undirected graph from the DAG
2. Triangulation: Selectively add arcs to the moral graph
3. Build a junction graph by identifying the cliques and separators
4. Build the junction tree by find an appropriate spanning tree

37
Step 1: Moralization: marry the parents

a a a

b c g b c g b c g

d e h d e h d e h

f f f
G=(V,E) GM

1. For all w ∈ V:
• For all u,v∈parents(w) add an edge e=u-v.
2. Undirect all edges.
38
Step 2: Triangulation

a a

b c g b c g

d e h d e h

f f

GM GT

Add edges to GM such that there is no cycle


with length ≥ 4 that does not contain a chord.
NO YES

39
Step 3: Build the junction graph
• A junction graph for an undirected graph G is an undirected, labeled
graph.
• Clique: a subgraph that is complete and maximal.
• The nodes are the cliques in G.
• If two cliques intersect, they are joined in the junction graph by an
edge labeled with their intersection. (separators)

40
a a a

b c g b c g b c g

d e h d e h d e h

f f f
Bayesian Network Moral graph GM Triangulated graph GT
G=(V,E)

abd a ace
a a a

ad ae ce b c c g

ade e ceg g
d d e e e
e
de e eg d e e h
separators
def e egh
f Cliques
e.g. ceg ∩ egh = eg
Junction graph GJ (not complete) 41
Step 4: Junction Tree
• A junction tree is a sub-graph of the junction graph
that
• Is a tree
• Contains all the cliques (spanning tree)
• Satisfies the running intersection property:
for each pair of nodes U, V, all nodes on the path
between U and V contain U ∩ V

42
Step 4: Junction Tree (cont.)
• Theorem: An undirected graph is triangulated if and only if its junction
graph has a junction tree
• Definition: The weight of a link in a junction graph is the number of
variable in the label. The weight of a junction tree is the sum of
weights of the labels.
• Theorem: A sub-tree of the junction graph of a triangulated graph is a
junction tree if and only if it is a spanning of maximal weight

43
There are several methods to find MST.
Kruskal’s algorithm: choose successively a link of
maximal weight unless it creates a cycle.
abd ace
abd a ace

ad ae ce
ad ae ce

ade e ceg ade ceg


e
de e eg
de eg
def e egh
def egh

Junction graph GJ (not complete) Junction tree GJT

44
Inference using junction tree
• Potential ∅𝑋: a function that maps each instantiation x of a set of variables X into a nonnegative real number

∑ 𝑌
Marginalization: suppose 𝑋 ∈ 𝑌, ∅𝑌 = ∅

𝑌 \X
• Constraints on potentials
1) Consistency property : for each clique X and neighboring separator S, it holds that

2) The potentials encode the joint distribution P(U) of the network


according to

• Property: for each clique/separator, it holds that ∅𝑋 = 𝑃 (𝑋 )


That means, for any variable 𝑉 ∈ 𝑋, we can compute its marginal by
Inference without evidence

from Huang&Darwiche, 1996


Initialization

from Huang&Darwiche, 1996


Example for initialization

from Huang&Darwiche, 1996


Global propagation
X R y
• Single message pass
Consider two adjacent clusters X and Y with separator R.
A message pass from X to Y occurs in two steps:

from Huang&Darwiche, 1996


Global propagation

from Huang&Darwiche, 1996


Marginalization
• Once we have a consistent junction tree, we can compute P(V) for
each variable of interest V by computing the marginals.

from Huang&Darwiche, 1996


Inference with evidence

from Huang&Darwiche, 1996


Observations and Likelihoods
• An observation is a statement of the form
• Observations are the simplest forms of evidence.
• Collections of observations denoted by E.
• Define likelihood to encode observations:

from Huang&Darwiche, 1996


Example of likelihood encoding
• Suppose we have observations C =on, E = off

from Huang&Darwiche, 1996


Initialization with observations
Observation entry

from Huang&Darwiche, 1996


Normalization

from Huang&Darwiche, 1996


Approximate inference
• Exact inference is feasible in small to medium-sized networks
• Takes a very long time for large networks
• Turn to approximate inference techniques which are much faster and
give pretty good results
Sampling
• Input: Bayesian network with set of nodes X
• Sample = a tuple with assigned values
s=(X1=x1,X2=x2,… ,Xk=xk)
• Tuple may include all variables (except evidence) or a
subset
• Sampling schemas dictate how to generate samples (tuples)
• Ideally, samples are distributed according to P(X|E)
Sampling algorithms
• Gibbs Sampling (MCMC)
• Importance Sampling
• Sequential Monte-Carlo (Particle Filtering) in Dynamic
Bayesian Networks
• etc.
A list of Python libraries
Outline
• Introduction
• Probability Review
• Bayesian Network
• Inference methods
• Network Structure Learning
Original Graph – Asia Example
Prediction
• Setting one (or more) column to
None and let Pomegranate predict
the value based on other
observations
• Need to already have Bayesian
network with conditional
probabilities (through .bif file or
generated network)
• Right example: original Asia
network, 64% accuracy
• Get random samples from Asia
network, then re-feed the samples to
generate graph estimate
• Run prediction again on “smoke”
• 49% accuracy
• Takeaway: generated networks and
conditional probabilities from samples
may generate completely different
networks
Generating Bayesian Network = NP-Hard
PGMPY

Pomegranate
Generate Models with Pomegranate (greedy)
• Results can highly depend
on samples
Generate Acyclic Permutations
• For each node, randomly assign it to level 1, 2, …, K
• Randomly pick two nodes
• One node from level k
• Other node from level k+1
• Add a directed edge from first to second node
• Edges cannot skip levels or connect nodes in the same level
• This leveling system prevents generating networks with cycles
• Use PGMPY’s K2Score to quantify network fit
http://www.lx.it.pt/~asmc/pub/talks/09-TA/ta_pres.pdf
Example where max_level = 3
Network with best score
How to handle larger datasets with many
columns/nodes and cardinality?
Use histogram binning to reduce the cardinality
• Instead of having too many cardinality, reduce it using histogram (fixed
# bins or percentage).
• Also reduces chance of having value only appear once or twice
Example with max_level = 5
Permutation + Pruning Algorithm
• Have a loop where we choose 1, 2, … tuple of nodes and run through
each permutation of node-edge connections with the tuples
• After every loop, look through the permutations that are generated
and pick the ones with highest scores
• Scores are calculated by doing prediction on the leaf nodes (nodes without
any children)
Networks with high scores
Network with high score
Ardi Machine Learning includes Bayesian
Networks
ml-ui ml-manager ml-worker
• Read files and
• Get file from UI parameters
• Process parameters • Run Bayesian code
• Store entry in database • Return results
• Wait for available BayesNetworkHandler.py
requests worker to take the work
Pomegran
• Read results PGMPY
ate

ml-db
• Stores data
and
parameters
Acknowledgements
• Some of the materials are based on work by the following:
• Dr.Cheng, Dr. Wong, Dr. Hamo, Dr. Silberstein, Dr. Huang, Mr. Chang-
Ogimoto, etc…

You might also like