0% found this document useful (0 votes)

51 views71 pages

Data Mining - Utrecht University - 11. Slides

The document discusses Bayesian networks and learning their structure from data. It covers parameter learning where the structure is known and structure learning where both structure and parameters must be learned. When estimating probabilities from data, smoothing techniques can be used to add "prior counts" to avoid zero probabilities. The document provides an example data set and derives the likelihood function for learning the network from that data. It discusses using the log-likelihood score for structure learning by filling in maximum likelihood estimates of the conditional probabilities.

Uploaded by

Leonardo Vida

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views71 pages

Data Mining - Utrecht University - 11. Slides

Uploaded by

Leonardo Vida

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 71

Data Mining 2018

Bayesian Networks (2)

Ad Feelders

Universiteit Utrecht

Ad Feelders ( Universiteit Utrecht ) Data Mining 1 / 70

Learning Bayesian Networks

1 Parameter learning: structure known/given; we only need to estimate

the conditional probabilities from the data.
2 Structure learning: structure unknown; we need to learn the networks
structure as well as the corresponding conditional probabilities from
the data.

Ad Feelders ( Universiteit Utrecht ) Data Mining 2 / 70

Smoothing by adding “prior counts”

Recall that the maximum likelihood estimate of p(xi | xpa(i) ) is:

n(xi , xpa(i) )
p̂(xi | xpa(i) ) =
n(xpa(i) )

But sometimes we have no (n(xpa(i) ) = 0) or very few observations to

estimate these (conditional) probabilities.

Ad Feelders ( Universiteit Utrecht ) Data Mining 3 / 70

Smoothing by adding “prior counts”

Add “prior counts” to “smooth” the estimates.

n(xpa(i) )p̂(xi | xpa(i) ) + m(xpa(i) )p 0 (xi | xpa(i) )

p̂ s (xi | xpa(i) ) =
n(xpa(i) ) + m(xpa(i) )

where m(xpa(i) ) is the prior precision, p̂ s (xi | xpa(i) ) is the smoothed

estimate, and p 0 (xi | xpa(i) ) is our prior estimate of p(xi | xpa(i) ).

Common to take m(xpa(i) ) to be the same for all parent configurations.

Weighted average of ML estimate and prior estimate.

Ad Feelders ( Universiteit Utrecht ) Data Mining 4 / 70

ML estimate vs. smoothed estimate

For example

n(x1 = 1, x2 = 2, x3 = 1) 0
p̂3|1,2 (1|1, 2) = = =0
n(x1 = 1, x2 = 2) 2

Suppose we set m = 2, and p 0 (xi | xpa(i) ) = p̂(xi )

Then we get

s 2 × 0 + 2 × 0.4
p̂3|1,2 (1|1, 2) = = 0.2,
2+2
since p̂3 (1) = 0.4.

Ad Feelders ( Universiteit Utrecht ) Data Mining 5 / 70

Data Set and Likelihood

P(X1 , X2 , X3 , X4 ) = p1 (X1 )p2 (X2 )p3|12 (X3 |X1 , X2 )p4|3 (X4 |X3 )

obs X1 X2 X3 X4
1 1 1 1 1 p1 (1)p2 (1)p3|12 (1|1, 1)p4|3 (1|1)
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3

Ad Feelders ( Universiteit Utrecht ) Data Mining 6 / 70

Data Set and Likelihood

P(X1 , X2 , X3 , X4 ) = p1 (X1 )p2 (X2 )p3|12 (X3 |X1 , X2 )p4|3 (X4 |X3 )

obs X1 X2 X3 X4
1 1 1 1 1 p1 (1)p2 (1)p3|12 (1|1, 1)p4|3 (1|1)
2 1 1 1 1 p1 (1)p2 (1)p3|12 (1|1, 1)p4|3 (1|1)
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3

Ad Feelders ( Universiteit Utrecht ) Data Mining 7 / 70

Data Set and Likelihood

P(X1 , X2 , X3 , X4 ) = p1 (X1 )p2 (X2 )p3|12 (X3 |X1 , X2 )p4|3 (X4 |X3 )

obs X1 X2 X3 X4
1 1 1 1 1 p1 (1)p2 (1)p3|12 (1|1, 1)p4|3 (1|1)
2 1 1 1 1 p1 (1)p2 (1)p3|12 (1|1, 1)p4|3 (1|1)
3 1 1 2 1 p1 (1)p2 (1)p3|12 (2|1, 1)p4|3 (1|2)
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3

Ad Feelders ( Universiteit Utrecht ) Data Mining 8 / 70

Data Set and Likelihood

P(X1 , X2 , X3 , X4 ) = p1 (X1 )p2 (X2 )p3|12 (X3 |X1 , X2 )p4|3 (X4 |X3 )

Ad Feelders ( Universiteit Utrecht ) Data Mining 9 / 70

Data Set and Likelihood

P(X1 , X2 , X3 , X4 ) = p1 (X1 )p2 (X2 )p3|12 (X3 |X1 , X2 )p4|3 (X4 |X3 )

Ad Feelders ( Universiteit Utrecht ) Data Mining 10 / 70

Data Set and Likelihood

P(X1 , X2 , X3 , X4 ) = p1 (X1 )p2 (X2 )p3|12 (X3 |X1 , X2 )p4|3 (X4 |X3 )

obs X1 X2 X3 X4
1 1 1 1 1 p1 (1)p2 (1)p3|12 (1|1, 1)p4|3 (1|1)
2 1 1 1 1 p1 (1)p2 (1)p3|12 (1|1, 1)p4|3 (1|1)
3 1 1 2 1 p1 (1)p2 (1)p3|12 (2|1, 1)p4|3 (1|2)
4 1 2 2 1 p1 (1)p2 (2)p3|12 (2|1, 2)p4|3 (1|2)
5 1 2 2 2 p1 (1)p2 (2)p3|12 (2|1, 2)p4|3 (2|2)
6 2 1 1 2 p1 (2)p2 (1)p3|12 (1|2, 1)p4|3 (2|1)
7 2 1 2 3 p1 (2)p2 (1)p3|12 (2|2, 1)p4|3 (3|2)
8 2 1 2 3 p1 (2)p2 (1)p3|12 (2|2, 1)p4|3 (3|2)
9 2 2 2 3 p1 (2)p2 (2)p3|12 (2|2, 2)p4|3 (3|2)
10 2 2 1 3 p1 (2)p2 (2)p3|12 (1|2, 2)p4|3 (3|1)

Ad Feelders ( Universiteit Utrecht ) Data Mining 11 / 70

For all observations
Likelihood function for all observations together:

L(D) = p1 (1)5 (1 − p1 (1))5 p2 (1)6 (1 − p2 (1))4 p3|1,2 (1|1, 1)2 (1 − p3|1,2 (1|1, 1))
(1 − p3|1,2 (1|1, 2))2 p3|1,2 (1|2, 1)(1 − p3|1,2 (1|2, 1))2 p3|1,2 (1|2, 2)
(1 − p3|1,2 (1|2, 2))p4|3 (1|1)2 p4|3 (2|1)(1 − p4|3 (1|1) − p4|3 (2|1))
p4|3 (1|2)2 p4|3 (2|2)(1 − p4|3 (1|2) − p4|3 (2|2))3

Or in log form

L(D) = 5 log p1 (1) + 5 log(1 − p1 (1)) + 6 log p2 (1) + 4 log(1 − p2 (1))

+2 log p3|1,2 (1|1, 1) + log(1 − p3|1,2 (1|1, 1))
+2 log(1 − p3|1,2 (1|1, 2)) + log p3|1,2 (1|2, 1) + 2 log(1 − p3|1,2 (1|2, 1))
+ log p3|1,2 (1|2, 2) + log(1 − p3|1,2 (1|2, 2))
+2 log p4|3 (1|1) + log p4|3 (2|1) + log(1 − p4|3 (1|1) − p4|3 (2|1))
+2 log p4|3 (1|2) + log p4|3 (2|2) + 3 log(1 − p4|3 (1|2) − p4|3 (2|2))

Ad Feelders ( Universiteit Utrecht ) Data Mining 12 / 70

Structure Learning

The loglikelihood function for a Bayesian Network is:

k
X X
L= n(xi , xpa(i) ) log p(xi | xpa(i) )
i=1 xi ,xpa(i)

The maximum likelihood estimate of p(xi | xpa(i) ) is:

n(xi , xpa(i) )
p̂(xi | xpa(i) ) = .
n(xpa(i) )

Ad Feelders ( Universiteit Utrecht ) Data Mining 13 / 70

The Log-Likelihood Score
If we fill in the maximum likelihood estimates in the log-likelihood function
we obtain the value of the log-likelihood function evaluated at its
maximum:

k
X X n(xi , xpa(i) )
L= n(xi , xpa(i) ) log
n(xpa(i) )
i=1 xi ,xpa(i)

This is called the log-likelihood score.

The higher its value, the better the model fits the data.
The saturated model (complete graph) always has the highest
log-likelihood score.
To avoid overfitting, we must penalize model complexity.

Ad Feelders ( Universiteit Utrecht ) Data Mining 14 / 70

Scoring Functions for Structure Learning

Scoring functions:
AIC(M) = LM − dim(M).
log n
BIC(M) = LM − 2 dim(M).
where LM is the log-likelihood score of model M and dim(M) is the
number of parameters of M.

BIC gives a higher penalty for model complexity (n > 7), so tends to lead
to less complex models than AIC.

Note: earlier we defined AIC (M) = 2(Lsat − LM ) + 2dim(M). Dividing by

−2 and ignoring the constant Lsat gives the current definition.

Ad Feelders ( Universiteit Utrecht ) Data Mining 15 / 70

Optimization Problem
Given
Training data.
Scoring function (BIC or AIC).
Space of possible models (all DAGs).
find the model that maximizes the score.
Most model search algorithms do not require an a priori ordering of
the variables!
The number of labeled acyclic directed graphs on k nodes is given by
the recurrence
k
X k j(k−j)
ak = (−1)j−1 2 ak−j
j
j=1

For example, a6 = 3, 781, 503.

Ad Feelders ( Universiteit Utrecht ) Data Mining 16 / 70
Heuristic Search

Define which models are neighbors of a given model

(typically: addition, removal, or reversal of an arc).
Traverse search space looking for high-scoring models,
e.g. by greedy hill-climbing.

Ad Feelders ( Universiteit Utrecht ) Data Mining 17 / 70

Score Decomposes

The loglikelihood score

k
X X n(xi , xpa(i) )
L= n(xi , xpa(i) ) log
n(xpa(i) )
i=1 xi ,xpa(i)

must be computed many times for different models in structure learning.

Luckily, it is a sum of terms, where each term contains the variables

{i} ∪ pa(i).

Hence, when making a change to the model, we only have to recompute

the score for those variables for which the parent set has changed!

Ad Feelders ( Universiteit Utrecht ) Data Mining 18 / 70

Example Data Set

obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3

Ad Feelders ( Universiteit Utrecht ) Data Mining 19 / 70

Score this model

1 2

Ad Feelders ( Universiteit Utrecht ) Data Mining 20 / 70

Relevant Data For Scoring Node 1

obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3

5 5
Score node 1 = 5 log 10 + 5 log 10

Ad Feelders ( Universiteit Utrecht ) Data Mining 21 / 70

Relevant Data For Scoring Node 2

obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3

6 4
Score node 2 = 6 log 10 + 4 log 10

Ad Feelders ( Universiteit Utrecht ) Data Mining 22 / 70

Relevant Data For Scoring Node 3

obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3

Score node 3 = 2 log 32 + log 31

Ad Feelders ( Universiteit Utrecht ) Data Mining 23 / 70

Relevant Data For Scoring Node 3

obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3

Score node 3 = 2 log 32 + log 31 + 2 log 1

Ad Feelders ( Universiteit Utrecht ) Data Mining 24 / 70

Relevant Data For Scoring Node 3

obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3

Score node 3 = 2 log 32 + log 31 + 2 log 1 + log 13 + 2 log 32

Ad Feelders ( Universiteit Utrecht ) Data Mining 25 / 70

Relevant Data For Estimating Scoring Node 3

obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3

Score node 3 = 2 log 32 + log 31 + 2 log 1 + log 13 + 2 log 32 + log 12 + log 21

Ad Feelders ( Universiteit Utrecht ) Data Mining 26 / 70

Relevant Data For Scoring Node 4

obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3

Score node 4 = 2 log 42 + log 41 + log 41

Ad Feelders ( Universiteit Utrecht ) Data Mining 27 / 70

Relevant Data For Scoring Node 4

obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3

Score node 4 = 2 log 42 + log 41 + log 41 + 2 log 26 + log 16 + 3 log 63

Ad Feelders ( Universiteit Utrecht ) Data Mining 28 / 70

Total Score

Summing the loglikelihood score over all nodes, we get:

5 5
L = 5 log + 5 log (node 1)
10 10
6 4
+ 6 log + 4 log (node 2)
10 10
2 1 1 2 1 1
+ 2 log + log + 2 log 1 + log + 2 log + log + log (node 3)
3 3 3 3 2 2
2 1 1 2 1 3
+ 2 log + log + log + 2 log + log + 3 log (node 4)
4 4 4 6 6 6
≈ −29.09

Ad Feelders ( Universiteit Utrecht ) Data Mining 29 / 70

Add an edge from X1 to X2

1 2

Ad Feelders ( Universiteit Utrecht ) Data Mining 30 / 70

Score is Decomposable

5 5
L = 5 log + 5 log (node 1)
10 10
6 4
+ 6 log 10 + 4 log 10 (node 2)
2 1 1 2 1 1
+ 2 log + log + 2 log 1 + log + 2 log + log + log (node 3)
3 3 3 3 2 2
2 1 1 2 1 3
+ 2 log + log + log + 2 log + log + 3 log (node 4)
4 4 4 6 6 6
≈ −29.09

When we add an edge from X1 to X2 , only the parent set of node 2 changes.
Therefore, only the score of node 2 (the boxed part) has to be recomputed.

Ad Feelders ( Universiteit Utrecht ) Data Mining 31 / 70

Relevant Data For Re-scoring Node 2

obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3

New score node 2 = 3 log 35 + 2 log 25

Ad Feelders ( Universiteit Utrecht ) Data Mining 32 / 70

Relevant Data For Re-scoring Node 2

obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3

New score node 2 = 3 log 35 + 2 log 25 + 3 log 35 + 2 log 25

Ad Feelders ( Universiteit Utrecht ) Data Mining 33 / 70

Score Decomposes

5 5
L = 5 log + 5 log (node 1)
10 10
+ 3 log 53 + 2 log 52 + 3 log 35 + 2 log 25 (node 2)
2 1 1 2 1 1
+ 2 log + log + 2 log 1 + log + 2 log + log + log (node 3)
3 3 3 3 2 2
2 1 1 2 1 3
+ 2 log + log + log + 2 log + log + 3 log (node 4)
4 4 4 6 6 6
≈ −29.09

The boxed part is the new contribution of node 2 to the score.

Ad Feelders ( Universiteit Utrecht ) Data Mining 34 / 70

Add an edge from X1 to X4

1 2

Ad Feelders ( Universiteit Utrecht ) Data Mining 35 / 70

Score Decomposes

5 5
L = 5 log + 5 log (node 1)
10 10
6 4
+ 6 log + 4 log (node 2)
10 10
2 1 1 2 1 1
+ 2 log + log + 2 log 1 + log + 2 log + log + log (node 3)
3 3 3 3 2 2
+ 2 log 42 + log 14 + log 14 + 2 log 26 + log 16 + 3 log 36 (node 4)
≈ −29.09

When we add an edge from X1 to X4 , only the parent set of node 4 changes.
Therefore, only the score of node 4 (the boxed part) has to be recomputed.

Ad Feelders ( Universiteit Utrecht ) Data Mining 36 / 70

Relevant Data For Re-scoring Node 4

obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3

New score node 4 = 2 log 1

Ad Feelders ( Universiteit Utrecht ) Data Mining 37 / 70

Relevant Data For Re-scoring Node 4

obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3

New score node 4 = 2 log 1 + 2 log 23 + log 13

Ad Feelders ( Universiteit Utrecht ) Data Mining 38 / 70

Relevant Data For Re-scoring Node 4

obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3

New score node 4 = 2 log 1 + 2 log 23 + log 13 + log 12 + log 12

Ad Feelders ( Universiteit Utrecht ) Data Mining 39 / 70

Relevant Data For Re-scoring Node 4

obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3

New score node 4 = 2 log 1 + 2 log 23 + log 13 + log 12 + log 12 + 3 log 1

Ad Feelders ( Universiteit Utrecht ) Data Mining 40 / 70

Score Decomposes

5 5
L = 5 log + 5 log (node 1)
10 10
6 4
+ 6 log + 4 log (node 2)
10 10
2 1 1 2 1 1
+ 2 log + log + 2 log 1 + log + 2 log + log + log (node 3)
3 3 3 3 2 2
+ 2 log 1 + 2 log 23 + log 13 + log 12 + log 12 + 3 log 1 (node 4)
≈ −22.16

The boxed part is the new contribution of node 4 to the score.

Ad Feelders ( Universiteit Utrecht ) Data Mining 41 / 70

Counting Parameters

The number of parameters of a Bayesian network is:

 
k 
X Y 
(di − 1) dj
 
i=1 j∈pa(i)

where k is the number of variables in the network, and di is the number of

possible values of Xi .
Q
j∈pa(i) dj is the number of parent configuration for Xi .

If Xi has no parents, the number of parent configurations should be taken

to be 1, so Xi contributes di − 1 parameters in that case.

Ad Feelders ( Universiteit Utrecht ) Data Mining 42 / 70

A Simple Structure Learning Algorithm

Algorithm 1 BN Structure Learning

1: G ← initial graph
2: max ← score(G )
3: repeat
4: nb ← neighbours(G )
5: for all G 0 ∈ nb do
6: if score (G 0 ) > max then
7: max ← score(G 0 )
8: G ← G0
9: end if
10: end for
11: until no change to G
12: return G

Ad Feelders ( Universiteit Utrecht ) Data Mining 43 / 70

Complexity: Naive

A DAG with k nodes has k(k − 1) possible directed edges.

edge present: delete or reverse
edge absent: add
So there are O(k 2 ) neighbours that have to be scored.
There are k components in the score and for each component we
have to compute the sufficient statistics which requires traversing the
training data (n rows).
So in total, scoring a single neighbour takes O(kn) time.
The total complexity is O(k 3 n) per search step.

Ad Feelders ( Universiteit Utrecht ) Data Mining 44 / 70

Complexity: Exploiting Decomposability

To score a neighbour we actually only have to score a single node

(add, delete), or two nodes (reversal), since we only have to
recompute the score for the nodes whose parent set changed by the
local operation performed. So the complexity of scoring a neighbour
is only O(n) instead of O(kn).
If we compute the change in score due to a local operation (add,
delete, reverse), then we can reuse the “delta scores” computed in
previous iterations. We only need to recompute the delta scores for
nodes whose parent set changed in the previous iteration. So we only
have to compute O(k) changes in score.
Hence the total complexity per search step is O(kn).

Ad Feelders ( Universiteit Utrecht ) Data Mining 45 / 70

Add an edge from X1 to X2

1 2

3 2 3 2
∆Score(add(X1 → X2 )) = (3 log + 2 log + 3 log + 2 log )
5 5 5 5
6 4
− (6 log + 4 log ) = 0
10 10

Ad Feelders ( Universiteit Utrecht ) Data Mining 46 / 70

Add an edge from X1 to X4

1 2

2 1 1 1
∆Score(add(X1 → X4 )) = (2 log 1 + 2 log + log + log + log + 3 log 1)
3 3 2 2
2 1 1 2 1 3
− (2 log + log + log + 2 log + log + 3 log )
4 4 4 6 6 6
≈ 6.93

Ad Feelders ( Universiteit Utrecht ) Data Mining 47 / 70

Add an edge from X1 to X4

Suppose we decide to add the arrow X1 → X4 .

In the next iteration only the ∆ scores of operations that change the
parent set of X4 have to be recomputed.
For example, ∆Score(add(X1 → X2 )) doesn’t have to be recomputed
because it is the same as in the previous iteration.

Ad Feelders ( Universiteit Utrecht ) Data Mining 48 / 70

Interpretation: warning!

1 2 3

These models can not be distinguished from data alone.

They represent the same independencies!

AIC and BIC give equivalent networks the same score.

Ad Feelders ( Universiteit Utrecht ) Data Mining 49 / 70
Markov Equivalence and Essential Graph

Two DAGs are Markov equivalent if and only if

1 they have the same skeleton (same undirected graph when you drop
the directions of all edges), and
2 they have the same immoralities (v-structures).

Essential Graph:

For a given DAG, an edge becomes bi-directional in the essential graph if

there is an equivalent DAG in which the direction of the edge is reversed.

Ad Feelders ( Universiteit Utrecht ) Data Mining 50 / 70

Example Analysis

We analyze a data set concerning risk factors for coronary heart disease.
For a sample of 1841 car-workers, the following information was recorded

Variable Description
A Does the person smoke?
B Is the person’s work strenuous mentally?
C Is the person’s work strenuous physically?
D Systolic blood pressure < 140mm?
E Ratio of beta to alfa lipoproteins < 3?
F Is there a family history of coronary heart disease?

Ad Feelders ( Universiteit Utrecht ) Data Mining 51 / 70

Example Analysis
For learning Bayesian networks, we use the bnlearn package in R.
Hill-climbing with the BIC score function:
> coronary.hc <- hc(coronary)
> plot(coronary.hc)

C D

F E
Ad Feelders ( Universiteit Utrecht ) Data Mining 52 / 70
The Search Process
> coronary.hc <- hc(coronary, debug=T)
----------------------------------------------------------------
* starting from the following network:
model:
[A][B][C][D][E][F]

* current score: -7061.714

* caching score delta for arc A -> B (17.531166).
* caching score delta for arc A -> C (9.981480).
* caching score delta for arc A -> D (1.757126).
* caching score delta for arc A -> E (4.941129).
* caching score delta for arc A -> F (-3.224701).
* caching score delta for arc B -> C (264.272873).
* caching score delta for arc B -> D (2.313656).
* caching score delta for arc B -> E (21.030213).
* caching score delta for arc B -> F (2.303571).
* caching score delta for arc C -> D (-3.711314).
* caching score delta for arc C -> E (4.577177).
* caching score delta for arc C -> F (-3.673929).
* caching score delta for arc D -> E (2.645583).
* caching score delta for arc D -> F (-3.197133).
* caching score delta for arc E -> F (-2.257169).

Ad Feelders ( Universiteit Utrecht ) Data Mining 53 / 70

The Search Process

The initial model (the mutual independence model

Ad Feelders ( Universiteit Utrecht ) Data Mining 54 / 70

The Search Process

The initial model (the mutual independence model

[A][B][C][D][E][F]) has a BIC score of -7061.714.
The output gives the change in score between the current model and
its neighbors.
Why is the score of only 15 of the 30 neighbors computed?
(e.g. A -> B, but not B -> A)?
A -> B and B -> A are Markov equivalent, and therefore have the
same score.
Adding B -> C results in the largest increase in score so we move to
that neighbor.

Ad Feelders ( Universiteit Utrecht ) Data Mining 54 / 70

The Search Process

* best operation was: adding B -> C .

* current network is :

model:
[A][B][D][E][F][C|B]

* current score: -6797.441

* caching score delta for arc A -> C (9.975823).
* caching score delta for arc B -> C (-264.272873).
* caching score delta for arc D -> C (-1.472731).
* caching score delta for arc E -> C (-6.587044).
* caching score delta for arc F -> C (-6.059896).

Ad Feelders ( Universiteit Utrecht ) Data Mining 55 / 70

The Search Process

We don’t have to recompute the change in score caused by, for

example, adding A -> B, because the parent set of B is the same as
in the previous iteration.
Therefore, adding A -> B now will cause the same score change as in
the previous iteration.
Only the parent set of C has changed, so we just have to recompute
the change in score caused by adding arcs X -> C.
Adding B -> E results in the largest increase in score so we move to
that neighbor.
The current model becomes: [A][B][D][F][C|B][E|B].

Ad Feelders ( Universiteit Utrecht ) Data Mining 56 / 70

Final Model and its Essential Graph

A A

C D C D

B B

F E F E

Final Model Essential Graph

Ad Feelders ( Universiteit Utrecht ) Data Mining 57 / 70

Example of Model Use

# estimate parameters for selected model structure

> coronary.hc.fit <- bn.fit(coronary.hc,coronary.dat,"mle")
# predict B from remaining variables
> coronary.hc.pred <- predict(coronary.hc.fit,node="B",
data=coronary.dat)
# make confusion matrix
> table(coronary.dat$B,coronary.hc.pred)
coronary.hc.pred
no yes
no 944 186
yes 208 503
> (944+503)/1841
[1] 0.7859859
> (944+186)/1841
[1] 0.6137968

Ad Feelders ( Universiteit Utrecht ) Data Mining 58 / 70

Bayesian Networks as Classifiers

A1 A2
A3

C
A4 A5

A6 A7 A8

Markov Blanket: Parents, Children and Parents of Children.

Ad Feelders ( Universiteit Utrecht ) Data Mining 59 / 70

Markov Blanket of C : Moral Graph

A1 A2
A3

C
A4 A5

A6 A7 A8

Markov Blanket: Parents, Children and Parents of Children.

Local Markov property: C ⊥

⊥ rest | boundary(C )
Ad Feelders ( Universiteit Utrecht ) Data Mining 60 / 70
Right Heart Catheterization Data: Variable Description

1 cat1: primary disease category (9 different values)

2 death: did the patient die within 180 days after admission to ICU?
3 swang1: was right heart catheterization (Swan-Ganz catheter)
performed within first 24 hours?
4 gender: male/female
5 race: black/white/other
6 ninsclas: type of medical insurance of patient (six different values)
7 income: income of patient, divided into 4 categories
8 ca: cancer status (yes/no/metastatic)
9 age: age of patient divided into 5 categories
10 meanbp1: mean blood pressure of patient divided into 2 categories

Ad Feelders ( Universiteit Utrecht ) Data Mining 61 / 70

Right Heart Catheterization Data: Descriptive Statistics
> summary(rhc.dat)
cat1 death swang1 gender race
ARF :2490 No :2013 No RHC:3551 Female:2543 black: 920
MOSF w/Sepsis :1227 Yes:3722 RHC :2184 Male :3192 other: 355
COPD : 457 white:4460
CHF : 456
Coma : 436
MOSF w/Malignancy: 399
(Other) : 270

ninsclas income ca age

Medicaid : 647 $11-$25k :1165 Metastatic: 384 (50,60] : 917
Medicare :1458 $25-$50k : 893 No :4379 (60,70] :1390
Medicare & Medicaid: 374 > $50k : 451 Yes : 972 (70,80] :1337
No insurance : 322 Under $11k:3226 (80,102]: 667
Private :1698 [18,50] :1424
Private & Medicare :1236

meanbp1
(85,259]:1975
[0,85] :3760

Ad Feelders ( Universiteit Utrecht ) Data Mining 62 / 70

Learning the Graph Structure

# load bayesian network library

> library(bnlearn)

# load library for graph vizualization

> library(Rgraphviz)

# use hill climbing with BIC scoring

# starting from empty graph
> rhc.bn <- hc(rhc.dat)

# plot the model structure

> plot(as(amat(rhc.bn),"graphNEL"))

Ad Feelders ( Universiteit Utrecht ) Data Mining 63 / 70

Graph Structure for Right Heart Catheterization Data

race

ninsclas

gender income age

cat1

swang1 ca

meanbp1 death

Ad Feelders ( Universiteit Utrecht ) Data Mining 64 / 70

Performing Inference

# estimate the network parameters

> rhc.bn.fit <- bn.fit(rhc.bn,data=rhc.dat)

# perform sampling based inference

# probability of death for metastatic cancer and
# mean blood pressure > 85
> cpquery(rhc.bn.fit,event=death=="Yes",evidence=
ca=="Metastatic" & meanbp1=="(85,259]",n=100000)
[1] 0.9033019

# probability of death for no cancer and

# mean blood pressure > 85
> cpquery(rhc.bn.fit,event=death=="Yes",evidence=
ca=="No" & meanbp1=="(85,259]",n=100000)
[1] 0.6020206

Ad Feelders ( Universiteit Utrecht ) Data Mining 65 / 70

Combining Data and Prior Knowledge
An expert studies the graph and argues that the edge from swang1 to
meanbp1 is in the wrong direction, since the blood pressure influences the
decision to apply right heart catheterization, not the other way around.

Can we turn the edge around without changing the “meaning” of the
network, i.e. without changing the conditional independencies expressed by
the graph? race

ninsclas

gender income age

cat1

swang1 ca

meanbp1 death

Ad Feelders ( Universiteit Utrecht ) Data Mining 66 / 70

Combining Data and Prior Knowledge
Common sense suggest that the variables can be divided into a number of ordered
blocks, where arrows are not allowed to point from a variable in a higher block to a
variable in a lower block.

As an example, consider the following block structure:

1 race, gender
2 age, income
3 ninsclass
4 cat1, ca, meanbp1
5 swang1
6 death
We can use the blacklist parameter to avoid edges pointing from higher blocks
to lower blocks.

Ad Feelders ( Universiteit Utrecht ) Data Mining 67 / 70

Learning with a Blacklist

# learn structure with blacklist

> rhc.bn.ord <- hc(rhc.dat,blacklist=blackL)

# has the score become much worse?

> score(rhc.bn.ord,rhc.dat)
[1] -54059.03
> score(rhc.bn,rhc.dat)
[1] -53749.15

# has inferences changed much?

> rhc.bn.ord.fit <- bn.fit(rhc.bn.ord,data=rhc.dat)

> cpquery(rhc.bn.ord.fit,event=death=="Yes",evidence=
ca=="Metastatic" & meanbp1=="(85,259]",n=100000)
[1] 0.9039467
> cpquery(rhc.bn.ord.fit,event=death=="Yes",evidence=
ca=="No" & meanbp1=="(85,259]",n=100000)
[1] 0.610249

Ad Feelders ( Universiteit Utrecht ) Data Mining 68 / 70

The Blacklist
The blacklist simply enumerates all the forbidden edges:

> blackL
X1 X2
1 cat1 gender
2 cat1 race
3 cat1 ninsclas
4 cat1 income
5 cat1 age
6 death cat1
7 death swang1
etc.

Ad Feelders ( Universiteit Utrecht ) Data Mining 69 / 70

Graph Structure Learned with Blacklist

race gender

age

cat1 income

meanbp1 ca ninsclas

swang1 death

Ad Feelders ( Universiteit Utrecht ) Data Mining 70 / 70

Chapter 9 Data Mining
No ratings yet
Chapter 9 Data Mining
147 pages
Data Mining - Utrecht University - 10. Slides
No ratings yet
Data Mining - Utrecht University - 10. Slides
49 pages
Structure Learning Via Non-Parametric Factorized Joint Likelihood Function
No ratings yet
Structure Learning Via Non-Parametric Factorized Joint Likelihood Function
11 pages
Data Mining - Utrecht University - 0. Intro
No ratings yet
Data Mining - Utrecht University - 0. Intro
53 pages
Week 10 v1.62 - Score-Based Learning
No ratings yet
Week 10 v1.62 - Score-Based Learning
58 pages
Unit 5 - Machine Learning
No ratings yet
Unit 5 - Machine Learning
16 pages
Statistical Learning Methods
No ratings yet
Statistical Learning Methods
28 pages
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
17 pages
Fall 2022 Midterm Notes PDF
No ratings yet
Fall 2022 Midterm Notes PDF
15 pages
Learning Structured Prediction Models: A Large Margin Approach
No ratings yet
Learning Structured Prediction Models: A Large Margin Approach
8 pages
Lecture5 Maximum Likelihood
No ratings yet
Lecture5 Maximum Likelihood
13 pages
UNIT 3-Bayesian Statistics
No ratings yet
UNIT 3-Bayesian Statistics
80 pages
Lecture 17 - Ensemble Learning
No ratings yet
Lecture 17 - Ensemble Learning
31 pages
Data Mining: Ensemble Techniques Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
No ratings yet
Data Mining: Ensemble Techniques Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
11 pages
Lect 0407
No ratings yet
Lect 0407
6 pages
Machine Learning: Sem-VII
No ratings yet
Machine Learning: Sem-VII
108 pages
A Scalable Data Science Workflow Approach For Big Data Bayesian Network Learning
No ratings yet
A Scalable Data Science Workflow Approach For Big Data Bayesian Network Learning
10 pages
ML Unit 3
No ratings yet
ML Unit 3
83 pages
Ia1 ML Scheme Common To Is, Ai, Cs
No ratings yet
Ia1 ML Scheme Common To Is, Ai, Cs
10 pages
CS 601 Machine Learning Unit 5
No ratings yet
CS 601 Machine Learning Unit 5
18 pages
LfD00 2
No ratings yet
LfD00 2
5 pages
ML Document-1 - Merged
No ratings yet
ML Document-1 - Merged
19 pages
COMP4702 Notes 2019: Week 2 - Supervised Learning
No ratings yet
COMP4702 Notes 2019: Week 2 - Supervised Learning
23 pages
Unit 3
No ratings yet
Unit 3
16 pages
An Introduction To Kernel Methods: C. Campbell
No ratings yet
An Introduction To Kernel Methods: C. Campbell
38 pages
AI Unit 4
No ratings yet
AI Unit 4
91 pages
9 Svm-Handout PDF
No ratings yet
9 Svm-Handout PDF
21 pages
CS 601 Machine Learning Unit 5
No ratings yet
CS 601 Machine Learning Unit 5
18 pages
ML Unit-4 Prob Learning
No ratings yet
ML Unit-4 Prob Learning
36 pages
The Problem of Overfitting
No ratings yet
The Problem of Overfitting
40 pages
Model Overfitting Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
No ratings yet
Model Overfitting Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
30 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
114 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
111 pages
Margin-Based Active Learning of Multiclass Classifiers: Marco Bressan
No ratings yet
Margin-Based Active Learning of Multiclass Classifiers: Marco Bressan
45 pages
L09 Learning I Bayesian Learning
No ratings yet
L09 Learning I Bayesian Learning
66 pages
Unit 2.5 ML
No ratings yet
Unit 2.5 ML
4 pages
Statistical Machine Learning: Assignment 03
No ratings yet
Statistical Machine Learning: Assignment 03
7 pages
Cours2 ML
No ratings yet
Cours2 ML
21 pages
Uai08lowd Poster Fixed
No ratings yet
Uai08lowd Poster Fixed
1 page
13 PracticalMachineLearning
100% (1)
13 PracticalMachineLearning
84 pages
ML Unit-3
No ratings yet
ML Unit-3
24 pages
Unit 3
No ratings yet
Unit 3
99 pages
Artificial Intelligence and Machine Learning: T.A. Silvia Bucci
No ratings yet
Artificial Intelligence and Machine Learning: T.A. Silvia Bucci
78 pages
M01 Tree-Based Methods
No ratings yet
M01 Tree-Based Methods
38 pages
Lecture03b Overfitting
No ratings yet
Lecture03b Overfitting
5 pages
ML Answer Key (M.tech)
No ratings yet
ML Answer Key (M.tech)
31 pages
Module 5,1 Ensemble - Bagging, RF, Boosting
No ratings yet
Module 5,1 Ensemble - Bagging, RF, Boosting
66 pages
ch9 Ensemble Learning
No ratings yet
ch9 Ensemble Learning
19 pages
ML 19.03 Sidenotes
No ratings yet
ML 19.03 Sidenotes
30 pages
Usc 08
No ratings yet
Usc 08
46 pages
Introduction To Probabilistic Learning
No ratings yet
Introduction To Probabilistic Learning
9 pages
ZhuoLiu SVclustering
No ratings yet
ZhuoLiu SVclustering
28 pages
Math For Machine Learning 1694120073
No ratings yet
Math For Machine Learning 1694120073
11 pages
ML Columbia PDF
No ratings yet
ML Columbia PDF
615 pages
Module 2 Notes Bcs602
No ratings yet
Module 2 Notes Bcs602
19 pages
Lecture 03 Bayes Classifier With Prob Concepts
No ratings yet
Lecture 03 Bayes Classifier With Prob Concepts
70 pages
Homework - 2
No ratings yet
Homework - 2
4 pages
Pattern Recognition & Learning II: © UW CSE Vision Faculty
No ratings yet
Pattern Recognition & Learning II: © UW CSE Vision Faculty
47 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Solving Math Problems
From Everand
Solving Math Problems
George N. Frempong
No ratings yet
Data Mining - Utrecht University - 13. Active Learning
No ratings yet
Data Mining - Utrecht University - 13. Active Learning
57 pages
Dm-Classtrees-2-2018 PDF
No ratings yet
Dm-Classtrees-2-2018 PDF
46 pages
Dm-Classtrees-1-2018 PDF
No ratings yet
Dm-Classtrees-1-2018 PDF
45 pages
Exam History of Economic Thought
No ratings yet
Exam History of Economic Thought
3 pages
Form Application HKID
No ratings yet
Form Application HKID
2 pages
GSM Sniffing
100% (2)
GSM Sniffing
51 pages
Content Server
No ratings yet
Content Server
15 pages
Discussion Leader
No ratings yet
Discussion Leader
5 pages
Wittgenstein, Ludwig - Philosophical Grammar (Blackwell, 1974)
100% (3)
Wittgenstein, Ludwig - Philosophical Grammar (Blackwell, 1974)
496 pages
HF-3 Instruction Manual
No ratings yet
HF-3 Instruction Manual
11 pages
1.PPQA-SEPG Roles and Responsibilities PDF
No ratings yet
1.PPQA-SEPG Roles and Responsibilities PDF
2 pages
SmartOTDR Optics ENG
No ratings yet
SmartOTDR Optics ENG
304 pages
Ism Practical File Nothing
No ratings yet
Ism Practical File Nothing
84 pages
PV Inverter Thesis
100% (1)
PV Inverter Thesis
7 pages
Building A Capacitive Liquid Sensor
No ratings yet
Building A Capacitive Liquid Sensor
10 pages
360 Ring of Light Error Codes Simplified RevA Oct20'05
No ratings yet
360 Ring of Light Error Codes Simplified RevA Oct20'05
7 pages
MCQ Questions On Cyber Security
No ratings yet
MCQ Questions On Cyber Security
13 pages
Imperva - SecureD Data Protection v1.5 HSL v1.2
No ratings yet
Imperva - SecureD Data Protection v1.5 HSL v1.2
32 pages
Structural Engineering PG Lab Manual
No ratings yet
Structural Engineering PG Lab Manual
47 pages
HART® Transmitter Calibration
No ratings yet
HART® Transmitter Calibration
16 pages
Offensive Security Certified Professional OSCP
No ratings yet
Offensive Security Certified Professional OSCP
1 page
Cnmaestro On-Premises 2.2.1 Release Notes
No ratings yet
Cnmaestro On-Premises 2.2.1 Release Notes
7 pages
A Dead-Time Compensating PID Controller Structure and Robust Tuning
No ratings yet
A Dead-Time Compensating PID Controller Structure and Robust Tuning
10 pages
A Study On Domination in Graphs
No ratings yet
A Study On Domination in Graphs
39 pages
01 VESDAnet TCPIP Connectivity Application Note
No ratings yet
01 VESDAnet TCPIP Connectivity Application Note
9 pages
Jyothsna CV
No ratings yet
Jyothsna CV
1 page
Current and Future Trends of Media & Information: (Ubiquitous Learning)
100% (1)
Current and Future Trends of Media & Information: (Ubiquitous Learning)
25 pages
Vernier Labquest 2 Manual Original
No ratings yet
Vernier Labquest 2 Manual Original
62 pages
Assessing Maximum DG Penetration Levels in A Real Distribution Feeder by Using OpenDSS
No ratings yet
Assessing Maximum DG Penetration Levels in A Real Distribution Feeder by Using OpenDSS
6 pages
L2 - Multiprocessor System
No ratings yet
L2 - Multiprocessor System
24 pages
Tux Paint 06
No ratings yet
Tux Paint 06
6 pages
LExmark M5155
No ratings yet
LExmark M5155
1,041 pages
ESB Services API Reference Guide
No ratings yet
ESB Services API Reference Guide
12 pages
E Ink Electronic Ink
No ratings yet
E Ink Electronic Ink
16 pages
Tom Lifi
No ratings yet
Tom Lifi
89 pages
Iteration MCQs
No ratings yet
Iteration MCQs
3 pages
SAP HANA 2.0 Cockpit Central Release Note
No ratings yet
SAP HANA 2.0 Cockpit Central Release Note
4 pages
Security Analyst l1 Operations 1
No ratings yet
Security Analyst l1 Operations 1
1 page
Offers: Why Choose Intrcity Smartbus ?
100% (1)
Offers: Why Choose Intrcity Smartbus ?
6 pages