0% found this document useful (0 votes)
51 views71 pages

Data Mining - Utrecht University - 11. Slides

The document discusses Bayesian networks and learning their structure from data. It covers parameter learning where the structure is known and structure learning where both structure and parameters must be learned. When estimating probabilities from data, smoothing techniques can be used to add "prior counts" to avoid zero probabilities. The document provides an example data set and derives the likelihood function for learning the network from that data. It discusses using the log-likelihood score for structure learning by filling in maximum likelihood estimates of the conditional probabilities.

Uploaded by

Leonardo Vida
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views71 pages

Data Mining - Utrecht University - 11. Slides

The document discusses Bayesian networks and learning their structure from data. It covers parameter learning where the structure is known and structure learning where both structure and parameters must be learned. When estimating probabilities from data, smoothing techniques can be used to add "prior counts" to avoid zero probabilities. The document provides an example data set and derives the likelihood function for learning the network from that data. It discusses using the log-likelihood score for structure learning by filling in maximum likelihood estimates of the conditional probabilities.

Uploaded by

Leonardo Vida
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

Data Mining 2018

Bayesian Networks (2)

Ad Feelders

Universiteit Utrecht

Ad Feelders ( Universiteit Utrecht ) Data Mining 1 / 70


Learning Bayesian Networks

1 Parameter learning: structure known/given; we only need to estimate


the conditional probabilities from the data.
2 Structure learning: structure unknown; we need to learn the networks
structure as well as the corresponding conditional probabilities from
the data.

Ad Feelders ( Universiteit Utrecht ) Data Mining 2 / 70


Smoothing by adding “prior counts”

Recall that the maximum likelihood estimate of p(xi | xpa(i) ) is:

n(xi , xpa(i) )
p̂(xi | xpa(i) ) =
n(xpa(i) )

But sometimes we have no (n(xpa(i) ) = 0) or very few observations to


estimate these (conditional) probabilities.

Ad Feelders ( Universiteit Utrecht ) Data Mining 3 / 70


Smoothing by adding “prior counts”

Add “prior counts” to “smooth” the estimates.

n(xpa(i) )p̂(xi | xpa(i) ) + m(xpa(i) )p 0 (xi | xpa(i) )


p̂ s (xi | xpa(i) ) =
n(xpa(i) ) + m(xpa(i) )

where m(xpa(i) ) is the prior precision, p̂ s (xi | xpa(i) ) is the smoothed


estimate, and p 0 (xi | xpa(i) ) is our prior estimate of p(xi | xpa(i) ).

Common to take m(xpa(i) ) to be the same for all parent configurations.

Weighted average of ML estimate and prior estimate.

Ad Feelders ( Universiteit Utrecht ) Data Mining 4 / 70


ML estimate vs. smoothed estimate

For example

n(x1 = 1, x2 = 2, x3 = 1) 0
p̂3|1,2 (1|1, 2) = = =0
n(x1 = 1, x2 = 2) 2

Suppose we set m = 2, and p 0 (xi | xpa(i) ) = p̂(xi )

Then we get

s 2 × 0 + 2 × 0.4
p̂3|1,2 (1|1, 2) = = 0.2,
2+2
since p̂3 (1) = 0.4.

Ad Feelders ( Universiteit Utrecht ) Data Mining 5 / 70


Data Set and Likelihood

P(X1 , X2 , X3 , X4 ) = p1 (X1 )p2 (X2 )p3|12 (X3 |X1 , X2 )p4|3 (X4 |X3 )

obs X1 X2 X3 X4
1 1 1 1 1 p1 (1)p2 (1)p3|12 (1|1, 1)p4|3 (1|1)
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3

Ad Feelders ( Universiteit Utrecht ) Data Mining 6 / 70


Data Set and Likelihood

P(X1 , X2 , X3 , X4 ) = p1 (X1 )p2 (X2 )p3|12 (X3 |X1 , X2 )p4|3 (X4 |X3 )

obs X1 X2 X3 X4
1 1 1 1 1 p1 (1)p2 (1)p3|12 (1|1, 1)p4|3 (1|1)
2 1 1 1 1 p1 (1)p2 (1)p3|12 (1|1, 1)p4|3 (1|1)
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3

Ad Feelders ( Universiteit Utrecht ) Data Mining 7 / 70


Data Set and Likelihood

P(X1 , X2 , X3 , X4 ) = p1 (X1 )p2 (X2 )p3|12 (X3 |X1 , X2 )p4|3 (X4 |X3 )

obs X1 X2 X3 X4
1 1 1 1 1 p1 (1)p2 (1)p3|12 (1|1, 1)p4|3 (1|1)
2 1 1 1 1 p1 (1)p2 (1)p3|12 (1|1, 1)p4|3 (1|1)
3 1 1 2 1 p1 (1)p2 (1)p3|12 (2|1, 1)p4|3 (1|2)
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3

Ad Feelders ( Universiteit Utrecht ) Data Mining 8 / 70


Data Set and Likelihood

P(X1 , X2 , X3 , X4 ) = p1 (X1 )p2 (X2 )p3|12 (X3 |X1 , X2 )p4|3 (X4 |X3 )

obs X1 X2 X3 X4
1 1 1 1 1 p1 (1)p2 (1)p3|12 (1|1, 1)p4|3 (1|1)
2 1 1 1 1 p1 (1)p2 (1)p3|12 (1|1, 1)p4|3 (1|1)
3 1 1 2 1 p1 (1)p2 (1)p3|12 (2|1, 1)p4|3 (1|2)
4 1 2 2 1 p1 (1)p2 (2)p3|12 (2|1, 2)p4|3 (1|2)
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3

Ad Feelders ( Universiteit Utrecht ) Data Mining 9 / 70


Data Set and Likelihood

P(X1 , X2 , X3 , X4 ) = p1 (X1 )p2 (X2 )p3|12 (X3 |X1 , X2 )p4|3 (X4 |X3 )

obs X1 X2 X3 X4
1 1 1 1 1 p1 (1)p2 (1)p3|12 (1|1, 1)p4|3 (1|1)
2 1 1 1 1 p1 (1)p2 (1)p3|12 (1|1, 1)p4|3 (1|1)
3 1 1 2 1 p1 (1)p2 (1)p3|12 (2|1, 1)p4|3 (1|2)
4 1 2 2 1 p1 (1)p2 (2)p3|12 (2|1, 2)p4|3 (1|2)
5 1 2 2 2 p1 (1)p2 (2)p3|12 (2|1, 2)p4|3 (2|2)
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3

Ad Feelders ( Universiteit Utrecht ) Data Mining 10 / 70


Data Set and Likelihood

P(X1 , X2 , X3 , X4 ) = p1 (X1 )p2 (X2 )p3|12 (X3 |X1 , X2 )p4|3 (X4 |X3 )

obs X1 X2 X3 X4
1 1 1 1 1 p1 (1)p2 (1)p3|12 (1|1, 1)p4|3 (1|1)
2 1 1 1 1 p1 (1)p2 (1)p3|12 (1|1, 1)p4|3 (1|1)
3 1 1 2 1 p1 (1)p2 (1)p3|12 (2|1, 1)p4|3 (1|2)
4 1 2 2 1 p1 (1)p2 (2)p3|12 (2|1, 2)p4|3 (1|2)
5 1 2 2 2 p1 (1)p2 (2)p3|12 (2|1, 2)p4|3 (2|2)
6 2 1 1 2 p1 (2)p2 (1)p3|12 (1|2, 1)p4|3 (2|1)
7 2 1 2 3 p1 (2)p2 (1)p3|12 (2|2, 1)p4|3 (3|2)
8 2 1 2 3 p1 (2)p2 (1)p3|12 (2|2, 1)p4|3 (3|2)
9 2 2 2 3 p1 (2)p2 (2)p3|12 (2|2, 2)p4|3 (3|2)
10 2 2 1 3 p1 (2)p2 (2)p3|12 (1|2, 2)p4|3 (3|1)

Ad Feelders ( Universiteit Utrecht ) Data Mining 11 / 70


For all observations
Likelihood function for all observations together:

L(D) = p1 (1)5 (1 − p1 (1))5 p2 (1)6 (1 − p2 (1))4 p3|1,2 (1|1, 1)2 (1 − p3|1,2 (1|1, 1))
(1 − p3|1,2 (1|1, 2))2 p3|1,2 (1|2, 1)(1 − p3|1,2 (1|2, 1))2 p3|1,2 (1|2, 2)
(1 − p3|1,2 (1|2, 2))p4|3 (1|1)2 p4|3 (2|1)(1 − p4|3 (1|1) − p4|3 (2|1))
p4|3 (1|2)2 p4|3 (2|2)(1 − p4|3 (1|2) − p4|3 (2|2))3

Or in log form

L(D) = 5 log p1 (1) + 5 log(1 − p1 (1)) + 6 log p2 (1) + 4 log(1 − p2 (1))


+2 log p3|1,2 (1|1, 1) + log(1 − p3|1,2 (1|1, 1))
+2 log(1 − p3|1,2 (1|1, 2)) + log p3|1,2 (1|2, 1) + 2 log(1 − p3|1,2 (1|2, 1))
+ log p3|1,2 (1|2, 2) + log(1 − p3|1,2 (1|2, 2))
+2 log p4|3 (1|1) + log p4|3 (2|1) + log(1 − p4|3 (1|1) − p4|3 (2|1))
+2 log p4|3 (1|2) + log p4|3 (2|2) + 3 log(1 − p4|3 (1|2) − p4|3 (2|2))

Ad Feelders ( Universiteit Utrecht ) Data Mining 12 / 70


Structure Learning

The loglikelihood function for a Bayesian Network is:


k
X X
L= n(xi , xpa(i) ) log p(xi | xpa(i) )
i=1 xi ,xpa(i)

The maximum likelihood estimate of p(xi | xpa(i) ) is:

n(xi , xpa(i) )
p̂(xi | xpa(i) ) = .
n(xpa(i) )

Ad Feelders ( Universiteit Utrecht ) Data Mining 13 / 70


The Log-Likelihood Score
If we fill in the maximum likelihood estimates in the log-likelihood function
we obtain the value of the log-likelihood function evaluated at its
maximum:

k
X X n(xi , xpa(i) )
L= n(xi , xpa(i) ) log
n(xpa(i) )
i=1 xi ,xpa(i)

This is called the log-likelihood score.


The higher its value, the better the model fits the data.
The saturated model (complete graph) always has the highest
log-likelihood score.
To avoid overfitting, we must penalize model complexity.

Ad Feelders ( Universiteit Utrecht ) Data Mining 14 / 70


Scoring Functions for Structure Learning

Scoring functions:
AIC(M) = LM − dim(M).
log n
BIC(M) = LM − 2 dim(M).
where LM is the log-likelihood score of model M and dim(M) is the
number of parameters of M.

BIC gives a higher penalty for model complexity (n > 7), so tends to lead
to less complex models than AIC.

Note: earlier we defined AIC (M) = 2(Lsat − LM ) + 2dim(M). Dividing by


−2 and ignoring the constant Lsat gives the current definition.

Ad Feelders ( Universiteit Utrecht ) Data Mining 15 / 70


Optimization Problem
Given
Training data.
Scoring function (BIC or AIC).
Space of possible models (all DAGs).
find the model that maximizes the score.
Most model search algorithms do not require an a priori ordering of
the variables!
The number of labeled acyclic directed graphs on k nodes is given by
the recurrence
k  
X k j(k−j)
ak = (−1)j−1 2 ak−j
j
j=1

For example, a6 = 3, 781, 503.


Ad Feelders ( Universiteit Utrecht ) Data Mining 16 / 70
Heuristic Search

Define which models are neighbors of a given model


(typically: addition, removal, or reversal of an arc).
Traverse search space looking for high-scoring models,
e.g. by greedy hill-climbing.

Ad Feelders ( Universiteit Utrecht ) Data Mining 17 / 70


Score Decomposes

The loglikelihood score


k
X X n(xi , xpa(i) )
L= n(xi , xpa(i) ) log
n(xpa(i) )
i=1 xi ,xpa(i)

must be computed many times for different models in structure learning.

Luckily, it is a sum of terms, where each term contains the variables


{i} ∪ pa(i).

Hence, when making a change to the model, we only have to recompute


the score for those variables for which the parent set has changed!

Ad Feelders ( Universiteit Utrecht ) Data Mining 18 / 70


Example Data Set

obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3

Ad Feelders ( Universiteit Utrecht ) Data Mining 19 / 70


Score this model

1 2

Ad Feelders ( Universiteit Utrecht ) Data Mining 20 / 70


Relevant Data For Scoring Node 1

obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3

5 5
Score node 1 = 5 log 10 + 5 log 10

Ad Feelders ( Universiteit Utrecht ) Data Mining 21 / 70


Relevant Data For Scoring Node 2

obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3

6 4
Score node 2 = 6 log 10 + 4 log 10

Ad Feelders ( Universiteit Utrecht ) Data Mining 22 / 70


Relevant Data For Scoring Node 3

obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3

Score node 3 = 2 log 32 + log 31

Ad Feelders ( Universiteit Utrecht ) Data Mining 23 / 70


Relevant Data For Scoring Node 3

obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3

Score node 3 = 2 log 32 + log 31 + 2 log 1

Ad Feelders ( Universiteit Utrecht ) Data Mining 24 / 70


Relevant Data For Scoring Node 3

obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3

Score node 3 = 2 log 32 + log 31 + 2 log 1 + log 13 + 2 log 32

Ad Feelders ( Universiteit Utrecht ) Data Mining 25 / 70


Relevant Data For Estimating Scoring Node 3

obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3

Score node 3 = 2 log 32 + log 31 + 2 log 1 + log 13 + 2 log 32 + log 12 + log 21

Ad Feelders ( Universiteit Utrecht ) Data Mining 26 / 70


Relevant Data For Scoring Node 4

obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3

Score node 4 = 2 log 42 + log 41 + log 41

Ad Feelders ( Universiteit Utrecht ) Data Mining 27 / 70


Relevant Data For Scoring Node 4

obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3

Score node 4 = 2 log 42 + log 41 + log 41 + 2 log 26 + log 16 + 3 log 63

Ad Feelders ( Universiteit Utrecht ) Data Mining 28 / 70


Total Score

Summing the loglikelihood score over all nodes, we get:


5 5
L = 5 log + 5 log (node 1)
10 10
6 4
+ 6 log + 4 log (node 2)
10 10
2 1 1 2 1 1
+ 2 log + log + 2 log 1 + log + 2 log + log + log (node 3)
3 3 3 3 2 2
2 1 1 2 1 3
+ 2 log + log + log + 2 log + log + 3 log (node 4)
4 4 4 6 6 6
≈ −29.09

Ad Feelders ( Universiteit Utrecht ) Data Mining 29 / 70


Add an edge from X1 to X2

1 2

Ad Feelders ( Universiteit Utrecht ) Data Mining 30 / 70


Score is Decomposable

5 5
L = 5 log + 5 log (node 1)
10 10
6 4
+ 6 log 10 + 4 log 10 (node 2)
2 1 1 2 1 1
+ 2 log + log + 2 log 1 + log + 2 log + log + log (node 3)
3 3 3 3 2 2
2 1 1 2 1 3
+ 2 log + log + log + 2 log + log + 3 log (node 4)
4 4 4 6 6 6
≈ −29.09

When we add an edge from X1 to X2 , only the parent set of node 2 changes.
Therefore, only the score of node 2 (the boxed part) has to be recomputed.

Ad Feelders ( Universiteit Utrecht ) Data Mining 31 / 70


Relevant Data For Re-scoring Node 2

obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3

New score node 2 = 3 log 35 + 2 log 25

Ad Feelders ( Universiteit Utrecht ) Data Mining 32 / 70


Relevant Data For Re-scoring Node 2

obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3

New score node 2 = 3 log 35 + 2 log 25 + 3 log 35 + 2 log 25

Ad Feelders ( Universiteit Utrecht ) Data Mining 33 / 70


Score Decomposes

5 5
L = 5 log + 5 log (node 1)
10 10
+ 3 log 53 + 2 log 52 + 3 log 35 + 2 log 25 (node 2)
2 1 1 2 1 1
+ 2 log + log + 2 log 1 + log + 2 log + log + log (node 3)
3 3 3 3 2 2
2 1 1 2 1 3
+ 2 log + log + log + 2 log + log + 3 log (node 4)
4 4 4 6 6 6
≈ −29.09

The boxed part is the new contribution of node 2 to the score.

Ad Feelders ( Universiteit Utrecht ) Data Mining 34 / 70


Add an edge from X1 to X4

1 2

Ad Feelders ( Universiteit Utrecht ) Data Mining 35 / 70


Score Decomposes

5 5
L = 5 log + 5 log (node 1)
10 10
6 4
+ 6 log + 4 log (node 2)
10 10
2 1 1 2 1 1
+ 2 log + log + 2 log 1 + log + 2 log + log + log (node 3)
3 3 3 3 2 2
+ 2 log 42 + log 14 + log 14 + 2 log 26 + log 16 + 3 log 36 (node 4)
≈ −29.09

When we add an edge from X1 to X4 , only the parent set of node 4 changes.
Therefore, only the score of node 4 (the boxed part) has to be recomputed.

Ad Feelders ( Universiteit Utrecht ) Data Mining 36 / 70


Relevant Data For Re-scoring Node 4

obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3

New score node 4 = 2 log 1

Ad Feelders ( Universiteit Utrecht ) Data Mining 37 / 70


Relevant Data For Re-scoring Node 4

obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3

New score node 4 = 2 log 1 + 2 log 23 + log 13

Ad Feelders ( Universiteit Utrecht ) Data Mining 38 / 70


Relevant Data For Re-scoring Node 4

obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3

New score node 4 = 2 log 1 + 2 log 23 + log 13 + log 12 + log 12

Ad Feelders ( Universiteit Utrecht ) Data Mining 39 / 70


Relevant Data For Re-scoring Node 4

obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3

New score node 4 = 2 log 1 + 2 log 23 + log 13 + log 12 + log 12 + 3 log 1

Ad Feelders ( Universiteit Utrecht ) Data Mining 40 / 70


Score Decomposes

5 5
L = 5 log + 5 log (node 1)
10 10
6 4
+ 6 log + 4 log (node 2)
10 10
2 1 1 2 1 1
+ 2 log + log + 2 log 1 + log + 2 log + log + log (node 3)
3 3 3 3 2 2
+ 2 log 1 + 2 log 23 + log 13 + log 12 + log 12 + 3 log 1 (node 4)
≈ −22.16

The boxed part is the new contribution of node 4 to the score.

Ad Feelders ( Universiteit Utrecht ) Data Mining 41 / 70


Counting Parameters

The number of parameters of a Bayesian network is:

 
k 
X Y 
(di − 1) dj
 
i=1 j∈pa(i)

where k is the number of variables in the network, and di is the number of


possible values of Xi .
Q
j∈pa(i) dj is the number of parent configuration for Xi .

If Xi has no parents, the number of parent configurations should be taken


to be 1, so Xi contributes di − 1 parameters in that case.

Ad Feelders ( Universiteit Utrecht ) Data Mining 42 / 70


A Simple Structure Learning Algorithm

Algorithm 1 BN Structure Learning


1: G ← initial graph
2: max ← score(G )
3: repeat
4: nb ← neighbours(G )
5: for all G 0 ∈ nb do
6: if score (G 0 ) > max then
7: max ← score(G 0 )
8: G ← G0
9: end if
10: end for
11: until no change to G
12: return G

Ad Feelders ( Universiteit Utrecht ) Data Mining 43 / 70


Complexity: Naive

A DAG with k nodes has k(k − 1) possible directed edges.


edge present: delete or reverse
edge absent: add
So there are O(k 2 ) neighbours that have to be scored.
There are k components in the score and for each component we
have to compute the sufficient statistics which requires traversing the
training data (n rows).
So in total, scoring a single neighbour takes O(kn) time.
The total complexity is O(k 3 n) per search step.

Ad Feelders ( Universiteit Utrecht ) Data Mining 44 / 70


Complexity: Exploiting Decomposability

To score a neighbour we actually only have to score a single node


(add, delete), or two nodes (reversal), since we only have to
recompute the score for the nodes whose parent set changed by the
local operation performed. So the complexity of scoring a neighbour
is only O(n) instead of O(kn).
If we compute the change in score due to a local operation (add,
delete, reverse), then we can reuse the “delta scores” computed in
previous iterations. We only need to recompute the delta scores for
nodes whose parent set changed in the previous iteration. So we only
have to compute O(k) changes in score.
Hence the total complexity per search step is O(kn).

Ad Feelders ( Universiteit Utrecht ) Data Mining 45 / 70


Add an edge from X1 to X2

1 2

3 2 3 2
∆Score(add(X1 → X2 )) = (3 log + 2 log + 3 log + 2 log )
5 5 5 5
6 4
− (6 log + 4 log ) = 0
10 10

Ad Feelders ( Universiteit Utrecht ) Data Mining 46 / 70


Add an edge from X1 to X4

1 2

2 1 1 1
∆Score(add(X1 → X4 )) = (2 log 1 + 2 log + log + log + log + 3 log 1)
3 3 2 2
2 1 1 2 1 3
− (2 log + log + log + 2 log + log + 3 log )
4 4 4 6 6 6
≈ 6.93

Ad Feelders ( Universiteit Utrecht ) Data Mining 47 / 70


Add an edge from X1 to X4

Suppose we decide to add the arrow X1 → X4 .


In the next iteration only the ∆ scores of operations that change the
parent set of X4 have to be recomputed.
For example, ∆Score(add(X1 → X2 )) doesn’t have to be recomputed
because it is the same as in the previous iteration.

Ad Feelders ( Universiteit Utrecht ) Data Mining 48 / 70


Interpretation: warning!

1 2 3

1 2 3

1 2 3

These models can not be distinguished from data alone.


They represent the same independencies!

AIC and BIC give equivalent networks the same score.


Ad Feelders ( Universiteit Utrecht ) Data Mining 49 / 70
Markov Equivalence and Essential Graph

Two DAGs are Markov equivalent if and only if


1 they have the same skeleton (same undirected graph when you drop
the directions of all edges), and
2 they have the same immoralities (v-structures).

Essential Graph:

For a given DAG, an edge becomes bi-directional in the essential graph if


there is an equivalent DAG in which the direction of the edge is reversed.

Ad Feelders ( Universiteit Utrecht ) Data Mining 50 / 70


Example Analysis

We analyze a data set concerning risk factors for coronary heart disease.
For a sample of 1841 car-workers, the following information was recorded

Variable Description
A Does the person smoke?
B Is the person’s work strenuous mentally?
C Is the person’s work strenuous physically?
D Systolic blood pressure < 140mm?
E Ratio of beta to alfa lipoproteins < 3?
F Is there a family history of coronary heart disease?

Ad Feelders ( Universiteit Utrecht ) Data Mining 51 / 70


Example Analysis
For learning Bayesian networks, we use the bnlearn package in R.
Hill-climbing with the BIC score function:
> coronary.hc <- hc(coronary)
> plot(coronary.hc)

C D

F E
Ad Feelders ( Universiteit Utrecht ) Data Mining 52 / 70
The Search Process
> coronary.hc <- hc(coronary, debug=T)
----------------------------------------------------------------
* starting from the following network:
model:
[A][B][C][D][E][F]

* current score: -7061.714


* caching score delta for arc A -> B (17.531166).
* caching score delta for arc A -> C (9.981480).
* caching score delta for arc A -> D (1.757126).
* caching score delta for arc A -> E (4.941129).
* caching score delta for arc A -> F (-3.224701).
* caching score delta for arc B -> C (264.272873).
* caching score delta for arc B -> D (2.313656).
* caching score delta for arc B -> E (21.030213).
* caching score delta for arc B -> F (2.303571).
* caching score delta for arc C -> D (-3.711314).
* caching score delta for arc C -> E (4.577177).
* caching score delta for arc C -> F (-3.673929).
* caching score delta for arc D -> E (2.645583).
* caching score delta for arc D -> F (-3.197133).
* caching score delta for arc E -> F (-2.257169).

Ad Feelders ( Universiteit Utrecht ) Data Mining 53 / 70


The Search Process

The initial model (the mutual independence model


[A][B][C][D][E][F]) has a BIC score of -7061.714.
The output gives the change in score between the current model and
its neighbors.
Why is the score of only 15 of the 30 neighbors computed?
(e.g. A -> B, but not B -> A)?

Ad Feelders ( Universiteit Utrecht ) Data Mining 54 / 70


The Search Process

The initial model (the mutual independence model


[A][B][C][D][E][F]) has a BIC score of -7061.714.
The output gives the change in score between the current model and
its neighbors.
Why is the score of only 15 of the 30 neighbors computed?
(e.g. A -> B, but not B -> A)?
A -> B and B -> A are Markov equivalent, and therefore have the
same score.
Adding B -> C results in the largest increase in score so we move to
that neighbor.

Ad Feelders ( Universiteit Utrecht ) Data Mining 54 / 70


The Search Process

* best operation was: adding B -> C .


* current network is :

model:
[A][B][D][E][F][C|B]

* current score: -6797.441


* caching score delta for arc A -> C (9.975823).
* caching score delta for arc B -> C (-264.272873).
* caching score delta for arc D -> C (-1.472731).
* caching score delta for arc E -> C (-6.587044).
* caching score delta for arc F -> C (-6.059896).

Ad Feelders ( Universiteit Utrecht ) Data Mining 55 / 70


The Search Process

We don’t have to recompute the change in score caused by, for


example, adding A -> B, because the parent set of B is the same as
in the previous iteration.
Therefore, adding A -> B now will cause the same score change as in
the previous iteration.
Only the parent set of C has changed, so we just have to recompute
the change in score caused by adding arcs X -> C.
Adding B -> E results in the largest increase in score so we move to
that neighbor.
The current model becomes: [A][B][D][F][C|B][E|B].

Ad Feelders ( Universiteit Utrecht ) Data Mining 56 / 70


Final Model and its Essential Graph

A A

C D C D

B B

F E F E

Final Model Essential Graph

Ad Feelders ( Universiteit Utrecht ) Data Mining 57 / 70


Example of Model Use

# estimate parameters for selected model structure


> coronary.hc.fit <- bn.fit(coronary.hc,coronary.dat,"mle")
# predict B from remaining variables
> coronary.hc.pred <- predict(coronary.hc.fit,node="B",
data=coronary.dat)
# make confusion matrix
> table(coronary.dat$B,coronary.hc.pred)
coronary.hc.pred
no yes
no 944 186
yes 208 503
> (944+503)/1841
[1] 0.7859859
> (944+186)/1841
[1] 0.6137968

Ad Feelders ( Universiteit Utrecht ) Data Mining 58 / 70


Bayesian Networks as Classifiers

A1 A2
A3

C
A4 A5

A6 A7 A8

Markov Blanket: Parents, Children and Parents of Children.

Ad Feelders ( Universiteit Utrecht ) Data Mining 59 / 70


Markov Blanket of C : Moral Graph

A1 A2
A3

C
A4 A5

A6 A7 A8

Markov Blanket: Parents, Children and Parents of Children.

Local Markov property: C ⊥


⊥ rest | boundary(C )
Ad Feelders ( Universiteit Utrecht ) Data Mining 60 / 70
Right Heart Catheterization Data: Variable Description

1 cat1: primary disease category (9 different values)


2 death: did the patient die within 180 days after admission to ICU?
3 swang1: was right heart catheterization (Swan-Ganz catheter)
performed within first 24 hours?
4 gender: male/female
5 race: black/white/other
6 ninsclas: type of medical insurance of patient (six different values)
7 income: income of patient, divided into 4 categories
8 ca: cancer status (yes/no/metastatic)
9 age: age of patient divided into 5 categories
10 meanbp1: mean blood pressure of patient divided into 2 categories

Ad Feelders ( Universiteit Utrecht ) Data Mining 61 / 70


Right Heart Catheterization Data: Descriptive Statistics
> summary(rhc.dat)
cat1 death swang1 gender race
ARF :2490 No :2013 No RHC:3551 Female:2543 black: 920
MOSF w/Sepsis :1227 Yes:3722 RHC :2184 Male :3192 other: 355
COPD : 457 white:4460
CHF : 456
Coma : 436
MOSF w/Malignancy: 399
(Other) : 270

ninsclas income ca age


Medicaid : 647 $11-$25k :1165 Metastatic: 384 (50,60] : 917
Medicare :1458 $25-$50k : 893 No :4379 (60,70] :1390
Medicare & Medicaid: 374 > $50k : 451 Yes : 972 (70,80] :1337
No insurance : 322 Under $11k:3226 (80,102]: 667
Private :1698 [18,50] :1424
Private & Medicare :1236

meanbp1
(85,259]:1975
[0,85] :3760

Ad Feelders ( Universiteit Utrecht ) Data Mining 62 / 70


Learning the Graph Structure

# load bayesian network library


> library(bnlearn)

# load library for graph vizualization


> library(Rgraphviz)

# use hill climbing with BIC scoring


# starting from empty graph
> rhc.bn <- hc(rhc.dat)

# plot the model structure


> plot(as(amat(rhc.bn),"graphNEL"))

Ad Feelders ( Universiteit Utrecht ) Data Mining 63 / 70


Graph Structure for Right Heart Catheterization Data

race

ninsclas

gender income age

cat1

swang1 ca

meanbp1 death

Ad Feelders ( Universiteit Utrecht ) Data Mining 64 / 70


Performing Inference

# estimate the network parameters


> rhc.bn.fit <- bn.fit(rhc.bn,data=rhc.dat)

# perform sampling based inference


# probability of death for metastatic cancer and
# mean blood pressure > 85
> cpquery(rhc.bn.fit,event=death=="Yes",evidence=
ca=="Metastatic" & meanbp1=="(85,259]",n=100000)
[1] 0.9033019

# probability of death for no cancer and


# mean blood pressure > 85
> cpquery(rhc.bn.fit,event=death=="Yes",evidence=
ca=="No" & meanbp1=="(85,259]",n=100000)
[1] 0.6020206

Ad Feelders ( Universiteit Utrecht ) Data Mining 65 / 70


Combining Data and Prior Knowledge
An expert studies the graph and argues that the edge from swang1 to
meanbp1 is in the wrong direction, since the blood pressure influences the
decision to apply right heart catheterization, not the other way around.

Can we turn the edge around without changing the “meaning” of the
network, i.e. without changing the conditional independencies expressed by
the graph? race

ninsclas

gender income age

cat1

swang1 ca

meanbp1 death

Ad Feelders ( Universiteit Utrecht ) Data Mining 66 / 70


Combining Data and Prior Knowledge
Common sense suggest that the variables can be divided into a number of ordered
blocks, where arrows are not allowed to point from a variable in a higher block to a
variable in a lower block.

As an example, consider the following block structure:


1 race, gender
2 age, income
3 ninsclass
4 cat1, ca, meanbp1
5 swang1
6 death
We can use the blacklist parameter to avoid edges pointing from higher blocks
to lower blocks.

Ad Feelders ( Universiteit Utrecht ) Data Mining 67 / 70


Learning with a Blacklist

# learn structure with blacklist


> rhc.bn.ord <- hc(rhc.dat,blacklist=blackL)

# has the score become much worse?


> score(rhc.bn.ord,rhc.dat)
[1] -54059.03
> score(rhc.bn,rhc.dat)
[1] -53749.15

# has inferences changed much?


> rhc.bn.ord.fit <- bn.fit(rhc.bn.ord,data=rhc.dat)

> cpquery(rhc.bn.ord.fit,event=death=="Yes",evidence=
ca=="Metastatic" & meanbp1=="(85,259]",n=100000)
[1] 0.9039467
> cpquery(rhc.bn.ord.fit,event=death=="Yes",evidence=
ca=="No" & meanbp1=="(85,259]",n=100000)
[1] 0.610249

Ad Feelders ( Universiteit Utrecht ) Data Mining 68 / 70


The Blacklist
The blacklist simply enumerates all the forbidden edges:

> blackL
X1 X2
1 cat1 gender
2 cat1 race
3 cat1 ninsclas
4 cat1 income
5 cat1 age
6 death cat1
7 death swang1
etc.

Ad Feelders ( Universiteit Utrecht ) Data Mining 69 / 70


Graph Structure Learned with Blacklist

race gender

age

cat1 income

meanbp1 ca ninsclas

swang1 death

Ad Feelders ( Universiteit Utrecht ) Data Mining 70 / 70

You might also like