Data Mining - Utrecht University - 11. Slides
Data Mining - Utrecht University - 11. Slides
Ad Feelders
Universiteit Utrecht
n(xi , xpa(i) )
p̂(xi | xpa(i) ) =
n(xpa(i) )
For example
n(x1 = 1, x2 = 2, x3 = 1) 0
p̂3|1,2 (1|1, 2) = = =0
n(x1 = 1, x2 = 2) 2
Then we get
s 2 × 0 + 2 × 0.4
p̂3|1,2 (1|1, 2) = = 0.2,
2+2
since p̂3 (1) = 0.4.
P(X1 , X2 , X3 , X4 ) = p1 (X1 )p2 (X2 )p3|12 (X3 |X1 , X2 )p4|3 (X4 |X3 )
obs X1 X2 X3 X4
1 1 1 1 1 p1 (1)p2 (1)p3|12 (1|1, 1)p4|3 (1|1)
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3
P(X1 , X2 , X3 , X4 ) = p1 (X1 )p2 (X2 )p3|12 (X3 |X1 , X2 )p4|3 (X4 |X3 )
obs X1 X2 X3 X4
1 1 1 1 1 p1 (1)p2 (1)p3|12 (1|1, 1)p4|3 (1|1)
2 1 1 1 1 p1 (1)p2 (1)p3|12 (1|1, 1)p4|3 (1|1)
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3
P(X1 , X2 , X3 , X4 ) = p1 (X1 )p2 (X2 )p3|12 (X3 |X1 , X2 )p4|3 (X4 |X3 )
obs X1 X2 X3 X4
1 1 1 1 1 p1 (1)p2 (1)p3|12 (1|1, 1)p4|3 (1|1)
2 1 1 1 1 p1 (1)p2 (1)p3|12 (1|1, 1)p4|3 (1|1)
3 1 1 2 1 p1 (1)p2 (1)p3|12 (2|1, 1)p4|3 (1|2)
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3
P(X1 , X2 , X3 , X4 ) = p1 (X1 )p2 (X2 )p3|12 (X3 |X1 , X2 )p4|3 (X4 |X3 )
obs X1 X2 X3 X4
1 1 1 1 1 p1 (1)p2 (1)p3|12 (1|1, 1)p4|3 (1|1)
2 1 1 1 1 p1 (1)p2 (1)p3|12 (1|1, 1)p4|3 (1|1)
3 1 1 2 1 p1 (1)p2 (1)p3|12 (2|1, 1)p4|3 (1|2)
4 1 2 2 1 p1 (1)p2 (2)p3|12 (2|1, 2)p4|3 (1|2)
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3
P(X1 , X2 , X3 , X4 ) = p1 (X1 )p2 (X2 )p3|12 (X3 |X1 , X2 )p4|3 (X4 |X3 )
obs X1 X2 X3 X4
1 1 1 1 1 p1 (1)p2 (1)p3|12 (1|1, 1)p4|3 (1|1)
2 1 1 1 1 p1 (1)p2 (1)p3|12 (1|1, 1)p4|3 (1|1)
3 1 1 2 1 p1 (1)p2 (1)p3|12 (2|1, 1)p4|3 (1|2)
4 1 2 2 1 p1 (1)p2 (2)p3|12 (2|1, 2)p4|3 (1|2)
5 1 2 2 2 p1 (1)p2 (2)p3|12 (2|1, 2)p4|3 (2|2)
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3
P(X1 , X2 , X3 , X4 ) = p1 (X1 )p2 (X2 )p3|12 (X3 |X1 , X2 )p4|3 (X4 |X3 )
obs X1 X2 X3 X4
1 1 1 1 1 p1 (1)p2 (1)p3|12 (1|1, 1)p4|3 (1|1)
2 1 1 1 1 p1 (1)p2 (1)p3|12 (1|1, 1)p4|3 (1|1)
3 1 1 2 1 p1 (1)p2 (1)p3|12 (2|1, 1)p4|3 (1|2)
4 1 2 2 1 p1 (1)p2 (2)p3|12 (2|1, 2)p4|3 (1|2)
5 1 2 2 2 p1 (1)p2 (2)p3|12 (2|1, 2)p4|3 (2|2)
6 2 1 1 2 p1 (2)p2 (1)p3|12 (1|2, 1)p4|3 (2|1)
7 2 1 2 3 p1 (2)p2 (1)p3|12 (2|2, 1)p4|3 (3|2)
8 2 1 2 3 p1 (2)p2 (1)p3|12 (2|2, 1)p4|3 (3|2)
9 2 2 2 3 p1 (2)p2 (2)p3|12 (2|2, 2)p4|3 (3|2)
10 2 2 1 3 p1 (2)p2 (2)p3|12 (1|2, 2)p4|3 (3|1)
L(D) = p1 (1)5 (1 − p1 (1))5 p2 (1)6 (1 − p2 (1))4 p3|1,2 (1|1, 1)2 (1 − p3|1,2 (1|1, 1))
(1 − p3|1,2 (1|1, 2))2 p3|1,2 (1|2, 1)(1 − p3|1,2 (1|2, 1))2 p3|1,2 (1|2, 2)
(1 − p3|1,2 (1|2, 2))p4|3 (1|1)2 p4|3 (2|1)(1 − p4|3 (1|1) − p4|3 (2|1))
p4|3 (1|2)2 p4|3 (2|2)(1 − p4|3 (1|2) − p4|3 (2|2))3
Or in log form
n(xi , xpa(i) )
p̂(xi | xpa(i) ) = .
n(xpa(i) )
k
X X n(xi , xpa(i) )
L= n(xi , xpa(i) ) log
n(xpa(i) )
i=1 xi ,xpa(i)
Scoring functions:
AIC(M) = LM − dim(M).
log n
BIC(M) = LM − 2 dim(M).
where LM is the log-likelihood score of model M and dim(M) is the
number of parameters of M.
BIC gives a higher penalty for model complexity (n > 7), so tends to lead
to less complex models than AIC.
obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3
1 2
obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3
5 5
Score node 1 = 5 log 10 + 5 log 10
obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3
6 4
Score node 2 = 6 log 10 + 4 log 10
obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3
obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3
obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3
obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3
obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3
obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3
1 2
5 5
L = 5 log + 5 log (node 1)
10 10
6 4
+ 6 log 10 + 4 log 10 (node 2)
2 1 1 2 1 1
+ 2 log + log + 2 log 1 + log + 2 log + log + log (node 3)
3 3 3 3 2 2
2 1 1 2 1 3
+ 2 log + log + log + 2 log + log + 3 log (node 4)
4 4 4 6 6 6
≈ −29.09
When we add an edge from X1 to X2 , only the parent set of node 2 changes.
Therefore, only the score of node 2 (the boxed part) has to be recomputed.
obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3
obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3
5 5
L = 5 log + 5 log (node 1)
10 10
+ 3 log 53 + 2 log 52 + 3 log 35 + 2 log 25 (node 2)
2 1 1 2 1 1
+ 2 log + log + 2 log 1 + log + 2 log + log + log (node 3)
3 3 3 3 2 2
2 1 1 2 1 3
+ 2 log + log + log + 2 log + log + 3 log (node 4)
4 4 4 6 6 6
≈ −29.09
1 2
5 5
L = 5 log + 5 log (node 1)
10 10
6 4
+ 6 log + 4 log (node 2)
10 10
2 1 1 2 1 1
+ 2 log + log + 2 log 1 + log + 2 log + log + log (node 3)
3 3 3 3 2 2
+ 2 log 42 + log 14 + log 14 + 2 log 26 + log 16 + 3 log 36 (node 4)
≈ −29.09
When we add an edge from X1 to X4 , only the parent set of node 4 changes.
Therefore, only the score of node 4 (the boxed part) has to be recomputed.
obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3
obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3
obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3
obs X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 2 1
4 1 2 2 1
5 1 2 2 2
6 2 1 1 2
7 2 1 2 3
8 2 1 2 3
9 2 2 2 3
10 2 2 1 3
5 5
L = 5 log + 5 log (node 1)
10 10
6 4
+ 6 log + 4 log (node 2)
10 10
2 1 1 2 1 1
+ 2 log + log + 2 log 1 + log + 2 log + log + log (node 3)
3 3 3 3 2 2
+ 2 log 1 + 2 log 23 + log 13 + log 12 + log 12 + 3 log 1 (node 4)
≈ −22.16
k
X Y
(di − 1) dj
i=1 j∈pa(i)
1 2
3 2 3 2
∆Score(add(X1 → X2 )) = (3 log + 2 log + 3 log + 2 log )
5 5 5 5
6 4
− (6 log + 4 log ) = 0
10 10
1 2
2 1 1 1
∆Score(add(X1 → X4 )) = (2 log 1 + 2 log + log + log + log + 3 log 1)
3 3 2 2
2 1 1 2 1 3
− (2 log + log + log + 2 log + log + 3 log )
4 4 4 6 6 6
≈ 6.93
1 2 3
1 2 3
1 2 3
Essential Graph:
We analyze a data set concerning risk factors for coronary heart disease.
For a sample of 1841 car-workers, the following information was recorded
Variable Description
A Does the person smoke?
B Is the person’s work strenuous mentally?
C Is the person’s work strenuous physically?
D Systolic blood pressure < 140mm?
E Ratio of beta to alfa lipoproteins < 3?
F Is there a family history of coronary heart disease?
C D
F E
Ad Feelders ( Universiteit Utrecht ) Data Mining 52 / 70
The Search Process
> coronary.hc <- hc(coronary, debug=T)
----------------------------------------------------------------
* starting from the following network:
model:
[A][B][C][D][E][F]
model:
[A][B][D][E][F][C|B]
A A
C D C D
B B
F E F E
A1 A2
A3
C
A4 A5
A6 A7 A8
A1 A2
A3
C
A4 A5
A6 A7 A8
meanbp1
(85,259]:1975
[0,85] :3760
race
ninsclas
cat1
swang1 ca
meanbp1 death
Can we turn the edge around without changing the “meaning” of the
network, i.e. without changing the conditional independencies expressed by
the graph? race
ninsclas
cat1
swang1 ca
meanbp1 death
> cpquery(rhc.bn.ord.fit,event=death=="Yes",evidence=
ca=="Metastatic" & meanbp1=="(85,259]",n=100000)
[1] 0.9039467
> cpquery(rhc.bn.ord.fit,event=death=="Yes",evidence=
ca=="No" & meanbp1=="(85,259]",n=100000)
[1] 0.610249
> blackL
X1 X2
1 cat1 gender
2 cat1 race
3 cat1 ninsclas
4 cat1 income
5 cat1 age
6 death cat1
7 death swang1
etc.
race gender
age
cat1 income
meanbp1 ca ninsclas
swang1 death