0% found this document useful (0 votes)
45 views49 pages

Model Selection/ Structure Learning Koller & Friedman Chapter 14 Mackay Chapter 28

1. The document summarizes techniques for learning the structure of graphical models from data, including constraint-based approaches that test for conditional independencies and search-and-score approaches that optimize a scoring function. 2. It discusses challenges like the faithfulness assumption in constraint-based approaches and difficulties in making reliable independence tests. It also describes using the Bayesian approach to compare hypotheses about model structure. 3. Key aspects covered include identifying equivalent DAGs, the PC algorithm for constraint-based structure learning, how to define independence tests and associated p-values, and using parameter priors and posteriors for Bayesian model comparison.

Uploaded by

aGWE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views49 pages

Model Selection/ Structure Learning Koller & Friedman Chapter 14 Mackay Chapter 28

1. The document summarizes techniques for learning the structure of graphical models from data, including constraint-based approaches that test for conditional independencies and search-and-score approaches that optimize a scoring function. 2. It discusses challenges like the faithfulness assumption in constraint-based approaches and difficulties in making reliable independence tests. It also describes using the Bayesian approach to compare hypotheses about model structure. 3. Key aspects covered include identifying equivalent DAGs, the PC algorithm for constraint-based structure learning, how to define independence tests and associated p-values, and using parameter priors and posteriors for Bayesian model comparison.

Uploaded by

aGWE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Lecture 15

Model selection/ structure learning

Koller & Friedman chapter 14


Mackay chapter 28

Kevin Murphy

8 November 2004
Structure learning: why?

• We often want to learn the structure of the graphical model:


– Scientific discovery (data mining)
– Use a good model for prediction, compression, classification etc.
• Often there may be more than one good model
– Look for features that they all share
– Average predictions over models
Learning gene regulatory pathways
Structure learning: how?

• Constraint-based approach:
– Assume some way of testing conditional independencies
X1 ⊥ X2|X3
– Then construct model consistent with these results
• Search-and-score approach:
– Define a scoring function for measuring model quality (e.g., marginal
likelihood or penalized likelihood)
– Use a search algorithm to find a (local) maximum of the score
Identifiability

• DAGs are I-equivalent if they encode the same set of conditional


independencies, e.g., X → Y → Z and X ← Y ← Z are indistin-
guishable given just observational data.
• However, X → Y ← Z has a v-structure, which has a unique
statistical signature. Hence some arc directions can be inferred from
passive observation.
• The set of I-equivalent DAGs can be represented by a PDAG (partially
directed acyclic graph).
• Distinguishing between members of an equivalence class requires in-
terventions/ experiments.
Constraint-based approach

• The build-PDAG algorithm from K&F chapter 3 can recover the true
DAG up to I-equivalence in O(N 32d) time if we make the following
assumptions:
– The maximum fan-in (number of parents) of any node is d
– The independence test oracle can handle up to 2d + 2 variables
– The underlying distribution P ∗ is faithful to G∗ i.e., there are no
spurious independencies that are not sanctioned by G∗ (G∗ is a
P-map of P ∗).
• This is often called the IC or PC algorithm.
constraint-based approach

• Bad
– Faithfulness assumption rules out certain CPDs like noisy-OR.
– Hard to make a reliable independence test (especially given small
data sets) which does not make too many errors (either false pos-
itives or false negatives).
– One misleading independence test result can result in multiple er-
rors in the resulting PDAG, so overall the approach is not very
robust to noise.
• Good
– PC algorithm is less dumb than local search
Independence tests

• An independence test X ⊥ Y seeks to accept or reject the null


hypothesis H0 that P ∗(X, Y ) = P ∗(X)P ∗(Y ).
• We need a decision rule that maps data to accept/reject.
• We define a scalar measure of deviance d(D) from the null hypoth-
esis.
• The p-value of a threshold t is the probability of falsely rejecting the
null hypothesis:
p(t) = P ({D : d(D) > t}|H0, N )
• Note that we need to know the size of the data set N (stopping rule)
ahead of time!
• We usually choose a threshold t so that the probability of a false
rejection is below some significance level α = 0.05.
Independence tests

• For discrete data, a common deviance is the χ2 statistic, which mea-


sures how far the counts are from what we would expect given inde-
pendence:
X (Ox,y − Ex,y )2 X (N (x, y) − N P (x)P (y))2
dχ2 (D) = =
x,y
Ex,y x,y
N P (x)P (y)

• The p-value requires summing over all datasets of size N :


p(t) = P ({D : d(D) > t}|H0, N )
• Since this is expensive in general, a standard approximation is to con-
sider the expected distribution of d(D) (under the null hypothesis)
as N → ∞, and use this to define thresholds to achieve a given
significance.
Example of classical hypothesis testing

• When spun on edge N = 250 times, a Belgian one-euro coin came


up heads Y = 140 times and tails 110.
• We would like to distinguish two models, or hypotheses: H0 means
the coin is unbiased (so p = 0.5); H1 means the coin is biased (has
probability of heads p 6= 0.5).
• p-value is “less than 7%”: p = P (Y ≥ 140)+P (Y ≤ 110) = 0.066:
n=250; p = 0.5; y = 140;
p = (1-binocdf(y-1,n,p)) + binocdf(n-y,n,p)
• If Y = 141, we get p = 0.0497, so we can reject the null hypothesis
at significance level 0.05.
• But is the coin really biased?
Bayesian approach

• We want to compute the posterior ratio of the 2 hypotheses:


P (H1|D) P (D|H1)P (H1)
=
P (H0|D) P (D|H0)P (H0)
• Let us assume a uniform prior P (H0) = P (H1) = 0.5.
• Then we just focus on the ratio of the marginal likelihoods:
Z 1
P (D|H1) = dθ P (D|θ, H1)P (θ|H1)
0
• For H0, there is no free parameter, so
P (D|H0) = 0.5N
where N is the number of coin tosses in D.
Parameter prior

• How to compute P (D|H1)?


• Let us assume a beta prior on the coin bias θ
1
P (θ|α, H1) = β(θ; αh, αt) = θαh−1(1 − θ)αt−1
Z(αh, αt)
where
Z 1
α −1 αt −1 Γ(αh)Γ(αt)
Z(αh, αt) = dθ θ h (1 − θ) =
0 Γ(αh + αt)
• Γ(n) = (n − 1)! for positive integers.
α h .
• Mean Eθ = α +α t
h
• If we set αh = αt = 1, we get a uniform prior (and Z = 1).
Parameter posterior

• Suppose we see Dh heads and Dt tails. The parameter posterior is


p(θ|α)P (D|θ, α)
P (θ|D, α) =
P (D|α)
1 1
= θαh−1(1 − θ)αt−1θDh (1 − θ)Dt
P (D|α) Z(αh, αt)
= β(θ; αh + Dh, αt + Dt)
Parameter posterior - small sample, uniform prior
prior,1.0, 1.0 likelihood, 1 heads, 0 tails posterior
2 1 2

1 0.5 1

0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
prior,1.0, 1.0 likelihood, 1 heads, 1 tails posterior
2 0.4 2

1 0.2 1

0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
prior,1.0, 1.0 likelihood, 10 heads, 1 tails posterior
2 0.04 5

1 0.02

0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
prior,1.0, 1.0 −4
likelihood, 10 heads, 5 tails posterior
x 10
2 1 4

1 0.5 2

0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
prior,1.0, 1.0 −6
likelihood, 10 heads, 10 tails posterior
x 10
2 1 4

1 0.5 2

0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
Parameter posterior - small sample, strong prior
prior,10.0, 10.0 likelihood, 1 heads, 0 tails posterior
4 1 4
2 0.5 2
0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
prior,10.0, 10.0 likelihood, 1 heads, 1 tails posterior
4 0.4 4
2 0.2 2
0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
prior,10.0, 10.0 likelihood, 10 heads, 1 tails posterior
4 0.04 5
2 0.02
0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
prior,10.0, 10.0 −4 10 heads, 5 tails
likelihood, posterior
x 10
4 1 5
2 0.5
0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
prior,10.0, 10.0 −6 10 heads, 10 tails
likelihood, posterior
x 10
4 1 10
2 0.5 5
0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
Parameter posterior - coin data, uniform prior
prior,1.0, 1.0 −75 140 heads, 110 tails
likelihood, posterior
x 10
2 4 15

1.5 3
10

1 2

5
0.5 1

0 0 0
0 0.5 1 0 0.5 1 0 0.5 1

prior,1.0, 1.0 −76 125 heads, 125 tails


likelihood, posterior
x 10
2 6 15

5
1.5
4 10

1 3

2 5
0.5
1

0 0 0
0 0.5 1 0 0.5 1 0 0.5 1

thetas = 0:0.01:1;
alphaH = 1; alphaT = 1;
prior = betapdf(thetas, alphaH, alphaT);
lik = thetas.^Nh .* (1-thetas).^Nt;
post = betapdf(thetas, alphaH+Nh, alphaT+Nt);
Model evidence

• Suppose we see Dh heads and Dt tails. The parameter posterior is


p(θ|α)P (D|θ, α)
P (θ|D, α) =
P (D|α)
1 1
= θαh−1(1 − θ)αt−1θDh (1 − θ)Dt
P (D|α) Z(αh, αt)
= β(θ; αh + Dh, αt + Dt)
where the marginal likelihood (evidence) is
Z(αh + Nh, αt + Nt)
P (D|α) =
Z(αh, αt)
Γ(α) Γ(αh + Nh) Γ(αt + Nt)
= · ·
Γ(α + N ) Γ(α + N ) Γ(α + N )
Sequentially evaluating the evidence

• By the chain rule of probability,


P (x1:N ) = P (x1)P (x2|x1)P (x3|x1:2) . . .
• Also, after N data cases, P (X|D1:N ) = Dir(~α+N~ ), so
Nk + αk def Nk + αk
P (X = k|D1:N , α~) = P =
i Ni + α i N +α
• Suppose D = H, T, T, H, H, H. Then
αh αt αt + 1 αh + 1 αh + 2
P (D) = · · · ·
α α+1 α+2 α+3 α+4
[αh(αh + 1)(αh + 2)] [αt(αt + 1)]
=
α(α + 1) · · · (α + 4)
[(αh) · · · (αh + Nh − 1)] [(αt) · · · (αt + Nt − 1)]
=
(α) · · · (α + N )
Model evidence
• For integers,
(α)(α + 1) · · · (α + M − 1)
(a + M − 1)!
=
(α − 1)!
(a + M − 1)(a + M − 2) · · · (a + M − M )(a + M − M − 1) · · · 2 · 1
=
(a − 1)(a − 2) · · · 2 · 1
(a + M − 1)(a + M − 2) · · · (a)(a − 1) · · · 2 · 1
=
(a − 1)(a − 2) · · · 2 · 1
• For reals, we replace (a − 1)! with Γ(a).
• Hence
[(αh) · · · (αh + Nh − 1)] [(αt) · · · (αt + Nt − 1)]
P (D) =
(α) · · · (α + N )
Γ(α) Γ(αh + Nh) Γ(αt + Nt)
= · ·
Γ(α + N ) Γ(α + N ) Γ(α + N )
Ratio of evidences (Bayes factor)

• We compute the ratio of marginal likelihoods (evidence):


P (H1|D) P (D|H1) Z(αh + Nh, αt + Nt) 1
= =
P (H0|D) P (D|H0) Z(αh, αt) 0.5N
Γ(140 + α)Γ(110 + α) Γ(2α)
= × × 2250
Γ(250 + 2α) Γ(α)Γ(α)
• Must work in log domain!
alphas = [0.37 1 2.7 7.4 20 55 148 403 1096];
Nh = 140; Nt = 110; N = Nh+Nt;
numer = gammaln(Nh+alphas) + gammaln(Nt+alphas) + gammaln(
denom = gammaln(N+2*alphas) + 2*gammaln(alphas);
r = exp(numer ./ denom);
So, is the coin biased or not?
• We plot the likelihood ratio vs hyperparameter α:
2

1.8

1.6

1.4

1.2

0.8

0.6

0.4

0.2
0 200 400 600 800 1000 1200

P (H1|D)
• For a uniform prior, P (H |D) = 0.48, (weakly) favoring the fair coin
0
hypothesis H0!
• At best, for α = 50, we can make the biased hypothesis twice as
likely.
• Not as dramatic as saying “we reject the null hypothesis (fair coin)
with significance 6.6%”.
From coins to dice
• Likelihood: binomial → multinomialY
N
P (D|~θ) = θi i
i
• Prior: beta → Dirichlet
1 Y α −1
P (~θ|~
α) = θi i
Z(~
α)
i
where Q
i Γ(αi)
Z(~α) = P
Γ( i αi)
• Posterior: beta → Dirichlet
P (~θ|D) = Dir(~α+N ~)
• Evidence (marginal likelihood)
Z(~ α + N)~ Q P
Γ(αi + Ni) Γ( i αi)
P (D|~α) = = Q i P
Z(~ α) i Γ(αi) Γ( i αi + Ni)
From dice to tabular Bayes nets
• If we assume global parameter independence, the evidence decom-
poses into one term per node:
Y
P (D|G) = P (D(Xi, Xπi )|~
αi )
i
• If we also assume local parameter independence, each node term
decomposes into a product over rows (conditioning cases):
Y Y
P (D|G) = P (D(Xi, Xπi = k)|~αi,·,k )
i k∈V al(πi)
Y Y Z(~
αi,·,k + Ni,·,k )
=
Z(~αi,·,k )
i k∈V al(πi)
 " P #
Y Y Y Γ(αijk + Nijk ) Γ( j αijk )
=   P
Γ(αijk ) Γ( j αijk + Nijk )
i k∈V al(πi) j
Example of model selection

• Suppose we generate data from X → Y , where P (X = 0) =


P (X = 1) = 0.5 and
P (Y = 1|X = 0) = 0.5 − , P (Y = 1|X = 1) = 0.5 + .
• As we increase , we increase the dependence of Y on X.
• Let us consider 3 hypotheses: H0 = X Y , H1 = X → Y ,
H2 = Y ← X, and use uniform priors.
• We will plot model posteriors vs N for different  and different ran-
dom trials:
P (D1:N |Hi)P (Hi)
P (Hi|D1:N ) = P
j P (D1:N |Hj )P (Hj )
Example of model selection
red = H0 (independence), blue/green = H1/H2 (dependence).
See BNT/examples/static/StructLearn/model-select1.m.
e=0.05, seed=1 e=0.10, seed=1 e=0.15, seed=1 e=0.20, seed=1
1 1 1 1

0.5 0.5 0.5 0.5

0 0 0 0
0 50 100 0 50 100 0 50 100 0 50 100
e=0.05, seed=2 e=0.10, seed=2 e=0.15, seed=2 e=0.20, seed=2
1 1 1 1

0.5 0.5 0.5 0.5

0 0 0 0
0 50 100 0 50 100 0 50 100 0 50 100
e=0.05, seed=3 e=0.10, seed=3 e=0.15, seed=3 e=0.20, seed=3
1 1 1 1

0.5 0.5 0.5 0.5

0 0 0 0
0 50 100 0 50 100 0 50 100 0 50 100
Score equivalence
• X → Y and X ← Y are I-equivalent (have the same likelihood).
• Suppose we use a uniform Dirichlet prior for each node in each graph,
with equivalent sample size α (K2-prior):
P (θX |H1) = Dir(α, α), P (θX|Y =i|H2) = Dir(α, α)
• In H1, the equivalent sample size for X is 2α, but in H2 it is 4α
(since two conditioning contexts). Hence the posterior probabilities
are different.
• The BDe (Bayesian Dirichlet likelihood equivalent) prior is to use
weights αXi|Xπ = αP 0(Xi, Xπi ) where P 0 could be represented by
i
e.g., a Bayes net.
• The BDeu (uniform) prior is P 0(Xi, Xπi ) = |X ||X1 .
i πi |
• Using the BDeu prior, the curves for X → Y and X ← Y are
indistinguishable. Using the K2 prior, they are not.
Bayesian Occam’s razor
• Why is P (H0|D) higher when then dependence on X and Y is weak
(small )?
• It is not because the prior P (Hi) explicitly favors simpler models
(although this is possible).
R
• It because the evidence P (D) = dwP (D|w)P (w), automatically
penalizes complex models.
• Occam’s razor says “If two models are equally predictive, prefer the
simpler one”.
• This is an automatic consequence of using Bayesian model selection.
• Maximum likelihood would always pick the most complex model,
since it has more parameters, and hence can fit the training data
better.
• Good test for a learning algorithm: feed it random noise, see if it
“discovers” structure!
Laplace approximation to the evidence

• Consider a large sample approximation, where the parameter poste-


rior becomes peaked.
ˆ MP :
• Take a second order Taylor expansion around theta
1
log P (θ|D) ≈ log P (θ̂M P |D) − (θ − θ̂)T H(θ − θ̂)
2
where
def∂ 2 log P (θ|D)
H = − T
|θ̂
∂θ∂θ MP
is the Hessian.
• By properties of Gaussian integrals,
Z
− 1 (θ−θ̂)T H(θ−θ̂)
P (D) ≈ dθ P (D|θ̂)P (θ̂)e 2
d/2 − 1
= P (D|θ̂)P (θ̂)(2π) |H| 2
Occam factor

• H is like the precision (inverse covariance) of a Gaussian.


− 1
• In the 1d case, |H| 2 = σθ|D , the width of the posterior.
• Consider a uniform prior with width σθ .
− 1
Then P (D) ≈ P (D|θ̂)P (θ̂)|H| 2 ≈ P (D|θ̂) σ1 σθ|D
θ
• The ratio of posterior accessible volume of the parameter space to
the prior, σθ|D /σθ , is called the Occam factor, i.e., the factor by
which Hi’s hypothesis space collapses when the data arrive.

















Bayesian Occam’s razor

• P (D|H1) is smallest, since it is too simple a model.


• P (D|H3) is second smallest, since it is too complex, so it spreads
its probability mass more thinly over the (D, θ) space (fewer dots on
the horizontal line).
• We trust an expert who predicts a few specific (and correct!) things
more than an expert who predicts many things.
























 
 











                        

                        

                        

                        

                        

                        

                        

                        

                        

                        

                        

                        

                        

                        

                        

                        

                        

                        

                        

                        

                        

                        

                        

                        

                        

                        

                        

                        

                        

                        

                        

                        

                        

                        

                        

                        

                        

                        

                        

                        

                        
• How many boxes behind the tree?




 
height and color (suspicious coincidence).

parameters, and computing the Occam factors.


Bayesian image interpretation

probable than there being 2 boxes which happen to have the same

• This can be formalized by assuming (uniform) priors on the box


• The intrepretation that the tree is in front of one box is much more
Leave one out cross validation (LOOCV)

• The evidence can be evaluated sequentially


P (x1:N ) = P (x1)P (x2|x1)P (x3|x1:2) . . .
• LOOCV approximates P (Xt|X1:t−1, θ̂1:t−1) under different permu-
tations of the data.
• Advantages of LOOCV
– Simple (no need to integrate out parameters)
– Robust (works well even if “truth not in model class”)
• Advantages of LOOCV
– Slow (in general, must rerun training many times)
– Does not use all the data
Minimum description length (MDL)

• Another way of thinking about Bayesian Occam’s razor is in terms


of information theory.
• To losslessly send a message about an event x with probability P (x)
takes L(x) = − log2 P (x) bits.
• Suppose instead of sending the raw data, you send a model and then
the residual errors (the parts of the data not predicted by the model).
• This takes L(D, H) bits:
L(D, H) = − log P (H) − log(P (D|H)) = − log P (H|D) + const
• The best model is the one with the overall shortest message.
 
  

  
  
 
  


  


 
 



 




best model

 
 
#bits total





#bits for data


#bits for model

   





 
 

Minimum description length (MDL)



BIC approximation to the evidence
• Laplace approximation
d/2 − 1
P (D) ≈ P (D|θ̂)P (θ̂)(2π) |H| 2
• Taking logs
d 1
log P (D) = log P (D|θ̂) + log P (θ̂) + log(2π) − log |H|
2 2
• BIC (Bayesian Information Criterion): drop terms that are indepen-
dent of N, and approximate log |H| ≈ d log N . So
d
log P (D) ≈ log P (D|θ̂M L) − log N
2
where d is the number of free parameters.
• AIC (Akaike Information Criterion): derived by minimizing KL diver-
gence independent of N, and approximate log |H| ≈ d log N . So
d
log P (D) ≈ log P (D|θ̂M L) − log N
2
Log-likelihood in information theoretic terms

1 1 XXX
` = Nijk log θijk
N N
i j
XXXk
= P̂ (Xi = j, Xπi = k) log P (Xi = j|Xπi = k)
i j k
X P (Xi = j, Xπi = k)P (Xi = j)
= P̂ (Xi = j, Xπi = k) log
P (Xπi = k)P (Xi = j)
ijk
XX P (Xi = j, Xπi = k)
= P̂ (Xi = j, Xπi = k) log
P (Xπi = k)P (Xi = j)
i jk
XX
+ ( P̂ (Xi = j, Xπi = k)) log P (Xi = j)
ij k
X
= I(Xi, Xπi ) − H(Xi)
i
BIC in information theoretic terms

d(G)
scoreBIC (G|D) = `(θ̂) − log N (D)
2
X X d
= N I(Xi, Xπi ) − N H(Xi) − log N
2
i i
• The mutual information term grows linearly in N , the complexity
penalty is logarithmic in N .
• So for large datasets, we pay more attention to fitting the data better.
• Also, the structural prior is independent of N , so does not matter
very much.
Desirable properties of a scoring function

• Consistency: i.e., if the data is generated by G∗, then G∗ and all


I-equivalent models maximize the score.
• Decomposability:
X
score(G|D) = FamScore(D(Xi, Xπi ))
i
which makes it cheap to compare score of G and G0 if they only
differ in a small number of families.
• Bayesian score (evidence), likelihood and penalized likelihood (BIC)
are all decomposable and consistent.
Maximizing the score

• Consider the family of DAGs Gd with maximum fan-in (number of


parents) equal to d.
• Theorem 14.4.3: It is NP-hard to find
G∗ = arg max score(G, D)
G∈Gd
for any d ≥ 2.
• In general, we need to use heuristic local search.
Maximizing the score: tractable cases

• For d ≤ 1 (i.e., trees), we can solve the problem in O(n2) time using
max spanning tree (next lecture).
• If weknow
 the ordering of the nodes, we can solve the problem in
n
O(d ) time (see below).
d
Known order (K2 algorithm)

• Suppose we a total ordering of the nodes X1 ≺ X2 . . . ≺ Xn and


want to find a DAG consistent with this with maximum score.
• The choice of parents for Xi, from P ai ⊆ {X1, . . . , Xi−1}, is inde-
pendent of the choice for Xj : since we obey the ordering, we cannot
create a cycle.
• Hence we can pick the best set of parents for each node indepen-
dently.
 
i−1
• For Xi, we need to search all subsets of size up to d for the
d
set which maximizes FamScore.
• We can use greedy techniques for this, c.f., learning a decision tree.
What if order isn’t known?

• Search in the space of DAGs.


• Search in the space of orderings, then conditioned on ≺, pick best
graph using K2 (Rao-Blackwellised sampling).
• Can also search in space of undirected graphs.
• Can also search in space of graphs of variable size, to allow creation
of hidden nodes (next lecture).
Searching in DAG space

• Typical search operators:


– Add an edge
– Delete an edge
– Reverse an edge
• We can get from any graph to any other graph in at most O(n2)
moves (the diameter of the search space).
• Moves are reversable.
• Simplest search algorithm: greedy hill climbing.
• We can only apply a search operator o to the current graph G if
the resulting graph o(G) satisfies the constraints, e.g., acyclicity,
indegree bound, induced treewidth bound (“thin junction trees”),
hard prior knowledge.
Cost of evaluating moves

• There are O(n2) operators we could apply at each step.


• For each operator, we need to check if o(G) is acylic.
• We can check acyclicity in O(e) time, where e = O(nd) is the
number of edges.
• For local moves, we can check acyclicity in amortized O(1) time
using the ancestor matrix.
• If o(G) is acyclic, we need to evaluate its quality. This requires
computing sufficient statistics for every family, which takes O(M n)
time, for M training cases.
• Suppose there are K steps to convergence. (We expect K  n2,
since the diameter is n2.)
• Hence total time is O(K · n2 · M n).
Exploiting decomposable score

• If the operator is valid, we need to evaluate its quality. Define


δG(o) = score(o(G)|D) − score(G|D)
• If the score is decomposable, and we want to modify an edge involving
X and Y , we only need to look at the sufficient statistics for X and
Y ’s families.
• e.g., if o = add X → Y :
δG(o) = FamScore(Y, P a(Y, G)∪X|D)−FamScore(Y, P a(Y, G)|D)
• So we can evaluate quality in O(M ) time by extracting sufficient
statistics for the columns related to X, Y and their parents.
• This reduces the time from O(Kn3M ) to O(Kn2M ).
Exploiting decomposable score

• After eg adding X → Y , we only need to update δ(o) for the O(n)


operators that involve X or Y .
• Also, we can update a heap in O(n log n) time and thereby find the
best o in O(1) time at each step.
• So total cost goes from O(Kn2M ) to O(K(nM + n log n)).
• For large M , we can use fancy data sructures (e.g., kd-trees) to
compute sufficient statistics in sub-linear time.
Local maxima

• Greedy hill climbing will stop when it reaches a local maximum or a


plateau (a set of neighboring networks that have the same score).
• Unfortunately, plateaux are common, since equivalence classes form
contiguous regions of search space (thm 14.4.4), and such classes
can be exponentially large.
• Solutions:
– Random restarts
– TABU search (prevent the algorithm from undoing an operator ap-
plied in the last L steps, thereby forcing it to explore new terrain).
– Data perturbation (dynamic local search): reweight the data and
take step.
– Simulated annealing: if δ(o) > 0, take move, else accept with
δ(o)
probability e t , where t is the temperature. Slow!
Searching in space of equivalence classes

• The space of class PDAGs is smaller.


• We avoid many of the plateux of I-equivalent DAGs.
• Operators are more complicated to implement and evaluate, but can
still be done locally (see paper by Max Chickering).
• Cannot exploit causal/ interventional data (which can distinguish
members of an equivalence class).
• Currently less common than searching in DAG space.
Learning the ICU-Alarm network with TABU search

• Learned structures often simpler than “true” model (fewer edges),


but predict just as well.
• Can only recover structure up to Markov equivalence.
• 10 minutes to learn structure for 100 variables and 5000 cases.
2

1.5
KL Divergence

Parameter learning
Structure learning
1

0.5

0
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

#sam ples

You might also like