Model Selection/ Structure Learning Koller & Friedman Chapter 14 Mackay Chapter 28
Model Selection/ Structure Learning Koller & Friedman Chapter 14 Mackay Chapter 28
Kevin Murphy
8 November 2004
Structure learning: why?
• Constraint-based approach:
– Assume some way of testing conditional independencies
X1 ⊥ X2|X3
– Then construct model consistent with these results
• Search-and-score approach:
– Define a scoring function for measuring model quality (e.g., marginal
likelihood or penalized likelihood)
– Use a search algorithm to find a (local) maximum of the score
Identifiability
• The build-PDAG algorithm from K&F chapter 3 can recover the true
DAG up to I-equivalence in O(N 32d) time if we make the following
assumptions:
– The maximum fan-in (number of parents) of any node is d
– The independence test oracle can handle up to 2d + 2 variables
– The underlying distribution P ∗ is faithful to G∗ i.e., there are no
spurious independencies that are not sanctioned by G∗ (G∗ is a
P-map of P ∗).
• This is often called the IC or PC algorithm.
constraint-based approach
• Bad
– Faithfulness assumption rules out certain CPDs like noisy-OR.
– Hard to make a reliable independence test (especially given small
data sets) which does not make too many errors (either false pos-
itives or false negatives).
– One misleading independence test result can result in multiple er-
rors in the resulting PDAG, so overall the approach is not very
robust to noise.
• Good
– PC algorithm is less dumb than local search
Independence tests
1 0.5 1
0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
prior,1.0, 1.0 likelihood, 1 heads, 1 tails posterior
2 0.4 2
1 0.2 1
0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
prior,1.0, 1.0 likelihood, 10 heads, 1 tails posterior
2 0.04 5
1 0.02
0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
prior,1.0, 1.0 −4
likelihood, 10 heads, 5 tails posterior
x 10
2 1 4
1 0.5 2
0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
prior,1.0, 1.0 −6
likelihood, 10 heads, 10 tails posterior
x 10
2 1 4
1 0.5 2
0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
Parameter posterior - small sample, strong prior
prior,10.0, 10.0 likelihood, 1 heads, 0 tails posterior
4 1 4
2 0.5 2
0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
prior,10.0, 10.0 likelihood, 1 heads, 1 tails posterior
4 0.4 4
2 0.2 2
0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
prior,10.0, 10.0 likelihood, 10 heads, 1 tails posterior
4 0.04 5
2 0.02
0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
prior,10.0, 10.0 −4 10 heads, 5 tails
likelihood, posterior
x 10
4 1 5
2 0.5
0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
prior,10.0, 10.0 −6 10 heads, 10 tails
likelihood, posterior
x 10
4 1 10
2 0.5 5
0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
Parameter posterior - coin data, uniform prior
prior,1.0, 1.0 −75 140 heads, 110 tails
likelihood, posterior
x 10
2 4 15
1.5 3
10
1 2
5
0.5 1
0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
5
1.5
4 10
1 3
2 5
0.5
1
0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
thetas = 0:0.01:1;
alphaH = 1; alphaT = 1;
prior = betapdf(thetas, alphaH, alphaT);
lik = thetas.^Nh .* (1-thetas).^Nt;
post = betapdf(thetas, alphaH+Nh, alphaT+Nt);
Model evidence
1.8
1.6
1.4
1.2
0.8
0.6
0.4
0.2
0 200 400 600 800 1000 1200
P (H1|D)
• For a uniform prior, P (H |D) = 0.48, (weakly) favoring the fair coin
0
hypothesis H0!
• At best, for α = 50, we can make the biased hypothesis twice as
likely.
• Not as dramatic as saying “we reject the null hypothesis (fair coin)
with significance 6.6%”.
From coins to dice
• Likelihood: binomial → multinomialY
N
P (D|~θ) = θi i
i
• Prior: beta → Dirichlet
1 Y α −1
P (~θ|~
α) = θi i
Z(~
α)
i
where Q
i Γ(αi)
Z(~α) = P
Γ( i αi)
• Posterior: beta → Dirichlet
P (~θ|D) = Dir(~α+N ~)
• Evidence (marginal likelihood)
Z(~ α + N)~ Q P
Γ(αi + Ni) Γ( i αi)
P (D|~α) = = Q i P
Z(~ α) i Γ(αi) Γ( i αi + Ni)
From dice to tabular Bayes nets
• If we assume global parameter independence, the evidence decom-
poses into one term per node:
Y
P (D|G) = P (D(Xi, Xπi )|~
αi )
i
• If we also assume local parameter independence, each node term
decomposes into a product over rows (conditioning cases):
Y Y
P (D|G) = P (D(Xi, Xπi = k)|~αi,·,k )
i k∈V al(πi)
Y Y Z(~
αi,·,k + Ni,·,k )
=
Z(~αi,·,k )
i k∈V al(πi)
" P #
Y Y Y Γ(αijk + Nijk ) Γ( j αijk )
= P
Γ(αijk ) Γ( j αijk + Nijk )
i k∈V al(πi) j
Example of model selection
0 0 0 0
0 50 100 0 50 100 0 50 100 0 50 100
e=0.05, seed=2 e=0.10, seed=2 e=0.15, seed=2 e=0.20, seed=2
1 1 1 1
0 0 0 0
0 50 100 0 50 100 0 50 100 0 50 100
e=0.05, seed=3 e=0.10, seed=3 e=0.15, seed=3 e=0.20, seed=3
1 1 1 1
0 0 0 0
0 50 100 0 50 100 0 50 100 0 50 100
Score equivalence
• X → Y and X ← Y are I-equivalent (have the same likelihood).
• Suppose we use a uniform Dirichlet prior for each node in each graph,
with equivalent sample size α (K2-prior):
P (θX |H1) = Dir(α, α), P (θX|Y =i|H2) = Dir(α, α)
• In H1, the equivalent sample size for X is 2α, but in H2 it is 4α
(since two conditioning contexts). Hence the posterior probabilities
are different.
• The BDe (Bayesian Dirichlet likelihood equivalent) prior is to use
weights αXi|Xπ = αP 0(Xi, Xπi ) where P 0 could be represented by
i
e.g., a Bayes net.
• The BDeu (uniform) prior is P 0(Xi, Xπi ) = |X ||X1 .
i πi |
• Using the BDeu prior, the curves for X → Y and X ← Y are
indistinguishable. Using the K2 prior, they are not.
Bayesian Occam’s razor
• Why is P (H0|D) higher when then dependence on X and Y is weak
(small )?
• It is not because the prior P (Hi) explicitly favors simpler models
(although this is possible).
R
• It because the evidence P (D) = dwP (D|w)P (w), automatically
penalizes complex models.
• Occam’s razor says “If two models are equally predictive, prefer the
simpler one”.
• This is an automatic consequence of using Bayesian model selection.
• Maximum likelihood would always pick the most complex model,
since it has more parameters, and hence can fit the training data
better.
• Good test for a learning algorithm: feed it random noise, see if it
“discovers” structure!
Laplace approximation to the evidence
Bayesian Occam’s razor
• How many boxes behind the tree?
height and color (suspicious coincidence).
probable than there being 2 boxes which happen to have the same
best model
#bits total
Minimum description length (MDL)
BIC approximation to the evidence
• Laplace approximation
d/2 − 1
P (D) ≈ P (D|θ̂)P (θ̂)(2π) |H| 2
• Taking logs
d 1
log P (D) = log P (D|θ̂) + log P (θ̂) + log(2π) − log |H|
2 2
• BIC (Bayesian Information Criterion): drop terms that are indepen-
dent of N, and approximate log |H| ≈ d log N . So
d
log P (D) ≈ log P (D|θ̂M L) − log N
2
where d is the number of free parameters.
• AIC (Akaike Information Criterion): derived by minimizing KL diver-
gence independent of N, and approximate log |H| ≈ d log N . So
d
log P (D) ≈ log P (D|θ̂M L) − log N
2
Log-likelihood in information theoretic terms
1 1 XXX
` = Nijk log θijk
N N
i j
XXXk
= P̂ (Xi = j, Xπi = k) log P (Xi = j|Xπi = k)
i j k
X P (Xi = j, Xπi = k)P (Xi = j)
= P̂ (Xi = j, Xπi = k) log
P (Xπi = k)P (Xi = j)
ijk
XX P (Xi = j, Xπi = k)
= P̂ (Xi = j, Xπi = k) log
P (Xπi = k)P (Xi = j)
i jk
XX
+ ( P̂ (Xi = j, Xπi = k)) log P (Xi = j)
ij k
X
= I(Xi, Xπi ) − H(Xi)
i
BIC in information theoretic terms
d(G)
scoreBIC (G|D) = `(θ̂) − log N (D)
2
X X d
= N I(Xi, Xπi ) − N H(Xi) − log N
2
i i
• The mutual information term grows linearly in N , the complexity
penalty is logarithmic in N .
• So for large datasets, we pay more attention to fitting the data better.
• Also, the structural prior is independent of N , so does not matter
very much.
Desirable properties of a scoring function
• For d ≤ 1 (i.e., trees), we can solve the problem in O(n2) time using
max spanning tree (next lecture).
• If weknow
the ordering of the nodes, we can solve the problem in
n
O(d ) time (see below).
d
Known order (K2 algorithm)
1.5
KL Divergence
Parameter learning
Structure learning
1
0.5
0
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
#sam ples