0% found this document useful (0 votes)

45 views49 pages

Model Selection/ Structure Learning Koller & Friedman Chapter 14 Mackay Chapter 28

1. The document summarizes techniques for learning the structure of graphical models from data, including constraint-based approaches that test for conditional independencies and search-and-score approaches that optimize a scoring function. 2. It discusses challenges like the faithfulness assumption in constraint-based approaches and difficulties in making reliable independence tests. It also describes using the Bayesian approach to compare hypotheses about model structure. 3. Key aspects covered include identifying equivalent DAGs, the PC algorithm for constraint-based structure learning, how to define independence tests and associated p-values, and using parameter priors and posteriors for Bayesian model comparison.

Uploaded by

aGWE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views49 pages

Model Selection/ Structure Learning Koller & Friedman Chapter 14 Mackay Chapter 28

Uploaded by

aGWE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

Lecture 15

Model selection/ structure learning

Koller & Friedman chapter 14

Mackay chapter 28

Kevin Murphy

8 November 2004
Structure learning: why?

• We often want to learn the structure of the graphical model:

– Scientific discovery (data mining)
– Use a good model for prediction, compression, classification etc.
• Often there may be more than one good model
– Look for features that they all share
– Average predictions over models
Learning gene regulatory pathways
Structure learning: how?

• Constraint-based approach:
– Assume some way of testing conditional independencies
X1 ⊥ X2|X3
– Then construct model consistent with these results
• Search-and-score approach:
– Define a scoring function for measuring model quality (e.g., marginal
likelihood or penalized likelihood)
– Use a search algorithm to find a (local) maximum of the score
Identifiability

• DAGs are I-equivalent if they encode the same set of conditional

independencies, e.g., X → Y → Z and X ← Y ← Z are indistin-
guishable given just observational data.
• However, X → Y ← Z has a v-structure, which has a unique
statistical signature. Hence some arc directions can be inferred from
passive observation.
• The set of I-equivalent DAGs can be represented by a PDAG (partially
directed acyclic graph).
• Distinguishing between members of an equivalence class requires in-
terventions/ experiments.
Constraint-based approach

• The build-PDAG algorithm from K&F chapter 3 can recover the true
DAG up to I-equivalence in O(N 32d) time if we make the following
assumptions:
– The maximum fan-in (number of parents) of any node is d
– The independence test oracle can handle up to 2d + 2 variables
– The underlying distribution P ∗ is faithful to G∗ i.e., there are no
spurious independencies that are not sanctioned by G∗ (G∗ is a
P-map of P ∗).
• This is often called the IC or PC algorithm.
constraint-based approach

• Bad
– Faithfulness assumption rules out certain CPDs like noisy-OR.
– Hard to make a reliable independence test (especially given small
data sets) which does not make too many errors (either false pos-
itives or false negatives).
– One misleading independence test result can result in multiple er-
rors in the resulting PDAG, so overall the approach is not very
robust to noise.
• Good
– PC algorithm is less dumb than local search
Independence tests

• An independence test X ⊥ Y seeks to accept or reject the null

hypothesis H0 that P ∗(X, Y ) = P ∗(X)P ∗(Y ).
• We need a decision rule that maps data to accept/reject.
• We define a scalar measure of deviance d(D) from the null hypoth-
esis.
• The p-value of a threshold t is the probability of falsely rejecting the
null hypothesis:
p(t) = P ({D : d(D) > t}|H0, N )
• Note that we need to know the size of the data set N (stopping rule)
ahead of time!
• We usually choose a threshold t so that the probability of a false
rejection is below some significance level α = 0.05.
Independence tests

• For discrete data, a common deviance is the χ2 statistic, which mea-

sures how far the counts are from what we would expect given inde-
pendence:
X (Ox,y − Ex,y )2 X (N (x, y) − N P (x)P (y))2
dχ2 (D) = =
x,y
Ex,y x,y
N P (x)P (y)

• The p-value requires summing over all datasets of size N :

p(t) = P ({D : d(D) > t}|H0, N )
• Since this is expensive in general, a standard approximation is to con-
sider the expected distribution of d(D) (under the null hypothesis)
as N → ∞, and use this to define thresholds to achieve a given
significance.
Example of classical hypothesis testing

• When spun on edge N = 250 times, a Belgian one-euro coin came

up heads Y = 140 times and tails 110.
• We would like to distinguish two models, or hypotheses: H0 means
the coin is unbiased (so p = 0.5); H1 means the coin is biased (has
probability of heads p 6= 0.5).
• p-value is “less than 7%”: p = P (Y ≥ 140)+P (Y ≤ 110) = 0.066:
n=250; p = 0.5; y = 140;
p = (1-binocdf(y-1,n,p)) + binocdf(n-y,n,p)
• If Y = 141, we get p = 0.0497, so we can reject the null hypothesis
at significance level 0.05.
• But is the coin really biased?
Bayesian approach

• We want to compute the posterior ratio of the 2 hypotheses:

• How to compute P (D|H1)?

• Let us assume a beta prior on the coin bias θ
1
P (θ|α, H1) = β(θ; αh, αt) = θαh−1(1 − θ)αt−1
Z(αh, αt)
where
Z 1
α −1 αt −1 Γ(αh)Γ(αt)
Z(αh, αt) = dθ θ h (1 − θ) =
0 Γ(αh + αt)
• Γ(n) = (n − 1)! for positive integers.
α h .
• Mean Eθ = α +α t
h
• If we set αh = αt = 1, we get a uniform prior (and Z = 1).
Parameter posterior

• Suppose we see Dh heads and Dt tails. The parameter posterior is

1 0.5 1

0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
prior,1.0, 1.0 likelihood, 1 heads, 1 tails posterior
2 0.4 2

1 0.2 1

0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
prior,1.0, 1.0 likelihood, 10 heads, 1 tails posterior
2 0.04 5

1 0.02

0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
prior,1.0, 1.0 −4
likelihood, 10 heads, 5 tails posterior
x 10
2 1 4

1 0.5 2

0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
prior,1.0, 1.0 −6
likelihood, 10 heads, 10 tails posterior
x 10
2 1 4

1 0.5 2

0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
Parameter posterior - small sample, strong prior
prior,10.0, 10.0 likelihood, 1 heads, 0 tails posterior
4 1 4
2 0.5 2
0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
prior,10.0, 10.0 likelihood, 1 heads, 1 tails posterior
4 0.4 4
2 0.2 2
0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
prior,10.0, 10.0 likelihood, 10 heads, 1 tails posterior
4 0.04 5
2 0.02
0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
prior,10.0, 10.0 −4 10 heads, 5 tails
likelihood, posterior
x 10
4 1 5
2 0.5
0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
prior,10.0, 10.0 −6 10 heads, 10 tails
likelihood, posterior
x 10
4 1 10
2 0.5 5
0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
Parameter posterior - coin data, uniform prior
prior,1.0, 1.0 −75 140 heads, 110 tails
likelihood, posterior
x 10
2 4 15

1.5 3
10

1 2

5
0.5 1

0 0 0
0 0.5 1 0 0.5 1 0 0.5 1

prior,1.0, 1.0 −76 125 heads, 125 tails

likelihood, posterior
x 10
2 6 15

5
1.5
4 10

1 3

2 5
0.5
1

0 0 0
0 0.5 1 0 0.5 1 0 0.5 1

thetas = 0:0.01:1;
alphaH = 1; alphaT = 1;
prior = betapdf(thetas, alphaH, alphaT);
lik = thetas.^Nh .* (1-thetas).^Nt;
post = betapdf(thetas, alphaH+Nh, alphaT+Nt);
Model evidence

• Suppose we see Dh heads and Dt tails. The parameter posterior is

p(θ|α)P (D|θ, α)
P (θ|D, α) =
P (D|α)
1 1
= θαh−1(1 − θ)αt−1θDh (1 − θ)Dt
P (D|α) Z(αh, αt)
= β(θ; αh + Dh, αt + Dt)
where the marginal likelihood (evidence) is
Z(αh + Nh, αt + Nt)
P (D|α) =
Z(αh, αt)
Γ(α) Γ(αh + Nh) Γ(αt + Nt)
= · ·
Γ(α + N ) Γ(α + N ) Γ(α + N )
Sequentially evaluating the evidence

• By the chain rule of probability,

P (x1:N ) = P (x1)P (x2|x1)P (x3|x1:2) . . .
• Also, after N data cases, P (X|D1:N ) = Dir(~α+N~ ), so
Nk + αk def Nk + αk
P (X = k|D1:N , α~) = P =
i Ni + α i N +α
• Suppose D = H, T, T, H, H, H. Then
αh αt αt + 1 αh + 1 αh + 2
P (D) = · · · ·
α α+1 α+2 α+3 α+4
[αh(αh + 1)(αh + 2)] [αt(αt + 1)]
=
α(α + 1) · · · (α + 4)
[(αh) · · · (αh + Nh − 1)] [(αt) · · · (αt + Nt − 1)]
=
(α) · · · (α + N )
Model evidence
• For integers,
(α)(α + 1) · · · (α + M − 1)
(a + M − 1)!
=
(α − 1)!
(a + M − 1)(a + M − 2) · · · (a + M − M )(a + M − M − 1) · · · 2 · 1
=
(a − 1)(a − 2) · · · 2 · 1
(a + M − 1)(a + M − 2) · · · (a)(a − 1) · · · 2 · 1
=
(a − 1)(a − 2) · · · 2 · 1
• For reals, we replace (a − 1)! with Γ(a).
• Hence
[(αh) · · · (αh + Nh − 1)] [(αt) · · · (αt + Nt − 1)]
P (D) =
(α) · · · (α + N )
Γ(α) Γ(αh + Nh) Γ(αt + Nt)
= · ·
Γ(α + N ) Γ(α + N ) Γ(α + N )
Ratio of evidences (Bayes factor)

• We compute the ratio of marginal likelihoods (evidence):

P (H1|D) P (D|H1) Z(αh + Nh, αt + Nt) 1
= =
P (H0|D) P (D|H0) Z(αh, αt) 0.5N
Γ(140 + α)Γ(110 + α) Γ(2α)
= × × 2250
Γ(250 + 2α) Γ(α)Γ(α)
• Must work in log domain!
alphas = [0.37 1 2.7 7.4 20 55 148 403 1096];
Nh = 140; Nt = 110; N = Nh+Nt;
numer = gammaln(Nh+alphas) + gammaln(Nt+alphas) + gammaln(
denom = gammaln(N+2*alphas) + 2*gammaln(alphas);
r = exp(numer ./ denom);
So, is the coin biased or not?
• We plot the likelihood ratio vs hyperparameter α:
2

1.8

1.6

1.4

1.2

0.8

0.6

0.4

0.2
0 200 400 600 800 1000 1200

P (H1|D)
• For a uniform prior, P (H |D) = 0.48, (weakly) favoring the fair coin
0
hypothesis H0!
• At best, for α = 50, we can make the biased hypothesis twice as
likely.
• Not as dramatic as saying “we reject the null hypothesis (fair coin)
with significance 6.6%”.
From coins to dice
• Likelihood: binomial → multinomialY
N
P (D|~θ) = θi i
i
• Prior: beta → Dirichlet
1 Y α −1
P (~θ|~
α) = θi i
Z(~
α)
i
where Q
i Γ(αi)
Z(~α) = P
Γ( i αi)
• Posterior: beta → Dirichlet
P (~θ|D) = Dir(~α+N ~)
• Evidence (marginal likelihood)
Z(~ α + N)~ Q P
Γ(αi + Ni) Γ( i αi)
P (D|~α) = = Q i P
Z(~ α) i Γ(αi) Γ( i αi + Ni)
From dice to tabular Bayes nets
• If we assume global parameter independence, the evidence decom-
poses into one term per node:
Y
P (D|G) = P (D(Xi, Xπi )|~
αi )
i
• If we also assume local parameter independence, each node term
decomposes into a product over rows (conditioning cases):
Y Y
P (D|G) = P (D(Xi, Xπi = k)|~αi,·,k )
i k∈V al(πi)
Y Y Z(~
αi,·,k + Ni,·,k )
=
Z(~αi,·,k )
i k∈V al(πi)
 " P #
Y Y Y Γ(αijk + Nijk ) Γ( j αijk )
=   P
Γ(αijk ) Γ( j αijk + Nijk )
i k∈V al(πi) j
Example of model selection

• Suppose we generate data from X → Y , where P (X = 0) =

P (X = 1) = 0.5 and
P (Y = 1|X = 0) = 0.5 − , P (Y = 1|X = 1) = 0.5 + .
• As we increase , we increase the dependence of Y on X.
• Let us consider 3 hypotheses: H0 = X Y , H1 = X → Y ,
H2 = Y ← X, and use uniform priors.
• We will plot model posteriors vs N for different and different ran-
dom trials:
P (D1:N |Hi)P (Hi)
P (Hi|D1:N ) = P
j P (D1:N |Hj )P (Hj )
Example of model selection
red = H0 (independence), blue/green = H1/H2 (dependence).
See BNT/examples/static/StructLearn/model-select1.m.
e=0.05, seed=1 e=0.10, seed=1 e=0.15, seed=1 e=0.20, seed=1
1 1 1 1

0.5 0.5 0.5 0.5

0 0 0 0
0 50 100 0 50 100 0 50 100 0 50 100
e=0.05, seed=2 e=0.10, seed=2 e=0.15, seed=2 e=0.20, seed=2
1 1 1 1

0.5 0.5 0.5 0.5

0 0 0 0
0 50 100 0 50 100 0 50 100 0 50 100
e=0.05, seed=3 e=0.10, seed=3 e=0.15, seed=3 e=0.20, seed=3
1 1 1 1

0.5 0.5 0.5 0.5

0 0 0 0
0 50 100 0 50 100 0 50 100 0 50 100
Score equivalence
• X → Y and X ← Y are I-equivalent (have the same likelihood).
• Suppose we use a uniform Dirichlet prior for each node in each graph,
with equivalent sample size α (K2-prior):
P (θX |H1) = Dir(α, α), P (θX|Y =i|H2) = Dir(α, α)
• In H1, the equivalent sample size for X is 2α, but in H2 it is 4α
(since two conditioning contexts). Hence the posterior probabilities
are different.
• The BDe (Bayesian Dirichlet likelihood equivalent) prior is to use
weights αXi|Xπ = αP 0(Xi, Xπi ) where P 0 could be represented by
i
e.g., a Bayes net.
• The BDeu (uniform) prior is P 0(Xi, Xπi ) = |X ||X1 .
i πi |
• Using the BDeu prior, the curves for X → Y and X ← Y are
indistinguishable. Using the K2 prior, they are not.
Bayesian Occam’s razor
• Why is P (H0|D) higher when then dependence on X and Y is weak
(small )?
• It is not because the prior P (Hi) explicitly favors simpler models
(although this is possible).
R
• It because the evidence P (D) = dwP (D|w)P (w), automatically
penalizes complex models.
• Occam’s razor says “If two models are equally predictive, prefer the
simpler one”.
• This is an automatic consequence of using Bayesian model selection.
• Maximum likelihood would always pick the most complex model,
since it has more parameters, and hence can fit the training data
better.
• Good test for a learning algorithm: feed it random noise, see if it
“discovers” structure!
Laplace approximation to the evidence

• Consider a large sample approximation, where the parameter poste-

• H is like the precision (inverse covariance) of a Gaussian.

− 1
• In the 1d case, |H| 2 = σθ|D , the width of the posterior.
• Consider a uniform prior with width σθ .
− 1
Then P (D) ≈ P (D|θ̂)P (θ̂)|H| 2 ≈ P (D|θ̂) σ1 σθ|D
θ
• The ratio of posterior accessible volume of the parameter space to
the prior, σθ|D /σθ , is called the Occam factor, i.e., the factor by
which Hi’s hypothesis space collapses when the data arrive.

Bayesian Occam’s razor

• P (D|H1) is smallest, since it is too simple a model.

• P (D|H3) is second smallest, since it is too complex, so it spreads
its probability mass more thinly over the (D, θ) space (fewer dots on
the horizontal line).
• We trust an expert who predicts a few specific (and correct!) things
more than an expert who predicts many things.

• How many boxes behind the tree?

height and color (suspicious coincidence).

parameters, and computing the Occam factors.

Bayesian image interpretation

probable than there being 2 boxes which happen to have the same

• This can be formalized by assuming (uniform) priors on the box

• The intrepretation that the tree is in front of one box is much more
Leave one out cross validation (LOOCV)

• The evidence can be evaluated sequentially

P (x1:N ) = P (x1)P (x2|x1)P (x3|x1:2) . . .
• LOOCV approximates P (Xt|X1:t−1, θ̂1:t−1) under different permu-
tations of the data.
• Advantages of LOOCV
– Simple (no need to integrate out parameters)
– Robust (works well even if “truth not in model class”)
• Advantages of LOOCV
– Slow (in general, must rerun training many times)
– Does not use all the data
Minimum description length (MDL)

• Another way of thinking about Bayesian Occam’s razor is in terms

of information theory.
• To losslessly send a message about an event x with probability P (x)
takes L(x) = − log2 P (x) bits.
• Suppose instead of sending the raw data, you send a model and then
the residual errors (the parts of the data not predicted by the model).
• This takes L(D, H) bits:
L(D, H) = − log P (H) − log(P (D|H)) = − log P (H|D) + const
• The best model is the one with the overall shortest message.

best model

#bits total

#bits for data

#bits for model

Minimum description length (MDL)

BIC approximation to the evidence
• Laplace approximation
d/2 − 1
P (D) ≈ P (D|θ̂)P (θ̂)(2π) |H| 2
• Taking logs
d 1
log P (D) = log P (D|θ̂) + log P (θ̂) + log(2π) − log |H|
2 2
• BIC (Bayesian Information Criterion): drop terms that are indepen-
dent of N, and approximate log |H| ≈ d log N . So
d
log P (D) ≈ log P (D|θ̂M L) − log N
2
where d is the number of free parameters.
• AIC (Akaike Information Criterion): derived by minimizing KL diver-
gence independent of N, and approximate log |H| ≈ d log N . So
d
log P (D) ≈ log P (D|θ̂M L) − log N
2
Log-likelihood in information theoretic terms

1 1 XXX
` = Nijk log θijk
N N
i j
XXXk
= P̂ (Xi = j, Xπi = k) log P (Xi = j|Xπi = k)
i j k
X P (Xi = j, Xπi = k)P (Xi = j)
= P̂ (Xi = j, Xπi = k) log
P (Xπi = k)P (Xi = j)
ijk
XX P (Xi = j, Xπi = k)
= P̂ (Xi = j, Xπi = k) log
P (Xπi = k)P (Xi = j)
i jk
XX
+ ( P̂ (Xi = j, Xπi = k)) log P (Xi = j)
ij k
X
= I(Xi, Xπi ) − H(Xi)
i
BIC in information theoretic terms

d(G)
scoreBIC (G|D) = `(θ̂) − log N (D)
2
X X d
= N I(Xi, Xπi ) − N H(Xi) − log N
2
i i
• The mutual information term grows linearly in N , the complexity
penalty is logarithmic in N .
• So for large datasets, we pay more attention to fitting the data better.
• Also, the structural prior is independent of N , so does not matter
very much.
Desirable properties of a scoring function

• Consistency: i.e., if the data is generated by G∗, then G∗ and all

I-equivalent models maximize the score.
• Decomposability:
X
score(G|D) = FamScore(D(Xi, Xπi ))
i
which makes it cheap to compare score of G and G0 if they only
differ in a small number of families.
• Bayesian score (evidence), likelihood and penalized likelihood (BIC)
are all decomposable and consistent.
Maximizing the score

• Consider the family of DAGs Gd with maximum fan-in (number of

parents) equal to d.
• Theorem 14.4.3: It is NP-hard to find
G∗ = arg max score(G, D)
G∈Gd
for any d ≥ 2.
• In general, we need to use heuristic local search.
Maximizing the score: tractable cases

• For d ≤ 1 (i.e., trees), we can solve the problem in O(n2) time using
max spanning tree (next lecture).
• If weknow
the ordering of the nodes, we can solve the problem in
n
O(d ) time (see below).
d
Known order (K2 algorithm)

• Suppose we a total ordering of the nodes X1 ≺ X2 . . . ≺ Xn and

want to find a DAG consistent with this with maximum score.
• The choice of parents for Xi, from P ai ⊆ {X1, . . . , Xi−1}, is inde-
pendent of the choice for Xj : since we obey the ordering, we cannot
create a cycle.
• Hence we can pick the best set of parents for each node indepen-
dently.

i−1
• For Xi, we need to search all subsets of size up to d for the
d
set which maximizes FamScore.
• We can use greedy techniques for this, c.f., learning a decision tree.
What if order isn’t known?

• Search in the space of DAGs.

• Search in the space of orderings, then conditioned on ≺, pick best
graph using K2 (Rao-Blackwellised sampling).
• Can also search in space of undirected graphs.
• Can also search in space of graphs of variable size, to allow creation
of hidden nodes (next lecture).
Searching in DAG space

• Typical search operators:

– Add an edge
– Delete an edge
– Reverse an edge
• We can get from any graph to any other graph in at most O(n2)
moves (the diameter of the search space).
• Moves are reversable.
• Simplest search algorithm: greedy hill climbing.
• We can only apply a search operator o to the current graph G if
the resulting graph o(G) satisfies the constraints, e.g., acyclicity,
indegree bound, induced treewidth bound (“thin junction trees”),
hard prior knowledge.
Cost of evaluating moves

• There are O(n2) operators we could apply at each step.

• For each operator, we need to check if o(G) is acylic.
• We can check acyclicity in O(e) time, where e = O(nd) is the
number of edges.
• For local moves, we can check acyclicity in amortized O(1) time
using the ancestor matrix.
• If o(G) is acyclic, we need to evaluate its quality. This requires
computing sufficient statistics for every family, which takes O(M n)
time, for M training cases.
• Suppose there are K steps to convergence. (We expect K n2,
since the diameter is n2.)
• Hence total time is O(K · n2 · M n).
Exploiting decomposable score

• If the operator is valid, we need to evaluate its quality. Define

δG(o) = score(o(G)|D) − score(G|D)
• If the score is decomposable, and we want to modify an edge involving
X and Y , we only need to look at the sufficient statistics for X and
Y ’s families.
• e.g., if o = add X → Y :
δG(o) = FamScore(Y, P a(Y, G)∪X|D)−FamScore(Y, P a(Y, G)|D)
• So we can evaluate quality in O(M ) time by extracting sufficient
statistics for the columns related to X, Y and their parents.
• This reduces the time from O(Kn3M ) to O(Kn2M ).
Exploiting decomposable score

• After eg adding X → Y , we only need to update δ(o) for the O(n)

operators that involve X or Y .
• Also, we can update a heap in O(n log n) time and thereby find the
best o in O(1) time at each step.
• So total cost goes from O(Kn2M ) to O(K(nM + n log n)).
• For large M , we can use fancy data sructures (e.g., kd-trees) to
compute sufficient statistics in sub-linear time.
Local maxima

• Greedy hill climbing will stop when it reaches a local maximum or a

plateau (a set of neighboring networks that have the same score).
• Unfortunately, plateaux are common, since equivalence classes form
contiguous regions of search space (thm 14.4.4), and such classes
can be exponentially large.
• Solutions:
– Random restarts
– TABU search (prevent the algorithm from undoing an operator ap-
plied in the last L steps, thereby forcing it to explore new terrain).
– Data perturbation (dynamic local search): reweight the data and
take step.
– Simulated annealing: if δ(o) > 0, take move, else accept with
δ(o)
probability e t , where t is the temperature. Slow!
Searching in space of equivalence classes

• The space of class PDAGs is smaller.

• We avoid many of the plateux of I-equivalent DAGs.
• Operators are more complicated to implement and evaluate, but can
still be done locally (see paper by Max Chickering).
• Cannot exploit causal/ interventional data (which can distinguish
members of an equivalence class).
• Currently less common than searching in DAG space.
Learning the ICU-Alarm network with TABU search

• Learned structures often simpler than “true” model (fewer edges),

but predict just as well.
• Can only recover structure up to Markov equivalence.
• 10 minutes to learn structure for 100 variables and 5000 cases.
2

1.5
KL Divergence

Parameter learning
Structure learning
1

0.5

0
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

#sam ples

Vocabulary For Poetry Analysis
100% (3)
Vocabulary For Poetry Analysis
2 pages
PyCon 2015 - Bayesian Statistics Made Simple
100% (4)
PyCon 2015 - Bayesian Statistics Made Simple
145 pages
Chapter 11 Cost Acctng
67% (6)
Chapter 11 Cost Acctng
11 pages
DT Heal All 1
100% (1)
DT Heal All 1
93 pages
An Introduction To Bayesian Statistics
100% (9)
An Introduction To Bayesian Statistics
20 pages
Introduction To RBI
No ratings yet
Introduction To RBI
18 pages
Bayesian Statistics: Thomas Bayes
No ratings yet
Bayesian Statistics: Thomas Bayes
22 pages
Bank Nifty Weekly FnO Hedging Strategy
No ratings yet
Bank Nifty Weekly FnO Hedging Strategy
5 pages
Baysian-Slides 16 Bayes Intro
No ratings yet
Baysian-Slides 16 Bayes Intro
49 pages
4.4.2 WBS 5.4 Benjamin - Srock - Ch7 - Exercises
No ratings yet
4.4.2 WBS 5.4 Benjamin - Srock - Ch7 - Exercises
5 pages
Bayesian Inference: A Practical Primer: Outline
No ratings yet
Bayesian Inference: A Practical Primer: Outline
28 pages
BML Lecture Notes
No ratings yet
BML Lecture Notes
126 pages
Heckerman95 BN KnowledgePLUSData
No ratings yet
Heckerman95 BN KnowledgePLUSData
47 pages
15.097: Probabilistic Modeling and Bayesian Analysis
No ratings yet
15.097: Probabilistic Modeling and Bayesian Analysis
42 pages
Eberhardt L4 InSituStress
100% (1)
Eberhardt L4 InSituStress
28 pages
Notes4 BayesianLearning
No ratings yet
Notes4 BayesianLearning
8 pages
The Natural History of Selborne by Gilbert White Preview
67% (3)
The Natural History of Selborne by Gilbert White Preview
20 pages
Naive Bayes
No ratings yet
Naive Bayes
25 pages
Overview of Principles of Statistics
No ratings yet
Overview of Principles of Statistics
8 pages
Frequentist vs. Bayesian Statistics Frequentist Thinking Bayesian Thinking
No ratings yet
Frequentist vs. Bayesian Statistics Frequentist Thinking Bayesian Thinking
18 pages
ML Lecture 03 - Probabilistic Inference (Spring 2024)
No ratings yet
ML Lecture 03 - Probabilistic Inference (Spring 2024)
46 pages
MIT18 05S14 Reading11
No ratings yet
MIT18 05S14 Reading11
9 pages
Bayes and Frequentism: Return of An Old Controversy: Louis Lyons
No ratings yet
Bayes and Frequentism: Return of An Old Controversy: Louis Lyons
40 pages
CH 5
No ratings yet
CH 5
45 pages
Taylor Swift Feature Article
No ratings yet
Taylor Swift Feature Article
4 pages
Concept Learning
No ratings yet
Concept Learning
33 pages
Lecture 10
No ratings yet
Lecture 10
59 pages
03 Lectureslides ParameterInference
No ratings yet
03 Lectureslides ParameterInference
24 pages
Bayesian Learning: Thanks To Nir Friedman, HU
No ratings yet
Bayesian Learning: Thanks To Nir Friedman, HU
41 pages
Lecture1 introToBayes
No ratings yet
Lecture1 introToBayes
65 pages
MA40189 Notes
No ratings yet
MA40189 Notes
70 pages
Introduction To Discrete Bayesian Methods: Petri Nokelainen
No ratings yet
Introduction To Discrete Bayesian Methods: Petri Nokelainen
146 pages
Unit 2
No ratings yet
Unit 2
20 pages
Bayes Stats
No ratings yet
Bayes Stats
3 pages
Bayesian Inference: The Basics
No ratings yet
Bayesian Inference: The Basics
37 pages
Lecture Notes For Probability and Statistics
No ratings yet
Lecture Notes For Probability and Statistics
7 pages
Probs Stats
No ratings yet
Probs Stats
26 pages
Bayesian Inference Slides 2021
No ratings yet
Bayesian Inference Slides 2021
37 pages
확통1 LectureNote09 on Bayesian Statistical Inference
No ratings yet
확통1 LectureNote09 on Bayesian Statistical Inference
78 pages
Bayesian Uncertainty Quantification
No ratings yet
Bayesian Uncertainty Quantification
23 pages
Introduction To Bayesian Learning: Aaron Hertzmann University of Toronto SIGGRAPH 2004 Tutorial
No ratings yet
Introduction To Bayesian Learning: Aaron Hertzmann University of Toronto SIGGRAPH 2004 Tutorial
141 pages
STAT40950 2 HypothesisTesting
No ratings yet
STAT40950 2 HypothesisTesting
13 pages
Machine Learning Models and Theories
No ratings yet
Machine Learning Models and Theories
38 pages
ML Unit III
No ratings yet
ML Unit III
40 pages
2223hk1 Slide01 ML2022-2
No ratings yet
2223hk1 Slide01 ML2022-2
23 pages
FSMLecture 4
No ratings yet
FSMLecture 4
49 pages
Bayesian Inference
No ratings yet
Bayesian Inference
5 pages
An Overview of Bayesian Econometrics
No ratings yet
An Overview of Bayesian Econometrics
30 pages
Bayesian
No ratings yet
Bayesian
91 pages
IDS22Bayes Applications
No ratings yet
IDS22Bayes Applications
34 pages
Bayes Manuscripts
No ratings yet
Bayes Manuscripts
180 pages
Bayesian Estimation
No ratings yet
Bayesian Estimation
13 pages
Bayesian Basics: Ryan P. Adams
No ratings yet
Bayesian Basics: Ryan P. Adams
7 pages
19-Bayesian 2
No ratings yet
19-Bayesian 2
39 pages
Zzzz-Essential Bayes
No ratings yet
Zzzz-Essential Bayes
158 pages
03 Lecturenote MLE MAP
No ratings yet
03 Lecturenote MLE MAP
7 pages
Bayesian Modelling Tuts-4-9
No ratings yet
Bayesian Modelling Tuts-4-9
6 pages
CS 601 Machine Learning Unit 5
No ratings yet
CS 601 Machine Learning Unit 5
18 pages
Lecture 3
No ratings yet
Lecture 3
13 pages
AIML-Unit 3 Notes-Assignment 3
No ratings yet
AIML-Unit 3 Notes-Assignment 3
37 pages
MLT Unit 4 Notes
No ratings yet
MLT Unit 4 Notes
26 pages
CLASS 2025 Bayesian Framework
No ratings yet
CLASS 2025 Bayesian Framework
46 pages
Chapter 9 Bayesian Methods - Machine Learning For Factor Investing
No ratings yet
Chapter 9 Bayesian Methods - Machine Learning For Factor Investing
11 pages
PML Class 1 2025
No ratings yet
PML Class 1 2025
54 pages
Course Outline Leadership
100% (3)
Course Outline Leadership
4 pages
DanMatchi RPG Supplement
No ratings yet
DanMatchi RPG Supplement
30 pages
FD11A MCQ Midsemester
0% (1)
FD11A MCQ Midsemester
6 pages
Specpro Case Digests
No ratings yet
Specpro Case Digests
16 pages
Bayes ML Tutorial
No ratings yet
Bayes ML Tutorial
69 pages
HINDI - PAPER-I To PAPER-IV
No ratings yet
HINDI - PAPER-I To PAPER-IV
9 pages
Foundations of Mental Health Care 5th Ed by Michelle Morrison Valfre A
No ratings yet
Foundations of Mental Health Care 5th Ed by Michelle Morrison Valfre A
302 pages
Spelling Lesson 2 Student
0% (1)
Spelling Lesson 2 Student
2 pages
Teknik Pemeriksaan CT Scan Sinus Paranasal
No ratings yet
Teknik Pemeriksaan CT Scan Sinus Paranasal
22 pages
Kana Speed Tests
No ratings yet
Kana Speed Tests
5 pages
Sci J 0224503
No ratings yet
Sci J 0224503
1 page
STATS 225: Bayesian Analysis Lecture 1: Introduction: Babak Shahbaba
No ratings yet
STATS 225: Bayesian Analysis Lecture 1: Introduction: Babak Shahbaba
49 pages
Reading Assessment Tool: Banna National High School
No ratings yet
Reading Assessment Tool: Banna National High School
2 pages
Subud Voice Newsletter
No ratings yet
Subud Voice Newsletter
16 pages
American Born Chinese
No ratings yet
American Born Chinese
3 pages
Quick Guide To Writing An Argumentative Essay: Paragraph 1: Introduction
No ratings yet
Quick Guide To Writing An Argumentative Essay: Paragraph 1: Introduction
2 pages
I-Flange SAE
No ratings yet
I-Flange SAE
59 pages
Sorry, Ya Just Gotta Be Quiet, My Dad, He's Still Sleeping, He Works Nights. C'mon in
No ratings yet
Sorry, Ya Just Gotta Be Quiet, My Dad, He's Still Sleeping, He Works Nights. C'mon in
12 pages
Abhishek Siddharth AARZOO FILR
No ratings yet
Abhishek Siddharth AARZOO FILR
43 pages
Insctructions:: Arab Open University Faculty of Business Studies
No ratings yet
Insctructions:: Arab Open University Faculty of Business Studies
4 pages
Pinlac vs. Court of Appeals: - First Division
No ratings yet
Pinlac vs. Court of Appeals: - First Division
24 pages
Pramod Pathak, Saumya Singh Page-9-15
No ratings yet
Pramod Pathak, Saumya Singh Page-9-15
7 pages
The Personal Side of Policing
No ratings yet
The Personal Side of Policing
2 pages
Basic Mathematics. Explained Easy | For Beginners
From Everand
Basic Mathematics. Explained Easy | For Beginners
ExaGrecation
No ratings yet
Applications of Derivatives Errors and Approximation (Calculus) Mathematics Question Bank
From Everand
Applications of Derivatives Errors and Approximation (Calculus) Mathematics Question Bank
Mohmmad Khaja Shareef
No ratings yet
Transformation of Axes (Geometry) Mathematics Question Bank
From Everand
Transformation of Axes (Geometry) Mathematics Question Bank
Mohmmad Khaja Shareef
3/5 (1)

Model Selection/ Structure Learning Koller & Friedman Chapter 14 Mackay Chapter 28

Uploaded by

Model Selection/ Structure Learning Koller & Friedman Chapter 14 Mackay Chapter 28

Uploaded by

Lecture 15

Model selection/ structure learning

Koller & Friedman chapter 14

• We often want to learn the structure of the graphical model:

• DAGs are I-equivalent if they encode the same set of conditional

• An independence test X ⊥ Y seeks to accept or reject the null

• For discrete data, a common deviance is the χ2 statistic, which mea-

• The p-value requires summing over all datasets of size N :

• When spun on edge N = 250 times, a Belgian one-euro coin came

• We want to compute the posterior ratio of the 2 hypotheses:

• How to compute P (D|H1)?

• Suppose we see Dh heads and Dt tails. The parameter posterior is

prior,1.0, 1.0 −76 125 heads, 125 tails

• Suppose we see Dh heads and Dt tails. The parameter posterior is

• By the chain rule of probability,

• We compute the ratio of marginal likelihoods (evidence):

• Suppose we generate data from X → Y , where P (X = 0) =

0.5 0.5 0.5 0.5

0.5 0.5 0.5 0.5

0.5 0.5 0.5 0.5

• Consider a large sample approximation, where the parameter poste-

• H is like the precision (inverse covariance) of a Gaussian.

• P (D|H1) is smallest, since it is too simple a model.

parameters, and computing the Occam factors.

• This can be formalized by assuming (uniform) priors on the box

• The evidence can be evaluated sequentially

• Another way of thinking about Bayesian Occam’s razor is in terms

#bits for data

• Consistency: i.e., if the data is generated by G∗, then G∗ and all

• Consider the family of DAGs Gd with maximum fan-in (number of

• Suppose we a total ordering of the nodes X1 ≺ X2 . . . ≺ Xn and

• Search in the space of DAGs.

• Typical search operators:

• There are O(n2) operators we could apply at each step.

• If the operator is valid, we need to evaluate its quality. Define

• After eg adding X → Y , we only need to update δ(o) for the O(n)

• Greedy hill climbing will stop when it reaches a local maximum or a

• The space of class PDAGs is smaller.

• Learned structures often simpler than “true” model (fewer edges),

You might also like