Arificial Intelligence Paper 5
Arificial Intelligence Paper 5
Abstract
Many algorithms for score-based Bayesian network structure learning (BNSL)
take as input a collection of potentially optimal parent sets for each variable in
a data set. Constructing these collections naively is computationally intensive
since the number of parent sets grows exponentially with the number of variables.
Therefore, pruning techniques are not only desirable but essential. While effective
pruning exists for the Bayesian Information Criterion (BIC), current results for the
Bayesian Dirichlet equivalent uniform (BDeu) score reduce the search space very
modestly, hampering the use of (the often preferred) BDeu. We derive new non-
trivial theoretical upper bounds for the BDeu score that considerably improve on
the state of the art. Since the new bounds are efficient and easy to implement, they
can be promptly integrated into many BNSL methods. We show that gains can be
significant in multiple UCI data sets so as to highlight practical implications of the
theoretical advances.
1 Introduction
A Bayesian network [19] is a widely used probabilistic graphical model. It is composed of (i) a
structure defined by a directed acyclic graph (DAG) where each node is associated with a random
variable, and where arcs represent dependencies between the variables entailing the Markov condi-
tion: every variable is conditionally independent of its non-descendant variables given its parents;
and (ii) a collection of conditional probability distributions defined for each variable given its par-
ents in the graph. Their graphical nature make Bayesian networks ideal for complex probabilistic
relationships existing in many real-world problems [8].
Bayesian network structure learning (BNSL) with complete data is NP-hard [3]. We tackle score-
based learning, that is, finding the structure maximising a given (data-dependent) score [14]. In
particular, we focus on the Bayesian Dirichlet equivalent uniform (BDeu) score [4], which cor-
responds to the log probability of the structure given (multinomial) data and a uniform prior on
structures: The BDeu score is decomposable, that is, it can be written as a sum of local scores of
the domain variables: BDeupGq “ iPV LBDeupi, Si q, where LBDeu is the local score function,
ř
V “ t1, . . . , nu is the set of (indices of) variables in the dataset, which is in correspondence with
nodes of the Bayesian network to be learned, and Si Ď V zi , with V zi “ V ztiu, is the parent set of
node i in the DAG structure G. A common approach divides the problem into two steps:
1. C ANDIDATE PARENT S ET I DENTIFICATION: For each variable of the domain, find a suit-
able collection of candidate parent sets and their local scores.
2. S TRUCTURE O PTIMISATION: Given the collection of candidate parent sets, choose a par-
ent set for each variable so as to maximise the overall score while avoiding directed cycles.
That is, LBDeupSq is a sum of qpSq values each of which is specific to a particular instantiation of
the variables S. We call such values local local BDeu scores (llB). In particular, LLBDeupS, jq “ 0
if its nj “ 0, so we can concentrate only on those which actually appear in the data:
ÿ
LBDeupSq “ LLBDeupS, jq .
S
jPDu
2
3 Pruning in Candidate Parent Set Identification
The pruning of parent sets rests on the (simple) observation that a parent set cannot be optimal if
one of its subsets has a higher score [20]. Thus, when learning Bayesian networks from data using
BDeu, it is important to have an upper bound ubpSq ě maxT :T ĄS LBDeupT q so as to potentially
prune a whole area of the search space at once. Ideally, one would like an upper bound that is tight
and cheap to compute, so that one can score parent sets S incrementally, and at the same time check
whether it is worth ‘expanding’ S: if ubpSq is not greater than maxR:RĎS LBDeupRq, then it is
unnecessary to expand S. In practice, however, there is a trade-off between these two desiderata.
With that in mind, we can define candidate parent set identification more formally:
C ANDIDATE PARENT S ET I DENTIFICATION: For each variable i P V , find a collection of
parent sets Li , such that Li “ tS Ď V zi : S 1 Ă S ñ LBDeupS 1 q ă LBDeupSqu .
Unfortunately, we cannot predict the elements of Li and have to compute the scores for a list
Li Ě Li . The practical benefit of our bounds is to reduce |Li |, and consequently to lower the
computational cost, while ensuring that Li Ě Li . Before presenting the best known upper bound
[6, 9, 10], we present a lemma on the variation of counts with expansions of the parent set.
T Ytiu SYtiu
Lemma 1. For S Ď T Ď V zi , jS P DuS and jT P DuT with jTS “ jS , |Du | ě |Du | and
T Ytiu SYtiu
|Du pjT q| ď |Du pjS q|.
As an example, consider the small dataset of Table 1. The number of non-zero counts never decreases
as we add a new variable to the parent set of variable i “ 3. With S “ t1u and T “ t1, 2u, we
SYtiu T Ytiu
have |Du | “ 3 and |Du | “ 4. Conversely, the number of (unique) occurrences compatible
with a given instantiation of the parent set never increases with its expansion: for example with
SYtiu T Ytiu
jS “ pvar1 : 1q and jT “ pvar1 : 1, var2 : 1q, we have |Du pjS q| “ 2 and |Du pjT q| “ 2.
Table 1: Example of data D, its reductions by parent sets S “ t1u and T “ t1, 2u, and the number
S
of unique occurrences compatible with jS P DuS and jT , jT1 P DuT , with jTS “ j 1 T “ jS . The child
1
variable is i “ 3, and we have jS “ pvar1 : 1q, jT “ pvar1 : 1, var2 : 1q, jT “ pvar1 : 1, var2 : 0q.
SYtiu T Ytiu SYtiu T Ytiu T Ytiu
D Du Du Du pjS q Du pjT q Du pjT1 q
1 2 3 1 3 1 2 3 1 3 1 2 3 1 2 3
0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0
1 0 0 1 0 1 0 0 1 1 1 1 1
1 1 0 1 1 1 1 0
1 1 1 1 1 1
SYtiu
Theorem 1 (Bound f [7, 10]). Let S Ď V zi , j P DuS , and ř
let f pS, jq “ ´|Du pjq| log qpiq. Then,
LLBDeupS, jq ď f pS, jq. Moreover, if LBDeupS 1 q ě jPDuS f pS, jq “ f pSq for some S 1 Ă S,
then all T Ě S are not in Li .
SYtiu
This means we compute the number of non-zero counts per instantiation, |Du pjq|, and we ‘gain’
SYtiu
log qpiq for each of them. Note that f pSq “ ´|Du | log qpiq, which by Lemma 1 is monotoni-
cally non-increasing over expansions of the parent set S. Hence f pSq is not only an upper bound
on LBDeupSq but also on LBDeupT q for every T Ě S. Bound f is cheap to compute but is unfor-
tunately too loose. We derive much tighter upper bounds which are actually bounds on these llBs.
Thus, an upper bound for a local BDeu score is obtained by simple addition as just described. We
will derive an upper bound on LLBDeupS, jq (where nj ą 0) by considering instantiation counts for
the full parent set V zi , the parent set which includes all possible parents for child i. We call these full
instantiation counts. Evidently, the number of full parent instantiations qpV zi q grows exponentially
zi
with |V |, but it is linear in |D| when we consider only the unique elements DuV .
3
řx´1
Lemma 2. Let x be a positive integer. Then Γα p0q “ 1 and log Γα pxq “ ℓ“0 logpℓ ` αq .
Lemma 3. For x positive integer and v ě 1, it holds that log Γα pxq {Γα{v pxq ě log v .
` ˘
where k ď k is the last positive integer in the list (the second product disappears if k 1 “ 1).
1
Lemma 5. For S Ď V zi and j P DuS , assume that ~nj “ pnj,k qkPcpiq are in decreasing order over
k “ 1, . . . , qpiq (this is without loss of generality, since we can name and process them in any order).
Then for any α ě αj “ αess {qpSq, we have
1
kÿ ´1
LLBDeupS, jq ď fpS, jq ` gpS, j, αq, with gpS, j, αq “ ´ log p1 ` nj,l {αq ,
l“1
Proof. First we prove that f pS, jS q ` gpS, jS q is an upper bound for LLBDeupS, jS q. From
zi
Lemma 6, if we take any instantiation of the fully expanded parent set, j P DuV : j S “ jS ,
we have that gpS, jS , αq ď gpV zi , j, αq for any α. As Lemma 6 is valid for every full in-
stantiation j, we can take the minimum over them to get the tightest bound. From Lemma 5,
LLBDeupS, jS q ď f pS, jS q ` gpS, jS q. Now, if we sum all the llBs, we obtain the second part
of the theorem for S. Finally, we need to show that this second part holds for any T Ą S, which
follows from f pT q ď f pSq (as the total number of non-zero counts only increases, by Lemma 1)
and ¨ ˛
ÿ ÿ ÿ ÿ
gpT, jT q “ ˝ gpT, jT q‚ ď gpS, jS q .
T
jT PDu S
jS PDu T : j S “j
jT PDu S
jS PDu
T S
That holds as gpT, jT q ď 0 and, with jTS “ jS , at least one term gpT, jT q is smaller than gpS, jS q,
as their minimisation spans the same full instantiations (and gp¨, ¨, αq is non-decreasing on α).
4
The BDeu score is simply the log marginal probability of the observed data given suitably chosen
Dirichlet priors over the parameters of a BN structure. Consequently, llBs are intimately connected
to the Dirichlet-multinomial conjugacy. Given a Dirichlet prior α ~ j “ pαj,1 , . . . , αj,qpiq q, the proba-
bility of observing data D~nj with counts ~nj “ pnj,1 , . . . , nj,qpiq q is:
ż
log PrpD~nj |~
αj q “ log PrpD~nj |pqPrpp|~ αj qdp ,
p
where the first distribution under the integral is multinomial and the second is Dirichlet. Note that
ż
log PrpD~nj |pqPrpp|~ αj qdp ď max log PrpD~nj |pq, (1)
p p
αj qdp “ 1. Note also that llBs are not the probability of observing sufficient statistics
ş
since p Prpp|~
counts, but of a particular dataset, that is, there is no multinomial coefficient which would consider
all the permutations yielding the same sufficient statistics. Therefore, we may devise a bound based
on the maximum (log-)likelihood estimation.
Lemma 7. řLet S Ď V zi and j P DuS . Then LLBDeupS, jq ď MLp~nj q , where we have that
MLp~nj q “ kPcpiq nj,k logpnj,k {nj q. (0 log 0 “ 0.)
Corollary 2. Let S Ď V zi and jS P DuS . Then LLBDeupS, jS q ď jPDV zi : j S “j MLp~nj q .
ř
u S
We can improve further on this bound of Corollary 2 by considering llBs as a function h of α for
fixed ~nj , since we can study and exploit the shape of their curves.
ÿ
h~nj pαq “ ´ log Γα pnj q ` log Γα{qpiq pnj,k q .
kPcpiq
Proof. We rewrite njS ,k as the sum of counts from full mass functions: njS ,k “
V zi : j S “j nj,k . Thus, LLBDeupS, jS q is the log probability log PrpD~ αjS q of observing a
njS |~
ř
jPDu S
data sequence with counts ~njS “ p jPDV zi : j S “j nj,k qkPcpiq under the Dirichlet-multinomial with
ř
u S
parameter vector α ~ jS . Assume an arbitrary order for the full mass functions related to elements in
zi zi
tj P DuV : j S “ jS u and name them j1 , . . . , jw , with w “ |tj P DuV : j S “ jS u|. Exploit-
ing the conjugacy multinomial-Dirichlet we can express this probability as a product of conditional
probabilities: ˜ ˇ ¸
źw ˇ ℓ´1
ˇÿ
αjS q “
PrpD~njS |~ Pr D~njℓ ˇ ~nj ` α
~ jS ,
ˇ t“1 t
ℓ“1
˜ ˇ ¸
ÿw ˇ ℓ´1 w
ˇÿ ÿ
LLBDeupS, jS q “ log Pr D~njℓ ˇ ~ jS ď log Prp~nj1 |~
~njt ` α αjS q ` MLp~njt q.
ˇ t“1 t“2
ℓ“1
These are obtained by applying Expression (1) to all but the first term. Since the choice of the order
is arbitrary, we can do it in our best interest and the theorem is obtained.
While the bound of Theorem 3 is valid for S, it gives no assurances about its supersets T , so it is of
little direct use (if we need to compute it for every T Ą S, then it is better to compute the scores
themselves). To address that we replace the first term of the right-hand side summation with a proper
upper bound.
5
Theorem 4 (Bound h). Let S Ď V zi , α “ αess {qpSq, jS P DuS , and h~nj pαq “ h~nj pαq if α ď 1 and
Bh~
nj
Bα pαq ě 0, and zero otherwise. Let
´ ¯ ÿ
hpS, jS q “ min ´MLp~nj q`mintMLp~nj q; f pV zi , jq`gpV zi , j, αq; h~nj pαqu ` MLp~nj q.
Vzi
jPDu : V zi
jPDu :
j S “jS S
j “jS
(2)
Then LLBDeupS, jS q ď hpS, jS q. Moreover, if LBDeupS 1 q ě hpS, jS q “ hpSq for some
ř
S
jS PDu
S 1 Ă S, then S and all its supersets are not in Li .
Proof. For the parent set S, the bound based on MLp~nj q (that is, using the first option in the inner
minimisation) is valid by Corollary 2. The other two options make use of Theorem 3 and their own
results: the bound on f pV zi , jq`gpV zi , j, αq is valid by Lemma 6, while the bound based on h~nj pαq
comes from Lemma 9, and thus the result holds for S. Take T Ą S. It is straightforward that
¨ ˛
ÿ ÿ ÿ ÿ
LBDeupT q ď hpT, jT q “ ˝ hpT, jT q‚ ď hpS, jS q,
T
jT PDu S
jS PDu T : j S “j
jT PDu S
jS PDu
T S
since jT PDuT : j S “jS hpT, jT q ď hpS, jS q, because both sides run over the same full instantiations
ř
T
and the right-hand side use the tighter minimisation of Expression (2) only once, while the left-hand
side can use that tighter minimisation once every jT , and Lemmas 6 and 9 ensure that the computed
values f pV zi , jq ` gpV zi , j, αq and h~nj pαq are valid for T .
We point out that the mathematical results may seem harder to use in practice than they actually are.
Computing gpSq and hpSq to prune a parent set S and all its supersets can be done in linear time,
since one pass through the data is enough to collect and process all required counts (AD-trees [18]
can be used to get even greater speedups). Since the computation of a score already takes linear time
in the number of data samples, we have a cheap bounds which are provably superior to the current
state-of-the-art pruning for BDeu. Finally, we also point out that bounds g and h prune the search
spaces differently, as their independent theoretical derivations suggest. Therefore, we combine both
to get a tighter bound which we call C4 “ mintg ; hu. Their differences are illustrated in the sequel.
6 Experiments
To analyse the empirical gains of the new bounds, we computed the list of candidate parent sets for
each variable in multiple UCI datasets [13]. In all experiments, we set αess “ 1 and discretise all
continuous variables by their median value. To provide an idea of the processing time, small datasets
(n ď 10) took less than few minutes to complete, while larger ones (n ě 20) took around one day
per variable (if using a single modern core). The main method is presented in Algorithm 1. Parent
sets are explored in order of size (outermost loop), and for each (non-pruned) parent set S, we verify
if it has no subset which is better than itself before including it in the resulting set, and then we
expand it by adding an extra parent, so long as the pruning criterion is not met. This algorithm is
presented in simplified terms: it is possible to cache most of the results to speed up computations.
6
For small datasets, it is feasible to score every candidate parent set so that we can compare how far
the upper bounds for a given parent set S (and all its supersets) are from the true best score among
itself and all supersets. Figure 1 shows such a comparison for variable Standard-of-living-index in
the cmc dataset. It is clear that the new bound C4 “ mintg; hu is much tighter than the current best
bound in the literature (here called f ) and improves considerably towards the true best score.
¨103
0
Figure 2: Number of scores computed per maximum number of parents with different bounds.
7
Table 2: Number of computations pruned (|Lc | “ |search space| ´ |L|) with each bound: f , g, h
and C4. Each dataset is characterised by its number of variables and observations, n and N , and the
number of all possible parent combinations |search space|. in-d is the maximum imposed in-degree
|găh|
and |hăg| is the ratio of the number of times bound g was active over h within C4.
|găh|
Dataset n N |search space| in-d |Lcf | |Lcg | |Lch | |LcC4 | |hăg|
9 768 2,304 5 0 0 0 0 8
diabetes 7 0 72 184 184 123.176
8 0 81 193 193 123.176
9 12,960 2,304 5 0 0 342 342 6.176
nursery 7 8 188 626 630 5.331
8 16 197 635 639 5.331
10 1,473 5,120 5 0 23 35 47 7.556
cmc 7 6 746 766 828 6.045
8 16 846 866 928 6.045
12 294 24,576 5 0 252 86 252 1.434
heart-h 8 1,109 7,469 6,090 7,504 1.019
8 1,321 8,273 6,894 8,308 1.019
12 1,066 24,576 5 884 1,810 2,170 2,462 2.799
solar-
8 3,741 8,890 10,561 11,043 2.885
flare
8 4,043 9,672 11,348 11,834 2.877
14 990 1.147 ¨ 105 5 1,564 1,833 1,707 1,837 0.182
vowel 9 7,579 22,701 21,962 22,729 6.152 ¨ 10´2
8 8,350 26,298 25,411 26,330 6.162 ¨ 10´2
17 101 1.114 ¨ 106 5 7,760 18,026 18,818 20,604 0.252
zoo 11 3.782 ¨ 105 7.834 ¨ 105 7.483 ¨ 105 7.925 ¨ 105 0.104
8 4.054 ¨ 105 8.262 ¨ 105 7.91 ¨ 105 8.353 ¨ 105 0.104
17 435 1.114 ¨ 106 5 0 0 0 0 0.919
vote 11 40,544 2.594 ¨ 105 2.126 ¨ 105 2.776 ¨ 105 0.414
8 55,067 3.021 ¨ 105 2.552 ¨ 105 3.203 ¨ 105 0.414
17 2,310 1.114 ¨ 106 5 0 0 0 0 0.184
segment 11 39,948 2.229 ¨ 105 2.915 ¨ 105 2.915 ¨ 105 0.256
8 51,902 2.614 ¨ 105 3.317 ¨ 105 3.317 ¨ 105 0.256
17 10,992 9.83 ¨ 105 5 0 0 0 0 0
pendigits 11 0 2,386 47,757 41,619 4.143 ¨ 10´2
8 0 17,445 76,982 70,321 4.383 ¨ 10´2
18 148 2.359 ¨ 106 5 7,295 8,344 6,076 8,344 12,375.846
lymph 11 1.02 ¨ 106 1.237 ¨ 106 9.489 ¨ 105 1.237 ¨ 106 73,280.769
8 1.182 ¨ 106 1.406 ¨ 106 1.114 ¨ 106 1.406 ¨ 106 73,327.538
18 339 2.359 ¨ 106 5 2,460 3,555 2,667 3,555 5.465 ¨ 10´2
primary-
11 4.292 ¨ 105 8.572 ¨ 105 6.751 ¨ 105 8.572 ¨ 105 6.856 ¨ 10´3
tumor
8 5.105 ¨ 105 1.015 ¨ 106 8.223 ¨ 105 1.015 ¨ 106 6.797 ¨ 10´3
19 846 4.981 ¨ 106 5 0 108 54 108 1.197
vehicle 12 6.614 ¨ 105 2.082 ¨ 106 1.848 ¨ 106 2.12 ¨ 106 0.474
8 7.582 ¨ 105 2.319 ¨ 106 2.084 ¨ 106 2.358 ¨ 106 0.473
20 155 7.864 ¨ 106 5 0 0 0 0 397.196
hepatitis 12 2.155 ¨ 106 3.341 ¨ 106 2.338 ¨ 106 3.341 ¨ 106 8,198.164
8 2.599 ¨ 106 3.795 ¨ 106 2.905 ¨ 106 3.795 ¨ 106 6,100.199
23 368 9.647 ¨ 107 5 1,170 2,415 934 2,415 8
colic 14 2.116 ¨ 107 2.122 ¨ 107 2.042 ¨ 107 2.122 ¨ 107 8
8 2.277 ¨ 107 2.284 ¨ 107 2.203 ¨ 107 2.284 ¨ 107 8
26 205 8.724 ¨ 108 5 1.388 ¨ 105 1.829 ¨ 105 1.544 ¨ 105 1.829 ¨ 105 8
autos 15 1.265 ¨ 108 1.265 ¨ 108 1.258 ¨ 108 1.265 ¨ 108 45,904.333
8 1.432 ¨ 108 1.432 ¨ 108 1.425 ¨ 108 1.432 ¨ 108 45,904.333
29 194 7.785 ¨ 109 5 2.782 ¨ 105 2.834 ¨ 105 1.275 ¨ 105 2.834 ¨ 105 8
flags 17 1.085 ¨ 109 1.085 ¨ 109 1.083 ¨ 109 1.085 ¨ 109 8
8 1.196 ¨ 109 1.196 ¨ 109 1.194 ¨ 109 1.196 ¨ 109 8
8
7 Conclusions
We have devised new theoretical bounds for learning Bayesian networks with the BDeu score. These
bounds come from analysing the score function from multiple angles and provide significant benefits
in reducing the search space of parent sets for each node of the network. Empirical results with
multiple UCI datasets illustrate the benefits that can be achieved in practice with the theoretical
bounds. In particular, the new bounds allow us to explore the whole search space of parent sets
using BDeu more efficiently without imposing bounds on the maximum in-degree, which was a
major bottleneck before for domains beyond some dozen variables.
As future work, tighter bounds may be possible by replacing the maximum likelihood estimation
terms in the formulas, as well as by using different search orders for exploring the space of parent
sets, which could benefit even further from these bounds. In particular, if one would run a branch-
and-bound approach to explore the parent sets of a node, it would be possible to use these bounds
more effectively by not only considering the parent sets and corresponding full instantiations but
also partial instantiations that are formed by disallowing some variables to be parents in some of the
branches. The mathematical details to realise such ideas as well as an improved implementation of
our bounds using sophisticated tailored data structures are natural next steps in this research.
References
[1] Mark Bartlett and James Cussens. Integer linear programming for the Bayesian network struc-
ture learning problem. Artificial Intelligence, 244:258–271, 2017.
[2] Eunice Yuh-Jie Chen, Yujia Shen, Arthur Choi, and Adnan Darwiche. Learning Bayesian
networks with ancestral constraints. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon,
and R. Garnett, editors, Advances in Neural Information Processing Systems 29 (NIPS), pages
2325–2333. Curran Associates, Inc., 2016.
[3] David M. Chickering, David Heckerman, and Christopher Meek. Large-sample learning of
Bayesian networks is NP-hard. Journal of Machine Learning Research, 20:1287–1330, Octo-
ber 2004.
[4] Gregory F. Cooper and Edward Herskovits. A Bayesian Method for the Induction of Proba-
bilistic Networks from Data. Machine Learning, 9:309–347, 1992.
[5] James Cussens. Bayesian network learning with cutting planes. In Proceedings of the 27th
Conference on Uncertainty in Artificial Intelligence (UAI), pages 153–160. AUAI Press, 2011.
[6] James Cussens. An upper bound for BDeu local scores. Proc. ECAI-2012 workshop on algo-
rithmic issues for inference in graphical models (AIGM), 2012.
[7] James Cussens and Mark Bartlett. GOBNILP 1.6.2 User/Developer Manual, 2015.
[8] James Cussens, Mark Bartlett, Elinor M. Jones, and Nuala A. Sheehan. Maximum likelihood
pedigree reconstruction using integer linear programming. Genetic Epidemiology, 37(1):69–
83, 2013.
[9] Cassio de Campos and Qiang Ji. Properties of Bayesian Dirichlet scores to learn Bayesian
network structures. In Conference on Advancements in Artificial Intelligence (AAAI), pages
431–436, 2010.
[10] Cassio de Campos and Qiang Ji. Efficient structure learning of Bayesian networks using con-
straints. Journal of Machine Learning Research, 12:663–689, 2011.
[11] Cassio de Campos, Mauro Scanagatta, Giorgio Corani, and Marco Zaffalon. Entropy-based
pruning for learning bayesian networks using bic. Artificial Intelligence, 260(C):42–50, 2018.
[12] Cassio de Campos, Zhi Zeng, and Qiang Ji. Structure learning of Bayesian networks using con-
straints. In Proc. of the 26th International Conference on Machine Learning (ICML), volume
382, pages 113–120. ACM, 2009.
[13] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.
9
[14] David Heckerman, Dan Geiger, and David M. Chickering. Learning Bayesian networks: The
combination of knowledge and statistical data. Machine Learning, 20(3):197–243, 1995.
[15] Tommi Jaakkola, David Sontag, Amir Globerson, and Marina Meila. Learning Bayesian net-
work structure using LP relaxations. In Proceedings of 13th International Conference on Arti-
ficial Intelligence and Statistics (AISTATS 2010), volume 9, pages 358–365, 2010. Journal of
Machine Learning Research Workshop and Conference Proceedings.
[16] Mikko Koivisto. Parent Assignment Is Hard for the MDL, AIC, and NML Costs. In Computa-
tional Learning Theory (COLT), volume 4005, pages 289–303. Springer, 2006.
[17] Mikko Koivisto and Kismat Sood. Exact Bayesian structure discovery in Bayesian networks.
Journal of Machine Learning Research, 5:549–573, 2004.
[18] Andrew Moore and Mary Soon Lee. Cached sufficient statistics for efficient machine learning
with large datasets. Journal of Artificial Intelligence Research, 8:67–91, 1998.
[19] Judea Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988.
[20] Marc Teyssier and Daphne Koller. Ordering-based search: A simple and effective algorithm
for learning Bayesian networks. In Conference on Uncertainty in Artificial Intelligence (UAI),
pages 584–590, 2005.
[21] Changhe Yuan and Brandon Malone. An improved admissible heuristic for learning optimal
Bayesian networks. In Proceedings of the 28th Conference on Uncertainty in Artificial Intelli-
gence (UAI), pages 924–933, Catalina Island, CA, 2012.
[22] Changhe Yuan and Brandon Malone. Learning optimal Bayesian networks: A shortest path
perspective. Journal of Artificial Intelligence Research, 48:23–65, October 2013.
10
Appendix A - Proofs
T Ytiu SYtiu
Lemma 1. For S Ď T Ď V zi , jS P DuS and jT P DuT with jTS “ jS , |Du | ě |Du | and
T Ytiu SYtiu
|Du pjT q| ď |Du pjS q|.
Proof. Given that S Ď T Ď V zi , every instantiation in DuSYi is compatible with one or more
instantiations in DuT Yi , and it follows that |DuT Yi | ě |DuSYi |. The relationship is reversed when we
consider the number of unique occurrences compatible with a given instantiation. By construction
jTS “ jS , so if there is an instantiation jT P DuT , there must be at least one instantiation jS P DuS ,
T Ytiu SYtiu T Ytiu SYtiu
and it follows that |Du pjT q| ď |Du pjS q|. Note that both |Du pjT q| and |Du pjS q|
are bounded by qpiq – one instantiation for each possible value child i can assume.
řx´1
Lemma 2. Let x be a positive integer. Then Γα p0q “ 1 and log Γα pxq “ ℓ“0 logpℓ ` αq .
Proof. If either x or y are zero, then their term will cancel out and the equality holds. Otherwise we
apply Lemma 2 three times and manipulate the products:
śx`y´1
Γα px ` yq pz ` αq
“ śy´1 z“0 śx´1
Γα pxq Γα pyq z“0 pz ` αq z“0 pz ` αq
x`y´1 x´1 x´1
ź y`z`α
ź ź 1 y`α
“ pz ` αq “ ě .
z“y z“0
pz ` αq z“0 z ` α α
Lemma 5. For S Ď V zi and j P DuS , assume that ~nj “ pnj,k qkPcpiq are in decreasing order over
k “ 1, . . . , qpiq (this is without loss of generality, since we can name and process them in any order).
Then for any α ě αj “ αess {qpSq, we have
1
kÿ ´1
LLBDeupS, jq ď fpS, jq ` gpS, j, αq, with gpS, j, αq “ ´ log p1 ` nj,l {αq ,
l“1
with α ě αj and Γαj,k pnj,k q {Γαj pnj,k q ď ´ log qpiq by Lemma 3 whenever nj,k ą 0.
11
Lemma 6. For S Ď T Ď V zi , jT P DuT , and jS P DuS with jTS “ jS , f pT, jT q ě f pS, jS q and
gpT, jT , αq ě gpS, jS , αq.
T Ytiu SYtiu
Proof. Because jTS “ jS , |Du pjT q| ď |Du pjS q|. Moreover, njT ,k ď njS ,k for every
k P cpiq (the counts get partitioned as more parents are introduced to arrive at T from S), so
p1 ` njT ,k {αq ď p1 ` njS ,k {αq for every k, and the result follows.
Lemma 7. řLet S Ď V zi and j P DuS . Then LLBDeupS, jq ď MLp~nj q , where we have that
MLp~nj q “ kPcpiq nj,k logpnj,k {nj q. (0 log 0 “ 0.)
Proof. The llB is simply the log probability of observing a data sequence with counts ~nj under a
Dirichlet-multinomial distribution with parameter vector α
~ j . The result follows from Expression (1)
and holds for any prior α
~j.
Lemma 8. If Ek : nj,k “ nj , then h~nj is a concave function for positive α ď 1.
Proof. Using the identity in Lemma 2, or, equivalently, by exploiting known properties of the
digamma and trigamma functions we have:
Bh~nj ÿ´1
ÿ nj,k
qpiq
1
nj ´1
ÿ 1
pαq “ ´
Bα k“1 ℓ“0
ℓqpiq ` α ℓ“0
ℓ`α
and
B 2 h~nj
nj ´1
ÿ 1 ÿ´1
ÿ nj,k
qpiq
1
pαq “ ´ .
Bα2 ℓ“0
pℓ ` αq2
k“1 ℓ“0
pℓqpiq ` αq2
B 2 hn
~
It suffices to show that Bα2 j pαq is always negative under the conditions of the theorem. If there are
at least two nj,k ą 0, then
nj ´1
B 2 h~nj ÿ 1 2
pαq ď ´ 2
Bα2 ℓ“0
pℓ ` αq2 α
simply by ignoring all those negative terms with ℓ ě 1.
Now we approximate it by the infinite sum of quadratic reciprocals:
nj ´1 nj ´1
B 2 h~nj ÿ 1 2 1 1 ÿ 1
pαq ď ´ “ ´ ` `
Bα 2
ℓ“0
pℓ ` αq2 α2 α2 p1 ` αq2
ℓ“2
pℓ ` αq2
8
1 1 ÿ 1 1 1 π2
ă´ ` ` “´ 2 ` ` ´1,
α2 p1 ` αq2
ℓ“2
ℓ 2 α p1 ` αq2 6
which is negative for any α ď 1 (the gap between the two fractions containing α obviously decreases
with the increase of α, so it is enough to check the sign for the largest value α “ 1). Thus we have
B 2 h~
nj
Bα2 pαq ă 0.
zi Bhn
~j
Lemma 9. Let S Ď V zi and j P DuV such that Ek : nj,k “ nj . If α ď qpSq and Bα pα{qpSqq is
non-negative then h~nj pα{qpT qq ď h~nj pα{qpSqq for every T Ě S.
Proof. Since Ek : nj,k “ nj and α{qpSq ď 1, we have that h~nj is concave (Lemma 8) and since
Bh~
nj
Bα pα{qpSqq ě 0, h~nj is non-decreasing.
Corollary 1. Let x1 , . . . , xk be a list of non-negative integers in decreasing order with x1 ą 0, then
˜ ¸ 1
ÿk źk kź ´1
Γα xl ě Γα pxl q p1 ` xl {αq ,
l“1 l“1 l“1
where k 1 ď k is the last positive integer in the list (the second product disappears if k 1 “ 1).
12
řk
Proof. Repeatedly apply Lemma 4 to xt ` p l“t xl q until all elements are processed. While both
the current xt and the rest of the list are positive (that is, until t “ k 1 ´ 1), we gain the extra term
p1 ` xt {αq. After that, we only ‘collect’ the Gamma functions, so the result follows.
Corollary 2. Let S Ď V zi and jS P DuS . Then LLBDeupS, jS q ď jPDV zi : j S “j MLp~nj q .
ř
u S
Proof. This follows from the properties of the maximum likelihood estimation, because it is mono-
tonically non-decreasing with the expansion of parent sets (we fit better in maximum likelihood
when having more parents).
13