0% found this document useful (0 votes)
65 views13 pages

Arificial Intelligence Paper 5

This document discusses pruning techniques for Bayesian network structure learning using the Bayesian Dirichlet equivalent uniform (BDeu) score. It presents new theoretical upper bounds for the local BDeu score that improve on the current state-of-the-art bounds. These new bounds allow for more efficient pruning of non-optimal parent sets during candidate parent set identification. The paper evaluates the new bounds on multiple datasets from the UCI repository, finding that the bounds allow learning networks more efficiently without restricting the maximum parent set size.

Uploaded by

Harita
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views13 pages

Arificial Intelligence Paper 5

This document discusses pruning techniques for Bayesian network structure learning using the Bayesian Dirichlet equivalent uniform (BDeu) score. It presents new theoretical upper bounds for the local BDeu score that improve on the current state-of-the-art bounds. These new bounds allow for more efficient pruning of non-optimal parent sets during candidate parent set identification. The paper evaluates the new bounds on multiple datasets from the UCI repository, finding that the bounds allow learning networks more efficiently without restricting the maximum parent set size.

Uploaded by

Harita
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

On Pruning for Score-Based

Bayesian Network Structure Learning

Alvaro H. C. Correia ˚ James Cussens ˚ Cassio de Campos ˚


arXiv:1905.09943v1 [stat.ML] 23 May 2019

Utrecht University University of York Utrecht University


Utrecht, The Netherlands York, United Kingdom Utrecht, The Netherlands
[email protected] [email protected] [email protected]

Abstract
Many algorithms for score-based Bayesian network structure learning (BNSL)
take as input a collection of potentially optimal parent sets for each variable in
a data set. Constructing these collections naively is computationally intensive
since the number of parent sets grows exponentially with the number of variables.
Therefore, pruning techniques are not only desirable but essential. While effective
pruning exists for the Bayesian Information Criterion (BIC), current results for the
Bayesian Dirichlet equivalent uniform (BDeu) score reduce the search space very
modestly, hampering the use of (the often preferred) BDeu. We derive new non-
trivial theoretical upper bounds for the BDeu score that considerably improve on
the state of the art. Since the new bounds are efficient and easy to implement, they
can be promptly integrated into many BNSL methods. We show that gains can be
significant in multiple UCI data sets so as to highlight practical implications of the
theoretical advances.

1 Introduction
A Bayesian network [19] is a widely used probabilistic graphical model. It is composed of (i) a
structure defined by a directed acyclic graph (DAG) where each node is associated with a random
variable, and where arcs represent dependencies between the variables entailing the Markov condi-
tion: every variable is conditionally independent of its non-descendant variables given its parents;
and (ii) a collection of conditional probability distributions defined for each variable given its par-
ents in the graph. Their graphical nature make Bayesian networks ideal for complex probabilistic
relationships existing in many real-world problems [8].
Bayesian network structure learning (BNSL) with complete data is NP-hard [3]. We tackle score-
based learning, that is, finding the structure maximising a given (data-dependent) score [14]. In
particular, we focus on the Bayesian Dirichlet equivalent uniform (BDeu) score [4], which cor-
responds to the log probability of the structure given (multinomial) data and a uniform prior on
structures: The BDeu score is decomposable, that is, it can be written as a sum of local scores of
the domain variables: BDeupGq “ iPV LBDeupi, Si q, where LBDeu is the local score function,
ř
V “ t1, . . . , nu is the set of (indices of) variables in the dataset, which is in correspondence with
nodes of the Bayesian network to be learned, and Si Ď V zi , with V zi “ V ztiu, is the parent set of
node i in the DAG structure G. A common approach divides the problem into two steps:
1. C ANDIDATE PARENT S ET I DENTIFICATION: For each variable of the domain, find a suit-
able collection of candidate parent sets and their local scores.
2. S TRUCTURE O PTIMISATION: Given the collection of candidate parent sets, choose a par-
ent set for each variable so as to maximise the overall score while avoiding directed cycles.

Preprint. Under review.


˚
All authors contributed equally to this work.
This paper concerns pruning ideas to help solve candidate parent set identification. The problem is
unlikely to admit a polynomial-time (in n) algorithm (it is proven to be LOGSNP-Hard [16] for BIC),
so usually one chooses a maximum in-degree d (number of parents per node) and then computes the
score of parent sets with in-degree at most d. Increasing the maximum in-degree can considerably
improve the chances of finding better structures but requires higher computational time, since there
are Θpnd q candidate parent sets (per variable) for a bound of d if an exhaustive search is performed,
and 2n´1 without an in-degree constraint. For instance, d ą 2 can already become prohibitive [1].
Our contribution is to provide new theoretical upper bounds for the local scores in order to prune
non-optimal parent sets without ever having to compute their scores. Such upper bounds can then be
used together with any searching approach [2, 5, 10, 12, 15, 17, 21, 22]. These bounds are efficiently
computed and easy to implement, so they can be easily integrated into existing software for BNSL.
While the main goal of this paper is to provide new theoretical upper bounds that are provably
superior to the state of the art [6, 9, 10], we also investigate how such bounds are effective in
practice. This is done by performing experiments with multiple datasets from the UCI Machine
Learning Repository [13]. The results support our motivation for new tighter bounds, in particular,
by allowing us to learn more efficiently without a maximum in-degree d, which may be especially
important in domains with complex relations.
The paper is organised as follows. Section 2 provides the notation and required definitions, as
well as a brief description of the current best bound for BDeu in the literature. Section 4 presents
an improved bound whose derivation follows the same approach as the existing one, but exploits
properties of the score function to get tighter results. This new bound is effective in many datasets,
as we show in the experiments. Still, it does not capture all cases and other bounds can be devised.
Section 5 looks at the problem from a new angle and introduces an upper bound based on a (tweaked)
maximum likelihood estimation. These bounds are finally combined and empirically compared
against each other in Section 6. Section 7 concludes the paper and gives directions for future research.
The proofs of intermediate lemmas and corollaries are left to the appendix for brevity.

2 Definitions and notation


First of all, because the collection of scores are computed independently for each variable in the
dataset (BDeu is decomposable), we drop i from the notation and use simply LBDeupSq to refer to
the score of node i with parent set S. We need some further notation:
• cpiq is the state space of variable i, and cpSq is theŚ set of all joint instantiations/configurations
of the random variables in S Ď V , that is, cpSq “ jPS cpjq the Cartesian product of the state
space of involved variables. Moreover, qpSq “ |cpSq|, and we abuse notation to say qpiq “ |cpiq|.
• The data D is a multiset (that is, repetitions are allowed) of elements from cpV q, with DS the
reduction of dimension of D only to the part regarding variables in S Ď V (note that D “ DV ),
1 1
and DS pjS 1 q Ď DS , with jS 1 P cpS 1 q, are the elements of DS such that DSXS “ jSSXS 1 . The
subscript of jS 1 is omitted if clear from the context. We use the notation Du instead of D to
denote the set of unique elements from the corresponding multiset D.
• For j P cpSq, we define nj “ |DS pjq|, that is, the number of occurrences of j in DS .
~ j “ pαj,k qkPcpiq is the prior vector for parent set S Ď V zi under configuration j P cpSq, which
• α
in the BDeu score satisfies αj,k “ αess {qpS Y tiuq, with αess as the equivalent sample size, a user
parameter to define the strength of the prior.
Let Γα pxq “ Γpx`αq
Γpαq for x nonnegative integer and α ą 0 (Γ denotes the Gamma function). Denote
zi
kPcpiq αj,k “ αess {qpSq by αj . The local score for i with parent set S Ď V
ř
can be written as
ÿ ÿ
LBDeupSq “ LLBDeupS, jq, and LLBDeupS, jq “ ´ log Γαj pnj q ` log Γαj,k pnj,k q .
jPcpSq kPcpiq

That is, LBDeupSq is a sum of qpSq values each of which is specific to a particular instantiation of
the variables S. We call such values local local BDeu scores (llB). In particular, LLBDeupS, jq “ 0
if its nj “ 0, so we can concentrate only on those which actually appear in the data:
ÿ
LBDeupSq “ LLBDeupS, jq .
S
jPDu

2
3 Pruning in Candidate Parent Set Identification
The pruning of parent sets rests on the (simple) observation that a parent set cannot be optimal if
one of its subsets has a higher score [20]. Thus, when learning Bayesian networks from data using
BDeu, it is important to have an upper bound ubpSq ě maxT :T ĄS LBDeupT q so as to potentially
prune a whole area of the search space at once. Ideally, one would like an upper bound that is tight
and cheap to compute, so that one can score parent sets S incrementally, and at the same time check
whether it is worth ‘expanding’ S: if ubpSq is not greater than maxR:RĎS LBDeupRq, then it is
unnecessary to expand S. In practice, however, there is a trade-off between these two desiderata.
With that in mind, we can define candidate parent set identification more formally:
C ANDIDATE PARENT S ET I DENTIFICATION: For each variable i P V , find a collection of
parent sets Li , such that Li “ tS Ď V zi : S 1 Ă S ñ LBDeupS 1 q ă LBDeupSqu .
Unfortunately, we cannot predict the elements of Li and have to compute the scores for a list
Li Ě Li . The practical benefit of our bounds is to reduce |Li |, and consequently to lower the
computational cost, while ensuring that Li Ě Li . Before presenting the best known upper bound
[6, 9, 10], we present a lemma on the variation of counts with expansions of the parent set.
T Ytiu SYtiu
Lemma 1. For S Ď T Ď V zi , jS P DuS and jT P DuT with jTS “ jS , |Du | ě |Du | and
T Ytiu SYtiu
|Du pjT q| ď |Du pjS q|.
As an example, consider the small dataset of Table 1. The number of non-zero counts never decreases
as we add a new variable to the parent set of variable i “ 3. With S “ t1u and T “ t1, 2u, we
SYtiu T Ytiu
have |Du | “ 3 and |Du | “ 4. Conversely, the number of (unique) occurrences compatible
with a given instantiation of the parent set never increases with its expansion: for example with
SYtiu T Ytiu
jS “ pvar1 : 1q and jT “ pvar1 : 1, var2 : 1q, we have |Du pjS q| “ 2 and |Du pjT q| “ 2.

Table 1: Example of data D, its reductions by parent sets S “ t1u and T “ t1, 2u, and the number
S
of unique occurrences compatible with jS P DuS and jT , jT1 P DuT , with jTS “ j 1 T “ jS . The child
1
variable is i “ 3, and we have jS “ pvar1 : 1q, jT “ pvar1 : 1, var2 : 1q, jT “ pvar1 : 1, var2 : 0q.
SYtiu T Ytiu SYtiu T Ytiu T Ytiu
D Du Du Du pjS q Du pjT q Du pjT1 q
1 2 3 1 3 1 2 3 1 3 1 2 3 1 2 3
0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0
1 0 0 1 0 1 0 0 1 1 1 1 1
1 1 0 1 1 1 1 0
1 1 1 1 1 1
SYtiu
Theorem 1 (Bound f [7, 10]). Let S Ď V zi , j P DuS , and ř
let f pS, jq “ ´|Du pjq| log qpiq. Then,
LLBDeupS, jq ď f pS, jq. Moreover, if LBDeupS 1 q ě jPDuS f pS, jq “ f pSq for some S 1 Ă S,
then all T Ě S are not in Li .
SYtiu
This means we compute the number of non-zero counts per instantiation, |Du pjq|, and we ‘gain’
SYtiu
log qpiq for each of them. Note that f pSq “ ´|Du | log qpiq, which by Lemma 1 is monotoni-
cally non-increasing over expansions of the parent set S. Hence f pSq is not only an upper bound
on LBDeupSq but also on LBDeupT q for every T Ě S. Bound f is cheap to compute but is unfor-
tunately too loose. We derive much tighter upper bounds which are actually bounds on these llBs.
Thus, an upper bound for a local BDeu score is obtained by simple addition as just described. We
will derive an upper bound on LLBDeupS, jq (where nj ą 0) by considering instantiation counts for
the full parent set V zi , the parent set which includes all possible parents for child i. We call these full
instantiation counts. Evidently, the number of full parent instantiations qpV zi q grows exponentially
zi
with |V |, but it is linear in |D| when we consider only the unique elements DuV .

4 Exploiting Gamma function properties


First, we extend the current state-of-the-art upper bound of Theorem 1 by exploiting some properties
of the Gamma function. For that, we need some intermediate results, where we assume α ą 0.

3
řx´1
Lemma 2. Let x be a positive integer. Then Γα p0q “ 1 and log Γα pxq “ ℓ“0 logpℓ ` αq .
Lemma 3. For x positive integer and v ě 1, it holds that log Γα pxq {Γα{v pxq ě log v .
` ˘

Lemma 4. Let x, y be non-negative integers with x ` y ą 0.


"
Γα px ` yq “ Γα pxq Γα pyq if x ¨ y “ 0 ,
Γα px ` yq ě Γα pxq Γα pyq p1 ` y{αq otherwise.
Corollary 1. Let x1 , . . . , xk be a list of non-negative integers in decreasing order with x1 ą 0, then
˜ ¸ 1
ÿk źk kź ´1
Γα xl ě Γα pxl q p1 ` xl {αq ,
l“1 l“1 l“1

where k ď k is the last positive integer in the list (the second product disappears if k 1 “ 1).
1

Lemma 5. For S Ď V zi and j P DuS , assume that ~nj “ pnj,k qkPcpiq are in decreasing order over
k “ 1, . . . , qpiq (this is without loss of generality, since we can name and process them in any order).
Then for any α ě αj “ αess {qpSq, we have
1
kÿ ´1
LLBDeupS, jq ď fpS, jq ` gpS, j, αq, with gpS, j, αq “ ´ log p1 ` nj,l {αq ,
l“1

and k 1 ď k is the largest index such that nj,k1 ą 0.


The difference here is the summation from the gap of the super-multiplicativity of Γ (Lemma 4
and Corollary 1). That extra term gives us a tighter bound on LLBDeupS, jq, but gpSq “ f pSq `
ř
S gpS, j, αq is no longer monotonic over expansions of S (though monotone in α). Hence, gpSq
jPDu
is not an upper bound on LBDeupT q for every T Ě S, and we need further results on gpS, j, αq.
Lemma 6. For S Ď T Ď V zi , jT P DuT , and jS P DuS with jTS “ jS , f pT, jT q ě f pS, jS q and
gpT, jT , αq ě gpS, jS , αq.
Theorem 2 (Bound g). Let S Ď V zi , jS P DuS , gpS, jS q “ minjPDV zi : j S “jS gpV zi , j, αess {qpSqq.
u
Then LLBDeupS, jS q ď f pS, jS q ` gpS, jS q. Also, if LBDeupS 1 q ě pf pSq ` jS PDS gpS, jS qq “
ř
u
gpSq for some S 1 Ă S, then all T Ě S are not in Li .

Proof. First we prove that f pS, jS q ` gpS, jS q is an upper bound for LLBDeupS, jS q. From
zi
Lemma 6, if we take any instantiation of the fully expanded parent set, j P DuV : j S “ jS ,
we have that gpS, jS , αq ď gpV zi , j, αq for any α. As Lemma 6 is valid for every full in-
stantiation j, we can take the minimum over them to get the tightest bound. From Lemma 5,
LLBDeupS, jS q ď f pS, jS q ` gpS, jS q. Now, if we sum all the llBs, we obtain the second part
of the theorem for S. Finally, we need to show that this second part holds for any T Ą S, which
follows from f pT q ď f pSq (as the total number of non-zero counts only increases, by Lemma 1)
and ¨ ˛
ÿ ÿ ÿ ÿ
gpT, jT q “ ˝ gpT, jT q‚ ď gpS, jS q .
T
jT PDu S
jS PDu T : j S “j
jT PDu S
jS PDu
T S

That holds as gpT, jT q ď 0 and, with jTS “ jS , at least one term gpT, jT q is smaller than gpS, jS q,
as their minimisation spans the same full instantiations (and gp¨, ¨, αq is non-decreasing on α).

5 Exploiting the likelihood function


zi
Bound g (Theorem 2) was based on the best full instantiation j P DuV that is compatible with an
llB of the parent set S. Knowing function g is monotonic over parent set sizes, we could look at an
instantiation of the fully extended parent set to derive a bound for the llB of S and all its supersets.
Even though the results are valid for every full instantiation, we can only compute Bound g using
one of them at a time. The new bound of this section comes from the realisation that it is possible to
exploit all full instantiations to derive a valid bound on the llB of S. For that purpose, we need some
properties of inferences with the Dirichlet-multinomial distribution and conjugacy.

4
The BDeu score is simply the log marginal probability of the observed data given suitably chosen
Dirichlet priors over the parameters of a BN structure. Consequently, llBs are intimately connected
to the Dirichlet-multinomial conjugacy. Given a Dirichlet prior α ~ j “ pαj,1 , . . . , αj,qpiq q, the proba-
bility of observing data D~nj with counts ~nj “ pnj,1 , . . . , nj,qpiq q is:
ż
log PrpD~nj |~
αj q “ log PrpD~nj |pqPrpp|~ αj qdp ,
p
where the first distribution under the integral is multinomial and the second is Dirichlet. Note that
ż
log PrpD~nj |pqPrpp|~ αj qdp ď max log PrpD~nj |pq, (1)
p p

αj qdp “ 1. Note also that llBs are not the probability of observing sufficient statistics
ş
since p Prpp|~
counts, but of a particular dataset, that is, there is no multinomial coefficient which would consider
all the permutations yielding the same sufficient statistics. Therefore, we may devise a bound based
on the maximum (log-)likelihood estimation.
Lemma 7. řLet S Ď V zi and j P DuS . Then LLBDeupS, jq ď MLp~nj q , where we have that
MLp~nj q “ kPcpiq nj,k logpnj,k {nj q. (0 log 0 “ 0.)
Corollary 2. Let S Ď V zi and jS P DuS . Then LLBDeupS, jS q ď jPDV zi : j S “j MLp~nj q .
ř
u S

We can improve further on this bound of Corollary 2 by considering llBs as a function h of α for
fixed ~nj , since we can study and exploit the shape of their curves.
ÿ
h~nj pαq “ ´ log Γα pnj q ` log Γα{qpiq pnj,k q .
kPcpiq

Lemma 8. If Ek : nj,k “ nj , then h~nj is a concave function for positive α ď 1.


The concavity of h~nj is useful for the following reason.
zi Bhn
~j
Lemma 9. Let S Ď V zi and j P DuV such that Ek : nj,k “ nj . If α ď qpSq and Bα pα{qpSqq is
non-negative then h~nj pα{qpT qq ď h~nj pα{qpSqq for every T Ě S.
The final step to improve the bound is to consider any score for a parent set as a function of the
(log-)probabilities over full mass functions.
Theorem 3. Let S Ď V zi and jS P DuS . Then LLBDeupS, jS q ď log PrpD~nj‹ |~ αjS q `

q, “ |~ q.
ř
jPD V zi
: j‰j ‹ n
MLp~ j where j arg minjPD V zi log PrpD~
n j
α jS
u u

Proof. We rewrite njS ,k as the sum of counts from full mass functions: njS ,k “
V zi : j S “j nj,k . Thus, LLBDeupS, jS q is the log probability log PrpD~ αjS q of observing a
njS |~
ř
jPDu S
data sequence with counts ~njS “ p jPDV zi : j S “j nj,k qkPcpiq under the Dirichlet-multinomial with
ř
u S
parameter vector α ~ jS . Assume an arbitrary order for the full mass functions related to elements in
zi zi
tj P DuV : j S “ jS u and name them j1 , . . . , jw , with w “ |tj P DuV : j S “ jS u|. Exploit-
ing the conjugacy multinomial-Dirichlet we can express this probability as a product of conditional
probabilities: ˜ ˇ ¸
źw ˇ ℓ´1
ˇÿ
αjS q “
PrpD~njS |~ Pr D~njℓ ˇ ~nj ` α
~ jS ,
ˇ t“1 t
ℓ“1
˜ ˇ ¸
ÿw ˇ ℓ´1 w
ˇÿ ÿ
LLBDeupS, jS q “ log Pr D~njℓ ˇ ~ jS ď log Prp~nj1 |~
~njt ` α αjS q ` MLp~njt q.
ˇ t“1 t“2
ℓ“1
These are obtained by applying Expression (1) to all but the first term. Since the choice of the order
is arbitrary, we can do it in our best interest and the theorem is obtained.

While the bound of Theorem 3 is valid for S, it gives no assurances about its supersets T , so it is of
little direct use (if we need to compute it for every T Ą S, then it is better to compute the scores
themselves). To address that we replace the first term of the right-hand side summation with a proper
upper bound.

5
Theorem 4 (Bound h). Let S Ď V zi , α “ αess {qpSq, jS P DuS , and h~nj pαq “ h~nj pαq if α ď 1 and
Bh~
nj
Bα pαq ě 0, and zero otherwise. Let
´ ¯ ÿ
hpS, jS q “ min ´MLp~nj q`mintMLp~nj q; f pV zi , jq`gpV zi , j, αq; h~nj pαqu ` MLp~nj q.
Vzi
jPDu : V zi
jPDu :
j S “jS S
j “jS
(2)
Then LLBDeupS, jS q ď hpS, jS q. Moreover, if LBDeupS 1 q ě hpS, jS q “ hpSq for some
ř
S
jS PDu
S 1 Ă S, then S and all its supersets are not in Li .

Proof. For the parent set S, the bound based on MLp~nj q (that is, using the first option in the inner
minimisation) is valid by Corollary 2. The other two options make use of Theorem 3 and their own
results: the bound on f pV zi , jq`gpV zi , j, αq is valid by Lemma 6, while the bound based on h~nj pαq
comes from Lemma 9, and thus the result holds for S. Take T Ą S. It is straightforward that
¨ ˛
ÿ ÿ ÿ ÿ
LBDeupT q ď hpT, jT q “ ˝ hpT, jT q‚ ď hpS, jS q,
T
jT PDu S
jS PDu T : j S “j
jT PDu S
jS PDu
T S

since jT PDuT : j S “jS hpT, jT q ď hpS, jS q, because both sides run over the same full instantiations
ř
T
and the right-hand side use the tighter minimisation of Expression (2) only once, while the left-hand
side can use that tighter minimisation once every jT , and Lemmas 6 and 9 ensure that the computed
values f pV zi , jq ` gpV zi , j, αq and h~nj pαq are valid for T .

We point out that the mathematical results may seem harder to use in practice than they actually are.
Computing gpSq and hpSq to prune a parent set S and all its supersets can be done in linear time,
since one pass through the data is enough to collect and process all required counts (AD-trees [18]
can be used to get even greater speedups). Since the computation of a score already takes linear time
in the number of data samples, we have a cheap bounds which are provably superior to the current
state-of-the-art pruning for BDeu. Finally, we also point out that bounds g and h prune the search
spaces differently, as their independent theoretical derivations suggest. Therefore, we combine both
to get a tighter bound which we call C4 “ mintg ; hu. Their differences are illustrated in the sequel.

6 Experiments
To analyse the empirical gains of the new bounds, we computed the list of candidate parent sets for
each variable in multiple UCI datasets [13]. In all experiments, we set αess “ 1 and discretise all
continuous variables by their median value. To provide an idea of the processing time, small datasets
(n ď 10) took less than few minutes to complete, while larger ones (n ě 20) took around one day
per variable (if using a single modern core). The main method is presented in Algorithm 1. Parent
sets are explored in order of size (outermost loop), and for each (non-pruned) parent set S, we verify
if it has no subset which is better than itself before including it in the resulting set, and then we
expand it by adding an extra parent, so long as the pruning criterion is not met. This algorithm is
presented in simplified terms: it is possible to cache most of the results to speed up computations.

Algorithm 1 Parent Set Identification


Input: pi, V, D, in-dq. Output: Li .
Li Ð tHu, Li Ð tHu, d Ð 0.
bpLi , T q “ maxSPLi : SĂT LBDeupSq.
while d ď in-d do
for S P Li : |S| “ d do
Li Ð Li Y tSu if LBDeupSq ą bpLi , Sq.
Li Ð Li Y tS Y ttu : pt P V zi zSq ^ pbpLi , S Y ttuq ă C4pS Y ttuqqu if d ă in-d.
end for
dÐd`1
end while

6
For small datasets, it is feasible to score every candidate parent set so that we can compare how far
the upper bounds for a given parent set S (and all its supersets) are from the true best score among
itself and all supersets. Figure 1 shows such a comparison for variable Standard-of-living-index in
the cmc dataset. It is clear that the new bound C4 “ mintg; hu is much tighter than the current best
bound in the literature (here called f ) and improves considerably towards the true best score.

¨103
0

Upper bound values


´2
f
C4
´4 true
1 2 3 4 5 6 7
Cardinality of candidate parent sets
Figure 1: Upper bound values for each candidate parent set for variable Standard-of-living-index in
the cmc dataset. Parent sets are arbitrarily ordered within the same cardinality.

The practical benefits of the new


Ť bounds are best observed when comparing the number of scores
computed to construct L “ iPV Li for each dataset. In Figure 2, we see that the previously
available bound (orange-square curves) is indeed loose as the number of scores computed is often
closer to the size of the entire search space (green-diamond curves). Conversely, each of the new
bounds (g and h) often reduces the computational costs by more than half with respect to f ’s. It
is also worth noticing that bound h does not always dominate g, or vice versa. For instance, for
datasets zoo and heart-h, h was more effective, while g was more active in the remaining datasets of
Figure 2. That justifies combining g and h into C4.
We also ran Algorithm 1 for the UCI datasets presented in Table 2 with the maximum in-degree as
defined there. The size of the search space (for all variables in the dataset) is also shown, together
with the number of pruned cases. The results in Table 2 show the number of computations pruned
with bound C4 is up to an order of magnitude higher in comparison to bound f . An interesting result
was obtained for the diabetes dataset, where pruning takes places for BDeu but failed to happen for
the BIC score [11], which is understood as having stronger pruning available.

¨104 heart-h ¨106 zoo ¨106 primary-tumor


2.5 1.2 2.5
1 2
2
0.8
1.5 0.6 1.5
1 0.4 1
0.5 0.2 0.5

¨104 solar-flare ¨106 segment ¨106 lymph


2.5 1.2 2.5
1 2
2
0.8
1.5 0.6 1.5
1 0.4 1
0.5 0.2 0.5
4 6 8 10 12 4 6 8 10 12 14 16 6 8 10 12 14 16 18
Number of parents Number of parents Number of parents
f g h C4 |search space|

Figure 2: Number of scores computed per maximum number of parents with different bounds.

7
Table 2: Number of computations pruned (|Lc | “ |search space| ´ |L|) with each bound: f , g, h
and C4. Each dataset is characterised by its number of variables and observations, n and N , and the
number of all possible parent combinations |search space|. in-d is the maximum imposed in-degree
|găh|
and |hăg| is the ratio of the number of times bound g was active over h within C4.
|găh|
Dataset n N |search space| in-d |Lcf | |Lcg | |Lch | |LcC4 | |hăg|

9 768 2,304 5 0 0 0 0 8
diabetes 7 0 72 184 184 123.176
8 0 81 193 193 123.176
9 12,960 2,304 5 0 0 342 342 6.176
nursery 7 8 188 626 630 5.331
8 16 197 635 639 5.331
10 1,473 5,120 5 0 23 35 47 7.556
cmc 7 6 746 766 828 6.045
8 16 846 866 928 6.045
12 294 24,576 5 0 252 86 252 1.434
heart-h 8 1,109 7,469 6,090 7,504 1.019
8 1,321 8,273 6,894 8,308 1.019
12 1,066 24,576 5 884 1,810 2,170 2,462 2.799
solar-
8 3,741 8,890 10,561 11,043 2.885
flare
8 4,043 9,672 11,348 11,834 2.877
14 990 1.147 ¨ 105 5 1,564 1,833 1,707 1,837 0.182
vowel 9 7,579 22,701 21,962 22,729 6.152 ¨ 10´2
8 8,350 26,298 25,411 26,330 6.162 ¨ 10´2
17 101 1.114 ¨ 106 5 7,760 18,026 18,818 20,604 0.252
zoo 11 3.782 ¨ 105 7.834 ¨ 105 7.483 ¨ 105 7.925 ¨ 105 0.104
8 4.054 ¨ 105 8.262 ¨ 105 7.91 ¨ 105 8.353 ¨ 105 0.104
17 435 1.114 ¨ 106 5 0 0 0 0 0.919
vote 11 40,544 2.594 ¨ 105 2.126 ¨ 105 2.776 ¨ 105 0.414
8 55,067 3.021 ¨ 105 2.552 ¨ 105 3.203 ¨ 105 0.414
17 2,310 1.114 ¨ 106 5 0 0 0 0 0.184
segment 11 39,948 2.229 ¨ 105 2.915 ¨ 105 2.915 ¨ 105 0.256
8 51,902 2.614 ¨ 105 3.317 ¨ 105 3.317 ¨ 105 0.256
17 10,992 9.83 ¨ 105 5 0 0 0 0 0
pendigits 11 0 2,386 47,757 41,619 4.143 ¨ 10´2
8 0 17,445 76,982 70,321 4.383 ¨ 10´2
18 148 2.359 ¨ 106 5 7,295 8,344 6,076 8,344 12,375.846
lymph 11 1.02 ¨ 106 1.237 ¨ 106 9.489 ¨ 105 1.237 ¨ 106 73,280.769
8 1.182 ¨ 106 1.406 ¨ 106 1.114 ¨ 106 1.406 ¨ 106 73,327.538
18 339 2.359 ¨ 106 5 2,460 3,555 2,667 3,555 5.465 ¨ 10´2
primary-
11 4.292 ¨ 105 8.572 ¨ 105 6.751 ¨ 105 8.572 ¨ 105 6.856 ¨ 10´3
tumor
8 5.105 ¨ 105 1.015 ¨ 106 8.223 ¨ 105 1.015 ¨ 106 6.797 ¨ 10´3
19 846 4.981 ¨ 106 5 0 108 54 108 1.197
vehicle 12 6.614 ¨ 105 2.082 ¨ 106 1.848 ¨ 106 2.12 ¨ 106 0.474
8 7.582 ¨ 105 2.319 ¨ 106 2.084 ¨ 106 2.358 ¨ 106 0.473
20 155 7.864 ¨ 106 5 0 0 0 0 397.196
hepatitis 12 2.155 ¨ 106 3.341 ¨ 106 2.338 ¨ 106 3.341 ¨ 106 8,198.164
8 2.599 ¨ 106 3.795 ¨ 106 2.905 ¨ 106 3.795 ¨ 106 6,100.199
23 368 9.647 ¨ 107 5 1,170 2,415 934 2,415 8
colic 14 2.116 ¨ 107 2.122 ¨ 107 2.042 ¨ 107 2.122 ¨ 107 8
8 2.277 ¨ 107 2.284 ¨ 107 2.203 ¨ 107 2.284 ¨ 107 8
26 205 8.724 ¨ 108 5 1.388 ¨ 105 1.829 ¨ 105 1.544 ¨ 105 1.829 ¨ 105 8
autos 15 1.265 ¨ 108 1.265 ¨ 108 1.258 ¨ 108 1.265 ¨ 108 45,904.333
8 1.432 ¨ 108 1.432 ¨ 108 1.425 ¨ 108 1.432 ¨ 108 45,904.333
29 194 7.785 ¨ 109 5 2.782 ¨ 105 2.834 ¨ 105 1.275 ¨ 105 2.834 ¨ 105 8
flags 17 1.085 ¨ 109 1.085 ¨ 109 1.083 ¨ 109 1.085 ¨ 109 8
8 1.196 ¨ 109 1.196 ¨ 109 1.194 ¨ 109 1.196 ¨ 109 8

8
7 Conclusions
We have devised new theoretical bounds for learning Bayesian networks with the BDeu score. These
bounds come from analysing the score function from multiple angles and provide significant benefits
in reducing the search space of parent sets for each node of the network. Empirical results with
multiple UCI datasets illustrate the benefits that can be achieved in practice with the theoretical
bounds. In particular, the new bounds allow us to explore the whole search space of parent sets
using BDeu more efficiently without imposing bounds on the maximum in-degree, which was a
major bottleneck before for domains beyond some dozen variables.
As future work, tighter bounds may be possible by replacing the maximum likelihood estimation
terms in the formulas, as well as by using different search orders for exploring the space of parent
sets, which could benefit even further from these bounds. In particular, if one would run a branch-
and-bound approach to explore the parent sets of a node, it would be possible to use these bounds
more effectively by not only considering the parent sets and corresponding full instantiations but
also partial instantiations that are formed by disallowing some variables to be parents in some of the
branches. The mathematical details to realise such ideas as well as an improved implementation of
our bounds using sophisticated tailored data structures are natural next steps in this research.

References
[1] Mark Bartlett and James Cussens. Integer linear programming for the Bayesian network struc-
ture learning problem. Artificial Intelligence, 244:258–271, 2017.
[2] Eunice Yuh-Jie Chen, Yujia Shen, Arthur Choi, and Adnan Darwiche. Learning Bayesian
networks with ancestral constraints. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon,
and R. Garnett, editors, Advances in Neural Information Processing Systems 29 (NIPS), pages
2325–2333. Curran Associates, Inc., 2016.
[3] David M. Chickering, David Heckerman, and Christopher Meek. Large-sample learning of
Bayesian networks is NP-hard. Journal of Machine Learning Research, 20:1287–1330, Octo-
ber 2004.
[4] Gregory F. Cooper and Edward Herskovits. A Bayesian Method for the Induction of Proba-
bilistic Networks from Data. Machine Learning, 9:309–347, 1992.
[5] James Cussens. Bayesian network learning with cutting planes. In Proceedings of the 27th
Conference on Uncertainty in Artificial Intelligence (UAI), pages 153–160. AUAI Press, 2011.
[6] James Cussens. An upper bound for BDeu local scores. Proc. ECAI-2012 workshop on algo-
rithmic issues for inference in graphical models (AIGM), 2012.
[7] James Cussens and Mark Bartlett. GOBNILP 1.6.2 User/Developer Manual, 2015.
[8] James Cussens, Mark Bartlett, Elinor M. Jones, and Nuala A. Sheehan. Maximum likelihood
pedigree reconstruction using integer linear programming. Genetic Epidemiology, 37(1):69–
83, 2013.
[9] Cassio de Campos and Qiang Ji. Properties of Bayesian Dirichlet scores to learn Bayesian
network structures. In Conference on Advancements in Artificial Intelligence (AAAI), pages
431–436, 2010.
[10] Cassio de Campos and Qiang Ji. Efficient structure learning of Bayesian networks using con-
straints. Journal of Machine Learning Research, 12:663–689, 2011.
[11] Cassio de Campos, Mauro Scanagatta, Giorgio Corani, and Marco Zaffalon. Entropy-based
pruning for learning bayesian networks using bic. Artificial Intelligence, 260(C):42–50, 2018.
[12] Cassio de Campos, Zhi Zeng, and Qiang Ji. Structure learning of Bayesian networks using con-
straints. In Proc. of the 26th International Conference on Machine Learning (ICML), volume
382, pages 113–120. ACM, 2009.
[13] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.

9
[14] David Heckerman, Dan Geiger, and David M. Chickering. Learning Bayesian networks: The
combination of knowledge and statistical data. Machine Learning, 20(3):197–243, 1995.
[15] Tommi Jaakkola, David Sontag, Amir Globerson, and Marina Meila. Learning Bayesian net-
work structure using LP relaxations. In Proceedings of 13th International Conference on Arti-
ficial Intelligence and Statistics (AISTATS 2010), volume 9, pages 358–365, 2010. Journal of
Machine Learning Research Workshop and Conference Proceedings.
[16] Mikko Koivisto. Parent Assignment Is Hard for the MDL, AIC, and NML Costs. In Computa-
tional Learning Theory (COLT), volume 4005, pages 289–303. Springer, 2006.
[17] Mikko Koivisto and Kismat Sood. Exact Bayesian structure discovery in Bayesian networks.
Journal of Machine Learning Research, 5:549–573, 2004.
[18] Andrew Moore and Mary Soon Lee. Cached sufficient statistics for efficient machine learning
with large datasets. Journal of Artificial Intelligence Research, 8:67–91, 1998.
[19] Judea Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988.
[20] Marc Teyssier and Daphne Koller. Ordering-based search: A simple and effective algorithm
for learning Bayesian networks. In Conference on Uncertainty in Artificial Intelligence (UAI),
pages 584–590, 2005.
[21] Changhe Yuan and Brandon Malone. An improved admissible heuristic for learning optimal
Bayesian networks. In Proceedings of the 28th Conference on Uncertainty in Artificial Intelli-
gence (UAI), pages 924–933, Catalina Island, CA, 2012.
[22] Changhe Yuan and Brandon Malone. Learning optimal Bayesian networks: A shortest path
perspective. Journal of Artificial Intelligence Research, 48:23–65, October 2013.

10
Appendix A - Proofs
T Ytiu SYtiu
Lemma 1. For S Ď T Ď V zi , jS P DuS and jT P DuT with jTS “ jS , |Du | ě |Du | and
T Ytiu SYtiu
|Du pjT q| ď |Du pjS q|.

Proof. Given that S Ď T Ď V zi , every instantiation in DuSYi is compatible with one or more
instantiations in DuT Yi , and it follows that |DuT Yi | ě |DuSYi |. The relationship is reversed when we
consider the number of unique occurrences compatible with a given instantiation. By construction
jTS “ jS , so if there is an instantiation jT P DuT , there must be at least one instantiation jS P DuS ,
T Ytiu SYtiu T Ytiu SYtiu
and it follows that |Du pjT q| ď |Du pjS q|. Note that both |Du pjT q| and |Du pjS q|
are bounded by qpiq – one instantiation for each possible value child i can assume.
řx´1
Lemma 2. Let x be a positive integer. Then Γα p0q “ 1 and log Γα pxq “ ℓ“0 logpℓ ` αq .

Proof. Follows from definition and Γpx ` 1q “ xΓpxq.


Lemma 3. For x positive integer and v ě 1, it holds that log Γα pxq {Γα{v pxq ě log v .
` ˘

Proof. By applying Lemma 2, we obtain


x´1 x´1
ÿ ℓ`α ÿ ℓ`α
log “ log v ` log ě log v
ℓ“0
ℓ ` α{v ℓ“1
ℓ ` α{v
because each term of the sum (if any) is greater than zero.
Lemma 4. Let x, y be non-negative integers with x ` y ą 0.
"
Γα px ` yq “ Γα pxq Γα pyq if x ¨ y “ 0 ,
Γα px ` yq ě Γα pxq Γα pyq p1 ` y{αq otherwise.

Proof. If either x or y are zero, then their term will cancel out and the equality holds. Otherwise we
apply Lemma 2 three times and manipulate the products:
śx`y´1
Γα px ` yq pz ` αq
“ śy´1 z“0 śx´1
Γα pxq Γα pyq z“0 pz ` αq z“0 pz ` αq
x`y´1 x´1 x´1
ź y`z`α
ź ź 1 y`α
“ pz ` αq “ ě .
z“y z“0
pz ` αq z“0 z ` α α

Lemma 5. For S Ď V zi and j P DuS , assume that ~nj “ pnj,k qkPcpiq are in decreasing order over
k “ 1, . . . , qpiq (this is without loss of generality, since we can name and process them in any order).
Then for any α ě αj “ αess {qpSq, we have
1
kÿ ´1
LLBDeupS, jq ď fpS, jq ` gpS, j, αq, with gpS, j, αq “ ´ log p1 ` nj,l {αq ,
l“1

and k 1 ď k is the largest index such that nj,k1 ą 0.

Proof. Since counts nj,k are in decreasing order by k, we apply Corollary 1:


¨ ˛
qpiq 1
kź ´1
ź n j,l ‚
ÿ
LLBDeupS, jq ď ´ log ˝ Γαj pnj,l q p1 ` q ` log Γαj,k pnj,k q “
l“1 l“1
αj
kPcpiq
1 1
kÿ ´1 kÿ ´1
Γαj,k pnj,k q
ˆ ˙ ˆ ˙
ÿ nj,l ´ nj,l ¯
log ´ log 1 ` ď ´|DuSYtiu pjq| log qpiq ´ log 1` ,
Γαj pnj,k q l“1
αj l“1
α
kPcpiq

with α ě αj and Γαj,k pnj,k q {Γαj pnj,k q ď ´ log qpiq by Lemma 3 whenever nj,k ą 0.

11
Lemma 6. For S Ď T Ď V zi , jT P DuT , and jS P DuS with jTS “ jS , f pT, jT q ě f pS, jS q and
gpT, jT , αq ě gpS, jS , αq.

T Ytiu SYtiu
Proof. Because jTS “ jS , |Du pjT q| ď |Du pjS q|. Moreover, njT ,k ď njS ,k for every
k P cpiq (the counts get partitioned as more parents are introduced to arrive at T from S), so
p1 ` njT ,k {αq ď p1 ` njS ,k {αq for every k, and the result follows.

Lemma 7. řLet S Ď V zi and j P DuS . Then LLBDeupS, jq ď MLp~nj q , where we have that
MLp~nj q “ kPcpiq nj,k logpnj,k {nj q. (0 log 0 “ 0.)

Proof. The llB is simply the log probability of observing a data sequence with counts ~nj under a
Dirichlet-multinomial distribution with parameter vector α
~ j . The result follows from Expression (1)
and holds for any prior α
~j.
Lemma 8. If Ek : nj,k “ nj , then h~nj is a concave function for positive α ď 1.

Proof. Using the identity in Lemma 2, or, equivalently, by exploiting known properties of the
digamma and trigamma functions we have:

Bh~nj ÿ´1
ÿ nj,k
qpiq
1
nj ´1
ÿ 1
pαq “ ´
Bα k“1 ℓ“0
ℓqpiq ` α ℓ“0
ℓ`α

and
B 2 h~nj
nj ´1
ÿ 1 ÿ´1
ÿ nj,k
qpiq
1
pαq “ ´ .
Bα2 ℓ“0
pℓ ` αq2
k“1 ℓ“0
pℓqpiq ` αq2
B 2 hn
~
It suffices to show that Bα2 j pαq is always negative under the conditions of the theorem. If there are
at least two nj,k ą 0, then
nj ´1
B 2 h~nj ÿ 1 2
pαq ď ´ 2
Bα2 ℓ“0
pℓ ` αq2 α
simply by ignoring all those negative terms with ℓ ě 1.
Now we approximate it by the infinite sum of quadratic reciprocals:
nj ´1 nj ´1
B 2 h~nj ÿ 1 2 1 1 ÿ 1
pαq ď ´ “ ´ ` `
Bα 2
ℓ“0
pℓ ` αq2 α2 α2 p1 ` αq2
ℓ“2
pℓ ` αq2
8
1 1 ÿ 1 1 1 π2
ă´ ` ` “´ 2 ` ` ´1,
α2 p1 ` αq2
ℓ“2
ℓ 2 α p1 ` αq2 6

which is negative for any α ď 1 (the gap between the two fractions containing α obviously decreases
with the increase of α, so it is enough to check the sign for the largest value α “ 1). Thus we have
B 2 h~
nj
Bα2 pαq ă 0.
zi Bhn
~j
Lemma 9. Let S Ď V zi and j P DuV such that Ek : nj,k “ nj . If α ď qpSq and Bα pα{qpSqq is
non-negative then h~nj pα{qpT qq ď h~nj pα{qpSqq for every T Ě S.

Proof. Since Ek : nj,k “ nj and α{qpSq ď 1, we have that h~nj is concave (Lemma 8) and since
Bh~
nj
Bα pα{qpSqq ě 0, h~nj is non-decreasing.
Corollary 1. Let x1 , . . . , xk be a list of non-negative integers in decreasing order with x1 ą 0, then
˜ ¸ 1
ÿk źk kź ´1
Γα xl ě Γα pxl q p1 ` xl {αq ,
l“1 l“1 l“1

where k 1 ď k is the last positive integer in the list (the second product disappears if k 1 “ 1).

12
řk
Proof. Repeatedly apply Lemma 4 to xt ` p l“t xl q until all elements are processed. While both
the current xt and the rest of the list are positive (that is, until t “ k 1 ´ 1), we gain the extra term
p1 ` xt {αq. After that, we only ‘collect’ the Gamma functions, so the result follows.
Corollary 2. Let S Ď V zi and jS P DuS . Then LLBDeupS, jS q ď jPDV zi : j S “j MLp~nj q .
ř
u S

Proof. This follows from the properties of the maximum likelihood estimation, because it is mono-
tonically non-decreasing with the expansion of parent sets (we fit better in maximum likelihood
when having more parents).

13

You might also like