Model Averaging With Discrete Bayesian Network Classifiers1
Model Averaging With Discrete Bayesian Network Classifiers1
= θiJK · P ( | S, D) · d
The general supervised classification problem can be i=0
αijk + Nijk
θ̂ijk = (3) Given Assumption 1 and Equation 6, Equation 5 can
αij + Nij
be written as:
will produce predictions equivalent to those obtained by N
averaging over all parameter configurations. We refer XY
P (XL → XM | D) = κ ρiLM (7)
to θ̂ijk as the standard parametrization.
S i=0
3.2 Averaging Structural Features with a where the ρiLM functions are given by:
Fixed Ordering
ρiLM = δ[M 6= i ∨ XL ∈ P i ] · ps (Xi , P i )·
The decomposition by Buntine used by Friedman and Yqi ri
Y
Γ(αij ) Γ(αijk + Nijk )
Koller was a dynamic programming solution which cal- · , (8)
Γ(αij + N ij ) Γ(αijk )
culated, with relative efficiency, the posterior proba- j=1 k=1
ΣLM
m ≡ ρ00LM · ρ01LM · ... · ρ0mLM We have taken the trouble to subscript the indices J
+ ρ10LM · ρ01LM · ... · ρ0mLM and K from Equation 1. Both are indexed with an x
.. .. to indicate that they are fixed by the particular config-
. .
uration of X, and J is indexed by S to emphasize that
+ ρµ0LM
0
· ρµ1LM
1
· . . . · ρµmLM
m
the value of the parent configuration index depends on
Using this notation, the following recursion relation can the structure of the network. Although this notation
be derived: may seem cumbersome, we hope it clarifies the analysis
µi
later.
X
ΣLM
i = ΣLM
i−1 · ρνiLM , ΣLM
−1 = 1 The ρ̂iJxS Kx functions again can be calculated using
ν=1 only information local to node Xi and P i . Fol-
Finally, expanding out the recurrence relation yields lowing a derivation identical to that for averaging
the expression for P (XL → XM | D): P (XL → XM | D) in Section 3.2, yields the result:
µi
N X
Y
µi
N X
Y P (X = x | D) = κ ρ̂νiJxν Kx . (13)
P (XL → XM | D) = κ ρνiLM (9)
i=0 ν=1
i=0 ν=1
Here the S index has been replaced with a ν indicating
Once the ρνiLM terms are calculated, the right-hand- ¡ ¢ which parent set for node Xi is being considered.
¡ ¢Once
side of Equation 9 can be performed in O(N · N k )
again this summation can be performed inO( Nk · N ·
¡ ¢
time. Calculating the ρ terms themselves requires the Nr · Nf k ) time and O( Nk · N · Nf k ) space.
ν
calculation of one hyperparameter αijk and sufficient
ν
statistic Nijk for each parameter of the network, each 3.4 Approximate Model Averaging
of which can be calculated in O(Nr ) time. To calcu-
late the complete
¡ ¢ set for all (j, k) combinations
¡ ¢ thus The derivation in the preceding section is an extension
requires O( Nk · N · Nr · Nf k ) time and O( Nk · N · Nf k ) of the concept underlying Buntine’s dynamic program-
space (because there are O(Nfk ) network parameters ming solution. The functional form of the above solu-
per node). 2 tion allows us to easily prove the following theorem:
Theorem 2 There exists a single Bayesian network We denote the class of structures being averaged over
model M ∗ = hS ∗ , θ ∗ i that will produce predictions using this procedure as Lnk (π | D), and we call the
equivalent to those produced by model averaging over method Approximate Model Averaging (AMA). Obvi-
all models in Lk (π). ously limn→N Lnk (π | D) = Lk (π). Furthermore, we
Proof: Let S ∗ be defined empirically show in Section 4 that the loss in ROC area,
Sµi so that each node Xi has
², due to this approximation for n ≥ 10 lies around
the parent set P ∗i = ν=1 P νi , and let θ ∗ be defined
by: −0.6% ≤ ² ≤ 0.6% with 99% confidence for N ≤ 100
µi
1 X ν and for a wide range of other parameters.
∗
θijk = √ ρ̂iJ ν k (14)
κ N ν=1 j
4 Experimental Tests
where the x subscript for Jxν has now been replaced
with a j and Kx has been replaced with a k subscript,
In this section we describe several experimental inves-
because we are now considering a particular coordinate
tigations that were designed to test the performance of
(ijk). It can be seen by direct comparison that the
(AMA) on distributions that do not necessarily fall into
single network prediction using M ∗ and Equation 1 will
Lk (π). We first generate synthetic data to allow us to
yield Equation 13. 2
more extensively vary parameters and to generate sta-
If we define functions f (Xi , P νi | D) such that tistically significant results, then we perform tests on
f (Xi , P νi | D) = ρ̂νiJ ν k /θ̂iJ
ν several real-world machine learning data sets.
ν k , then Equation 14 can
j j
be written as:
4.1 Experimental Setup
µi
X
∗ 1 ν ν
θijk = √ θ̂iJ ν k · f (Xi , P i | D) (15) There are at least five parameters for which we sought
κ N j
ν=1 to characterize the performance of AMA predictions:
The functions f (Xi , P νi | D) do not depend on the in- the number of nodes N , the approximation limit n
dices Jjν and k, and they are proportional to the lo- on number of nodes, the maximum in-degree (“den-
cal contribution of the posterior probability that the sity”) K of the generating network, the maximum num-
parent set of Xi is in fact P νi . Equation 15 thus pro- ber of parents k allowed in Lk (π), and the number
vides the interpretation that M ∗ represents a structure- of records Nr . It is beyond the scope of this paper
based smoothing where each standard parameter θ̂ijk ν to present a comprehensive comparison over this full
ν
is weighted based on the likelihood that P i is the true five-dimensional space; however, here we sample their
parent set of Xi . settings to provide insight into the dependence of the
results on these parameters.
Theorem
¡ ¢ 2 implies that, rather than performing the
O( Nk · N ) summation in Equation 13 for each case to We compared AMA to three other Bayesian network
be classified, in principle we need only construct a single classifiers: a single naive network (SNN) with the stan-
model M ∗ and use standard O(N ) Bayesian network dard parametrization [Domingos and Pazzani, 1997],
inference for each case. a model that model-averaged over all naive structures
(NMA) [Dash and Cooper, 2002], and a non-restricted
A serious practical difficulty is that this approach two-stage thick-thin greedy search (GTT) over the
requires in the worst case the construction of a space of DAGs. GTT starts with an empty graph and
completely-connected Bayesian network and is thus in- repeatedly adds the arc (without creating a cycle) that
tractable for reasonable size N . An obvious pruning maximally increases the marginal likelihood P (D | S)
strategy, however, is to truncate the sum in Equation 14 until no arc addition will result in a positive increase,
to include no more than n parents. If we reorder the then it repeatedly removes arcs until no arc deletion
possible parent sets for node Xi as OP ≡ {P 1i , . . . , P µi i } will result in a positive increase in P (D | S). GTT as-
such that f (Xi , P νi | D) > f (Xi , P λi | D) only if ν < λ, sumes no ordering on the nodes.
then a reasonable approximation for P ∗i can be con-
structed by the following procedure: All abbreviations and symbols are summarized in Ta-
ble 1 as a reference for the reader.
Procedure 1 (Approximate Pi∗ construction) The inner-loop of each test performed basically
Given: n and OP . the same procedure: Given the five parameters
1. Let P ∗i = ∅ {N, Nr , K, n, k} and a total number of trials Ntrials ,
we did the following:
2. For ν = 1 to µi ,
if |P ∗i ∪ P νi | ≤ n, let P ∗i = P ∗i ∪ P νi , Procedure 2 (Basic testing loop)
else continue. Given: N , Nr , Ntest , K, n, k, and Ntrials .
Symbol Description
N Number of nodes percent better. In all experiments performed below,
Nr Number of training records
K Maximum in-degree of generating graphs method (3) was used to generate π.
k Maximum in-degree in Lk
n Maximum in-degree in summary MA network.
SNN Single naive network 4.2 Experimental Results
NMA Naive model averaging
GTT Greedy thick-thin
AMA Approximate model averaging The first experiment tested the degree of error incurred
by model averaging over the class Lnk (π | D) instead of
Table 1: Table of symbols. the full class Lk (π). The degree of error was measured
in terms of the percent difference in ROC areas, ∆,
Do:
using k = 1, Nr = 100, Ntrials = 40, N varied over
1. Generate Ntrials random graphs G(N, K). the values {15, 20, 30, 40, 50, 60, 80, 100}, n varied over
the values {5, 7, 10, 12, 15}, and K ←- [1, . . . , N ]. The
2. For each graph G(N, K) do: compiled results are shown in Table 2a–c. In this and
(a) Generate Nr training records and Ntest test
n ∆ (%) N ∆ (%) k ∆ (%)
records. 1 15 ± 2.4 20 .023 ± .067 1 .032 ± .09
5 .74 ± .55 30 .0016 ± .18 2 .036 ± .11
(b) Train two classifiers to be compared M1 (typ- 7 .24 ± .42 40 .036 ± .22 3 .093 ± .23
ically the AMA classifier) and M2 (the classi- 10 .15 ± .29 60 .10 ± .22 4 .042 ± .32
fier to be compared) on the training records. 12 −.01 ± .36 80 .17 ± .29
15 .12 ± .14 100 .23 ± .32
(c) Test M1 and M2 on the test data, measur-
ing the ROC areas R1 and R2 , respectively, of (a) (b) (c)
each. Table 2: The percent difference (∆) in ROC area be-
(d) Calculate the quantity δ = RT1−R
−R2
2
, where T is tween model averaging over Lk (π) and model averaging over
the ROC area of a perfect classifier (i.e., 1 if Ln
k (π | D) as various parameters are varied, with the 99%
3. Average δ over all Ntrials . all subsequent tables, the error bars denote the 99%
confidence interval of the mean.
The performance metric δ indicates what percentage of
Table 2a shows the dependence on n averaged over all
M2 ’s missing ROC area is captured by M1 .
values of N , showing that for an approximation level
For some experiments it was necessary to generate net- n >∼ 10 the difference in ROC area between the ex-
works randomly from a uniform distribution over DAGs act MA and AMA is less than 0.5% with confidence
with fixed K. We employed a lazy data generation pro- P > 0.99. Table 2b shows the dependence on N av-
cedure whereby node conditional probability distribu- eraged over the values of n ∈ {10, 12, 15}. This table
tions were generated only when they were required by shows that with n set to reasonable values the approx-
the sampling, a technique which allows generation of imation error is bounded under 0.6% with 99% confi-
data for arbitrarily dense networks. dence up to N = 100. Finally, one might expect the ap-
proximation error to increase as k was increased, since
In all cases we assume a uniform prior over non-
for a fixed n, Lk (π) includes increasingly more struc-
forbidden structures and thus allow ps (Xi , P i ) = 1 for
tures than Lnk (π | D) as k increases. Table 2c shows
all i. We also adopted the K2 parameter prior [Cooper
the results of varying k in an experiment with N = 20,
and Herskovits, 1992] which sets αijk = 1 for all (i, j, k).
Nr = 100, K ←- [1, . . . , N ], n = 12 and k varied from 1
This criterion has the property of weighting all local
to 4. For all values of k the error in terms of the differ-
distributions of parameters uniformly. All variables in
ence in ROC area is below 0.4% with 99% confidence.
our synthetic tests were binary: Nc = Nf = 2, and
for all experiments Ntest = 1000. In many experiments Next, we generated synthetic data to test the perfor-
we sampled the density of generating graphs K uni- mance of AMA relative to the other methods. For all
formly from a set {1, 2, . . . , N }; we use the notation tests, the approximation parameter was fixed to n = 12.
K ←- {1, . . . , N } to denote this procedure. We performed four tests, each varying one of the pa-
rameters in Table 1 while fixing the remaining param-
In all experiments, π was chosen to be a fixed total
eters to particular values. These results are shown in
ordering of the variables. At least three heuristics were
Table 3.
used to generate π: (1) generate a random ordering, (2)
generate two opposite random orderings and average Due to space constraints we only report here the re-
predictions of each, and (3) use a topological sort of sults for AMA versus GTT. However, we note that
the graph obtained by GTT. These methods produced both GTT and AMA achieved on average higher per-
comparable results, but (2) and (3) performed a few formance than SNN for almost every experimental run
N δ (%) Q1 Q3 k δ (%) Q1 Q3 Nr δ kt (%) Qkt
1 Qkt
3 δ an (%) Qan
1 Qan
3
5 16 ± 1.3 7.7 27 1 .322 ± .72 -4.7 13 50 32 ± 13 24 55 2.6 ± 3.2 -9.2 17
10 11 ± 1.5 4.0 23 2 9.05 ± .46 2.7 16 100 23 ± 11 8.8 53 .66 ± 3.3 -11 16
20 7.1 ± 1.6 .14 17 3 10.5 ± .37 3.3 16 200 13 ± 9.0 -1.2 32 −2.6 ± 4.2 -17 16
40 12 ± 1.4 .75 23 4 10.9 ± .38 3.5 17 400 12 ± 6.9 -1.2 34 −3.3 ± 5.4 -21 18
80 14 ± 1.5 1.8 28 5 10.9 ± .43 3.8 17 800 3.7 ± 6.9 -8.5 23 2.0 ± 5.0 -11 21
160 20 ± 4.1 .17 44 6 10.2 ± .45 3.1 16 3200 −.07 ± 14 -19 15 5.6 ± 7.1 -7.5 19
K δ (%) Q1 Q3 Nr δ (%) Q1 Q3
2 7.6 ± 1.7 .43 13 25 9.4 ± 2.7 1.2 13 Table 4: AMA performance v.s. GTT on synthetic data
4 14 ± 2.2 3.6 23 50 9.4 ± 2.5 1.6 16 generated using the ALARM network and classifying on
8 11 ± 2.1 2.4 18 100 11 ± 2.3 4.0 16 kinked tube (kt) and anaphylaxis (an).
16 7.2 ± 1.9 -.40 13 400 10 ± 2.2 3.6 16
32 5.3 ± 1.8 -1.1 7.9 800 5.5 ± 3.0 -.30 13
50 6.0 ± 1.6 -.50 9.7 3200 −7.4 ± 5.2 -26 13
ing training and test data with the benchmark ALARM
network. In this case, N = 36 and K = 4 were fixed
Table 3: The performance of AMA versus GTT as several by the network, and a test was performed with k = 3,
parameters are varied. The error terms indicate the 99% n = 10, and Nr systematically varied. The results in
significance level. Q1 and Q3 denote the first and third
quartiles, respectively. Table 4 are shown for classification on the kinked tube
and anaphylaxis nodes. These results are of interest
performed. NMA performed comparably to AMA for because they demonstrate that the qualitative perfor-
small Nr , but performed worse than both GTT and mance of the AMA classifier depends not just on global
AMA for large Nr . Later (Table 5) we present quanti- network features but also on features specific to the
tative comparisons of all four techniques on real-world classification node.
data. Finally, we performed measurements of the perfor-
The top-left quadrant of Table 3 shows the results of mance of AMA, SNN, GTT and NMA on 21 data sets
varying the number of nodes N , while holding Nr = taken from the UCI online database [Blake and Merz,
100, k = 2, and with K ←- [1, . . . , N ]. For all val- 1998]. These results are shown in Table 5. Here the
ues of N using this configuration of parameters, AMA score δdi for classifier Ci was calculated according to
outperformed GTT at the 99% significance level, the Procedure 2, where M2 = Ci and M1 was taken to be
difference generally increasing as N grew very large or the maximum scoring classifier for the data set d. For
very small. Probably the minimum in this curve around example, in the monks-2 database, AMA was the high-
N = 20 reflects the tradeoff that as N gets very small, est scoring classifier and covered 48% of the remaining
bounding the in-degree of the networks to k = 2 comes area for SNN and GTT and 21% of the remaining area
closer and closer to the limiting case of k = N ; whereas, for NMA. The ROC area will in general depend on
generally as the ratio N/Nr increases we expect model which state of the classification variable is considered
averaging to benefit over a single model. to be the “positive” state; therefore, the scores in Ta-
ble 5 are average scores for all ROC curves associated
In the bottom-left quadrant of Table 3, we varied the with a particular classification variable; therefore some
density of generating graphs K while holding Nr = 100, data sets (e.g., wine) have no zero entries when two or
k = 2 and N = 50. Again, for this set of measure- more classifiers score highest on different curves.
ments, AMA always outperformed GTT at the 99% sig-
nificance level. The results were most apparent when We have underlined the top two scoring classifiers for
K < ∼ 10, an encouraging result since it is a common
each data set to emphasize the fact that AMA was typi-
belief that most real-world networks are sparse. Data set δ SN N δ GT T δN M A δ AM A N k Nr
haberman 0.35 0.35 0 0 4 4 306
In the top-right of Table 3, we varied the maximum hayes-roth 0.32 0.32 0 0.01 6 6 132
number of parents allowed in Lk from 1 ≤ k ≤ 6, while monks-3 0.83 0.24 0.82 0 7 7 552
monks-1 0.98 0 0.98 0 7 7 554
holding Nr = 100, N = 20, and K ←- [1, . . . , N ]. The monks-2 0.48 0.48 0.21 0 7 7 600
value of δ is surprisingly insensitive to the value of k chess krk 0.54 0 0.54 0.32 7 7 28055
for k ≥ 2. Finally, in the bottom-right, we varied the ecoli 0.03 0.01 0.02 0 8 8 336
yeast 0.04 0.11 0.04 0.07 8 8 1484
number of records Nr while fixing N = 20, k = 3 and abalone 0.12 0.08 0.05 0 9 9 4176
K ←- [1, . . . , N ]. As expected, for smaller values of cpu-perf 0.13 0.31 0.01 0.11 10 10 209
glass 0.10 0.04 0.15 0.13 10 10 214
Nr model averaging performs well versus GTT. As Nr cmc 0.01 0.07 0.01 0.04 10 10 1473
grows GTT eventually outperforms AMA at the 99% sol-flare-C 0.03 0.09 0.02 0.01 11 11 322
sol-flare-M 0 0.44 0.17 0.20 11 11 322
significance level. It was observed in other experiments sol-flare-X 0.06 0.01 0.18 0.33 11 11 322
that the ability of GTT to outperform AMA at high wine 0.14 0.01 0.16 0.06 14 7 177
Nr depended strongly on the value of k. When it was credit-scrn 0 0.12 0.09 0.02 16 5 652
letter-rec 0.38 0 0.38 0.01 17 5 20000
computationally feasible to set k ' N , AMA typically thyroid 0.17 0.28 0 0.11 21 5 7200
would outperform GTT even at high Nr . brst-canc-w 0.28 0 0.29 0.23 32 3 569
connect-4 0.49 0 0.49 0.52 43 2 67557
The performance of AMA was also tested by generat-
Table 5: Experimental results for 21 UCI data sets.
cally more robust on these data sets than the other clas- [Buntine, 1991] W. Buntine. Theory refinement on
sifiers, scoring in the top two 15/21 times compared to Bayesian networks. In Proceedings of the Seventh An-
7/21, 10/21, and 11/21 for SNN, GTT and NMA, re- nual Conference on Uncertainty in Artificial Intelligence
(UAI–91), pages 52–60, San Mateo, California, 1991.
spectively. The average P difference ∆i between classifier Morgan Kaufmann Publishers.
i 1 AM A
i and AMA: ∆ ≡ 21 d (δd −δdi ), was calculated to
gauge the statistical significance of these experiments. [Cooper and Herskovits, 1992] Gregory F. Cooper and Ed-
The results were: ∆SN N = 15.8 ± 6.7% (significant at ward Herskovits. A Bayesian method for the induction
of probabilistic networks from data. Machine Learning,
the 99% level), ∆GT T = 3.7 ± 5.3% (significant only at 9(4):309–347, 1992.
the 75% level), and ∆N M A = 11.7 ± 6.3 (significant at
the 95% level). These results are promising, but need [Dash and Cooper, 2002] Denver Dash and Gregory F.
to be extended to further investigate the performance Cooper. Exact model averaging with naive Bayesian clas-
sifiers. Proceedings of the Nineteenth International Con-
of AMA. ference on Machine Learning (ICML 2002), to appear,
2002.
5 Discussion
[Domingos and Pazzani, 1997] Pedro Domingos
We have shown that it is possible to construct a sin- and Michael Pazzani. On the optimality of the simple
gle DAG model that will perform linear-time approxi- Bayesian classifier under zero-one loss. Machine Learn-
mate model averaging over the Lnk (π | D) class of mod- ing, 29:103–130, 1997.
els. We have demonstrated empirically that even with [Friedman and Koller, 2000] Nir Friedman and Daphne
relatively little effort in choosing a good partial order- Koller. Being Bayesian about network structure. In Un-
ing π, classifications obtained by model averaging over certainty in Artificial Intelligence: Proceedings of the Six-
Lnk (π | D) can be beneficial compared to other BN clas- teenth Conference (UAI-2000), pages 201–210, San Fran-
cisco, CA, 2000. Morgan Kaufmann Publishers.
sifiers. The benefits of AMA were not without cost;
although comprehensive measurements were not taken, [Friedman et al., 1997] Nir Friedman, Dan Geiger, Moises
it was observed that the initial model construction time Goldszmidt, G. Provan, P. Langley, , and P. Smyth.
typically was 3-10 times longer for AMA than it took Bayesian network classifiers. Machine Learning, 29:131,
1997.
for the greedy search to converge for the values of k and
n used in our experiments. [Heckerman et al., 1995] David Heckerman, Dan Geiger,
and David M. Chickering. Learning Bayesian networks:
The AMA technique is interesting because of its sim- The combination of knowledge and statistical data. Ma-
plicity of implementation. Existing systems that use chine Learning, 20:197–243, 1995.
Bayesian network classifiers can trivially be adapted to
[Madigan and Raftery, 1994] David
use model averaging by replacing their existing model
Madigan and Adrian E. Raftery. Model selection and
with a single summary model. accounting for model uncertainty in graphical models us-
ing Occam’s window. Journal of the American Statistical
Future work includes finding a better method for opti-
Association, 89:1535–1546, 1994.
mizing the ordering π, possibly by doing a search over
orderings as in [Friedman and Koller, 2000]. Also, our [Madigan and York, 1995] David Madigan and J. York.
experimental results should be expanded to more exten- Bayesian graphical models for discrete data. Interna-
tional Statistical Review, 63:215–232, 1995.
sively characterize the performance of AMA for classi-
fication on real-world data. Another possible extension [Meila and Jaakkola, 2000] Marina Meila and Tommi S.
is to relax the assumption of complete data, possibly Jaakkola. Tractable Bayesian learning of tree belief net-
by using the EM algorithm or MCMC sampling to es- works. In Uncertainty in Artificial Intelligence: Proceed-
ings of the Sixteenth Conference (UAI-2000), pages 380–
timate Equation 14 from data.
388, San Francisco, CA, 2000. Morgan Kaufmann Pub-
lishers.
6 Acknowledgements
[Spirtes et al., 1993] Peter Spirtes, Clark Glymour, and
This work was supported in part by grant number S99- Richard Scheines. Causation, Prediction, and Search.
GSRP-085 from the National Aeronautics and Space Ad- Springer Verlag, New York, 1993.
ministration under the Graduate Students Research Pro-
[Verma and Pearl, 1991] T.S. Verma and Judea Pearl.
gram, by grant IIS-9812021 from the National Science Foun- Equivalence and synthesis of causal models. In Uncer-
dation and by grants LM06696 from the National Library tainty in Artificial Intelligence 6, pages 255 –269. Else-
of Medicine. vier Science Publishing Company, Inc., New York, N. Y.,
1991.