0% found this document useful (0 votes)
18 views8 pages

Model Averaging With Discrete Bayesian Network Classifiers1

Uploaded by

anirbanc2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views8 pages

Model Averaging With Discrete Bayesian Network Classifiers1

Uploaded by

anirbanc2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Model Averaging with Discrete Bayesian Network Classifiers

Denver Dash Gregory F. Cooper



Decision Systems Laboratory Center for Biomedical Informatics
Intelligent Systems Program University of Pittsburgh
University of Pittsburgh Pittsburgh, PA 15213
Pittsburgh, PA 15213

Abstract still perform surprisingly well at the classification task


[Domingos and Pazzani, 1997]. More sophisticated al-
This paper considers the problem of perform- gorithms for learning BNs from data [Verma and Pearl,
ing classification by model-averaging over a 1991, Cooper and Herskovits, 1992, Spirtes et al., 1993,
class of discrete Bayesian network structures Heckerman et al., 1995, Friedman et al., 1997] can also
consistent with a partial ordering and with be used effectively to construct a BN model by extract-
bounded in-degree k. We show that for N ing information about conditional independencies from
nodes this class contains in the worst-case at D to build more accurate structural models. Classifica-
¡ ¢N/2 tion using a single Bayesian network model for a fixed
least Ω( N/2
k ) distinct network structures,
number of classes when the feature vector is completely
but we show that
¡ ¢ this summation can be per- instantiated can be performed in O(N ) time.
formed in O( Nk · N ) time. We use this fact
to show that it is possible to efficiently con- This procedure of selecting a single model for classi-
struct a single directed acyclic graph (DAG) fication has the potential drawback of over-fitting the
whose predictions approximate those of exact data however, leading to poor generalization capabil-
model-averaging over this class, allowing ap- ity. A strictly Bayesian approach, averaging classifica-
proximate model-averaged predictions to be tions over all models weighted by their posterior prob-
performed in O(N ) time. We evaluate the ability given the data, has been shown to reduce over-
procedure in a supervised classification con- fitting and provide better generalization [Madigan and
text, and show empirically that this technique Raftery, 1994]. Unfortunately, the space of network
can be beneficial for classification even when structures is super-exponential in the number of model
the generating distribution is not a member variables, and thus an exact method for full model-
of the class being averaged over, and we char- averaging is likely to be intractable.
acterize the performance over several param-
In this paper we consider the possibility of performing
eters on simulated and real-world data.
exact and approximate model-averaging (MA) over a
particular class of structures rather than over the gen-
eral space of DAGs. Recently Dash and Cooper [2002],
1 Introduction
demonstrated that exact model averaging over the re-
stricted class of naive networks could be performed by a
The general supervised classification problem seeks to
simple re-parametrization of a naive network, and they
create a model based on labelled data D, which can
showed that this technique consistently outperformed
be used to classify future vectors of features F =
a single naive classifier with the standard parametriza-
{F1 , F2 , . . . , FN } into one of various classes of inter-
tion. The present paper generalizes that result; in par-
est. A Bayesian network (BN) classifier is a probabilis-
ticular, we show that exact model averaging over the
tic model that accomplishes this goal by explicating
class of BN structures consistent with a partial order-
causal interactions/conditional independencies between
ing π and with bounded in-degree k, despite its super-
features in F . The simplest Bayesian network classifier
exponential size, can be performed with relatively small
for this task is the naive classifier, which, without infer-
time and space restrictions.
ring any structural information from the database, can

This work was partially performed during a summer
Methods for approximate MA classification using both
internship at the Machine Learning and Perception group, selective pruning [Madigan and Raftery, 1994, Volin-
Microsoft Research, Cambridge, UK. sky, 1997] and Monte-Carlo [Madigan and York, 1995]
techniques exist and have shown to improve prediction vectors are most likely to reside.
tasks; however these methods do not possess the extent
We use the notation Xi to refer to the nodes when we
of time efficiency of inference that we seek.
need to have a uniform notation; using the convention
Friedman and Koller [2000] studied the ability to es- that, Xi ≡ Fi and X0 ≡ C, and we use X to denote
timate structural features of a network (for example the collective set of nodes in the network. A directed
the probability of an arc from Xi to Xj ) by perform- graph G(X) is defined as a pair hX, Ei, where E is a
ing a MCMC search over orderings of nodes. Their set of directed edges Xi → Xj , such that Xi , Xj ∈ X.
method relied on a decomposition, which they credited
We assume that each node Xi is a discrete multino-
to Buntine [1991], that we extend in order to prove
mial variable with ri possible states {x1i , x2i , . . . , xri i }.
our key theoretical result. We discuss this issue in de-
We use P i to denote the parent set of Xi , and we
tail in Section 3.3. Their work differs from ours in two
let qi denote the number of possible joint configura-
key respects: (1) Their approach does not capture the
tions of parents for node Xi , which we enumerate as
single-network (and thus the efficiency of calculation)
{p1i , p2i , . . . , pqi i }. We use θijk to denote a single param-
approximation to the MA problem, and (2) They per-
form model averaging only to calculate the probabilities eter of the network: θijk = P (Xi = xki | P i = pji ), and
of structural features, explicitly not for classification. the symbol θ to denote the collective parameters of the
network. In general we use the common (ijk) coordi-
Meila and Jaakkola [2000] discuss the ability to perform nates notation to identify the k-th state and the j-th par-
exact model averaging over all trees. They also use ent configuration of the i-th node in the network. We
similar assumptions and similar decompositions that take the common assumptions in BN structure learning
we use; however our calculation is more general in al- of Dirichlet priors, parameter independence, and pa-
lowing nodes to have more than one parent, but it is rameter modularity [Heckerman et al., 1995]. We also
less general in that it assumes a partial-ordering of the assume that database D consists of complete labelled
nodes. Their approach also does not allow for O(N ) instances.
model-averaged classifications.
Our primary contributions in this paper are as follows: 3 Theoretical Results
(1) we extend the factorization of conditionals to ap-
ply to the task of classification, (2) we show that MA In this section we show how to efficiently calculate
calculations over this class can be approximated by a the quantity P (X = x | D) averaged over the class of
single network structure S ∗ which can be constructed structures we are considering.
efficiently, thereafter allowing approximate MA predic-
tions to be performed in O(N ) time, and (3) we demon- 3.1 Fixed Network Structure
strate empirically that, especially when the number of
records is small compared to the size of the network, For a fixed network structure S and a fixed set of net-
using this technique for classification can be beneficial work parameters θ, the quantity P (X = x | S, θ) can
compared to (a) a single naive classifier, (b) single net- be calculated in O(N ) time:
works learned from data using a greedy Bayesian learn- N
ing algorithm and (c) exact model averaging over naive Y
P (X = x | S, θ) = θiJK , (1)
classifiers. i=0
In Section 2 we formally frame the problem and state where all (j, k) coordinates are fixed by the configura-
our assumptions and notation. In Section 3 we derive tion of X to the value (j, k) = (J, K).
the MA solution and show that the MA predictions are
approximated by those of a single structure bearing a When, rather than a fixed set of parameters, a database
D is given, from an ideal Bayesian perspective it is
particular set of parameters. In Section 4 we present necessary to average over all possible configurations of
the experimental comparisons, and in Section 5 we dis- the parameters θ:
cuss our conclusions and future directions. Z
P (X = x | S, D) = P (X = x | S,  ) · P ( | S, D) · d
2 Assumptions and Notation Z Y
N

= θiJK · P ( | S, D) · d
The general supervised classification problem can be i=0

framed as follows: Given a set of features F =


{F1 , F2 , . . . , FN }, a set of classes C = {C1 , C2 , . . . CNc }, where the second line follows from Equation 1. Given
and a labelled database D = {D1 , D2 , . . . , DR }, con- the assumption of parameter independence and Dirich-
struct a model to predict into which class future feature let priors, this quantity can be written just in terms
of sufficient statistics and Dirichlet hyperparameters Given the assumptions of complete data, multinomial
[Cooper and Herskovits, 1992, Heckerman et al., 1995]: variables, Dirichlet priors and parameter independence,
N
the marginal likelihood P (D | S) can be written just
Y αiJK + NiJK in terms of hyperparameters and sufficient statistics
P (X = x | S, D) = , (2)
αiJ + NiJ [Cooper and Herskovits, 1992, Heckerman et al., 1995]:
i=0
P
where we have used the notation that Qij = k Qijk 1 . P (D | S) =
Comparing this result to Equation 1 illustrates the well- qi
N Y
Y ri
Y
Γ(αij ) Γ(αijk + Nijk )
known result that a single network with a fixed set of · . (6)
parameters θ̂ given by i=0 j=1
Γ(αij + N ij ) Γ(αijk )
k=1

αijk + Nijk
θ̂ijk = (3) Given Assumption 1 and Equation 6, Equation 5 can
αij + Nij
be written as:
will produce predictions equivalent to those obtained by N
averaging over all parameter configurations. We refer XY
P (XL → XM | D) = κ ρiLM (7)
to θ̂ijk as the standard parametrization.
S i=0

3.2 Averaging Structural Features with a where the ρiLM functions are given by:
Fixed Ordering
ρiLM = δ[M 6= i ∨ XL ∈ P i ] · ps (Xi , P i )·
The decomposition by Buntine used by Friedman and Yqi ri
Y
Γ(αij ) Γ(αijk + Nijk )
Koller was a dynamic programming solution which cal- · , (8)
Γ(αij + N ij ) Γ(αijk )
culated, with relative efficiency, the posterior proba- j=1 k=1

bility of a structural feature (for example a particular


and can be calculated using information that depends
arc XL → XM ) averaged over all in-degree-bounded
only on nodes Xi and P i .
networks consistent with a fixed ordering. Here we re-
derive the result. Buntine considered a total ordering on the nodes, but
generalization to a partial ordering is straightforward.
The derivation required an additional assumption, la-
For a given partial ordering π and a particular node
belled “structure modularity” by Friedman and Koller:
Xi we want to enumerate all of Xi ’s possible parent
Assumption 1 (Structure modularity) The prior sets up to a maximum size k. To this end, we will
of a structure S, P (S), can be factorized according to typically use the superscript ν to index the different
the network: parent sets. For example, four nodes partitioned as
h{X1 , X3 }, {X2 , X4 }i and a maximum in-degree k = 2
N
Y might yield the following enumeration of parent sets
P (S) ∝ ps (Xi , P i ), (4)
for X2 : {P 02 = ∅, P 12 = {X1 }, P 22 = {X3 }, P 32 =
i=0
{X1 , X3 }}, any of which we might refer to as P ν2 . We
where ps (Xi , P i ) is some function that depends only on assume the hyperparameters αijk ν
for any parent set
the local structure (Xi and P i ). ν
P i can be calculated in constant time. For example
the K2 scoring metric [Cooper and Herskovits, 1992],
Obviously the uniform distribution which is O(1) will which sets αijk = 1 for all (ijk), will satisfy this re-
satisfy this assumption, as will other common network quirement. The class of models consistent with π with
prior forms. bounded in-degree of k we denote as Lk (π):
The posterior probability P (XL → XM | D) can be
written as: Definition 1 (Multi-level k-graphs, Lk (π)) For a
given integer k ≤ N and a given partial ordering π
P (XL → XM | D) =
X of X, a DAG G = hX, Ei is a multi-level k-graph
κ δ(XL → XM ∈ S) · P (D | S) · P (S) (5) with respect to π (denoted as Lk (π)) if arcs are di-
S rected down levels and no variable has more than k
where κ = 1/P (D) is a constant that depends only on parents: Xi → Xj ∈ E ⇒ Level (Xi ) < Level (Xj ),
the database, and δ(X) is the Kronecker delta function: and Xi ∈ X ⇒ |P i | ≤ k.
½
1 if X = true When model-averaging over the class Lk (π), each node
δ(X) =
0 otherwise at level l can choose from all the nodes in layers l0 < l at
1
P most k parents. For k < N/2, Equationh¡ 7¢N/2 in the
i worst-
More generally, we will use Qij = k
Qijk for a quan- N/2
tity Q. case includes a summation over Ω k network
structures (this worst-case corresponding to two layers, In fact, the Nf k characterization is a loose upper-
each with N/2 nodes). However, the following theorem bound, and can likely be reduced by compressing con-
shows that Equation 7 can be calculated with relative tingency tables.
efficiency:
3.3 Model Averaging for Prediction
Theorem 1 For N variables with a maximum number
of states per variable given by Nf and a database of Here we show that a similar dynamic program-
Nr records, the ¡right-hand
¢ side of Equation 7 can ¡Nbe
¢ ming solution exists for model-averaging the quantity
calculated in O( N k ·N ·N r ·Nf
k
) time and using O( k · P (X = x | D) over the class Lk (π).
N ·Nf k ) space if the summation is restricted to network
This quantity can be written as:
structures in Lk (π).
N
XY
Proof: We use µi to denote the maximum number of
parent sets available to node Xi . Expanding the sum P (X = x | D) = κ θ̂iJK · P (D | S) · P (S), (10)
S i=0
in Equation 7 and using the notation ρνiLM to denote
ρiLM for the νth parent set P νi , yields: where θ̂iJK are the standard parameters given in Equa-
 tion 3. Given structure modularity and Equation 6,
P (XL → XM | D) ∝ 
 Equation 10 can be written in a form very similar to
ρ00LM · ρ01LM · . . . · ρ0N LM  

 Equation 7:
+ ρ10LM · ρ01LM · . . . · ρ0N LM  


.. .. 
 "µ ¶N/2 # N
 N/2 XY
. . P (X = x | D) = κ ρ̂iJxS Kx (11)
Ω terms.
+ ρµ0LM 0
· ρ01LM · . . . · ρ0N LM   k S i=0
0 1 0 

+ ρ0LM · ρ1LM · . . . · ρN LM  
.. .. 
 where here the ρ̂iJxS Kx functions are given by:


. . 

µ0 µ1 µN 
+ ρ0LM · ρ1LM · . . . · ρN LM ρ̂iJxS Kx = θ̂iJxS Kx · ps (Xi , P i )·
qi
Y ri
Y
We define the symbol ΣLM Γ(αij ) Γ(αijk + Nijk )
m to denote the structure sum · . (12)
of the product up to and including the m-th node: j=1
Γ(αij + Nij ) Γ(αijk )
k=1

ΣLM
m ≡ ρ00LM · ρ01LM · ... · ρ0mLM We have taken the trouble to subscript the indices J
+ ρ10LM · ρ01LM · ... · ρ0mLM and K from Equation 1. Both are indexed with an x
.. .. to indicate that they are fixed by the particular config-
. .
uration of X, and J is indexed by S to emphasize that
+ ρµ0LM
0
· ρµ1LM
1
· . . . · ρµmLM
m
the value of the parent configuration index depends on
Using this notation, the following recursion relation can the structure of the network. Although this notation
be derived: may seem cumbersome, we hope it clarifies the analysis
µi
later.
X
ΣLM
i = ΣLM
i−1 · ρνiLM , ΣLM
−1 = 1 The ρ̂iJxS Kx functions again can be calculated using
ν=1 only information local to node Xi and P i . Fol-
Finally, expanding out the recurrence relation yields lowing a derivation identical to that for averaging
the expression for P (XL → XM | D): P (XL → XM | D) in Section 3.2, yields the result:
µi
N X
Y
µi
N X
Y P (X = x | D) = κ ρ̂νiJxν Kx . (13)
P (XL → XM | D) = κ ρνiLM (9)
i=0 ν=1
i=0 ν=1
Here the S index has been replaced with a ν indicating
Once the ρνiLM terms are calculated, the right-hand- ¡ ¢ which parent set for node Xi is being considered.
¡ ¢Once
side of Equation 9 can be performed in O(N · N k )
again this summation can be performed inO( Nk · N ·
¡ ¢
time. Calculating the ρ terms themselves requires the Nr · Nf k ) time and O( Nk · N · Nf k ) space.
ν
calculation of one hyperparameter αijk and sufficient
ν
statistic Nijk for each parameter of the network, each 3.4 Approximate Model Averaging
of which can be calculated in O(Nr ) time. To calcu-
late the complete
¡ ¢ set for all (j, k) combinations
¡ ¢ thus The derivation in the preceding section is an extension
requires O( Nk · N · Nr · Nf k ) time and O( Nk · N · Nf k ) of the concept underlying Buntine’s dynamic program-
space (because there are O(Nfk ) network parameters ming solution. The functional form of the above solu-
per node). 2 tion allows us to easily prove the following theorem:
Theorem 2 There exists a single Bayesian network We denote the class of structures being averaged over
model M ∗ = hS ∗ , θ ∗ i that will produce predictions using this procedure as Lnk (π | D), and we call the
equivalent to those produced by model averaging over method Approximate Model Averaging (AMA). Obvi-
all models in Lk (π). ously limn→N Lnk (π | D) = Lk (π). Furthermore, we
Proof: Let S ∗ be defined empirically show in Section 4 that the loss in ROC area,
Sµi so that each node Xi has
², due to this approximation for n ≥ 10 lies around
the parent set P ∗i = ν=1 P νi , and let θ ∗ be defined
by: −0.6% ≤ ² ≤ 0.6% with 99% confidence for N ≤ 100
µi
1 X ν and for a wide range of other parameters.

θijk = √ ρ̂iJ ν k (14)
κ N ν=1 j
4 Experimental Tests
where the x subscript for Jxν has now been replaced
with a j and Kx has been replaced with a k subscript,
In this section we describe several experimental inves-
because we are now considering a particular coordinate
tigations that were designed to test the performance of
(ijk). It can be seen by direct comparison that the
(AMA) on distributions that do not necessarily fall into
single network prediction using M ∗ and Equation 1 will
Lk (π). We first generate synthetic data to allow us to
yield Equation 13. 2
more extensively vary parameters and to generate sta-
If we define functions f (Xi , P νi | D) such that tistically significant results, then we perform tests on
f (Xi , P νi | D) = ρ̂νiJ ν k /θ̂iJ
ν several real-world machine learning data sets.
ν k , then Equation 14 can
j j
be written as:
4.1 Experimental Setup
µi
X
∗ 1 ν ν
θijk = √ θ̂iJ ν k · f (Xi , P i | D) (15) There are at least five parameters for which we sought
κ N j
ν=1 to characterize the performance of AMA predictions:
The functions f (Xi , P νi | D) do not depend on the in- the number of nodes N , the approximation limit n
dices Jjν and k, and they are proportional to the lo- on number of nodes, the maximum in-degree (“den-
cal contribution of the posterior probability that the sity”) K of the generating network, the maximum num-
parent set of Xi is in fact P νi . Equation 15 thus pro- ber of parents k allowed in Lk (π), and the number
vides the interpretation that M ∗ represents a structure- of records Nr . It is beyond the scope of this paper
based smoothing where each standard parameter θ̂ijk ν to present a comprehensive comparison over this full
ν
is weighted based on the likelihood that P i is the true five-dimensional space; however, here we sample their
parent set of Xi . settings to provide insight into the dependence of the
results on these parameters.
Theorem
¡ ¢ 2 implies that, rather than performing the
O( Nk · N ) summation in Equation 13 for each case to We compared AMA to three other Bayesian network
be classified, in principle we need only construct a single classifiers: a single naive network (SNN) with the stan-
model M ∗ and use standard O(N ) Bayesian network dard parametrization [Domingos and Pazzani, 1997],
inference for each case. a model that model-averaged over all naive structures
(NMA) [Dash and Cooper, 2002], and a non-restricted
A serious practical difficulty is that this approach two-stage thick-thin greedy search (GTT) over the
requires in the worst case the construction of a space of DAGs. GTT starts with an empty graph and
completely-connected Bayesian network and is thus in- repeatedly adds the arc (without creating a cycle) that
tractable for reasonable size N . An obvious pruning maximally increases the marginal likelihood P (D | S)
strategy, however, is to truncate the sum in Equation 14 until no arc addition will result in a positive increase,
to include no more than n parents. If we reorder the then it repeatedly removes arcs until no arc deletion
possible parent sets for node Xi as OP ≡ {P 1i , . . . , P µi i } will result in a positive increase in P (D | S). GTT as-
such that f (Xi , P νi | D) > f (Xi , P λi | D) only if ν < λ, sumes no ordering on the nodes.
then a reasonable approximation for P ∗i can be con-
structed by the following procedure: All abbreviations and symbols are summarized in Ta-
ble 1 as a reference for the reader.
Procedure 1 (Approximate Pi∗ construction) The inner-loop of each test performed basically
Given: n and OP . the same procedure: Given the five parameters
1. Let P ∗i = ∅ {N, Nr , K, n, k} and a total number of trials Ntrials ,
we did the following:
2. For ν = 1 to µi ,
if |P ∗i ∪ P νi | ≤ n, let P ∗i = P ∗i ∪ P νi , Procedure 2 (Basic testing loop)
else continue. Given: N , Nr , Ntest , K, n, k, and Ntrials .
Symbol Description
N Number of nodes percent better. In all experiments performed below,
Nr Number of training records
K Maximum in-degree of generating graphs method (3) was used to generate π.
k Maximum in-degree in Lk
n Maximum in-degree in summary MA network.
SNN Single naive network 4.2 Experimental Results
NMA Naive model averaging
GTT Greedy thick-thin
AMA Approximate model averaging The first experiment tested the degree of error incurred
by model averaging over the class Lnk (π | D) instead of
Table 1: Table of symbols. the full class Lk (π). The degree of error was measured
in terms of the percent difference in ROC areas, ∆,
Do:
using k = 1, Nr = 100, Ntrials = 40, N varied over
1. Generate Ntrials random graphs G(N, K). the values {15, 20, 30, 40, 50, 60, 80, 100}, n varied over
the values {5, 7, 10, 12, 15}, and K ←- [1, . . . , N ]. The
2. For each graph G(N, K) do: compiled results are shown in Table 2a–c. In this and
(a) Generate Nr training records and Ntest test
n ∆ (%) N ∆ (%) k ∆ (%)
records. 1 15 ± 2.4 20 .023 ± .067 1 .032 ± .09
5 .74 ± .55 30 .0016 ± .18 2 .036 ± .11
(b) Train two classifiers to be compared M1 (typ- 7 .24 ± .42 40 .036 ± .22 3 .093 ± .23
ically the AMA classifier) and M2 (the classi- 10 .15 ± .29 60 .10 ± .22 4 .042 ± .32
fier to be compared) on the training records. 12 −.01 ± .36 80 .17 ± .29
15 .12 ± .14 100 .23 ± .32
(c) Test M1 and M2 on the test data, measur-
ing the ROC areas R1 and R2 , respectively, of (a) (b) (c)
each. Table 2: The percent difference (∆) in ROC area be-
(d) Calculate the quantity δ = RT1−R
−R2
2
, where T is tween model averaging over Lk (π) and model averaging over
the ROC area of a perfect classifier (i.e., 1 if Ln
k (π | D) as various parameters are varied, with the 99%

the axes are normalized). confidence intervals,.

3. Average δ over all Ntrials . all subsequent tables, the error bars denote the 99%
confidence interval of the mean.
The performance metric δ indicates what percentage of
Table 2a shows the dependence on n averaged over all
M2 ’s missing ROC area is captured by M1 .
values of N , showing that for an approximation level
For some experiments it was necessary to generate net- n >∼ 10 the difference in ROC area between the ex-
works randomly from a uniform distribution over DAGs act MA and AMA is less than 0.5% with confidence
with fixed K. We employed a lazy data generation pro- P > 0.99. Table 2b shows the dependence on N av-
cedure whereby node conditional probability distribu- eraged over the values of n ∈ {10, 12, 15}. This table
tions were generated only when they were required by shows that with n set to reasonable values the approx-
the sampling, a technique which allows generation of imation error is bounded under 0.6% with 99% confi-
data for arbitrarily dense networks. dence up to N = 100. Finally, one might expect the ap-
proximation error to increase as k was increased, since
In all cases we assume a uniform prior over non-
for a fixed n, Lk (π) includes increasingly more struc-
forbidden structures and thus allow ps (Xi , P i ) = 1 for
tures than Lnk (π | D) as k increases. Table 2c shows
all i. We also adopted the K2 parameter prior [Cooper
the results of varying k in an experiment with N = 20,
and Herskovits, 1992] which sets αijk = 1 for all (i, j, k).
Nr = 100, K ←- [1, . . . , N ], n = 12 and k varied from 1
This criterion has the property of weighting all local
to 4. For all values of k the error in terms of the differ-
distributions of parameters uniformly. All variables in
ence in ROC area is below 0.4% with 99% confidence.
our synthetic tests were binary: Nc = Nf = 2, and
for all experiments Ntest = 1000. In many experiments Next, we generated synthetic data to test the perfor-
we sampled the density of generating graphs K uni- mance of AMA relative to the other methods. For all
formly from a set {1, 2, . . . , N }; we use the notation tests, the approximation parameter was fixed to n = 12.
K ←- {1, . . . , N } to denote this procedure. We performed four tests, each varying one of the pa-
rameters in Table 1 while fixing the remaining param-
In all experiments, π was chosen to be a fixed total
eters to particular values. These results are shown in
ordering of the variables. At least three heuristics were
Table 3.
used to generate π: (1) generate a random ordering, (2)
generate two opposite random orderings and average Due to space constraints we only report here the re-
predictions of each, and (3) use a topological sort of sults for AMA versus GTT. However, we note that
the graph obtained by GTT. These methods produced both GTT and AMA achieved on average higher per-
comparable results, but (2) and (3) performed a few formance than SNN for almost every experimental run
N δ (%) Q1 Q3 k δ (%) Q1 Q3 Nr δ kt (%) Qkt
1 Qkt
3 δ an (%) Qan
1 Qan
3
5 16 ± 1.3 7.7 27 1 .322 ± .72 -4.7 13 50 32 ± 13 24 55 2.6 ± 3.2 -9.2 17
10 11 ± 1.5 4.0 23 2 9.05 ± .46 2.7 16 100 23 ± 11 8.8 53 .66 ± 3.3 -11 16
20 7.1 ± 1.6 .14 17 3 10.5 ± .37 3.3 16 200 13 ± 9.0 -1.2 32 −2.6 ± 4.2 -17 16
40 12 ± 1.4 .75 23 4 10.9 ± .38 3.5 17 400 12 ± 6.9 -1.2 34 −3.3 ± 5.4 -21 18
80 14 ± 1.5 1.8 28 5 10.9 ± .43 3.8 17 800 3.7 ± 6.9 -8.5 23 2.0 ± 5.0 -11 21
160 20 ± 4.1 .17 44 6 10.2 ± .45 3.1 16 3200 −.07 ± 14 -19 15 5.6 ± 7.1 -7.5 19
K δ (%) Q1 Q3 Nr δ (%) Q1 Q3
2 7.6 ± 1.7 .43 13 25 9.4 ± 2.7 1.2 13 Table 4: AMA performance v.s. GTT on synthetic data
4 14 ± 2.2 3.6 23 50 9.4 ± 2.5 1.6 16 generated using the ALARM network and classifying on
8 11 ± 2.1 2.4 18 100 11 ± 2.3 4.0 16 kinked tube (kt) and anaphylaxis (an).
16 7.2 ± 1.9 -.40 13 400 10 ± 2.2 3.6 16
32 5.3 ± 1.8 -1.1 7.9 800 5.5 ± 3.0 -.30 13
50 6.0 ± 1.6 -.50 9.7 3200 −7.4 ± 5.2 -26 13
ing training and test data with the benchmark ALARM
network. In this case, N = 36 and K = 4 were fixed
Table 3: The performance of AMA versus GTT as several by the network, and a test was performed with k = 3,
parameters are varied. The error terms indicate the 99% n = 10, and Nr systematically varied. The results in
significance level. Q1 and Q3 denote the first and third
quartiles, respectively. Table 4 are shown for classification on the kinked tube
and anaphylaxis nodes. These results are of interest
performed. NMA performed comparably to AMA for because they demonstrate that the qualitative perfor-
small Nr , but performed worse than both GTT and mance of the AMA classifier depends not just on global
AMA for large Nr . Later (Table 5) we present quanti- network features but also on features specific to the
tative comparisons of all four techniques on real-world classification node.
data. Finally, we performed measurements of the perfor-
The top-left quadrant of Table 3 shows the results of mance of AMA, SNN, GTT and NMA on 21 data sets
varying the number of nodes N , while holding Nr = taken from the UCI online database [Blake and Merz,
100, k = 2, and with K ←- [1, . . . , N ]. For all val- 1998]. These results are shown in Table 5. Here the
ues of N using this configuration of parameters, AMA score δdi for classifier Ci was calculated according to
outperformed GTT at the 99% significance level, the Procedure 2, where M2 = Ci and M1 was taken to be
difference generally increasing as N grew very large or the maximum scoring classifier for the data set d. For
very small. Probably the minimum in this curve around example, in the monks-2 database, AMA was the high-
N = 20 reflects the tradeoff that as N gets very small, est scoring classifier and covered 48% of the remaining
bounding the in-degree of the networks to k = 2 comes area for SNN and GTT and 21% of the remaining area
closer and closer to the limiting case of k = N ; whereas, for NMA. The ROC area will in general depend on
generally as the ratio N/Nr increases we expect model which state of the classification variable is considered
averaging to benefit over a single model. to be the “positive” state; therefore, the scores in Ta-
ble 5 are average scores for all ROC curves associated
In the bottom-left quadrant of Table 3, we varied the with a particular classification variable; therefore some
density of generating graphs K while holding Nr = 100, data sets (e.g., wine) have no zero entries when two or
k = 2 and N = 50. Again, for this set of measure- more classifiers score highest on different curves.
ments, AMA always outperformed GTT at the 99% sig-
nificance level. The results were most apparent when We have underlined the top two scoring classifiers for
K < ∼ 10, an encouraging result since it is a common
each data set to emphasize the fact that AMA was typi-
belief that most real-world networks are sparse. Data set δ SN N δ GT T δN M A δ AM A N k Nr
haberman 0.35 0.35 0 0 4 4 306
In the top-right of Table 3, we varied the maximum hayes-roth 0.32 0.32 0 0.01 6 6 132
number of parents allowed in Lk from 1 ≤ k ≤ 6, while monks-3 0.83 0.24 0.82 0 7 7 552
monks-1 0.98 0 0.98 0 7 7 554
holding Nr = 100, N = 20, and K ←- [1, . . . , N ]. The monks-2 0.48 0.48 0.21 0 7 7 600
value of δ is surprisingly insensitive to the value of k chess krk 0.54 0 0.54 0.32 7 7 28055
for k ≥ 2. Finally, in the bottom-right, we varied the ecoli 0.03 0.01 0.02 0 8 8 336
yeast 0.04 0.11 0.04 0.07 8 8 1484
number of records Nr while fixing N = 20, k = 3 and abalone 0.12 0.08 0.05 0 9 9 4176
K ←- [1, . . . , N ]. As expected, for smaller values of cpu-perf 0.13 0.31 0.01 0.11 10 10 209
glass 0.10 0.04 0.15 0.13 10 10 214
Nr model averaging performs well versus GTT. As Nr cmc 0.01 0.07 0.01 0.04 10 10 1473
grows GTT eventually outperforms AMA at the 99% sol-flare-C 0.03 0.09 0.02 0.01 11 11 322
sol-flare-M 0 0.44 0.17 0.20 11 11 322
significance level. It was observed in other experiments sol-flare-X 0.06 0.01 0.18 0.33 11 11 322
that the ability of GTT to outperform AMA at high wine 0.14 0.01 0.16 0.06 14 7 177
Nr depended strongly on the value of k. When it was credit-scrn 0 0.12 0.09 0.02 16 5 652
letter-rec 0.38 0 0.38 0.01 17 5 20000
computationally feasible to set k ' N , AMA typically thyroid 0.17 0.28 0 0.11 21 5 7200
would outperform GTT even at high Nr . brst-canc-w 0.28 0 0.29 0.23 32 3 569
connect-4 0.49 0 0.49 0.52 43 2 67557
The performance of AMA was also tested by generat-
Table 5: Experimental results for 21 UCI data sets.
cally more robust on these data sets than the other clas- [Buntine, 1991] W. Buntine. Theory refinement on
sifiers, scoring in the top two 15/21 times compared to Bayesian networks. In Proceedings of the Seventh An-
7/21, 10/21, and 11/21 for SNN, GTT and NMA, re- nual Conference on Uncertainty in Artificial Intelligence
(UAI–91), pages 52–60, San Mateo, California, 1991.
spectively. The average P difference ∆i between classifier Morgan Kaufmann Publishers.
i 1 AM A
i and AMA: ∆ ≡ 21 d (δd −δdi ), was calculated to
gauge the statistical significance of these experiments. [Cooper and Herskovits, 1992] Gregory F. Cooper and Ed-
The results were: ∆SN N = 15.8 ± 6.7% (significant at ward Herskovits. A Bayesian method for the induction
of probabilistic networks from data. Machine Learning,
the 99% level), ∆GT T = 3.7 ± 5.3% (significant only at 9(4):309–347, 1992.
the 75% level), and ∆N M A = 11.7 ± 6.3 (significant at
the 95% level). These results are promising, but need [Dash and Cooper, 2002] Denver Dash and Gregory F.
to be extended to further investigate the performance Cooper. Exact model averaging with naive Bayesian clas-
sifiers. Proceedings of the Nineteenth International Con-
of AMA. ference on Machine Learning (ICML 2002), to appear,
2002.
5 Discussion
[Domingos and Pazzani, 1997] Pedro Domingos
We have shown that it is possible to construct a sin- and Michael Pazzani. On the optimality of the simple
gle DAG model that will perform linear-time approxi- Bayesian classifier under zero-one loss. Machine Learn-
mate model averaging over the Lnk (π | D) class of mod- ing, 29:103–130, 1997.
els. We have demonstrated empirically that even with [Friedman and Koller, 2000] Nir Friedman and Daphne
relatively little effort in choosing a good partial order- Koller. Being Bayesian about network structure. In Un-
ing π, classifications obtained by model averaging over certainty in Artificial Intelligence: Proceedings of the Six-
Lnk (π | D) can be beneficial compared to other BN clas- teenth Conference (UAI-2000), pages 201–210, San Fran-
cisco, CA, 2000. Morgan Kaufmann Publishers.
sifiers. The benefits of AMA were not without cost;
although comprehensive measurements were not taken, [Friedman et al., 1997] Nir Friedman, Dan Geiger, Moises
it was observed that the initial model construction time Goldszmidt, G. Provan, P. Langley, , and P. Smyth.
typically was 3-10 times longer for AMA than it took Bayesian network classifiers. Machine Learning, 29:131,
1997.
for the greedy search to converge for the values of k and
n used in our experiments. [Heckerman et al., 1995] David Heckerman, Dan Geiger,
and David M. Chickering. Learning Bayesian networks:
The AMA technique is interesting because of its sim- The combination of knowledge and statistical data. Ma-
plicity of implementation. Existing systems that use chine Learning, 20:197–243, 1995.
Bayesian network classifiers can trivially be adapted to
[Madigan and Raftery, 1994] David
use model averaging by replacing their existing model
Madigan and Adrian E. Raftery. Model selection and
with a single summary model. accounting for model uncertainty in graphical models us-
ing Occam’s window. Journal of the American Statistical
Future work includes finding a better method for opti-
Association, 89:1535–1546, 1994.
mizing the ordering π, possibly by doing a search over
orderings as in [Friedman and Koller, 2000]. Also, our [Madigan and York, 1995] David Madigan and J. York.
experimental results should be expanded to more exten- Bayesian graphical models for discrete data. Interna-
tional Statistical Review, 63:215–232, 1995.
sively characterize the performance of AMA for classi-
fication on real-world data. Another possible extension [Meila and Jaakkola, 2000] Marina Meila and Tommi S.
is to relax the assumption of complete data, possibly Jaakkola. Tractable Bayesian learning of tree belief net-
by using the EM algorithm or MCMC sampling to es- works. In Uncertainty in Artificial Intelligence: Proceed-
ings of the Sixteenth Conference (UAI-2000), pages 380–
timate Equation 14 from data.
388, San Francisco, CA, 2000. Morgan Kaufmann Pub-
lishers.
6 Acknowledgements
[Spirtes et al., 1993] Peter Spirtes, Clark Glymour, and
This work was supported in part by grant number S99- Richard Scheines. Causation, Prediction, and Search.
GSRP-085 from the National Aeronautics and Space Ad- Springer Verlag, New York, 1993.
ministration under the Graduate Students Research Pro-
[Verma and Pearl, 1991] T.S. Verma and Judea Pearl.
gram, by grant IIS-9812021 from the National Science Foun- Equivalence and synthesis of causal models. In Uncer-
dation and by grants LM06696 from the National Library tainty in Artificial Intelligence 6, pages 255 –269. Else-
of Medicine. vier Science Publishing Company, Inc., New York, N. Y.,
1991.

References [Volinsky, 1997] C.T. Volinsky. Bayesian Model Averaging


for Censored Survival Models. PhD dissertation, Univer-
[Blake and Merz, 1998] C.L. Blake and C.J. Merz. UCI sity of Washington, 1997.
repository of machine learning databases, 1998.

You might also like