Model Averaging With Discrete Bayesian Network Classifiers1

Uploaded by

anirbanc2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views8 pages

Model Averaging With Discrete Bayesian Network Classifiers1

Uploaded by

anirbanc2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Model Averaging with Discrete Bayesian Network Classifiers

Denver Dash Gregory F. Cooper

∗
Decision Systems Laboratory Center for Biomedical Informatics
Intelligent Systems Program University of Pittsburgh
University of Pittsburgh Pittsburgh, PA 15213
Pittsburgh, PA 15213

Abstract still perform surprisingly well at the classification task

[Domingos and Pazzani, 1997]. More sophisticated al-
This paper considers the problem of perform- gorithms for learning BNs from data [Verma and Pearl,
ing classification by model-averaging over a 1991, Cooper and Herskovits, 1992, Spirtes et al., 1993,
class of discrete Bayesian network structures Heckerman et al., 1995, Friedman et al., 1997] can also
consistent with a partial ordering and with be used effectively to construct a BN model by extract-
bounded in-degree k. We show that for N ing information about conditional independencies from
nodes this class contains in the worst-case at D to build more accurate structural models. Classifica-
¡ ¢N/2 tion using a single Bayesian network model for a fixed
least Ω( N/2
k ) distinct network structures,
number of classes when the feature vector is completely
but we show that
¡ ¢ this summation can be per- instantiated can be performed in O(N ) time.
formed in O( Nk · N ) time. We use this fact
to show that it is possible to efficiently con- This procedure of selecting a single model for classi-
struct a single directed acyclic graph (DAG) fication has the potential drawback of over-fitting the
whose predictions approximate those of exact data however, leading to poor generalization capabil-
model-averaging over this class, allowing ap- ity. A strictly Bayesian approach, averaging classifica-
proximate model-averaged predictions to be tions over all models weighted by their posterior prob-
performed in O(N ) time. We evaluate the ability given the data, has been shown to reduce over-
procedure in a supervised classification con- fitting and provide better generalization [Madigan and
text, and show empirically that this technique Raftery, 1994]. Unfortunately, the space of network
can be beneficial for classification even when structures is super-exponential in the number of model
the generating distribution is not a member variables, and thus an exact method for full model-
of the class being averaged over, and we char- averaging is likely to be intractable.
acterize the performance over several param-
In this paper we consider the possibility of performing
eters on simulated and real-world data.
exact and approximate model-averaging (MA) over a
particular class of structures rather than over the gen-
eral space of DAGs. Recently Dash and Cooper [2002],
1 Introduction
demonstrated that exact model averaging over the re-
stricted class of naive networks could be performed by a
The general supervised classification problem seeks to
simple re-parametrization of a naive network, and they
create a model based on labelled data D, which can
showed that this technique consistently outperformed
be used to classify future vectors of features F =
a single naive classifier with the standard parametriza-
{F1 , F2 , . . . , FN } into one of various classes of inter-
tion. The present paper generalizes that result; in par-
est. A Bayesian network (BN) classifier is a probabilis-
ticular, we show that exact model averaging over the
tic model that accomplishes this goal by explicating
class of BN structures consistent with a partial order-
causal interactions/conditional independencies between
ing π and with bounded in-degree k, despite its super-
features in F . The simplest Bayesian network classifier
exponential size, can be performed with relatively small
for this task is the naive classifier, which, without infer-
time and space restrictions.
ring any structural information from the database, can
∗
This work was partially performed during a summer
Methods for approximate MA classification using both
internship at the Machine Learning and Perception group, selective pruning [Madigan and Raftery, 1994, Volin-
Microsoft Research, Cambridge, UK. sky, 1997] and Monte-Carlo [Madigan and York, 1995]
techniques exist and have shown to improve prediction vectors are most likely to reside.
tasks; however these methods do not possess the extent
We use the notation Xi to refer to the nodes when we
of time efficiency of inference that we seek.
need to have a uniform notation; using the convention
Friedman and Koller [2000] studied the ability to es- that, Xi ≡ Fi and X0 ≡ C, and we use X to denote
timate structural features of a network (for example the collective set of nodes in the network. A directed
the probability of an arc from Xi to Xj ) by perform- graph G(X) is defined as a pair hX, Ei, where E is a
ing a MCMC search over orderings of nodes. Their set of directed edges Xi → Xj , such that Xi , Xj ∈ X.
method relied on a decomposition, which they credited
We assume that each node Xi is a discrete multino-
to Buntine [1991], that we extend in order to prove
mial variable with ri possible states {x1i , x2i , . . . , xri i }.
our key theoretical result. We discuss this issue in de-
We use P i to denote the parent set of Xi , and we
tail in Section 3.3. Their work differs from ours in two
let qi denote the number of possible joint configura-
key respects: (1) Their approach does not capture the
tions of parents for node Xi , which we enumerate as
single-network (and thus the efficiency of calculation)
{p1i , p2i , . . . , pqi i }. We use θijk to denote a single param-
approximation to the MA problem, and (2) They per-
form model averaging only to calculate the probabilities eter of the network: θijk = P (Xi = xki | P i = pji ), and
of structural features, explicitly not for classification. the symbol θ to denote the collective parameters of the
network. In general we use the common (ijk) coordi-
Meila and Jaakkola [2000] discuss the ability to perform nates notation to identify the k-th state and the j-th par-
exact model averaging over all trees. They also use ent configuration of the i-th node in the network. We
similar assumptions and similar decompositions that take the common assumptions in BN structure learning
we use; however our calculation is more general in al- of Dirichlet priors, parameter independence, and pa-
lowing nodes to have more than one parent, but it is rameter modularity [Heckerman et al., 1995]. We also
less general in that it assumes a partial-ordering of the assume that database D consists of complete labelled
nodes. Their approach also does not allow for O(N ) instances.
model-averaged classifications.
Our primary contributions in this paper are as follows: 3 Theoretical Results
(1) we extend the factorization of conditionals to ap-
ply to the task of classification, (2) we show that MA In this section we show how to efficiently calculate
calculations over this class can be approximated by a the quantity P (X = x | D) averaged over the class of
single network structure S ∗ which can be constructed structures we are considering.
efficiently, thereafter allowing approximate MA predic-
tions to be performed in O(N ) time, and (3) we demon- 3.1 Fixed Network Structure
strate empirically that, especially when the number of
records is small compared to the size of the network, For a fixed network structure S and a fixed set of net-
using this technique for classification can be beneficial work parameters θ, the quantity P (X = x | S, θ) can
compared to (a) a single naive classifier, (b) single net- be calculated in O(N ) time:
works learned from data using a greedy Bayesian learn- N
ing algorithm and (c) exact model averaging over naive Y
P (X = x | S, θ) = θiJK , (1)
classifiers. i=0
In Section 2 we formally frame the problem and state where all (j, k) coordinates are fixed by the configura-
our assumptions and notation. In Section 3 we derive tion of X to the value (j, k) = (J, K).
the MA solution and show that the MA predictions are
approximated by those of a single structure bearing a When, rather than a fixed set of parameters, a database
D is given, from an ideal Bayesian perspective it is
particular set of parameters. In Section 4 we present necessary to average over all possible configurations of
the experimental comparisons, and in Section 5 we dis- the parameters θ:
cuss our conclusions and future directions. Z
P (X = x | S, D) = P (X = x | S, ) · P ( | S, D) · d
2 Assumptions and Notation Z Y
N

= θiJK · P ( | S, D) · d
The general supervised classification problem can be i=0

framed as follows: Given a set of features F =

{F1 , F2 , . . . , FN }, a set of classes C = {C1 , C2 , . . . CNc }, where the second line follows from Equation 1. Given
and a labelled database D = {D1 , D2 , . . . , DR }, con- the assumption of parameter independence and Dirich-
struct a model to predict into which class future feature let priors, this quantity can be written just in terms
of sufficient statistics and Dirichlet hyperparameters Given the assumptions of complete data, multinomial
[Cooper and Herskovits, 1992, Heckerman et al., 1995]: variables, Dirichlet priors and parameter independence,
N
the marginal likelihood P (D | S) can be written just
Y αiJK + NiJK in terms of hyperparameters and sufficient statistics
P (X = x | S, D) = , (2)
αiJ + NiJ [Cooper and Herskovits, 1992, Heckerman et al., 1995]:
i=0
P
where we have used the notation that Qij = k Qijk 1 . P (D | S) =
Comparing this result to Equation 1 illustrates the well- qi
N Y
Y ri
Y
Γ(αij ) Γ(αijk + Nijk )
known result that a single network with a fixed set of · . (6)
parameters θ̂ given by i=0 j=1
Γ(αij + N ij ) Γ(αijk )
k=1

αijk + Nijk
θ̂ijk = (3) Given Assumption 1 and Equation 6, Equation 5 can
αij + Nij
be written as:
will produce predictions equivalent to those obtained by N
averaging over all parameter configurations. We refer XY
P (XL → XM | D) = κ ρiLM (7)
to θ̂ijk as the standard parametrization.
S i=0

3.2 Averaging Structural Features with a where the ρiLM functions are given by:
Fixed Ordering
ρiLM = δ[M 6= i ∨ XL ∈ P i ] · ps (Xi , P i )·
The decomposition by Buntine used by Friedman and Yqi ri
Y
Γ(αij ) Γ(αijk + Nijk )
Koller was a dynamic programming solution which cal- · , (8)
Γ(αij + N ij ) Γ(αijk )
culated, with relative efficiency, the posterior proba- j=1 k=1

bility of a structural feature (for example a particular

and can be calculated using information that depends
arc XL → XM ) averaged over all in-degree-bounded
only on nodes Xi and P i .
networks consistent with a fixed ordering. Here we re-
derive the result. Buntine considered a total ordering on the nodes, but
generalization to a partial ordering is straightforward.
The derivation required an additional assumption, la-
For a given partial ordering π and a particular node
belled “structure modularity” by Friedman and Koller:
Xi we want to enumerate all of Xi ’s possible parent
Assumption 1 (Structure modularity) The prior sets up to a maximum size k. To this end, we will
of a structure S, P (S), can be factorized according to typically use the superscript ν to index the different
the network: parent sets. For example, four nodes partitioned as
h{X1 , X3 }, {X2 , X4 }i and a maximum in-degree k = 2
N
Y might yield the following enumeration of parent sets
P (S) ∝ ps (Xi , P i ), (4)
for X2 : {P 02 = ∅, P 12 = {X1 }, P 22 = {X3 }, P 32 =
i=0
{X1 , X3 }}, any of which we might refer to as P ν2 . We
where ps (Xi , P i ) is some function that depends only on assume the hyperparameters αijk ν
for any parent set
the local structure (Xi and P i ). ν
P i can be calculated in constant time. For example
the K2 scoring metric [Cooper and Herskovits, 1992],
Obviously the uniform distribution which is O(1) will which sets αijk = 1 for all (ijk), will satisfy this re-
satisfy this assumption, as will other common network quirement. The class of models consistent with π with
prior forms. bounded in-degree of k we denote as Lk (π):
The posterior probability P (XL → XM | D) can be
written as: Definition 1 (Multi-level k-graphs, Lk (π)) For a
given integer k ≤ N and a given partial ordering π
P (XL → XM | D) =
X of X, a DAG G = hX, Ei is a multi-level k-graph
κ δ(XL → XM ∈ S) · P (D | S) · P (S) (5) with respect to π (denoted as Lk (π)) if arcs are di-
S rected down levels and no variable has more than k
where κ = 1/P (D) is a constant that depends only on parents: Xi → Xj ∈ E ⇒ Level (Xi ) < Level (Xj ),
the database, and δ(X) is the Kronecker delta function: and Xi ∈ X ⇒ |P i | ≤ k.
½
1 if X = true When model-averaging over the class Lk (π), each node
δ(X) =
0 otherwise at level l can choose from all the nodes in layers l0 < l at
1
P most k parents. For k < N/2, Equationh¡ 7¢N/2 in the
i worst-
More generally, we will use Qij = k
Qijk for a quan- N/2
tity Q. case includes a summation over Ω k network
structures (this worst-case corresponding to two layers, In fact, the Nf k characterization is a loose upper-
each with N/2 nodes). However, the following theorem bound, and can likely be reduced by compressing con-
shows that Equation 7 can be calculated with relative tingency tables.
efficiency:
3.3 Model Averaging for Prediction
Theorem 1 For N variables with a maximum number
of states per variable given by Nf and a database of Here we show that a similar dynamic program-
Nr records, the ¡right-hand
¢ side of Equation 7 can ¡Nbe
¢ ming solution exists for model-averaging the quantity
calculated in O( N k ·N ·N r ·Nf
k
) time and using O( k · P (X = x | D) over the class Lk (π).
N ·Nf k ) space if the summation is restricted to network
This quantity can be written as:
structures in Lk (π).
N
XY
Proof: We use µi to denote the maximum number of
parent sets available to node Xi . Expanding the sum P (X = x | D) = κ θ̂iJK · P (D | S) · P (S), (10)
S i=0
in Equation 7 and using the notation ρνiLM to denote
ρiLM for the νth parent set P νi , yields: where θ̂iJK are the standard parameters given in Equa-
 tion 3. Given structure modularity and Equation 6,
P (XL → XM | D) ∝ 
 Equation 10 can be written in a form very similar to
ρ00LM · ρ01LM · . . . · ρ0N LM  

 Equation 7:
+ ρ10LM · ρ01LM · . . . · ρ0N LM  


.. .. 
 "µ ¶N/2 # N
 N/2 XY
. . P (X = x | D) = κ ρ̂iJxS Kx (11)
Ω terms.
+ ρµ0LM 0
· ρ01LM · . . . · ρ0N LM   k S i=0
0 1 0 

+ ρ0LM · ρ1LM · . . . · ρN LM  
.. .. 
 where here the ρ̂iJxS Kx functions are given by:


. . 

µ0 µ1 µN 
+ ρ0LM · ρ1LM · . . . · ρN LM ρ̂iJxS Kx = θ̂iJxS Kx · ps (Xi , P i )·
qi
Y ri
Y
We define the symbol ΣLM Γ(αij ) Γ(αijk + Nijk )
m to denote the structure sum · . (12)
of the product up to and including the m-th node: j=1
Γ(αij + Nij ) Γ(αijk )
k=1

ΣLM
m ≡ ρ00LM · ρ01LM · ... · ρ0mLM We have taken the trouble to subscript the indices J
+ ρ10LM · ρ01LM · ... · ρ0mLM and K from Equation 1. Both are indexed with an x
.. .. to indicate that they are fixed by the particular config-
. .
uration of X, and J is indexed by S to emphasize that
+ ρµ0LM
0
· ρµ1LM
1
· . . . · ρµmLM
m
the value of the parent configuration index depends on
Using this notation, the following recursion relation can the structure of the network. Although this notation
be derived: may seem cumbersome, we hope it clarifies the analysis
µi
later.
X
ΣLM
i = ΣLM
i−1 · ρνiLM , ΣLM
−1 = 1 The ρ̂iJxS Kx functions again can be calculated using
ν=1 only information local to node Xi and P i . Fol-
Finally, expanding out the recurrence relation yields lowing a derivation identical to that for averaging
the expression for P (XL → XM | D): P (XL → XM | D) in Section 3.2, yields the result:
µi
N X
Y
µi
N X
Y P (X = x | D) = κ ρ̂νiJxν Kx . (13)
P (XL → XM | D) = κ ρνiLM (9)
i=0 ν=1
i=0 ν=1
Here the S index has been replaced with a ν indicating
Once the ρνiLM terms are calculated, the right-hand- ¡ ¢ which parent set for node Xi is being considered.
¡ ¢Once
side of Equation 9 can be performed in O(N · N k )
again this summation can be performed inO( Nk · N ·
¡ ¢
time. Calculating the ρ terms themselves requires the Nr · Nf k ) time and O( Nk · N · Nf k ) space.
ν
calculation of one hyperparameter αijk and sufficient
ν
statistic Nijk for each parameter of the network, each 3.4 Approximate Model Averaging
of which can be calculated in O(Nr ) time. To calcu-
late the complete
¡ ¢ set for all (j, k) combinations
¡ ¢ thus The derivation in the preceding section is an extension
requires O( Nk · N · Nr · Nf k ) time and O( Nk · N · Nf k ) of the concept underlying Buntine’s dynamic program-
space (because there are O(Nfk ) network parameters ming solution. The functional form of the above solu-
per node). 2 tion allows us to easily prove the following theorem:
Theorem 2 There exists a single Bayesian network We denote the class of structures being averaged over
model M ∗ = hS ∗ , θ ∗ i that will produce predictions using this procedure as Lnk (π | D), and we call the
equivalent to those produced by model averaging over method Approximate Model Averaging (AMA). Obvi-
all models in Lk (π). ously limn→N Lnk (π | D) = Lk (π). Furthermore, we
Proof: Let S ∗ be defined empirically show in Section 4 that the loss in ROC area,
Sµi so that each node Xi has
², due to this approximation for n ≥ 10 lies around
the parent set P ∗i = ν=1 P νi , and let θ ∗ be defined
by: −0.6% ≤ ² ≤ 0.6% with 99% confidence for N ≤ 100
µi
1 X ν and for a wide range of other parameters.
∗
θijk = √ ρ̂iJ ν k (14)
κ N ν=1 j
4 Experimental Tests
where the x subscript for Jxν has now been replaced
with a j and Kx has been replaced with a k subscript,
In this section we describe several experimental inves-
because we are now considering a particular coordinate
tigations that were designed to test the performance of
(ijk). It can be seen by direct comparison that the
(AMA) on distributions that do not necessarily fall into
single network prediction using M ∗ and Equation 1 will
Lk (π). We first generate synthetic data to allow us to
yield Equation 13. 2
more extensively vary parameters and to generate sta-
If we define functions f (Xi , P νi | D) such that tistically significant results, then we perform tests on
f (Xi , P νi | D) = ρ̂νiJ ν k /θ̂iJ
ν several real-world machine learning data sets.
ν k , then Equation 14 can
j j
be written as:
4.1 Experimental Setup
µi
X
∗ 1 ν ν
θijk = √ θ̂iJ ν k · f (Xi , P i | D) (15) There are at least five parameters for which we sought
κ N j
ν=1 to characterize the performance of AMA predictions:
The functions f (Xi , P νi | D) do not depend on the in- the number of nodes N , the approximation limit n
dices Jjν and k, and they are proportional to the lo- on number of nodes, the maximum in-degree (“den-
cal contribution of the posterior probability that the sity”) K of the generating network, the maximum num-
parent set of Xi is in fact P νi . Equation 15 thus pro- ber of parents k allowed in Lk (π), and the number
vides the interpretation that M ∗ represents a structure- of records Nr . It is beyond the scope of this paper
based smoothing where each standard parameter θ̂ijk ν to present a comprehensive comparison over this full
ν
is weighted based on the likelihood that P i is the true five-dimensional space; however, here we sample their
parent set of Xi . settings to provide insight into the dependence of the
results on these parameters.
Theorem
¡ ¢ 2 implies that, rather than performing the
O( Nk · N ) summation in Equation 13 for each case to We compared AMA to three other Bayesian network
be classified, in principle we need only construct a single classifiers: a single naive network (SNN) with the stan-
model M ∗ and use standard O(N ) Bayesian network dard parametrization [Domingos and Pazzani, 1997],
inference for each case. a model that model-averaged over all naive structures
(NMA) [Dash and Cooper, 2002], and a non-restricted
A serious practical difficulty is that this approach two-stage thick-thin greedy search (GTT) over the
requires in the worst case the construction of a space of DAGs. GTT starts with an empty graph and
completely-connected Bayesian network and is thus in- repeatedly adds the arc (without creating a cycle) that
tractable for reasonable size N . An obvious pruning maximally increases the marginal likelihood P (D | S)
strategy, however, is to truncate the sum in Equation 14 until no arc addition will result in a positive increase,
to include no more than n parents. If we reorder the then it repeatedly removes arcs until no arc deletion
possible parent sets for node Xi as OP ≡ {P 1i , . . . , P µi i } will result in a positive increase in P (D | S). GTT as-
such that f (Xi , P νi | D) > f (Xi , P λi | D) only if ν < λ, sumes no ordering on the nodes.
then a reasonable approximation for P ∗i can be con-
structed by the following procedure: All abbreviations and symbols are summarized in Ta-
ble 1 as a reference for the reader.
Procedure 1 (Approximate Pi∗ construction) The inner-loop of each test performed basically
Given: n and OP . the same procedure: Given the five parameters
1. Let P ∗i = ∅ {N, Nr , K, n, k} and a total number of trials Ntrials ,
we did the following:
2. For ν = 1 to µi ,
if |P ∗i ∪ P νi | ≤ n, let P ∗i = P ∗i ∪ P νi , Procedure 2 (Basic testing loop)
else continue. Given: N , Nr , Ntest , K, n, k, and Ntrials .
Symbol Description
N Number of nodes percent better. In all experiments performed below,
Nr Number of training records
K Maximum in-degree of generating graphs method (3) was used to generate π.
k Maximum in-degree in Lk
n Maximum in-degree in summary MA network.
SNN Single naive network 4.2 Experimental Results
NMA Naive model averaging
GTT Greedy thick-thin
AMA Approximate model averaging The first experiment tested the degree of error incurred
by model averaging over the class Lnk (π | D) instead of
Table 1: Table of symbols. the full class Lk (π). The degree of error was measured
in terms of the percent difference in ROC areas, ∆,
Do:
using k = 1, Nr = 100, Ntrials = 40, N varied over
1. Generate Ntrials random graphs G(N, K). the values {15, 20, 30, 40, 50, 60, 80, 100}, n varied over
the values {5, 7, 10, 12, 15}, and K ←- [1, . . . , N ]. The
2. For each graph G(N, K) do: compiled results are shown in Table 2a–c. In this and
(a) Generate Nr training records and Ntest test
n ∆ (%) N ∆ (%) k ∆ (%)
records. 1 15 ± 2.4 20 .023 ± .067 1 .032 ± .09
5 .74 ± .55 30 .0016 ± .18 2 .036 ± .11
(b) Train two classifiers to be compared M1 (typ- 7 .24 ± .42 40 .036 ± .22 3 .093 ± .23
ically the AMA classifier) and M2 (the classi- 10 .15 ± .29 60 .10 ± .22 4 .042 ± .32
fier to be compared) on the training records. 12 −.01 ± .36 80 .17 ± .29
15 .12 ± .14 100 .23 ± .32
(c) Test M1 and M2 on the test data, measur-
ing the ROC areas R1 and R2 , respectively, of (a) (b) (c)
each. Table 2: The percent difference (∆) in ROC area be-
(d) Calculate the quantity δ = RT1−R
−R2
2
, where T is tween model averaging over Lk (π) and model averaging over
the ROC area of a perfect classifier (i.e., 1 if Ln
k (π | D) as various parameters are varied, with the 99%

the axes are normalized). confidence intervals,.

3. Average δ over all Ntrials . all subsequent tables, the error bars denote the 99%
confidence interval of the mean.
The performance metric δ indicates what percentage of
Table 2a shows the dependence on n averaged over all
M2 ’s missing ROC area is captured by M1 .
values of N , showing that for an approximation level
For some experiments it was necessary to generate net- n >∼ 10 the difference in ROC area between the ex-
works randomly from a uniform distribution over DAGs act MA and AMA is less than 0.5% with confidence
with fixed K. We employed a lazy data generation pro- P > 0.99. Table 2b shows the dependence on N av-
cedure whereby node conditional probability distribu- eraged over the values of n ∈ {10, 12, 15}. This table
tions were generated only when they were required by shows that with n set to reasonable values the approx-
the sampling, a technique which allows generation of imation error is bounded under 0.6% with 99% confi-
data for arbitrarily dense networks. dence up to N = 100. Finally, one might expect the ap-
proximation error to increase as k was increased, since
In all cases we assume a uniform prior over non-
for a fixed n, Lk (π) includes increasingly more struc-
forbidden structures and thus allow ps (Xi , P i ) = 1 for
tures than Lnk (π | D) as k increases. Table 2c shows
all i. We also adopted the K2 parameter prior [Cooper
the results of varying k in an experiment with N = 20,
and Herskovits, 1992] which sets αijk = 1 for all (i, j, k).
Nr = 100, K ←- [1, . . . , N ], n = 12 and k varied from 1
This criterion has the property of weighting all local
to 4. For all values of k the error in terms of the differ-
distributions of parameters uniformly. All variables in
ence in ROC area is below 0.4% with 99% confidence.
our synthetic tests were binary: Nc = Nf = 2, and
for all experiments Ntest = 1000. In many experiments Next, we generated synthetic data to test the perfor-
we sampled the density of generating graphs K uni- mance of AMA relative to the other methods. For all
formly from a set {1, 2, . . . , N }; we use the notation tests, the approximation parameter was fixed to n = 12.
K ←- {1, . . . , N } to denote this procedure. We performed four tests, each varying one of the pa-
rameters in Table 1 while fixing the remaining param-
In all experiments, π was chosen to be a fixed total
eters to particular values. These results are shown in
ordering of the variables. At least three heuristics were
Table 3.
used to generate π: (1) generate a random ordering, (2)
generate two opposite random orderings and average Due to space constraints we only report here the re-
predictions of each, and (3) use a topological sort of sults for AMA versus GTT. However, we note that
the graph obtained by GTT. These methods produced both GTT and AMA achieved on average higher per-
comparable results, but (2) and (3) performed a few formance than SNN for almost every experimental run
N δ (%) Q1 Q3 k δ (%) Q1 Q3 Nr δ kt (%) Qkt
1 Qkt
3 δ an (%) Qan
1 Qan
3
5 16 ± 1.3 7.7 27 1 .322 ± .72 -4.7 13 50 32 ± 13 24 55 2.6 ± 3.2 -9.2 17
10 11 ± 1.5 4.0 23 2 9.05 ± .46 2.7 16 100 23 ± 11 8.8 53 .66 ± 3.3 -11 16
20 7.1 ± 1.6 .14 17 3 10.5 ± .37 3.3 16 200 13 ± 9.0 -1.2 32 −2.6 ± 4.2 -17 16
40 12 ± 1.4 .75 23 4 10.9 ± .38 3.5 17 400 12 ± 6.9 -1.2 34 −3.3 ± 5.4 -21 18
80 14 ± 1.5 1.8 28 5 10.9 ± .43 3.8 17 800 3.7 ± 6.9 -8.5 23 2.0 ± 5.0 -11 21
160 20 ± 4.1 .17 44 6 10.2 ± .45 3.1 16 3200 −.07 ± 14 -19 15 5.6 ± 7.1 -7.5 19
K δ (%) Q1 Q3 Nr δ (%) Q1 Q3
2 7.6 ± 1.7 .43 13 25 9.4 ± 2.7 1.2 13 Table 4: AMA performance v.s. GTT on synthetic data
4 14 ± 2.2 3.6 23 50 9.4 ± 2.5 1.6 16 generated using the ALARM network and classifying on
8 11 ± 2.1 2.4 18 100 11 ± 2.3 4.0 16 kinked tube (kt) and anaphylaxis (an).
16 7.2 ± 1.9 -.40 13 400 10 ± 2.2 3.6 16
32 5.3 ± 1.8 -1.1 7.9 800 5.5 ± 3.0 -.30 13
50 6.0 ± 1.6 -.50 9.7 3200 −7.4 ± 5.2 -26 13
ing training and test data with the benchmark ALARM
network. In this case, N = 36 and K = 4 were fixed
Table 3: The performance of AMA versus GTT as several by the network, and a test was performed with k = 3,
parameters are varied. The error terms indicate the 99% n = 10, and Nr systematically varied. The results in
significance level. Q1 and Q3 denote the first and third
quartiles, respectively. Table 4 are shown for classification on the kinked tube
and anaphylaxis nodes. These results are of interest
performed. NMA performed comparably to AMA for because they demonstrate that the qualitative perfor-
small Nr , but performed worse than both GTT and mance of the AMA classifier depends not just on global
AMA for large Nr . Later (Table 5) we present quanti- network features but also on features specific to the
tative comparisons of all four techniques on real-world classification node.
data. Finally, we performed measurements of the perfor-
The top-left quadrant of Table 3 shows the results of mance of AMA, SNN, GTT and NMA on 21 data sets
varying the number of nodes N , while holding Nr = taken from the UCI online database [Blake and Merz,
100, k = 2, and with K ←- [1, . . . , N ]. For all val- 1998]. These results are shown in Table 5. Here the
ues of N using this configuration of parameters, AMA score δdi for classifier Ci was calculated according to
outperformed GTT at the 99% significance level, the Procedure 2, where M2 = Ci and M1 was taken to be
difference generally increasing as N grew very large or the maximum scoring classifier for the data set d. For
very small. Probably the minimum in this curve around example, in the monks-2 database, AMA was the high-
N = 20 reflects the tradeoff that as N gets very small, est scoring classifier and covered 48% of the remaining
bounding the in-degree of the networks to k = 2 comes area for SNN and GTT and 21% of the remaining area
closer and closer to the limiting case of k = N ; whereas, for NMA. The ROC area will in general depend on
generally as the ratio N/Nr increases we expect model which state of the classification variable is considered
averaging to benefit over a single model. to be the “positive” state; therefore, the scores in Ta-
ble 5 are average scores for all ROC curves associated
In the bottom-left quadrant of Table 3, we varied the with a particular classification variable; therefore some
density of generating graphs K while holding Nr = 100, data sets (e.g., wine) have no zero entries when two or
k = 2 and N = 50. Again, for this set of measure- more classifiers score highest on different curves.
ments, AMA always outperformed GTT at the 99% sig-
nificance level. The results were most apparent when We have underlined the top two scoring classifiers for
K < ∼ 10, an encouraging result since it is a common
each data set to emphasize the fact that AMA was typi-
belief that most real-world networks are sparse. Data set δ SN N δ GT T δN M A δ AM A N k Nr
haberman 0.35 0.35 0 0 4 4 306
In the top-right of Table 3, we varied the maximum hayes-roth 0.32 0.32 0 0.01 6 6 132
number of parents allowed in Lk from 1 ≤ k ≤ 6, while monks-3 0.83 0.24 0.82 0 7 7 552
monks-1 0.98 0 0.98 0 7 7 554
holding Nr = 100, N = 20, and K ←- [1, . . . , N ]. The monks-2 0.48 0.48 0.21 0 7 7 600
value of δ is surprisingly insensitive to the value of k chess krk 0.54 0 0.54 0.32 7 7 28055
for k ≥ 2. Finally, in the bottom-right, we varied the ecoli 0.03 0.01 0.02 0 8 8 336
yeast 0.04 0.11 0.04 0.07 8 8 1484
number of records Nr while fixing N = 20, k = 3 and abalone 0.12 0.08 0.05 0 9 9 4176
K ←- [1, . . . , N ]. As expected, for smaller values of cpu-perf 0.13 0.31 0.01 0.11 10 10 209
glass 0.10 0.04 0.15 0.13 10 10 214
Nr model averaging performs well versus GTT. As Nr cmc 0.01 0.07 0.01 0.04 10 10 1473
grows GTT eventually outperforms AMA at the 99% sol-flare-C 0.03 0.09 0.02 0.01 11 11 322
sol-flare-M 0 0.44 0.17 0.20 11 11 322
significance level. It was observed in other experiments sol-flare-X 0.06 0.01 0.18 0.33 11 11 322
that the ability of GTT to outperform AMA at high wine 0.14 0.01 0.16 0.06 14 7 177
Nr depended strongly on the value of k. When it was credit-scrn 0 0.12 0.09 0.02 16 5 652
letter-rec 0.38 0 0.38 0.01 17 5 20000
computationally feasible to set k ' N , AMA typically thyroid 0.17 0.28 0 0.11 21 5 7200
would outperform GTT even at high Nr . brst-canc-w 0.28 0 0.29 0.23 32 3 569
connect-4 0.49 0 0.49 0.52 43 2 67557
The performance of AMA was also tested by generat-
Table 5: Experimental results for 21 UCI data sets.
cally more robust on these data sets than the other clas- [Buntine, 1991] W. Buntine. Theory refinement on
sifiers, scoring in the top two 15/21 times compared to Bayesian networks. In Proceedings of the Seventh An-
7/21, 10/21, and 11/21 for SNN, GTT and NMA, re- nual Conference on Uncertainty in Artificial Intelligence
(UAI–91), pages 52–60, San Mateo, California, 1991.
spectively. The average P difference ∆i between classifier Morgan Kaufmann Publishers.
i 1 AM A
i and AMA: ∆ ≡ 21 d (δd −δdi ), was calculated to
gauge the statistical significance of these experiments. [Cooper and Herskovits, 1992] Gregory F. Cooper and Ed-
The results were: ∆SN N = 15.8 ± 6.7% (significant at ward Herskovits. A Bayesian method for the induction
of probabilistic networks from data. Machine Learning,
the 99% level), ∆GT T = 3.7 ± 5.3% (significant only at 9(4):309–347, 1992.
the 75% level), and ∆N M A = 11.7 ± 6.3 (significant at
the 95% level). These results are promising, but need [Dash and Cooper, 2002] Denver Dash and Gregory F.
to be extended to further investigate the performance Cooper. Exact model averaging with naive Bayesian clas-
sifiers. Proceedings of the Nineteenth International Con-
of AMA. ference on Machine Learning (ICML 2002), to appear,
2002.
5 Discussion
[Domingos and Pazzani, 1997] Pedro Domingos
We have shown that it is possible to construct a sin- and Michael Pazzani. On the optimality of the simple
gle DAG model that will perform linear-time approxi- Bayesian classifier under zero-one loss. Machine Learn-
mate model averaging over the Lnk (π | D) class of mod- ing, 29:103–130, 1997.
els. We have demonstrated empirically that even with [Friedman and Koller, 2000] Nir Friedman and Daphne
relatively little effort in choosing a good partial order- Koller. Being Bayesian about network structure. In Un-
ing π, classifications obtained by model averaging over certainty in Artificial Intelligence: Proceedings of the Six-
Lnk (π | D) can be beneficial compared to other BN clas- teenth Conference (UAI-2000), pages 201–210, San Fran-
cisco, CA, 2000. Morgan Kaufmann Publishers.
sifiers. The benefits of AMA were not without cost;
although comprehensive measurements were not taken, [Friedman et al., 1997] Nir Friedman, Dan Geiger, Moises
it was observed that the initial model construction time Goldszmidt, G. Provan, P. Langley, , and P. Smyth.
typically was 3-10 times longer for AMA than it took Bayesian network classifiers. Machine Learning, 29:131,
1997.
for the greedy search to converge for the values of k and
n used in our experiments. [Heckerman et al., 1995] David Heckerman, Dan Geiger,
and David M. Chickering. Learning Bayesian networks:
The AMA technique is interesting because of its sim- The combination of knowledge and statistical data. Ma-
plicity of implementation. Existing systems that use chine Learning, 20:197–243, 1995.
Bayesian network classifiers can trivially be adapted to
[Madigan and Raftery, 1994] David
use model averaging by replacing their existing model
Madigan and Adrian E. Raftery. Model selection and
with a single summary model. accounting for model uncertainty in graphical models us-
ing Occam’s window. Journal of the American Statistical
Future work includes finding a better method for opti-
Association, 89:1535–1546, 1994.
mizing the ordering π, possibly by doing a search over
orderings as in [Friedman and Koller, 2000]. Also, our [Madigan and York, 1995] David Madigan and J. York.
experimental results should be expanded to more exten- Bayesian graphical models for discrete data. Interna-
tional Statistical Review, 63:215–232, 1995.
sively characterize the performance of AMA for classi-
fication on real-world data. Another possible extension [Meila and Jaakkola, 2000] Marina Meila and Tommi S.
is to relax the assumption of complete data, possibly Jaakkola. Tractable Bayesian learning of tree belief net-
by using the EM algorithm or MCMC sampling to es- works. In Uncertainty in Artificial Intelligence: Proceed-
ings of the Sixteenth Conference (UAI-2000), pages 380–
timate Equation 14 from data.
388, San Francisco, CA, 2000. Morgan Kaufmann Pub-
lishers.
6 Acknowledgements
[Spirtes et al., 1993] Peter Spirtes, Clark Glymour, and
This work was supported in part by grant number S99- Richard Scheines. Causation, Prediction, and Search.
GSRP-085 from the National Aeronautics and Space Ad- Springer Verlag, New York, 1993.
ministration under the Graduate Students Research Pro-
[Verma and Pearl, 1991] T.S. Verma and Judea Pearl.
gram, by grant IIS-9812021 from the National Science Foun- Equivalence and synthesis of causal models. In Uncer-
dation and by grants LM06696 from the National Library tainty in Artificial Intelligence 6, pages 255 –269. Else-
of Medicine. vier Science Publishing Company, Inc., New York, N. Y.,
1991.

References [Volinsky, 1997] C.T. Volinsky. Bayesian Model Averaging

for Censored Survival Models. PhD dissertation, Univer-
[Blake and Merz, 1998] C.L. Blake and C.J. Merz. UCI sity of Washington, 1997.
repository of machine learning databases, 1998.

Writing Custom Parsing Rules in McAfee ESM
No ratings yet
Writing Custom Parsing Rules in McAfee ESM
21 pages
Role & Responsibility of A Functional Consultant
No ratings yet
Role & Responsibility of A Functional Consultant
2 pages
Learning Beyasian Networks Classifiers
No ratings yet
Learning Beyasian Networks Classifiers
10 pages
2017 - Mariem Kchaou - Et Al. 2017
No ratings yet
2017 - Mariem Kchaou - Et Al. 2017
6 pages
Prez 2009
No ratings yet
Prez 2009
22 pages
Friedman1997 - Article - BayesianNetworkClassifiers Edited
No ratings yet
Friedman1997 - Article - BayesianNetworkClassifiers Edited
33 pages
Introduction To Classification - PPT Slides 1
No ratings yet
Introduction To Classification - PPT Slides 1
62 pages
Learning Optimal Augmented Bayes Networks: Vikas Hamine Paul Helman
No ratings yet
Learning Optimal Augmented Bayes Networks: Vikas Hamine Paul Helman
4 pages
Survey On Multiclass Classification Methods
No ratings yet
Survey On Multiclass Classification Methods
9 pages
DM See M4
No ratings yet
DM See M4
8 pages
Chapter 6: Classification and Prediction: Classify Predictions
No ratings yet
Chapter 6: Classification and Prediction: Classify Predictions
23 pages
Bayesian
No ratings yet
Bayesian
23 pages
Duda Solutions PDF
No ratings yet
Duda Solutions PDF
77 pages
Unit-Iv Data Classification: Data Warehousing and Data Mining
No ratings yet
Unit-Iv Data Classification: Data Warehousing and Data Mining
7 pages
DWDM Unit 4
No ratings yet
DWDM Unit 4
22 pages
Bayesian Network Learning With The PC Algorithm An Improved and Correct Variation PDF
No ratings yet
Bayesian Network Learning With The PC Algorithm An Improved and Correct Variation PDF
20 pages
IML Module 3
No ratings yet
IML Module 3
95 pages
Dynamic Bayesian Multinets
No ratings yet
Dynamic Bayesian Multinets
8 pages
ML Module4 Classification
No ratings yet
ML Module4 Classification
79 pages
Applsci 13 04852 v2
No ratings yet
Applsci 13 04852 v2
18 pages
The Graph Neural Network Model
No ratings yet
The Graph Neural Network Model
20 pages
8 Classification
No ratings yet
8 Classification
45 pages
An Efficient Technique For Calculating Exact Nearest-Neighbor Classification Accuracy
No ratings yet
An Efficient Technique For Calculating Exact Nearest-Neighbor Classification Accuracy
7 pages
Datamining Lect7knearst
No ratings yet
Datamining Lect7knearst
62 pages
Evaluation of Different Classifier
No ratings yet
Evaluation of Different Classifier
4 pages
Pattern Recognition
No ratings yet
Pattern Recognition
33 pages
CS402 Mod 3
No ratings yet
CS402 Mod 3
2 pages
Lec23 PDF
No ratings yet
Lec23 PDF
7 pages
CS 601 Machine Learning Unit 5
No ratings yet
CS 601 Machine Learning Unit 5
18 pages
Data Mining Lecture 10B: Classification
No ratings yet
Data Mining Lecture 10B: Classification
62 pages
Platt, Large Margin DAGs For Multi-Class Classification
No ratings yet
Platt, Large Margin DAGs For Multi-Class Classification
8 pages
Unit - Iii
No ratings yet
Unit - Iii
52 pages
Chapter 9 Data Mining
No ratings yet
Chapter 9 Data Mining
147 pages
Big Data Notes
No ratings yet
Big Data Notes
33 pages
L06 Slides - mlp3
No ratings yet
L06 Slides - mlp3
26 pages
Classification and Prediction
No ratings yet
Classification and Prediction
21 pages
Tan 2021 J. Phys. Conf. Ser. 1994 012016
No ratings yet
Tan 2021 J. Phys. Conf. Ser. 1994 012016
6 pages
"Classifiers": R & D Project by Under The Guidance of
No ratings yet
"Classifiers": R & D Project by Under The Guidance of
59 pages
Fast Self-Supervised Clustering With Anchor Graph
No ratings yet
Fast Self-Supervised Clustering With Anchor Graph
14 pages
K - Nearest Neighbours Classifier / Regressor
No ratings yet
K - Nearest Neighbours Classifier / Regressor
35 pages
Variyam 2
No ratings yet
Variyam 2
48 pages
Module - 3 - Last Part
No ratings yet
Module - 3 - Last Part
16 pages
Module05 - Bayesian Reasoning
No ratings yet
Module05 - Bayesian Reasoning
37 pages
ML Unit 2
No ratings yet
ML Unit 2
8 pages
Edge Classification in Networks
No ratings yet
Edge Classification in Networks
30 pages
Journal of Statistical Software: Learning Bayesian Networks With The Bnlearn R Package
No ratings yet
Journal of Statistical Software: Learning Bayesian Networks With The Bnlearn R Package
22 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
16 pages
CH 4
No ratings yet
CH 4
21 pages
Classification
No ratings yet
Classification
45 pages
Classification (NaiveBayes KNN SVM DecisionTrees)
No ratings yet
Classification (NaiveBayes KNN SVM DecisionTrees)
105 pages
Post Op Weka Data Set Sample PDF
No ratings yet
Post Op Weka Data Set Sample PDF
8 pages
基于骨架约束的双目标演化贝叶斯网络结构学习｜FCS
No ratings yet
基于骨架约束的双目标演化贝叶斯网络结构学习｜FCS
13 pages
Data Mining NOTES
No ratings yet
Data Mining NOTES
57 pages
CZ4032 Data Analytics & Mining Notes
No ratings yet
CZ4032 Data Analytics & Mining Notes
16 pages
BayesianNetClassifiers 3.5.5
No ratings yet
BayesianNetClassifiers 3.5.5
35 pages
Improving Naive Bayes For Classification: L. Jiang, Z. Cai, and D. Wang
No ratings yet
Improving Naive Bayes For Classification: L. Jiang, Z. Cai, and D. Wang
5 pages
Mod09-ppt2-ML in Image Classification
No ratings yet
Mod09-ppt2-ML in Image Classification
30 pages
Lec 1
No ratings yet
Lec 1
42 pages
Uoc Luong Phi Tham So
No ratings yet
Uoc Luong Phi Tham So
84 pages
Dear The Weight
From Everand
Dear The Weight
Masud Rana
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Improved Bounds For Szemeredi's Theorem1
No ratings yet
Improved Bounds For Szemeredi's Theorem1
13 pages
IM CA100+ Benchmark Briefing Mar25
No ratings yet
IM CA100+ Benchmark Briefing Mar25
25 pages
Improved Bounds For Szemeredi's Theorem
No ratings yet
Improved Bounds For Szemeredi's Theorem
13 pages
Disk Embedding Theorem - Book
No ratings yet
Disk Embedding Theorem - Book
428 pages
Voice Disorder Classification Using Convolutional Neural Network Based On Deep Transfer Learning1
No ratings yet
Voice Disorder Classification Using Convolutional Neural Network Based On Deep Transfer Learning1
15 pages
2D Radiative MHD Simulations of The Importance of Partial Ionization in The Chromosphere
No ratings yet
2D Radiative MHD Simulations of The Importance of Partial Ionization in The Chromosphere
49 pages
8713-Article Text-12241-1-2-20201228
No ratings yet
8713-Article Text-12241-1-2-20201228
7 pages
A-JEPA Joint-Embedding Predictive Architecture Can Listen1
No ratings yet
A-JEPA Joint-Embedding Predictive Architecture Can Listen1
12 pages
Quantum Physics Author Various Athors
100% (1)
Quantum Physics Author Various Athors
548 pages
Interactions Between Computational Intelligence and Mathematics Compress
100% (3)
Interactions Between Computational Intelligence and Mathematics Compress
125 pages
Quantum Theory of X Ray Reflrction and Scattering
No ratings yet
Quantum Theory of X Ray Reflrction and Scattering
10 pages
GR Nature
No ratings yet
GR Nature
1 page
Gordon J. Miller - Chemical Group Theory-Iowa State University, LibreTexts
No ratings yet
Gordon J. Miller - Chemical Group Theory-Iowa State University, LibreTexts
165 pages
Hubeny Stellar Atmospheres
No ratings yet
Hubeny Stellar Atmospheres
68 pages
Geometrical Approach To Logical Qubit Fi
No ratings yet
Geometrical Approach To Logical Qubit Fi
14 pages
Observational Constraints On Black Hole Spin1
No ratings yet
Observational Constraints On Black Hole Spin1
38 pages
Recent Developments in Cosmology
No ratings yet
Recent Developments in Cosmology
19 pages
InTech-Neoclassical Theory of X Ray Scattering by Electrons
No ratings yet
InTech-Neoclassical Theory of X Ray Scattering by Electrons
32 pages
Active Galactive Nuclei
No ratings yet
Active Galactive Nuclei
36 pages
Quantum Computing Lecture Notes Another Set
No ratings yet
Quantum Computing Lecture Notes Another Set
105 pages
Multi Condition Identification of Thermal Process Data Based On Mixed Constraints Semi Supervised Clustering
No ratings yet
Multi Condition Identification of Thermal Process Data Based On Mixed Constraints Semi Supervised Clustering
19 pages
Introduction To Astrochemistry
100% (1)
Introduction To Astrochemistry
30 pages
MisII Manual Page
No ratings yet
MisII Manual Page
21 pages
D830 - JM7BMB - A1a - 0628 PDF
0% (1)
D830 - JM7BMB - A1a - 0628 PDF
57 pages
Questions
No ratings yet
Questions
2 pages
Library Management System Project Proposal
No ratings yet
Library Management System Project Proposal
8 pages
Tinjauan Terhadap Rencana Penerapan Pajak Lingkungan Sebagai Instrumen Perlindungan Lingkungan Hidup Di Indonesia
No ratings yet
Tinjauan Terhadap Rencana Penerapan Pajak Lingkungan Sebagai Instrumen Perlindungan Lingkungan Hidup Di Indonesia
16 pages
Microsoft Office Professional 2010 Step by Step PDF
100% (2)
Microsoft Office Professional 2010 Step by Step PDF
1,072 pages
Number Systems: Foundations of Computer Science Cengage Learning
No ratings yet
Number Systems: Foundations of Computer Science Cengage Learning
86 pages
Control Units Group2
No ratings yet
Control Units Group2
22 pages
Answer Scheme Quiz CSC126 - March2021
No ratings yet
Answer Scheme Quiz CSC126 - March2021
4 pages
VNX CS Powered Down Without Any Admin Interference
No ratings yet
VNX CS Powered Down Without Any Admin Interference
2 pages
Ait307 QP
No ratings yet
Ait307 QP
3 pages
Create Deep Entity Using Rap
No ratings yet
Create Deep Entity Using Rap
9 pages
Particle Swarm Optimization: Technique, System and Challenges
No ratings yet
Particle Swarm Optimization: Technique, System and Challenges
9 pages
Changes Spark
No ratings yet
Changes Spark
137 pages
Assignment - Ranpath Tools
No ratings yet
Assignment - Ranpath Tools
8 pages
Hlina Beyene
No ratings yet
Hlina Beyene
3 pages
Wordpress 1
No ratings yet
Wordpress 1
18 pages
Emudhra Org Ara
No ratings yet
Emudhra Org Ara
2 pages
Theory of Sound PDF
No ratings yet
Theory of Sound PDF
500 pages
095866
No ratings yet
095866
9 pages
8085 Prog-Ans
No ratings yet
8085 Prog-Ans
23 pages
Scientific Notation Notes
100% (1)
Scientific Notation Notes
3 pages
Configuring Vintela SSO in Distributed Environments - Vintela Configuration Only
No ratings yet
Configuring Vintela SSO in Distributed Environments - Vintela Configuration Only
17 pages
Introduction To Parallel Programming: Center For Institutional Research Computing
No ratings yet
Introduction To Parallel Programming: Center For Institutional Research Computing
98 pages
Car Park Smoke Extraction in AutoDesk CFD
No ratings yet
Car Park Smoke Extraction in AutoDesk CFD
44 pages
Random Processes: Professor Ke-Sheng Cheng
No ratings yet
Random Processes: Professor Ke-Sheng Cheng
23 pages
SP 34 1987 Handbook On Reinforcement and Detailing
No ratings yet
SP 34 1987 Handbook On Reinforcement and Detailing
183 pages

Model Averaging With Discrete Bayesian Network Classifiers1

Uploaded by

Model Averaging With Discrete Bayesian Network Classifiers1

Uploaded by

Model Averaging with Discrete Bayesian Network Classifiers

Denver Dash Gregory F. Cooper

Abstract still perform surprisingly well at the classification task

framed as follows: Given a set of features F =

bility of a structural feature (for example a particular

the axes are normalized). confidence intervals,.

References [Volinsky, 1997] C.T. Volinsky. Bayesian Model Averaging

You might also like