Structure Learning Via Non-Parametric Factorized Joint Likelihood Function
Structure Learning Via Non-Parametric Factorized Joint Likelihood Function
DOI:10.3233/IFS-141125
IOS Press
PY
b University Institute of Information Technology, PMAS-Arid Agriculture University, Rawalpindi, Pakistan
CO
Abstract. Bayesian Belief Network has inspired the community of machine learning in the domain of structure learning. Numerous
scoring functions have been introduced in the structure learning. The performance of these scoring functions usually tends to favor
the dense Bayesian Network by using an implicit over-fitting phenomenon. Motivated by this limitation, this study has introduced
a novel scoring function which is optimized for producing non complex learnt structures but with the target of higher classification
accuracy. The introduced scoring function assumes no external parameter for its fine tuning. It is decomposable and possesses
no penalty factor. The introduced scoring function holds its mathematical interpretation within information theory. This scoring
OR
function is designed to maximize the discriminant function for given dataset query variables with respect to the class feature and
other non class features. The empirical evaluation of the introduced scoring function points out that its classification accuracy is
significantly better than other existent scoring functions using greedy search algorithm K2 and hill climber. The outcome of this
study illustrates that simplistic structure with higher classification performance is possible by means of exercising the proposed
scoring function in the Bayesian Belief structure learning.
TH
Keywords: Bayesian belief network, machine learning, discriminative learning, scoring functions, hill climbing algorithm
Structure learning of Bayesian Belief Networks performance. However such structure carries the over-
(BBN) has proved its strength in numerous domains head of large calculation in parameter learning. Such
of application. The crux of the BBN has the notion a complex structure is limited to dataset with medium
of learning a structure given a dataset. Although it is number of data attributes (features). On the other hand,
a well-researched field but still it is computationally a learnt structure with less number of arcs between
hard task. As this problem is NP-hard [13, 17], hence nodes usually results in poor classification accuracy.
researchers are left with the only option of using heuris- The prime contribution of this study is to reduce the
tic search. In this study, we have proposed a technique classification error rate while giving a least complex
for addressing this task. Our approach is based on the structure. The proposed scoring function produces the
combination of space of query variable ordering and the network with better accuracy with smaller number of
space of learnt structures. parental nodes in the range of 2 to 4. This aspect cer-
tainly goes in the situation where an improved accuracy
is required for smarter learnt network structures. We
∗ Corresponding author. Muhammad Naeem, Department of have carried out a large number of experiments to sup-
Computer Science, Mohammad Ali Jinnah University, Islamabad
44000, Pakistan. Tel.: +92 51 111878787; Fax: +92 51 2822446;
port this claim.
E-mails: [email protected]; [email protected] A BBN classifier is technically composed of two
(Sohail Asghar). components which include a scoring function and a
1064-1246/14/$27.50 © 2014 – IOS Press and the authors. All rights reserved
1590 M. Naeem and S. Asghar / Structure learning via non-parametric factorized joint likelihood function
searching heuristic; the way through which a scoring 2. Case study for evaluation of scoring metrics
function is evaluated. We have tweaked six scoring
functions amongst of which five are quite well estab- We already delivered a brief introduction of structure
lished scoring functions. All of these functions have learning. However it is quite useful to describe a broader
been extensively used in BBN based structure learning. picture of structure learning. Moreover, we also need
These include Minimum Description Length (MDL) to discuss a case study of dataset segment (publically
[18], Akaike Information Criterion (AIC) [6], BDeu available at UCI repository). The Fig. 1 is delineating
[17], factorized conditional log likelihood (fCLL) [2]. the whole picture of the structure learning. It is pointing
Cooper et al. [5] introduced an algorithm K2 in which out four essential onion style layers. A given dataset
greedy search was employed while a scoring function of after passing through a pre requisite preprocessing step
Bayes was used. We have discussed them with an exam- is handed over to feature selection layer.
ple in next section in perspective of a few shortcomings This study is not focused on separating optimized
PY
inside them. features but as the searching algorithm like K2 (we
In [15], Yang et al. performed experiments on three used K2 search algorithm) is sensitive to feature order-
random networks for comparing the performance of ing. A part of the proposed metric initially re order
various scoring functions using K2 as searching algo- the entire attribute features with respect to the dicrim-
rithm. They pointed out BIC can correctly discover the inant effect from the class feature. Numerous search
CO
correct network topology. They also pointed out that algorithms have been proposed characterized by the
BDeu is found to be sensitive given its external fine formulation of the search space, orderings over the
tuning parameter namely alpha value. It was suggested network variables and equivalence classes of learnt
a comparison between the complexity of the model and structures. Some notable search algorithms include
the fitting of the training data [16]. They argued over genetic search, hill-climbing, tree augmented network,
the performance of Bayes scoring function comparison greedy search, simulated annealing, tabu search [2], and
OR
to AIC [16]. According to them, AIC exhibit better in implementation of all of them is also available in weka
coping over-fitting in the BBN model selection. The [9]. The innermost layer is comprised of scoring met-
limitation in both of the above is that they did not ric which results into a numerical value (herein known
employ a learning algorithm for finding a good hypothe- as the score). The search algorithm uses this score in
sis structures. The second limitation is related to counts deciding the parental node (attribute feature and class
of experimental data. Both of them produce a few net- feature) for each candidate node (attribute feature). The
TH
PY
DAG; in fact every node in this DAG possess a Condi-
tional Probability Table (CPT), this process is known as
the parameter learning which is also determined by the Fig. 2. A Simple Bayesian Network Structure.
combine affect of search algorithm and scoring metric
[13]. The estimation algorithm such as simple estimator
CO
100.00
uses this array of CPT in a way that every instance of the
Percentage of Accuracy and Simplicity
The Fig. 2 is showing a very simple BBN structure Accuracy 91.42 91.39 91.69 94.85 94.63 94.85 95.28 95.32
Fullness 100 95 28.3 55.8 55.8 27.9 51.3 51.3
for the dataset segment. The right section of Fig. 2
contains the abbreviations of the features of dataset
TH
already fixed. However apart from this quickly built towards the class feature, eventually it will results into
structure, some implicit information is missing. This improved classification accuracy. The accuracy of the
structure is in fact void of carrying a lot of underly- classifier in simple BBN was 91.42% while this accu-
ing mutually coherent information between the nodes. racy was increased up to 94.84% by using scoring
Technically such a structure will be termed as suffer- metric of Entropy (see Fig. 3).
ing from data under fitting because the learnt structure Here the question arises which factor is responsible
is not truly representing the dataset upon which it for this increase in accuracy. Should we extrapolate that
was built. We repeat this experiment but this time we realizing more parental nodes may lead to justify this
used Entropy as a scoring metric while exercising K2 reason? We must establish a relationship between the
as search algorithm again and fixing other external fullness (simplicity) of the DAG and the outcome of its
parameters same. The scoring metric Entropy unveils relevant classification accuracy. It will be useful if we
a lot of mutually exclusive relationship in the given develop a simple mathematical formula for defining the
dataset. For example, it identifies a set of four features fullness of a DAG such that Fullness = 100 × number of
as the parental node for the feature short-line-density- edges between class node and attribute node / all links
2. These parental nodes include region-centroid-row, Keeping in view of the above formula, the gen-
region-centroid-col, short-line-density-5 and the class eral BBN is giving poor classification. However as we
node. The Entropy metric also introduces a lot of other increase the number of links then it favors for the data
1592 M. Naeem and S. Asghar / Structure learning via non-parametric factorized joint likelihood function
fitting [13, 17]. But things are not so straight unluck- as the best among all of the candidate structures. The
ily. Albeit this trend is mostly observed in literature simplicity of Bayes scoring metric is also equivalent
and during our experimentation as well but the other to that of NPFiLM, however as we already described
factor which is opposing the optimized data fitting is that identifying the true relationship for the purpose
data over fitting problem. The result of scoring metric of maximizing discriminant function for a class node
(Fig. 3) is a good example in this context. The contin- is a challenge in classification. When we used K2 as
uous increase in the links of the DAG eventually result search algorithm then pre ordering of feature set also
into an over fitting problem where the model apparently plays an important role. The proposed scoring metric
favors the training set but unable to classify the same also does the same job the mathematical detail of which
number of test dataset. We can term this problem as fix- is illustrated in Section 5.
ing the balance between data under fitting and data over
fitting in structure learning. While observing the graph- 3. Proposed scoring function
PY
ical result in Fig. 3, one can argue about the anomalous
performance of MDL. It produces only one extra link A scoring function can be expanded in terms of the
other than the basic links between the attribute node and local scores. The local score refers to the score produced
class node. This link is between 7th node hedge-mean by the CPT of every query variable conditioned with the
and 6th node hedge-sd (as parental node). The reason is class feature. The sum of all of such scores is termed
CO
evident that the relationship was incorrectly identified as the scoring function of a given structure, model or a
while leaving a lot of potentially correct links as shown directed acyclic graph.
in the Fig. 4. We assume that D denotes the training dataset with
According to the Fig. 4, this wrong link has been fixed such that n number of features such that fi is the ith
using the proposed scoring metric. The detail of the feature. Then the local score can be denoted by i. The
other links including distribution and parent set is also final scoring function can be obtained by cumulating
OR
described which results this NPFiLM based structure all local scoring functions:
n
(B, D) = i (fi , D) (2)
i=1
3.1. Definition 1
A simple decomposition can be expressed: the value of the penalty factor; this is again a ques-
tionable issue. In fact the expression above is not a
Size (M) = (x|pa(x)| × |x|) (5) decomposable score. The penalty factor always car-
x∈X ries computational overhead [19]. In order to address
It means that the complexity of a structure M can this problem, the general principle of William Ock-
be estimated by the product of parent-node-counts and ham (1285–1349) is quite useful according to which
query-variable-states. “select the simplest hypothesis such that the hypothe-
sis be consistent with the underlying observation”. The
researchers pointed out that this principle has its own in
3.2. Lemma 1 BBN structure learning [4]. Proceeding with the gen-
eral principle of Ockham hypothesis, let we consider
Let M be a BBN given query variables X. The F a non class variable and C a class variable. We are
optimized and data fitted model Ḿx of the underlying
PY
to find out the mathematical value of the relationship
dataset holds essential links only. It can be expressed between the instances of these two features. Let F con-
that there is no other optimized network Mx which can tains a number of distinct states given class feature C
hold lesser number of links. possessing b number of distinct states.
Proof. Let M be any ordinary structure. Its parameter
CO
F = {fi |i = 1, . . . a} (7)
distribution is PUX . Let Ḿx be an optimized model. It
can be noted that whenever two query variables xi and
xJ are directly linked, then it certainly enhances the C = {cj |j = 1, . . . b} (8)
goodness of data fitting of the structure being learnt. We shall proceed further keeping in view of the above
If these are present in model M, then they must be simple point estimation in structure learning. The joint
present in Ḿx . However, if the size of M is shorter than probability between both of these feature variables can
OR
size of Ḿx then it in inferred that some links in M exhibit be formulated as:
the opposite direction as that of so called optimized
structure. It rationalizes the search for a smallest size of δ(C, F ) = P(C = c, F = f ) (9)
structure. If this is BBN possessing indispensable links c f
only then the structure must be an optimized structure.
We need to normalize the joint probability given in
TH
In structure learning, the discriminant joint prob- If we introduce a directed link from node Xi to
abilities ξC,F posses the notion of converting it into node Xj then only the local value of NPFiLM will be
maximum a prior probabilistic inference given a case changed. If this inclusion causes an improvement in the
of simple point estimation. structure G, then this improvement can be described as:
n
Proof. Let ϑ(F ) shows the marginal probability of a (Xp Xq ) = max arg(Xq , Pa(Xq ), C,D) ∪ {Xp }
query variable. The value of λC,F can be expressed
PY
i=1
into conditional probability by means of replacing the n
denominator in the Equation (20) as:
− max arg(Xq , Pa(Xq ), C, D) (15)
⎡ ⎛ ⎞⎤ i=1
P(c, f )
λC,F = ⎣max arg ⎝ ⎠⎦ (12) Equations (14) and 15 points out that there is a simple
CO
c
ϑ(F )
f frequency calculation related to frequency calculation
of the feature node, parent node and class node. Hence
The simple point estimation potential λC,F as shown it can be inferred that NPFiLM belongs to the class
in the Equation (12) is decomposable into simple of decomposable scoring functions. The decomposition
point conditional joint probability. However, we are not property is certainly useful when searching algorithm
restricted to simple point estimation rather our objec- like K2 or hill climbing needs to calculate the final score
tive is to circumvent a dataset. It requires that we need
OR
ing function. candidate out of the set of all query variables in a queue.
From this large size of factors, only those factors are
Proof. Let there are n number of non class query vari- useful which tends to contribute towards explanation
ables with a single class variable in the dataset D. We of a class state. However, it also becomes the reason of
can achieve the reduction of simple point estimation an essential observation. Let us consider feature set and
AU
into a generalized maximum a posterior inference nota- class variable as defined in Equations (7) and (8). Let
tion namely NPFiLM as below: us consider the last feature in ordered list. Certainly this
must be linked to class variable with a precise discrim-
NPFiLM(D, G) inant value of joint probability in a BBN. An inclusion
n
of the new query variable in the set of its parent list will
= max arg(Xi , Pa(Xi ), C, D) (13) be restricted by a higher value of discriminant value.
i=1 However, as the new node is linked, such chances are
The simple calculation between two feature variables quite narrow, because the factor of joint probability dis-
in order to turn them into a decomposable value can be tribution will start thinning with the increase of new
denoted as parental value. It means, in a randomly ordered set of
qi r1
j=1 k=1 Nijk Where i is feature iterator, j is par- features, there are chances that we get a merely simple
ent iterator, k is feature state iterator and c is class Naı̈ve Bayes graph. We already mentioned in introduc-
iterator. If we consider the inclusion of class variable, tion section that simple Naïve Bayes is not possessing
then a minor change will turn it this value of estimation value of goodness for data fitting [13, 17, 19]. Moreover,
qi
into j=1 r1 k=1 Nijck there is no need to introduce such a scoring function
Now, we plug this value into Equations (13) such where the learnt structure is known a prior without any
that: ambiguity or confusion. This raises the question of tack-
M. Naeem and S. Asghar / Structure learning via non-parametric factorized joint likelihood function 1595
ling this problem of converging the structure into a very If we plug this ordered set into the Equation (16) then
thin model. A prior intelligent ordering of all query it will be appeared into the following equation.
variables may give the answer to this problem.
<
−−
NPFiLM(D, G, X ) =
3.5. Definition 2 ⎡ ⎛ ⎞⎤
qi
r1 (19)
⎣ 1
Let the feature set is represented such as F = max arg ⎝ Nijck ⎠⎦
c
|C c |
j=1 k=1
{f1 , f2 , f3 , . . . fn } then ordering weight of every fea-
ture will be calculated by weight factor as shown as:
3.6. Lemma 4
ωF = λC,F − λC,F (16)
The ordered set instantiated by a mutually informa-
The terms λC,F and λC,F plays the role of exis-
PY
tion oriented metric can convert NPFiLM into a well
tence restrictions. Let us assume both of them as to
behaved scoring function.
be existence restrictions such that (F, C) ∈ λC,F : the
link F → C elucidates the discriminant objective with Proof. Let us consider a set of n un-sorted query vari-
respect to the class variable such as: (F, C) ∈ λF,C : the ables such that F = {f1 , f2 , f3 , . . . fn }. Let us start
CO
link C → F points out that the discriminant scoring from any of the succeeding query variable like s j th
value with respect to the feature. query variable fj such that it lies somewhere in the triv-
In our earlier research the correct topological order- ial un-ordered list denoted by {ρ, fj , ς} ∴ ρ ∩ ς = φ
ing between two features was discussed at point where ρ is the set of predecessor and ς is the set of
estimation level [11]. We discussed a parametric prop- successor nodes. It has already been described that K2
erty of Bayesian Belief network with suggestion of adds incrementally for a node as its parental node out
possible correctness in the point estimation [11]. It
OR
describe the correct point estimation topology in a lot otherwise the expression ∀(fs , c) ∈ ς, fs → fj : false
of situations; this ultimately leads to the selection of should hold. Hence it can be concluded that intelligent
potential neighbor nodes and parental nodes becoming ordered set can beget NPFiLM to a well behaved scoring
unsuitable. However I2S was shown to be capable of function. One can notice that NPFiLM does not posses
rightly identifying it in majority of the cases as com- any a prior parametric value for fine tuning purpose.
AU
pared to Bayes, MDL, BDeu, Entropy [11]. Madden Moreover there is no concept of penalty factor in the
[10] pointed out that a model in which a class node is final mathematical expression of the proposed scoring
fixed at the root level can bring an increased goodness function.
of data fitting. This type of scheme was named in lit-
erature as “selective BN augmented NBC” [10]. Hence
the later score value must be eliminated from the first 4. Information theoretic interpretation
value which will result into a weighted score vector as
depicted in the Equation (17). Before describing empirical results, we illustrate that
NPFiLM has an interesting mutual coherence interpre-
P(f, c) tation with its roots in information theoretic elucidation.
λF,C = max arg (17) We revive some basic concept of entropy and mutual
c
ϑ(C)
f information and their mapping to each other. Let X be a
A function for simple descending order is applicable feature and C is a class then joint entropy is a function
to the weights achieved from the Equation (16) which of conditional entropy of X given C and conditional
results into an ordered list of input variables. entropy of class given X and Mutual Information (MI)
such as:
<
−− i
F = {ω |i = 1 . . . n} (18)
H(X, C) = H(X|C) + MI(X, C) + H(C|X) (20)
f
1596 M. Naeem and S. Asghar / Structure learning via non-parametric factorized joint likelihood function
PY
two variables (X and C). This is expressed as:
H(X|C) = H(X) − MI(X, C) (22)
Fig. 5. Surface plot for joint distribution of two variables for Mutual
Information (MI).
From Equations 21 and 22, one can easily conclude
CO
and establish the relationship of NPFiLM and MI indi-
cated such as:
H(X) − MI(X, C)
= NPFiLM(X|Cm ) + NPFiLM(X|Cn ) (23)
Notice that the proposed measure can be expressed
OR
(24)
However, an interesting argument arises; what is the
impact of NPFiLM in scoring metric as compared to Fig. 6. Surface plot for joint distribution of two variables for I2S.
Mutual Information and our earlier proposed scoring
metric I2S? Figure 5 indicates the surface chart of MI asymmetric in behavior like NPFiLM which is in quite
AU
when numbers of states in both of the features are grad- harmony with graphical learning. However, it is not as
ually increased. Notice that in MI, the value is always well behaved as NPFiLM. It is showing that a change
behaving symmetrically while the relationship of two in state counts of the query variables delivers a dras-
nodes in BBN is never yielding a symmetric accuracy. tic change in the corresponding value of I2S while the
Secondly for a uniform distribution of two variables it state count change in class node is giving a slow change
only reaches its maxima in some certain cases when spe- in the I2S value. Keeping in view of this limitation in
cific number of states in both of the features is achieved. I2S and MI, this phenomenon is adjusted in NPFiLM to
On the contrary NPFiLM (Fig. 7), number of states tailor it into a suitable scoring function for BBN.
plays an important role as the enumeration of class
states increases, there is probability that its joint distri-
bution will also get sparse and thinner in each individual 5. Empirical validation of NPFiLM
distribution. Such well behaved phenomenon is also
observed in the BBN classifier where the equal dis- We in this study have evaluated our proposed scor-
tribution of class node with respect to the other query ing function over fifty datasets publically available
variables results in poor classification accuracy. More- at UCI machine learning repository [1]. We selected
over, we also demonstrate the behavior of I2S, although datasets representing from various domain of inter-
the values (see Fig. 6) point out that I2S is also an est in order to avoid any possible implicit biasness
M. Naeem and S. Asghar / Structure learning via non-parametric factorized joint likelihood function 1597
PY
nodes.
itive and False Negative to the sum of all four quadrants comparing to every other scoring function (see Figs. 8
values of a typical confusion matrix. It is mathemati- and 10). If there is no significant difference observed
cally defined as: then it means both of the competing scoring functions
FP + FN have delivered more or less same data fitted model.
Errorrate = (25) Secondly the proposed scoring function is compared
AU
TP + TN + FP + FN
with every other scoring function in terms of average
In the Equation (25), TP and TN denote the positive classification rates (see Figs. 9 and 11).
and negative classes being classified as correctly. The The results of Figs. 8 and 9 were obtained using
results are shown by comparing six well known BBN greedy search mechanism whose implementation in
scoring functions. The scoring function comparison has BBN is known by K2. The results from these figures
been achieved by adjusting a parameter known as the are quite evident that the proposed scoring function
number of parents. This value has been denoted by p in mostly outperform whereas the worst performance was
all of the figures in this section. The value of p ranges observed in case of Entropy scoring function. The
from 2 to 4. The reason behind restricting this value up results of Bayes, AIC, BDeu and MDL were approxi-
to four is that at value of 5 and above, numerous existent mately comparable except in case of MDL with setting
scoring functions were not capable of parameter learn- maximum number of four parental nodes. It means
ing because as if the features contains numerous distinct empirical evidence indicate that MDL is not suitable
states then multiplication of these distinct states for fea- for thick network. Some dataset were too large in num-
tures more than four results into a very large CPT for ber of features that a few of the classifiers did not
which the common memory resources gets exhausted. give result in reasonable time, thus the results of these
A solution for such problem lies in stream mining where dataset have been excluded from average performance
the in transits results are kept in hard disk for further comparison.
1598 M. Naeem and S. Asghar / Structure learning via non-parametric factorized joint likelihood function
PY
tions using HCL.
Fig. 9. Average error rate of scoring function using greedy search K2
for 50 datasets using greedy search K2.
our introduced measure NPFiLM. The results in general aggregated to their optimum value. As far as BDeu is
validate the findings of Lerner et al. [3] that hill climbing concerned, its performance is mainly characterized by
is potentially prone to yield better results (see Figs. 10 the alpha value which controls the penalty factor. The
and 11); However, the error rate / accuracy results for same is true for Bayes scoring metric. Unluckily there is
NPFiLM was still better as compared to its peer scor- no such mechanism found which can dictate in advance
AU
ing functions. One dataset arrhythmia was significantly about the specific value of alpha for any particular data.
wider in size of its features. It contains 280 features and An optimized value of alpha for each dataset is always
sixteen classes, albeit its sample size was short (only a variable and can not be predicted in advance. As far
452). We noticed that scoring function such as Bayes, as AIC and MDL are concerned, we term both of them
AIC, BDeu, MDL and Entropy did not give result even as sister scoring function because both of them differs
a pass of 48 hours when using HCL. This left us with no only by the way of controlling penalty factor.
option but to omit this dataset result from the figures. Some techniques were used to improve the per-
During the experiment of comparison with peer formance of MDL [8] where the searching algorithm
scoring function, it was noticed that entropy scoring during the construction of network was revised. They
function always yield a very thick and dense network replaced the greedy algorithm K2 [5] (Cooper and Her-
in which significantly large number of arcs were estab- skovits) by branch and bound algorithm and MDL as
lished. Although adding each arc in general increases scoring function. The superiority of MDL over its peer
the training accuracy but at the same time, it is being scoring functions Bayes was also elaborated [12]. They
done at the cost of reduction in accuracy of test samples. analyzed two important scoring functions MDL and
In fact there is a need to select such a scoring function Bayes. They concluded that MDL is superior in time
where neither very thick nor very thin network is gen- complexity as compared to Bayes as the number of
erated so that over-fitting and under-fitting both can be nodes gradually increases. Secondly it is a good prac-
M. Naeem and S. Asghar / Structure learning via non-parametric factorized joint likelihood function 1599
PY
[4] F.V. Jensen and T.D. Nielsen, Bayesian networks and decision
Rajaram et al. [14] stated that “small alterations” to graphs, Information Science and Statistics Volume. ISBN 978-
Naı̈ve Bayes are in abundance in literature in pursuit 0-387-68281-5. (2007), Springer New York.
of correct adjustments for specific nature of dataset. A [5] G.F. Cooper and E. Herskovits, A Bayesian method for the
true balance prevents it from growing very dense or induction of probabilistic networks from data, Machine Learn-
ing 9 (1992), 309–347.
very thin network. This clearly avoids the test network
CO
[6] H. Akaike, A new look at the statistical model identifica-
to suffer from over-fitting or under-fitting phenomenon. tion, IEEE Transactions on Automatic Control 19 (1974), 716–
However, in its peer methods, the density or scarcity is 723.
not well controlled. In case of BDeu and Bayes, the [7] H. Xu, J. Yang, P. Jia and Y. Ding, Effective structure learning
for estimation of distribution algorithms via L1-regularized
value of alpha greatly determines the density level. The Bayesian networks, Int J Adv Robotic Sy 10(17) (2013),
scoring function Entropy was observed to prone to give 1–14.
dense network in majority of the cases. Whereas MDL [8] J. Suzuki, Learning Bayesian belief networks based on the
MDL principle: An efficient algorithm using the branch and
OR
nomenon goes in favor of higher accuracy. However Cybernetics and Systems Analysis 44(2) (2008), 219–224.
[13] N. Friedman, D. Geiger and M. Goldszmidt, Bayesian network
the dark side of this aspect is the additional complexity classifiers, Machine Learning 29 (1997), 131–163.
in the network. As the network carries heavily linked [14] R. Rajaram, S. Pramala, S. Rajalakshmi, C. Jeyendran and
nodes at the cost of parameter learning. Moreover data D.S.J. Prakash, NB+: An improved Naı̈ve Bayesian algorithm,
fitting phenomenon starts favoring to training data and Knowledge-Based Systems 24(5) (2011), 563–569.
[15] S. Yang and K.C. Chang, Comparison of score metrics for
bypassing the test data. NPFiLM has been designed Bayesian network learning, Systems, Man, and Cybernetics,
in a way to keep balance between data over-fitting 1996, IEEE International Conference on Volume 3 3 (1996),
and data-under fitting. The study delivered the anal- 2155–2160.
[16] T. Van Allen and R. Greiner, Model selection criteria for learn-
ysis and comparison from different angle using greedy
ing belief nets: An empirical comparison, In ICML’00, (2000),
search with various setting of parental set. NPFiLM per- 1047–1054.
formed significantly better than others in all of these [17] W.L. Buntine, Theory refinement on Bayesian networks, Proc
comparisons. The proposed measure is extrapolated to UAI’ (1999), 52–60.
[18] W. Lam and F. Bacchus, Learning Bayesian belief networks:
construct the realistic network which is likely to tally An approach based on the MDL principle, Comp Intell 10
with the practical gist of domain experts. According to (1994), 269–294.
our finding BIC has shown its capability to well behave [19] Z. Liu, B. Malone and C. Yuan, Empirical evaluation of scor-
for large sample sizes but for the small dataset, its error ing functions for Bayesian network model selection, BMC
bioinformatics 13(Suppl 15), S14 (2012).
rate is significantly high.