0% found this document useful (0 votes)
45 views11 pages

Structure Learning Via Non-Parametric Factorized Joint Likelihood Function

Abstract. Bayesian Belief Network has inspired the community of machine learning in the domain of structure learning. Numerous scoring functions have been introduced in the structure learning. The performance of these scoring functions usually tends to favor the dense Bayesian Network by using an implicit over-fitting phenomenon. Motivated by this limitation, this study has introduced a novel scoring function which is optimized for producing non complex learnt structures but with the target of h

Uploaded by

Mohd Naeem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views11 pages

Structure Learning Via Non-Parametric Factorized Joint Likelihood Function

Abstract. Bayesian Belief Network has inspired the community of machine learning in the domain of structure learning. Numerous scoring functions have been introduced in the structure learning. The performance of these scoring functions usually tends to favor the dense Bayesian Network by using an implicit over-fitting phenomenon. Motivated by this limitation, this study has introduced a novel scoring function which is optimized for producing non complex learnt structures but with the target of h

Uploaded by

Mohd Naeem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Journal of Intelligent & Fuzzy Systems 27 (2014) 1589–1599 1589

DOI:10.3233/IFS-141125
IOS Press

Structure learning via non-parametric


factorized joint likelihood function
Muhammad Naeema,∗ and Sohail Asgharb
a Department of Computer Science, Mohammad Ali Jinnah University, Islamabad, Pakistan

PY
b University Institute of Information Technology, PMAS-Arid Agriculture University, Rawalpindi, Pakistan
CO
Abstract. Bayesian Belief Network has inspired the community of machine learning in the domain of structure learning. Numerous
scoring functions have been introduced in the structure learning. The performance of these scoring functions usually tends to favor
the dense Bayesian Network by using an implicit over-fitting phenomenon. Motivated by this limitation, this study has introduced
a novel scoring function which is optimized for producing non complex learnt structures but with the target of higher classification
accuracy. The introduced scoring function assumes no external parameter for its fine tuning. It is decomposable and possesses
no penalty factor. The introduced scoring function holds its mathematical interpretation within information theory. This scoring
OR

function is designed to maximize the discriminant function for given dataset query variables with respect to the class feature and
other non class features. The empirical evaluation of the introduced scoring function points out that its classification accuracy is
significantly better than other existent scoring functions using greedy search algorithm K2 and hill climber. The outcome of this
study illustrates that simplistic structure with higher classification performance is possible by means of exercising the proposed
scoring function in the Bayesian Belief structure learning.
TH

Keywords: Bayesian belief network, machine learning, discriminative learning, scoring functions, hill climbing algorithm

1. Introduction When the numbers of parent nodes in structure learn-


ing are large then data fitting favors higher classification
AU

Structure learning of Bayesian Belief Networks performance. However such structure carries the over-
(BBN) has proved its strength in numerous domains head of large calculation in parameter learning. Such
of application. The crux of the BBN has the notion a complex structure is limited to dataset with medium
of learning a structure given a dataset. Although it is number of data attributes (features). On the other hand,
a well-researched field but still it is computationally a learnt structure with less number of arcs between
hard task. As this problem is NP-hard [13, 17], hence nodes usually results in poor classification accuracy.
researchers are left with the only option of using heuris- The prime contribution of this study is to reduce the
tic search. In this study, we have proposed a technique classification error rate while giving a least complex
for addressing this task. Our approach is based on the structure. The proposed scoring function produces the
combination of space of query variable ordering and the network with better accuracy with smaller number of
space of learnt structures. parental nodes in the range of 2 to 4. This aspect cer-
tainly goes in the situation where an improved accuracy
is required for smarter learnt network structures. We
∗ Corresponding author. Muhammad Naeem, Department of have carried out a large number of experiments to sup-
Computer Science, Mohammad Ali Jinnah University, Islamabad
44000, Pakistan. Tel.: +92 51 111878787; Fax: +92 51 2822446;
port this claim.
E-mails: [email protected]; [email protected] A BBN classifier is technically composed of two
(Sohail Asghar). components which include a scoring function and a

1064-1246/14/$27.50 © 2014 – IOS Press and the authors. All rights reserved
1590 M. Naeem and S. Asghar / Structure learning via non-parametric factorized joint likelihood function

searching heuristic; the way through which a scoring 2. Case study for evaluation of scoring metrics
function is evaluated. We have tweaked six scoring
functions amongst of which five are quite well estab- We already delivered a brief introduction of structure
lished scoring functions. All of these functions have learning. However it is quite useful to describe a broader
been extensively used in BBN based structure learning. picture of structure learning. Moreover, we also need
These include Minimum Description Length (MDL) to discuss a case study of dataset segment (publically
[18], Akaike Information Criterion (AIC) [6], BDeu available at UCI repository). The Fig. 1 is delineating
[17], factorized conditional log likelihood (fCLL) [2]. the whole picture of the structure learning. It is pointing
Cooper et al. [5] introduced an algorithm K2 in which out four essential onion style layers. A given dataset
greedy search was employed while a scoring function of after passing through a pre requisite preprocessing step
Bayes was used. We have discussed them with an exam- is handed over to feature selection layer.
ple in next section in perspective of a few shortcomings This study is not focused on separating optimized

PY
inside them. features but as the searching algorithm like K2 (we
In [15], Yang et al. performed experiments on three used K2 search algorithm) is sensitive to feature order-
random networks for comparing the performance of ing. A part of the proposed metric initially re order
various scoring functions using K2 as searching algo- the entire attribute features with respect to the dicrim-
rithm. They pointed out BIC can correctly discover the inant effect from the class feature. Numerous search
CO
correct network topology. They also pointed out that algorithms have been proposed characterized by the
BDeu is found to be sensitive given its external fine formulation of the search space, orderings over the
tuning parameter namely alpha value. It was suggested network variables and equivalence classes of learnt
a comparison between the complexity of the model and structures. Some notable search algorithms include
the fitting of the training data [16]. They argued over genetic search, hill-climbing, tree augmented network,
the performance of Bayes scoring function comparison greedy search, simulated annealing, tabu search [2], and
OR

to AIC [16]. According to them, AIC exhibit better in implementation of all of them is also available in weka
coping over-fitting in the BBN model selection. The [9]. The innermost layer is comprised of scoring met-
limitation in both of the above is that they did not ric which results into a numerical value (herein known
employ a learning algorithm for finding a good hypothe- as the score). The search algorithm uses this score in
sis structures. The second limitation is related to counts deciding the parental node (attribute feature and class
of experimental data. Both of them produce a few net- feature) for each candidate node (attribute feature). The
TH

works with random generation styles. Recently Liu et


al. [19] performed an empirical evaluation of scoring
function AIC, BDeu, MDL and Bayesian Information
Criterion (BIC). They have shown that MDL and BIC
can deliver better results in finding a goodness of data
AU

fitting BBN topology. MDL is usually suited to com-


plex Bayesian network [19]. Keeping in view of the
prior work, our objective is to develop a scoring func-
tion which can rightly balance between the complexity
of the learnt structure and the goodness of fit to the
underlying training dataset.
The rest of the paper is organized into five sections.
Section two is related to an exemplification of the moti-
vation of this study. In the third section, we have statis-
tically illustrated the evolution of the proposed scoring
function NPFiLM. The outcome of this section is a
mathematical equation expressing the decomposable
metric. The fourth section is explanation of proposed
scoring function in context of information theory. The
fifth section is related to empirical validation of the pro-
posed metric while the last section is devoted to the
conclusion expressing the final outcome of this study. Fig. 1. Structure Learning: A Broader Picture.
M. Naeem and S. Asghar / Structure learning via non-parametric factorized joint likelihood function 1591

execution of the search algorithm (like K2) ends up in


the formulation of a structure in shape of a Directed
Acyclic Graph (DAG). The joint probability distribu-
tion of this structure can be described into a chain rule
for BNN as:
n

P(X1 , X2 , . . . Xn) = P(Xi |pai ) (1)
i=1

This must be kept in mind that the shape of the learnt


structure is influenced by both search algorithm and
scoring metric. This structure is not merely a piece of

PY
DAG; in fact every node in this DAG possess a Condi-
tional Probability Table (CPT), this process is known as
the parameter learning which is also determined by the Fig. 2. A Simple Bayesian Network Structure.
combine affect of search algorithm and scoring metric
[13]. The estimation algorithm such as simple estimator
CO
100.00
uses this array of CPT in a way that every instance of the
Percentage of Accuracy and Simplicity

dataset is classified according to a user supplied value


80.00
of folded cross validation technique. Here it must be
brought into the notice of shrewd readers that the esti- 60.00
mation algorithm has no impact on the final accuracy
of the classifier. The researchers have focused on the 40.00
OR

estimation algorithm in context of its time and space


complexity. The parameter learning is always limited 20.00
Gen.
by the available finite memory. BBN
MDL fCLL AIC BDEU Entropy Bayes NP-FiLM

The Fig. 2 is showing a very simple BBN structure Accuracy 91.42 91.39 91.69 94.85 94.63 94.85 95.28 95.32
Fullness 100 95 28.3 55.8 55.8 27.9 51.3 51.3
for the dataset segment. The right section of Fig. 2
contains the abbreviations of the features of dataset
TH

Fig. 3. Relationship between DAG fullness and its Classification


segment. It is quite evident from the Fig. 2 that CPT Accuracy (x axis scoring metric).
overhead is not very large for every node as every
node is linked to a single node which is the class
node. This type of structure technically requires no mutually coherent relationship which were supposed
scoring metric as the relationship of every node is to be useful in increasing the discriminant function
AU

already fixed. However apart from this quickly built towards the class feature, eventually it will results into
structure, some implicit information is missing. This improved classification accuracy. The accuracy of the
structure is in fact void of carrying a lot of underly- classifier in simple BBN was 91.42% while this accu-
ing mutually coherent information between the nodes. racy was increased up to 94.84% by using scoring
Technically such a structure will be termed as suffer- metric of Entropy (see Fig. 3).
ing from data under fitting because the learnt structure Here the question arises which factor is responsible
is not truly representing the dataset upon which it for this increase in accuracy. Should we extrapolate that
was built. We repeat this experiment but this time we realizing more parental nodes may lead to justify this
used Entropy as a scoring metric while exercising K2 reason? We must establish a relationship between the
as search algorithm again and fixing other external fullness (simplicity) of the DAG and the outcome of its
parameters same. The scoring metric Entropy unveils relevant classification accuracy. It will be useful if we
a lot of mutually exclusive relationship in the given develop a simple mathematical formula for defining the
dataset. For example, it identifies a set of four features fullness of a DAG such that Fullness = 100 × number of
as the parental node for the feature short-line-density- edges between class node and attribute node / all links
2. These parental nodes include region-centroid-row, Keeping in view of the above formula, the gen-
region-centroid-col, short-line-density-5 and the class eral BBN is giving poor classification. However as we
node. The Entropy metric also introduces a lot of other increase the number of links then it favors for the data
1592 M. Naeem and S. Asghar / Structure learning via non-parametric factorized joint likelihood function

fitting [13, 17]. But things are not so straight unluck- as the best among all of the candidate structures. The
ily. Albeit this trend is mostly observed in literature simplicity of Bayes scoring metric is also equivalent
and during our experimentation as well but the other to that of NPFiLM, however as we already described
factor which is opposing the optimized data fitting is that identifying the true relationship for the purpose
data over fitting problem. The result of scoring metric of maximizing discriminant function for a class node
(Fig. 3) is a good example in this context. The contin- is a challenge in classification. When we used K2 as
uous increase in the links of the DAG eventually result search algorithm then pre ordering of feature set also
into an over fitting problem where the model apparently plays an important role. The proposed scoring metric
favors the training set but unable to classify the same also does the same job the mathematical detail of which
number of test dataset. We can term this problem as fix- is illustrated in Section 5.
ing the balance between data under fitting and data over
fitting in structure learning. While observing the graph- 3. Proposed scoring function

PY
ical result in Fig. 3, one can argue about the anomalous
performance of MDL. It produces only one extra link A scoring function can be expanded in terms of the
other than the basic links between the attribute node and local scores. The local score refers to the score produced
class node. This link is between 7th node hedge-mean by the CPT of every query variable conditioned with the
and 6th node hedge-sd (as parental node). The reason is class feature. The sum of all of such scores is termed
CO
evident that the relationship was incorrectly identified as the scoring function of a given structure, model or a
while leaving a lot of potentially correct links as shown directed acyclic graph.
in the Fig. 4. We assume that D denotes the training dataset with
According to the Fig. 4, this wrong link has been fixed such that n number of features such that fi is the ith
using the proposed scoring metric. The detail of the feature. Then the local score can be denoted by i. The
other links including distribution and parent set is also final scoring function  can be obtained by cumulating
OR

described which results this NPFiLM based structure all local scoring functions:
n

(B, D) = i (fi , D) (2)
i=1

The scoring function of a structure is formulated on


TH

Log likelihood LL obtained from the training dataset.


The LL may be expressed as the log probability of
dataset D with model G as shown by Equation (3).
 n qi  ri  
Nijk
AU

LL(G|D) = Nijk log (3)


Nij
i=1 j=1 k=1

Here Nijk shows that ith feature is in kth state given


its corresponding qth parent. One can observe that LL
is frequency calculation problem. Hence, certainly it is
in a decompose-able value.

3.1. Definition 1

Let M points out a BBN given X variables. Further-


more, it is also stated that BBN parameters M are
independent (locally as well as globally). Hence the size
of the BBN
.
structure is a function of number of unique
states S. of query variables and the links L between the
query variables as shown by the equation:
Fig. 4. An Augmented Bayesian Structure representing casual influ- .
ence (Scoring Metric = NPFiLM). Size (M) = f (L, S.) (4)
M. Naeem and S. Asghar / Structure learning via non-parametric factorized joint likelihood function 1593

A simple decomposition can be expressed: the value of the penalty factor; this is again a ques-
 tionable issue. In fact the expression above is not a
Size (M) = (x|pa(x)| × |x|) (5) decomposable score. The penalty factor always car-
x∈X ries computational overhead [19]. In order to address
It means that the complexity of a structure M can this problem, the general principle of William Ock-
be estimated by the product of parent-node-counts and ham (1285–1349) is quite useful according to which
query-variable-states. “select the simplest hypothesis such that the hypothe-
sis be consistent with the underlying observation”. The
researchers pointed out that this principle has its own in
3.2. Lemma 1 BBN structure learning [4]. Proceeding with the gen-
eral principle of Ockham hypothesis, let we consider
Let M be a BBN given query variables X. The F a non class variable and C a class variable. We are
optimized and data fitted model Ḿx of the underlying

PY
to find out the mathematical value of the relationship
dataset holds essential links only. It can be expressed between the instances of these two features. Let F con-
that there is no other optimized network Mx which can tains a number of distinct states given class feature C
hold lesser number of links. possessing b number of distinct states.
Proof. Let M be any ordinary structure. Its parameter
CO
F = {fi |i = 1, . . . a} (7)
distribution is PUX . Let Ḿx be an optimized model. It
can be noted that whenever two query variables xi and
xJ are directly linked, then it certainly enhances the C = {cj |j = 1, . . . b} (8)
goodness of data fitting of the structure being learnt. We shall proceed further keeping in view of the above
If these are present in model M, then they must be simple point estimation in structure learning. The joint
present in Ḿx . However, if the size of M is shorter than probability between both of these feature variables can
OR

size of Ḿx then it in inferred that some links in M exhibit be formulated as:
the opposite direction as that of so called optimized 
structure. It rationalizes the search for a smallest size of δ(C, F ) = P(C = c, F = f ) (9)
structure. If this is BBN possessing indispensable links c f
only then the structure must be an optimized structure.
We need to normalize the joint probability given in
TH

It is clear that an extra arc must be trimmed out.


Equation (9). Here the normalization refers to scaling
Here the notion of extra arc refers to an arc causing no
the value between zero and one in order to asses the
increase in the goodness of data fitting. The extra arc is
degree of magnitude. In the literature, this type of prob-
a cause of two problems.
ability distribution is known in the name of potential.
• It will generate over-fitting during model training, The potential ξ can be expressed as:
AU

leading to a compromise on the goodness of data 


fitting. ξC,F = P(C = c, F = f ) (10)
• The size of CPT will be increased leading to c f
increase in the computational complexity in the The goal is to maximize the discriminant objective
phase of model evaluation. function given the potential ξ. A change into this poten-
Penalizing the network or we can say that the penalty tial will appear in the form as:
factor is the solution to addition of such an arc. The ⎡ ⎛ ⎞⎤
 
penalty factor is responsible to adjust the complexity of ξC,F = ⎣max arg ⎝ P(C = c, F = f )⎠⎦
the structure. The LL based scoring function SF can c f
be denoted by subtracting the penalty factor from the (11)
value of LL. The potential ξ has been realized into ξC,F . It is a
n
 prior discriminant on simple point estimation. In infor-
SF (G|D) = LL(D|G) − PF (Xi , G, D) (6) mation theory, It can be considered as another measure
i=1 of coherence between two query variables. The rela-
Scoring function which we are discussing in this tionship expressed in the above equation exists into the
study suffer from the penalty factor. However what is capacity of a pluggable unit. We can integrate this unit
1594 M. Naeem and S. Asghar / Structure learning via non-parametric factorized joint likelihood function

into a relationship between query variables and their NPFiLM(D, G)


class variable. ⎡ ⎛ ⎞⎤
 qi 
r1
⎣ 1
= max arg ⎝ Nijck ⎠⎦ (14)
3.3. Lemma 2 c
|C c |
j=1 k=1

In structure learning, the discriminant joint prob- If we introduce a directed link from node Xi to
abilities ξC,F posses the notion of converting it into node Xj then only the local value of NPFiLM will be
maximum a prior probabilistic inference given a case changed. If this inclusion causes an improvement in the
of simple point estimation. structure G, then this improvement can be described as:
n 

Proof. Let ϑ(F ) shows the marginal probability of a (Xp Xq ) = max arg(Xq , Pa(Xq ), C,D) ∪ {Xp }
query variable. The value of λC,F can be expressed

PY
i=1
into conditional probability by means of replacing the n
denominator in the Equation (20) as: 
− max arg(Xq , Pa(Xq ), C, D) (15)
⎡ ⎛ ⎞⎤ i=1
  P(c, f )
λC,F = ⎣max arg ⎝ ⎠⎦ (12) Equations (14) and 15 points out that there is a simple
CO
c
ϑ(F )
f frequency calculation related to frequency calculation
of the feature node, parent node and class node. Hence
The simple point estimation potential λC,F as shown it can be inferred that NPFiLM belongs to the class
in the Equation (12) is decomposable into simple of decomposable scoring functions. The decomposition
point conditional joint probability. However, we are not property is certainly useful when searching algorithm
restricted to simple point estimation rather our objec- like K2 or hill climbing needs to calculate the final score
tive is to circumvent a dataset. It requires that we need
OR

after insertion or subtraction of an arc from the DAG.


to generalize λC,F into an estimator for a given dataset. At this point we shall again describe the phenomenon
leading to our motivation for a novel scoring function.
3.4. Lemma 3 According to this phenomenon, the number of possible
configuration for a query variable (node) say Xi will
NPFiLM can be expressed into a decomposable scor- start getting large with every increment of the potential
TH

ing function. candidate out of the set of all query variables in a queue.
From this large size of factors, only those factors are
Proof. Let there are n number of non class query vari- useful which tends to contribute towards explanation
ables with a single class variable in the dataset D. We of a class state. However, it also becomes the reason of
can achieve the reduction of simple point estimation an essential observation. Let us consider feature set and
AU

into a generalized maximum a posterior inference nota- class variable as defined in Equations (7) and (8). Let
tion namely NPFiLM as below: us consider the last feature in ordered list. Certainly this
must be linked to class variable with a precise discrim-
NPFiLM(D, G) inant value of joint probability in a BBN. An inclusion
n
 of the new query variable in the set of its parent list will
= max arg(Xi , Pa(Xi ), C, D) (13) be restricted by a higher value of discriminant value.
i=1 However, as the new node is linked, such chances are
The simple calculation between two feature variables quite narrow, because the factor of joint probability dis-
in order to turn them into a decomposable value can be tribution will start thinning with the increase of new
denoted as parental value. It means, in a randomly ordered set of
qi r1
j=1 k=1 Nijk Where i is feature iterator, j is par- features, there are chances that we get a merely simple
ent iterator, k is feature state iterator and c is class Naı̈ve Bayes graph. We already mentioned in introduc-
iterator. If we consider the inclusion of class variable, tion section that simple Naïve Bayes is not possessing
then a minor change will turn it this value of estimation value of goodness for data fitting [13, 17, 19]. Moreover,
qi
into j=1 r1 k=1 Nijck there is no need to introduce such a scoring function
Now, we plug this value into Equations (13) such where the learnt structure is known a prior without any
that: ambiguity or confusion. This raises the question of tack-
M. Naeem and S. Asghar / Structure learning via non-parametric factorized joint likelihood function 1595

ling this problem of converging the structure into a very If we plug this ordered set into the Equation (16) then
thin model. A prior intelligent ordering of all query it will be appeared into the following equation.
variables may give the answer to this problem.
<
−−
NPFiLM(D, G, X ) =
3.5. Definition 2 ⎡ ⎛ ⎞⎤
 qi 
r1 (19)
⎣ 1
Let the feature set is represented such as F = max arg ⎝ Nijck ⎠⎦
c
|C c |
j=1 k=1
{f1 , f2 , f3 , . . . fn } then ordering weight of every fea-
ture will be calculated by weight factor as shown as:
3.6. Lemma 4
ωF = λC,F − λC,F (16)
The ordered set instantiated by a mutually informa-
The terms λC,F and λC,F plays the role of exis-

PY
tion oriented metric can convert NPFiLM into a well
tence restrictions. Let us assume both of them as to
behaved scoring function.
be existence restrictions such that (F, C) ∈ λC,F : the
link F → C elucidates the discriminant objective with Proof. Let us consider a set of n un-sorted query vari-
respect to the class variable such as: (F, C) ∈ λF,C : the ables such that F = {f1 , f2 , f3 , . . . fn }. Let us start
CO
link C → F points out that the discriminant scoring from any of the succeeding query variable like s j th
value with respect to the feature. query variable fj such that it lies somewhere in the triv-
In our earlier research the correct topological order- ial un-ordered list denoted by {ρ, fj , ς} ∴ ρ ∩ ς = φ
ing between two features was discussed at point where ρ is the set of predecessor and ς is the set of
estimation level [11]. We discussed a parametric prop- successor nodes. It has already been described that K2
erty of Bayesian Belief network with suggestion of adds incrementally for a node as its parental node out
possible correctness in the point estimation [11]. It
OR

of a given ordering list whose addition possibly incre-


was reported in earlier research by means of proposing ment the net score of the underlying structure. K2 search
a point estimation metric (Integration to Segrega- algorithm can select any of the parent-set before fj . If a
tion (I2S)) in which it was emphasized that well feature fs be present such that it can considerably con-
renowned scoring functions fail to precisely capture the tribute towards net score of the structure, then following
casual relationship between any two query variables to expression must hold ∀(fs , c) ∈ ρ, fs → fj : true and
TH

describe the correct point estimation topology in a lot otherwise the expression ∀(fs , c) ∈ ς, fs → fj : false
of situations; this ultimately leads to the selection of should hold. Hence it can be concluded that intelligent
potential neighbor nodes and parental nodes becoming ordered set can beget NPFiLM to a well behaved scoring
unsuitable. However I2S was shown to be capable of function. One can notice that NPFiLM does not posses
rightly identifying it in majority of the cases as com- any a prior parametric value for fine tuning purpose.
AU

pared to Bayes, MDL, BDeu, Entropy [11]. Madden Moreover there is no concept of penalty factor in the
[10] pointed out that a model in which a class node is final mathematical expression of the proposed scoring
fixed at the root level can bring an increased goodness function.
of data fitting. This type of scheme was named in lit-
erature as “selective BN augmented NBC” [10]. Hence
the later score value must be eliminated from the first 4. Information theoretic interpretation
value which will result into a weighted score vector as
depicted in the Equation (17). Before describing empirical results, we illustrate that
   NPFiLM has an interesting mutual coherence interpre-
  P(f, c) tation with its roots in information theoretic elucidation.
λF,C = max arg (17) We revive some basic concept of entropy and mutual
c
ϑ(C)
f information and their mapping to each other. Let X be a
A function for simple descending order is applicable feature and C is a class then joint entropy is a function
to the weights achieved from the Equation (16) which of conditional entropy of X given C and conditional
results into an ordered list of input variables. entropy of class given X and Mutual Information (MI)
such as:
<
−− i
F = {ω |i = 1 . . . n} (18)
H(X, C) = H(X|C) + MI(X, C) + H(C|X) (20)
f
1596 M. Naeem and S. Asghar / Structure learning via non-parametric factorized joint likelihood function

The conditional entropy of X given C can be split


into two components; first component maximizes the
discriminant function while the second component is
trivial part which can be ignored. In fact the notion of
NPFiLM has its roots in splitting of conditional entropy
of X given C as shown below:
H(X|C) = NPFiLM(X|Cm ) + NPFiLM(X|Cn ) (21)
This establishes the relationship of NPFiLM to con-
ditional entropy. On the other hand, the conditional
entropy itself is a component of entropy of single vari-
able (X in this case) and the mutual information between

PY
two variables (X and C). This is expressed as:
H(X|C) = H(X) − MI(X, C) (22)
Fig. 5. Surface plot for joint distribution of two variables for Mutual
Information (MI).
From Equations 21 and 22, one can easily conclude
CO
and establish the relationship of NPFiLM and MI indi-
cated such as:
H(X) − MI(X, C)
= NPFiLM(X|Cm ) + NPFiLM(X|Cn ) (23)
Notice that the proposed measure can be expressed
OR

by entropy and Mutual Information giving a notion


that NPFiLM has its explicit illustration in entropy and
mutual information.
NPFiLM(X|Cm )
= H(X) − MI(X, C) − NPFiLM(X|Cn )
TH

(24)
However, an interesting argument arises; what is the
impact of NPFiLM in scoring metric as compared to Fig. 6. Surface plot for joint distribution of two variables for I2S.
Mutual Information and our earlier proposed scoring
metric I2S? Figure 5 indicates the surface chart of MI asymmetric in behavior like NPFiLM which is in quite
AU

when numbers of states in both of the features are grad- harmony with graphical learning. However, it is not as
ually increased. Notice that in MI, the value is always well behaved as NPFiLM. It is showing that a change
behaving symmetrically while the relationship of two in state counts of the query variables delivers a dras-
nodes in BBN is never yielding a symmetric accuracy. tic change in the corresponding value of I2S while the
Secondly for a uniform distribution of two variables it state count change in class node is giving a slow change
only reaches its maxima in some certain cases when spe- in the I2S value. Keeping in view of this limitation in
cific number of states in both of the features is achieved. I2S and MI, this phenomenon is adjusted in NPFiLM to
On the contrary NPFiLM (Fig. 7), number of states tailor it into a suitable scoring function for BBN.
plays an important role as the enumeration of class
states increases, there is probability that its joint distri-
bution will also get sparse and thinner in each individual 5. Empirical validation of NPFiLM
distribution. Such well behaved phenomenon is also
observed in the BBN classifier where the equal dis- We in this study have evaluated our proposed scor-
tribution of class node with respect to the other query ing function over fifty datasets publically available
variables results in poor classification accuracy. More- at UCI machine learning repository [1]. We selected
over, we also demonstrate the behavior of I2S, although datasets representing from various domain of inter-
the values (see Fig. 6) point out that I2S is also an est in order to avoid any possible implicit biasness
M. Naeem and S. Asghar / Structure learning via non-parametric factorized joint likelihood function 1597

Fig. 8. Comparison of error rate of NPFiLM vs. other scoring func-


tion using greedy search K2. P indicates maximum number of parental

PY
nodes.

Fig. 7. Surface plot for joint distribution of two variables for


computational task. However, this aspect of research
NPFiLM.
is out of the scope of this study. The other parameters
of the experimental setup include value of cross vali-
CO
towards any specific scoring function. There are numer- dation. It is has been shown by numerous researchers
ous classification performance metrics usually coined that setting a larger value for cross validation is more
in a single term known as class imbalance measures. suitable for evaluation purpose. However, a big value
These include selectivity, specificity, f-measure and means the repeated experiments. The researchers have
many more. However, every one of them is suited to agreed that a good trade off is value of ten where we
examine the classification performance from a specific are almost very near to optimized results. Moreover, the
angle. Simple accuracy and error rate are two met-
OR

same is true in experimental setup at weka [9].


rics which are quite generalized and are not perceived There are some common details which are applicable
in accordance with some specific dimension or per- to all of the images in this section. Firstly evaluation is
spective. Hence we have selected error rate where the obtained in form of win-neutral-lose. It is assessed that
classification with lowest error rate is acceptable. in how many datasets the proposed scoring functions
The error rate is a ratio between the sum of False Pos- wins (low error rate) or loses (high error rate) while
TH

itive and False Negative to the sum of all four quadrants comparing to every other scoring function (see Figs. 8
values of a typical confusion matrix. It is mathemati- and 10). If there is no significant difference observed
cally defined as: then it means both of the competing scoring functions
FP + FN have delivered more or less same data fitted model.
Errorrate = (25) Secondly the proposed scoring function is compared
AU

TP + TN + FP + FN
with every other scoring function in terms of average
In the Equation (25), TP and TN denote the positive classification rates (see Figs. 9 and 11).
and negative classes being classified as correctly. The The results of Figs. 8 and 9 were obtained using
results are shown by comparing six well known BBN greedy search mechanism whose implementation in
scoring functions. The scoring function comparison has BBN is known by K2. The results from these figures
been achieved by adjusting a parameter known as the are quite evident that the proposed scoring function
number of parents. This value has been denoted by p in mostly outperform whereas the worst performance was
all of the figures in this section. The value of p ranges observed in case of Entropy scoring function. The
from 2 to 4. The reason behind restricting this value up results of Bayes, AIC, BDeu and MDL were approxi-
to four is that at value of 5 and above, numerous existent mately comparable except in case of MDL with setting
scoring functions were not capable of parameter learn- maximum number of four parental nodes. It means
ing because as if the features contains numerous distinct empirical evidence indicate that MDL is not suitable
states then multiplication of these distinct states for fea- for thick network. Some dataset were too large in num-
tures more than four results into a very large CPT for ber of features that a few of the classifiers did not
which the common memory resources gets exhausted. give result in reasonable time, thus the results of these
A solution for such problem lies in stream mining where dataset have been excluded from average performance
the in transits results are kept in hard disk for further comparison.
1598 M. Naeem and S. Asghar / Structure learning via non-parametric factorized joint likelihood function

Fig. 10. win/neutral/lose: NPFiLM BBN versus peer scoring func-

PY
tions using HCL.
Fig. 9. Average error rate of scoring function using greedy search K2
for 50 datasets using greedy search K2.

In order to validate the performance accuracy of


CO
NPFiLM, we exercised various sessions of experiments
from different angle. Previously we restricted ourselves
on comparison using K2 greedy algorithm. The greedy
search algorithm is still popular in BBN [7]. Moreover,
numerous researchers have argued over the perfor-
mance of K2 greedy algorithm. Lerner et al. [3] demon-
OR

strated that substituting the K2 search algorithm by Hill


Climbing (HCL) search technique has the potential to
improve the accuracy in the BBN structure. This moti-
vates us to collect the result of these well known scoring Fig. 11. Average error rate: NPFiLM versus peer scoring functions
using HCL.
functions using HCL. Again, we set the default setting
of BBN using HCL search. We made the comparison to
TH

our introduced measure NPFiLM. The results in general aggregated to their optimum value. As far as BDeu is
validate the findings of Lerner et al. [3] that hill climbing concerned, its performance is mainly characterized by
is potentially prone to yield better results (see Figs. 10 the alpha value which controls the penalty factor. The
and 11); However, the error rate / accuracy results for same is true for Bayes scoring metric. Unluckily there is
NPFiLM was still better as compared to its peer scor- no such mechanism found which can dictate in advance
AU

ing functions. One dataset arrhythmia was significantly about the specific value of alpha for any particular data.
wider in size of its features. It contains 280 features and An optimized value of alpha for each dataset is always
sixteen classes, albeit its sample size was short (only a variable and can not be predicted in advance. As far
452). We noticed that scoring function such as Bayes, as AIC and MDL are concerned, we term both of them
AIC, BDeu, MDL and Entropy did not give result even as sister scoring function because both of them differs
a pass of 48 hours when using HCL. This left us with no only by the way of controlling penalty factor.
option but to omit this dataset result from the figures. Some techniques were used to improve the per-
During the experiment of comparison with peer formance of MDL [8] where the searching algorithm
scoring function, it was noticed that entropy scoring during the construction of network was revised. They
function always yield a very thick and dense network replaced the greedy algorithm K2 [5] (Cooper and Her-
in which significantly large number of arcs were estab- skovits) by branch and bound algorithm and MDL as
lished. Although adding each arc in general increases scoring function. The superiority of MDL over its peer
the training accuracy but at the same time, it is being scoring functions Bayes was also elaborated [12]. They
done at the cost of reduction in accuracy of test samples. analyzed two important scoring functions MDL and
In fact there is a need to select such a scoring function Bayes. They concluded that MDL is superior in time
where neither very thick nor very thin network is gen- complexity as compared to Bayes as the number of
erated so that over-fitting and under-fitting both can be nodes gradually increases. Secondly it is a good prac-
M. Naeem and S. Asghar / Structure learning via non-parametric factorized joint likelihood function 1599

tice to use prior knowledge or expert knowledge for References


developing a per order set of the nodes before subject-
ing the variables nodes in K2 search. Another finding [1] A. Frank and A. Asuncion, UCI Repository of machine learn-
in this study is about behavior of Bayes scoring func- ing databases. Tech. Rep., Univ. California, Sch. Inform.
Comp. Sci., Irvine, CA, Available from. http://www.ics.uci.
tion over learning; simple to say, Bayes can lead to
edu/∼mlearn/MLRepository.html. (2010), Accessed April
over-fitting in numerous cases. 2013.
A general analysis of figures in this section reveals [2] A.M. Carvalho, M.T.T. Roos, A.L. Oliveira and P. Myllymäki,
that the accuracy of the dataset using NPFiLM is con- Discriminative learning of Bayesian networks via factor-
ized conditional log-likelihood, Journal of machine learning
stant although a very important influential parameter research 12 (2011), 2181–2210.
‘parent count’ is altered from 2 to 4. In fact, this mea- [3] B. Lerner and R. Malka, Investigation of the K2 algorithm in
sure keeps a balance between the scarcity and the high learning Bayesian network classifiers, Applied Artificial Intel-
density of the network with a small alteration only. ligence 25(1) (2011), 74–96.

PY
[4] F.V. Jensen and T.D. Nielsen, Bayesian networks and decision
Rajaram et al. [14] stated that “small alterations” to graphs, Information Science and Statistics Volume. ISBN 978-
Naı̈ve Bayes are in abundance in literature in pursuit 0-387-68281-5. (2007), Springer New York.
of correct adjustments for specific nature of dataset. A [5] G.F. Cooper and E. Herskovits, A Bayesian method for the
true balance prevents it from growing very dense or induction of probabilistic networks from data, Machine Learn-
ing 9 (1992), 309–347.
very thin network. This clearly avoids the test network
CO
[6] H. Akaike, A new look at the statistical model identifica-
to suffer from over-fitting or under-fitting phenomenon. tion, IEEE Transactions on Automatic Control 19 (1974), 716–
However, in its peer methods, the density or scarcity is 723.
not well controlled. In case of BDeu and Bayes, the [7] H. Xu, J. Yang, P. Jia and Y. Ding, Effective structure learning
for estimation of distribution algorithms via L1-regularized
value of alpha greatly determines the density level. The Bayesian networks, Int J Adv Robotic Sy 10(17) (2013),
scoring function Entropy was observed to prone to give 1–14.
dense network in majority of the cases. Whereas MDL [8] J. Suzuki, Learning Bayesian belief networks based on the
MDL principle: An efficient algorithm using the branch and
OR

and AIC are giving relatively better performance, how-


bound technique, IEICE Transactions on Information and Sys-
ever we have shown that in all of the cases using K2 tems 82(2), (1999) 356–367.
and Hill climbing with variation in the size of potential [9] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann
parent set, the proposed measure outperformed. and I.H. Witten, The Weka data mining software: An update,
ACM SIGKDD Explorations 11 (2009), 10–18.
[10] M.G. Madden, On the classification performance of TAN and
TH

general Bayesian networks, Knowledge-Based Systems 22(7)


6. Conclusion (2009), 489–495.
[11] M. Naeem and S. Asghar, A novel mutual dependence mea-
sure for structure learning, Journal of the National Science
The essential innovation in this study is that when- Foundation of Sri Lanka 41 (2013), 203–208.
ever the learnt structure is getting complex due to the [12] M.Z. Zgurovskii, P.I. Bidyuk and A.N. Terent’ev, Methods of
higher number of parents, then usually data fitting phe- constructing Bayesian networks based on scoring functions,
AU

nomenon goes in favor of higher accuracy. However Cybernetics and Systems Analysis 44(2) (2008), 219–224.
[13] N. Friedman, D. Geiger and M. Goldszmidt, Bayesian network
the dark side of this aspect is the additional complexity classifiers, Machine Learning 29 (1997), 131–163.
in the network. As the network carries heavily linked [14] R. Rajaram, S. Pramala, S. Rajalakshmi, C. Jeyendran and
nodes at the cost of parameter learning. Moreover data D.S.J. Prakash, NB+: An improved Naı̈ve Bayesian algorithm,
fitting phenomenon starts favoring to training data and Knowledge-Based Systems 24(5) (2011), 563–569.
[15] S. Yang and K.C. Chang, Comparison of score metrics for
bypassing the test data. NPFiLM has been designed Bayesian network learning, Systems, Man, and Cybernetics,
in a way to keep balance between data over-fitting 1996, IEEE International Conference on Volume 3 3 (1996),
and data-under fitting. The study delivered the anal- 2155–2160.
[16] T. Van Allen and R. Greiner, Model selection criteria for learn-
ysis and comparison from different angle using greedy
ing belief nets: An empirical comparison, In ICML’00, (2000),
search with various setting of parental set. NPFiLM per- 1047–1054.
formed significantly better than others in all of these [17] W.L. Buntine, Theory refinement on Bayesian networks, Proc
comparisons. The proposed measure is extrapolated to UAI’ (1999), 52–60.
[18] W. Lam and F. Bacchus, Learning Bayesian belief networks:
construct the realistic network which is likely to tally An approach based on the MDL principle, Comp Intell 10
with the practical gist of domain experts. According to (1994), 269–294.
our finding BIC has shown its capability to well behave [19] Z. Liu, B. Malone and C. Yuan, Empirical evaluation of scor-
for large sample sizes but for the small dataset, its error ing functions for Bayesian network model selection, BMC
bioinformatics 13(Suppl 15), S14 (2012).
rate is significantly high.

You might also like