0% found this document useful (0 votes)
51 views23 pages

Lecture No. 03

The document discusses Bayesian classifiers and the Bayes theorem. It explains how the Bayes theorem can be used to solve classification problems by computing posterior probabilities. It then describes the naive Bayes classifier, which makes a conditional independence assumption that attributes are independent given the class. This allows estimating probabilities needed for classification more practically from training data.

Uploaded by

wanaw46278
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views23 pages

Lecture No. 03

The document discusses Bayesian classifiers and the Bayes theorem. It explains how the Bayes theorem can be used to solve classification problems by computing posterior probabilities. It then describes the naive Bayes classifier, which makes a conditional independence assumption that attributes are independent given the class. This allows estimating probabilities needed for classification more practically from training data.

Uploaded by

wanaw46278
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

228 Chapter 5 Classification: Alternative Techniques

knowledge of the classeswith new evidence gathered from data. The use of the
Bayes theorem for solving classification problems will be explained, followed
by a description of two implementations of Bayesian classifiers: naiVe Bayes
and the Bayesian belief network.

5.3.1 Bayes Theorem

Cons'idera footballgame betweentwo riual teams: Team0 and Team1.


SupposeTeam 0 wins 65%oof the ti,me and Team 1 w,insthe remaining
matches. Among the gameswon by Team 0, only 30Toof them come
from playing on Team1's footballfield. On the other hand,,75Toof the
u'ictoriesfor Team1 are obta'inedwhi,leplayi.ngat home. If Teamf is to
host the nert match betweenthe two teams,wh,ichteam wi,ll most likelu
etnergeas the winner?

This question can be answered by using the well-known Bayes theorem. For
completeness, we begin with some basic definitions from probability theory.
Readerswho are unfamiliar with conceptsin probability may refer to Appendix
C for a brief review of this topic.
Let X and Y be a pair of random variables. Their joint probability, P(X :
r,Y : g), refers to the probability that variable X will take on the value
r and variable Y will take on the value g. A conditional probability is the
probability that a random variable will take on a particular value given that the
outcome for another random variable is known. For example, the conditional
probability P(Y :UlX: r) refers to the probability that the variable Y will
take on the value g, given that the variable X is observedto have the value r.
The joint and conditional probabilities for X and Y are related in the following
way:
P(x,Y) : P(Ylx) x P(X) : P(XIY) x P(Y). (5.e)
Rearranging the last two expressionsin Equation 5.9 leads to the following
formula, known as the Bayes theorem:

P(xlY)P(Y)
P(Ylx): (5.10)
P(X)

The Bayes theorem can be used to solve the prediction problem stated
at the beginning of this section. For notational convenience, let X be the
random variable that represents the team hosting the match and Y be the
random variable that representsthe winner of the match. Both X and Y can
5.3 Bayesian Classifiers 229

take on values from the set {0,1}. We can summarize the information given
in the problem as follows:

Probability Team 0 wins is P(Y :0) : 0.65.


ProbabilityTeam1 wins is P(Y :1) : 1 - P(Y : 0) : 0.35.
Probability Team t hostedthe match it won is P(X : llY :1) : 0.75.
Probability Team t hostedthe match won by Team 0 is P(X : llY :0) : 0.9.

Our objective is to compute P(Y : llx : 1), which is the conditional


probability that Team 1 wins the next match it will be hosting, and compares
it against P(Y :OlX: 1). Using the Bayestheorem, we obtain

P (X : rl Y :1 ) x P( Y : 1)
P(Y:llx : 1) :
P(X :1)
P(X :LlY :1) x P(Y : 1)
P ( X : ! , Y : 1 )+ P ( X : 1 , Y : 0 )
P ( X : L I Y: 1 ) x P ( Y : 1 )
P (X : IIY :I) P( Y : 1)* P( X : r lY :O) P( Y : 0)
0.75x 0.35
0.75x0.35+0.3x0.65
: 0.5738.

where the law of total probability (seeEquation C.5 on page 722) was applied
in the secondline. Furthermore, P(Y :OlX : 1) : t - P(Y : llx - 1) :
0.4262. Since P(Y : llx : 1) > P(Y : OlX : 1), Team t has a better
chance than Team 0 of winning the next match.

5.3.2 Using the Bayes Theorem for Classification

Before describing how the Bayes theorem can be used for classification, let
us formalize the classification problem from a statistical perspective. Let X
denote the attribute set and Y denote the class variable. If the class variable
has a non-deterministic relationship with the attributes, then we can treat
X and Y as random variables and capture their relationship probabilistically
using P(YIX). This conditional probability is also known as the posterior
probability for Y, as opposed to its prior probability, P(Y).
During the training phase, we need to learn the posterior probabilities
P(ylX) for every combination of X and Y based on information gathered
from the training data. By knowing these probabilities, a test record X' can
be classified by finding the class Yt that maximizes the posterior probability,
230 Chapter 5 Classification: Alternative Techniques

P(Y'lX'). To illustrate this approach, consider the task of predicting whether


a loan borrower will default on their payments. Figure 5.9 shows a training
set with the following attributes: Hone Owner, Marital Status, and Annual
Income. Loan borrowers who defaulted on their payments are classified as
Yes, while those who repaid their loans are classifi.edas No.

a."'
"""""od"

Figure5.9.Training
setforpredicting problem.
theloandefault

Suppose we are given a test record with the following attribute set: X :
(HoneOwner: No, Marital Status : Married, Annual Income : $120K). To
classify the record, we need to compute the posterior probabilities P(YeslX)
and P(NolX) basedon information available in the training data. If P(YeslX) >
P(NolX), then the record is classified as Yes; otherwise, it is classifiedas No.
Estimating the posterior probabilities accurately for every possible combi-
nation of class labiel and attribute value is a difficult problem because it re-
quires a very large training set, even for a moderate number of attributes. The
Bayes theorem is useful becauseit allows us to expressthe posterior probabil-
ity in terms of the prior probability P(f), the class-conditional probability
P(X|Y), and the evidence,P(X):

P( xlY) x P( Y)
P(ylx): (5.11)
P(x)
When comparing the posterior probabilities for different values of Y, the de-
nominator term, P(X), is always constant, and thus, can be ignored. The
5.3 BayesianClassifiers 23L

prior probability P(f) can be easily estimated from the training set by com-
puting the fraction of training records that belong to each class. To estimate
the class-conditionalprobabilities P(Xlf), we present two implementations of
Bayesian classification methods: the naiVe Bayes classifier and the Bayesian
belief network. These implementations are described in Sections 5.3.3 and
5.3.5,respectively.

5.3.3 NaiVe Bayes Classifier


A naive Bayes classifier estimates the class-conditional probability by assuming
that the attributes are conditionally independent, given the classlabel g. The
conditional independenceassumption can be formally stated as follows:

&
P ( X I Y: a ) : f r l x S v : 11, (5.12)
i:l

where each attribute set X : {Xr, X2,..., X4} consistsof d attributes.

Conditional Independence

Before delving into the details of how a naive Bayes classifier works, let us
examine the notion of conditional independence. Let X, Y, and Z denote
three sets of random variables. The variables in X are said to be conditionally
independent of Y, given Z, 1f the following condition holds:

P(xlY,z) : P(xlz). (5.13)

An example of conditional independenceis the relationship between a person's


arm length and his or her reading skills. One might observe that people with
longer arms tend to have higher levels of reading skills. This relationship can
be explained by the presenceof a confounding factor, which is age. A young
child tends to have short arms and lacks the reading skills of an adult. If the
age of a person is fixed, then the observed relationship between arm length
and reading skills disappears. Thus, we can conclude that arm length and
reading skills are conditionally independent when the age variable is fixed.
232 Chapter 5 Classification: Alternative Techniques

The conditional independencebetween X and Y can also be written into


a form that looks similar to Equation 5.12:

P\X,Y,Z)
P(x,Ylz) :
P(Z)
P (
-FEqX .Y , Z ) , . P ( Y . Z )
^
P(z)
P(xlY,z) x P(vlz)
P(xlz)x P(Ylz), (5.14)

where Equation 5.13 was used to obtain the last line of Equation 5.14.

How a Nai've Bayes Classifier Works

With the conditional independence assumption, instead of computing the


class-conditionalprobability for every combination of X, we only have to esti-
mate the conditional probability of each X4, given Y. The latter approach is
more practical becauseit does not require a very large training set to obtain
a good estimate of the probability.
To classify a test record, the naive Bayes classifier computes the posterior
probability for each class Y:

p(ytx)' - PV)n!:l P(xilY) (5.15)


P(x)

Since P(X) is fixed for every Y, it is sufficient to choosethe class that maxi-
mizes the numerator term, p(V)l[i:tP(X,lY). In the next two subsections,
we describe several approaches for estimating the conditional probabilities
P(X,lY) for categorical and continuous attributes.

Estimating Conditional Probabilities for Categorical Attributes


For a categorical attribute Xa, the conditional probability P(Xi : rilY : A)
is estimated according to the fraction of training instancesin class g that take
on a particular attribute value ri. For example, in the training set given in
Figure 5.9, three out of the seven people who repaid their loans also own a
home. As a result, the conditional probability for P(Home Owner:Yeslno) is
equal to 3/7. Similarly, the conditional probability for defaulted borrowers
who are single is given by P(Marital Status : SinglelYes) : 213.
5.3 Bayesian Classifiers 233

Estimating Conditional Probabilities for Continuous Attributes

There are two ways to estimate the class-conditional probabilities for contin-
uous attributes in naive Bayes classifiers:

1. We can discretize each continuous attribute and then replace the con-
tinuous attribute value with its corresponding discrete interval. This
approach transforms the continuous attributes into ordinal attributes.
The conditional probability P(X,IY : U) is estimated by computing
the fraction of training records belonging to class g that falls within the
corresponding interval for Xi. The estimation error depends on the dis-
cretization strategy (as described in Section 2.3.6 on page 57), as well as
the number of discrete intervals. If the number of intervals is too large,
there are too few training records in each interval to provide a reliable
estimate for P(XrlY). On the other hand, if the number of intervals
is too small, then some intervals may aggregate records from different
classesand we may miss the correct decision boundary.

2. We can assume a certain form of probability distribution for the contin-


uous variable and estimate the parameters of the distribution using the
training data. A Gaussian distribution is usually chosento representthe
class-conditional probability for continuous attributes. The distribution
is characterized by two parameters, its mean, p,, and variance, o2. For
each class Aj, the class-conditional probability for attribute Xi is

., tt:j)2
_(tt_
zofi
P(Xi: r,ilY : y) : -)- exp (5.16)
1/2troii

The parameter p,ii can be estimated based on the sample mean of Xt


(z) for all training record.sthat belong to the class gt . Similarly, ol, can
be estimated from the sample variance (s2) of such training records. For
example, consider the annual income attribute shown in Figure 5.9. The
sample mean and variance for this attribute with respect to the class No
are

r 2 5 + 1 0 0 + 7 0 + . . . + 7 5:
110
(
, ( 1 2 5 1 1 0 ) 2+ ( 1 0 0- 1 1 0 ) 2+ . . . + ( 7 5- 1 1 0 ) 2
-
:2975
7(6)
s: t/2975:54.54.
234 Chapter 5 Classification: Alternative Techniques

Given a test record with taxable income equal to $120K, we can compute
its class-conditionalprobability as follows:

:
P(rncome=12olNo) : 0.0072.
6h.b4)"*p-95#f

Note that the preceding interpretation of class-conditional probability


is somewhat misleading. The right-hand side of Equation 5.16 corre-
sponds to a probability density function, f (X;pti,o;7). Since the
function is continuous, the probability that the random variable Xl takes
a particular value is zero. Instead, we should compute the conditional
probability that Xi lies within some interval, ri and rt t e, where e is a
small constant:

f rtle
P ( * o< X ; I r i * e l y : y r 1 : I fqo;ttij,oij)dxi
J:r'
= f (rt; ttti,o,ii) x e. (5.17)

Since e appears as a constant multiplicative factor for each class, it


cancels out when we normalize the posterior probability for P(flX).
Therefore, we can still apply Equation 5.16 to approximate the class-
conditional probability P (X,lY).

Example of the Naive Bayes Classifier

Consider the data set shown in Figure 5.10(a). We can compute the class-
conditional probability for each categorical attribute, along with the sample
mean and variance for the continuous attribute using the methodology de-
scribed in the previous subsections. These probabilities are summarized in
Figure 5.10(b).
To predict the class label of a test record ;q : (HomeOwner:No, Marital
Status : Married, Income : $120K), we need to compute the posterior prob-
abilities P(UolX) and P(YeslX). Recall from our earlier discussionthat these
posterior probabilities can be estimated by computing the product between
the prior probability P(Y) and the class-conditionalprobabilitiesll P(X,lY),
which corresponds to the numerator of the right-hand side term in Equation
5.15.
The prior probabilities of each class can be estimated by calculating the
fraction of training records that belong to each class. Since there are three
records that belong to the classYes and sevenrecords that belong to the class
5.3 Bayesian Classifiers 235

P(HomeOwner=YeslNo) = 317
P(HomeOwner=NolNo) = 4fl
P(HomeOwner=YeslYes) =0
P(HomeOwner=NolYes) =1
P(Marital = 2n
Status=SinglelNo)
Yes 125K
No)= 1/7
P(MaritalStatus=Divorcedl
No 100K = 4t7
P(MaritalStatus=MarriedlNo)
No 70K = 2/3
P(MaritalStatus=SinglelYes)
Yes 120K = 1/3
P(MaritalStatus=DivorcedlYes)
=0
P(MaritalStatus=MarriedlYes)
No 95K
No 60K ForAnnualIncome:
Yes 220K lf class=No:samplemean=110
No 85K samplevariance=2975
No 75K samplemedn=90
lf class=Yes:
samplevariance=2S
No 90K

(a) (b)

Figure
5.10.Thenalve
Bayes
classifier problem.
fortheloanclassification

No, P(Yes) :0.3 and P(no) :0.7. Using the information provided in Figure
5.10(b), the class-conditionalprobabilities can be computed as follows:

P(Xluo) : P(Hone 0wner : NolNo)x P(status : MarriedlNo)


x P(Annual fncome : $120KlNo)
: 417 x 417 x 0.0072: 0.0024.

P(XlYes) : P(Home 0wner : IrtolYes)x P(Status : MarriedlYes)


x P(AnnuaI Income : $120KlYes)
: 1x0x1.2x10-e:0.

Putting them together, the posterior probability for class No is P(NolX) :


a x 7 l l 0 x 0 . 0 0 2 4 : 0 . 0 0 1 6 a , w h e r ea : l l P ( X ) i s a c o n s t a n t e r m . U s i n g
a similar approach, we can show that the posterior probability for class Yes
is zero because its class-conditional probability is zero. Since P(NolX) >
P(YeslX), the record is classifiedas No.
236 Chapter 5 Classification: Alternative Techniques

M-estimate of Conditional Probability

The preceding example illustrates a potential problem with estimating poste-


rior probabilities from training data. If the class-conditional probability for
one of the attributes is zero, then the overall posterior probability for the class
vanishes. This approach of estimating class-conditional probabilities using
simple fractions may seem too brittle, especially when there are few training
examples available and the number of attributes is large.
In a more extreme case, if the training examples do not cover many of
the attribute values, we may not be able to classify some of the test records.
For example, if P(Marital Status : DivorcedlNo) is zero instead of If7,
then a record with attribute set 1(: (HomeOwner- yes, Marital Status :
Divorced, Income : $120K) has the following class-conditionalprobabilities:

P(Xlno) : 3/7 x 0 x 0.0072: 0.


P ( X l v e s ) : 0 x 7 1 3x 7 . 2x 1 0 - e : 0 .

The naive Bayes classifier will not be able to classify the record. This prob-
lem can be addressed by using the m-estimate approach for estimating the
conditional probabilities :

!!
P (r,l a- ) : ?s! , (5.18)
n+Tn

where n is the total number of instances from class 3ry,n" is the number of
training examples from class gi that take on the value ri, rrl is a parameter
known as the equivalent sample size, and p is a user-specifiedparameter. If
there is no training set available (i.e., n:0), then P(rilyi) : p. Therefore
p can be regarded as the prior probability of observing the attribute value
ri among records with class 97. The equivalent sample size determines the
tradeoff between the prior probability p and the observed probability n.f n.
In the example given in the previous section, the conditional probability
P(Status : MarriedlYes) : 0 because none of the training records for the
class has the particular attribute value. Using the m-estimate approach with
m:3 and p :113, the conditional probability is no longer zero:

P(Marital Status:MarriedlYes) : (0+3 xtlS)/(J+3) :176.


5.3 Bayesian Classifiers 237

If we assumep : If 3 for all attributes of class Yes and p : 213 for all
attributes of class No. then

P(Xluo) : P(Home Owner: NolNo)x P(status : MarriedlNo)


x P(Annual Incone : $120KlNo)
: : o.oo26.
6lto x 6/10x o.oo72

P(XlYes) : P(Home 0tmer : ttolYes) x P(status : MarriedlYes)


x P(AnnuaI Income: $120KlYes)
: 4 / 6 x 1 1 6x 7 . 2 x 1 0 - e : 1 . 3 x 1 0 - 1 0 .

The posterior probability for class No is P(llolx) : (t x 7110 x 0.0026 :


0.0018o, while the posterior probability for class Yes is P(YeslX) : o x
3/10 x 1.3 x 10-10 : 4.0 x 10-11a. Atthough the classificationdecisionhas
not changed, the m-estimate approach generally provides a more robust way
for estimating probabilities when the number of training examples is small.

Characteristics of Naive Bayes Classifiers

NaiVe Bayes classifiers generally have the following characteristics:

o They are robust to isolated noise points becausesuch points are averaged
out when estimating conditional probabilities from data. Naive Bayes
classifiers can also handle missing values by ignoring the example during
model building and classification.

o They are robust to irrelevant attributes. If Xi is an irrelevant at-


tribute, then P(XrlY) becomesalmost uniformly distributed. The class-
conditional probability for Xi has no impact on the overall computation
of the posterior probability.

o Correlated attributes can degrade the performance of naive Bayes clas-


sifiers becausethe conditional independenceassumption no longer holds
for such attributes. For example, consider the following probabilities:

P(A:0lY:0) :0.4, P(A:1lY:0) :0.6,


P ( A : 0 l Y : 1 ) : 0 . 6 , P ( A : L I Y: 1 ) : 0 . 4 ,

where A is a binary attribute and Y is a binary class variable. Suppose


there is another binary attribute B that is perfectly correlated with A
238 Chapter 5 Classification: Alternative Techniques

when Y : 0, but is independent of -4 when Y : I. For simplicity,


assumethat the class-conditionalprobabilities for B are the same as for
A. Given a record with attributes,4 :0.8:0. we can comoute its
posterior probabilities as follows:

P ( A : O l y : 0 ) P ( B : O l y : O ) P ( Y: 0 )
P(Y:0lA:0, B : 0) :
P(A:0, B : 0)
0 . 1 6x P ( Y : 0 )
P(A:0, B : 0)'

P ( A : O l y : I ) P ( B : O l y: l ) P ( Y : 1 )
P(Y : IlA:0,8 : 0) :
P(A:0, B : 0)
0 . 3 6x P ( Y 1 )
:
P(A:0, B : 0)'

If P(Y - 0) : P(Y : 1), then the naiVe Bayes classifier would assign
the record to class 1. However, the truth is,

P ( A : 0 ,B : O l Y : 0 ) : P ( A : 0l)' : 0) : 0.4,

becauseA and B are perfectly correlated when Y : 0. As a result, the


posterior probability for Y : 0 is

P ( A : 0 , 8 : O l Y : 0 ) P ( Y: 0 )
P(Y :0lA:0, B : 0) :
P(A:0,8:0)
0 . 4x P ( Y : 0 )
P(A :0,8 : 0 )'

which is larger than that for Y : 1. The record should have been
classifiedas class 0.

5.3.4 Bayes Error Rate


Supposewe know the true probability distribution that governs P(Xlf). The
Bayesianclassificationmethod allows us to determine the ideal decisionbound-
ary for the classification task, as illustrated in the following example.
Example 5.3. Consider the task of identifying alligators and crocodilesbased
on their respectivelengths. The averagelength of an adult crocodile is about 15
feet, while the averagelength of an adult alligator is about 12 feet. Assuming
5.3 Bayesian Classifiers 239

\, Crocodile
\
\
\
\
\
\
\
\
\
\
\
\
\

5 10 tu
Length,*

Figure5.11.Comparing
thelikelihood of a crocodile
functions andanalligator.

that their length z follows a Gaussian distribution with a standard deviation


equal to 2 feet, we can express their class-conditional probabilities as follows:

:
P(Xlcrocodile) ( 5.1e)
#"""0 1
:
P(Xlnrri.gator) ;(ry)'l (5.20)
#""*o[ "(ry)')
Figure 5.11 shows a comparison between the class-conditionalprobabilities
for a crocodile and an alligator. Assuming that their prior probabilities are
the same, the ideal decision boundary is located at some length i such that

P(X: ilCrocodile): P(X: f lAlligator).

Using Equations 5.19 and 5.20, we obtain

(ft-rb\2
:\ /i-r2\2
\ , / , /'

which can be solved to yield f : 13.5. The decision boundary for this example
is located halfway between the two means. r
24O Chapter 5 Classification: Alternative Techniques

(a) (b) (c)

Figure probabilistic
5.12.Representing relationships
usingdirected graphs.
acyclic

When the prior probabilities are different, the decision boundary shifts
toward the class with lower prior probability (see Exercise 10 on page 319).
Furthermore, the minimum error rate attainable by any classifieron the given
data can also be computed. The ideal decision boundary in the preceding
example classifies all creatures whose lengths are less than ft as alligators and
those whose lengths are greater than 0 as crocodiles. The error rate of the
classifier is given by the sum of the area under the posterior probability curve
for crocodiles (from length 0 to i) and the area under the posterior probability
curve for alligators (from f to oo):

Error: et 'o.odirelX)d, * P(Alri-gatorlx)d,X.


fo" Ir*
The total error rate is known as the Bayes error rate.

5.3.5 Bayesian Belief Networks

The conditional independence assumption made by naive Bayes classifiers may


seem too rigid, especially for classification problems in which the attributes
are somewhat correlated. This section presents a more flexible approach for
modeling the class-conditional probabilities P(Xlf). Instead of requiring all
the attributes to be conditionally independent given the class, this approach
allows us to specify which pair of attributes are conditionally independent.
We begin with a discussionon how to represent and build such a probabilistic
model, followed by an example of how to make inferences from the model.
5.3 Bayesian Classifiers 24L

Model Representation

A Bayesian belief network (BBN), or simply, Bayesian network, provides a


graphical representation of the probabilistic relationships among a set of ran-
dom variables. There are two key elements of a Bayesian network:

1. A directed acyclic graph (dag) encoding the dependence relationships


among a set of variables.

2. A probability table associatingeach node to its immediate parent nodes.

Consider three random variables, -4, B, and C, in which A and B are


independent variables and each has a direct influence on a third variable, C.
The relationships among the variables can be summarized into the directed
acyclic graph shown in Figure 5.12(a). Each node in the graph represents a
variable, and each arc asserts the dependencerelationship between the pair
of variables. If there is a directed arc from X to Y, then X is the parent of
Y and Y is the child of X. F\rrthermore, if there is a directed path in the
network from X to Z, then X is an ancestor of Z, whlle Z is a descendant
of X. For example, in the diagram shown in Figure 5.12(b), A is a descendant
of D and D is an ancestor of B. Both B and D arc also non-descendantsof
A. An important property of the Bayesian network can be stated as follows:

Property 1 (Conditional Independence). A node in a Bayesi'annetwork


i,s condi,tionally i,ndependentof its non-descendants,i'f i,tsparents are lcnown.

In the diagram shown in Figure 5.12(b), A is conditionally independent of


both B and D given C because the nodes for B and D are non-descendants
of node A. The conditional independenceassumption made by a naive Bayes
classifiercan also be representedusing a Bayesian network, as shown in Figure
5.12(c),where gris the target classand {Xt,Xz,...,Xa} is the attribute set.
Besides the conditional independenceconditions imposed by the network
topology, each node is also associatedwith a probability table.

1. If a node X does not have any parents, then the table contains only the
prior probability P(X).

2. If a node X has only one parent, Y, then the table contains the condi-
tional probability P(XIY).

3. If a node X has multiple parents, {Yt,Yz, . . . ,Yn}, then the table contains
the conditionalprobability P(XlY,Yz,. . ., Yr.).
242 Chapter 5 Classification: Alternative Techniques

Hb=Yes
HD=Yes
D=Healthy 0.2
E=Yes D=Unhealthy 0.85
D=Heatthy 0.25
E=Yes 0.45
D=Unhealthl
E=No
D=Healthy 0.55
E=No
0.75 CP=Yes
D=Unhealth!
HD=Yes
Hb=Yes
0.8
HD=Yes
Hh=Nn
0.6
HD=No
0.4
Hb=Yes
HD=No
Hb=No
0.1

Figure5.13.A Bayesian
beliefnetwork
fordetecting
heartdisease in patients.
andheartburn

Figure 5.13 shows an example of a Bayesian network for modeling patients


with heart diseaseor heartburn problems. Each variable in the diagram is
assumed to be binary-valued. The parent nodes for heart disease (HD) cor-
respond to risk factors that may affect the disease,such as exercise (E) and
diet (D). The child nodes for heart disease correspond to symptoms of the
disease,such as chest pain (CP) and high blood pressure (BP). For example,
the diagram shows that heartburn (Hb) may result from an unhealthy diet
and may lead to chest pain.
The nodes associated with the risk factors contain only the prior proba-
bilities, whereas the nodes for heart disease,heartburn, and their correspond-
ing symptoms contain the conditional probabilities. To save space, some of
the probabilities have been omitted from the diagram. The omitted prob-
abilities can be recovered by noting that P(X - *) : 1 - P(X : r) and
P(X : TIY) : 1 - P(X : rlY), where 7 denotes the opposite outcome of r.
For example, the conditional probability

P(Heart Disease: NolExercise: No,Diet : Healthy)


1 - P(Heart Disease : YeslExercise : No,Diet : Healthy)
1-0.55:0.45.
5.3 BayesianClassifiers 243

Model Building

Model building in Bayesiannetworks involves two steps: (1) creating the struc-
ture of the network, and (2) estimating the probability values in the tables
associatedwith each node. The network topology can be obtained by encod-
ing the subjective knowledge of domain experts. Algorithm 5.3 presents a
systematic procedure for inducing the topology of a Bayesian network.

Algorithm 5.3 Algorithm for generating the topology of a Bayesian network.


1: Let T: (XuXz,...,X4) denote a total order of the variables.
2:forj:Ttoddo
3: Let X76 denote the jth highest order variable in ?.
4: Let r(X711) : {Xz(r), Xrp1, . . . , Xr1-t)} denote the set of variables preced-
ing X713y.
5: Remove the variables from r(Xrti>) that do not affect X1 (using prior knowl-
edge).
6: Create an arc between X71y; and the remaining variables in r(X711).
7: end for

Example 5.4. Consider the variables shown in Figure 5.13. After performing
Step 1, Iet us assume that the variables are ordered in the following way:
(E,D,HD,H\,CP,BP). From Steps 2 to 7, starting with variable D, we
obtain the following conditional probabilities:
. P(DIE) is simplified to P(D).

o P(HDlE, D) cannot be simplified.

o P(HblHD,E,D) is simplifiedto P(HblD).

o P(C PlHb, H D, E, D) is simplified to P(C PlHb, H D).

o P(BPICP,Hb,HD,E,D) is simplifiedto P(BPIHD).


Based on these conditional probabilities) we can create arcs between the nodes
(8, HD), (D, HD), (D, Hb), (HD, CP), (Hb, CP), and (HD, BP). These
arcs result in the network structure shown in Figure 5.13. r
Algorithm 5.3 guaranteesa topology that does not contain any cycles. The
proof for this is quite straightforward. If a cycle exists, then there must be at
least one arc connecting the lower-ordered nodes to the higher-ordered nodes,
and at least another arc connecting the higher-ordered nodes to the lower-
ordered nodes. Since Algorithm 5.3 prevents any arc from connecting the
244 Chapter 5 Classification: Alternative Techniques

lower-ordered nodes to the higher-ordered nodes, there cannot be any cycles


in the topology.
Nevertheless, the network topology may change if we apply a different or-
dering scheme to the variables. Some topology may be inferior because it
produces many arcs connecting between different pairs of nodes. In principle,
we may have to examine all d! possibleorderings to determine the most appro-
priate topology, a task that can be computationally expensive. An alternative
approach is to divide the variables into causal and effect variables, and then
draw the arcs from each causal variable to its corresponding effect variables.
This approach easesthe task of building the Bayesian network structure.
Once the right topology has been found, the probability table associated
with each node is determined. Estimating such probabilities is fairly straight-
forward and is similar to the approach used by naiVe Bayes classifiers.

Example of Inferencing Using BBN

Supposewe are interested in using the BBN shown in Figure 5.13 to diagnose
whether a person has heart disease. The following cases illustrate how the
diagnosis can be made under different scenarios.

Case 1: No Prior Information

Without any prior information, we can determine whether the person is likely
to have heart disease by computing the prior probabilities P(HD : Yes) and
P(HD: No). To simplify the notation, let a € {Yes,No} denote the binary
values of Exercise and B e {Healthy,Unhealthy} denote the binary values
of Diet.

P(HD:ves) : ttP(Hn
: Y e s l E : ( t , D : P ) P ( E: a , D : 0 )
d13

: ttp(uo : yesl,O
: (t,D: ilP(E : a)P(D: g)
aR

0 . 2 5x 0 . 7x 0 . 2 5+ 0 . 4 5x 0 . 7x 0 . 7 5 + 0 . 5 5x 0 . 3x 0 . 2 5
+ 0.75x 0.3 x 0.75
0.49.

Since P(HD - no) - 1 - P(ttO : yes) : 0.51, the person has a slightly higher
chance of not getting the disease.
5.3 Bayesian Classifiers 245

Case 2: High Blood Pressure

If the person has high blood pressure)we can make a diagnosis about heart
disease by comparing the posterior probabilities, P(HD : YeslBP : High)
against P(ttO : Nolnt : High). To do this, we must compute P(Be : High):

P (n e:H i g h ) : frl n r
: HighlHD:7) p( HD
:7)

: 0.85 x 0.49+ 0.2 x 0.51: 0.5185.

where 7 € {Yes, No}. Therefore, the posterior probability the person has heart
diseaseis

P(ge : HighlHD: Yes)P(HD: Yes)


P(Ho : yeslBp : High) :
P(ne : High)
0'85 x 0.49
:
ffi:o'8033'
Similarly, P(ttO : NolBe: High) - 1 - 0.8033 : 0.1967. Therefore, when a
person has high blood pressure, it increasesthe risk of heart disease.

Case 3: High Blood Pressure, Healthy Diet, and Regular Exercise

Supposewe are told that the person exercisesregularly and eats a healthy diet.
How does the new information affect our diagnosis? With the new information,
the posterior probability that the person has heart diseaseis

P(Ho: YeslBP: Hlgh,D: Heal-thy,E: Yes)


P(BP : HighlHD: Yes,D : Healthy,E : Yes
P(ne : HighlD : HeaIthV,E : Yes)
)l
x P(HD: YeslD : Healthy,E : Yes)

P(ee : HighlHD: Yes)P(HD: YeslD : Healthy,E : Yes)


D, flBe : HighlHD
: ?)P(HD:.ylD: Healthy,E : Yes)
0.85x 0.25
0.85x0.25+0.2x0.75
: 0.5862,

while the probability that the persondoesnot haveheart diseaseis

P(tt0 : Nolee: High,D : Healthy,E : yes) - 1 - 0.5862: 0.4138.


246 Chapter 5 Classification: Alternative Techniques

The model therefore suggests that eating healthily and exercising regularly
may reduce a person's risk of getting heart disease.

Characteristics of BBN

Following are some of the general characteristics of the BBN method:

1. BBN provides an approach for capturing the prior knowledge of a par-


ticular domain using a graphical model. The network can also be used
to encode causal dependenciesamong variables.

2 . Constructing the network can be time consuming and requires a large


amount of effort. However, once the structure of the network has been
determined, adding a new variable is quite straightforward.

3 . Bayesian networks are well suited to dealing with incomplete data. In-
stanceswith missing attributes can be handled by summing or integrat-
ing the probabilities over all possible values of the attribute.

4 . Becausethe data is combined probabilistically with prior knowledge,the


method is quite robust to model overfitting.

5.4 Artificial Neural Network (ANN)


The study of artificial neural networks (ANN) was inspired by attempts to
simulate biological neural systems. The human brain consists primarily of
nerve cells called neurons) linked together with other neurons via strands
of fiber called axons. Axons are used to transmit nerve impulses from one
neuron to another wheneverthe neurons are stimulated. A neuron is connected
to the axons of other neurons via dendrites, which are extensions from the
cell body of the neuron. The contact point between a dendrite and an axon is
called a synapse. Neurologists have discovered that the human brain learns
by changing the strength of the synaptic connection between neurons upon
repeated stimulation by the same impulse.
Analogous to human brain structure, an ANN is composed of an inter-
connected assemblyof nodes and directed links. In this section, we will exam-
ine a family of ANN models, starting with the simplest model called percep-
tron, and show how the models can be trained to solve classificationproblems.
10/4/23, 11:22 AM Lab No. 02(NB) - Colaboratory

from sklearn.datasets import make_classification

X, y = make_classification(
n_features=6,
n_classes=3,
n_samples=800,
n_informative=2,
random_state=1,
n_clusters_per_class=1
)

import matplotlib.pyplot as plt

plt.scatter(X[:, 0], X[:, 1], c=y, marker="*");

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(


X, y, test_size=0.33, random_state=125
)

from sklearn.naive_bayes import GaussianNB

# Build a Gaussian Classifier


model = GaussianNB()

# Model training
model.fit(X_train, y_train)

# Predict Output
predicted = model.predict([X_test[6]])

print("Actual Value:", y_test[6])


print("Predicted Value:", predicted[0])

Actual Value: 0
Predicted Value: 0

from sklearn.metrics import (


accuracy_score,
confusion_matrix,
ConfusionMatrixDisplay,
f1_score,
)

y_pred = model.predict(X_test)
accuray = accuracy_score(y_pred, y_test)
f1 = f1_score(y_pred, y_test, average="weighted")

https://colab.research.google.com/drive/1StIyHH_OUAj9A7SMm6PXeBZxMBaS8hUi#scrollTo=FxSyrc8j7oTY&printMode=true 1/2
10/4/23, 11:22 AM Lab No. 02(NB) - Colaboratory
print("Accuracy:", accuray)
print("F1 Score:", f1)

Accuracy: 0.8484848484848485
F1 Score: 0.8491119695890328

labels = [0,1,2]
cm = confusion_matrix(y_test, y_pred, labels=labels)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
disp.plot();

https://colab.research.google.com/drive/1StIyHH_OUAj9A7SMm6PXeBZxMBaS8hUi#scrollTo=FxSyrc8j7oTY&printMode=true 2/2

You might also like