100% found this document useful (1 vote)
680 views18 pages

Bayes Classifier PDF

The document describes classification using Bayes decision theory and the naive Bayes classifier. It explains that Bayes classification involves calculating the posterior probabilities P(C1|x) and P(C2|x) of classes C1 and C2 given patterns x, and assigning x to the class with the higher posterior probability. This minimizes the probability of classification error. The naive Bayes classifier makes the assumption that features are independent given the class, allowing simple calculation of posterior probabilities using Bayes' theorem. It can achieve performance comparable to more complex classifiers despite its simplicity.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
680 views18 pages

Bayes Classifier PDF

The document describes classification using Bayes decision theory and the naive Bayes classifier. It explains that Bayes classification involves calculating the posterior probabilities P(C1|x) and P(C2|x) of classes C1 and C2 given patterns x, and assigning x to the class with the higher posterior probability. This minimizes the probability of classification error. The naive Bayes classifier makes the assumption that features are independent given the class, allowing simple calculation of posterior probabilities using Bayes' theorem. It can achieve performance comparable to more complex classifiers despite its simplicity.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Classification using Bayes Decision Theory

• In this approach classification is carried out using probabilities of classes.

• It is assumed that we know the a priori or the prior probability of each


class. If we have two classes C1 and C2 , then the prior probability for
class C1 is PC1 and the prior probability for class C2 is PC2 .

• If the prior probability is not known, the classes are taken to be equally
likely.

• If prior probability is not known and there are two classes C1 and V2 ,
then it is assumed PC1 = PC2 = 0.5.

• If PC1 and PC2 are known, then when a new pattern x comes along, we
need to calculate P (C1 |x) and P (C2 |x).

• The bayes theorem is used to compute P (C1|x) and P (C2|x).

• Then if P (C1 |x) ≥ P (C2 |x), the pattern is assigned to Class 1 and if
P (C1|x) < P (C2 |x), it is assigned to Class 2. This is called the Bayes
decision rule.
Bayes Rule
• If P (Ci) is the prior probability of Class i, and P (X|Ci) is the condi-
tional density of X given class Ci , then the a posteriori or posterior
probability of Ci is given by

P (X|Ci )P (Ci )
P (Ci|X) = P (X)

• Bayes theorem provides a way of calculating the posterior probability


P (Ci | X). In other words, after observing X, the posterior probability
that the class is ci can be calculated from Bayes theorem. It is useful
to convert prior probabilities to posterior probabilities.

• P (X) is given by

P
P (X) = i P (X | Ci )P (Ci)

2
• Let the probability that an elephant is black be 80% and that an ele-
phant is white be 20%. This means P(elephant is black) = 0.8 and
P(elephant is white) =0.2 . With only this information, any elephant
will be classified as being black. This is because the probability of
error in this case is only 0.2 as opposed to classifying the elephant as
white which results in a probability of error of 0.8 . When additional in-
formation is available, it can be used along with the information above.

If we have probability that elephant belongs to region X is 0.2 . Now


if the elephant belongs to region X, we need to calculate the posterior
probability that the elephant is white. i.e. P(elephant is white | ele-
phant belongs to region) or P (W | X). This can be calculated by using
Bayes theorem. If 95% of the time when the elephant is white, it is
because it belongs to the region X. Then

P (X|W )∗P (W ) 0.95∗0.2


P (W | X) = P (X)
= 0.2
= 0.95

The probability of error is 0.05 which is the probability that the ele-
phant is not white given that it belongs to region X.

Minimum Error Rate Classifier

• If it is required to classify a pattern X, then the minimum error rate


classifier classifies the pattern X to the class C which has the maximum
posterior probability P (C | X).

• If the test pattern X is classified as belonging to Class C, then the error


in the classification will be (1 - P (C | X)).

• It is evident to reduce the error, X has to be classified as belonging to


the class for which P (C | X) is maximum.

• The expected probability of error is given by

Z
(1 − P (C|X))P (X)dX
X

3
This is minimum when P (C | X) is maximum (for a specified value of
P (X).

• Let us consider an example of how to use minimum error rate classifier


for a classification problem. Let us consider an example with three
classes small, medium and large with prior probability

P (small) = 31
1
P (medium) = 2
P (large) = 16

We have a set of nails, bolts and rivets in a box and the three classes
correspond to the size of these objects in the box.

Now let us consider the class-conditional probabilities of these objects :

For Class small we get

P (nail | small) = 41
P (bolt | small) = 12
P (rivet | small) = 14

For Class medium we get

P (nail | medium) = 21
P (bolt | medium) = 16
P (rivet | medium) = 13

For Class large we get

P (nail | large) = 13
P (bolt | large) = 13

4
1
P (rivet | large) = 3

Now we can find the probability of the class labels given that it is a
nail, bolt or rivet. For doing this we need to use Bayes Classifier. Once
we get these probabilities, we can find the corresponding class labels of
the objects.

P (nail)|small)P (small)
P (small | nail) = P (nail|small).P (small)+P (nail|medium).P (medium)+P (nail|large).P (large)

This will give


1 1
.
P (small | nail) = 4 3
1 1 1 1 1 1
. + . +3.6
= 0.2143
4 3 2 2

Similarly, we calculate P (medium | nail) and we get


1
.f rac12
P (medium | nail) = 2
1 1 1 1 1 1
. + . + .
= 0.6429
4 3 2 2 3 6

and also P (large | nail)


1 1
.
P (large | nail) = 3 6
1 1 1 1 1 1
. + . + .
= 0.1429
4 3 2 2 3 6

Since P (medium | nail) > P (small | nail) and


P (medium | nail) > P (large | nail)

we classify nail as belonging to the class medium. The probability of


error P (error | nail) = 1 − 0.6429 = 0.3571

In a similar way, we can find the posterior probability for bolt


1 1
.
P (small | bolt) = 2 3
1 1 1 1 1 1
. + . +3.6
= 0.5455
2 3 6 2

1 1
.
P (medium | bolt) = 6 2
1 1 1 1 1 1
. +6.2+3.6
= 0.2727
2 3

5
1 1
.
P (large | bolt) = 3 6
1 1 1 1 1 1
. + . + .
= 0.1818
2 3 6 2 3 6

Since P (small | bolt) > P (medium | bolt) and


P (small | bolt) > P (large | bolt)

we classify bolt as belonging to the class small and the probability of


error P (error | bolt) = 1 − 0.5455 = 0.4545

In a similar way, we can find the posterior probability for rivet


1 1
.
P (small | rivet) = 4 3
1 1 1 1 1 1
. +3.2+3.6
= 0.2727
4 3

1 1
.
P (medium | rivet) = 3 2
1 1 1 1 1 1
. + . + .
= 0.5455
4 3 3 2 3 6

1 1
.
P (large | rivet) = 3 6
1 1 1 1 1 1
. + . + .
= 0.1818
4 3 3 2 3 6

Since P (medium | rivet) > P (small | rivet) and


P (medium | rivet) > P (large | rivet)

we classify bolt as belonging to the class medium and the probability


of error P (error | rivet) = 1 − 0.5455 = 0.4545

6
Naive Bayes Classifier

• A naive bayes classifier is based on applying Bayes theorem to find the


class of a pattern.

• The assumption made here is that every feature is class conditionally


independent.

• Due to this assumption, the probabilistic classifier is simple.

• In other words, it is assumed that the effect of each feature on a given


class is independent of the value of other features.

• Since this simplifies the computation, though it may not be always


true, it is considered to be a naive classifier.

• Even though this assumption is made, the Naive Bayes Classifier is


found to give results comparable in performance to other classifiers like
neural network classifiers and classification trees.

• Since the calculations are simple, this classifier can be used for large
databases where the results are obtained fast with reasonable accuracy.

• Using the minimum error rate classifier, we classify the pattern X to


the class with the maximum posterior probability P (c | X). In the
naive bayes classifier, this can be written as

P (C | f1 , ..., fd ).
where f1 , ..., fd are the features.

• Using Bayes theorem, this can be written as

P (C) P (f1 ,...,fd )|C


P (C | f1 , ..., fd ) = p(f1 ,...,fd )

Since every feature fi is independent of every other feature fj , for j 6= i,


given the class

2
P (fi, fj | C) = P (fi | C)P (fj | C)

So we get,

P (C, f1, ..., fd ) = P (C) P (f1 | C) P (f2 | C) · · · p(fd | C)

=
Y
d
p(C) p(fi |C).
i=1

The conditional distribution over the class variable C is

1 Yn
p(C|f1 , . . . , fn ) = p(C) p(fi |C)
Z i=1

where Z is a scaling factor.

• The Naive Bayes classification uses only the prior probabilities of classes
P(C) and the independent probability distributions p(fi | C).

Parameter Estimation

• In supervised learning, a training set is given. Using the training set,


all the parameters of the bayes model can be computed.

• If nC of the training examples out of n belong to Class C, then the


prior probability of Class C will be

nC
P (C) = n

• In a class C, if n1 samples take a range of values (or a single value) out


of a total of nC samples in the class, then the prior probability of the

3
feature being in this range in this class will be

n1
P (f1 is in range (a,b)) = nC

In case of the feature taking a small number of integer values, this can
be calculated for each of these values. For example, it would be

n2
P (f1 is 6) = nC

if n2 of the nC patterns of Class c take on the value 6.

• If some class and feature never occur together, then that probability
will be zero. When this is multiplied by other probabilities, it may
make some probabilities zero. To prevent this, it is necessary to give a
small value of probability to every probability estimate.

• Let us estimate the parameters for a training set which has 100 pat-
terns of Class 1, 90 patterns of Class 2, 140 patterns of Class 3 and 100
patterns of Class 4. The prior probability of each class can be calcu-
lated.
The prior probability of Class 1 is

100
P (C1) = 100+90+140+100
= 0.233

The prior probability of Class 2 is

90
P (C2) = 100+90+140+100
= 0.210

The prior probability of Class 3 is

140
P (C2) = 100+90+140+100
= 0.326

The prior probability of Class 4 is

4
100
P (C2) = 100+90+140+100
= 0.233

Out of the 100 examples of Class 1, if we consider a particular feature


f1 and if 30 patterns take on the value 0, 45 take on the value 1 and
25 take on the value 2, then the prior probability that in Class 1 the
feature f1 is 0 is

30
P (f1 is 0) = 100
= 0.03

The prior probability that in Class 1 the feature f1 is 1 is

45
P (f1 is 1) = 100
= 0.45

The prior probability that in Class 1 the feature f1 is 2 is

25
P (f1 is 2) = 100
= 0.25

Example for Naive Bayes Classifier

Let us take an example dataset.


Consider the example given in decision trees given in the Table 1. We
have a new pattern

money = 90, has-exams=yes, and weather=fine

We need to classify this pattern as either belonging to goes-to-movie=yes


or goes-to-movie=no.
There are four examples out of 11 belonging to goes-to-movie=yes.
4
The prior probability of P(goes-to-movie=yes)= 11 = 0.364

7
The prior probability of P(goes-to-movie=no) = 11
= 0.636

There are 4 examples with money50 − 150 and goes-to-movie=no and 1


examples with money < 50 and goes-to-movie=yes. Therefore,

5
Money Has-exams weather Goes-to-movie
25 no fine no
200 no hot yes
100 no rainy no
125 yes rainy no
30 yes rainy no
300 yes fine yes
55 yes hot no
140 no hot no
20 yes fine no
175 yes fine yes
110 no fine yes

Table 1: Example training data set

1
P (money50 − 150 | goes − to − movie = yes) = 4
= 0.25 and

4
P (money50 − 150 | goes − to − movie = no) = 7
= 0.429

There are 4 examples with has-exams=yes and goes-to-movie=no and 2


examples with has-exams=yes and goes-to-movie=yes. Therefore,

2
P (has − exams | goes − to − movie = yes) = 4
= 0.5

4
P (has − exams | goes − to − movie = no) = 7
= 0.429

There are 2 examples with weather=fine and goes-to-movie=no and 2


examples with weather=fine and goes-to-movie=yes. Therefore,

2
P (weather = f ine | goes − to − movie = yes) = 4
= 0.5

2
P (weather = f ine | goes − to − movie = no) = 7
= 0.286

Therefore

6
P (goes − to − movie = yes | X) = 0.364 * 0.25 * 0.5 * 0.5 = 0.023

P (goes − to − movie = no | X) = 0.636 * 0.429 * 0.429 * 0.286 = 0.033

Since P (goes−to−movie = no | X) is larger, the new pattern is classified


as belonging to the class goes-to-movie=no.

7
Bayesian Belief Network

• A Bayesian network is a graphical model of a situation which repre-


sents a set of variables and the dependencies between them by using
probability.

• The nodes in a Bayesian network represent the variables and the di-
rectional arcs represent the dependencies between the variables. The
direction of the arrows show the direction of the dependency.

• Each variable is associated with a conditional probability table which


gives the probability of this variable for different values of the variables
on which this node depends.

• Using this model, it is possible to perform inference and learning.

• Bayesian networks that model a sequence of variables varying in time


are called dynamic Bayesian networks.

• Bayesian networks with decision nodes which solve decision problems


under uncertainly are Influence diagrams.

• The graphical model will be a directed acyclic graph with the nodes
representing the variables. The arcs will represent the dependencies
between nodes. If there is a directed arc between A and B, then A is
called the parent of B and B is a child of A.

• A variable which has no parent is a variable which does not depend


on any other variable. It is a variable which is independent and is not
conditioned on any other variable.

• The conditional probability table associated with each node gives the
probability of this variable for different values of its parent nodes.

• If a node does not have any parents, the conditional probability table
is very simple as there are no dependencies.

• The joint distribution of the variable values can be written as the


product of the local distribution of each node and its parents. In
other words, if f1 , f2 , ..., fd are the variables, the joint distribution

2
P (f1 , f2 , ..., fd ) is given by

Qd
P (f1 , f2 , ..., fd ) = j=1 P (fj | parents(fj )

Using the above equation, it is possible to get the joint distribution of


the variables for different values.

• As as example, let us consider the following scenario.

Lakshman travels by air if he is on an official visit. If he is on a personal


visit, he travels by air if he has money. If he does not travel by plane,
he travels by train but sometimes also takes a bus.

The variables involved are :

1. Lakshman travels by air(A)

2. Goes on official visit(F)

3. Lakshman has money(M)

4. Lakshman travels by train(T)

5. Lakshman travels by bus(B)

This situation is converted into a belief network as shown in Figure 1.

In the graph, we can see the dependencies with respect to the variables.
The probability values at a variable are dependent on the value of its par-
ents. In this case, the variable A is dependent on F and M. The variable T
is dependent on A and variable B is dependent on A. The variables F and M
are independent variables which do not have any parent node. So their prob-
abilities are not dependent on any other variable. Node A has the biggest
conditional probability table as A depends on F and M. T and B depend on A.

Once the graph is drawn which is a directed acyclic graph (DAG), the
probabilities of the variables depending on the values of their parent node has
to be entered. This requires us to know the problem at hand and estimate

3
M

Has money P(M) 0.3


P(F) 0.7 F in pocket

Goes on
official trip

M F P(A | M and F)
A
T T 0.98
T F 0.98
Lakskhman
F T 0.98
travels by
air F F 0.10

T B

Lakshman Lakshman
travels by travels by
train bus A P(B | A)
A P(T | A)

T 0.00 T 0.00

F 0.60 F 0.40

Figure 1: Bayesian Belief Network

4
these probabilities. So for each node the Conditional Probability Table has
to be entered.

First we take the independent nodes. Node F has a probability of P (F ) =


0.7. Node M has a probability of P (M) = 0.3.

We next come to node A. The conditional probability table for this node
can be represented as

F M P(A | F,M and P)


T T 0.98
T F 0.98
F T 0.90
F F 0.10

The conditional probability table for T can be represented as

A P|A
T 0.0
F 0.6

The conditional probability table for B is

A P|A
T 0.0
F 0.40

Using the bayesian belief network, we can get the probability of a com-
bination of these variables. For example, we can get the probability that
Lakshman travels by train, does not travel by air, goes on an official trip and
has money. In other words, we are finding P (T, ¬A, F, M). The probability
of each variable given its parent is found and multiplied together to give the
probability.

P (T, ¬A, M, P ) = P (T | ¬A) ∗ P (¬A | F and M) ∗ P (F ) ∗ P (M)


= 0.6 * 0.98 * 0.7 * 0.3 = 0.123

5
Assignment
1. Let the probability that a road is wet P(w) = 0.3. Let probability of
rain, P(R) = 0.3. Given that 90% of the time when the roads are wet,
it is because it has rained, and it has rained, calculate the posterior
probability that the roads are wet.
2. Let blue, green, and red be three classes of objects with prior prob-
abilities given by P(blue) = 0.3, P(green) = 0.4, P(red) = 0.3. Let
there be three types of objects: pencils, pens, and paper. Let the class-
conditional probabilities of these objects be given as follows. Use Bayes
classifier to classify pencil, pen, and paper.

P(pencil|green) = 0.3 P(pen|green) = 0.5 P(paper|green) = 0.2


P(pencil|blue) = 0.5 P(pen|blue) = 0.2 P(paper|blue) = 0.3
P(pencil|red) = 0.2 P(pen|red) = 0.3 P(paper|red) = 0.5

3. Consider a two-class (Tasty or nonTasty) problem with the following


training data. Use Naive Bayes classifier to classify
Cook = Asha, Health − Status = Bad, Cuisine = Continental

Cook Health-Status Cuisine Tasty

Asha Bad Indian Yes


Asha Good Continental Yes
Sita Bad Indian No
Sita Good Indian Yes
Usha Bad Indian Yes
Usha Bad Continental No
Sita Bad Continental No
Sita Good Continental Yes
Usha Good Indian Yes
Usha Good Continental No

6
4. Consider the following dataset with three features f1 , f2 , and f3 . Con-
sider the test pattern f1 = a, f2 = c, f3 = f . Classify it using
NNC and Naive Bayes Classifier.

f1 f2 f3 Class Label

a c e No
b c f Yes
b c e No
b d f Yes
a d f Yes
a d f No

5. The profit a businessman makes depends on how fresh the provisions


are. Further, if there a festival approaching, his profit increases. On
the other hand, towards the end of the month, his sales come down. If
he makes enough profit, he celebrates Diwali in a big way. Draw the
belief network and suggest the likely conditional probability tables for
all variables. Using this data, find the probability that the businessman
celebrates Diwali big given that the provisions are fresh.
References
V. S. Devi and M. N. Murty (2011) Pattern Recognition: An Introduc-
tion Universities Press, Hyderabad.
R.O. Duda, P.E. Hart and D.G. Stork (2001) Pattern Classification,
Second Edition, Wiley-Interscience.
S. Russell and P. Norvig (2003) Artificial Intelligence : A Modern Ap-
proach, Pearson India.
P. Domingos and M. Pazzani (1997) On the optimality of the simple
Bayesian classifier under zero-one loss, Machine Learning, Vol. 29, pp. 103-
130.
J. Pearl, (1988) Probabilistic reasoning in intelligent systems, Morgan Kauff-
man.
I. Rish (2001) An empirical study of the naive Bayes classifier, IJCAI work-
shop on Empirical Methods in Artificial Intelligence

7
D. Hackerman (1996) Bayesian networks for knowledge discovery, In U.M.
Fayyad, G.P. Shapiro, P. Smyth and R. Uthurusamy, editors Advances in
Knowledge Discovery and Data Mining, MIT Press.
R.E. Neopolitan, (2003) Learning Bayesian Networks, Prentice Hall.

You might also like