Bayes Classifier PDF
Bayes Classifier PDF
• If the prior probability is not known, the classes are taken to be equally
likely.
• If prior probability is not known and there are two classes C1 and V2 ,
then it is assumed PC1 = PC2 = 0.5.
• If PC1 and PC2 are known, then when a new pattern x comes along, we
need to calculate P (C1 |x) and P (C2 |x).
• Then if P (C1 |x) ≥ P (C2 |x), the pattern is assigned to Class 1 and if
P (C1|x) < P (C2 |x), it is assigned to Class 2. This is called the Bayes
decision rule.
Bayes Rule
• If P (Ci) is the prior probability of Class i, and P (X|Ci) is the condi-
tional density of X given class Ci , then the a posteriori or posterior
probability of Ci is given by
P (X|Ci )P (Ci )
P (Ci|X) = P (X)
• P (X) is given by
P
P (X) = i P (X | Ci )P (Ci)
2
• Let the probability that an elephant is black be 80% and that an ele-
phant is white be 20%. This means P(elephant is black) = 0.8 and
P(elephant is white) =0.2 . With only this information, any elephant
will be classified as being black. This is because the probability of
error in this case is only 0.2 as opposed to classifying the elephant as
white which results in a probability of error of 0.8 . When additional in-
formation is available, it can be used along with the information above.
The probability of error is 0.05 which is the probability that the ele-
phant is not white given that it belongs to region X.
Z
(1 − P (C|X))P (X)dX
X
3
This is minimum when P (C | X) is maximum (for a specified value of
P (X).
P (small) = 31
1
P (medium) = 2
P (large) = 16
We have a set of nails, bolts and rivets in a box and the three classes
correspond to the size of these objects in the box.
P (nail | small) = 41
P (bolt | small) = 12
P (rivet | small) = 14
P (nail | medium) = 21
P (bolt | medium) = 16
P (rivet | medium) = 13
P (nail | large) = 13
P (bolt | large) = 13
4
1
P (rivet | large) = 3
Now we can find the probability of the class labels given that it is a
nail, bolt or rivet. For doing this we need to use Bayes Classifier. Once
we get these probabilities, we can find the corresponding class labels of
the objects.
P (nail)|small)P (small)
P (small | nail) = P (nail|small).P (small)+P (nail|medium).P (medium)+P (nail|large).P (large)
1 1
.
P (medium | bolt) = 6 2
1 1 1 1 1 1
. +6.2+3.6
= 0.2727
2 3
5
1 1
.
P (large | bolt) = 3 6
1 1 1 1 1 1
. + . + .
= 0.1818
2 3 6 2 3 6
1 1
.
P (medium | rivet) = 3 2
1 1 1 1 1 1
. + . + .
= 0.5455
4 3 3 2 3 6
1 1
.
P (large | rivet) = 3 6
1 1 1 1 1 1
. + . + .
= 0.1818
4 3 3 2 3 6
6
Naive Bayes Classifier
• Since the calculations are simple, this classifier can be used for large
databases where the results are obtained fast with reasonable accuracy.
P (C | f1 , ..., fd ).
where f1 , ..., fd are the features.
2
P (fi, fj | C) = P (fi | C)P (fj | C)
So we get,
=
Y
d
p(C) p(fi |C).
i=1
1 Yn
p(C|f1 , . . . , fn ) = p(C) p(fi |C)
Z i=1
• The Naive Bayes classification uses only the prior probabilities of classes
P(C) and the independent probability distributions p(fi | C).
Parameter Estimation
nC
P (C) = n
3
feature being in this range in this class will be
n1
P (f1 is in range (a,b)) = nC
In case of the feature taking a small number of integer values, this can
be calculated for each of these values. For example, it would be
n2
P (f1 is 6) = nC
• If some class and feature never occur together, then that probability
will be zero. When this is multiplied by other probabilities, it may
make some probabilities zero. To prevent this, it is necessary to give a
small value of probability to every probability estimate.
• Let us estimate the parameters for a training set which has 100 pat-
terns of Class 1, 90 patterns of Class 2, 140 patterns of Class 3 and 100
patterns of Class 4. The prior probability of each class can be calcu-
lated.
The prior probability of Class 1 is
100
P (C1) = 100+90+140+100
= 0.233
90
P (C2) = 100+90+140+100
= 0.210
140
P (C2) = 100+90+140+100
= 0.326
4
100
P (C2) = 100+90+140+100
= 0.233
30
P (f1 is 0) = 100
= 0.03
45
P (f1 is 1) = 100
= 0.45
25
P (f1 is 2) = 100
= 0.25
7
The prior probability of P(goes-to-movie=no) = 11
= 0.636
5
Money Has-exams weather Goes-to-movie
25 no fine no
200 no hot yes
100 no rainy no
125 yes rainy no
30 yes rainy no
300 yes fine yes
55 yes hot no
140 no hot no
20 yes fine no
175 yes fine yes
110 no fine yes
1
P (money50 − 150 | goes − to − movie = yes) = 4
= 0.25 and
4
P (money50 − 150 | goes − to − movie = no) = 7
= 0.429
2
P (has − exams | goes − to − movie = yes) = 4
= 0.5
4
P (has − exams | goes − to − movie = no) = 7
= 0.429
2
P (weather = f ine | goes − to − movie = yes) = 4
= 0.5
2
P (weather = f ine | goes − to − movie = no) = 7
= 0.286
Therefore
6
P (goes − to − movie = yes | X) = 0.364 * 0.25 * 0.5 * 0.5 = 0.023
7
Bayesian Belief Network
• The nodes in a Bayesian network represent the variables and the di-
rectional arcs represent the dependencies between the variables. The
direction of the arrows show the direction of the dependency.
• The graphical model will be a directed acyclic graph with the nodes
representing the variables. The arcs will represent the dependencies
between nodes. If there is a directed arc between A and B, then A is
called the parent of B and B is a child of A.
• The conditional probability table associated with each node gives the
probability of this variable for different values of its parent nodes.
• If a node does not have any parents, the conditional probability table
is very simple as there are no dependencies.
2
P (f1 , f2 , ..., fd ) is given by
Qd
P (f1 , f2 , ..., fd ) = j=1 P (fj | parents(fj )
In the graph, we can see the dependencies with respect to the variables.
The probability values at a variable are dependent on the value of its par-
ents. In this case, the variable A is dependent on F and M. The variable T
is dependent on A and variable B is dependent on A. The variables F and M
are independent variables which do not have any parent node. So their prob-
abilities are not dependent on any other variable. Node A has the biggest
conditional probability table as A depends on F and M. T and B depend on A.
Once the graph is drawn which is a directed acyclic graph (DAG), the
probabilities of the variables depending on the values of their parent node has
to be entered. This requires us to know the problem at hand and estimate
3
M
Goes on
official trip
M F P(A | M and F)
A
T T 0.98
T F 0.98
Lakskhman
F T 0.98
travels by
air F F 0.10
T B
Lakshman Lakshman
travels by travels by
train bus A P(B | A)
A P(T | A)
T 0.00 T 0.00
F 0.60 F 0.40
4
these probabilities. So for each node the Conditional Probability Table has
to be entered.
We next come to node A. The conditional probability table for this node
can be represented as
A P|A
T 0.0
F 0.6
A P|A
T 0.0
F 0.40
Using the bayesian belief network, we can get the probability of a com-
bination of these variables. For example, we can get the probability that
Lakshman travels by train, does not travel by air, goes on an official trip and
has money. In other words, we are finding P (T, ¬A, F, M). The probability
of each variable given its parent is found and multiplied together to give the
probability.
5
Assignment
1. Let the probability that a road is wet P(w) = 0.3. Let probability of
rain, P(R) = 0.3. Given that 90% of the time when the roads are wet,
it is because it has rained, and it has rained, calculate the posterior
probability that the roads are wet.
2. Let blue, green, and red be three classes of objects with prior prob-
abilities given by P(blue) = 0.3, P(green) = 0.4, P(red) = 0.3. Let
there be three types of objects: pencils, pens, and paper. Let the class-
conditional probabilities of these objects be given as follows. Use Bayes
classifier to classify pencil, pen, and paper.
6
4. Consider the following dataset with three features f1 , f2 , and f3 . Con-
sider the test pattern f1 = a, f2 = c, f3 = f . Classify it using
NNC and Naive Bayes Classifier.
f1 f2 f3 Class Label
a c e No
b c f Yes
b c e No
b d f Yes
a d f Yes
a d f No
7
D. Hackerman (1996) Bayesian networks for knowledge discovery, In U.M.
Fayyad, G.P. Shapiro, P. Smyth and R. Uthurusamy, editors Advances in
Knowledge Discovery and Data Mining, MIT Press.
R.E. Neopolitan, (2003) Learning Bayesian Networks, Prentice Hall.