2 Unit PR Statistical Decision Making
2 Unit PR Statistical Decision Making
Pattern Recognition
Statistical Decision Making
Dr. Srinath. S
Syllabus for Unit - 2
• Statistical Decision Making:
• Introduction, Bayes’ Theorem
• Conditionally Independent Features
• Decision Boundaries
Classification (Revision)
It is the task of assigning a class label to an input pattern. The class label indicates
one of a given set of classes. The classification is carried out with the help of a
model obtained using a learning procedure. There are two categories of
classification. supervised learning and unsupervised learning.
• Supervised learning makes use of a set of examples which already have the
class labels assigned to them.
P A B if P B 0
P
A B
= P B
P[A B]
Similarly, P[B|A] = if P[A] is not equal to 0
P[A]
• Original Sample space is the red coloured rectangular box.
• What is the probability of A occurring given sample space as B.
• Hence P(B) is in the denominator.
• And area in question is the intersection of A and B
P A B
P A B = and
P B
So
P[A B] = P[B].P[A|B] = P[A].P[B|A]
or
P[B].P[A|B] = P[A].P[B|A]
X, P(X)
This is the Prob. of any vector X being assigned to class wi.
Example for Bayes Rule/ Theorem
• Given Bayes' Rule :
Example1:
• Probability of (King/Face)
Prob. of having a fever, given that a person has a cold is, P(f|C) = 0.4.
Overall prob. of fever P(f) = 0.02.
Then using Bayes Th., the Prob. that a person has a cold, given that she (or he)
has a fever is:
P(f|C) P(C ) 0.4∗0.01
P(C|f) = == = 0.2
P(f ) 0.02
Generalized Bayes Theorem
• Consider we have 3 classes A1, A2 and A3.
• Area under Red box is the sample space
• Consider they are mutually exclusive and
collectively exhaustive.
• Mutually exclusive means, if one event occurs then
another event cannot happen.
• Collectively exhaustive means, if we combine all the probabilities, i.e P(A1),
P(A2) and P(A3), it gives the sample space, i.e the total rectangular red coloured
space.
• Consider now another event B occurs over A1,A2 and A3.
• Some area of B is common with A1, and A2 and A3.
• It is as shown in the figure below:
• Portion common with A1 and B is shown by:
• Portion common with A2 and B is given by :
• Portion common with A3 and B is given by:
• Represented by:
So.. Given Problem can be represented as:
Example-4.
Given 1% of people have a certain genetic defect. (It means 99% don’t have genetic defect)
90% of tests on the genetic defected people, the defect/disease is found positive(true positives).
9.6% of the tests (on non diseased people) are false positives
A = chance of having the genetic defect. That was given in the question as 1%. (P(A) = 0.01)
That also means the probability of not having the gene (~A) is 99%. (P(~A) = 0.99)
X = A positive test result.
P(A|X) = Probability of having the genetic defect given a positive test result. (To be computed)
P(X|A) = Chance of a positive test result given that the person actually has the genetic defect = 90%. (0.90)
p(X|~A) = Chance of a positive test if the person doesn’t have the genetic defect. That was given in the question as 9.6% (0.096)
Now we have all of the information, we need to put into the
equation:
• P(W)=0.01
• P(~W)=0.99
• P(PT|W)=0.9
• P(PT|~W)=0.08 Compute P(testing positive)
(0.9 * 0.01) / ((0.9 * 0.01) + (0.08 * 0.99) = 0.10.
Example-6
A disease occurs in 0.5% of the population
(5% is 5/10% removing % (5/10)/100=0.005)
What is the probability of them having the disease, given a positive result?
𝑃(𝑃𝑇|𝐷)×𝑃 𝐷
◦ 𝑃 𝑑𝑖𝑠𝑒𝑎𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑡𝑒𝑠𝑡 =
𝑃(𝑃𝑇|𝐷)×𝑃 𝐷 +𝑃 𝑃𝑇 ~𝐷 ×𝑃 ~𝐷
0.99×0.005
◦ =
0.99×0.005 + 0.05×0.995
Therefore:
0.99 × 0.005
𝑃 𝑑𝑖𝑠𝑒𝑎𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑡𝑒𝑠𝑡 = = 0.09
0.0547
𝑖. 𝑒. 9%
◦
◦ We know:
𝑃 𝐷 = chance of having the disease
𝑃 ~𝐷 = chance of not having the disease
• If the likelihood ratio R is greater than 1, we should select class A as the most likely
class of the sample, otherwise it is class B
• A boundary between the decision regions is called decision boundary
• Optimal decision boundaries separate the feature space into decision regions R1,
R2…..Rn such that class Ci is the most probable for values of x in Ri than any other
region
• For feature values exactly on the decision boundary between two
classes , the two classes are equally probable.
• p(x,y) = p(x).p(y)
• X=height and Y=Weight are joint probabilities are not independent… usually they are
dependent.
• Independence is equivalent to saying
• P(y|x) = P(y) or
• P(x|y) = P(x)
Conditional Independence
• Two random variables X and Y are said to be independent given Z if and
only if
– Height is less indicates age is less and hence vocabulary might vary.
– So Vocabulary is dependent on height.
6 6 2 2
P( A | M ) = = 0.06
7 7 7 7
1 10 3 4
P( A | N ) = = 0.0042
13 13 13 13
7 P(A|M)P(M) > P(A|N)P(N)
P ( A | M ) P ( M ) = 0.06 = 0.021
20 => Mammals
13
P ( A | N ) P ( N ) = 0.004 = 0.0027
20
Example. ‘Play Tennis’ data
• Naïve based classifier is very popular for document classifier
• (naïve means: all are equal and independent: all the attributes will
have equal weightage and are independent)
Based on the examples in the table, classify the following datum x:
x=(Outl=Sunny, Temp=Cool, Hum=High, Wind=strong)
• That means: Play tennis or not?
hNB = arg max P ( h) P ( x | h) = arg max P ( h) P ( at | h)
h[ yes , no ] h[ yes , no ] t
= arg max P ( h) P (Outlook = sunny | h) P (Temp = cool | h) P ( Humidity = high | h) P (Wind = strong | h)
h[ yes , no ]
• Working:
P ( PlayTennis = yes) = 9 / 14 = 0.64
P ( PlayTennis = no) = 5 / 14 = 0.36
P (Wind = strong | PlayTennis = yes) = 3 / 9 = 0.33
P (Wind = strong | PlayTennis = no) = 3 / 5 = 0.60
etc.
P ( yes) P ( sunny | yes) P (cool | yes) P ( high | yes) P ( strong | yes) = 0.0053
P ( no) P ( sunny | no) P (cool | no) P ( high | no) P ( strong | no) = 0.0206
answer : PlayTennis( x ) = no
What is our probability of error?
• For the two class situation, we have
• P(error|x) = { P(ω1|x) if we decide ω2
{ P(ω2|x) if we decide ω1
• We can minimize the probability of error by following the posterior:
Decide ω1 if P(ω1|x) > P(ω2|x)
Probability of error becomes P(error|x) = min [P(ω1|x), P(ω2|x)]
Equivalently, Decide ω1 if p(x|ω1)P(ω1) > p(x|ω2)P(ω2);
otherwise decide ω2 I.e., the evidence term is not used in decision making.
Conversely, if we have uniform priors, then the decision will rely exclusively on the
likelihoods.
Take Home Message: Decision making relies on both the priors and the likelihoods and
Bayes Decision Rule combines them to achieve the minimum probability of error.
Application of Naïve Bayes Classifier for NLP
• Consider the following sentences:
– S1 : The food is Delicious : Liked
– S2 : The food is Bad : Not Liked
– S3 : Bad food : Not Liked
– Given a new sentence, whether it can be classified as liked sentence or not liked.
F1 F2 F3 0utput
Food Delicious Bad
• S1 1 1 0 1
• S2 1 0 1 0
• S3 1 0 1 0
• P(Liked | attributes) = P(Delicious | Liked) * P(Food | Liked) * P(Liked)
• =(1/1) * (1/1) *(1/3) = 0.33