0% found this document useful (0 votes)
9 views44 pages

pr2 Bayes

The document discusses Bayes decision theory in the context of classifying fish species based on their lightness feature. It covers concepts such as prior probabilities, decision rules, likelihood, and the minimization of classification errors using posterior probabilities. Additionally, it contrasts generative and discriminative models for solving decision problems in pattern recognition.

Uploaded by

poddarsandeep063
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views44 pages

pr2 Bayes

The document discusses Bayes decision theory in the context of classifying fish species based on their lightness feature. It covers concepts such as prior probabilities, decision rules, likelihood, and the minimization of classification errors using posterior probabilities. Additionally, it contrasts generative and discriminative models for solving decision problems in pattern recognition.

Uploaded by

poddarsandeep063
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Pattern Recognition

Course Instructor
Prof. Jyotsna Singh
Bayes decision theory
 Assume that an image
segmentation module has
already extracted the shape of the
fishes
 A feature extraction module
has characterized each
shape/pattern with one feature:
the average lightness of the shape.
 Decision problem: we want to
assign each shape/pattern to one
of the two classes considered
(salmon, sea bass).
Bayes decision theory

 Design classifiers to recommend decisions that


minimize some total expected ”risk”.
 The simplest risk is the classification error (i.e., costs
are equal).
 Typically, the risk includes the cost associated with
different decisions.
Terminology
 State of nature ω (random variable):
e.g. ω1 for sea bass, ω2 for salmon
 Probabilities P(ω1) and P(ω2) (priors):
e.g. prior knowledge of how likely is to get a sea bass or a salmon
 Probability density function p(x) (evidence):
e.g. how frequently we will measure a pattern with feature value x (e.g.,
x corresponds to lightness)
 Conditional probability density p(x/ωj) (likelihood)
e.g. how frequently we will measure a pattern with feature value x given
that the pattern belongs to class ωj
 Conditional probability P(ωj /x) (posterior) :
e.g., the probability that the fish belongs to class ωj given measurement
x.
We assume that we cannot know deterministically which is the
“class” (salmon or sea bass) of the next fish incoming on the
conveyor belt.
So the problem must be formulated in probabilistic terms.
 Bayes decision theory formalizes this situation with the concept of
“state of nature” (usually called “class” in pattern recognition).
 Let ω =ω1 or ω =ω2 be the variable that identifies the class,
where ω is a random variable.
 we have two states-of-nature/classes: ω1 and ω2

 The two classes could have the same prior probability:


P(ω1) = P(ω2)
P(ω1) + P(ω2) = 1 (we have just two species of fish)
Decision Rule Using Prior Probabilities
The a priori or prior probability reflects our knowledge of
how likely we expect a certain state of nature before we can
actually observe it.
 In the fish example, it is the probability that we will see either a
salmon or a sea bass next on the conveyor belt.
 The prior may vary depending on the situation.
 If we get equal numbers of salmon and sea bass in a catch, then the
priors are equal or uniform.
 Depending on the season, we may get more salmon than sea bass.

 We write P(ω= ω1) or just P(ω1) for the prior the next is a sea bass.
 The priors must exhibit exclusivity and exhaustivity. For c states of
nature, or classes:
Decision Rule Using Prior Probabilities
 A decision rule prescribes what action to take based on
observed input.

 Idea Check: What is a reasonable Decision Rule if


➢ the only available information is the prior, and
➢ The two decisions have the same risk.

 If we should make a decision without being able to see the incoming


fish, the only rational decision would be:
Assign the fish to ω1 if P(ω1) > P(ω2), else assign the fish to ω2
This “blind” (a priori) decision works well only if one class is much
more likely, e.g., P(ω1) >> P(ω2)
 If the priors are uniform, this rule will behave poorly.
Feature space
 In general, we must “see” the pattern to make a rational
decision according to Bayesian theory.
We must see the fish and characterize it with some features.
 A feature is an observable variable. A feature space is a set from
which we can sample or observe values.
 Examples of features:
 Length
 Width
 Lightness
 For simplicity, let's assume that our features are all continuous
values.
 Denote a scalar feature as x and a vector feature as x. For a d-
dimensional feature space, x ∈ Rd.
Class-Conditional Density or Likelihood

 The class-conditional probability density function is the


probability density function for x, our feature, given that the
state of nature is ω:
p(x|ω)
 For example, the average lightness of the pattern.
As fishes incoming on the belt will have “random” lightness
values, the lightness feature x should be treated as a random
variable with conditional distribution p(x|ωi ).
Posterior Probability Bayes Formula
 If we know the prior distribution and the class-conditional
density,
 how does this affect our decision rule?
 Posterior probability is the probability of a certain state of
nature
given our observables: P(ω|x).
 Use Bayes Formula:
The MAP decision rule
MAP decision rule with more than two classes
MAP rule for error probability minimization
The Maximum Likelihood or ML rule
Decision regions
Probability of Error
The performance of any decision rule can be measured by its probability of
error P[error] which, making use of the Theorem of total probability can be
broken up into
Probability of error
 A mistake occurs when an input vector belonging to class
C1 is assigned to class C2 or vice versa.
 The probability of this occurring is given by

 Clearly to minimize p(mistake) we should arrange that each x is


assigned to whichever class has the smaller value of the integrand.
 Thus, if p(x, C1) > p(x, C2) for a given value of x, then we should
assign that x to class C1.
Using the product rule p(x, Ck) =p(Ck|x)p(x),
and noting that the factor of p(x) is common to all terms, we see that
➢ Each x should be assigned to the class having the largest posterior
probability p(Ck|x).
Bayes decision rule
 From the product rule of probability we have
p(x, Ck) = p(Ck|x)p(x).
Because the factor p(x) is common to both terms, we can
restate this result as saying that the minimum probability of
making a mistake is obtained if each value of x is assigned to
the class for which the posterior probability p(Ck|x) is largest.
 For the more general case of K classes, it is slightly easier to
maximize the probability of being correct, which is given by

 which is maximized when the regions Rk are chosen such that each x
is assigned to the class for which p(x, Ck) is largest.
➢Values of x  xˆ are classified as class C2 and hence
belong to decision region R2, whereas points x  xˆ are
classified as C1 and belong to R1.
➢Errors arise from the blue, green, and red regions,
so that for x  xˆ the errors are due to points from class
C2 being misclassified as C1 (joint red and green regions),
and for points in the region x  xˆ the errors are due to
points from class C1 being misclassified as C2 (blue
region). Schematic illustration of the
➢As we vary the location x̂ of the decision boundary, joint probabilities p(x, Ck) for
each of two classes plotted
the combined areas of the blue and green regions against x, together with the
remains constant, whereas the size of the red region decision boundary x = x̂
varies.

➢The optimal choice for x̂ is where the curves for p(x, C1) and p(x, C2) cross,
corresponding to x̂ = x0 , because in this case the red region disappears.
➢This is equivalent to the minimum misclassification rate decision rule, which
assigns each value of x to the class having the higher posterior probability p(Ck|x).
Probability of Error
MAP rule for error probability minimization
error probability for LRT
error probability for LRT
Bayes Decision Rule (with Equal Costs)
From error to risk
From error to risk
Loss Function
Loss Matrix
 For many applications, our objective will be more complex than
simply minimizing the number of misclassifications.
 Let us consider again the medical diagnosis problem.
 If a patient who does not have cancer is incorrectly diagnosed as having
cancer, the consequences may be some patient distress plus the need
for further investigations.
 Conversely, if a patient with cancer is diagnosed as healthy, the result
may be premature death due to lack of treatment.
 Thus, the consequences of these two types of mistake can be
dramatically different.
Loss Function
We can formalize such issues through the introduction of a loss function, also
called a cost function, which is
A single, overall measure of loss incurred in taking any of the available
decisions or actions.
An example of a loss matrix with elements Lkj for the cancer
treatment problem.
The rows correspond to the true class, whereas the columns
correspond to the assignment of class made by our decision
criterion.

For a new value of x, the true class is Ck and that we assign x to class Cj.
In so doing, we incur some level of loss Lkj, which we can view as the k, j
element of a loss matrix.
Cancer example:
✓Loss matrix says that there is no loss incurred if the correct decision is made,
✓There is a loss of 1 if a healthy patient is diagnosed as having cancer,
✓Whereas there is a loss of 1000 if a patient having cancer is diagnosed as healthy.
Minimizing the expected loss
 The optimal solution is the one which minimizes the loss function.
However,
 The loss function depends on the true class, which is unknown.
 For a given input vector x, our uncertainty in the true class is expressed
through the joint probability distribution p(x, Ck)
 So, we minimize the average loss, where the average is computed with
respect to this distribution, which is given by

Thus the decision rule that minimizes the


is a
expected loss is the one that assigns each
minimum.
new x to the class j for which the quantity
Minimizing the expected loss
Minimizing the expected loss
Reject option
It may be appropriate to use an
automatic system to classify
observations for which there is little
doubt, while leaving a human
expert to classify the more
ambiguous cases.

We can achieve this by introducing a


threshold θ and rejecting those
inputs x for which the largest of
the posterior probabilities p(Ck|x) is
less than or equal to θ.
 We have broken the classification problem down into two
separate stages,
 The inference stage in which we use training data to learn a model for
p(Ck|x), and
 the subsequent decision stage in which we use these posterior probabilities to
make optimal class assignments.
 An alternative possibility would be to solve both problems
together and simply learn a function that maps inputs x directly
into decisions.
Such a function is called a discriminant function.
Generative models
We can identify three distinct approaches to solving decision problems, all
of which have been used in practical applications. These are given, in
decreasing order of complexity, by:

 First solve the inference problem of determining the class-conditional densities


p(x|Ck) for each class Ck individually.
 Separately infer the prior class probabilities p(Ck).
 Then use Bayes’ theorem to find the posterior class probabilities p(Ck|x).

 Having found the posterior probabilities, we use decision theory to determine


class membership for each new input x.

Approaches that explicitly or implicitly model the distribution of inputs as


well as outputs are known as generative models, because by sampling from
them it is possible to generate synthetic data points in the input space.
Discriminative models

 First solve the inference problem of determining the posterior class


probabilities p(Ck|x),
 subsequently use decision theory to assign each new x to one of the
classes.
 Approaches that model the posterior probabilities directly are called
discriminative models.
Discriminant function
 Find a function f(x), called a discriminant function, which
maps each input x directly onto a class label.
 For instance, in the case of two-class problems, f(·) might be
binary valued and such that
f = 0 represents class C1
f = 1 represents class C2.
 In this case, probabilities play no role.

You might also like