0% found this document useful (0 votes)
21 views22 pages

Unit-Ii Bayesian Decision Theory

This document introduces Bayesian decision theory for pattern classification problems. It discusses using prior probabilities and cost functions to determine optimal classification decisions given measurements. The key concepts covered are: applying Bayes' rule to calculate posterior probabilities given priors and likelihoods; defining loss/risk functions to quantify costs of different decisions; and deriving the Bayes decision rule and Bayes risk to minimize total risk.

Uploaded by

Joyce George
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views22 pages

Unit-Ii Bayesian Decision Theory

This document introduces Bayesian decision theory for pattern classification problems. It discusses using prior probabilities and cost functions to determine optimal classification decisions given measurements. The key concepts covered are: applying Bayes' rule to calculate posterior probabilities given priors and likelihoods; defining loss/risk functions to quantify costs of different decisions; and deriving the Bayes decision rule and Bayes risk to minimize total risk.

Uploaded by

Joyce George
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

12/31/2010 SAM_PR_UNIT-II 1

Bayesian decision theory is a fundamental statistical


approach to the problem of pattern classification.
Quantify the tradeoffs between various classification
decisions using probability and the costs that
accompany these decisions.
Assume all relevant probability distributions are
known
Can we exploit prior knowledge in our fish
classification problem:
Are the sequence of fish predictable? (statistics)
Is each class equally probable? (uniform priors)
What is the cost of an error? (risk, optimization)



UNIT-II BAYESIAN DECISION THEORY
INTRODUCTION

F
a
a
D
o
O
E
n
g
i
n
e
e
r
s
.
c
o
m
12/31/2010 SAM_PR_UNIT-II 2
State of nature is prior information
Model as a random variable, e:
e

= e
1
: the event that the next fish is a sea bass
category 1: sea bass; category 2: salmon
P(e
1
) = probability of category 1
P(e
2
) = probability of category 2
P(e
1
) + P( e
2
) = 1
Exclusivity: e
1
and e
2
share no basic events
Exhaustivity: the union of all outcomes is the
sample space (either e
1
or e
2
must occur)
If all incorrect classifications have an equal cost:
Decide e
1
if P(e
1
) > P(e
2
); otherwise, decide e
2



BAYESIAN DECISION THEORY
PRIOR PROBABILITIES

F
a
a
D
o
O
E
n
g
i
n
e
e
r
s
.
c
o
m
12/31/2010 SAM_PR_UNIT-II 3
BAYESIAN DECISION THEORY
CLASS-CONDITIONAL PROBABILITIES
A decision rule with only prior information always produces
the same result and ignores measurements.
If P(e
1
) >> P( e
2
), we will be correct most of the time.
Probability of error: P(E) = min(P(e
1
),P( e
2
)).
Given a feature, x (lightness),
which is a continuous random
variable, p(x|e
2
) is the class-
conditional probability density
function:
p(x|e
1
) and p(x|e
2
) describe the
difference in lightness between
populations of sea and salmon.
F
a
a
D
o
O
E
n
g
i
n
e
e
r
s
.
c
o
m
12/31/2010 SAM_PR_UNIT-II 4
A probability density function is denoted in lowercase and
represents a function of a continuous variable.
p
x
(x|e), often abbreviated as p(x), denotes a probability
density function for the random variable X. Note that p
x
(x|e)
and p
y
(y|e) can be two different functions.
P(x|e) denotes a probability mass function, and must obey the
following constraints:

BAYESIAN DECISION THEORY
PROBABILITY FUNCTIONS
1 =

e X x
P(x)
0 > ) x ( P
Probability mass functions are typically used for discrete
random variables while densities describe continuous
random variables (latter must be integrated).
F
a
a
D
o
O
E
n
g
i
n
e
e
r
s
.
c
o
m
12/31/2010 SAM_PR_UNIT-II 5
Suppose we know both P(e
j
) and p(x|e
j
), and we can
measure x. How does this influence our decision?
The joint probability that of finding a pattern that is in category
j and that this pattern has a feature value of x is:
( )
( )
( ) x p
P x p
x P
j j
j
e e
= e
BAYESIAN DECISION THEORY
BAYES FORMULA
( ) ( ) ( )
j
j
j
P x p x p e

e =
=
2
1
where in the case of two categories:
( )
j j j j
P x p x p x P ) x , ( p e e = e = e
Rearranging terms, we arrive at Bayes formula:
F
a
a
D
o
O
E
n
g
i
n
e
e
r
s
.
c
o
m
12/31/2010 SAM_PR_UNIT-II 6
Bayes formula:

can be expressed in words as:

By measuring x, we can convert the prior probability, P(e
j
),
into a posterior probability, P(e
j
|x).
Evidence can be viewed as a scale factor and is often ignored
in optimization applications (e.g., speech recognition).
( ) x p
P x p
x P
j j
j
e e
= e
evidence
prior likelihood
posterior

=
BAYESIAN DECISION THEORY
BAYES FORMULA
F
a
a
D
o
O
E
n
g
i
n
e
e
r
s
.
c
o
m
12/31/2010 SAM_PR_UNIT-II 7
BAYESIAN DECISION THEORY
POSTERIOR PROBABILITIES
For every value of x, the posteriors sum to 1.0.
At x=14, the probability it is in category e
2
is 0.08, and for
category e
1
is 0.92.
Two-class fish sorting problem (P(e
1
) = 2/3, P(e
2
) = 1/3):
F
a
a
D
o
O
E
n
g
i
n
e
e
r
s
.
c
o
m
12/31/2010 SAM_PR_UNIT-II 8
Decision rule:
For an observation x, decide e
1
if P(e
1
|x) > P(e
2
|x);
otherwise, decide e
2
Probability of error:

The average probability of error is given by:


If for every x we ensure that P(error|x) is as small as possible,
then the integral is as small as possible. Thus, Bayes decision
rule for minimizes P(error).
BAYESIAN DECISION THEORY
BAYES DECISION RULE
( )

e e e
e e e
=
2 1
1 2
x ) x ( P
x ) x ( P
x | error P
} }
= =


dx ) x ( p ) x | error ( P dx ) x , error ( P ) error ( P
)] x ( P ), x ( P min[ ) x | error ( P
2 1
e e =
F
a
a
D
o
O
E
n
g
i
n
e
e
r
s
.
c
o
m
12/31/2010 SAM_PR_UNIT-II 9
Bayes Decision Rule
F
a
a
D
o
O
E
n
g
i
n
e
e
r
s
.
c
o
m
12/31/2010 SAM_PR_UNIT-II 10
BAYESIAN DECISION THEORY
EVIDENCE
The evidence, p(x), is a scale factor that assures conditional
probabilities sum to 1:
P(e
1
|x)+P(e
2
|x)=1
We can eliminate the scale factor (which appears on both
sides of the equation):
Decide e
1
if p(x|e
1
)P(e
1
) > p(x|e
2
)P(e
2
)

Special cases: if p(x| e
1
)=p(x| e
2
): x gives us no useful
information. if P(e
1
) = P(e
2
): decision is based entirely on the
likelihood, p(x|e
j
).
F
a
a
D
o
O
E
n
g
i
n
e
e
r
s
.
c
o
m
12/31/2010 SAM_PR_UNIT-II 11
Generalization of the preceding ideas:
Use of more than one feature
(e.g., length and lightness)
Use more than two states of nature
(e.g., N-way classification)
Allowing actions other than a decision to decide on the
state of nature (e.g., rejection: refusing to take an action
when alternatives are close or confidence is low)
Introduce a loss of function which is more general than the
probability of error
(e.g., errors are not equally costly)
CONTINUOUS FEATURES
GENERALIZATION OF TWO-CLASS PROBLEM
F
a
a
D
o
O
E
n
g
i
n
e
e
r
s
.
c
o
m
12/31/2010 SAM_PR_UNIT-II 12
CONTINUOUS FEATURES
LOSS FUNCTION 1
Let {e
1
, e
2
,, e
c
} be the set of c categories
Let {o
1
, o
2
,, o
a
} be the set of a possible actions
Let (o
i
|e
j
) be the loss incurred for taking action o
i
when the
state of nature is e
j
F
a
a
D
o
O
E
n
g
i
n
e
e
r
s
.
c
o
m
12/31/2010 SAM_PR_UNIT-II 13
CONTINUOUS FEATURES
LOSS FUNCTION
(o
i
|e
j
) be the loss incurred for taking action o
i
when the state
of nature is e
j
The posterior, P(e
j
|x), can be computed from Bayes formula:
) ( p
) ( P ) | ( p
) ( P
j j
j
x
x
x
e e
= e
where the evidence is:
) ( P ) | ( p ) ( p
j
c
j
j
e

e =
=1
x x
The expected loss from taking action o
i
is:
) | ( ) | ( ) | (
1
x x
j
c
j
j i i
P R e e o o

=
=
F
a
a
D
o
O
E
n
g
i
n
e
e
r
s
.
c
o
m
12/31/2010 SAM_PR_UNIT-II 14
CONTINUOUS FEATURES
BAYES RISK
An expected loss is called a risk.
R(o
i
|x) is called the conditional risk.
A general decision rule is a function o(x) that tells us which
action to take for every possible observation.
The overall risk is given by:
}
o = x x x x d ) ( p ) | ) ( ( R R
If we choose o(x) so that R(o
i
(x)) is as small as possible for
every x, the overall risk will be minimized.
Compute the conditional risk for every o and select the action
that minimizes R(o
i
|x). This is denoted R*, and is referred to as
the Bayes risk.
The Bayes risk is the best performance that can be achieved.
F
a
a
D
o
O
E
n
g
i
n
e
e
r
s
.
c
o
m
12/31/2010 SAM_PR_UNIT-II 15
CONTINUOUS FEATURES
TWO-CATEGORY CLASSIFICATION
Let o
1
correspond to e
1
, o
2
to e
2
, and
ij
= (o
i
|e
j
)
The conditional risk is given by:
R(o
1
|x) =
11
P(e
1
|x) +
12
P(e
2
|x)
R(o
2
|x) =
21
P(e
1
|x) +
22
P(e
2
|x)
Our decision rule is:
choose e
1
if: R(o
1
|x) < R(o
2
|x);
otherwise decide e
2

This results in the equivalent rule:
choose e
1
if: (
21
-
11
) P(x|e
1
) > (
12
-
22
) P(x|e
2
);
otherwise decide

e
2
If the loss incurred for making an error is greater than that
incurred for being correct, the factors (
21
-
11
) and
(
12
-
22
) are positive, and the ratio of these factors simply
scales the posteriors.
F
a
a
D
o
O
E
n
g
i
n
e
e
r
s
.
c
o
m
12/31/2010 SAM_PR_UNIT-II 16
CONTINUOUS FEATURES
LIKELIHOOD
By employing Bayes formula, we can replace the posteriors by
the prior probabilities and conditional densities:
choose e
1
if:
(
21
-
11
) p(x|e
1
) P(e
1
) > (
12
-
22
) p(x|e
2
) P(e
2
);
otherwise decide

e
2
If
21
-
11
is positive, our rule becomes:

) ( P
) ( P
) | ( p
) | ( p
: if choose
1
2
11 21
22 12
2
1
1
e
e


>
e
e
e
x
x
If the loss factors are identical, and the prior probabilities are
equal, this reduces to a standard likelihood ratio:
1
2
1
1
>
e
e
e
) | ( p
) | ( p
: if choose
x
x
F
a
a
D
o
O
E
n
g
i
n
e
e
r
s
.
c
o
m
12/31/2010 SAM_PR_UNIT-II 17
Consider a symmetrical or zero-one loss function:
Minimum Error Rate Classification

=
=
=
= e o c ,..., , j , i
j i
j i
) (
j i
2 1
1
0
MINIMUM ERROR RATE
The conditional risk is:
x)
x)
x) x
j
j j
i
c
i j
c
j
i i
( P
( P
( P ) ( R ) ( R
e =
e

=
e

e o = o
=
=
1
1
The conditional risk is the average probability of error.
To minimize error, maximize P(e
i
|x) also known as
maximum a posteriori decoding (MAP).
Minimum error rate classification:
choose e
i
if: P(e
i
|

x) > P(e
j
|

x) for all j=i

F
a
a
D
o
O
E
n
g
i
n
e
e
r
s
.
c
o
m
12/31/2010 SAM_PR_UNIT-II 18
Classifiers, Discriminant Functions
and Decision Surfaces
The multi-category case

Set of discriminant functions g
i
(x), i = 1,, c

The classifier assigns a feature vector x to class e
i

if:
g
i
(x) > g
j
(x) j = i
F
a
a
D
o
O
E
n
g
i
n
e
e
r
s
.
c
o
m
12/31/2010 SAM_PR_UNIT-II 19
F
a
a
D
o
O
E
n
g
i
n
e
e
r
s
.
c
o
m
12/31/2010 SAM_PR_UNIT-II 20
Let g
i
(x) = - R(o
i
| x)
(max. discriminant corresponds to min. risk!)

For the minimum error rate, we take
g
i
(x) = P(e
i
| x)

(max. discrimination corresponds to max. posterior!)
g
i
(x) P(x | e
i
) P(e
i
)

g
i
(x) = ln P(x | e
i
) + ln P(e
i
)

F
a
a
D
o
O
E
n
g
i
n
e
e
r
s
.
c
o
m
12/31/2010 SAM_PR_UNIT-II 21
Feature space divided into c decision regions
if g
i
(x) > g
j
(x) j = i then x is in R
i

(R
i
means assign x to e
i
)


The two-category case
A classifier is a dichotomizer that has two discriminant
functions g
1
and g
2


Let g(x) g
1
(x) g
2
(x)

Decide e
1
if g(x) > 0 ; Otherwise decide e
2

F
a
a
D
o
O
E
n
g
i
n
e
e
r
s
.
c
o
m
12/31/2010 SAM_PR_UNIT-II 22
The computation of g(x)

) ( P
) ( P
ln
) | x ( P
) | x ( P
ln
) x | ( P ) x | ( P ) x ( g
2
1
2
1
2 1
e
e
e
e
e e
+ =
=
F
a
a
D
o
O
E
n
g
i
n
e
e
r
s
.
c
o
m

You might also like