Unit-Ii Bayesian Decision Theory
Unit-Ii Bayesian Decision Theory
e X x
P(x)
0 > ) x ( P
Probability mass functions are typically used for discrete
random variables while densities describe continuous
random variables (latter must be integrated).
F
a
a
D
o
O
E
n
g
i
n
e
e
r
s
.
c
o
m
12/31/2010 SAM_PR_UNIT-II 5
Suppose we know both P(e
j
) and p(x|e
j
), and we can
measure x. How does this influence our decision?
The joint probability that of finding a pattern that is in category
j and that this pattern has a feature value of x is:
( )
( )
( ) x p
P x p
x P
j j
j
e e
= e
BAYESIAN DECISION THEORY
BAYES FORMULA
( ) ( ) ( )
j
j
j
P x p x p e
e =
=
2
1
where in the case of two categories:
( )
j j j j
P x p x p x P ) x , ( p e e = e = e
Rearranging terms, we arrive at Bayes formula:
F
a
a
D
o
O
E
n
g
i
n
e
e
r
s
.
c
o
m
12/31/2010 SAM_PR_UNIT-II 6
Bayes formula:
can be expressed in words as:
By measuring x, we can convert the prior probability, P(e
j
),
into a posterior probability, P(e
j
|x).
Evidence can be viewed as a scale factor and is often ignored
in optimization applications (e.g., speech recognition).
( ) x p
P x p
x P
j j
j
e e
= e
evidence
prior likelihood
posterior
=
BAYESIAN DECISION THEORY
BAYES FORMULA
F
a
a
D
o
O
E
n
g
i
n
e
e
r
s
.
c
o
m
12/31/2010 SAM_PR_UNIT-II 7
BAYESIAN DECISION THEORY
POSTERIOR PROBABILITIES
For every value of x, the posteriors sum to 1.0.
At x=14, the probability it is in category e
2
is 0.08, and for
category e
1
is 0.92.
Two-class fish sorting problem (P(e
1
) = 2/3, P(e
2
) = 1/3):
F
a
a
D
o
O
E
n
g
i
n
e
e
r
s
.
c
o
m
12/31/2010 SAM_PR_UNIT-II 8
Decision rule:
For an observation x, decide e
1
if P(e
1
|x) > P(e
2
|x);
otherwise, decide e
2
Probability of error:
The average probability of error is given by:
If for every x we ensure that P(error|x) is as small as possible,
then the integral is as small as possible. Thus, Bayes decision
rule for minimizes P(error).
BAYESIAN DECISION THEORY
BAYES DECISION RULE
( )
e e e
e e e
=
2 1
1 2
x ) x ( P
x ) x ( P
x | error P
} }
= =
dx ) x ( p ) x | error ( P dx ) x , error ( P ) error ( P
)] x ( P ), x ( P min[ ) x | error ( P
2 1
e e =
F
a
a
D
o
O
E
n
g
i
n
e
e
r
s
.
c
o
m
12/31/2010 SAM_PR_UNIT-II 9
Bayes Decision Rule
F
a
a
D
o
O
E
n
g
i
n
e
e
r
s
.
c
o
m
12/31/2010 SAM_PR_UNIT-II 10
BAYESIAN DECISION THEORY
EVIDENCE
The evidence, p(x), is a scale factor that assures conditional
probabilities sum to 1:
P(e
1
|x)+P(e
2
|x)=1
We can eliminate the scale factor (which appears on both
sides of the equation):
Decide e
1
if p(x|e
1
)P(e
1
) > p(x|e
2
)P(e
2
)
Special cases: if p(x| e
1
)=p(x| e
2
): x gives us no useful
information. if P(e
1
) = P(e
2
): decision is based entirely on the
likelihood, p(x|e
j
).
F
a
a
D
o
O
E
n
g
i
n
e
e
r
s
.
c
o
m
12/31/2010 SAM_PR_UNIT-II 11
Generalization of the preceding ideas:
Use of more than one feature
(e.g., length and lightness)
Use more than two states of nature
(e.g., N-way classification)
Allowing actions other than a decision to decide on the
state of nature (e.g., rejection: refusing to take an action
when alternatives are close or confidence is low)
Introduce a loss of function which is more general than the
probability of error
(e.g., errors are not equally costly)
CONTINUOUS FEATURES
GENERALIZATION OF TWO-CLASS PROBLEM
F
a
a
D
o
O
E
n
g
i
n
e
e
r
s
.
c
o
m
12/31/2010 SAM_PR_UNIT-II 12
CONTINUOUS FEATURES
LOSS FUNCTION 1
Let {e
1
, e
2
,, e
c
} be the set of c categories
Let {o
1
, o
2
,, o
a
} be the set of a possible actions
Let (o
i
|e
j
) be the loss incurred for taking action o
i
when the
state of nature is e
j
F
a
a
D
o
O
E
n
g
i
n
e
e
r
s
.
c
o
m
12/31/2010 SAM_PR_UNIT-II 13
CONTINUOUS FEATURES
LOSS FUNCTION
(o
i
|e
j
) be the loss incurred for taking action o
i
when the state
of nature is e
j
The posterior, P(e
j
|x), can be computed from Bayes formula:
) ( p
) ( P ) | ( p
) ( P
j j
j
x
x
x
e e
= e
where the evidence is:
) ( P ) | ( p ) ( p
j
c
j
j
e
e =
=1
x x
The expected loss from taking action o
i
is:
) | ( ) | ( ) | (
1
x x
j
c
j
j i i
P R e e o o
=
=
F
a
a
D
o
O
E
n
g
i
n
e
e
r
s
.
c
o
m
12/31/2010 SAM_PR_UNIT-II 14
CONTINUOUS FEATURES
BAYES RISK
An expected loss is called a risk.
R(o
i
|x) is called the conditional risk.
A general decision rule is a function o(x) that tells us which
action to take for every possible observation.
The overall risk is given by:
}
o = x x x x d ) ( p ) | ) ( ( R R
If we choose o(x) so that R(o
i
(x)) is as small as possible for
every x, the overall risk will be minimized.
Compute the conditional risk for every o and select the action
that minimizes R(o
i
|x). This is denoted R*, and is referred to as
the Bayes risk.
The Bayes risk is the best performance that can be achieved.
F
a
a
D
o
O
E
n
g
i
n
e
e
r
s
.
c
o
m
12/31/2010 SAM_PR_UNIT-II 15
CONTINUOUS FEATURES
TWO-CATEGORY CLASSIFICATION
Let o
1
correspond to e
1
, o
2
to e
2
, and
ij
= (o
i
|e
j
)
The conditional risk is given by:
R(o
1
|x) =
11
P(e
1
|x) +
12
P(e
2
|x)
R(o
2
|x) =
21
P(e
1
|x) +
22
P(e
2
|x)
Our decision rule is:
choose e
1
if: R(o
1
|x) < R(o
2
|x);
otherwise decide e
2
This results in the equivalent rule:
choose e
1
if: (
21
-
11
) P(x|e
1
) > (
12
-
22
) P(x|e
2
);
otherwise decide
e
2
If the loss incurred for making an error is greater than that
incurred for being correct, the factors (
21
-
11
) and
(
12
-
22
) are positive, and the ratio of these factors simply
scales the posteriors.
F
a
a
D
o
O
E
n
g
i
n
e
e
r
s
.
c
o
m
12/31/2010 SAM_PR_UNIT-II 16
CONTINUOUS FEATURES
LIKELIHOOD
By employing Bayes formula, we can replace the posteriors by
the prior probabilities and conditional densities:
choose e
1
if:
(
21
-
11
) p(x|e
1
) P(e
1
) > (
12
-
22
) p(x|e
2
) P(e
2
);
otherwise decide
e
2
If
21
-
11
is positive, our rule becomes:
) ( P
) ( P
) | ( p
) | ( p
: if choose
1
2
11 21
22 12
2
1
1
e
e
>
e
e
e
x
x
If the loss factors are identical, and the prior probabilities are
equal, this reduces to a standard likelihood ratio:
1
2
1
1
>
e
e
e
) | ( p
) | ( p
: if choose
x
x
F
a
a
D
o
O
E
n
g
i
n
e
e
r
s
.
c
o
m
12/31/2010 SAM_PR_UNIT-II 17
Consider a symmetrical or zero-one loss function:
Minimum Error Rate Classification
=
=
=
= e o c ,..., , j , i
j i
j i
) (
j i
2 1
1
0
MINIMUM ERROR RATE
The conditional risk is:
x)
x)
x) x
j
j j
i
c
i j
c
j
i i
( P
( P
( P ) ( R ) ( R
e =
e
=
e
e o = o
=
=
1
1
The conditional risk is the average probability of error.
To minimize error, maximize P(e
i
|x) also known as
maximum a posteriori decoding (MAP).
Minimum error rate classification:
choose e
i
if: P(e
i
|
x) > P(e
j
|
x) for all j=i
F
a
a
D
o
O
E
n
g
i
n
e
e
r
s
.
c
o
m
12/31/2010 SAM_PR_UNIT-II 18
Classifiers, Discriminant Functions
and Decision Surfaces
The multi-category case
Set of discriminant functions g
i
(x), i = 1,, c
The classifier assigns a feature vector x to class e
i
if:
g
i
(x) > g
j
(x) j = i
F
a
a
D
o
O
E
n
g
i
n
e
e
r
s
.
c
o
m
12/31/2010 SAM_PR_UNIT-II 19
F
a
a
D
o
O
E
n
g
i
n
e
e
r
s
.
c
o
m
12/31/2010 SAM_PR_UNIT-II 20
Let g
i
(x) = - R(o
i
| x)
(max. discriminant corresponds to min. risk!)
For the minimum error rate, we take
g
i
(x) = P(e
i
| x)
(max. discrimination corresponds to max. posterior!)
g
i
(x) P(x | e
i
) P(e
i
)
g
i
(x) = ln P(x | e
i
) + ln P(e
i
)
F
a
a
D
o
O
E
n
g
i
n
e
e
r
s
.
c
o
m
12/31/2010 SAM_PR_UNIT-II 21
Feature space divided into c decision regions
if g
i
(x) > g
j
(x) j = i then x is in R
i
(R
i
means assign x to e
i
)
The two-category case
A classifier is a dichotomizer that has two discriminant
functions g
1
and g
2
Let g(x) g
1
(x) g
2
(x)
Decide e
1
if g(x) > 0 ; Otherwise decide e
2
F
a
a
D
o
O
E
n
g
i
n
e
e
r
s
.
c
o
m
12/31/2010 SAM_PR_UNIT-II 22
The computation of g(x)
) ( P
) ( P
ln
) | x ( P
) | x ( P
ln
) x | ( P ) x | ( P ) x ( g
2
1
2
1
2 1
e
e
e
e
e e
+ =
=
F
a
a
D
o
O
E
n
g
i
n
e
e
r
s
.
c
o
m