0% found this document useful (0 votes)
213 views16 pages

Lecture 5

The document summarizes key concepts from Bayesian decision theory including the likelihood ratio test, probability of error, and Bayes' risk. It provides an example demonstrating how to derive the likelihood ratio test decision rule for a two-class problem assuming Gaussian class-conditional densities and equal priors. It is shown that the likelihood ratio test achieves the minimum probability of error, known as the Bayes error rate. Finally, it introduces the concept of Bayes' risk, which accounts for misclassification costs, and notes that the decision rule minimizing Bayes' risk also minimizes a related expression.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
213 views16 pages

Lecture 5

The document summarizes key concepts from Bayesian decision theory including the likelihood ratio test, probability of error, and Bayes' risk. It provides an example demonstrating how to derive the likelihood ratio test decision rule for a two-class problem assuming Gaussian class-conditional densities and equal priors. It is shown that the likelihood ratio test achieves the minimum probability of error, known as the Bayes error rate. Finally, it introduces the concept of Bayes' risk, which accounts for misclassification costs, and notes that the decision rule minimizing Bayes' risk also minimizes a related expression.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Lecture 5

Bayesian Decision Theory


A fundamental statistical approach to quantifying the tradeos
between various decisions using probabilities and costs that
accompany such decisions. Reasoning is based on Bayes Rule.
Discriminant Functions for the Gaussian Density / Quadratic
Classiers
Apply the results of Bayesian Decision Theory to derive the
discriminant functions for the case of Gaussian class-conditional
probabilities.
Bayesian Decision Theory
The Likelihood Ratio Test
Likelihood Ratio Test
Want to classify an object based on the evidence provided by a
measurement (a feature vector) x.
A reasonable decision rule would be - Choose the class that is most
probable given x. Or mathematically choose class i such that
P(
i
|x) P(
j
|x) for i = 1, , C
Consider the decision rule for a 2-class problem:
Class (x) =

1
if P(
1
|x) > P(
2
|x)

2
if P(
1
|x) < P(
2
|x)
Likelihood Ratio Test
Choose class
1
if P(
1
|x) > P(
2
|x),

P(x|
1
)P(
1
)
P(x)
>
P(x|
2
)P(
2
)
P(x)
, Bayes Rule
P(x|
1
)P(
1
) > P(x|
2
)P(
2
), eliminate P(x) > 0

P(x|
1
)
P(x|
2
)
>
P(
2
)
P(
1
)
, as P() > 0
Let:
(x) =
P(x|
1
)
P(x|
2
)
. .. .
likelihood ratio
then
Likelihood Ratio Test:
Class (x) =

1
if (x) >
P(
2
)
P(
1
)

2
if (x) <
P(
2
)
P(
1
)
An example
Derive a decision rule for the 2-class problem based on the Likelihood
Ratio Test assuming equal priors and class conditional densities:
P(x|
1
) =
1

2
exp
(x 4)
2
2
, P(x|
2
) =
1

2
exp
(x 10)
2
2
Solution: Substitute the likelihoods and priors into the expressions in the LRT
(x) =
(

2)
1
exp (.5(x 4)
2
)
(

2)
1
exp (.5(x 10)
2
)
,
P(2)
P(1)
=
.5
.5
= 1
Choose class 1 if:
(x) > 1
exp (.5(x 4)
2
) > exp (.5(x 10)
2
)
(x 4)
2
< (x 10)
2
, by taking logs and changing signs
x < 7
The LRT decision rule is:
Class (x) =

1
if x < 7

2
if x > 7
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
3
Likelihood Ratio Test: an example
! Given a classification problem with the following class conditional densities,
derive a decision rule based on the Likelihood Ratio Test (assume equal priors)
! Solution
" Substituting the given likelihoods and priors into the LRT expression:
" Simplifying the LRT expression:
" Changing signs and taking logs:
" Which yields:
" This LRT result makes sense from an intuitive point of
view since the likelihoods are identical and differ only
in their mean value
! How would the LRT decision rule change if, say, the priors were such that
P(!
1
)=2P(!
2
) ?
2 2
10) (x
2
1
2
4) (x
2
1
1
e
2!
1
) " | P(x e
2!
1
) " | P(x
" " " "
# #
1
1
e
! 2
1
e
! 2
1
) x ( #
1
2
2
2
"
"
) 10 x (
2
1
) 4 x (
2
1
$
%
#
" "
" "
1
e
e
) x ( #
1
2
2
2 "
"
) 10 x (
2
1
) 4 x (
2
1
$
%
#
" "
" "
0 ) 10 x ( ) 4 x (
1
2
"
"
2 2
%
$
" " "
7 x
1
2
"
"
%
$
R
1
: say !
1
x
R
2
: say !
2
P(x|!
1
) P(x|!
2
)
4 10
How would the LRT decision rule change if P(w
1
) = 2P(
2
) ?
Bayesian Decision Theory
The Likelihood Ratio Test
The Probability of Error
Probability of Error
Performance of a decision rule is measured by its probability of error:
P(error) =
C
X
i=1
P(error|i)P(i)
The class conditional probability of error is:
P(error|i) =
X
j=i
P(choose j|i) =
X
j=i
Z
R
j
P(x|i)dx
where Rj = {x : Class (x) = j}.
For the 2-class problem
P(error) = P(1)
Z
R
2
P(x|1)dx
| {z }

1
+P(2)
Z
R
1
P(x|2)dx
| {z }

2
1 is the integral of the likelihood P(x|1) over the region where 2 is chosen.
Back to the Example
For the decision rule of the previous example, the integrals
1
and
2
are
depicted below.
Since we assumed equal priors, then
P(error) = .5(
1
+
2
)
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
4
The probability of error (1)
! The performance of any decision rule can be measured by its probability of error P[error]
which, making use of the Theorem of total probability (Lecture 2), can be broken up into
! The class conditional probability of error P[error|!
i
] can be expressed as
! So, for our 2-class problem, the probability of error becomes
" where "
i
is the integral of the likelihood P(x|!
i
) over the region R
j
where we choose !
j
! For the decision rule of the previous example, the integrals "
1
and "
2
are depicted below
" Since we assumed equal priors, then P[error] = ("
1
+ "
2
)/2
! Compute the probability for the example above
#
$
$
C
1 i
i i
] ]P[! ! | P[error P[error]
%
$ $
j R
i i j i
dx ) ! | x ( P ] ! | ! choose [ P ] ! | error [ P
! !" ! !# $ ! !" ! !# $
2
1
1
2
"
R
2 2
"
R
1 1
dx ) ! | x ( P ] ! [ P dx ) ! | x ( P ] ! [ P ] error [ P
% %
& $
R
1
: say !
1
x
R
2
: say !
2
P(x|!
1
) P(x|!
2
)
4 10
"
2
"
1
Write out the expression for P(error) for this example.
Back to the Example
For the decision rule of the previous example, the integrals 1 and 2 are depicted below.
Since we assumed equal priors, then
P(error) = .5(1 +2)
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
4
The probability of error (1)
! The performance of any decision rule can be measured by its probability of error P[error]
which, making use of the Theorem of total probability (Lecture 2), can be broken up into
! The class conditional probability of error P[error|!
i
] can be expressed as
! So, for our 2-class problem, the probability of error becomes
" where "
i
is the integral of the likelihood P(x|!
i
) over the region R
j
where we choose !
j
! For the decision rule of the previous example, the integrals "
1
and "
2
are depicted below
" Since we assumed equal priors, then P[error] = ("
1
+ "
2
)/2
! Compute the probability for the example above
#
$
$
C
1 i
i i
] ]P[! ! | P[error P[error]
%
$ $
j R
i i j i
dx ) ! | x ( P ] ! | ! choose [ P ] ! | error [ P
! !" ! !# $ ! !" ! !# $
2
1
1
2
"
R
2 2
"
R
1 1
dx ) ! | x ( P ] ! [ P dx ) ! | x ( P ] ! [ P ] error [ P
% %
& $
R
1
: say !
1
x
R
2
: say !
2
P(x|!
1
) P(x|!
2
)
4 10
"
2
"
1
Write out the expression for P(error) for this example.

1
= (2)

1
2

x=7
exp(.5(x 4)
2
)dx,

2
= (2)

1
2

7
x=
exp(.5(x 10)
2
)dx
Probability of Error
Thinking about the 2-class problem, not all decisions are equally good wrt
minimizing P(error). For our example consider this (silly) rule:
Class (x) =

1
if x < 100

2
if x > 100
For this (silly) rule
1
1 and
2
0 which is much more than the error
for the rule dened by the likelihood ratio test. In fact:
Bayes Error Rate: For any given problem, the minimum probability
of error is achieved by the Likelihood Ratio Test decision rule. This
probability of error is called the Bayes Error Rate and is the BEST
any classier can do.
Bayesian Decision Theory
The Likelihood Ratio Test
The Probability of Error
Bayes Risk
Bayes Risk
So far have assumed that the penalty of misclassication of a class
1
example as class
2
is the same as that for the misclassication of a
class
2
example as class
1
. But consider misclassifying
a faulty airplane as a safe airplane (puts peoples lives in danger)
a safe airplane as a faulty airplane (costs the airline company money)
Can formalize this concept in terms of a cost function C
ij
Let Cij denote the cost of choosing class i when j is the true class.
Bayes Risk is the expected value of the cost
E[C] =
2
X
i=1
2
X
j=1
CijP(decide i, j true class) =
2
X
i=1
2
X
j=1
CijP(x Ri|j)P(j)
Bayes Risk
What is the decision rule that minimizes the Bayes Risk ?
First note: P(x R
i
|
j
) =

xR
i
P(x|
j
)dx
Then the Bayes Risk is:
E[C] =

R
1
[C
11
P(
1
)P(x|
1
) + C
12
P(
2
)P(x|
2
)] dx+

R
2
[C
21
P(
1
)P(x|
1
) + C
22
P(
2
)P(x|
2
)] dx
Now remember

R
1
P(x|
j
)dx +

R
2
P(x|
j
) =

R
1
R
2
P(x|
j
)dx = 1.
E[C] =C11P(1)
Z
R
1
P(x|1)dx +C12P(2)
Z
R
1
P(x|2)dx +
C21P(1)
Z
R
2
P(x|1)dx +C22P(2)
Z
R
2
P(x|2)dx+
C21P(1)
Z
R
1
P(x|1)dx +C22P(2)
Z
R
1
P(x|2)dx + +A
C21P(1)
Z
R
1
P(x|1)dx C22P(2)
Z
R
1
P(x|2)dx A
=C21P(1)
Z
R
1
R
2
P(x|1)dx +C22P(2)
Z
R
1
R
2
P(x|2)dx +
(C12 C22)P(2)
Z
R
1
P(x|2)dx (C21 C11)P(1)
Z
R
1
P(x|1)dx
=C21P(1) +C22P(2) +
(C12 C22)P(2)
Z
R
1
P(x|2)dx (C21 C11)P(1)
Z
R
1
P(x|1)dx
We want to nd the region R
1
that minimizes the Bayes Risk. From the
previous slide we see the rst two terms of E[C] are constant with respect
to R
1
. Thus the optimal region is:
R

1
= arg min
R
1
(
Z
R
1
[(C12 C22)P(2)P(x|2) (C21 C11)P(1)P(x|1)] dx
)
= arg min
R
1
(
Z
R
1
g(x)dx
)
Note we are assuming C
21
> C
11
and C
12
> C
22
, that is the cost of a
misclassication is higher than the cost of a correct classication. Thus:
(C
12
C
22
) > 0 AND (C
21
C
11
) > 0
Bayes Risk (2)
Temporally forget about the specic expression of g(x). Consider the
type of decision region R

1
we are looking for. Select the intervals that
minimize the integral

R
1
g(x)dx, that is the intervals where g(x) < 0
Thus we will choose R

1
such that
(C
21
C
11
)P(
1
)P(x|
1
) > (C
12
C
22
)P(
2
)P(x|
2
)
Rearranging the terms yields:
P(x|
1
)
P(x|
2
)
>
(C
12
C
22
)P(
2
)
(C
21
C
11
)P(
1
)
Therefore we obtain the decision rule
A Likelihood Ratio Test:
Class (x) =

1
if
P(x|
1
)
P(x|
2
)
>
(C
12
C
22
)P(
2
)
(C
21
C
11
)P(
1
)

2
if
P(x|
1
)
P(x|
2
)
<
(C
12
C
22
)P(
2
)
(C
21
C
11
)P(
1
)
Bayes Risk: An Example
Consider the following 2 class classication problem. The likelihood
functions for each class are:
P(x|
1
) = (23)

1
2
exp

.5x
2
/3

, P(x|
2
) = (2)

1
2
exp

.5(x 2)
2

Introduction to Pattern Analysis


Ricardo Gutierrez-Osuna
Texas A&M University
9
-6 -4 -2 0 2 4 6
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
x
-6 -4 -2 0 2 4 6
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
x
lik
e
lih
o
o
d
The Bayes Risk: an example
! Consider a classification problem with two classes
defined by the following likelihood functions
" Sketch the two densities
" What is the likelihood ratio?
" Assume P[!
1
]=P[!
2
]=0.5, C
11
=C
22
=0, C
12
=1 and C
21
=3
1/2
.
Determine a decision rule that minimizes the probability of
error
2
2
2) (x
2
1
2
3
x
2
1
1
e
2!
1
) " | P(x
e
3 2!
1
) " | P(x
" "
"
#
#
27 . 1 , 73 . 4 x 0 12 x 12 x 2
0 ) 2 x (
2
1
3
x
2
1
1
e
e
3
1
e
! 2
1
e
3 ! 2
1
) x ( #
1
2
1
2
1
2
2
2
1
2
2
2
"
"
2
"
"
2
2
"
"
) 2 x (
2
1
3
x
2
1
"
"
) 2 x (
2
1
3
x
2
1
# $
%
&
' "
%
&
" ' "
%
&
%
&
#
" "
"
" "
"
R
1
R
2
R
1
The priors are: P(
1
) = P(
2
) = .5
Dene the (mis)classication costs
as: C
11
= C
22
= 0, C
12
= 1, C
21
=

3
Problem: Determine a decision rule minimizing the probability of error.
Bayes Risk: An Example (2)
Solution: (x) =
(23)

1
2 exp

.5x
2
/3

(2)

1
2 exp (.5(x2)
2
)
=
(3)

1
2 exp

.5x
2
/3

exp (.5(x2)
2
)
.
Choose class 1 if (x) >
.5(1 0)
.5(

3 0)

(3)

1
2 exp
`
.5x
2
/3

exp (.5(x 2)
2
)
> 1
exp

.5x
2
/3

> exp

.5(x 2)
2


1
2
x
2
3
>
1
2
(x 2)
2
x
2
6x + 6 > 0
x > 4.73 and x < 1.27
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
9
-6 -4 -2 0 2 4 6
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
x
-6 -4 -2 0 2 4 6
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
x
lik
e
lih
o
o
d
The Bayes Risk: an example
! Consider a classification problem with two classes
defined by the following likelihood functions
" Sketch the two densities
" What is the likelihood ratio?
" Assume P[!
1
]=P[!
2
]=0.5, C
11
=C
22
=0, C
12
=1 and C
21
=3
1/2
.
Determine a decision rule that minimizes the probability of
error
2
2
2) (x
2
1
2
3
x
2
1
1
e
2!
1
) " | P(x
e
3 2!
1
) " | P(x
" "
"
#
#
27 . 1 , 73 . 4 x 0 12 x 12 x 2
0 ) 2 x (
2
1
3
x
2
1
1
e
e
3
1
e
! 2
1
e
3 ! 2
1
) x ( #
1
2
1
2
1
2
2
2
1
2
2
2
"
"
2
"
"
2
2
"
"
) 2 x (
2
1
3
x
2
1
"
"
) 2 x (
2
1
3
x
2
1
# $
%
&
' "
%
&
" ' "
%
&
%
&
#
" "
"
" "
"
R
1
R
2
R
1
Bayesian Decision Theory
The Likelihood Ratio Test
The Probability of Error
Bayes Risk
Bayes, MAP and ML Criteria
Variations of the LRT
The LRT decision rule minimizing the Bayes Risk is also known as the
Bayes Criterion
Bayes Criterion Class (x) =
8
<
:
1 if (x) >
(C
12
C
22
)P(
2
)
(C
21
C
11
)P(
1
)
2 if (x) <
(C
12
C
22
)P(
2
)
(C
21
C
11
)P(
1
)
Minimize the probability of error, that is the Bayes Criterion with
C
ij
=
ij
, this version of the LRT decision is referred to as the
Maximum A Posteriori Criterion.
MAP Criterion Class (x) =
(
1 if P(1|x) > P(2|x)
2 if P(1|x) < P(2|x)
Finally, for the case of equal priors P(
i
) and C
ij
=
ij
(a zero one
cost function) the LRT decision rule is called the Maximum Likelihood
Criterion, since it will minimize the likelihood P(x|
i
).
ML Criterion Class (x) =
(
1 if P(x|1) > P(x|2)
2 if P(x|1) < P(x|2)
Variations of the LRT (2)
Two more decision rules are commonly cited in the related literature.
The Neyman-Pearson Criterion which also leads to a LRT decision rule.
It xes one class error probability, say < and seeks to minimize the
other.
The Minimax Criterion, derived from the Bayes Criterion and seeks to
minimize the maximum Bayes Risk.
Bayesian Decision Theory
The Likelihood Ratio Test
The Probability of Error
Bayes Risk
Bayes, MAP and ML Criteria
Multi-class functions
Decision rules for multi-class problems
The decision rule minimizing P(error) generalizes to multi-class problems.
The derivation is easier if we express P(error) in terms of making a correct assignment.
P(error) = 1 P(correct)
Probability of making a correct assignment is
P(correct) =
C
X
i=1
P(i)
Z
R
i
P(x|i)dx
=
C
X
i=1
Z
R
i
P(x|i)P(i)dx
=
C
X
i=1
Z
R
i
P(i|x)P(x)dx
| {z }
T
i
The problem of minimizing P(error) is equivalent to that of maximizing
P(correct).
To maximize P(correct) we have to maximize each of the integrals T
i
.
In turn, each integral T
i
will be maximized by choosing the class
i
that yields the maximum P(
i
|x) = we will dene R
i
to be the
regions where P(
i
|x) is maximum.
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
12
Minimum P[error] rule for multi-class problems
! The decision rule that minimizes P[error] generalizes very easily to multi-class
problems
" For clarity in the derivation, the probability of error is better expressed in terms of the
probability of making a correct assignment
" The probability of making a correct assignment is
" The problem of minimizing P[error] is equivalent to that of maximizing P[correct]. Expressing
P[correct] in terms of the posteriors:
" In order to maximize P[correct], we will have to
maximize each of the integrals !
i
. In turn, each
integral !
i
will be maximized by choosing the
class "
i
that yields the maximum P["
i
|x]
#we will define R
i
to be the region where
P["
i
|x] is maximum
! Therefore, the decision rule that minimizes P[error] is the MAP Criterion
] correct [ P 1 ] error [ P $ %
& '
%
%
C
1 i
i
R
i
dx ) ! | x ( P ) ! ( P ] correct [ P
i
&' &' & '
%
!
% %
% % %
C
1 i R
i
C
1 i
i i
R
C
1 i
i
R
i
i
i i i
P(x)dx x) | P(! )dx )P(! ! | P(x )dx ! | P(x ) P(! P[correct]
! ! " ! ! # $
x
P
r
o
b
a
b
i
li
t
y
R
2
R
1
R
3
R
2
R
1
P("
1
|x)
P("
2
|x)
P("
3
|x)
Therefore, the decision rule that minimizes P(error) is the MAP Criterion.
Minimum Bayes Risk
Dene the overall decision rule as a function
: x {
1
,
2
, ,
C
} s.t. (x) =
i
if x is assigned to class
i
.
The risk R((x)|x) of assigning x to class (x) =
i
is
R(
i
|x) =
C

j=1
C
ij
P(
j
|x)
The Bayes Risk associate with the decision rule (x) is
R((x)) =

R((x)|x)P(x)dx
To minimize this expression we have to minimize the conditional risk
R((x)|x) at each point x in the feature space.
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
13
Minimum Bayes Risk for multi-class problems
! To determine which decision rule yields the minimum Bayes Risk for the multi-class
problem we will use a slightly different formulation
" We will denote by !
i
the decision to choose class "
i
,
" We will denote by !(x) the overall decision rule that maps features x into classes "
i
: !(x)#{!
1
, !
2
, , !
C
}
! The (conditional) risk $(!
i
|x) of assigning a feature x to class "
i
is
! And the Bayes Risk associated with the decision rule !(x) is
! In order to minimize this expression,we will have to minimize the conditional risk $(!(x)|x)
at each point x in the feature space, which in turn is equivalent to choosing "
i
such that
$(!
i
|x) is minimum
% & % &
'
(
( $ ( # $
C
1 j
j ij i i
) x | ! ( P C x | " " ) x ( "
% & % & dx ) x ( P x | ) x ( " ) x ( "
)
$ ( $
x
R
i
s
k
R
1
R
2
R
3
R
2
R
1
R
2
R
2
$ (!
2
|x)
$ (!
3
|x)
$ (!
1
|x)
Bayesian Decision Theory
The Likelihood Ratio Test
The Probability of Error
Bayes Risk
Bayes, MAP and ML Criteria
Multi-class functions
Discriminant Functions
Discriminant Functions
All decision rules presented in this lecture have the same structure
At each x in feature space choose class i which maximizes (or minimizes) some
measure gi(x)
Formally, there is a set of discriminant functions {g
i
(x)}
C
i=1
and the
following decision rule
assign x to class
i
if g
i
(x) > g
j
(x) j = i
Can visualize the decision rule as a network or machine
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
14
Discriminant functions
! All the decision rules we have presented in this lecture have the same structure
" At each point x in feature space choose class !
i
which maximizes (or minimizes) some measure g
i
(x)
! This structure can be formalized with a set of discriminant functions g
i
(x), i=1..C, and the
following decision rule
! Therefore, we can visualize the decision rule as a network or machine that computes C
discriminant functions and selects the category corresponding to the largest discriminant.
Such network is depicted in the following figure (presented already in Lecture 1)
! Finally, we express the three basic decision rules: Bayes, MAP and ML in terms of
Discriminant Functions to show the generality of this formulation
i" j (x) g (x) g if ! class to x assign "
j i i
" # $
x
2
x
2 x
3
x
3 x
d
x
d
g
1
(x) g
1
(x)
x
1
x
1
g
2
(x) g
2
(x) g
C
(x) g
C
(x)
Select max Select max
Costs Costs
Class assignment
Discriminant functions
Features
Criterion Discriminant Function
Bayes g
i
(x)=-%(&
i
|x)
MAP g
i
(x)=P(!
i
|x)
ML g
i
(x)=P(x|!
i
)
The three basic decision rules Bayes, MAP and ML in terms of
Discriminant Functions:
Criterion Discriminant Function
Bayes g
i
(x) = R(
i
|x)
MAP g
i
(x) = P(
i
|x)
ML g
i
(x) = P(x|
i
)
Discriminant Functions for the class of
Gaussian Distributions
Quadratic Classiers
Bayes classiers for Normally distributed classes
Bayes classiers for Normally distributed classes
The (MAP) decision rule minimizing the probability of error can be
formulated as a family of discriminant functions
Choose class i if gi(x) > gj(x) i = j with gi(x) = P(i|x)
For classes that are normally distributed, this family can be reduced to very simple
expressions.
General expression for Gaussian densities
The multi-variate Normal density is dened as
fX(x) =
1
(2)
n
2 ||
1
2
exp

1
2
(x )
T

1
(x )

Using Bayes rule the MAP discriminant function becomes


gi(x) =
P(x|i)P(i)
P(x)
=
1
(2)
n
2 |i|
1
2
exp

1
2
(x
i
)
T

1
i
(x
i
)

P(i)
P(x)
Eliminating constant terms
gi(x) = |i|
1/2
exp

1
2
(x )
T

1
i
(x )

P(i)
Taking the log since it is a monotonically increasing function
gi(x) =
1
2
(x
i
)
T

1
i
(x
i
)
1
2
log (|i|) + log (P(i))
Quadratic Discriminant Function
Quadratic Classiers
Bayes classiers for Normally distributed classes
Case 1:
i
=
2
I
Case 1:
i
=
2
I
The features are statistically independent with the same variance for
all classes.
The quadratic function becomes
gi(x) =
1
2
(x
i
)
T
(
2
I)
1
(x
i
)
1
2
log (|
2
I|) + log (P(i))
=
1
2
2
(x
i
)
T
(x
i
)
N
2
log (
2
) + log (P(i))
=
1
2
2
(x
i
)
T
(x
i
) + log (P(i)), dropping the second term
=
1
2
2
(x
T
x 2
T
i
x +
T
i

i
) + log (P(i))
Eliminate the term x
T
x as it is constant for all classes. Then
gi(x) =
1
2
2
(2
T
i
x +
T
i

i
) + log (P(i))
= w
T
i
x +wi0
where
wi =

i

2
and wi0 =
1
2
2

T
i

i
+ log (P(i))
As the discriminant is linear, the decision boundaries gi(x) = gj(x) will be
hyper-planes.
If we assume equal priors (also know as the nearest mean classier)
Minimum distance classier: g
i
(x) =
1
2
2
(x
i
)
T
(x
i
)
Properties of the class-conditional probabilities
the loci of constant probability for each class are hyper-spheres
Case 1:
i
=
2
I
The decision boundaries are the hyperplanes g
i
(x) = g
j
(x), and can
be written as
w
T
(x x
0
) = 0, after some algebra
where
w =
i

j
x
0
=
1
2
(
i
+
j
)

2

2
ln
P(
i
)
P(
j
)
(
i

j
).
The hyperplane separating R
i
and R
j
passes through the point x
0
and
is orthogonal to the vector w.
Case 1:
i
=
2
I, Example
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
4
Case 1: !
i
="
2
I, example
! To illustrate the previous result, we will
compute the decision boundaries for a 3-
class, 2-dimensional problem with the
following class mean vectors and
covariance matrices and equal priors
# $ # $ # $
%
&
'
(
)
*
+
%
&
'
(
)
*
+
%
&
'
(
)
*
+
+ + +
2 0
0 2
!
2 0
0 2
!
2 0
0 2
!
5 2 4 7 2 3
3 2 1
T
3
T
2
T
1
Compute decision boundaries for the above
3-class, 2D problem with class-conditional
parameters and equal priors.

1
= (3, 2)
T
,
2
= (7, 4)
T
,
3
= (2, 5)
T

1
=

2 0
0 2

,
2
=

2 0
0 2

,
3
=

2 0
0 2

Introduction to Pattern Analysis


Ricardo Gutierrez-Osuna
Texas A&M University
4
Case 1: !
i
="
2
I, example
! To illustrate the previous result, we will
compute the decision boundaries for a 3-
class, 2-dimensional problem with the
following class mean vectors and
covariance matrices and equal priors
# $ # $ # $
%
&
'
(
)
*
+
%
&
'
(
)
*
+
%
&
'
(
)
*
+
+ + +
2 0
0 2
!
2 0
0 2
!
2 0
0 2
!
5 2 4 7 2 3
3 2 1
T
3
T
2
T
1
Quadratic Classiers
Bayes classiers for Normally distributed classes
Case 1:
i
=
2
I
Case 2:
i
= ( diagonal)
Case 2:
i
= ( diagonal)
The classes have the same covariance matrix, but the features are
allowed to have dierent variances.
The quadratic function becomes
gi(x) =
1
2
(x
i
)
T

1
i
(x
i
)
1
2
log (|i|) + log (P(i))
=
1
2
(x
i
)
T
0
B
B
@

2
1
.
.
.

2
N
1
C
C
A
(x
i
)
1
2
log
0
B
B
@

0
B
B
@

2
1
.
.
.

2
N
1
C
C
A

1
C
C
A
+ log (P(
i
))
=
1
2
N
X
k=1
(xk ik)
2

2
k

1
2
log

N
Y
k=1

2
k
!
+ log (P(i))
=
1
2
N
X
k=1
x
2
k
2xkik +
2
ik

2
k

1
2
log

N
Y
k=1

2
k
!
+ log (P(i))
Eliminate the term x
2
k
as it is constant for all classes.
gi(x) =
1
2
N
X
k=1
2xkik +
2
ik

2
k

1
2
log

N
Y
k=1

2
k
!
+ log (P(i))
Properties
This discriminant is linear, so the decision boundaries g
i
(x) = g
j
(x)
are hyper-planes.
The loci of constant probability are hyper-ellipses aligned with the
feature axes.
The only dierence to the previous classier is that the distance of each
axis is normalized by the variance of the axis.
Case 2:
i
= ( diagonal), example
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
6
! To illustrate the previous result, we will
compute the decision boundaries for a 3-class,
2-dimensional problem with the following class
mean vectors and covariance matrices and
equal priors
! " ! " ! "
#
$
%
&
'
(
)
#
$
%
&
'
(
)
#
$
%
&
'
(
)
) ) )
2 0
0 1
!
2 0
0 1
!
2 0
0 1
!
5 2 4 5 2 3
3 2 1
T
3
T
2
T
1
Case 2: *
i
= * (* diagonal), example
Compute decision boundaries for the above
3-class, 2D problem with class-conditional
parameters and equal priors.

1
= (3, 2)
T
,
2
= (5, 4)
T
,
3
= (2, 5)
T

1
=

1 0
0 2

,
2
=

1 0
0 2

,
3
=

1 0
0 2

Introduction to Pattern Analysis


Ricardo Gutierrez-Osuna
Texas A&M University
6
! To illustrate the previous result, we will
compute the decision boundaries for a 3-class,
2-dimensional problem with the following class
mean vectors and covariance matrices and
equal priors
! " ! " ! "
#
$
%
&
'
(
)
#
$
%
&
'
(
)
#
$
%
&
'
(
)
) ) )
2 0
0 1
!
2 0
0 1
!
2 0
0 1
!
5 2 4 5 2 3
3 2 1
T
3
T
2
T
1
Case 2: *
i
= * (* diagonal), example
Quadratic Classiers
Bayes classiers for Normally distributed classes
Case 1:
i
=
2
I
Case 2:
i
= ( diagonal)
Case 3:
i
= ( non-diagonal)
Case 3:
i
= ( non-diagonal)
All classes have the same covariance matrix, but it is not necessarily
diagonal.
The quadratic discriminant function becomes
gi(x) =
1
2
(x
i
)
T

1
i
(x
i
)
1
2
log (|i|) + log (P(i))
=
1
2
(x
i
)
T

1
(x
i
)
1
2
log (||) + log (P(i))
Eliminate the term log (||), which is constant for all classes.
g
i
(x) =
1
2
(x
i
)
T

1
(x
i
) + log (P(
i
))
The quadratic term is called the Mahalanobis distance, a very important
term in Statistical PR.
Mahalanobis distance: x y
2

1
= (x y)
T

1
(x y)
The Mahalanobis distance is a vector distance that uses a
1
norm.

1
can be thought as a stretching factor on the space.
For = I the Mahalanobis distance becomes the Euclidean distance.
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
7
Case 3: !
i
=! (! non-diagonal)
! In this case, all the classes have the same covariance matrix, but this is no longer diagonal
! The quadratic discriminant becomes
! Eliminating the term log|"|, which is constant for all classes
" The quadratic term is called the Mahalanobis distance, a very important distance in Statistical PR
! The Mahalanobis distance is a vector distance that
uses a "
-1
norm
" "
-1
can be thought of as a stretching factor on the space
" Note that for an identity covariance matrix ("=I), the
Mahalanobis distance becomes the familiar Euclidean distance
# $ # $
# $ # $ ) ! P( log log
2
1
- ) (x ) (x
2
1
) ! P( log log
2
1
- ) (x ) (x
2
1
(x) g
i i
1 T
i
i i i
1
i
T
i i
% " & " & & '
' % " & " & & '
&
&
# $ ) ! P( log ) (x ) (x
2
1
(x) g
i i
1
i
T
i i
% & " & & '
&
(
x
2
x
1
" - x
2
i
'
K - x
2
i 1
'
&
"
y) (x y) (x y - x
1 T
2
1 & " & '
&
"
&
Distance s Mahalanobi
Case 3:
i
= ( non-diagonal)
Expansion of the quadratic term in the discriminant yields
gi(x) =
1
2
(x
i
)
T

1
(x
i
) + log (P(i))
=
1
2

x
T

1
x 2
T
i

1
x +
T
i

1

+ log (P(i))
Removing the term x
T

1
x which is constant for all classes
gi(x) =
1
2

2
T
i

1
x +
T
i

1

+ log (P(i))
Reorganizing terms get:
gi(x) = w
T
i
x +wi0 with wi =
1

i
and wi0 =
1
2

T
i

1

i
+ log (P(i))
Properties
The discriminant is linear, so the decision boundaries are hyper-planes.
The constant probability loci are hyper-ellipses aligned with the
eigenvectors of .
If we can assume equal priors the classier becomes a minimum
(Mahalanobis) distance classier.
Equal Priors: g
i
(x) =
1
2
(x
i
)
T

1
(x
i
)
Case 3: Example
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
9
! To illustrate the previous result, we will
compute the decision boundaries for a 3-
class, 2-dimensional problem with the
following class mean vectors and
covariance matrices and equal priors
! " ! " ! "
#
$
%
&
'
(
)
#
$
%
&
'
(
)
#
$
%
&
'
(
)
) ) )
2 7 . 0
7 . 0 1
!
2 7 . 0
7 . 0 1
!
2 7 . 0
7 . 0 1
!
5 2 4 5 2 3
3 2 1
T
3
T
2
T
1
Case 3: *
i
=* (* non-diagonal), example
Compute decision boundaries for the above
3-class, 2D problem with class-conditional
parameters and equal priors.

1
= (3, 2)
T
,
2
= (5, 4)
T
,
3
= (2, 5)
T

1
=

.5 .7
.7 2

,
2
=

1 .7
.7 2

,
3
=

1 .7
.7 2

Introduction to Pattern Analysis


Ricardo Gutierrez-Osuna
Texas A&M University
9
! To illustrate the previous result, we will
compute the decision boundaries for a 3-
class, 2-dimensional problem with the
following class mean vectors and
covariance matrices and equal priors
! " ! " ! "
#
$
%
&
'
(
)
#
$
%
&
'
(
)
#
$
%
&
'
(
)
) ) )
2 7 . 0
7 . 0 1
!
2 7 . 0
7 . 0 1
!
2 7 . 0
7 . 0 1
!
5 2 4 5 2 3
3 2 1
T
3
T
2
T
1
Case 3: *
i
=* (* non-diagonal), example
Quadratic Classiers
Bayes classiers for Normally distributed classes
Case 1:
i
=
2
I
Case 2:
i
= ( diagonal)
Case 3:
i
= ( non-diagonal)
Case 4:
i
=
2
i
I
Case 4:
i
=
2
i
I
Each class has a dierent covariance matrix, which is proportional to
the identity matrix.
The quadratic discriminant becomes
gi(x) =
1
2
(x
i
)
T

1
i
(x
i
)
1
2
log (|i|) + log (P(i))
=
1
2
(x
i
)
T

2
i
(x
i
)
N
2
log

2
i

+ log (P(i))
The expression cannot be reduce further so
The decision boundaries are quadratic: hyper-ellipses
The loci of constant probability are hyper-spheres aligned with the feature axis.
Case 4:
i
=
2
i
I, example
Compute decision boundaries for the above
3-class, 2D problem with class-conditional
parameters.

1
= (3, 2)
T
,
2
= (5, 4)
T
,
3
= (2, 5)
T

1
=

.5 0
0 .5

,
2
=

1 0
0 1

,
3
=

2 0
0 2

Quadratic Classiers
Bayes classiers for Normally distributed classes
Case 1:
i
=
2
I
Case 2:
i
= ( diagonal)
Case 3:
i
= ( non-diagonal)
Case 4:
i
=
2
i
I
Case 5:
i
=
j
, General Case
Case 5:
i
=
j
, General Case
Have already derived the expression for the general case it is:
gi(x) =
1
2
(x
i
)
T

1
i
(x
i
)
1
2
log (|i|) + log (P(i))
Reorganizing terms in a quadratic form yields
g
i
(x) = x
T
W
i
x +w
T
i
x + w
i0
where
W
i
=
1
2

1
i
,
w
i
=
1
i

i
,
w
i0
=
1
2

T
i

1
i

i

1
2
log (|
i
|) + log (P(
i
))
Properties
The loci of constant probability for each class are hyper-ellipses, oriented
with the eigenvectors of
i
for that class.
The decision boundaries are quadratic: hyper-ellipses or hyper-
parabolloids
The quadratic expression in the discriminant is proportional to the
Mahalanobis distance using the class-conditional variance
i
.
Case 5:
i
=
j
, Example
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
13
! To illustrate the previous result, we will
compute the decision boundaries for a 3-
class, 2-dimensional problem with the
following class mean vectors and
covariance matrices and equal priors
! " ! " ! "
#
$
%
&
'
(
)
#
$
%
&
'
(
*
*
)
#
$
%
&
'
(
*
*
)
) ) )
3 5 . 0
5 . 0 5 . 0
!
7 1
1 1
!
2 1
1 1
!
5 2 4 5 2 3
3 2 1
T
3
T
2
T
1
Case 5: +
i
,+
j
general case, example
Zoom
out
Compute the decision boundaries for the above
3-class, 2-dimensional problem with class-
conditional parameters.

1
= (3, 2)
T

2
= (5, 4)
T

3
= (2, 5)
T

1
=

1 1
1 2

,
2
=

1 0
0 1

,
3
=

2 0
0 2

Introduction to Pattern Analysis


Ricardo Gutierrez-Osuna
Texas A&M University
13
! To illustrate the previous result, we will
compute the decision boundaries for a 3-
class, 2-dimensional problem with the
following class mean vectors and
covariance matrices and equal priors
! " ! " ! "
#
$
%
&
'
(
)
#
$
%
&
'
(
*
*
)
#
$
%
&
'
(
*
*
)
) ) )
3 5 . 0
5 . 0 5 . 0
!
7 1
1 1
!
2 1
1 1
!
5 2 4 5 2 3
3 2 1
T
3
T
2
T
1
Case 5: +
i
,+
j
general case, example
Zoom
out
Quadratic Classiers
Bayes classiers for Normally distributed classes
Case 1:
i
=
2
I
Case 2:
i
= ( diagonal)
Case 3:
i
= ( non-diagonal)
Case 4:
i
=
2
i
I
Case 5:
i
=
j
, General Case
Numerical Example
Numerical Example
Derive the discriminant function for the 2-class 3D classication problem dened by the
following Gaussian Likelihoods

1
=
0
@
0
0
0
1
A
;
2
=
0
@
1
1
1
1
A
; 1 = 2 =
0
@
1
4
0 0
0
1
4
0
0 0
1
4
1
A
; P(2) = 2P(1)
Solution:
g1(x) =
1
2
2
(x
1
)
T
(x
1
) + log (P(1))
=
1
2(
1
4
)
(x1 0, x2 0, x3 0)
0
@
x1 0
x2 0
x3 0
1
A
+ log

1
3

g2(x) =
1
2(
1
4
)
(x1 1, x2 1, x3 1)
0
@
x1 1
x2 1
x3 1
1
A
+ log

2
3

Classify x as 1 if g1(x) > g2(x).


g1(x) > g2(x)
2(x
T
x) + log

1
3

> 2((x1 1)
2
+ (x2 1)
2
+ (x3 1)
2
) + log

2
3

x1 +x2 +x3 <


6 log 2
4
= 1.32
Therefore the decision rule is:
Class (x) =
(
1 if x1 +x2 +x3 < 1.32
2 if x1 +x2 +x3 > 1.32
Classify the test example x
u
= (.1, .7, .8)
T
.1 + .7 + .8 = 1.6 > 1.32 = x
u

2
Quadratic Classiers
Bayes classiers for Normally distributed classes
Case 1:
i
=
2
I
Case 2:
i
= ( diagonal)
Case 3:
i
= ( non-diagonal)
Case 4:
i
=
2
i
I
Case 5:
i
=
j
, General Case
Numerical Example
Conclusions
Conclusions
We can draw the following conclusions
The Bayes classier for normally distributed classes (general case) is a quadratic
classier.
The Bayes classier for normally distributed classes with equal covariance matrices
is a linear classier.
The minimum Mahalanobis distance classier is Bayes-optimal for
normally distributed classes and
equal covariance matrices and
equal priors
The minimum Euclidean distance classier is Bayes-optimal for
normally distributed classes and
equal covariance matrices proportional to the identity matrix and
equal priors
Both Euclidean and Mahalanobis distance classiers are linear classiers.
Some of the most popular classiers can be derived from decision-
theoretic principles and some simplifying assumptions.
Using a specic (Euclidean or Mahalanobis) minimum distance classier implicitly
corresponds to certain statistical assumptions.
Can rarely answer the question if these assumptions hold for real problems. In
most cases limited to answering Does this classier solve our problem or not?

You might also like