0% found this document useful (0 votes)
38 views

Classification L12

This document summarizes different classification methods: (1) It introduces discriminant functions and the optimal Bayes classifier for minimizing probability of error. (2) It describes quadratic classifiers that result from assuming Gaussian class-conditional densities, leading to quadratic discriminant functions involving Mahalanobis distances. (3) It examines special cases of the quadratic classifier where the covariance matrices have different forms, resulting in different types of decision boundaries.

Uploaded by

nsephus3
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Classification L12

This document summarizes different classification methods: (1) It introduces discriminant functions and the optimal Bayes classifier for minimizing probability of error. (2) It describes quadratic classifiers that result from assuming Gaussian class-conditional densities, leading to quadratic discriminant functions involving Mahalanobis distances. (3) It examines special cases of the quadratic classifier where the covariance matrices have different forms, resulting in different types of decision boundaries.

Uploaded by

nsephus3
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Lecture 12: Classification

g Discriminant functions
g The optimal Bayes classifier
g Quadratic classifiers
g Euclidean and Mahalanobis metrics
g K Nearest Neighbor Classifiers

Intelligent Sensor Systems 1


Ricardo Gutierrez-Osuna
Wright State University
Discriminant functions
g A convenient way to represent a pattern classifier is in terms of
a family of discriminant functions gi(x) with a simple MAX gate
as the classification rule
Class assignment

Select
Select max
max
Costs
Costs

gg1(x) gg2(x) ggC(x) Discriminant functions


1(x) 2(x) C(x)

xx1
1
xx2
2
xx3
3
xxd
d
Features

Assign x to class if gi (x) > g (x) ∀j ≠ i


i j

g How do we choose the discriminant functions gi(x)


n Depends on the objective function to minimize
g Probability of error
g Bayes Risk
Intelligent Sensor Systems 2
Ricardo Gutierrez-Osuna
Wright State University
Minimizing probability of error
g Probability of error P[error|x] is “the probability of assigning x
to the wrong class”
n For a two-class problem, P[error|x] is simply

P( | x) if we decide
P(error | x) =  1 2

P( 2 | x) if we decide 1

g It makes sense that the classification rule be designed to minimize the


average probability of error P[error] across all possible values of x
+∞ +∞
P(error) = ∫ P(error, x)dx = ∫ P(error | x)P(x)dx
−∞ −∞

g To ensure P(error) is minimum we minimize P(error|x) by choosing the


class with maximum posterior P(ωi|x) at each x
n This is called the MAXIMUM A POSTERIORI (MAP) RULE
g And the associated discriminant functions become

gMAP
i (x) = P( i | x)

Intelligent Sensor Systems 3


Ricardo Gutierrez-Osuna
Wright State University
Minimizing probability of error
g We “prove” the optimality of the
MAP rule graphically

P(wi|x)
n The right plot shows the posterior
for each of the two classes
n The bottom plots shows the
P(error) for the MAP rule and
another rule
n Which one has lower P(error)
(color-filled area) ? x

THE MAP RULE THE “OTHER” RULE

Choose Choose Choose Choose Choose Choose


RED BLUE RED RED BLUE RED

Intelligent Sensor Systems 4


Ricardo Gutierrez-Osuna
Wright State University
Quadratic classifiers
g Let us assume that the likelihood densities are Gaussian
1  1 
P(x | i )= 1/2
exp  − (x − i )T ∑ i−1(x − i )
(2 n/2
∑i  2 

g Using Bayes rule, the MAP discriminant functions become


P(x | )P( ) 1  1  1
gi (x) = P( i | x) = i i
= 1/2
exp − (x − i )T ∑ i−1(x − i )P( i )
P(x) (2
n/2
∑i  2  P(x)

n Eliminating constant terms


 1 
exp − (x − i )T ∑ i−1(x − i )P(
-1/2
gi (x) = ∑ i i )
 2 

n We take natural logs (the logarithm is monotonically increasing)

gi (x) = − (x − i )T ∑ i−1(x − i ) - log( ∑ i ) + log(P(


1 1
i ))
2 2
g This is known as a Quadratic Discriminant Function
g The quadratic term is know as the Mahalanobis distance

Intelligent Sensor Systems 5


Ricardo Gutierrez-Osuna
Wright State University
Mahalanobis distance
g The Mahalanobis distance can be thought of vector distance that uses
a ∑i-1 norm
x1

Mahalanobis Distance
x2
2 −1 µ
x-y ∑ i −1
= (x − y) ∑i (x − y)
T
2
xi - ∑ −1
=K
2
xi - =

n ∑-1 can be thought of as a stretching factor on the space


n Note that for an identity covariance matrix (∑i=I), the Mahalanobis distance
becomes the familiar Euclidean distance
g In the following slides we look at special cases of the Quadratic
classifier
n For convenience we will assume equiprobable priors so we can drop the term
log(P(ωi))

Intelligent Sensor Systems 6


Ricardo Gutierrez-Osuna
Wright State University
Special case I: Σi=σ2I
g In this case, the discriminant
becomes
gi (x) = −(x − i )T (x − i )

n This is known as a MINIMUM


DISTANCE CLASSIFIER
n Notice the linear decision
boundaries

Intelligent Sensor Systems 7


Ricardo Gutierrez-Osuna
Wright State University
Special case 2: Σi= Σ (Σ diagonal)
g In this case, the discriminant
becomes
1
gi (x) = − (x − i )T ∑ −1(x − i )
2
n This is known as a MAHALANOBIS
DISTANCE CLASSIFIER
n Still linear decision boundaries

Intelligent Sensor Systems 8


Ricardo Gutierrez-Osuna
Wright State University
Special case 3: Σi=Σ (Σ non-diagonal)
g In this case, the discriminant
becomes
1 −1
gi (x) = − (x − i )T ∑ i (x − i )
2
n This is also known as a
MAHALANOBIS DISTANCE
CLASSIFIER
n Still linear decision boundaries

Intelligent Sensor Systems 9


Ricardo Gutierrez-Osuna
Wright State University
Case 4: Σi=σi2I, example
g In this case the quadratic
expression cannot be simplified
any further
g Notice that the decision
boundaries are no longer linear
but quadratic

Zoom
out

Intelligent Sensor Systems 10


Ricardo Gutierrez-Osuna
Wright State University
Case 5: Σi≠Σj general case, example
g In this case there are no
constraints so the quadratic
expression cannot be
simplified any further
g Notice that the decision
boundaries are also quadratic

Zoom
out

Intelligent Sensor Systems 11


Ricardo Gutierrez-Osuna
Wright State University
Limitations of quadratic classifiers
g The fundamental limitation is the unimodal Gaussian
assumption
n For non-Gaussian or multimodal
Gaussian, the results may be
significantly sub-optimal

g A practical limitation is associated with the minimum


required size for the dataset
n If the number of examples per class is less than the number of
dimensions, the covariance matrix becomes singular and,
therefore, its inverse cannot be computed
g In this case it is common to assume the same covariance structure
for all classes and compute the covariance matrix using all the
examples, regardless of class

Intelligent Sensor Systems 12


Ricardo Gutierrez-Osuna
Wright State University
Conclusions
g We can extract the following conclusions
n The Bayes classifier for normally distributed classes is quadratic
n The Bayes classifier for normally distributed classes with equal
covariance matrices is a linear classifier
n The minimum Mahalanobis distance classifier is optimum for
g normally distributed classes and equal covariance matrices and equal priors
n The minimum Euclidean distance classifier is optimum for
g normally distributed classes and equal covariance matrices proportional to
the identity matrix and equal priors
n Both Euclidean and Mahalanobis distance classifiers are linear
g The goal of this discussion was to show that some of the most
popular classifiers can be derived from decision-theoretic
principles and some simplifying assumptions
n It is important to realize that using a specific (Euclidean or Mahalanobis)
minimum distance classifier implicitly corresponds to certain statistical
assumptions
n The question whether these assumptions hold or don’t can rarely be
answered in practice; in most cases we are limited to posting and
answering the question “does this classifier solve our problem or not?”
Intelligent Sensor Systems 13
Ricardo Gutierrez-Osuna
Wright State University
K Nearest Neighbor classifier
g The kNN classifier is based on non-parametric density
estimation techniques
n Let us assume we seek to estimate the density function P(x) from a
dataset of examples
n P(x) can be approximated by the expression
 V is the volume surrounding x
k 
P(x) ≅ where  N is the total number of examples
NV  k is the number of examples inside V

n The volume V is determined by the


V=πR2
D-dim distance RkD(x) between x
and its k nearest neighbor P(x) =
k
2
x N R
R
k k
P(x) ≅ =
NV N ⋅ c D ⋅ RDk (x)

g Where cD is the volume of the


unit sphere in D dimensions

Intelligent Sensor Systems 14


Ricardo Gutierrez-Osuna
Wright State University
K Nearest Neighbor classifier
g We use the previous result to estimate the posterior probability
n The unconditional density is, again, estimated with
ki
P(x | i )=
Ni V
n And the priors can be estimated by
Ni
P( i ) =
N
n The posterior probability then becomes
k i Ni

P(x | i )P( i ) Ni V N k i
P( i | x) = = =
P(x) k k
n Yielding discriminant functions NV

ki
gi (x) =
k
g This is known as the k Nearest Neighbor classifier

Intelligent Sensor Systems 15


Ricardo Gutierrez-Osuna
Wright State University
K Nearest Neighbor classifier
g The kNN classifier is a very intuitive method
n Examples are classified based on their similarity with training data
g For a given unlabeled example xu∈ℜD, find the k “closest” labeled examples in the
training data set and assign xu to the class that appears most frequently within the k-
subset
g The kNN only requires
n An integer k 10 4
10
10
10
1010 444
n A set of labeled examples 1010 4 6666 6
10
10
10
1010 ? 4 444 4444
1010 4 666666666 1
111
5 10 1 111
n A measure of “closeness” 10 4 4 44 6
6 1 111111
12
12
6 222
222
22 3333
222
22 3 8338
0 22
2 2 2 58
3 8
3
8
9
8338
3
88
8 333
3
555859
939
888
55 5 58 9998
9 8
55555953
9 99
5 999
8
axis 2 -5 55 9 99
9
-10

77
7 7
7 77
-15

777777 7
77
-20 7
7
-0.06 -0.05 -0.04 -0.03 -0.02 -0.01 0 0.01 0.02 0.03
axis 1

Intelligent Sensor Systems 16


Ricardo Gutierrez-Osuna
Wright State University
kNN in action: example 1
g We generate data for a 2-dimensional 3-
class problem, where the class-conditional
densities are multi-modal, and non-linearly
separable
g We used kNN with
n k = five
n Metric = Euclidean distance

Intelligent Sensor Systems 17


Ricardo Gutierrez-Osuna
Wright State University
kNN in action: example 2
g We generate data for a 2-dim 3-class
problem, where the likelihoods are
unimodal, and are distributed in rings
around a common mean
n These classes are also non-linearly separable
g We used kNN with
n k = five
n Metric = Euclidean distance

Intelligent Sensor Systems 18


Ricardo Gutierrez-Osuna
Wright State University
kNN versus 1NN
1-NN 5-NN 20-NN

Intelligent Sensor Systems 19


Ricardo Gutierrez-Osuna
Wright State University
Characteristics of the kNN classifier
g Advantages
n Analytically tractable, simple implementation
n Nearly optimal in the large sample limit (N→∞)
g P[error]Bayes >P[error]1-NNR<2P[error]Bayes
n Uses local information, which can yield highly adaptive behavior
n Lends itself very easily to parallel implementations
g Disadvantages
n Large storage requirements
n Computationally intensive recall
n Highly susceptible to the curse of dimensionality
g 1NN versus kNN
n The use of large values of k has two main advantages
g Yields smoother decision regions
g Provides probabilistic information: The ratio of examples for each class
gives information about the ambiguity of the decision
n However, too large values of k are detrimental
g It destroys the locality of the estimation
g In addition, it increases the computational burden
Intelligent Sensor Systems 20
Ricardo Gutierrez-Osuna
Wright State University

You might also like