Classification
Classification
Applications
Philip O. Ogunbona
Pattern Classification
Autumn
1 Pattern Recognition/Classifier
2 Approaches to Classification
4 References
We take for granted the fact that we are able to move around in our world and
recognize:
cars
people
animals
objects in general
despite the variety in their form and existence.
What features help in these tasks?
In pattern recognition we study how to design machines that can recognize and
classify ”things”.
We study the statistics of the features that describe ”things”.
We study how to measure the performance of pattern recognition systems and
select good systems.
Feature
Sensor Selector/ Classifier
Extractor Decision
Representation Feature
Pattern Pattern
Feature
Sensor Selector/ Classifier
Extractor Decision
Representation Feature
Pattern Pattern
Feature
Sensor Selector/ Classifier
Extractor Decision
Representation Feature
Pattern Pattern
ω1 , . . . , ωC
There is a variable, z, that indicates which class, ωi , a pattern x belongs. That is,
Part of the design process is to evaluate and set optimal operating parameters.
The idea is that once we have a designed classifier we can estimate the class
membership of an unknown pattern.
There is an assumption that the samples used for training are drawn from the
same probability distribution as the test samples and the operational samples.
Feature
Sensor Selector/ Classifier
Extractor Decision
Representation Feature
Pattern Pattern
1 Representation pattern is the raw data we obtain from the sensor e.g. image or
video pixels, price of stock, etc.
3 The trained classifier uses the feature pattern to make a decision regarding the
pattern presented at its input.
Problem
Given a training set of patterns of known class, we seek to design a classifier that is
optimal for the expected operating conditions.
Unsupervised: The given data is not labelled and the idea is to find groups in the
data and the features that distinguish one group from another.
CLASSIFICATION
SUPERVISED UNSUPERVISED
Interesting observation
may show that bass is
usually longer than
salmon.
Count
We take several Salmon Sea bass
Count
Sea bass
We take several
samples of the two
fishes and measure
their lightness.
X*
We again represent
Lightness
our measurement as a
histogram. Figure 5: “Histogram” of
The answer to the lightness of fishes; Lightness
question, "Will this marked X ∗ will lead to smallest
feature sufficiently number of errors.
classify the fishes?" is
more satisfying.
The X ∗ or L∗ is a
decision threshold.
Width
vector,
x
x= 1
x2
Lightness
Width
dimensionality curse.
A very complicated model
may also result in over fitting -
training data is separated Lightness
“perfectly”; new patterns are
poorly classified. Figure 7: Feature space with
Generalization problem. complex decision boundary of
classifier.
Occam’s Razor
The principle of using as simple as is necessary model to describe systems is
captured in the so-called “Occam’s razor” - favour simpler explanations over those that
are needlessly complicated.
x ∈ ωj if,
P(ωj |x) > P(ωk |x) k = 1, . . . , C; k ̸= j
We use Bayes’ theorem to express the a posteriori probabilities P(ωj |x) in terms
of the a priori probabilities and the class-conditional density functions p(x|ωi )
p(x|ωi )P(ωi )
P(ωi |x) =
p(x)
where
C
X
p(x) = p(x|ωj )P(ωj )
j=1
In terms of the class-conditional density we can write the decision rule as, assign
x to ωj if,
0.2
Lr (x) = >
p(x|ω2 ) P(ω1 )
0.15
Take as an example a
two-class discrimination 0.1
p(x|ω2 ) =
0.6N(x|1, 1) + 0.4N(x| − 1, 2)
1.5
0.5
0
-4 -3 -2 -1 0 1 2 3 4
P(ω2 )
If Lr (x) > , the observed sample is classified as ω1 ,
P(ω1 )
SCIT-AMRL (University of Wollongong) Machine Learning ML 18 / 38
Bayes decision rule - minimum risk
This decision rule minimizes an expected loss or risk.
C
X C Z X
X C
r= ri = λji P(ωj |x)p(x)dx
i=1 i=1 Ωi j=1
then x ∈ Ωi .
C
X
li (x) = λji P(ωj |x)
j=1
The minimum risk decision rule is simply to decide ω1 if l1 (x) < l2 (x).
when x ∈ class ωi .
This is the same as Bayes rule for minimum error.
when x ∈ class ω1 .
This is the same as Bayes rule for minimum error in the two-category case.
The corresponding risks in the case of the zero-one loss matrix are
C
X
li (x) = λij P(ωj |x)
j=1
X
= P(ωj |x)
j̸=i
= 1 − P(ωi |x)
This decision rule finds application in signal processing, e.g. radar signal
detection and other two-way detection problems.
If we term the class, ω1 , as positive class and ω2 as negative class (This is just
convention)
Type I error probability: is called false negative rate, that is, proportion of
positive samples incorrectly assigned to negative class
Z
ϵ1 = p(x|ω1 )dx
Ω2
Type II error probability: is called false positive rate, that is, proportion of
negative samples incorrectly classified as positive,
Z
ϵ2 = p(x|ω2 )dx
Ω1
p(x|ω1 )
This can be written as: If > µ then, x ∈ Ω1
p(x|ω2 )
R
µ is chosen so that (false alarm error) Ω p(x|ω2 )dx = ϵ0 - numerical solution is
1
usually employed to find ϵ0
1.0
d=4
d=2
d=1
P(True positive)
d=0
0.0
0.0
P(False alarm) 1.0
Bayes decision rules requires knowledge of prior class probabilities and class
conditional densities which are often not available in practice and must be
estimated from data
h(x) > k ⇒ x ∈ ω1
< k ⇒ x ∈ ω2
for a constant k
The discriminant techniques rely on the form of the function being specified and
not on the underlying distribution
p
X
g(x) = ω t x + ω0 = ωi xi + ω0
i=1
The equation describes a hyperplane with unit normal in the direction of ω and a
perpendicular distance, |ω0 |/|ω| from origin.
g >0 w
g<0
g(x)/|w|
|w_0|/|w|
g=0
origin
Classifiers that use a linear discriminant function are called linear machines.
where
ωi = pi (1)
1
ωi0 = − ||pi ||2 (2)
2
to show that it is indeed a linear machine.
The prototype points could be chosen as the mean of each class and we have a
nearest class mean classifier.
Note also that the decision regions of a linear machine are always convex.
Prototype point
Decision boundary line
Perpendicular bisector
The linear machine has a simple form but suffers the limitation of not being able
to separate situations where the decision regions have to be non-convex.
The examples below show two-class problems where a linear discriminant will
fail to separate. They require piece-wise linear discriminant functions.
1 jt
gij (x) = x t pij − p , j = 1, . . . , ni ; i = 1, . . . , C
2 i