0% found this document useful (0 votes)
1 views

Classification

The document discusses machine learning with a focus on pattern recognition and classification, outlining key concepts such as supervised and unsupervised classification, decision theory, and the design of classifiers. It emphasizes the importance of feature extraction and selection, as well as the challenges of overfitting and generalization in classifier design. Additionally, it introduces Bayes decision rules for minimizing classification errors and risks.

Uploaded by

smrfwrld
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Classification

The document discusses machine learning with a focus on pattern recognition and classification, outlining key concepts such as supervised and unsupervised classification, decision theory, and the design of classifiers. It emphasizes the importance of feature extraction and selection, as well as the challenges of overfitting and generalization in classifier design. Additionally, it introduces Bayes decision rules for minimizing classification errors and risks.

Uploaded by

smrfwrld
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Machine Learning: Algorithms and

Applications

Philip O. Ogunbona

Advanced Multimedia Research Lab


University of Wollongong

Pattern Classification
Autumn

SCIT-AMRL (University of Wollongong) Machine Learning ML 1 / 38


Outline of Topics

1 Pattern Recognition/Classifier

2 Approaches to Classification

3 Elementary Decision Theory

4 References

SCIT-AMRL (University of Wollongong) Machine Learning ML 2 / 38


What is pattern recognition/classifier?

We take for granted the fact that we are able to move around in our world and
recognize:
cars
people
animals
objects in general
despite the variety in their form and existence.
What features help in these tasks?
In pattern recognition we study how to design machines that can recognize and
classify ”things”.
We study the statistics of the features that describe ”things”.
We study how to measure the performance of pattern recognition systems and
select good systems.

SCIT-AMRL (University of Wollongong) Machine Learning ML 3 / 38


Basic Model - Pattern Classifier

Feature
Sensor Selector/ Classifier
Extractor Decision

Representation Feature
Pattern Pattern

Figure 1: Pattern Classifier [7]

The pattern is a set of numbers or values represented as a p − dimensional


vector,  t
x = x1 x2 · · · xp
where t (or sometimes T ) denotes vector transpose

SCIT-AMRL (University of Wollongong) Machine Learning ML 4 / 38


Basic Model - Pattern Classifier

Feature
Sensor Selector/ Classifier
Extractor Decision

Representation Feature
Pattern Pattern

Figure 1: Pattern Classifier [7]

The pattern could be:


pixels in an image
closing prices of a share on the stock market
recordings of a speech signal
measurements on weather variables
group of measurements about a real estate
group of measurements about the behaviour and life style of people
etc.

SCIT-AMRL (University of Wollongong) Machine Learning ML 4 / 38


Basic Model - Pattern Classifier

Feature
Sensor Selector/ Classifier
Extractor Decision

Representation Feature
Pattern Pattern

Figure 1: Pattern Classifier [7]

We assume that there are C classes denoted by,

ω1 , . . . , ωC

There is a variable, z, that indicates which class, ωi , a pattern x belongs. That is,

if z = i, then the pattern, x, belongs to ωi , i ∈ {1, . . . , C}

SCIT-AMRL (University of Wollongong) Machine Learning ML 4 / 38


Basic Model - Pattern Classifier

The problem is how to design the pattern classifier.


Designing a pattern classifier entails:
specifying the classifier model parameters
ensuring that response for a given pattern is optimal

The design process assumes we have a set of patterns of known class,


{(xi , zi )}, called the training or design set, used to design the classifier.

Part of the design process is to evaluate and set optimal operating parameters.

The idea is that once we have a designed classifier we can estimate the class
membership of an unknown pattern.

There is an assumption that the samples used for training are drawn from the
same probability distribution as the test samples and the operational samples.

SCIT-AMRL (University of Wollongong) Machine Learning ML 5 / 38


Basic Model - Pattern Classifier

A closer look at the simplified classifier model:

Feature
Sensor Selector/ Classifier
Extractor Decision

Representation Feature
Pattern Pattern

Figure 2: Pattern Classifier [7]

1 Representation pattern is the raw data we obtain from the sensor e.g. image or
video pixels, price of stock, etc.

2 Feature pattern is a small set of variables obtained through some transformation


- feature selection and/or extraction

3 The trained classifier uses the feature pattern to make a decision regarding the
pattern presented at its input.

SCIT-AMRL (University of Wollongong) Machine Learning ML 6 / 38


Basic Model - Pattern Classifier

Further consideration about classifier design:

Problem
Given a training set of patterns of known class, we seek to design a classifier that is
optimal for the expected operating conditions.

1 The given set of training patterns is finite.


2 The classifier model cannot be too complex. In other words it cannot have too
many parameters. This situation may lead to over-fitting.
3 It is not important to achieve optimal performance on the design set.
4 It is very important to achieve optimal generalization performance.
Expected performance on data representative of the true operating
condition - the infinite set from which the design set is drawn.

SCIT-AMRL (University of Wollongong) Machine Learning ML 7 / 38


Supervised and Unsupervised Classification
There are two main categories of classifiers :
Supervised: The classifier design process has a set of data samples with
associated labels (class type). These are exemplars or training data.

Unsupervised: The given data is not labelled and the idea is to find groups in the
data and the features that distinguish one group from another.

There is a third category, namely semi-supervised classifiers, in which both


labelled and unlabelled data are used for the training.

CLASSIFICATION

SUPERVISED UNSUPERVISED

Figure 3: Main categories of classifiers

SCIT-AMRL (University of Wollongong) Machine Learning ML 8 / 38


Supervised Classification

Example - from Duda-Hart-Stork [2]


We are required to design a classifier for a fishing company so as to automate the
sorting process. The company is interested in sorting salmon from bass. The cost of
selling a salmon as bass could be high when misclassified!

Possible features of interest


length
width
Salmon
number and shape of
fins
position of mouth
lightness
Bass
These features will vary
because of measurement
errors or conditions.

SCIT-AMRL (University of Wollongong) Machine Learning ML 9 / 38


Supervised Classification

Interesting observation
may show that bass is
usually longer than
salmon.

Count
We take several Salmon Sea bass

samples of the two


fishes and measure
their lengths. Length L*

We may represent our


measurement as a Figure 4: “Histogram” of fishes
histogram. lengths; Length marked L∗ will
lead to smallest number of
We may ask the
question, "Will this
errors.
feature sufficiently
classify the fishes?"

SCIT-AMRL (University of Wollongong) Machine Learning ML 10 / 38


Supervised Classification

Perhaps the cost of


using length alone to
classify is too high. Salmon

Count
Sea bass
We take several
samples of the two
fishes and measure
their lightness.
X*
We again represent
Lightness
our measurement as a
histogram. Figure 5: “Histogram” of
The answer to the lightness of fishes; Lightness
question, "Will this marked X ∗ will lead to smallest
feature sufficiently number of errors.
classify the fishes?" is
more satisfying.
The X ∗ or L∗ is a
decision threshold.

SCIT-AMRL (University of Wollongong) Machine Learning ML 11 / 38


Supervised Classification

Assume that we believe that


we could do better at Salmon Sea bass
classifying the fishes by using
two (2) features.
We now have a
two-dimensional feature

Width
vector,
 
x
x= 1
x2
Lightness

The feature space can be


visualized. Figure 6: Feature space with
decision boundary of classifier.
How to obtain the “best”
decision boundary is the
classifier design problem.

SCIT-AMRL (University of Wollongong) Machine Learning ML 12 / 38


Supervised Classification

As we increase the number of


features there is need to deal Salmon Sea bass
with a high dimensional
feature vector.
The problem of “too many
features" is referred to as

Width
dimensionality curse.
A very complicated model
may also result in over fitting -
training data is separated Lightness
“perfectly”; new patterns are
poorly classified. Figure 7: Feature space with
Generalization problem. complex decision boundary of
classifier.

SCIT-AMRL (University of Wollongong) Machine Learning ML 13 / 38


Supervised Classification

Occam’s Razor
The principle of using as simple as is necessary model to describe systems is
captured in the so-called “Occam’s razor” - favour simpler explanations over those that
are needlessly complicated.

The principle underlies the very popular method of sparse representation.

SCIT-AMRL (University of Wollongong) Machine Learning ML 14 / 38


Bayes decision rule - minimum error
This approach to classification (also called discrimination) assumes that we have
full knowledge of the probability density function of each class

Let the C classes have known a priori probabilities, P(ω1 ), . . . , P(ωC )

We make use of the measurement vector x to assign x to one of the C classes

Formulate a decision rule to assign x to class ωj if the probability of class ωj


given the observation x, (i.e. P(ωj |x) - posterior probability), is the highest over
all classes ω1 , . . . , ωC ;

x ∈ ωj if,
P(ωj |x) > P(ωk |x) k = 1, . . . , C; k ̸= j

Measurement space is partitioned into C regions, Ω1 , Ω2 , . . . , ΩC ; x ∈ Ωj ⇒ x is


in class ωj

SCIT-AMRL (University of Wollongong) Machine Learning ML 15 / 38


Bayes decision rule - minimum error

We use Bayes’ theorem to express the a posteriori probabilities P(ωj |x) in terms
of the a priori probabilities and the class-conditional density functions p(x|ωi )
p(x|ωi )P(ωi )
P(ωi |x) =
p(x)
where
C
X
p(x) = p(x|ωj )P(ωj )
j=1

In terms of the class-conditional density we can write the decision rule as, assign
x to ωj if,

p(x|ωj )P(ωj ) > p(x|ωk )P(ωk ) k = 1, . . . , C k ̸= j


This is the Bayes’ rule for minimum error.

SCIT-AMRL (University of Wollongong) Machine Learning ML 16 / 38


Bayes decision rule - minimum error

In a two-class case we can The plots of


write the Bayes’ minimum p(x|ωi )p(ωi ), i = 1, 2 with
error rule in terms of likelihood P(ω1 ) = P(ω2 ) = 0.5 are
ratio, Lr (x), for x ∈ class ω1 , shown below;

0.2

p(x|ω1 ) P(ω2 ) p(x|w1)p(w1)


p(x|w2)p(w2)

Lr (x) = >
p(x|ω2 ) P(ω1 )
0.15

Take as an example a
two-class discrimination 0.1

problem, with class ω1


normally distributed as, 0.05

p(x|ω1 ) = N(x|0, 1) and class


ω2 as a normal mixture with 0
-4 -3 -2 -1 0 1 2 3 4

p(x|ω2 ) =
0.6N(x|1, 1) + 0.4N(x| − 1, 2)

SCIT-AMRL (University of Wollongong) Machine Learning ML 17 / 38


Bayes decision rule - minimum error
Plots of the likelihood ratio Lr (x) and threshold, P(ω2 )/P(ω1 ) are
shown below:
2
l(x)
p(w2)/p(w1)

1.5

0.5

0
-4 -3 -2 -1 0 1 2 3 4

Figure 8: Likelihood function

P(ω2 )
If Lr (x) > , the observed sample is classified as ω1 ,
P(ω1 )
SCIT-AMRL (University of Wollongong) Machine Learning ML 18 / 38
Bayes decision rule - minimum risk
This decision rule minimizes an expected loss or risk.

Define a loss matrix, Λ, with components,

λji = cost of assigning a pattern x to ωi when x ∈ ωj

The conditional risk of assigning a pattern x to class ωi is defined as


C
X
li (x) = λji P(ωj |x)
j=1

The average risk over decision region Ωi is


Z
ri = li (x)p(x)dx
Ωi
C
Z X
= λji P(ωj |x)p(x)dx
Ωi j=1

SCIT-AMRL (University of Wollongong) Machine Learning ML 19 / 38


Bayes decision rule - minimum risk
The overall expected cost or risk is obtained by summing the risks associated
with all the classes

C
X C Z X
X C
r= ri = λji P(ωj |x)p(x)dx
i=1 i=1 Ωi j=1

The risk is minimized if the regions Ωi are chosen such that if


C
X C
X
λji P(ωj |x)p(x) ≤ λjk P(ωj |x)p(x) k = 1, . . . , C
j=1 j=1

then x ∈ Ωi .

This is the Bayes decision rule for minimum risk

The Bayes risk, r ∗ , is


Z C
X
r∗ = min λji P(ωj |x)p(x)dx
x i=1,...,C j=1

SCIT-AMRL (University of Wollongong) Machine Learning ML 20 / 38


Bayes decision rule - minimum risk
For a two-category classification problem we can write the conditional risks as:

C
X
li (x) = λji P(ωj |x)
j=1

l1 (x) = λ11 P(ω1 |x) + λ21 P(ω2 |x)


l2 (x) = λ12 P(ω1 |x) + λ22 P(ω2 |x)

The minimum risk decision rule is simply to decide ω1 if l1 (x) < l2 (x).

This can be expressed in terms of posterior probabilities as: Decide ω1 if

(λ11 − λ12 )P(ω1 |x) < (λ22 − λ21 )P(ω2 |x)

In terms of the prior probabilities and conditional densities we decide ω1 if,

(λ11 − λ12 )p(x|ω1 )P(ω1 ) < (λ22 − λ21 )p(x|ω2 )P(ω2 )

SCIT-AMRL (University of Wollongong) Machine Learning ML 21 / 38


Bayes decision rule - minimum risk

If we consider the special case of equal cost (also called symmetrical or


zero-one) loss matrix, Λ, in which,

1, ̸ j;
i=
λij =
0, i=j
a substitution of this condition into the Bayes decision rule for minimum risk,
gives,
C
X C
X
P(ωj |x)p(x) − P(ωi |x)p(x) ≤ P(ωj |x)p(x) − P(ωk |x)p(x)
j=1 j=1

for k = 1, . . . , C. This is easily simplified as,

p(x|ωi )p(ωi ) ≥ p(x|ωk )p(ωk ), k = 1, . . . , C

when x ∈ class ωi .
This is the same as Bayes rule for minimum error.

SCIT-AMRL (University of Wollongong) Machine Learning ML 22 / 38


Bayes decision rule - minimum risk
If we consider the special case of zero-one loss matrix, Λ, in which,

1, ̸ j;
i=
λij =
0, i=j
and a two-category classification the Bayes decision rule for minimum risk, gives,

p(x|ω1 )p(ω1 ) ≥ p(x|ω2 )p(ω2 ),

when x ∈ class ω1 .
This is the same as Bayes rule for minimum error in the two-category case.

The corresponding risks in the case of the zero-one loss matrix are

C
X
li (x) = λij P(ωj |x)
j=1
X
= P(ωj |x)
j̸=i

= 1 − P(ωi |x)

SCIT-AMRL (University of Wollongong) Machine Learning ML 23 / 38


Neyman-Pearson decision rule

This is an alternative to the Bayes decision rule for a two-class problem.

In a two-class problem, two types of errors identified:


Type I: Classify a pattern of class ω1 as belonging to class ω2 with
associated error probability
Z
ϵ1 = p(x|ω1 )dx
Ω2

Type II: Classify a pattern from class ω2 as belonging to class ω1 with


associated error probability
Z
ϵ2 = p(x|ω2 )dx
Ω1

Neyman-Pearson decision rule is to minimize ϵ1 subject to ϵ2 being equal to a


constant, ϵ0 , say.

SCIT-AMRL (University of Wollongong) Machine Learning ML 24 / 38


Neyman-Pearson decision rule

This decision rule finds application in signal processing, e.g. radar signal
detection and other two-way detection problems.

If we term the class, ω1 , as positive class and ω2 as negative class (This is just
convention)
Type I error probability: is called false negative rate, that is, proportion of
positive samples incorrectly assigned to negative class
Z
ϵ1 = p(x|ω1 )dx
Ω2

Type II error probability: is called false positive rate, that is, proportion of
negative samples incorrectly classified as positive,
Z
ϵ2 = p(x|ω2 )dx
Ω1

Type II error is also called false alarm.

SCIT-AMRL (University of Wollongong) Machine Learning ML 25 / 38


Neyman-Pearson decision rule

This decision rule minimizes the objective function,


Z Z 
r= p(x|ω1 )dx + µ p(x|ω2 )dx − ϵ0
Ω2 Ω1
Z Z 
=1− p(x|ω1 )dx + µ p(x|ω2 )dx − ϵ0
Ω1 Ω1
Z
= (1 − µϵ0 ) + {µp(x|ω2 )dx − p(x|ω1 )dx}
Ω1

where µ is the Lagrange multiplier.

Objective function is minimized if we choose Ω1 such that integrand is negative.


That is, if µp(x|ω2 ) − p(x|ω1 ) < 0 then x ∈ Ω1 .

p(x|ω1 )
This can be written as: If > µ then, x ∈ Ω1
p(x|ω2 )
R
µ is chosen so that (false alarm error) Ω p(x|ω2 )dx = ϵ0 - numerical solution is
1
usually employed to find ϵ0

SCIT-AMRL (University of Wollongong) Machine Learning ML 26 / 38


Neyman-Pearson decision rule
The performance of the decision rule is displayed in the form of a receiver
operating characteristic (ROC) curve that plots true positive against false
positive: (1 − ϵ1 ) against ϵ2

1.0
d=4
d=2

d=1
P(True positive)

d=0

0.0
0.0
P(False alarm) 1.0

Figure 9: Receiver Operating characteristic (ROC) for two univariate normal


distributions and varying values of d; d = |µ1 − µ2 |; µ1 and µ2 are the means of the
distributions.

SCIT-AMRL (University of Wollongong) Machine Learning ML 27 / 38


Discriminant Functions

Bayes decision rules requires knowledge of prior class probabilities and class
conditional densities which are often not available in practice and must be
estimated from data

The class of techniques being introduced makes no assumption about p(x|ωi )


but rather assumes a form of the discriminant functions

A discriminant function is a function of the feature vector x that leads to a


classification rule

Consider a two-class problem, a discriminant function h(x) is such that

h(x) > k ⇒ x ∈ ω1
< k ⇒ x ∈ ω2

for a constant k

SCIT-AMRL (University of Wollongong) Machine Learning ML 28 / 38


Discriminant Functions

Discriminant functions are not unique. If f (.) is a monotonic function, then

g(x) = f (h(x)) > k ′ ⇒ x ∈ ω1


< k ′ ⇒ x ∈ ω2

where k ′ = f (k ), gives the same decision as h(x).

For classification problem with C classes we define C discriminant functions,


gi (x) such that,

gi (x) > gj (x) ⇒ x ∈ ωi j = 1, . . . , C; j ̸= i


This implies that a feature vector is assigned to the class with the largest
discriminant.

The discriminant techniques rely on the form of the function being specified and
not on the underlying distribution

Parameters of the functional form are adjusted by a training procedure

SCIT-AMRL (University of Wollongong) Machine Learning ML 29 / 38


Linear discriminant functions

Linear discriminant functions are a linear combination of the components of the


 t
measurement (or feature) vector, x = x1 , x2 , . . . , xp , such that,

p
X
g(x) = ω t x + ω0 = ωi xi + ω0
i=1

where we need to specify the weight vector ω and threshold weight ω0

The equation describes a hyperplane with unit normal in the direction of ω and a
perpendicular distance, |ω0 |/|ω| from origin.

SCIT-AMRL (University of Wollongong) Machine Learning ML 30 / 38


Linear discriminant functions

g >0 w

g<0

g(x)/|w|
|w_0|/|w|

g=0

origin

Figure 10: Geometry of linear discriminant function

The value of the discriminant function for a pattern x is the perpendicular


distance from the hyperplane

SCIT-AMRL (University of Wollongong) Machine Learning ML 31 / 38


Linear discriminant functions

Classifiers that use a linear discriminant function are called linear machines.

The minimum-distance classifier is an example. It uses the nearest-neighbour


decision rule.

Let the prototype points of the classifier be p1 , . . . , pC . Each point represents a


class, ωi . The minimum distance classifier assigns x to the ωi with nearest point
pi

||x − pi ||2 = x t x − 2x t pi + pit pi


The class assigned to x is
1 t
ωi = max(x t pi − pi pi )
i 2

SCIT-AMRL (University of Wollongong) Machine Learning ML 32 / 38


Linear discriminant functions

We can relate this assignment to the linear discriminant function

gi (x) = ωit x + ωi0

where

ωi = pi (1)
1
ωi0 = − ||pi ||2 (2)
2
to show that it is indeed a linear machine.

The prototype points could be chosen as the mean of each class and we have a
nearest class mean classifier.

SCIT-AMRL (University of Wollongong) Machine Learning ML 33 / 38


Linear discriminant functions
Each boundary is the perpendicular bisector of the lines joining the prototype
points of regions that are contiguous.

Note also that the decision regions of a linear machine are always convex.

Prototype point
Decision boundary line

Perpendicular bisector

Figure 11: Decision regions of minimum distance classifier

SCIT-AMRL (University of Wollongong) Machine Learning ML 34 / 38


Piecewise linear discriminant functions

The linear machine has a simple form but suffers the limitation of not being able
to separate situations where the decision regions have to be non-convex.

The examples below show two-class problems where a linear discriminant will
fail to separate. They require piece-wise linear discriminant functions.

Decision regions are not convex

Figure 12: Groups not separable by linear discriminant functions

SCIT-AMRL (University of Wollongong) Machine Learning ML 35 / 38


Piecewise linear discriminant functions

Figure 13: Quick illustration of convex and non-convex regions

SCIT-AMRL (University of Wollongong) Machine Learning ML 36 / 38


Piecewise linear discriminant functions

We may solve the previous two-class problem by using piece-wise linear


discriminant function to generalize the minimum-distance classifier.

We allow more than one prototype for each class.


n
Suppose there are ni prototypes in class ωi , pi1 , . . . , pi i , i = 1, . . . , C.

The discriminant function which assigns pattern x to class, ωi , is defined as

gi (x) = max gij (x)


j=1,...,ni

where gij is a linear subsidiary discriminant function, given by

1 jt
gij (x) = x t pij − p , j = 1, . . . , ni ; i = 1, . . . , C
2 i

SCIT-AMRL (University of Wollongong) Machine Learning ML 37 / 38


Bibliography

[1] Ethem Alpaydin.


Introduction to Machine Learning.
The MIT Press, Cambridge Massachusetts, second edition, 2010.
[2] Richard O. Duda, Peter E. Hart, and David G. Stork.
Pattern Classification.
John Wiley and Sons, Second edition, 2001.
[3] Ian Goodfellow, Yoshua Bengio, and Aaron Courville.
Deep Learning.
MIT Press, 2016.
[4] John D. Kellleher, Brian Mac Namee, and Aoife D’Arcy.
Fundamentals of Machine Learning for Predictive Data Analytics - Algorithms, Worked Examples and Case Studies.
The MIT Press, Cambridge Massachusetts, 2015.
[5] Tom M. Mitchell.
Machine Learning.
WCB McGraw-Hill, 1997.
[6] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar.
Foundations of Machine Learning.
MIT Press, 2012.
[7] A. Webb.
Statistical Pattern Recognition.
John Wiley and Sons, Second edition, 2002.

SCIT-AMRL (University of Wollongong) Machine Learning ML 38 / 38

You might also like