Pattern Recognition
P.S.Sastry
[email protected]
Dept. Electrical Engineering
Indian Institute of Science, Bangalore
PR video course p.1/90
Reference Books
R.O.Duda, P.E.Hart and D.G.Stork, Pattern
Classification, Johy Wiley, 2002
C.M.Bishop, Pattern Recognition and Machine
Learning, Springer, 2006.
C. M. Bishop, Neural Networks for Pattern
Recognition, Oxford University Press, Indian Edition,
2003.
R.O.Duda and P.E. Hart, Pattern Classification and
Scene Analysis, Wiley, 1973
PR video course p.2/90
Pattern Recognition
A basic attribute of people categorisation of sensory
input
Pattern PR System Class label
Examples of Pattern Recognition tasks
Reading facial expressions
Recognising Speech
Reading a Document
Identifying a person by fingerprints
Diagnosis from medical images
PR video course p.3/90
Machine Recognition of Patterns
X
pattern feature extractor classifier class
label
Feature extractor makes some measurements on the
input pattern.
X is called Feature Vector. Often, X n .
Classifier maps each feature vector to a class label.
Features to be used are problem-specific.
PR video course p.4/90
Some Examples of PR Tasks
Character Recognition
Pattern Image.
Class identity of character
Features: Binary image, projections (e.g., row and
column sums), Moments etc.
PR video course p.5/90
Some Examples of PR Tasks
Speech Recognition
Pattern 1-D signal (or its sampled version)
Class identity of speech units
Features LPC model of chunks of speech,
spectral info, cepstrum etc.
Pattern can become a sequence of feature vectors.
PR video course p.6/90
Examples contd...
Fingerprint based identity verification
Pattern image plus a identity claim
Class Yes / No
Features: Position of Minutiae, Orientation field of
ridge lines etc.
PR video course p.7/90
Examples contd...
Video-based Surveillance
Pattern video sequence
Class e.g., level of alertness
Features Motion trajectories, Parameters of a
prefixed model etc.
PR video course p.8/90
Examples contd...
Credit Screening
Pattern Details of an applicant (for, e.g., credit
card)
Class Yes / No
Features: income, job history, level of credit, credit
history etc.
PR video course p.9/90
Examples contd...
Imposter detection (of, e.g., credit card)
Pattern A sequence of transactions
Class Yes / No
Features: Amount of money, locations of
transactions, times between transactions etc.
PR video course p.10/90
Examples contd...
Document Classification
Pattern A document and a query
Class Relevant or not (in general, rank)
Features word occurrence counts, word context
etc.
Spam filtering, diagnostics of machinery etc.
PR video course p.11/90
Design of Pattern Recognition Systems
Features depend on the problem. Measure relevant
quantities.
PR video course p.12/90
Design of Pattern Recognition Systems
Features depend on the problem. Measure relevant
quantities.
Some techniques available to extract more relevant
quantities from the initial measurements. (e.g., PCA)
PR video course p.13/90
Design of Pattern Recognition Systems
Features depend on the problem. Measure relevant
quantities.
Some techniques available to extract more relevant
quantities from the initial measurements. (e.g., PCA)
After feature extraction each pattern is a vector
Classifier is a function to map such vectors into class
labels.
PR video course p.14/90
Design of Pattern Recognition Systems
Features depend on the problem. Measure relevant
quantities.
Some techniques available to extract more relevant
quantities from the initial measurements. (e.g., PCA)
After feature extraction each pattern is a vector
Classifier is a function to map such vectors into class
labels.
Many general techniques of classifier design are
available.
Need to test and validate the final system.
PR video course p.15/90
Some notation
Feature Space, X Set of all possible feature vectors.
PR video course p.16/90
Some notation
Feature Space, X Set of all possible feature vectors.
Classifier: a decision rule or a function,
h : X {1, , M }.
Often, X = n . Convenient to take M = 2.
Then we take the labels as {0, 1} or {1, 1}.
PR video course p.17/90
Some notation
Feature Space, X Set of all possible feature vectors.
Classifier: a decision rule or a function,
h : X {1, , M }.
Often, X = n . Convenient to take M = 2.
Then we take the labels as {0, 1} or {1, 1}.
Then, any binary valued function on X is a classifier.
What h to choose? We want correct or optimal
classifier.
How to design classifiers?
How to judge performance?
How to provide performance guarantees?
PR video course p.18/90
We first consider the 2-class problem.
PR video course p.19/90
We first consider the 2-class problem.
Can handle the M > 2 case if we know how to handle
2-class problem.
PR video course p.20/90
We first consider the 2-class problem.
Can handle the M > 2 case if we know how to handle
2-class problem.
Simplest alternative: design M 2-class classifiers.
One Vs Rest
PR video course p.21/90
We first consider the 2-class problem.
Can handle the M > 2 case if we know how to handle
2-class problem.
Simplest alternative: design M 2-class classifiers.
One Vs Rest
There are other possibilities: e.g., Tree structured
classifiers.
PR video course p.22/90
We first consider the 2-class problem.
Can handle the M > 2 case if we know how to handle
2-class problem.
Simplest alternative: design M 2-class classifiers.
One Vs Rest
There are other possibilities: e.g., Tree structured
classifiers.
The 2-class problem is the basic problem.
PR video course p.23/90
We first consider the 2-class problem.
Can handle the M > 2 case if we know how to handle
2-class problem.
Simplest alternative: design M 2-class classifiers.
One Vs Rest
There are other possibilities: e.g., Tree structured
classifiers.
The 2-class problem is the basic problem.
We will also look at M-class classifiesrs.
PR video course p.24/90
A simple PR problem
: Problem: Spot the Right Candidate
: Features:
x1 : Marks based on academic record
x2 : Marks in the interview
PR video course p.25/90
A simple PR problem
: Problem: Spot the Right Candidate
: Features:
x1 : Marks based on academic record
x2 : Marks in the interview
A Classifier: ax1 + bx2 > c Good
PR video course p.26/90
A simple PR problem
: Problem: Spot the Right Candidate
: Features:
x1 : Marks based on academic record
x2 : Marks in the interview
A Classifier: ax1 + bx2 > c Good
Another Classifier: x1 x2 > c Good
(or (x1 + a)(x2 + b) > c).
PR video course p.27/90
A simple PR problem
: Problem: Spot the Right Candidate
: Features:
x1 : Marks based on academic record
x2 : Marks in the interview
A Classifier: ax1 + bx2 > c Good
Another Classifier: x1 x2 > c Good
(or (x1 + a)(x2 + b) > c).
Design of classifier:
We have to choose a specific form for the classifier.
What values to use for parameters such as a, b, c?
PR video course p.28/90
Designing Classifiers
Need to decide how feature vector values determine
the class.
(How different marks reflect goodness of candidate)
PR video course p.29/90
Designing Classifiers
Need to decide how feature vector values determine
the class.
(How different marks reflect goodness of candidate)
In most applications, not possible to design classifier
from physics of the problem.
PR video course p.30/90
Designing Classifiers
Need to decide how feature vector values determine
the class.
(How different marks reflect goodness of candidate)
In most applications, not possible to design classifier
from physics of the problem.
The difficulties are
Lot of variability in patterns of a single class
Variability in feature vector values
Feature vectors of patterns from different classes
can be arbitrarily close.
Noise in measurements
PR video course p.31/90
Designing Classifiers contd...
Often the only information available for the design is
A training set of example patterns.
Training set: {(Xi , yi ), i = 1, . . . , }.
Here Xi is an example feature vector of class yi .
PR video course p.32/90
Designing Classifiers contd...
Often the only information available for the design is
A training set of example patterns.
Training set: {(Xi , yi ), i = 1, . . . , }.
Here Xi is an example feature vector of class yi .
Generation of training set Take representative
patterns of known category (data collection) and
obtain the feature vectors. (choice of feature
measurements).
Now learn an appropriate function h as classifier.
(Model choice)
Test and validate the classifier on more data.
PR video course p.33/90
A simple PR problem
: Problem: Spot the Right Candidate
: Features:
x1 : Marks based on academic record
x2 : Marks in the interview
A Classifier: ax1 + bx2 > c Good
We have chosen a specific form for the classifier.
Design of classifier: What values to use for a, b, c?
Information available: experience history of past
candidates
PR video course p.34/90
Training Set
PR video course p.35/90
Another example problem
Problem: recognize persons of medium build
Features: Height and Weight
The classifier is nonlinear here
PR video course p.36/90
Learning from Examples
Designing a classifier is a typical problem of learning
from examples. (Also called learning with a teacher).
Nature of feedback from teacher determines difficulty
of the learning problem.
PR video course p.37/90
In the context of Pattern Recognition
f eature
V ector Classifier
Class
f eedback
Teacher
Supervised learning Teacher gives the true class
label for each feature vector
Reinforcement Learning noisy assessment of
performance, e.g., correct/incorrect
Unsupervised learning no teacher input (Clustering
problems)
PR video course p.38/90
When the class labels of training patterns as given by
teacher are noisy, we consider it as supervised
learning with label noise or classification noise.
Many classifier design algorithms do supervised
learning.
PR video course p.39/90
Function Learning
Closely related problem. Output is continuous-valued
rather than discrete as in classifiers.
Here training set examples could be
{(Xi , yi ), i = 1, . . . , }, Xi X , yi .
PR video course p.40/90
Function Learning
Closely related problem. Output is continuous-valued
rather than discrete as in classifiers.
Here training set examples could be
{(Xi , yi ), i = 1, . . . , }, Xi X , yi .
The prediction variable, y , is continuous; rather than
taking finitely many values. ( There can be noise in
examples).
Similar learning techniques needed to infer the
underlying functional relationship between X and y .
(Regression function of y on X ).
PR video course p.41/90
Examples of Function Learning
Time series prediction: Given a series x1 , x2 , , find
a function to predict xn .
PR video course p.42/90
Examples of Function Learning
Time series prediction: Given a series x1 , x2 , , find
a function to predict xn .
Based on past values: Find a best function
xn = h(xn1 , xn2 , , xnp )
Predict stock prices, exchange rates etc.
Linear prediction model used in speech analysis
PR video course p.43/90
Examples of Function Learning
Time series prediction: Given a series x1 , x2 , , find
a function to predict xn .
Based on past values: Find a best function
xn = h(xn1 , xn2 , , xnp )
Predict stock prices, exchange rates etc.
Linear prediction model used in speech analysis
More general predictors can use other variables also.
Predict rainfall based on measurements and
(possibly) previous years data.
In general, System Identification. (An application:
smart sensors)
PR video course p.44/90
Examples contd... : Equaliser
Tx x(k) channel Z(k) filter
y(k) Rx
We want y(k) = x(k). Design (or adapt) the filter to
achieve this
We can choose a filter as T
y(k) =
ai Z(k i)
i=1
Find best ai a function learning problem.
PR video course p.45/90
Examples contd... : Equaliser
Tx x(k) channel Z(k) filter
y(k) Rx
We want y(k) = x(k). Design (or adapt) the filter to
achieve this
We can choose a filter as T
y(k) =
ai Z(k i)
i=1
Find best ai a function learning problem.
Training set: {(x(k), Z(k)), k = 1, 2, , N }
How do we know x(k) at the receiver end?
Prior agreements (Protocols)
PR video course p.46/90
Learning from examples
Learning from examples is basic to both classification
and regression.
Training set: {(X1 , y1 ), , (X , y )}.
We essentially want to fit a best function: y = f (X).
PR video course p.47/90
Learning from examples
Learning from examples is basic to both classification
and regression.
Training set: {(X1 , y1 ), , (X , y )}.
We essentially want to fit a best function: y = f (X).
Suppose X . Then this is a familiar curve-fitting
problem.
PR video course p.48/90
Learning from examples
Learning from examples is basic to both classification
and regression.
Training set: {(X1 , y1 ), , (X , y )}.
We essentially want to fit a best function: y = f (X).
Suppose X . Then this is a familiar curve-fitting
problem.
Model choice (choice of form for f ) is very important
here.
If we choose f to be a polynomial, higher the degree
better would be the performance on training data.
But that is not the real performance A polynomial of
degree would give zero error!!
PR video course p.49/90
Learning from examples Generalization
To obtain a classifier (or a regression function) we use
the training set.
We know the class label of patterns (or the values for
prediction variable) in the training set.
PR video course p.50/90
Learning from examples Generalization
To obtain a classifier (or a regression function) we use
the training set.
We know the class label of patterns (or the values for
prediction variable) in the training set.
Errors on the training set do not necessarily tell how
good is the classifier.
PR video course p.51/90
Learning from examples Generalization
To obtain a classifier (or a regression function) we use
the training set.
We know the class label of patterns (or the values for
prediction variable) in the training set.
Errors on the training set do not necessarily tell how
good is the classifier.
Any classifier that amounts to only storing the training
set is useless.
Interested in the generalization abilities how does
our classifier perform on unseen or new patterns.
PR video course p.52/90
Design of Classifiers
The classifier should perform well inspite of inherent
variability of patterns and noise in feature extraction
and/or in class labels as given in training set.
PR video course p.53/90
Design of Classifiers
The classifier should perform well inspite of inherent
variability of patterns and noise in feature extraction
and/or in class labels as given in training set.
Statistical Pattern Recognition An approach where
the variabilities are captured through probabilistic
models.
PR video course p.54/90
Design of Classifiers
The classifier should perform well inspite of inherent
variability of patterns and noise in feature extraction
and/or in class labels as given in training set.
Statistical Pattern Recognition An approach where
the variabilities are captured through probabilistic
models.
There are other approaches, e.g., syntactic pattern
recognition, fuzzy-set based methods etc.
PR video course p.55/90
Design of Classifiers
The classifier should perform well inspite of inherent
variability of patterns and noise in feature extraction
and/or in class labels as given in training set.
Statistical Pattern Recognition An approach where
the variabilities are captured through probabilistic
models.
There are other approaches, e.g., syntactic pattern
recognition, fuzzy-set based methods etc.
In this course we consider classification and
regression (function learning) problems in the
statistical framework.
PR video course p.56/90
Statistical Pattern Recognition
X is the feature space. (We take X = n ). We
Consider a 2-class problem for simplicity of notation.
A given feature vector can come from different
classes with different probabilities.
PR video course p.57/90
Statistical Pattern Recognition
X is the feature space. (We take X = n ). We
Consider a 2-class problem for simplicity of notation.
A given feature vector can come from different
classes with different probabilities.
Let: fi be the probability density function of the
feature vectors from class-i, i = 0, 1.
fi are called class conditional densities.
PR video course p.58/90
Statistical Pattern Recognition
X is the feature space. (We take X = n ). We
Consider a 2-class problem for simplicity of notation.
A given feature vector can come from different
classes with different probabilities.
Let: fi be the probability density function of the
feature vectors from class-i, i = 0, 1.
fi are called class conditional densities.
Let X = (X1 , , Xn ) n represent the feature
vector.
Then fi is the (conditional) joint density of the random
variables X1 , , Xn given that X is from class-i.
PR video course p.59/90
Class conditional densities model the variability in the
feature values.
For example, the two classes can be uniformly
distributed in the two regions as shown. (The two
classes are separable here).
Class 1
Class 2
PR video course p.60/90
When class regions are separable, an important
special case is linear separability.
Class 1
Class 2
Class 1
Class 2
The classes in the left panel above are linearly
separable (can be separated by a line) while those in
the right panel are not linearly separable (though
separable).
PR video course p.61/90
In general, the two class conditional densities can
overlap. (The same value of feature vector can be
from different classes with different probabilities)
Class 1
Class 2
PR video course p.62/90
The statistical viewpoint gives us one way of looking
for optimal classifier.
We can say we want a classifier that has least
probability of misclassifying a random pattern (drawn
from the underlying distributions).
PR video course p.63/90
Let
qi (X) = Prob[class = i|X], i = 0, 1.
qi is called posterior probability (function) for class-i.
PR video course p.64/90
Let
qi (X) = Prob[class = i|X], i = 0, 1.
qi is called posterior probability (function) for class-i.
Consider the classifier
h(X) = 0 if q0 (X) > q1 (X)
= 1 otherwise
PR video course p.65/90
Let
qi (X) = Prob[class = i|X], i = 0, 1.
qi is called posterior probability (function) for class-i.
Consider the classifier
h(X) = 0 if q0 (X) > q1 (X)
= 1 otherwise
q0 (X) > q1 (X) would imply that the feature vector X
is more likely to come from class-0 rather than
class-1.
PR video course p.66/90
Let
qi (X) = Prob[class = i|X], i = 0, 1.
qi is called posterior probability (function) for class-i.
Consider the classifier
h(X) = 0 if q0 (X) > q1 (X)
= 1 otherwise
q0 (X) > q1 (X) would imply that the feature vector X
is more likely to come from class-0 rather than
class-1.
Hence, intuitively, such a classifier should minimize
probability of error in classification.
PR video course p.67/90
Statistical Pattern Recognition
X (= n ) is the feature space. Y = {0, 1} is the set of
class labels
PR video course p.68/90
Statistical Pattern Recognition
X (= n ) is the feature space. Y = {0, 1} is the set of
class labels
H = {h | h : X Y} is the set of classifiers of
interest.
PR video course p.69/90
Statistical Pattern Recognition
X (= n ) is the feature space. Y = {0, 1} is the set of
class labels
H = {h | h : X Y} is the set of classifiers of
interest.
For a feature vector X , let y(X) denote the class
label of X . In general, y(X) would be random.
PR video course p.70/90
Statistical Pattern Recognition
X (= n ) is the feature space. Y = {0, 1} is the set of
class labels
H = {h | h : X Y} is the set of classifiers of
interest.
For a feature vector X , let y(X) denote the class
label of X . In general, y(X) would be random.
Now we want to assign a figure of merit to each
possible classifier in H.
PR video course p.71/90
For example, we can rate different classifiers by
F (h) = Prob[h(X) 6= y(X)]
F (h) is the probability that h misclassifies a random
X.
PR video course p.72/90
For example, we can rate different classifiers by
F (h) = Prob[h(X) 6= y(X)]
F (h) is the probability that h misclassifies a random
X.
Optimal classifier would be one with lowest value of
F.
PR video course p.73/90
For example, we can rate different classifiers by
F (h) = Prob[h(X) 6= y(X)]
F (h) is the probability that h misclassifies a random
X.
Optimal classifier would be one with lowest value of
F.
Given a h we can calculate F (h) only if we know the
probability distributions of classes.
PR video course p.74/90
For example, we can rate different classifiers by
F (h) = Prob[h(X) 6= y(X)]
F (h) is the probability that h misclassifies a random
X.
Optimal classifier would be one with lowest value of
F.
Given a h we can calculate F (h) only if we know the
probability distributions of classes.
Minimizing F is not a straight-forward optimization
problem.
PR video course p.75/90
For example, we can rate different classifiers by
F (h) = Prob[h(X) 6= y(X)]
F (h) is the probability that h misclassifies a random
X.
Optimal classifier would be one with lowest value of
F.
Given a h we can calculate F (h) only if we know the
probability distributions of classes.
Minimizing F is not a straight-forward optimization
problem.
Here we are treating all errors as same. We will
PR video course p.76/90
Statistical PR contd.
Recall that fi (X) denotes the probability density
function of feature vectors of class-i. ( class
conditional densities).
PR video course p.77/90
Statistical PR contd.
Recall that fi (X) denotes the probability density
function of feature vectors of class-i. ( class
conditional densities).
Let pi = Prob[y(X) = i]. Called prior probabilities.
PR video course p.78/90
Statistical PR contd.
Recall that fi (X) denotes the probability density
function of feature vectors of class-i. ( class
conditional densities).
Let pi = Prob[y(X) = i]. Called prior probabilities.
Recall, posterior probabilities,
qi (X) = Prob[y(X) = i | X].
PR video course p.79/90
Statistical PR contd.
Recall that fi (X) denotes the probability density
function of feature vectors of class-i. ( class
conditional densities).
Let pi = Prob[y(X) = i]. Called prior probabilities.
Recall, posterior probabilities,
qi (X) = Prob[y(X) = i | X]. Now, by Bayes theorem
qi (X) = fi (X)pi / Z
where Z = f0 (X)p0 + f1 (X)p1 is the normalising
constant
PR video course p.80/90
Bayes Classifier
Consider the classifier given by
q0 (X)
>1
h(X) = 0 if
q1 (X)
= 1 otherwise
PR video course p.81/90
Bayes Classifier
Consider the classifier given by
q0 (X)
>1
h(X) = 0 if
q1 (X)
= 1 otherwise
This is called the Bayes classifier. Given our statistical
framework, this is the optimal classifier.
PR video course p.82/90
Bayes Classifier
Consider the classifier given by
q0 (X)
>1
h(X) = 0 if
q1 (X)
= 1 otherwise
This is called the Bayes classifier. Given our statistical
framework, this is the optimal classifier.
q0 (X) > q1 (X) is same as p0 f0 (X) > p1 f1 (X).
PR video course p.83/90
Bayes Classifier
Consider the classifier given by
q0 (X)
>1
h(X) = 0 if
q1 (X)
= 1 otherwise
This is called the Bayes classifier. Given our statistical
framework, this is the optimal classifier.
q0 (X) > q1 (X) is same as p0 f0 (X) > p1 f1 (X).
We will prove optimality of Bayes classifier later.
PR video course p.84/90
story so far
We consider PR as a two step process Feature
measurement/extraction and Classification
A classifier is to map feature vectors to Class labels.
The main difficulty in designing classifiers is due to
large variability in feature vector values.
The main information we have for the design is a
training set of examples.
Function learning is a closely related problem.
In both cases we need to learn from (training)
examples.
PR video course p.85/90
story so far
The statistical view is: model the variation in feature
vectors from a given class through probability models.
One objective can be: Find a classifier that has least
probability of error.
For example, Bayes Classifier is the optimal one if we
know class conditional densities.
PR video course p.86/90
Organization of the course
Overview of classifier learning strategies
Nearest Neighbour classification rule
Bayes Classifier for minimizing risk.
Some variations on this theme (e.g.
Neymann-Pearson Classifier)
Techniques for estimating class conditional densities
to implement Bayes classifier.
PR video course p.87/90
Organization of the course
Linear discriminant functions
Learning linear discriminant functions (Perceptron,
LMS, logistic regression)
Linear least-squares regression
Simple overview of statistical learning theory
Empirical risk minimization and VC theory
PR video course p.88/90
Organization of the course
Learning nonlinear classifiers
Neural Networks (Backpropagation, RBF networks)
Support Vector Machines
Kernel-based methods
Feature extraction and dimensionality reduction (PCA)
Boosting and ensemble classifiers (AdaBoost)
PR video course p.89/90
Thank You!
PR video course p.90/90