Pattern Recognition: Lasse Holmstr Om and Petri Koistinen
Pattern Recognition: Lasse Holmstr Om and Petri Koistinen
Pattern recognition
Lasse Holmström1∗ and Petri Koistinen2
TABLE 1 Typical application areas for pattern recognition. a human expert. Rejection can be represented by a
separate label, such as g(x) = 0.
Automated visual inspection of products in manufacturing
If the joint distribution of X and J is known,
Automatic speech recognition
then decision theory allows us to find classifiers
Classification of text into categories (e.g., spam vs. non-spam) which are optimal according to various criteria. These
Character recognition (for printed or for handwritten text) theoretically optimal classifiers can be kept as guides,
Computer-aided diagnosis using a variety of medical data when some aspects of the joint distribution must be
Classification of ground cover types in remote sensing estimated. According to the multiplication rule of
probability theory, the joint density of two random
Face and gesture recognition from images or image sequences
quantities can be factorized as the marginal density
of one times the conditional density of the other.
We denote the true class of the object by J. The Therefore, the joint density f (x, j) of X and J can be
classification is to be made on the basis of features factorized as
Xi measured from the object. Together they form the
feature vector X = [X1 , . . . , Xd ]T . The feature vector
exhibits random variation, partly due to the different f (x, j) = Pj fj (x) = fX (x) P(j|x). (1)
properties of the different classes and partly due to
variation within each class. We regard the class J and
the feature vector X to be random quantities, which In the first factorization, Pj = P(J = j) is the marginal
have a joint distribution. While the class J is a discrete probability of class j, and fj (x) is the class-conditional
random variable, the feature vector X can have either density (probability density function or probability
a continuous or a discrete distribution, or some of its mass function) of X given that J = j. In the second
components may have discrete and the others have a factorization, fX (x) is the marginal density of X, and
continuous distribution. P(j|x) is the conditional probability of J = j given that
Many authors avoid the use of a separate X = x. Pj is called the prior probability and P(j|x)
random variable to denote the class of the object the posterior probability of class j. Both the prior and
and instead denote the jth class, for example, by the posterior probabilities of the classes sum to one.
ωj and use a phrase like ‘X belongs to class ωj ’ or From the factorizations (1), we obtain the posterior
even the corresponding notation X ∈ ωj to indicate probabilities as
that the class of the object with feature vector X is
j. Especially the notation X ∈ ωj can be confusing
Pj f j ( x )
c
for the newcomer, who has not yet internalized the P(j|x) = , where fX (x) = Pj fj (x). (2)
central idea that the feature vector of the object does f X ( x)
j=1
not determine its class uniquely. It is therefore better
to use a separate random variable (such as J) to denote
the class of the object. See Figure 1 for an illustration of these concepts.
The pattern classification (or discrimination) We next derive the form of the classifier, which
task is to design a classifier, which tries to guess the has the least possible risk. Let λ(j0 , j) be the loss (or
class J of the object based on the value of the feature cost), when J = j0 and g(X) = j for j0 = 1, . . . , c and
vector X. This guess is calculated by a classifier g for j = 1, . . . , c. We may allow the reject option by
(also called a decision or discrimination rule), which defining λ(j0 , j) also for j = 0. Then the expected loss
is simply a function defined on Rd such that g(x) is the or risk for classifier g is given by
classifier’s guess of J, when the feature vector X has
the value x. We then say that X is classified, allocated,
or assigned to class g(X). The classifier errs, when R(g) = E[λ(J, g(X))]. (3)
g(X) = J. An alternative viewpoint is that the classifier
g(X) determines to which of the decision regions
Aj = {x ∈ Rd : g(x) = j} the feature vector X belongs. Usually, one would penalize only for misclassifica-
Their boundaries are called decision boundaries. tions, which corresponds to choosing λ(j, j) = 0 for
Usually, the classifier g returns one of the valid all j. If we further select λ(j0 , j) equal to one when
class labels 1, . . ., c. In some applications, however, j0 = j, then R(g) is simply the error probability of g.
the classifier is also allowed to reject the feature vector, The situation where certain kinds of misclassifications
which is known as the reject option. Rejected feature are more serious than others can be modeled by using
vectors are set aside, for example, to be classified by unequal losses.
0.4 1.0
f1
2 f2
fX 0.8
0.3
0.6
0.2
J
0.4
0.1
0.2
1
0.0 0.0
−5 0 5 −5 0 5 −5 0 5
X x x
FIGURE 1 | The joint distribution of class J and feature vector X, when there are two classes, and the feature vector dimension is one. The first
panel shows a jittered scatter plot of (X, J), the middle panel shows the class conditional densities f1 and f2 as well as the marginal density fX ,
whereas the last panel shows the two posterior probabilities. The prior probabilities are P1 = 0.6 and P2 = 1 − P1 .
By writing the expectation (3) as an iterated If we penalize one unit for each misclassification
expectation, we obtain and do not allow rejections, then the loss λ(j0 , j)
is zero when j0 = j and one when j0 = j, and here
R(g) = E[E[λ(J, g(X)) | X]] j0 , j ∈ {1, . . . , c}. The corresponding risk is the classifier
error probability, which is the most widely used
c
criterion in practice. In this case
= λ(j0 , g(x)) P(j0 |x) fX (x) dx.
Rd j0 =1
c
λ(j0 , j) P(j0 |x) = P(j0 |x) = 1 − P(j|x).
(This formula is valid when X is continuously j0 =1 j0 =j
distributed, but in the general case, the result is
a Lebesgue integral with respect to the marginal Here, we used the fact that the posterior probabilities
distribution of X.) Because fX (x) ≥ 0, it is clear that of the classes sum to one. Because minimizing
the optimal classifier g∗ is obtained by minimizing at 1 − P(j|x) is the same as maximizing P(j|x), we see
each x the conditional risk that the Bayes classifier for minimum error probability
is given by
c
E[λ(J, g(X)) | X = x] = λ(j0 , g(x)) P(j0 |x). g∗ (x) = arg max j=1,...,c P(j|x). (5)
j0 =1
The resulting classifier is often called simply the Bayes
Such is always the case in Bayesian decision theory. classifier. It assigns x to that class, whose posterior
Hence, the optimal classifier is probability is greatest. When the class-conditional
distributions overlap, even the Bayes classifier has
c
positive error probability. Using Eq. (2) and noticing
g∗ (x) = arg min j λ(j0 , j) P(j0 |x), (4) that the positive factor fX (x) is common to all the
j0 =1 classes, we obtain the following alternative expression
for the Bayes classifier,
where arg minj selects that value of the argument j,
which minimizes the expression following it, when g∗ (x) = arg max j=1,...,c Pj fj (x). (6)
j ranges over the possible values: over 1, . . . , c when
rejection is not allowed, and over 0, 1, . . . , c otherwise. Besides the risk, other kinds of criteria are
If there are ties, then the minimizing argument can be important, such as the Neyman–Pearson criterion,
selected arbitrarily among the minimizers. (We also which is available in a two-class problem, see, for
use the arg max operator, which is defined similarly.) example, Ref 3. There the classification error of one
The resulting classifier g∗ is the Bayes classifier for class is minimized given a fixed error probability for
minimum risk. the other class.
needed for the quadratic classifier. The resulting smoothing parameters hj are selected to minimize
classifier selects the maximizing argument among the classification error probability using, for example,
c first-degree polynomials. In a pioneering work cross-validation on the training set. While accurate
published in 1936, R. A. Fisher proposed a method, estimates f̂j (x; hj ) of the class-conditional densities
which is equivalent with this, in the two-class fj of course will lead to a good classifier, what
case. This approach can be called Gaussian linear really matters is that the classifier properly models
discriminant analysis (LDA), or the (normal-based) the decision boundaries. For this reason, even biased
linear classifier. kernel estimates may produce a classifier with a low
Often the sample sizes from each of the classification error.
classes are so small compared with the feature Sometimes the kernel estimate of a density
vector dimension that the covariance estimates in improves if the shapes of the kernel and the
the heteroscedastic model are highly variable. One density are chosen to be similar. The shape of
option is then to use the homoscedastic model, even the kernel can be adjusted by generalizing (9) to
when there are no grounds for believing that the
f̂ (x; H) = 1n ni=1 KH (x − Xi ) , where H is a sym-
covariance matrices are equal. Another option is metric positive definitive scaling matrix and KH (x) =
to use regularized discriminant analysis (RDA),23 |H−1/2 |K(H−1/2 x) . A special case is a diagonal matrix
which regularizes the covariance estimates of QDA H = diag(h1 , . . . , hd ) which allows different levels
by shrinking them, first, toward the pooled covariance of smoothing for different variables of the feature
estimate and, second, toward a multiple of the identity vector.
matrix. The method uses two regularizing parameters, Another popular nonparametric approach to
which can be selected by cross-validation. classification has been the k-nearest neighbor method
As discussed by Ripley (Section 2.4),11 several (k-NN). It can be interpreted as an instance of the
predictive classifiers can be expressed in closed form plug-in rule (7) as follows. Let 1 ≤ k ≤ n and x ∈ Rd .
within the normal model. See Ref 24 for more recent Consider the training data fixed and permute the
work. k training vectors Xi nearest to x in an order of
increasing distance from x,
8 8
6 6
4 4
2 2
0 0
−2 −2
−4 −4
−6 −6
−8 −6 −4 −2 0 2 4 6 −8 −6 −4 −2 0 2 4 6
FIGURE 2 | Decision regions obtained using QDA (left) and the 1-nearest neighbor classifier (right) for the same training data.
P(1|x)
so that x is assigned to the class with the most log = α + β T x, (13)
P(2|x)
training vectors among its k nearest neighbors. Ties
need to be resolved separately, and one may, for where P(2|x) = 1 − P(1|x). The assumption (13) is
example, choose that class whose farthest training equivalent with
vector from x is closest to it among the tied classes.
The justification of the k-nearest neighbor classifier exp(α + β T x)
through density estimation is most plausible when P(1|x) = ,
1 + exp(α + β T x)
k 1 and n is large with k n. Then even a small
1
neighborhood of x can contain many training vectors P(2|x) = . (14)
from all classes and the approximations in Eq. (11) are 1 + exp(α + β T x)
credible. In practice, however, very small values of k
are often used, one of the most popular nonparametric The marginal distribution of X is here typically left
discrimination methods being in fact the 1-nearest unmodeled by conditioning on the feature vectors in
neighbor classifier. Figure 2 shows decision regions the training data.
obtained with QDA and the 1-nearest neighbor Conditionally on Xi , the class Ji has the value
classifier. 1 with probability P(1|Xi ) and the value 2 with
probability P(2|Xi ), and so the conditional likelihood
For large n, finding the nearest neighbors can
for (α, β) is
incur a heavy computational overhead and various
training set preprocessing schemes have been proposed
n
to alleviate this problem. Computational efficiency can [P(1|Xi )]1(Ji =1) [P(2|Xi )]1(Ji =2) ,
also be improved using so-called editing techniques i=1
that aim to select representative subsets of training
vectors for each class and construct a classifier where the two posterior probabilities are as in
based on the training set thus reduced in size. Eq. (14), and the indicators of the two classes 1(Ji = 1)
For further reading on kernel discriminant analysis and 1(Ji = 2) select the first term when Ji = 1 and the
and nearest neighbor methods, see, for example, second term when Ji = 2. This is suitable both for
Refs 3,10,12. separate and for mixture sampling. The parameter
values maximizing the conditional likelihood can be becomes either an input to other units or the output
calculated with widely available software for fitting of the network. If there are no nonlinearities in the
generalized linear models (GLMs), and then one can output layer, then the MLP evaluates the function
use Eq. (8), which is equivalent with classifying to r(x, w), where
class one whenever P̂(1|x) > 0.5.
The multiclass analog of the logistic regression
m d
wji ϕ wi x + wi0 + wj0 ,
(2) (1) (1) (2)
model is the multinomial (or multiple) logistic model, r j ( x , w) =
where the posterior probabilities are modeled as i=1 =1
j = 1, . . . , c. (17)
P(j|x)
log = rj (x), j = 1, . . . , c − 1, (15)
P(c|x) Here the weight vector w contains all the weights
(1)
(and biases) of the MLP: wi is the weight from the
where rj (x) = α j + β Tj x are c − 1 linear functions. (1)
th input to the ith hidden unit and wi0 is its bias
Equivalently, (2)
weight; wji is the weight from the ith hidden unit
(2)
exp(rj (x)) to the jth output unit and wj0 is its bias weight.
P(j|x) = c , j = 1, . . . , c, (16)
k=1 exp(rk (x))
The most popular choice for the nonlinearity ϕ is
the logistic sigmoid ϕ(z) = 1/(1 + exp(−z)). Thanks
where we have made use of the convention that rc (x) ≡ to their flexibility, MLPs have been very popular in
1. The conditional likelihood is now multinomial, all kinds of nonlinear regression and classification
tasks.
n
c As already mentioned, one approach to fitting
[P(j|Xi )]1(Ji =j) , a MLP is to use c − 1 outputs (or even an
i=1 j=1 overparametrized model with c outputs) together
with the structure (16) (which is often called the
and also this model can be fitted using software for softmax function) and the multinomial conditional
GLMs. Instead of the linear form, one can also choose likelihood. However, perhaps the most widely used
the functions rj to be nonlinear in their parameters, fitting criterion in classification tasks is the squared
for example, they could be neural networks of the error criterion
feedforward type, such as multilayer perceptrons or
radial basis function networks. However, then the
n
model falls outside the class of GLMs. ti − r(Xi , w)2 = min!,
w
i=1
See Ref 25 for a view on the relationship between what the nonlinear mapping ϕ is becomes unnecessary.
neural and statistical classifiers and, for example, Refs This is the so-called kernel trick. By Mercer’s theorem,
11,19 for more information on neural network models the kernel has to be a non-negative definite function. In
for classification. applications, one may start by choosing a convenient
form for the kernel among the wide selection of valid
kernels available in the literature.
SUPPORT VECTOR MACHINES SVM classifiers can be used also in a multiclass
problem using one of several approaches. For
Support vector machines (SVMs)26 are currently of
great interest to theoretical and applied researchers more information on SVMs, see, for example, Refs
and they have strong connections to computational 4,16,18,27 and the Web page.28
learning theory. The basic idea is easiest to understand,
when we have a linearly separable two-class problem.
The resulting classifier is called the maximal margin OTHER METHODS
classifier. The idea is to search the optimal separating A number of methods construct the decision regions
hyperplane which has the maximal margin of using recursive partitioning of the feature space.
separation between the training vectors from the A well-known approach in this vein are the tree-
two classes, so maximal margin classifiers estimate structured classifiers such as the classification and
directly the decision boundary. Being a separating regression tree method that includes the original
hyperplane means that the training vectors from the CART and its variants. An important property of
two classes lie on different sides of the hyperplane, this approach is that, where most classifiers operate
and having maximal margin means that the distance as ‘black boxes’, tree classifiers are also able to
from the hyperplane to the nearest training vector provide a potentially useful explanation for the
is maximal. The support vectors are those training assignment of a pattern to a particular class in the
vectors which lie nearest to the optimal hyperplane. form of threshold values on its features Xi . Another
This optimization problem can be formulated as a graphical model based approach is the Bayesian belief
quadratic programming problem. In real applications, networks.
the training data is usually not linearly separable To improve discrimination performance, combi-
and then the maximal margin hyperplane does not nation of several classifiers into a single ‘committee’
exist. A solution is to seek the so-called soft-margin classifier has also received much attention in recent
hyperplane instead. Also this leads to a quadratic years. This approach includes, for example, bagging
program. As the construction of SVM classifiers and boosting which, by perturbing the training set,
leads to standard convex optimization problems, generate a collection of classifiers that are combined
there are no complications with local minima as into a single combined classifier.
there are with MLPs. These quadratic programs For more information on these and other
can be solved either by general purpose quadratic techniques not covered in this article, see, for example,
program solvers or by techniques developed specially Refs 11,16,29.
for SVMs.
Suppose we transform the original feature vec-
tors into some high-dimensional or even infinite- FEATURE EXTRACTION
dimensional (Hilbert) space using a nonlinear map- AND SELECTION
ping ϕ before constructing the maximal margin of
the soft-margin hyperplane. Using a dual formula- Often the measured pattern vectors are too high
tion of the original quadratic program, one obtains dimensional for useful estimation of a classifier. The
another quadratic program, which depends on the raw patterns are therefore first transformed to lower
training vectors only through their inner products. dimensional feature vectors in a way that hopefully
In the transformed space, the inner products can be preserves the salient class information of the original
represented using a kernel function measurements. However, such a transformation can
never improve the theoretically optimal classification
ϕ(Xi ), ϕ(Xj )) = K(Xi , Xj ). result. Indeed, let ϕ : RD → Rd be the transformation
of the raw pattern vector Y into the feature vector
Thus, the inner products needed for the construction X (here usually d D). Then any classifier g in
of the SVM classifier can be can be calculated in the Rd induces a classifier g◦ϕ in the original pattern
original feature space. Also the resulting classifier can space and if the lowest (Bayes) error for Y is
be implemented using the kernel and then working out e∗ , then we have P(g(X) = J) = P((g◦ϕ)(Y) = J) ≥ e∗ .
Still, in practice, dimension reduction makes classifier indexed in an order of decreasing eigenvalues we
estimation easier which often more than compensates have that Var(X1 ) ≥ · · · ≥ Var(XD ). By defining a
for the possible loss of classification information in the feature vector as X = [X1 , . . . , Xd ]T , where d D,
transformation. Also, if one wishes to use a particular dimension reduction is achieved while at the same
classifier type, a clever feature transformation can time most of the variability in the original pattern
sometimes be used to improve classifier performance. vector Y can be captured. Of course, this approach
A prime example of this idea is the support vector does not take into account the class labels of the
machine that combines a nonlinear mapping with a patterns in any way and indeed there are versions of
simple linear classifier. principal component analysis which incorporate this
Ideally, the classifier and the feature transforma- information, too. Another general purpose dimension
tion should be designed in a joint process but often in reduction and feature extraction technique that has
practice these two design steps are separated. While found some use in pattern recognition is metric
application-specific information is usually needed for multidimensional scaling.
best results, there are also some generic methods for Good accounts of various feature selection and
transforming the raw patterns into feature vectors. extraction methods can be found, for examle, in Refs
These include feature selection, feature extraction typ- 3 and 6.
ically based on a linear mapping, as well as general
purpose dimension reduction techniques.
In feature selection, one tries to choose from
the raw pattern Y = [Y1 , . . . , YD ]T variables Yi CLASSIFIER ASSESSMENT
that are most useful in discrimination. Thus, X =
ϕ(Y) = [Yi1 , . . . , Yid ]T , where i1 < · · · < id . Estimated Once the classifier has been implemented, one should
classification error or some other measure of class assess whether it meets the design criteria such as being
separation can be used to rank the performance of sufficiently quick to compute and having adequate
different subsets of indices. Because of combinatorial error rate. A naive way of estimating the error rate
complexity, exhaustive search through all possible (assuming mixture sampling) is to calculate the relative
subsets is rarely feasible and one resorts to various frequency of errors in the design sample. This is
suboptimal, incremental schemes that add or delete called the resubstitution estimator or the apparent
one feature at a time. error rate. It is optimistically biased, and the bias
A typical linear feature extraction method uses a can be severe for complex classifiers. Instead, the
transformation of the form ϕ(Y) = [aT1 Y, . . . , aTd Y]T , recommended approach is to split the data into two
where a1 , . . . , ad are orthonormal vectors chosen to separate sets, the training set and the test set. The
optimize some measure of within-class spread and classifier is estimated using data in the training set,
between-class separation. The ai s can be selected for and its performance is assessed on the independent
example from among the eigenvectors of the sample test set. This is called the holdout estimate. Several
within-class covariance matrix SW to maximize the classifiers require the choice of tuning parameters
ratio aTi SB ai /λi , where SB is the sample between- or model or architecture or kernel selection. This is
class covariance matrix and λi is the eigenvalue typically based on cross-validation. In order to keep
corresponding to ai . Thus, strict separation between the design and the test set,
the cross-validation then needs to be done using the
c
nj
c
nj training set only.
SW = ˆ j , SB = (µ̂j − µ̂)(µ̂j − µ̂)T ,
n n
j=1 j=1
formulation of the real-world problem and from the classification algorithms. This is especially relevant in
invention of good features, not so much from new new application domains.
REFERENCES
1. Hand DJ. Construction and Assessment of Classifica- 16. Izenman AJ. Modern Multivariate Statistical Tech-
tion Rules. Chichester, England: John Wiley & Sons; niques: Regression, Classification, and Manifold Learn-
1997. ing. New York: Springer; 2008.
2. Duda RO, Hart PE, Stork DG. Pattern Classification. 17. Hand D, Mannila H, Smyth P. Principles of Data
2nd ed. New York: John Wiley & Sons; 2000. Mining. Cambridge, MA: The MIT Press; 2001.
3. Webb A. Statistical Pattern Recognition. 2nd ed. Chich-
18. Hastie T, Tibshirani R, Friedman J. The Elements of
ester, England: John Wiley & Sons; 2002.
Statistical Learning: Data Mining, Inference and Pre-
4. Theodoridis S, Koutroumbas K. Pattern Recognition. diction. 2nd ed. New York: Springer; 2001.
4th ed. Burlington, MA: Academic Press; 2008.
19. Bishop CM. Pattern Recognition and Machine Learn-
5. Young TY, Calvert TW. Classification, Estimation and
ing. New York: Springer; 2006.
Pattern Recognition. New York: American Elsevier
Publishing; 1974. 20. Krishnaiah PR, Kanal LN, eds. Handbook of Statistics
6. Devijver PA, Kittler J. Pattern Recognition: A Statistical 2: Classification, Pattern Recognition and Reduction of
Approach. London: Prentice-Hall; 1982. Dimensionality. Amsterdam: North-Holland; 1982.
7. Therrien CW. Decision, Estimation and Classification: 21. Young TY, Fu K-S, eds. Handbook of Pattern Recogni-
An Introduction to Pattern Recognition and Related tion and Image Processing. New York: Academic Press;
Topics. New York: John Wiley & Sons; 1989. 1986.
8. Schalkoff RJ. Pattern Recognition: Statistical, Struc- 22. Chen CH, Wang PSP, eds. Handbook of Pattern Recog-
tural and Neural Approaches. New York: John Wiley nition and Computer Vision. 3rd ed. Singapore: World
& Sons; 1992. Scientific; 2005.
9. Fukunaga K. Introduction to Statistical Pattern Recog-
23. Friedman JH. Regularized discriminant analysis. J Am
nition. 2nd ed. Sandiego, CA: Academic Press; 1990
Stat Assoc 1989, 84:165–175.
(Original ed. published 1972).
10. McLachlan GJ. Discriminant Analysis and Statistical 24. Srivastava S, Gupta MR, Frigyik BA. Bayesian
Pattern Recognition. New York: John Wiley & Sons; quadratic discriminant analysis. J Mach Learn Res
1992. 2007, 8:1277–1305.
11. Ripley BD. Pattern Recognition and Neural Networks. 25. Holmström L, Koistinen P, Laaksonen J, Oja E. Neu-
Cambridge, UK: Cambridge University Press; 1996. ral and statistical classifiers—taxonomy and two case
12. Devroye, L Györfi, G Lugosi. A Probabilistic Theory of studies. IEEE Trans Neural Netw 1997, 8:5–17.
Pattern Recognition. New York: Springer; 1996. 26. Cortes C, Vapnik V. Support-vector networks. Mach
13. Mardia KV, Kent JT, Bibby JM. Multivariate Analysis. Learn 1995, 20:273–297.
London: Academic Press; 1979. 27. Cristianini N, Shawe-Taylor J. Support Vector
14. Krzanowski WJ, Marriott FHC. Multivariate Analy- Machines and other kernel-based learning methods.
sis: Part 2, Classification, Covariance Structures and Cambridge, UK: Cambridge University Press; 2000.
Repeated Measurements, vol. 2, Kendall’s Library of
Statistics. London: Edward Arnold; 1995. 28. Kernel-machines.org. http://www.kernel-machines.org/,
Accessed May 3, 2010.
15. Anderson TW. An Introduction to Multivariate Statis-
tical Analysis. 3rd ed. New York: John Wiley & Sons; 29. Venables WN, Ripley BD. Modern Applied Statistics
2003. with S. 4th ed. New York: Springer; 2002.