0% found this document useful (0 votes)
52 views10 pages

Pattern Recognition: Lasse Holmstr Om and Petri Koistinen

This document provides an overview of pattern recognition, focusing on pattern classification. It discusses how pattern recognition involves classifying real-world objects based on measurements and features. Popular discrimination methods are reviewed using decision theory as a unifying framework. Statistical pattern recognition represents patterns with numerical feature vectors and approaches classification as a supervised learning problem to predict an object's class based on its features.

Uploaded by

Talzzoft Electr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views10 pages

Pattern Recognition: Lasse Holmstr Om and Petri Koistinen

This document provides an overview of pattern recognition, focusing on pattern classification. It discusses how pattern recognition involves classifying real-world objects based on measurements and features. Popular discrimination methods are reviewed using decision theory as a unifying framework. Statistical pattern recognition represents patterns with numerical feature vectors and approaches classification as a supervised learning problem to predict an object's class based on its features.

Uploaded by

Talzzoft Electr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Overview

Pattern recognition
Lasse Holmström1∗ and Petri Koistinen2

We give an overview of pattern recognition, concentrating on the problem of


pattern classification. Several popular discrimination methods are reviewed using
decision theory as a unifying framework .  2010 John Wiley & Sons, Inc. WIREs Comp Stat
2010 2 404–413

P attern recognition is an engineering discipline,


where the goal is to build systems that are able
to classify real-world objects of interest into one of a
has a problem in unsupervised pattern recognition
or clustering. This article concentrates on supervised
pattern recognition.
number of classes on the basis of measurements. The
objects or patterns can be, for example, printed or
handwritten characters, biological cells or acoustic READING MATERIAL
or electronic signals. The measurements can be, Pattern recognition has a long history. It had its
for example, signal waveforms, images, or image beginnings in the statistical literature of the 1930s.
sequences. Table 1 lists typical application areas. The advent of computers in 1950s and 1960s brought
To achieve the goal, one may apply techniques a demand for practical applications, and the field
from signal and image processing, computer science, developed significantly. The 1970s brought great
neural computation, and statistics. developments in the probabilistic theory of pattern
Often the raw measurements one can make recognition. One significant boost for the field was
are so many in number that inferring their statistical the sudden growth of neural network research in the
properties is hopeless. Therefore, they are usually late 1980s and 1990s. Currently, new developments
first transformed into a vector whose components take place, for example, in the machine learning
are called features. This transformation is called community. Also emerging applications drive the
feature selection or feature extraction and it is very development of the field.
much application dependent. Although some generic For further study of pattern recognition, we
methods are available, one should study carefully the recommend one of the many texts currently available,
application-specific literature to identify promising such as Refs 1–4. Older but still useful texts include
features. Refs 5–9. On a more advanced level, McLachlan10
When each pattern is represented by a numerical gives a scholarly treatment of discriminant analysis
feature vector, then one speaks of statistical pattern and Ripley11 presents an advanced synthesis of
recognition. There are approaches grouped under the statistical pattern recognition and neural networks,
term syntactic or structural pattern recognition, where whereas the monograph12 can be consulted for
the patterns are represented using more complicated rigorous probabilistic results on the generalization
structures such as strings in a formal grammar. ability of a range of classification rules. Several texts on
The rest of this overview discusses statistical pattern multivariate statistical analysis such as Refs 13–16 and
recognition. If we have available a set of patterns texts and monographs on machine learning methods
whose classes and feature vectors are already known, such as Refs 17–19 are relevant. Handbooks are also
then one speaks of supervised pattern recognition or available, including Refs 20–22. A variety of scientific
discriminant analysis. However, sometimes the classes journals and conferences contain both theoretical- and
have not yet been defined, and one attempts to find application-oriented articles on pattern recognition.
classes of objects with similar properties. Then one
∗ Correspondence to: [email protected] CLASSIFIERS BASED ON DECISION
1 Department of Mathematical Sciences, University of Oulu, Oulu, THEORY
Finland
2 Department of Mathematics and Statistics, University of Helsinki, In the pattern classification problem, an object is to
Helsinki, Finland be classified as belonging to one of the c mutually
DOI: 10.1002/wics.99 exclusive classes (or categories), labeled 1, . . ., c.

404  2010 Jo h n Wiley & So n s, In c. Vo lu me 2, Ju ly /Au gu s t 2010


WIREs Computational Statistics Pattern recognition

TABLE 1 Typical application areas for pattern recognition. a human expert. Rejection can be represented by a
separate label, such as g(x) = 0.
Automated visual inspection of products in manufacturing
If the joint distribution of X and J is known,
Automatic speech recognition
then decision theory allows us to find classifiers
Classification of text into categories (e.g., spam vs. non-spam) which are optimal according to various criteria. These
Character recognition (for printed or for handwritten text) theoretically optimal classifiers can be kept as guides,
Computer-aided diagnosis using a variety of medical data when some aspects of the joint distribution must be
Classification of ground cover types in remote sensing estimated. According to the multiplication rule of
probability theory, the joint density of two random
Face and gesture recognition from images or image sequences
quantities can be factorized as the marginal density
of one times the conditional density of the other.
We denote the true class of the object by J. The Therefore, the joint density f (x, j) of X and J can be
classification is to be made on the basis of features factorized as
Xi measured from the object. Together they form the
feature vector X = [X1 , . . . , Xd ]T . The feature vector
exhibits random variation, partly due to the different f (x, j) = Pj fj (x) = fX (x) P(j|x). (1)
properties of the different classes and partly due to
variation within each class. We regard the class J and
the feature vector X to be random quantities, which In the first factorization, Pj = P(J = j) is the marginal
have a joint distribution. While the class J is a discrete probability of class j, and fj (x) is the class-conditional
random variable, the feature vector X can have either density (probability density function or probability
a continuous or a discrete distribution, or some of its mass function) of X given that J = j. In the second
components may have discrete and the others have a factorization, fX (x) is the marginal density of X, and
continuous distribution. P(j|x) is the conditional probability of J = j given that
Many authors avoid the use of a separate X = x. Pj is called the prior probability and P(j|x)
random variable to denote the class of the object the posterior probability of class j. Both the prior and
and instead denote the jth class, for example, by the posterior probabilities of the classes sum to one.
ωj and use a phrase like ‘X belongs to class ωj ’ or From the factorizations (1), we obtain the posterior
even the corresponding notation X ∈ ωj to indicate probabilities as
that the class of the object with feature vector X is
j. Especially the notation X ∈ ωj can be confusing
Pj f j ( x ) 
c
for the newcomer, who has not yet internalized the P(j|x) = , where fX (x) = Pj fj (x). (2)
central idea that the feature vector of the object does f X ( x)
j=1
not determine its class uniquely. It is therefore better
to use a separate random variable (such as J) to denote
the class of the object. See Figure 1 for an illustration of these concepts.
The pattern classification (or discrimination) We next derive the form of the classifier, which
task is to design a classifier, which tries to guess the has the least possible risk. Let λ(j0 , j) be the loss (or
class J of the object based on the value of the feature cost), when J = j0 and g(X) = j for j0 = 1, . . . , c and
vector X. This guess is calculated by a classifier g for j = 1, . . . , c. We may allow the reject option by
(also called a decision or discrimination rule), which defining λ(j0 , j) also for j = 0. Then the expected loss
is simply a function defined on Rd such that g(x) is the or risk for classifier g is given by
classifier’s guess of J, when the feature vector X has
the value x. We then say that X is classified, allocated,
or assigned to class g(X). The classifier errs, when R(g) = E[λ(J, g(X))]. (3)
g(X) = J. An alternative viewpoint is that the classifier
g(X) determines to which of the decision regions
Aj = {x ∈ Rd : g(x) = j} the feature vector X belongs. Usually, one would penalize only for misclassifica-
Their boundaries are called decision boundaries. tions, which corresponds to choosing λ(j, j) = 0 for
Usually, the classifier g returns one of the valid all j. If we further select λ(j0 , j) equal to one when
class labels 1, . . ., c. In some applications, however, j0 = j, then R(g) is simply the error probability of g.
the classifier is also allowed to reject the feature vector, The situation where certain kinds of misclassifications
which is known as the reject option. Rejected feature are more serious than others can be modeled by using
vectors are set aside, for example, to be classified by unequal losses.

Vo lu me 2, Ju ly /Au gu s t 2010  2010 Jo h n Wiley & So n s, In c. 405


Overview www.wiley.com/wires/compstats

0.4 1.0
f1
2 f2
fX 0.8
0.3

0.6
0.2
J

0.4

0.1
0.2
1

0.0 0.0
−5 0 5 −5 0 5 −5 0 5
X x x

FIGURE 1 | The joint distribution of class J and feature vector X, when there are two classes, and the feature vector dimension is one. The first
panel shows a jittered scatter plot of (X, J), the middle panel shows the class conditional densities f1 and f2 as well as the marginal density fX ,
whereas the last panel shows the two posterior probabilities. The prior probabilities are P1 = 0.6 and P2 = 1 − P1 .

By writing the expectation (3) as an iterated If we penalize one unit for each misclassification
expectation, we obtain and do not allow rejections, then the loss λ(j0 , j)
is zero when j0 = j and one when j0 = j, and here
R(g) = E[E[λ(J, g(X)) | X]] j0 , j ∈ {1, . . . , c}. The corresponding risk is the classifier
  error probability, which is the most widely used
 c
 criterion in practice. In this case
= λ(j0 , g(x)) P(j0 |x) fX (x) dx.
Rd j0 =1

c 
λ(j0 , j) P(j0 |x) = P(j0 |x) = 1 − P(j|x).
(This formula is valid when X is continuously j0 =1 j0 =j
distributed, but in the general case, the result is
a Lebesgue integral with respect to the marginal Here, we used the fact that the posterior probabilities
distribution of X.) Because fX (x) ≥ 0, it is clear that of the classes sum to one. Because minimizing
the optimal classifier g∗ is obtained by minimizing at 1 − P(j|x) is the same as maximizing P(j|x), we see
each x the conditional risk that the Bayes classifier for minimum error probability
is given by

c
E[λ(J, g(X)) | X = x] = λ(j0 , g(x)) P(j0 |x). g∗ (x) = arg max j=1,...,c P(j|x). (5)
j0 =1
The resulting classifier is often called simply the Bayes
Such is always the case in Bayesian decision theory. classifier. It assigns x to that class, whose posterior
Hence, the optimal classifier is probability is greatest. When the class-conditional
distributions overlap, even the Bayes classifier has

c
positive error probability. Using Eq. (2) and noticing
g∗ (x) = arg min j λ(j0 , j) P(j0 |x), (4) that the positive factor fX (x) is common to all the
j0 =1 classes, we obtain the following alternative expression
for the Bayes classifier,
where arg minj selects that value of the argument j,
which minimizes the expression following it, when g∗ (x) = arg max j=1,...,c Pj fj (x). (6)
j ranges over the possible values: over 1, . . . , c when
rejection is not allowed, and over 0, 1, . . . , c otherwise. Besides the risk, other kinds of criteria are
If there are ties, then the minimizing argument can be important, such as the Neyman–Pearson criterion,
selected arbitrarily among the minimizers. (We also which is available in a two-class problem, see, for
use the arg max operator, which is defined similarly.) example, Ref 3. There the classification error of one
The resulting classifier g∗ is the Bayes classifier for class is minimized given a fixed error probability for
minimum risk. the other class.

406  2010 Jo h n Wiley & So n s, In c. Vo lu me 2, Ju ly /Au gu s t 2010


WIREs Computational Statistics Pattern recognition

DESIGNING CLASSIFIERS ON on the predictive distribution of the class label


THE BASIS OF TRAINING DATA (i.e., its conditional distribution) conditional on
the training data D and the observed feature
Although the formulas for the optimal classifiers vector x. The predictive distributions of the
depend on the joint distribution of the pair (X, J), one classes are then used in place of the theoretical
does not know it in practice, but only has available class posterior probabilities. In some cases,
a sequence of pairs D = ((X1 , J1 ), . . . , (Xn , Jn )) of the predictive distributions can be calculated
feature vectors and classes of objects. One typically in closed form, but otherwise they must be
assumes that the training data are obtained either approximated, for example, by Markov chain
by mixture sampling or by separate sampling. In Monte Carlo methods.
mixture sampling the training data are assumed to
be an independent and identically distributed (i.i.d.) 4. Another approach is to estimate the decision
sample from the joint distribution of (X, J). In separate regions directly, for example, by making
sampling, the feature vectors are assumed to be assumptions about the forms of the decision
sampled independently and separately from each of boundaries.
the classes. Consequently, in separate sampling the
sample sizes of the different classes do not necessarily
reflect their prior probabilities. These training data NORMAL-BASED CLASSIFIERS
(or design data) are then used for designing the
A classical approach is to assume that the class-
classifier applying, for example, some of the following
conditional distributions belong to a family of dis-
ideas.
tributions described by a finite set of parameters.
Once these parameters have been estimated, one can
1. One first forms estimates P̂j and f̂j (x) for use the plug-in rule (7). The most popular choice is
the prior probabilities and class-conditional to model the distribution of X conditional on J = j
densities, which are then plugged in the formula as a multinormal distribution N(µj ,  j ), where µj is
for the optimal classifier. For example, instead the mean vector and  j the covariance matrix of
of Eq. (6), one can use the plug-in classifier class j. The resulting quadratic and linear classifiers
are among the most popular classifiers in applica-
g(x) = arg max j=1,...,c P̂j f̂j (x). (7) tions and are discussed in all pattern recognition
texts.
The prior probabilities Pj are usually esti- Letting fj (x | µj ,  j ) denote the density of
mated by the relative frequencies of the classes N(µj ,  j ), one sees immediately that the logarithm
in some large data set (which does not need to be of Pj fj (x | µj ,  j ) is a quadratic function of x. The
the training set). The class-conditional densities parameters µj and  j can be estimated by the
fj (x) can be estimated using various parametric sample mean µ̂j and sample covariance matrix ˆ j
or nonparametric approaches. of those training feature vectors originating from
2. The optimal classifiers depend on the joint distri- class j, but other estimates of µj and  j can be
bution only through the posterior probabilities used as well. Once the parameter estimates have
P(j|x). Therefore, one viable approach is to form been plugged in, the classifier (7) is equivalent to
directly estimates P̂(j|x) of the posterior proba- selecting the maximal one out of c quadratic functions
bilities and plug them in the formulas for optimal of x, which are called discriminant functions. This
classifiers. For example, instead of Eq. (5), one procedure is often called quadratic discriminant
can use analysis (QDA), and the classifier may also be called
the Gaussian classifier or the (normal-based) quadratic
g(x) = arg max j=1,...,c P̂(j|x). (8) classifier.
Instead of the previous heteroscedastic model,
The posterior probabilities can be estimated where each class has its own covariance matrix,
using various parametric or nonparametric one can also use a homoscedastic model, where
regression approaches. the covariance matrix is the same in all the
3. While the two previous approaches use estimates classes, that is,  1 = · · · =  c = . In this case, the
as if they were equal to the unknown population quadratic term is common to all the classes and
quantities, in the Bayesian approach to statistics can be canceled from the discriminant functions.
one usually averages over such uncertainty. A The common covariance matrix can be estimated by
natural idea is to base the classification decision pooling the within-class covariance matrix estimates

Vo lu me 2, Ju ly /Au gu s t 2010  2010 Jo h n Wiley & So n s, In c. 407


Overview www.wiley.com/wires/compstats

needed for the quadratic classifier. The resulting smoothing parameters hj are selected to minimize
classifier selects the maximizing argument among the classification error probability using, for example,
c first-degree polynomials. In a pioneering work cross-validation on the training set. While accurate
published in 1936, R. A. Fisher proposed a method, estimates f̂j (x; hj ) of the class-conditional densities
which is equivalent with this, in the two-class fj of course will lead to a good classifier, what
case. This approach can be called Gaussian linear really matters is that the classifier properly models
discriminant analysis (LDA), or the (normal-based) the decision boundaries. For this reason, even biased
linear classifier. kernel estimates may produce a classifier with a low
Often the sample sizes from each of the classification error.
classes are so small compared with the feature Sometimes the kernel estimate of a density
vector dimension that the covariance estimates in improves if the shapes of the kernel and the
the heteroscedastic model are highly variable. One density are chosen to be similar. The shape of
option is then to use the homoscedastic model, even the kernel can be adjusted by generalizing (9) to
when there are no grounds for believing that the 
f̂ (x; H) = 1n ni=1 KH (x − Xi ) , where H is a sym-
covariance matrices are equal. Another option is metric positive definitive scaling matrix and KH (x) =
to use regularized discriminant analysis (RDA),23 |H−1/2 |K(H−1/2 x) . A special case is a diagonal matrix
which regularizes the covariance estimates of QDA H = diag(h1 , . . . , hd ) which allows different levels
by shrinking them, first, toward the pooled covariance of smoothing for different variables of the feature
estimate and, second, toward a multiple of the identity vector.
matrix. The method uses two regularizing parameters, Another popular nonparametric approach to
which can be selected by cross-validation. classification has been the k-nearest neighbor method
As discussed by Ripley (Section 2.4),11 several (k-NN). It can be interpreted as an instance of the
predictive classifiers can be expressed in closed form plug-in rule (7) as follows. Let 1 ≤ k ≤ n and x ∈ Rd .
within the normal model. See Ref 24 for more recent Consider the training data fixed and permute the
work. k training vectors Xi nearest to x in an order of
increasing distance from x,

NONPARAMETRIC APPROACHES x − Xi1  ≤ x − Xi2  ≤ · · · ≤ x − Xik . (10)


The normal-based classifier assumes that the class-
conditional densities are Gaussian. In many situations, Let δ k = x − Xik  be the distance from x to its kth
nonparametric estimates of the class-conditional nearest neighbor and denote by B = B(x, δ k ) the ball
densities may lead to a better plug-in classifier (7). with radius δ k centered at x. Let mj be the number
The most popular nonparametric density estimator of training vectors from class j in this ball. If the
is the kernel method. Let X1 , . . . , Xn be a random total number of class j training vectors is nj and the
sample from a d-variate distribution with a density f . class-conditional density fj does not change much in
The kernel estimator of f is the neighborhood B, then

1
n mj

f̂ (x; h) = Kh (x − Xi ), (9) ≈P X∈B|J=j = fj (y) dy ≈ fj (x) · Vol(B),


n nj B
i=1
(11)

where the kernel K satisfies Rd K = 1, h > 0 is the
smoothing parameter, and Kh (x) = h−d K(h−1 x) is where Vol(B) = cd δ dk and the constant cd depends on
the scaled kernel. The performance of the estimator the dimension d of the feature space. Solving for fj (x)
does not depend much on the choice of K and in Eq. (11) one then obtains a natural density estimate
one often uses the standard multinormal kernel.
What matters is a proper value for the smoothing mj /nj
f̂j (x; k) = . (12)
parameter h. A small value of h produces a spiky cd δ dk
estimate with high variance, whereas a large value
of h results in a smooth estimate with a large bias. Here k plays the role of a smoothing parameter similar
Using training data from class j, the class-conditional to h in kernel estimation: a small k produces a highly
density fj can be estimated by a kernel estimator variable density function estimate, whereas a large k
f̂j (x; hj ) and then used in the classifier (7). This is results in a smooth, possibly biased estimate. If the
often called kernel discriminant analysis (KDA). The training set is obtained using mixture sampling, a

408  2010 Jo h n Wiley & So n s, In c. Vo lu me 2, Ju ly /Au gu s t 2010


WIREs Computational Statistics Pattern recognition

8 8

6 6

4 4

2 2

0 0

−2 −2

−4 −4

−6 −6
−8 −6 −4 −2 0 2 4 6 −8 −6 −4 −2 0 2 4 6

FIGURE 2 | Decision regions obtained using QDA (left) and the 1-nearest neighbor classifier (right) for the same training data.

convenient estimate of the class prior probability is LOGISTIC DISCRIMINATION


P̂j = nj /n and the rule (7) then takes the form
A popular statistical approach to classification in a
two-class problem is logistic regression. There one
nj mj /nj models the log-odds, in favor of class one, using a
g(x) = arg max · = arg max mj ,
j=1,...,c n cd δ d j=1,...,c linear function of x,
k

P(1|x)
so that x is assigned to the class with the most log = α + β T x, (13)
P(2|x)
training vectors among its k nearest neighbors. Ties
need to be resolved separately, and one may, for where P(2|x) = 1 − P(1|x). The assumption (13) is
example, choose that class whose farthest training equivalent with
vector from x is closest to it among the tied classes.
The justification of the k-nearest neighbor classifier exp(α + β T x)
through density estimation is most plausible when P(1|x) = ,
1 + exp(α + β T x)
k  1 and n is large with k n. Then even a small
1
neighborhood of x can contain many training vectors P(2|x) = . (14)
from all classes and the approximations in Eq. (11) are 1 + exp(α + β T x)
credible. In practice, however, very small values of k
are often used, one of the most popular nonparametric The marginal distribution of X is here typically left
discrimination methods being in fact the 1-nearest unmodeled by conditioning on the feature vectors in
neighbor classifier. Figure 2 shows decision regions the training data.
obtained with QDA and the 1-nearest neighbor Conditionally on Xi , the class Ji has the value
classifier. 1 with probability P(1|Xi ) and the value 2 with
probability P(2|Xi ), and so the conditional likelihood
For large n, finding the nearest neighbors can
for (α, β) is
incur a heavy computational overhead and various
training set preprocessing schemes have been proposed

n
to alleviate this problem. Computational efficiency can [P(1|Xi )]1(Ji =1) [P(2|Xi )]1(Ji =2) ,
also be improved using so-called editing techniques i=1
that aim to select representative subsets of training
vectors for each class and construct a classifier where the two posterior probabilities are as in
based on the training set thus reduced in size. Eq. (14), and the indicators of the two classes 1(Ji = 1)
For further reading on kernel discriminant analysis and 1(Ji = 2) select the first term when Ji = 1 and the
and nearest neighbor methods, see, for example, second term when Ji = 2. This is suitable both for
Refs 3,10,12. separate and for mixture sampling. The parameter

Vo lu me 2, Ju ly /Au gu s t 2010  2010 Jo h n Wiley & So n s, In c. 409


Overview www.wiley.com/wires/compstats

values maximizing the conditional likelihood can be becomes either an input to other units or the output
calculated with widely available software for fitting of the network. If there are no nonlinearities in the
generalized linear models (GLMs), and then one can output layer, then the MLP evaluates the function
use Eq. (8), which is equivalent with classifying to r(x, w), where
class one whenever P̂(1|x) > 0.5.  
The multiclass analog of the logistic regression 
m d
wji ϕ  wi x + wi0  + wj0 ,
(2) (1) (1) (2)
model is the multinomial (or multiple) logistic model, r j ( x , w) =
where the posterior probabilities are modeled as i=1 =1

j = 1, . . . , c. (17)
P(j|x)
log = rj (x), j = 1, . . . , c − 1, (15)
P(c|x) Here the weight vector w contains all the weights
(1)
(and biases) of the MLP: wi is the weight from the
where rj (x) = α j + β Tj x are c − 1 linear functions. (1)
th input to the ith hidden unit and wi0 is its bias
Equivalently, (2)
weight; wji is the weight from the ith hidden unit
(2)
exp(rj (x)) to the jth output unit and wj0 is its bias weight.
P(j|x) = c , j = 1, . . . , c, (16)
k=1 exp(rk (x))
The most popular choice for the nonlinearity ϕ is
the logistic sigmoid ϕ(z) = 1/(1 + exp(−z)). Thanks
where we have made use of the convention that rc (x) ≡ to their flexibility, MLPs have been very popular in
1. The conditional likelihood is now multinomial, all kinds of nonlinear regression and classification
tasks.

n
c As already mentioned, one approach to fitting
[P(j|Xi )]1(Ji =j) , a MLP is to use c − 1 outputs (or even an
i=1 j=1 overparametrized model with c outputs) together
with the structure (16) (which is often called the
and also this model can be fitted using software for softmax function) and the multinomial conditional
GLMs. Instead of the linear form, one can also choose likelihood. However, perhaps the most widely used
the functions rj to be nonlinear in their parameters, fitting criterion in classification tasks is the squared
for example, they could be neural networks of the error criterion
feedforward type, such as multilayer perceptrons or
radial basis function networks. However, then the 
n

model falls outside the class of GLMs. ti − r(Xi , w)2 = min!,
w
i=1

where the ith target ti is the indicator vector of


MULTILAYER PERCEPTRONS the class Ji , that is, tij is one if Ji = j and zero
Figure 3 shows the structure of a multilayer percep- otherwise. Under mixture sampling, the outputs of
tron (MLP) with d inputs, one layer of m hidden units the estimated network rj (x, w ) can be interpreted as
and c output units. Each unit calculates a linear com- estimates of the posterior probabilities of the classes,
bination of its outputs, optionally applies a nonlinear and then one can use the rule (8). One often adds
function, and passes the result as its output, which a regularization term to the optimization criterion
to make the estimates more stable. This can be, for
example, of the weight decay form, λw2 , where
Hidden units Output units
λ > 0. The number of hidden units and the value
1 of the regularization parameter λ can be chosen, for
x1 1 r1(x, w)
example, with cross-validation.
2 All the formulations lead to a nonconvex opti-
x2 2 r2(x, w)
mization problem which has several local and global
.. .. extrema. One can fit the MLP using either general
. .. .. .
. . purpose optimization routines or special optimiza-
xd rc(x, w)
tion methods tailored for MLPs. The gradient (and
c
m higher derivatives) of the optimization criterion can
be calculated using formulas, which are explained in
the literature in connection with the backpropagation
FIGURE 3 | A multilayer perceptron with one layer of hidden units.
method.

410  2010 Jo h n Wiley & So n s, In c. Vo lu me 2, Ju ly /Au gu s t 2010


WIREs Computational Statistics Pattern recognition

See Ref 25 for a view on the relationship between what the nonlinear mapping ϕ is becomes unnecessary.
neural and statistical classifiers and, for example, Refs This is the so-called kernel trick. By Mercer’s theorem,
11,19 for more information on neural network models the kernel has to be a non-negative definite function. In
for classification. applications, one may start by choosing a convenient
form for the kernel among the wide selection of valid
kernels available in the literature.
SUPPORT VECTOR MACHINES SVM classifiers can be used also in a multiclass
problem using one of several approaches. For
Support vector machines (SVMs)26 are currently of
great interest to theoretical and applied researchers more information on SVMs, see, for example, Refs
and they have strong connections to computational 4,16,18,27 and the Web page.28
learning theory. The basic idea is easiest to understand,
when we have a linearly separable two-class problem.
The resulting classifier is called the maximal margin OTHER METHODS
classifier. The idea is to search the optimal separating A number of methods construct the decision regions
hyperplane which has the maximal margin of using recursive partitioning of the feature space.
separation between the training vectors from the A well-known approach in this vein are the tree-
two classes, so maximal margin classifiers estimate structured classifiers such as the classification and
directly the decision boundary. Being a separating regression tree method that includes the original
hyperplane means that the training vectors from the CART and its variants. An important property of
two classes lie on different sides of the hyperplane, this approach is that, where most classifiers operate
and having maximal margin means that the distance as ‘black boxes’, tree classifiers are also able to
from the hyperplane to the nearest training vector provide a potentially useful explanation for the
is maximal. The support vectors are those training assignment of a pattern to a particular class in the
vectors which lie nearest to the optimal hyperplane. form of threshold values on its features Xi . Another
This optimization problem can be formulated as a graphical model based approach is the Bayesian belief
quadratic programming problem. In real applications, networks.
the training data is usually not linearly separable To improve discrimination performance, combi-
and then the maximal margin hyperplane does not nation of several classifiers into a single ‘committee’
exist. A solution is to seek the so-called soft-margin classifier has also received much attention in recent
hyperplane instead. Also this leads to a quadratic years. This approach includes, for example, bagging
program. As the construction of SVM classifiers and boosting which, by perturbing the training set,
leads to standard convex optimization problems, generate a collection of classifiers that are combined
there are no complications with local minima as into a single combined classifier.
there are with MLPs. These quadratic programs For more information on these and other
can be solved either by general purpose quadratic techniques not covered in this article, see, for example,
program solvers or by techniques developed specially Refs 11,16,29.
for SVMs.
Suppose we transform the original feature vec-
tors into some high-dimensional or even infinite- FEATURE EXTRACTION
dimensional (Hilbert) space using a nonlinear map- AND SELECTION
ping ϕ before constructing the maximal margin of
the soft-margin hyperplane. Using a dual formula- Often the measured pattern vectors are too high
tion of the original quadratic program, one obtains dimensional for useful estimation of a classifier. The
another quadratic program, which depends on the raw patterns are therefore first transformed to lower
training vectors only through their inner products. dimensional feature vectors in a way that hopefully
In the transformed space, the inner products can be preserves the salient class information of the original
represented using a kernel function measurements. However, such a transformation can
never improve the theoretically optimal classification
ϕ(Xi ), ϕ(Xj )) = K(Xi , Xj ). result. Indeed, let ϕ : RD → Rd be the transformation
of the raw pattern vector Y into the feature vector
Thus, the inner products needed for the construction X (here usually d D). Then any classifier g in
of the SVM classifier can be can be calculated in the Rd induces a classifier g◦ϕ in the original pattern
original feature space. Also the resulting classifier can space and if the lowest (Bayes) error for Y is
be implemented using the kernel and then working out e∗ , then we have P(g(X) = J) = P((g◦ϕ)(Y) = J) ≥ e∗ .

Vo lu me 2, Ju ly /Au gu s t 2010  2010 Jo h n Wiley & So n s, In c. 411


Overview www.wiley.com/wires/compstats

Still, in practice, dimension reduction makes classifier indexed in an order of decreasing eigenvalues we
estimation easier which often more than compensates have that Var(X1 ) ≥ · · · ≥ Var(XD ). By defining a
for the possible loss of classification information in the feature vector as X = [X1 , . . . , Xd ]T , where d D,
transformation. Also, if one wishes to use a particular dimension reduction is achieved while at the same
classifier type, a clever feature transformation can time most of the variability in the original pattern
sometimes be used to improve classifier performance. vector Y can be captured. Of course, this approach
A prime example of this idea is the support vector does not take into account the class labels of the
machine that combines a nonlinear mapping with a patterns in any way and indeed there are versions of
simple linear classifier. principal component analysis which incorporate this
Ideally, the classifier and the feature transforma- information, too. Another general purpose dimension
tion should be designed in a joint process but often in reduction and feature extraction technique that has
practice these two design steps are separated. While found some use in pattern recognition is metric
application-specific information is usually needed for multidimensional scaling.
best results, there are also some generic methods for Good accounts of various feature selection and
transforming the raw patterns into feature vectors. extraction methods can be found, for examle, in Refs
These include feature selection, feature extraction typ- 3 and 6.
ically based on a linear mapping, as well as general
purpose dimension reduction techniques.
In feature selection, one tries to choose from
the raw pattern Y = [Y1 , . . . , YD ]T variables Yi CLASSIFIER ASSESSMENT
that are most useful in discrimination. Thus, X =
ϕ(Y) = [Yi1 , . . . , Yid ]T , where i1 < · · · < id . Estimated Once the classifier has been implemented, one should
classification error or some other measure of class assess whether it meets the design criteria such as being
separation can be used to rank the performance of sufficiently quick to compute and having adequate
different subsets of indices. Because of combinatorial error rate. A naive way of estimating the error rate
complexity, exhaustive search through all possible (assuming mixture sampling) is to calculate the relative
subsets is rarely feasible and one resorts to various frequency of errors in the design sample. This is
suboptimal, incremental schemes that add or delete called the resubstitution estimator or the apparent
one feature at a time. error rate. It is optimistically biased, and the bias
A typical linear feature extraction method uses a can be severe for complex classifiers. Instead, the
transformation of the form ϕ(Y) = [aT1 Y, . . . , aTd Y]T , recommended approach is to split the data into two
where a1 , . . . , ad are orthonormal vectors chosen to separate sets, the training set and the test set. The
optimize some measure of within-class spread and classifier is estimated using data in the training set,
between-class separation. The ai s can be selected for and its performance is assessed on the independent
example from among the eigenvectors of the sample test set. This is called the holdout estimate. Several
within-class covariance matrix SW to maximize the classifiers require the choice of tuning parameters
ratio aTi SB ai /λi , where SB is the sample between- or model or architecture or kernel selection. This is
class covariance matrix and λi is the eigenvalue typically based on cross-validation. In order to keep
corresponding to ai . Thus, strict separation between the design and the test set,
the cross-validation then needs to be done using the

c
nj 
c
nj training set only.
SW = ˆ j , SB = (µ̂j − µ̂)(µ̂j − µ̂)T ,
n n
j=1 j=1

where nj is the number of class j training vectors CONCLUSION


and µ̂j , µ̂ and ˆ j are sample versions of the class j
mean, the overall mean and the class j covariance Pattern recognition has a long history, starting from
matrix, respectively, all computed from the original the investigations of statisticians in the 1930s and
raw training data. continuing with the groundwork of 1950s and 1960s.
The most widely used dimension reduction In more recent times, the field has received new
technique is principal components analysis. The impetus especially from the neural networks and
features are of the form Xi = uTi Y, where ui is the machine learning communities.
ith eigenvector of the covariance matrix of Y which However exciting the new classification methods
needs to be estimated from the training data. The may seem, one should keep in mind that the
Xi s are uncorrelated and when the eigenvectors are greatest progress usually comes from a careful

412  2010 Jo h n Wiley & So n s, In c. Vo lu me 2, Ju ly /Au gu s t 2010


WIREs Computational Statistics Pattern recognition

formulation of the real-world problem and from the classification algorithms. This is especially relevant in
invention of good features, not so much from new new application domains.

REFERENCES
1. Hand DJ. Construction and Assessment of Classifica- 16. Izenman AJ. Modern Multivariate Statistical Tech-
tion Rules. Chichester, England: John Wiley & Sons; niques: Regression, Classification, and Manifold Learn-
1997. ing. New York: Springer; 2008.
2. Duda RO, Hart PE, Stork DG. Pattern Classification. 17. Hand D, Mannila H, Smyth P. Principles of Data
2nd ed. New York: John Wiley & Sons; 2000. Mining. Cambridge, MA: The MIT Press; 2001.
3. Webb A. Statistical Pattern Recognition. 2nd ed. Chich-
18. Hastie T, Tibshirani R, Friedman J. The Elements of
ester, England: John Wiley & Sons; 2002.
Statistical Learning: Data Mining, Inference and Pre-
4. Theodoridis S, Koutroumbas K. Pattern Recognition. diction. 2nd ed. New York: Springer; 2001.
4th ed. Burlington, MA: Academic Press; 2008.
19. Bishop CM. Pattern Recognition and Machine Learn-
5. Young TY, Calvert TW. Classification, Estimation and
ing. New York: Springer; 2006.
Pattern Recognition. New York: American Elsevier
Publishing; 1974. 20. Krishnaiah PR, Kanal LN, eds. Handbook of Statistics
6. Devijver PA, Kittler J. Pattern Recognition: A Statistical 2: Classification, Pattern Recognition and Reduction of
Approach. London: Prentice-Hall; 1982. Dimensionality. Amsterdam: North-Holland; 1982.
7. Therrien CW. Decision, Estimation and Classification: 21. Young TY, Fu K-S, eds. Handbook of Pattern Recogni-
An Introduction to Pattern Recognition and Related tion and Image Processing. New York: Academic Press;
Topics. New York: John Wiley & Sons; 1989. 1986.
8. Schalkoff RJ. Pattern Recognition: Statistical, Struc- 22. Chen CH, Wang PSP, eds. Handbook of Pattern Recog-
tural and Neural Approaches. New York: John Wiley nition and Computer Vision. 3rd ed. Singapore: World
& Sons; 1992. Scientific; 2005.
9. Fukunaga K. Introduction to Statistical Pattern Recog-
23. Friedman JH. Regularized discriminant analysis. J Am
nition. 2nd ed. Sandiego, CA: Academic Press; 1990
Stat Assoc 1989, 84:165–175.
(Original ed. published 1972).
10. McLachlan GJ. Discriminant Analysis and Statistical 24. Srivastava S, Gupta MR, Frigyik BA. Bayesian
Pattern Recognition. New York: John Wiley & Sons; quadratic discriminant analysis. J Mach Learn Res
1992. 2007, 8:1277–1305.
11. Ripley BD. Pattern Recognition and Neural Networks. 25. Holmström L, Koistinen P, Laaksonen J, Oja E. Neu-
Cambridge, UK: Cambridge University Press; 1996. ral and statistical classifiers—taxonomy and two case
12. Devroye, L Györfi, G Lugosi. A Probabilistic Theory of studies. IEEE Trans Neural Netw 1997, 8:5–17.
Pattern Recognition. New York: Springer; 1996. 26. Cortes C, Vapnik V. Support-vector networks. Mach
13. Mardia KV, Kent JT, Bibby JM. Multivariate Analysis. Learn 1995, 20:273–297.
London: Academic Press; 1979. 27. Cristianini N, Shawe-Taylor J. Support Vector
14. Krzanowski WJ, Marriott FHC. Multivariate Analy- Machines and other kernel-based learning methods.
sis: Part 2, Classification, Covariance Structures and Cambridge, UK: Cambridge University Press; 2000.
Repeated Measurements, vol. 2, Kendall’s Library of
Statistics. London: Edward Arnold; 1995. 28. Kernel-machines.org. http://www.kernel-machines.org/,
Accessed May 3, 2010.
15. Anderson TW. An Introduction to Multivariate Statis-
tical Analysis. 3rd ed. New York: John Wiley & Sons; 29. Venables WN, Ripley BD. Modern Applied Statistics
2003. with S. 4th ed. New York: Springer; 2002.

Vo lu me 2, Ju ly /Au gu s t 2010  2010 Jo h n Wiley & So n s, In c. 413

You might also like