1 PATTERN RECOGNITION Introduction Features Classifiers and Principles - Compress
1 PATTERN RECOGNITION Introduction Features Classifiers and Principles - Compress
Pattern Recognition
De Gruyter Graduate
Also of Interest
Pattern Recognition
Matthias Richter
Institute of Anthropomatics and Robotics, Chair IES Karlsruhe
Institute of Technology Adenauerring 4
76131 Karlsruhe
[email protected]
Matthias Nagel
Institute of Theoretical Informatics, Cryptography and IT Security
Karlsruhe Institute of Technology Am Fasanengarten 5
76131 Karlsruhe
[email protected]
ISBN 978-3-11-053793-2
e-ISBN (PDF) 978-3-11-053794-9
e-ISBN (EPUB) 978-3-11-053796-3
www.degruyter.com
Preface
Pattern Recognition ⊂ Machine Learning ⊂ Artificial Intelligence:
This relation could give the impression that pattern recognition is only a tiny, very spe-
cialized topic. That, however, is misleading. Pattern recognition is a very important
field of machine learning and artificial intelligence with its own rich structure and
many interesting principles and challenges. For humans, and also for animals, their
natural abilities to recognize patterns are essential for navigating the physical world
which they perceive with their naturally given senses. Pattern recognition here per-
forms an important abstraction from sensory signals to categories: on th most basic
level, it enables the classification of objects into “Eatable” or “Not eatable” or, e.g.,
into “Friend” or “Foe.” These categories (or, synonymously, classes) do not always
have a tangible character. Examples of non-material classes are, e.g., “secure situa-
tion” or “dangerous situation.” Such classes may even shift depending on the context,
for example, when deciding whether an action is socially acceptable or not. Therefore,
everybody is very much acquainted, at least at an intuitive level, with what pattern
recognition means to our daily life. This fact is surely one reason why pattern recogni-
tion as a technical subdiscipline is a source of so much inspiration for scientists and
engineers. In order to implement pattern recognition capabilities in technical systems,
it is necessary to formalize it in such a way, that the designer of a pattern recognition
system can systematically engineer the algorithms and devices necessary for a techni-
cal realization. This textbook summarizes a lecture course about pattern recognition
that one of the authors (Jürgen Beyerer) has been giving for students of technical and
natural sciences at the Karlsruhe Institute of Technology (KIT) since 2005. The aim of
this book is to introduce the essential principles, concepts and challenges of pattern
recognition in a comprehensive and illuminating presentation. We will try to explain
all aspects of pattern recognition in a well understandable, self-contained fashion.
Facts are explained with a mixture of a sufficiently deep mathematical treatment, but
without going into the very last technical details of a mathematical proof. The given
explanations will aid readers to understand the essential ideas and to comprehend
their interrelations. Above all, readers will gain the big picture that underlies all of
pattern recognition.
The authors would like to thank their peers and colleagues for their support:
Special thanks are owed to Dr. Ioana Gheța who was very engaged during the early
phases of the lecture “Pattern Recognition” at the KIT. She prepared most of the many
slides and accompanied the course along many lecture periods.
Thanks as well to Dr. Martin Grafmüller and to Dr. Miro Taphanel for supporting
the lecture Pattern Recognition with great dedication.
Moreover, many thanks to to Prof. Michael Heizmann and Prof. Fernando Puente
León for inspiring discussions, which have positively influenced to the evolution of
the lecture.
https://doi.org/10.1515/9783110537949-005
VI | Preface
Thanks to Christian Hermann and Lars Sommer for providing additional figures
and examples of deep learning. Our gratitude also to our friends and colleagues Alexey
Pak, Ankush Meshram, Chengchao Qu, Christian Hermann, Ding Luo, Julius Pfrom-
mer, Julius Krause, Johannes Meyer, Lars Sommer, Mahsa Mohammadikaji, Mathias
Anneken, Mathias Ziearth, Miro Taphanel, Patrick Philipp, and Zheng Li for providing
valuable input and corrections for the preparation of this manuscript.
Lastly, we thank De Gruyter for their support and collaboration in this project.
List of Tables | XI
Notation | XVII
Introduction | XIX
2 Features | 10
2.1 Types of features and their traits | 10
2.1.1 Nominal scale | 10
2.1.2 Ordinal scale | 12
2.1.3 Interval scale | 13
2.1.4 Ratio scale and absolute scale | 13
2.2 Feature space inspection | 13
2.2.1 Projections | 14
2.2.2 Intersections and slices | 15
2.3 Transformations of the feature space | 17
2.4 Measurement of distances in the feature space | 17
2.4.1 Basic definitions | 19
2.4.2 Elementary norms and metrics | 20
2.4.3 A metric for sets | 22
2.4.4 Metrics on the ordinal scale | 23
2.4.5 The Kullback–Leibler divergence | 23
2.4.6 Tangential distance measure | 28
2.5 Normalization | 32
2.5.1 Alignment, elimination of physical dimension, and leveling of
proportions | 32
2.5.2 Lighting adjustment of images | 33
2.5.3 Distortion adjustment of images | 37
VIII | Contents
Bibliography | 271
Glossary | 275
Index | 281
List of Tables
Table 1 Capabilities of humans and machines in relation to pattern recognition | XXI
Table 7.1 Character sequences generated by Markov models of different order | 212
https://doi.org/10.1515/9783110537949-011
List of Figures
Fig. 1 Examples of artificial and natural objects | XX
Fig. 2 Industrial bulk material sorting system | XXI
https://doi.org/10.1515/9783110537949-013
XIV | List of Figures
Fig. 3.1 Example of a random distribution of mixed discrete and continuous quantities | 99
Fig. 3.2 The decision space K | 100
Fig. 3.3 Workflow of the MAP classifier | 102
Fig. 3.4 3-dimensional probability simplex in barycentric coordinates | 103
Fig. 3.5 Connection between the likelihood ratio and the optimal decision region | 106
Fig. 3.6 Decision of an MAP classifier in relation to the a posteriori probabilities | 108
Fig. 3.7 Underlying densities in the reference example for classification | 109
Fig. 3.8 Optimal decision regions | 110
Fig. 3.9 Risk of the Minimax classifier | 112
Fig. 3.10 Decision boundary with uneven priors | 115
Fig. 3.11 Decision regions of a generic Gaussian classifier | 116
Fig. 3.12 Decision regions of a generic two-class Gaussian classifier | 117
Fig. 3.13 Decision regions of a Gaussian classifier with the reference example | 118
Fig. 7.1 Techniques for extending linear discriminants to more than two classes | 174
Fig. 7.2 Nonlinear separation by augmentation of the feature space. | 176
Fig. 7.3 Decision regions of a linear regression classifier | 177
Fig. 7.4 Four steps of the perceptron algorithm | 179
Fig. 7.5 Feed-forward neural network with one hidden layer | 181
Fig. 7.6 Decision regions of a feed-forward neural network | 183
Fig. 7.7 Neuron activation of an autoencoder with three hidden neurons | 185
Fig. 7.8 Pre-training with stacked autoencoders. | 187
Fig. 7.9 Comparison of ReLU and sigmoid activation functions | 189
Fig. 7.10 A single convolution block in a convolutional neural network | 189
Fig. 7.11 High level structure of a convolutional neural network. | 190
Fig. 7.12 Types of features captured in convolution blocks of a convolutional neural
network | 191
Fig. 7.13 Detection and classification of vehicles in aerial images with CNNs | 192
Fig. 7.14 Structure of the CNN used in Herrmann et al. [2016] | 193
Fig. 7.15 Classification with maximum margin | 195
Fig. 7.16 Decision regions of a hard margin SVM | 203
Fig. 7.17 Geometric interpretation of the slack variables ξ i , i = 1, . . . ,N. | 204
Fig. 7.18 Decision regions of a soft margin SVM | 205
Fig. 7.19 Decision boundaries of hard margin and soft margin SVMs | 206
Fig. 7.20 Toy example of a matched filter | 207
Fig. 7.21 Discrete first order Markov model with three states ω i . | 211
Fig. 7.22 Discrete first order hidden Markov model | 213
Fig. 9.1 Relation of the world model P(m,ω) and training and test sets D and D. | 232
Fig. 9.2 Sketch of different class assignments under different model families | 233
Fig. 9.3 Expected test error, empirical training error, and VC confidence vs. VC
dimension | 234
Fig. 9.4 Classification error probability | 235
Fig. 9.5 Classification outcomes in a 2-class scenario | 236
Fig. 9.6 Performance indicators for a binary classifier | 237
Fig. 9.7 Example of ROC curves | 239
XVI | List of Figures
Fig. 9.8 Converting a multi-class confusion matrix to binary confusion matrices | 240
Fig. 9.9 Five-fold cross-validation | 242
Fig. 9.10 Schematic example of AdaBoost training. | 244
Fig. 9.11 AdaBoost classifier obtained by training in Figure 9.10 | 245
Fig. 9.12 Reasons to refuse to classify an object | 245
Fig. 9.13 Classifier with rejection option | 246
Fig. 9.14 Rejection criteria and the corresponding rejection regions | 247
Notation
General identifiers
Special identifiers
c Number of classes
ℂ Set of complex numbers
d Dimension of feature space
D Set of training samples
i, j, k Indices along the dimension, i.e., i, j, k ∈ {1, . . . , d}, or along the number of sam-
ples, i.e., i, j, k ∈ {1, . . . , N}
I Identity matrix
j Imaginary unit, j2 = −1
J Fisher information matrix
k(⋅,⋅) Kernel function
k(⋅) Decision function
K Decision space
l Cost function l : Ω0 /∼ × Ω/∼ → ℝ
L Cost matrix ∈ ℝ(c+1)×c
m Feature vector
mi Feature vector of the i-th sample
m ij The j-th component of the i-th feature vector
M ij The component at the i-th row and j-th column of the matrix M
M Feature space
N Number of samples
ℕ Set of natural numbers
o Object
ω Class of objects, i.e., ω ⊆ Ω
ω0 Rejection class
Ω Set of objects (the relevant part of the world) Ω = {o1 , . . . , o N }
https://doi.org/10.1515/9783110537949-017
XVIII | Notation
Ω/∼ The domain factorized w.r.t. the classes, i.e., the set of classes Ω/∼ =
{ω1 , . . . , ω c }
Ω0 /∼ The set of classes including the rejection class, Ω0 /∼ = Ω/∼ ∪ {ω0 }
p(m) Probability density function for random variable m evaluated at m
P(ω) Probability mass function for (discrete) random variable ω evaluated at ω
Pr(e) Probability of an event e
P(A) Power set, i.e., the set of all subsets of A
ℝ Set of real numbers
S Set of all samples, S = D ⊎ T ⊎ V
T Set of test samples
V Set of validation samples
U Unit matrix, i.e., the matrix all of whose entries are 1
θ Parameter vector
Θ Parameter space
ℤ Set of integer numbers
Special symbols
∝ “proportional to”-relation
P
→ Convergence in probability
w
→ Weak convergence
⇝ Leads to (not necessarily in a strict mathematical sense)
⊎ Disjoint union of sets, i.e., C = A ⊎ B ⇔ C = A ∪ B and A ∩ B = 0.
⟨⋅,⋅⟩ Scalar product
∇, ∇e Gradient, Gradient w.r.t. e
Cov{⋅} Covariance
j j j
δi Kronecker delta/symbol; δ i = 1 iff. i = j, else δ i = 0
δ[⋅] Generalized Kronecker symbol, i.e., δ[Π] = 1 iff Π is true and δ[Π] = 0 otherwise
E{⋅} Expected value
N(μ, σ 2 ) Normal/Gaussian distribution with expectation μ and variance σ 2
N(µ, Σ) Multivariate normal/Gaussian distribution with expectation µ and covariance ma-
trix Σ
tr A Trace of the matrix A
Var{⋅} Variance
Abbreviations
Pattern acquisition, Sensing, Measuring In the first step, suitable properties of the
objects to be classified have to be gathered and put into computable representa-
tions. Although pattern might suggest that this (necessary) step is part of the actual
pattern recognition task, it is not. However, this process has to be considered so
far as to provide an awareness of any possible complications it may cause in the
subsequent steps. Measurements of any kind are usually affected by random noise
and other disturbances that, depending on the application, can not be mitigated
by methods of metrology alone: for example, changes of lighting conditions in
uncontrolled and uncontrollable environments. A pattern recognition system has
to be designed so that it is capable of solving the classification task regardless of
such factors.
Feature definition, Feature acquisition Suitable features have to be selected based
on the available patterns and methods for extracting these features from the pat-
terns have to be defined. The general aim is to find the smallest set of the most
informative and discriminative features. A feature is discriminative if it varies lit-
tle with objects within a single class, but varies significantly with objects from
different classes.
Design of the classifier After the features have been determined, rules to assign a
class to an object have to be established. The underlying mathematical model has
to be selected so that it is powerful enough to discern all given classes and thus
solve the classification task. On the other hand, it should not be more complicated
https://doi.org/10.1515/9783110537949-019
XX | Introduction
These lecture notes on pattern recognition are mainly concerned with the last two
issues. The complete process of designing a pattern recognition system will be covered
in its entirety and the underlying mathematical background of the required building
blocks will be given in depth.
Pattern recognition systems are generally parts of larger systems, in which pattern
recognition is used to derive decisions from the result of the classification. Industrial
sorting systems are typical of this (see Figure 2). Here, products are processed differ-
ently depending on their class memberships.
Hence, as a pattern recognition system is not an end in itself, the design of such a
system has to consider the consequences of a bad decision caused by a misclassifica-
tion. This puts pattern recognition between human and machine. The main advantage
of automatic pattern recognition is that it can execute recurring classification tasks
with great speed and without fatigue. However, an automatic classifier can only dis-
cern the classes that were considered in the design phase and it can only use those
features that were defined in advance. A pattern recognition system to tell apples from
oranges may label a pear as an apple and a lemon as an orange if lemons and pears
were not known in the design phase. The features used for classification might be
chosen poorly and not be discriminative enough. Different environmental conditions
(e.g., lighting) in the laboratory and in the field that were not considered beforehand
might impair the classification performance, too. Humans, on the other hand, can use
their associative and cognitive capabilities to achieve good classification performance
Introduction | XXI
Computer
Camera
(line-scan)
Illumination
Bulk material
Conveyor
Ejection
stage
Background plate
Definition 1.1 (Equivalence relation). Let Ω be a set of elements with some relation ∼.
Suppose further that o, o1 , o2 , o3 ∈ Ω are arbitrary. The relation ∼ is said to be an
equivalence relation if it fulfills the following conditions:
1. Reflexivity: o ∼ o.
2. Symmetry: o1 ∼ o2 ⇔ o2 ∼ o1 .
3. Transitivity: o1 ∼ o2 and o2 ∼ o3 ⇒ o1 ∼ o3 .
of all elements that are equivalent to o. The object o is also called a representative of
the set [o]∼ . In the context of pattern recognition, each o ∈ Ω denotes an object and
each [o]∼ denotes a class. A different approach to classifying every element of a set is
given by partitioning the set:
It is easy to see that equivalence relations and partitions describe synonymous con-
cepts: every equivalence relation induces a partition, and every partition induces an
equivalence relation.
The underlying principle of all pattern recognition is illustrated in Figure 1.1.
On the left it shows—in abstract terms—the world and a (sub)set Ω of objects that
https://doi.org/10.1515/9783110537949-023
2 | 1 Fundamentals and definitions
m2 , . . . , m c
Ω ⊆ World
ωc Rc
Sensing,
ω1 measuring,
R2
characterizing: m1
⋅⋅⋅⋅⋅⋅ o i → mi
ω2 Ω ...
R1
Decision boun-
daries
live within the world. The set Ω is given by the pattern recognition task and is also
called the domain. Only the objects in the domain are relevant to the task; this is the
so called closed world assumption. The task also partitions the domain into classes
ω1 , ω2 , ω3 , . . . ⊆ Ω. A suitable mapping associates every object o i to a feature vector
mi ∈ M inside the feature space M. The goal is now to find rules that partition M
along decision boundaries so that the classes of M match the classes of the domain.
Hence, the rule for classifying an object o is
This means that the estimated class ω̂ (o) of object o is set to the class ω i if the
feature vector m (o) falls inside the region Ri . For this reason, the Ri are also called
decision regions. The concept of a classifier can now be stated more precisely:
Definition 1.3 (Classifier). A classifier is a collection of rules that state how to evaluate
feature vectors in order to sort objects into classes. Equivalently, a classifier is a system
of decision boundaries in the feature space.
Readers experienced in machine learning will find these concepts very familiar. In fact,
machine learning and pattern recognition are closely intertwined: pattern recognition
is (mostly) supervised learning, as the classes are known in advance. This topic will be
picked up again later in this chapter.
In the previous section it was already mentioned that a pattern recognition system
maps objects onto feature vectors (see Figure 1.1) and that the classification is carried
1.2 Structure of a pattern recognition system | 3
Ω
Sensing
⋅⋅⋅
Preprocessing
Segmentation
Patterns
Feature extraction
out in the feature space. This section focuses on the steps involved and defines the
terms pattern and feature.
Figure 1.2 shows the processing pipeline of a pattern recognition system. In the
first steps, the relevant properties of the objects from Ω must be put into a machine
readable interpretation. These first steps (yellow boxes in Figure 1.2) are usually per-
formed by methods of sensor engineering, signal processing, or metrology, and are
not directly part of the pattern recognition system. The result of these operations is
the pattern of the object under inspection.
Definition 1.4 (Pattern). A pattern is the collection of the observed or measured prop-
erties of a single object.
The most prominent pattern is the image, but patterns can also be (text) documents,
audio recordings, seismograms, or indeed any other signal or data. The pattern of
an object is the input to the actual pattern recognition, which is itself composed of
two major steps (gray boxes in Figure 1.2): previously defined features are extracted
from the pattern and the resulting feature vector is passed to the classifier, which then
outputs an equivalence class according to Equation (1.2).
A feature is any quality or quantity that can be derived from the pattern, for example,
the area of a region in an image, the count of occurrences of a key word within a text,
or the position of a peak in an audio signal.
4 | 1 Fundamentals and definitions
From an abstract point of view, pattern recognition is mapping the set of objects Ω to
be classified to the equivalence classes ω ∈ Ω/ ∼, i.e., Ω → Ω/ ∼ or o → ω. In some
cases, this view is sufficient for treating the pattern recognition task. For example, if
the objects are e-mails and the task is to classify the e-mails as either “ham” =̂ ω1 or
“spam” =̂ ω2 , this view is sufficient for deriving the following simple classifier: The
body of an incoming e-mail is matched against a list of forbidden words. If it contains
more than S of these words, it is marked as spam, otherwise it is marked as ham.
For a more complicated classification system, as well as for many other pattern
recognition problems, it is helpful and can provide additional insights to break up the
mapping Ω → Ω/ ∼ into several intermediate steps. In this book, the pattern recog-
nition process is subdivided into the following steps: observation, sensing, measure-
ment; feature extraction; decision preparation; and classification. This subdivision is
outlined in Figure 1.3.
To come back to the example mentioned above, an e-mail is already digital data,
hence it does not need to be sensed. It can be further seen as an object, a pattern, and
a feature vector, all at once. A spam classification application that takes the e-mail as
input and accomplishes the desired assignment to one of the two categories could be
considered as a black box that performs the mapping Ω → Ω/ ∼ directly.
In many other cases, especially if objects of the physical world are to be classified,
the intermediate steps of Ω → P → M → K → Ω/ ∼ will help to better analyze and
understand the internal mechanisms, challenges and problems of object classification.
It also supports engineering a better pattern recognition system. The concept of the
pattern space P is especially helpful if the raw data acquired about an object has a
very high dimension, e.g., if an image of an object is taken as the pattern. Explicit
use of P will be made in Section 2.4.6, where the tangent distance is discussed, and
in Section 2.6.3, where invariant features are considered. The concept of the decision
space K helps to generalize classifiers and is especially useful to treat the rejection
1.4 Design of a pattern recognition system | 5
Objects Classes
Pattern recognition
Ω Ω/ ∼
Observation,
sensing, Classification
measurement
Feature Decision
extraction preparation
P M K
Pattern Feature Decision
space space space
Fig. 1.3. Subdividing the pattern recognition process allows deeper insights and helps to better
understand important concepts such as: the curse of dimensionality, overfitting, and rejection.
problem in Section 9.4. Lastly, the concept of the feature space M is fundamental
to pattern recognition and permeates the whole textbook. Features can be seen as a
concentrated extract from the pattern, which essentially carries the information about
the object which is relevant for the classification task.
Overall, any pattern recognition task can be formally defined by a quintuple
(Ω, ∼, ω0 , l, S), where Ω is the set of objects to be classified, ∼ is an equivalence
relation that defines the classes in Ω, ω0 is the rejection class (see Section 9.4), l is a
cost function that assesses the classification decision ω̂ compared to the true class ω
(see Section 3.3), and S is the set of examples with known class memberships. Note
that the rejection class ω0 is not always needed and may be empty. Similarly, the cost
function l may be omitted, in which case it is assumed that incorrect classification
creates the same costs independently of the class and no cost is incurred by a correct
classification (0–1 loss).
These concepts will be further developed and refined in the following chapters.
For now, we will return to a more concrete discussion of how to design systems that
can solve a pattern recognition task.
Figure 1.4 shows the principal steps involved in designing a pattern recognition sys-
tem: data gathering, selection of features, definition of the classifier, training of the
classifier, and evaluation. Every step is prone to making different types of errors, but
the sources of these errors can broadly be sorted into four categories:
1. Too small a dataset,
2. A non-representative dataset,
6 | 1 Fundamentals and definitions
Start
Data gathering
Training, validation
and test samples
Selection and defi-
Performace of classifier
nition of features
Operators for
feature extraction
Definition of classifier
Mathematical
model
Training of classifier
Evaluation of classifier
Finish
The following section will describe the different steps in detail, highlighting the chal-
lenges faced and pointing out possible sources of error.
The first step is always to gather samples of the objects to be classified. The result-
ing dataset is labeled S and consists of patterns of objects where the corresponding
classes are known a priori, for example because the objects have been labeled by a
domain expert. As the class of each sample is known, deriving a classifier from S con-
stitutes supervised learning. The complement to supervised learning is unsupervised
learning, where the class of the objects in S is not known and the goal is to uncover
some latent structure in the data. In the context of pattern recognition, however, un-
supervised learning is only of minor interest.
A common mistake when gathering the dataset is to pick pathological, charac-
teristic samples from each class. At first glance, this simplifies the following steps,
because it seems easier to determine the discriminative features. Unfortunately, these
seemingly discriminative features are often useless in practice. Furthermore, in many
1.4 Design of a pattern recognition system | 7
Validation set V
Training set
(25 % of S)
D (50 % of S)
Data set S
Testing set T
(25 % of S)
Fig. 1.5. Rule of thumb to partition the dataset into training, validation and test sets.
situations, the most informative samples are those that represent edge cases. Consider
a system where the goal is to pick out defective products. If the dataset only consists
of the most perfect samples and the most defective samples, it is easy to find highly
discriminative features and one will assume that the classifier will perform with high
accuracy. Yet in practice, imperfect, but acceptable products may be picked out or
products with a subtle, but serious defect may be missed. A good dataset contains
both extreme and common cases. More generally, the challenge is to obtain a dataset
that is representative of the underlying distribution of classes. However, an unrepre-
sentative dataset is often intentional or practically impossible to avoid when one of the
classes is very sparsely populated but representatives of all classes are needed. In the
above example of picking out defective products, it is conceivable that on average only
one in a thousand products has a defect. In practice, one will select an approximately
equal number of defective and intact products to build the dataset S. This means that
the so called a priori distribution of classes must not be determined from S, but has to
be obtained elsewhere.
The dataset S is further partitioned into a training set D, a validation set V, and a
test set T. A rule of thumb is to use 50 % of S for D, 25 % of S for V, and the remaining
25 % of S for T (see Figure 1.5). The test set T is held back and not considered during
most of the design process. It is only used once to evaluate the classifier in the last
design step (see Figure 1.4). The distinction between training and validation set is not
always necessary. The validation set V is needed if the classifier in question is governed
not only by parameters that are estimated from the training set D, but also depends
on so called design parameters or hyper parameters. The optimal design parameters
are determined using the validation set.
A general issue is that the available dataset is often too small. The reason is that
obtaining and (manually) pre-classifying a dataset is typically very time consuming
and thus costly. In some cases, the number of samples is naturally limited, e.g., when
8 | 1 Fundamentals and definitions
the goal is to classify earthquakes. The partition into training, test and validation sets
further reduces the number of available samples, sometimes to a point where carry-
ing out the remaining design phases is no longer reasonable. Chapter 9 will suggest
methods for dealing with small datasets.
The second step of the design process (see Figure 1.4) is concerned with choosing
suitable features. Different types of features and their characteristics will be covered
in Chapter 2 and will not be discussed at this point. However, two general design
principles should be considered when choosing features:
1. Simple, comprehensible features should be preferred. Features that correspond
to immediate (physical) properties of the objects or features which are otherwise
meaningful, allow understanding and optimizing the decisions of the classifier.
2. The selection should contain a small number of highly discriminative features.
The features should show little deviation within classes, but vary greatly between
classes.
The latter principle is especially important to avoid the so called curse of dimension-
ality (sometimes also called the Hughes effect): a higher dimensional feature space
means that a classifier operating in this feature space will depend on more parameters.
Determining the appropriate parameters is a typical estimation problem. The more
parameters need to be estimated, the more samples are needed to adhere to a given
error bound. Chapter 6 will give more details on this topic.
The third design step is the definition of a suitable classifier (see Figure 1.4). The
boundary between feature extraction and classifier is arbitrary and was already called
“blurry” in Figure 1.2. In the example in Figure 2.4c, one has the option to either stick
with the features and choose a more powerful classifier that can represent curved
decision boundaries, or to transform the features and choose a simple classifier that
only allows linear decision boundaries. It is also possible to take the output of one
classifier as input for a higher order classifier. For example, the first classifier could
classify each pixel of an image into one of several categories. The second classifier
would then operate on the features derived from the intermediate image. Ultimately,
it is mostly a question of personal preference where to put the boundary and whether
feature transformation is part of the feature extraction or belongs to the classifier.
After one has decided on a classifier, the fourth design step (see Figure 1.2) is to
train it. Using the training and validation sets D and V, the (hyper-)parameters of
the classifier are estimated so that the classification is in some sense as accurate as
possible. In many cases, this is achieved by defining a loss function that punishes
misclassification, then optimizing this loss function w.r.t. the classifier parameters.
As the dataset can be considered as a (finite) realization of a stochastic process, the
parameters are subject to statistical estimation errors. These errors will become smaller
the more samples are available.
An edge case occurs when the sample size is so small and the classifier has so
many parameters that the estimation problem is under-determined. It is then possible
1.5 Exercises | 9
to choose the parameters in such a way that the classifier classifies all training samples
correctly. Yet novel, unseen samples will most probably not be classified correctly, i.e.,
the classifier does not generalize well. This phenomenon is called overfitting and will
be revisited in Chapter 6.
In the fifth and last step of the design process (see Figure 1.2), the classifier is
evaluated using the test set T, which was previously held back. In particular, this step
is important to detect whether the classifier generalizes well or whether it has been
overfitted. If the classifier does not perform as needed, any of the previous steps—in
particular the choice of features and classifier—can be revisited and adjusted. Strictly
speaking, the test set T is already depleted and must not be used in a second run.
Instead, each separate run should use a different test set, which has not yet been seen
in the previous design steps. However, in many cases it is not possible to gather new
samples. Again, Chapter 9 will suggest methods for dealing with such situations.
1.5 Exercises
(1.1) Let S be the set of all computer science students at the KIT. For x,y ∈ S, let x ∼ y
be true iff x and y are attending the same class. Is x ∼ y an equivalence relation?
(1.5) Let x,y ∈ ℕ and f : ℕ → ℕ be a function on the natural numbers. Is the relation
x ∼ y ⇔ f(x) ≤ f(y) an equivalence relation?
(1.6) Let A be a set of algorithms and for each X ∈ A let r(X,n) be the runtime of
that algorithm for an input of length n. Is the following relation an equivalence
relation?
X ∼ Y ⇔ r(X,n) ∈ O (r(Y,n)) for X,Y ∈ A.
i.e., O (f(n)) is the set of all functions of n that are asymptotically bounded below
by f(n).
2 Features
A good understanding of features is fundamental for designing a proper pattern recog-
nition system. Thus this chapter deals with all aspects of this concept, beginning with
a mere classification of the kinds of features, up to the methods for reducing the dimen-
sionality of the feature space. A typical beginner’s mistake is to apply mathematical
operations to the numeric representation of a feature, just because it is syntactically
possible, albeit these operations have no meaning whatsoever for the underlying prob-
lem. Therefore, the first section elaborates on the different types of possible features
and their traits.
The nominal scale is made up of pure labels. The only meaningful question to ask is
whether two variables have the same value: the nominal scale only allows to compare
two values w.r.t. equivalence. There is no meaningful transformation besides relabel-
ing. No empirical operation is permissible, i.e., there is no mathematical operation of
nominal features that is also meaningful in the material world.
https://doi.org/10.1515/9783110537949-032
Table 2.1. Taxonomy of scales of measurement. Empirical relations are mathematical relations that emerge from experiments, e.g., comparing the volume of
two objects by measuring how much water they displace. Likewise, empirical operations are mathematical operations that can be carried out in an experiment,
e.g., adding the mass of two objects by putting them together, or taking the ratio of two masses by putting them on a balance scale and noting the point of the
fulcrum when the scale is balanced.
Empirical operation
Ordering ≺ Ordering ≺ Ordering ≺ Ordering ≺
Addition ⊕ Addition ⊕ Addition ⊕
Allowed transformation
Multiplication ⊗ Multiplication ⊗
Typical domain Integers, Names, Integers Real numbers Real numbers Natural numbers
with a > 0 with a > 0
Symbols
Expressiveness very low low medium high very high
Examples Telephone numbers, School grades, Temperature in °F, Temperature in K, Electron count,
Postal codes, Degree of hardening, Calendar time, Electric current, Euler characteristic,
Gender, Wind intensity, Geographic altitude Bank account balance, Number of test failures
Scale name Scale Expressiveness Edge Length
2.1 Types of features and their traits | 11
12 | 2 Features
A typical example is the sex of a human. The two possible values can be either
written as “f” vs. “m,” “female” vs. “male,” or be denoted by the special symbols ~
vs. |. The labels are different, but the meaning is the same. Although nominal values
are sometimes represented by digits, one must not interpret them as numbers. For
example, the postal codes used in Germany are digits, but there is no meaning in, e.g.,
adding two postal codes. Similarly, nominal features do not have an ordering, i.e., the
postal code 12345 is not “smaller” than the postal code 56789. Of course, most of the
time there are options for how to introduce some kind of lexicographic sorting scheme,
but this is purely artificial and has no meaning for the underlying objects.
With respect to statistics, the permissible average is not the mean (since summa-
tion is not allowed) or the median (since there is no ordering), but the mode, i.e., the
most common value in the dataset.
The next higher scale is made of values on an ordinal scale. The ordinal scale allows
comparing values w.r.t. equivalence and rank. Any transformation of the domain must
preserve the order, which means that the transformation must be strictly increasing.
But there is still no way to add an offset to one value in order to obtain a new value or
to take the difference between two values.
Probably the best known example is school grades. In the German grading system,
the grade 1 (“excellent”) is better than 2 (“good”), which is better than 3 (“satisfactory”)
and so on. But quite surely the difference in a student’s skills is not the same between
the grades 1 and 2 as between 2 and 3, although the “difference” in the grades is unity
in both cases. In addition, teachers often report the arithmetic mean of the grades
in an exam, even though the arithmetic mean does not exist on the ordinal scale. In
consequence, it is syntactically possible to compute the mean, even though the result,
e.g., 2.47 has no place on the grading scale, other than it being “closer” to a 2 than
a 3. The Anglo-Saxon grading system, which uses the letters “A” to “F”, is somewhat
immune to this confusion.
The correct average involving an ordinal scale is obtained by the median: the value
that separates the lower half of the sample from the upper half. In other words, 50 %
of the sample is smaller, and 50 % is larger than the median. One can also measure
the scatter of a dataset using the quantile distance. The p-quantile of a dataset is the
value that separates the lower p ⋅ 100 % from the upper (1 − p) ⋅ 100 % of the dataset
(the median is the 0.5-quantile). The p-quantile distance is the distance (number of
values) between the p and (1 − p)-quantile. Common values for p are p = 0, which
results in the range of the data set, and p = 0.25, which results in the inter-quartile
range.
2.2 Feature space inspection | 13
The interval scale allows adding an offset to one value to obtain a new one, or to calcu-
late the difference between two values—hence the name. However, the interval scale
lacks a naturally defined zero. Values from the interval scale are typically represented
using real numbers, which contains the symbol “0,” but this symbol has no special
meaning and its position on the scale is arbitrary. For this reason, the scalar multiplica-
tion of two values from the interval scale is meaningless. Permissible transformations
preserve the order, but may shift the position of the zero.
A prominent example is the (relative) temperature in °F and °C. The conversion
from Celsius to Fahrenheit is given by TF = 59 °C °F
TC + 32 °F. The temperatures 10 °C
and 20 °C on the Celsius scale correspond to 50 °F and 68 °F on the Fahrenheit scale.
Hence, one cannot say that 20 °C is twice as warm as 10 °C: this statement does not
hold w.r.t. the Fahrenheit scale.
The interval scale is the first of the discussed scales that allows computing the
arithmetic mean and standard deviation.
The ratio scale has a well defined, non-arbitrary zero, and therefore allows calculating
ratios of two values. This implies that there is a scalar multiplication and that any
transformation must preserve the zero. Many features from the field of physics belong
to this category and any transformation is merely a change of units. Note that although
there is a semantically meaningful zero, this does not mean that features from this
scale may not attain negative values. An example is one’s account balance, which
has a defined zero (no money in the account), but may also become negative (open
liabilities).
The absolute scale shares these properties, but is equipped with a natural unit
and features of this scale can not be negative. In other words, features of the absolute
scale represent counts of some quantities. Therefore, the only allowed transformation
is the identity.
For a well working system, the question, how to find “good,” i.e., distinguishing fea-
tures of objects, needs to be answered. The primary course of action is to visually
inspect the feature space for good candidates.
In order to find discriminative features, one needs to get an idea about the structure
of the feature space. In the one- or two-dimensional case, this can be easily done by
looking at a visual representation of the dataset in question, e.g., a histogram or a
14 | 2 Features
Iris setosa
Iris versicolor
Iris virginica
Petal width
0 6
0
2
4
6 4 Sepal length
Petal length
(a) Three-dimensional feature space
Petal width
Iris setosa
Iris versicolor
Iris virginica
20 10 Petal length
5
10
Fig. 2.1. Iris flower dataset as an example of how projection helps the inspection of the feature space.
scatter plot. Even with three dimensions, a perspective view of the data might suffice.
However, this approach becomes problematic when the number of dimensions is larger
than three.
2.2.1 Projections
m2
m3
6
6
4
4 m2
6 2
2 4
2 m1
m1
2 4 6 2 4 6
(a) 3D scatter plot of all samples (b) Projection of all samples onto the
m1 ,m2 -plane
m2
m3
6
6
4
4 m2
6 2
2 4
2 m1 Fig. 2.2. Difference
m1
2 between the full
4 6 2 4 6
projection and the
(c) 3D scatter plot of samples in a (d) Projection of samples in the slice slice projection tech-
slice onto the m1 ,m2 -plane niques.
Iris flower dataset, which quantifies the morphological variation of Iris flowers of three
related species.
Figure 2.1a depicts a perspective drawing of the three features petal width, petal
length and sepal length. Figure 2.1b shows a two-dimensional projection and two
aligned histograms of the same data by omitting the sepal length. The latter clearly
shows that the features petal length and petal width are already sufficient to distin-
guish the species Iris setosa from the others. Further two-dimensional projections
might show that Iris versicolor and Iris virginica can also be easily separated from
each other.
If the distribution of the samples in the feature space is more complex, simple projec-
tions might fail. Even worse, this approach might lead to the wrong conclusion that the
samples of two different classes cannot be separated by the features in question even
16 | 2 Features
n1
a
m3 b
u
m2
though they can be. Figure 2.2 shows this issue using artificial data. The objects of the
first class are all distributed within a solid sphere. The samples of the second class lie
close to the surface of a second, larger sphere. This sphere encloses the samples of the
first class, but the radius is large enough to separate the classes.
The initial situation is depicted in Figure 2.2a. Even though the samples can be
separated, any projection to a two-dimensional subspace will suggest that the classes
overlap each other, as shown in Figure 2.2b. However, if one projects slices of the data
instead all of it at once, the structure becomes apparent. Figure 2.2c shows the result of
such a slice in the three dimensional space and Figure 2.2d shows the corresponding
projection. The latter clearly shows that one class only encloses the other but can be
distinguished nonetheless.
The principal idea of the construction is illustrated in Figure 2.3. The slice is de-
fined by its mean plane (yellow) and a bound ε that defines half of the thickness of
the slice. Any sample that is located at a distance less than this bound is projected
onto the plane. The mean plane itself is given by its two directional vectors a, b and
its oriented distance u from the origin. The mean plane on its own, i.e., a slice with
zero thickness (ε = 0), does not normally suffice to “catch” any sample points: If the
samples are continuously distributed, the probability that a sample intersected by the
mean plane is zero.
Let d ∈ ℕ be the dimension of the feature space. A two-dimensional plane is
defined either by its two directional vectors a and b or as the intersection of d − 2
linearly independent hyperplanes. Hence, let
denote an orthonormal basis of the feature space, where each nj is the normal vector
of a hyperplane. Let u1 , . . . , u d−2 be the oriented distances of the hyperplanes from
the origin. The two-dimensional plane is defined by the solution of the system of linear
equations
nT1 m − u1 = 0
..
.
nTd−2 m − u d−2 = 0. (2.2)
T
Let m = (m1 , . . . ,m d ) be an arbitrary point of the feature space. The distance of
m from the plane in the direction of nj is given by nTj m − u j , hence the total Euclidean
distance of m from the plane is
d−2
2
v = √ ∑ (nTj m − u j ) . (2.3)
j=1
Because the sample size is limited, it is usually advisable to restrict the number of
features used. Apart from limiting the selection, this can also be achieved by a suit-
able transformation of the feature space (see Figure 2.4). In Figure 2.4a it is possible
to separate the two classes using the feature m1 alone. Hence, the feature m2 is not
needed and can be omitted. In Figure 2.4b, both features are needed, but the classes
are separable by a straight line. Alternatively, the feature space could be rotated in
such a way that the new feature m2 is sufficient to discriminate between the classes.
The annular classes in Figure 2.4c are not linearly separable, but a nonlinear transfor-
mation into polar coordinates shows that the classes can be separated by the radial
component. Section 2.7 will present methods for automating such transformations to
some degree. Especially the principal component analysis will play a central role.
As will be shown in later chapters, many classifiers need to calculate some kind of
distance between feature vectors. A very simple, yet surprisingly well-performing clas-
sifier is the so-called nearest neighbor classifier: Given a dataset with known points
18 | 2 Features
m2
m2 m1
m2
m1
m1
r = m2
m2
m1
φ = m1
in the feature space and known class memberships for each point, a new point with
unknown membership is assigned to the same class as the nearest known point. Obvi-
ously, the concept “being nearest to” requires a measure of distance.
If the feature vector was an element of a standard Euclidean vector space, one
could use the well known Euclidean distance
d
m − m = √ ∑ m i − mi 2 , (2.5)
i=1
but this approach relies on some assumptions that are generally not true for real-world
applications. The cause of this can be summarized by the heterogeneity of the compo-
nents of the feature space, meaning
– features on different scales of measurement,
– features with different (physical) units,
– features with different meanings and
– features with differences in magnitude.
2.4 Measurement of distances in the feature space | 19
Above all, Equation (2.5) requires that all components m i ,mi , i = 1, . . . ,d are at least
on an interval scale. In practice, the components are often a mixture of real numbers,
ordinal values and nominal values. In these cases, the Euclidean distance in Equa-
tion (2.5) does not make sense; even worse, it is syntactically incorrect.
In cases where all the components are real numbers, there is still the problem of
different scales or units. For example, the same (physical) feature, “length,” can be
given in “inches” or “miles.” The problem gets even worse if the components stem from
different physical magnitudes, e.g., if the first component is a mass and the second
component is a length. A simple solution to this problem is a weighted sum of the
individual component distances, i.e.,
d d
D (m, m ) = ∑ α i D i (m i , mi ) for α i > 0 and ∑ α i = 1. (2.6)
i=1 i=1
To discuss the oncoming concepts, we must first define the terms that will be used.
Definition 2.1 (Metric, metric space). Let M be a set and m, m , and m ∈ M. A func-
tion D : M × M → ℝ≥0 is called a metric iff
1. D (m, m ) ≥ 0 (non-negativity)
2. D (m, m ) = 0 ⇔ m = m
(reflexivity, coincidence)
3. D (m, m ) = D (m , m) (symmetry)
4. D (m, m ) ≤ D (m, m ) + D (m , m ) (triangle inequality)
A set M equipped with a metric D is called a metric space.
With respect to real-world applications, having a metric feature space is an ideal, but
unrealistic situation. Luckily, fewer requirements will often suffice. As will be seen
in Section 2.4.5, the Kullback–Leibler divergence is not a metric because it lacks the
symmetry property and violates the triangle inequality, but it is quite useful nonethe-
less. Those functions that fulfil some, but not all of the above requirements are usually
called distance functions, discrepancys or divergences. None of these terms is pre-
cisely defined. Moreover, “distance function” is also used as a synonym for metric and
should be avoided to prevent confusion. “Divergence” is generally only used for func-
20 | 2 Features
tions that quantify the difference between probability distributions, i.e., the term is
used in a very specific context. Another important concept is given by the term (vector)
norm:
Definition 2.2 (Norm, normed vector space). Let M be a vector space over the real
numbers and let m, m ∈ M. A function ‖⋅‖ : M → ℝ≥0 is called a norm iff
1. ‖m‖ ≥ 0 and ‖m‖ = 0 ⇔ m = 0 (positive definiteness)
2. ‖αm‖ = |α| ‖m‖ with α ∈ ℝ (homogeneity)
3. m + m ≤ ‖m‖ + m (triangle inequality)
A vector space M equipped with a norm ‖⋅‖ is called a normed vector space.
Due to the prerequisite of the definition, a normed vector space can only be applied to
features on a ratio scale. A norm can be used to construct a metric, which means that
every normed vector space is a metric space, too.
Definition 2.3 (Induced metric). Let M be a normed vector space and ‖⋅‖ its norm and
let m, m ∈ M. Then
D (m, m ) := m − m (2.7)
defines an induced metric on M.
Note that because of the homogeneity property, Definition 2.2 requires the value to
be on a ratio scale; otherwise the scalar multiplication would not be well defined.
However, the induced metric from Definition 2.3 can be applied to an interval scale,
too, because the proof does not need the scalar multiplication. Of course, one must
not say that the metric D (m, m ) = m − m stems from a norm, because there is no
such thing as a norm on an interval scale.
Inarguably, the most familiar example of a norm is the Euclidean norm. But this norm
is just a special embodiment of a whole family of vector norms that can be used to
quantify the distance of features on a ratio scale. The norms of this family are called
Minkowski norms or p-norms.
Definition 2.4 (Minkowski norm, p-norm). Let M denote a real vector space of finite
dimension d and let r ∈ ℤ ∪ {∞} be a constant parameter. Then
1
{(∑d |m i |r ) r if r < ∞
‖m‖r = { i=1 (2.8)
d
{maxi=1 |m i | if r = ∞
is a norm on M.
The name “p-norm” comes from the fact that the parameter is traditionally denoted
by p and not r as seen here. This book uses r to avoid a clash of names, because p is
already used to denote a probability density function.
2.4 Measurement of distances in the feature space | 21
m2
3
r = 0.4
r = 0.6
r =1
r =2
2 r =5
r = −5
r = −2
r = −1
1 Fig. 2.5. Unit circles for
Minkowski norms with dif-
ferent choices of r. Only the
upper right quadrant of the two-
m1
dimensional Euclidean space is
1 2 3 shown.
Although r can be any integer or infinity, only a few choices are of greater impor-
tance. For r = 2
d
‖m‖e = ‖m‖2 = √ ∑ |m i |2 (2.9)
i=1
This norm—or more precisely: the induced metric—is also known as taxicab metric
or Manhattan metric. One can visualize this metric as the distance that a car must go
between two points of a city with a rectilinear grid of streets like in Manhattan. For
r = ∞ the resulting norm
is called maximum norm or Chebyshev norm.Figure 2.5 depicts the unit circles for
different choices of r in the upper right quadrant of the two-dimensional Euclidean
space.
Furthermore, the Mahalanobis norm is another common metric for real vector
spaces:
Definition 2.5 (Mahalanobis norm). Let M denote a real vector space of finite dimen-
sion d and let A ∈ ℝd×d be a positive definite matrix. Then
is a norm on M.
22 | 2 Features
To a certain degree, the Mahalanobis norm is another way to generalize the Euclidean
norm: they coincide for A = Id . More generally, elements A ii on the diagonal of A can
be thought of as scaling the corresponding dimension i, while off-diagonal elements
A ij , i ≠ j assess the dependence between the dimension i and j. The Mahalanobis also
appears in the multivariate normal distribution (see Definition 3.3), where the matrix
A is the inverse of the covariance Σ of the data.
So far only norms and their induced metrics that require at least an interval scale
were considered. The metrics handle all quantitative scales of Table 2.1. The next sec-
tions will introduce metrics for features on other scales.
Lets assume one has a finite set U and the features in question are subsets of U. In
other words, the feature space M is the power set P(U) of U. On the one hand the
features are clearly not ordinal, because the relation “⊆” induces only a partial order.
Of course, it is possible to artificially define an ad hoc total order because M is finite,
but the focus shall remain on generally meaningful metrics. On the other hand, a mere
nominal feature only allows to state if two values (here: two sets) are equal or not.
However, two sets can also be said to be “nearly equal” when both the intersection
and the set difference is non-empty (i.e., they share some, but not all elements). The
Tanimoto metric reflects these situations.
Definition 2.6 (Tanimoto metric). Let U be a finite set, M = P (U) and S1 , S2 ∈ M, i.e.,
S1 , S2 ⊆ U. Then
|S1 | + |S2 | − 2 |S1 ∩ S2 |
DTanimoto (S1 , S2 ) = ∈ [0, 1] (2.13)
|S1 | + |S2 | − |S1 ∩ S2 |
defines a metric on M.
Here, we will omit the proof that DTanimoto is indeed a metric (interested readers are re-
ferred to, e.g., the proof of Lipkus [1999]) and instead investigate its properties. If S1 and
S2 denote the same set, then |S1 | = |S2 | = |S1 ∩ S2 | and therefore DTanimoto (S1 , S2 ) = 0.
Contrary, if S1 and S2 do not have any element in common, |S1 ∩ S2 | = 0 holds and
DTanimoto (S1 , S2 ) = 1. Altogether, the Tanimoto metric varies on the interval from 0
(identical) to 1 (completely different).
Moreover, the Tanimoto metric takes the overall number of elements into account.
Two sets that differ in one element are judged to be increasingly similar, as the number
of shared elements increases. For example, let U = {a, . . . , z}, S1 = {a, b, c}, S2 =
{a, b, d}, S1 = {a, b, d, e, f} and S2 = {a, b, d, e, g}. It follows, that
3+3−4 1
DTanimoto (S1 , S2 ) = = and (2.14)
3+3−2 2
5+5−8 1
DTanimoto (S1 , S2 ) = = . (2.15)
5+5−4 3
2.4 Measurement of distances in the feature space | 23
It is not immediately clear how to define a meaningful metric for ordinal features,
since there is no empirical addition on that scale. A possible solution is to consider the
metric D(m,m ) of two ordinal features m, m as the number of swaps of neighboring
elements in order to reach m from m.
Consider, for example, the set of characters in the English language {A, B, C, . . . , Z},
where the order corresponds to the position in the alphabet. The metric informally de-
fined above would yield D(A,C) = 2 and D(A,A) = 0. This example can be generalized
as follows:
Definition 2.7 (Permutation metric). Let M be a locally finite and totally ordered set
with a unique successor function, i.e., for each element x ∈ M there is a unique next
element x ∈ M. Then
is a metric on M.
Another way to look at Definition 2.7 is to homomorphically map M into the integers
(i.e., successive elements of M are mapped to successive integers) and calculate the
absolute difference of the numbers corresponding to the sets.
The Kullback–Leibler divergence (KL divergence) does not directly quantify a differ-
ence between features m, but between probability distributions (characterized by the
probability mass function or the probability density) over the features. It is often used
as a meta metric to compare objects o i , i = 1, . . . ,N that are in turn characterized
by a set of features Oi = {mj | j = 1, . . . ,M i }. To this extent, the features in Oi are
used to estimate the probability mass P(m ̂ | o i ) or probability density p(m
̂ | o i ) for each
object o i . The KL divergence is then used to compute the distance between two object-
dependent distributions and by proxy the distance between two objects. Here, the
̂
P(m | o i ) or p(m
̂ | o i ) can themselves be interpreted as features for o i . An extended
example of this approach is given below.
⇑ P (m)
D (P⇑
⇑P ) =
⇑ ∑ P (m) ln . (2.17)
m∈supp P
P (m)
24 | 2 Features
2. Let p, p be two probability density functions on the same space M. The Kullback–
Leibler divergence of p with respect to p is given by
⇑ p (m)
D (p⇑
⇑p ) = ∫ ⋅ ⋅ ⋅ ∫ p (m) ln p (m) dm.
⇑ (2.18)
supp p
according to p(m),
⇑ p(m)
D (p⇑
⇑p ) = E{log p (m) } .
⇑ (2.20)
The likelihood ratio is the crucial quantity of optimal statistical tests to decide between
two competing hypotheses H1 : m ∼ p(m) and H2 : m ∼ p (m) (Neyman and Pearson
[1992]). In other words, the Kullback–Leibler divergence measures the mean discrim-
inability between H1 and H2 .
To become accustomed to the Kullback–Leibler divergence, we will now dis-
cuss some simple examples. First, consider the family of Bernoulli distributions
parametrized by the probability of success τ ∈ [0,1]. That is,
D (P a=0.3 |P b )
3
and hence for two Gaussian distributions with parameters μ1 ,σ1 and μ2 ,σ2 , one ob-
tains
− 12 2
(2πσ21 ) exp (− 2σ1 2 (m − μ1 ) )
D (p1 ‖p2 ) = ∫ p (m | μ1 ,σ1 ) ln 1
dm
− 12 2
ℝ (2πσ22 ) exp (− 2σ1 2 (m − μ2 ) )
2
1 σ2 1 2
= − ld 12 − ∫ p (m | μ1 , σ1 ) (m − μ1 ) dm
2 σ2 2σ21
ℝ
1 2
+ ∫ p (m | μ1 , σ1 ) (m − μ2 ) dm
2σ22
ℝ
2
1 σ2 σ 2
σ2 + (μ1 − μ2 )
= − ld 12 − 12 + 1
2 σ2 2σ1 2σ22
2
1 σ21 σ2 (μ1 − μ2 )
= ( − ln 21 − 1) + . (2.24)
2 σ22 σ2 2σ22
0.6 0.6
0.4 0.4
0.2 0.2
m m
−4 −2 2 4 −4 −2 2 4
(a) μ1 = −1, μ2 = 1, σ2 = 0.5 and DKL = 4 (b) μ1 = −2, μ2 = 2, σ2 = 0.5 and DKL = 16
Fig. 2.7. Pairs of Gaussian distributions with equal variance σ 2 = 0.5 and their KL divergences.
0.6 0.6
0.4 0.4
0.2 0.2
m m
−4 −2 2 4 −4 −2 2 4
(a) μ1 = 0, σ12 = 1, μ2 = 1, σ22 = 0.5 and (b) μ1 = 1,σ12 = 0.5, μ1 = 0, σ22 = 1 and
DKL = 1.153 DKL = 0.597
Fig. 2.8. Pairs of Gaussian distribution with different variances and their KL divergences.
0.6 0.6
p2 (m) p1 (m)
0.4 0.4
Fig. 2.10. Combustion engine, microscopic image of bore texture and detail with texture model with
groove parameters. Source: Krahe and Beyerer [1997].
where i = 1, 2 indicates the first or second set. Here, µi is the expected value of u in
the ith groove set,
μ ai
µi = Ei {u} = ∭ up i (u,∆) du d∆ = ( ). (2.26)
μ bi
E{(a ij − μ a i )(b ij − μ b i )}
ρi = .
σ ai σ bi
28 | 2 Features
σ2a i ρ i σ ai σ bi
Ci = E{(u − µi )(u − µi )T } = ( ). (2.27)
ρ i σ ai σ bi σ2b i
The parameter λ i denotes the groove density in the ith set, i.e.,
1
λi = . (2.28)
E{∆ ij }
This model will be used to construct a measure of the distance between two groove
sets. Recall from Definition 2.8 that the KL divergence between p1 and p2 is asymmetric.
In order to derive a symmetric measure of the distance between two groove sets, we
simply take the sum of the KL divergence between p1 and p2 , and the KL divergence
with these arguments transposed:
To conclude this section on metrics, we will discuss the tangential distance measure.
This method does not introduce a new metric, but rather builds on top of a given one
and makes this metric more robust against small, systematic disturbances of the fea-
ture vectors that may be caused by varying lighting conditions, out of focus images,
small rotations of the pattern, etc. The key is that these disturbances should be sys-
tematic and not due to random noise.
Consider, for example, the problem of optical character recognition. Figure 2.11
shows two possible systematic variations of a character (the pattern): rotation and line
2.4 Measurement of distances in the feature space | 29
Line w idth
thickness. This variation causes differing patterns, and will therefore result in different
feature vectors. However, since the variations in the pattern are systematic, so will
the variations in the feature vectors. More precisely, small variations of the pattern
will move the feature vector within a small neighborhood of the feature vector of the
original pattern (given that the feature mapping is smooth in the appropriate sense).
This observation leads the following assumption: The feature vectors Oi = {mj | j =
1, . . . , M i } derived from the patterns of an object o i lie on a topological manifold. The
mathematical details and implications of this insight are quite profound and outside
the scope of this book. Nonetheless, Appendix B gives a primer of the most important
terms and concepts of the underlying theory. For the purposes of this section, it is
sufficient to interpret “manifold” as a lower-dimensional hypersurface embedded in
the feature space. In other words, the features mj ∈ Oi of the object o i do not populate
the feature space arbitrarily, but are restricted to some surface within the feature space.
An example is shown in Figure 2.12, where the black curves show the manifolds
to which are restricted the features of two objects o i and o k . In the context of the OCR
example, the two objects stand for different characters, e.g., o i for the character “A”
and o k for the character “B.” Formally, the manifolds are denoted by the set of all
points given by an action A of some transformation p on the object o,
Again, the mathematical definitions of the terms are found in Appendix B. Here,
it is sufficient to interpret A(p,o i ) as the feature vector that is extracted from some
systematic variation parametrized by p.
The Figure 2.12 also highlights an important issue: given a feature vector m to clas-
sify and two feature vectors mi and mk derived from the objects o i and o k , respectively,
computing the distance between the features might lead to the wrong conclusions.
Here, m is closest to mk and hence one could conclude that the object that produced
30 | 2 Features
Mi = {A (p, o i )|p ∈ Π}
mk
mk
mi
m Mk = {A (p, o k )|p ∈ Π}
mi
Tmk
Tm i
Fig. 2.12. Tangential distance measure: Improving the distance measure of two feature vectors by
linear interpolation of the manifold corresponding to the underlying object.
m is more similar to o k than to o i . If, however, one considers the entire manifold of
features of both objects, one arrives at a different picture: the closest point to m on
Mi (mi ) is closer than the closest point on Mk (mk ). In consequence, m is closer to o i
than to o k , which is the opposite of what was deduced from the distances of the given
features. This motivates the following improved distance measure:
Unfortunately, the manifold is generally not known and even if it were, computing
the minimal distance is generally computationally infeasible. A solution to this is
given by the tangential distance measure: Similar to a first order Taylor expansion,
the true distance DManifold (m, mi ) is approximated using a tangential (that is, linear)
approximation at mi :
Here, Tmi denotes the tangent space (or, more precisely, the projection onto the tangent
space) of the manifold at mi . The search for the closest distance is further restricted to
a small neighborhood {mi + Tmi a | ‖a‖ < ε} around mi . The reason is that, similar to
the Taylor expansion, the linear approximation becomes more and more inaccurate
the further one deviates from mi .
Figure 2.12 illustrates this approach: The feature vector m is closer to the tangent
identified by Tmi (purple line) than to the tangent identified by Tmk (orange line).
2.4 Measurement of distances in the feature space | 31
Line w idth
Hence, m is correctly assigned to the object o i instead of o k . Note that in the figure,
the neighborhoods (denoted by the perpendicular stops on the tangents) are chosen
to be very large. In consequence, the approximation to the manifold Mk does not
hold. In practice, one would probably choose a smaller neighborhood. However, the
neighborhood might also not be chosen too small, because then the distance to the
tangent Tmw will not differ much from the distance to the original feature vector mi .
Note: If one chooses the Euclidean norm, the minimization w.r.t. a in Equa-
tion (2.33) reduces to a quadratic optimization problem, which can be solved using
standard tools.
However, exactly computing the tangent space Tm requires the evaluation of the
gradient of A(p, o k ) at m. Unfortunately, this information is rarely available in practice.
However, the tangent space can be approximated by a secant t̂(m):
where det [∆p1 , . . . , ∆pq ] ≠ 0, i.e., the ∆pj are linearly independent. The small dis-
turbance ∆pj , j = 1, . . . ,q can be obtained by recording the objects under various
conditions, or (more commonly) by simulating in a software application these condi-
tions from a small number of actual measurements.
An example of this is shown in Figure 2.13, where only the patterns in the orange
boxes were obtained by measurement. The remaining variations were approximated
using linear interpolation. The corresponding features lie on the secants between the
features of the measured patterns. Comparing Figure 2.13 to the true variations in
Figure 2.11 highlights another important trade-off when using this method: how many
measurements, or sampling points of the manifold, should one obtain? Measuring the
variations takes a considerable effort, but measuring too few variations will reduce
the quality of the approximation.
32 | 2 Features
Lastly, Figure 2.12 illustrates another drawback of the method: the manifolds Mk =
{A (p, o k ) | p ∈ Π} and Mi = {A (p, o i ) | p ∈ Π} are drawn as closed curves. However,
this is not necessarily true. Indeed, the manifold need not even be connected, but
might consist of several, disconnected strips. As a result, the tangential approximation
may significantly underestimate the actual distance to the manifold. Furthermore,
the secant approximation may assume a manifold where there is none, and therefore
invalidate the whole method. However, these issues rarely occur in practice.
2.5 Normalization
The previous section has already presented an approach to enhance the robustness of
a distance measure. Normalization has a similar goal, but is applied at an earlier stage
in the processing chain. Instead of improving a metric, normalization tries
– to eliminate extraneous disturbances of the patterns,
– to eliminate extraneous variations of the patterns and
– to eliminate extraneous variations of the features.
If modifications are already avoided at the stage of the patterns, the deduced features
become independent of those modifications. Surely, this task is highly domain specific,
and requires a good understanding of the concrete pattern recognition system. For this
reason, this section can only present some examples of what can be done in certain
cases. These examples are:
1. Planimetric adjustment of images;
2. lighting adjustment of images;
3. amplitude recovery of audio signals, e.g., by automatic gain adjustment;
4. distortion adjustment of images (due to lens aberrations);
5. alignment, elimination of physical dimension, and leveling of proportions; and
6. dynamic time warping.
Let us assume one has features with values on a interval scale at least, and let m be
the feature vector modeled as a random variable. Then Item 5 can be realized by
T
m1 − E{m1 } m d − E{m d }
m =(
,..., ) . (2.35)
√Var{m1 } √Var{m d }
1 N
mj = ∑ m ij , (2.37)
N i=1
1 N
sj = √ ∑ (m ij − m j ) and (2.38)
N − 1 i=1
m1 − m1 md − md T
m = ( ,..., ) . (2.39)
s1 sd
The adjustment and normalization of the lighting conditions of images is a very broad
field. As this textbook is about pattern recognition, this section can only touch on
this topic. A more detailed discussion can be found in the relevant literature, e.g., in
Machine Vision by Beyerer et al. [2016]. The examples shown here are taken from that
book.
Chromaticity normalization
Here and in the discussions below, the color values of the pixels are assumed to be
within the interval [0, 1]. Chromaticity normalization transforms each pixel so that all
the pixels (except the black ones) have the same intensity. More formally, given the
color components (r, g, b) of a pixel, the transformation maps to the color value
{ 1 (r, g, b) if r + g + b > 0
(r , g , b ) = { r+g+b (2.40)
(0,0,0) if r + g + b = 0.
{
Figure 2.14 shows an example.
Illumination normalization tries to equalize the overall proportions of the color
channels in order to mitigate the effect of different light sources with varying color
34 | 2 Features
Fig. 2.15. Normalization of lighting conditions by iterated normalization of the chromaticity and
illumination. Source: Beyerer et al. [2016].
where f and g are signals and α, β ∈ ℝ are scalars. Informally, the above requires that
the result of applying T to a combination of signals has to be the same as first applying
T to each signal and then combining the results.
Let us now assume a very simple signal model for the image generation process,
given by
g (x) = s (x) △ b (x) . x = (x,y)T , (2.45)
Here g : ℝ2 → ℝ denotes the final image as seen by the pattern recognition system,
s : ℝ2 → ℝ the true underlying image we wish to recover, b : ℝ2 → ℝ the disturbance,
and △ a binary operator on two signals.
If △ is addition, then s is said to be subject to additive noise. Additive systems
are often preferred because they can usually be treated by linear filters, which have
been intensively studied and are well understood tools. If △ is not addition, one can
try to map the system equation into a different space so that the transformed system
is additive. Generally, let T denote a linear filter and U such a transformation. Then
the filter can be applied by
U −1 TUg = U −1 TU (s △ b) = U −1 (TUs + TUb) . (2.46)
36 | 2 Features
Fig. 2.16. Images of the surface of agglomerated cork. Source: Beyerer et al. [2016].
Furthermore, assume that g, s, b > 0 and that the support of the Fourier transforms
of ln s and ln b is not identical. This implies that the logarithm of the disturbance
ln b varies much more slowly than the logarithm ln s of the true signal. Under these
assumptions, the image can be improved by a high pass filter (H) in combination with
a logarithmic transformation. It follows that
≈ln s by ≈0 by
assumption assumption
⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞ ⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞
(exp H ln)g = (exp H ln)(s ⋅ b) = exp( H ln s + H ln b )
≈ exp(ln s + 0) = s. (2.48)
As before, g denotes the final image as seen by the system, s the true but unknown
image, and b a disturbance. It is assumed that s is a homogeneous process—a reason-
able assumption for regular texture like the cork surface in Figure 2.16—and that b
2.5 Normalization | 37
Distortion
ỹ y V η
x̃ x ξ
Distortion Adjustment
g(x, y) γ(ξ, η)
V V −1
Real world Preliminary image Reconstructed image
bj i
Usually, (x, y) = V(ξ, η) does not denote a valid lattice point of the preliminary
image, hence g (V(ξ, η)) must be interpolated. In practice, these three methods are
customary:
– nearest-neighbor interpolation,
– bilinear interpolation, and
– bicubic interpolation.
Dynamic time warping is necessary if one has to match two signals of different lengths.
A typical example is a pair of audio recordings with different speed profiles. The goal is
to find the best mapping between the two signals that equalizes the different temporal
speed courses (see Figure 2.18). Let
A = (a1 , . . . , aM ) (2.53)
B = (b1 , . . . , bL ) (2.54)
be two discrete signals A and B with lengths M and L, respectively. The goal is to find
a sequence of pairs of indices
such that
K
Copt = arg min ∑ ai k − bj k (2.60)
C k=1
is minimized.
Though the whole chapter has already been about features, it has been implicitly as-
sumed that those features are already present. Of course, every section came up with
some examples of features, where it was helpful to discuss the concepts.
The first section classified features according to their scale of measurement, the
third section illustrated some transformations of the feature space, the fourth section
dealt with distance measures, and the previous section gave some examples of feature
normalization. In summary, all the sections relied on the fact that there were already
available features that could be handled, modified, or transformed. Only the second
section focused a little bit on how to obtain the features. But even that section assumed
that there already was a pool of features from which to select.
In order to fill this gap, this section will put the focus on the question of how to
initially find good features. The first subsection will give some examples of descriptive
features and why descriptive features should be preferred. The second subsection is
about features derived from a model of the generation process of the object. The third
subsection will present a way of systematically constructing invariant features, and is
closely related to Section 2.4.6.
The most straightforward approach is to select standard descriptive features that char-
acterize obvious traits of the object’s class and that carry a natural interpretation. De-
spite the simplicity of this heuristic method, it is often quite successful. Moreover,
descriptive features have the distinct advantage that they can easily be understood by
the system designer, something which simplifies the debugging process for when the
pattern recognition system fails.
40 | 2 Features
Minimum
bounding
rectangle
Object
Geometric features
If the border of an object can be identified, the object’s area is computable as well. The
degree of filling m is defined as
Area of the object
m= ∈ [0,1]. (2.61)
Area of the bounding rectangle
Usually there are two options for how a minimum bounding rectangle can be de-
fined (see Figure 2.19). The minimum bounding rectangle whose edges are still aligned
parallel to the axis is sometimes called the ferret box. But normally the term minimum
bounding rectangle (MBR) denotes the rectangle that ignores this constraint and is
rotated so that it is properly aligned with the enclosed object.
For both definitions of the box, m is an invariant of the position and of the scale
of the object. In addition, when using the MBR, m is also invariant w.r.t. rotation. But
the ferret box is easier to compute.
A natural generalization of a bounding box is the convex hull (see 2.20). Accord-
ingly, the degree of convexity is defined as
Area of the object
m= ∈ [0,1]. (2.62)
Area of the convex hull
Again, m is invariant w.r.t. translation, scaling and rotation. The degree of com-
pactness or form factor relates the perimeter to the area and is defined as
4πArea
m= ∈ [0,1]. (2.63)
Perimeter2
The coefficient 4π has been chosen so that 0 ≤ m ≤ 1, because the circle has the
smallest perimeter of all areas of the same size in Euclidean geometry (see Figure 2.21).
2.6 Selection and construction of features | 41
Area A
Triangles
Squares
Circles
Form factor m
Fig. 2.21. Degree of compactness
0.6 0.8 1 (form factor).
Feature Letter
A Ä B C D E F ... O Ö P ... U Ü V W X Y Z
B 1 3 1 1 1 1 1 ... 1 3 1 ... 1 3 1 1 1 1 1
L 1 1 2 0 1 0 0 ... 1 1 1 ... 0 0 0 0 0 0 0
E 0 1 -1 1 0 1 1 ... 0 2 0 ... 1 3 1 1 1 1 1
Topological features
Topological features depart from the tangible geometry and describe an object such
that the features become invariant with respect to rubber-sheeting transformations.
Suitable features include, for example, the number of connected components (B) or
the genus, i.e., the number of holes (L). The Euler number is defined as
E = B − L. (2.64)
Table 2.2 lists the number of connected components, the genus, and the Euler
number of the letters.
ℝ2 → ℝ
g: { (2.65)
(x, y) → g(x, y)
and let F denote the operator of Fourier transformation. Figures 2.22c and 2.22f illus-
trates the Fourier transforms, i.e.,
∞
These figures clearly show a difference. The left periodogram has only one sym-
metric peak at the fundamental frequency, but the right periodogram has some addi-
tional peaks at the subharmonic frequencies. Hence, a reasonable feature is derived
2.6 Selection and construction of features | 43
In the flawless case, this ratio becomes small, because in the numerator the sub-
harmonics vanish. In contrast, the ratio becomes large in the faulty case. For more
details on this topic, see Beyerer et al. [2016].
where g n denotes the state of the system at n and e n denotes the disturbance, while
the a i are the parameters of the model.
In the context of the analysis of time series, the causal neighborhood is naturally
in the past, but the AR model can also be applied to structured images (textures) if
the pixels are enumerated appropriately. A pragmatic approach is to define all pixels
below and to the left of a given pixel as the neighborhood of that pixel.
44 | 2 Features
The number |U| of elements in the neighborhood is called the order of the AR model.
As U only refers to “past” states, and there is a defined starting point (the origin), a
recursive evaluation of the system model is possible. However, first one needs to find
the AR parameters for the given image. To simplify notation, we write the system
equation as
g mn = aT γmn + e mn , (2.70)
The unknown parameters are the coefficient vector a and the variance of the noise
T
σ2 . Hence the feature vector is given by m = (σ2 , aT ) . The objective of the optimiza-
tion is to minimize the variance of the noise, which here can be interpreted as the
prediction error of the AR model:
2
σ2 = Var{e mn } = E{e2mn } = E{(g m − aT γmn ) }
= E{g 2mn − 2aT γmn g mn + aT γmn γTmn a}
= E{g 2mn } − 2aT E{γmn g mn } + aT E{γmn γTmn } a → minimize. (2.72)
Do not be confused by the fact that the left side of the equation is a constant (σ2 ),
but the remainder of the equation seems to depend on the position (m, n): since all
involved processes are at least weakly stationary (see Appendix C), the value of the
last line does not actually depend on (m, n).
To calculate the expectations of γmn and g mn , we assume that the process is lo-
cally ergodic (see, again, Appendix C) in the neighborhood Umn . This means that the
expectation can be estimated by an average over a neighborhood within the same re-
alization. With this in mind, the necessary condition for finding the optimal a is that
1 See Appendix C for an explanation of the terms “weakly stationary”, “white noise”, etc.
2.6 Selection and construction of features | 45
G = ∑ g2mn (2.74)
(m,n)∈Umn
H = ∑ γmn g mn (2.75)
(m,n)∈Umn
Fig. 2.23. Synthetic honing textures using an AR model as an example of model-driven features.
make use of any context specific knowledge. The question of what order the AR model
should have cannot be answered in general. Usually one has to resort to some trial-
and-error approach until the achieved result is acceptable. Generally this leads to an
unnecessary high dimension of the parameter space. Furthermore, the AR coefficients
do not have an easily interpretable meaning in terms of the modelled pattern.
The next section deals with an adjusted, purpose-specific model for the same ex-
ample, i.e., honing textures, as a counterpart to the general AR model.
x → g v (xT ev − d v ) . (2.81)
rotatory motion
y
honing bar
l ν (xT eν − d ν )
l ν (⋅)
dν
(a) Generation of a honing texture by simultaneous (b) Parametric model of a single honing
rotation and stroke movements groove
Fig. 2.24. Physical formation process and parametric model of a honing texture.
As the supports of different grooves are not disjoint, their gray-scale values are
added in the overlapping regions. This is an error of the model, but the error is negli-
gible and this assumption significantly simplifies the calculation.
With α v ∈ [0, π), the directional vector of the groove can be written as ev =
(cos α v , sin α v )T . Then the parameters of the groove model are the angle α v , the dis-
tance d v , and the groove profile function g v . Due to the movement of the honing tool,
the grooves normally have one out of two principal directions, i.e., a simplified para-
metric stochastic model is
1 1
p(α v ) = δ(α v − β1 ) + δ(α v − β2 ) (2.83)
2 2
with two parameters β1 and β2 and with δ(α) denoting the Dirac distribution that is
nonzero only at α = 0.
Likewise, the density of the grooves depends primarily on the density and distribu-
tion of the abrasive grain material on the honing tool. The distances d v of the grooves
from the origin are chosen such that they are uniformly distributed. The number q of
grooves in an interval of size L is assumed to be Poisson distributed
(λL)q
P(q) = e−λL q = 0, . . . , ∞. (2.84)
q!
Lastly, the groove profile function g v (⋅) is assumed to be from a parametric family
of functions that is totally defined by its expectation (E{g v })(⋅).
In summary, in order to learn the model, the parameters β1 , β2 , λ, and (E{g v })(⋅)
need to be estimated. Figure 2.25 illustrates the results. In comparison to the general
48 | 2 Features
Fig. 2.25. Synthetic honing texture as an example of model-driven features using a physically moti-
vated model. Compare this result to the AR model in Figure 2.23.
AR model (see Figure 2.23), the artificially generated surface resembles the original
surface much better albeit the number of parameters is much smaller. For more details
on this approach, see Beyerer [1994].
A general problem of choosing features is to choose those that are invariant with re-
spect to variations within the same class. These variations can be one out of two types:
firstly, the observable patterns vary because they belong to different objects within
the same class. Secondly, the same object can induce different patterns due to distur-
bances (see Figure 2.26). This section sheds light on the question of how to construct
mappings from varying patterns onto invariant features in a systematic way.
For this reason this section is strongly related to Section 2.4.6. The reader is advised
to recall the important concepts and definitions from that section. Section 2.4.6 tried to
enforce the robustness of the distance measurement in the feature against variations of
the features. The new contribution of this section is to make the feature itself invariant
with respect to variations of the pattern.
Recall the situation from Section 2.4.6 and Figure 2.12. The feature space M is a
smooth manifold and the disturbance is modeled as a Lie transformation group Π that
acts on the feature space. The group action is denoted by A : Π × M → M. For an
arbitrary but fixed feature mi , the orbit Πmi is given by {A(p, mi ) | p ∈ Π}. Refer to
Appendix B for the definitions of these terms.
The objective is to find a new feature m ̃ and a suitable feature transformation
mi → m(mi ) such that m(mi ) is constant on each orbit {A(p, mi ) | p ∈ Π}.
̃ ̃
2.6 Selection and construction of features | 49
ωi Ri
{M ij }
{mij }
o ij
Ω P M
Fig. 2.26. Variations of the objects and variation of patterns due to the measurement leads to varia-
tion of the features.
To begin with, consider one toy example to illustrate the computational complexity
of a brute force solution. Consider a two-dimensional point x = (x1 , x2 )T ∈ ℝ2 . There
are several options for what a suitable Lie transformation group Π could be like:
1. The translation group Π = τ = ℝ2 acting by x = x + a for a ∈ τ. As the name
suggests, the points are just linearly moved around in the plane. The number of
degrees of freedom of this group, i.e., the dimension of the Lie transformation
group, equals two.
2. The congruence group
↑
↑
↑ cos α sin α
Π = C = {(R, a)↑ ↑
↑
↑ R=( ) , α ∈ ℝ, a ∈ ℝ2 } (2.85)
↑
↑ − sin α cos α
additionally comprises rotations and has three degrees of freedom. It acts by x =
Rx + a.
3. The similarity group
↑
↑
↑ cos α sin α
Π = S = {(T, a)↑
↑
↑
↑ T = k( ) , k, α ∈ 𝕋, a ∈ ℝ2 } (2.86)
↑
↑ − sin α cos α
also allows scaling. The dimension of the Lie transformation group is four. As a
congruence group, it acts by x = Tx + a.
4. The affine group
↑
Π = A = {(P, a)↑↑det P ≠ 0, a ∈ ℝ2 }
↑
↑ (2.87)
includes translations, rotations, scalings, and shearing, and has six degrees of
freedom. It also acts by x = Px + a.
50 | 2 Features
Assume that each class is represented by one feature vector mj for j = 1, . . . , c and
m is the previously unseen feature vector that should be classified. One approach for
classification would be to choose ω̂ = ω i with
(i, p∗ ) = arg min m − A(p, mj ) . (2.88)
j∈{1,...,c}
p∈Π
This means that one has to calculate all the transformations of all the classes
and choose the class that comes closest to the provided unseen feature vector. Un-
fortunately, the complexity grows exponentially with the dimension of the Lie trans-
formation group. In order to obtain a feeling for this implication, consider the six-
dimensional affine group from above and assume that each dimension is discretized
by 103 steps. This results in the computation of 1018 values per class. A machine able
to perform 109 of these computations per second would need 31.7 years to classify
just one sample. These numbers clearly show that a brute force approach is not an
option.
There are three approaches to systematically constructing invariant features:
1. The integral method:
m
̃ = ∫ f (A(p, m)) dp. (2.89)
Π
3. Normalization: Trace the feature vector back to a designated point of the orbit.
not too much information is lost, because the integral must still return different values
for different classes. For example, the choice f ≡ 0 forces the integral to be always
zero. Without a doubt, this is a perfect—but obviously too restrictive—invariant. There
is no choice of f that is generally applicable to just any feature.
The group acts on the points by the usual multiplication of a vector by a matrix,
and the orbits are circles centered at the origin. The integral approach leads to
m
̃ = ∫ f (A(g, m)) dg
G
π
cos α sin α m1
= ∫ f (( ) ( )) dα
− sin α cos α m2
−π
π
This result is correct, because up to a missing root, m21 +m22 is the (squared) distance
of the point from the origin. This is an invariance with respect to rotations around the
origin, because it equals the radius of the orbit. Although we lose information about
the precise location of the feature m, we do not lose too much information, because
the distance from the origin still suffices to distinguish different orbits. The drawback
is that we had to guess the suitable function f .
The general idea of calculating an integral with respect to a Lie transformation
group is to express a group element g ∈ G by its so-called normal coordinates. In
this case the normal coordinate representation is g(α) = ( −cos α sin α
sin α cos α ). The domain
l
of the normal coordinates is some isometry of the ℝ and hence the integral can be
pulled back to an already known integral over the real numbers. But finding the normal
coordinate representation of a Lie group is not always as easy as this example might
pretend.
! ∂m
̃ (A(g, m))
0=
∂g
∂ cos α sin α m1
= m
̃ (( ) ( ))
∂α − sin α cos α m2
2.6 Selection and construction of features | 53
̃ : ℝ2 → ℝ, ( uv ) → m
(assumption: m ̃ ( uv ) )
∂ m cos α + m2 sin α
= (m2 cos α − m1 sin α) ̃( 1
m )
∂u m2 cos α − m1 sin α
∂ m cos α + m2 sin α
− (m1 cos α + m2 sin α) ̃( 1
m )
∂v m2 cos α − m1 sin α
(substitution: ξ = m2 cos α − m1 sin α, χ = (m1 cos α + m2 sin α))
∂ ξ ∂ ξ
=ξ m
̃( )−χ m ̃ ( ). (2.94)
∂u χ ∂v χ
Close inspection of the last line reveals that
u
̃ ( ) = u2 + v2
m (2.95)
v
∂ ̃ u
is one solution of the partial differential equation, because ∂u m ( v ) = 2u and
∂ ̃ u
∂v m ( v ) = 2v, and therefore ξ ⋅ 2χ − χ ⋅ 2ξ = 0 follows.
By definition, m̃ is invariant under group actions and we obtain
m1 cos α sin α m1
m
̃( )=m
̃ (( ) ( )) = m21 + m22 (2.96)
m2 − sin α cos α m2
for any α. This is the same result as obtained by the integral approach. Instead of
guessing some function f , the problem is rather to find a formula for the solution.
Normalization by example
In contrast to the first two methods, normalization does not provide a predetermined
course of action. The general idea is to pull back each feature vector to a designated
point on the orbit (see Appendix B). In other words, each orbit is characterized by a
single canonical representative. Hence, if m is the feature vector, one needs to find the
corresponding group element g ∈ G such that A (g, m) maps to the representative of
the orbit of g. Obviously, g depends on m, and the question of how g can be calculated
remains an open question for the general case. In this section, an example of two-
dimensional contours is presented.
A two-dimensional contour (see Figure 2.27) can be considered as a continuous
closed curve in the Euclidean plane given by
x(l)
z(l) = ( ) with x : [0, L] → ℝ and y : [0, L] → ℝ (2.97)
y(l)
with boundary condition x(0) = x(L) and y(0) = y(L). As x and y are continuous
functions with period L, they are especially suited to be expanded in Fourier series,
∞ L
2πl 1 2πl
x(l) = ∑ X n e L jn with X n = ∫ x(l)e− L jn dl (2.98)
n=−∞ L
0
54 | 2 Features
∞ L
2πl 1 2πl
y(l) = ∑ Y n e L jn with Yn = ∫ y(l)e− L jn dl. (2.99)
n=−∞ L
0
Note that X n and Y n are complex values, but the additional restriction X ∗n = X−n
and Y n∗ = Y−n holds, so that the imaginary parts are canceled pairwise, because x and
y are real functions. Hence, z(l) can be written as
∞
2πl Xn
z(l) = ∑ Z n e L jn with Zn = ( ). (2.100)
n=−∞ Yn
The Fourier series can be written in a more compact form if z(l) is not a two-
dimensional real vector, but considered as a complex function, i.e.,
The property X ∗n
= X−n and Y n∗
= Y−n does not carry over to the coefficients Z n ,
because z(l) is a true complex function and therefore the imaginary parts do not cancel
in general. On the one hand, one has
Z ∗n = X ∗n + (jY n )∗ = X ∗n − jY n∗ (2.103)
From now on, we look at z(l) as a true complex function and it is assumed that the
coefficients Z0 , Z1 , Z−1 , Z2 , Z−2 are not restricted to be pairwise complex conjugates.
At this point, one has a feature vector
A translation only affects the first coefficient Z0 . Actually, this coefficient is noth-
ing else than the center of mass of the contour. Hence, omitting this coefficient (or
implicitly setting it to zero) describes the same contour with its center of mass moved
to the origin. A translation invariant feature vector is therefore
Scaling invariance is also easy to obtain. If the contour z(l) is scaled by a real,
positive value a ∈ ℝ>0 to z (l) = a z(l), all coefficients are scaled by the same value
Hence, dividing all coefficients by the absolute value of the first element yields a
scaling invariant feature vector
This step resulted in a feature vector whose first component has an absolute value
of one, but an arbitrary direction, on the complex unit circle. A rotation (but not scale)
invariant feature vector is obtained if all coefficients are multiplied in such a way that
the coefficient Z1 points in the direction of the real axis, i.e., if the first coefficient
becomes a positive real number. This is true because all coefficients are multiplied
by the same value ejα if the contour z(l) is multiplied by ejα . This is a rotation in the
complex plane. Let
φ1 = Arg Z1 ∈ (−π, π] (2.110)
denote the argument of the first coefficient. This means that
Hence
T
m
̃ = (Z1 e−jφ1 , Z−1 e−jφ1 , . . . , Z n e−jφ1 , Z−n e−jφ1 )
T
= (|Z1 | , Z−1 e−jφ1 , . . . , Z n e−jφ1 , Z−n e−jφ1 ) (2.112)
56 | 2 Features
is a rotation (but not scale) invariant feature vector. In other words, the orientation of
the contour is encoded in the phases of the coefficients.
If the last two steps are combined, one obtains a scale and rotation invariant fea-
ture vector. Of course, this actually means that all coefficients are divided by Z1 . As
the first coefficient becomes one, it can implicitly be omitted.
In summary, let
be the feature vector of the Fourier series approximation of a contour with n coefficients.
Then
̃ −1 , Z
̃2 , Z
̃ −2 , . . . , Z
̃n, Z T
̃ −n ) = ( Z−1 , Z2 Z−2 Z n Z−n T
m
̃ = (Z Z1 Z1 , Z1 , ... , Z1 , Z1 ) (2.114)
is an example of a translation, scale and rotation invariant feature vector of the contour.
The values Z ̃ 0 = 0 and Z̃ 1 = 1 are implicitly omitted. More generally, because z (l) =
ae z(l), with a ∈ ℝ and α ∈ ℝ, leads to Z k = Z k aejα , each ratio m̃ = ZZmn , with
jα >0
Generally, a high dimension of the feature space is unfavorable, for reasons that will
be explained in Section 6.1. The last section of this chapter will treat the question of
how a high dimension of the feature space can be reduced.
The presentation starts with the concepts of principal component analysis (PCA)
and independent component analysis (ICA). Both methods derive new features by
combining the original features and projecting the result to a subspace of smaller di-
mension. They share the objective of approximately representing the collection of all
samples D with the desired (lower) dimensional space, so that when one would recon-
struct the samples in the original feature space, the mean square error between the
original features and the reconstructed features is minimized. In other words, these
methods do not take the class affiliation into consideration but regard the whole col-
lection of samples D at once.
The third method this section will treat is multiple discriminant analysis (MDA).
This method initially focuses on optimal class separation, but apart from that, it op-
erates similarly to the other two methods. That is, this method also calculates com-
binations of the original features and projects them to a subspace. But the way the
combinations and projections are calculated differs.
All these methods suffer from two drawbacks. First, they only work for features
on an interval scale at least, because subtraction and scalar multiplication must be de-
fined. Second, the descriptive meaning of the original features might get lost. Instead
of concrete features these methods might generally return opaque transformations.
2.7 Dimensionality reduction of the feature space | 57
For these reasons, the last method presents a systematic way of selecting a subset
of the original feature vector such that the smaller set of features is still good enough.
Throwing away some components is the same as a projection but it keeps the axis “as
is.” The advantage of this method is that it works for arbitrary kinds of features and
retains the meaning of the original features. The disadvantage is that it is less powerful,
because it does not make use of combinations of the features.
The idea of principal component analysis is to find a lower dimensional subspace such
that the data is optimally represented in terms of the mean square error. This subsec-
tion proceeds as follows. First, the method is presented for the case where the subspace
is chosen to be zero-, one-, or two-dimensional, because these cases can be easily de-
picted and they descriptively convey the underlying idea. Then the general case is
presented for an arbitrary number of dimensions. At the end, the method of principal
component analysis is generalized to kernelized principal component analysis, which
uses in addition a nonlinear transformation in order to improve the representation.
m2
This shows that the point with the least squared distance to the m1 , . . . , mN is the
center of mass of these points (see Figure 2.28),
Iteratively, one can now seek for the one-dimensional line that best represents the
points. That line is given by
where e denotes the normalized directional vector and a the scalar parameter of the
line. Let m̆ k = m + a k e denote the orthogonal projection of the feature mk onto the
line. The optimization functional is
N N
J1 (a1 , . . . , a N , e) = ∑ ‖m̆ k − mk ‖2 = ∑ ‖m + a k e − mk ‖2 . (2.119)
k=1 k=1
N N N
= ∑ a2k − 2 ∑ a k eT (mk − m) + ∑ ‖mk − m‖2 . (2.120)
k=1 k=1 k=1
As the minimum is an inner point, it suffices to find the point where the first deriva-
tives are zero:
∂ !
J1 (a1 , . . . , a N , e) = 2a k − 2eT (mk − m) = 0 ⇔ a k = eT (mk − m). (2.121)
∂a k
Putting this solution into the last line of Equation (2.120) yields
N N N
J1 (e) = ∑ a2k − 2 ∑ a2k + ∑ ‖mk − m‖2
k=1 k=1 k=1
N N
= − ∑ a2k + ∑ ‖mk − m‖2
k=1 k=1
N N
= − ∑ eT (mk − m)(mk − m)T e + ∑ ‖mk − m‖2
k=1 k=1
N N
= −eT Se + ∑ ‖mk − m‖2 with S := ∑ (mk − m)(mk − m)T . (2.122)
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
k=1 k=1
fix, independent of e
The matrix
N
S := ∑ (mk − m)(mk − m)T ∈ ℝd×d (2.123)
k=1
The line before the last line shows that the sought value of λ is an eigenvalue of
the matrix S. Since S is symmetric by construction (see Equation (2.123)), it is diago-
nalizable and such an eigenvalue must exist. The last line reveals that the greatest
eigenvalue must be picked to maximize eT Se.
In summary, the best line has a base point at the center of mass and the same
direction as the eigenvector with the largest eigenvalue of the scatter matrix (see Fig-
ure 2.29).
60 | 2 Features
h
m2
In order to complete the usual notation, let the column-wise concatenation of the
zero mean feature vectors
denote the so-called data matrix. Then the scatter matrix can be written as
N
S = ∑ (mk − m)(mk − m)T = MMT . (2.127)
k=1
We now turn to the general case. Again, m̆ k will denote the projection of mk to a
d -dimensional
affine subspace given by
d
m + ∑ a i ei (2.128)
i=1
with {e1 , . . . , ed } constituting an orthonormal basis. Then the objective function is
N
J d (a1,1 , . . . , a N,d , e1 , . . . , ed ) = ∑ ‖m̆ k − mk ‖2
k=1
N d
2
= ∑ (m + ∑ a k,i ei ) − mk (2.129)
k=1
i=1
for d < d . A generalized variant of the same course of action as above leads to the
following result: the optimal affine subspace with dimension d has a base point at its
average value m and is spanned by the d eigenvectors of the d largest eigenvalues
of the scatter matrix S (see Figure 2.30).
Hence, the usual procedure to calculate the d -dimensional principal component
analysis consists of the following steps:
2.7 Dimensionality reduction of the feature space | 61
m
of smaller dimension d .
support of the random process is chosen to be a bounded set of natural numbers, i.e., t ∈
T
{1, . . . , d}, then the stochastic process can be written as a vector (m(1), . . . , m(d))
and one is in the same situation as for principal component analysis.
Let m be a random vector and let
µ = E{m} (2.135)
Σ = Cov{m} (2.136)
E = (1 , . . . , d ) (2.137)
κ1 0 ... 0
.. .. ..
0 . . .
Λ = ( .. .. .. ) (2.138)
. . . 0
0 ... 0 κd
Λ = ET ΣE. (2.140)
̃ = ET (m − µ) .
m (2.141)
E{m
̃} = 0 (2.142)
κ i = Var{m
̃ i} (2.144)
2.7 Dimensionality reduction of the feature space | 63
m1
m2
m2
λ1
∝ √
m
tion
Projection
onto e1
a
evi
rd d
m1
nda
Sta
Projection
onto e2
Sta
nda
rd d
∝ √ eviatio
λ2 n
Fig. 2.31. The variance of the dataset is encoded in the principal components so that the variance
along a component is proportional to the corresponding eigenvalue.
Cov{m
̃i, m
̃ j} = 0 for i ≠ j. (2.145)
That being said, we now return to principal component analysis. Instead of a ran-
dom vector m, one has a set of feature vectors mk that are nothing else than realizations
of m and the empirical mean m is an unbiased estimator of the expectation vector
µ̂ = m. (2.146)
Except for a correction factor, something similar holds for the scatter matrix. An
unbiased estimator for the covariance matrix is
1
Σ̂ = S. (2.147)
N−1
The component-wise variance of the transformed feature can be unbiasedly esti-
mated by the scaled eigenvalues of the scatter matrix
1
κ̂ i = λi . (2.148)
N−1
This situation is depicted in Figure 2.31.
64 | 2 Features
and the reconstruction of the ith component in the original feature space is given by
ei mi = e e⏟Ti⏟ (m − m) .
⏟⏟⏟i⏟⏟ (2.150)
∈ℝd×d
Hence m[1] = (I − e1 eT1 ) (m − m) is the feature vector with the first component
removed. More generally, we denote the transformed vector m without the entries
corresponding to the first i eigenvectors ei by:
Because distinct eigenvectors are orthogonal, the sequence m[1] , m[2] , m[3] , . . . can
be calculated recursively:
The Equations (2.152) to (2.154) and so on can be thought of as follows: At first, the
direction of maximum variance is determined and the variation of the data w.r.t. this
direction is removed. Then, within the data modified in this way, again the direction
of maximum variance is determined, and so on, and so on. Therefore, in a greedy
manner, the maximum variance directions are identified recursively and the pertaining
components of the data are consecutively subtracted.
For a single zero-mean feature vector mk − m with k = 1, . . . , N, the projection
squared error onto the d -dimensional subspace is
d
2 d 2
(I − ∑ ei eT ) (mk − m) = (∑ ei eT ) (mk − m) (2.155)
i i
i=1 i=d +1
and the total squared error for all feature vectors is the sum of the remaining squared
eigenvalues:
N d 2 d
∑ (I − ∑ ei eTi ) (mk − m) = ∑ λ2i . (2.156)
k=1 i=1
i=d +1
By construction, the principal component analysis yields the best (w.r.t. mean
square error) d -dimensional approximation. Furthermore, we already know that the
2.7 Dimensionality reduction of the feature space | 65
eigenvalues are proportional to the standard deviation of the data with respect to the
corresponding direction. Now, assume some other arbitrary d -dimensional projection
and calculate the standard deviation of the dimensions being thrown away. Then these
deviations will be greater than the term above. In this sense, the sum ∑di=d +1 λ2i is
minimal, or, loses as little information as possible.
Abusing some concepts and notation, we can clarify what is meant by “loss of
information.” In virtue of the coerced normalization, one can regard the sequence of
eigenvalues λ1 , . . . , λ d as a probability distribution
λi
χ i := d
. (2.157)
∑j=1 λj
For any other linear transformation of the feature space, the corresponding entropy
of the variances in each direction is larger. In other words, the principal component
analysis yields that linear transformation for which the entropy of the “variances” be-
comes minimal. Hence, the variances are as unequally distributed as possible. In a lax
interpretation, one could say that the first dimension bears as much information about
the data as possible, the second dimension bears most of the remaining information,
the third dimension bears most of the information without the first two dimensions,
and so on.
The following list recapitulates the essential characteristics of a principal compo-
nent analysis.
– The components of the transformed feature vectors are pairwise uncorrelated.
– The variances of the components of the transformed feature are maximally un-
equally distributed for all linear transformations. (The variances have minimal
entropy.)
– The PCA yields the best d -dimensional approximation in terms of the squared
deviation.
– The PCA does not aim for the optimal separability of the classes, but tries to provide
the best representation of all the data D as a whole. Nonetheless, experience shows
that the PCA yields feature spaces of good quality with low dimensions.
– The descriptive meaning of the original features is lost.
Fig. 2.32. Mean face computed from the YALE faces dataset of
Georghiades et al. [2001].
ods), faces are represented as the deviations from a mean face. The mean face as well
as the “directions” of the deviation are calculated using PCA.
Let g(x, y) denote the gray-scale image of a face with (x, y) ∈ {1, . . . , n}2 . Note
that all images are required to be of the same size, but there is no technical reason to
require them to be square. However, this restriction simplifies the following discussion.
In addition, all images should show the face in the same pose and be aligned with a
common reference frame (e.g., eye centers on the same height) for this technique to
work well. The pixels are arranged into a vector m ∈ ℝd with dimension d = n2 . Note
that here the pattern itself is used as the feature vector. As above, let
denote the data matrix and S = MMT ∈ ℝd×d denote the scatter matrix.
Usually, the next step would be to calculate the eigenvectors and eigenvalues of
S. In practice, however, this is infeasible due to the size of S and the resulting compu-
tational complexity of the eigen-decomposition. Consider, for example, small facial
images measuring 32 × 32 pixels, i.e., n = 32 (in real applications the images will
be larger). Then the “feature vectors” m will be of dimension d = n2 = 322 and the
scatter matrix S will have d2 = n4 = 1,048,576 entries.
The costly eigen-decomposition can be avoided by exploiting the structure of the
problem: the dimensionality of the space induced by the training sample is smaller
than the dimensionality of the feature space. In other words, the number N of features
in the training sample is much smaller than d. This is an odd situation: in most cases,
the number of samples is much larger than d. As we will see in Chapter 4, N ≫ d
is (often) actually required in order to successfully estimate the decision boundaries
in the feature space. Note, however, that at the moment we do not wish to derive a
classifier, rather, we wish to find a compact representation of the facial images that
can be used with a classifier.
Nevertheless, as here N < d, consider instead the matrix
K := MT M ∈ ℝN×N . (2.160)
2.7 Dimensionality reduction of the feature space | 67
Fig. 2.33. First 20 eigenfaces computed from the YALE faces dataset of Georghiades et al. [2001].
The first components clearly correspond to different lighting conditions, while the other components
correspond to changes in pose and facial structure.
λi
1 5 10 15 20 i
Fig. 2.34. First 20 eigenvalues corresponding to the eigenfaces in Figure 2.33. Note that most of the
variation is captured with just the first two components which correspond to lighthing directions.
68 | 2 Features
Mηi
ei = for i = 1, . . . , N. (2.161)
‖Mηi ‖
Since the eigenvectors are computed from images, they can themselves be con-
verted into images. Figures 2.32 and 2.33 show the mean vector m and the eigenvectors
ei , i = 1, . . . ,10 of the extended YALE face dataset B (Georghiades et al. [2001]) inter-
preted as gray-scale images. This dataset contains pictures of the faces of 39 subjects.
The images were recorded under different lighting conditions and cropped and ro-
tated so that the faces of two different images are aligned. One can clearly see that
the eigenvectors represent major modes of change: lighting, pose, and facial structure.
The eigenvalues corresponding to the eigenvectors are shown in Figure 2.34. As ex-
pected from the previous discussion, the eigenvalues are very unequally distributed;
most of the variation in the dataset is represented by the first two components. The
third, fourth, etc. eigenvalues are of much smaller magnitude, which means that the
associated components explain finer, but less common, details.
For classification, a d -dimensional feature vector (where d ≪ d) according to
m = (e1 , ⋅ ⋅ ⋅ ,ed )T (m − m) is used. Note that this approach is not restricted to facial
image recognition. It can also be used with 3D facial data from depth sensors, or indeed
any other type of data.
This model, however, can only represent planes that are somewhat similar to the
models chosen as the basis. Of course, one can simply provide a larger selection of
different planes, but then the size of the feature vector will increase, which typically
has adverse effects on classification performance. A better model should capture the
principal modes of change in an airplane: the size, the position of the wings, the ori-
entation of the wings, etc. Such a model can be derived using PCA. The base models
are collected into the data matrix,
and PCA is used to extract the mean model g and eigenvectors ei . A new model is then
represented by
gnew = g + m1 e1 + m2 e2 + m3 e3 + . . . (2.164)
As it turns out, the first eigenvector e1 mainly accounts for the size of the airplane
body, the second and third eigenvectors e2 and e3 mainly encode the position and
orientation of the wings, and the other eigenvectors encode minor details. PCA has
found the most relevant modes of change purely from the data provided, without any
guidance from a human expert. With this model, most of the inherent variation in
the airplane models can be expressed using only the first three or four eigenvectors.
The resulting feature descriptor m = (m1 ,m2 ,m3 ,m4 )T is very compact, yet sufficient
for the task of describing different types of airplanes. More information about this
approach can be found in Laubenheimer [2004].
Standard principal component analysis aims to find the best orthogonal transforma-
tion of the feature space such that the projection onto the d first dimensions (or prin-
cipal components) is the best approximation among all other orthogonal transforma-
tions.
70 | 2 Features
kernelized PCA
ϕ PCA
M F M
But sometimes it might yield better results if PCA is not applied to the original
feature space, but to an intermediate higher dimensional space. In other words, there is
a nonlinear function ϕ : M → F that maps from the feature space M with dimension d
to a new Hilbert space F with higher—possibly infinite—dimension. Then the principal
component analysis is applied to this intermediate space in order to obtain the feature
space M with reduced dimension d . This idea is depicted in Figure 2.36. Only if ϕ is
chosen to be truly nonlinear does this approach provide any benefits in comparison
with standard principal component analysis. Otherwise, the results do not differ.
The reason behind the name kernelized PCA, and why it is covered by a whole
section on its own, is that there is a clever calculation trick. This trick is referred to as
the kernel trick and will be revisited in and explained in greater detail in Section 7.7.
The naïve approach to realize Figure 2.36 would be to explicitly choose F and ϕ
and perform a principal component analysis in the high-dimensional space F. In this
case, one needs to compute the inner products of vectors explicitly mapped by ϕ with
possibly prohibitive computational costs. As the final goal is to ease the complexity
by dimension reduction, this seems like a step in the wrong direction. The trick is to
rewrite all the formulas so that the map ϕ only occurs in pairs within the inner product
of F. This means that terms like
are the only places where ϕ appears. Then all these terms can be replaced by a so
called kernel function
k: M × M → ℝ (2.166)
that absorbs two mappings and the inner product into one simply evaluable function
so that ϕ never needs to be calculated explicitly.
This being said, the upcoming course of action in this section is already clear. One
starts with the “regular” PCA on F and tries to rewrite all formulas so that all ϕ vanish
and k remains.
Standard PCA centers the data first, i.e., in the first step mk − m is calculated.
Without explicit knowledge about ϕ it is neither possible nor computationally feasi-
ble to calculate ϕ(mk ) − ϕ(m) in the same way. Hence, one assumes that ϕ already
generates zero-mean data ϕ(m1 ), . . . , ϕ(mN ) with ∑Nk=1 ϕ(mk ) = 0. Of course, this
2.7 Dimensionality reduction of the feature space | 71
assumption is rather far-fetched, and at the end this condition will be dropped again,
but provisionally this is assumed to be true.
The following derivation mainly follows that of Schölkopf et al., which can be
found in Schölkopf et al. [1997].
Let m1 , . . . , mN ∈ M denote the original features. Then ϕ(m1 ), . . . , ϕ(mN ) ∈ F
are the non-linearly transformed, high-dimensional features. Furthermore,
1 N 1
C= ∑ ϕ(mk )ϕ(mk )T = DDT (2.168)
N k=1 N
is the scatter matrix (see Equation (2.126)). Two aspects are important to be noted:
First, the scatter matrix is additionally normalized; Second, these terms assume that
1 N
N ∑k=1 ϕ(mk ) = 0.
Following the usual procedure, the eigenvalue equation
λv = Cv for v ∈ F, λ ∈ ℝ (2.169)
must be solved next. As always, C is diagonalizable and all eigenvalues are non-
negative λ ≥ 0, because of the special form C = DDT . As Equation (2.169) will not be
explicitly solved, the definition of C is put into Equation (2.168),
1 N 1 N
λv = Cv = ( ∑ ϕ(mk )ϕ(mk )T ) v = ∑ (ϕ(mk )T v) ϕ(mk ) (2.170)
N k=1 N k=1 ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
(∗)∈ℝ
N
λ=0
̸
⇒ v = ∑ α k ϕ(mk ) = Dα for α ∈ ℝN . (2.171)
k=1
in a subspace that has dimension N at most. There are at most N eigenvectors with
a nonzero eigenvalue and these eigenvectors are in the span {ϕ(m1 ), . . . , ϕ(mN )};
all other (possibly infinitely many) eigenvalues are zero and their eigenvectors are or-
thogonal to that subspace. Intuitively, this is not a surprise. If there are only N feature
vectors (data points), these points can span at most a N − 1 dimensional subspace.
(Two points are always on one line, three points are always on one plane, and so on.)
This does not change, if the points are mapped into a space with higher dimension
first. As one is only interested in principal components with nonzero variance (no
other components bear any information at all), one can presume λ > 0 from now on.
The eigenvector Equation (2.170) corresponds to a system of linear equations with
possibly infinitely many rows. Because we know that all interesting solutions with
λ > 0 are in the span {ϕ(m1 ), . . . , ϕ(mN )}, it suffices to consider the projection onto
this space. This means one can multiply Equation (2.170) from the left by DT (see Equa-
tion (2.167)) without losing any interesting solution.
In conclusion, Equation (2.170) with Equation (2.171) and left multiplication with
T
D leads to
K := DT D ∈ ℝN×N (2.173)
Now compare the definition of the kernel matrix K (Equation (2.173)) with the defi-
nition of the data matrix C (Equation (2.167)) and note that this is the same trick that has
already been used to reduce the complexity of the eigenface problem (Equation (2.160)).
Furthermore, one can see that
holds. This means each matrix entry Kij is the inner product of the corresponding
feature vectors in the high-dimensional space. For later use, we introduce the kernel
function
M×M→ℝ
k: { (2.176)
(mi , mj ) → ϕ(mi )T ϕ(mj )
and set
K ij = k(mi , mj ). (2.177)
Again, because K is symmetric and positive-definite, and because only nonzero
solutions are of interest, one factor K can be canceled out in Equation (2.174). There
remains
λα = Kα (2.178)
2.7 Dimensionality reduction of the feature space | 73
A = DA.̃ (2.182)
Again, with the usual PCA this would require computing m = AT ϕ(m). This can
be rewritten as
̃ T ϕ(m) = Ã T (DT ϕ(m))
m = AT ϕ(m) = (DA)
ϕ(m1 )T ϕ(m) k(m1 , m)
=A (̃T .
..
T
) = Ã ( .. ). (2.183)
.
T
ϕ(mN ) ϕ(m) k(mN , m)
The last step finishes the derivation of the kernelized PCA. In summary, Equa-
tion (2.178) must be solved with K being defined as in Equation (2.177) under the nor-
malization condition from Equation (2.179). The eigenvectors found must be organized
into a projection matrix à and the last equation above returns the principal compo-
nents m for any feature vector m. All steps require evaluating the kernel function k
at most. The transformation ϕ is never needed explicitly.
Attention: Although k : M × M → ℝ seems to come out of the blue, it is still
assumed that it corresponds to some inner product on some unknown vector space F
of unknown dimension and that there is a (nonlinear) mapping ϕ from M into F such
that k(⋅, ⋅) = ⟨ϕ(⋅), ϕ(⋅)⟩. Moreover, it is assumed that ∑Nk=1 ϕ(mk ) = 0. This means
that the transformed dataset has zero mean.
Two last questions still need to be answered:
74 | 2 Features
– Which functions are allowed for the kernel k : M × M → ℝ without a map ϕ being
explicitly given?
– How can the condition ∑Nk=1 ϕ(mk ) = 0 be relaxed if ϕ is not given?
then there is a vector space V with an inner product ⟨⋅, ⋅⟩ and a mapping φ : M → V
such that
k(⋅, ⋅) = ⟨φ(⋅), φ(⋅)⟩ . (2.186)
We now tackle the problem of data with nonzero mean. Recall that the data matrix is
defined as
D = (ϕ(m1 ), . . . , ϕ(mN )) (2.187)
and the kernel matrix as
K = DT D ∈ ℝN×N . (2.188)
Let further
̃ = (ϕ(m1 ) − ϕ(m), . . . , ϕ(mN ) − ϕ(m), )
D (2.189)
with ϕ(m) = 1
N ∑Nk=1 ϕ(mk ) be the centered data matrix and
̃T D
̃=D
K ̃ ∈ ℝN×N (2.190)
̃ = D − 1 DU
D (2.191)
N
2.7 Dimensionality reduction of the feature space | 75
and putting this into the definition of the centered kernel matrix yields
T
̃ = (D − 1 DU) (D − 1 DU)
K
N N
1 T 1 1
= D D − D DU − UDT D + 2 UDT DU
T
N N N
1 1 1
= K − KU − UK + 2 UKU. (2.192)
N N N
The eigenvector equation need to be solved with K ̃ instead of K, but as one can
see, no explicit evaluation of ϕ is required.
Moreover, the projection must be redefined. As in Equation (2.183), let m be the
vector under consideration and à be the projection matrix. Write u ∈ ℝN for the vector
of ones. A similar calculation as above leads to
k(m1 , m)
T .. 1
m = Ã ((
. ) − Ku) (2.193)
N
k(mN , m)
as the new projection formula.
To finish this section, the following list give a ready to use sequence of instructions
for the kernelized PCA, as the section before did for the usual PCA. Let m1 , . . . , mN ∈
M = ℝd be a training set and m ∈ M an additional feature vector. k : M × M → ℝ
denotes a permissible kernel function. Fix some d < min{d, N}.
1. Calculate the matrices
and
̃ = K − 1 KU − 1 UK + 1 UKU
1 ⋅⋅⋅ 1.
K with U := ( ... .. ) ∈ ℝN×N . (2.195)
N N N2 1 ⋅⋅⋅ 1
m2 m2
m1 m1
Fig. 2.37. Kernelized PCA with radial kernel function k(m, s) = exp (− 12 ‖m − s‖2 ).
of smaller dimension d .
Figure 2.37 gives an example of a kernelized PCA. In this example, the final dimension
d is not chosen to be smaller but is equal to the original dimension d = d = 2.
i , mj } = 0
Cov{m
for i ≠ j, (2.200)
follows as well. Thus, one could say that the independent component analysis goes one
step further than the PCA. Actually, ICA can be seen as a two-step process. First, the
components are decorrelated. Second, the components are orthogonally transformed
so that they become independent (see 2.38).
Before we dive more deeply into ICA, we will review some fundamentals from
probability theory.
2.7 Dimensionality reduction of the feature space | 77
ICA
e. g. by PCA
M M M
Decorrelation make components
“Whitening” independent
The random variables a, b are called uncorrelated iff either of the following two
equivalent conditions is met.
Definition 2.11 (Independence). Let p(a) and p(b) denote the marginal densities of
the random variables a and b respectively. Moreover, let p(a | b) and p(b | a) be their
conditional densities. a and b are called independent if any one (and therefore all) of
the following equivalent conditions holds.
m = Y ⏟⏟
Z⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
(m − ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ = Ym
⏟⏟⏟E{m})
(2.207)
decorrrelate
78 | 2 Features
m2 m2 m
2
1 1 1
m1 m1 m
1
−1 1 −1 1 −1 1
−1 −1 −1
(a) Original feature space (b) Intermediate space of decor-(c) Transformed space of inde-
related features (after whiten- pendent features
ing)
0
..
.
..
. ...
Z = √Λ−1 ET = ( . ) ET . (2.208)
.. .. ..
. . 0
1
0 ... 0
( √κ d )
See Equations (2.137) and (2.138) for the definition of Λ and E. The scaling is neces-
sary, because otherwise the covariance matrix of the transformed feature would equal
Λ (see Equations (2.143) to (2.145)) but we want it to be the identity. In practice, if
only the set of features m1 , . . . , mN is known, one uses the PCA to obtain an unbiased
estimator
√ N−1λ1 0 ... 0
.. .. ..
0 . . . ) T
Ẑ = ( . A . (2.209)
.. . .. . .. 0
( 0 ... 0 √ N−1
λd )
Unfortunately, this goal is not always achievable, nor is such a matrix unique. The
idea is to find an objective function that measures the “magnitude of independence”
with respect to Y. Actually, there a two major approaches to define something like
a “magnitude of independence;” they lead to different algorithms for ICA. The first
approach leads to the non-Gaussian family of ICA algorithms, which, as the name sug-
gests, is inspired by the central limit theorem. This approach is not part of the subject
matter of this textbook. The second approach is inspired by Shannon’s information
theory and uses the concept of mutual information to measure the independence of
random variables. This textbook follows this second approach.
Definition 2.13 (Differential entropy). Let a and b be two absolutely continuous ran-
dom variables with density p(a,b). Furthermore, let p(a) and p(b) denote their
marginal densities and p(a | b) and p(b | a) denote their conditional densities.
1. The differential entropy of the random variable a (respectively b) is defined as
Differential entropy is the generalization of the entropy H(⋅) to the case of continuous
random variables. The original concept of entropy is only defined for the discrete case.
Therefore, sometimes the term continuous entropy is also used. Also, the generaliza-
tion is syntactically straightforward: h(⋅) uses densities and integrals where H(⋅) uses
discrete probabilities and sums (the result must be handled with care). The differen-
tial entropy is not always positive and is not invariant under continuous coordinate
transformations. Both are important points for a “real” entropy and a measure of in-
formation uncertainty and are fulfilled by the original, discrete entropy. Nevertheless,
for the purposes of this book, these problems can be ignored.
versa. In addition, let p(a)p(b) denote the product density of the marginal densities.
Then
This shows that the mutual information between two random variables is the
Kullback–Leibler divergence (see 2.4.5) between the joint distribution and the product
of the marginal distributions. This divergence becomes zero if the random variables
are independent.
This yields the objective function for the second step of the ICA. Let m =
(m1 . . . md ) denote the random vector after decorrelation as in Equation (2.207).
Then
⇑
J(Y) = D (p(m )⇑
⇑
⇑p(m1 ) ⋅ ⋅ ⋅ p(m d )) with m = Ym and YT Y = I (2.217)
is the transformation matrix sought. In practice, this optimization problem can only be
solved numerically, and J(Y∗ ) = 0 cannot be guaranteed in practice. Still, the resulting
transform will make the features approximately independent.
m2 m2
m1 m1
m2 m2
Projection Projection
onto e1 onto e1
m m
Class ω1
Class ω2
Projection m1 Projection m1
onto e2 onto e2
(a) First component suffices to separate the (b) Both components are necessary to separate
classes. the classes.
Fig. 2.40. The case for multiple discriminant analysis: PCA does not take class information into
account. In particular, it does not aim for optimal class separability.
with respect to the actual problem. In contrast, in Figure 2.40a, a reduction to the first
component still suffices to distinguish between the classes.
This shortcoming is tackled by multiple discriminant analysis (MDA). As the name
suggests, MDA considers different classes and aims for an optimal separation right
from the beginning. If the problem has c classes, then MDA finds the best projection
onto a (c − 1)-dimensional subspace.
will denote the partition of the dataset. Additionally, |D1 | = N1 and |D2 | = N2 denote
the cardinalities of these sets. The goal is to find a vector w ∈ ℝd such that
m = wT m (2.220)
yields a projection that optimally separates both classes. Figure 2.41 illustrates the
situation.
82 | 2 Features
m2
m
s1
s2
m1 − m2
w
m1 Fig. 2.41. Quantities in the two-class
case of multiple discriminant analysis.
A good choice of w would be one that on the one hand arranges that the projected
mean points of both classes are spread out, and on the other hand that the standard
deviation of each class is concentrated at the same time. This means one should opti-
mize the ratio between these two quantities. To that end, one defines the mean of the
projected classes by
1 1 1
mi = ∑ m = ∑ wT m = wT ( ∑ m) = wT mi for i = 1, 2
Ni N i m∈D N i m∈D
m ∈Di i i
(2.221)
and the squared standard deviation of the projected classes by
1 2 1 2
s2
i = ∑ (m − mi ) = ∑ (wT m − wT mi )
Ni N i m∈D
m ∈Di i
1 T
= wT ( ∑ (m − mi ) (m − mi ) ) w for i = 1, 2. (2.222)
N i m∈D
i
and needs to be maximized. This form of the objective function is called the Fisher
linear discriminant. This functional is not in an optimal form to solve the maximization
problem, because the dependence on w is not explicit, nor is this form suited to be
generalized to higher dimensions. So the next step is to rewrite Equation (2.223) into
the product of a matrix and a vector.
2.7 Dimensionality reduction of the feature space | 83
A close look at Equation (2.222) reveals that the middle factor is again the scatter
matrix (here normalized by N i ):
1 T
Si := ∑ (m − mi ) (m − mi ) for i = 1, 2 (2.224)
N i m∈D
i
SW := S1 + S2 . (2.225)
The suffix B in SB stands for “between.” Then the objective function takes the form
wT SB w
J(w) = . (2.229)
wT SW w
This form is called the Rayleigh coefficient or Rayleigh quotient. The aim is to max-
imize the ratio of the deviation between the classes compared to the deviation within
the classes.
The Rayleigh quotient is invariant under scaling w, hence it suffices to maximize
the numerator wT SB w for all w such that the denominator wT SW w equals 1,
wT SB w
max = max wT SB w. (2.230)
w∈ℝd wT SW w w∈ℝd
wT SW w=1
The last line is called a generalized eigenvalue problem. Under the assumption
that S−1
W is invertible, one obtains the standard eigenvalue problem
S−1
W SB w = λw. (2.233)
Luckily, this equation does not need to be solved directly. From the definition of
SB one can see that
T
SB w = (m1 − m2 ) ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
(m1 − m2 ) w (2.234)
scalar
always has the same direction as (m1 − m2 ). Therefore, Equation (2.233) can be sim-
plfied by setting SB w = λ (m1 − m2 ) for some unknown scalar λ ,
S−1
W λ (m1 − m2 ) = λw
⇔ w = λ−1 λ S−1
W (m1 − m2 ) . (2.235)
w S−1
W (m1 − m2 )
w= =
√wT SW w √(m1 − m2 )T S−1 T SW S−1 (m1 − m2 )
W W
S−1
W (m1 − m2 )
= . (2.236)
T T
√(m1 − m2 ) S−1 (m1 − m2 )
W
m = WT m ∈ ℝ(c−1) (2.238)
1 1 c
m= ∑ m= ∑ N i mi (2.240)
N m∈D N i=1
the overall mean. The scatter matrices for each class are
1 T
Si = ∑ (m − mi ) (m − mi ) (2.241)
N i m∈D
i
Note that for c = 2, this definition differs from the previous definition.
The corresponding definitions can be set up for the projected feature m = WT m:
1
mi = ∑ m (class means) (2.244)
Ni
m ∈Di
1
m = ∑ m (overall mean) (2.245)
N m ∈D
1 T
Si = ∑ (m − mi ) (m − mi ) (scatter matrices of each class) (2.246)
Ni
m ∈Di
c
SW = ∑ Si (intra-class scattering) (2.247)
i=1
c
T
SB = ∑ N i (mi − m ) (mi − m ) (inter-class scattering) (2.248)
i=1
SW = WT SW W (2.249)
T
SB = W SB W (2.250)
between the scattering of the original features and the projected features (see Equa-
tions (2.226) and (2.228)). This leads again to the Rayleigh quotient
WT SB W
S
J(W) = B = T (2.251)
S W SW W
W
86 | 2 Features
as the objective function. Here, |M| denotes the determinant of the matrix M. The
columns w1 , . . . , wc−1 of the W that maximizes J(W) are the eigenvectors of the gener-
alized eigenproblem
that belong to the (c − 1) greatest eigenvalues λ1 , ..., λ c−1 . The task is to find the roots
of the characteristic polynomial
Example: Fisherfaces
In Section 2.7.1 it was shown how PCA can be used to represent the images of faces in
an approach called eigenfaces. Similarly, MDA can be used to extract Fisherfaces from
a dataset of images: the images are collected into vectors mi , from which the MDA
matrix W is computed according to Equation (2.254). The columns wi , the Fisherfaces,
of the matrix W can then be reorganized into images and inspected.
As an example, Fisherfaces were extracted from the extended YALE face dataset B
(which was used to extract the eigenfaces, too, see Section 2.7.1). Images of the same
subject were grouped into the same class, yielding 39 classes, and hence 38 Fisherfaces,
in all. The corresponding images of ten Fisherfaces are shown in Figure 2.42. Unlike
the eigenfaces in Figure 2.33, humans have trouble interpreting the meaning of these
images. If one concentrates enough, it is possible to see outlines of the eyes, the nose,
and the mouth. One can also see the outline of the chin in the fifth picture from the
left of the upper row, but it is difficult to imagine how Fisherfaces could be useful in
determining the identity of a person. Yet, when the feature vector m = WT m of a given
unknown image m is used in classification, Fisherfaces prove to be quite effective.
In an experiment, a linear soft margin support vector machine (see Section 7.7)
was trained to recognize the 39 identities in the extended YALE dataset using both
eigenfaces and Fisherfaces as representations. With eigenfaces, 19 % accuracy was
achieved with 76 components, 42 % accuracy with 127 components, and 65 % accuracy
with 200 components. With Fisherfaces, on the other hand, the classification was 75 %
accurate with only 38 components. This experiment shows that MDA is much more
efficient at encoding discriminative information than PCA.
2.7 Dimensionality reduction of the feature space | 87
Fig. 2.42. First ten Fisher faces computed from the YALE faces dataset of Georghiades et al. [2001].
Unlike with the eigenfaces in Figure 2.33, there is no directly human-interpretable structure in the
images.
A common disadvantage of all the previous methods to reduce the dimension of the
feature space is that the result is an opaque transformation matrix. The conversion
leads to a new feature vector whose components are nebulous combinations of the
former feature components. Hence, a (potentially existing) descriptive meaning gets
lost. Moreover, the previous methods only work for features on at least an interval
scale.
Feature selection means choosing a subset of features from a wider set of features
that are considered to be sane for the problem at hand. In terms of the previous method,
this method actually projects onto subspaces that are aligned with the same axes, i.e.,
components are just left out or added but neither combined nor rotated. For this reason,
this method also works for features on a lower scale.
Instead of an objective function that only depends on the data, the performance
of a selection of features is directly evaluated with respect to a previously chosen
classifier. For each selected set of features, the classifier is tuned on the training set, the
classifier is applied to the test set, and the estimated class assignments are compared
with the real classes of the data. As opposed to the other methods, the outcome of
the dimension reduction thus depends on the established classifier. The workflow is
depicted in Figure 2.43.
More formally, let D ∈ M denote a training set with d = dim M and D =
{m1 , . . . , mN }. Let I = {1, . . . , d} be the index set of all dimensions and I ∈ P(I) a se-
lection of indices of the dimensions. For a feature vector m, let m|I denote the feature
vector restricted to the selected components. Similarly, D|I denotes the restricted set.
88 | 2 Features
Feature
D
extraction
D|I
Classification
Performance
I ∈ P(I)
evaluation
Selection
Fig. 2.43. Workflow of feature selection.
The task is to find the I∗ ∈ P(I) such that the classifier has the best performance
on D|I∗ among all I ∈ P(I). To test every subset of I, 2d runs are necessary. Hence,
this is only possible if d is small, because for each subset I the classifier needs to be
trained anew, tested, and evaluated on D|I . For reasonable values of d this is already
prohibitive. If the desired dimension d < d is already given in advance, the number
of subsets is still (dd ). Thus a brute force approach is normally impossible.
A suboptimal, but feasible approach is a greedy technique. First, the single-
element set with the best feature component is selected. This requires d runs. The
best component is held. Then the component out of the d − 1 remaining ones is
chosen that shows the best performance in conjunction with the already chosen first
one. This procedure is repeated until the desired number of components is chosen
or until joining a new component does not improve the performance. This way only
d + (d − 1) ⋅ ⋅ ⋅ + (d − d + 1) subsets need to be evaluated.
This approach is called a wrapper approach, because it wraps the feature selection
around a classifier. Besides wrappers, there are embedded approaches, where a classi-
fier implicitly performs the feature selection, and filter approaches, that do not depend
on a classifier at all. However, these methods are outside the scope of this book.
Especially when dealing with images or video, one often has the problem that the
feature extraction step assumes that all the patterns are of the same size. For example,
the eigenfaces and Fischerfaces approaches in Sections 2.7.1 and 2.7.4 assume that
2.7 Dimensionality reduction of the feature space | 89
the facial images are of the same size. One possible solution, the one pursued in these
examples, is to crop and rescale the images so that they fulfill this constraint. However,
doing so will inevitably remove information that is then unavailable for classification.
An alternative solution is given by the bag of visual words approach. Here, several
low level features are extracted from different parts of the pattern and then combined
into one higher level descriptor that characterizes the whole pattern. By construction,
this descriptor always has the same dimensionality, irrespective of the size of the un-
derlying pattern or the number of extracted low level features.
The approach has its roots in the bag of words model from natural language pro-
cessing. Without going into too much detail, this model can be described as follows.
Document generation is modeled as the repeated and independent drawing of words
w k that follow a probability distribution P(w). The overall probability of generating
the document τ = (w1 ,w2 , . . . ,w K ), that is, the sequence of K words w k , is thus given
by
N
P(τ) = ∏ P(w k ). (2.255)
k=1
In a classification setting, e.g., when the goal is to classify e-mails into “ham”
=̂ ω1 or “spam” =̂ ω2 , characteristic word distributions P(w | ω1 ) for ham and P(w | ω2 )
for spam can be estimated from a collection of documents by counting the words
that occur in them. An unseen e-mail p = (w1 , . . . ,wK ) with K words can then be
classified by assigning the class that maximizes the likelihood
K
L(ω) = ∏ P (wk ω) for ω ∈ {ω1 ,ω2 }. (2.256)
k=1
Alternatively, one can estimate the word distributions for every e-mail in the train-
ing set and train a classifier using these distributions as feature vectors. In other words,
the P(w) estimated from the document p (the pattern) acts as the feature vector m, i.e.,
m = (P(w1 ), . . . , P(w k ))T .
Two things are important here. First, the order of the words in the document does
not matter, nor does their surrounding text. Second, this method works irrespectively
of the length of the documents. Both of these are caused by the underlying idea of
treating a document as an unordered collection—a bag—of words.
The same idea can be used to classify images of varying size. Here, it is assumed
that the images are composed of visual words from some (for now) nondescript visual
vocabulary. Similar to document classification, an image can then be characterized
by observing the words that occur in the image (see Figure 2.44). Note that in gen-
eral the original image cannot be reconstructed from the bag representation, because
the vocabulary might not contain all of the possible visual words that appear in the
image and because the position of the words in the image is not specified in the bag
representation.
90 | 2 Features
Bag representation
Fig. 2.44. Illustration of the underlying idea of bag of visual words: An image can be thought of as
the composition of words from some visual vocabulary and can therefore be characterized by the
words that appear in it.
This approach is divided into two steps: learning the visual vocabulary from a
training set, i.e., defining the visual words, and extracting a higher level descriptor
from an image. Formally, let D = {pn | n = 1, . . . ,N} be a set of N patterns pn ∈ P.
These patterns should be representative for the patterns that are to be classified later
on, but information about their classes is not needed. Low level features mt (pn ) ∈
ℝd , t = 1, . . . ,T(pn ) are extracted from each of the N patterns pn . Note that the number
of extracted low level features T(pn ) depends on pn . In general, a different number of
low level features is extracted from each pattern, T(pn ) ≠ T(pm ) for n ≠ m, e.g., when
the features are extracted on key points as in the example below. The mt (pn ) are then
used to partition the ℝd into K non-overlapping tiles z k , k = 1, . . . ,K (see Figure 2.45),
i.e.,
K
⨄ z k = ℝd . (2.257)
k=1
The z k form the visual vocabulary, that is, each z k corresponds to a visual word.
It is tempting to think of the low level features mk (pn ) as the alphabet from which the
visual words are constructed, but this is a false analogy: unlike charaters, every low
level feature appears in one and only word z k . A better analogy is to to consider the
mt (pn ) ∈ z k as different spellings (“colour” and “color”) or perhaps synonyms (“color”
and “hue”) of the same word z k .
Once the vocabulary is determined, a K-dimensional high level descriptor f can be
extracted from an unseen pattern p . Once again, low level features mt (p ) ∈ ℝd with
t = 1, . . . ,T(p ) are extracted from p . Note that the type of the features must be the
same as the type that was used to determine the vocabulary. The high level descriptor
f is built as a count statistic (i.e., a histogram) over the mt (p ) w.r.t. the z k ,
m2
z6 z7 z1
z8
z10
m2
f1 0
20
z6 z1 f2 2
z7 20
f3 1
20
z8
z10 f4 1
20
f5 2
20
z9 z2 f6 0
20
f7 1
20
z5 f8 5
z4 20
z3 f9 4
20
m1 f10 4
20
Fig. 2.46. Example of a bag of visual words descriptor constructed using the vocabulary in Fig-
ure 2.45. T(p) = 20 low level features were extracted from the underlying pattern.
where δ[⋅] denotes the generalized Kronecker symbol. In other words, the entry f k is
the fraction (i.e., the frequency) of low level feature descriptors that fall into the tile
z k . Figure 2.46 shows an example descriptor derived from T = 20 low level features
using the vocabulary from Figure 2.45.
The bag of visual words approach has some important properties:
1. As in text processing, the size of f does not depend on the size of the underlying
pattern p.
2. Determining the vocabulary is unsupervised, i.e., information about the patterns’
classes ω(pn ), n = 1, . . . ,N is not needed to determine the vocabulary.
3. Invariance properties of the low level features mt propagate to the high level de-
scriptors p, e.g., if the mt are invariant under rotation, scale or illumination, so
will be f.
4. Spatial (and in the case of a video, temporal) relations between image patches
are discarded. On the one hand, this makes f robust against translation; on the
92 | 2 Features
other hand, this may remove discriminative information and prevent localization
of objects in an image.
5. If there is a semantic interpretation of the low level features mt , the high level
descriptor f can also be interpretable.
6. The dimensionality of f is often much larger than the dimensionality of the low
level features mt , but the exact number of the z k usually does not have a significant
impact on the classification performance.
2 Maximum a posteriori classification according to Equation (3.23) under the assumption that the
features are statistically independent
2.7 Dimensionality reduction of the feature space | 93
(a) Primitive features: features that require little or (b) Dense sampling: features are
no computation time, e.g., color channels, gradient extracted from every foreground
magnitude, texture codes. pixel.
Fig. 2.47. Modifications to use bag of words in bulk material sorting. Images from Richter et al.
[2016].
et al. [2004]). Nowadays, the method has been superseded by convolutional neural
networks (see Section 7.6), but the approach is still useful in other domains.
Gaussian mixtu re T
p → {mt }t= 1
model
T
p(m) {p(mt )}t= 1
fu = E{ p(m) < θ}
{µk , Σk } f k = E{m ∈ ẑ k }
Fig. 2.48. Structure of the bag of words approach in Richter et al. [2016].
2.8 Exercises
(2.2) Assign the following features to their corresponding feature scale (nominal, or-
dinal, interval, ratio, or absolute): (School) grades, car brands, date of birth, area
of the canvas of a sail, number of cows in a herd, motor temperature in ∘ C, engine
speed/revolution, height of body, clothing size, optical magnification, account
balance, electrical voltage, place in a race, gender, variety of apple, display of a
Geiger counter, population density, annual income in EUR, intelligence quotient
(IQ).
(2.3) Compute the Kullback–Leibler divergence DKL between the following probability
distributions P1 and P2 :
(2.4) Suppose there are three states, referred to as 1, 2, and 3, and two random vari-
ables with the probability mass functions P1 (X) and P2 (X) such that P1 (1) =
P1 (2) = P1 (3) = 31 and P2 (1) = 16 . How must the remaining probabilities P2 (2) :=
a and P2 (3) := b be chosen so that the Kullback–Leibler divergence DKL (P1 ‖P2 )
is minimized?
(2.5) The traffic police invented a new, innovative test to assess a road user’s ability
to drive a vehicle: Drivers are prompted to fire a gun at a target. The hit pattern is
compared to the following reference, obtained from drunk and sober drivers:
y
sober
drunk
0.5
x
−0.5 0.5
−0.5
(2.6) A well-known car manufacturer manipulated the software that controls the am-
monium nitrate injection to the catalytic converter so that less ammonium nitrate
was used. The resulting engine fumes contain less smelly ammonia (NH3 ), but
more pollutant nitrogen oxides (NOX ). In a randomized test of engine fumes for
these two compounds, the following scatter plot resulted:
96 | 2 Features
m2 (% NOX )
3a
manipulated
not manipulated
2a
m1 (% NH3 )
a 2a 3a
Construct a one-dimensional feature m that can be used to classify cars into “ma-
nipulated” (ω1 ) or “not manipulated” (ω2 ). Give a formula to compute m .
(2.7) The following feature is derived from a Fourier contour descriptor, i.e., the coef-
ficients Z i ∈ ℂ, i ∈ ℤ of the Fourier series expansion of the contour:
Z8 Z6 Z4 Z2 Z0
m := − + − + .
|Z1 | |Z1 | |Z1 | |Z1 | |Z1 |
Is m invariant under scaling, translation, and rotation? Why or why not?
(2.8) The following feature is derived from a Fourier contour descriptor, i.e., the coef-
ficients Z i ∈ ℂ, i ∈ ℤ of the Fourier series expansion of the contour:
α|Z0 | + |Z2 |
m := α,β,γ ∈ ℝ.
√|Z3 |2 + |Z4 |β + γ
How must α, β, γ ∈ ℝ be chosen for m to be invariant under translation and scal-
ing?
1 2 3 4 5 6 7
(2.10) In an optical inspection system the following variables are computed for each
recorded object: center of mass c, perimeter of the contour P, area of the object
2.8 Exercises | 97
A, length of the main and secondary axes l1 and l2 , and the rotation angle φ of
the main axis w.r.t. the image coordinate system. The variables are shown in the
following sketch:
A l1
P
φ
l2
‖c‖2
m1 =
A
l1 P2
m2 = +
l2 A
l2
m3 = φ
U
A
m4 = − cos φ
l1
φ l1 − l2
m5 = +
1‖c‖2 A
3 Bayesian decision theory
Up to now, this book has dealt with the question of how to select, define, and extract
features from observed patterns of objects. In Figure 1.2, this is depicted as the first
step of a pattern recognition system (first blue box) after the preparatory steps. From
now on, our attention will be turned to the second step: how the features are used to
assign objects to classes. Eventually, this task will be solved by classifiers.
Firstly, the entirety of classifiers can be divided into methods that use a proba-
bilistic description of the problem and those which do not. The latter are considered in
Chapter 7. The class of probabilistic methods is rather extensive, hence its discussion
is divided into the Chapters 3 to 6. Under the assumption that the probabilistic descrip-
tion of the system is already given, the question of how the actual classification is to
be performed needs to be answered. This is the subject of this chapter. The problem
of how to obtain the probabilistic description temporarily remains open and will be
answered later (Chapters 4 to 6).
In summary, the assumptions of this chapter are that the probabilistic description
of the system is already given and that the features of the objects are already defined
and extracted.
The general idea is to think of the world as a random generator that outputs pairs of
features with associated object classes (m, ω). The collection of all pairs is described by
a probability distribution. In addition, all elements of a sequence of pairs are pairwise
independent, i.e., the joint distribution of N pairs (m1 , ω1 ), . . . , (mN , ω N ) equals the
product of the individual distributions,
N
p((m1 , ω1 ), . . . , (mN , ω N )) = ∏ p(mj , ω j ). (3.1)
j=1
is called an a priori distribution (of the classes) and the marginal distribution of the
features
ωc
p(m) = ∑ P(m, ω) (3.3)
ω=ω1
p(m | ω) (3.4)
https://doi.org/10.1515/9783110537949-120
3.1 General considerations | 99
P (m, ω)
ω2 ω5
ω5 ω3
ω4
ω3
ω2 ω4
ω1
ω1
m m
(a) Full joint distribution (b) Marginal distributions
Fig. 3.1. Example of a random distribution P(m, ω) of mixed discrete and continuous quantities.
P(ω | m) (3.5)
where each ω i ⊆ Ω denoted one class and was formally defined as a subset of Ω. Now,
we identify each of them with a unit vector in a c-dimensional space K,
j
ωi ∈ K = ℝc with ω ij = δ i (3.9)
100 | 3 Bayesian decision theory
k 3
ω3
1 k
ω2
k 2 Fig. 3.2. The decision space K for c = 3 classes.
1
ω1 The orange arrow shows a decision vector k,
k 1 1 the blue triangle shows the probability simplex,
where ∑i k i (m) = 1 and k i ≥ 0, i = 1, . . . ,c.
1⏟⏟, 0, . . . , 0)T .
ωi = (0, . . . , 0, ⏟⏟⏟⏟⏟ (3.10)
ith position
The unit vectors ωi are also called target vectors, k ∈ K decision vector, and its
components k i decision functions.
In light of this new perspective, a classifier follows a two-step construction scheme.
In the first step, a feature vector is mapped to a value k in the decision space spanned
by the ωi (see Figure 3.2),
c↑
↑
↑
↑
k ∈ span {ω1 , . . . , ωc } = { ∑ λ i ωi ↑
↑
↑ λ i ∈ ℝ} . (3.11)
↑
↑
i=1 ↑
The second step is to take the target vector ω̂ as an estimation that has the shortest
distance from k, i.e.,
ω̂ = arg min ‖k − ωi ‖ . (3.12)
ωi : i=1,...,c
While the second step is uniform for all classifiers, the actual logic of a classifier
is merged into the mapping in the first step,
{M →K
k: { (3.13)
m → k(m).
{
Beyond a uniform description of classifiers, the vectorized approach prevents a
misconception of the classes’ numbering. While the numbering misleadingly implies
an ordering of the classes, the vectorized approach clearly shows that each pair of
target vectors has the same distance. More concisely: the classes stem from a nominal,
not an ordinal, scale, and the vector description reflects this.
A parametric decision function is a mapping k(m, θ) that additionally depends
on a parameter vector θ = (θ1 , . . . , θ k )T ∈ Θ from a parameter space Θ. Learning
by examples means to find the parameter vector θ̂ such that the mapping k(mi , θ)̂
3.2 The maximum a posteriori classifier | 101
approximates the true class ω(mi ) for each of the training samples in some opti-
mal way. Each decision region Ri ⊆ M in the feature space corresponds to a subset
↑
{k↑
↑‖k − ωi ‖ < ‖k − ωj ‖ ∀j ≠ i} ∈ K in the decision space. More precisely,
↑
↑
Ri = {m↑
↑‖k(m, θ) − ωi ‖ < ‖k(m, θ) − ωj ‖ ∀j ≠ i}
↑ (3.14)
↑
∂Ri = {m↑
↑‖k(m, θ) − ωi ‖ ≤ ‖k(m, θ) − ωj ‖ ∀j ≠ i} \ Ri .
↑ (3.15)
This shows that the structure and parametrization of k(⋅, θ) determines the deci-
sion boundaries in M.
After these preliminary remarks, we now start with the first specific classifier. As the
first objective, one can require that the expected squared Euclidean distance between
the decision vector and the true target vector is minimal. Let P(m, ω) denote the joint
distribution of the feature vector and the target vectors. Then the objective function is
c
f(k) = E{‖k(m) − ω‖2 } = ∑ ∫ ‖k(m) − ωi ‖2 P(m, ωi ) dm (3.16)
i=1 M
As k is an optimal solution, the inequality must hold for any k∆ . This is surely
true if there is a k such that the second term is identically zero. Note that it is not
required that the second term be zero: this is a sufficient but not a necessary condition.
102 | 3 Bayesian decision theory
Find maximum
k2 (m) = P(ω2 |m)
m ω̂
..
.
In the last line, the dependence of k on m was made explicit again and the term was
reordered so that kT∆ (m) was moved out of the brackets, because this term is outside
of our control. But the overall formula becomes zero if the term in the brackets is zero
and this term is only made up of known quantities and k. One obtains
c c c
0 = ∑ (k(m) − ωi )P(ωi | m) = k(m) ∑ P(ωi | m) − ∑ ωi P(ωi | m)
i=1 ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
i=1 i=1
=1
↑ ↑
= k(m) − E{ω↑
↑m}
↑ ⇒ k(m) = E{ω↑
↑m} .
↑ (3.21)
The last line concludes the derivation. Going one step back and explicitly writing
out the expectation gives a more illustrative representation of the optimal target vector:
c
↑
k(m) = E{ω↑
↑m} = ∑ ωi P(ωi | m)
↑
i=1
1 0 P(ω1 | m)
..
0 . P(ω2 | m)
= ( .. ) ⋅ P(ω1 | m) + ⋅ ⋅ ⋅ + ( ) ⋅ P(ωc | m) = ( .. ). (3.22)
. 0 .
0 1 P(ωc | m)
The target vector is formed by the conditional distributions or a posteriori distri-
butions of the classes, but this is only the first step of the classifier. Putting the last
3.2 The maximum a posteriori classifier | 103
k3
0.2 0.1
0.8 (0.2)
0.7
0.4
0.6
0.0
0.5 (0.8)
0.6
(0.15) 0.4 0.2
0.35
0.8
0.2
1
0.2 0.4 0.6 0.8 1
k1 k2
Fig. 3.4. 3-dimensional probability simplex in barycentric coordinates, i.e., projected onto the two-
dimensional plane so that every point in the simplex is identified by three coordinates k1 ,k2 ,k3 ≥ 0
with k1 + k2 + k3 = 1.
Intuitively, this result does not come as a surprise. The optimal classifier with
respect to the least expected square error always takes the class with the highest a
posteriori probability. Therefore, this classifier is called the maximum a posteriori
(MAP) classifier. Every point that can be described by k shares the property that the
sum of its components equals one. For this reason, the blue simplex in Figure 3.2 is
also called the probability simplex. Figure 3.3 illustrates the workflow of the MAP
classifier; Figure 3.4 depicts a 2-dimensional projection of the probability simplex
from Figure 3.2.
Note that although this classifier is optimal in the sense of least square error in
the decision space, this does not mean that it will classify without error. The error
probability is given by
P e := P(ω̂ ≠ ω), (3.24)
i.e., this is the probability that the estimated class ω does not match the true class
ω. Since the a posteriori probability P(ω | m) depends on the class-specific feature
distribution p(m | ω), the asymptotic (i.e., the true theoretical) error probability can
only be zero if the class-specific feature distributions do not overlap, or if the a priori
probability of overlapping distributions is nonzero only for one of the classes. In prac-
tical applications, the former is almost never the case, while the latter is nonsensical
as it prevents the classifier from deciding on one or more of the classes.
104 | 3 Bayesian decision theory
In the previous Section, the MAP classifier was derived as the optimal classifier with
respect to the least expected square error. Of course, this criterion is only correct if each
error is equally bad. Quite often one is faced with applications in which one kind of
classification error is worse or more costly than an error of another kind. For example,
an undetected cause for alarm might be worse than a false alarm.
Therefore, the goal of this chapter is to extend the Bayesian framework by a cost
function:
l : Ω0 /∼ × Ω/∼ → ℝ (3.25)
where Ω0 /∼ := Ω/∼ ∪ {ω0 } and ω0 denotes the rejection class, which will be discussed
in more detail in Section 9.4. For now, it suffices to treat it as just another class without
any deeper meaning. The cost function l expresses the costs of deciding on the class
ω̂ if the true class is actually ω. In the case of a finite number of classes, the costs can
also be expressed as the matrix
Normally, one has l(ω i ,ω i ) = 0 and l(ω i , ω j ≥ 0 for i ≠ j. Instead of reducing the
average error the new objective is to reduce the expected cost
R = E{l(ω(m),
̂ ω)} . (3.27)
R = E{l(ω(m),
̂ ω)}
c
= ∫ ∑ l(ω(m),
̂ ω j )P(m, ω j ) dm
M j=1
c
↑
= ∫ ( ∑ l(ω(m),
̂ ω j )P(ω j ↑
↑ m))p(m) dm
↑
M j=1
= ∫ R(ω(m)
̂ | m)p(m) dm. (3.30)
M
In summary, the optimal decision is to choose the class with the smallest a posteriori
risk.
Viewed from a distance, all this is not very surprising. Taking the less risky choice
is probably what everyone would intuitively do, but now this is even mathematically
proven to be the optimum. Hence, the optimal Bayesian decision is a benchmark for
any other approach and defines an upper bound on the performance of any classifier.
At this point, one could ask, why is there any reason to consider a different classifier
if the optimum is already achieved? What is the rest of this book about? The answer
is simple: the Bayesian classifier requires probability densities that are typically not
(fully) known in real-world scenarios. The Bayesian classifier is optimal because it
uses every piece of information that can eventually be known about the entirety of
all the features and objects of the domain. Of course, an omniscient classifier has no
difficulties in making the best decision.
Bayesian decision theory uses a cost function and the a posteriori probability
of the classes. While the cost of a wrong decision can hopefully be determined with
some degree of certainty, an accurate a posteriori probability or more precisely its
determining pieces are hard to find. The class-specific distribution of the features and
the a priori distribution of the classes are normally unknown. Note that the marginal
distribution of the features p(m) in the denominator of
p(m | ω)P(ω)
P(ω | m) = ∝ p(m | ω)P(ω) (3.32)
p(m)
106 | 3 Bayesian decision theory
p(m|ω1 )
p(m|ω2 )
R2 R1 R2 R1
Fig. 3.5. Connection between the likelihood ratio and the optimal decision region.
is not required to find the minimum or maximum with respect to ω for a fixed m.
The quality of the Bayesian classifier stands or falls with the accuracy of these
quantities. Therefore, the next two chapters will treat the question of how those quanti-
ties are estimated in practice. The next chapter will focus on the so-called parametrized
methods, and the following one will deal with parameter-free methods that work with-
out a model assumption. Besides the technical challenge of how those distributions
are mathematically obtained, another issue becomes evident, too. The question is
what the concept “probability” actually means. This more philosophical matter will
constitute the introductory part of the coming chapter.
But before that, the remainder of this chapter will discuss some simple examples
and deduce another estimator that limits the maximal risk in the worst case.
Once again, consider a simple two-class scenario. Let l ij = l(ω i , ω j ) be a shorter
notation for the cost function. The a posteriori risk for a fixed feature m is
The estimator decides on the class ω1 iff R(ω1 | m) < R(ω2 | m). This can be equiv-
alently rewritten as
equals the converse of the a posteriori probability. As the estimator chooses the least
risk, this leads to the already known MAP classifier. Hence, the MAP classifier is a
special case of the general Bayesian classifier with this particular cost function.
The overall risk is
↑
↑
R = ∫ R(ω(m)
̂ ↑ m)p(m) dm
↑
M
eq. (3.37) ↑
↑
= ∫ (1 − P(ω(m)
̂ ↑ m))p(m) dm
↑
M
↑
↑
= 1 − ∫ P(ω(m)
̂ ↑ m)p(m) dm
↑
M
= 1 − ∫ P(ω(m),
̂ m) dm. (3.38)
M
To interpret the last term, reconsider Figure 3.1 and compare it with Figure 3.6.
Generally there will be tuples (ω(o1 ),m(o1 )), (ω(o2 ),m(o2 )) whose underlying ob-
jects belong to different classes, ω(o1 ) ≠ ω(o2 ), but show the same feature vector
m(o1 ) = m(o2 ). This always happens if the supports of the likelihoods overlap.
The probability of the whole ensemble of all tuples (ω, m) is one by definition, i.e.,
108 | 3 Bayesian decision theory
P (m | ω i )
P (m, ω)
ω2 ω5
ω ω3
ω 5
ω3 4
ω ω4
ω1 2 ω1
m m
(a) Joint distribution (b) Marginal distributions
Fig. 3.6. Decision of an MAP classifier in relation to the a posteriori probabilities. The a priori proba-
bilities are P(ω i ) = 15 for i = 1, . . . , 5. The MAP classifier always chooses the class with the highest a
posteriori probability (shaded areas/thick lines). The overall risk can be computed by summing the
individual class-specific risks for each of these regions separately.
∑cj=1 ∫M P(ω j , m) dm = 1, but the estimator ω̂ is a function of the feature vector and
always chooses one (fixed) class, depending on the features. In the case of the MAP
classifier, the estimator ω̂ always decides on the class with the highest a posteriori
probability (thick lines in Figure 3.6b). Hence, the ensemble (ω(m), ̂ m) is only a
subset of all the options. Rewriting the last term of Equation (3.38) in an artificially
complicated manner leads to
c
∫ P(ω(m),
̂ m) dm = ∑ δ[ω(m)=ω
̂ j]
∫ P(ω j , m) dm
M j=1 M
c
=∑ ∫ P(ω j , m) dm
j=1
{m∈M | ω(m)=ω
̂ j}
c
= ∑ ∫ P(ω j , m) dm ≤ 1, (3.39)
j=1 R
j
where δ[ ⋅ ] denotes the generalized Kronecker symbol. This means that only those
regions of the density are integrated that correspond to classes the estimator chooses
(shaded areas in Figure 3.6a). In summary, the term in Equation (3.39) represents
the probability that the classifier decides correctly. Conversely, Equation (3.38) is the
probability of a wrong decision. This can also be rewritten as
c c
R = 1 − ∑ ∫ P(ω j , m) dm = ∑ ∫ P(ω j , m) dm. (3.40)
j=1 R j=1 M\R
j j
As a final result, one can state that the overall risk of the MAP classifier equals the
probability of a wrong decision.
3.3 Bayesian classification | 109
p(m|ω1 ) p(m|ω2 )
0.04 0.04
0.03 0.03
0.02 0.02
0.01 0.01
12 12 12 12
10 8 8 10 10 8 8 10
6 4 4 6 6 4 4 6
m2 2
0 0
2 m1 m2 2
0 0
2 m1
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
12 12 12 12
10 8 8 10 10 8 8 10
6 4 4 6 6 4 4 6
m2 2
0 0
2 m1 m2 2
0 0
2 m1
m2
R1 R2 ω1
12 ω2
Optimal
boundary
10
2 R1
m1
2 4 6 8 10 12
Fig. 3.8. Optimal decision regions derived from the true underlying distribution. In region R1 , one
has that P(ω1 | m) > P(ω2 | m). In region R2 , the opposite is true. The dataset consists of 100
samples for each class (200 samples in all). The optimal classifier has an empirical classification
16
error of 200 = 8 % for the dataset shown.
The posterior class probabilities can be computed using the Bayes rule and are shown
in Figure 3.7b.
The optimal decision boundaries correspond to the set of points where P(ω1 | m) =
P(ω2 | m), as shown in Figure 3.8. Since the relevant distributions are known, it is
possible to compute the theoretical classification error using Equation (3.40). The
probability of error turns out to be approximately P e ≈ 6.16 %.
This model was also used to create a training sample D and a test sample T to use
in the classifier examples throughout this book. Both D and T consist of 200 samples
each, where 100 samples were drawn from p(m | ω1 ) and 100 samples were drawn
from p(m | ω2 ). Figure 3.8 shows the training set D alongside the decision regions.
It can be seen that some samples fall into the wrong decision regions. In fact, the
empirical classification error is 8 %. Note that this error deviates from the theoretical
classification error calculated above. The important lesson to take away from this
is that the empirical classification error is subject to random perturbations, since it
depends on the dataset. More reliable estimates of the true probability of error can
be obtained by using a larger test sample, or by using the techniques described in
Section 9.2.
3.3 Bayesian classification | 111
The remainder of this section will introduce another kind of classifier, the Minimax
classifier. All of the concepts that have been discussed so far need either the a pos-
teriori distribution P(ω | m) or must resort to the a priori distribution P(ω), thanks to
Equation (3.32). Knowing the former is nearly a forlorn hope in real-world applications,
but even having a feasible estimate of the latter can be a challenging task. Normally,
one has to rely on expert knowledge of the application. Hence, a natural question to
ask is what happens if one implements a pattern recognition system with a specific a
priori distribution in mind that does not reflect the reality. To start, reconsider the risk
c
(3.30) ↑
↑ ↑
↑ m)p(m) dm
R = ∫ R(ω(m)
̂ ↑ m)p(m) dm = ∑ ∫ R(ω i ↑
↑ ↑
M i=1 R
i
c c
(3.28) ↑
= ∑ ∫ ∑ l ij P(ω j ↑
↑ m)p(m) dm
↑
i=1 R j=1
i
c c
(3.28) ↑
= ∑ ∫ ∑ l ij p(m ↑
↑ ω j )P(ω j ) dm.
↑ (3.42)
i=1 R j=1
i
The second line holds because the decision regions Ri are a partition of M and
the classifier ω̂ is equivalent to those regions. Hence, ω̂ is constant on each of them
and so ω(m)
̂ = ω i for m ∈ Ri .
Although the Minimax classifier is not limited to the binary case, the following
discussion will be restricted to this case in order to keep the formalism simple. With
c = 2 and P(ω2 ) = 1 − P(ω1 ), it follows that
+ P(ω1 ) [(l11 − l22 ) + (l21 − l11 ) ∫ p(m | ω1 ) dm − (l12 − l22 ) ∫ p(m | ω2 ) dm]
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
R2 R1
!
=0
(3.43)
One can see that the risk is a linear function of the a priori probability of ω1 . While
the costs l ij and the class-specific distributions of the feature vector are not adjustable,
but rather determined by the environment, the decision regions Ri are the definable
112 | 3 Bayesian decision theory
R
0.6
Ptrue
0.4
RMinimax
0.2
Rdesign
design parameters. They are made up of the ordinate and the slope. This can lead to a
problem, as illustrated in Figure 3.9.
For every a priori probability, there exists a particular choice of the decision re-
gions such that the risk is minimized. This relation is depicted by the blue curve in
Figure 3.9, which represents the optimal Bayesian classifiers. This curve is always zero
at the end points, because if any a priori probability equals 1, an error-free classifi-
cation is possible. It attains its maximum somewhere in the middle, say at PMinimax ,
where the uncertainty is high. Of course, the uncertainty is maximized if all classes
are equally distributed, but the maximal risk can lie somewhat off the mark if different
costs are associated with each class.
Now, assume that the a priori probability was Pdesign and the classifier was origi-
nally constructed with this assumption in mind (hence, the classifier was thought to
have a risk Rdesign ). But if for some reason this assumption was wrong, and the clas-
sifier is deployed in a scenario in which the true a priori probability is Ptrue ≠ Pdesign ,
then Equation (3.43) holds and the true risk Rtrue lies on the tangent (red line) that
goes through the initially assumed point. This might even lead to a risk that is much
higher than the minimal risk RMinimax in the worst case.
The idea is to construct a classifier such that Equation (3.43) becomes independent
of the a priori distribution. This means the slope in Equation (3.43) must be set to zero
by an appropriate choice of the decision regions Ri . Implicitly, this choice belongs to
the optimal Bayesian classifier for the worst-case a priori probability PMinimax . Then the
classifier has the risk RMinimax , which remains constant even if the a priori probability
diverges, because the tangent is constant (yellow line in Figure 3.9).
In summary, the objective is not to construct a classifier that has the minimal risk
for a specific a priori probability, but a classifier that has the minimal risk in the worst
3.3 Bayesian classification | 113
case. The name “Minimax” comes from the fact that one tries to minimize the maximal
risk.
Anyway, in classical pattern recognition the optimal Bayesian classifier usually
performs better than the Minimax classifier. The situation in Figure 3.9 is somewhat
far-fetched because Pdesign and Ptrue are extremely different. Though not optimal, a
Bayesian classifier that is tuned for Pdesign will still be better for Ptrue than the Minimax
classifier if Ptrue is not too far away. (In Figure 3.9 these are all points where the red
tangent is still below the yellow line.) The origin of the Minimax approach is located
in game theory. Here, two players have to consecutively perform moves and one has
to decide what the best move is. The a priori probability describes what the adversary
is likely to do for its next move, but of course the adversary can adapt its own strategy
to the outcome of the decision one still has to make. Governed by the assumption that
the adversary will always do what is worst for the player, the idea is to minimize the
maximal risk. In classical pattern recognition, however, the adversary is the environ-
ment, and is therefore passive. Under normal conditions, one should be able to make
a reasonable assumption about the a priori distribution that is hopefully not too far
off.
The last Section of this Chapter repeats the previously introduced concepts in the case
that the class-specific feature distributions are Gaussian. Besides the goal of presenting
some concrete examples, this section also serves to reinforce the understanding of the
previous information.
1 1 m−μ 2
p(m) = exp (− ( ) ). (3.44)
√2πσ 2 σ
The importance of the normal distribution is explained by the central limit theorem.
In its most basic and classical form, it states that the normalized sum of a sequence
of independent and identically distributed random variables with existing expecta-
tion and variance converges almost surely to a normally distributed random variable.
Stated more precisely:
Note that this theorem does not explain the ubiquity of the normal distribution. Indeed,
there are more sophisticated and generalized variants of the central limit theorem, but
convergence (in distribution) can still be guaranteed under very mild assumptions.
Without going into too much detail, the individual random variables do not even need
to be identically distributed, and the pairwise independence can be replaced by some
limit value condition that ensures that no subsequence has too much influence. Hence,
the normal distribution gained extreme popularity in modeling naturally occurring
phenomena.
Loosely formulated, one could say that if a feature is generated by summing many
independent contributions, it is reasonable to approximate the distribution of the fea-
ture by a Gaussian distribution. This holds in one dimension, but also in many dimen-
sions. The multi-dimensional normal distribution is given by
Note that Σ is a covariance matrix and therefore symmetric and positive definite. In
the 2-dimensional case, Σ can be decomposed as
for σ1 , σ2 > 0 and ρ ∈ [−1, 1]. Here, ρ is called the correlation coefficient.
For the rest of this chapter, we will assume that the class-specific feature distribu-
tions of the features are normal. This means for each j = 1, . . . , c,
Note that such an assumption is only reasonable if the features are at least on an
interval scale.
Because the MAP classifier chooses the class with the highest a posteriori prob-
ability, and due to the strict monotonicity of the logarithm and Equation (3.32), the
following holds for every fixed feature vector m:
p(m|ω i ) p(m|ω i )
ω2
ω1 ω2 ω1
m P(ω1 ) = 0.7
P(ω2 ) = 0.3
0
R2
0 4 8 2
R1 R2 −2
R1 0
0 −2
P(ω1 ) = 0.7 P(ω2 ) = 0.3 m1 2 m2
(a) Decision boundaries for P(ω1 ) = 0.7 and P(ω2 ) = 0.3
p(m|ω i ) p(m|ω i )
ω2
ω1 ω2 ω1
m P(ω1 ) = 0.95
P(ω2 ) = 0.05
0 R2
0 4 8 R1 2
R1 R2 −2 0
0 −2
P(ω1 ) = 0.95 P(ω2 ) = 0.05 m1 2 m2
(b) Decision boundaries for P(ω1 ) = 0.95 and P(ω2 ) = 0.05
Fig. 3.10. Decision boundary of a two-class Gaussian classifier with unequal a priori probabilities.
Left: One-dimensional feature space; Right: Two-dimensional feature space.
As the logarithmic representation is easier and numerically more stable with re-
spect to normally distributed features, the decision function is reformulated as
p(m)
0
Fig. 3.11. Decision regions of a generic Gaus-
ω4 sian classifier (i.e. full covariances) with c = 4
ω2 ω3
classes and two features (d = 2). The diagram
ω1
2 shows p(m) = ∑4i=1 P(m,ω i ), where the re-
0
−2
0 gions are colored according to the decision
2 −2
m1 m2 made in the region.
0 = k i (m) − k j (m)
1 1 P(ω i )
= (µ − µj )T m − (‖µi ‖2 − ‖µj ‖2 ) + ln (3.54)
σ2 i 2σ2 P(ω j )
The decision boundaries are hyperplanes that are perpendicular to the connection
lines between the expectation. If the a priori probabilities of the classes are equal, i.e.,
ln P(ω i)
P(ω j ) = 0, then the hyperplanes lie at the center points between the expectation
vectors. If the a priori probabilities are not equal, the hyperplanes move toward the
component with lower a priori probability. Examples with one and two features are
shown in Figure 3.10.
For the second step, continue assuming an equal covariance matrix Σi = Σ for
all classes, but no longer are they necessarily diagonal. Again, the crucial part of the
decision function is
1
k i (m) = − (m − µi )T Σ−1 (m − µi ) + ln P(ω i ) (3.55)
2
and the decision boundary is given by
1 T −1 P(ω i )
0 = (µi − µj )T Σ−1 m − (µ Σ µi − µj T Σ−1 µj ) + ln . (3.56)
2 i P(ω j )
Again, the decision boundaries are hyperplanes. But unlike before (see Equation (3.54)),
the hyperplanes are rotated by Σ−1 and thus are not perpendicular to the connection
lines between the centers of the Gaussians µi .
3.3 Bayesian classification | 117
p(m) p(m)
0 0
ω1
ω2
ω1
2 ω2 2
−2 0 −2 0
0 −2 0 −2
m1 2 m2 m1 2 m2
(a) Parabolic decision boundary (b) Hyperbolic decision boundaries
p(m) p(m)
0 0
2 2
−2 0 −2 0
0 −2 0 −2
m1 2 m2 m1 2 m2
(c) Linear decision boundaries (d) Elliptic decision boundary
Fig. 3.12. Decision regions of a generic Gaussian classifier (i.e. full covariances) with c = 2 classes
and two features (d = 2) are conic sections. The diagram shows p(m) = ∑2i=1 P(m,ω i ), where the
regions are colored according to the decision made in the region.
m 2
R1 R2 ω1
12 ω2
Optimal
boundary
10 Decision
boundary
2 R1
m 1
2 4 6 8 10 12
Fig. 3.13. Application to the reference example of Section 3.3.2. Decision regions of a classifier with
Gaussian densities p(m | ω j ) = N(m; µj ,Σj ) (j = 1,2). The parameters µj and Σj are estimated
from a training sample. The training error is etrain = 12.5 %; the testing error is etest = 7 %, but
asymptotically approaches etest ≈ 9.6 %. The training set is the same as in Figure 3.8. Test samples
are shown with hollow marks.
A Gaussian mixture is a random variable whose density equals the convex combination
of Gaussian densities. More formally:
3.4 Exercises | 119
Note that the term “Gaussian mixture” is misleading in that m is not a mixture of
Gaussian random variables (which would itself be a Gaussian random variable, see
Theorem 3.2). Instead, it’s probability density function is a mixture of Gaussian proba-
bility density functions.
Gaussian mixtures are very popular because they are easy to handle and enjoy a
powerful approximation property: Every density (within reason) can be approximated
by Gaussian mixtures with arbitrary precision. More precisely, let f be a density with
a finite number of discontinuities on every compact subset of its support. Let
Kn
f n = ∑ π n,k N(m; µn,k , Σn,k ) (3.63)
k=1
be a sequence of Gaussian mixtures. Then there are K n , π n,k , µn,k , Σn,k such that f n
converges uniformly to f except at the points of discontinuity (see Maz’ya and Schmidt
[1996]). Note that K n is the number of components of the nth member f n and that, in
general, each f n has different components. Furthermore, the µn,k and Σn,k are not
necessarily a superset of the components of the previous member.
Unfortunately, there is no general rule for how many components K are necessary
to obtain a specific approximation error. But assume the task is to approximate each
d-dimensional class-specific feature distribution with K components. Then for every
class there are 12 K(d2 + 3d + 2) − 1 parameters,
π j,1 , . . . , π j,K
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ µ j,1 , . . . , µj,K
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ Σ j,1 , . . . , Σj,K .
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
K−1 Kd 1
2 Kd(d+1)
3.4 Exercises
(3.1) Show: If two random variables a,b are stochastically independent under the
event a, P(a | b) = P(a), then they are also independent under the opposite event
a, i.e., P(a | b) = P(a).
(3.2) Let m1 and m2 be two feature vectors that are to be classified using a maximum
posteriori (MAP) classifier.
1. When is the result of classification according to maximum a posteriori proba-
bility, ω̂ = arg maxω P(ω | m), the same as that of classification according to
maximum likelihood, ω̂ = arg maxω p(m | ω)?
120 | 3 Bayesian decision theory
2. Under what conditions will the result of the MAP classifier only depend on
the a priori probabilities P(ω)?
(3.3) In a classification problem with three classes ω1 , ω2 , ω3 , with P(ω1 ) = 0.1 and
P(ω2 ) = 0.6, the following is known about the feature vectors m ∈ ℝd :
(3.4) Suppose given two classes ω1 and ω2 and a feature m ∈ ℝ with the following
class-dependent feature distributions:
{
{ m if 0 < m ≤ 1
{
p(m | ω1 ) = {2 − m if 1 < m ≤ 2
{
{
{0 else
and
{
{ m−1 if 1 < m ≤ 2
{
p(m | ω2 ) = {3 − m if 2 < m ≤ 3
{
{
{0 else.
1. Sketch the class-dependent feature distributions p(m | ω1 ) and p(m | ω2 ) in a
single diagram.
2. Calculate the decision boundary of the Bayesian optimal classifier under the
assumption that P(ω1 ) = P(ω2 ) = 0.5. Mark the boundary in your diagram.
3. Calculate the decision boundary for P(ω1 ) = 0.25 and mark it in your diagram.
4. Calculate the error probabilities in both cases.
(3.5) Let ω1 to ω4 be four classes with P(ω1 ) = P(ω2 ), P(ω3 ) = 0.3 and P(ω4 ) = 0.5.
For a feature m1 , the following are to hold:
(3.6) Let ω be a class and let m = (m1 ,m2 )T be a feature vector of stochastically inde-
pendent features m1 and m2 . Give the a posteriori class probability P(ω | m) using
Bayes’ law. Simplify as much as possible.
(3.7) A bulk material sorter is used to separate healthy wheat grains (ω1 ) from grains
infected with ergot (a fungus that produces a very potent toxin, ω2 ) and assorted
foreign bodies like dirt and the grains of other plants (ω3 ). If an infected grain
remains undetected, the infection will spread and 100,000 grains with a value of
1 EUR will have to be discarded, on average. If a foreign body remains undetected,
the damage will only be to the brand image, which is calculated at 0.01 EUR.
The sorting system uses a Bayesian classifier to classify each individual grain,
where only the length of the object is used as a feature. The sensor used can only
detect whether a grain is longer or shorter than 7 mm.
It is known that the material stream consists of 97 % healthy grains, 2 % infected
grains, and 1 % foreign materials. The manufacturer of the length sensor gives the
following performance characteristics:
90
P(length < 7 mm | ω1 ) =
100
3
P(length < 7 mm | ω2 ) =
100
5
P(length < 7 mm | ω3 ) =
100
1. Construct the cost matrix L for classification according to minimal a posteriori
risk.
2. Which class will be chosen by a maximum a posteriori classifier if the sensor
signals length < 7 mm?
3. Which class will be chosen by a risk minimizing classifier if the sensor signals
length < 7 mm?
4 Parameter estimation
The previous Chapter assumed that the quantities
p(m|ω)P(ω)
P(ω|m) = ∝ p(m|ω)P(ω) (4.1)
p(m)
(see Equation (3.32)) are known or at least the very right side (without p(m)). As al-
ready stated, the methods for determining these quantities can basically be divided
into two groups, parametric and non-parametric methods. This chapter deals with the
parametric approach.
In principle, one already has a mathematical model p(m|ω i , θ) of the distribution
that bounds the major traits and restricts the degrees of freedom to a finite-dimensional
parameter vector θ ∈ ℝq . For example, p(m|ω i , θ) might be the family of normal
T
densities with unknown expectation and variance, i.e., θ = (μ, σ2 ) . Furthermore,
a dataset D = {m1 , . . . , mN } is given which is assumed to have been generated by
p(m|ω i , θ) for a fixed, but unknown θ. The goal is to find the “true” value of the
parameter vector θ given the samples D.
This definition of an estimator is rather broad and especially does not make any state-
ment about quality. A constant function is an estimator, too, but intuition tells us that
this should not be a good one. Performance indicators, which help to decide if an es-
timator is reasonable with respect to the application, are a subsequent topic of this
chapter.
First, however, there will be a short excursus on two doctrines about the meaning
(semantics) of probability. Both philosophies share the same syntactical foundation:
The axiom system of Kolmogorov governs how to calculate with probabilities.
Pr(B∩ C)
Moreover, the probability Pr(B|C) = Pr(C) is called the conditional probability of B
given C.
https://doi.org/10.1515/9783110537949-144
4 Parameter estimation | 123
̂
dataset D = {m1 , . . . , mN }, a reasonable approach is to choose θ(D) such that the
observation D becomes most likely,
̂
θ(D) = arg max ∏ p(m | θ). (4.3)
θ∈Θ m∈D
This approach is called the likelihood method and this estimator is called the maximum
likelihood estimator (see Definition 4.7).
In the Bayesian framework, the parameter vector is assumed to be a random quan-
tity θ, too, so that there is a joint distribution of (m, θ). For practical purposes the
joint distribution is rather uninteresting, but the distribution assumption is expressed
as a conditional distribution p(m|θ). Applying Bayes’ law, this can be rewritten to a
conditional distribution of θ given an observation D. Then θ(D) ̂ is assigned the value
for which the a posteriori probability attains a maximum. Unsurprisingly, this is called
the maximum a posteriori estimator.
Let us repeat this so that the difference between both philosophies becomes abun-
dantly clear. In classical statistics, the distribution of the features is given by p(m | θ).
The parameter vector θ is an unknown, but constant quantity. On the Bayesian view,
the distribution of the features is a conditional distribution given by p(m|θ). The pa-
rameter vector θ is a random variable itself. Moreover, in classical statistics the pa-
rameter is chosen such that the observation becomes most likely. There is no point in
speaking about something like the probability of the parameter. In the Bayesian world,
on the contrary, the parameter vector θ is chosen to have the maximum probability
conditioned by the observation.
In maximum a posteriori pattern classification, there is one feature distribution
per class ω i . Consequently, one needs to estimate one parameter vector θi per class.
More precisely, one seeks the distributions p(m|ω i , θi ) with i = 1, . . . , c. The general
assumption is that one uses supervised learning. This means that the number of classes
and the class assignment of samples is given beforehand. The whole dataset can be
decomposed into a partition D = D1 ⊎ ⋅ ⋅ ⋅ ⊎ Dc and furthermore samples from Di bear
information about the unknown parameter vector θi , but do not have any influence on
the parameter vectors θj (j ≠ i) of the other classes. In conclusion, the task of parameter
estimation can be independently repeated for each class and one can assume without
loss of generality that only one class exists. Here and below, the explicit notation of a
class will be suppressed.
Definition 4.1 stated that any statistic that maps into the parameter space is an
estimator. One of the minimal requirements for a reasonable estimator is its unbiased-
ness. The principal idea is to put random variables (instead of observations) into the
estimator, so that the estimator becomes a random quantity on its own. The estimator
is called unbiased if its expectation equals the parameter being estimated.
̂ = E{θ(m
E{θ} ̂ 1 , . . . , mN )}
̂ = E{θ(m
E{θ} ̂ 1 , . . . , mN )}
Definition 4.4 (Cramér–Rao bound). Let the hypotheses be the same as in Defini-
tion 4.3, first item, with the simplification Θ = ℝ and the following additions:
1. θ̂ is unbiased,
2. θ̂ : MN → Θ does not depend on the unknown value θ,
̂ 1 , . . . , mN )2 } < ∞,
3. E{θ̂ 2 } = E{θ(m
4. The density p(m | θ) is differentiable with respect to θ, and
126 | 4 Parameter estimation
∂ ∂
5. ∂θ ∫M p(m | θ) dm = ∫M ∂θ p(m | θ) dm.
̂ ≥ 1 1
Var{θ} 2
= 2
(4.6)
∂ ∂
N E{( ∂θ ln p(m | θ)) } N ∫M ( ∂θ ln p(m | θ)) p(m | θ) dm
An estimator whose variance equals this bound for all θ is called a CR-efficient estima-
tor. Before a sketch of the proof is given, let us discuss the prerequisites. The first two
are rather natural. The need for unbiasedness was already explained. An estimator
that would depend on the value being estimated is not forbidden by definition, but
rather pointless for practical purposes. Hence, this is no real limitation. The last three
requirements are rather technical and normally summarized by the term regularity
conditions. Distributions that comply with these are called regular distributions. For
all practical (engineering) purposes, one can take them as granted.
The proof is mainly an application of the Cauchy–Schwarz inequality
2
∫ x(m)2 dm ∫ y(m)2 dm ≥ (∫ x(m)y(m) dm) (4.7)
for square-integrable functions. The sketch of the proof is only presented for N = 1
to avoid cumbersome integrals. Because θ̂ is unbiased, E{θ̂ − θ} = 0 for all θ, and
likewise for the partial derivative ∂ E{θ̂ − θ} = 0. Altogether, this leads to
∂θ
∂ ∂
0= E{θ̂ − θ} = ̂
∫(θ(m) − θ)p(m | θ) dm
∂θ ∂θ
a.p. ∂ ̂ ̂ ∂
=∫ (θ(m) − θ) ⋅ p(m | θ) dm + ∫(θ(m) − θ) p(m | θ) dm
∂θ ∂θ
(4.9) ̂ ∂
= − ∫ p(m | θ) dm + ∫(θ(m) − θ)p(m | θ) ln p(m | θ) dm. (4.8)
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ ∂θ
=1
̂ ∂ 2
1 = (∫(θ(m) − θ)p(m | θ) ln p(m | θ) dm)
∂θ
̂ ∂ 2
= (∫(θ(m) − θ)√ p(m | θ)√ p(m | θ) ln p(m | θ) dm)
∂θ
2
(4.7)
̂ 2 ∂
≤ ∫(θ(m) − θ) p(m | θ) dm ⋅ ∫ ( ln p(m | θ)) p(m | θ) dm
∂θ
2
= Var{θ} ̂ ⋅ E{( ∂ ln p(m | θ)) }, (4.10)
∂θ
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
=J(θ) (Fisher information)
4 Parameter estimation | 127
Of course, Equation (4.14) requires the explanation of what “⪰” means in the con-
text of matrices. We say Cov{θ} ̂ ⪰ 1 J −1 (θ) iff (Cov{θ}
̂ − 1 J −1 (θ)) is a positive semi-
N N
definite matrix. This is equivalent to
1 T −1
αT Cov{θ}
̂ α≥ α J (θ)α for all α ∈ ℝk (4.16)
N
and
̂ ≥ 1
tr Cov{θ} tr J −1 (θ) (4.17)
N
After unbiased estimators and CR-efficient estimators, the third type of estimator
we consider is the consistent estimator.
128 | 4 Parameter estimation
Definition 4.5 (Consistent estimator). Again, the setting is the same as in Defini-
tion 4.3, first item. An estimator is called a consistent estimator iff
with θ̂ = θ(m
̂ 1 , . . . , mN ).
This means an estimator is consistent if its value converges almost surely to the true
value. Actually, this should be a minimal requirement of a reasonable estimator. One
should realize that neither is an unbiased estimator necessarily consistent, nor is a
consistent estimator necessarily unbiased. These properties are independent of each
other.
Before looking at some examples of estimators that illustrate the above concepts,
the terms will be discussed and weighed against each other. From a purely theoreti-
cal perspective, the unbiasedness of an estimator is a minimal requirement. The gap
between the expectation of the estimator and the true value
b(θ)̂ = E{θ}
̂ −θ (4.19)
is called the bias of the estimator. If an estimator is truly biased (and not only biased,
but even asymptotically biased), then the bias will remain no matter how many sam-
ples are used. For this reason a bias is also called a systematic error. In contrast, the
variance Var{θ} ̂ of an estimator typically diminishes when more samples are consid-
ered. Hence, the variance is called the stochastic error of the estimator. As the theory
can use as many samples as needed, the stochastic error is not crucial. From a practi-
cal perspective this is not entirely true, because the number of samples is limited and
cannot be arbitrarily increased. The mean squared error (MSE) of an estimator
2
E{(θ̂ − θ)2 } = b(θ)̂ + Var{θ}
̂ (4.20)
equals the sum of the squared bias and the variance. Given two estimators θ̂ 1 and θ̂ 2
with Var{θ̂ 2 } ≪ Var{θ̂ 1 }, a biased estimator θ̂ 2 can have a much smaller mean squared
error than an unbiased estimator θ̂ 1 . This is depicted in Figure 4.1.
As a starting example, consider an estimator of the expectation. For this purpose,
m i (i = 1, . . . , N) will be independently and identically distributed with p(m | μ) with
a parameter θ = μ. Moreover, E{m i } = μ. Then σ2 = Var{m i } = E{(m − μ)2 } follows.
Note that although the distribution is assumed to be parametrized by its expectation
μ, the distribution is not assumed to be Gaussian. The expectation and variance are
called μ and σ2 for convenience only.
The empirical mean suggests itself as an estimator for the expectation:
1 N
μ̂ = ∑ mi . (4.21)
N i=1
4 Parameter estimation | 129
p(θ̂ i ) θ̂ 2
b(θ̂ 2 ) > 0
∝ √Var{θ̂ 2 }
θ̂ 1
Fig. 4.1. Comparison of an unbi-
ased estimator with large vari-
∝ √Var{θ̂ 1 }
ance (θ̂ 1 , blue) with a biased
θ̂ estimator with small variance
θ (θ̂ 2 , red).
1 N N
= ∑ ∑ E{(m k − μ)(m l − μ)}
N 2 k=1 i=1
1 N 2 1 N N
= ∑ E{(m k − μ) } + ∑ ∑ E{(m k − μ)(m l − μ)}
N 2 k=1 N 2 k=1 i=1
i=k
̸
N N N
1 1
= ∑ σ2 + 2 ∑ ∑ E{m k − μ} E{m l − μ}
N k=1
2 N k=1 i=1
i=k
̸
i.i.d. σ2
= . (4.23)
N
The variance of the estimator linearly vanishes with respect to the sample size. In other
words, the standard deviation √Var{μ}̂ decreases with √1 . This asymptotic behavior
N
is usual for most applications. Applying Chebyshev’s law
Var{μ}
̂
Pr (μ̂ − E{μ}
̂ ≥ ε) ≤
∀ε > 0 (4.24)
ε2
yields
σ2
Pr (μ̂ − μ ≥ ε) ≤ , (4.25)
Nε2
which shows that the estimator is also consistent.
130 | 4 Parameter estimation
The empirical mean as an estimator for the expectation can more or less be found
through an educated guess. We now turn to a more systematic approach to finding
estimators. The maximum likelihood estimator has already been mentioned in the
introduction of this chapter (see Equation (4.3)). For many distribution assumptions,
the maximum likelihood estimator of the expectation value equals the average mean.
Due to the strict monotonicity of the logarithm, the likelihood function and the log-
likelihood function share the same extremal points. In practice, the log-likelihood
function is often easier to use, since it involves sums instead of products. The maximum
likelihood estimator determines the parameter that maximizes the likelihood given
the observation. In other words, maximum likelihood estimation chooses the value
θ = θ̂ which makes the given observation D maximally probable under the model.
Definition 4.7 (Maximum likelihood estimator). The hypothesis will be the same as
in Definition 4.6. Then
θ̂ ML (D) = arg max ∏ p(m | θ) = arg max ∑ ln p(m | θ) (4.28)
θ∈Θ m∈D θ∈Θ m∈D
Under the usual implicit assumption that all functions are sufficiently smooth,
! ∂ T
0 = ∇θ l(θ) = ∑ ∇θ ln(p(m | θ)) with ∇θ = ( ∂θ∂ 1 ... ∂θ q ) (4.29)
m∈D
is a necessary condition.
The first example will be to find the ML estimator for the expectation value of a
d-dimensional normal distribution. Let mk ∼ N(µ, Σ) with µ unknown but known Σ.
It follows that
1 d 1
ln p(mk ) = − (mk − µ)T Σ−1 (mk − µ) − ln 2π − ln det Σ
2 2 2
⇒ ∇µ ln p(mk ) = Σ−1 (mk − µ). (4.30)
4.1 Maximum likelihood estimation | 131
1 N
⇔ µ̂ ML = ∑ mk . (4.31)
N k=1
1 1 1
ln p(m k ) = − (m k − θ1 )2 − ln θ2 − ln 2π
2θ2 2 2
1
θ2 (m k − θ 1 ) !
⇒ ∇θ ln p(m k ) = ( 1 2 1 )= 0. (4.32)
2 (m k − θ 1 ) − 2θ
2θ2 2
1 N
θ1 = μ̂ ML = ∑ mk (4.33)
N k=1
1 N 2
θ2 = σ̂ 2ML = ∑ (m k − μ̂ ML ) . (4.34)
N k=1
1 N
µ̂ ML = ∑ mk (4.35)
N k=1
1 N T
Σ̂ ML = ∑ (mk − µ̂ ML )(mk − µ̂ ML ) . (4.36)
N k=1
Note that the ML estimator for the variance is biased. It would be unbiased if the
true expectation μ was known. But as the ML estimator μ̂ is put into the estimator
for the variance, this estimator underestimates the variance systematically due to an
additional uncertainty coming from μ.̂ It can be shown that the unbiased estimator is
N ̂2
N−1 σ ML . In any case, both estimators are consistent.
132 | 4 Parameter estimation
In this section, the estimation problem is reconsidered under the Bayesian framework.
Unlike the former approach, the parameter vector θ is also regarded as a random quan-
tity. Moreover, the classical approach introduced the parameter right from the start
and aimed to estimate the parameter directly from the given dataset. In the Bayesian
concept, the parameter vector fades a little bit into the background, because here the
starting point is the original aim of estimating the class of an unknown object given
the training samples. The parameter is introduced as an intermediate link between
the training samples and the unknown object.
The fundamental quantity of the Bayesian classification is the a posteriori distri-
bution of the classes
p(m|ω i )P(ω i )
P(ω i |m) = i = 1, . . . , c. (4.37)
p(m)
As usual, the data set is D = D1 ⊎ ⋅ ⋅ ⋅ ⊎ Dc with m ∈ Di ⇔ ω(m) = ω i . Taking into
account that all quantities in Equation (4.37) are based on the data D, the formula can
be extended to
p(m|ω i , D)P(ω i |D)
P(ω i |m, D) = . (4.38)
p(m|D)
The conceptual difference of the Bayesian view is to actually regard every proba-
bility as a conditional probability. Any unconditional distribution is just a convenient
utility, if the condition is negligible. This means one actually wants to know the proba-
bility that a realized feature m of an random feature m belongs to class ω i , given that
the concrete dataset D out of the entirety of D has been observed before. In this sense,
P(ω|m) is only an abbreviation for P(ω|m, D) given that P( ⋅ , ⋅ , D ) ≈ P( ⋅ , ⋅ , D ), if
the datasets D and D are large enough.
Equation (4.38) can immediately be simplified again, because supervised sam-
pling is assumed. This means that the membership of a sample m in one of the parti-
tions Di is controlled, because its class is known. This has two consequences:
First, though the a priori distribution of the classes P(ω|D) depends on D, one
must not use a realization of the random variable D, because the realization is gen-
erally not truly sampled but artificially composed. This means that the proportions
of the partition D1 ⊎ ⋅ ⋅ ⋅ ⊎ Dc do not reflect the distribution of the classes. Hence, the
assumption is that an a priori distribution P(ω) is known.
Second, one assumes that the class-specific feature distribution does not depend
on samples of a different class. This means that
P(m|ω i , D) = P(m|ω i , Di ). (4.39)
Applying these considerations to Equation (4.38) and replacing the denominator by a
summation over all classes yields
p(m|ω i , Di )P(ω i )
P(ω i |m, D) = . (4.40)
∑cj=1 p(m|ω j , Dj )P(ω j )
4.2 Bayesian estimation of the class-specific distributions | 133
Note the additional indices in the right side of the above equation. Hence, the
only quantity to be determined is the class-specific feature distribution p(m|ω i , Di )
given the matching partition of a specific dataset. This quantity can be calculated
independently for each of the c classes and it is only required for matching indices of
ω i and Di . For this reason, the explicit notation of the class is omitted,
but it is implicitly stipulated that m is conditioned on the same class as the samples
in Di .
Until now, no parameter vector θ has been introduced so far, but everything was
based on the data D. We now assume that the feature distribution has a known para-
metric form with an unknown parameter θ that is a random quantity. Then one can
write
= ∫ p(m|θ, D)p(θ|D) dθ
Θ
The latter equation assumes that the distribution of the feature is conditionally
independent of the data D given the parameter vector θ.
The open question is whether the last line of Equation (4.42) must be calculated
every time and for each class when a new feature m is to be classified. (Recall that the
indices were suppressed and that Equation (4.42) is only a sub-term in Equation (4.40).)
Under certain conditions the answer is that this is not ultimately necessary and the
calculation can be decoupled into two steps.
Assume the data D imply strong evidence for one singular parameter, i.e., the
density p(θ|D) has a sharp and singular maximum at
̂
θ(D) = arg max p(θ|D). (4.43)
θ∈Θ
p(m|D) = ∫ p(m|θ)p(θ|D) dθ
Θ
and the integral calculation can be avoided. In summary, the conditional feature dis-
tribution with respect to the dataset can be approximately replaced by a conditional
134 | 4 Parameter estimation
feature distribution with respect to the parameter vector with the highest a posteriori
distribution given the data.
The first example considers a univariate normal distribution with random expec-
tation μ but known variance σ2 , i.e., m k ∼ N(μ, σ2 ). The expectation is also normally
distributed with μ ∼ N(μ0 , σ20 ). We start with the calculation of the a posteriori distri-
bution of μ given the data,
p(D|μ)p(μ)
p(μ|D) =
∫ p(D|μ)p(μ) dμ
N
∝ p(μ) ∏ p(m k |μ)
k=1
N
1 μ − μ0 2 1 mk − μ 2
∝ exp {− ( ) } ⋅ ∏ exp {− ( ) }
2 σ0 k=1
2 σ
1 μ − μ0 2 N m k − μ 2
∝ exp {− [( ) +∑( ) ]}
2 σ0 k=1
σ
1 N 1 1 N μ0
∝ exp {− [( 2 + 2 ) μ2 − 2 ( 2 ∑ m k + 2 ) μ]}
2 σ σ0 σ k=1 σ0
1 μ − μN 2
⇒ p(μ|D) = α exp {− ( ) }, (4.45)
2 σN
where the quantities in the last line are
Nσ20 σ2 1 N
μN = ( ) μ̂ N + μ0 , with μ̂ N = ∑ mk , (4.46)
Nσ20 + σ2 2
Nσ0 + σ2 N k=1
σ20 σ2
σ2N = , and (4.47)
Nσ20 + σ2
1
α= . (4.48)
√2πσ N
The quantities μ N , σ2N and μ̂ N can be found by comparing the coefficients of the
last and the second last line. The factor α can be easily determined, because the last
line shows that p(μ | D) is a Gaussian density.
Before we go on to finally calculate the feature distribution p(m|D), let us discuss
this intermediate result. The estimate of μ given the data D is a Gaussian density on
its own. We consider the two extreme cases with respect to the sample number N. If
there is no sample, N = 0, then
Nσ2 σ2
μ N = ( 2 0 ) μ̂ N + 2
μ0 = μ0 and (4.49)
Nσ0 +⏟⏟⏟σ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
2 Nσ0 + σ2
⏟⏟ ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
=0 =1
2 2
σ0 σ
σ2N = = σ20 . (4.50)
Nσ20 + σ2
4.2 Bayesian estimation of the class-specific distributions | 135
p(μ|D)
N=0
N = 10
N = 20
N = 30
N = 50
Samples
μ
Fig. 4.2. Sequence of Bayesian a posteriori densities estimating the mean μ of a Gaussian distribu-
tion; the true Gaussian has μ = 3, σ 2 = 2, the prior distribution of μ was assumed to be distributed
with μ0 = −1 and σ02 = 0.5.
This is a reasonable result, as the distribution of the best estimate equals the prior if
no data is given. In contrast, as N → ∞,
Nσ2 σ2 1 N
lim μ N = lim ( 2 0 ) μ̂ N + lim μ 0 = lim μ̂ N = lim ∑ mk
N→∞ Nσ0 + σ2
N→∞ ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ Nσ20 + σ2
N→∞ ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ N→∞ N→∞ N
k=1
→1 →0
(4.51)
σ20 σ2
lim σ2N = lim = 0. (4.52)
N→∞ N→∞ Nσ 2 + σ 2
0
In conclusion, for infinitely many samples, the uncertainty of the estimation vanishes
and the a posteriori distribution converges to a Dirac distribution at the empirical mean
of the samples. This means that any resemblance to the a priori assumption vanishes
and the result depends solely on the data and actually equals the ML estimator. An
example of such a sequence of a posteriori distributions is depicted in Figure 4.2.
To conclude the example, we must still calculate the conditional feature distribu-
tion given the dataset p(m|D). As all densities are Gaussian, the calculation of Equa-
tion (4.42) needs little effort. Again, α denotes a universal normalizing constant in
p(m|D) = ∫ p(m|μ)p(μ|D) dμ
1 m−μ 2 1 μ − μN 2
= α ∫ exp{− ( ) } exp{− ( ) } dμ
2 σ 2 σN
1 (m − μ N )2
= α exp{− }. (4.53)
2 σ2 + σ2N
−1 1 N
µN = NΣ0 (NΣ0 + Σ) µ̂ N + Σ(NΣ0 + Σ)µ0 , with µ̂ N = ∑ mk , (4.55)
N k=1
−1
ΣN = Σ0 Σ(NΣ0 + Σ) , and (4.56)
1
α= . (4.57)
√2π det ΣN
Analogously to Equation (4.53), the feature distribution equals
p(m|D) = ∫ p(m|µ)p(µ|D) dµ
1 1
∝ ∫ exp{− (m − µ)T Σ−1 (m − µ)} ⋅ exp{− (μ − µN )T Σ−1N (m − µ)} dµ
2 2
1
∝ exp{− (m − µN )T (Σ + ΣN ) (m − µN )}.
−1
(4.58)
2
In summary, the multivariate result is m ∼ N(µN , Σ + ΣN ).
At the end of this section, the following list recapitulates the principal steps of
the Bayesian approach to estimating the feature distribution. The implicit suppressed
notation of the classes is re-introduced again. These steps must be performed for each
class ω i , i = 1, . . . , c:
1. p(m|θi , ω i ) is assumed to be structurally known; the parameter vector θi is a
random quantity, too.
2. p(θi | ω i ) includes the a priori knowledge about the θi .
3. The dataset Di bears the additional knowledge about θi ; Di is assumed to be a
set of identically and independently distributed feature vectors m1 , . . . , mN i ∼
p(m | θi , ω i ),
Ni
p(Di |θi , ω i ) = ∏ p(mk |θi , ω i ). (4.59)
k=1
4.3 Bayesian parameter estimation | 137
followed by
p(m|D, ω i ) = ∫ p(m|θi , ω i )p(θi |Di , ω i ) dθi . (4.61)
The goal of the ML estimator is to find the best value of θ̂ that can be plugged into
the parametric density p(m | θ). In contrast, the Bayesian technique does not yield a
single value of θ,̂ but a whole a posteriori distribution p(θ|D). Hence, the classical
approach and the Bayesian approach are not directly comparable. The class-specific
feature distribution is calculated by
(see Equation (4.42)). As already stated on page 133, this computational effort can be
avoided if the a posteriori distribution can be approximated by a Dirac distribution. In
this case, the additional integration degenerates into a simple replacement of θ by the
value of θ̂ with the highest a posteriori probability. But setting θ̂ = arg maxθ∈Θ p(θ|D)
(see Equation (4.43)) is only one option for condensing a full density into a single value
of the parameter. This section will present the two most important ways of Bayesian
parameter estimation.
The basic approach is to find the estimate of θ(D)̂ such that the expectation
̂
E{l(θ(D), θ)} (4.63)
̂ 2
E{θ(D) − θ } =
̂ T ̂
∫ ∫(θ(D) − θ) (θ(D) − θ)p(θ|D) dθ dD (4.65)
MN Ω
̂ T ̂
I(D) = ∫(θ(D) − θ) (θ(D) − θ)p(θ|D) dθ (4.66)
Ω
where U denotes the unit matrix, i.e., the matrix all of whose entries are unity. In
summary, the estimator with the least quadratic error is
̂ ↑
θ(D) = E{θ↑
↑D} ,
↑ (4.70)
{0 if ‖θ̂ − θ‖ < ∆
l(θ,̂ θ) = { (4.71)
1 else
{
for an arbitrary but fixed ∆ > 0. An interesting special case is surely ∆ = 0. But as
{θ̂ = θ} is a null set, the direct approach does not lead to any result.
4.4 Additional remarks on Bayesian classification | 139
= ∫ p(θ|D) dθ
̂
{θ|‖θ(D)−θ‖>∆}
is minimized point-wise. The last line is minimal if the integration is over a region
where p(θ|D) is large. If ∆ becomes small enough and if the density is sufficiently
smooth, this is achieved for arg maxθ p(θ | D). Hence, it follows that
̂
θ(D) = arg max p(θ | D) (4.73)
θ
for ∆ → 0. This is called the maximum a posteriori estimator.
Now, we briefly turn our attention back to Bayesian classification. With the results of
Chapter 4, it is possible to discuss in greater depth the errors that arise in Bayesian
classification.
Although Bayesian classification is the optimal classification, it is not free of errors.
Basically, three different sources of errors can be distinguished:
4.5 Exercises
(4.1) The weight of a letter m in grams varies between m = 10 and m = 20. There are
two possibilities for estimating the weight of a given letter:
– Estimate m̂ 1 = 15, independently of the true weight of the letter.
– Estimate m̂ 2 = x, where x is the display of an inaccurate scale with E{x} = m
and Var{m} = 36.
How large is the mean squared error (MSE) for each estimator? Which estimator
has the smaller MSE?
(4.4) Let X be a random variable over a population with expectation μ and variance
σ2 . Further, let x1 , . . . ,x N be an i.i.d. sample of size N > 4 over the population.
The following estimator of the expected value μ of X is proposed:
1 N−2
μ̂ := ∑ xi , (4.77)
N − 4 i=3
i.e., the first and last two elements of the sample are discarded.
1. Show that μ̂ is an unbiased estimator of μ.
2. Is μ̂ a better estimator than the maximum likelihood estimator
1 N
μ̂ ML = ∑ xi ? (4.78)
N i=1
(4.5) Let m1 , . . . ,m N be a sample of N i.i.d. elements and consider the following esti-
mator of the sample variance σ2 = Var{m}:
̂ 1 N
σ2 = ∑ (m i − μ)2 , (4.79)
α − N i=1
4.5 Exercises | 141
(4.6) Let m1 , . . . ,m N be a sample of N i.i.d. elements and consider the following esti-
mator of the expected value μ = E{f(m)}:
N N
μ̂ = ∑ f(m i ), (4.80)
N − α i=1
for some function f(⋅). For which values of α will μ̂ be an unbiased estimator of μ?
5 Parameter free methods
At the beginning of this chapter we will first review what has been done so far. The
principal goal is to assign a class to an unknown object given its features and a training
set of objects with known features and classes. From a more abstract point of view,
one wants to learn some kind of rule, given a finite training sample of special cases.
The rule will then be applied to a new situation, where one hopes that the proposed
rule has general significance.
This is a two-step process: In the first step, the general rule must be found from
specific instantiations, the second step is to apply the (hopefully) general rule to a
new specific situation. If necessary, the rule found from the first step will need some
intermediate formal rewriting into a form that is applicable in the second step. The
first step, from the special to the general, is called “induction,” the second step, from
the general to the special, is called “deduction” (see Figure 5.1). In the context of this
textbook, the induction is to find a class-specific feature distribution given a dataset D;
the deduction is to apply the a posteriori probability to an unknown feature vector. The
necessary formal rewriting is Bayes’ law in order to obtain the a posteriori probability.
Instead of following this indirection, it is sometimes possible to directly infer from
the given data to the new situation. The term “transduction” was introduced by Vapnik
in the context of support vector machines (see Section 7.7) to describe this shortcut. In
this chapter, however, we are concerned with the induction step.
The induction step from D to p(m|ω) is an ill-posed inverse problem. For a deeper
understanding of inverse problems, see, e.g., the work of Aster et al. [2013] or Rieder
[2003]. For our purposes, the following intuition is sufficient: The induction is called
“inverse,” because the deduction is thought of as the forward model. It is ill-posed if
one of the following conditions (going back to Jacques Hadamard) holds:
– The inverse mapping is not well defined,
– the inverse mapping is not unique, or
– the inverse mapping is not continuous.
p(m|ω) ⇒ P(ω|m)
Induction Deduction
(=̂ Ill-posed inverse (=̂ Forward problem)
problem)
Training Unknown
set D “Transduction” objects
Fig. 5.1. The triangle of
inference.
https://doi.org/10.1515/9783110537949-164
5 Parameter free methods | 143
In this context, the dataset poses only a finite number of conditions on the infinite-
dimensional solution space of all density functions. This means that in general, the
data does not suffice to determine a solution. Regularization, i.e., enforcing further
restrictions on the space of solutions, can offer a way out. Such additional restrictions
might be
– to make additional assumptions, e.g., on the range of parameters,
– to bring in additional prior knowledge, and
– to formulate desirable traits of the solution as auxiliary constraints.
The risk of regularization lies in restricting the space of permissible solutions in such
a way that the true solution is unintentionally excluded. In the previous chapter, the
class-specific feature distribution was assumed to be structurally known. Hence, the
space of all densities was restricted to a finitely parametrized family of densities. Un-
fortunately, there are only a handful of standard densities that are still analytically and
computationally feasible. But it is questionable how well these densities fit the appli-
cations. Especially the assumption that a multi-dimensional feature space is governed
by a product of simple densities seems bold.
Although this chapter still considers the induction step and tries to find a density
p(m | ω), it follows a totally different approach. The parameter-free methods do not
constitute a specific form of the density right from the beginning, but try to look at the
samples as a kind of discrete approximation of the true density. While the number of
samples increases, a sequence of densities is created that eventually converges to the
true density. This textbook will present two such methods: the Parzen window method
and the k-nearest neighbor method¹.
Let m denote a random vector, p(m) its density, m a realization of m, and Am ⊆ M
a neighborhood around m. Eventually, p(m) will be the unknown, true density we
want to approximate. Then
Pm = ∫ p(m) ̆ dm̆ (5.1)
Am
1
=∫ p(m)
̆ ∫ dm dm̆ = ∫ p(m)
̆ dm̆ = 1. (5.3)
V
M ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
A m̆ M
=V
The aforementioned conditions are only necessary but not sufficient conditions
to ensure convergence. In particular, the limits must change position, but without any
additional assumptions, one has
k N,V k N,V
p(m) = lim lim ≠ lim lim = 0. (5.11)
V→0 N→∞ NV N→∞ V→0 NV
The latter is easy to see: If N is arbitrary but fixed at some bound, the volume V is
so small that all the points are located outside and k becomes constantly zero.
In theory this is no problem, because one can first define a sequence of decreas-
ing volumes (outer limit) and then take as many samples as necessary to get a good
approximation of lim k/N (inner limit). In practice, the situation is more complicated.
Generally, the number of samples N is given in advance or is at least bounded and
there is normally no option for getting a fresh sequence of samples for each volume V.
Hence the question is: What is a reasonable size for V, given some samples?
k
Choosing a large V helps to get a reliable approximation for P V = N,V N , because
there are many samples that fall in A. But unfortunately, the outer approximation PVV
becomes too coarse. In the extreme case, the neighborhood A = M equals the whole
support. Then k = N, because all the samples must fall in M and Pm = 1 is a perfect
approximation, but the moving average degenerates to the uniform distribution. In
contrast, choosing a small V is appropriate for getting a good local approximation of
the density if P V ≈ k(N,V)
N can be reliably estimated. This becomes more difficult as
smaller Vs are chosen, because the event of a sample’s falling in A becomes more
unlikely. In the extreme case, there is no sample and so P V = 0. Then the approximate
146 | 5 Parameter free methods
density is ragged, with areas that are constantly zero and small peaks around each
sample.
Informally, the volume V must not diminish too fast with respect to N. In Sec-
tion 5.1, a proof of convergence is presented if V −1 = O(√N). Two approaches are well
established in practice. The Parzen window method assigns the volume V ∝ √1 and
N
k is estimated from the sampling. The k-nearest neighbor method assigns k ∝ √N
and the volume V is estimated from the sampling, i.e., the neighborhood around each
point is blown up until exactly k points are included.
Figure 5.2 shows an example of a comparison between the Parzen window method
and the k-nearest neighbor method. The samples (blue points) are drawn uniformly
within the unit disk with radius r = 1. The center point of the neighborhood is m = 0
and A0 is chosen to be a disc (red line). Note that for the Parzen window method, the
1 1
radius decreases with N − 4 , because the area of the disk is proportional to N − 2 .
The Parzen window method assigns the volume of the neighborhood with respect to
the sample size. For now, the neighborhood is chosen to be a simple d-dimensional
cube with edge length h N and volume V N = h dN . To this end, let
{1 1
if |u j | ≤ 2 for all j = 1, . . . , d
φ(u) = rect(u) := { (5.12)
0 else
{
denote the indicator function of the unit cube centered at the origin. Then u → φ( m−u
hN )
denotes the indicator function of a cube centered at m with edge length h N . Let
m1 , . . . , mN denote the samples. For each m ∈ M, the number of samples within its
neighborhood can be counted by
N
m − mi
k N (m) = ∑ φ( ). (5.13)
i=1
hN
1 N 1 m − mi 1 N m − mi
p̂ N (m) = ∑ φ( )= ∑ δN ( ) (5.15)
N i=1 V N hN N i=1 hN
The window function φ satisfies this condition as it is constantly zero outside the unit
cube.
The symbol δ N for the sequence of scaled window functions in Equation (5.15) was
not chosen arbitrarily, but serves to highlight a connection with Dirac sequences.
The first two requirements state that δ N is a formal density function. The third require-
ment demands that all the probability is eventually concentrated in an arbitrary small
neighborhood around the origin. In other words, the δ N approach the Dirac distribu-
tion δ. Unlike δ, however, all the δ N are regular distributions. Quite often, δ is even
defined as the weak limit of a Dirac sequence. In this case, the convergence holds by
definition.
Here, δ N (m) = V1N φ( hmN ) was initially chosen to be the uniform distribution over a
rectangular neighborhood. A natural generalization is to replace the uniform window
function by a Gaussian density. To this end, redefine
and set
1 m 1 1
δ N (m) = φ( ) = exp{− 2 ‖m‖2 }. (5.18)
VN hN d
2
d
(2π) h N 2h N
d
Here, V N = (2π) 2 h dN denotes the volume of the (infinite) support of the window func-
tion scaled by h N . Hence, the parameter h N can be understood as controlling the vari-
ance of the Gaussian window function.
The estimated density p̂ N (m) can be regarded as a random quantity p̂ N (m) on
its own if the samples that support the density are considered as random variables.
Hence, it is possible to examine its expectation μ N (m) and variance σ2N (m) point-wise
for every m. The estimated density p̂ N (m) converges to the true density p(m) in terms
of mean squared error if
lim V N = 0, (5.23)
N→∞
lim NV N = ∞. (5.24)
N→∞
Note that the second condition is actually a repetition of the definition of a window
function in Equation (5.16). The first condition forces the window function to be modest
in the neighborhood of the origin. The last two conditions force the spread of the
window function to vanish but not faster than the number of samples increases.
The expectation is calculated in two steps, because the first equality is needed
again later. For any i = 1, . . . , N,
1 N
μ N (m) = E{p̂ N (m)} = ∑ E{δ N (m − mi )} = [δ N ∗ p](m) (5.26)
N i=1
follows. As the Dirac distribution is the neutral element with respect to convolution,
yields the desired result. Note that in the above line, the limit and the integral were
silently swapped. This step used the requirement Equation (5.22).
The calculation of the variance uses Equation (5.26) and exploits the fact that the
variance of a sum of independent variables is the sum of the individual variances:
i.i.d. 1 N
σ2N (m) = Var{p̂ N (m)} = ∑ Var{δ N (m − mi )}
N 2 i=1
1 N 2 2
= ∑ E{δ N (m − mi ) } − (E{δ N (m − mi )})
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
N i=1
2
=μ N (m)≥0
1
≤ ∫ δ2N (m − u)p(u) du
N
M
(5.14)1 1 m−u
= ∫ φ( )δ N (m − u)p(u) du
N VN hN
M
(5.21) sup φ(u) (5.26) sup φ(u)
≤ ∫ δ N (m − u)p(u) du = μ N (m). (5.28)
NV N NV N
M
5.1 The Parzen window method | 149
m 2
R1 R2 ω1
12 ω2
Optimal
boundary
10 Decision
boundary
2 R1
m 1
2 4 6 8 10 12
Fig. 5.3. Application to the reference example of Section 3.3.2. Decision regions of a classifier with
Parzen window density estimators (h N ≈ 0.5, N = 100 for either class) with Gaussian window
function. The training error is etrain = 6 %; the testing error is etest = 6.5 %, but asymptotically
approaches etest ≈ 7 %. The training set is the same as in Figure 3.8. Test samples are shown with
hollow marks.
N
1 1
= ∑ exp {− ‖m − mi ‖2 }
N (2π)
d
2 h dN i=1 2 h2N
N 1
1 Nd
= ∑ exp {− ‖m − mi ‖2 } . (5.30)
√N (2π) 2 h d
d
i=1
2 h2
The Parzen window method can be used to estimate the class-specific feature dis-
tributions p(m | ω i ), i = 1, . . . , c, which are in turn used in an MAP classifier. Figure 5.3
shows the decision regions of the ongoing example, where the class-specific densities
p(m | ω1 ) and p(m | ω2 ) were estimated using Parzen windows. One can see that the
decision boundary comes very close to the decision boundary of the Bayesian optimal
classifier, and with 7 %, the asymptotic test error is only slightly larger than the 6.16 %
of the optimal classifier. Note, however, that the outcome very much depends on the
number and the location of the training samples, as well as the dimensionality of the
feature space itself (see Section 6.1).
In summary, the Parzen window method is characterized by three crucial traits.
Universality The Parzen window method does not require any prior knowledge about
the probability distribution. Even sophisticated multi-modal distributions can be
estimated.
Choice of parameter Although the theoretical convergence to the true density holds
in any case, the quality of the result in practice depends heavily on the initial
choice of the volume V or the associated spread h.
Data-independent size of neighborhood For a fixed sample size N, any point of the
feature space is covered by the same neighborhood, independently of the sample
density.
Figures 5.4 and 5.5 depict the estimation of a normal distribution with a Gaussian
window function for different sample sizes N and initial spreads h for m ∈ ℝ and
m ∈ ℝ2 , respectively.
Similar to the Parzen window method, the k-nearest neighbor method aims to estimate
a density in virtue of
k/N k
p(m)
̂ = = . (5.31)
V NV
But in contrast to the Parzen window method, the number of considered samples k
only depends on the total number of samples N and instead the volume V is fitted so
that exactly k samples fall in the neighborhood of m. Review Figure 5.2 for a graphical
illustration of the difference between both methods.
The neighborhood of a point m can be thought of as a cell centered at m that is
inflated until it contains k N samples; the cell is small if the neighborhood is dense
and large if the neighborhood is sparsely populated. Unfortunately, this intuition is
5.2 The k-nearest neighbor method | 151
p(m)
̂ p(m)
̂ p(m)
̂
2 2 2
1.5 1.5 1.5
1 1 1
0.5 0.5 0.5
m m m
−4 −2 2 4 −4 −2 2 4 −4 −2 2 4
(a) N = 1, h = 1.0 (b) N = 1, h = 0.5 (c) N = 1, h = 0.1
p(m)
̂ p(m)
̂ p(m)
̂
2 2 2
1.5 1.5 1.5
1 1 1
0.5 0.5 0.5
m m m
−4 −2 2 4 −4 −2 2 4 −4 −2 2 4
(d) N = 10, h = 1.0 (e) N = 10, h = 0.5 (f) N = 10, h = 0.1
p(m)
̂ p(m)
̂ p(m)
̂
2 2 2
1.5 1.5 1.5
1 1 1
0.5 0.5 0.5
m m m
−4 −2 2 4 −4 −2 2 4 −4 −2 2 4
(g) N = 50, h = 1.0 (h) N = 50, h = 0.5 (i) N = 50, h = 0.1
p(m)
̂ p(m)
̂ p(m)
̂
2 2 2
1.5 1.5 1.5
1 1 1
0.5 0.5 0.5
m m m
−4 −2 2 4 −4 −2 2 4 −4 −2 2 4
(j) N = 100, h = 1.0 (k) N = 100, h = 0.5 (l) N = 100, h = 0.1
Fig. 5.4. Parzen window density estimation with a Gaussian window function for varying sample
sizes N (blue curves) and spreads h with m ∈ ℝ; the true density is drawn in red.
152 | 5 Parameter free methods
p(m)
̂ p(m)
̂ p(m)
̂
0 0 0
m1 m2 m1 m2 m1 m2
(a) N = 1, h = 2.0 (b) N = 1, h = 1.0 (c) N = 1, h = 0.5
p(m)
̂ p(m)
̂ p(m)
̂
0.4 0.4 0.4
0 0 0
m1 m2 m1 m2 m1 m2
(d) N = 10, h = 2.0 (e) N = 10, h = 1.0 (f) N = 10, h = 0.5
p(m)
̂ p(m)
̂ p(m)
̂
p(m)
̂ p(m)
̂ p(m)
̂
0 0 0
m1 m2 m1 m2 m1 m2
(j) N = 100, h = 2.0 (k) N = 100, h = 1.0 (l) N = 100, h = 0.5
Fig. 5.5. Parzen window density estimation with a Gaussian window function for varying sample
sizes N and spreads h with m ∈ ℝ2 ; the true density is a single Gaussian with µ = 0 and Σ = I,
p(m) = N(µ,I) (not shown). Note that the scale of the applicate is different in each row.
5.2 The k-nearest neighbor method | 153
p(m)
̂ p(m)
̂
1.5 1.5
1 1
0.5 0.5
m m
−4 −2 2 4 −4 −2 2 4
(a) N = 1, k = 1 (b) N = 16, k = 4
p(m)
̂ p(m)
̂
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
m m
−4 −2 2 4 −4 −2 2 4
(c) N = 64, k = 8 (d) N = 256, k = 16
Fig. 5.6. k-nearest neighbor density estimation with for varying sample sizes N and k = √N (blue
curve); the true density is drawn in red.
lim k N = ∞, (5.32)
N→∞
kN
lim =0 (5.33)
N→∞ N
P
are sufficient conditions for p(m)
̂ → p(m) for every m where p(m) is continuous.
kN
The first condition ensures that N is a good local approximation of the probability
(the stochastic error vanishes); the second condition ensures that k grows sufficiently
slowly so that the volume of the cell becomes zero (the systematic error vanishes).
Instead of reproducing the proof, let us highlight a substantial difference from the
Parzen window method: When employing the Parzen window method and a differen-
tiable window function is chosen, each function of the sequence is differentiable, too.
Moreover, every approximation fulfills the formal requirements of a density.
154 | 5 Parameter free methods
with one pole at m1 (here, α denotes the volume of the d-dimensional unit sphere).
For k > 1, the volume of the neighborhood of m changes smoothly with respect to m,
as long as the k samples that define the neighborhood stay the same. If m moves into
the area of influence of a different sample, the volume of the neighborhood changes in
a non-differentiable way. Generally, the points of nondifferentiability of the estimated
function do not match the samples, but are placed in between.
Moreover, in the 1-dimensional case, i.e., m ∈ ℝ, the integral value of each ap-
proximated density is not equal to one but diverges to infinity even for k > 1. Outside
of the finite number of samples N, every approximation asymptotically behaves like
1
m → m and therefore the integral becomes infinite. This means that although the
density estimate converges to the true density function point-wise in probability, the
approximation is not a density on its own.
̂
k i/N ki
P(m, ωi ) = = (5.35)
V NV
is an estimator of the joint probability. Applying Bayes’ law leads to
̂
P(m, ωi ) ki
̂ i |m) =
P(ω = . (5.36)
∑cj=1 ̂
P(m, ω j ) k
5.3 k-nearest neighbor classification | 155
m 2
m 1
The appealing point of the last line is that it not only provides the a posteriori prob-
ability directly, but that it does not suffer from the technical difficulties of the previous
̂
result. Because there are only finitely many classes, P(ω|m) is not a probability density
function, but a probability mass function and sums to one. Moreover, the volume V
and the number of samples N that caused those difficulties cancel out.
Together with the usual maximum a posteriori rule, the estimated class is the
class of the most frequently represented samples in the neighborhood. For k = 1, this
classifier is called the nearest neighbor classifier. The formal decision rule is
ω(m)
̂ = ωi ⇔ arg min‖m − mj ‖ ∈ Di , (5.37)
mj ∈D
where α is a normalization constant that ensures that the k i (m) sum to one:
1
α := . (5.39)
∑cj=1 k j (m)
Plotting the decision boundaries in the feature space yields a Voronoi tessellation
(see Figure 5.7). The only free design parameter is the scaling of the individual com-
ponents, or, equivalently, the choice of the metric. This parameter heavily influences
which neighbor is considered to be “near.” Figure 5.8 illustrates this effect for a fixed
standard Euclidean metric but different scales of the first axis.
A natural extension is to base the classification decision not only on the nearest
neighbor but on the majority class of several neighbors, i.e., assign the most frequent
class among the k > 1 neighbors. Let Ak (m) denote a neighborhood of m that includes
156 | 5 Parameter free methods
m2 m2
m m
m1 βm1
P∗ = R = 1 − ∫ P(ω(m),
̂ m) dm. (5.41)
M
In the case of the nearest neighbor classifier, the error probability depends on the num-
ber of samples being drawn. The classification is erroneous if the sample mi nearest to
the test sample m has a different class than the true class of m. Let P N denote the error
probability of the nearest neighbor classifier with N samples. We state without proof
that P = limN→∞ P N exists and call P the asymptotic error probability of the nearest
neighbor classifier. With c classes, both error probabilities are at most c−1 c , because
this is the probability of being wrong if one merely guesses. The optimal Bayes error
probability P∗ is a lower bound for P. But it is also possible to bound the error probabil-
ity of the nearest neighbor classifier from above in terms of P∗ . Cover and Hart [1967]
5.3 k-nearest neighbor classification | 157
m 2
ω1
ω2
m 2
R1 R2 ω1
12 ω2
Optimal
boundary
10 Decision
boundary
2 R1
m 1
2 4 6 8 10 12
Fig. 5.10. Decision regions of a nearest neighbor classifier with k = 1. The training error is etrain =
0 % by definition of the classifier. The testing error is etest = 10 %, and asymptotically approaches
etest ≈ 9.4 %. The training set is the same as in Figure 3.8. Test samples are shown with hollow
marks.
158 | 5 Parameter free methods
m 2
R1 R2 ω1
12 ω2
Optimal
boundary
10 Decision
boundary
2 R1
m 1
2 4 6 8 10 12
Fig. 5.11. Decision regions of a nearest neighbor classifier with k = 3. The training error is etrain =
4.5 %. The testing error is etest = 8 %, and asymptotically approaches etest ≈ 7.5 %. The training set
is the same as in Figure 3.8. Test samples are shown with hollow marks.
m 2
R1 R2 ω1
12 ω2
Optimal
boundary
10 Decision
boundary
2 R1
m 1
2 4 6 8 10 12
Fig. 5.12. Decision regions of a nearest neighbor classifier with k = 5. The training error is etrain =
7 %. The testing error is etest = 6 %, and asymptotically approaches etest ≈ 7.1 %. The training set is
the same as in Figure 3.8. Test samples are shown with hollow marks.
5.3 k-nearest neighbor classification | 159
P
c −1
c
P = 2P∗
P = P∗
c
P = P∗ (2 − c −1 P )
∗
P∗
Fig. 5.13. Asymptotic error bounds of the
c −1
c nearest neighbor classifier
then the supports of the class-specific feature distributions must be disjoint. Hence,
for infinite many samples, the nearest sample asymptotically has the correct class and
the nearest neighbor classifier always decides correctly. In contrast, if P∗ = c−1 c , then
the Bayes classifier is not better than guessing and the nearest neighbor classifier is
not worse.
By dropping the last term in Equation (5.42), the weaker upper bound
P ≤ 2P∗ (5.43)
starting points to control the quality of the approximation. Even worse, in the most
general case, examples can be constructed such that convergence is arbitrarily slow
and not even monotonic.
5.4 Exercises
(5.1) Given the mapping y = A(x) = x2 , why is the inference from y to x an ill-posed
inverse problem?
(5.2) One wishes to establish a functional relation between two scalar measurements
x and y. Considering the underlying physics, it is known that the relation must
be linear. Suppose that there are N > 2 noisy measurements (x i ,y i ), i = 1, . . . , N.
The task is formulated as follows:
Find the parameters a,b ∈ ℝ of a straight line that interpolates the data points, i.e., y i =
a x i + b for all i = 1, . . . ,N.
Why is this (inverse) problem ill-posed? How can the task be reformulated so that
the inverse problem is well-posed?
(5.3) Suppose there are given six mappings y = A i (x), i = 1, . . . , 6 with the properties
shown in the table below. For which of the mappings A i is the inference from y to
x an ill-posed inverse problem?
Property Mapping
A1 A2 A3 A4 A5 A6
A−1
i is well defined × × × × ×
A−1
i is injective × ×
A i is injective × × × × ×
A−1
i is surjective × × × ×
A i is surjective ×
A−1
i is continuous × × × ×
A−1
i is linear × × × ×
x = A−1
i (y) is unique × × × ×
D = {8.1, 8.9, 7.6, 9.7, 12.2, 7.1, 10.4, 9.3, 14.9, 10.1},
D = {8.0, 8.5, 7.6, 9.7, 12.2, 7.1, 10.5, 9.3, 14.9, 10.0},
5.4 Exercises | 161
(5.6) Use the following sample D = {m1 , . . . , m6 } to graphically classify the points
m1 = (−2,2)T , m2 = (2,0)T and m3 = (−1,−5)T using the nearest neighbor method:
0 4 2
m1 = ( ) , m2 = ( ) , m3 = ( )
0 2 6
4 −6 −4
m4 = ( ) , m5 = ( ) , m6 = ( )
−4 −6 2
Section 2.7 introduced techniques to reduce the dimension of a feature space but
lacked an explanation of the reasons why a small number of dimensions is favorable.
The introduction of this book established some design principles for how to select
features, and referred to the “curse of dimensionality” (see Section 1.4, Page 8), but
did not give an explanation of the term. This section will fill this gap.
The beginning will be the exact opposite and first give an example that seems to
support the commonsense (but false) belief that a large number of features should
lead to better classification. After that, this belief is disproved and it will be shown
why the example is misleading.
Recall an example from Section 3.3.4. The number of classes will be c = 2 and
both class-specific feature distributions will be p(m|ω i ) = N(m; µi ,Σ) with shared
covariance matrix Σ. Moreover, the a priori distribution P(ω1 ) = P(ω2 ) = 12 is assumed.
As already known from Equation (3.56), the decision boundary of the corresponding
classifier is a hyperplane given by
T 1
Λ(m) = (Σ−1 (µ1 − µ2 )) (m − (µ + µ2 )) = 0 (6.1)
2 1
and the decision rule is
{ω1 Λ(m) < 0
ω(m)
̂ ={ (6.2)
ω else.
{ 2
Putting the random variable m into Λ(m) makes this a Gaussian distributed ran-
dom variable itself, because it is a linear transformation of m. Thus, it is possible to
calculate the conditional expectation and variance with respect to the true class. There
https://doi.org/10.1515/9783110537949-184
6.1 Dimensionality of the feature space | 163
follows
↑
E{Λ(m)↑
↑ω = ω1 }
↑
T 1 ↑
= E{(Σ−1 (µ1 − µ2 )) (m − (µ + µ2 ))↑
↑ω = ω1 }
↑
↑
2 1
↑ 1
= (µ1 − µ2 )T Σ−1 (E{m↑
↑ω = ω1 } − (µ1 + µ2 ))
↑ 2
1
= (µ1 − µ2 )T Σ−1 (µ1 − µ2 ) (6.3)
2
and likewise ω = ω2 yields
↑ 1 T −1
E{Λ(m)↑
↑ω = ω2 } = − 2 (µ1 − µ2 ) Σ (µ1 − µ2 ).
↑ (6.4)
s := µ1 − µ2 m = √(µ1 − µ2 )T Σ−1 (µ1 − µ2 ) (6.6)
as the Mahalanobis distance w.r.t. Σ−1 between the expectation values µ1 and µ2 . In
summary, this leads to
The last line shows that the error probability vanishes (R → 0) if the Mahalanobis
distance of the expectation values increases (s → ∞). Until now, no assumption about
the dimension d of the feature space has been made. The mutual positions of the ex-
pectation values cannot be easily changed, because they are given by the nature of the
application, but one could try to put more features into the feature vector and thereby
increase its dimension. As long as these additional features contain new information
about the problem, the Mahalanobis distance is increased. In mathematical terms,
s → ∞ if d → ∞. Note that simply duplicating components does not increase the
Mahalanobis distance: the increase depends on the correlation between the existing
and the additional features. In consequence, one could argue the more data the better,
or, at greater length, the more data, the higher the dimension, the greater the distance,
the lower the error. Figure 6.1 seems to support this statement. The plot indicates the
support of two uniform distributions in different dimensions. Under a projection onto
the first dimension, both supports overlap each other by one-half. This overlapping
region is responsible for a false classification. In two dimensions, the overlapping re-
gion of the rectangles only counts for one-quarter of the area. So the proportion of
possible false classifications declines. In three dimensions, the cubes can be perfectly
separated.
Although this seems to support the belief that a larger number of dimensions
improves the classification, the statement is generally wrong in real-world applica-
tions. The conclusion is only true if the class-specific feature distributions are perfectly
known or if there were infinitely many training samples.
Consider another, this time factual, example that illustrates the real effect of in-
creasing the number of dimensions (Beyerer [1994]). The task was to automatically
assess the quality of honed surfaces of cylinders given a catalog of N = 33 pictures
of such a surface with manually assigned grades on an ordinal scale between 1 and
6.1 Dimensionality of the feature space | 165
Estimation error
2
Number of Fig. 6.2. Dependence of
error rate on the dimen-
features d
sion of the feature space
5 10 15 20 25 in Beyerer [1994].
10. To solve the problem, 25 heuristic and model-driven features were defined. The
idea for classification was to estimate the grade n by a linear regression n̂ = Am
with the smallest mean square error. A feature selection according to Section 2.7.5
showed that increasing the number of features improved the classification result at
the beginning but beyond a certain point, additional features increased the error again
(see Figure 6.2).
This phenomenon can be best understood by the following allegory. Each feature
bears some net payload and some irrelevant payload that counts as a disturbance.
As long as a new feature adds more important information into the system than dis-
turbance, the classification performance increases. But if all the net payload that a
feature potentially could add is already included by the existing features, then only
the disturbance goes on top, and the performance degrades. To rectify the misleading
perception, the concept of interval probability will be introduced with the help of an
example. Let m ∼ N(µ, σ2 I) with dimension d ∈ ℕ. As all components are indepen-
dent, the probability that a sample falls in the d-dimensional cube with edge length
4σ around its expectation value equals
This is a strictly decreasing function with respect to d. For d = 1, the bulk of the
samples lies within the interval [μ1 − 2σ, μ1 + 2σ]. But for d = 100, it follows that
P ≈ 0.95100 ≈ 0.0059, and only a small fraction of the samples lies within the cube.
Generally, the higher the dimension, the more sparsely the samples are scattered.
This leads to the notion of sample density, to conceptualize this statement more
precisely. Given a finite number of samples N i per class, the smallest axis-aligned
enclosing cuboid is considered (see Figure 6.3). The sample density will be defined as
the number of samples per unit volume: γ i := NV ii . So as to better compare the cuboids,
and abstract from the different ratios of the edges, the geometric mean of the edge
lengths, s i = √d
∏dj=1 s ij , is used. Then γ i = NV i = N i d . In order to keep the density
i (s i )
166 | 6 General considerations
m 2
s 21
s 21 s 22
m 1 s 22 m 1
s 11 s 21
(a) d = 1 (b) d = 2
m 3
ω1
s 11 ω2
s 12
s 13
s 21
s 22
s 23
m 2
Fig. 6.3. Density of a sample for
feature spaces of increasing
dimensionality. In each plot,
m 1
the number of samples per
(c) d = 3 class is the same.
m2 m2
R1
R2
R2
R1
m1 R1 m1
m3
m3
m2
m2
m1
m1
Fig. 6.4. Examples of feature dimension d and parameter dimension q. If p = q, the decision bound-
ary is linear. Larger q enable more complicated decision boundaries
1
⇒ ∇µ ln p(m) = (m − µ)
σ2
1T
⇒ (∇µ ln p(m))(∇µ ln p(m)) = (m − µ)(m − µ)T
σ4
T
⇒ J(µ) = E{(∇µ ln p(m))(∇µ ln p(m)) }
1 1
= E{(m − µ)(m − µ)T } = 2 I. (6.12)
σ4 σ
So the inverse of the Fisher information matrix is J−1 (µ) = σ2 I and it follows that
1 qσ2
tr Cov{µ}
̂ ≥ tr J−1 (µ) = →∞ (q → ∞). (6.13)
N N
The trace of the covariance matrix is the sum of the squared variances, and grows
linearly in the number of parameters. Unfortunately, the trace has no direct geometric
interpretation. But since in this example the features are all pairwise independent, the
168 | 6 General considerations
Learning (Estimation)
1 T
̂ −1⏟⏟ (m d 1
k i (m) = − (m − µ̂ i ) Σ
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ ⏟⏟⏟⏟⏟i ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ ⏟⏟̂ i⏟⏟ + ln ⏟⏟⏟⏟⏟⏟⏟⏟⏟
− µ̂ i ) − ln 2π − ln det ⏟⏟⏟Σ P(ω i ) (6.15)
2 2
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ 2 2
O(dN) 2 O(d N) O(dN) O(N) O(d )
O(1)
d(d+1)
The covariance matrix Σ̂ has 2 distinct entries, hence the estimation is asymp-
−1
totically dominated by O(d2 N). The necessary matrix inversion Σ̂ requires O(d2.4 )
operations, but d2.4 < d2 N due to the fact that d < N. Therefore, the overall complexity
to determine k i is O(d2 N). As there are i = 1, . . . , c such decision functions, the total
cost is O(cd2 N).
Classification
1 −1 d 1
k i (m) = − (m − µ̂ i )T Σ̂ i (m − µ̂ i ) − ln 2π − ln det Σ̂ i + ln P(ω i )
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ (6.16)
2 2
2 2
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
O(d )
O(1)
6.2 Overfitting
The term overfitting denotes a phenomenon that generally occurs when a model with
a large number of parameters is fit to a set with too few samples. After the model is
chosen and the number of parameters is fixed, the remaining objective is to minimize
the classification error over the dataset. If the model is powerful enough, the error
with respect to the specifically given dataset can be reduced to zero: the model learns
the data by heart (see Figure 6.5b). However, this usually does not coincide with a
good general solution and the classification error on new and unseen samples will be
large. An overly simple model, on the other hand, is not able to sufficiently reduce
the error at all (see Figure 6.5a), because it lacks the necessary flexibility to match the
data. The ability to achieve a low error rate on both the training data and the testing
data is called generalization.
In order to check whether a chosen model fits the problem, the dataset can be
divided into a training set D and a test set T (see Figure 1.5). The training set is used to
estimate the parameters of the model, the test set is used to assess the model’s perfor-
mance and ability to generalize. Nonetheless, such a check can never be a strict proof
that the model is the correct one, but only a test for plausibility. Hence, the question
remains how to find the right model. In Figure 6.5c, the optimal decision boundary (of
a Bayesian classifier) can be given, because the example was artificially created and
the underlying model from which the data set was generated was known. In reality,
this is hardly ever true. Hence, the viable approach is to employ Occam’s Razor. This
principle states that among different competing hypotheses that are equally consistent
with the given data, the hypothesis with the fewest assumptions should be selected.
Figure 6.5 illustrates the effect of overfitting by means of an example. The decision
boundary tries to optimally separate the classes, within the limits of its ability.
The next example is of overfitting in the context of regression analysis. Let f(x) =
1 2
2 x − x, y = f(x) + r where r is some Gaussian distributed noise. Five samples (x1 , y1 ),
. . . , (x5 , y5 ) are given and it is only known that x and y are governed by some poly-
nomial rule. The task is to find the best estimation f ̂ with f ̂(x) = ∑ki=0 a i x i . For order
k = 4, the regression f ̂ is able to perfectly fit the given samples, but overall the regres-
sion with order k = 2 resembles the true function much better although it exhibits a
small training error (see Figure 6.6). If there had been a sixth sample (x6 , y6 ) and both
regression functions had been kept fixed, the quadratic polynomial would very likely
have been a better fit.
170 | 6 General considerations
m2 m2
m1 m1
(a) Linear decision boundary; simple (b) Overfitting with a highly flexible
model, but large training error decision boundary; the training error
is zero.
m2
m1
Fig. 6.5. Trade-off between generalization and training error. The classifier should neither be too
simple to represent the underlying classes, nor too complex to not generalize from the training data.
y
Samples
4
Underlying function
Order k = 2
2 Order k = 4
x
−2 −1 1 2 3 4 5
−2
Fig. 6.6. Overfitting in a regression scenario. Two polynomials with order k = 2 and k = 4 are fitted
to data that was generated from a polynomial with order k = 2.
6.3 Exercises | 171
But if there had been many more samples (possible infinitely many), both regres-
sion functions would eventually converge to the true function. Hence, the model of
order k = 4 is not generally worse than the model of order k = 2, it is only worse for
a small number of samples. This leads to the following rules of thumb, which are in
accordance with Occam’s Razor:
– The smaller the dataset, the simpler the model should be, and
– the higher the number of parameters of the model (or classifier), the more samples
are required.
6.3 Exercises
where g k (m) denotes a Gaussian density? How many parameters need to be esti-
mated when using a Parzen window method instead?
(6.2) Given two classes ω1 and ω2 that are to be classified using four-dimensional
features m ∈ ℝ4 , how many parameters must be estimated when using a linear
classifier? How many parameters must be estimated for a maximum a posteriori
classifier, under the assumption that the features are class-conditionally normally
distributed, i.e., p(m | ω c ) = N(µc ,Σc ), c = 1,2?
(6.3) A micro-controller for Internet of Things applications can only save up to 256
parameters. You are tasked to use this micro-controller for classification in a six-
dimensional feature space.
1. How many linear classifiers can be realized using this micro-controller? How
many parameters will remain unused?
2. How many classes can be separated using a maximum a posteriori classifier
with multivariate Gaussian distributions as class-dependent feature distribu-
tions? How many parameters will remain unused?
In other words, what is the smallest dimension d in which more than 90% of the
probability mass will be outside of the hypercube [−2,5]d ?
7 Special classifiers
The remaining chapters of this book collect some further topics of pattern recognition.
Except for Section 9.4, these chapters deal with certain important classifier methods.
In contrast to the techniques of Chapters 3 to 5, these classifiers do not estimate a
distribution first, but try to find a classification rule directly from the given data instead.
This means that these classifiers follow the transduction path from Figure 5.1.
A linear discriminant is a linear function that operates on the feature space. With two
classes (c = 2), w ∈ ℝd and b ∈ ℝ, it is given by
k(m) = wT m + b (7.1)
{
{ ω1 if k(m) > 0
{
ω̂ = ω(m)
̂ = {ω2 if k(m) < 0 (7.2)
{
{
{whatever if k(m) = 0.
This means that the decision boundary is given by the (d − 1)-dimensional hyper-
plane H = {m ∈ M|k(m) = 0}. Here, w is a vector perpendicular to the hyperplane and
b determines the distance to the origin. The oriented distance of a point to the plane
is given by D(m, H) = k(m), provided that m is normalized (‖m‖2 = 1).
The resemblance to the decision function of Chapter 3 is not an accident. In case
of a Bayesian classifier for normally distributed features with identical covariances,
the decision function is linear. However, here and in the following sections, the ap-
proach is reversed: instead of inspecting the decision boundary of a Bayesian classifier,
the decision boundary is explicitly stated. The parameters of the plane are directly
determined from the training samples in such a way that the classification error is
minimized. How to carry out such a minimization will be discussed in the following
sections.
Equation (7.2) shows the decision rule for the case of c = 2 classes, but the approach
can be extended to more classes as well. In the upcoming discussion, it is implicitly
assumed that the classes are linearly separable.
The most straightforward solution is to determine one hyperplane per class. Each
hyperplane divides the space into two half-spaces so that the samples that belong to
https://doi.org/10.1515/9783110537949-195
174 | 7 Special classifiers
R1 H1,4
ambiguous
ambiguous ω1 ω1
region R1
region H1 R2
R4 R4 H2,3
ambiguous ω2
ω2 ω4 ω4
region
R2
H3 H1,3
ambiguous ω3 ambiguous ω3
region region R3
R3
H3,4
H2 H4 H2,4 H1,2
(a) One linear discriminant function per (b) One linear discriminant for each pair of
class classes
R1
ω1
R4
ω2 ω4
R2 ω3
R3
Fig. 7.1. Different techniques for extending linear discriminants to more than two classes. All these
methods except for the linear machine introduce ambiguous regions.
the class fall in one half-space, while the samples that belong to the other classes fall
in the other half-space. An unseen sample is classified as belonging to class ω i if it
falls in the corresponding half-space, but not in a half-space that corresponds to the
other classes, i.e., as before
ω(m)
̂ = ω i ⇔ k i (m) > 0 and k j (m) < 0 for i ≠ j. (7.3)
The resulting decision regions are depicted in Figure 7.1a. A major drawback of
this approach is that a large volume of the feature space belongs to an ambiguous
region, where no classification is possible.
Another approach is to determine one hyperplane Hi,j with i, j = 1, . . . , c, i ≠ j
for each pair of classes, resulting in c(c−1)
2 linear discriminants. A sample is classified
as being in class ω i if its feature vector lies on the correct side of all hyperplanes that
separate ω i from the other classes ω j ,
ω(m)
̂ = ω i ⇔ k i,j (m) > 0 for all j ≠ i. (7.4)
7.1 Linear discriminants | 175
This approach is depicted in Figure 7.1b and usually leads to smaller ambiguous re-
gions, at the cost of a more complicated classifier.
A third approach is given by the linear machine. As with the first approach, there
is only one linear discriminant k i for each class, but the decision rule is different: A
point is assigned to a class if the corresponding linear discriminant is larger than any
other:
ω(m)
̂ = ω i ⇔ k i (m) > k j (m) for all j ≠ i. (7.5)
Going back to the two-class cases c = 2, there is another possibility of extending the
linear discriminants. Explicitly writing out the vectorized term,
d
k(m) = wT m + b = ∑ w i m i + b, (7.7)
i=1
suggests that it can be extended by higher order combinations, e.g., quadratic terms,
d d d
k(m) = ∑ ∑ w ij m i m j + ∑ w i m i + b = mT Wm + wT m + b. (7.8)
j=1 i=1 i=1
m2
m1 m2
R1
R̃ 1 ⊂ ℝ3
w m1
m2 R2
m1 R̃ 2 ⊂ ℝ3 R1
(a) Linear separation in the augmented, (b) . . . and the corresponding nonlinear
3-dimensional feature space . . . separation in the original 2-dimensional
feature space
Fig. 7.2. Nonlinear separation by augmentation of the feature space. The purple surface in (a) shows
the embedding of the original feature space, the orange plane is the decision boundary of the linear
discriminant in ℝ3 . The augmentation feature vector is defined as y := (1, m1 , m2 , m1 m2 )T and the
parameters of the linear discriminant are a = (1,0,0,1)T .
1
d∗ y1 (m)
(a0 . . . a d∗ ) ( . ) = aT y.
k(m) = ∑ a i y i (m) = ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ (7.9)
..
i=0 =aT
y d∗ (m)
m 2
R1 R2 ω1
12 ω2
Optimal
boundary
10 Decision
boundary
2 R1
m 1
2 4 6 8 10 12
Fig. 7.3. Application to the reference example of Section 3.3.2. Decision regions of a linear regres-
T
sion classifier with augmented feature vector y = (1,m1 ,m2 ,m1 m2 ,m21 ,m22 ) . The training and
testing errors are etrain = 8.5 % and etest = 8.5 %. The testing error asymptotically approaches
etest ≈ 8.7 %. The training set is the same as in Figure 3.8. Test samples are shown with hollow
marks.
Figure 7.3 shows the decision regions of a linear regression classifier with an aug-
mented feature vector (see description). By design of the feature vector, the decision
boundary is a conic section and is visually very similar to the decision boundary of the
Gaussian classifier in Figure 3.13. The linear regression classifier does not make any
explicit assumption about the density of the features. However, such assumptions are
implicit in the choice of feature augmentation.
What remains are techniques to determine a separating hyperplane from the given
samples. The perceptron algorithm introduced by Rosenblatt [1957, 1962] serves as
the first example. A perceptron is a binary classifier (c = 2) and requires that the
training set D is linearly separable. The pseudocode to learn the classifier is shown in
Algorithm 7.1. For each sample mi , an indicator variable
{1 if ω(m) = ω1
zi = { (7.10)
−1 if ω(m) = ω2
{
178 | 7 Special classifiers
is introduced so that both correct and false classifications can be covered with a single
statement (see line 4 of Algorithm 7.1). This indicator variable is often found in binary
classifiers and will reappear in Section 7.7.
The perceptron algorithm starts with an arbitrary but fixed hyperplane, and it-
eratively constructs a sequence of hyperplanes until the training error is 0. This can
only work if the training data is linearly separable, but if it is, then the algorithm is
guaranteed to converge.
The speed of convergence depends on the sample, the absolute value of the normal
vector m, and the learning rate η. Novikoff [1962] showed that the influence of a single
update eventually vanishes and that the sequence of hyperplanes converges.
Theorem 7.1 (Novikoff’s Perceptron Theorem). Let D be linearly separable with a mar-
gin γ > 0. This means there exists a hyperplane given by w and b with ‖w‖ = 1 and
for all mi ∈ D one has
z i (wT mi + b) ≥ γ. (7.11)
m 2
Class ω1
Class ω2
Preliminary
hy perplanes
Final hy perplane
Intuitively better
hy perplane
m 1
Fig. 7.4. Four steps of the perceptron algorithm. The algorithm converged to a separating, but subop-
timal hyperplane.
the initial hyperplane w0 , b0 , the learning rate η, and the order in which the samples
in D are processed. Theorem 7.1 already required the existence of a hyperplane with
margin γ. Surely, it would be desirable to find such a hyperplane with γ as large as
possible. This idea is picked up in Section 7.7.
The aim of linear regression is to find a linear function (see Equation (7.1)) that best
maps a set of input vectors to their corresponding output. Note that “best” is only
loosely defined and can, for example, mean minimal squared error, minimal absolute
error, or any other loss function.
In the context of pattern recognition, linear regression can be applied to learn a
linear decision function for each class ω i . The input is given by the dataset D and the
corresponding output is the (perfect) decision function, that is, the objective is to find
a decision function k i (m) = aTi m, ai ∈ ℝd such that
! {1 if ω(m) = ω i
k i (mk ) = aTi m = { (7.13)
0 otherwise
{
for every i = 1, . . . , c. Given the training sample D, the optimization goals become
! {1 if ω(mk ) = ω i
mTk ai = z ki := { (7.14)
0 otherwise,
{
180 | 7 Special classifiers
mT1 z1i
( .. ) ai = ( ... ),
. !
(7.15)
mT1
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ z Ni
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
:=M :=zi
Factoring out Equation (7.16) and taking the gradient with respect to ai yields
Setting ∇ai = 0 and solving for ai yields the optimal (in the sense of minimal squared
error) solution
−1
âi = (MT M) MT zi , i = 1, . . . ,c. (7.19)
−1
The term (MT M) MT is called a pseudo-inverse of the matrix M. Substituting this
result into Equation (7.13) gives the decision functions
−1
k i (m) = zTi M(MT M) m, i = 1, . . . ,c. (7.20)
As the pseudo-inverse and the feature vector m do not depend on ω i , the entire decision
vector can be written as
zT1
k(m) = ( ... ) M(MT M) m.
−1
(7.21)
zTc
Note that the decision function does not take into account an offset b. However,
an offset, as well as nonlinearities, can be included using the techniques discussed in
Section 7.1.2. The structure of the classifier remains the same.
Artificial neural networks are biologically inspired networks of artificial neurons and
synapses. The neurons sum the input from all incoming synapses, apply a nonlinear
function, and output the result of the computation to all outgoing synapses. Artificial
7.4 Artificial neural networks | 181
1 f (∑(⋅))
m1 f (∑(⋅)) f (∑(⋅)) k1
m2 ..
.
.. .. f (∑(⋅)) kc
. .
md
f (∑(⋅))
neural networks are modeled as directed graphs, where the nodes correspond to neu-
rons and the edges correspond to synapses. A special type of neural network is the
feed-forward network. These networks are directed acyclic graphs that are organized
into layers such that the neurons of one layer have outgoing edges only to the neurons
of the next layer, but not to neurons in the same or other layers. The first layer of a
feed-forward network is called the input layer and the last layer is called the output
layer. The layers in between are called hidden layers. The processing flow goes from
the input to the output layer, but not the other way around.
Figure 7.5 shows a feed-forward neural network with one hidden layer. The input
layer consists of (d+1) neurons, where d neurons distribute the features m1 , . . . ,m d to
the neurons of the hidden layer and one neuron outputs a constant level. Such neurons
are called bias neurons. The hidden layer consists of n neurons, where one neuron is,
again, a bias neuron and the other (n − 1) neurons compute the features from the
input layer. Finally, the output layer consists of c neurons, where in this example each
output computes the decision function k l for the corresponding class. Each synapse
is endowed with a weight. Here, w ji denotes the weight from the i-th input neuron to
182 | 7 Special classifiers
the j-th hidden neuron, w̃ lj denotes the weight from the j-th hidden neuron to the l-th
output neuron, and b j and b̃ l denote the weights from the bias neurons.
Overall, this network computes the discriminant functions
n d
k l (m) = f ( ∑ w̃ lj f ( ∑ w ji m i + b j ) + b̃ l ) l = 1, . . . , c
j=1 i=1
f (wT1 m + b1 )
= f (w̃ Tl h + b̃ l ) where h := ( .. ). (7.22)
.
T
f (wn m + b n )
The activation function f(⋅) is not further specified, but a typical choice is the Fermi
function f(ξ) := 1+e1 −ξ , which approaches 0 as ξ goes to −∞, approaches 1 as ξ goes
to ∞, and is 12 for ξ = 0 (a sigmoid activation). Surprisingly, feed-forward neural net-
works with one hidden layer can represent any continuous function m → k that maps
features to decision vectors. This result can be proven using Kolmogorov’s General
Representation Theorem (Kolmogorov [1963]):
In general, Ξ j and ψ ij are nonlinear functions. Because of this result, neural networks
are sometimes also called universal function approximators.
A feed-forward neural network is typically trained using backpropagation. Back-
propagation iteratively minimizes the training error by propagating the error from the
output layer to the input layer and adjusting the weights on the way. More formally,
the training error performs a gradient descent on the squared training error
N
e := ∑ ‖k(mv ) − ω(mv )‖2 (7.24)
v=1
where η denotes the learning rate (likewise for b j and b̃ l ). Details on the algorithm can
be found, e.g., in Duda et al. [2001].
The advantages of feed-forward neural networks are that they require no prior
knowledge and are relatively easy to configure, yet allow approximating arbitrary
decision functions and afford a very fast classification.
7.4 Artificial neural networks | 183
m 2
R1 R2 ω1
12 ω2
Optimal
boundary
10 Decision
boundary
2 R1
m 1
2 4 6 8 10 12
Fig. 7.6. Application to the reference example of Section 3.3.2. Decision regions of a feed-forward
neural network with two hidden layers with 10 neurons each. Note that a different initialization
usually results in vastly different decision boundaries. The training and testing errors are etrain =
8 % and etest = 5.5 %. The testing error asymptotically approaches etest ≈ 7.6 %. The training set is
the same as in Figure 3.8. Test samples are shown with hollow marks.
On the other hand, large neural networks involve a large number of parameters
to estimate, with all the associated problems (see Section 6.1). Training large neural
networks is computationally very expensive, especially if the training set is also large.
There are no clear guidelines for how to structure a neural network for a given prob-
lem, and it is difficult to interpret the underlying computation—a neural network is a
mathematical black box that only allows numerical interpretation. Neural networks
are prone to overfitting the training data and there is no guarantee that the learned
parameters constitute a global optimum, as the loss function in Equation (7.24) is non-
convex. Nonetheless, neural networks achieve remarkable results in many different
application domains.
Figure 7.6 shows the decision regions of a feed-forward neural network with two
hidden layers, each of which contains ten neurons. The decision boundary is compli-
cated and does not seem to match the optimal decision boundary well, yet with 7.6 %,
the asymptotic testing error is only 1.5 percentage points larger than the Bayes error
rate. However, the decision regions can look vastly different when the architecture
is changed, e.g., when using a different number of hidden layers, or if the layers con-
tain a different number of hidden neurons. Even a different choice of initial weights
can have a significant impact on the decision boundary. For these reasons, a thor-
184 | 7 Special classifiers
ough evaluation of the parameter space is paramount when using this highly flexible
classifier.
7.5 Autoencoders
An artificial neural network can also be used to compress data by letting it learn to
reproduce its input (i.e., to learn the identity function). If the hidden layer consists
of fewer neurons than the input layer, the network is forced to learn some lower-
dimensional encoding of the data. Such networks are called autoencoders. As an ex-
ample (which is due to Ritter et al. [1990], consider a dataset of eight training vectors
9
D = {m1 , . . . , m8 } with mi ∈ ℝ8 for i = 1, . . . ,8. Let the i-th entry of mi be 10 and
1
every other entry be 10 :
1 9 1 T
mi := ( ,..., ,..., ) . (7.26)
10 ⏟⏟10
⏟⏟⏟⏟⏟ 10
i-th position
Figure 7.7 shows the activation of each neuron (except the bias neurons) of an
autoencoder that was trained on this dataset. The net was trained for 5000 iterations
with a learning rate of η = 0.25 and the final mean squared training error (see Equa-
tion (7.24)) was e = 0.047. It can be seen that while the input is reconstructed almost
perfectly, the compression in the hidden layer is not lossless.
Note that the compression is only valid for the seen data: unseen training samples
may not be compressed with a low reconstruction error. For example, this network pro-
duces a squared error of e = 6.51 when reconstructing the vector m = 1 = (1, . . . ,1)T .
Theorem 7.2 implies that a neural network with one hidden layer is sufficient to rep-
resent any function and therefore derive arbitrarily complicated decision boundaries,
provided that it contains enough neurons. Yet there are still reasons to prefer networks
with many layers with fewer neurons: deeper networks have the same approximation
capabilities as shallow networks, but generally require fewer parameters (see Schmid-
huber [2015]). For example, there are functions that require a polynomial number of
parameters (w.r.t. the number of inputs) in a network with n hidden layers, but require
an exponential number of parameters with n − 1 layers (Schmidhuber [2015]). As dis-
cussed in Section 6.1, having fewer parameters typically leads to better generalization
properties and reduces the risk of overfitting the training data.
Multiple layers can also be used to model hierarchical part–object relationships,
where the first few layers model the individual parts, and the following layers model the
composition of those parts. A typical example of such a model is a car, where the first
7.6 Deep learning | 185
few layers could model car parts, such as wheels, doors, windows, headlights, etc., and
the following layers model the car’s frame and body. Lastly, the human visual cortex is
also organized in many hierarchical layers, that fulfill increasingly complicated tasks.
Unfortunately, there is no clear definition of what constitutes a “deep” network.
Generally, a neural network with one hidden layer is considered shallow, whereas
a network with ten hidden layers is already considered deep. One of the pioneering
works in deep learning, LeNet-5 by LeCun et al. [1998] had seven layers, but other
architectures can have more than 1000 layers (He et al. [2016]).
Artificial neural networks with multiple layers are typically trained using the back-
propagation algorithm, which, as mentioned above, performs a gradient descent to
minimize the squared prediction error on the training set. However, deep networks
cause two major issues with backpropagation: First, the gradient may vanish or ex-
plode during the backward pass due to exponential changes from one layer to another.
In effect, gradient descent may require an unfeasibly large number of iterations to con-
verge. Second, since gradient descent only considers local information about the first
derivative, backpropagation often becomes stuck at saddle points or in local optima.
This means that even if the gradient descent converges, there is no guarantee that the
solution found is also a good solution.
186 | 7 Special classifiers
Deep networks also tend to have large parameter spaces and therefore need more
training data than models with fewer parameters. Such training data requires storage
and computation time, neither of which were always as abundant as they are today.
Furthermore, there were no clear concepts for representing practical problems or en-
coding prior knowledge in deep neural networks. At the same time, alternatives such
as the SVM achieved a similar classification performance, but had a much more solid
theoretical foundation.
In recent years, these issues have largely been solved. There are huge datasets,
such as ImageNet (Deng et al. [2009]), that contain millions of labeled training samples.
Graphical processing units allow significantly accelerating the computation of the
gradient and therefore allow many more iterations with small learning rates. This
means that even an almost vanishingly small gradient can be sufficient to escape a
saddle point, provided that it is followed for long enough.
But there have also been significant theoretical advances to improve the training
algorithm. Unsupervised pre-training allows using unlabeled training data to initialize
a network with a good solution, before the supervised gradient descent is performed.
Stochastic gradient descent, momentum, and weight decay speed up the training and
avoid falling into local optima. Rectified linear units instead of sigmoid activation
avoid vanishing gradients with large activations.
Similarly, specialized architectures have led to breakthroughs in certain areas.
The long short term memory (LSTM) (see Hochreiter and Schmidhuber [1997]) ap-
proach is well suited to handle sequential data, such as audio, text, or time series.
Since LSTMs “remember” training errors during backpropagation, they also solve the
problem of vanishing gradients. Convolutional neural networks convolve the input
data with banks of trainable filters and are especially well suited for multidimensional
data that contain repeating structures, e.g., image data.
A detailed discussion of these techniques is outside the scope of this book, but we
will briefly explore the fundamental ideas and motivations in the following.
In unsupervised pre-training, the weights are initialized by treating each layer individ-
ually. Here, the goal is to find a good initialization for the supervised backpropagation.
The intuition is that the pre-training will move the parameter vector near a local opti-
mum, which can then be quickly found by gradient descent. Unsupervised pre-training
also does not require annotated training data, but only unlabeled data, which is much
easier to obtain in large quantities. This reduces the need for labeled data, since the
supervised training will only run for a short time. The only requirement is that the
unlabeled data must be from the same domain as the labeled data, e.g., images of cars,
if the goal is to classify car models, etc.
7.6 Deep learning | 187
+ +
Fig. 7.8. Pre-training with stacked autoencoders decomposes training a deep network layer by layer.
Each layer is trained as an autoencoder of its input, where the input is the output of the previous
layer.
Stochastic gradient descent approximates the gradient using random subsets of the
training data. In particular, in the t-th iteration, only the batch Bt ⊂ D is used to
approximate the gradient,
∂e ∂ev ∂ev
= ∑ ≈ ∑ . (7.27)
∂w v∈D ∂wt v∈B ∂wt
t
188 | 7 Special classifiers
A rectified linear unit (ReLU) denotes an activation function that avoids the problem
of vanishing gradients. In particular, the ReLU activation function
f(ξ)
3
ReLU
f(ξ) = max(0,ξ)
2 Sigmoid
f(ξ) = e ξ /(1 + e ξ )
1
ξ
−4 −2 2 4
Layer n Layer n + 1
Layer n + 1 Layer n + 2
Layer n + 2 Layer n + 3
f(ξ)
Image channels
Number of convolution
matrices in the first layer
h
idt
Number of convolution
ew
ag matrices in the second layer
Im
k(m)
Image height
Fig. 7.11. High level structure of a toy example convolutional neural network with Q = 2 convolution
blocks (convolution layer, ReLU, max pooling) and R = 2 fully connected layers. Stride and padding
are not shown in the figure. All nodes in the last feature maps are connected without restriction to
the nodes in the following (non-convolutional) hidden layer. In the figure this is indicated by the
gray shading between these layers. Note: Real networks, e.g., in Krizhevsky et al. [2012], are usually
much larger.
To reduce the computational effort and the number of parameters to learn, the
convolution is often combined with a downscaling operation. A stride of n computes
the convolution only at every n-th position in both spatial directions and therefore
shrinks the output by a factor of n2 —a factor of n in both directions. For example, the
output of a convolution layer with stride two is one-fourth the size of the input, because
the convolution is computed only on every odd row and column of the input. Stride is
also often used to replace the pooling layer, as both have a similar effect, but strides
significantly reduce the computation time.
Another reduction due to convolution is that positions on the boundary are omit-
ted, because the convolution can only be computed at positions where the convolution
matrix fully fits into the image. As this reduction is typically undesired, the input can
be padded by a certain number of pixels to allow convolution at these positions. There
is no clear guideline on how to fill the padded area, but common approaches are to
tile the input periodically, or to reflect the input at the border pixels.
The next layer performs a nonlinear mapping of the convolution result, usually
using the ReLU activation function (see Figure 7.10b). Lastly, a so called max pooling
layer subsamples the result by propagating only the maximum activation within a
small neighborhood to the next layer (Figure 7.10c). These max pooling layers cause
the features to become tolerant to translation and improve the tolerance for noise in the
7.6 Deep learning | 191
Fig. 7.12. Image patches that produce large filter responses in a convolutional neural network with
five convolution blocks. For each block, six convolution matrices are shown. The first layer responds
mainly to color, edges, and corners. The following layers capture increasingly complicated structures
and even whole concepts, e.g., “group of humans,” in the fifth layer. The images were generated as
described in Zeiler and Fergus [2014].
input data. At the same time, max pooling leads to data reduction (Figure 7.11). Com-
bined with successive convolutions, this means that later stages see a larger portion
of the input image.
These Q convolution blocks are followed by R fully connected layers (the final
layers in Figure 7.11), where typically Q > R. The fully connected layers play the role of
a regular multilayer feed-forward neural network that classifies the features extracted
by the convolutional layers. However, the parameters of both the feature extraction
and the classification are learned at the same time. In effect, the features are very well
tuned to the classifier and vice versa.
CNNs are usually trained to minimize the error on the training set using (stochastic)
gradient descent, and not using pre-training. This way, it is ensured that the convo-
lution matrices in the same layer learn different weights, and that the convolution
matrices of consecutive layers are geared to each other.
To avoid overfitting in the fully connected layer, one can use so called dropout,
where a random fraction of neurons are deactivated in each training iteration (Srivas-
tava et al. [2014]). The idea is that dropout forces the network to introduce redundan-
cies and therefore reduces the risk of overfitting.
CNNs have been shown to be remarkably successful at difficult image recognition
tasks. For example, prior to the advent of CNNs, the state of the art in image categoriza-
192 | 7 Special classifiers
Fig. 7.13. Detection and classification of vehicles in aerial images with CNNs. Red and blue boxes
show detection of cars and trucks, green and turquoise boxes show the corresponding ground truth.
Results of Sommer et al. [2017], images from the DLR3K dataset (Liu and Mattyus [2015]).
tion (the task of classifying an image into one of many categories) achieved an error
rate of 45.7 % on the ImageNet dataset. The convolutional neural network approach
by Krizhevsky et al. [2012] reduced this error rate to 37.5 %! Their network consisted
of five convolution blocks followed by two fully connected layers.
Here, and with CNNs in general, the different convolution blocks produce different
types of features: The first block essentially detects edges and corners; the second
block responds to primitive textures; the third block detects parts of objects, etc. This
can be seen in Figure 7.12. The figure shows image patches that produce large filter
responses for six distinct convolution matrices for each layer in a convolutional neural
network with five convolution blocks. Although the details depend on the input data
and the architecture of the network, it is common to all CNNs that the deeper the layer,
the higher the level of abstraction in the corresponding features. Peculiarly, a similar
structure is found in the human visual cortex.
7.6 Deep learning | 193
P×P pooling
P×P pooling
on a training set, where mi and mj denote the output of the network for the i-th and
j-th training sample, and y ij ∈ {−1,1} is an indicator variable that is 1 if the training
samples show the same person and −1 otherwise. b > 0 is the classification threshold,
i.e., mi and mj are said to belong to the same person iff ‖mi − mj ‖2 ≤ b. Once trained,
only one of the CNNs is used in classification and the other one is discarded. Such a
training procedure with two or more networks with the same structure (but different
parameters) is also known as a Siamese setup (Bromley et al. [1993]).
The support vector machine (SVM) is one of the most versatile classifiers. It is relatively
simple, yet extremely powerful, and provides good generalization even with a small
number of training samples. The SVM is a linear discriminant classifier for two classes
(c = 2), but can be extended to multiple classes using the techniques discussed in
Section 7.1.1. The SVM can be explained based on five fundamental ideas:
1. Linear separation with maximum distance of the separating hyperplane to the
nearest training samples (the support vectors).
2. Dual formulation of the linear classifier to reduce the number of parameters to
estimate.
3. Nonlinear mapping of the features to a high-dimensional feature space Φ.
4. Implicit use of the (possibly ∞-dimensional) space of eigenfunctions of a so-called
kernel function K as the transformed feature space Φ. The transformed features do
not have to be explicitly computed and the classifier has a small number of free
parameters even though dim(Φ) is large (kernel trick).
5. Relaxation of the linear separability requirement by introducing slack variables.
These ideas will be discussed in the following. For now, it is assumed that the training
set D = {m1 , . . . ,mN }, m ∈ ℝd is linearly separable. This assumption will be relaxed
in the fifth step.
As was already seen in Section 7.2, typically more than one linear discriminant can sep-
arate a given dataset. Yet, some of the discriminants are intuitively “better” than others,
because these discriminants generalize better than others. One method of finding such
discriminants is to impose additional conditions on the separating hyperplane. With
SVMs, the goal is to find those hyperplane parameters (w,b) such that the margin γ,
the distance between the hyperplane and the closest training samples, is maximized:
m2
5
Class ω1
Class ω2
4 Optimal hyperplane
3
Support vectors {m+ } and {m− }
1
γ
m1
1 2 3 4 5
The intuition is that a larger margin results in better generalization. This intuition
was already expressed in Figure 7.4, where the final hyperplane of the perceptron
does separate the training data, but the margin is very small. The dashed line marks a
hyperplane with a larger margin.
Interestingly, there is exactly one hyperplane that satisfies Equation (7.33), pro-
vided that D is linearly separable. This hyperplane is fully defined by the support
vectors {m+ } and {m− }, i.e., the vectors that are closest to the hyperplane (see Fig-
ure 7.15). The SVM concentrates only on the boundaries between classes and therefore
only on the most difficult samples.
But how does one estimate the parameters w and b from a training set D and how
does the margin γ relate to w and b? To derive an answer, recall the linear decision
function
d
k(m) = wT m + b = ( ∑ w i m i ) + b = ⟨w, m⟩ + b, (7.34)
i=1
where m is assigned to the class ω1 if k(m) > 0 and to ω2 otherwise; that is, the
classification depends on the sign of the decision function k(m). Observe further that
the sign of the decision function does not change when the parameters are scaled
by some factor β > 0, that is, sign(wT m + b) = sign(βwT m + βb). By definition, the
support vectors {m+ } of ω1 and {m− } of ω2 lie exactly on the margin and the optimal
separating hyperplane has the same distance from all support vectors (see Figure 7.15).
196 | 7 Special classifiers
wT m+ + b = 1
} ⇒ wT m+ − wT m− = 2. (7.35)
wT m− + b = −1
In other words, the margin γ = ‖w‖−1 assumes its maximum iff ‖w‖ becomes
minimal. This translates into the following optimization problem:
Taking the derivative of L with respect to w and b and setting the partial derivatives
to 0 yields
N N
∂L !
= w − ∑ z i α i mi = 0 ⇔ w = ∑ z i α i mi and (7.39)
∂w i=1 i=1
N
∂L !
= ∑ z i α i = 0. (7.40)
∂b i=1
The primal formulation of the decision function in Equation (7.34) suggests that there
are (d + 1) parameters to estimate: (d parameters for w and one parameter for b. If w is
normalized, the number of free parameters reduces to d. Yet, above it was hinted that
the hyperplane is fully determined by the support vectors. If the number of support
vectors is smaller than d, this means that there are actually fewer parameters that need
to be estimated.
7.7 Support vector machines | 197
Indeed, Equation (7.39) shows that the weight vector w can be written as a linear
combination of the training samples (the same result is used in the perceptron algo-
rithm, see line 5 in Algorithm 7.1). Substituting w = ∑Ni=1 α i z i mi in Equation (7.34)
yields
N
k(m) = ⟨w, m⟩ + b = ⟨ ∑ α i z i mi , m⟩ + b
i=1
N
= ∑ α i z i ⟨mi , m⟩ + b. (7.41)
i=1
Equation (7.41) is called the dual form of Equation (7.34). In this formulation, the
number of free parameters is N: (N − 1) parameters for the α i (recall Equation (7.40))
and one parameter for b. This number does not depend on the dimensionality d of
the feature space! Below it will be seen that the α i are nonzero only for the support
vectors. Also note that the feature vectors mi in Equation (7.41) only appear inside of
inner products, but not on their own. This will become important in Section 7.7.4.
Substituting Equations (7.39) and (7.40) in Equation (7.38) yields the dual formu-
lation of the Lagrange function:
N
1 (N,N) !
L(α) = ∑ α i − ∑ z i z j α i α j ⟨mi , mj ⟩ → max. (7.42)
i=1
2 (i,j)=(1,1)
The dual formulation depends only on the dual variables α i and must be max-
imized instead of minimized. That is, if α∗ solves the following dual constrained
quadratic optimization problem,
N
1 (N,N)
Maximize: ∑ αi − ∑ z i z j α i α j ⟨mi , mj ⟩
i=1
2 (i,j)=(1,1)
N
Subject to: ∑ z i α i = 0 and
i=1
α i ≥ 0, i = 1, . . . , N, (7.43)
then the vector w∗ = ∑Ni=1 z i α∗i mi realizes the linear classifier with maximum margin
γ = ‖w∗ ‖−1 . For b∗ it follows that
1
b∗ = − (max {⟨w∗ , mi ⟩} + min {⟨w∗ , mi ⟩}) . (7.44)
2 z i =−1 z i =1
The above constrained quadratic optimization problem can be solved very effi-
ciently, e.g., with quadratic programming techniques. Note that the solution (w∗ ,b∗ )
is unique. Non-support vector samples mi , i.e., the mi that do not fall on the boundary
of the margin, do not influence the solution at all. During the optimization, the corre-
sponding weights α∗i vanish: only the α∗j that correspond to support vectors are ≠ 0.
This is, in fact, the reason for calling these samples support vectors.
198 | 7 Special classifiers
Following this reasoning, the number of parameters that need to be estimated for
the SVM classifier is neither d (as suggested by Equation (7.34)) nor N (as suggested
by Equation (7.41)). Rather, the number of non-vanishing parameters is equal to the
number of support vectors, |SV|, where SV denotes the set of support vectors. This
explains why an SVM is able to find a separating hyperplane in very high-, and even
infinite-dimensional spaces.
{𝕄 → Φ
ϕ:{ T
(7.45)
m → ϕ(m) = (φ1 (m), . . . , φ d∗ (m)) ,
{
7.7 Support vector machines | 199
where w ∈ ℝd .
∗
In this formulation, the number of free parameters is (d∗ + 1), which means that
additional parameters have to be estimated and all the discussed drawbacks apply
(see Sections 6.1 and 7.1.2). Note that although the separation is nonlinear in 𝕄, it
is linear in Φ. In essence, the linear separation with maximum margin to the near-
est samples {ϕ(m)} is conducted in the space Φ instead of the space 𝕄. If d∗ > d
(which is generally, but not necessarily, the case), the mapping ϕ(m) determines a
d-dimensional sub-manifold in Φ.
Note that in a two-class problem (c = 2), a dataset D = {m1 , . . . ,mN } of d-
dimensional feature vectors mi can always be linearly separated if d ≥ N − 1 and
the {mi } do not reside in a (d − 1)-dimensional subspace of 𝕄. As a consequence,
linear separation may always be achieved by a suitable mapping ϕ(⋅).
Recall the dual form of the decision function in Equation (7.41). When applying
the mapping ϕ(⋅), the decision function becomes
N
k(m) = ∑ α i z i ⟨ϕ(mi ), ϕ(m)⟩ + b. (7.47)
i=1
Definition 7.3 (Kernel function). A function K is said to be a kernel function (or kernel
for short) if for all m, m ∈ 𝕄
This is the basic insight of the kernel trick: an inner product of mapped feature
vectors may be replaced by a kernel function that computes both in one step. The
necessary and sufficient conditions for an arbitrary bivariate function K(⋅,⋅) to be a
kernel function are given by Mercer’s theorem (Mercer [1909]).
with positive coefficients λ j > 0 (i.e., K denotes an inner product in the feature space
Φ associated with K) iff
for all f ≢ 0 with ∫f 2 (m) dm < ∞. The λ j and φ j (⋅) are the solutions to the eigenvalue
problem
∫ K(m,m ) φ(m ) dm = λ φ(m). (7.52)
𝕄
Given a function K(⋅, ⋅) that satisfies the hypotheses of Mercer’s theorem (and hence is
a kernel function), one can use this function in the dual formulation of the classifier in
Equation (7.49) to implicitly use the possibly infinite-dimensional transformed feature
space Φ without needing to explicitly compute the corresponding feature vectors {ϕj }.
In other words: the kernel function K induces the feature space Φ.
Note, again, that even though the feature vector ϕ(m) may have a very high, pos-
sibly even infinite dimensionality d∗ , the classifier in Equation (7.49) is still fully de-
termined by only N free parameters.
The corresponding mapping is the identity function ϕ(m) = m. Therefore, this kernel
is also called the linear kernel.
A more interesting kernel arises by squaring the scalar product:
d 2 d d
K(m,u) := ⟨m, u⟩2 = ( ∑ m i u i ) = ( ∑ m i u i )( ∑ m j u j )
i=1 i=1 j=1
d d (d,d)
= ∑ ∑ mi mj ui uj = ∑ (m i m j )(u i u j ). (7.54)
i=1 j=1 (i,j)=(1,1)
7.7 Support vector machines | 201
i.e., the vector of all monomials of degree 2 of the entries {m i }. The dimensionality of
Φ is d∗ = 12 (d + 1)d, because the terms m i m j appear twice for every i ≠ j, but must
only be counted once.
The above kernel function can be modified by adding a constant c ∈ ℝ before
squaring:
d d
K(m,u) := (⟨m, u⟩ + c)2 = ( ∑ m i u i + c)( ∑ m j u j + c)
i=1 j=1
d d d
= ∑ ∑ m i m j u i u j + 2c ∑ m i u i + c2
i=1 j=1 i=1
(d,d) d
= ∑ (m i m j )(u i u j ) + ∑ (√2c m i ) (√2c u i ) + c2 . (7.56)
(i,j)=(1,1) i=1
d+2 1
d∗ = ( ) = (d + 2)(d + 1). (7.57)
2 2
Akin to the reasoning above, the transformed feature vectors contain all monomi-
als of degree ≤ q. For this reason, this kernel function is also known as the polynomial
kernel. Together with the linear kernel, the polynomial kernel is one of the standard
kernels often used with SVMs.
Another popular kernel is the Gaussian kernel, or radial basis function (RBF) ker-
nel:
‖m − u‖2
K(m,u) := exp {− }. (7.59)
σ2
Unlike the linear and polynomial kernels, this kernel produces a mapping into an
infinite-dimensional space Φ, that is, the eigenfunction decomposition in Theorem 7.4
has an infinite number of solutions (Rasmussen and Williams [2006]).
202 | 7 Special classifiers
A decision function can also be found when a kernel function is used. The opti-
mization problem simply swaps the scalar product in Equation (7.43) for a kernel:
N
1 (N,N)
Maximize: ∑ αi − ∑ z i z j α i α j K (mi , mj )
i=1
2 (i,j)=(1,1)
N
Subject to: ∑ z i α i = 0 and
i=1
α i ≥ 0, i = 1, . . . , N. (7.60)
is equivalent to the hyperplane in the induced space Φ with maximum margin (Cris-
tianini and Shawe-Taylor [2000])
1
γ= . (7.62)
√∑mi ∈SV α∗i
N
Note that it follows from Mercer’s theorem that the (kernel) matrix (K(mi ,mj ))i,j=1
is positive definite. This means that the optimization problem in Equation (7.60) is
convex and has a unique global optimum that can be found using, e.g., quadratic
programming.
Figure 7.16 shows the decision regions a hard margin SVM with a Gaussian kernel
(σ = 4) learned from the ongoing reference dataset. The decision regions are very com-
plicated and rugged. Clearly, this classifier overfits the data and does not generalize
well. This can also be seen in the low training error of etrain = 3.5 %, but rather large
testing error of etest = 11 %. Note that in theory a hard margin SVM should have no
training error (etrain = 0). In practice this is rarely achieved, as the optimization of
Equation (7.60) is usually terminated before the (true) optimum is reached.
Probability of error
The probability of making an error when classifying unseen samples mi ∈ ̸ D may be
eye-balled as
|SV|
P (ω(m)
̂ ≠ ω(m), m ∈ ̸ D) ≈ . (7.63)
N
The reasoning goes as follows. If some training sample mi ∈ D is left out in train-
ing, it will be correctly classified if it is not a support vector. If it is a support vector,
there is a chance that it will be misclassified by the SVM trained on the reduced train-
ing set. If this is repeated for all mi ∈ D, one makes at most |SV| errors (leave-one-out
argument). It follows that for a fixed training set, the classifier with fewer support
vectors will perform better (consistently with Occam’s razor).
7.7 Support vector machines | 203
m 2
R1 R2 ω1
12 ω2
Optimal
boundary
10 Decision
boundary
2 R1
m 1
2 4 6 8 10 12
Fig. 7.16. Application to the reference example of Section 3.3.2. Decision regions of a hard margin
SVM classifier with Gaussian kernel (σ = 4). The training error is etrain = 3.5 % and the testing error
is etest = 11 %. The latter asymptotically approaches etest ≈ 11.2 %. The training set is the same as
in Figure 3.8. Test samples are shown with hollow marks.
So far we have assumed that the dataset D is linearly separable (either in 𝕄 or in Φ).
However, there are cases where D is not linearly separable or separable only with
a small margin. To allow the SVM to work well in these cases, one introduces the so
called slack variables ξ i ≥ 0, i = 1, . . . ,N that measure how much a training sample mi
violates the margin or even how far mi lies on the wrong side of the separating hyper-
plane, see Figure 7.17. For the linear classifier with maximum margin, the optimization
goal becomes
N
Minimize: ⟨w, w⟩ + C ∑ ξ i2
i=1
The design parameter C > 0 defines how much emphasis should be put on correct
classification (i.e., C is large) versus a large margin (i.e., C is small). A dual formulation
of the above and the use of a kernel K instead of the scalar product ⟨w, m⟩ leads to
204 | 7 Special classifiers
γ
Class ω1
mk Class ω2
Optimal hyperplane
γξ j
Support vectors
{m+ } and {m− }
γξ k
mk
Fig. 7.17. Geometric interpretation of the slack variables ξ i , i = 1, . . . ,N: The slack variables mea-
sure (w.r.t. to the margin γ) if, and how far training samples penetrate the margin.
α i ≥ 0, i = 1, . . . , N. (7.65)
α∗
where the offset b∗ is chosen so that z i k(m) = 1 − Ci for all i with α∗i ≠ 0. A proof of
the above can be found, for example, in Cristianini and Shawe-Taylor [2000].
The decision regions of a soft margin SVM can be seen in Figure 7.18. Again, the
classifier was trained using the ongoing reference dataset from Section 3.3.2. As with
Figure 7.16, the SVM uses a Gaussian kernel, albeit here the kernel parameter was
chosen to be σ = 1. The design parameter was chosen to be C = 1. Unlike with a hard
margin, the soft margin SVM does not overfit the training data, but generalizes well.
The decision boundaries are reasonably close to the decision regions of the Bayesian
optimal classifier and the asymptotic testing error is only 0.7 percentage points above
the optimal Bayes error rate. Different choices for σ and C vary the shape of the decision
boundary: higher values of σ generally lead to a smooth decision boundary, whereas
higher values of C lead to a more complicated boundary.
7.7 Support vector machines | 205
m 2
R1 R2 ω1
12 ω2
Optimal
boundary
10 Decision
boundary
2 R1
m 1
2 4 6 8 10 12
Fig. 7.18. Application to the reference example of Section 3.3.2. Decision regions of a soft margin
SVM classifier with Gaussian kernel (σ = 1) and C = 1. The training and testing errors are etrain =
etest = 6.5 %. The testing error asymptotically approaches etest ≈ 6.8 %. The training set is the
same as in Figure 3.8. Test samples are shown with hollow marks.
In practice, one can estimate the hyperparameter C and the kernel parameters (σ
in the example) using the validation set V. To this end, an SVM classifier with fixed
hyperparameters is trained using the training set D and the classification performance
on V is recorded. The process is repeated for various combinations of the parameters,
where the parameters are often determined using a rule (e.g., grid search, the param-
eters are drawn from a regular grid) or are randomly sampled (randomized search).
Finally, the parameters that yield the highest classification performance are kept.
7.7.6 Discussion
SVMs are very powerful classifiers, which are applicable to a wide range of problems.
As discussed in Section 7.7.4, the complexity of an SVM is determined not by the di-
mensionality of the feature space, but only by the number of support vectors. Because
of this, an SVM is generally less prone to overfitting than other classifiers. Unlike ar-
tificial neural networks, which often yield suboptimal classifiers, an SVM classifier
is uniquely determined by the training data and the learning algorithm will always
produce a globally optimal classifier. Furthermore, the SVM algorithm allows for a
geometric interpretation and is easy to apply without the need of prior knowledge
206 | 7 Special classifiers
m2 m2
ω1
ω2
m1 m1
Hard-margin SVM Soft-margin SVM
Fig. 7.19. Decision boundaries (shown in the original feature space) of a hard margin and soft margin
SVM with Gaussian kernel on the same dataset. The shown feature vectors are the training data
D. The support vectors are marked with circles around them. Example according to Cristianini and
Shawe-Taylor [2000].
about the problem. Using appropriate kernel functions, an SVM classifier can even be
used to classify complicated objects like genome sequences or the words of a (natural)
language. Most of all, it is based on a very well developed theoretical foundation (see
Boser et al. [1992], Cortes and Vapnik [1995], Schölkopf and Burges [1999]). One major
drawback, however, is that the presented SVM is only a binary classifier. Extension
to multiple classes typically requires training at least one SVM for each class (see Sec-
tion 7.1.1), but extensions to true multi-class SVMs exist as well (e.g., Crammer and
Singer [2001]).
Figure 7.19 shows an example of the different decision boundaries that a hard
margin SVM (left) and a soft margin SVM (right) can derive. The boundaries are shown
in the original, two-dimensional feature space. Both SVMs use a Gaussian kernel (see
Equation (7.59)) with the same kernel parameter σ2 = 0.5 and were trained on the
same data. The shown data are the training data D. Support vectors are marked with
orange circles. The hard margin SVM shows perfect classification on the training data,
but has a relatively complicated decision boundary. The decision region of the soft
margin SVM, on the other hand, is smooth and relatively simple, but results in three
errors on the training set. Furthermore, the number of support vectors is significantly
higher with the soft margin SVM than with the hard margin SVM.
Often the goal is not only classifying an object, but also locating that object within an
image. Consider the toy example in Figure 7.20. Here, the goal is to find the location
7.8 Matched filters | 207
B
C m A
A
A
n
(a) Noisy image with three objects (the (b) A matched filter for the letter A is
letters A, B, and C) moved across the image
of three characters A, B, and C against a noisy background. In other words: the image
not only contains the object to be classified, but also unwanted noise. Matched filters,
also known as template matching, are a popular tool to achieve just that.
The idea behind matched filters is that the objects to be found are known in
advance—which is always the case in a classification setting—and that a prototypical
template can be derived for each of the objects. Matched filters provide a mathemati-
cal mechanism that assumes extremal values in places where the image matches the
template. In the above example, a matched filter for the character “A” produces an
image that is (nearly) black everywhere except at the center of the “A” in the original
image. In other words, objects are found within an image (or any other type of signal)
by moving the template over the image and recording the positions where the template
matches and the resulting image is maximal.
In the following, this intuitive description is formalized in a discrete notation, that
is, the images are considered to be two-dimensional discrete signals, as opposed to
continuous signals. In particular, let g ij denote the value of the image at the pixel
position (i,j) and let
g mn := (. . . ,g m−i,n−j , . . .)T , ∀(i,j) ∈ U (7.67)
denote the image patch around the position (m,n). In other words, g mn is the vector
of all image pixels in the region U around the patch origin (m,n). The image patch
is modeled as being composed of a true, underlying object image omn and additive
stationary noise rmn with E{rmn } = 0,
g mn = omn + rmn . (7.68)
The object and noise terms are defined in the same way as the image patch, which
means that g mn , omn , rmn ∈ ℝ|U| are all of the same size. Given a filter v ∈ ℝ|U| , the
208 | 7 Special classifiers
response of the filter to the image patch g mn is obtained by taking the inner product
of the two vectors:
In the following, the indices mn are dropped for the sake of notational brevity. For
example, the above equation will be written as simply k = vT o + vT r.
The question now is how to find a suitable filter. Many different approaches are
possible. For example, one could train a linear SVM classifier and take the weight
vector as a filter. With matched filters, however, the filter v := (. . . ,v ij , . . .)T is chosen
so that is maximizes the signal to noise ratio
with some constant c ∈ ℝ. Using the relation between v and w defined above finally
yields the discrete matched filter
T
v = cQ−1 (Q−1 ) o = cK−1
rr o. (7.76)
T
Here, (Q−1 ) acts as a whitening filter that de-correlates the noise in the image patch
g mn . However, the transformed image patch resides now in a different space than the
object image o. To correct for this, Q−1 modifies o to match after the whitening of the
image.
Note that while matched filters correct for (pixel) noise, they are still very sensitive
to rotation, scale, and other distortions of the input image. Since these perturbations
are very common in detection tasks, the image has to be normalized before applying
the filter.
In order to use matched filters for classification, one matched filter vi is created
for each class ω i to be recognized. The filters are moved over the image and for each
position the best match is recorded. More formally, the feature vector at position x =
(x,y)T is given by
i.e., the column vector of image pixels within the (shifted) region U around x. The
decision function for each filter vi , i = 1, . . . ,c, is given by k i (m(x)) = vTi m(x) and the
decision vector becomes
vT1
k (m(x)) = ( ... ) m(x) = V m(x). (7.79)
vTc
This decision vector is evaluated at every pixel of the input image. As there is always
a maximal entry in the decision vector, there will be a match at every pixel, even though
this certainly cannot be the case. In practice, match candidates with low responses
will be discarded as “none of the classes.” This maximum criterion is an example of
classification with rejection, which will be explored in more detail in Section 9.4.
More details about matched filters can be found, for example, in Beyerer et al.
[2016].
210 | 7 Special classifiers
An underlying assumption in the discussion so far was that the data are independent
and identically distributed, i.e., the feature vector mi does not depend on the feature
vectors seen before. So far this assumption has been valid, but it does not hold, e.g.,
for videos, where the content of the next frame depends on the content of the current
frame, for speech, where grammar restricts which words can follow another word in
a valid sentence, or for games, where the next move depends on the moves that have
been played before. In general, any sequence where the probability of drawing an
object depends on which objects have been drawn before, i.e., sequences that depend
on some state, violate the i.i.d. assumption.
As an example, consider the recognition of spoken words, or more specifically,
the classification of utterances into characters that make up a word. In this scenario,
each character constitutes a class, i.e., ω1 =̂ A, ω2 =̂ B, ω3 =̂ C, etc. Words are gener-
ated by some source that sequentially attains one of the classes as its internal state
and produces the corresponding character. Clearly, the characters of a word are not
independent of the surrounding characters. For example, if the letters observed so far
are “T” and “E”, the characters “A”, “N” and “D” are more probable to be observed
next than the character “X”. A classifier will be more powerful if these dependences
are modeled into it.
In our example, however, it is not possible to observe the characters directly. In
other words, the classes are hidden from our view. It is possible, after some suitable
signal processing, to observe associated phonemes—the smallest indivisible parts of
speech—v i , e.g., (in IPA phoneme notation) v1 =̂ /i:/, v2 =̂ /e/, v3 =̂ /æ/, etc. Given a
sequence of such phonemes, the goal is to recognize the corresponding word, i.e., the
sequence of states (characters) that will produce the observed sequence of phonemes.
More concretely, consider the word “sequence.” This word is produced by reaching
the states S, E, Q, U, E, N, C, and E one after another. When spoken, however, one
can only observe the phoneme sequence /’sikw ns/. The goal is to work this process
e
backwards, that is, to map the observed phonemes to the word “sequence” by virtue of
a model of the generation process. Note that here the number of phonemes is the same
as the number of characters. This does not always have to be the case: “model” has five
characters, but the (US English) pronunciation /’madl/ contains only four phonemes.
Sequences of any type can be modeled using discrete Markov models. A Markov model
describes the probability distribution to switch to the state ω(t) at time t, given the
system’s states ω(t − 1), ω(t − 2), . . . in the previous time steps (t − 1), (t − 2), . . . If
all of these time steps were to be taken into account, such a model would not be very
useful: inference and learning would take place in a very high-dimensional space,
7.9 Classification of sequences | 211
a22
ω2
a12 a23
a21 a32
Fig. 7.21. Discrete first order Markov
a13 model with three states ω i . The probabil-
ity of a transition from state ω i to state
a11 ω1 ω3 a33
ω k is denoted a ik . Image after Duda et al.
a31 [2001].
with all the associated problems (see Section 6.1). A discrete Markov model of the l-th
order therefore assumes that the probability of going into a state depends on the l
preceding states, but not more:
P(ω(t) | ω(t − 1), ω(t − 2),ω(t − 3), . . .) = P(ω(t) | ω(t − 1)). (7.81)
Here, the above is referred to as the state transition probability and the probability
of switching from state ω i to state ω k is abbreviated as
Finally, a Markov model is completed by the a priori state probabilities P(ω i ), which
encodes the probability of starting in the state ω i .
First order Markov models may be represented by a stochastic automaton, where
the states correspond to the classes ω i and the state transition probabilities are given
by the a ik . An example is shown in Figure 7.21.
Markov models can be, and have been, used to generate sequences of characters.
Table 7.1 presents some examples of sequences that were generated by Markov models
of increasing order. The state transition probabilities and the a priori probabilities
were estimated from large corpora of German, English, and Russian texts.
The sequences generated by the 0th order models were fully determined by the
prior probabilities and do not resemble words from the corresponding languages very
much. Higher order models, on the other hand, include the transition probabilities
and even generate valid words like “IN” and “WHEY”—even though the model does
not explicitly encode the concept of a word!
212 | 7 Special classifiers
Table 7.1. Character sequences generated by Markov models of different order. Table reproduced
from Hoffmann [1998].
0 EME GKNEET ERS TITBL BTZENFNDBGD EAI E LASZ BETEATR IASMIRCH EGEOM
1 AUSZ KEINU WONDINGLIN DUFRN ISAR STEISBERER ITEHM ANORER
2 PLANZEUDGES PHIN INE UNDEN VEBEICHT GES AUF ES SO UNG GAN DICH WANDERSO
3 ICH FOLGEMAESZIG BIS STEHEN DISPONIN SEELE NAMEN
0 OCRO HLI RGWR NMIELWIS EULL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH BRL
1 ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN D ILONASIVE TUCOOWE AT TEA-
SONARE FUSO TIZIN ANOY TOBE SEACE CTISBE
2 IN NO IST LAT WHEY CRACTICT FROURE BIRS GROCID PONDENOME OF DEMONSTRURES OF
THE REPTAGIN IS REGOACTIONA OF CRE
So far we have assumed that the model state at a time t is known with absolute certainty,
but this is not always the case. In the introductory speech recognition example on 210,
the underlying states (characters) were only indirectly observable via the associated
phonemes. A hidden Markov model (HMM) is an extension to a Markov model that can
deal with such situations.
In addition to states and state transition probabilities, an HMM consists of observa-
tions v j and emission probabilities that denote the probability of seeing an observable
given a chain of states ω(t), . . . , ω(t − l), where l is the order of the model. In a first or-
der HMM, the emission probability of observing v(t) = v j depends only on the current
state ω(t) and is denoted by
Note that with a hidden Markov model, the states are not directly observable. In-
stead, the state sequence can only be inferred from the sequence of observations v j .
Figure 7.22 shows a first order hidden Markov model with three hidden states and four
observations.
There are three important tasks when working with hidden Markov models:
7.9 Classification of sequences | 213
observable
v2 v3
v1 v4
a22 ω2
hidden
a12 a23
a21 a32
a13
Fig. 7.22. Discrete first order hidden
a11 ω1 ω3 a33
Markov model with three hidden states
a31 ω i and four possible observations v j .
b11 b
12 b13 b14 b31 b
32 b33 b34 The a ik denote the state transition prob-
v1 v4 v1 v4 abilities, while the b ij denote the proba-
v2 v3 v2 v3 bility that v j is observed when entering
observable state ω i . Image after Duda et al. [2001].
The first task is straightforward, as it only involves known quantities and is mostly of
theoretical importance. The second and third tasks, on the other hand, are very impor-
tant in the context of pattern recognition. The second task is analogous to classification
in the conventional setting: observations correspond to feature vectors, whereas the
states correspond to classes. At first, it seems that this task is not much more com-
plicated than the forward problem, but unfortunately it is much more complicated.
The usual approach to the backward problem is the Viterbi algorithm (Viterbi [1967],
Forney [1973]), a variant of dynamic programming. The last task corresponds to learn-
ing a conventional classifier. Here, the goal is to construct an HMM from the data.
Most common approaches estimate the parameters a ik and b ij using an expectation
maximization (EM) algorithm (Dempster et al. [1977]). Without going into detail, EM al-
ternates between computing the expected value of the likelihood of the data given the
parameters (expectation step) and maximizing the expected likelihood by changing
the parameters (maximization step). This procedure eventually converges to the maxi-
214 | 7 Special classifiers
mum likelihood estimate of the parameters. EM is especially useful in cases where the
likelihood function cannot be maximized analytically.
Naive brute-force methods to solve the inference and estimation tasks fail because
the number of possible sequences grows exponentially with the length T of the se-
quence: the number of possible sequences of length T is c T , where c is the number of
states.
Although HMMs can be employed whenever one has to deal with sequences,
the most successful applications lie in the recognition of handwriting, gestures, and
speech. A full treatment of the training and inference algorithms as well as applica-
tions are outside the scope of this book. Interested readers are instead referred to other
sources, e.g., Moon and Stirling [2000] or Fink [2003].
7.10 Exercises
(7.3) Construct the kernel function K(p,q) with p = (p1 ,p2 ,p3 )T and q = (q1 ,q2 ,q3 )T
that implicitly performs the following feature mapping φ : ℝ3 → ℝ4 :
m1 m2
m1
m2 m3
φ(m) = φ ((m2 )) = ( ).
m3 m1
m3
(m1 + m2 + m3 )3
Consider again the classification of an unknown plant. A botanist will ask a sequence
of questions regarding different nominal properties like the shape of the leaf, the for-
mation of the buds of the plant, the shape of the blossom, the color of the blossom,
etc. Every question rules out certain plant candidates, until there is only one option
left—the final decision.
In such situations it is typical that the next question depends on the answer to the
previous question. Formally, this technique of asking questions can be represented
by a decision tree. In this sense, a floral field guide is a “manual” decision tree clas-
sifier. Other examples of such decision trees are medical diagnoses, error detection
procedures for technical equipment, and codified business procedures.
A decision tree is a tree structure, where the inner nodes correspond to questions,
the links between the nodes represent the answers, and the leaves represent the deci-
sions, or classes. An example of a decision tree to classify fruit is shown in Figure 8.1.
https://doi.org/10.1515/9783110537949-237
216 | 8 Classification with nominal features
In the example, which was taken from Duda et al. [2001], the classes are
Ω/∼ = {ω1 . . . ,ω7 }
= {Apple, Watermelon, Grape, Grapefruit, Lemon, Cherry, Banana}
and the discrete, four dimensional space of nominal features is given by
m ∈ 𝕄 ⊆ 𝕄1 × 𝕄2 × 𝕄3 × 𝕄4
= {green, yellow, red} × {big, medium, small} × {round, thin} × {sweet, sour}.
Decision trees are easy to understand and can, unlike most of the classifiers dis-
cussed so far, be intuitively interpreted by humans. Of course, the interpretability
breaks down with deeper, more complex trees, but in principle every classification
decision can be reproduced and understood by a human.
Figure 8.1 also shows some key properties of decision tree classifiers: The branches
of a node must be mutually exclusive (no answer may appear on two branches) and
exhaustive (all answers that are possible at a particular node must be covered). Fur-
thermore, a question in a node must have a deterministic and unambiguous answer.
The classification of an unknown object is achieved by sequential decisions along a
path through the tree, until a leaf node is reached. The path itself is determined by the
answers to the questions in the inner nodes. Note that the same question may appear
at multiple points in the decision tree, even on a single path through the tree. For exam-
ple, the question “Size?” appears three times in the example in Figure 8.1. Depending
on where the question is asked, it may have a different number of possible answers
(outgoing edges). In Figure 8.1, “Size?” has different possible answers depending on
the node that asks this question. Leaf nodes represent classes. An object that reaches
a given leaf node is assigned to the class that is represented by that node. Multiple
leaf nodes may (and usually do) represent the same class. The tree in Figure 8.1, for
example, contains two leaf nodes each for the classes “Apple” and “Grape”.
Decision trees are generally very fast classifiers, provided the tree is well structured
and not too deep. Decision trees allow easily incorporating prior knowledge about the
pattern recognition task into the classifier. For example, one can augment a decision
tree learned from a sample (see Section 8.1.1) by rules that represent expert knowledge.
Decision trees are also applicable to features of a higher scale, e.g., the interval scale.
Then, the nodes typically quantize the features into two or more possible subsets of
values. If, for example, the size were measured in mm, the question “Size?” in the third
node from the left in the second level of Figure 8.1 may be quantized so that “small”
means a size of ≤ 20 mm and “medium” means a size of > 20 mm. Then, the question
would become “Size ≤ 20 mm?”.
Using grouping and quantization, every decision tree can be transformed into a
binary decision tree, i.e., a tree where each internal node has exactly two outgoing
edges. In a binary decision tree, the nodes usually represent yes/no questions. A bina-
rized tree of Figure 8.1 is shown in Figure 8.2. In the following, we will only consider
binary trees.
8.1 Decision trees | 217
Color?
Fig. 8.1. Decision tree to classify fruit. Inner nodes represent questions about the features, edges
represent possible answers, and leaf nodes represent classes. Recreated according to Duda et al.
[2001].
Color = yellow?
yes no
yes no yes no
no
yes no yes no
yes
Apple
Grapefruit Lemon Taste = sour? Size = small?
Watermelon
yes no yes no
Learning a decision tree corresponds to determining the structure, that is, the nodes
and branches, of a tree using the training set D. The training procedure is recursive.
At each step, a node is constructed according to some splitting criterion (see below).
The resulting branches split the training set into two disjoint subsets, D = DYes ⊎ DNo .
The subset DYes is associated with the left branch of the split and DNo is associated
with the right branch. On each branch, a node is again constructed according to the
splitting criterion, but only using the samples that reach that node. This procedure is
repeated recursively until a stopping criterion is met.
Like other classifiers, decision trees may overfit the training set. Overfitting hap-
pens if the structure of the tree is too fine grained and the corresponding decision paths
are too detailed. This can be spotted in the learning phase, when too few samples of
the training set reach the deeper nodes and leaves of the decision tree. In the extreme
case, only one training sample is alloted to each leaf node.
To prevent this scenario, a learning algorithm should create a compact tree, i.e., a
tree with as few nodes as possible. A common greedy approach chooses a question for
the current node so that the training sets in the resulting split DYes and DNo are as pure
as possible. The pureness of a dataset is measured using a heterogeneity or impurity
measure, denoted by i(⋅). The impurity measure i(n) should assume a minimum if
the dataset Dn at node n consists of samples of only one class, and should attain a
maximum if the classes are uniformly distributed, i.e., if each class is represented by
the same number of samples in Dn .
There are three standard measures that fulfill these properties: the entropy mea-
sure, the Gini impurity measure, and the misclassification measure. A qualitative com-
parison of the measures for a two-class scenario is shown in Figure 8.3.
Definition 8.1 (Entropy measure). The entropy measure corresponds to the entropy of
the empirical class distribution in the training set,
c
̂ k | m,n) log2 (P(ω
i(n) = − ∑ P(ω ̂ k | m,n)) . (8.2)
k=1
Here, n denotes the current node and the probability distribution is estimated as the
ratio of the number N nk of samples of class ω k that reach node n to the total number
N n of samples that reach node n:
k
̂ k | m,n) := N n .
P(ω (8.3)
Nn
Definition 8.2 (Gini impurity measure). The Gini impurity measure (or simply the Gini
measure) estimates the expected error probability if the class were to be randomly
assigned according to the class distribution at the node n:
c c c
̂ l | m,n) P(ω
i(n) = ∑ ∑ P(ω ̂ k | m,n))2 .
̂ k | m,n) = 1 − ∑ (P(ω (8.4)
k=1 l=1 k=1
l=k
̸
8.1 Decision trees | 219
i(n) Entropy
1 Gini impurity
Misclassification
0.75
0.5
0.25
P(ω1 |m,n)
0.25 0.5 0.75 1
Fig. 8.3. Qualitative comparison of impurity measures in dependence on the class probability
P(ω(m) = ω1 ) in a two-class scenario.
Impurity minimization
Given an impurity measure i(n), the question to ask at node n is chosen so that the
impurity of the split is minimized, i.e., chosen to maximize the decrease in impurity
This process is iterated, each time using the corresponding partitions of the train-
ing data, until the decrease in impurity falls below a threshold (∆i(n) < τ i ), or until
the number of training samples that reach the node n becomes too low. In addition
to stopping the training procedure on these conditions, a fully grown decision tree
may be post-processed by merging or pruning nodes after training. These topics are,
however, outside the scope of this textbook.
In Figure 8.4, the reference dataset from Section 3.3.2 was used to train a decision
tree with the greedy strategy outlined above. Gini impurity was used as split criterion
and recursion was stopped when less than 15 training samples were available for a split.
The decision rules of the tree are shown in Figure 8.5. The testing error is relatively
large and from a visual inspection of the decision regions it is evident that the tree is
more complicated than it should be. The reasons for this will be explored in the next
section.
220 | 8 Classification with nominal features
m 2
R1 R2 ω1
12 ω2
Optimal
boundary
10 Decision
boundary
2 R1
m 1
2 4 6 8 10 12
Fig. 8.4. Application to the reference example of Section 3.3.2. Decision regions of a decision tree
classifier. Training was stopped when there were fewer than 15 samples available for a split. The
training and testing errors are etrain = 7 % and etest = 12 %, respectively. The testing error asymp-
totically approaches etest ≈ 14.6 %. The training set is the same as in Figure 3.8. Test samples are
shown with hollow marks.
m 2 ≤ 9.38?
y es no
m 2 ≤ 6.37? m 1 ≤ 7.03?
y es no y es no
m 1 ≤ 5.43? m 1 ≤ 4.31? ω1 ω2
y es no y es no
ω2 ω1 m 2 ≤ 8.01 m 2 ≤ 7.78?
y es no y es no
ω2 ω1 m 2 ≤ 7.58? ω2
m2
1 ω1
ω2
m2 ≤ 0.563?
0.8 no yes
R1 ω1 m1 ≤ 0.376?
0.6
no yes
m2 ≤ 0.421? m2 ≤ 0.099?
0.4
no yes no yes
m1 ≤ 0.59? ω2 ω1 ω2
0.2 R2
no yes
m1
ω2 ω1
0.2 0.4 0.6 0.8 1
m2
1 ω1
ω2
0.8
R1
− 65 m1 + m2 ≤ −0.45?
0.6
no yes
0.4 ω1 ω2
0.2 R2
m1
0.2 0.4 0.6 0.8 1
Fig. 8.6. Impact of the features used in decision tree learning. This example descriptively underlines
what is meant by qualifying the boundary between feature extraction and classification as “blurry”
in Figure 1.2.
m2
1 ω1 m2 ≤ 0.414?
ω2
no yes
0.8 ω1 m2 ≤ 0.343?
no yes
0.6
m2 ≤ 0.385? m1 ≤ 0.842?
R1
no yes no yes
0.4
ω2 ω1 ω1 ω2
0.2
R2
m1
0.2 0.4 0.6 0.8 1
Fig. 8.7. A decision tree that does not generalize well. The filled marks show the training sample D.
As mentioned in Chapter 2, the choice of features can have a significant impact on the
performance of the classifier. Since decision trees are transparent classifiers, they are
well suited to exemplify this point. Figure 8.6 shows the impact of inadequate features
on the classifier. Here, the features are on the quantitative scale, and each node of the
decision tree splits the feature space parallel to the axes. As can be seen in Figure 8.6a,
the features m1 and m2 produce an unnecessarily complicated and overfitted decision
tree. Yet, a simple feature transformation produces the minimal decision tree shown in
Figure 8.6b. Similarly, the decision tree in Figure 8.7 does not generalize well. Allowing
the misclassification of one training sample would eliminate the thin decision region
around m2 = 0.4 and produce a much simpler decision tree without overfitting.
In practice, decision trees are often observed to overfit the data, especially when the
trees are relatively deep. As mentioned above, one method to address this issue is to
prune the tree after construction. An alternative solution is to train an ensemble of
several classifiers and average their predictions. The idea is that while every classifier
in the ensemble may give inaccurate predictions, the random part of the pertaining
error will average out over the collective predictions. In other words: averaging over
the ensemble will reduce the variance of the classification system. We will state this
intuition more precisely below.
Methods that take this approach, i.e., methods that derive a decision from a set of
classifiers, are called ensemble methods. Ensemble methods subsume a broad range
of techniques, much more than can be covered in this book. Rokach [2010] and Zhou
8.2 Random forests | 223
[2012] give a much more thorough discussion of this subject. Here, we will restrict
ourselves to the following general understanding: ensemble methods predict the class
of a sample according to a weighted average
M
k(m) = ∑ α j kj (m), (8.8)
j=1
where the kj (m) denote the base classifiers in the ensemble, the α j ∈ ℝ denote weights
associated with each classifier, and k(m) is the overall decision function of the ensem-
ble.
One particularly successful instance of ensemble methods is the method of random
forests, sometimes also called random decision forests or randomized forests. As the
name suggests, a random forest is composed of decision trees kj (m), j = 1, . . . ,M,
1
where each tree is weighted equally, α j = M . We will discuss another ensemble method,
AdaBoost, in Section 9.3, where the weights α j of the classifiers are adapted during the
training phase. Interestingly, under certain conditions, AdaBoost can be interpreted
as a special case of a random forest (Breiman [2001]).
Similar to the SVM, random forests have shown remarkable classification perfor-
mance out of the box in many practical classification tasks. This success can largely
be attributed to a key idea that lends random forests the first half of their name: ran-
domization.
Random forests use randomization during training in two ways: First, each tree
in the ensemble is trained on a random subsample of the training set. Second, at each
node in each tree, only a random subspace of the feature space is considered for a
split.
More formally, let D = {m1 , . . . ,mN } be the training set of N d-dimensional train-
ing samples mi = (m i1 , . . . ,m id )T with known class memberships ω(mi ), i = 1, . . . , N.
To train the decision tree kj (m), first a new training set D̃ j = {mr(1) , . . . ,mr(B) } is con-
structed by randomly sampling from D with replacement. Here, r(l) is a function that
maps l to a random integer 1 ≤ k ≤ N and B ≤ N is the size of the sub-sampled training
set D̃ j (typically B ≈ 0.7N). Note that r(⋅) may map to the same integer more than once,
meaning that one and the same training sample mi may occur more than once in D̃ j .
The motivation behind the resampling is that each tree will be sensitive to a slightly
different version of the classification problem. As a result, each tree will make different
errors in classification, but these errors will presumably be corrected by the other trees
in the ensemble. In a broader context, D̃ j is called a bootstrap sample (bootstrapping
is a statistical technique to deal with small datasets) and the aggregation of decision
trees trained on different bootstrap samples is referred to as bootstrap aggregating or
bagging for short.
Still, the trees are likely to choose the same features in the first few splits, since
these splits favor features that are highly correlated with the class memberships ω(mi ).
To circumvent this issue, only a random subset of d ≤ d (typically with d ≈ √d)
224 | 8 Classification with nominal features
features are considered when finding each split. In other words, during the decision
tree learning, the question to ask at node n is chosen to minimize the impurity of the
split according to Equation (8.7), but only d randomly selected features are considered
as candidates. Note that each split considers a different set of features, that is, the d
feature candidates are chosen anew in each individual iteration of the decision tree
learning.
The main effect of bagging and feature sub-sampling is that the trees will become
decorrelated, because they specialize on different training samples and emphasize
different features. As a result, this minimizes the variance of the whole ensemble.
1
This result can be seen from the following calculation, where the weights α j = M
in Equation (8.8) are pulled in front of the sum (Hastie et al. [2001]). Note that the
discussion uses real-valued decision functions k j (⋅) instead of the vectorial kj (⋅) in
Equation (8.8) to simplify the notation, but the argument still holds for vector-valued
decision functions.
With the assumptions E{k j (m)} = 0, Var{k j (m)} = σ2 and Cov{k j (m), k l (m)} =
ρσ2 for all j,l = 1, . . . , M:
{1 M } 1 M M
Var{ ∑ k j (m)} = 2 ∑ ∑ Cov{k j (m), k l (m)}
M M j=1 l=1
{ j=1 }
1 M M
= ∑ ( ∑ Cov{k j (m), k l (m)} + Var{k j (m)})
M 2 j=1 l=1
l=j̸
M
1
= ∑ ((M − 1)ρσ2 + σ2 )
M 2 j=1
M(M − 1)ρσ2 + Mσ2
=
M2
(M − 1)ρσ2 σ2
= +
M M
2 21− ρ
= ρσ + σ . (8.9)
M
It can be seen that the variance of k(m) is reduced when
– the correlation ρ between the trees is reduced, or
– M is increased, i.e., more trees are added to the ensemble.
Of course, adding more trees to the ensemble will probably increase the correlation
between the trees and thereby increase the first term of the above equation. At the
same time, removing too many trees will increase the second term. The “correct” num-
ber of trees in the ensemble depends on the classification performance, but in many
applications a number M on the order of tens will provide a good baseline.
What remains to be discussed is classification with random forests, i.e., how to im-
plement Equation (8.8). Breiman [2001] suggests deriving a class probability estimate
8.2 Random forests | 225
m 2
R1 R2 ω1
12 ω2
Optimal
boundary
10 Decision
boundary
2 R1
m 1
2 4 6 8 10 12
Fig. 8.8. Application to the reference example of Section 3.3.2. Decision regions of a random forest
classifier. The forest was composed of 10 decision trees. Training was stopped when there were
fewer than two samples available for a split. The training and testing errors are etrain = 1 % and
etest = 8.9 %, respectively. The testing error asymptotically approaches etest ≈ 8.31 %, which is very
close to the 6.16 % asymptotic testing error of the optimal classifier. The training set is the same as
in Figure 3.8. Test samples are shown with hollow marks.
̂ k | m) from a majority vote. Each tree kj (⋅) in the ensemble will classify the sample
P(ω
m and the probability estimate for class ω k is the fraction of trees that voted for ω k ,
M
̂ k | m) = 1 ∑ δ
P(ω ̂j , (8.10)
M j=1 [arg maxω P (ω | m)=ω k ]
where P̂ j (ω | m) is the a posteriori probability derived from the decision tree kj (m) and
δ[⋅] denotes the generalized Kronecker symbol.
In practice, Equation (8.10) does not require probability estimates P̂ j (ω | m), but
only a class assignment ω̂ j (m). In other words, one can simply use the class assignment
stored in the leaf nodes of the tree kj (⋅), as described in Section 8.1.
An alternative approach is due to Ho [1995], where the class probability estimate
is the average of the probability estimates P̂ j (ω | m) of the trees,
M
̂ k | m) = 1 ∑ P̂ j (ω k | m) .
P(ω (8.11)
M j=1
Here, simple class assignments are not enough: instead, this method requires
full probability estimates. A straightforward approach to obtain these estimates is to
226 | 8 Classification with nominal features
store the class membership probabilities P̂ j (ω k | m,n) in each leaf node n of each tree
kj (⋅). The overall estimate of the tree P̂ j (ω k | m) is then given by the leaf node n that
k
is reached by m. The P̂ j (ω k | m,n) can be estimated as in Equation (8.3), i.e., with N jn
denoting the number of training samples of class ω k that reach node n of tree kj (⋅),
and N jn denoting the total number of training samples that reach that node,
k
N jn
P̂ j (ω k | m,n) := . (8.12)
N jn
In either case, the final class estimate is given by the maximum a posteriori classi-
fier as in Equation (3.23),
ω(m)
̂ = arg max P̂ (ω k | m) , (8.13)
ω k ∈Ω/∼
t a b a c d b d a c b b a c d a c
s=5
m b d a c
Fig. 8.9. Strict string matching. Figure according to Duda et al. [2001].
Another specialized technique may be used when the pattern is a sequence of symbols.
As an example, consider the classification of DNA sequences in biomedical applica-
tions. DNA can be thought of a series of pairs of the bases Adenin, Thymin, Cytosin,
and Guanin. However, as Tymin always pairs with Adenin and Cytosin always pairs
with Guanin, DNA can be sufficiently described as a sequence of symbols in the al-
phabet A = {A,C,G,T}. Here, A stands for Adenin, C for Cytosin, G for Guanin, and T
for the base Thymin. A short DNA sequence may be described by the symbol sequence
AGCTTCGAATC. Long sequences of symbols (where the meaning of “long” depends
on the context) are also called a text and a substring of a sequence is denoted a factor.
Symbols may represent either nominal or ordinal features. With the DNA example, the
symbols are nominal.
The task in string matching is to find a sequence m in a given text t, where t is
usually much longer than m (see Figure 8.9). In other words, the task is to answer the
question: is the sequence m a factor of the text t and where is it located? With the DNA
example, string matching can be used to find markers for genetic disorders in the DNA
of a patient.
Often, it is not necessary or even possible to find exact matches. With genetic
markers, random mutations may cause individual genes to change, without affecting
the overall behavior of the factor. One method to deal with such situations is to use the
nearest neighbor classifier (see Section 5.3), where the distance between two sequences
m1 and m2 is the edit distance of the sequences. The edit distance between m1 and m2
is the minimum number of string operations—insertions, deletions and substitutions—
to transform the sequence m1 into the sequence m2 (see Figure 8.10).
During the training of the classifier, all given strings (factors) and their class mem-
berships are stored. When classifying an unknown sequence m, all edit distances of
m to the stored factors are computed and the class of the sequence with the minimum
distance to m is assigned.
Another strategy is to include in the alphabet the special wildcard symbol ⋆, which
matches with any character in A. The wildcard symbol may appear both in the text
and the target sequence m. An example of string matching with wildcard symbols is
shown in Figure 8.11.
228 | 8 Classification with nominal features
t h e _ p l a c e d _ s t r i c t u r e s _ i n _
s = 11
m s t r u c t u r e s
Edit distance = 1
Fig. 8.10. Approximate string matching. Figure according to Duda et al. [2001].
t a r c h _ p a ⋆ t e r ⋆ s _ i n _ l o n g ⋆ s t
s
m p a t t ⋆ r ⋆ s
Fig. 8.11. String matching with wildcard symbol ⋆. Figure according to Duda et al. [2001].
8.4 Grammars
Besides string matching, another approach to classifying sequences is the use of gram-
mars, where every class is represented by a grammar G i , i = 1, . . . ,c. All sequences that
are generated by the grammar G i are considered equivalent. In this sense, a grammar
G i corresponds to a model of the patterns of the class ω i . Classification with grammars
corresponds to parsing: A pattern is assigned to the class whose grammar generates
the pattern.
Formally, a grammar G is a quadruple G = (A,V,S,P) of an alphabet A of terminal
symbols, variables V, the starting variable S ∈ V, and a set of production rules P that
replace variables v ∈ V with strings of variables or terminal symbols. L(G) denotes the
language of the grammar G, where the language is the set of all sequences of terminal
symbols that can be produced by G.
An example of a grammar (according to Duda et al. [2001]) is given by the alphabet
A := {a,b,c}, the variables V = {A,B,C,S} and the production rules
{ p1 : S → AB or BC }
{
{ }
}
{ p2 : A → BA or a }
P={ } .
{
{ p3 : B → CC or b }
}
{ }
{ p4 : C → AB or a }
Parsing a grammar can be approached bottom–up or top–down. Bottom–up pars-
ing starts at the sequences and applies rules in reverse until the starting variable S
is reached (see Figure 8.12, left). Top–down parsing starts with S and applies rules
until the sequence is matched (Figure 8.12, right). The details of both approaches are
outside the scope of this textbook.
8.5 Exercises | 229
5 S,A,C S
4 0 S,A,C
A B
3 0 B B
j
2 S,A B S,C S,A B A C C
Fig. 8.12. Bottom up (left) and top down (right) parsing of the sequence “baaba” given the example
grammar (see text). In both cases, the sequence is accepted.
The learning problem corresponds to the construction of a grammar from the data
in D and is also known as grammar induction. However, there are generally infinitely
many grammars that are consistent with a finite number of sample sequences. A solu-
tion is again offered by Ockham’s razor: the simplest grammar that is consistent with
the data is to be preferred over more complex grammars that also fit the data. Like
parsing, methods to learn a grammar from data are outside the scope of this textbook.
8.5 Exercises
(8.1) Use the features animal, weight, and speed to construct a balanced decision tree
that classifies the following sample without error:
(8.2) Given below is a training sample of two-dimensional features for two classes ω1
and ω2 and a decision tree learned from the data.
Sketch the decision boundary of this classifier. Which leaves are a symptom of
overfitting? Can the tree be simplified to eliminate the overfitting?
m2
5
ω2
4
2
ω1
1
m1
1 2 3 4 5 6 7 8
m1 < 5?
Y N
m1 < 3? m2 < 5?
Y N Y N
ω2 m2 < 2? m1 < 8? ω2
Y N Y N
ω1 ω2 ω1 m2 < 4?
Y N
ω2 ω1
9 Classifier-independent concepts
In the final chapter of this book, we will explore topics that are, in a sense, orthogonal
to the classifiers we have seen so far. The first section deals with fundamental concepts
and the limits of all statistical learning. The following sections will give an overview of
methods for empirically evaluating of a classifier’s performance. The last two sections
will introduce boosting, a meta-technique for combining the predictions of several
weak classifiers into one strong classifier, and will discuss techniques for classifying
with the option of rejecting a sample.
The first two issues are part of the topic of proper sampling, and so of lesser interest in
the context of learning theory. The third point addresses the quality of the features and
the problem of extracting the relevant information from the patterns. The fourth point,
which here is reformulated compared to point 4 on p. 6 in Section 1.4, gives rise to the
central problem of statistical learning. In the following discussion of this problem, we
assume the classification is binary (i.e. c = 2). In this case, it is sufficient to consider a
single decision function k(m | θ) instead of ω(m ̂ | θ).
https://doi.org/10.1515/9783110537949-253
232 | 9 Classifier-independent concepts
Fig. 9.1. Relation of the world model P(m,ω) and training and test sets D and D. The training set D
and the test set T are drawn from a stochastic process with unknown joint probability distribution
P(m,ω). The training error etraining (m,θ) and the test error etest (m,θ) of a model θ are estimated on
the sets D and T.
Under what conditions does a small training error lead to a small test error? More
generally, statistical learning theory is concerned with the ability of a classifier to
generalize, that is, to classify unseen samples with little error. Note that the two are
inversely related: if the ability to generalize is great, the mean test error will be small,
and vice versa.
More formally, given a decision function k(m,θ) governed by an controllable pa-
rameter vector θ, what can be known about the expected test error ε(θ) = E{etest (m,θ)}?
Under what conditions will ε(θ) be minimal and what is the probably approximately
correct (PAC, Valiant [1984]) lower bound on ε(θ)?
Here, N = |D| is the number of samples in the training set, ν is the VC dimension
(see below) of the set K = {k(m,θ)|θ ∈ Θ} of decision functions and Φ(ν,N,η) denotes
the VC confidence (Vapnik and Vapnik [1998]), defined by
η
ν (log 2N
ν + 1) − log ( 4 )
Φ(ν,N,η) = √ . (9.2)
N
This bound holds regardless of the underlying distribution of the data if this distri-
bution is the same for all samples and all the samples were produced independently,
9.1 Learning theory | 233
Kh ,
ν=3
Ke ,
ν≥4
Fig. 9.2. Sketch of different class assignments to a sample using the model families Kh of two-
dimensional hyperplanes and Ke of ellipses for separation. The VC dimension of Kh is ν = 3,
whereas that of Ke is ν ≥ 4.
that is, if the data is distributed i.i.d. A key quantity in Equation (9.2) is the Vapnik–
Chervonenkis dimension (VC dimension) ν of the set of decision functions K. Briefly,
ν acts as a measure of the complexity of the model family represented by K. Since a
rigorous definition requires concepts that are outside the scope of this text, we will
give only an informal, intuitive description of ν:
Consider a given set of N samples to be assigned to two classes. Because each
sample can belong to either class, but not to both, there are 2N constellations of pos-
sible class assignments in total. The VC dimension ν of a set K of decision functions
is defined as the maximum number of samples that can be separated by K for all pos-
sible class assignments, independently of the spatial distribution of the samples. An
example with two-dimensional features m ∈ ℝ2 is shown in Figure 9.2. In the first
row, the set K is the set of two-dimensional hyperplanes, Kh = {wT m − b = 0}. In the
second row, K is the set of ellipses, Ke = {(m − µ)T Λ(m − µ) − b = 0}, where µ denotes
the center of the ellipse and Λ ∈ ℝ2×2 is a positive semidefinite matrix. Note that in
general a set of four two-dimensional samples cannot be separated by a hyperplane
(e.g., the configuration in the first panel of the second row of Figure 9.2—the XOR prob-
lem), but three samples can. Thus, the VC dimension is ν = 3. In general, if K denotes
the set of hyperplanes in ℝd , then ν = d + 1. For polynomials, ν grows with the degree
of the polynomial.
Since the VC dimension ν depends on the model parameters θ, all three quantities
in Equation (9.1), the expected test error ε(θ), the empirical training error etraining (θ),
and the VC confidence Φ(ν,N,η), change with varying ν. Figure 9.3 outlines how the
three quantities change with increasing VC dimension ν. It can be seen that with larger
ν, the Φ also increases, while the empirical training error decreases. The expected test
error ε(θ) first decreases, but increases again when the classifier begins to overfit the
data. The optimal ν, that is, the model family with optimal complexity with respect to
ε, is found where etraining (θ) + Φ(ν,N,η) is minimal.
234 | 9 Classifier-independent concepts
etraining (θ)
Φ(ν,N,η)
ε(θ)
ν
optimal ν
Fig. 9.3. Qualitative plot of the expected test error ε(θ), the empirical training error etraining (θ), and
VC confidence Φ(ν,N,η), against VC dimension ν.
ν (log 2N
+ 1) − log ( 4η )
(θ) + √
ν
= etraining , (9.3)
N
and the Figure 9.3 give rise to the following observations:
1. If the number of training samples is vastly larger than the VC dimension, as Nν →
∞, then Φ → 0 and in turn ε → etraining . In this case, it is sufficient to minimize
the training error, as a small training error guarantees a small test error. This is
known as the empirical risk minimization (ERM) principle (Vapnik and Vapnik
[1998]).
2. If, on the other hand, the ratio Nν is small, then Φ dominates the bound. A small
training error no longer guarantees a small test error. Here, etraining and Φ must
be minimized simultaneously instead. This principle is known as the structural
risk minimization (SRM) principle (Vapnik and Vapnik [1998]).
9.2 Empirical evaluation of classifier performance | 235
p(m, ω i )
ω1
ω2
Reducible
error
∫R p(m, ω2 ) dm
1
∫R p(m, ω1 ) dm
2
R1 mB m∗ R2
(a) Non-optimal choice of decision boundary
p(m, ω i )
ω1
ω2
∫R p(m, ω2 ) dm
1 ∫R p(m, ω1 ) dm
2
R1 mB = m∗ R2
(b) Optimal choice of decision boundary
Fig. 9.4. Classification error probability. Even with an optimal decision boundary, there is a remain-
ing, irreducible error probability
Most practical applications of pattern recognition fall in the second case: there is a
relatively small dataset on which to train the classifier, which means one should either
choose a model with low VC dimension or train the model using SRM.
The design phases of a pattern recognition system in Figure 1.4 listed the evaluation
of the classifier as the last step. In the previous chapters, estimators of the error proba-
bility P e were given for some specific classifiers, but this is not the only possibility for
assessing the performance of a classifier. This section will fill that gap by introducing
some common performance measures and techniques to validate classifiers with a
finite test sample.
236 | 9 Classifier-independent concepts
Predicted class ω̂
ω1 ω2
Correct rejection
ω1 ω1
ω1 TN FP
Predicted class ω̂
ck
True class ω
Sla
True class ω
Fa
lse
ala
rm ω2 FN TP
ω2 ω2
Discovery (hit)
(a) Terms for classification outcomes (b) Confusion matrix
We will first focus on a binary classifier that decides between only c = 2 classes.
Usually, one class is called the “negative class” and the other one the “positive class.”
Here, we associate ω1 with the negative class and ω2 with the positive class. Binary
classifiers are always used if the goal is to detect the presence or absence of some qual-
ity, e.g., diseased vs. healthy fruit, defective vs. intact workpiece, etc. Moreover, every
multi-class classification task can be solved using a combination of binary classifiers
(see, for example, Section 7.1.1). Because binary classifiers play such an important role
in pattern recognition, many technical terms have been invented to precisely describe
their characteristics. In order to simplify the upcoming discussion, we restrict the sce-
nario to a one-dimensional feature space (d = 1) and a linear classifier. This means
the decision boundary is just a single point m∗ and the decision regions are given by
R1 = {m ∈ M|m < m∗ } and R2 = {m ∈ M|m > m∗ }.
As already discussed in the previous chapters, the feature distributions usually
overlap, which means that there is some minimum error probability that cannot be
reduced any further. This situation is depicted in Figure 9.4. In both sub-figures, the
decision boundary is marked by m∗ and the optimal decision boundary according
to Bayes is labeled mB . A classification error occurs if a feature falls in a different
decision region than that to which the true class belongs. In this example, the overall
error equals
which is the sum of the red and blue areas in Figure 9.4. In Figure 9.4b, this sum attains
its minimum, as the decision boundary equals the optimal boundary mB = m∗ . In
Figure 9.4a, the sum is larger, because the boundary introduces an additional reducible
error (purple frame).
Depending on the true class ω and the classifier prediction ω,̂ one can distinguish
four cases (see Figure 9.5a):
9.2 Empirical evaluation of classifier performance | 237
p (m | ω i )
ω1 Discovery (hit)
Correct rejection P ( m ∈ R2 | ω2 )
P ( m ∈ R1 | ω1 ) ω2
Slack
False alarm
P ( m ∈ R1 | ω2 )
P ( m ∈ R2 | ω1 )
R1 m ∗
R2
Figure 9.6 shows the class-conditional decision probabilities that correspond to the
probabilities for each of the cases. Please note that these are not the same as the error
probability, which is the unconditional probability that the classification is false.
When evaluating a classifier using a test set T, the outcome of each case is counted
and recorded in a confusion matrix, as shown in Figure 9.5b. In the two-class setting,
the cells are often labeled as follows: true negatives, which counts the number of cor-
rect rejections, false positives, which counts the number of false alarms, false negatives,
which is the number of falsely rejected samples, and true positives, which is the num-
ber of discoveries in the dataset. We denote the number of samples in each cell by TN,
FP, FN, and TP, respectively. Using these quantities, one can compute higher order
performance measures that characterize the classifier. These measures usually approx-
imate a characteristic classification probability. A list of common measures is given in
Table 9.1.
These measures are coupled in interesting ways. Changing a parameter to improve
one measure will also change the values of the other measures. Unfortunately, the
coupling is sometimes reversed: improving one measure may have a negative impact
on another. Consider the example in Figure 9.7a, where the classifier has only one
parameter m∗ . Increasing m∗ will decrease the false positive rate (the fall-out), but
the true positive rate (the recall) will also be decreased. Moving the decision boundary
in the opposite direction will increase the recall, but also increase the fall-out.
238 | 9 Classifier-independent concepts
Table 9.1. Common binary classification performance measures derived from a confusion matrix.
We now turn our attention back to the generic multi-class setting. Generalizing the
classification error probability P e of the binary classifier (see Equation (9.4)) to the
9.2 Empirical evaluation of classifier performance | 239
p(m| ω i ) ω1 ω2
d Recall
μ1 μ2
Fall- out
m
R1 m∗ R2
(a) Underlying class-specific feature distributions
d = 0.7
0.8 d = 1.3
m ∗ d = 2.0
∞
0.6 d = ∞
0.4
0.2
Fig. 9.7. Examples of ROC curves for two Gaussian feature distributions with variance σ 2 = 1 and
distance d between the expectations μ1 and μ2 .
Intuitively, the error probability P e is the probability that the feature vector m does
not fall in the correct decision region.
Other measures from the binary case, like precision and recall, cannot directly
be applied in a multi-class setting. However, these measures can still be computed
individually for each class: the c-class classification problem is treated as c separate
binary classification problems, where the goal is to separate the target class ω i from
all other classes ω k , k ≠ i, or ω i for short.
Formally, let C(ω j ,ω k ) denote the number of samples with true class ω = ω j that
were classified as ω̂ = ω k . The number of true positives, false positives, false negatives,
240 | 9 Classifier-independent concepts
Predicted class ω̂
ω1 ω2 ω3 ω4 ω5
ω1 TN FP TN
Prediction
ω2 ω2
ω2 FN TP FN
True class ω
ω2 TP2 FN2
Truth
ω3
ω2 FP2 TN2
ω4 TN FP TN
ω5
Fig. 9.8. Converting a multi-class confusion matrix to binary confusion matrices: A multi-class
confusion matrix of c classes can be subsumed into c binary confusion matrices, one for each class
ω i . Example here: the reduced confusion matrix with respect to ω2 , i.e., i = 2.
and true negatives for class ω i are then computed by (see Figure 9.8):
Given these counts, the measures from Table 9.1 can be derived for each class
individually. Sometimes, the class-wise measures are further averaged over all classes
to grade the overall performance with a single number.
In the following, assume that a classifier with c classes was trained on the training set
D and the error probability will be estimated from the test set T. Again, all elements in
the training and test sets are i.i.d. The number of samples in T is denoted by |T| = NT .
Denote by Tj the set of the |Tj | = NT,j test samples that belong to class ω j . Let P j denote
the (unknown) error probability for class ω j and let n j denote the number of incorrectly
classified samples in Tj . The probability to incorrectly classifying n j items of T is
NT,j nj
Pr{n j items of Tj misclassified} = ( ) P j (1 − P j )NT,j −n j . (9.10)
nj
Note, again, that this error probability cannot be computed, because the P j are
not known. It can, however, be estimated from the test sample using the maximum
9.2 Empirical evaluation of classifier performance | 241
n
likelihood estimate of P j , P̂ j = NT,jj , which leads to the maximum likelihood estimate
of the overall error probability of the classifier,
c
nj
P̂ e = ∑ P(ω j ) . (9.11)
j=1
N T,j
This highlights two problems. First, the variance of the estimator is inversely pro-
portional to NT,j , which means that small test subsets will result in a relatively high
variance of the estimated error probability. Second, the comparison of two different
classifiers with respect to their error probability is only valid if the difference in P̂ e is
significant w.r.t. the stochastic error of the estimate √Var{P̂ e }.
The question, then, is how to choose a good test sample in order to get good esti-
mates P̂ e . General approaches to choosing adequate test sample sizes are described by
Guyon et al. [1998]. Here, however, we are only interested in choosing the number of
test samples NT so that, with probability (1 − a), the true error rate P e does not exceed
the estimated error rate P̂ e by more than some small quantity ε(NT , a):
Defining ε(NT ,a) := βP e as a fraction of the true classification error, one can
compute suitable test set sizes. For example, for a = 0.05 and β = 0.2, i.e., that there
should be a 95 % probability that the true error P e does not exceed the estimated error
P̂ e by more than 20 %, it follows that the number of test samples should be NT ≈
100
P e (Guyon et al. [1998]). Note that this result is independent of the number of classes
c.
D1 := S1 ∪ S2 ∪ S3 ∪ S4 T1 := S5
Fold 1 S1 S2 S3 S4 S5 ⇝ P̂ e,1
Fold 2 S1 S2 S3 S4 S5 ⇝ P̂ e,2
Fold 3 S1 S2 S3 S4 S5 ⇝ P̂ e,3
Fold 4 S1 S2 S3 S4 S5 ⇝ P̂ e,4
Fold 5 S1 S2 S3 S4 S5 ⇝ P̂ e,5
1 5 1 5 2
P̂ e = ∑ P̂ e,f Var{P̂ e } ≈ ∑ (P̂ e,f − P̂ e )
5 f =1 4 f =1
Fig. 9.9. Example of a five-fold cross-validation: The dataset S is partitioned into five equally sized
subsets Si . In each fold, one subset is used as the test set T and the training set D is the union of the
remaining subsets.
In the jth of the m rounds (also called folds), the learning set D = S \ Sj is con-
structed from all but the jth subset, which is used as the test set T = Sj . The classifier
is trained using D and evaluated using T as usual, and the classification error—or any
other performance measure—is recorded. The process is repeated for each j = 1, . . . ,m,
so that every sample is used for testing once. The final estimate of the test error is taken
as the arithmetic mean of the m test errors estimated in the folds. Cross-validation also
makes it possible to estimate the variance of the test error estimate. Typical values for
the number of folds are m = 5,10. A schematic example of a five-fold cross-validation
is shown in Figure 9.9.
A special case occurs if m = N = |S|, i.e., when the number of folds is equal to
the number of samples. Here, the classifier is trained with all but one sample in S
and evaluated on the sample that was left out. This scheme is therefore also known
as the leave-one-out cross-validation. The estimates obtained with a leave-one-out
cross-validation are typically more reliable than with m-fold cross-validation, but this
increased precision is paid for by a much larger evaluation effort.
9.3 Boosting
1 − εj
α j ← log
εj
w i ← w i exp (α j δ[z i =k̸ j (mi )] ) for i = 1, . . . , N
return k(m) = sign (∑M j
j=1 α j k (m))
fiers that perform marginally better than a random guess, can form a strong classifier.
The decision function k(m) of the strong classifier is a weighted sum of those of the
weak classifiers k j (m),
M
k(m) = sign ( ∑ α j k j (m)) . (9.15)
j=1
The weights α j are adapted so that the weak classifiers k j (m) are weighted accord-
ing to their classification performance on the training set. Note that in Section 8.2, we
have discussed another instance of ensemble methods: Random Forests. Here, how-
ever, the base learners k j (⋅) may be any type of classifier, not just decision trees. The
only restriction is, again, that the k j (⋅) must perform better than random guessing.
Note that here we restrict ourselves to binary classification, hence the scalar decision
function k j (⋅) is used in place of the vectorial kj (⋅).
One of the best-known boosting algorithms is AdaBoost (short for adaptive boost-
ing) of Freund and Schapire [1997]. The algorithm is very simple, yet produces very
strong classifiers out of the box. The basic idea of AdaBoost is to iteratively generate
weak classifiers that minimize the weighted training error on the training set. Initially,
all training samples are weighted equally, but after each iteration more weight is put on
misclassified samples. Consequently, the classifier chosen in the subsequent iterations
will put more emphasis on correctly classifying samples that were often misclassified.
Pseudo-code of the algorithm is given in algorithm 9.1. If the weak classifier is cho-
sen from a set of base classifiers, line 3 of the algorithm is replaced with choosing the
244 | 9 Classifier-independent concepts
1
wi = 10 , i = 1, . . . ,10 k1 : ε1 = 0.3, α1 = 0.42 reweighted samples
...
Fig. 9.10. Schematic example of AdaBoost training (M = 3) for two classes with the decision func-
tions k i (m) := sign(m d i − τ i ) with d i ∈ 1,2. The size of a mark indicates the magnitude of the
associated weight. Example adapted from Freund and Schapire [2004].
classifier that minimizes the weighted training error. A visual example of AdaBoost
training and the resulting strong classifier is given in Figures 9.10 and 9.11.
Boosting algorithms are meta-algorithms that can be used with many different
weak classifiers or even combinations of classifiers of different types. Popular choices
are shallow decision trees, i.e., decision trees that are only a few levels deep, and linear
SVMs. While simple, the approach is very powerful. Given that the weak classifiers are
simple classifiers, the strong classifier is usually very fast to compute. At the same time,
boosted classifiers typically generalize well, and do not overfit easily. Weak classifiers
that perform well on the training set will receive larger weights α i than classifiers that
do not.
The main design parameters are the types of the weak classifiers k j (⋅) or a set of
pre-selected classifiers to choose from, and the number of iterations M. There is no
need for prior knowledge of the distribution of the features, but boosting typically
needs large training sets to work well.
9.4 Rejection
In Chapter 1, and throughout the book so far, all objects were assumed to belong to one
of the classes ω1 , . . . ,ω c . In other words, the relevant part of the world Ω is partitioned
into equivalence classes ω i , Ω/∼ = {ω1 , . . . ,ω c }. The underlying (implicit) assumption
9.4 Rejection | 245
Fig. 9.11. Visual representation of the AdaBoost classifier obtained after three iterations in Fig-
ure 9.10.
ω1 ω3
Fig. 9.12. Reasons to refuse to clas-
(ii)
sify an object: (i) the feature vector is
(i) (i)
placed near a border between classes
ω2 or (ii) the feature vector is placed
(ii) within unpopulated regions of the
feature space.
that all classes are known at design time and that every object o ∈ Ω belongs to one
of the classes ω i is called the closed world assumption.
In practice, this assumption rarely holds, but it is often close enough to the truth
to not cause any harm. An intelligent scale that classifies fruit and vegetables, for
example, will work with a model of just these produce items of the supermarket. Other
products in the store are not included in the model, and will be misclassified when
put on the scale, but the harm due to the misclassification is negligible.
In any case, there are at least two valid reasons to refuse a classification:
1. Ambiguous situation: the decision functions k i do not exhibit a significant maxi-
mum.
2. Unknown object: The object lies outside the domain explained by Ω/∼.
The first case occurs when the feature vector of an object falls near a decision bound-
ary. In other words, a sample should be rejected if the sample falls into a narrow strip
around the decision boundaries. The second case occurs if the feature vector falls out-
side of the area occupied by the feature vectors of the known objects (see Figure 9.12).
The intelligent scale, for example, may encounter a Nashi pear that looks similar to an
apple. Customers might also be tempted to put unrelated items, e.g., a loaf of bread, on
the scale to see how the system reacts. In both cases, the task of classification should
be declined.
To treat both cases, the workflow of a classifier (see Figure 3.3) is extended by
a subsequent rejection stage, as shown in Figure 9.13. The rejection test might be
inherent to the specific classifier, but there are four classifier-independent rejection
246 | 9 Classifier-independent concepts
k(m)
k1 (m)
Find maximum
Reject?
k2 (m)
m ω̂
..
.
k c (m)
Fig. 9.13. Schema of a classifier with rejection option. As the rejection option is a subsequent step
after maximum search, it can be applied to any classifier.
criteria that work with any classifier that outputs some measure of confidence. These
are (due to Schürmann [1996]):
– Maximum criterion: Reject if max{k i } < τ, i.e., if the decision functions show no
significant maximum. The corresponding rejection region in the decision space is
shown in Figure 9.14a.
– Difference criterion: Reject if max{k i } − max {{k j } \ max{k i }} < τ, i.e., the two top
ranked decision options have a similar confidence. The corresponding rejection
region in the decision space is shown in Figure 9.14b.
– Distance criterion: Reject if min{‖k − ωi ‖} > τ, i.e., if the confidence of the closest
class in the decision space is not large enough. The corresponding rejection region
in the decision space is shown in Figure 9.14c.
– Minimum criterion: Reject if min{k i } < τ < 0, i.e., if at least one decision function
expresses high confidence that the object does not belong to any of the classes
defined during design time. The corresponding rejection region in the decision
space is shown in Figure 9.14d.
Formally, the rejection option can be treated as an additional class ω0 , with which the
original class-partition of Ω is augmented (recall Section 3.3):
0.2 0.2
0.8 0.8
0.4 0.4
0.6 0.6
0.6 0.6
0.4 0.4
0.8 0.8
0.2 0.2
1 1
0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1
k1 k2 k1 k2
0.2 0.2
0.8 0.8
0.4 0.4
0.6 0.6
0.6 0.6
0.4 0.4
0.8 0.8
0.2 0.2
1 1
0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1
k1 k2 k1 k2
Fig. 9.14. Rejection criteria and the corresponding rejection regions in the decision space.
9.5 Exercises
ω1 : NT,1 = 20, k1 = 14
ω2 : NT,2 = 30, k2 = 12
Here, NT,i denotes the number of test samples of class ω i and k i denotes the num-
ber of correctly classified samples of that class. The a priori probabilities are given
by P(ω1 ) = p and P(ω2 ) = 2p for p ∈ [0,1].
1. Give an estimate P̂ e of the error probability of the classifier.
2. In operation, an error rate of P e = 0.4 is observed. Assuming that the class-
dependent error rates are correct, what are the values of the true a priori prob-
abilities?
248 | 9 Classifier-independent concepts
(9.2) A test of a classifier with three classes ω1 , ω2 , ω3 results in the following confu-
sion matrix:
True class
Prediction ω1 ω2 ω3
ω1 120 6 3
ω2 16 21 7
ω3 8 9 26
Give an estimate P̂ e of the error probability of the classifier. Assume the following
a priori probabilities:
(1.2) The relation is not transitive and hence not an equivalence relation, as seen in
this family tree:
Joel Kathie
(1.4) x ∼ y ⇔ xT y ≥ 0 is reflexive and symmetric, but not transitive and hence not an
equivalence relation. Let x = (0, − 1)T , y = (1,0)T , and z = (0,2)T . Then:
xT y = 0 ⇒ x ∼ y
yT z = 0 ⇒ y ∼ z
xT z = −2 ⇒ x ≁ z
(1.5) The relation is not an equivalence relation, because symmetry does not hold if
f(x) < f(y) (strictly smaller) for any x,y ∈ ℕ. In this case, x ∼ y holds, but y ≁ x,
because f(y) ≰ f(x).
(1.6) The relation is not symmetric and hence not an equivalence relation:
Let r(X,n) ∈ O(n a ) and r(Y,n) ∈ O(a n ) for a > 1. Then r(X,n) ∈ O(r(Y,n)) = O(a n )
and therefore X ∼ Y. However, r(Y,n) ∈ ̸ O(r(X,n)) = O(n a ) and thus Y ≁ X.
https://doi.org/10.1515/9783110537949-271
250 | A Solutions to the exercises
A.2 Chapter 2
(2.1) mn allows any relabeling function, i. e., any injective function. mo allows only
functions that preserve the ordering. mr allows only functions that also preserve
relative distances, i. e., only linear functions. mr allows only functions that also
preserve the zero, i. e., only scaling functions. ma allows only the identity map-
ping.
1. f(m) = 3 m + α is injective and strictly increasing and therefore allowed for
both mn and mo . Since 3 > 0, it can also be applied to mi , but is only allowed
for mr if α = 0.
2. mn and mo both allow f(m) = e m , since the function is injective and strictly
increasing. It is not linear, and therefore not allowed with either mi or mr .
3. The function is only allowed with mn , as it is injective, but not strictly increas-
ing and not linear.
P1 (x)
DKL (P1 ‖P2 ) = ∑ P1 (x) log
x∈supp(P1 )
P2 (x)
P1 (a) P1 (b)
= P1 (a) log + P1 (b) log
P2 (a) P2 (b)
P1 (c) P1 (d)
+ P1 (c) log + P1 (d) log
P2 (c) P2 (d)
1 1/3 0 1 1/3 1 1/3
= log + 0 log + log + log
3 1/3 1/6 3 1/6 3 1/3
1 1 1
= log 1 + log 2 + log 1
3 3 3
1
= log 2 ≈ 0.23
3
(2.10) A feature is invariant to translation if the location parameter c does not influ-
ence the feature. Similarly, a feature is invariant to rotation if its computation does
not involve the rotation parameter φ. All other measurements—perimeter P, area
A, and axis lengths l1 and l2 —are sensitive to scaling, but to different degrees; a
scaling factor s affects P, l1 and l2 linearly, but A will grow quadratically in s.
252 | A Solutions to the exercises
A.3 Chapter 3
(3.2) 1. If the a priori probability distribution is uninformative about the class, i. e.,
if P(ω i ) is the same for all classes ω i .
2. If the class-specific feature distribution is the same for both features, i. e.,
p(m1 | ω i ) = p(m2 | ω i ) for all ω i .
(3.3) 1. ω1 and ω2 can not be separated using m. The classification depends solely
on the a priori probabilities.
2. ω3 and ω2 can be separated without error. Since the class-specific feature
distributions for ω1 and ω2 are the same, ω3 and ω1 can also be perfectly
separated.
3. Class-dependent error probabilities:
ω(m) = ω3: P(ω̂ ≠ ω3 | ω3 ) = 0
ω(m) = ω2: P(ω̂ ≠ ω2 | ω2 ) = 0 Since the classifier
}
ω(m) = ω1: P(ω̂ ≠ ω1 | ω1 ) = 1 always decides on ω2
(3.4) 1. Sketch of the feature distributions p(m | ω) and the decision boundaries for
parts (2) and (3):
1 1 1 1
= ( + )= .
2 8 8 8
For the second boundary (part 3)
5
4 3
3 1 1 9 3
= ⋅ + ⋅ = .
4 32 4 32 32
2. Class ω4 will be chosen, since it has highest a priori probability of all classes.
3. ω1 and ω4 have the highest class-specific feature densities at m1 . Since
P(ω4 ) > P(ω1 ),
p(m1 | ω1 )P(ω1 ) < p(m1 | ω4 )P(ω4 ) ⇔ P(ω1 | m1 ) < P(ω4 | m1 )
and hence ω4 is chosen again.
4. ω1 and ω4 as well as ω2 and ω4 are best separated using only m2 . ω1 and
ω2 cannot be separated, since the a priori probability and the class-specific
feature distributions, and hence the a posteriori probabilities, are the same
for these classes.
5. Since ω1 and ω2 can not be separated using m2 , but they can be separated
using m1 , one should use m1 instead of m2 .
A.4 Chapter 4
Since the weight m is in the interval [10,20], and the maximum of MSE(m̂ 1 ) in this
interval is 25 (at m = 10 and m = 20), the estimator m̂ 1 has lower mean squared
error than m̂ 2 .
(4.2) To find the maximum likelihood estimator, we maximize the log-likelihood func-
tion:
N N
1 |m i − μ| 1 |m i − μ|
l(μ) = log ∏ exp ( ) = ∑ (log − )
i=1
2σ σ i=1
2σ σ
1 N
= −N log(2σ) − ∑ |m i − μ|
σ i=1
1 N
⇒ μML = arg max l(μ) = arg max {−N log(2σ) − ∑ |m i − μ|}
μ μ σ i=1
1 N N
= arg max {− ∑ |m i − μ|} = arg max {− ∑ |m i − μ|}
μ σ i=1 μ
i=1
N
= arg min μ { ∑ |m i − μ|} = μ.̂
i=1
256 | A Solutions to the exercises
1 N−2 1 N−2
E{μ}̂ = E{ ∑ xi } = ∑ E{x i }
N − 4 i=3 N − 4 i=3
1
= (N − 4) μ = μ.
N−4
2. Both estimators are unbiased, but the variance of μ̂ is larger than the vari-
ance of μ̂ ML and therefore μ̂ ML is a better estimator. This can be shown using
Equation (4.23):
1 N−2 1 N−2
Var{μ}̂ = Var{ ∑ xi } = ∑ E{(x i − μ)2 }
N − 4 i=3 (N − 4)2 i=3
N−2
1 1 1
= ∑ σ2 = σ2 > σ2 = Var{μ̂ ML }
(N − 4) i=3
2 N−4 N
1 N
E{σ̂v } = ∑ E{(m i − μ)2 } (E{} linear)
α − N i=1
1 N
= ∑ E{(m − μ)2 } (m i i. i. d.)
α − N i=1
1 N
= ⋅ N ⋅ Var{m} = ⋅ σ2 .
α−N α−N
In order for the estimator to be unbiased, the expected value must be equal to the
true value, which gives
N
E{̂
σ2 } = σ2 ⇔ = 1 ⇔ α = 2N.
α−N
A.5 Chapter 5 | 257
N N N N
E{μ}̂ = E{ ∑ f(m i )} = ∑ E{f(m i )}
N − α i=1 N − α i=1
N N N2 μ
= ∑ E{f(m)} = .
N − α i=1 N−α
It is unbiased if
! N2 μ
E{μ}̂ = μ ⇔ = μ ⇒ N2 = N − α ⇔ α = N − N2.
N−α
A.5 Chapter 5
(5.1) The inverse mapping x = A−1 (y) = ±√y is not unique, therefore the inference
from y to x is ill-posed.
(5.2) The system of equations to solve this problem is over-determined, since there
are more data points (N > 2) than degrees of freedom (a,b). This means that
in general there is no solution that interpolates all data points. Therefore, the
problem is ill-posed.
The following variation is well-posed.
Find the parameters a,b ∈ ℝ of a straight line y = a x + b that minimizes the distance
between the line and the data points, i. e., find the parameters a,b ∈ ℝ that minimize
N
∑ d(y i , a x i + b)
i=1
1 N 1 x − xi {1 if |y| ≤ 0.5
p(x)
̂ = ∑ φ( ) where φ(y) := {
N i=1 V N hN 0 else.
{
Here, V N = h1N = 1, and hence
1
p(6)
̂ = ⋅0=0
10
1
p(8)
̂ = ⋅ 2 = 0.2
10
258 | A Solutions to the exercises
1
p(10)
̂ = ⋅ 3 = 0.3
10
1
p(12)
̂ = ⋅ 1 = 0.1
10
1
p(14)
̂ = ⋅0=0
10
(5.5) The dataset can be sorted to quickly find the nearest neighbors to a given m:
D = {7.1, 7.6, 8.0, 8.5, 9.3, 9.7, 10.0, 10.5, 12.2, 14.9}.
The density is estimated according to p(m)̂ = NkV . The volume V depends on
the position of the neighbors, V(m) = 2 ⋅ |n3 (m) − m|, where n3 (m) denotes the
third-closest neighbor of m). Putting everything together,
3 3
p(m
̂ = 6) = =
10 ⋅ 2 ⋅ |8 − 6| 40
3 3
p(m
̂ = 8) = =
10 ⋅ 2 ⋅ |8.5 − 8| 10
3 3
p(m
̂ = 10) = =
10 ⋅ 2 ⋅ |10.5 − 10| 10
3 3
p(m
̂ = 12) = =
10 ⋅ 2 ⋅ |10 − 12| 40
3 3
p(m
̂ = 14) = =
10 ⋅ 2 ⋅ |10.5 − 14| 70
(5.6) The feature space, sample, decision boundary, and samples to classify, are
shown in the diagram below:
m2
6
4 ω1
m1
2
m2
-6 -4 -2 2 4 6 m1
-2
ω2
-4
m3
-6
⇒ ω(m
̂ 1 ) = ω2 ; ω(m
̂ 2 ) = ω1 ; ω(m
̂ 3 ) = ω1 or ω(m
̂ 3 ) = ω2
A.6 Chapter 6 | 259
A.6 Chapter 6
(6.1) Each of the three Gaussian components g k (m) is parametrized by a mean µk and
a covariance matrix Σk . The mean requires five (5) parameters since the feature
space is five-dimensional. The covariance matrix is symmetric and requires es-
timating fifteen (15) parameters. Two parameters are needed to estimate the α k ,
since α1 + α2 + α3 = 1. In all, there are 3 ⋅ (5 + 15) + 2 = 62 parameters to estimate
for the Gaussian mixture.
The Parzen window method does not require estimating any parameters. The meta-
parameters (window type, window size) are chosen beforehand.
(6.2) The linear classifier needs to estimate four (4) parameters: three for the normal
of the hyperplane w and one for the distance to the origin b.
For the Gaussian classifier, four (4) parameters are needed for each mean and ten
(10) parameters are needed for the covariance matrices, so there are 2⋅(4+10) = 28
parameters to estimate in all. The a priori probabilities are not estimated from the
sample.
6⋅7 257 5
c ⋅ (6 + + 1) − 1 = 28 c − 1 < 256 ⇔ c < =9+ .
2 28 28
All in all, c = 9 classes can be separated, at a maximum, using this device.
256 − 9 ⋅ 28 + 1 = 5 parameters remain unused.
(6.4) The probability that the feature m i lies in the interval [−2,5] is
5 − (−2) 7 1
Pr (m i ∈ [−2,5]) = = = .
11 − (−10) 21 3
1 d
Pr (m⃗ ∈ ̸ [−2,5]d ) = 1 − Pr (m⃗ ∈ [−2,5]d ) = 1 − ( ) =: P d .
3
Plugging in different values for d yields
1 2 9
d = 1 ⇒ P1 = 1 − = ≯
3 3 10
260 | A Solutions to the exercises
1 8 9
d = 2 ⇒ P2 = 1 − = ≯
9 9 10
1 26 9
d = 3 ⇒ P3 = 1 − = >
27 27 10
Therefore, more than 90 % of the probability mass is outside the hypercube
[−2,5]d when the dimensionality of the feature space is at least d = 3.
A.7 Chapter 7
(7.1) K(m,m ) = ‖m‖2 does not depend on m , hence it is not symmetric and not a
kernel function.
(7.2) K(m,m ) = 4mT m − mT m − (m )T m is symmetric, but not positive definite, and
therefore not a kernel function:
1 0
K (( ) , ( )) = 4 ⋅ 0 − 1 − 1 = −2 < 0
0 1
1 1
K (( ) , ( )) = 4 − 1 − 2 = 1 > 0
0 1
(7.3) The kernel function is the scalar product of the lifted features:
(7.4) 1. Sketch of the features and decision boundary. Note that there are no feature
vectors inside the margin, since this is a hard margin SVM:
m2
6
ω1
5
2
ω2
1
m1
1 2 3 4 5 6
A.8 Chapter 8 | 261
A.8 Chapter 8
speed = medium?
Y N
ω1 ω3 ω1 ω2
m2
6
5
ω2
4
2
ω1
1
m1
1 2 3 4 5 6 7 8
262 | A Solutions to the exercises
Overfitting occurs in the region m1 > 8, i. e., in the partial tree reached by N→Y→N:
here, both leaves contain only one sample.
This overfitting is eliminated by replacing the partial tree reached by N→Y with a
leaf node ω1 , yielding the following decision tree:
m1 < 5
Y N
m1 < 3? m2 < 5?
Y N Y N
ω2 m2 < 2 ω1 ω2
Y N
ω1 ω2
This tree mis-classifies one of the training samples.
A.9 Chapter 9
!
(9.1) 1. P(ω1 ) + P(ω2 ) = 1 ⇒ P(ω1 ) = 31 , P(ω2 ) = 23 . From Equation (9.12) it follows:
2
NT,i − k i 1 6 2 18 1
P̂ e = ∑ P(ω i ) ⋅ = ⋅ + ⋅ =
i=1
N T,i 3 20 3 30 2
2. Approach: calculate P(ω1 ) (and hence P(ω2 )) from the observed error rate
P e = P(ω1 ) NnT,1
1
+ (1 − P(ω1 )) NnT,2
2
.
6 18 ! 4
P(ω1 ) ⋅ + (1 − P(ω1 )) ⋅ =
20 30 10
6 3 ! 4
⇔ − ⋅ P(ω1 ) =
10 10 10
2 1
⇒ P(ω1 ) = ⇒ P(ω2 ) =
3 3
(9.2) The numbers of testing samples per class are NT,1 = 144, NT,2 = 36, and NT,3 =
36. The class-dependent error probabilities are estimated as
̂ 1 | ω1 ) = 16 + 8 ,
P(ω ̂ 2 | ω2 ) = 6 + 9 ,
P(ω ̂ 3 | ω3 ) = 3 + 7 .
P(ω
144 36 36
Putting both together yields
P̂ e = P(ω1 ) P(ω
̂ 1 | ω1 ) + P(ω2 ) P(ω
̂ 2 | ω2 ) + P(ω3 ) P(ω
̂ 3 | ω3 )
1 24 1 15 3 10 1 1 1 1
= ⋅ + ⋅ + ⋅ = + + = = 0.25.
2 144 5 36 10 36 12 12 12 4
B A primer on Lie theory
The tangential distance in Section 2.4.6 as well as the construction of invariant features
in Section 2.6.3 used concepts from Lie theory, but gave no formal introduction of these
concepts. This section will give a concise introduction to the postponed mathematical
details.
Definition B.1 (Topological Manifold). Let Π be a Hausdorff space, i. e., a set of points
with a system of open sets such that for each pair of two distinct points the points can
be placed in two disjoint open sets.
1. A chart (or coordinate chart or coordinate map) is a pair (U, φ) of an open subset
U ⊆ Π and a corresponding injective map φ : U → V to an open subset V ⊆ ℝd of
Euclidean space such that φ is a homeomorphism. To be a homeomorphism means
that φ and φ−1 are both continuous, i. e., the pre-image of an open set is an open
set.
2. Let (Ui , φ j ), (Ui , φ j ), i ≠ j be two charts with a nonempty intersection Uij = Ui ∩
Uj ≠ 0 and let φ i and φ j denote the restrictions of φ i and φ j to the intersection Uij .
The map
τ = φ i ∘ φ−1
j : φ i (Uij ) → φ j (Uij ) (B.1)
The above definition looks rather complicated, but can be understood intuitively—the
phrases “coordinate map” and “atlas” were not chosen by chance. A (topological)
manifold is a set that locally looks like Euclidean space.
The canonical example is a globe: a sphere and a plane have different global
geometries, but they look the same when you zoom in close enough. It is possible
to choose an open neighborhood on the sphere and map it to the Euclidean plane
the same way as one can flatten pieces of an orange peel. Otherwise, it would not be
possible to transfer a map printed on a flat sheet of paper onto a spherical globe.
On a grander scale, the earth looks flat (apart from the occasional hill or valley)
from a human perspective, but from outer space, it is clear that it is (approximately) a
sphere.
Most of the above definition is necessary to ensure that every point of the manifold
is on some map and that the same point on the manifold looks similar on different
maps. For example, two maps of Western and Middle Europe both include Germany,
but in different places. Still, the border of the country should look approximately the
same on both maps.
https://doi.org/10.1515/9783110537949-285
264 | B A primer on Lie theory
φ ∘ F ∘ φ−1 : φ(U) ⊂ ℝd → ℝd
f =̃ (B.2)
is differentiable at φ(x).
3. A function F : Π → Π ̃ is a diffeomorphism if it is bijective and F as well as F −1 are
both differentiable.
Note that in the first item, differentiability is defined, because τ ij are functions from
and to Euclidean space, where this concept already exists. In the second item, differ-
entiability is independent of the choice of the chart functions φ and ̃ φ, because the
transition function τ between different charts is required to be differentiable. Hence
F is either differentiable with respect to every chart or not differentiable at all.
The definition ensures that the manifold has no “edges” or “corners” but rather
looks smooth, as the name suggests. From this definition, it can be seen that the pre-
vious example for a topological manifold—the globe—is also a smooth manifold.
For the purposes of this book, this short introduction to manifolds will suffice. We
will now turn our attention to a different mathematical field, group theory, but come
back to manifolds at the end of this section.
Definition B.3 (Group). A group (G, ⊙) is a set G with a binary composition ⊙ : G×G → G
with the following properties:
1. Associativity: (g1 ⊙ g2 ) ⊙ g3 = g1 ⊙ (g2 ⊙ g3 ).
2. Neutral element: there exists e ∈ G with g ⊙ e = e ⊙ g = g for all g ∈ G.
3. Inverse element: for every g ∈ G, there is a g −1 ∈ G with g ⊙ g−1 = g −1 ⊙ g = e.
Definition B.4 (Group action). Let G be a group and S an arbitrary set. A (left) group
action of G on S is a function A : G × S → S such that:
1. for all g1 , g2 ∈ G and s ∈ S
A (e, s) = s. (B.4)
B A primer on Lie theory | 265
Both items of the definition are reasonable when put into words: The first item requires
that the result of composing two group actions will be the same as the group action on
the composition of the elements. The second item requires that the neutral element of
the group has no effect with respect to the group action. One sometimes writes g ⊙ s
instead of A(g, s), i. e., the group composition ⊙ is also used to indicate the group
action.
In the following two definitions, let G be a group, S a set, A a group action of G on
S, and s ∈ S.
Definition B.5 (Group orbit). The set Gs = {A (g, s)|g ∈ G} ⊆ S is called the orbit of s.
Definition B.6 (Stabilizer). The subgroup Gs = {g ∈ G|A (g, s) = s} is called the stabi-
lizer of s.
Although the definitions look similar and the notations differ only in the use of a con-
catenation vs. the use of a subscript, the semantics are quite different. The group orbit
Gs is a subset of S and contains all the points that can be reached from s by a transfor-
mation. The stabilizer Gs is a subset of G and contains all the group elements that do
not affect s.
To see the difference, consider the group of all rotation matrices of two-dimensional
Euclidean space,
↑
cos α sin α ↑ ↑
↑α ∈ ℝ} ,
G = {( )↑ ↑ (B.5)
− sin α cos α ↑ ↑
↑
where the group composition ⊙ is given by the usual matrix multiplication. This group
is also called the two-dimensional special orthogonal group SO(2).
It is easy to see that G is indeed a group. Associativity follows by the associativity
of matrix multiplication and the concatenation of two rotations by α and β gives the
same result as one rotation by α + β. The inverse of a rotation is the rotation by the
negative of its angle, and the neutral element is a rotation by 0.
Now, let S = ℝ2 be the Euclidean plane. The group action of G on S is given by the
usual multiplication of a vector by a matrix. Then the orbit of an arbitrary point s ∈ ℝ2
consists of all points with the same distance from the origin:
↑
cos α sin α ↑ ↑
Gs = {A (g, s)|g ∈ G} = {( ) s↑
↑α ∈ ℝ}
↑
− sin α cos α ↑ ↑
↑
↑
↑s = ‖s‖} .
= {s ∈ ℝ2 ↑
↑
↑ (B.6)
This means that these orbits are circles around the origin—hence the name. For
s = 0, the orbit is a degenerate circle with radius 0, i. e., a point. Hence, the stabilizer
is given by
It is easy to see that any rotation maps the origin to the origin, which means that
the stabilizer G0 is the whole group. All other points are rotated along a circle around
the origin, hence the only stabilizer is the neutral element I.
The special orthogonal group shares an important property that links the world
of algebra with manifolds: each rotation matrix can be decomposed into smaller and
smaller rotations. In particular, a rotation can be decomposed into
n
cos α sin α cos αn sin αn
( )=( ) (B.8)
− sin α cos α − sin αn cos αn
for every n ∈ ℕ. Indeed, any rotation can be decomposed into infinitely many, in-
finitesimally small rotations.
This observation leads to the following definition.
Definition B.7 (Lie group). A group Π that is also a smooth manifold such that the
group operation p1 ⊙ p−12 is differentiable is a Lie group.
The two-dimensional special orthogonal group SO(2) is a Lie group and a one-
dimensional smooth manifold, which can be seen if one considers that a suitable
chart function maps a rotation matrix to its angle α ∈ ℝ.
The definition of a Lie group only makes a statement about the group operation,
but not about group actions. The combination yields the following definition.
Definition B.8 (Lie transformation group). Let M be a smooth manifold, Π a Lie group,
and A : Π × M → M a group action of Π on M. Π is called a Lie transformation group
with respect to M if A is differentiable.
Note that this definition uses some kind of differentiability three times: The group
Π is a Lie group and therefore equipped with a differentiable structure; the space M
is a smooth manifold, i. e., it possesses a differentiable structure; and the map A is
required to be differentiable, too.
To conclude, consider Figure 2.12 (reproduced in Figure B.1 for convenience) in
light of the new concepts. The feature space M is assumed to be a smooth manifold and
the disturbances are modeled as a Lie transformation group Π that acts on the feature
space. Then the orbit of a feature vector mi under the group action are smooth sub-
manifolds. Although this section skipped a mathematical definition of the tangent
space, it should be intuitively clear that a tangent exists at each point of each sub-
manifold, because all involved maps are differentiable.
B A primer on Lie theory | 267
Mi = {A (p, o i )|p ∈ Π}
mk
mk
mi
m Mk = {A (p, o k )|p ∈ Π}
mi
Tmk
Tm i
Definition C.1 ((Strictly) stationary process (of order m)). Let g be a random process,
k ∈ ℕ be a finite dimension, x1 , . . . , xk ∈ ℝ2 arbitrary points, and τ ∈ ℝ2 a translation
vector.
https://doi.org/10.1515/9783110537949-290
C Random processes | 269
1. If
p(x1 , . . . , xk ) = p(x1 + τ, . . . , xk + τ) (C.7)
holds for all valid choices of k, x1 , . . . , xk , and τ, the process g is called (strictly)
stationary. This means that all fidis are invariant under translation.
2. A stochastic process is called (strictly) stationary of order m, if the above holds for
all k ≤ m.
Definition C.2 (Homogeneity (of order m)). A stochastic process is called homoge-
neous of order m if
This means the first m moments do not depend on the point x. Obviously, stationarity
is much stronger than homogeneity. A stationary process is always homogeneous (up
to the same order) but not vice versa.
This means the expectation is constant for every point x and the covariance is also
constant in the sense that its value only depends on the relative position of x and y
but is not affected by a translation. Especially for x = y, this implies Var{g} (x) = σ2
is constant for all x ∈ ℝ2 .
The condition of weak stationarity is more restrictive than a homogeneity of order
two, but less restrictive than stationarity of order two. For a process to be homogeneous,
it is only required that its expectation and variance be constant: this does not say
anything about its covariance. In contrast, to be a stationary process of order two, it
is required that all two-dimensional marginal distributions be identical. The latter is
much stronger than having only identical covariances.
Note that the term “expectation free” is a bit misleading: an expectation free random
process is not free of having an expectation. It has an expectation: 0.
270 | C Random processes
E{e} = 0 (C.12)
{σ2 τ=0
Cov{e} (x, x + τ) = { (C.13)
0 else.
{
Actually, both requirements already ensure that it is a weakly stationary process, but
demand much more. Especially, the last requirement implies that any two states are
uncorrelated with each other.
Lastly, we consider a certain assumption about random processes that makes rea-
soning about them easier in many circumstances: ergodicity. Informally, in an ergodic
process, a reasonably large sample from that process is representative of the process
as a whole. Formally, let E denote a probability space and let e ∈ E be an elementary
event. Moreover, let g(x) = g(x, e) denote the realization of the random process g(x)
with respect to the elementary event e.
Definition C.6 (Ergodic process). Let g be a stationary process and let μ(x) = E{g} (x)
denote the expectation of g. This means that the expectation μ(x) = μ is constant for
all x ∈ ℝ2 . The process g is said to be ergodic if for all events e ∈ E and all y ∈ ℝ2 ,
w h
2 2
1
lim ∫ ∫ g(x, e) dx = μ = E{g} (y). (C.14)
w→∞ wh
h→∞ − w2 − 2h
On the right side of Equation (C.14), one arbitrary point y is fixed and the average over
all possible realizations g of g is calculated. On the left, one realization g(x) = g(x, e) is
fixed and the average over all points x ∈ ℝ2 is calculated. Hence, under the assumption
that g is ergodic, one can determine the unknown expectation and variance of g by
taking the average over all points of only one single realization.
Bibliography
R. Aster, B. Borchers, and C. Thurber. Parameter Estimation and Inverse Problems. International
Geophysics Series. Academic Press, 2013. ISBN 9780123850485.
Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks. In
Advances in neural information processing systems, pages 153–160, 2007.
J. Beyerer. Analyse von Riefentexturen. PhD thesis, Düsseldorf, 1994.
J. Beyerer, F. Puente León, and C. Frese. Machine Vision. Springer, 2016.
B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. In
Proceedings of the fifth annual workshop on Computational learning theory, pages 144–152.
ACM, 1992.
L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
J. Bromley, J. W. Bentz, L. Bottou, I. Guyon, Y. LeCun, C. Moore, E. Säckinger, and R. Shah. Signature
verification using a "siamese" time delay neural network. IJPRAI, 7(4):669–688, 1993.
C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995.
T. Cover and P. Hart. Nearest neighbor pattern classification. IEEE Transactions on Information
Theory, 13(1):21–27, Jan. 1967.
K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector
machines. Journal of machine learning research, 2(Dec):265–292, 2001.
A. Criminisi, J. Shotton, E. Konukoglu, et al. Decision forests: A unified framework for classification,
regression, density estimation, manifold learning and semi-supervised learning. Foundations
and Trends® in Computer Graphics and Vision, 7(2–3):81–227, 2012.
N. Cristianini and J. Shawe-Taylor. An introduction to support vector machines and other kernel-
based learning methods. Cambridge university press, 2000.
G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray. Visual categorization with bags of keypoints.
In Workshop on statistical learning in computer vision, ECCV, volume 1, pages 1–2. Prague,
2004.
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em
algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1–38, 1977.
ISSN 00359246.
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image
database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pages 248–255. IEEE, 2009.
R. O. Duda, P. E. Hart, and D. G. Stork. Pattern classification. Wiley, New York, 2 edition, 2001.
B. Efron and T. Hastie. Computer Age Statistical Inference: Algorithms, Evidence, and Data Science.
Cambridge University Press, New York, NY, USA, 1st edition, 2016. ISBN 9781107149892.
G. A. Fink. Mustererkennung mit Markov-Modellen. Vieweg+Teubner Verlag, 2003. ISBN 978-3-519-
00453-0.
G. D. Forney. The viterbi algorithm. Proceedings of the IEEE, 61(3):268–278, Mar. 1973. ISSN
0018-9219. 10.1109/PROC.1973.9030.
Y. Freund and R. Schapire. A tutorial on boosting, 2004.
Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an applica-
tion to boosting. Journal of Computer and System Sciences, 55(1):119 – 139, 1997.
A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman. From few to many: Illumination cone models
for face recognition under variable lighting and pose. IEEE Trans. Pattern Anal. Mach. Intelli-
gence, 23(6):643–660, 2001.
I. Guyon, J. Makhoul, R. Schwartz, and V. Vapnik. What size test set gives good error rate estimates?
Pattern Analysis and Machine Intelligence, IEEE Transactions on, 20(1):52–64, 1998.
https://doi.org/10.1515/9783110537949-293
272 | Bibliography
T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning, volume 1. Springer
series in statistics Springer, Berlin, 2001.
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
C. Herrmann, D. Willersinn, and J. Beyerer. Low-resolution convolutional neural networks for video
face recognition. In Proceedings of the 13th IEEE International Conference on Advanced Video
and Signal Based Surveillance, Colorado Springs, USA, Aug. 2016. IEEE.
T. K. Ho. Random decision forests. In Document Analysis and Recognition, volume 1, pages 278–282.
IEEE, 1995.
S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780,
1997.
R. Hoffmann. Signalanalyse und –erkennung. Springer, 1998.
A. Hyvärinen, J. Karhunen, and E. Oja. Independent Component Analysis, volume 46. John Wiley &
Sons, 2004.
A. Jaglom and I. Jaglom. Wahrscheinlichkeit und Information. Deutscher Verlag der Wissenschaften,
Berlin, 1960. Translated from Russian.
A. N. Kolmogorov. On the representation of continuous functions of many variables by superpo-
sition of continuous functions of one variable and addition. American Mathematical Society
Translation, 28(2):55–59, 1963.
D. Krahe and J. Beyerer. A parametric method to quantify the balance of groove sets of honed
cylinder bores. In Intelligent Systems & Advanced Manufacturing, pages 192–201. International
Society for Optics and Photonics, 1997.
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural
networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
K. Küpfmüller. Die entropie der deutschen sprache. Fernmeldetechnische Zeitung, 7(6):265–272,
1954.
A. Laubenheimer. Automatische Registrierung adaptiver Modelle zur Typerkennung technischer
Objekte. PhD thesis, Universität Karlsruhe, 2004.
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recogni-
tion. Proceedings of the IEEE, 86(11):2278–2324, 1998.
A. H. Lipkus. A proof of the triangle inequality for the tanimoto distance. Journal of Mathematical
Chemistry, 26(1-3):263–265, 1999.
K. Liu and G. Mattyus. Fast multiclass vehicle detection on aerial images. Geoscience and Remote
Sensing Letters, IEEE, PP(99):1–5, 2015. ISSN 1545-598X. 10.1109/LGRS.2015.2439517.
D. O. Loftsgaarden and C. P. Quesenberry. A nonparametric estimate of a multivariate density
function. The Annals of Mathematical Statistics, 36(3):1049–1051, 1965.
D. G. Lowe. Distinctive image features from scale-invariant keypoints. International journal of
computer vision, 60(2):91–110, 2004.
V. Maz’ya and G. Schmidt. On approximate approximations using gaussian kernels. IMA Journal of
Numerical Analysis, 16(1):13–29, 1996.
J. Mercer. Functions of positive and negative type, and their connection with the theory of integral
equations. Philosophical transactions of the royal society of London, 209:415–446, 1909.
T. K. Moon and W. C. Stirling. Mathematical Methods and Algorithms for Signal Processing. Prentice
Hall, Upper Saddle River, NJ, 2000. ISBN 0-201-36186-8.
J. Neyman and E. S. Pearson. On the problem of the most efficient tests of statistical hypotheses. In
Breakthroughs in statistics, pages 73–108. Springer, 1992.
A. B. J. Novikoff. On convergence proofs on perceptrons. Proceedings of the Symposium on the
Mathematical Theory of Automata, 12:615–622, 1962.
Bibliography | 273
C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. Adaptative computation
and machine learning series. University Press Group Limited, 2006. ISBN 9780262182539.
M. Richter, T. Längle, and J. Beyerer. Knowing when you don’t: Bag of visual words with reject option
for automatic visual inspection of bulk materials. In Proceedings of the 23rd International
Conference on Patter Recognition (ICPR), Cancun, Mexiko, Dec. 2016.
A. Rieder. Keine Probleme mit Inversen Problemen: Eine Einführung in ihre stabile Lösung.
Vieweg+Teubner Verlag, 2003. ISBN 9783528031985.
H. Ritter, T. Martinetz, and K. Schulten. Neuronale netze. Addison-Wesley, 1990.
C. P. Robert. A comparison of the bayesian and frequentist approaches to estimation by francisco j.
samaniego. International Statistical Review, 79(1):117–118, 2011.
L. Rokach. Pattern classification using ensemble methods, volume 75. World Scientific, 2010.
F. Rosenblatt. The Perceptron: A Perceiving and Recognizing Automaton, volume Report 85-60-1.
Cornell Aeronautical Laboratory, 1957.
F. Rosenblatt. Principles of Neurodynamics: Perceptrons and the Rheory of Brain Mechanisms.
Spartan, 1962.
J. Schmidhuber. Deep learning in neural networks: An overview. Neural networks, 61:85–117, 2015.
B. Schölkopf and C. J. Burges. Advances in Kernel Methods: Support Vector Learning. MIT press,
1999.
B. Schölkopf, A. Smola, and K.-R. Müller. Kernel principal component analysis. In International
Conference on Artificial Neural Networks, pages 583–588. Springer, 1997.
J. Schürmann. Pattern classification: a unified view of statistical and neural approaches. Wiley
Online Library, 1996.
C. E. Shannon, N. J. A. N. J. A. Sloane, A. D. Wyner, and I. I. theory society, editors. Claude Elwood
Shannon: collected papers. IEEE Press, New York, 1993. ISBN 0-7803-0434-9. IEEE Information
Theory Society.
L. W. Sommer, T. Schuchert, and J. Beyerer. Deep learning based multi-category object detection in
aerial images. In Proc. SPIE 10202, Automatic Target Recognition XXVII, Anaheim, United States,
May 2017.
N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple
way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):
1929–1958, 2014.
S. S. Stevens. On the theory of scales of measurement. Science, 103(2684):677–680, June 1946.
L. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984.
V. N. Vapnik and V. Vapnik. Statistical learning theory, volume 1. Wiley New York, 1998.
A. J. Viterbi. Error bounds for convolutional codes and an asymptotically optimum decoding algo-
rithm. IEEE Transactions on Information Theory, 13(2):260–269, Apr. 1967. ISSN 0018-9448.
10.1109/TIT.1967.1054010.
M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European
Conference on Computer Vision (ECCV), pages 818–833. Springer, 2014.
Z.-H. Zhou. Ensemble methods: foundations and algorithms. CRC press, 2012.
Glossary
A posteriori distribution The distribution of the classes with respect to a fixed feature.
A priori distribution The distribution of the classes without knowledge of the features.
Absolute norm A special type of Minkowski norm.
Absolute scale Scale of measurement for counting quantities.
AR model Autoregressive signal model.
Autoregressive signal model Representation of a type of random process.
Dataset The set of all objects that were collected to define, validate and test a pattern recognition
system.
Decision boundary The boundary of a decision region. The entirety of the boundaries is an equiva-
lent description of the classifier.
Decision function A function that maps a feature vector to one component of the decision space.
Decision region A partition in the feature space.
Decision space An intermediate space to unify the mathematical description of the classes.
Decision tree Tree structured classifier where the inner nodes correspond to tests, the edges corre-
spond to the outcomes of the tests, and the leaf nodes govern the class decision.
Decision vector The vector of decision functions of all classes.
Dirac sequence A sequence of probability distributions that converges to the Dirac distribution.
https://doi.org/10.1515/9783110537949-297
276 | Glossary
Discrepancy A function that quantifies the similarity between two (mathematical) objects that lacks
some properties of a metric.
Distance function Usually a synonym for metric (usage may vary depending on contex).
Distribution Mathematical object that encapsulates the properties of random variables.
Divergence A discrepancy between probability distributions.
EM Expectation maximization.
Emission probability In hidden Markov models: Probability of seeing an observable given a chain
of states.
Empirical operation Mathematical operation that corresponds to an experiment, e.g., addition of
the masses of two objects by putting both on a scale at the same time.
Empirical relation Mathematical relations that emerge from experiments, e.g., by comparing the
weight of two objects.
Empirical risk minimization From statistical learning theory: Minimization of the average loss on
a training set.
Empiricism Philosophy of science that emphasizes evidence and experiments.
Entropy measure Impurity measure corresponding to the entropy of the empirical class distribution
of that data set.
Estimator A measurable function from the space of all finite datasets into the parameter space of a
parametric distribution assumption.
Euclidean norm A special type of Minkowski norm.
Expectation maximization Iterative technique to maximize the likelihood function of an estimator.
Gaussian mixture A random variable whose density is a convex combination of Gaussian densities.
Generalization Ability of a classifier to perform well on unseen data.
Gini impurity Impurity measure corresponding to the expected error probability of random class
assignment on that data set.
Hidden Markov model Markov model where the states and state transitions are hidden and can
only be inferred from observations.
HMM Hidden Markov model.
Homogeneous process A random process whose moments do not depend on the point of evalua-
tion.
Hyper parameters Parameters that govern a classifer but are not estimated from the training set.
Impurity measure Measure that assesses the class distribution in a data set.
Interval scale Scale of measurement for measuring intervals but lacking a natural zero.
Joint distribution The distribution of several random quantities in a joint probability space.
Glossary | 277
k-nearest neighbor method A parameter-free technique to define a density given a number of finite
samples. See also Parzen window method.
Kullback–Leibler divergence Measure (but not a metric) of the difference between probability dis-
tributions.
Leave-one-out cross-validation Cross-validation where only one sample is used for evaluation,
and the rest are used to train the classifier.
Likelihood function A function of the parameters of a statistical model for a given data set.
Likelihood ratio The ratio two likelihood functions with different models. Used in hypothesis test-
ing.
Linear discriminant A basic classifier that draws hyperplanes between classes in the feature space.
Log-likelihood function The logarithm of the likelihood function.
Long short term memory Type of deep learning architecture suitable for sequential data.
Mahalanobis norm Norm of a vector with respect to some positive definite matrix.
Manhattan metric Metric deduced from the absolute norm; also: taxicab metric.
MAP classifier Maximum a posteriori classifier.
Marginal distribution The projection of a joint distribution onto one of the axes.
Markov model Probabilistic model of states and transitions between states with certain restric-
tions.
Maximum a posteriori classifier A classifier that decides on the class with the highest a posteriori
probability with respect to a given feature.
Maximum norm A special type of Minkowski norm.
Maximum-likelihood estimator An estimator that chooses the parameter that makes the given ob-
servation most likely under the model.
Mean squared error Mean of the squared derivations of an estimator to the target variable.
Median The middle entry in a sorted list of items.
Metric A function that defines a distance.
Metric space A set with a distance measure.
Minimax classifier A special type of classifier that estimates the class such that the maximal risk
with respect to any a priori distribution is minimized. See also classifier.
Minkowski norm A parametrized norm for real vector spaces.
Misclassification measure Impurity measure corresponding to the empirical error probability of the
dominant class in that data set.
ML estimator Maximum-likelihood estimator.
Mode In statistics: The global maximum of a probability mass or probability density, i.e., the most
probable value.
Nearest neighbor classifier A classifier that assigns an object the same class as the nearest (in the
feature space) sample of the training set.
Nominal scale Scale of measurement made up of labels.
Norm Function to measure the length of a vector.
Parameter space The (vector) space of all quantities that define a classifier.
Parameter vector A point in the parameter space.
278 | Glossary
Parzen window method A parameter-free technique to define a density given a number of finite
samples. See also k-nearest neighbor method.
Pattern The raw data from a sensor.
Pattern space The set of all possible patterns.
PCA Principal component analysis.
Permutation metric A metric for features on the ordinal scale.
Principal component analysis A method for finding a lower-dimensional subspace such that the
projection of the dataset has a minimal squared reconstruction error.
Probability simplex A subset in the decision space.
Scale of measurement Defines certain types of variables and permissible operations on the vari-
ables of a given type.
Score In statistics: Measure of how much a parameter influences the density of a random variable.
Sensitivity True-positive rate.
Slack The event that a binary classifier incorrectly decides for “negative” although the sample is
positive.
Slack variable In SVMs: Variables associated with the training samples to measure the violation of
the maximum margin constraint.
Specificity True-negative rate.
State transition probability In Markov models: Probability to switch betwenn states.
Stationary process A random process that does not change the joint distribution of a derived time
series when shifted in time.
Stochastic gradient descent Randomized version of the gradient descent optimization algorithm.
Structural risk minimization From statistical learning theory: Joint minimization of the average
loss on a training set and the model complexity.
Supervised learning Learning when the classes of the training samples are known, e.g., classifica-
tion.
Support vector machine A linear classifier that maximizes the margin between the decision bound-
ary and the training samples.
SVM Support vector machine
Target vector A unit vector in the decision space and a corner of the probability simplex.
Taxicab metric Metric deduced from the absolute norm; also: Manhattan metric.
Test set A special subset of the dataset that is used to test the performance of a classifier.
Training set A special subset of the dataset that is used to define the parameters of a classifier.
Glossary | 279
True-negative rate The probability of a binary classifier of deciding on “negative” if the sample ac-
tually belongs to the negative class.
True-positive rate The probability of a binary classifier of deciding on “positive” if the sample actu-
ally belongs to the positive class.
Unbiased estimator A special estimator whose expectation value equals the parameter being esti-
mated, if considered as an random variable on its own.
Unbiasedness See unbiased estimator.
Unsupervised learning Learning when the classes of the training samples are not known or not
needed, e.g., clustering, density estimation, etc.
Validation set A special subset of the dataset that is used to define the design parameters of a
classifier.
Vapnik–Chervonenkis dimension Measure of complexity of a given family of classifiers.
Weak classifier A classifier that performs only marginally better than random guessing.
Weakly stationary process A random process whose expectation and covariance are constant at
every point.
Window function A function that is nonzero only in some interval, often used to assign a weight
according to some distance, e.g., in the Parzen window method.
Index
activation function 182 distribution 98
AR model 43 – a posteriori distribution 99
autoencoder 183 – a priori distribution 98
– class-specific feature distribution 99
– conditional distribution 98, 99
backpropagation 182
– joint distribution 98
bag of words 88
– marginal distribution 98
– bag of visual words 89
divergence 19
bagging 223
dropout 191
Bayes’ law 99
Bayesian classifier 104
eigenfaces 65, 86
Bayesianism 123
EM 213
bias 128
emission probability 211
boosting 242
empirical operation 11
bootstrap aggregating 223
empirical relation 11
bootstrapping 223
empirical risk minimization 234
equivalence relation 1
central limit theorem 113 ERM 234
class 1, 98 estimator 122
classifier 2, 98 – consistent estimator 128
CNN 189 – CR-efficient estimator 126
confusion matrix 237, 240 – unbiased estimator 125
convolutional neural network 189 expectation maximization 213
correlation coefficient 76
cost function 104 feature 3, 10
Cramér–Rao bound 125 feature space 5, 13, 162
cross-validation 241 feed-forward network 180
– leave-one-out cross-validation 242 ferret box 40
curse of dimensionality 8, 162 fidis 269
Fisher information 127
Fisherfaces 86
data matrix 60
form factor 40
dataset 6, 7
frequentism 123
decision boundary 2
decision function 100 Gaussian distribution
decision region 2, 101 – multivariate 114
decision space 4, 99 – univariate 113
decision tree 215 Gaussian mixture 119
decision vector 100 general representation theorem 182
degree of compactness 40 generalization 170
degree of convexity 40 group 265
degree of filling 39 – group action 265, 266
differential entropy 78 – Lie group 267
– conditional differential entropy 78 – Lie transformation group 267
Dirac sequence 147 – stabilizer 266
discrepancy 19
distance function 19 hidden Markov model 211
https://doi.org/10.1515/9783110537949-303
282 | Index