0% found this document useful (0 votes)

39 views307 pages

1 PATTERN RECOGNITION Introduction Features Classifiers and Principles - Compress

Uploaded by

YoshiZH

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views307 pages

1 PATTERN RECOGNITION Introduction Features Classifiers and Principles - Compress

Uploaded by

YoshiZH

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 307

Jürgen Beyerer, Matthias Richter, Matthias Nagel

Pattern Recognition
De Gruyter Graduate
Also of Interest

Dynamic Fuzzy Machine Learning

L. Li, L. Zhang, Z. Zhang, 2018
ISBN 978-3-11-051870-2, e-ISBN 978-3-11-052065-1,
e-ISBN (EPUB) 978-3-11-051875-7, Set-ISBN 978-3-11-052066-8

Lie Group Machine Learning

F. Li, L. Zhang, Z. Zhang, 2019
ISBN 978-3-11-050068-4, e-ISBN 978-3-11-049950-6,
e-ISBN (EPUB) 978-3-11-049807-3, Set-ISBN 978-3-11-049955-1

Complex Behavior in Evolutionary Robotics

L. König, 2015
ISBN 978-3-11-040854-6, e-ISBN 978-3-11-040855-3,
e-ISBN (EPUB) 978-3-11-040918-5, Set-ISBN 978-3-11-040917-8

Pattern Recognition on Oriented Matroids

A. O. Matveev, 2017
ISBN 978-3-11-053071-1, e-ISBN 978-3-11-048106-8,
e-ISBN (EPUB) 978-3-11-048030-6, Set-ISBN 978-3-11-053115-2

Graphs for Pattern Recognition

D. Gainanov, 2016
ISBN 978-3-11-048013-9, e-ISBN 978-3-11-052065-1,
e-ISBN (EPUB) 978-3-11-051875-7, Set-ISBN 978-3-11-048107-5
Jürgen Beyerer, Matthias Richter,
Matthias Nagel

Pattern Recognition

Introduction, Features, Classifiers and Principles

Authors
Prof. Dr.-Ing. habil. Jürgen Beyerer
Fraunhofer Institute of Optronics, System Technologies and
Image Exploitation IOSB Fraunhoferstr. 1
76131 Karlsruhe
[email protected]
-and-
Institute of Anthropomatics and Robotics, Chair IES Karlsruhe
Institute of Technology Adenauerring 4
76131 Karlsruhe

Matthias Richter
Institute of Anthropomatics and Robotics, Chair IES Karlsruhe
Institute of Technology Adenauerring 4
76131 Karlsruhe
[email protected]

Matthias Nagel
Institute of Theoretical Informatics, Cryptography and IT Security
Karlsruhe Institute of Technology Am Fasanengarten 5
76131 Karlsruhe
[email protected]

ISBN 978-3-11-053793-2
e-ISBN (PDF) 978-3-11-053794-9
e-ISBN (EPUB) 978-3-11-053796-3

Library of Congress Cataloging-in-Publication Data

A CIP catalog record for this book has been applied for at the Library of Congress.

Bibliographic information published by the Deutsche Nationalbibliothek

The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie;
detailed bibliographic data are available on the Internet at http://dnb.dnb.de.

© 2018 Walter de Gruyter GmbH, Berlin/Boston

Cover image: Top Photo Corporation/Top Photo Group/thinkstock
Printing and binding: CPI books GmbH, Leck
♾ Printed on acid-free paper
Printed in Germany

www.degruyter.com
Preface
Pattern Recognition ⊂ Machine Learning ⊂ Artificial Intelligence:
This relation could give the impression that pattern recognition is only a tiny, very spe-
cialized topic. That, however, is misleading. Pattern recognition is a very important
field of machine learning and artificial intelligence with its own rich structure and
many interesting principles and challenges. For humans, and also for animals, their
natural abilities to recognize patterns are essential for navigating the physical world
which they perceive with their naturally given senses. Pattern recognition here per-
forms an important abstraction from sensory signals to categories: on th most basic
level, it enables the classification of objects into “Eatable” or “Not eatable” or, e.g.,
into “Friend” or “Foe.” These categories (or, synonymously, classes) do not always
have a tangible character. Examples of non-material classes are, e.g., “secure situa-
tion” or “dangerous situation.” Such classes may even shift depending on the context,
for example, when deciding whether an action is socially acceptable or not. Therefore,
everybody is very much acquainted, at least at an intuitive level, with what pattern
recognition means to our daily life. This fact is surely one reason why pattern recogni-
tion as a technical subdiscipline is a source of so much inspiration for scientists and
engineers. In order to implement pattern recognition capabilities in technical systems,
it is necessary to formalize it in such a way, that the designer of a pattern recognition
system can systematically engineer the algorithms and devices necessary for a techni-
cal realization. This textbook summarizes a lecture course about pattern recognition
that one of the authors (Jürgen Beyerer) has been giving for students of technical and
natural sciences at the Karlsruhe Institute of Technology (KIT) since 2005. The aim of
this book is to introduce the essential principles, concepts and challenges of pattern
recognition in a comprehensive and illuminating presentation. We will try to explain
all aspects of pattern recognition in a well understandable, self-contained fashion.
Facts are explained with a mixture of a sufficiently deep mathematical treatment, but
without going into the very last technical details of a mathematical proof. The given
explanations will aid readers to understand the essential ideas and to comprehend
their interrelations. Above all, readers will gain the big picture that underlies all of
pattern recognition.
The authors would like to thank their peers and colleagues for their support:
Special thanks are owed to Dr. Ioana Gheța who was very engaged during the early
phases of the lecture “Pattern Recognition” at the KIT. She prepared most of the many
slides and accompanied the course along many lecture periods.
Thanks as well to Dr. Martin Grafmüller and to Dr. Miro Taphanel for supporting
the lecture Pattern Recognition with great dedication.
Moreover, many thanks to to Prof. Michael Heizmann and Prof. Fernando Puente
León for inspiring discussions, which have positively influenced to the evolution of
the lecture.

https://doi.org/10.1515/9783110537949-005
VI | Preface

Thanks to Christian Hermann and Lars Sommer for providing additional figures
and examples of deep learning. Our gratitude also to our friends and colleagues Alexey
Pak, Ankush Meshram, Chengchao Qu, Christian Hermann, Ding Luo, Julius Pfrom-
mer, Julius Krause, Johannes Meyer, Lars Sommer, Mahsa Mohammadikaji, Mathias
Anneken, Mathias Ziearth, Miro Taphanel, Patrick Philipp, and Zheng Li for providing
valuable input and corrections for the preparation of this manuscript.
Lastly, we thank De Gruyter for their support and collaboration in this project.

Karlsruhe, Summer 2017

Jürgen Beyerer
Matthias Richter
Matthias Nagel
Contents
Preface | V

List of Tables | XI

List of Figures | XIII

Notation | XVII

Introduction | XIX

1 Fundamentals and definitions | 1

2.5.4 Dynamic time warping | 38

3 Bayesian decision theory | 98

4 Parameter estimation | 122

5 Parameter free methods | 142

5.1 The Parzen window method | 146
5.2 The k-nearest neighbor method | 150
5.3 k-nearest neighbor classification | 154
5.4 Exercises | 160

6 General considerations | 162

6.1 Dimensionality of the feature space | 162
Contents | IX

6.2 Overfitting | 169

6.3 Exercises | 171

7 Special classifiers | 173

8 Classification with nominal features | 215

9 Classifier-independent concepts | 231

9.1 Learning theory | 231
9.1.1 The central problem of statistical learning | 232
X | Contents

9.1.2 Vapnik–Chervonenkis learning theory | 232

A Solutions to the exercises | 249

B A primer on Lie theory | 263

C Random processes | 268

Bibliography | 271

Glossary | 275

Index | 281
List of Tables
Table 1 Capabilities of humans and machines in relation to pattern recognition | XXI

Table 2.1 Taxonomy of scales of measurement | 11

Table 2.2 Topology of the letters of the German alphabet | 41

Table 7.1 Character sequences generated by Markov models of different order | 212

Table 9.1 Common binary classification performance measures | 238

https://doi.org/10.1515/9783110537949-011
List of Figures
Fig. 1 Examples of artificial and natural objects | XX
Fig. 2 Industrial bulk material sorting system | XXI

Fig. 1.1 Transformation of the domain Ω into the feature space M | 2

Fig. 1.2 Processing pipeline of a pattern recognition system | 3
Fig. 1.3 Abstract steps in pattern recognition | 5
Fig. 1.4 Design phases of a pattern recognition system | 6
Fig. 1.5 Rule of thumb to partition the dataset into training, validation and test sets | 7

Fig. 2.1 Iris flower dataset | 14

Fig. 2.2 Full projection and slice projection techniques | 15
Fig. 2.3 Construction of two-dimensional slices | 16
Fig. 2.4 Feature transformation for dimensionality reduction | 18
Fig. 2.5 Unit circles for different Minkowski norms | 21
Fig. 2.6 Kullback–Leibler divergence between two Bernoulli distributions | 25
Fig. 2.7 KL divergence of Gaussian distributions with equal variance | 26
Fig. 2.8 KL divergence of Gaussian distributions with unequal variances | 26
Fig. 2.9 Pairs of rectangle-like densities | 26
Fig. 2.10 Combustion engine, microscopic image of bore texture and texture model | 27
Fig. 2.11 Systematic variations in optical character recognition | 29
Fig. 2.12 Tangential distance measure | 30
Fig. 2.13 Linear approximation of the variation in Figure 2.11 | 31
Fig. 2.14 Chromaticity normalization | 33
Fig. 2.15 Normalization of lighting conditions | 35
Fig. 2.16 Images of the surface of agglomerated cork | 36
Fig. 2.17 Adjustment of geometric distortions | 37
Fig. 2.18 Adjustment of temporal distortions | 38
Fig. 2.19 Different bounding boxes around an object | 40
Fig. 2.20 The convex hull around a concave object | 40
Fig. 2.21 Degree of compactness (form factor) | 41
Fig. 2.22 Classification of faulty milling cutters | 42
Fig. 2.23 Synthetic honing textures using an AR model | 46
Fig. 2.24 Physical formation process and parametric model of a honing texture | 47
Fig. 2.25 Synthetic honing texture using a physically motivated model | 48
Fig. 2.26 Impact of object variation and variation of patterns on the features | 49
Fig. 2.27 Synthesis of a two-dimensional contour | 54
Fig. 2.28 Principal component analysis, first step | 58
Fig. 2.29 Principal component analysis, second step | 60
Fig. 2.30 Principal component analysis, general case | 61
Fig. 2.31 The variance of the dataset is encoded in principal components | 63
Fig. 2.32 Mean face of the YALE faces dataset | 66
Fig. 2.33 First 20 eigenfaces of the YALE faces dataset | 67
Fig. 2.34 First 20 eigenvalues corresponding to the eigenfaces in Figure 2.33 | 67
Fig. 2.35 Wireframe model of an airplane | 69
Fig. 2.36 Concept of kernelized PCA | 70
Fig. 2.37 Kernelized PCA with radial kernel function | 76

https://doi.org/10.1515/9783110537949-013
XIV | List of Figures

Fig. 2.38 Concept of independent component analysis | 77

Fig. 2.39 Effect of an independent component analysis | 78
Fig. 2.40 PCA does not take class separability into account | 81
Fig. 2.41 Multiple discriminant analysis | 82
Fig. 2.42 First ten Fisher faces of the YALE faces dataset | 87
Fig. 2.43 Workflow of feature selection | 88
Fig. 2.44 Underlying idea of bag of visual words | 90
Fig. 2.45 Example of a visual vocabulary | 91
Fig. 2.46 Example of a bag of words descriptor | 91
Fig. 2.47 Bag of words for bulk material sorting | 93
Fig. 2.48 Structure of the bag of words approach in Richter et al. [2016] | 94

Fig. 3.1 Example of a random distribution of mixed discrete and continuous quantities | 99
Fig. 3.2 The decision space K | 100
Fig. 3.3 Workflow of the MAP classifier | 102
Fig. 3.4 3-dimensional probability simplex in barycentric coordinates | 103
Fig. 3.5 Connection between the likelihood ratio and the optimal decision region | 106
Fig. 3.6 Decision of an MAP classifier in relation to the a posteriori probabilities | 108
Fig. 3.7 Underlying densities in the reference example for classification | 109
Fig. 3.8 Optimal decision regions | 110
Fig. 3.9 Risk of the Minimax classifier | 112
Fig. 3.10 Decision boundary with uneven priors | 115
Fig. 3.11 Decision regions of a generic Gaussian classifier | 116
Fig. 3.12 Decision regions of a generic two-class Gaussian classifier | 117
Fig. 3.13 Decision regions of a Gaussian classifier with the reference example | 118

Fig. 4.1 Comparison of estimators | 129

Fig. 4.2 Sequence of Bayesian a posteriori densities | 135

Fig. 5.1 The triangle of inference | 142

Fig. 5.2 Comparison of Parzen window and k-nearest neighbor density estimation | 145
Fig. 5.3 Decision regions of a Parzen window classifier | 149
Fig. 5.4 Parzen window density estimation (m ∈ ℝ) | 151
Fig. 5.5 Parzen window density estimation (m ∈ ℝ2 ) | 152
Fig. 5.6 k-nearest neighbor density estimation | 153
Fig. 5.7 Example Voronoi tessellation of a two-dimensional feature space | 155
Fig. 5.8 Dependence of the nearest neighbor classifier on the metric | 156
Fig. 5.9 k-nearest neighbor classifier | 157
Fig. 5.10 Decision regions of a nearest neighbor classifier | 157
Fig. 5.11 Decision regions of a 3-nearest neighbor classifier | 158
Fig. 5.12 Decision regions of a 5-nearest neighbor classifier | 158
Fig. 5.13 Asymptotic error bounds of the nearest neighbor classifier | 159

Fig. 6.1 Increasing dimension vs. overlapping densities | 164

Fig. 6.2 Dependence of error rate on the dimension of the feature space in Beyerer
[1994] | 165
Fig. 6.3 Density of a sample for feature spaces of increasing dimensionality | 166
Fig. 6.4 Examples of feature dimension d and parameter dimension q | 167
List of Figures | XV

Fig. 6.5 Trade-off between generalization and training error | 170

Fig. 6.6 Overfitting in a regression scenario | 170

Fig. 7.1 Techniques for extending linear discriminants to more than two classes | 174
Fig. 7.2 Nonlinear separation by augmentation of the feature space. | 176
Fig. 7.3 Decision regions of a linear regression classifier | 177
Fig. 7.4 Four steps of the perceptron algorithm | 179
Fig. 7.5 Feed-forward neural network with one hidden layer | 181
Fig. 7.6 Decision regions of a feed-forward neural network | 183
Fig. 7.7 Neuron activation of an autoencoder with three hidden neurons | 185
Fig. 7.8 Pre-training with stacked autoencoders. | 187
Fig. 7.9 Comparison of ReLU and sigmoid activation functions | 189
Fig. 7.10 A single convolution block in a convolutional neural network | 189
Fig. 7.11 High level structure of a convolutional neural network. | 190
Fig. 7.12 Types of features captured in convolution blocks of a convolutional neural
network | 191
Fig. 7.13 Detection and classification of vehicles in aerial images with CNNs | 192
Fig. 7.14 Structure of the CNN used in Herrmann et al. [2016] | 193
Fig. 7.15 Classification with maximum margin | 195
Fig. 7.16 Decision regions of a hard margin SVM | 203
Fig. 7.17 Geometric interpretation of the slack variables ξ i , i = 1, . . . ,N. | 204
Fig. 7.18 Decision regions of a soft margin SVM | 205
Fig. 7.19 Decision boundaries of hard margin and soft margin SVMs | 206
Fig. 7.20 Toy example of a matched filter | 207
Fig. 7.21 Discrete first order Markov model with three states ω i . | 211
Fig. 7.22 Discrete first order hidden Markov model | 213

Fig. 8.1 Decision tree to classify fruit | 217

Fig. 8.2 Binarized version of the decision tree in Figure 8.1 | 217
Fig. 8.3 Qualitative comparison of impurity measures | 219
Fig. 8.4 Decision regions of a decision tree | 220
Fig. 8.5 Structure of the decision tree of Figure 8.4 | 220
Fig. 8.6 Impact of the features used in decision tree learning | 221
Fig. 8.7 A decision tree that does not generalize well. | 222
Fig. 8.8 Decision regions of a random forest | 225
Fig. 8.9 Strict string matching | 227
Fig. 8.10 Approximate string matching | 228
Fig. 8.11 String matching with wildcard symbol ⋆ | 228
Fig. 8.12 Bottom up and top down parsing of a sequence | 229

Fig. 9.1 Relation of the world model P(m,ω) and training and test sets D and D. | 232
Fig. 9.2 Sketch of different class assignments under different model families | 233
Fig. 9.3 Expected test error, empirical training error, and VC confidence vs. VC
dimension | 234
Fig. 9.4 Classification error probability | 235
Fig. 9.5 Classification outcomes in a 2-class scenario | 236
Fig. 9.6 Performance indicators for a binary classifier | 237
Fig. 9.7 Example of ROC curves | 239
XVI | List of Figures

Fig. 9.8 Converting a multi-class confusion matrix to binary confusion matrices | 240
Fig. 9.9 Five-fold cross-validation | 242
Fig. 9.10 Schematic example of AdaBoost training. | 244
Fig. 9.11 AdaBoost classifier obtained by training in Figure 9.10 | 245
Fig. 9.12 Reasons to refuse to classify an object | 245
Fig. 9.13 Classifier with rejection option | 246
Fig. 9.14 Rejection criteria and the corresponding rejection regions | 247
Notation
General identifiers

a, . . . , z Scalar, function mapping to a scalar, or a realization of a random variable

a, . . . , z Random variable (scalar)
a, . . . , z Vector, function mapping to a vector, or realization of a vectorial random variable
a, . . . , z Random variable (vectorial)
â, . . . , ẑ Realized estimator of denoted variable
â, . . . , ẑ Estimator of denoted variable as random variable itself
A, . . . , Z Matrix
A, . . . , Z Matrix as random variable
A, . . . , Z Set
A, . . . , Z System of sets

Special identifiers

c Number of classes
ℂ Set of complex numbers
d Dimension of feature space
D Set of training samples
i, j, k Indices along the dimension, i.e., i, j, k ∈ {1, . . . , d}, or along the number of sam-
ples, i.e., i, j, k ∈ {1, . . . , N}
I Identity matrix
j Imaginary unit, j2 = −1
J Fisher information matrix
k(⋅,⋅) Kernel function
k(⋅) Decision function
K Decision space
l Cost function l : Ω0 /∼ × Ω/∼ → ℝ
L Cost matrix ∈ ℝ(c+1)×c
m Feature vector
mi Feature vector of the i-th sample
m ij The j-th component of the i-th feature vector
M ij The component at the i-th row and j-th column of the matrix M
M Feature space
N Number of samples
ℕ Set of natural numbers
o Object
ω Class of objects, i.e., ω ⊆ Ω
ω0 Rejection class
Ω Set of objects (the relevant part of the world) Ω = {o1 , . . . , o N }

https://doi.org/10.1515/9783110537949-017
XVIII | Notation

Ω/∼ The domain factorized w.r.t. the classes, i.e., the set of classes Ω/∼ =
{ω1 , . . . , ω c }
Ω0 /∼ The set of classes including the rejection class, Ω0 /∼ = Ω/∼ ∪ {ω0 }
p(m) Probability density function for random variable m evaluated at m
P(ω) Probability mass function for (discrete) random variable ω evaluated at ω
Pr(e) Probability of an event e
P(A) Power set, i.e., the set of all subsets of A
ℝ Set of real numbers
S Set of all samples, S = D ⊎ T ⊎ V
T Set of test samples
V Set of validation samples
U Unit matrix, i.e., the matrix all of whose entries are 1
θ Parameter vector
Θ Parameter space
ℤ Set of integer numbers

Special symbols

∝ “proportional to”-relation
P
→ Convergence in probability
w
→ Weak convergence
⇝ Leads to (not necessarily in a strict mathematical sense)
⊎ Disjoint union of sets, i.e., C = A ⊎ B ⇔ C = A ∪ B and A ∩ B = 0.
⟨⋅,⋅⟩ Scalar product
∇, ∇e Gradient, Gradient w.r.t. e
Cov{⋅} Covariance
j j j
δi Kronecker delta/symbol; δ i = 1 iff. i = j, else δ i = 0
δ[⋅] Generalized Kronecker symbol, i.e., δ[Π] = 1 iff Π is true and δ[Π] = 0 otherwise
E{⋅} Expected value
N(μ, σ 2 ) Normal/Gaussian distribution with expectation μ and variance σ 2
N(µ, Σ) Multivariate normal/Gaussian distribution with expectation µ and covariance ma-
trix Σ
tr A Trace of the matrix A
Var{⋅} Variance

Abbreviations

iff if and only if

i.i.d. independent and identically distributed
N.B. “Nota bene” (latin: note well, take note)
w.r.t. with respect to
Introduction
The overall goal of pattern recognition is to develop systems that can distinguish and
classify objects. The range of possible objects is vast. Objects can be physical things
existing in the real world, like banknotes, as well as non-material entities, e.g., e-mails,
or abstract concepts such as actions or situations. The objects can be of natural origin
or artificially created. Examples of objects in pattern recognition tasks are shown in
Figure 1.
On the basis of recorded patterns, the task is to classify the objects into previously
assigned classes by defining and extracting suitable features. The type as well as the
number of classes is given by the classification task. For example, banknotes (see Fig-
ure 1b) could be classified according to their monetary value or the goal could be to
discriminate between real and counterfeited banknotes. For now, we will refrain from
defining what we mean by the terms pattern, feature, and class. Instead, we will rely
on an intuitive understanding of these concepts. A precise definition will be given in
the next chapter.
From this short description, the fundamental elements of a pattern recognition
task and the challenges to be encountered at each step can be identified even without
a precise definition of the concepts pattern, feature, and class:

Pattern acquisition, Sensing, Measuring In the first step, suitable properties of the
objects to be classified have to be gathered and put into computable representa-
tions. Although pattern might suggest that this (necessary) step is part of the actual
pattern recognition task, it is not. However, this process has to be considered so
far as to provide an awareness of any possible complications it may cause in the
subsequent steps. Measurements of any kind are usually affected by random noise
and other disturbances that, depending on the application, can not be mitigated
by methods of metrology alone: for example, changes of lighting conditions in
uncontrolled and uncontrollable environments. A pattern recognition system has
to be designed so that it is capable of solving the classification task regardless of
such factors.
Feature definition, Feature acquisition Suitable features have to be selected based
on the available patterns and methods for extracting these features from the pat-
terns have to be defined. The general aim is to find the smallest set of the most
informative and discriminative features. A feature is discriminative if it varies lit-
tle with objects within a single class, but varies significantly with objects from
different classes.
Design of the classifier After the features have been determined, rules to assign a
class to an object have to be established. The underlying mathematical model has
to be selected so that it is powerful enough to discern all given classes and thus
solve the classification task. On the other hand, it should not be more complicated

https://doi.org/10.1515/9783110537949-019
XX | Introduction

(a) Screws (b) Banknotes (c) Handwriting

(d) Plant seeds (e) Plant leaves

Fig. 1. Examples of artificial and natural objects.

than it needs to be. Determining a given classifier’s parameters is a typical learning

problem, and is therefore also affected by the problems pertaining to this field.
These topics will be discussed in greater detail in Chapter 1.

These lecture notes on pattern recognition are mainly concerned with the last two
issues. The complete process of designing a pattern recognition system will be covered
in its entirety and the underlying mathematical background of the required building
blocks will be given in depth.
Pattern recognition systems are generally parts of larger systems, in which pattern
recognition is used to derive decisions from the result of the classification. Industrial
sorting systems are typical of this (see Figure 2). Here, products are processed differ-
ently depending on their class memberships.
Hence, as a pattern recognition system is not an end in itself, the design of such a
system has to consider the consequences of a bad decision caused by a misclassifica-
tion. This puts pattern recognition between human and machine. The main advantage
of automatic pattern recognition is that it can execute recurring classification tasks
with great speed and without fatigue. However, an automatic classifier can only dis-
cern the classes that were considered in the design phase and it can only use those
features that were defined in advance. A pattern recognition system to tell apples from
oranges may label a pear as an apple and a lemon as an orange if lemons and pears
were not known in the design phase. The features used for classification might be
chosen poorly and not be discriminative enough. Different environmental conditions
(e.g., lighting) in the laboratory and in the field that were not considered beforehand
might impair the classification performance, too. Humans, on the other hand, can use
their associative and cognitive capabilities to achieve good classification performance
Introduction | XXI

Computer

Camera
(line-scan)

Illumination

Bulk material

Conveyor
Ejection
stage

Background plate

(a) Schematic overview (b) Inspection and ejection stage

Fig. 2. Industrial bulk material sorting system.

Table 1. Capabilities of humans and machines in relation to pattern recognition.

Association & cognition Combinatorics & precision

Human very good poor

Machine medium very good

even in adverse conditions. In addition, humans are capable of undertaking further

actions if they are unsure about a decision. The contrasting abilities of humans and
machines in relation to pattern recognition are compared in Table 1. In many cases
one will choose to build a hybrid system: easy classification tasks will be processed
automatically, ambiguous cases require human intervention, which may be aided by
the machine, e.g., by providing a selection of the most probable classes.
1 Fundamentals and definitions
The aim of this chapter is to describe the general structure of a pattern recognition
system and properly define the fundamental terms and concepts that were partially
used in the Introduction already. A description of the generic process of designing a
pattern recognizer will be given and the challenges at each step will be stated more
precisely.

1.1 Goals of pattern recognition

The purpose of pattern recognition is to assign classes to objects according to some

similarity properties. Before delving deeper, we must first define what is meant by class
and object. For this, two mathematical concepts are needed: equivalence relations and
partitions.

Definition 1.1 (Equivalence relation). Let Ω be a set of elements with some relation ∼.
Suppose further that o, o1 , o2 , o3 ∈ Ω are arbitrary. The relation ∼ is said to be an
equivalence relation if it fulfills the following conditions:
1. Reflexivity: o ∼ o.
2. Symmetry: o1 ∼ o2 ⇔ o2 ∼ o1 .
3. Transitivity: o1 ∼ o2 and o2 ∼ o3 ⇒ o1 ∼ o3 .

Two elements o1 , o2 with o1 ∼ o2 are said to be equivalent. We further write [o]∼ ⊆ Ω

to denote the subset
↑ 󸀠
[o]∼ = {o󸀠 ∈ Ω↑↑o ∼ o}
↑ (1.1)

of all elements that are equivalent to o. The object o is also called a representative of
the set [o]∼ . In the context of pattern recognition, each o ∈ Ω denotes an object and
each [o]∼ denotes a class. A different approach to classifying every element of a set is
given by partitioning the set:

Definition 1.2 (Partition, Class). Let Ω be a set and ω1 , ω2 , ω3 , . . . ⊆ Ω be a system of

subsets. This system of subsets is called a partition of Ω if the following conditions are
met:
1. ω i ∩ ω j = 0 for all i ≠ j, i.e., the subsets are pairwise disjoint, and
2. ⋃i ω i = Ω, i.e., the system is exhaustive.
Every subset ω is called a class (of the partition).

It is easy to see that equivalence relations and partitions describe synonymous con-
cepts: every equivalence relation induces a partition, and every partition induces an
equivalence relation.
The underlying principle of all pattern recognition is illustrated in Figure 1.1.
On the left it shows—in abstract terms—the world and a (sub)set Ω of objects that

https://doi.org/10.1515/9783110537949-023
2 | 1 Fundamentals and definitions

m2 , . . . , m c
Ω ⊆ World

ωc Rc
Sensing,
ω1 measuring,
R2
characterizing: m1
⋅⋅⋅⋅⋅⋅ o i 󳨃→ mi
ω2 Ω ...
R1
Decision boun-
daries

Fig. 1.1. Transformation of the domain Ω into the feature space M.

live within the world. The set Ω is given by the pattern recognition task and is also
called the domain. Only the objects in the domain are relevant to the task; this is the
so called closed world assumption. The task also partitions the domain into classes
ω1 , ω2 , ω3 , . . . ⊆ Ω. A suitable mapping associates every object o i to a feature vector
mi ∈ M inside the feature space M. The goal is now to find rules that partition M
along decision boundaries so that the classes of M match the classes of the domain.
Hence, the rule for classifying an object o is

ω̂ (o) := ω i if m (o) ∈ Ri . (1.2)

This means that the estimated class ω̂ (o) of object o is set to the class ω i if the
feature vector m (o) falls inside the region Ri . For this reason, the Ri are also called
decision regions. The concept of a classifier can now be stated more precisely:

Definition 1.3 (Classifier). A classifier is a collection of rules that state how to evaluate
feature vectors in order to sort objects into classes. Equivalently, a classifier is a system
of decision boundaries in the feature space.

Readers experienced in machine learning will find these concepts very familiar. In fact,
machine learning and pattern recognition are closely intertwined: pattern recognition
is (mostly) supervised learning, as the classes are known in advance. This topic will be
picked up again later in this chapter.

1.2 Structure of a pattern recognition system

In the previous section it was already mentioned that a pattern recognition system
maps objects onto feature vectors (see Figure 1.1) and that the classification is carried
1.2 Structure of a pattern recognition system | 3

Ω
Sensing

⋅⋅⋅
Preprocessing

Segmentation

Patterns

Feature extraction

Blurry boundary Features Fig. 1.2. Process-

ing pipeline of a
Classification Classes ω i pattern recogni-
tion system.

out in the feature space. This section focuses on the steps involved and defines the
terms pattern and feature.
Figure 1.2 shows the processing pipeline of a pattern recognition system. In the
first steps, the relevant properties of the objects from Ω must be put into a machine
readable interpretation. These first steps (yellow boxes in Figure 1.2) are usually per-
formed by methods of sensor engineering, signal processing, or metrology, and are
not directly part of the pattern recognition system. The result of these operations is
the pattern of the object under inspection.

Definition 1.4 (Pattern). A pattern is the collection of the observed or measured prop-
erties of a single object.

The most prominent pattern is the image, but patterns can also be (text) documents,
audio recordings, seismograms, or indeed any other signal or data. The pattern of
an object is the input to the actual pattern recognition, which is itself composed of
two major steps (gray boxes in Figure 1.2): previously defined features are extracted
from the pattern and the resulting feature vector is passed to the classifier, which then
outputs an equivalence class according to Equation (1.2).

Definition 1.5 (Feature). A feature is an obtainable, characteristic property, which will

be the basis for distinguishing between patterns and therefore also between the un-
derlying classes.

A feature is any quality or quantity that can be derived from the pattern, for example,
the area of a region in an image, the count of occurrences of a key word within a text,
or the position of a peak in an audio signal.
4 | 1 Fundamentals and definitions

As an example, consider the task of classifying cubical objects as either “small

cube” or “big cube” with the aid of a camera system. The pattern of an object is the
camera image, i.e., the pixel representation of the image. By using suitable image pro-
cessing algorithms, the pixels that belong to the cube can be separated from the pixels
that show the background and the length of the edge of the cube can be determined.
Here, “edge length” is the feature that is used to classify the object into the classes
“big” or “small.”
Note that the boundary between the individual steps is not clearly defined, es-
pecially between feature extraction and classification. Often there is the possibility
of using simple features in conjunction with a powerful classifier, or of combining
elaborate features with a simple classifier.

1.3 Abstract view of pattern recognition

From an abstract point of view, pattern recognition is mapping the set of objects Ω to
be classified to the equivalence classes ω ∈ Ω/ ∼, i.e., Ω → Ω/ ∼ or o 󳨃→ ω. In some
cases, this view is sufficient for treating the pattern recognition task. For example, if
the objects are e-mails and the task is to classify the e-mails as either “ham” =̂ ω1 or
“spam” =̂ ω2 , this view is sufficient for deriving the following simple classifier: The
body of an incoming e-mail is matched against a list of forbidden words. If it contains
more than S of these words, it is marked as spam, otherwise it is marked as ham.
For a more complicated classification system, as well as for many other pattern
recognition problems, it is helpful and can provide additional insights to break up the
mapping Ω → Ω/ ∼ into several intermediate steps. In this book, the pattern recog-
nition process is subdivided into the following steps: observation, sensing, measure-
ment; feature extraction; decision preparation; and classification. This subdivision is
outlined in Figure 1.3.
To come back to the example mentioned above, an e-mail is already digital data,
hence it does not need to be sensed. It can be further seen as an object, a pattern, and
a feature vector, all at once. A spam classification application that takes the e-mail as
input and accomplishes the desired assignment to one of the two categories could be
considered as a black box that performs the mapping Ω → Ω/ ∼ directly.
In many other cases, especially if objects of the physical world are to be classified,
the intermediate steps of Ω → P → M → K → Ω/ ∼ will help to better analyze and
understand the internal mechanisms, challenges and problems of object classification.
It also supports engineering a better pattern recognition system. The concept of the
pattern space P is especially helpful if the raw data acquired about an object has a
very high dimension, e.g., if an image of an object is taken as the pattern. Explicit
use of P will be made in Section 2.4.6, where the tangent distance is discussed, and
in Section 2.6.3, where invariant features are considered. The concept of the decision
space K helps to generalize classifiers and is especially useful to treat the rejection
1.4 Design of a pattern recognition system | 5

Objects Classes
Pattern recognition
Ω Ω/ ∼
Observation,
sensing, Classification
measurement
Feature Decision
extraction preparation
P M K
Pattern Feature Decision
space space space

Fig. 1.3. Subdividing the pattern recognition process allows deeper insights and helps to better
understand important concepts such as: the curse of dimensionality, overfitting, and rejection.

problem in Section 9.4. Lastly, the concept of the feature space M is fundamental
to pattern recognition and permeates the whole textbook. Features can be seen as a
concentrated extract from the pattern, which essentially carries the information about
the object which is relevant for the classification task.
Overall, any pattern recognition task can be formally defined by a quintuple
(Ω, ∼, ω0 , l, S), where Ω is the set of objects to be classified, ∼ is an equivalence
relation that defines the classes in Ω, ω0 is the rejection class (see Section 9.4), l is a
cost function that assesses the classification decision ω̂ compared to the true class ω
(see Section 3.3), and S is the set of examples with known class memberships. Note
that the rejection class ω0 is not always needed and may be empty. Similarly, the cost
function l may be omitted, in which case it is assumed that incorrect classification
creates the same costs independently of the class and no cost is incurred by a correct
classification (0–1 loss).
These concepts will be further developed and refined in the following chapters.
For now, we will return to a more concrete discussion of how to design systems that
can solve a pattern recognition task.

1.4 Design of a pattern recognition system

Figure 1.4 shows the principal steps involved in designing a pattern recognition sys-
tem: data gathering, selection of features, definition of the classifier, training of the
classifier, and evaluation. Every step is prone to making different types of errors, but
the sources of these errors can broadly be sorted into four categories:
1. Too small a dataset,
2. A non-representative dataset,
6 | 1 Fundamentals and definitions

Start

Data gathering
Training, validation
and test samples
Selection and defi-
Performace of classifier

nition of features
Operators for
feature extraction
Definition of classifier
Mathematical
model
Training of classifier

Evaluation of classifier

Finish

Fig. 1.4. Design phases of a pattern recognition system.

3. Inappropriate, non-discriminative features, and

4. An unsuitable or ineffective mathematical model of the classifier.

The following section will describe the different steps in detail, highlighting the chal-
lenges faced and pointing out possible sources of error.
The first step is always to gather samples of the objects to be classified. The result-
ing dataset is labeled S and consists of patterns of objects where the corresponding
classes are known a priori, for example because the objects have been labeled by a
domain expert. As the class of each sample is known, deriving a classifier from S con-
stitutes supervised learning. The complement to supervised learning is unsupervised
learning, where the class of the objects in S is not known and the goal is to uncover
some latent structure in the data. In the context of pattern recognition, however, un-
supervised learning is only of minor interest.
A common mistake when gathering the dataset is to pick pathological, charac-
teristic samples from each class. At first glance, this simplifies the following steps,
because it seems easier to determine the discriminative features. Unfortunately, these
seemingly discriminative features are often useless in practice. Furthermore, in many
1.4 Design of a pattern recognition system | 7

Validation set V
Training set
(25 % of S)
D (50 % of S)

Data set S

Testing set T
(25 % of S)

Fig. 1.5. Rule of thumb to partition the dataset into training, validation and test sets.

situations, the most informative samples are those that represent edge cases. Consider
a system where the goal is to pick out defective products. If the dataset only consists
of the most perfect samples and the most defective samples, it is easy to find highly
discriminative features and one will assume that the classifier will perform with high
accuracy. Yet in practice, imperfect, but acceptable products may be picked out or
products with a subtle, but serious defect may be missed. A good dataset contains
both extreme and common cases. More generally, the challenge is to obtain a dataset
that is representative of the underlying distribution of classes. However, an unrepre-
sentative dataset is often intentional or practically impossible to avoid when one of the
classes is very sparsely populated but representatives of all classes are needed. In the
above example of picking out defective products, it is conceivable that on average only
one in a thousand products has a defect. In practice, one will select an approximately
equal number of defective and intact products to build the dataset S. This means that
the so called a priori distribution of classes must not be determined from S, but has to
be obtained elsewhere.
The dataset S is further partitioned into a training set D, a validation set V, and a
test set T. A rule of thumb is to use 50 % of S for D, 25 % of S for V, and the remaining
25 % of S for T (see Figure 1.5). The test set T is held back and not considered during
most of the design process. It is only used once to evaluate the classifier in the last
design step (see Figure 1.4). The distinction between training and validation set is not
always necessary. The validation set V is needed if the classifier in question is governed
not only by parameters that are estimated from the training set D, but also depends
on so called design parameters or hyper parameters. The optimal design parameters
are determined using the validation set.
A general issue is that the available dataset is often too small. The reason is that
obtaining and (manually) pre-classifying a dataset is typically very time consuming
and thus costly. In some cases, the number of samples is naturally limited, e.g., when
8 | 1 Fundamentals and definitions

the goal is to classify earthquakes. The partition into training, test and validation sets
further reduces the number of available samples, sometimes to a point where carry-
ing out the remaining design phases is no longer reasonable. Chapter 9 will suggest
methods for dealing with small datasets.
The second step of the design process (see Figure 1.4) is concerned with choosing
suitable features. Different types of features and their characteristics will be covered
in Chapter 2 and will not be discussed at this point. However, two general design
principles should be considered when choosing features:
1. Simple, comprehensible features should be preferred. Features that correspond
to immediate (physical) properties of the objects or features which are otherwise
meaningful, allow understanding and optimizing the decisions of the classifier.
2. The selection should contain a small number of highly discriminative features.
The features should show little deviation within classes, but vary greatly between
classes.

The latter principle is especially important to avoid the so called curse of dimension-
ality (sometimes also called the Hughes effect): a higher dimensional feature space
means that a classifier operating in this feature space will depend on more parameters.
Determining the appropriate parameters is a typical estimation problem. The more
parameters need to be estimated, the more samples are needed to adhere to a given
error bound. Chapter 6 will give more details on this topic.
The third design step is the definition of a suitable classifier (see Figure 1.4). The
boundary between feature extraction and classifier is arbitrary and was already called
“blurry” in Figure 1.2. In the example in Figure 2.4c, one has the option to either stick
with the features and choose a more powerful classifier that can represent curved
decision boundaries, or to transform the features and choose a simple classifier that
only allows linear decision boundaries. It is also possible to take the output of one
classifier as input for a higher order classifier. For example, the first classifier could
classify each pixel of an image into one of several categories. The second classifier
would then operate on the features derived from the intermediate image. Ultimately,
it is mostly a question of personal preference where to put the boundary and whether
feature transformation is part of the feature extraction or belongs to the classifier.
After one has decided on a classifier, the fourth design step (see Figure 1.2) is to
train it. Using the training and validation sets D and V, the (hyper-)parameters of
the classifier are estimated so that the classification is in some sense as accurate as
possible. In many cases, this is achieved by defining a loss function that punishes
misclassification, then optimizing this loss function w.r.t. the classifier parameters.
As the dataset can be considered as a (finite) realization of a stochastic process, the
parameters are subject to statistical estimation errors. These errors will become smaller
the more samples are available.
An edge case occurs when the sample size is so small and the classifier has so
many parameters that the estimation problem is under-determined. It is then possible
1.5 Exercises | 9

to choose the parameters in such a way that the classifier classifies all training samples
correctly. Yet novel, unseen samples will most probably not be classified correctly, i.e.,
the classifier does not generalize well. This phenomenon is called overfitting and will
be revisited in Chapter 6.
In the fifth and last step of the design process (see Figure 1.2), the classifier is
evaluated using the test set T, which was previously held back. In particular, this step
is important to detect whether the classifier generalizes well or whether it has been
overfitted. If the classifier does not perform as needed, any of the previous steps—in
particular the choice of features and classifier—can be revisited and adjusted. Strictly
speaking, the test set T is already depleted and must not be used in a second run.
Instead, each separate run should use a different test set, which has not yet been seen
in the previous design steps. However, in many cases it is not possible to gather new
samples. Again, Chapter 9 will suggest methods for dealing with such situations.

1.5 Exercises

(1.1) Let S be the set of all computer science students at the KIT. For x,y ∈ S, let x ∼ y
be true iff x and y are attending the same class. Is x ∼ y an equivalence relation?

(1.2) Let S be as above. Let x ∼ y be true iff x and y share a grandparent. Is x ∼ y an

equivalence relation?

(1.3) Let x, y ∈ ℝd . Is x ∼ y ⇔ xT y = 0 an equivalence relation?

(1.4) Let x, y ∈ ℝd . Is x ∼ y ⇔ xT y ≥ 0 an equivalence relation?

(1.5) Let x,y ∈ ℕ and f : ℕ 󳨃→ ℕ be a function on the natural numbers. Is the relation
x ∼ y ⇔ f(x) ≤ f(y) an equivalence relation?

(1.6) Let A be a set of algorithms and for each X ∈ A let r(X,n) be the runtime of
that algorithm for an input of length n. Is the following relation an equivalence
relation?
X ∼ Y ⇔ r(X,n) ∈ O (r(Y,n)) for X,Y ∈ A.

Note: The Landau symbol O (“big O notation”) is defined by

O (f(n)) := {g(n) | ∃α > 0∃n0 > 0∀n ≥ n0 : |g(n)| ≤ α|f(n)|} ,

i.e., O (f(n)) is the set of all functions of n that are asymptotically bounded below
by f(n).
2 Features
A good understanding of features is fundamental for designing a proper pattern recog-
nition system. Thus this chapter deals with all aspects of this concept, beginning with
a mere classification of the kinds of features, up to the methods for reducing the dimen-
sionality of the feature space. A typical beginner’s mistake is to apply mathematical
operations to the numeric representation of a feature, just because it is syntactically
possible, albeit these operations have no meaning whatsoever for the underlying prob-
lem. Therefore, the first section elaborates on the different types of possible features
and their traits.

2.1 Types of features and their traits

In empiricism, the scale of measurement (also: level of measurement) is an important

characteristic of a feature or variable. In short, the scale defines the allowed transfor-
mations that can be applied to the variable without adding more meaning to it than
it had before. Roughly speaking, the scale of measurement is a classification of the
expressive power of a variable. A transformation of a variable from one domain to an-
other is possible if and only if the transformation preserves the structure of the original
domain.
Table 2.1 shows five scales of measurement in conjunction with their characteris-
tics as well as some examples. The first four categories—the nominal scale, the ordinal
scale, the interval scale, and the ratio scale—were proposed by Stevens [1946]. Lastly,
we also consider the absolute scale. The first two scales of measurement can be further
subsumed under the term qualitative features, whereas the other three scales repre-
sent quantitative features. The order of appearance of the scales in the table follows
the cardinality of the set of allowed feature transformations. The transformation of a
nominal variable can be any function f that represents an unambiguous relabeling of
the features, that is, the only requirement on f is injectivity. At the other end, the only
allowed transformation of an absolute variable is the identity.

2.1.1 Nominal scale

The nominal scale is made up of pure labels. The only meaningful question to ask is
whether two variables have the same value: the nominal scale only allows to compare
two values w.r.t. equivalence. There is no meaningful transformation besides relabel-
ing. No empirical operation is permissible, i.e., there is no mathematical operation of
nominal features that is also meaningful in the material world.

https://doi.org/10.1515/9783110537949-032
Table 2.1. Taxonomy of scales of measurement. Empirical relations are mathematical relations that emerge from experiments, e.g., comparing the volume of
two objects by measuring how much water they displace. Likewise, empirical operations are mathematical operations that can be carried out in an experiment,
e.g., adding the mass of two objects by putting them together, or taking the ratio of two masses by putting them on a balance scale and noting the point of the
fulcrum when the scale is balanced.

Trait Qualitative Quantitative

Nominal scale Ordinal scale Interval scale Ratio scale Absolute scale

Empirical relation Equivalence ∼ Equivalence ∼ Equivalence ∼ Equivalence ∼ Equivalence ∼

Empirical operation
Ordering ≺ Ordering ≺ Ordering ≺ Ordering ≺
Addition ⊕ Addition ⊕ Addition ⊕

Allowed transformation
Multiplication ⊗ Multiplication ⊗

injective strictly increasing

m󸀠 = f(m) m󸀠 = f(m) m󸀠 = am + b m󸀠 = am m󸀠 = m

Typical domain Integers, Names, Integers Real numbers Real numbers Natural numbers
with a > 0 with a > 0

Symbols
Expressiveness very low low medium high very high
Examples Telephone numbers, School grades, Temperature in °F, Temperature in K, Electron count,
Postal codes, Degree of hardening, Calendar time, Electric current, Euler characteristic,
Gender, Wind intensity, Geographic altitude Bank account balance, Number of test failures
Scale name Scale Expressiveness Edge Length
2.1 Types of features and their traits | 11
12 | 2 Features

A typical example is the sex of a human. The two possible values can be either
written as “f” vs. “m,” “female” vs. “male,” or be denoted by the special symbols ~
vs. |. The labels are different, but the meaning is the same. Although nominal values
are sometimes represented by digits, one must not interpret them as numbers. For
example, the postal codes used in Germany are digits, but there is no meaning in, e.g.,
adding two postal codes. Similarly, nominal features do not have an ordering, i.e., the
postal code 12345 is not “smaller” than the postal code 56789. Of course, most of the
time there are options for how to introduce some kind of lexicographic sorting scheme,
but this is purely artificial and has no meaning for the underlying objects.
With respect to statistics, the permissible average is not the mean (since summa-
tion is not allowed) or the median (since there is no ordering), but the mode, i.e., the
most common value in the dataset.

2.1.2 Ordinal scale

The next higher scale is made of values on an ordinal scale. The ordinal scale allows
comparing values w.r.t. equivalence and rank. Any transformation of the domain must
preserve the order, which means that the transformation must be strictly increasing.
But there is still no way to add an offset to one value in order to obtain a new value or
to take the difference between two values.
Probably the best known example is school grades. In the German grading system,
the grade 1 (“excellent”) is better than 2 (“good”), which is better than 3 (“satisfactory”)
and so on. But quite surely the difference in a student’s skills is not the same between
the grades 1 and 2 as between 2 and 3, although the “difference” in the grades is unity
in both cases. In addition, teachers often report the arithmetic mean of the grades
in an exam, even though the arithmetic mean does not exist on the ordinal scale. In
consequence, it is syntactically possible to compute the mean, even though the result,
e.g., 2.47 has no place on the grading scale, other than it being “closer” to a 2 than
a 3. The Anglo-Saxon grading system, which uses the letters “A” to “F”, is somewhat
immune to this confusion.
The correct average involving an ordinal scale is obtained by the median: the value
that separates the lower half of the sample from the upper half. In other words, 50 %
of the sample is smaller, and 50 % is larger than the median. One can also measure
the scatter of a dataset using the quantile distance. The p-quantile of a dataset is the
value that separates the lower p ⋅ 100 % from the upper (1 − p) ⋅ 100 % of the dataset
(the median is the 0.5-quantile). The p-quantile distance is the distance (number of
values) between the p and (1 − p)-quantile. Common values for p are p = 0, which
results in the range of the data set, and p = 0.25, which results in the inter-quartile
range.
2.2 Feature space inspection | 13

2.1.3 Interval scale

The interval scale allows adding an offset to one value to obtain a new one, or to calcu-
late the difference between two values—hence the name. However, the interval scale
lacks a naturally defined zero. Values from the interval scale are typically represented
using real numbers, which contains the symbol “0,” but this symbol has no special
meaning and its position on the scale is arbitrary. For this reason, the scalar multiplica-
tion of two values from the interval scale is meaningless. Permissible transformations
preserve the order, but may shift the position of the zero.
A prominent example is the (relative) temperature in °F and °C. The conversion
from Celsius to Fahrenheit is given by TF = 59 °C °F
TC + 32 °F. The temperatures 10 °C
and 20 °C on the Celsius scale correspond to 50 °F and 68 °F on the Fahrenheit scale.
Hence, one cannot say that 20 °C is twice as warm as 10 °C: this statement does not
hold w.r.t. the Fahrenheit scale.
The interval scale is the first of the discussed scales that allows computing the
arithmetic mean and standard deviation.

2.1.4 Ratio scale and absolute scale

The ratio scale has a well defined, non-arbitrary zero, and therefore allows calculating
ratios of two values. This implies that there is a scalar multiplication and that any
transformation must preserve the zero. Many features from the field of physics belong
to this category and any transformation is merely a change of units. Note that although
there is a semantically meaningful zero, this does not mean that features from this
scale may not attain negative values. An example is one’s account balance, which
has a defined zero (no money in the account), but may also become negative (open
liabilities).
The absolute scale shares these properties, but is equipped with a natural unit
and features of this scale can not be negative. In other words, features of the absolute
scale represent counts of some quantities. Therefore, the only allowed transformation
is the identity.

2.2 Feature space inspection

For a well working system, the question, how to find “good,” i.e., distinguishing fea-
tures of objects, needs to be answered. The primary course of action is to visually
inspect the feature space for good candidates.
In order to find discriminative features, one needs to get an idea about the structure
of the feature space. In the one- or two-dimensional case, this can be easily done by
looking at a visual representation of the dataset in question, e.g., a histogram or a
14 | 2 Features

Iris setosa
Iris versicolor
Iris virginica
Petal width

0 6
0
2
4
6 4 Sepal length
Petal length
(a) Three-dimensional feature space

Petal width
Iris setosa
Iris versicolor
Iris virginica

20 10 Petal length
5

(b) Two-dimensional projection

Fig. 2.1. Iris flower dataset as an example of how projection helps the inspection of the feature space.

scatter plot. Even with three dimensions, a perspective view of the data might suffice.
However, this approach becomes problematic when the number of dimensions is larger
than three.

2.2.1 Projections

The simplest approach to inspecting high dimensional feature spaces is to visualize

every pair of dimensions of the dataset. More formally, the dataset is visualized by
projecting the data onto a plane defined by pairs of basis vectors of the feature space.
This approach works well if the data is rather cooperative. Figure 2.1 illustrates Fisher’s
2.2 Feature space inspection | 15

m2
m3
6
6
4
4 m2
6 2
2 4
2 m1
m1
2 4 6 2 4 6
(a) 3D scatter plot of all samples (b) Projection of all samples onto the
m1 ,m2 -plane

m2
m3
6
6
4
4 m2
6 2
2 4
2 m1 Fig. 2.2. Difference
m1
2 between the full
4 6 2 4 6
projection and the
(c) 3D scatter plot of samples in a (d) Projection of samples in the slice slice projection tech-
slice onto the m1 ,m2 -plane niques.

Iris flower dataset, which quantifies the morphological variation of Iris flowers of three
related species.
Figure 2.1a depicts a perspective drawing of the three features petal width, petal
length and sepal length. Figure 2.1b shows a two-dimensional projection and two
aligned histograms of the same data by omitting the sepal length. The latter clearly
shows that the features petal length and petal width are already sufficient to distin-
guish the species Iris setosa from the others. Further two-dimensional projections
might show that Iris versicolor and Iris virginica can also be easily separated from
each other.

2.2.2 Intersections and slices

If the distribution of the samples in the feature space is more complex, simple projec-
tions might fail. Even worse, this approach might lead to the wrong conclusion that the
samples of two different classes cannot be separated by the features in question even
16 | 2 Features

a
m3 b

u
m2

Fig. 2.3. Construction of

m1 two-dimensional slices.

though they can be. Figure 2.2 shows this issue using artificial data. The objects of the
first class are all distributed within a solid sphere. The samples of the second class lie
close to the surface of a second, larger sphere. This sphere encloses the samples of the
first class, but the radius is large enough to separate the classes.
The initial situation is depicted in Figure 2.2a. Even though the samples can be
separated, any projection to a two-dimensional subspace will suggest that the classes
overlap each other, as shown in Figure 2.2b. However, if one projects slices of the data
instead all of it at once, the structure becomes apparent. Figure 2.2c shows the result of
such a slice in the three dimensional space and Figure 2.2d shows the corresponding
projection. The latter clearly shows that one class only encloses the other but can be
distinguished nonetheless.
The principal idea of the construction is illustrated in Figure 2.3. The slice is de-
fined by its mean plane (yellow) and a bound ε that defines half of the thickness of
the slice. Any sample that is located at a distance less than this bound is projected
onto the plane. The mean plane itself is given by its two directional vectors a, b and
its oriented distance u from the origin. The mean plane on its own, i.e., a slice with
zero thickness (ε = 0), does not normally suffice to “catch” any sample points: If the
samples are continuously distributed, the probability that a sample intersected by the
mean plane is zero.
Let d ∈ ℕ be the dimension of the feature space. A two-dimensional plane is
defined either by its two directional vectors a and b or as the intersection of d − 2
linearly independent hyperplanes. Hence, let

{a, b, n1 , . . . , nd−2 } (2.1)

2.3 Transformations of the feature space | 17

denote an orthonormal basis of the feature space, where each nj is the normal vector
of a hyperplane. Let u1 , . . . , u d−2 be the oriented distances of the hyperplanes from
the origin. The two-dimensional plane is defined by the solution of the system of linear
equations
nT1 m − u1 = 0
..
.
nTd−2 m − u d−2 = 0. (2.2)
T
Let m = (m1 , . . . ,m d ) be an arbitrary point of the feature space. The distance of
m from the plane in the direction of nj is given by nTj m − u j , hence the total Euclidean
distance of m from the plane is
d−2
2
v = √ ∑ (nTj m − u j ) . (2.3)
j=1

Let m1 , . . . , mN be the feature vectors of the dataset. Then the two-dimensional

projection of the feature vectors within the slice to the plane is given by
↑
↑ d−2
{ aT mi ↑ ↑ }
( )↑√ ∑ (nT mi − u j )2 < ε .
↑
(2.4)
{ bT m ↑ ↑ j }
i ↑
↑
↑ j=1
{ ↑ }

2.3 Transformations of the feature space

Because the sample size is limited, it is usually advisable to restrict the number of
features used. Apart from limiting the selection, this can also be achieved by a suit-
able transformation of the feature space (see Figure 2.4). In Figure 2.4a it is possible
to separate the two classes using the feature m1 alone. Hence, the feature m2 is not
needed and can be omitted. In Figure 2.4b, both features are needed, but the classes
are separable by a straight line. Alternatively, the feature space could be rotated in
such a way that the new feature m󸀠2 is sufficient to discriminate between the classes.
The annular classes in Figure 2.4c are not linearly separable, but a nonlinear transfor-
mation into polar coordinates shows that the classes can be separated by the radial
component. Section 2.7 will present methods for automating such transformations to
some degree. Especially the principal component analysis will play a central role.

2.4 Measurement of distances in the feature space

As will be shown in later chapters, many classifiers need to calculate some kind of
distance between feature vectors. A very simple, yet surprisingly well-performing clas-
sifier is the so-called nearest neighbor classifier: Given a dataset with known points
18 | 2 Features

m2 m󸀠1

m󸀠2

(a) Classes are axis-aligned (b) Linearly separable classes

r = m󸀠2
m2

φ = m󸀠1

(c) Annular classes in Cartesian (d) Annular classes in polar coordinates

coordinates

Fig. 2.4. Feature transformation for dimensionality reduction.

in the feature space and known class memberships for each point, a new point with
unknown membership is assigned to the same class as the nearest known point. Obvi-
ously, the concept “being nearest to” requires a measure of distance.
If the feature vector was an element of a standard Euclidean vector space, one
could use the well known Euclidean distance
d
󵄩󵄩 󵄩
󵄩󵄩m − m󸀠 󵄩󵄩󵄩 = √ ∑ 󵄨󵄨󵄨󵄨m i − m󸀠i 󵄨󵄨󵄨󵄨2 , (2.5)
󵄩 󵄩
i=1

but this approach relies on some assumptions that are generally not true for real-world
applications. The cause of this can be summarized by the heterogeneity of the compo-
nents of the feature space, meaning
– features on different scales of measurement,
– features with different (physical) units,
– features with different meanings and
– features with differences in magnitude.
2.4 Measurement of distances in the feature space | 19

Above all, Equation (2.5) requires that all components m i ,m󸀠i , i = 1, . . . ,d are at least
on an interval scale. In practice, the components are often a mixture of real numbers,
ordinal values and nominal values. In these cases, the Euclidean distance in Equa-
tion (2.5) does not make sense; even worse, it is syntactically incorrect.
In cases where all the components are real numbers, there is still the problem of
different scales or units. For example, the same (physical) feature, “length,” can be
given in “inches” or “miles.” The problem gets even worse if the components stem from
different physical magnitudes, e.g., if the first component is a mass and the second
component is a length. A simple solution to this problem is a weighted sum of the
individual component distances, i.e.,
d d
D (m, m󸀠 ) = ∑ α i D i (m i , m󸀠i ) for α i > 0 and ∑ α i = 1. (2.6)
i=1 i=1

The coefficients α1 , . . . , α d handle the different units by containing the inverse

of the component’s unit, so that each summand becomes a dimensionless quantity.
Nonetheless, the question of the difference in size is still an open problem and affords
many free design parameters that must be carefully chosen.
Finally, the sum of squares (see Equation (2.5)) is not the only way to merge the
different components into one distance value. The first section of this chapter intro-
duces the more general Minkowski norms and metrics. The choice of metric can also
influence a classifier’s performance.

2.4.1 Basic definitions

To discuss the oncoming concepts, we must first define the terms that will be used.

Definition 2.1 (Metric, metric space). Let M be a set and m, m󸀠 , and m󸀠󸀠 ∈ M. A func-
tion D : M × M → ℝ≥0 is called a metric iff
1. D (m, m󸀠 ) ≥ 0 (non-negativity)
2. D (m, m ) = 0 ⇔ m = m
󸀠 󸀠 (reflexivity, coincidence)
3. D (m, m󸀠 ) = D (m󸀠 , m) (symmetry)
4. D (m, m󸀠󸀠 ) ≤ D (m, m󸀠 ) + D (m󸀠 , m󸀠󸀠 ) (triangle inequality)
A set M equipped with a metric D is called a metric space.

With respect to real-world applications, having a metric feature space is an ideal, but
unrealistic situation. Luckily, fewer requirements will often suffice. As will be seen
in Section 2.4.5, the Kullback–Leibler divergence is not a metric because it lacks the
symmetry property and violates the triangle inequality, but it is quite useful nonethe-
less. Those functions that fulfil some, but not all of the above requirements are usually
called distance functions, discrepancys or divergences. None of these terms is pre-
cisely defined. Moreover, “distance function” is also used as a synonym for metric and
should be avoided to prevent confusion. “Divergence” is generally only used for func-
20 | 2 Features

tions that quantify the difference between probability distributions, i.e., the term is
used in a very specific context. Another important concept is given by the term (vector)
norm:

Definition 2.2 (Norm, normed vector space). Let M be a vector space over the real
numbers and let m, m󸀠 ∈ M. A function ‖⋅‖ : M → ℝ≥0 is called a norm iff
1. ‖m‖ ≥ 0 and ‖m‖ = 0 ⇔ m = 0 (positive definiteness)
2. ‖αm‖ = |α| ‖m‖ with α ∈ ℝ (homogeneity)
󵄩 󵄩 󵄩 󵄩
3. 󵄩󵄩󵄩m + m󸀠 󵄩󵄩󵄩 ≤ ‖m‖ + 󵄩󵄩󵄩m󸀠 󵄩󵄩󵄩 (triangle inequality)
A vector space M equipped with a norm ‖⋅‖ is called a normed vector space.

Due to the prerequisite of the definition, a normed vector space can only be applied to
features on a ratio scale. A norm can be used to construct a metric, which means that
every normed vector space is a metric space, too.

Definition 2.3 (Induced metric). Let M be a normed vector space and ‖⋅‖ its norm and
let m, m󸀠 ∈ M. Then
󵄩 󵄩
D (m, m󸀠 ) := 󵄩󵄩󵄩󵄩m − m󸀠 󵄩󵄩󵄩󵄩 (2.7)
defines an induced metric on M.

Note that because of the homogeneity property, Definition 2.2 requires the value to
be on a ratio scale; otherwise the scalar multiplication would not be well defined.
However, the induced metric from Definition 2.3 can be applied to an interval scale,
too, because the proof does not need the scalar multiplication. Of course, one must
󵄩 󵄩
not say that the metric D (m, m󸀠 ) = 󵄩󵄩󵄩m − m󸀠 󵄩󵄩󵄩 stems from a norm, because there is no
such thing as a norm on an interval scale.

2.4.2 Elementary norms and metrics

Inarguably, the most familiar example of a norm is the Euclidean norm. But this norm
is just a special embodiment of a whole family of vector norms that can be used to
quantify the distance of features on a ratio scale. The norms of this family are called
Minkowski norms or p-norms.

Definition 2.4 (Minkowski norm, p-norm). Let M denote a real vector space of finite
dimension d and let r ∈ ℤ ∪ {∞} be a constant parameter. Then
1
{(∑d |m i |r ) r if r < ∞
‖m‖r = { i=1 (2.8)
d
{maxi=1 |m i | if r = ∞
is a norm on M.

The name “p-norm” comes from the fact that the parameter is traditionally denoted
by p and not r as seen here. This book uses r to avoid a clash of names, because p is
already used to denote a probability density function.
2.4 Measurement of distances in the feature space | 21

m2
3
r = 0.4
r = 0.6
r =1
r =2
2 r =5
r = −5
r = −2
r = −1
1 Fig. 2.5. Unit circles for
Minkowski norms with dif-
ferent choices of r. Only the
upper right quadrant of the two-
m1
dimensional Euclidean space is
1 2 3 shown.

Although r can be any integer or infinity, only a few choices are of greater impor-
tance. For r = 2
d
‖m‖e = ‖m‖2 = √ ∑ |m i |2 (2.9)
i=1

is the Euclidean norm. Likewise, r = 1 yields the absolute norm

d
‖m‖1 = ∑ |m i |. (2.10)
i=1

This norm—or more precisely: the induced metric—is also known as taxicab metric
or Manhattan metric. One can visualize this metric as the distance that a car must go
between two points of a city with a rectilinear grid of streets like in Manhattan. For
r = ∞ the resulting norm

‖m‖T = ‖m‖∞ = maxdi=1 |m i | (2.11)

is called maximum norm or Chebyshev norm.Figure 2.5 depicts the unit circles for
different choices of r in the upper right quadrant of the two-dimensional Euclidean
space.
Furthermore, the Mahalanobis norm is another common metric for real vector
spaces:

Definition 2.5 (Mahalanobis norm). Let M denote a real vector space of finite dimen-
sion d and let A ∈ ℝd×d be a positive definite matrix. Then

‖m‖m = √mT Am (2.12)

is a norm on M.
22 | 2 Features

To a certain degree, the Mahalanobis norm is another way to generalize the Euclidean
norm: they coincide for A = Id . More generally, elements A ii on the diagonal of A can
be thought of as scaling the corresponding dimension i, while off-diagonal elements
A ij , i ≠ j assess the dependence between the dimension i and j. The Mahalanobis also
appears in the multivariate normal distribution (see Definition 3.3), where the matrix
A is the inverse of the covariance Σ of the data.
So far only norms and their induced metrics that require at least an interval scale
were considered. The metrics handle all quantitative scales of Table 2.1. The next sec-
tions will introduce metrics for features on other scales.

2.4.3 A metric for sets

Lets assume one has a finite set U and the features in question are subsets of U. In
other words, the feature space M is the power set P(U) of U. On the one hand the
features are clearly not ordinal, because the relation “⊆” induces only a partial order.
Of course, it is possible to artificially define an ad hoc total order because M is finite,
but the focus shall remain on generally meaningful metrics. On the other hand, a mere
nominal feature only allows to state if two values (here: two sets) are equal or not.
However, two sets can also be said to be “nearly equal” when both the intersection
and the set difference is non-empty (i.e., they share some, but not all elements). The
Tanimoto metric reflects these situations.

Definition 2.6 (Tanimoto metric). Let U be a finite set, M = P (U) and S1 , S2 ∈ M, i.e.,
S1 , S2 ⊆ U. Then
|S1 | + |S2 | − 2 |S1 ∩ S2 |
DTanimoto (S1 , S2 ) = ∈ [0, 1] (2.13)
|S1 | + |S2 | − |S1 ∩ S2 |
defines a metric on M.

Here, we will omit the proof that DTanimoto is indeed a metric (interested readers are re-
ferred to, e.g., the proof of Lipkus [1999]) and instead investigate its properties. If S1 and
S2 denote the same set, then |S1 | = |S2 | = |S1 ∩ S2 | and therefore DTanimoto (S1 , S2 ) = 0.
Contrary, if S1 and S2 do not have any element in common, |S1 ∩ S2 | = 0 holds and
DTanimoto (S1 , S2 ) = 1. Altogether, the Tanimoto metric varies on the interval from 0
(identical) to 1 (completely different).
Moreover, the Tanimoto metric takes the overall number of elements into account.
Two sets that differ in one element are judged to be increasingly similar, as the number
of shared elements increases. For example, let U = {a, . . . , z}, S1 = {a, b, c}, S2 =
{a, b, d}, S󸀠1 = {a, b, d, e, f} and S󸀠2 = {a, b, d, e, g}. It follows, that
3+3−4 1
DTanimoto (S1 , S2 ) = = and (2.14)
3+3−2 2
5+5−8 1
DTanimoto (S󸀠1 , S󸀠2 ) = = . (2.15)
5+5−4 3
2.4 Measurement of distances in the feature space | 23

2.4.4 Metrics on the ordinal scale

It is not immediately clear how to define a meaningful metric for ordinal features,
since there is no empirical addition on that scale. A possible solution is to consider the
metric D(m,m󸀠 ) of two ordinal features m, m󸀠 as the number of swaps of neighboring
elements in order to reach m󸀠 from m.
Consider, for example, the set of characters in the English language {A, B, C, . . . , Z},
where the order corresponds to the position in the alphabet. The metric informally de-
fined above would yield D(A,C) = 2 and D(A,A) = 0. This example can be generalized
as follows:

Definition 2.7 (Permutation metric). Let M be a locally finite and totally ordered set
with a unique successor function, i.e., for each element x ∈ M there is a unique next
element x󸀠 ∈ M. Then

D(x, x󸀠 ) = smallest number of permutations of

successive elements to get from x to x󸀠 (2.16)

is a metric on M.

Another way to look at Definition 2.7 is to homomorphically map M into the integers
(i.e., successive elements of M are mapped to successive integers) and calculate the
absolute difference of the numbers corresponding to the sets.

2.4.5 The Kullback–Leibler divergence

The Kullback–Leibler divergence (KL divergence) does not directly quantify a differ-
ence between features m, but between probability distributions (characterized by the
probability mass function or the probability density) over the features. It is often used
as a meta metric to compare objects o i , i = 1, . . . ,N that are in turn characterized
by a set of features Oi = {mj | j = 1, . . . ,M i }. To this extent, the features in Oi are
used to estimate the probability mass P(m ̂ | o i ) or probability density p(m
̂ | o i ) for each
object o i . The KL divergence is then used to compute the distance between two object-
dependent distributions and by proxy the distance between two objects. Here, the
̂
P(m | o i ) or p(m
̂ | o i ) can themselves be interpreted as features for o i . An extended
example of this approach is given below.

Definition 2.8 (Kullback–Leibler divergence).

1. Let P, P󸀠 be two probability mass functions on the same space M. The Kullback–
Leibler divergence of P󸀠 with respect to P is given by

⇑ 󸀠 P (m)
D (P⇑
⇑P ) =
⇑ ∑ P (m) ln . (2.17)
m∈supp P
P󸀠 (m)
24 | 2 Features

2. Let p, p󸀠 be two probability density functions on the same space M. The Kullback–
Leibler divergence of p󸀠 with respect to p is given by
⇑ 󸀠 p (m)
D (p⇑
⇑p ) = ∫ ⋅ ⋅ ⋅ ∫ p (m) ln p󸀠 (m) dm.
⇑ (2.18)
supp p

The Kullback–Leibler divergence is not a metric: As can be seen from definition, it is

not symmetric and does not obey the triangle inequality. If P󸀠 (m) = 0 and P (m) ≠ 0,
⇑ 󸀠
one sets D (P⇑⇑P ) = ∞, hence the range of the KL divergence is [0, ∞]. The Kullback–
⇑
Leibler divergence is zero if the distributions are equal almost everywhere.
Despite the fact that the Kullback–Leibler divergence is missing important proper-
ties of a metric, it is still very useful, because it dominates the difference of the proba-
bility density functions with respect to the absolute norm (Minkowski norm with r = 1)
by
2
⇑ 󸀠 1 󵄨󵄨 󵄨󵄨 1 󵄩󵄩 󸀠󵄩 󵄩2
D (p⇑ 󸀠
⇑p ) ≥ 2 ld e (∫ ⋅ ⋅ ⋅ ∫ 󵄨󵄨󵄨p (m) − p (m)󵄨󵄨󵄨 dm) = 2 ld e 󵄩󵄩󵄩p − p 󵄩󵄩󵄩1 .
⇑ (2.19)
M

The Kullback–Leibler divergence can be interpreted as the expectation of the log-

arithm of the so called likelihood ratio pp(m)
󸀠 (m) based on features that are distributed

according to p(m),
⇑ 󸀠 p(m)
D (p⇑
⇑p ) = E{log p󸀠 (m) } .
⇑ (2.20)

The likelihood ratio is the crucial quantity of optimal statistical tests to decide between
two competing hypotheses H1 : m ∼ p(m) and H2 : m ∼ p󸀠 (m) (Neyman and Pearson
[1992]). In other words, the Kullback–Leibler divergence measures the mean discrim-
inability between H1 and H2 .
To become accustomed to the Kullback–Leibler divergence, we will now dis-
cuss some simple examples. First, consider the family of Bernoulli distributions
parametrized by the probability of success τ ∈ [0,1]. That is,

P (Success) = τ and P (Failure) = 1 − τ. (2.21)

The Kullback–Leibler divergence between two different Bernoulli distributions is

therefore given by
a 1−a
D (P a ‖P b ) = a ln + (1 − a) ln . (2.22)
b 1−b
Figure 2.6 depicts the value of the Kullback–Leibler divergence as a function of the
success probability b of the second distribution while keeping the success probability
of the first distribution fixed at a = 0.3.
Now, as an example of the continuous case, consider the Kullback–Leibler diver-
gence of two univariate Gaussian distributions. The Gaussian density function is
1 1 2
p (m | μ,σ) = exp (− (m − μ) ) (2.23)
√2πσ 2σ2
2.4 Measurement of distances in the feature space | 25

D (P a=0.3 |P b )
3

Fig. 2.6. Kullback–Leibler divergence between

b
two Bernoulli distributions with fixed success
0.2 0.4 0.6 0.8 1 probability a = 0.3 for the first distribution.

and hence for two Gaussian distributions with parameters μ1 ,σ1 and μ2 ,σ2 , one ob-
tains
− 12 2
(2πσ21 ) exp (− 2σ1 2 (m − μ1 ) )
D (p1 ‖p2 ) = ∫ p (m | μ1 ,σ1 ) ln 1
dm
− 12 2
ℝ (2πσ22 ) exp (− 2σ1 2 (m − μ2 ) )
2

1 σ2 1 2
= − ld 12 − ∫ p (m | μ1 , σ1 ) (m − μ1 ) dm
2 σ2 2σ21
ℝ
1 2
+ ∫ p (m | μ1 , σ1 ) (m − μ2 ) dm
2σ22
ℝ
2
1 σ2 σ 2
σ2 + (μ1 − μ2 )
= − ld 12 − 12 + 1
2 σ2 2σ1 2σ22
2
1 σ21 σ2 (μ1 − μ2 )
= ( − ln 21 − 1) + . (2.24)
2 σ22 σ2 2σ22

It is interesting to note that the Kullback–Leibler divergence becomes symmet-

ric if the family of distributions is further restricted to Gaussian distributions with
equal variances, i.e., σ21 = σ22 = σ2 , i.e., when p1 (m | μ1 ,σ), p2 (m | μ2 ,σ), one has that
D (p1 ‖p2 )) = D (p2 ‖p1 ). Figure 2.7 shows two pairs of Gaussian distributions with
equal variance. The corresponding Kullback–Leibler divergence is noted in the dia-
gram. In order to illustrate the asymmetric behavior of the Kullback–Leibler diver-
gence, Figure 2.8 depicts a pair of Gaussian distributions where p μ1 ,σ1 and p μ2 ,σ2 have
swapped roles between the diagrams. Once again, the Kullback–Leibler divergence is
given in the diagram. As a last example, a pair of rectangle-like densities with unequal
support is considered. This example illustrates that the Kullback–Leibler divergence
can even take on the value infinity depending on the order of its arguments (see Fig-
ure 2.9).
26 | 2 Features

0.6 0.6

0.4 0.4

0.2 0.2

m m
−4 −2 2 4 −4 −2 2 4
(a) μ1 = −1, μ2 = 1, σ2 = 0.5 and DKL = 4 (b) μ1 = −2, μ2 = 2, σ2 = 0.5 and DKL = 16

Fig. 2.7. Pairs of Gaussian distributions with equal variance σ 2 = 0.5 and their KL divergences.

0.6 0.6

0.4 0.4

0.2 0.2

m m
−4 −2 2 4 −4 −2 2 4
(a) μ1 = 0, σ12 = 1, μ2 = 1, σ22 = 0.5 and (b) μ1 = 1,σ12 = 0.5, μ1 = 0, σ22 = 1 and
DKL = 1.153 DKL = 0.597

Fig. 2.8. Pairs of Gaussian distribution with different variances and their KL divergences.

0.6 0.6

p2 (m) p1 (m)
0.4 0.4

0.2 p1 (m) 0.2 p2 (m)

m m
−4 −2 2 4 −4 −2 2 4
(a) DKL < ∞ (b) DKL = ∞

Fig. 2.9. Pairs of rectangle-like densities and their KL divergences.

2.4 Measurement of distances in the feature space | 27

Fig. 2.10. Combustion engine, microscopic image of bore texture and detail with texture model with
groove parameters. Source: Krahe and Beyerer [1997].

Extended example: Grading of honing textures

The Kullback–Leibler divergence can be used to derive robust distance measures be-
tween features. Consider, for example, the problem of grading honing textures in the
cylinder bores of combustion engines, shown in Figure 2.10. This honing texture is
the result of a grinding tool rotating around its axis while oscillating inside the cylin-
der hole. The resulting grooves’ function is to retain lubricant while the pistons move
inside the motor block. A single groove in the honing texture is characterized by its
amplitude (depth) a and width b. Let further ∆ denote the distance between two given
adjacent grooves. From Figure 2.10 it can be seen that the grooves can be divided into
two sets, depending on their orientation. The question is: how similar are these sets
of grooves?
One possible approach is to model the parameters u := (a,b)T and ∆ of the two
groove sets stochastically. In particular, let us assume that ∆ and u are independent. It
is known that the stride ∆ follows an exponential distribution, and it is sensible to as-
sume that the groove depth or amplitude a and width b are jointly normally distributed.
This yields the overall density

p i (u,∆) = p i (∆) ⋅ p i (u)

1 (u − µi )T C−1
i (u − µi )
= λ i exp(−λ i ∆) exp (− ), (2.25)
2π√det(Ci ) 2

where i = 1, 2 indicates the first or second set. Here, µi is the expected value of u in
the ith groove set,

μ ai
µi = Ei {u} = ∭ up i (u,∆) du d∆ = ( ). (2.26)
μ bi

With ρ i denoting the correlation coefficient between a i and b i ,

E{(a ij − μ a i )(b ij − μ b i )}
ρi = .
σ ai σ bi
28 | 2 Features

Now let Ci be the covariance matrix of u in the ith groove set:

σ2a i ρ i σ ai σ bi
Ci = E{(u − µi )(u − µi )T } = ( ). (2.27)
ρ i σ ai σ bi σ2b i

The parameter λ i denotes the groove density in the ith set, i.e.,
1
λi = . (2.28)
E{∆ ij }
This model will be used to construct a measure of the distance between two groove
sets. Recall from Definition 2.8 that the KL divergence between p1 and p2 is asymmetric.
In order to derive a symmetric measure of the distance between two groove sets, we
simply take the sum of the KL divergence between p1 and p2 , and the KL divergence
with these arguments transposed:

D ab∆ := D(p1 ‖p2 ) + D(p2 ‖p1 )

p1 (u,∆) p2 (u,∆)
= ∭ p1 (u,∆) ln du d∆ + ∭ p2 (u,∆) ln du d∆. (2.29)
p2 (u,∆) p1 (u,∆)
After a fair bit of algebra, this expression simplifies to the sum of three terms:
1
D ab∆ = (µ − µ2 )T (C−1
1 + C2 ) (µ1 − µ2 )
−1
(Mahalanobis distance of means)
2 1
1
1 C2 + C2 C1 ) − 2
+ tr (C−1 −2
(distance of covariances)
2
λ1 λ2
+ + − 2. (distance of groove densities) (2.30)
λ2 λ1
This metric can be used to compare two sets of grooves as follows. First, measure
the amplitude a, depth b and distances ∆ of the grooves in the set. Second, estimate
the parameters µi , Ci and λ i , i = 1,2 for both sets according to Equations (2.26) to (2.28).
Third, compute the distance between the two sets by using Equation (2.30). If the dis-
tance is below a given threshold, the honing texture passes the inspection. Otherwise,
it is rejected as of insufficient quality.

2.4.6 Tangential distance measure

To conclude this section on metrics, we will discuss the tangential distance measure.
This method does not introduce a new metric, but rather builds on top of a given one
and makes this metric more robust against small, systematic disturbances of the fea-
ture vectors that may be caused by varying lighting conditions, out of focus images,
small rotations of the pattern, etc. The key is that these disturbances should be sys-
tematic and not due to random noise.
Consider, for example, the problem of optical character recognition. Figure 2.11
shows two possible systematic variations of a character (the pattern): rotation and line
2.4 Measurement of distances in the feature space | 29

Line w idth

Fig. 2.11. Systematic variations in optical

Rotation character recognition.

thickness. This variation causes differing patterns, and will therefore result in different
feature vectors. However, since the variations in the pattern are systematic, so will
the variations in the feature vectors. More precisely, small variations of the pattern
will move the feature vector within a small neighborhood of the feature vector of the
original pattern (given that the feature mapping is smooth in the appropriate sense).
This observation leads the following assumption: The feature vectors Oi = {mj | j =
1, . . . , M i } derived from the patterns of an object o i lie on a topological manifold. The
mathematical details and implications of this insight are quite profound and outside
the scope of this book. Nonetheless, Appendix B gives a primer of the most important
terms and concepts of the underlying theory. For the purposes of this section, it is
sufficient to interpret “manifold” as a lower-dimensional hypersurface embedded in
the feature space. In other words, the features mj ∈ Oi of the object o i do not populate
the feature space arbitrarily, but are restricted to some surface within the feature space.
An example is shown in Figure 2.12, where the black curves show the manifolds
to which are restricted the features of two objects o i and o k . In the context of the OCR
example, the two objects stand for different characters, e.g., o i for the character “A”
and o k for the character “B.” Formally, the manifolds are denoted by the set of all
points given by an action A of some transformation p on the object o,

Mi = {A(p,o i ) | p ∈ Π}. (2.31)

Again, the mathematical definitions of the terms are found in Appendix B. Here,
it is sufficient to interpret A(p,o i ) as the feature vector that is extracted from some
systematic variation parametrized by p.
The Figure 2.12 also highlights an important issue: given a feature vector m to clas-
sify and two feature vectors mi and mk derived from the objects o i and o k , respectively,
computing the distance between the features might lead to the wrong conclusions.
Here, m is closest to mk and hence one could conclude that the object that produced
30 | 2 Features

Mi = {A (p, o i )|p ∈ Π}

m󸀠k
m󸀠i

m Mk = {A (p, o k )|p ∈ Π}

Tmk
Tm i

Fig. 2.12. Tangential distance measure: Improving the distance measure of two feature vectors by
linear interpolation of the manifold corresponding to the underlying object.

m is more similar to o k than to o i . If, however, one considers the entire manifold of
features of both objects, one arrives at a different picture: the closest point to m on
Mi (m󸀠i ) is closer than the closest point on Mk (m󸀠k ). In consequence, m is closer to o i
than to o k , which is the opposite of what was deduced from the distances of the given
features. This motivates the following improved distance measure:

DManifold (m, mi ) := min ‖m − m󸀠i ‖, where Mi ∋ mi . (2.32)

󸀠
mi ∈Mi

Unfortunately, the manifold is generally not known and even if it were, computing
the minimal distance is generally computationally infeasible. A solution to this is
given by the tangential distance measure: Similar to a first order Taylor expansion,
the true distance DManifold (m, mi ) is approximated using a tangential (that is, linear)
approximation at mi :

Dtangential (m, mi ) := min ‖m − (mi + Tmi a)‖ ≈ DManifold (m, mi ). (2.33)

‖a‖ ≤ ε

Here, Tmi denotes the tangent space (or, more precisely, the projection onto the tangent
space) of the manifold at mi . The search for the closest distance is further restricted to
a small neighborhood {mi + Tmi a | ‖a‖ < ε} around mi . The reason is that, similar to
the Taylor expansion, the linear approximation becomes more and more inaccurate
the further one deviates from mi .
Figure 2.12 illustrates this approach: The feature vector m is closer to the tangent
identified by Tmi (purple line) than to the tangent identified by Tmk (orange line).
2.4 Measurement of distances in the feature space | 31

Line w idth

Fig. 2.13. Linear approximation of the variation

Rotation in Figure 2.11.

Hence, m is correctly assigned to the object o i instead of o k . Note that in the figure,
the neighborhoods (denoted by the perpendicular stops on the tangents) are chosen
to be very large. In consequence, the approximation to the manifold Mk does not
hold. In practice, one would probably choose a smaller neighborhood. However, the
neighborhood might also not be chosen too small, because then the distance to the
tangent Tmw will not differ much from the distance to the original feature vector mi .
Note: If one chooses the Euclidean norm, the minimization w.r.t. a in Equa-
tion (2.33) reduces to a quadratic optimization problem, which can be solved using
standard tools.
However, exactly computing the tangent space Tm requires the evaluation of the
gradient of A(p, o k ) at m. Unfortunately, this information is rarely available in practice.
However, the tangent space can be approximated by a secant t̂(m):

tĵ (mi ) = mi − A(p + ∆pj ; o j ), (2.34)

where det [∆p1 , . . . , ∆pq ] ≠ 0, i.e., the ∆pj are linearly independent. The small dis-
turbance ∆pj , j = 1, . . . ,q can be obtained by recording the objects under various
conditions, or (more commonly) by simulating in a software application these condi-
tions from a small number of actual measurements.
An example of this is shown in Figure 2.13, where only the patterns in the orange
boxes were obtained by measurement. The remaining variations were approximated
using linear interpolation. The corresponding features lie on the secants between the
features of the measured patterns. Comparing Figure 2.13 to the true variations in
Figure 2.11 highlights another important trade-off when using this method: how many
measurements, or sampling points of the manifold, should one obtain? Measuring the
variations takes a considerable effort, but measuring too few variations will reduce
the quality of the approximation.
32 | 2 Features

Lastly, Figure 2.12 illustrates another drawback of the method: the manifolds Mk =
{A (p, o k ) | p ∈ Π} and Mi = {A (p, o i ) | p ∈ Π} are drawn as closed curves. However,
this is not necessarily true. Indeed, the manifold need not even be connected, but
might consist of several, disconnected strips. As a result, the tangential approximation
may significantly underestimate the actual distance to the manifold. Furthermore,
the secant approximation may assume a manifold where there is none, and therefore
invalidate the whole method. However, these issues rarely occur in practice.

2.5 Normalization

The previous section has already presented an approach to enhance the robustness of
a distance measure. Normalization has a similar goal, but is applied at an earlier stage
in the processing chain. Instead of improving a metric, normalization tries
– to eliminate extraneous disturbances of the patterns,
– to eliminate extraneous variations of the patterns and
– to eliminate extraneous variations of the features.

If modifications are already avoided at the stage of the patterns, the deduced features
become independent of those modifications. Surely, this task is highly domain specific,
and requires a good understanding of the concrete pattern recognition system. For this
reason, this section can only present some examples of what can be done in certain
cases. These examples are:
1. Planimetric adjustment of images;
2. lighting adjustment of images;
3. amplitude recovery of audio signals, e.g., by automatic gain adjustment;
4. distortion adjustment of images (due to lens aberrations);
5. alignment, elimination of physical dimension, and leveling of proportions; and
6. dynamic time warping.

2.5.1 Alignment, elimination of physical dimension, and leveling of proportions

Let us assume one has features with values on a interval scale at least, and let m be
the feature vector modeled as a random variable. Then Item 5 can be realized by
T
m1 − E{m1 } m d − E{m d }
m =(
󸀠
,..., ) . (2.35)
√Var{m1 } √Var{m d }

Although m might be equipped with physical dimensions, m󸀠 is not, because of

the division by the standard deviation. Moreover,
E{m󸀠 } = 0 Var{m󸀠j } = 1, j = 1, . . . ,d. (2.36)
2.5 Normalization | 33

Fig. 2.14. Chromaticity normal-

ization of an image. Source:
(a) Original image (b) Normalized image Beyerer et al. [2016].

In practice, one normally approximates the unknown expectation and variance by

the empirical mean and standard deviation of the learning samples mi , j = i, . . . , N,

1 N
mj = ∑ m ij , (2.37)
N i=1

1 N
sj = √ ∑ (m ij − m j ) and (2.38)
N − 1 i=1

m1 − m1 md − md T
m󸀠 = ( ,..., ) . (2.39)
s1 sd

2.5.2 Lighting adjustment of images

The adjustment and normalization of the lighting conditions of images is a very broad
field. As this textbook is about pattern recognition, this section can only touch on
this topic. A more detailed discussion can be found in the relevant literature, e.g., in
Machine Vision by Beyerer et al. [2016]. The examples shown here are taken from that
book.

Chromaticity normalization
Here and in the discussions below, the color values of the pixels are assumed to be
within the interval [0, 1]. Chromaticity normalization transforms each pixel so that all
the pixels (except the black ones) have the same intensity. More formally, given the
color components (r, g, b) of a pixel, the transformation maps to the color value

{ 1 (r, g, b) if r + g + b > 0
(r󸀠 , g󸀠 , b󸀠 ) = { r+g+b (2.40)
(0,0,0) if r + g + b = 0.
{
Figure 2.14 shows an example.
Illumination normalization tries to equalize the overall proportions of the color
channels in order to mitigate the effect of different light sources with varying color
34 | 2 Features

temperatures. Assume the image contains J pixels (r1 , g1 , b1 ) , . . . , (r J , g J , b J ) and

let (R, G, B) = 3J ∑Jj=1 (r j , g j , b j ) denote the overall sum for each channel. Then each
r gj bj
pixel is transformed to (r󸀠j , g󸀠j , b󸀠j ) = ( Rj , G , B ). This means that ∑Jj=1 r󸀠j = ∑Jj=1 g 󸀠j =
J
∑j=1 b󸀠j = 3J for each color channel and the final image has equal proportions of each
channel.
Let us consider both steps together. An RGB image with J pixels can be described
by 3J variables. The lightning adjustment creates J linear constraints, i.e., one for each
pixel. The illumination normalization contributes 3 additional constraints, one for
each channel. This leads to the following system of linear equations

r󸀠1 + g1󸀠 + b󸀠1 = 1 ... r󸀠J + g 󸀠J + b󸀠J = 1 (2.41)

J J J
J J J
∑ r󸀠j = ∑ g󸀠j = ∑ b󸀠j = . (2.42)
j=1
3 j=1
3 j=1
3

These are J + 2 linearly independent equations, because

J J J J
J J J
= ∑ r󸀠j = ∑ (1 − g 󸀠j − b󸀠j ) = J − ∑ g 󸀠j − ∑ b󸀠j = J − − . (2.43)
3 j=1 j=1 j=1 j=1
3 3

In summary, the system has 3J − J − 2 = 2(J − 1) degrees of freedom. Normally, if

the transformation (r󸀠j , g 󸀠j , b󸀠j ) = r j +g1j +b j (r j , g j , b j ) or the transformation (r󸀠j , g 󸀠j , b󸀠j ) =
r g b
( Rj , Gj , Bj ) is applied to an image, the result is normalized either in chromaticity or in
illumination, but not both. However, if both steps are iteratively applied, the sequence
of images converges. Figure 2.15 shows an example of such a comprehensive image
normalization.

Signal theoretic approach

Another method of image normalization is to look at the image from a signal theoretic
point of view and consider image (improving) operations as filters. For the sake of sim-
plicity, assume that the image is a gray-scale image and is given by a two-dimensional
function g : ℝ2 → ℝ. Though the digital representation of an image only defines the
values at discrete points (at the pixels), it is assumed that the function can be continu-
ously interpolated. Moreover, some of the following discussion requires that g be peri-
odic and without discontinuities at the borders of the image. Such a representation can
be obtained from the image by mirroring it along the borders, e.g., g(x,y) = g(w − x,y)
for w ≤ x ≤ 2w. From here on, we will assume that g fulfills all the necessary require-
ments (especially smoothness, periodicity, and integrability) without explicitly stating
this. We further restrict ourselves to linear filters, i.e., filters T that satisfy

T(αf + βg) = αTf + βTg, (2.44)

2.5 Normalization | 35

(a) High proportion of (b) High proportion of (c) High proportion of

red green yellow

(d) Normalized image (e) Normalized image (f) Normalized image

Fig. 2.15. Normalization of lighting conditions by iterated normalization of the chromaticity and
illumination. Source: Beyerer et al. [2016].

where f and g are signals and α, β ∈ ℝ are scalars. Informally, the above requires that
the result of applying T to a combination of signals has to be the same as first applying
T to each signal and then combining the results.
Let us now assume a very simple signal model for the image generation process,
given by
g (x) = s (x) △ b (x) . x = (x,y)T , (2.45)
Here g : ℝ2 → ℝ denotes the final image as seen by the pattern recognition system,
s : ℝ2 → ℝ the true underlying image we wish to recover, b : ℝ2 → ℝ the disturbance,
and △ a binary operator on two signals.
If △ is addition, then s is said to be subject to additive noise. Additive systems
are often preferred because they can usually be treated by linear filters, which have
been intensively studied and are well understood tools. If △ is not addition, one can
try to map the system equation into a different space so that the transformed system
is additive. Generally, let T denote a linear filter and U such a transformation. Then
the filter can be applied by
U −1 TUg = U −1 TU (s △ b) = U −1 (TUs + TUb) . (2.46)
36 | 2 Features

(a) Original (b) After homomorphic filter- (c) After homogenization of

ing degree 2

Fig. 2.16. Images of the surface of agglomerated cork. Source: Beyerer et al. [2016].

Such a transformed filter is called a “homomorphic” filter.

Assume a slowly varying disturbance b(x) due to an inhomogeneous illumination
of the scene. This disturbance will act by multiplication on the true signal,

g (x) = s (x) ⋅ b (x) . (2.47)

Furthermore, assume that g, s, b > 0 and that the support of the Fourier transforms
of ln s and ln b is not identical. This implies that the logarithm of the disturbance
ln b varies much more slowly than the logarithm ln s of the true signal. Under these
assumptions, the image can be improved by a high pass filter (H) in combination with
a logarithmic transformation. It follows that
≈ln s by ≈0 by
assumption assumption
⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞ ⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞
(exp H ln)g = (exp H ln)(s ⋅ b) = exp( H ln s + H ln b )
≈ exp(ln s + 0) = s. (2.48)

Figure 2.16b shows the result of such a homomorphic filter.

Lastly, we will discuss the homogenization of images by interpreting the image
as a random process g : ℝ2 → [0,1]. Details on random processes can be found in
Appendix C. Here, consider g as a random function that gives a value given a position x.
Since g is a random function, we can also give the expectation μ = E{g} : ℝ2 → [0,1]
and variance σ2 = Var{g} : ℝ2 → ℝ+ (both functions themselves). In the stochastic
signal model, the image g is modeled by

g (x) = s (x) △ b (x) . (2.49)

As before, g denotes the final image as seen by the system, s the true but unknown
image, and b a disturbance. It is assumed that s is a homogeneous process—a reason-
able assumption for regular texture like the cork surface in Figure 2.16—and that b
2.5 Normalization | 37

Distortion
ỹ y V η

x̃ x ξ

Distortion Adjustment
g(x, y) γ(ξ, η)
V V −1
Real world Preliminary image Reconstructed image

Fig. 2.17. Adjustment of geometric distortions.

destroys the homogeneity of g. For example, b could represent differences in lighting

that make some parts of the image appear brighter than others. With this in mind, one
can estimate the underlying image using
g(x) − μ(x)
s(x)
̂ = . (2.50)
σ(x)
Although reasonable, this approach is problematic in practice, because it requires
estimating the expectation and variance of g. Neither can be estimated reliably, since
only one realization of g, the image g at hand, is given. However, under the assumption
that g is ergodic (which due to b it generally is not), one can determine the unknown
expectation and variance in Equation (2.50) by taking the average over all points of
only one single realization g.
Further details of the method can be found in Beyerer et al. [2016]. An example of
the result of the homogenization transformation is shown in Figure 2.16c.

2.5.3 Distortion adjustment of images

Let V denote a distortion of the image due to aberration or perspective deformation.

Each (continuous) coordinate (x̃ , ̃y)T of the true world is mapped to the image coordi-
nates (x, y)T (see Figure 2.17)
x x̃
( ) = V( ). (2.51)
y ̃y
If V is known, the adjusted image γ can be reconstructed from g. As both γ and g
are discrete, the final coordinates (ξ, η) are used as the starting point and the value of
38 | 2 Features

bj i

Fig. 2.18. Adjustment of temporal distortions

ai using dynamic time warping.

the induced coordinates (x, y) = V(ξ, η) is copied

γ(ξ, η) = g (V(ξ, η)) . (2.52)

Usually, (x, y) = V(ξ, η) does not denote a valid lattice point of the preliminary
image, hence g (V(ξ, η)) must be interpolated. In practice, these three methods are
customary:
– nearest-neighbor interpolation,
– bilinear interpolation, and
– bicubic interpolation.

For more details, again see Beyerer et al. [2016].

2.5.4 Dynamic time warping

Dynamic time warping is necessary if one has to match two signals of different lengths.
A typical example is a pair of audio recordings with different speed profiles. The goal is
to find the best mapping between the two signals that equalizes the different temporal
speed courses (see Figure 2.18). Let

A = (a1 , . . . , aM ) (2.53)
B = (b1 , . . . , bL ) (2.54)

be two discrete signals A and B with lengths M and L, respectively. The goal is to find
a sequence of pairs of indices

C = (c1 , . . . , cK ) with ck = (i k , j k ) , k = 1, . . . , K (2.55)

2.6 Selection and construction of features | 39

that obey the constraints

i1 = j1 = 1 (Starting points must match) (2.56)

i K = M, j K = L (End points must match) (2.57)
i k ≤ i k+1 , j k ≤ j k+1 (Monotonicity) (2.58)
i k − i k−1 ≤ 1, j k − j k−1 ≤ 1 (No point must be skipped) (2.59)

such that
K
󵄩 󵄩
Copt = arg min ∑ 󵄩󵄩󵄩ai k − bj k 󵄩󵄩󵄩 (2.60)
C k=1

is minimized.

2.6 Selection and construction of features

Though the whole chapter has already been about features, it has been implicitly as-
sumed that those features are already present. Of course, every section came up with
some examples of features, where it was helpful to discuss the concepts.
The first section classified features according to their scale of measurement, the
third section illustrated some transformations of the feature space, the fourth section
dealt with distance measures, and the previous section gave some examples of feature
normalization. In summary, all the sections relied on the fact that there were already
available features that could be handled, modified, or transformed. Only the second
section focused a little bit on how to obtain the features. But even that section assumed
that there already was a pool of features from which to select.
In order to fill this gap, this section will put the focus on the question of how to
initially find good features. The first subsection will give some examples of descriptive
features and why descriptive features should be preferred. The second subsection is
about features derived from a model of the generation process of the object. The third
subsection will present a way of systematically constructing invariant features, and is
closely related to Section 2.4.6.

2.6.1 Descriptive features

The most straightforward approach is to select standard descriptive features that char-
acterize obvious traits of the object’s class and that carry a natural interpretation. De-
spite the simplicity of this heuristic method, it is often quite successful. Moreover,
descriptive features have the distinct advantage that they can easily be understood by
the system designer, something which simplifies the debugging process for when the
pattern recognition system fails.
40 | 2 Features

Minimum
bounding
rectangle

Object

Fig. 2.19. Different bounding boxes around an

Ferret Box object.

Fig. 2.20. The convex

hull around a concave
object.

Geometric features
If the border of an object can be identified, the object’s area is computable as well. The
degree of filling m is defined as
Area of the object
m= ∈ [0,1]. (2.61)
Area of the bounding rectangle
Usually there are two options for how a minimum bounding rectangle can be de-
fined (see Figure 2.19). The minimum bounding rectangle whose edges are still aligned
parallel to the axis is sometimes called the ferret box. But normally the term minimum
bounding rectangle (MBR) denotes the rectangle that ignores this constraint and is
rotated so that it is properly aligned with the enclosed object.
For both definitions of the box, m is an invariant of the position and of the scale
of the object. In addition, when using the MBR, m is also invariant w.r.t. rotation. But
the ferret box is easier to compute.
A natural generalization of a bounding box is the convex hull (see 2.20). Accord-
ingly, the degree of convexity is defined as
Area of the object
m= ∈ [0,1]. (2.62)
Area of the convex hull
Again, m is invariant w.r.t. translation, scaling and rotation. The degree of com-
pactness or form factor relates the perimeter to the area and is defined as
4πArea
m= ∈ [0,1]. (2.63)
Perimeter2
The coefficient 4π has been chosen so that 0 ≤ m ≤ 1, because the circle has the
smallest perimeter of all areas of the same size in Euclidean geometry (see Figure 2.21).
2.6 Selection and construction of features | 41

Area A
Triangles
Squares
Circles

Form factor m
Fig. 2.21. Degree of compactness
0.6 0.8 1 (form factor).

Table 2.2. Topology of the letters of the German alphabet.

Feature Letter
A Ä B C D E F ... O Ö P ... U Ü V W X Y Z

B 1 3 1 1 1 1 1 ... 1 3 1 ... 1 3 1 1 1 1 1
L 1 1 2 0 1 0 0 ... 1 1 1 ... 0 0 0 0 0 0 0
E 0 1 -1 1 0 1 1 ... 0 2 0 ... 1 3 1 1 1 1 1

Topological features
Topological features depart from the tangible geometry and describe an object such
that the features become invariant with respect to rubber-sheeting transformations.
Suitable features include, for example, the number of connected components (B) or
the genus, i.e., the number of holes (L). The Euler number is defined as

E = B − L. (2.64)

Table 2.2 lists the number of connected components, the genus, and the Euler
number of the letters.

Periodogram (frequency spectrum)

In some cases, the classification task is hard to carry out in the time domain (or, in
the case of images, in the spatial domain) but becomes easy in the frequency domain.
This fact will be illustrated by the following example. This example is interesting for
an auxiliary reason. Most of the examples so far had the defect that the pattern and
feature spaces were identical because the pattern could be directly used as a feature
due to the simplicity of the example. The following example distinguishes between a
preceding pattern, a pattern transformation, and a derived feature.
42 | 2 Features

(a) Milling cutter (b) Flawless surface (c) Flawless peri-

odogram

(d) Broken cog (e) Defective surface (f) Defective peri-

odogram

Fig. 2.22. Classification of faulty milling cutters.

Assume a surface is manufactured by a face-milling cutter (see Figure 2.22a). If

one of the cogs is broken (see Figure 2.22d), the milling cutter will damage the surface.
A direct inspection of the resulting surfaces reveals, in the case of a fault, a periodic
repetition of two neighboring stronger grooves (see Figure 2.22e). A fault-free surface
is more uniform, with equally strong grooves (see Figure 2.22b). Although the human
eye is able to perceive the difference, its formalization in a suitable computer repre-
sentation is difficult. Once again, let the gray-scale image be

ℝ2 → ℝ
g: { (2.65)
(x, y) 󳨃→ g(x, y)

and let F denote the operator of Fourier transformation. Figures 2.22c and 2.22f illus-
trates the Fourier transforms, i.e.,
∞

G(f x , f y ) = F{g}(f x , f y ) = ∬ g(x, y)e−2πj(xf x +yf y ) dx dy (2.66)

−∞

These figures clearly show a difference. The left periodogram has only one sym-
metric peak at the fundamental frequency, but the right periodogram has some addi-
tional peaks at the subharmonic frequencies. Hence, a reasonable feature is derived
2.6 Selection and construction of features | 43

by comparing the intensities of these frequencies,

󵄨 󵄨2
∬A 󵄨󵄨󵄨F{g}(f x , f y )󵄨󵄨󵄨 df x df y
m= 2
󵄨 󵄨2 . (2.67)
∬A 󵄨󵄨󵄨F{g}(f x , f y )󵄨󵄨󵄨 df x df y
1

In the flawless case, this ratio becomes small, because in the numerator the sub-
harmonics vanish. In contrast, the ratio becomes large in the faulty case. For more
details on this topic, see Beyerer et al. [2016].

2.6.2 Model-driven features

The principal idea behind model-driven features is to consider a parametric mathemat-

ical model which is able to generate the patterns. Different parameters of the model
will yield different patterns. In particular, each class can be characterized by some
subset of the parameter space and can be represented by one or more prototypical
examples from that class. The variation in classes is caused by small perturbations of
the parameters of these prototypes, i.e., the variation can be modeled as noise in the
parameter space. Such a model is often referred to as a “generating process.” Here,
however, it is used for classification by estimating the parameters that generate a given
pattern and using these parameters as the feature vector m.
The model-driven approach has several advantages. First of all, it allows contribut-
ing expert knowledge. Usually this leads to feature vectors (or parameter vectors) that
are highly specific to the problem and therefore illustrative. This normally coincides
with short parameter vectors and hence a low dimension. Finally, the system can be
comfortably evaluated, because new patterns can be artificially generated.

Autoregressive signal models

Of course, in order to take advantage of the model-driven approach, one initially needs
a model. A rather generic model is the autoregressive signal model (AR model). The
idea is that the state of the system at a point n only depends on finitely many previous
states from a causal neighborhood. The number of model parameters is the number
of those states. The (one-dimensional) system equation can be written as
K
g n = ∑ a i g n−i + e n , (2.68)
i=1

where g n denotes the state of the system at n and e n denotes the disturbance, while
the a i are the parameters of the model.
In the context of the analysis of time series, the causal neighborhood is naturally
in the past, but the AR model can also be applied to structured images (textures) if
the pixels are enumerated appropriately. A pragmatic approach is to define all pixels
below and to the left of a given pixel as the neighborhood of that pixel.
44 | 2 Features

The two-dimensional AR model can be stated as follows: Let g mn be a two-

dimensional weakly stationary process with zero mean and let e mn be a white noise
process¹. Let further U with (0,0) ∉ U denote a finite and causal neighborhood of the
origin (0,0) and let a kl ∈ ℝ denote the AR coefficients. The two-dimensional AR model
is then given by
g mn = ∑ a kl g m−k,n−l + e mn . (2.69)
(k,l)∈U

The number |U| of elements in the neighborhood is called the order of the AR model.
As U only refers to “past” states, and there is a defined starting point (the origin), a
recursive evaluation of the system model is possible. However, first one needs to find
the AR parameters for the given image. To simplify notation, we write the system
equation as
g mn = aT γmn + e mn , (2.70)

where a = (. . . , a kl , . . . )T (k, l) ∈ U are the AR parameters and γmn is the vector

of the pixel values of g in the neighborhood U around the point (m, n),

γmn = (. . . , g m−k,n−l , . . . )T (k, l) ∈ U. (2.71)

The unknown parameters are the coefficient vector a and the variance of the noise
T
σ2 . Hence the feature vector is given by m = (σ2 , aT ) . The objective of the optimiza-
tion is to minimize the variance of the noise, which here can be interpreted as the
prediction error of the AR model:
2
σ2 = Var{e mn } = E{e2mn } = E{(g m − aT γmn ) }
= E{g 2mn − 2aT γmn g mn + aT γmn γTmn a}
= E{g 2mn } − 2aT E{γmn g mn } + aT E{γmn γTmn } a → minimize. (2.72)

Do not be confused by the fact that the left side of the equation is a constant (σ2 ),
but the remainder of the equation seems to depend on the position (m, n): since all
involved processes are at least weakly stationary (see Appendix C), the value of the
last line does not actually depend on (m, n).
To calculate the expectations of γmn and g mn , we assume that the process is lo-
cally ergodic (see, again, Appendix C) in the neighborhood U󸀠mn . This means that the
expectation can be estimated by an average over a neighborhood within the same re-
alization. With this in mind, the necessary condition for finding the optimal a is that

1 See Appendix C for an explanation of the terms “weakly stationary”, “white noise”, etc.
2.6 Selection and construction of features | 45

the gradient of Equation (2.72) with respect to a vanishes:

!
∇a σ2 = −2 E{γmn g mn } + 2 E{γmn γTmn } a = 0
−1
⇔ a = (E{γmn γTmn }) E{γmn g mn }
−1
󵄨 󵄨−1 󵄨 󵄨−1
(Ergodicity) ≈ (󵄨󵄨󵄨󵄨U󸀠mn 󵄨󵄨󵄨󵄨 ∑ γmn γTmn ) ⋅ 󵄨󵄨󵄨󵄨U󸀠mn 󵄨󵄨󵄨󵄨 ∑ γmn g mn
(m,n)∈U󸀠mn (m,n)∈U󸀠mn
−1

⇒ â = ( ∑ γmn γTmn ) ⋅ ∑ γmn g mn (2.73)

(m,n)∈U󸀠mn (m,n)∈U󸀠mn

with |U󸀠mn | being the total number of elements in the neighborhood.

In order to obtain an estimator σ̂ 2 of σ2 , the estimator â is put into Equation (2.72)
and ergodicity is exploited once again. To simplify the notation, the following ad hoc
abbreviations are introduced:

G = ∑ g2mn (2.74)
(m,n)∈U󸀠mn

H = ∑ γmn g mn (2.75)
(m,n)∈U󸀠mn

Γ = ∑ γmn γTmn . (2.76)

(m,n)∈U󸀠mn

Using these, the estimator of Equation (2.73) can be written as â = Γ −1 H. Further,

Γ is symmetric, i.e., Γ T = Γ and the values of the expectation in Equation (2.72) are
󵄨 󵄨−1
E{g 2mn } = 󵄨󵄨󵄨󵄨U󸀠mn 󵄨󵄨󵄨󵄨 G (2.77)
󵄨 󵄨−1
E{γmn g mn } = 󵄨󵄨󵄨󵄨U󸀠mn 󵄨󵄨󵄨󵄨 H (2.78)
󵄨 󵄨−1
E{γmn γTmn } = 󵄨󵄨󵄨󵄨U󸀠mn 󵄨󵄨󵄨󵄨 Γ. (2.79)

With this notation, the estimator σ̂ 2 becomes

σ̂ 2 = E{g2mn } − 2âT E{γmn g mn } + âT E{γmn γTmn } â

1 T T
= 󵄨 󸀠 󵄨 (G − 2 (Γ −1 H) H + (Γ −1 H) Γ (Γ −1 H))
󵄨󵄨Umn 󵄨󵄨
󵄨 󵄨
1
= 󵄨 󸀠 󵄨 (G − 2HT Γ −1 H + HT Γ −1 ΓΓ −1 H)
󵄨󵄨Umn 󵄨󵄨
󵄨 󵄨
1
= 󵄨 󸀠 󵄨 (G − HT Γ −1 H) . (2.80)
󵄨󵄨Umn 󵄨󵄨
󵄨 󵄨
Figure 2.23 presents an example for a honing texture. Figure 2.23a depicts the
original image and Figure 2.23b presents an artificially generated texture based on an
estimated AR model of order 84. The AR model is a rather generic model that does not
46 | 2 Features

(a) Original honing texture (b) Artificially generated texture

(AR model of order 84)

Fig. 2.23. Synthetic honing textures using an AR model as an example of model-driven features.

make use of any context specific knowledge. The question of what order the AR model
should have cannot be answered in general. Usually one has to resort to some trial-
and-error approach until the achieved result is acceptable. Generally this leads to an
unnecessary high dimension of the parameter space. Furthermore, the AR coefficients
do not have an easily interpretable meaning in terms of the modelled pattern.
The next section deals with an adjusted, purpose-specific model for the same ex-
ample, i.e., honing textures, as a counterpart to the general AR model.

A physically justified model for a honing texture

Honing is a machining process that scrubs the construction material with some abra-
sive grain material. The grain material, for example particles of ceramic or diamond,
are superimposed on the so called “hone head.” The honing tool follows a predefined
path while being pressed against the construction material. For example, honing is
used for finishing the surface of cylinder bores of combustion engines. In this special
case, the honing tool is rotated around its longitudinal axis and performs an oscillating
stroke movement (see Figure 2.24a).
The honing texture is spawned by the additive superposition of groove profiles
g v : ℝ → ℝ for v ∈ ℕ. A groove profile specifies the gray-scale value orthogonal to the
principal direction of the groove. This means it has large values around the origin and
vanishes at infinity. In addition, a single groove is specified by its direction ev and its
distance from the origin d v (see Figure 2.24b):

x 󳨃→ g v (xT ev − d v ) . (2.81)

Hence, the whole texture is given by

∞
t(x) = ∑ g v (xT ev − d v ) . (2.82)
v=−∞
2.6 Selection and construction of features | 47

rotatory motion

y
honing bar

l ν (xT eν − d ν )

l ν (⋅)
dν

cutting material oscillating lifting αν

eν x
motion

(a) Generation of a honing texture by simultaneous (b) Parametric model of a single honing
rotation and stroke movements groove

Fig. 2.24. Physical formation process and parametric model of a honing texture.

As the supports of different grooves are not disjoint, their gray-scale values are
added in the overlapping regions. This is an error of the model, but the error is negli-
gible and this assumption significantly simplifies the calculation.
With α v ∈ [0, π), the directional vector of the groove can be written as ev =
(cos α v , sin α v )T . Then the parameters of the groove model are the angle α v , the dis-
tance d v , and the groove profile function g v . Due to the movement of the honing tool,
the grooves normally have one out of two principal directions, i.e., a simplified para-
metric stochastic model is
1 1
p(α v ) = δ(α v − β1 ) + δ(α v − β2 ) (2.83)
2 2
with two parameters β1 and β2 and with δ(α) denoting the Dirac distribution that is
nonzero only at α = 0.
Likewise, the density of the grooves depends primarily on the density and distribu-
tion of the abrasive grain material on the honing tool. The distances d v of the grooves
from the origin are chosen such that they are uniformly distributed. The number q of
grooves in an interval of size L is assumed to be Poisson distributed
(λL)q
P(q) = e−λL q = 0, . . . , ∞. (2.84)
q!
Lastly, the groove profile function g v (⋅) is assumed to be from a parametric family
of functions that is totally defined by its expectation (E{g v })(⋅).
In summary, in order to learn the model, the parameters β1 , β2 , λ, and (E{g v })(⋅)
need to be estimated. Figure 2.25 illustrates the results. In comparison to the general
48 | 2 Features

(a) Original honing texture (b) Artificially generated texture

(physically motivated model with
14 parameters)

Fig. 2.25. Synthetic honing texture as an example of model-driven features using a physically moti-
vated model. Compare this result to the AR model in Figure 2.23.

AR model (see Figure 2.23), the artificially generated surface resembles the original
surface much better albeit the number of parameters is much smaller. For more details
on this approach, see Beyerer [1994].

2.6.3 Construction of invariant features

A general problem of choosing features is to choose those that are invariant with re-
spect to variations within the same class. These variations can be one out of two types:
firstly, the observable patterns vary because they belong to different objects within
the same class. Secondly, the same object can induce different patterns due to distur-
bances (see Figure 2.26). This section sheds light on the question of how to construct
mappings from varying patterns onto invariant features in a systematic way.
For this reason this section is strongly related to Section 2.4.6. The reader is advised
to recall the important concepts and definitions from that section. Section 2.4.6 tried to
enforce the robustness of the distance measurement in the feature against variations of
the features. The new contribution of this section is to make the feature itself invariant
with respect to variations of the pattern.
Recall the situation from Section 2.4.6 and Figure 2.12. The feature space M is a
smooth manifold and the disturbance is modeled as a Lie transformation group Π that
acts on the feature space. The group action is denoted by A : Π × M → M. For an
arbitrary but fixed feature mi , the orbit Πmi is given by {A(p, mi ) | p ∈ Π}. Refer to
Appendix B for the definitions of these terms.
The objective is to find a new feature m ̃ and a suitable feature transformation
mi 󳨃→ m(mi ) such that m(mi ) is constant on each orbit {A(p, mi ) | p ∈ Π}.
̃ ̃
2.6 Selection and construction of features | 49

Objects Patterns Features

(Domain) (Pattern space, Signal space) (Feature space)

class-internal variation of objects

ωi Ri

{M ij }
{mij }
o ij

Ω P M

– Systematic influence of measurement:

pose, scale, geometric distortion, . . .
– Random disturbances: noise, lighting, . . .

Fig. 2.26. Variations of the objects and variation of patterns due to the measurement leads to varia-
tion of the features.

To begin with, consider one toy example to illustrate the computational complexity
of a brute force solution. Consider a two-dimensional point x = (x1 , x2 )T ∈ ℝ2 . There
are several options for what a suitable Lie transformation group Π could be like:
1. The translation group Π = τ = ℝ2 acting by x󸀠 = x + a for a ∈ τ. As the name
suggests, the points are just linearly moved around in the plane. The number of
degrees of freedom of this group, i.e., the dimension of the Lie transformation
group, equals two.
2. The congruence group
↑
↑
↑ cos α sin α
Π = C = {(R, a)↑ ↑
↑
↑ R=( ) , α ∈ ℝ, a ∈ ℝ2 } (2.85)
↑
↑ − sin α cos α
additionally comprises rotations and has three degrees of freedom. It acts by x󸀠 =
Rx + a.
3. The similarity group
↑
↑
↑ cos α sin α
Π = S = {(T, a)↑
↑
↑
↑ T = k( ) , k, α ∈ 𝕋, a ∈ ℝ2 } (2.86)
↑
↑ − sin α cos α
also allows scaling. The dimension of the Lie transformation group is four. As a
congruence group, it acts by x󸀠 = Tx + a.
4. The affine group
↑
Π = A = {(P, a)↑↑det P ≠ 0, a ∈ ℝ2 }
↑
↑ (2.87)
includes translations, rotations, scalings, and shearing, and has six degrees of
freedom. It also acts by x󸀠 = Px + a.
50 | 2 Features

Assume that each class is represented by one feature vector mj for j = 1, . . . , c and
m is the previously unseen feature vector that should be classified. One approach for
classification would be to choose ω̂ = ω i with
󵄩 󵄩
(i, p∗ ) = arg min 󵄩󵄩󵄩m − A(p, mj )󵄩󵄩󵄩 . (2.88)
j∈{1,...,c}
p∈Π

This means that one has to calculate all the transformations of all the classes
and choose the class that comes closest to the provided unseen feature vector. Un-
fortunately, the complexity grows exponentially with the dimension of the Lie trans-
formation group. In order to obtain a feeling for this implication, consider the six-
dimensional affine group from above and assume that each dimension is discretized
by 103 steps. This results in the computation of 1018 values per class. A machine able
to perform 109 of these computations per second would need 31.7 years to classify
just one sample. These numbers clearly show that a brute force approach is not an
option.
There are three approaches to systematically constructing invariant features:
1. The integral method:
m
̃ = ∫ f (A(p, m)) dp. (2.89)
Π

2. The differential method:

∂m
̃ (A(p, m)) !
m
̃ with = 0. (2.90)
∂p

3. Normalization: Trace the feature vector back to a designated point of the orbit.

The first (respectively, second) method integrates (respectively, differentiates) with

respect to the Lie transformation group. Before we are going to explain how this can
be done, we will explain the general idea of both methods.

The integral method

The integral method can be seen as some kind of averaging. The new feature vector m ̃
is invariant under group actions on m if the integral is calculated on the whole orbit of
m anyway. I.e., the orbits are always the same but with different starting points. The
function f maps a group element to some range such that f is integrable. Normally,
one chooses the real or complex numbers as the range. The function f is a design
parameter and must be carefully chosen for the specific problem. For example, if the
feature vector itself is a vector of real or complex number, f might be chosen to be a
polynomial. Of course, any kind of averaging implies a loss of information. But this is
intentional, because an invariant feature is supposed to hide the difference between
the transformed patterns of the same object. On the other hand, one must ensure that
2.6 Selection and construction of features | 51

not too much information is lost, because the integral must still return different values
for different classes. For example, the choice f ≡ 0 forces the integral to be always
zero. Without a doubt, this is a perfect—but obviously too restrictive—invariant. There
is no choice of f that is generally applicable to just any feature.

The differential method

The differential method copes with the problem of the first method of how to find a good
f . The differential method does not need an additional design parameter, but tackles
the problem directly. The goal is to find a new feature vector m ̃ that is a function of
the old feature vector m by design. The requirement that m ̃ must not vary if the group
element varies, is exactly what is expressed by the condition that the derivative with
respect to the group element is zero. This leads to a partial differential equation whose
solution is the wanted function
m 󳨃→ m ̃(m). (2.91)

In general, the differential equation is nonlinear and not easy to solve.

Actually, the integral and differential methods are two sides of the same story and
are normally equally easy or difficult. The integral method requires a good educated
guess for the correct function f and then directly provides a straightforward expression
of the mapping m 󳨃→ m ̃(m). The differential method does not need such a function
f appearing from nowhere, but it is questionable whether there is a solution to the
resulting differential equation.
Before the last method, “normalization,” is discussed, the question of how one
integrates or differentiates with respect to a group remains to be answered. This ques-
tion cannot be satisfactorily answered in this book, because a rigorous answer requires
mathematical concepts such as “simply connected Riemannian manifolds,” “geodesic
completeness,” “maximal normal neighborhood” and “cut locus.” Instead, we will
present a very simple example so that the reader gets a glimpse that it is indeed possi-
ble.

An example of the integral method

Reconsider the example from Section 2.4.6. The feature vector is a point in the two-
dimensional Euclidean plane and the Lie transformation group (see Appendix B) is
the set of the rotation matrices of the plane
↑
cos α sin α ↑
↑
G = SO(2) = {( )↑
↑α ∈ ℝ} .
↑ (2.92)
− sin α cos α ↑
↑
↑
52 | 2 Features

The group acts on the points by the usual multiplication of a vector by a matrix,
and the orbits are circles centered at the origin. The integral approach leads to

m
̃ = ∫ f (A(g, m)) dg
G
π
cos α sin α m1
= ∫ f (( ) ( )) dα
− sin α cos α m2
−π
π

= ∫ f (m1 cos α + m2 sin α, m2 cos α − m1 sin α) dα

−π
1 2
(educated guess: f(u, v) = (u + v2 ))
2π
π
1
= ∫ ((m1 cos α + m2 sin α)2 + (m2 cos α − m1 sin α)2 ) dα
2π
−π
π
1
= ∫ (m21 + m22 ) ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
(cos2 α + sin2 α) dα = m21 + m22 . (2.93)
2π
−π =1

This result is correct, because up to a missing root, m21 +m22 is the (squared) distance
of the point from the origin. This is an invariance with respect to rotations around the
origin, because it equals the radius of the orbit. Although we lose information about
the precise location of the feature m, we do not lose too much information, because
the distance from the origin still suffices to distinguish different orbits. The drawback
is that we had to guess the suitable function f .
The general idea of calculating an integral with respect to a Lie transformation
group is to express a group element g ∈ G by its so-called normal coordinates. In
this case the normal coordinate representation is g(α) = ( −cos α sin α
sin α cos α ). The domain
l
of the normal coordinates is some isometry of the ℝ and hence the integral can be
pulled back to an already known integral over the real numbers. But finding the normal
coordinate representation of a Lie group is not always as easy as this example might
pretend.

An example of the differential method

The same course of action is applied to the differential approach. If one uses the nor-
mal coordinate representation, the differential approach of the toy example in Equa-
tion (2.92) simplifies to

! ∂m
̃ (A(g, m))
0=
∂g
∂ cos α sin α m1
= m
̃ (( ) ( ))
∂α − sin α cos α m2
2.6 Selection and construction of features | 53

̃ : ℝ2 → ℝ, ( uv ) 󳨃→ m
(assumption: m ̃ ( uv ) )
∂ m cos α + m2 sin α
= (m2 cos α − m1 sin α) ̃( 1
m )
∂u m2 cos α − m1 sin α
∂ m cos α + m2 sin α
− (m1 cos α + m2 sin α) ̃( 1
m )
∂v m2 cos α − m1 sin α
(substitution: ξ = m2 cos α − m1 sin α, χ = (m1 cos α + m2 sin α))
∂ ξ ∂ ξ
=ξ m
̃( )−χ m ̃ ( ). (2.94)
∂u χ ∂v χ
Close inspection of the last line reveals that
u
̃ ( ) = u2 + v2
m (2.95)
v
∂ ̃ u
is one solution of the partial differential equation, because ∂u m ( v ) = 2u and
∂ ̃ u
∂v m ( v ) = 2v, and therefore ξ ⋅ 2χ − χ ⋅ 2ξ = 0 follows.
By definition, m̃ is invariant under group actions and we obtain
m1 cos α sin α m1
m
̃( )=m
̃ (( ) ( )) = m21 + m22 (2.96)
m2 − sin α cos α m2
for any α. This is the same result as obtained by the integral approach. Instead of
guessing some function f , the problem is rather to find a formula for the solution.

Normalization by example
In contrast to the first two methods, normalization does not provide a predetermined
course of action. The general idea is to pull back each feature vector to a designated
point on the orbit (see Appendix B). In other words, each orbit is characterized by a
single canonical representative. Hence, if m is the feature vector, one needs to find the
corresponding group element g ∈ G such that A (g, m) maps to the representative of
the orbit of g. Obviously, g depends on m, and the question of how g can be calculated
remains an open question for the general case. In this section, an example of two-
dimensional contours is presented.
A two-dimensional contour (see Figure 2.27) can be considered as a continuous
closed curve in the Euclidean plane given by
x(l)
z(l) = ( ) with x : [0, L] → ℝ and y : [0, L] → ℝ (2.97)
y(l)
with boundary condition x(0) = x(L) and y(0) = y(L). As x and y are continuous
functions with period L, they are especially suited to be expanded in Fourier series,
∞ L
2πl 1 2πl
x(l) = ∑ X n e L jn with X n = ∫ x(l)e− L jn dl (2.98)
n=−∞ L
0
54 | 2 Features

(a) Original (b) n = 2 (c) n = 3 (d) n = 4 (e) n = 5 (f) n = 6

(g) n = 7 (h) n = 8 (i) n = 9 (j) n = 10 (k) n = 11 (l) n = 12

Fig. 2.27. Synthesis of a two-dimensional contourwith n Fouriercoefficients

Z−(n−1) , . . . , Z0 , . . . , Z n−1 ∈ ℂ

∞ L
2πl 1 2πl
y(l) = ∑ Y n e L jn with Yn = ∫ y(l)e− L jn dl. (2.99)
n=−∞ L
0

Note that X n and Y n are complex values, but the additional restriction X ∗n = X−n
and Y n∗ = Y−n holds, so that the imaginary parts are canceled pairwise, because x and
y are real functions. Hence, z(l) can be written as
∞
2πl Xn
z(l) = ∑ Z n e L jn with Zn = ( ). (2.100)
n=−∞ Yn
The Fourier series can be written in a more compact form if z(l) is not a two-
dimensional real vector, but considered as a complex function, i.e.,

z(l) = x(l) + jy(l). (2.101)

The Fourier coefficients are equally added

L
1 2πl
Z n = X n + jY n or equivalently Zn = ∫ z(l)e− L jn dl. (2.102)
L
0

The property X ∗n
= X−n and Y n∗
= Y−n does not carry over to the coefficients Z n ,
because z(l) is a true complex function and therefore the imaginary parts do not cancel
in general. On the one hand, one has

Z ∗n = X ∗n + (jY n )∗ = X ∗n − jY n∗ (2.103)

and on the other hand, one has

Z−n = X−n + jY−n = X ∗n + jY n∗ . (2.104)

2.6 Selection and construction of features | 55

From now on, we look at z(l) as a true complex function and it is assumed that the
coefficients Z0 , Z1 , Z−1 , Z2 , Z−2 are not restricted to be pairwise complex conjugates.
At this point, one has a feature vector

m = (Z0 , Z1 , Z−1 , . . . , Z n , Z−n )T (2.105)

which is a truncated Fourier series that approximately describes a closed contour.

Now let us examine how normalization is used to make this feature vector invariant
by pulling it back to a canonical representative.
If the contour z(l) is translated by a fixed (complex) value a ∈ ℂ to z󸀠 (l) = z(l) + a,
then the new feature vector m󸀠 becomes

m󸀠 = m + (a, 0, . . . , 0)T = (Z0 + a, Z1 , Z−1 , . . . , Z n , Z−n )T . (2.106)

A translation only affects the first coefficient Z0 . Actually, this coefficient is noth-
ing else than the center of mass of the contour. Hence, omitting this coefficient (or
implicitly setting it to zero) describes the same contour with its center of mass moved
to the origin. A translation invariant feature vector is therefore

̃ = (Z1 , Z−1 , . . . , Z n , Z−n )T .

m (2.107)

Scaling invariance is also easy to obtain. If the contour z(l) is scaled by a real,
positive value a ∈ ℝ>0 to z󸀠 (l) = a z(l), all coefficients are scaled by the same value

m󸀠 = a m = (aZ1 , aZ−1 , . . . , aZ n , aZ−n )T . (2.108)

Hence, dividing all coefficients by the absolute value of the first element yields a
scaling invariant feature vector

̃ = ( |ZZ1 | , Z−1 Z n Z−n T

m 1 |Z1 | , ... , |Z1 | , |Z1 | ) . (2.109)

This step resulted in a feature vector whose first component has an absolute value
of one, but an arbitrary direction, on the complex unit circle. A rotation (but not scale)
invariant feature vector is obtained if all coefficients are multiplied in such a way that
the coefficient Z1 points in the direction of the real axis, i.e., if the first coefficient
becomes a positive real number. This is true because all coefficients are multiplied
by the same value ejα if the contour z(l) is multiplied by ejα . This is a rotation in the
complex plane. Let
φ1 = Arg Z1 ∈ (−π, π] (2.110)
denote the argument of the first coefficient. This means that

Z1 = |Z1 | ejφ1 . (2.111)

Hence
T
m
̃ = (Z1 e−jφ1 , Z−1 e−jφ1 , . . . , Z n e−jφ1 , Z−n e−jφ1 )
T
= (|Z1 | , Z−1 e−jφ1 , . . . , Z n e−jφ1 , Z−n e−jφ1 ) (2.112)
56 | 2 Features

is a rotation (but not scale) invariant feature vector. In other words, the orientation of
the contour is encoded in the phases of the coefficients.
If the last two steps are combined, one obtains a scale and rotation invariant fea-
ture vector. Of course, this actually means that all coefficients are divided by Z1 . As
the first coefficient becomes one, it can implicitly be omitted.
In summary, let

m = (Z0 , Z1 , Z−1 , Z2 , Z−2 , . . . , Z n , Z−n )T (2.113)

be the feature vector of the Fourier series approximation of a contour with n coefficients.
Then

̃ −1 , Z
̃2 , Z
̃ −2 , . . . , Z
̃n, Z T
̃ −n ) = ( Z−1 , Z2 Z−2 Z n Z−n T
m
̃ = (Z Z1 Z1 , Z1 , ... , Z1 , Z1 ) (2.114)

is an example of a translation, scale and rotation invariant feature vector of the contour.
The values Z ̃ 0 = 0 and Z̃ 1 = 1 are implicitly omitted. More generally, because z󸀠 (l) =
ae z(l), with a ∈ ℝ and α ∈ ℝ, leads to Z 󸀠k = Z k aejα , each ratio m̃ = ZZmn , with
jα >0

m,n ≠ 0, is a feature that is invariant w.r.t. translation, scaling, and rotation.

2.7 Dimensionality reduction of the feature space

Generally, a high dimension of the feature space is unfavorable, for reasons that will
be explained in Section 6.1. The last section of this chapter will treat the question of
how a high dimension of the feature space can be reduced.
The presentation starts with the concepts of principal component analysis (PCA)
and independent component analysis (ICA). Both methods derive new features by
combining the original features and projecting the result to a subspace of smaller di-
mension. They share the objective of approximately representing the collection of all
samples D with the desired (lower) dimensional space, so that when one would recon-
struct the samples in the original feature space, the mean square error between the
original features and the reconstructed features is minimized. In other words, these
methods do not take the class affiliation into consideration but regard the whole col-
lection of samples D at once.
The third method this section will treat is multiple discriminant analysis (MDA).
This method initially focuses on optimal class separation, but apart from that, it op-
erates similarly to the other two methods. That is, this method also calculates com-
binations of the original features and projects them to a subspace. But the way the
combinations and projections are calculated differs.
All these methods suffer from two drawbacks. First, they only work for features
on an interval scale at least, because subtraction and scalar multiplication must be de-
fined. Second, the descriptive meaning of the original features might get lost. Instead
of concrete features these methods might generally return opaque transformations.
2.7 Dimensionality reduction of the feature space | 57

For these reasons, the last method presents a systematic way of selecting a subset
of the original feature vector such that the smaller set of features is still good enough.
Throwing away some components is the same as a projection but it keeps the axis “as
is.” The advantage of this method is that it works for arbitrary kinds of features and
retains the meaning of the original features. The disadvantage is that it is less powerful,
because it does not make use of combinations of the features.

2.7.1 Principal component analysis

The idea of principal component analysis is to find a lower dimensional subspace such
that the data is optimally represented in terms of the mean square error. This subsec-
tion proceeds as follows. First, the method is presented for the case where the subspace
is chosen to be zero-, one-, or two-dimensional, because these cases can be easily de-
picted and they descriptively convey the underlying idea. Then the general case is
presented for an arbitrary number of dimensions. At the end, the method of principal
component analysis is generalized to kernelized principal component analysis, which
uses in addition a nonlinear transformation in order to improve the representation.

First steps and general deduction

Let D = {m1 , . . . , mN } be a set of feature vectors. A zero-dimensional space is a point
and the projection onto this space is that point itself. Hence one tries to find that point
m0 such that the sum of all squared distances between it and the points m1 , . . . , mN
is minimal. This means the objective is to find the minimum of
N
J0 (m) = ∑ ‖m − mk ‖2 . (2.115)
k=1

This can be rewritten as

N N
J0 (m) = ∑ ‖m − mk ‖2 = ∑ ‖(m − m) − (mk − m)‖2
k=1 k=1
N N N
= ∑ ‖m − m‖2 − 2 ∑ (m − m)T (mk − m) + ∑ ‖mk − m‖2
k=1 k=1 k=1
N N N
= ∑ ‖m − m‖2 − 2(m − m)T ∑ (mk − m) + ∑ ‖mk − m‖2
k=1 ⏟⏟k=1
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ k=1
=0
N N
= ∑ ‖m − m‖2 + ∑ ‖mk − m‖2 . (2.116)
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
k=1 ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
k=1
to be minimized fixed, independent of m
58 | 2 Features

Fig. 2.28. Principal component analysis, first step:

m1 Finding the point with minimal reconstruction
error.

This shows that the point with the least squared distance to the m1 , . . . , mN is the
center of mass of these points (see Figure 2.28),

m0 = arg min J0 (m) = m. (2.117)

Iteratively, one can now seek for the one-dimensional line that best represents the
points. That line is given by

me (a) = m + ae with a ∈ ℝ, e ∈ ℝd and ‖e‖ = 1 (2.118)

where e denotes the normalized directional vector and a the scalar parameter of the
line. Let m̆ k = m + a k e denote the orthogonal projection of the feature mk onto the
line. The optimization functional is
N N
J1 (a1 , . . . , a N , e) = ∑ ‖m̆ k − mk ‖2 = ∑ ‖m + a k e − mk ‖2 . (2.119)
k=1 k=1

The scalars a1 , . . . , a N vary if the directional e vector varies. Hence, we minimize

the functional in two steps. Firstly, a1 , . . . , a N are chosen such that J1 is minimized
for an arbitrary, but fixed, e. The result is a set of scalars, depending on e, i.e., a1 (e),
. . . , a N (e). Secondly, e is chosen to minimize J1 . Once again we rewrite J1 as (note that
‖e‖ = 1)
N
J1 (a1 , . . . , a N , e) = ∑ ‖m + a k e − mk ‖2
k=1
N
= ∑ ‖a k e − (mk − m)‖2
k=1
N N N
= ∑ ‖a k e‖2 − 2 ∑ a k eT (mk − m) + ∑ ‖mk − m‖2
k=1 k=1 k=1
2.7 Dimensionality reduction of the feature space | 59

N N N
= ∑ a2k − 2 ∑ a k eT (mk − m) + ∑ ‖mk − m‖2 . (2.120)
k=1 k=1 k=1

As the minimum is an inner point, it suffices to find the point where the first deriva-
tives are zero:
∂ !
J1 (a1 , . . . , a N , e) = 2a k − 2eT (mk − m) = 0 ⇔ a k = eT (mk − m). (2.121)
∂a k
Putting this solution into the last line of Equation (2.120) yields
N N N
J1 (e) = ∑ a2k − 2 ∑ a2k + ∑ ‖mk − m‖2
k=1 k=1 k=1
N N
= − ∑ a2k + ∑ ‖mk − m‖2
k=1 k=1
N N
= − ∑ eT (mk − m)(mk − m)T e + ∑ ‖mk − m‖2
k=1 k=1
N N
= −eT Se + ∑ ‖mk − m‖2 with S := ∑ (mk − m)(mk − m)T . (2.122)
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
k=1 k=1
fix, independent of e

The matrix
N
S := ∑ (mk − m)(mk − m)T ∈ ℝd×d (2.123)
k=1

is called the scatter matrix and the term

eT Se with ‖e‖ = eT e = 1 (2.124)

must be maximized in order to minimize J1 (e). The method of Lagrange multipliers,

with multiplier λ(eT e − 1), yields
∂ !
(eT Se − λ(eT e − 1)) = 0
∂e
⇔ 2Se − 2λe = 0
⇔ Se = λe
⇒ eT Se = λeT e = λ. (2.125)

The line before the last line shows that the sought value of λ is an eigenvalue of
the matrix S. Since S is symmetric by construction (see Equation (2.123)), it is diago-
nalizable and such an eigenvalue must exist. The last line reveals that the greatest
eigenvalue must be picked to maximize eT Se.
In summary, the best line has a base point at the center of mass and the same
direction as the eigenvector with the largest eigenvalue of the scatter matrix (see Fig-
ure 2.29).
60 | 2 Features

h
m2

Fig. 2.29. Principal component analysis, second

m1 step: Finding the line with minimal reconstruction
error.

In order to complete the usual notation, let the column-wise concatenation of the
zero mean feature vectors

M := (m1 − m, ... , mN − m) ∈ ℝd×N (2.126)

denote the so-called data matrix. Then the scatter matrix can be written as
N
S = ∑ (mk − m)(mk − m)T = MMT . (2.127)
k=1

We now turn to the general case. Again, m̆ k will denote the projection of mk to a
d󸀠 -dimensional
affine subspace given by

d󸀠
m + ∑ a i ei (2.128)
i=1

with {e1 , . . . , ed󸀠 } constituting an orthonormal basis. Then the objective function is
N
J d󸀠 (a1,1 , . . . , a N,d󸀠 , e1 , . . . , ed󸀠 ) = ∑ ‖m̆ k − mk ‖2
k=1
N 󵄩 d󸀠
󵄩 󵄩󵄩2
= ∑ 󵄩󵄩󵄩󵄩(m + ∑ a k,i ei ) − mk 󵄩󵄩󵄩󵄩 (2.129)
k=1
󵄩 i=1
󵄩

for d < d󸀠 . A generalized variant of the same course of action as above leads to the
following result: the optimal affine subspace with dimension d󸀠 has a base point at its
average value m and is spanned by the d󸀠 eigenvectors of the d󸀠 largest eigenvalues
of the scatter matrix S (see Figure 2.30).
Hence, the usual procedure to calculate the d󸀠 -dimensional principal component
analysis consists of the following steps:
2.7 Dimensionality reduction of the feature space | 61

m󸀠

Fig. 2.30. Principal component analysis, general case:

Finding the d󸀠 -dimensional subspace with minimal
reconstruction error.

1. Calculate the average

N
m = ∑ mi ∈ ℝd , (2.130)
i=1

the data matrix

M = (m1 − m, ... , mN − m) ∈ ℝd×N , (2.131)

and the scatter matrix

S = MMT ∈ ℝd×d (2.132)
of all feature vectors m1 , . . . , mN .
2. Calculate the normalized eigenvectors e1 , . . . , ed of S and sort them such that the
corresponding eigenvalues λ1 , . . . , λ d are decreasing, i.e., λ1 > λ2 > . . . > λ d . (N.B.:
There are d eigenvectors, because S is symmetric and therefore diagonalizable.)
3. Construct a matrix
A := (e1 , . . . , ed󸀠 ) ∈ ℝd×d
󸀠
(2.133)
with the first d󸀠 eigenvectors as its columns.
4. Transform each feature vector mi into a new feature vector

m󸀠i = AT (mi − m) for i = 1, . . . , N (2.134)

of smaller dimension d󸀠 .

Relation to the Karhunen–Loève transformation

Before the implications of the PCA are discussed in detail in the upcoming paragraphs,
this section will shed some light on the relation between the PCA and the Karhunen–
Loève transformation. The latter regards the same concepts from the perspective of
a stochastic process. Generally, the Karhunen–Loève transformation tries to find an
optimal representation by orthogonal functions for a stochastic process m(t) ∈ ℝ
and some time index t from a compact set of ℝ. In this context, the coefficients of the
system of orthogonal functions are considered to be random variables. Of course, if the
62 | 2 Features

support of the random process is chosen to be a bounded set of natural numbers, i.e., t ∈
T
{1, . . . , d}, then the stochastic process can be written as a vector (m(1), . . . , m(d))
and one is in the same situation as for principal component analysis.
Let m be a random vector and let

µ = E{m} (2.135)
Σ = Cov{m} (2.136)

denote the expectation and the covariance matrix respectively. As Σ is symmetric

and positive definite, it can be decomposed into its eigenvectors. Let the i be the
normalized eigenvectors and κ i > 0 be the eigenvalues in decreasing order. Define the
matrices

E = (1 , . . . , d ) (2.137)
κ1 0 ... 0
.. .. ..
0 . . .
Λ = ( .. .. .. ) (2.138)
. . . 0
0 ... 0 κd

where E is a column-wise concatenation of the eigenvectors. Then the covariance

matrix can be rewritten as
d
Σ = ∑ κ i i Ti = EΛET . (2.139)
i=1

Because the eigenvectors constitute a orthonormal basis, E is orthogonal and ET =

E . Therefore Equation (2.139) yields
−1

Λ = ET ΣE. (2.140)

The Karhunen–Loève transformation of m is defined as

̃ = ET (m − µ) .
m (2.141)

The transformed random variable has zero mean

E{m
̃} = 0 (2.142)

and the covariance matrix is

T T
̃} = E{ET (m − µ) (m − µ) E} = ET E{(m − µ) (m − µ) } E
Cov{m
= ET ΣE = Λ. (2.143)

This equation shows that the variance of a single component is

κ i = Var{m
̃ i} (2.144)
2.7 Dimensionality reduction of the feature space | 63

m󸀠1
m2

m󸀠2

λ1
∝ √
m

tion
Projection
onto e1

a
evi
rd d
m1

nda
Sta
Projection
onto e2

Sta
nda
rd d
∝ √ eviatio
λ2 n

Fig. 2.31. The variance of the dataset is encoded in the principal components so that the variance
along a component is proportional to the corresponding eigenvalue.

and that all components are pairwise uncorrelated:

Cov{m
̃i, m
̃ j} = 0 for i ≠ j. (2.145)

That being said, we now return to principal component analysis. Instead of a ran-
dom vector m, one has a set of feature vectors mk that are nothing else than realizations
of m and the empirical mean m is an unbiased estimator of the expectation vector

µ̂ = m. (2.146)

Except for a correction factor, something similar holds for the scatter matrix. An
unbiased estimator for the covariance matrix is
1
Σ̂ = S. (2.147)
N−1
The component-wise variance of the transformed feature can be unbiasedly esti-
mated by the scaled eigenvalues of the scatter matrix
1
κ̂ i = λi . (2.148)
N−1
This situation is depicted in Figure 2.31.
64 | 2 Features

Some characteristics of principal component analysis

We now try to calculate the approximation error that arises if the features are projected
to a space of lower dimension. The eigenvectors {e1 , . . . , ed } constitute an orthonor-
mal basis and the corresponding coefficients are

m󸀠i = eTi (m − m) (2.149)

and the reconstruction of the ith component in the original feature space is given by

ei m󸀠i = e e⏟Ti⏟ (m − m) .
⏟⏟⏟i⏟⏟ (2.150)
∈ℝd×d

Hence m[1] = (I − e1 eT1 ) (m − m) is the feature vector with the first component
removed. More generally, we denote the transformed vector m without the entries
corresponding to the first i eigenvectors ei by:

m[i] = (I − e1 eT1 ⋅ ⋅ ⋅ − ei eTi ) (m − m) . (2.151)

Because distinct eigenvectors are orthogonal, the sequence m[1] , m[2] , m[3] , . . . can
be calculated recursively:

m󸀠1 = eT1 (m − m) m[1] = (I − e1 eT1 ) (m − m) (2.152)

m󸀠2 = eT2 m[1] m[2] = (I − e2 eT2 ) m[1] (2.153)
m󸀠3 = eT3 m[2] m[3] = (I − e3 eT3 ) m[2] (2.154)
..
.

The Equations (2.152) to (2.154) and so on can be thought of as follows: At first, the
direction of maximum variance is determined and the variation of the data w.r.t. this
direction is removed. Then, within the data modified in this way, again the direction
of maximum variance is determined, and so on, and so on. Therefore, in a greedy
manner, the maximum variance directions are identified recursively and the pertaining
components of the data are consecutively subtracted.
For a single zero-mean feature vector mk − m with k = 1, . . . , N, the projection
squared error onto the d󸀠 -dimensional subspace is

󵄩󵄩 d󸀠
󵄩2 󵄩 d 󵄩2
󵄩󵄩(I − ∑ ei eT ) (mk − m)󵄩󵄩󵄩 = 󵄩󵄩󵄩(∑ ei eT ) (mk − m)󵄩󵄩󵄩 (2.155)
󵄩󵄩 i 󵄩󵄩 󵄩󵄩 i 󵄩󵄩
i=1 i=d +1
󸀠

and the total squared error for all feature vectors is the sum of the remaining squared
eigenvalues:
N 󵄩 d󸀠 󵄩󵄩2 d
󵄩
∑ 󵄩󵄩󵄩(I − ∑ ei eTi ) (mk − m)󵄩󵄩󵄩 = ∑ λ2i . (2.156)
󵄩 󵄩
k=1 i=1 󸀠
i=d +1

By construction, the principal component analysis yields the best (w.r.t. mean
square error) d󸀠 -dimensional approximation. Furthermore, we already know that the
2.7 Dimensionality reduction of the feature space | 65

eigenvalues are proportional to the standard deviation of the data with respect to the
corresponding direction. Now, assume some other arbitrary d󸀠 -dimensional projection
and calculate the standard deviation of the dimensions being thrown away. Then these
deviations will be greater than the term above. In this sense, the sum ∑di=d󸀠 +1 λ2i is
minimal, or, loses as little information as possible.
Abusing some concepts and notation, we can clarify what is meant by “loss of
information.” In virtue of the coerced normalization, one can regard the sequence of
eigenvalues λ1 , . . . , λ d as a probability distribution

λi
χ i := d
. (2.157)
∑j=1 λj

From an information theoretic point of view, the entropy of this distribution is

d
H(χ1 , . . . , χ d ) = − ∑ χ i ln χ i . (2.158)
i=1

For any other linear transformation of the feature space, the corresponding entropy
of the variances in each direction is larger. In other words, the principal component
analysis yields that linear transformation for which the entropy of the “variances” be-
comes minimal. Hence, the variances are as unequally distributed as possible. In a lax
interpretation, one could say that the first dimension bears as much information about
the data as possible, the second dimension bears most of the remaining information,
the third dimension bears most of the information without the first two dimensions,
and so on.
The following list recapitulates the essential characteristics of a principal compo-
nent analysis.
– The components of the transformed feature vectors are pairwise uncorrelated.
– The variances of the components of the transformed feature are maximally un-
equally distributed for all linear transformations. (The variances have minimal
entropy.)
– The PCA yields the best d󸀠 -dimensional approximation in terms of the squared
deviation.
– The PCA does not aim for the optimal separability of the classes, but tries to provide
the best representation of all the data D as a whole. Nonetheless, experience shows
that the PCA yields feature spaces of good quality with low dimensions.
– The descriptive meaning of the original features is lost.

Applications and examples: Eigenfaces

A well-known application of principal component analysis within the field of pattern
recognition is identity recognition: given an image of a face, what is the name of that
person? With the “eigenfaces” approach (now superseded by more sophisticated meth-
66 | 2 Features

Fig. 2.32. Mean face computed from the YALE faces dataset of
Georghiades et al. [2001].

ods), faces are represented as the deviations from a mean face. The mean face as well
as the “directions” of the deviation are calculated using PCA.
Let g(x, y) denote the gray-scale image of a face with (x, y) ∈ {1, . . . , n}2 . Note
that all images are required to be of the same size, but there is no technical reason to
require them to be square. However, this restriction simplifies the following discussion.
In addition, all images should show the face in the same pose and be aligned with a
common reference frame (e.g., eye centers on the same height) for this technique to
work well. The pixels are arranged into a vector m ∈ ℝd with dimension d = n2 . Note
that here the pattern itself is used as the feature vector. As above, let

M = (m1 − m, ... , mN − m) ∈ ℝd×N (2.159)

denote the data matrix and S = MMT ∈ ℝd×d denote the scatter matrix.
Usually, the next step would be to calculate the eigenvectors and eigenvalues of
S. In practice, however, this is infeasible due to the size of S and the resulting compu-
tational complexity of the eigen-decomposition. Consider, for example, small facial
images measuring 32 × 32 pixels, i.e., n = 32 (in real applications the images will
be larger). Then the “feature vectors” m will be of dimension d = n2 = 322 and the
scatter matrix S will have d2 = n4 = 1,048,576 entries.
The costly eigen-decomposition can be avoided by exploiting the structure of the
problem: the dimensionality of the space induced by the training sample is smaller
than the dimensionality of the feature space. In other words, the number N of features
in the training sample is much smaller than d. This is an odd situation: in most cases,
the number of samples is much larger than d. As we will see in Chapter 4, N ≫ d
is (often) actually required in order to successfully estimate the decision boundaries
in the feature space. Note, however, that at the moment we do not wish to derive a
classifier, rather, we wish to find a compact representation of the facial images that
can be used with a classifier.
Nevertheless, as here N < d, consider instead the matrix

K := MT M ∈ ℝN×N . (2.160)
2.7 Dimensionality reduction of the feature space | 67

Fig. 2.33. First 20 eigenfaces computed from the YALE faces dataset of Georghiades et al. [2001].
The first components clearly correspond to different lighting conditions, while the other components
correspond to changes in pose and facial structure.

λi

1 5 10 15 20 i

Fig. 2.34. First 20 eigenvalues corresponding to the eigenfaces in Figure 2.33. Note that most of the
variation is captured with just the first two components which correspond to lighthing directions.
68 | 2 Features

By construction, this matrix is symmetric and therefore diagonalizable. Let ηi denote

the eigenvectors of K for i = 1, . . . , N. Then the first N (N < d = n2 ) greatest eigenvec-
tors ei of M can be calculated from the ηi of L using

Mηi
ei = for i = 1, . . . , N. (2.161)
‖Mηi ‖

Since the eigenvectors are computed from images, they can themselves be con-
verted into images. Figures 2.32 and 2.33 show the mean vector m and the eigenvectors
ei , i = 1, . . . ,10 of the extended YALE face dataset B (Georghiades et al. [2001]) inter-
preted as gray-scale images. This dataset contains pictures of the faces of 39 subjects.
The images were recorded under different lighting conditions and cropped and ro-
tated so that the faces of two different images are aligned. One can clearly see that
the eigenvectors represent major modes of change: lighting, pose, and facial structure.
The eigenvalues corresponding to the eigenvectors are shown in Figure 2.34. As ex-
pected from the previous discussion, the eigenvalues are very unequally distributed;
most of the variation in the dataset is represented by the first two components. The
third, fourth, etc. eigenvalues are of much smaller magnitude, which means that the
associated components explain finer, but less common, details.
For classification, a d󸀠 -dimensional feature vector (where d󸀠 ≪ d) according to
m󸀠 = (e1 , ⋅ ⋅ ⋅ ,ed󸀠 )T (m − m) is used. Note that this approach is not restricted to facial
image recognition. It can also be used with 3D facial data from depth sensors, or indeed
any other type of data.

Applications and examples: Classification of airplane models

In fact, a similar approach has been used to classify airplane types in aerial images. As
with eigenfaces, the two-dimensional images could be fed directly into PCA to derive
a feature descriptor. However, environmental conditions such as lighting, partial oc-
clusion, etc. add non-relevant variation, which in turn results in much larger feature
vectors than expected. To circumvent this problem, a parametric 3D wireframe model,
like the one shown in Figure 2.35, can be adjusted to match the image. The parameters
of this model are then used as a feature vector (Laubenheimer [2004]).
Formally, the wireframe models consist of nodes N nodes with three coordinates
(x i , y i , z i )T , i = 1, . . . ,N and associations between these coordinates. In this example,
there are N = 2000 nodes for every wireframe model in the database. The coordinates
are collected into a vector g = (x1 ,y1 ,z1 , . . . ,x N ,y N ,z N )T , where the order of the points
is arbitrary, but consistent among the different models. For example, the first node
could always be on the nose of the airplane, while the 20th point could always be on
the tip of the left wing.
A simple parametric model chooses a small number of representative airplanes as
a basis to represent unknown models. If, for example, the basis consists of wireframe
models of the three common Airbus models and three common Boeing models, a new
2.7 Dimensionality reduction of the feature space | 69

Fig. 2.35. Wireframe

model of an airplane.
Image source: Fraun-
hofer IOSB Karlsruhe
and Laubenheimer [2004].

wireframe model is parametrized by

gnew = α1 gAirbus A320 + α2 gAirbus A300 + α3 gAirbus A340

+ α4 gBoeing 747 + α5 gBoeing 707 + α6 gBoeing 737 . (2.162)

This model, however, can only represent planes that are somewhat similar to the
models chosen as the basis. Of course, one can simply provide a larger selection of
different planes, but then the size of the feature vector will increase, which typically
has adverse effects on classification performance. A better model should capture the
principal modes of change in an airplane: the size, the position of the wings, the ori-
entation of the wings, etc. Such a model can be derived using PCA. The base models
are collected into the data matrix,

M = (gAirbus A320 , gBoeing 747 , gAirbus A300 , gBoeing 707 , . . .) , (2.163)

and PCA is used to extract the mean model g and eigenvectors ei . A new model is then
represented by

gnew = g + m1 e1 + m2 e2 + m3 e3 + . . . (2.164)

As it turns out, the first eigenvector e1 mainly accounts for the size of the airplane
body, the second and third eigenvectors e2 and e3 mainly encode the position and
orientation of the wings, and the other eigenvectors encode minor details. PCA has
found the most relevant modes of change purely from the data provided, without any
guidance from a human expert. With this model, most of the inherent variation in
the airplane models can be expressed using only the first three or four eigenvectors.
The resulting feature descriptor m = (m1 ,m2 ,m3 ,m4 )T is very compact, yet sufficient
for the task of describing different types of airplanes. More information about this
approach can be found in Laubenheimer [2004].

2.7.2 Kernelized principal component analysis

Standard principal component analysis aims to find the best orthogonal transforma-
tion of the feature space such that the projection onto the d󸀠 first dimensions (or prin-
cipal components) is the best approximation among all other orthogonal transforma-
tions.
70 | 2 Features

kernelized PCA

ϕ PCA
M F M󸀠

dim(M) = d dim(F) ≫ d dim(M󸀠 ) = d󸀠 < d

Fig. 2.36. Concept of kernelized PCA.

But sometimes it might yield better results if PCA is not applied to the original
feature space, but to an intermediate higher dimensional space. In other words, there is
a nonlinear function ϕ : M → F that maps from the feature space M with dimension d
to a new Hilbert space F with higher—possibly infinite—dimension. Then the principal
component analysis is applied to this intermediate space in order to obtain the feature
space M󸀠 with reduced dimension d󸀠 . This idea is depicted in Figure 2.36. Only if ϕ is
chosen to be truly nonlinear does this approach provide any benefits in comparison
with standard principal component analysis. Otherwise, the results do not differ.
The reason behind the name kernelized PCA, and why it is covered by a whole
section on its own, is that there is a clever calculation trick. This trick is referred to as
the kernel trick and will be revisited in and explained in greater detail in Section 7.7.
The naïve approach to realize Figure 2.36 would be to explicitly choose F and ϕ
and perform a principal component analysis in the high-dimensional space F. In this
case, one needs to compute the inner products of vectors explicitly mapped by ϕ with
possibly prohibitive computational costs. As the final goal is to ease the complexity
by dimension reduction, this seems like a step in the wrong direction. The trick is to
rewrite all the formulas so that the map ϕ only occurs in pairs within the inner product
of F. This means that terms like

⟨ϕ(⋅), ϕ(⋅)⟩ : M × M → ℝ (2.165)

are the only places where ϕ appears. Then all these terms can be replaced by a so
called kernel function
k: M × M → ℝ (2.166)

that absorbs two mappings and the inner product into one simply evaluable function
so that ϕ never needs to be calculated explicitly.
This being said, the upcoming course of action in this section is already clear. One
starts with the “regular” PCA on F and tries to rewrite all formulas so that all ϕ vanish
and k remains.
Standard PCA centers the data first, i.e., in the first step mk − m is calculated.
Without explicit knowledge about ϕ it is neither possible nor computationally feasi-
ble to calculate ϕ(mk ) − ϕ(m) in the same way. Hence, one assumes that ϕ already
generates zero-mean data ϕ(m1 ), . . . , ϕ(mN ) with ∑Nk=1 ϕ(mk ) = 0. Of course, this
2.7 Dimensionality reduction of the feature space | 71

assumption is rather far-fetched, and at the end this condition will be dropped again,
but provisionally this is assumed to be true.
The following derivation mainly follows that of Schölkopf et al., which can be
found in Schölkopf et al. [1997].
Let m1 , . . . , mN ∈ M denote the original features. Then ϕ(m1 ), . . . , ϕ(mN ) ∈ F
are the non-linearly transformed, high-dimensional features. Furthermore,

D = (ϕ(m1 ), . . . , ϕ(mN )) (2.167)

is the data matrix (see Equation (2.126)) and

1 N 1
C= ∑ ϕ(mk )ϕ(mk )T = DDT (2.168)
N k=1 N

is the scatter matrix (see Equation (2.126)). Two aspects are important to be noted:
First, the scatter matrix is additionally normalized; Second, these terms assume that
1 N
N ∑k=1 ϕ(mk ) = 0.
Following the usual procedure, the eigenvalue equation

λv = Cv for v ∈ F, λ ∈ ℝ (2.169)

must be solved next. As always, C is diagonalizable and all eigenvalues are non-
negative λ ≥ 0, because of the special form C = DDT . As Equation (2.169) will not be
explicitly solved, the definition of C is put into Equation (2.168),

1 N 1 N
λv = Cv = ( ∑ ϕ(mk )ϕ(mk )T ) v = ∑ (ϕ(mk )T v) ϕ(mk ) (2.170)
N k=1 N k=1 ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
(∗)∈ℝ
N
λ=0
̸
⇒ v = ∑ α k ϕ(mk ) = Dα for α ∈ ℝN . (2.171)
k=1

Equation (2.170) reveals an important observation. The eigenvalue zero (λ = 0)

can occur for two reasons: either the right side is a non-trivial linear combination,
i.e., at least two coefficients (∗) are nonzero, or all coefficients equal zero. In the lat-
ter case, the eigenvector v is perpendicular to every ϕ(mk ). But if the eigenvalue is
nonzero (λ ≠ 0), then the right side must be a nonzero linear combination. Hence, v
is a non-trivial linear combination of the ϕ(mk ). To summarize, any eigenvector so-
lution of Equation (2.169) that corresponds to a nonzero eigenvalue is in the span of
{ϕ(m1 ), . . . , ϕ(mN )}.
Hence, in case λ ≠ 0, one can rewrite v as a linear combination of the ϕ(m1 ),
. . . , ϕ(mN ) (see Equation (2.171)) or as the product of the data matrix Equation (2.167)
applied to a vector.
These observations are very important, because they are the key that allows reduc-
ing the complexity. Although F might be infinite dimensional, all the action takes place
72 | 2 Features

in a subspace that has dimension N at most. There are at most N eigenvectors with
a nonzero eigenvalue and these eigenvectors are in the span {ϕ(m1 ), . . . , ϕ(mN )};
all other (possibly infinitely many) eigenvalues are zero and their eigenvectors are or-
thogonal to that subspace. Intuitively, this is not a surprise. If there are only N feature
vectors (data points), these points can span at most a N − 1 dimensional subspace.
(Two points are always on one line, three points are always on one plane, and so on.)
This does not change, if the points are mapped into a space with higher dimension
first. As one is only interested in principal components with nonzero variance (no
other components bear any information at all), one can presume λ > 0 from now on.
The eigenvector Equation (2.170) corresponds to a system of linear equations with
possibly infinitely many rows. Because we know that all interesting solutions with
λ > 0 are in the span {ϕ(m1 ), . . . , ϕ(mN )}, it suffices to consider the projection onto
this space. This means one can multiply Equation (2.170) from the left by DT (see Equa-
tion (2.167)) without losing any interesting solution.
In conclusion, Equation (2.170) with Equation (2.171) and left multiplication with
T
D leads to

λv = Cv ⇒ λDα = CDα ⇒ λDT Dα = DT CDα. (2.172)

In order to go on, define the kernel matrix

K := DT D ∈ ℝN×N (2.173)

for short and obtain

λDT Dα = DT CDα ⇒ λKα = DT DDT Dα ⇒ λKα = K2 α. (2.174)

Now compare the definition of the kernel matrix K (Equation (2.173)) with the defi-
nition of the data matrix C (Equation (2.167)) and note that this is the same trick that has
already been used to reduce the complexity of the eigenface problem (Equation (2.160)).
Furthermore, one can see that

K ij = ϕ(mi )T ϕ(mj ) (2.175)

holds. This means each matrix entry Kij is the inner product of the corresponding
feature vectors in the high-dimensional space. For later use, we introduce the kernel
function
M×M→ℝ
k: { (2.176)
(mi , mj ) 󳨃→ ϕ(mi )T ϕ(mj )
and set
K ij = k(mi , mj ). (2.177)
Again, because K is symmetric and positive-definite, and because only nonzero
solutions are of interest, one factor K can be canceled out in Equation (2.174). There
remains
λα = Kα (2.178)
2.7 Dimensionality reduction of the feature space | 73

for K ∈ ℝN×N and α ∈ ℝN . This means it suffices to solve an N-dimensional eigen-

vector problem in order to obtain the eigenvalues λ. The usual PCA requires that the
eigenvectors that are used for the projection matrix be normalized. This means the
high-dimensional eigenvectors v ∈ F needs to be normalized (see Equation (2.169)),
but Equation (2.178) determines the eigenvectors α ∈ ℝN . A normalization condition
can be derived from Equation (2.171):
!
1 = ‖v‖2 = ‖Dα‖2 = αT DT Dα = αT (Kα) = αT (λα) = λ‖α‖2
1
⇒ ‖α‖2 = . (2.179)
λ
Let λ1 , . . . , λ d󸀠 denote the d󸀠 greatest eigenvalues of Equation (2.178) and α1 , . . . ,
αd󸀠 ∈ ℝN the corresponding eigenvectors with normalization condition ‖αi ‖2 = λ−1 i for
i = 1, . . . , d󸀠 . Let m ∈ M be another feature vector whose first d󸀠 principal components
are to be calculated. Let
A := (v1 , . . . , vd󸀠 ) (2.180)
be the projection matrix from the high-dimensional space F into M󸀠 and write

Ã := (α1 , . . . , αd󸀠 ) . (2.181)

Then the projection matrix can be written as

A = DA.̃ (2.182)

Again, with the usual PCA this would require computing m󸀠 = AT ϕ(m). This can
be rewritten as
̃ T ϕ(m) = Ã T (DT ϕ(m))
m󸀠 = AT ϕ(m) = (DA)
ϕ(m1 )T ϕ(m) k(m1 , m)
=A (̃T .
..
T
) = Ã ( .. ). (2.183)
.
T
ϕ(mN ) ϕ(m) k(mN , m)

The last step finishes the derivation of the kernelized PCA. In summary, Equa-
tion (2.178) must be solved with K being defined as in Equation (2.177) under the nor-
malization condition from Equation (2.179). The eigenvectors found must be organized
into a projection matrix Ã and the last equation above returns the principal compo-
nents m󸀠 for any feature vector m. All steps require evaluating the kernel function k
at most. The transformation ϕ is never needed explicitly.
Attention: Although k : M × M → ℝ seems to come out of the blue, it is still
assumed that it corresponds to some inner product on some unknown vector space F
of unknown dimension and that there is a (nonlinear) mapping ϕ from M into F such
that k(⋅, ⋅) = ⟨ϕ(⋅), ϕ(⋅)⟩. Moreover, it is assumed that ∑Nk=1 ϕ(mk ) = 0. This means
that the transformed dataset has zero mean.
Two last questions still need to be answered:
74 | 2 Features

– Which functions are allowed for the kernel k : M × M → ℝ without a map ϕ being
explicitly given?
– How can the condition ∑Nk=1 ϕ(mk ) = 0 be relaxed if ϕ is not given?

First, some very common examples will be given:

– Polynomial kernel: k(m, s) = (mT s + c)n for fixed c ∈ ℝ, n ∈ ℕ.
2
– Radial kernel: k(m, s) = exp (− ‖m−s‖
2σ2
) for fixed σ2 ∈ ℝ>0 .
– Sigmoid kernel: k(m, s) = tanh (κmT s + θ) for fixed κ, θ ∈ ℝ.

In general, Mercer’s condition can be used to check whether a function k : M × M → ℝ

is permissible.

Theorem 2.9 (Mercer’s condition (Mercer [1909])). Let M ⊆ ℝn and k : M × M → ℝ.

Write L2 (M) for the set of all square-integrable function on M, i.e.,
↑
↑
{ ↑
↑ }
2 ↑ 2
L (M) = {f : M → ℝ↑
↑
↑ ∫ f (m) dm < ∞} . (2.184)
↑
↑
{ ↑
↑M }
If k is the kernel of a positive semi-definite integral operator on L2 (M), i.e., if

∬ k(m, m󸀠 )f(m)f(m󸀠 ) dm dm󸀠 ≥ 0 ∀f ∈ L2 (M), (2.185)

M×M

then there is a vector space V with an inner product ⟨⋅, ⋅⟩ and a mapping φ : M → V
such that
k(⋅, ⋅) = ⟨φ(⋅), φ(⋅)⟩ . (2.186)

We now tackle the problem of data with nonzero mean. Recall that the data matrix is
defined as
D = (ϕ(m1 ), . . . , ϕ(mN )) (2.187)
and the kernel matrix as
K = DT D ∈ ℝN×N . (2.188)
Let further
̃ = (ϕ(m1 ) − ϕ(m), . . . , ϕ(mN ) − ϕ(m), )
D (2.189)

with ϕ(m) = 1
N ∑Nk=1 ϕ(mk ) be the centered data matrix and

̃T D
̃=D
K ̃ ∈ ℝN×N (2.190)

̃ in terms of K, in such a way that no explicit

the analogue of K. The aim is to rewrite K
evaluation of ϕ is necessary. Actually, this is straightforward. Let U denote the (N × N)-
matrix of ones. Then the centered data matrix can be rewritten as

̃ = D − 1 DU
D (2.191)
N
2.7 Dimensionality reduction of the feature space | 75

and putting this into the definition of the centered kernel matrix yields
T
̃ = (D − 1 DU) (D − 1 DU)
K
N N
1 T 1 1
= D D − D DU − UDT D + 2 UDT DU
T
N N N
1 1 1
= K − KU − UK + 2 UKU. (2.192)
N N N
The eigenvector equation need to be solved with K ̃ instead of K, but as one can
see, no explicit evaluation of ϕ is required.
Moreover, the projection must be redefined. As in Equation (2.183), let m be the
vector under consideration and Ã be the projection matrix. Write u ∈ ℝN for the vector
of ones. A similar calculation as above leads to
k(m1 , m)
T .. 1
m = Ã ((
󸀠
. ) − Ku) (2.193)
N
k(mN , m)
as the new projection formula.
To finish this section, the following list give a ready to use sequence of instructions
for the kernelized PCA, as the section before did for the usual PCA. Let m1 , . . . , mN ∈
M = ℝd be a training set and m ∈ M an additional feature vector. k : M × M → ℝ
denotes a permissible kernel function. Fix some d󸀠 < min{d, N}.
1. Calculate the matrices

K ∈ ℝN×N with Kij = k(mi , mj ) (2.194)

and
̃ = K − 1 KU − 1 UK + 1 UKU
1 ⋅⋅⋅ 1.
K with U := ( ... .. ) ∈ ℝN×N . (2.195)
N N N2 1 ⋅⋅⋅ 1

2. Solve the eigenvector equation

̃
λα = Kα with λ ∈ ℝ, α ∈ ℝN (2.196)

under the normalization condition ‖α‖2 = λ−1 . Let α1 , . . . , αN ∈ ℝN denote the

solutions such that the corresponding eigenvalues λ1 , . . . , λ N are decreasing. (N.B.:
There are N eigenvectors, because K ̃ is symmetric and therefore diagonalizable.)
3. Construct a matrix
Ã = (α1 , . . . , αd󸀠 ) ∈ ℝN×d
󸀠
(2.197)
with the first d󸀠 eigenvectors as columns.
4. Transform the feature vector m ∈ ℝd into the feature vector
k(m1 , m)
̃T .. 1
m = A ((
󸀠
. ) − Ku) with µ := (1, . . . ,1)T ∈ ℝN (2.198)
N
k(mN , m)
76 | 2 Features

m2 m󸀠2

m1 m󸀠1

(a) Original feature space (b) Transformed feature space

Fig. 2.37. Kernelized PCA with radial kernel function k(m, s) = exp (− 12 ‖m − s‖2 ).

of smaller dimension d󸀠 .

Figure 2.37 gives an example of a kernelized PCA. In this example, the final dimension
d󸀠 is not chosen to be smaller but is equal to the original dimension d = d󸀠 = 2.

2.7.3 Independent component analysis

Independent component analysis (ICA) is very similar to principal component analy-

sis but additionally aims at finding a linear transformation such that the components
become stochastically independent (Hyvärinen et al. [2004]). Recall that the principal
component analysis interpreted as a special case of the Karhunen–Loève transforma-
tion yields uncorrelated components (see Equations (2.142), (2.143) and (2.145)). Surely,
if independency

1 , . . . , md󸀠󸀠 ) = p(m1 ) ⋅ ⋅ ⋅ p m󸀠󸀠󸀠󸀠 (md󸀠󸀠 )

pm󸀠󸀠1 ,...,m󸀠󸀠󸀠󸀠 (m󸀠󸀠 󸀠󸀠 󸀠󸀠 󸀠󸀠
(2.199)
d d

holds for the transformed vector m󸀠󸀠 = (m󸀠󸀠

1 ... m󸀠󸀠
d󸀠
), then being uncorrelated,

i , mj } = 0
Cov{m󸀠󸀠 󸀠󸀠
for i ≠ j, (2.200)

follows as well. Thus, one could say that the independent component analysis goes one
step further than the PCA. Actually, ICA can be seen as a two-step process. First, the
components are decorrelated. Second, the components are orthogonally transformed
so that they become independent (see 2.38).
Before we dive more deeply into ICA, we will review some fundamentals from
probability theory.
2.7 Dimensionality reduction of the feature space | 77

ICA

e. g. by PCA
M M󸀠 M󸀠󸀠
Decorrelation make components
“Whitening” independent

Fig. 2.38. Concept of independent component analysis.

Definition 2.10 (Correlation coefficient, uncorrelated). Let a, b be two random vari-

ables with common density p a,b . The correlation coefficient ρ a,b ∈ [−1,1] is defined
as
Cov{a, b} E{(a − E{a}) (b − E{b})}
ρ a,b = = . (2.201)
√Var{a} Var{b} √E{(a − E{a})2 } E{(b − E{b})2 }

The random variables a, b are called uncorrelated iff either of the following two
equivalent conditions is met.

Cov{a, b} = 0 ⇔ ρ a,b = 0. (2.202)

Definition 2.11 (Independence). Let p(a) and p(b) denote the marginal densities of
the random variables a and b respectively. Moreover, let p(a | b) and p(b | a) be their
conditional densities. a and b are called independent if any one (and therefore all) of
the following equivalent conditions holds.

p(a, b) = p(a) ⋅ p(b) (2.203)

⇔ p(a | b) = p(a) (2.204)
⇔ p(b | a) = p(b). (2.205)

Definition 2.12 (Induced uncorrelatedness). If two random variables are indepen-

dent, then they are also uncorrelated. The converse is not necessarily true.

As an example, let a ∼ N(0,1) and b := a2 . Certainly, a and b are not stochastically

independent, since if b is observed, one knows the magnitude |a| of a. If a is observed,
one knows b. Nevertheless, a and b are uncorrelated, because Cov{a, b} ∝ E{a3 } = 0
since the third central moment of a Gaussian distribution is zero.
As already mentioned, the ICA is composed of two steps (see Figure 2.39). Instead
of directly looking for a matrix U such that

m󸀠󸀠 = U (m − E{m}) (2.206)

yields independent components, U = YZ is split into a product

m󸀠󸀠 = Y ⏟⏟
Z⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
(m − ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ = Ym
⏟⏟⏟E{m})
󸀠
(2.207)
decorrrelate
78 | 2 Features

m2 m󸀠2 m󸀠󸀠
2
1 1 1

m1 m󸀠1 m󸀠󸀠
1

−1 1 −1 1 −1 1

−1 −1 −1
(a) Original feature space (b) Intermediate space of decor-(c) Transformed space of inde-
related features (after whiten- pendent features
ing)

Fig. 2.39. Effect of an independent component analysis.

and Z is chosen to be the scaled transformation matrix of the Karhunen–Loève trans-

formation
1
√κ1
0 ... 0

0
..
.
..
. ...
Z = √Λ−1 ET = ( . ) ET . (2.208)
.. .. ..
. . 0
1
0 ... 0
( √κ d )

See Equations (2.137) and (2.138) for the definition of Λ and E. The scaling is neces-
sary, because otherwise the covariance matrix of the transformed feature would equal
Λ (see Equations (2.143) to (2.145)) but we want it to be the identity. In practice, if
only the set of features m1 , . . . , mN is known, one uses the PCA to obtain an unbiased
estimator
√ N−1λ1 0 ... 0
.. .. ..
0 . . . ) T
Ẑ = ( . A . (2.209)
.. . .. . .. 0
( 0 ... 0 √ N−1
λd )

See Equation (2.133) for the definition of A.

Secondly, one needs to find the matrix Y. As the lack of correlation is a necessary
condition for independence, the lack of correlation of the intermediate feature vector
m󸀠 due to whitening must be retained:
! T T
I = Cov{m󸀠󸀠 } = Cov{Ym󸀠 } = Y Cov{m
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟} Y = YY .
󸀠
(2.210)
=I

This observation reveals that Y has to be an orthogonal matrix. The advantage

is that Y has only d(d−1)
2 degrees of freedom. Hence, the remaining task is to find an
orthogonal matrix such that m󸀠󸀠 = Ym󸀠 has independent components.
2.7 Dimensionality reduction of the feature space | 79

Unfortunately, this goal is not always achievable, nor is such a matrix unique. The
idea is to find an objective function that measures the “magnitude of independence”
with respect to Y. Actually, there a two major approaches to define something like
a “magnitude of independence;” they lead to different algorithms for ICA. The first
approach leads to the non-Gaussian family of ICA algorithms, which, as the name sug-
gests, is inspired by the central limit theorem. This approach is not part of the subject
matter of this textbook. The second approach is inspired by Shannon’s information
theory and uses the concept of mutual information to measure the independence of
random variables. This textbook follows this second approach.

Definition 2.13 (Differential entropy). Let a and b be two absolutely continuous ran-
dom variables with density p(a,b). Furthermore, let p(a) and p(b) denote their
marginal densities and p(a | b) and p(b | a) denote their conditional densities.
1. The differential entropy of the random variable a (respectively b) is defined as

h(a) = − ∫ p(a) ln (p(a)) da = − E{ln (p(a))} , (2.211)

h(b) = − ∫ p(b) ln (p(b)) db = − E{ln (p(b))} . (2.212)

2. The conditional differential entropy of the random variables a and b is defined as

h(a | b) = − ∬ p(a, b) ln (p(a | b)) da db = − E{ln (p(a | b))} , (2.213)

h(b | a) = − ∬ p(a, b) ln (p(b | a)) da db = − E{ln (p(b | a))} . (2.214)

Differential entropy is the generalization of the entropy H(⋅) to the case of continuous
random variables. The original concept of entropy is only defined for the discrete case.
Therefore, sometimes the term continuous entropy is also used. Also, the generaliza-
tion is syntactically straightforward: h(⋅) uses densities and integrals where H(⋅) uses
discrete probabilities and sums (the result must be handled with care). The differen-
tial entropy is not always positive and is not invariant under continuous coordinate
transformations. Both are important points for a “real” entropy and a measure of in-
formation uncertainty and are fulfilled by the original, discrete entropy. Nevertheless,
for the purposes of this book, these problems can be ignored.

Definition 2.14 (Mutual information (Transinformation)). The setting is the same as in

Definition 2.13. The mutual information between the random variables a and b is de-
fined as
I(a, b) = h(a) − h(a | b) = h(b) − h(b | a) = I(b, a). (2.215)

The mutual information I is a symmetric function and the aforementioned problems of

the differential entropy have no consequences here. From a descriptive point of view,
the mutual information measures how much an observation of a realization of the
random variable a reveals about possible outcomes of the random variable b and vice
80 | 2 Features

versa. In addition, let p(a)p(b) denote the product density of the marginal densities.
Then

I(a, b) = h(a) − h(a | b)

= − ∫ p(a) ln (p(a)) da + ∬ p(a, b) ln (p(a | b)) da db

= − ∬ p(a, b) ln (p(a)) da db + ∬ p(a, b) ln (p(a | b)) da db

p(a | b) p(a, b)
= ∬ p(a, b) ln da db = ∬ p(a, b) ln da db
p(a) p(a)p(b)
supp p(a,b)

= D(p(a, b)‖p(a)p(b)). (2.216)

This shows that the mutual information between two random variables is the
Kullback–Leibler divergence (see 2.4.5) between the joint distribution and the product
of the marginal distributions. This divergence becomes zero if the random variables
are independent.
This yields the objective function for the second step of the ICA. Let m󸀠 =
(m1 . . . m󸀠d ) denote the random vector after decorrelation as in Equation (2.207).
󸀠

Then
⇑
J(Y) = D (p(m󸀠󸀠 )⇑
⇑ 󸀠󸀠 󸀠󸀠
⇑p(m1 ) ⋅ ⋅ ⋅ p(m d )) with m󸀠󸀠 = Ym󸀠 and YT Y = I (2.217)

needs to be minimized, i.e.,

Y∗ = arg min J(Y) (2.218)
Y

is the transformation matrix sought. In practice, this optimization problem can only be
solved numerically, and J(Y∗ ) = 0 cannot be guaranteed in practice. Still, the resulting
transform will make the features approximately independent.

2.7.4 Multiple discriminant analysis

The goal of principal component analysis (and therefore of independent component

analysis, too) is to find a projection to a lower dimensional space with respect to the
entirety of all points D. There is no guarantee that the first components are suitable
for making a separation between the classes, because PCA is totally ignorant of these
classes. This situation is depicted in Figure 2.40. In both cases, the position of the en-
tirety of the points is the same and therefore the principal components analysis yields
the same projection. But in Figure 2.40b, both classes exploit the whole range of the
first principal component. Hence, a projection onto this component (i.e., a reduction
to d󸀠 = 1) is not able to separate the classes. At least a second component is necessary.
Actually the second component alone would suffice. Although the first component
bears more information about the whole set, there is no information that is productive
2.7 Dimensionality reduction of the feature space | 81

m2 m2
m󸀠1 m󸀠1

m󸀠2 m󸀠2
Projection Projection
onto e1 onto e1

m m
Class ω1
Class ω2

Projection m1 Projection m1
onto e2 onto e2

(a) First component suffices to separate the (b) Both components are necessary to separate
classes. the classes.

Fig. 2.40. The case for multiple discriminant analysis: PCA does not take class information into
account. In particular, it does not aim for optimal class separability.

with respect to the actual problem. In contrast, in Figure 2.40a, a reduction to the first
component still suffices to distinguish between the classes.
This shortcoming is tackled by multiple discriminant analysis (MDA). As the name
suggests, MDA considers different classes and aims for an optimal separation right
from the beginning. If the problem has c classes, then MDA finds the best projection
onto a (c − 1)-dimensional subspace.

The case of two classes

To begin, we will consider the simplest case of c = 2 classes and a projection onto a
line, i.e., d󸀠 = c − 1 = 1. An objective function will be derived to find the best line in
this case. This toy example will justify the later definition of the objective function in
the higher dimensional case.
Let D = {m1 , . . . , mN } be the set of data. As already stated, c = 2 is assumed and

D1 = {mi |ω(mi ) = ω1 } D2 = {mi |ω(mi ) = ω2 } (2.219)

will denote the partition of the dataset. Additionally, |D1 | = N1 and |D2 | = N2 denote
the cardinalities of these sets. The goal is to find a vector w ∈ ℝd such that

m󸀠 = wT m (2.220)

yields a projection that optimally separates both classes. Figure 2.41 illustrates the
situation.
82 | 2 Features

m2
m󸀠

s󸀠1

s󸀠2
m󸀠1 − m󸀠2

w
m1 Fig. 2.41. Quantities in the two-class
case of multiple discriminant analysis.

A good choice of w would be one that on the one hand arranges that the projected
mean points of both classes are spread out, and on the other hand that the standard
deviation of each class is concentrated at the same time. This means one should opti-
mize the ratio between these two quantities. To that end, one defines the mean of the
projected classes by

1 1 1
m󸀠i = ∑ m󸀠 = ∑ wT m = wT ( ∑ m) = wT mi for i = 1, 2
Ni N i m∈D N i m∈D
m󸀠 ∈D󸀠i i i

(2.221)
and the squared standard deviation of the projected classes by
1 2 1 2
s󸀠2
i = ∑ (m󸀠 − m󸀠i ) = ∑ (wT m − wT mi )
Ni N i m∈D
m󸀠 ∈D󸀠i i

1 T
= wT ( ∑ (m − mi ) (m − mi ) ) w for i = 1, 2. (2.222)
N i m∈D
i

The objective function will be the ratio

󵄨󵄨 󸀠 󵄨2
󵄨󵄨m1 − m󸀠2 󵄨󵄨󵄨
󵄨
J(w) = 󸀠2 󵄨 (2.223)
s1 + s󸀠2
2

and needs to be maximized. This form of the objective function is called the Fisher
linear discriminant. This functional is not in an optimal form to solve the maximization
problem, because the dependence on w is not explicit, nor is this form suited to be
generalized to higher dimensions. So the next step is to rewrite Equation (2.223) into
the product of a matrix and a vector.
2.7 Dimensionality reduction of the feature space | 83

A close look at Equation (2.222) reveals that the middle factor is again the scatter
matrix (here normalized by N i ):
1 T
Si := ∑ (m − mi ) (m − mi ) for i = 1, 2 (2.224)
N i m∈D
i

and one writes the sum of both scatter matrices as

SW := S1 + S2 . (2.225)

Then the denominator of Equation (2.223) can be rewritten as

T T T
1 + s 2 = w S1 w + w S2 w = w SW w.
s󸀠2 󸀠2
(2.226)

The suffix W in SW stands for “within.” Similarly, with

T
SB := (m1 − m2 ) (m1 − m2 ) (2.227)

the numerator of Equation (2.223) becomes

󵄨󵄨 󸀠 󵄨2 󵄨 󵄨2
󵄨󵄨m1 − m󸀠2 󵄨󵄨󵄨 = 󵄨󵄨󵄨wT (m1 − m2 )󵄨󵄨󵄨
󵄨 󵄨 󵄨 󵄨
T
= wT (m1 − m2 ) (m1 − m2 ) w
= wT SB w. (2.228)

The suffix B in SB stands for “between.” Then the objective function takes the form

wT SB w
J(w) = . (2.229)
wT SW w
This form is called the Rayleigh coefficient or Rayleigh quotient. The aim is to max-
imize the ratio of the deviation between the classes compared to the deviation within
the classes.
The Rayleigh quotient is invariant under scaling w, hence it suffices to maximize
the numerator wT SB w for all w such that the denominator wT SW w equals 1,

wT SB w
max = max wT SB w. (2.230)
w∈ℝd wT SW w w∈ℝd
wT SW w=1

This, on being rewritten with Lagrange multipliers, leads to

f(w, λ) = wT SB w − λ (wT SW wT − 1) (2.231)

with λ as the Lagrange multiplier,

!
∇w f(w, λ) = 2SB w − 2λSW w = 0
⇔ SB w = λSW w. (2.232)
84 | 2 Features

The last line is called a generalized eigenvalue problem. Under the assumption
that S−1
W is invertible, one obtains the standard eigenvalue problem

S−1
W SB w = λw. (2.233)

Luckily, this equation does not need to be solved directly. From the definition of
SB one can see that
T
SB w = (m1 − m2 ) ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
(m1 − m2 ) w (2.234)
scalar

always has the same direction as (m1 − m2 ). Therefore, Equation (2.233) can be sim-
plfied by setting SB w = λ󸀠 (m1 − m2 ) for some unknown scalar λ󸀠 ,

S−1
W λ (m1 − m2 ) = λw
󸀠

⇔ w = λ−1 λ󸀠 S−1
W (m1 − m2 ) . (2.235)

By virtue of the normalization condition wT SW w = 1 from Equation (2.230), the

unknown factor can be cancelled:

w S−1
W (m1 − m2 )
w= =
√wT SW w √(m1 − m2 )T S−1 T SW S−1 (m1 − m2 )
W W
S−1
W (m1 − m2 )
= . (2.236)
T T
√(m1 − m2 ) S−1 (m1 − m2 )
W

This concludes the case of two classes.

The general case

The introduction of this chapter already noted that in the general case the dimension
of the subspace is c − 1, i.e., one less than the number of classes. Instead of a single
projection vector w ∈ ℝd to obtain the projected “vector” (a one-dimensional vector,
i.e., a scalar)
m󸀠 = wT m ∈ ℝ, (2.237)
one has a projection matrix W ∈ ℝd×(c−1) , so that the projection becomes

m󸀠 = WT m ∈ ℝ(c−1) (2.238)

with W = (w1 , . . . , wc−1 ) and w1 , . . . , wc−1 being the columns of W.

In order to avoid a lot of technical details, this section only gives a sketch of the
overall course of action. The reader is recommended to compare the formulas with the
corresponding formulas from the two-class case and note the similarities.
As before,
1
mi = ∑ m for i = 1, . . . , c (2.239)
N i m∈D
i
2.7 Dimensionality reduction of the feature space | 85

denotes the mean of each class, and

1 1 c
m= ∑ m= ∑ N i mi (2.240)
N m∈D N i=1

the overall mean. The scatter matrices for each class are
1 T
Si = ∑ (m − mi ) (m − mi ) (2.241)
N i m∈D
i

and the intra-class scattering is

c
SW = ∑ Si . (2.242)
i=1

Likewise, the inter-class scatter matrix is

c
T
SB = ∑ N i (mi − m) (mi − m) . (2.243)
i=1

Note that for c = 2, this definition differs from the previous definition.
The corresponding definitions can be set up for the projected feature m󸀠 = WT m:
1
m󸀠i = ∑ m󸀠 (class means) (2.244)
Ni
m󸀠 ∈D󸀠i
1
m󸀠 = ∑ m󸀠 (overall mean) (2.245)
N m󸀠 ∈D󸀠

1 T
S󸀠i = ∑ (m󸀠 − m󸀠i ) (m󸀠 − m󸀠i ) (scatter matrices of each class) (2.246)
Ni
m󸀠 ∈D󸀠i
c
S󸀠W = ∑ S󸀠i (intra-class scattering) (2.247)
i=1
c
T
S󸀠B = ∑ N i (m󸀠i − m󸀠 ) (m󸀠i − m󸀠 ) (inter-class scattering) (2.248)
i=1

Unsurprisingly, some algebraic calculation reveals the relation

S󸀠W = WT SW W (2.249)
T
S󸀠B = W SB W (2.250)

between the scattering of the original features and the projected features (see Equa-
tions (2.226) and (2.228)). This leads again to the Rayleigh quotient
󵄨󵄨 󸀠 󵄨󵄨 󵄨󵄨󵄨WT SB W󵄨󵄨󵄨
󵄨S 󵄨
J(W) = 󵄨󵄨 󸀠B 󵄨󵄨 = 󵄨󵄨 T 󵄨 (2.251)
󵄨󵄨S 󵄨󵄨 󵄨󵄨W SW W󵄨󵄨󵄨
󵄨 W󵄨 󵄨 󵄨
86 | 2 Features

as the objective function. Here, |M| denotes the determinant of the matrix M. The
columns w1 , . . . , wc−1 of the W that maximizes J(W) are the eigenvectors of the gener-
alized eigenproblem

SB w = λSW w ⇔ (SB − λSW )w = 0 (2.252)

that belong to the (c − 1) greatest eigenvalues λ1 , ..., λ c−1 . The task is to find the roots
of the characteristic polynomial

g(λ) = det(SB − λSW ) (2.253)

and then solve the system of linear equations

(SB − λSW )w = 0 (2.254)

in w for each λ = λ1 , . . . , λ c−1 .

Example: Fisherfaces
In Section 2.7.1 it was shown how PCA can be used to represent the images of faces in
an approach called eigenfaces. Similarly, MDA can be used to extract Fisherfaces from
a dataset of images: the images are collected into vectors mi , from which the MDA
matrix W is computed according to Equation (2.254). The columns wi , the Fisherfaces,
of the matrix W can then be reorganized into images and inspected.
As an example, Fisherfaces were extracted from the extended YALE face dataset B
(which was used to extract the eigenfaces, too, see Section 2.7.1). Images of the same
subject were grouped into the same class, yielding 39 classes, and hence 38 Fisherfaces,
in all. The corresponding images of ten Fisherfaces are shown in Figure 2.42. Unlike
the eigenfaces in Figure 2.33, humans have trouble interpreting the meaning of these
images. If one concentrates enough, it is possible to see outlines of the eyes, the nose,
and the mouth. One can also see the outline of the chin in the fifth picture from the
left of the upper row, but it is difficult to imagine how Fisherfaces could be useful in
determining the identity of a person. Yet, when the feature vector m󸀠 = WT m of a given
unknown image m is used in classification, Fisherfaces prove to be quite effective.
In an experiment, a linear soft margin support vector machine (see Section 7.7)
was trained to recognize the 39 identities in the extended YALE dataset using both
eigenfaces and Fisherfaces as representations. With eigenfaces, 19 % accuracy was
achieved with 76 components, 42 % accuracy with 127 components, and 65 % accuracy
with 200 components. With Fisherfaces, on the other hand, the classification was 75 %
accurate with only 38 components. This experiment shows that MDA is much more
efficient at encoding discriminative information than PCA.
2.7 Dimensionality reduction of the feature space | 87

Fig. 2.42. First ten Fisher faces computed from the YALE faces dataset of Georghiades et al. [2001].
Unlike with the eigenfaces in Figure 2.33, there is no directly human-interpretable structure in the
images.

2.7.5 Dimensionality reduction by feature selection

A common disadvantage of all the previous methods to reduce the dimension of the
feature space is that the result is an opaque transformation matrix. The conversion
leads to a new feature vector whose components are nebulous combinations of the
former feature components. Hence, a (potentially existing) descriptive meaning gets
lost. Moreover, the previous methods only work for features on at least an interval
scale.
Feature selection means choosing a subset of features from a wider set of features
that are considered to be sane for the problem at hand. In terms of the previous method,
this method actually projects onto subspaces that are aligned with the same axes, i.e.,
components are just left out or added but neither combined nor rotated. For this reason,
this method also works for features on a lower scale.
Instead of an objective function that only depends on the data, the performance
of a selection of features is directly evaluated with respect to a previously chosen
classifier. For each selected set of features, the classifier is tuned on the training set, the
classifier is applied to the test set, and the estimated class assignments are compared
with the real classes of the data. As opposed to the other methods, the outcome of
the dimension reduction thus depends on the established classifier. The workflow is
depicted in Figure 2.43.
More formally, let D ∈ M denote a training set with d = dim M and D =
{m1 , . . . , mN }. Let I = {1, . . . , d} be the index set of all dimensions and I󸀠 ∈ P(I) a se-
lection of indices of the dimensions. For a feature vector m, let m|I󸀠 denote the feature
vector restricted to the selected components. Similarly, D|I󸀠 denotes the restricted set.
88 | 2 Features

Feature
D
extraction

D|I󸀠

Classification

Performance
I󸀠 ∈ P(I)
evaluation

Selection
Fig. 2.43. Workflow of feature selection.

The task is to find the I∗ ∈ P(I) such that the classifier has the best performance
on D|I∗ among all I ∈ P(I). To test every subset of I, 2d runs are necessary. Hence,
this is only possible if d is small, because for each subset I󸀠 the classifier needs to be
trained anew, tested, and evaluated on D|I󸀠 . For reasonable values of d this is already
prohibitive. If the desired dimension d󸀠 < d is already given in advance, the number
of subsets is still (dd󸀠 ). Thus a brute force approach is normally impossible.
A suboptimal, but feasible approach is a greedy technique. First, the single-
element set with the best feature component is selected. This requires d runs. The
best component is held. Then the component out of the d − 1 remaining ones is
chosen that shows the best performance in conjunction with the already chosen first
one. This procedure is repeated until the desired number of components is chosen
or until joining a new component does not improve the performance. This way only
d + (d − 1) ⋅ ⋅ ⋅ + (d − d󸀠 + 1) subsets need to be evaluated.
This approach is called a wrapper approach, because it wraps the feature selection
around a classifier. Besides wrappers, there are embedded approaches, where a classi-
fier implicitly performs the feature selection, and filter approaches, that do not depend
on a classifier at all. However, these methods are outside the scope of this book.

2.7.6 Bag of words

Especially when dealing with images or video, one often has the problem that the
feature extraction step assumes that all the patterns are of the same size. For example,
the eigenfaces and Fischerfaces approaches in Sections 2.7.1 and 2.7.4 assume that
2.7 Dimensionality reduction of the feature space | 89

the facial images are of the same size. One possible solution, the one pursued in these
examples, is to crop and rescale the images so that they fulfill this constraint. However,
doing so will inevitably remove information that is then unavailable for classification.
An alternative solution is given by the bag of visual words approach. Here, several
low level features are extracted from different parts of the pattern and then combined
into one higher level descriptor that characterizes the whole pattern. By construction,
this descriptor always has the same dimensionality, irrespective of the size of the un-
derlying pattern or the number of extracted low level features.
The approach has its roots in the bag of words model from natural language pro-
cessing. Without going into too much detail, this model can be described as follows.
Document generation is modeled as the repeated and independent drawing of words
w k that follow a probability distribution P(w). The overall probability of generating
the document τ = (w1 ,w2 , . . . ,w K ), that is, the sequence of K words w k , is thus given
by
N
P(τ) = ∏ P(w k ). (2.255)
k=1

In a classification setting, e.g., when the goal is to classify e-mails into “ham”
=̂ ω1 or “spam” =̂ ω2 , characteristic word distributions P(w | ω1 ) for ham and P(w | ω2 )
for spam can be estimated from a collection of documents by counting the words
that occur in them. An unseen e-mail p󸀠 = (w󸀠1 , . . . ,w󸀠K 󸀠 ) with K 󸀠 words can then be
classified by assigning the class that maximizes the likelihood

K 󸀠
󵄨
L(ω) = ∏ P (w󸀠k 󵄨󵄨󵄨󵄨 ω) for ω ∈ {ω1 ,ω2 }. (2.256)
k=1

Alternatively, one can estimate the word distributions for every e-mail in the train-
ing set and train a classifier using these distributions as feature vectors. In other words,
the P(w) estimated from the document p (the pattern) acts as the feature vector m, i.e.,
m = (P(w1 ), . . . , P(w k ))T .
Two things are important here. First, the order of the words in the document does
not matter, nor does their surrounding text. Second, this method works irrespectively
of the length of the documents. Both of these are caused by the underlying idea of
treating a document as an unordered collection—a bag—of words.
The same idea can be used to classify images of varying size. Here, it is assumed
that the images are composed of visual words from some (for now) nondescript visual
vocabulary. Similar to document classification, an image can then be characterized
by observing the words that occur in the image (see Figure 2.44). Note that in gen-
eral the original image cannot be reconstructed from the bag representation, because
the vocabulary might not contain all of the possible visual words that appear in the
image and because the position of the words in the image is not specified in the bag
representation.
90 | 2 Features

Bag representation

Fig. 2.44. Illustration of the underlying idea of bag of visual words: An image can be thought of as
the composition of words from some visual vocabulary and can therefore be characterized by the
words that appear in it.

This approach is divided into two steps: learning the visual vocabulary from a
training set, i.e., defining the visual words, and extracting a higher level descriptor
from an image. Formally, let D = {pn | n = 1, . . . ,N} be a set of N patterns pn ∈ P.
These patterns should be representative for the patterns that are to be classified later
on, but information about their classes is not needed. Low level features mt (pn ) ∈
ℝd , t = 1, . . . ,T(pn ) are extracted from each of the N patterns pn . Note that the number
of extracted low level features T(pn ) depends on pn . In general, a different number of
low level features is extracted from each pattern, T(pn ) ≠ T(pm ) for n ≠ m, e.g., when
the features are extracted on key points as in the example below. The mt (pn ) are then
used to partition the ℝd into K non-overlapping tiles z k , k = 1, . . . ,K (see Figure 2.45),
i.e.,
K
⨄ z k = ℝd . (2.257)
k=1

The z k form the visual vocabulary, that is, each z k corresponds to a visual word.
It is tempting to think of the low level features mk (pn ) as the alphabet from which the
visual words are constructed, but this is a false analogy: unlike charaters, every low
level feature appears in one and only word z k . A better analogy is to to consider the
mt (pn ) ∈ z k as different spellings (“colour” and “color”) or perhaps synonyms (“color”
and “hue”) of the same word z k .
Once the vocabulary is determined, a K-dimensional high level descriptor f can be
extracted from an unseen pattern p󸀠 . Once again, low level features mt (p󸀠 ) ∈ ℝd with
t = 1, . . . ,T(p󸀠 ) are extracted from p󸀠 . Note that the type of the features must be the
same as the type that was used to determine the vocabulary. The high level descriptor
f is built as a count statistic (i.e., a histogram) over the mt (p󸀠 ) w.r.t. the z k ,

f = (f1 ,f2 , . . . ,f K )T with

T(p)
1
fk = ∑ δ[mt ∈z k ] , k = 1, . . . ,K, (2.258)
T(p) t=1
2.7 Dimensionality reduction of the feature space | 91

z6 z7 z1
z8
z10

z9 z2 Fig. 2.45. Example of a visual vocabulary with K =

10 words z1 , . . . ,z10 derived from two-dimensional
z5 low level features m = (m1 ,m2 )T using K-Means
z4
clustering. Note that here the goal is not to find
z3
clusters, but rather to find a partition of the feature
m1 space.

m2
f1 0
20

z6 z1 f2 2
z7 20
f3 1
20
z8
z10 f4 1
20
f5 2
20
z9 z2 f6 0
20
f7 1
20
z5 f8 5
z4 20

z3 f9 4
20

m1 f10 4
20

Fig. 2.46. Example of a bag of visual words descriptor constructed using the vocabulary in Fig-
ure 2.45. T(p) = 20 low level features were extracted from the underlying pattern.

where δ[⋅] denotes the generalized Kronecker symbol. In other words, the entry f k is
the fraction (i.e., the frequency) of low level feature descriptors that fall into the tile
z k . Figure 2.46 shows an example descriptor derived from T = 20 low level features
using the vocabulary from Figure 2.45.
The bag of visual words approach has some important properties:
1. As in text processing, the size of f does not depend on the size of the underlying
pattern p.
2. Determining the vocabulary is unsupervised, i.e., information about the patterns’
classes ω(pn ), n = 1, . . . ,N is not needed to determine the vocabulary.
3. Invariance properties of the low level features mt propagate to the high level de-
scriptors p, e.g., if the mt are invariant under rotation, scale or illumination, so
will be f.
4. Spatial (and in the case of a video, temporal) relations between image patches
are discarded. On the one hand, this makes f robust against translation; on the
92 | 2 Features

other hand, this may remove discriminative information and prevent localization
of objects in an image.
5. If there is a semantic interpretation of the low level features mt , the high level
descriptor f can also be interpretable.
6. The dimensionality of f is often much larger than the dimensionality of the low
level features mt , but the exact number of the z k usually does not have a significant
impact on the classification performance.

Lastly, bag of words can be interpreted as a meta-feature that subsumes multiple

features extracted from an object into a single feature vector. This is similar to the
Kullback–Leibler divergence from Section 2.4.5, which can be used to compare ob-
jects by estimating object-dependent probability distributions from the features. The
difference is that the KL divergence compares full probability densities, while bag of
words compares histograms, i.e., estimates of the probability mass in buckets of the
feature space. Furthermore, bag of visual words produces a feature vector and relies on
a separate, but arbitrary, classifier to derive the class assignment, while the KL diver-
gence produces a scalar that can directly be used for classification, as in the example
of Section 2.4.5.

Applications and examples: Image categorization

A highly influential paper that popularized this method in the field of computer vision
was presented by Csurka et al. [2004]. In their paper, they described a bag of visual
words approach for image categorization, i.e., for classifying images into one of many
categories. Image categorization is difficult not only because of the large number of
classes, but also because the images may show several objects at once and under
various poses and lighting conditions. Furthermore, objects may be partially occluded
and may be subject to non-rigid transformations, e.g., different arm positions when
classifying images of humans.
To cope with these difficulties, SIFT keypoint descriptors (Lowe [2004]) were used
as low level features. Without going into the details, a SIFT descriptor is a very pow-
erful visual descriptor that characterizes the local texture around keypoints. SIFT is
invariant under scaling and robust against rotation, non-rigid transformations, and dif-
ferences in lighting. The partition of the feature space is achieved using K-Means clus-
tering. As in the example Figure 2.45, the goal was not to derive an optimal clustering
of the features, but rather to build a descriptor that allows an accurate image catego-
rization later on. Two classifiers, naive Bayes² and a kernel SVM (see Section 7.7), were
evaluated. It was found that the SVM classifier produced more accurate results (Csurka

2 Maximum a posteriori classification according to Equation (3.23) under the assumption that the
features are statistically independent
2.7 Dimensionality reduction of the feature space | 93

(a) Primitive features: features that require little or (b) Dense sampling: features are
no computation time, e.g., color channels, gradient extracted from every foreground
magnitude, texture codes. pixel.

Fig. 2.47. Modifications to use bag of words in bulk material sorting. Images from Richter et al.
[2016].

et al. [2004]). Nowadays, the method has been superseded by convolutional neural
networks (see Section 7.6), but the approach is still useful in other domains.

Applications and examples: Sorting of bulk materials

For example, in Richter et al. [2016], a bag of visual words approach is used for the
sorting of bulk materials such as plastic pellets, minerals or foodstuffs. As can be seen
in Figure 2, the environmental conditions (lighting, background, etc.) in bulk material
sorters are design parameters rather than a source of nuisance and are chosen to aid
the pattern recognition system. In addition, the objects under inspection are relatively
simple, so that, compared to image categorization, the pattern recognition is relatively
easy. The limiting factor is the processing time, since bulk material sorting is a real
time task: in a typical system, the system has only a few tens of milliseconds to capture
and process an image, classify the objects, and carry out the derived sorting decision.
This severely restricts the features and classifiers one can use. Using SIFT features
and kernel SVM classifiers as above is not feasible. However, the specific circumstances
of bulk material sorting can be exploited in a modified bag of words approach. First,
since the objects are relatively simple, only primitive low level features, such as the
values of the red, green and blue color channels, the gradient magnitude, or texture
codes, are extracted (see Figure 2.47a). These features can encode color and texture
information and require little or no computing time. However, since each individual
low level feature carries little discriminative information, they are only useful when a
large number of them is considered. To this end, every foreground pixel in an object
image is considered a key point, that is, the primitive features are extracted on every
foreground pixel of the object (see Figure 2.47b). This dense sampling comes at no
94 | 2 Features

Training phase Sorting phase

Gaussian mixtu re T
p → {mt }t= 1
model

T
p(m) {p(mt )}t= 1
fu = E{ p(m) < θ}

density outlier fraction

{µk , Σk } f k = E{m ∈ ẑ k }

vocabulary denoised descriptor

object class ω(f

̂ ) P(ω0 | fu )

Fig. 2.48. Structure of the bag of words approach in Richter et al. [2016].

additional cost, since foreground–background separation is a necessary step in the

overall processing pipeline of the sorting system.
Once the primitive features are extracted from a training set, the feature space
is partitioned using a Gaussian mixture model (see Section 3.3.5) instead of K-Means
clustering. The use of this model is twofold: first, it provides the vocabulary of visual
words z k , and second, the density estimate p(m)
̂ is used to define the regions of outliers
in the feature space. These regions are then used to de-noise the object descriptor f,
and to introduce a rejection class ω0 to classify unknown objects. The overall structure
of the approach is shown in Figure 2.48.

2.8 Exercises

(2.1) Suppose we are given the following features: mn on a nominal scale, mo on a

ordinal scale, mi on a interval scale, mr on a ratio-scale, and ma on a absolute
scale. Which of these features allow which of the following transformations?
1. f(m) = 3 m + α with α ∈ ℝ
2. f(m) = e m
{ 1 if m ≠ 0
3. f(m) = { m
0 else
{
2.8 Exercises | 95

(2.2) Assign the following features to their corresponding feature scale (nominal, or-
dinal, interval, ratio, or absolute): (School) grades, car brands, date of birth, area
of the canvas of a sail, number of cows in a herd, motor temperature in ∘ C, engine
speed/revolution, height of body, clothing size, optical magnification, account
balance, electrical voltage, place in a race, gender, variety of apple, display of a
Geiger counter, population density, annual income in EUR, intelligence quotient
(IQ).

(2.3) Compute the Kullback–Leibler divergence DKL between the following probability
distributions P1 and P2 :

P1 (a) = 13 , P1 (b) = 0, P1 (c) = 13 , P1 (d) = 13 ,

P2 (a) = 13 , P2 (b) = 16 , P2 (c) = 61 , P2 (d) = 13 .

(2.4) Suppose there are three states, referred to as 1, 2, and 3, and two random vari-
ables with the probability mass functions P1 (X) and P2 (X) such that P1 (1) =
P1 (2) = P1 (3) = 31 and P2 (1) = 16 . How must the remaining probabilities P2 (2) :=
a and P2 (3) := b be chosen so that the Kullback–Leibler divergence DKL (P1 ‖P2 )
is minimized?

(2.5) The traffic police invented a new, innovative test to assess a road user’s ability
to drive a vehicle: Drivers are prompted to fire a gun at a target. The hit pattern is
compared to the following reference, obtained from drunk and sober drivers:

y
sober
drunk
0.5

x
−0.5 0.5

−0.5

Construct a one-dimensional feature m󸀠 that can be used to assess whether a sub-

ject is driving under the influence or not. Give a formula to compute m󸀠 .

(2.6) A well-known car manufacturer manipulated the software that controls the am-
monium nitrate injection to the catalytic converter so that less ammonium nitrate
was used. The resulting engine fumes contain less smelly ammonia (NH3 ), but
more pollutant nitrogen oxides (NOX ). In a randomized test of engine fumes for
these two compounds, the following scatter plot resulted:
96 | 2 Features

m2 (% NOX )
3a
manipulated
not manipulated
2a

m1 (% NH3 )
a 2a 3a

Construct a one-dimensional feature m󸀠 that can be used to classify cars into “ma-
nipulated” (ω1 ) or “not manipulated” (ω2 ). Give a formula to compute m󸀠 .

(2.7) The following feature is derived from a Fourier contour descriptor, i.e., the coef-
ficients Z i ∈ ℂ, i ∈ ℤ of the Fourier series expansion of the contour:
Z8 Z6 Z4 Z2 Z0
m := − + − + .
|Z1 | |Z1 | |Z1 | |Z1 | |Z1 |
Is m invariant under scaling, translation, and rotation? Why or why not?

(2.8) The following feature is derived from a Fourier contour descriptor, i.e., the coef-
ficients Z i ∈ ℂ, i ∈ ℤ of the Fourier series expansion of the contour:
α|Z0 | + |Z2 |
m := α,β,γ ∈ ℝ.
√|Z3 |2 + |Z4 |β + γ
How must α, β, γ ∈ ℝ be chosen for m to be invariant under translation and scal-
ing?

(2.9) Given are the following patterns:

1 2 3 4 5 6 7

1. Which of the patterns are equivalent given a translation invariant feature?

2. Which of the patterns are equivalent given a translation and rotation invariant
feature?
3. Which of the patterns are equivalent given a translation, rotation and scale
invariant feature?

(2.10) In an optical inspection system the following variables are computed for each
recorded object: center of mass c, perimeter of the contour P, area of the object
2.8 Exercises | 97

A, length of the main and secondary axes l1 and l2 , and the rotation angle φ of
the main axis w.r.t. the image coordinate system. The variables are shown in the
following sketch:

A l1
P
φ

Five features, m1 to m5 , are computed as defined below. Which of the features

are invariant under translation, which are invariant under rotation, and which are
invariant under scaling?

‖c‖2
m1 =
A
l1 P2
m2 = +
l2 A
l2
m3 = φ
U
A
m4 = − cos φ
l1
φ l1 − l2
m5 = +
1‖c‖2 A
3 Bayesian decision theory
Up to now, this book has dealt with the question of how to select, define, and extract
features from observed patterns of objects. In Figure 1.2, this is depicted as the first
step of a pattern recognition system (first blue box) after the preparatory steps. From
now on, our attention will be turned to the second step: how the features are used to
assign objects to classes. Eventually, this task will be solved by classifiers.
Firstly, the entirety of classifiers can be divided into methods that use a proba-
bilistic description of the problem and those which do not. The latter are considered in
Chapter 7. The class of probabilistic methods is rather extensive, hence its discussion
is divided into the Chapters 3 to 6. Under the assumption that the probabilistic descrip-
tion of the system is already given, the question of how the actual classification is to
be performed needs to be answered. This is the subject of this chapter. The problem
of how to obtain the probabilistic description temporarily remains open and will be
answered later (Chapters 4 to 6).
In summary, the assumptions of this chapter are that the probabilistic description
of the system is already given and that the features of the objects are already defined
and extracted.

3.1 General considerations

The general idea is to think of the world as a random generator that outputs pairs of
features with associated object classes (m, ω). The collection of all pairs is described by
a probability distribution. In addition, all elements of a sequence of pairs are pairwise
independent, i.e., the joint distribution of N pairs (m1 , ω1 ), . . . , (mN , ω N ) equals the
product of the individual distributions,
N
p((m1 , ω1 ), . . . , (mN , ω N )) = ∏ p(mj , ω j ). (3.1)
j=1

The marginal distribution of the classes

P(ω) = ∫ P(m, ω) dm (3.2)

is called an a priori distribution (of the classes) and the marginal distribution of the
features
ωc
p(m) = ∑ P(m, ω) (3.3)
ω=ω1

is called the overall distribution of the features. The conditional distribution

p(m | ω) (3.4)

https://doi.org/10.1515/9783110537949-120
3.1 General considerations | 99

P (m, ω)
ω2 ω5

ω5 ω3
ω4
ω3
ω2 ω4
ω1
ω1

m m
(a) Full joint distribution (b) Marginal distributions

Fig. 3.1. Example of a random distribution P(m, ω) of mixed discrete and continuous quantities.

is called the class-specific feature distribution and the conditional distribution

P(ω | m) (3.5)

is called the a posteriori distribution of the classes.

Throughout this chapter, we assume that the class-specific feature distribution
and the a priori distribution are known. How to obtain those quantities from the data
will be the subject of Chapters 4 and 5. After the observation of a specific feature m,
one is interested in the probability of the classes, that is, the a posteriori distribution.
The joint distribution can be expressed as

p(m, ω) = p(m | ω)P(ω) = P(ω | m)p(m) (3.6)

which leads to Bayes’ law

p(m | ω)P(ω)
P(ω | m) = . (3.7)
p(m)
Note that P(m, ω) is a mixed distribution of discrete and continuous random quan-
tities. The set of classes {ω1 , . . . , ω c } is finite but the range of the feature vector is usu-
ally a continuum. In terms of the continuous components of m, P(m, ω) is a probability
density and in terms of ω and the discrete components of m, P(m,ω) is a probability
mass function. Figure 3.1 illustrates such a distribution.
As a preparatory step towards deriving an optimal classifier, we introduce the deci-
sion space K with dim(K) = c. This decision space is not strictly required, but it helps
to unify the formal description of different classifiers and prevents a misinterpretation
of the meaning of the classes. Originally, we had defined

Ω/∼ = {ω1 , . . . , ω c } (3.8)

where each ω i ⊆ Ω denoted one class and was formally defined as a subset of Ω. Now,
we identify each of them with a unit vector in a c-dimensional space K,
j
ωi ∈ K = ℝc with ω ij = δ i (3.9)
100 | 3 Bayesian decision theory

k 3
ω3
1 k

ω2
k 2 Fig. 3.2. The decision space K for c = 3 classes.
1
ω1 The orange arrow shows a decision vector k,
k 1 1 the blue triangle shows the probability simplex,
where ∑i k i (m) = 1 and k i ≥ 0, i = 1, . . . ,c.

or, more elaborately (see Figure 3.2):

1⏟⏟, 0, . . . , 0)T .
ωi = (0, . . . , 0, ⏟⏟⏟⏟⏟ (3.10)
ith position
The unit vectors ωi are also called target vectors, k ∈ K decision vector, and its
components k i decision functions.
In light of this new perspective, a classifier follows a two-step construction scheme.
In the first step, a feature vector is mapped to a value k in the decision space spanned
by the ωi (see Figure 3.2),
c↑
↑
↑
↑
k ∈ span {ω1 , . . . , ωc } = { ∑ λ i ωi ↑
↑
↑ λ i ∈ ℝ} . (3.11)
↑
↑
i=1 ↑
The second step is to take the target vector ω̂ as an estimation that has the shortest
distance from k, i.e.,
ω̂ = arg min ‖k − ωi ‖ . (3.12)
ωi : i=1,...,c

While the second step is uniform for all classifiers, the actual logic of a classifier
is merged into the mapping in the first step,

{M →K
k: { (3.13)
m 󳨃→ k(m).
{
Beyond a uniform description of classifiers, the vectorized approach prevents a
misconception of the classes’ numbering. While the numbering misleadingly implies
an ordering of the classes, the vectorized approach clearly shows that each pair of
target vectors has the same distance. More concisely: the classes stem from a nominal,
not an ordinal, scale, and the vector description reflects this.
A parametric decision function is a mapping k(m, θ) that additionally depends
on a parameter vector θ = (θ1 , . . . , θ k )T ∈ Θ from a parameter space Θ. Learning
by examples means to find the parameter vector θ̂ such that the mapping k(mi , θ)̂
3.2 The maximum a posteriori classifier | 101

approximates the true class ω(mi ) for each of the training samples in some opti-
mal way. Each decision region Ri ⊆ M in the feature space corresponds to a subset
↑
{k↑
↑‖k − ωi ‖ < ‖k − ωj ‖ ∀j ≠ i} ∈ K in the decision space. More precisely,
↑
↑
Ri = {m↑
↑‖k(m, θ) − ωi ‖ < ‖k(m, θ) − ωj ‖ ∀j ≠ i}
↑ (3.14)
↑
∂Ri = {m↑
↑‖k(m, θ) − ωi ‖ ≤ ‖k(m, θ) − ωj ‖ ∀j ≠ i} \ Ri .
↑ (3.15)

This shows that the structure and parametrization of k(⋅, θ) determines the deci-
sion boundaries in M.

3.2 The maximum a posteriori classifier

After these preliminary remarks, we now start with the first specific classifier. As the
first objective, one can require that the expected squared Euclidean distance between
the decision vector and the true target vector is minimal. Let P(m, ω) denote the joint
distribution of the feature vector and the target vectors. Then the objective function is
c
f(k) = E{‖k(m) − ω‖2 } = ∑ ∫ ‖k(m) − ωi ‖2 P(m, ωi ) dm (3.16)
i=1 M

for all permissible functions k : M → K. In a strictly mathematical sense, one has to

define the set of functions under consideration and ensure that the integrals are well
defined. For the purposes of this book, we will ignore these technicalities. Let k denote
an optimal solution and k∆ ≡0 ̸ the function which gives the difference from any other
solution. It holds, that
f(k + k∆ ) > f(k). (3.17)
In order to simplify notation in the following discussion, the argument to k is
momentarily omitted. Keep in mind that the expectation is taken with respect to m,
which is an argument of k. Consider the left side of Equation (3.17) more closely:

f(k + k∆ ) = E{‖k + k∆ − ω‖2 }

= E{‖k − ω‖2 } + 2 E{kT∆ (k − ω)} + E{‖k∆ ‖2 } . (3.18)

Hence, Equation (3.17) is equivalent to

E{‖k∆ ‖2 } + 2 E{kT∆ (k − ω)} > 0. (3.19)

As k is an optimal solution, the inequality must hold for any k∆ . This is surely
true if there is a k such that the second term is identically zero. Note that it is not
required that the second term be zero: this is a sufficient but not a necessary condition.
102 | 3 Bayesian decision theory

k1 (m) = P(ω1 |m)

Find maximum
k2 (m) = P(ω2 |m)
m ω̂
..
.

k c (m) = P(ωc |m)

Fig. 3.3. Workflow of the MAP classifier.

Altogether, this assumption leads to

c
0 = E{kT∆ (k − ω)} = ∫ ∑ kT∆ (k − ωi )P(m, ωi ) dm
M i=1
c
= ∫ kT∆ (m) ( ∑ (k(m) − ωi )P(ωi | m)) p(m) dm. (3.20)
M ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
i=1
!
=0

In the last line, the dependence of k on m was made explicit again and the term was
reordered so that kT∆ (m) was moved out of the brackets, because this term is outside
of our control. But the overall formula becomes zero if the term in the brackets is zero
and this term is only made up of known quantities and k. One obtains
c c c
0 = ∑ (k(m) − ωi )P(ωi | m) = k(m) ∑ P(ωi | m) − ∑ ωi P(ωi | m)
i=1 ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
i=1 i=1
=1
↑ ↑
= k(m) − E{ω↑
↑m}
↑ ⇒ k(m) = E{ω↑
↑m} .
↑ (3.21)

The last line concludes the derivation. Going one step back and explicitly writing
out the expectation gives a more illustrative representation of the optimal target vector:
c
↑
k(m) = E{ω↑
↑m} = ∑ ωi P(ωi | m)
↑
i=1

1 0 P(ω1 | m)
..
0 . P(ω2 | m)
= ( .. ) ⋅ P(ω1 | m) + ⋅ ⋅ ⋅ + ( ) ⋅ P(ωc | m) = ( .. ). (3.22)
. 0 .
0 1 P(ωc | m)
The target vector is formed by the conditional distributions or a posteriori distri-
butions of the classes, but this is only the first step of the classifier. Putting the last
3.2 The maximum a posteriori classifier | 103

0.2 0.1
0.8 (0.2)
0.7
0.4
0.6
0.0
0.5 (0.8)
0.6
(0.15) 0.4 0.2
0.35
0.8
0.2

1
0.2 0.4 0.6 0.8 1
k1 k2

Fig. 3.4. 3-dimensional probability simplex in barycentric coordinates, i.e., projected onto the two-
dimensional plane so that every point in the simplex is identified by three coordinates k1 ,k2 ,k3 ≥ 0
with k1 + k2 + k3 = 1.

line into the second step (see Equation (3.12)) leads to

ω̂ = arg min ‖k − ωi ‖ = arg max P(ωi | m). (3.23)

ωi : i=1,...,c ωi : i=1,...,c

Intuitively, this result does not come as a surprise. The optimal classifier with
respect to the least expected square error always takes the class with the highest a
posteriori probability. Therefore, this classifier is called the maximum a posteriori
(MAP) classifier. Every point that can be described by k shares the property that the
sum of its components equals one. For this reason, the blue simplex in Figure 3.2 is
also called the probability simplex. Figure 3.3 illustrates the workflow of the MAP
classifier; Figure 3.4 depicts a 2-dimensional projection of the probability simplex
from Figure 3.2.
Note that although this classifier is optimal in the sense of least square error in
the decision space, this does not mean that it will classify without error. The error
probability is given by
P e := P(ω̂ ≠ ω), (3.24)

i.e., this is the probability that the estimated class ω does not match the true class
ω. Since the a posteriori probability P(ω | m) depends on the class-specific feature
distribution p(m | ω), the asymptotic (i.e., the true theoretical) error probability can
only be zero if the class-specific feature distributions do not overlap, or if the a priori
probability of overlapping distributions is nonzero only for one of the classes. In prac-
tical applications, the former is almost never the case, while the latter is nonsensical
as it prevents the classifier from deciding on one or more of the classes.
104 | 3 Bayesian decision theory

3.3 Bayesian classification

In the previous Section, the MAP classifier was derived as the optimal classifier with
respect to the least expected square error. Of course, this criterion is only correct if each
error is equally bad. Quite often one is faced with applications in which one kind of
classification error is worse or more costly than an error of another kind. For example,
an undetected cause for alarm might be worse than a false alarm.
Therefore, the goal of this chapter is to extend the Bayesian framework by a cost
function:
l : Ω0 /∼ × Ω/∼ → ℝ (3.25)

where Ω0 /∼ := Ω/∼ ∪ {ω0 } and ω0 denotes the rejection class, which will be discussed
in more detail in Section 9.4. For now, it suffices to treat it as just another class without
any deeper meaning. The cost function l expresses the costs of deciding on the class
ω̂ if the true class is actually ω. In the case of a finite number of classes, the costs can
also be expressed as the matrix

l(ω0 , ω1 ) ... l(ω0 , ω c )

l(ω1 , ω1 ) ... l(ω1 , ω c )
L=( .. .. .. ) ∈ ℝ(c+1)×c . (3.26)
. . .
l(ω c , ω1 ) ... l(ω c , ω c )

Normally, one has l(ω i ,ω i ) = 0 and l(ω i , ω j ≥ 0 for i ≠ j. Instead of reducing the
average error the new objective is to reduce the expected cost

R = E{l(ω(m),
̂ ω)} . (3.27)

This expected cost is also called the risk.

In the first subsection, we will develop the concept of the Bayesian classifier in a
similar way as for the MAP classifier. The Bayesian classifier is the classifier with the
minimal risk. In the light of this new concept, the MAP classifier will be subsumed as
j
the special case when the cost function is the Kronecker delta, l(ω i ,ω j ) = δ i .
In the second subsection we will derive another special classifier, the so-called
Minimax classifier.

3.3.1 The Bayesian optimal classifier

The a posteriori risk or conditional risk is the risk of deciding on a class ω i , i = 1, . . . , c

given a fixed feature vector m,
c
↑ ↑ m} = ∑ l(ω i , ω j )P(ω j ↑
↑
r i (m) = R(ω i ↑
↑ m) = E{l(ω i , ω) ↑
↑ ↑
↑
↑ m),
↑ (3.28)
j=1
3.3 Bayesian classification | 105

which gives the a posteriori risk vector

r(m) = (r1 (m), . . . , r c (m))T . (3.29)

The objective is now to find a classifier ω̂ : { M→Ω/∼

m󳨃→ω i that minimizes the total risk
R = E{l(ω(m), ω)}. This leads to
̂

R = E{l(ω(m),
̂ ω)}
c
= ∫ ∑ l(ω(m),
̂ ω j )P(m, ω j ) dm
M j=1
c
↑
= ∫ ( ∑ l(ω(m),
̂ ω j )P(ω j ↑
↑ m))p(m) dm
↑
M j=1

= ∫ R(ω(m)
̂ | m)p(m) dm. (3.30)
M

As p(m) is a non-negative function, the risk R obviously becomes minimal if ω̂ is

chosen such that R(ω(m)
̂ | m) is minimal point-wise for every m. The estimator ω̂ can
only have c distinct values, and it is the same for R(ω(m)
̂ | m), hence the solution is to
choose
ω(m)
̂ = ω i with i = arg min r j (m) (3.31)
j=1,...,c

In summary, the optimal decision is to choose the class with the smallest a posteriori
risk.
Viewed from a distance, all this is not very surprising. Taking the less risky choice
is probably what everyone would intuitively do, but now this is even mathematically
proven to be the optimum. Hence, the optimal Bayesian decision is a benchmark for
any other approach and defines an upper bound on the performance of any classifier.
At this point, one could ask, why is there any reason to consider a different classifier
if the optimum is already achieved? What is the rest of this book about? The answer
is simple: the Bayesian classifier requires probability densities that are typically not
(fully) known in real-world scenarios. The Bayesian classifier is optimal because it
uses every piece of information that can eventually be known about the entirety of
all the features and objects of the domain. Of course, an omniscient classifier has no
difficulties in making the best decision.
Bayesian decision theory uses a cost function and the a posteriori probability
of the classes. While the cost of a wrong decision can hopefully be determined with
some degree of certainty, an accurate a posteriori probability or more precisely its
determining pieces are hard to find. The class-specific distribution of the features and
the a priori distribution of the classes are normally unknown. Note that the marginal
distribution of the features p(m) in the denominator of
p(m | ω)P(ω)
P(ω | m) = ∝ p(m | ω)P(ω) (3.32)
p(m)
106 | 3 Bayesian decision theory

p(m|ω1 )
p(m|ω2 )

(l12 −l22 )P(ω2 )

(l22 −l11 )P(ω1 )

R2 R1 R2 R1

Fig. 3.5. Connection between the likelihood ratio and the optimal decision region.

is not required to find the minimum or maximum with respect to ω for a fixed m.
The quality of the Bayesian classifier stands or falls with the accuracy of these
quantities. Therefore, the next two chapters will treat the question of how those quanti-
ties are estimated in practice. The next chapter will focus on the so-called parametrized
methods, and the following one will deal with parameter-free methods that work with-
out a model assumption. Besides the technical challenge of how those distributions
are mathematically obtained, another issue becomes evident, too. The question is
what the concept “probability” actually means. This more philosophical matter will
constitute the introductory part of the coming chapter.
But before that, the remainder of this chapter will discuss some simple examples
and deduce another estimator that limits the maximal risk in the worst case.
Once again, consider a simple two-class scenario. Let l ij = l(ω i , ω j ) be a shorter
notation for the cost function. The a posteriori risk for a fixed feature m is

R(ω1 | m) = l11 P(ω1 | m) + l12 P(ω2 | m) (3.33)

R(ω2 | m) = l21 P(ω1 | m) + l22 P(ω2 | m) (3.34)

The estimator decides on the class ω1 iff R(ω1 | m) < R(ω2 | m). This can be equiv-
alently rewritten as

R(ω1 | m) < R(ω2 | m)

Hence, the Bayesian classifier decides according to

p(m | ω1 ) (l12 −l22 )P(ω2 )
{ω1 if p(m | ω2 ) >
ω(m)
̂ ={ (l21 −l11 )P(ω1 )
(3.36)
ω else
{ 2
The last inequality is typical: the likelihood ratio is compared to a threshold. Fig-
ure 3.5 illustrates the relation between this ratio and the decision region in the fea-
ture space. Whenever the likelihood ratio is above the threshold (orange line) the
corresponding subset of the feature space is assigned to the decision region R1 (violet
intervals), otherwise to R2 (brown intervals).
Note that p(m | ω) for fixed ω and variable m is the class-specific distribution of
the feature vector m. But as soon as a specific value of m is inserted into p(m | ω) and
p(m | ω) is interpreted as a function of ω, the term p(m | ω) is called the likelihood
function.
j
Now consider the special cost function l ij = 1−δ i . This means that correct decisions
are costless and every incorrect decision is penalized with a cost of 1 (this cost function
is also known as 0–1 loss). Then, the a posteriori risk
c c
j
R(ω i | m) = ∑ l ij P(ω j | m) = ∑ (1 − δ i )P(ω j | m)
j=1 j=1
c
j
= 1 − ∑ δ i P(ω j | m) = 1 − P(ω i | m) (3.37)
j=1

equals the converse of the a posteriori probability. As the estimator chooses the least
risk, this leads to the already known MAP classifier. Hence, the MAP classifier is a
special case of the general Bayesian classifier with this particular cost function.
The overall risk is
↑
↑
R = ∫ R(ω(m)
̂ ↑ m)p(m) dm
↑
M
eq. (3.37) ↑
↑
= ∫ (1 − P(ω(m)
̂ ↑ m))p(m) dm
↑
M
↑
↑
= 1 − ∫ P(ω(m)
̂ ↑ m)p(m) dm
↑
M

= 1 − ∫ P(ω(m),
̂ m) dm. (3.38)
M

To interpret the last term, reconsider Figure 3.1 and compare it with Figure 3.6.
Generally there will be tuples (ω(o1 ),m(o1 )), (ω(o2 ),m(o2 )) whose underlying ob-
jects belong to different classes, ω(o1 ) ≠ ω(o2 ), but show the same feature vector
m(o1 ) = m(o2 ). This always happens if the supports of the likelihoods overlap.
The probability of the whole ensemble of all tuples (ω, m) is one by definition, i.e.,
108 | 3 Bayesian decision theory

P (m | ω i )
P (m, ω)
ω2 ω5

ω ω3
ω 5
ω3 4
ω ω4
ω1 2 ω1

m m
(a) Joint distribution (b) Marginal distributions

Fig. 3.6. Decision of an MAP classifier in relation to the a posteriori probabilities. The a priori proba-
bilities are P(ω i ) = 15 for i = 1, . . . , 5. The MAP classifier always chooses the class with the highest a
posteriori probability (shaded areas/thick lines). The overall risk can be computed by summing the
individual class-specific risks for each of these regions separately.

∑cj=1 ∫M P(ω j , m) dm = 1, but the estimator ω̂ is a function of the feature vector and
always chooses one (fixed) class, depending on the features. In the case of the MAP
classifier, the estimator ω̂ always decides on the class with the highest a posteriori
probability (thick lines in Figure 3.6b). Hence, the ensemble (ω(m), ̂ m) is only a
subset of all the options. Rewriting the last term of Equation (3.38) in an artificially
complicated manner leads to
c
∫ P(ω(m),
̂ m) dm = ∑ δ[ω(m)=ω
̂ j]
∫ P(ω j , m) dm
M j=1 M
c
=∑ ∫ P(ω j , m) dm
j=1
{m∈M | ω(m)=ω
̂ j}

c
= ∑ ∫ P(ω j , m) dm ≤ 1, (3.39)
j=1 R
j

where δ[ ⋅ ] denotes the generalized Kronecker symbol. This means that only those
regions of the density are integrated that correspond to classes the estimator chooses
(shaded areas in Figure 3.6a). In summary, the term in Equation (3.39) represents
the probability that the classifier decides correctly. Conversely, Equation (3.38) is the
probability of a wrong decision. This can also be rewritten as
c c
R = 1 − ∑ ∫ P(ω j , m) dm = ∑ ∫ P(ω j , m) dm. (3.40)
j=1 R j=1 M\R
j j

As a final result, one can state that the overall risk of the MAP classifier equals the
probability of a wrong decision.
3.3 Bayesian classification | 109

p(m|ω1 ) p(m|ω2 )
0.04 0.04

0.03 0.03

0.02 0.02

0.01 0.01

12 12 12 12
10 8 8 10 10 8 8 10
6 4 4 6 6 4 4 6
m2 2
0 0
2 m1 m2 2
0 0
2 m1

(a) Class-specific feature distributions

P(ω2 |m) P(ω2 |m)

1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2

12 12 12 12
10 8 8 10 10 8 8 10
6 4 4 6 6 4 4 6
m2 2
0 0
2 m1 m2 2
0 0
2 m1

(b) A posteriori class probabilities

Fig. 3.7. Underlying densities in the reference example for classification.

3.3.2 Reference example: Optimal decision regions

We now turn our attention to a practical example of Bayesian classification. In partic-

ular, we consider a two-class problem (c = 2) with two-dimensional feature vectors
m = (m1 ,m2 )T . Variations on this example will be found throughout this book.
The data is generated according to the joint probability density function P(m,ω) =
p(m | ω)P(ω). Both classes are equally likely, i.e., the prior class probabilities are
P(ω j ) := 12 for j = 1,2. The likelihoods of the features given a class are modeled
as mixtures of seven isotropic Gaussian densities:
7
σ2j 0 7
p(m | ω j ) = ∑ N (m; µji , ( )) = ∑ N (m; µji , σ2j I) , (3.41)
i=1
0 σ2j i=1

where we further define σ1 = σ2 := 1. In other words, the components of both mixtures

differ only in their mean µji . Figure 3.7a shows the likelihoods p(m | ω j ) of this example.
110 | 3 Bayesian decision theory

m2
R1 R2 ω1
12 ω2
Optimal
boundary
10

2 R1

m1
2 4 6 8 10 12

Fig. 3.8. Optimal decision regions derived from the true underlying distribution. In region R1 , one
has that P(ω1 | m) > P(ω2 | m). In region R2 , the opposite is true. The dataset consists of 100
samples for each class (200 samples in all). The optimal classifier has an empirical classification
16
error of 200 = 8 % for the dataset shown.

The posterior class probabilities can be computed using the Bayes rule and are shown
in Figure 3.7b.
The optimal decision boundaries correspond to the set of points where P(ω1 | m) =
P(ω2 | m), as shown in Figure 3.8. Since the relevant distributions are known, it is
possible to compute the theoretical classification error using Equation (3.40). The
probability of error turns out to be approximately P e ≈ 6.16 %.
This model was also used to create a training sample D and a test sample T to use
in the classifier examples throughout this book. Both D and T consist of 200 samples
each, where 100 samples were drawn from p(m | ω1 ) and 100 samples were drawn
from p(m | ω2 ). Figure 3.8 shows the training set D alongside the decision regions.
It can be seen that some samples fall into the wrong decision regions. In fact, the
empirical classification error is 8 %. Note that this error deviates from the theoretical
classification error calculated above. The important lesson to take away from this
is that the empirical classification error is subject to random perturbations, since it
depends on the dataset. More reliable estimates of the true probability of error can
be obtained by using a larger test sample, or by using the techniques described in
Section 9.2.
3.3 Bayesian classification | 111

3.3.3 The minimax classifier

The remainder of this section will introduce another kind of classifier, the Minimax
classifier. All of the concepts that have been discussed so far need either the a pos-
teriori distribution P(ω | m) or must resort to the a priori distribution P(ω), thanks to
Equation (3.32). Knowing the former is nearly a forlorn hope in real-world applications,
but even having a feasible estimate of the latter can be a challenging task. Normally,
one has to rely on expert knowledge of the application. Hence, a natural question to
ask is what happens if one implements a pattern recognition system with a specific a
priori distribution in mind that does not reflect the reality. To start, reconsider the risk
c
(3.30) ↑
↑ ↑
↑ m)p(m) dm
R = ∫ R(ω(m)
̂ ↑ m)p(m) dm = ∑ ∫ R(ω i ↑
↑ ↑
M i=1 R
i
c c
(3.28) ↑
= ∑ ∫ ∑ l ij P(ω j ↑
↑ m)p(m) dm
↑
i=1 R j=1
i
c c
(3.28) ↑
= ∑ ∫ ∑ l ij p(m ↑
↑ ω j )P(ω j ) dm.
↑ (3.42)
i=1 R j=1
i

The second line holds because the decision regions Ri are a partition of M and
the classifier ω̂ is equivalent to those regions. Hence, ω̂ is constant on each of them
and so ω(m)
̂ = ω i for m ∈ Ri .
Although the Minimax classifier is not limited to the binary case, the following
discussion will be restricted to this case in order to keep the formalism simple. With
c = 2 and P(ω2 ) = 1 − P(ω1 ), it follows that

R = l11 P(ω1 ) ∫ p(m | ω1 ) dm + l12 P(ω2 ) ∫ p(m | ω2 ) dm

R1 R1

+ l21 P(ω1 ) ∫ p(m | ω1 ) dm + l22 P(ω2 ) ∫ p(m | ω2 ) dm

R2 R2

= l22 + (l12 − l22 ) ∫ p(m | ω2 ) dm

⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
R1
RMinimax

+ P(ω1 ) [(l11 − l22 ) + (l21 − l11 ) ∫ p(m | ω1 ) dm − (l12 − l22 ) ∫ p(m | ω2 ) dm]
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
R2 R1
!
=0
(3.43)

One can see that the risk is a linear function of the a priori probability of ω1 . While
the costs l ij and the class-specific distributions of the feature vector are not adjustable,
but rather determined by the environment, the decision regions Ri are the definable
112 | 3 Bayesian decision theory

R
0.6

Ptrue

0.4

RMinimax
0.2

Rdesign

Pdesign PMinimax Ptrue P(ω1 )

0.2 0.4 0.6 0.8 1

Fig. 3.9. Risk of the Minimax classifier.

design parameters. They are made up of the ordinate and the slope. This can lead to a
problem, as illustrated in Figure 3.9.
For every a priori probability, there exists a particular choice of the decision re-
gions such that the risk is minimized. This relation is depicted by the blue curve in
Figure 3.9, which represents the optimal Bayesian classifiers. This curve is always zero
at the end points, because if any a priori probability equals 1, an error-free classifi-
cation is possible. It attains its maximum somewhere in the middle, say at PMinimax ,
where the uncertainty is high. Of course, the uncertainty is maximized if all classes
are equally distributed, but the maximal risk can lie somewhat off the mark if different
costs are associated with each class.
Now, assume that the a priori probability was Pdesign and the classifier was origi-
nally constructed with this assumption in mind (hence, the classifier was thought to
have a risk Rdesign ). But if for some reason this assumption was wrong, and the clas-
sifier is deployed in a scenario in which the true a priori probability is Ptrue ≠ Pdesign ,
then Equation (3.43) holds and the true risk Rtrue lies on the tangent (red line) that
goes through the initially assumed point. This might even lead to a risk that is much
higher than the minimal risk RMinimax in the worst case.
The idea is to construct a classifier such that Equation (3.43) becomes independent
of the a priori distribution. This means the slope in Equation (3.43) must be set to zero
by an appropriate choice of the decision regions Ri . Implicitly, this choice belongs to
the optimal Bayesian classifier for the worst-case a priori probability PMinimax . Then the
classifier has the risk RMinimax , which remains constant even if the a priori probability
diverges, because the tangent is constant (yellow line in Figure 3.9).
In summary, the objective is not to construct a classifier that has the minimal risk
for a specific a priori probability, but a classifier that has the minimal risk in the worst
3.3 Bayesian classification | 113

case. The name “Minimax” comes from the fact that one tries to minimize the maximal
risk.
Anyway, in classical pattern recognition the optimal Bayesian classifier usually
performs better than the Minimax classifier. The situation in Figure 3.9 is somewhat
far-fetched because Pdesign and Ptrue are extremely different. Though not optimal, a
Bayesian classifier that is tuned for Pdesign will still be better for Ptrue than the Minimax
classifier if Ptrue is not too far away. (In Figure 3.9 these are all points where the red
tangent is still below the yellow line.) The origin of the Minimax approach is located
in game theory. Here, two players have to consecutively perform moves and one has
to decide what the best move is. The a priori probability describes what the adversary
is likely to do for its next move, but of course the adversary can adapt its own strategy
to the outcome of the decision one still has to make. Governed by the assumption that
the adversary will always do what is worst for the player, the idea is to minimize the
maximal risk. In classical pattern recognition, however, the adversary is the environ-
ment, and is therefore passive. Under normal conditions, one should be able to make
a reasonable assumption about the a priori distribution that is hopefully not too far
off.

3.3.4 Normally distributed features

The last Section of this Chapter repeats the previously introduced concepts in the case
that the class-specific feature distributions are Gaussian. Besides the goal of presenting
some concrete examples, this section also serves to reinforce the understanding of the
previous information.

Definition 3.1 (Univariate normal distribution). A random variable m is said to be nor-

mally (or Gaussian) distributed with expectation μ and variance σ2 if its distribution
can be described by the density

1 1 m−μ 2
p(m) = exp (− ( ) ). (3.44)
√2πσ 2 σ

The importance of the normal distribution is explained by the central limit theorem.
In its most basic and classical form, it states that the normalized sum of a sequence
of independent and identically distributed random variables with existing expecta-
tion and variance converges almost surely to a normally distributed random variable.
Stated more precisely:

Theorem 3.2 ((Classical) central limit theorem). Let m1 , m2 , . . . be a sequence of in-

dependent and identically distributed random variables with E{m j } = μ and Var{m j } =
σ2 . Let
1 n
sn = ∑ mi (3.45)
n i=1
114 | 3 Bayesian decision theory

denote the average partial sum. Then

P
√n(s n − μ) → z ∼ N(0, 1). (3.46)

Note that this theorem does not explain the ubiquity of the normal distribution. Indeed,
there are more sophisticated and generalized variants of the central limit theorem, but
convergence (in distribution) can still be guaranteed under very mild assumptions.
Without going into too much detail, the individual random variables do not even need
to be identically distributed, and the pairwise independence can be replaced by some
limit value condition that ensures that no subsequence has too much influence. Hence,
the normal distribution gained extreme popularity in modeling naturally occurring
phenomena.
Loosely formulated, one could say that if a feature is generated by summing many
independent contributions, it is reasonable to approximate the distribution of the fea-
ture by a Gaussian distribution. This holds in one dimension, but also in many dimen-
sions. The multi-dimensional normal distribution is given by

Definition 3.3 (Multivariate normal distribution). A random vector m ∈ ℝd is called

normally (or Gaussian) distributed with expectation vector µ ∈ ℝd and covariance
matrix Σ ∈ ℝd×d if its distribution can be described by the density
1 1
p(m) = exp (− (m − µ)T Σ−1 (m − µ)) . (3.47)
d
(2π) |Σ|
2
1
2 2

Note that Σ is a covariance matrix and therefore symmetric and positive definite. In
the 2-dimensional case, Σ can be decomposed as

σ21 ρσ21 σ22

Σ=( ) (3.48)
ρσ21 σ22 σ22

for σ1 , σ2 > 0 and ρ ∈ [−1, 1]. Here, ρ is called the correlation coefficient.
For the rest of this chapter, we will assume that the class-specific feature distribu-
tions of the features are normal. This means for each j = 1, . . . , c,

p(m | ω i ) = N(m; µi , Σi ). (3.49)

Note that such an assumption is only reasonable if the features are at least on an
interval scale.
Because the MAP classifier chooses the class with the highest a posteriori prob-
ability, and due to the strict monotonicity of the logarithm and Equation (3.32), the
following holds for every fixed feature vector m:

P(ω i | m) > P(ω j | m)

p(m|ω i ) p(m|ω i )
ω2
ω1 ω2 ω1

m P(ω1 ) = 0.7
P(ω2 ) = 0.3
0
R2
0 4 8 2
R1 R2 −2
R1 0
0 −2
P(ω1 ) = 0.7 P(ω2 ) = 0.3 m1 2 m2
(a) Decision boundaries for P(ω1 ) = 0.7 and P(ω2 ) = 0.3

p(m|ω i ) p(m|ω i )
ω2
ω1 ω2 ω1

m P(ω1 ) = 0.95
P(ω2 ) = 0.05
0 R2
0 4 8 R1 2
R1 R2 −2 0
0 −2
P(ω1 ) = 0.95 P(ω2 ) = 0.05 m1 2 m2
(b) Decision boundaries for P(ω1 ) = 0.95 and P(ω2 ) = 0.05

Fig. 3.10. Decision boundary of a two-class Gaussian classifier with unequal a priori probabilities.
Left: One-dimensional feature space; Right: Two-dimensional feature space.

As the logarithmic representation is easier and numerically more stable with re-
spect to normally distributed features, the decision function is reformulated as

k i (m) = ln p(m | ω i ) + ln P(ω i )

1 d 1
= − (m − µi )T Σi −1 (m − µi ) − ln 2π − ln|Σi | + ln P(ω i ). (3.51)
2 2 2
The goal is to discuss the shape of the decision boundaries. For this purpose, the
scenario will now be specialized even more, to the easiest case, and then generalized
again step by step. If all likelihoods share the same covariance matrix Σi = σ2 I, the
decision functions become
1 d
k i (m) = − (m − µi )T (m − µi ) − ln 2π − d ln σ + ln P(ω i ). (3.52)
2σ2 2
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
independent of i
116 | 3 Bayesian decision theory

p(m)

0
Fig. 3.11. Decision regions of a generic Gaus-
ω4 sian classifier (i.e. full covariances) with c = 4
ω2 ω3
classes and two features (d = 2). The diagram
ω1
2 shows p(m) = ∑4i=1 P(m,ω i ), where the re-
0
−2
0 gions are colored according to the decision
2 −2
m1 m2 made in the region.

After reduction to the crucial components, k i (m) can be redefined as

1
k i (m) = − ‖m − µi ‖2 + ln P(ω i ) (3.53)
2σ2
and the decision boundaries are given by

0 = k i (m) − k j (m)
1 1 P(ω i )
= (µ − µj )T m − (‖µi ‖2 − ‖µj ‖2 ) + ln (3.54)
σ2 i 2σ2 P(ω j )

The decision boundaries are hyperplanes that are perpendicular to the connection
lines between the expectation. If the a priori probabilities of the classes are equal, i.e.,
ln P(ω i)
P(ω j ) = 0, then the hyperplanes lie at the center points between the expectation
vectors. If the a priori probabilities are not equal, the hyperplanes move toward the
component with lower a priori probability. Examples with one and two features are
shown in Figure 3.10.
For the second step, continue assuming an equal covariance matrix Σi = Σ for
all classes, but no longer are they necessarily diagonal. Again, the crucial part of the
decision function is
1
k i (m) = − (m − µi )T Σ−1 (m − µi ) + ln P(ω i ) (3.55)
2
and the decision boundary is given by

1 T −1 P(ω i )
0 = (µi − µj )T Σ−1 m − (µ Σ µi − µj T Σ−1 µj ) + ln . (3.56)
2 i P(ω j )

Again, the decision boundaries are hyperplanes. But unlike before (see Equation (3.54)),
the hyperplanes are rotated by Σ−1 and thus are not perpendicular to the connection
lines between the centers of the Gaussians µi .
3.3 Bayesian classification | 117

p(m) p(m)

0 0
ω1
ω2
ω1
2 ω2 2
−2 0 −2 0
0 −2 0 −2
m1 2 m2 m1 2 m2
(a) Parabolic decision boundary (b) Hyperbolic decision boundaries

p(m) p(m)

0 0

2 2
−2 0 −2 0
0 −2 0 −2
m1 2 m2 m1 2 m2
(c) Linear decision boundaries (d) Elliptic decision boundary

Fig. 3.12. Decision regions of a generic Gaussian classifier (i.e. full covariances) with c = 2 classes
and two features (d = 2) are conic sections. The diagram shows p(m) = ∑2i=1 P(m,ω i ), where the
regions are colored according to the decision made in the region.

In the most general case k i can be redefined as

k i (m) = mT Wi m + wi T m + w i0 with (3.57)

1
Wi = − Σi −1 (3.58)
2
wi = Σi −1 µi (3.59)
1 1
w i0 = − µi T Σi −1 µi − ln|Σi | + ln P(ω i ), (3.60)
2 2
if only the crucial parts remain. Then the decision boundary equals

0 = mT (Wi − Wj )m + (wi − wj )T m + w i0 − w j0 . (3.61)

118 | 3 Bayesian decision theory

m 2

R1 R2 ω1
12 ω2
Optimal
boundary
10 Decision
boundary

2 R1

m 1

2 4 6 8 10 12

Fig. 3.13. Application to the reference example of Section 3.3.2. Decision regions of a classifier with
Gaussian densities p(m | ω j ) = N(m; µj ,Σj ) (j = 1,2). The parameters µj and Σj are estimated
from a training sample. The training error is etrain = 12.5 %; the testing error is etest = 7 %, but
asymptotically approaches etest ≈ 9.6 %. The training set is the same as in Figure 3.8. Test samples
are shown with hollow marks.

In summary, for normally distributed class-specific densities, the decision bound-

aries are hyperquadrics in the feature space. An example of a four-class (c = 4) Gaus-
sian classifier in a two-dimensional feature space is shown in Figure 3.11. Note that
the decision boundaries between the classes are piecewise conic sections. Further
examples are shown in Figure 3.12.
Next, Figure 3.13 shows the decision regions of a Gaussian classifier with the ref-
erence example dataset from Section 3.3.2. Again, the decision boundary is a conic
section, which approximates the optimal decision boundary. However, conic sections
are not flexible enough to exactly match the optimal decision boundary, which results
in a 3.5 percentage point increase in the error probability.

3.3.5 Arbitrarily distributed features

A Gaussian mixture is a random variable whose density equals the convex combination
of Gaussian densities. More formally:
3.4 Exercises | 119

Definition 3.4 (Gaussian mixture). A random variable m is a Gaussian mixture if its

density is of the form
K
p(m) = ∑ π k N(m; µk , Σk ) (3.62)
k=1

with π k > 0 and ∑Kk=1 π k = 1.

Note that the term “Gaussian mixture” is misleading in that m is not a mixture of
Gaussian random variables (which would itself be a Gaussian random variable, see
Theorem 3.2). Instead, it’s probability density function is a mixture of Gaussian proba-
bility density functions.
Gaussian mixtures are very popular because they are easy to handle and enjoy a
powerful approximation property: Every density (within reason) can be approximated
by Gaussian mixtures with arbitrary precision. More precisely, let f be a density with
a finite number of discontinuities on every compact subset of its support. Let
Kn
f n = ∑ π n,k N(m; µn,k , Σn,k ) (3.63)
k=1

be a sequence of Gaussian mixtures. Then there are K n , π n,k , µn,k , Σn,k such that f n
converges uniformly to f except at the points of discontinuity (see Maz’ya and Schmidt
[1996]). Note that K n is the number of components of the nth member f n and that, in
general, each f n has different components. Furthermore, the µn,k and Σn,k are not
necessarily a superset of the components of the previous member.
Unfortunately, there is no general rule for how many components K are necessary
to obtain a specific approximation error. But assume the task is to approximate each
d-dimensional class-specific feature distribution with K components. Then for every
class there are 12 K(d2 + 3d + 2) − 1 parameters,

π j,1 , . . . , π j,K
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ µ j,1 , . . . , µj,K
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ Σ j,1 , . . . , Σj,K .
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
K−1 Kd 1
2 Kd(d+1)

3.4 Exercises

(3.1) Show: If two random variables a,b are stochastically independent under the
event a, P(a | b) = P(a), then they are also independent under the opposite event
a, i.e., P(a | b) = P(a).

(3.2) Let m1 and m2 be two feature vectors that are to be classified using a maximum
posteriori (MAP) classifier.
1. When is the result of classification according to maximum a posteriori proba-
bility, ω̂ = arg maxω P(ω | m), the same as that of classification according to
maximum likelihood, ω̂ = arg maxω p(m | ω)?
120 | 3 Bayesian decision theory

2. Under what conditions will the result of the MAP classifier only depend on
the a priori probabilities P(ω)?

(3.3) In a classification problem with three classes ω1 , ω2 , ω3 , with P(ω1 ) = 0.1 and
P(ω2 ) = 0.6, the following is known about the feature vectors m ∈ ℝd :

p(m | ω1 ) ≡ p(m | ω2 ) and

suppm {p(m | ω3 )} ∩ suppm {p(m | ω2 )} = 0.

Maximum a posteriori classification is used for classification.

1. How well can ω1 be separated from ω2 using m?
2. How well can ω3 be separated from ω1 and ω2 using m?
3. How large is the error probability P e ?

(3.4) Suppose given two classes ω1 and ω2 and a feature m ∈ ℝ with the following
class-dependent feature distributions:

{
{ m if 0 < m ≤ 1
{
p(m | ω1 ) = {2 − m if 1 < m ≤ 2
{
{
{0 else

and
{
{ m−1 if 1 < m ≤ 2
{
p(m | ω2 ) = {3 − m if 2 < m ≤ 3
{
{
{0 else.
1. Sketch the class-dependent feature distributions p(m | ω1 ) and p(m | ω2 ) in a
single diagram.
2. Calculate the decision boundary of the Bayesian optimal classifier under the
assumption that P(ω1 ) = P(ω2 ) = 0.5. Mark the boundary in your diagram.
3. Calculate the decision boundary for P(ω1 ) = 0.25 and mark it in your diagram.
4. Calculate the error probabilities in both cases.

(3.5) Let ω1 to ω4 be four classes with P(ω1 ) = P(ω2 ), P(ω3 ) = 0.3 and P(ω4 ) = 0.5.
For a feature m1 , the following are to hold:

p(m1 | ω1 ) = p(m1 | ω4 ) > p(m1 | ω3 ) and p(m1 | ω4 ) > p(m1 | ω2 ).

1. Compute the probabilities P(ω1 ) and P(ω2 ).

2. What class will be chosen if one classifies according to the maximum a priori
probability?
3. What class will be chosen if one classifies according to the maximum a poste-
riori probability?
3.4 Exercises | 121

4. Let m2 be a second feature with p(m2 | ω1 ) = p(m2 | ω2 ) > p(m2 | ω3 ) >

p(m2 | ω4 ). Which classes are separated the best using (only) m2 ? Which
classes cannot be separated?
5. Which of the features m1 and m2 is better suited for maximum a posteriori
classification on its own?

(3.6) Let ω be a class and let m = (m1 ,m2 )T be a feature vector of stochastically inde-
pendent features m1 and m2 . Give the a posteriori class probability P(ω | m) using
Bayes’ law. Simplify as much as possible.

(3.7) A bulk material sorter is used to separate healthy wheat grains (ω1 ) from grains
infected with ergot (a fungus that produces a very potent toxin, ω2 ) and assorted
foreign bodies like dirt and the grains of other plants (ω3 ). If an infected grain
remains undetected, the infection will spread and 100,000 grains with a value of
1 EUR will have to be discarded, on average. If a foreign body remains undetected,
the damage will only be to the brand image, which is calculated at 0.01 EUR.
The sorting system uses a Bayesian classifier to classify each individual grain,
where only the length of the object is used as a feature. The sensor used can only
detect whether a grain is longer or shorter than 7 mm.
It is known that the material stream consists of 97 % healthy grains, 2 % infected
grains, and 1 % foreign materials. The manufacturer of the length sensor gives the
following performance characteristics:
90
P(length < 7 mm | ω1 ) =
100
3
P(length < 7 mm | ω2 ) =
100
5
P(length < 7 mm | ω3 ) =
100
1. Construct the cost matrix L for classification according to minimal a posteriori
risk.
2. Which class will be chosen by a maximum a posteriori classifier if the sensor
signals length < 7 mm?
3. Which class will be chosen by a risk minimizing classifier if the sensor signals
length < 7 mm?
4 Parameter estimation
The previous Chapter assumed that the quantities
p(m|ω)P(ω)
P(ω|m) = ∝ p(m|ω)P(ω) (4.1)
p(m)
(see Equation (3.32)) are known or at least the very right side (without p(m)). As al-
ready stated, the methods for determining these quantities can basically be divided
into two groups, parametric and non-parametric methods. This chapter deals with the
parametric approach.
In principle, one already has a mathematical model p(m|ω i , θ) of the distribution
that bounds the major traits and restricts the degrees of freedom to a finite-dimensional
parameter vector θ ∈ ℝq . For example, p(m|ω i , θ) might be the family of normal
T
densities with unknown expectation and variance, i.e., θ = (μ, σ2 ) . Furthermore,
a dataset D = {m1 , . . . , mN } is given which is assumed to have been generated by
p(m|ω i , θ) for a fixed, but unknown θ. The goal is to find the “true” value of the
parameter vector θ given the samples D.

Definition 4.1 (Statistic, Estimator). Let S = M ∪ M2 ∪ M3 ∪ ⋅ ⋅ ⋅ denote the space of all

finite samples and Θ ∋ θ the set of parameters.
1. A measurable function s : S → ℝ is called a (real) statistic.
2. A statistic s : S → Θ that maps into the parameter space is called an estimator.

This definition of an estimator is rather broad and especially does not make any state-
ment about quality. A constant function is an estimator, too, but intuition tells us that
this should not be a good one. Performance indicators, which help to decide if an es-
timator is reasonable with respect to the application, are a subsequent topic of this
chapter.
First, however, there will be a short excursus on two doctrines about the meaning
(semantics) of probability. Both philosophies share the same syntactical foundation:
The axiom system of Kolmogorov governs how to calculate with probabilities.

Definition 4.2 (Axiom system of Kolmogorov). Let (A, A) be a measurable space (A

can be thought of as the set of all possible elementary events). The triple (A, A, P)
is called a probability space if
1. Pr(B) ≥ 0 for all B ∈ A (non-negativity).
2. Pr(A) = 1 (the certain event has probability 1).
3. For all B1 , B2 , . . . with Bj ∩ Bi = 0, we have
∞
Pr(B1 ∪ B2 ∪ ⋅ ⋅ ⋅ ) = ∑ Pr(Bi ). (4.2)
j=1

Pr(B∩ C)
Moreover, the probability Pr(B|C) = Pr(C) is called the conditional probability of B
given C.

https://doi.org/10.1515/9783110537949-144
4 Parameter estimation | 123

Although these definitions explain the syntactical handling of probability, they

do not explain what is actually meant by an expression like “The probability of tossing
a head is 12 ” or “Tomorrow it is going to rain with a chance of 1 out of 3.” There are two
doctrines. On the one hand, there is frequentism, which is also called the objective
concept of probability. On the other hand, there is the Bayesian interpretation, which is
also called the subjective concept of probability. Note that the Bayesian interpretation
is not directly linked to the Bayes law except that both concepts are named in honor
of the same person. The Bayes law also holds in frequentism and is not less correct
under this doctrine.
The frequentist interpretation tries to interpret the probability as some kind of
physical property like weight, size, etc. Hence, a probability can only be assigned to
events for which a well-defined (theoretical) experiment can be designed. Then the
probability of the event is the limit of infinitely many repetitions of the experiment.
Coin tossing is the classical example of such an experiment. As the existence of a well-
defined experiment is a fundamental requirement, the frequentist view is also called
objective.
Although there are good reasons for frequentism, reasons that are not going to be
elaborated further here, this concept quickly meets its limits. No probability can be
assigned to infrequent events, such as volcanic eruptions. Even an statement about
tomorrow’s weather is difficult. Nonetheless, in many cases, one can imagine some
kind of (theoretical) experiment in an ideal environment.
The Bayesian approach avoids those difficulties by interpreting probability as a
degree of belief. As belief is very subjective, there is no need for an experiment. The
principal idea is to trace back probability to a fair bet. For example, the probability of
rain equals 13 if one believes it is a fair bet (for both sides) to put 1 money unit on rain
if the prize is 2 in case of rain and a total loss in case of sunshine. Though this concept
opens the world of probability to a much wider field of applications, this approach
has the drawback that probability becomes somewhat intangible. The Bayesian inter-
pretation has a branch that is called “objective Bayesianism” which perhaps could be
explained by “degree of belief after good reasoning”.
This textbook is not going to resolve the conflict between frequentism and
Bayesianism, but this topic is mentioned to draw attention to some important dif-
ferences with respect to parameter estimation. For a deeper look into the different
interpretations of probability, see, for example, the work of Robert [2011] or Efron and
Hastie [2016].
In frequentism (classical statistics), the parameter vector θ is assumed to be a
fixed though unknown quantity. The parameter vector is not a random variable. On
the notational level, this leads to p(m | θ) for the density, i.e., the density of a random
quantity m evaluated at m that is additionally controlled by a (non-random) parameter
θ. Note that although from the frequentist point of view, θ is not a random variable,
the density p(m | θ) is written like a conditional density. This is usual within the en-
gineering literature and should not cause any confusion within this book. Given a
124 | 4 Parameter estimation

̂
dataset D = {m1 , . . . , mN }, a reasonable approach is to choose θ(D) such that the
observation D becomes most likely,
̂
θ(D) = arg max ∏ p(m | θ). (4.3)
θ∈Θ m∈D

This approach is called the likelihood method and this estimator is called the maximum
likelihood estimator (see Definition 4.7).
In the Bayesian framework, the parameter vector is assumed to be a random quan-
tity θ, too, so that there is a joint distribution of (m, θ). For practical purposes the
joint distribution is rather uninteresting, but the distribution assumption is expressed
as a conditional distribution p(m|θ). Applying Bayes’ law, this can be rewritten to a
conditional distribution of θ given an observation D. Then θ(D) ̂ is assigned the value
for which the a posteriori probability attains a maximum. Unsurprisingly, this is called
the maximum a posteriori estimator.
Let us repeat this so that the difference between both philosophies becomes abun-
dantly clear. In classical statistics, the distribution of the features is given by p(m | θ).
The parameter vector θ is an unknown, but constant quantity. On the Bayesian view,
the distribution of the features is a conditional distribution given by p(m|θ). The pa-
rameter vector θ is a random variable itself. Moreover, in classical statistics the pa-
rameter is chosen such that the observation becomes most likely. There is no point in
speaking about something like the probability of the parameter. In the Bayesian world,
on the contrary, the parameter vector θ is chosen to have the maximum probability
conditioned by the observation.
In maximum a posteriori pattern classification, there is one feature distribution
per class ω i . Consequently, one needs to estimate one parameter vector θi per class.
More precisely, one seeks the distributions p(m|ω i , θi ) with i = 1, . . . , c. The general
assumption is that one uses supervised learning. This means that the number of classes
and the class assignment of samples is given beforehand. The whole dataset can be
decomposed into a partition D = D1 ⊎ ⋅ ⋅ ⋅ ⊎ Dc and furthermore samples from Di bear
information about the unknown parameter vector θi , but do not have any influence on
the parameter vectors θj (j ≠ i) of the other classes. In conclusion, the task of parameter
estimation can be independently repeated for each class and one can assume without
loss of generality that only one class exists. Here and below, the explicit notation of a
class will be suppressed.
Definition 4.1 stated that any statistic that maps into the parameter space is an
estimator. One of the minimal requirements for a reasonable estimator is its unbiased-
ness. The principal idea is to put random variables (instead of observations) into the
estimator, so that the estimator becomes a random quantity on its own. The estimator
is called unbiased if its expectation equals the parameter being estimated.

Definition 4.3 (Unbiased estimator). Depending on the interpretation of statistics:

In classical statistics Let m1 , . . . , mN denote independently and identically dis-
tributed random variables with distribution p(m | θ). Let θ̂ : MN → Θ be an
4 Parameter estimation | 125

̂ 1 , . . . , mN ) will denote the estimator considered as a

estimator of θ. θ̂ = θ(m
random variable. Then θ̂ is an unbiased estimator iff the following holds:

̂ = E{θ(m
E{θ} ̂ 1 , . . . , mN )}

̂ 1 , . . . , mN ) p(m1 | θ) ⋅ ⋅ ⋅ p(mN | θ) dm1 dm2 . . . dmN

= ∫ ⋅ ⋅ ⋅ ∫ θ(m
MN
!
= θ. (4.4)

In Bayesian statistics Let m1 , . . . , mN denote independently and identically dis-

tributed random variables with conditional distribution p(m|θ). Let θ̂ : MN → Θ
be an estimator of θ. θ̂ = θ(m̂ 1 , . . . , mN ) will denote the estimator considered as
̂
a random variable. Then θ is an unbiased estimator iff the following holds:

̂ = E{θ(m
E{θ} ̂ 1 , . . . , mN )}

̂ 1 , . . . , mN ) p(m1 | θ) ⋅ ⋅ ⋅ p(mN | θ) dm1 dm2 . . . dmN

= ∫ ⋅ ⋅ ⋅ ∫ θ(m
MN
!
= E{θ} (4.5)

As illustrated by the definition above, it is necessary to distinguish between classical

and Bayesian statistics in order to get the details right, although the principal ideas
(like unbiasedness) exist in both worlds.
We will discuss the viewpoint of classical statistics first. If one already considers
the expectation of an estimator, it seems reasonable to discuss the variance Var{θ} ̂ =
2 2
̂ } = E{θ̂ 2 } − E{θ}
E{(θ̂ − E{θ}) ̂ as well. Here, the variance of a random vector dis-
T
tributes over the elements of the vector, i.e., Var{α} := (Var{α1 } , Var{α2 } , . . .) ; sim-
ilarly, the square distributes over the elements, too. As a minimal requirement, the
discussion must be restricted to unbiased estimators with Var{θ} ̂ = E{θ̂ 2 } − E{θ} ̂ 2=
E{θ̂ 2 } − θ2 , because one is interested in how much the estimator fluctuates away from
the true value. Without this requirement, it would be easy to construct estimators with
zero variance by taking a constant estimator. Nonetheless, a small variance is desir-
able. If one cannot expect to obtain zero variance as long as the estimator is based on
the observations, one can still ask, if there is some lower bound among all reasonable
estimators. This lower bound is given by the Cramér–Rao bound (CRB). For the sake
of simplicity, the theorem is only given for a real, scalar parameter θ ∈ Θ = ℝ.

Definition 4.4 (Cramér–Rao bound). Let the hypotheses be the same as in Defini-
tion 4.3, first item, with the simplification Θ = ℝ and the following additions:
1. θ̂ is unbiased,
2. θ̂ : MN → Θ does not depend on the unknown value θ,
̂ 1 , . . . , mN )2 } < ∞,
3. E{θ̂ 2 } = E{θ(m
4. The density p(m | θ) is differentiable with respect to θ, and
126 | 4 Parameter estimation

∂ ∂
5. ∂θ ∫M p(m | θ) dm = ∫M ∂θ p(m | θ) dm.

Then the following holds:

̂ ≥ 1 1
Var{θ} 2
= 2
(4.6)
∂ ∂
N E{( ∂θ ln p(m | θ)) } N ∫M ( ∂θ ln p(m | θ)) p(m | θ) dm

An estimator whose variance equals this bound for all θ is called a CR-efficient estima-
tor. Before a sketch of the proof is given, let us discuss the prerequisites. The first two
are rather natural. The need for unbiasedness was already explained. An estimator
that would depend on the value being estimated is not forbidden by definition, but
rather pointless for practical purposes. Hence, this is no real limitation. The last three
requirements are rather technical and normally summarized by the term regularity
conditions. Distributions that comply with these are called regular distributions. For
all practical (engineering) purposes, one can take them as granted.
The proof is mainly an application of the Cauchy–Schwarz inequality
2
∫ x(m)2 dm ∫ y(m)2 dm ≥ (∫ x(m)y(m) dm) (4.7)

for square-integrable functions. The sketch of the proof is only presented for N = 1
to avoid cumbersome integrals. Because θ̂ is unbiased, E{θ̂ − θ} = 0 for all θ, and
likewise for the partial derivative ∂ E{θ̂ − θ} = 0. Altogether, this leads to
∂θ

∂ ∂
0= E{θ̂ − θ} = ̂
∫(θ(m) − θ)p(m | θ) dm
∂θ ∂θ
a.p. ∂ ̂ ̂ ∂
=∫ (θ(m) − θ) ⋅ p(m | θ) dm + ∫(θ(m) − θ) p(m | θ) dm
∂θ ∂θ
(4.9) ̂ ∂
= − ∫ p(m | θ) dm + ∫(θ(m) − θ)p(m | θ) ln p(m | θ) dm. (4.8)
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ ∂θ
=1

The last line holds because

∂ 1 ∂ ∂
p(m | θ) ln p(m | θ) = p(m | θ) p(m | θ) = p(m | θ), (4.9)
∂θ p(m | θ) ∂θ ∂θ
which follows from the chain rule of differentiation, where p(m | θ) is viewed as a
function of θ. Reorganizing and simplifying Equation (4.8) yields

̂ ∂ 2
1 = (∫(θ(m) − θ)p(m | θ) ln p(m | θ) dm)
∂θ
̂ ∂ 2
= (∫(θ(m) − θ)√ p(m | θ)√ p(m | θ) ln p(m | θ) dm)
∂θ
2
(4.7)
̂ 2 ∂
≤ ∫(θ(m) − θ) p(m | θ) dm ⋅ ∫ ( ln p(m | θ)) p(m | θ) dm
∂θ
2
= Var{θ} ̂ ⋅ E{( ∂ ln p(m | θ)) }, (4.10)
∂θ
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
=J(θ) (Fisher information)
4 Parameter estimation | 127

which concludes the sketch of the proof.

The expectation in Equations (4.6) and (4.10) is called the Fisher information J(θ):
it is the variance of the score. The score estimates how much the parameter θ influences
the density of the random variable m at every point m, normalized to the absolute value
of the density at the point m. So, the score is
1 ∂ (4.9) ∂
p(m | θ) = ln p(m | θ), (4.11)
p(m | θ) ∂θ ∂θ
which, again, follows from the chain rule of differentiation.
This information can be further concentrated by not looking at the change for
every point m, but by looking at the average change. If one puts the random variable
m into its own density, one can regard the score as a random quantity and calculate
its expectation. Under the regulatory conditions of Definition 4.4
∂
E{ ln p(m | θ)} = 0. (4.12)
∂θ
This makes intuitive sense, because the integral of every member of the family p(m | θ)
equals 1. So, if a change of θ increases the density p(m | θ) at some point m, the density
must decrease at some other point m󸀠 . With respect to the weighted average over all
m, the changes being induced by θ must be balanced. But the squared average does
not disappear, because both a decrease and an increase count as positive. Due to the
fact that the mean is zero, this equals the variance,
2
∂ ∂
E{( ln p(m | θ)) } = Var{ ln p(m | θ)} = J(θ). (4.13)
∂θ ∂θ
Without any proof, we just state the multi-dimensional generalization of Equa-
tions (4.6) and (4.10):
Cov{θ} ̂ ⪰ 1 J −1 (θ) (4.14)
N
where the scalar Fisher information is replaced by the Fisher information matrix
T
J −1 (θ) = E{(∇θ ln p(m | θ))(∇θ ln p(m | θ)) } (4.15)

Of course, Equation (4.14) requires the explanation of what “⪰” means in the con-
text of matrices. We say Cov{θ} ̂ ⪰ 1 J −1 (θ) iff (Cov{θ}
̂ − 1 J −1 (θ)) is a positive semi-
N N
definite matrix. This is equivalent to
1 T −1
αT Cov{θ}
̂ α≥ α J (θ)α for all α ∈ ℝk (4.16)
N
and

̂ ≥ 1
tr Cov{θ} tr J −1 (θ) (4.17)
N
After unbiased estimators and CR-efficient estimators, the third type of estimator
we consider is the consistent estimator.
128 | 4 Parameter estimation

Definition 4.5 (Consistent estimator). Again, the setting is the same as in Defini-
tion 4.3, first item. An estimator is called a consistent estimator iff

lim Pr (‖θ̂ − θ‖ ≥ ε) = 0 for all ε > 0 (4.18)

N→∞

with θ̂ = θ(m
̂ 1 , . . . , mN ).

This means an estimator is consistent if its value converges almost surely to the true
value. Actually, this should be a minimal requirement of a reasonable estimator. One
should realize that neither is an unbiased estimator necessarily consistent, nor is a
consistent estimator necessarily unbiased. These properties are independent of each
other.
Before looking at some examples of estimators that illustrate the above concepts,
the terms will be discussed and weighed against each other. From a purely theoreti-
cal perspective, the unbiasedness of an estimator is a minimal requirement. The gap
between the expectation of the estimator and the true value

b(θ)̂ = E{θ}
̂ −θ (4.19)

is called the bias of the estimator. If an estimator is truly biased (and not only biased,
but even asymptotically biased), then the bias will remain no matter how many sam-
ples are used. For this reason a bias is also called a systematic error. In contrast, the
variance Var{θ} ̂ of an estimator typically diminishes when more samples are consid-
ered. Hence, the variance is called the stochastic error of the estimator. As the theory
can use as many samples as needed, the stochastic error is not crucial. From a practi-
cal perspective this is not entirely true, because the number of samples is limited and
cannot be arbitrarily increased. The mean squared error (MSE) of an estimator
2
E{(θ̂ − θ)2 } = b(θ)̂ + Var{θ}
̂ (4.20)

equals the sum of the squared bias and the variance. Given two estimators θ̂ 1 and θ̂ 2
with Var{θ̂ 2 } ≪ Var{θ̂ 1 }, a biased estimator θ̂ 2 can have a much smaller mean squared
error than an unbiased estimator θ̂ 1 . This is depicted in Figure 4.1.
As a starting example, consider an estimator of the expectation. For this purpose,
m i (i = 1, . . . , N) will be independently and identically distributed with p(m | μ) with
a parameter θ = μ. Moreover, E{m i } = μ. Then σ2 = Var{m i } = E{(m − μ)2 } follows.
Note that although the distribution is assumed to be parametrized by its expectation
μ, the distribution is not assumed to be Gaussian. The expectation and variance are
called μ and σ2 for convenience only.
The empirical mean suggests itself as an estimator for the expectation:

1 N
μ̂ = ∑ mi . (4.21)
N i=1
4 Parameter estimation | 129

p(θ̂ i ) θ̂ 2

b(θ̂ 2 ) > 0

∝ √Var{θ̂ 2 }

θ̂ 1
Fig. 4.1. Comparison of an unbi-
ased estimator with large vari-
∝ √Var{θ̂ 1 }
ance (θ̂ 1 , blue) with a biased
θ̂ estimator with small variance
θ (θ̂ 2 , red).

The expectation of this estimator is

1 N 1 N 1 N
̂ = E{
E{μ} ∑ mi } = ∑ E{m i } = ∑ μ = μ. (4.22)
N i=1 N i=1 N i=1
This proves, that for any distribution, the empirical average is an unbiased estimator
for the expectation. The variance of this estimator equals
1 N 2
̂ = E{(μ̂ − μ)2 } = E{(
Var{μ} ∑ m i − μ) }
N i=1

1 N N
= ∑ ∑ E{(m k − μ)(m l − μ)}
N 2 k=1 i=1

1 N 2 1 N N
= ∑ E{(m k − μ) } + ∑ ∑ E{(m k − μ)(m l − μ)}
N 2 k=1 N 2 k=1 i=1
i=k
̸
N N N
1 1
= ∑ σ2 + 2 ∑ ∑ E{m k − μ} E{m l − μ}
N k=1
2 N k=1 i=1
i=k
̸

i.i.d. σ2
= . (4.23)
N
The variance of the estimator linearly vanishes with respect to the sample size. In other
words, the standard deviation √Var{μ}̂ decreases with √1 . This asymptotic behavior
N
is usual for most applications. Applying Chebyshev’s law

󵄨 󵄨 Var{μ}
̂
Pr (󵄨󵄨󵄨󵄨μ̂ − E{μ}
̂ 󵄨󵄨󵄨 ≥ ε) ≤
󵄨 ∀ε > 0 (4.24)
ε2
yields
󵄨 󵄨 σ2
Pr (󵄨󵄨󵄨󵄨μ̂ − μ󵄨󵄨󵄨󵄨 ≥ ε) ≤ , (4.25)
Nε2
which shows that the estimator is also consistent.
130 | 4 Parameter estimation

4.1 Maximum likelihood estimation

The empirical mean as an estimator for the expectation can more or less be found
through an educated guess. We now turn to a more systematic approach to finding
estimators. The maximum likelihood estimator has already been mentioned in the
introduction of this chapter (see Equation (4.3)). For many distribution assumptions,
the maximum likelihood estimator of the expectation value equals the average mean.

Definition 4.6 (Likelihood function, Log-likelihood function). Let mi (i = 1, . . . , N)

be independently and identically distributed with p(m | θ) and D = {m1 , . . . , mN } a
set of realized samples.
1. The function
L(θ) = ∏ p(m | θ) (4.26)
m∈D
is called the likelihood function.
2. The function
l(θ) = ln ∏ p(m | θ) = ∑ ln p(m | θ) (4.27)
m∈D m∈D
is called the log-likelihood function.

Due to the strict monotonicity of the logarithm, the likelihood function and the log-
likelihood function share the same extremal points. In practice, the log-likelihood
function is often easier to use, since it involves sums instead of products. The maximum
likelihood estimator determines the parameter that maximizes the likelihood given
the observation. In other words, maximum likelihood estimation chooses the value
θ = θ̂ which makes the given observation D maximally probable under the model.

Definition 4.7 (Maximum likelihood estimator). The hypothesis will be the same as
in Definition 4.6. Then
θ̂ ML (D) = arg max ∏ p(m | θ) = arg max ∑ ln p(m | θ) (4.28)
θ∈Θ m∈D θ∈Θ m∈D

is called the maximum-likelihood estimator (ML estimator).

Under the usual implicit assumption that all functions are sufficiently smooth,
! ∂ T
0 = ∇θ l(θ) = ∑ ∇θ ln(p(m | θ)) with ∇θ = ( ∂θ∂ 1 ... ∂θ q ) (4.29)
m∈D

is a necessary condition.
The first example will be to find the ML estimator for the expectation value of a
d-dimensional normal distribution. Let mk ∼ N(µ, Σ) with µ unknown but known Σ.
It follows that
1 d 1
ln p(mk ) = − (mk − µ)T Σ−1 (mk − µ) − ln 2π − ln det Σ
2 2 2
⇒ ∇µ ln p(mk ) = Σ−1 (mk − µ). (4.30)
4.1 Maximum likelihood estimation | 131

Applying the necessary condition Equation (4.29) leads to

N N
!
∑ ∇µ̂ ln p(mk ) = ∑ Σ−1 (mk − µ)̂ = 0
k=1 k=1
N
⇔ ∑ (mk − µ)̂ = 0
k=1

1 N
⇔ µ̂ ML = ∑ mk . (4.31)
N k=1

In summary, the ML estimator of the expectation is the empirical average, as al-

ready has been noted earlier. This result holds for many distributions.
The next example extends the estimator to the case that the variance is unknown,
too. To avoid some tedious matrix calculus, the scenario is restricted to the scalar case,
but the result is equally true in a multi-dimensional setting. Hence, let m k ∼ N(μ, σ2 )
with θ = (θ1 , θ2 )T := (μ, σ2 )T as the parameter vector. It follows that

1 1 1
ln p(m k ) = − (m k − θ1 )2 − ln θ2 − ln 2π
2θ2 2 2
1
θ2 (m k − θ 1 ) !
⇒ ∇θ ln p(m k ) = ( 1 2 1 )= 0. (4.32)
2 (m k − θ 1 ) − 2θ
2θ2 2

The solution of the system of equations is

1 N
θ1 = μ̂ ML = ∑ mk (4.33)
N k=1

1 N 2
θ2 = σ̂ 2ML = ∑ (m k − μ̂ ML ) . (4.34)
N k=1

Accordingly, in the multi-dimensional case, the ML estimator is

1 N
µ̂ ML = ∑ mk (4.35)
N k=1

1 N T
Σ̂ ML = ∑ (mk − µ̂ ML )(mk − µ̂ ML ) . (4.36)
N k=1

Note that the ML estimator for the variance is biased. It would be unbiased if the
true expectation μ was known. But as the ML estimator μ̂ is put into the estimator
for the variance, this estimator underestimates the variance systematically due to an
additional uncertainty coming from μ.̂ It can be shown that the unbiased estimator is
N ̂2
N−1 σ ML . In any case, both estimators are consistent.
132 | 4 Parameter estimation

4.2 Bayesian estimation of the class-specific distributions

In this section, the estimation problem is reconsidered under the Bayesian framework.
Unlike the former approach, the parameter vector θ is also regarded as a random quan-
tity. Moreover, the classical approach introduced the parameter right from the start
and aimed to estimate the parameter directly from the given dataset. In the Bayesian
concept, the parameter vector fades a little bit into the background, because here the
starting point is the original aim of estimating the class of an unknown object given
the training samples. The parameter is introduced as an intermediate link between
the training samples and the unknown object.
The fundamental quantity of the Bayesian classification is the a posteriori distri-
bution of the classes
p(m|ω i )P(ω i )
P(ω i |m) = i = 1, . . . , c. (4.37)
p(m)
As usual, the data set is D = D1 ⊎ ⋅ ⋅ ⋅ ⊎ Dc with m ∈ Di ⇔ ω(m) = ω i . Taking into
account that all quantities in Equation (4.37) are based on the data D, the formula can
be extended to
p(m|ω i , D)P(ω i |D)
P(ω i |m, D) = . (4.38)
p(m|D)
The conceptual difference of the Bayesian view is to actually regard every proba-
bility as a conditional probability. Any unconditional distribution is just a convenient
utility, if the condition is negligible. This means one actually wants to know the proba-
bility that a realized feature m of an random feature m belongs to class ω i , given that
the concrete dataset D out of the entirety of D has been observed before. In this sense,
P(ω|m) is only an abbreviation for P(ω|m, D) given that P( ⋅ , ⋅ , D󸀠 ) ≈ P( ⋅ , ⋅ , D󸀠󸀠 ), if
the datasets D󸀠 and D󸀠󸀠 are large enough.
Equation (4.38) can immediately be simplified again, because supervised sam-
pling is assumed. This means that the membership of a sample m in one of the parti-
tions Di is controlled, because its class is known. This has two consequences:
First, though the a priori distribution of the classes P(ω|D) depends on D, one
must not use a realization of the random variable D, because the realization is gen-
erally not truly sampled but artificially composed. This means that the proportions
of the partition D1 ⊎ ⋅ ⋅ ⋅ ⊎ Dc do not reflect the distribution of the classes. Hence, the
assumption is that an a priori distribution P(ω) is known.
Second, one assumes that the class-specific feature distribution does not depend
on samples of a different class. This means that
P(m|ω i , D) = P(m|ω i , Di ). (4.39)
Applying these considerations to Equation (4.38) and replacing the denominator by a
summation over all classes yields
p(m|ω i , Di )P(ω i )
P(ω i |m, D) = . (4.40)
∑cj=1 p(m|ω j , Dj )P(ω j )
4.2 Bayesian estimation of the class-specific distributions | 133

Note the additional indices in the right side of the above equation. Hence, the
only quantity to be determined is the class-specific feature distribution p(m|ω i , Di )
given the matching partition of a specific dataset. This quantity can be calculated
independently for each of the c classes and it is only required for matching indices of
ω i and Di . For this reason, the explicit notation of the class is omitted,

p(m|ω i , Di ) = p(m|D), (4.41)

but it is implicitly stipulated that m is conditioned on the same class as the samples
in Di .
Until now, no parameter vector θ has been introduced so far, but everything was
based on the data D. We now assume that the feature distribution has a known para-
metric form with an unknown parameter θ that is a random quantity. Then one can
write

p(m|D) = ∫ p(m, θ|D) dθ

= ∫ p(m|θ, D)p(θ|D) dθ
Θ

= ∫ p(m|θ)p(θ|D) dθ. (4.42)

The latter equation assumes that the distribution of the feature is conditionally
independent of the data D given the parameter vector θ.
The open question is whether the last line of Equation (4.42) must be calculated
every time and for each class when a new feature m is to be classified. (Recall that the
indices were suppressed and that Equation (4.42) is only a sub-term in Equation (4.40).)
Under certain conditions the answer is that this is not ultimately necessary and the
calculation can be decoupled into two steps.
Assume the data D imply strong evidence for one singular parameter, i.e., the
density p(θ|D) has a sharp and singular maximum at
̂
θ(D) = arg max p(θ|D). (4.43)
θ∈Θ

̂ where δ denotes the

Then the density can be approximated by p(θ|D) ≈ δ(θ − θ),
Dirac distribution. It follows that

p(m|D) = ∫ p(m|θ)p(θ|D) dθ
Θ

≈ ∫ p(m|θ)δ(θ − θ)̂ dθ = p(m|θ)̂ (4.44)

and the integral calculation can be avoided. In summary, the conditional feature dis-
tribution with respect to the dataset can be approximately replaced by a conditional
134 | 4 Parameter estimation

feature distribution with respect to the parameter vector with the highest a posteriori
distribution given the data.
The first example considers a univariate normal distribution with random expec-
tation μ but known variance σ2 , i.e., m k ∼ N(μ, σ2 ). The expectation is also normally
distributed with μ ∼ N(μ0 , σ20 ). We start with the calculation of the a posteriori distri-
bution of μ given the data,
p(D|μ)p(μ)
p(μ|D) =
∫ p(D|μ)p(μ) dμ
N
∝ p(μ) ∏ p(m k |μ)
k=1
N
1 μ − μ0 2 1 mk − μ 2
∝ exp {− ( ) } ⋅ ∏ exp {− ( ) }
2 σ0 k=1
2 σ

1 μ − μ0 2 N m k − μ 2
∝ exp {− [( ) +∑( ) ]}
2 σ0 k=1
σ

1 N 1 1 N μ0
∝ exp {− [( 2 + 2 ) μ2 − 2 ( 2 ∑ m k + 2 ) μ]}
2 σ σ0 σ k=1 σ0
1 μ − μN 2
⇒ p(μ|D) = α exp {− ( ) }, (4.45)
2 σN
where the quantities in the last line are
Nσ20 σ2 1 N
μN = ( ) μ̂ N + μ0 , with μ̂ N = ∑ mk , (4.46)
Nσ20 + σ2 2
Nσ0 + σ2 N k=1
σ20 σ2
σ2N = , and (4.47)
Nσ20 + σ2
1
α= . (4.48)
√2πσ N
The quantities μ N , σ2N and μ̂ N can be found by comparing the coefficients of the
last and the second last line. The factor α can be easily determined, because the last
line shows that p(μ | D) is a Gaussian density.
Before we go on to finally calculate the feature distribution p(m|D), let us discuss
this intermediate result. The estimate of μ given the data D is a Gaussian density on
its own. We consider the two extreme cases with respect to the sample number N. If
there is no sample, N = 0, then
Nσ2 σ2
μ N = ( 2 0 ) μ̂ N + 2
μ0 = μ0 and (4.49)
Nσ0 +⏟⏟⏟σ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
2 Nσ0 + σ2
⏟⏟ ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
=0 =1
2 2
σ0 σ
σ2N = = σ20 . (4.50)
Nσ20 + σ2
4.2 Bayesian estimation of the class-specific distributions | 135

p(μ|D)
N=0
N = 10
N = 20
N = 30
N = 50
Samples

μ󸀠

Fig. 4.2. Sequence of Bayesian a posteriori densities estimating the mean μ of a Gaussian distribu-
tion; the true Gaussian has μ󸀠 = 3, σ 2 = 2, the prior distribution of μ was assumed to be distributed
with μ0 = −1 and σ02 = 0.5.

This is a reasonable result, as the distribution of the best estimate equals the prior if
no data is given. In contrast, as N → ∞,
Nσ2 σ2 1 N
lim μ N = lim ( 2 0 ) μ̂ N + lim μ 0 = lim μ̂ N = lim ∑ mk
N→∞ Nσ0 + σ2
N→∞ ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ Nσ20 + σ2
N→∞ ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ N→∞ N→∞ N
k=1
→1 →0
(4.51)
σ20 σ2
lim σ2N = lim = 0. (4.52)
N→∞ N→∞ Nσ 2 + σ 2
0

In conclusion, for infinitely many samples, the uncertainty of the estimation vanishes
and the a posteriori distribution converges to a Dirac distribution at the empirical mean
of the samples. This means that any resemblance to the a priori assumption vanishes
and the result depends solely on the data and actually equals the ML estimator. An
example of such a sequence of a posteriori distributions is depicted in Figure 4.2.
To conclude the example, we must still calculate the conditional feature distribu-
tion given the dataset p(m|D). As all densities are Gaussian, the calculation of Equa-
tion (4.42) needs little effort. Again, α denotes a universal normalizing constant in

p(m|D) = ∫ p(m|μ)p(μ|D) dμ
1 m−μ 2 1 μ − μN 2
= α ∫ exp{− ( ) } exp{− ( ) } dμ
2 σ 2 σN
1 (m − μ N )2
= α exp{− }. (4.53)
2 σ2 + σ2N

In summary, m ∼ N(μ N , σ2 + σ2N ).

136 | 4 Parameter estimation

The next example is the multivariate case. It is a straight-forward generalization

of the previous example. Let the samples be mk ∼ N(µ, Σ) and the mean be µ ∼
N(µ0 , Σ0 ). Similar to Equation (4.45), one obtains
N
p(µ|D) ∝ p(µ) ∏ p(mk |µ)
k=1
N
1 1
∝ exp{− (µ − µ0 )T Σ−1 T −1
0 (µ − µ0 )} ⋅ ∏ exp{− (mk − µ) Σ (mk − µ)}
2 k=1
2
N
1
∝ exp{− [(µ − µ0 )T Σ−1 (µ − µ0 ) + ∑ (mk − µ)T Σ−1 (mk − µ)]}
2 k=1
N
1
∝ exp{− [µT (NΣ−1 + Σ−1 T
0 )µ − 2µ (Σ
−1
∑ mk + Σ−1
0 µ0 )]}
2 k=1
1
p(µ|D) = α exp{− (µ − µN )T Σ−1 N (µ − µN )}, (4.54)
2
where the quantities in the last line are

−1 1 N
µN = NΣ0 (NΣ0 + Σ) µ̂ N + Σ(NΣ0 + Σ)µ0 , with µ̂ N = ∑ mk , (4.55)
N k=1
−1
ΣN = Σ0 Σ(NΣ0 + Σ) , and (4.56)
1
α= . (4.57)
√2π det ΣN
Analogously to Equation (4.53), the feature distribution equals

p(m|D) = ∫ p(m|µ)p(µ|D) dµ
1 1
∝ ∫ exp{− (m − µ)T Σ−1 (m − µ)} ⋅ exp{− (μ − µN )T Σ−1N (m − µ)} dµ
2 2
1
∝ exp{− (m − µN )T (Σ + ΣN ) (m − µN )}.
−1
(4.58)
2
In summary, the multivariate result is m ∼ N(µN , Σ + ΣN ).
At the end of this section, the following list recapitulates the principal steps of
the Bayesian approach to estimating the feature distribution. The implicit suppressed
notation of the classes is re-introduced again. These steps must be performed for each
class ω i , i = 1, . . . , c:
1. p(m|θi , ω i ) is assumed to be structurally known; the parameter vector θi is a
random quantity, too.
2. p(θi | ω i ) includes the a priori knowledge about the θi .
3. The dataset Di bears the additional knowledge about θi ; Di is assumed to be a
set of identically and independently distributed feature vectors m1 , . . . , mN i ∼
p(m | θi , ω i ),
Ni
p(Di |θi , ω i ) = ∏ p(mk |θi , ω i ). (4.59)
k=1
4.3 Bayesian parameter estimation | 137

4. Then the class-specific feature distribution is calculated by

p(Di |θi , ω i )p(θi |ω i )

p(θi |Di , ω i ) = (4.60)
∫ p(Di |θi , ω i )p(θi |ω i ) dθi

followed by
p(m|D, ω i ) = ∫ p(m|θi , ω i )p(θi |Di , ω i ) dθi . (4.61)

In comparison to maximum likelihood estimation, two essential differences can be

captured as such:
– The Bayesian technique allows bringing in prior knowledge about the parameter
vector; the maximum likelihood technique does not.
– An ML estimator is usually simpler to compute, because one needs to find an
extremum only; the Bayesian technique requires the numerical calculation of
multidimensional integrals.

4.3 Bayesian parameter estimation

The goal of the ML estimator is to find the best value of θ̂ that can be plugged into
the parametric density p(m | θ). In contrast, the Bayesian technique does not yield a
single value of θ,̂ but a whole a posteriori distribution p(θ|D). Hence, the classical
approach and the Bayesian approach are not directly comparable. The class-specific
feature distribution is calculated by

p(m|D) = ∫ p(m|θ)p(θ|D) dθ (4.62)

(see Equation (4.42)). As already stated on page 133, this computational effort can be
avoided if the a posteriori distribution can be approximated by a Dirac distribution. In
this case, the additional integration degenerates into a simple replacement of θ by the
value of θ̂ with the highest a posteriori probability. But setting θ̂ = arg maxθ∈Θ p(θ|D)
(see Equation (4.43)) is only one option for condensing a full density into a single value
of the parameter. This section will present the two most important ways of Bayesian
parameter estimation.
The basic approach is to find the estimate of θ(D)̂ such that the expectation

̂
E{l(θ(D), θ)} (4.63)

is minimized, where θ is the a posteriori parameter and l is a cost function.

138 | 4 Parameter estimation

4.3.1 Least squared estimation error

The cost function will be

l(θ,̂ θ) = (θ̂ − θ)T (θ̂ − θ). (4.64)
Here and below, dD abbreviates dm1 . . . dmN . The expected square error

󵄩 ̂ 󵄩2
E{󵄩󵄩󵄩θ(D) − θ󵄩󵄩󵄩 } =
̂ T ̂
∫ ∫(θ(D) − θ) (θ(D) − θ)p(θ|D) dθ dD (4.65)
MN Ω

is minimized if the integrand

̂ T ̂
I(D) = ∫(θ(D) − θ) (θ(D) − θ)p(θ|D) dθ (4.66)
Ω

is minimized point-wise for every D = {m1 , . . . , mN }. Under the assumption that

integration and differentiation can be interchanged, the necessary condition for the
minimum becomes
! ∂ ̂
0= I(D) = ∫ 2(θ(D) − θ)p(θ|D) dθ (4.67)
̂
∂θ(D)
Ω
̂ ↑
⇔ θ(D) = E{θ↑
↑D} .
↑ (4.68)

This is a minimizing point if the second derivative is a positive definite matrix,

∂
I(D) = ∫ 2Up(θ|D) dθ = 2U, (4.69)
̂
∂2 θ(D)
Ω

where U denotes the unit matrix, i.e., the matrix all of whose entries are unity. In
summary, the estimator with the least quadratic error is
̂ ↑
θ(D) = E{θ↑
↑D} ,
↑ (4.70)

the a posteriori expectation of θ.

4.3.2 Constant penalty for failures

Now, the cost function will be

{0 if ‖θ̂ − θ‖ < ∆
l(θ,̂ θ) = { (4.71)
1 else
{
for an arbitrary but fixed ∆ > 0. An interesting special case is surely ∆ = 0. But as
{θ̂ = θ} is a null set, the direct approach does not lead to any result.
4.4 Additional remarks on Bayesian classification | 139

Again, the expectation E{l(θ,̂ θ)} is minimized if the integrand

I(D) = ∫ l(θ,̂ θ)p(θ|D) dθ

Ω

= ∫ p(θ|D) dθ
̂
{θ|‖θ(D)−θ‖>∆}

=1− ∫ p(θ|D) dθ (4.72)

̂
{θ | ‖θ(D)−θ‖≤∆}

is minimized point-wise. The last line is minimal if the integration is over a region
where p(θ|D) is large. If ∆ becomes small enough and if the density is sufficiently
smooth, this is achieved for arg maxθ p(θ | D). Hence, it follows that
̂
θ(D) = arg max p(θ | D) (4.73)
θ
for ∆ → 0. This is called the maximum a posteriori estimator.

4.4 Additional remarks on Bayesian classification

Now, we briefly turn our attention back to Bayesian classification. With the results of
Chapter 4, it is possible to discuss in greater depth the errors that arise in Bayesian
classification.
Although Bayesian classification is the optimal classification, it is not free of errors.
Basically, three different sources of errors can be distinguished:

Bayesian or indistinguishability error This error arises because of overlapping

class-specific feature distributions. This means there are at least two classes
whose objects might have the same feature values. Then the features are not
perfectly discriminative. The solution to this problem is to find better features that
are more powerful.
Model error In this case, an unsuitable parametric model was chosen to describe the
class-specific feature distributions. For example, the features were assumed to
be Gaussian distributed, but actually they are uniformly distributed. Of course,
the parameter estimation techniques of this chapter generate a result, but for a
non-fitting model. Finding a suitable model requires expert knowledge. There are
options to test if a model seems to fit the data, but these tests cannot find a better
model. Anyway, these tests are outside the scope of this textbook.
Estimation error This error occurs because there are too few samples compared to
the number of parameters to give a reliable estimation. This kind of error can
be diminished by either gathering more samples or by reducing the number of
parameters. Normally, the former is naturally limited by costs, effort, or time. The
latter can be limited by the quality of the new features.
140 | 4 Parameter estimation

4.5 Exercises

(4.1) The weight of a letter m in grams varies between m = 10 and m = 20. There are
two possibilities for estimating the weight of a given letter:
– Estimate m̂ 1 = 15, independently of the true weight of the letter.
– Estimate m̂ 2 = x, where x is the display of an inaccurate scale with E{x} = m
and Var{m} = 36.
How large is the mean squared error (MSE) for each estimator? Which estimator
has the smaller MSE?

(4.2) Let m1 , . . . ,m N be a sample of N i.i.d. elements that follow a Laplace distribution

with the density
1 |m − μ|
p(m) = exp ( ). (4.74)
2σ σ
Show that the following estimator is a maximum likelihood estimator for the pa-
rameter μ:
N
μ̂ := arg min ∑ |m i − μ|. (4.75)
μ
i=1

(4.3) Let the random variable X be distributed with the density

p(x | θ) = θe−θ x (4.76)

with unknown parameter θ > 0. An i.i.d. sample x1 , . . . ,x N of size N is drawn

from this distribution. Calculate the maximum likelihood estimator θ̂ ML of θ.

(4.4) Let X be a random variable over a population with expectation μ and variance
σ2 . Further, let x1 , . . . ,x N be an i.i.d. sample of size N > 4 over the population.
The following estimator of the expected value μ of X is proposed:

1 N−2
μ̂ := ∑ xi , (4.77)
N − 4 i=3

i.e., the first and last two elements of the sample are discarded.
1. Show that μ̂ is an unbiased estimator of μ.
2. Is μ̂ a better estimator than the maximum likelihood estimator

1 N
μ̂ ML = ∑ xi ? (4.78)
N i=1

(4.5) Let m1 , . . . ,m N be a sample of N i.i.d. elements and consider the following esti-
mator of the sample variance σ2 = Var{m}:

̂ 1 N
σ2 = ∑ (m i − μ)2 , (4.79)
α − N i=1
4.5 Exercises | 141

where μ = E{m}. For which values of α will ̂

σ2 be an unbiased estimator of σ2 ?

(4.6) Let m1 , . . . ,m N be a sample of N i.i.d. elements and consider the following esti-
mator of the expected value μ = E{f(m)}:

N N
μ̂ = ∑ f(m i ), (4.80)
N − α i=1

for some function f(⋅). For which values of α will μ̂ be an unbiased estimator of μ?
5 Parameter free methods
At the beginning of this chapter we will first review what has been done so far. The
principal goal is to assign a class to an unknown object given its features and a training
set of objects with known features and classes. From a more abstract point of view,
one wants to learn some kind of rule, given a finite training sample of special cases.
The rule will then be applied to a new situation, where one hopes that the proposed
rule has general significance.
This is a two-step process: In the first step, the general rule must be found from
specific instantiations, the second step is to apply the (hopefully) general rule to a
new specific situation. If necessary, the rule found from the first step will need some
intermediate formal rewriting into a form that is applicable in the second step. The
first step, from the special to the general, is called “induction,” the second step, from
the general to the special, is called “deduction” (see Figure 5.1). In the context of this
textbook, the induction is to find a class-specific feature distribution given a dataset D;
the deduction is to apply the a posteriori probability to an unknown feature vector. The
necessary formal rewriting is Bayes’ law in order to obtain the a posteriori probability.
Instead of following this indirection, it is sometimes possible to directly infer from
the given data to the new situation. The term “transduction” was introduced by Vapnik
in the context of support vector machines (see Section 7.7) to describe this shortcut. In
this chapter, however, we are concerned with the induction step.
The induction step from D to p(m|ω) is an ill-posed inverse problem. For a deeper
understanding of inverse problems, see, e.g., the work of Aster et al. [2013] or Rieder
[2003]. For our purposes, the following intuition is sufficient: The induction is called
“inverse,” because the deduction is thought of as the forward model. It is ill-posed if
one of the following conditions (going back to Jacques Hadamard) holds:
– The inverse mapping is not well defined,
– the inverse mapping is not unique, or
– the inverse mapping is not continuous.

p(m|ω) ⇒ P(ω|m)
Induction Deduction
(=̂ Ill-posed inverse (=̂ Forward problem)
problem)

Training Unknown
set D “Transduction” objects
Fig. 5.1. The triangle of
inference.

https://doi.org/10.1515/9783110537949-164
5 Parameter free methods | 143

In this context, the dataset poses only a finite number of conditions on the infinite-
dimensional solution space of all density functions. This means that in general, the
data does not suffice to determine a solution. Regularization, i.e., enforcing further
restrictions on the space of solutions, can offer a way out. Such additional restrictions
might be
– to make additional assumptions, e.g., on the range of parameters,
– to bring in additional prior knowledge, and
– to formulate desirable traits of the solution as auxiliary constraints.

The risk of regularization lies in restricting the space of permissible solutions in such
a way that the true solution is unintentionally excluded. In the previous chapter, the
class-specific feature distribution was assumed to be structurally known. Hence, the
space of all densities was restricted to a finitely parametrized family of densities. Un-
fortunately, there are only a handful of standard densities that are still analytically and
computationally feasible. But it is questionable how well these densities fit the appli-
cations. Especially the assumption that a multi-dimensional feature space is governed
by a product of simple densities seems bold.
Although this chapter still considers the induction step and tries to find a density
p(m | ω), it follows a totally different approach. The parameter-free methods do not
constitute a specific form of the density right from the beginning, but try to look at the
samples as a kind of discrete approximation of the true density. While the number of
samples increases, a sequence of densities is created that eventually converges to the
true density. This textbook will present two such methods: the Parzen window method
and the k-nearest neighbor method¹.
Let m denote a random vector, p(m) its density, m a realization of m, and Am ⊆ M
a neighborhood around m. Eventually, p(m) will be the unknown, true density we
want to approximate. Then
Pm = ∫ p(m) ̆ dm̆ (5.1)
Am

is the probability that m attains a value in Am . Let V denote the volume of Am .

Pm 1
p(m)
̆ = = ∫ p(m)
̆ dm̆ (5.2)
V V
Am

defines a new density. In order to check this, let Am̆ = {m|Am ∋ m}

̆ denote the inverse
neighborhood, i.e., the set of all vectors m whose neighborhoods Am contain m. ̆ One
can calculate
1 1
∫ p(m)
̆ dm = ∫ ∫ p(m)
̆ dm̆ dm = ∫ ∫ p(m)
̆ dm dm̆
V V
M M Am M Am̆

1 Not to be confused with the k-nearest neighbor classifier.

144 | 5 Parameter free methods

1
=∫ p(m)
̆ ∫ dm dm̆ = ∫ p(m)
̆ dm̆ = 1. (5.3)
V
M ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
A m̆ M
=V

Without a proof, we claim that p(m)

̆ → p(m) if V → 0 for all m where p(m) is contin-
uous.
In summary, p(m)
̆ is a density that represents a moving average of p(m). The
moving average converges to the true density if the neighborhood over which the
average is calculated shrinks. Hence, for reasonably small volumes V one can set
Pm
p(m) ≈ p(m)
̆ = . (5.4)
V
Consequently, the problem is reduced to finding estimates of the probability Pm
for each volume V respectively for each neighborhood Am .
Assume m1 , . . . , mN ∼ p(m) are i.i.d. and let k be the random variable denoting
the number of mi falling into Am . Then k is binomially distributed with probability
mass function
N k
P(k | Pm ) = ( )Pm (1 − Pm )N−k (5.5)
k
and the unknown parameter Pm . Given a realization k of k, a good estimator (the
unbiased ML estimator) is given by
k
P̂ m = (5.6)
N
and hence one can set
/
k N k
p(m) ≈ = = p(m).
̂ (5.7)
V NV
k
Note that the term NV on the right hand side actually depends on m, because k is
the number of points within the neighborhood Am and V is its volume. For the rest of
this discussion, the dependence on the center point m is omitted, but we write k = k N,V
to indicate that the number of samples in A depends on the total number of samples
N and on the volume V of the neighborhood.
The overall approximation is based on the nested limiting process
PV k N,V
p(m) = lim = lim lim , where (5.8)
V→0 V V→0 N→∞ NV
k N,V
P V := . (5.9)
N
The convergence of the outer limit requires P V → 0, which means that
k
lim lim = 0. (5.10)
V→0 N→∞ N
For any fixed V, however, the convergence of the inner limit requires k → ∞ for
N → ∞.
5 Parameter free methods | 145

(a) PW, N = 1, r = (b) PW, N = 4, r = (c) PW, N = 9, r =

1 1 1
1− 4 = 1 ⇝ k = 1 4− 4 ≈ 0.71 ⇝ k = 4 9− 4 ≈ 0.58 ⇝ k = 5

Fig. 5.2. Comparison

of neighborhood size
V and sample size
k between Parzen
window (PW) and
k-nearest neighbor
(d) k-NN, N = 1, k = (e) k-NN, N = 4, k = (f) k-NN, N = 16, (k-NN) density esti-
√1 = 1 ⇝ r ≈ 0.64 √4 = 2 ⇝ r ≈ 0.44 k = √9 = 3 ⇝ r ≈ 0.44 mation.

The aforementioned conditions are only necessary but not sufficient conditions
to ensure convergence. In particular, the limits must change position, but without any
additional assumptions, one has

k N,V k N,V
p(m) = lim lim ≠ lim lim = 0. (5.11)
V→0 N→∞ NV N→∞ V→0 NV

The latter is easy to see: If N is arbitrary but fixed at some bound, the volume V is
so small that all the points are located outside and k becomes constantly zero.
In theory this is no problem, because one can first define a sequence of decreas-
ing volumes (outer limit) and then take as many samples as necessary to get a good
approximation of lim k/N (inner limit). In practice, the situation is more complicated.
Generally, the number of samples N is given in advance or is at least bounded and
there is normally no option for getting a fresh sequence of samples for each volume V.
Hence the question is: What is a reasonable size for V, given some samples?
k
Choosing a large V helps to get a reliable approximation for P V = N,V N , because
there are many samples that fall in A. But unfortunately, the outer approximation PVV
becomes too coarse. In the extreme case, the neighborhood A = M equals the whole
support. Then k = N, because all the samples must fall in M and Pm = 1 is a perfect
approximation, but the moving average degenerates to the uniform distribution. In
contrast, choosing a small V is appropriate for getting a good local approximation of
the density if P V ≈ k(N,V)
N can be reliably estimated. This becomes more difficult as
smaller Vs are chosen, because the event of a sample’s falling in A becomes more
unlikely. In the extreme case, there is no sample and so P V = 0. Then the approximate
146 | 5 Parameter free methods

density is ragged, with areas that are constantly zero and small peaks around each
sample.
Informally, the volume V must not diminish too fast with respect to N. In Sec-
tion 5.1, a proof of convergence is presented if V −1 = O(√N). Two approaches are well
established in practice. The Parzen window method assigns the volume V ∝ √1 and
N
k is estimated from the sampling. The k-nearest neighbor method assigns k ∝ √N
and the volume V is estimated from the sampling, i.e., the neighborhood around each
point is blown up until exactly k points are included.
Figure 5.2 shows an example of a comparison between the Parzen window method
and the k-nearest neighbor method. The samples (blue points) are drawn uniformly
within the unit disk with radius r = 1. The center point of the neighborhood is m = 0
and A0 is chosen to be a disc (red line). Note that for the Parzen window method, the
1 1
radius decreases with N − 4 , because the area of the disk is proportional to N − 2 .

5.1 The Parzen window method

The Parzen window method assigns the volume of the neighborhood with respect to
the sample size. For now, the neighborhood is chosen to be a simple d-dimensional
cube with edge length h N and volume V N = h dN . To this end, let

{1 1
if |u j | ≤ 2 for all j = 1, . . . , d
φ(u) = rect(u) := { (5.12)
0 else
{
denote the indicator function of the unit cube centered at the origin. Then u 󳨃→ φ( m−u
hN )
denotes the indicator function of a cube centered at m with edge length h N . Let
m1 , . . . , mN denote the samples. For each m ∈ M, the number of samples within its
neighborhood can be counted by
N
m − mi
k N (m) = ∑ φ( ). (5.13)
i=1
hN

Thus, together with Equation (5.7) and setting

1 m
δ N (m) := φ( ), (5.14)
VN hN
it follows that

1 N 1 m − mi 1 N m − mi
p̂ N (m) = ∑ φ( )= ∑ δN ( ) (5.15)
N i=1 V N hN N i=1 hN

is an estimator of the true density. This corresponds to an approximation of the proba-

bility density function by an interpolation between the samples by virtue of the func-
tion δ N ( ⋅ ).
5.1 The Parzen window method | 147

The function φ is called a window function. A window function is any function

that decreases to zero sufficiently rapidly; more precisely
d
lim φ(u) ∏ |u i | = 0. (5.16)
‖u‖→∞
i=1

The window function φ satisfies this condition as it is constantly zero outside the unit
cube.
The symbol δ N for the sequence of scaled window functions in Equation (5.15) was
not chosen arbitrarily, but serves to highlight a connection with Dirac sequences.

Definition 5.1 (Dirac sequence). A sequence of integrable functions δ N over M with

M = ℝd or M = ℂd is called a Dirac sequence iff
1. δ N (m) ≥ 0 for all m ∈ M and N ∈ ℕ,
2. ∫M δ N (m) dm = 1 for all N ∈ ℕ,
3. limN→∞ ∫‖m‖>ε δ N (m) dm = 1 for all ε > 0.

The first two requirements state that δ N is a formal density function. The third require-
ment demands that all the probability is eventually concentrated in an arbitrary small
neighborhood around the origin. In other words, the δ N approach the Dirac distribu-
tion δ. Unlike δ, however, all the δ N are regular distributions. Quite often, δ is even
defined as the weak limit of a Dirac sequence. In this case, the convergence holds by
definition.
Here, δ N (m) = V1N φ( hmN ) was initially chosen to be the uniform distribution over a
rectangular neighborhood. A natural generalization is to replace the uniform window
function by a Gaussian density. To this end, redefine

φ(u) := exp{− 21 ‖u‖2 } (5.17)

and set
1 m 1 1
δ N (m) = φ( ) = exp{− 2 ‖m‖2 }. (5.18)
VN hN d
2
d
(2π) h N 2h N
d
Here, V N = (2π) 2 h dN denotes the volume of the (infinite) support of the window func-
tion scaled by h N . Hence, the parameter h N can be understood as controlling the vari-
ance of the Gaussian window function.
The estimated density p̂ N (m) can be regarded as a random quantity p̂ N (m) on
its own if the samples that support the density are considered as random variables.
Hence, it is possible to examine its expectation μ N (m) and variance σ2N (m) point-wise
for every m. The estimated density p̂ N (m) converges to the true density p(m) in terms
of mean squared error if

lim μ N (m) = p(m), (5.19)

N→∞

lim σ2N (m) = 0 (5.20)

N→∞
148 | 5 Parameter free methods

point-wise. To prove convergence, the following four requirements must be satisfied:

sup φ(u) < ∞, (5.21)

u∈M
d
lim φ(u) ∏ |u i | = 0, (5.22)
‖u‖→∞
i=1

lim V N = 0, (5.23)
N→∞

lim NV N = ∞. (5.24)
N→∞

Note that the second condition is actually a repetition of the definition of a window
function in Equation (5.16). The first condition forces the window function to be modest
in the neighborhood of the origin. The last two conditions force the spread of the
window function to vanish but not faster than the number of samples increases.
The expectation is calculated in two steps, because the first equality is needed
again later. For any i = 1, . . . , N,

E{δ N (m − mi )} = ∫ δ N (m − u)p(u) du = [δ N ∗ p](m), (5.25)

where ∗ denotes the convolution operator. Likewise

1 N
μ N (m) = E{p̂ N (m)} = ∑ E{δ N (m − mi )} = [δ N ∗ p](m) (5.26)
N i=1

follows. As the Dirac distribution is the neutral element with respect to convolution,

lim μ N (m) = lim [δ N ∗ p](m) = [δ ∗ p](m) = p(m) (5.27)

N→∞ N→∞

yields the desired result. Note that in the above line, the limit and the integral were
silently swapped. This step used the requirement Equation (5.22).
The calculation of the variance uses Equation (5.26) and exploits the fact that the
variance of a sum of independent variables is the sum of the individual variances:

i.i.d. 1 N
σ2N (m) = Var{p̂ N (m)} = ∑ Var{δ N (m − mi )}
N 2 i=1

1 N 2 2
= ∑ E{δ N (m − mi ) } − (E{δ N (m − mi )})
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
N i=1
2
=μ N (m)≥0
1
≤ ∫ δ2N (m − u)p(u) du
N
M
(5.14)1 1 m−u
= ∫ φ( )δ N (m − u)p(u) du
N VN hN
M
(5.21) sup φ(u) (5.26) sup φ(u)
≤ ∫ δ N (m − u)p(u) du = μ N (m). (5.28)
NV N NV N
M
5.1 The Parzen window method | 149

m 2

R1 R2 ω1
12 ω2
Optimal
boundary
10 Decision
boundary

2 R1

m 1

2 4 6 8 10 12

Fig. 5.3. Application to the reference example of Section 3.3.2. Decision regions of a classifier with
Parzen window density estimators (h N ≈ 0.5, N = 100 for either class) with Gaussian window
function. The training error is etrain = 6 %; the testing error is etest = 6.5 %, but asymptotically
approaches etest ≈ 7 %. The training set is the same as in Figure 3.8. Test samples are shown with
hollow marks.

The function μ N (m) is a density function and is at least as well-behaved as the

true density function p(m), sup φ(u) is bounded by hypothesis and limN→∞ NV N = ∞
holds by assumption, too. Hence, it follows that σ2N (m) → 0, and the proof is con-
cluded.
In summary, the estimated density p(m) ̂ converges to the true density p(m) in
terms of the mean square error if—besides other requirements—the volume V N van-
ishes more slowly than N1 . This is by far the most important point that should be noted
from the proof, because this is a design parameter that can be influenced by the appli-
cation.
This result will now be used to present a complete example using a Gaussian
window function. Recalling Equations (5.14), (5.15), (5.17) and (5.18) and combining
these with the design decision V N ∝ √1 to satisfy the condition in Equation (5.24)
N
yields
d ! 1 1
V N = (2π) 2 h dN ∝ N − 2 ⇒ h N = h N − 2d , (5.29)
where h ∈ denotes the initial width of the window function and is a design param-
ℝ≥0
eter. Thus, altogether, there follows
1 N 1 m − mi 1 N m − mi
p(m)
̂ = ∑ φ( )= ∑ φ( )
N i=1 V N hN N V N i=1 hN
150 | 5 Parameter free methods

N
1 1
= ∑ exp {− ‖m − mi ‖2 }
N (2π)
d
2 h dN i=1 2 h2N
N 1
1 Nd
= ∑ exp {− ‖m − mi ‖2 } . (5.30)
√N (2π) 2 h d
d
i=1
2 h2

The Parzen window method can be used to estimate the class-specific feature dis-
tributions p(m | ω i ), i = 1, . . . , c, which are in turn used in an MAP classifier. Figure 5.3
shows the decision regions of the ongoing example, where the class-specific densities
p(m | ω1 ) and p(m | ω2 ) were estimated using Parzen windows. One can see that the
decision boundary comes very close to the decision boundary of the Bayesian optimal
classifier, and with 7 %, the asymptotic test error is only slightly larger than the 6.16 %
of the optimal classifier. Note, however, that the outcome very much depends on the
number and the location of the training samples, as well as the dimensionality of the
feature space itself (see Section 6.1).
In summary, the Parzen window method is characterized by three crucial traits.
Universality The Parzen window method does not require any prior knowledge about
the probability distribution. Even sophisticated multi-modal distributions can be
estimated.
Choice of parameter Although the theoretical convergence to the true density holds
in any case, the quality of the result in practice depends heavily on the initial
choice of the volume V or the associated spread h.
Data-independent size of neighborhood For a fixed sample size N, any point of the
feature space is covered by the same neighborhood, independently of the sample
density.
Figures 5.4 and 5.5 depict the estimation of a normal distribution with a Gaussian
window function for different sample sizes N and initial spreads h for m ∈ ℝ and
m ∈ ℝ2 , respectively.

5.2 The k-nearest neighbor method

Similar to the Parzen window method, the k-nearest neighbor method aims to estimate
a density in virtue of
k/N k
p(m)
̂ = = . (5.31)
V NV
But in contrast to the Parzen window method, the number of considered samples k
only depends on the total number of samples N and instead the volume V is fitted so
that exactly k samples fall in the neighborhood of m. Review Figure 5.2 for a graphical
illustration of the difference between both methods.
The neighborhood of a point m can be thought of as a cell centered at m that is
inflated until it contains k N samples; the cell is small if the neighborhood is dense
and large if the neighborhood is sparsely populated. Unfortunately, this intuition is
5.2 The k-nearest neighbor method | 151

p(m)
̂ p(m)
̂ p(m)
̂
2 2 2
1.5 1.5 1.5
1 1 1
0.5 0.5 0.5
m m m
−4 −2 2 4 −4 −2 2 4 −4 −2 2 4
(a) N = 1, h = 1.0 (b) N = 1, h = 0.5 (c) N = 1, h = 0.1

p(m)
̂ p(m)
̂ p(m)
̂
2 2 2
1.5 1.5 1.5
1 1 1
0.5 0.5 0.5
m m m
−4 −2 2 4 −4 −2 2 4 −4 −2 2 4
(d) N = 10, h = 1.0 (e) N = 10, h = 0.5 (f) N = 10, h = 0.1

p(m)
̂ p(m)
̂ p(m)
̂
2 2 2
1.5 1.5 1.5
1 1 1
0.5 0.5 0.5
m m m
−4 −2 2 4 −4 −2 2 4 −4 −2 2 4
(g) N = 50, h = 1.0 (h) N = 50, h = 0.5 (i) N = 50, h = 0.1

p(m)
̂ p(m)
̂ p(m)
̂
2 2 2
1.5 1.5 1.5
1 1 1
0.5 0.5 0.5
m m m
−4 −2 2 4 −4 −2 2 4 −4 −2 2 4
(j) N = 100, h = 1.0 (k) N = 100, h = 0.5 (l) N = 100, h = 0.1

Fig. 5.4. Parzen window density estimation with a Gaussian window function for varying sample
sizes N (blue curves) and spreads h with m ∈ ℝ; the true density is drawn in red.
152 | 5 Parameter free methods

p(m)
̂ p(m)
̂ p(m)
̂

0.6 0.6 0.6

0.3 0.3 0.3

0 0 0
m1 m2 m1 m2 m1 m2
(a) N = 1, h = 2.0 (b) N = 1, h = 1.0 (c) N = 1, h = 0.5

p(m)
̂ p(m)
̂ p(m)
̂
0.4 0.4 0.4

0.2 0.2 0.2

0 0 0
m1 m2 m1 m2 m1 m2
(d) N = 10, h = 2.0 (e) N = 10, h = 1.0 (f) N = 10, h = 0.5

p(m)
̂ p(m)
̂ p(m)
̂

0.3 0.3 0.3

0.2 0.2 0.2
0.1 0.1 0.1
0 0 0
m1 m2 m1 m2 m1 m2
(g) N = 50, h = 2.0 (h) N = 50, h = 1.0 (i) N = 50, h = 0.5

p(m)
̂ p(m)
̂ p(m)
̂

0.2 0.2 0.2

0.1 0.1 0.1

0 0 0
m1 m2 m1 m2 m1 m2
(j) N = 100, h = 2.0 (k) N = 100, h = 1.0 (l) N = 100, h = 0.5

Fig. 5.5. Parzen window density estimation with a Gaussian window function for varying sample
sizes N and spreads h with m ∈ ℝ2 ; the true density is a single Gaussian with µ = 0 and Σ = I,
p(m) = N(µ,I) (not shown). Note that the scale of the applicate is different in each row.
5.2 The k-nearest neighbor method | 153

p(m)
̂ p(m)
̂

1.5 1.5

1 1

0.5 0.5

m m
−4 −2 2 4 −4 −2 2 4
(a) N = 1, k = 1 (b) N = 16, k = 4

p(m)
̂ p(m)
̂
0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2
m m
−4 −2 2 4 −4 −2 2 4
(c) N = 64, k = 8 (d) N = 256, k = 16

Fig. 5.6. k-nearest neighbor density estimation with for varying sample sizes N and k = √N (blue
curve); the true density is drawn in red.

difficult to express in a closed mathematical formula, and the proof of convergence to

the true density is quite involved. Nonetheless, Loftsgaarden and Quesenberry [1965]
have shown that

lim k N = ∞, (5.32)
N→∞
kN
lim =0 (5.33)
N→∞ N
P
are sufficient conditions for p(m)
̂ → p(m) for every m where p(m) is continuous.
kN
The first condition ensures that N is a good local approximation of the probability
(the stochastic error vanishes); the second condition ensures that k grows sufficiently
slowly so that the volume of the cell becomes zero (the systematic error vanishes).
Instead of reproducing the proof, let us highlight a substantial difference from the
Parzen window method: When employing the Parzen window method and a differen-
tiable window function is chosen, each function of the sequence is differentiable, too.
Moreover, every approximation fulfills the formal requirements of a density.
154 | 5 Parameter free methods

Densities estimated by the k-nearest neighbor method are usually continuous,

but not differentiable for k > 1, even if the true density is. In the case of k = 1, the
estimated density even has poles (see Figure 5.6a). In this case, the neighborhood of m
has zero volume if m is one of the samples mi . Hence the denominator of Equation (5.31)
vanishes. In the simplest scenario, i.e., for N = k = 1, sample m1 and a spherical
neighborhood, the density is
1
p(m)
̂ = (5.34)
α‖m − m1 ‖

with one pole at m1 (here, α denotes the volume of the d-dimensional unit sphere).
For k > 1, the volume of the neighborhood of m changes smoothly with respect to m,
as long as the k samples that define the neighborhood stay the same. If m moves into
the area of influence of a different sample, the volume of the neighborhood changes in
a non-differentiable way. Generally, the points of nondifferentiability of the estimated
function do not match the samples, but are placed in between.
Moreover, in the 1-dimensional case, i.e., m ∈ ℝ, the integral value of each ap-
proximated density is not equal to one but diverges to infinity even for k > 1. Outside
of the finite number of samples N, every approximation asymptotically behaves like
1
m 󳨃→ m and therefore the integral becomes infinite. This means that although the
density estimate converges to the true density function point-wise in probability, the
approximation is not a density on its own.

5.3 k-nearest neighbor classification

The technique discussed in the previous Section aimed at estimating an approximation

to the class-specific feature distribution p(m|ω), but ultimately we are interested in
the classification of an unknown sample. Estimating p(m|ω) is only a step on this way.
This section introduces a direct classifier that adopts the previous idea, does not need
any distribution assumption, and whose decision is only based on the neighboring
samples.
As always, D = {m1 , . . . , mN } denotes the set of all training samples and m a new
unknown sample. Let A denote a reasonably small neighborhood of m, k the total
number of samples that fall in A, and k i the number of samples that belong to class
ω i . Then due to the same line of argument as before,

̂
k i/N ki
P(m, ωi ) = = (5.35)
V NV
is an estimator of the joint probability. Applying Bayes’ law leads to

̂
P(m, ωi ) ki
̂ i |m) =
P(ω = . (5.36)
∑cj=1 ̂
P(m, ω j ) k
5.3 k-nearest neighbor classification | 155

m 2

m 1

Fig. 5.7. Example Voronoi tessellation of a

two-dimensional feature space.

The appealing point of the last line is that it not only provides the a posteriori prob-
ability directly, but that it does not suffer from the technical difficulties of the previous
̂
result. Because there are only finitely many classes, P(ω|m) is not a probability density
function, but a probability mass function and sums to one. Moreover, the volume V
and the number of samples N that caused those difficulties cancel out.
Together with the usual maximum a posteriori rule, the estimated class is the
class of the most frequently represented samples in the neighborhood. For k = 1, this
classifier is called the nearest neighbor classifier. The formal decision rule is

ω(m)
̂ = ωi ⇔ arg min‖m − mj ‖ ∈ Di , (5.37)
mj ∈D

where Di is the set of training samples of class ω i . The decision function is

1 1
k i (m) = α ⋅ = α ⋅ max , (5.38)
minm
̃∈Di ‖m − m
̃ ‖ m
̃ ∈D i ‖m − m
̃‖

where α is a normalization constant that ensures that the k i (m) sum to one:
1
α := . (5.39)
∑cj=1 k j (m)

Plotting the decision boundaries in the feature space yields a Voronoi tessellation
(see Figure 5.7). The only free design parameter is the scaling of the individual com-
ponents, or, equivalently, the choice of the metric. This parameter heavily influences
which neighbor is considered to be “near.” Figure 5.8 illustrates this effect for a fixed
standard Euclidean metric but different scales of the first axis.
A natural extension is to base the classification decision not only on the nearest
neighbor but on the majority class of several neighbors, i.e., assign the most frequent
class among the k > 1 neighbors. Let Ak (m) denote a neighborhood of m that includes
156 | 5 Parameter free methods

m2 m2

m m
m1 βm1

(a) Euclidean distance (b) Mahalanobis distance scaling m1 by β

Fig. 5.8. Dependence of the nearest neighbor classifier on the metric

exactly k samples. The decision function can then be written as

1 󵄨󵄨󵄨 ↑ 󵄨
k i (m) = ̃↑
󵄨{m↑m̃ ∈ Ak (m) ∩ Di }󵄨󵄨󵄨󵄨. (5.40)
k 󵄨󵄨 ↑ 󵄨
A sketch of this classification method is shown in Figure 5.9.
Figures 5.10 to 5.12 show the decision boundaries of a k-nearest neighbor classifier
with varying values of k for the reference example of Section 3.3.2. With k = 1, the
decision region is complicated and rugged, while higher values of k smooth out the
details. As with the Parzen window method, the optimal choice of k depends on the
number of training samples and the dimensionality of the feature space, but also on
the distance measure used.
An interesting question is how well the nearest neighbor classifier performs in
comparison to the MAP classifier. In the latter case, the error probability P∗ equals the
overall risk and we recall from Equation (3.38) that

P∗ = R = 1 − ∫ P(ω(m),
̂ m) dm. (5.41)
M

In the case of the nearest neighbor classifier, the error probability depends on the num-
ber of samples being drawn. The classification is erroneous if the sample mi nearest to
the test sample m has a different class than the true class of m. Let P N denote the error
probability of the nearest neighbor classifier with N samples. We state without proof
that P = limN→∞ P N exists and call P the asymptotic error probability of the nearest
neighbor classifier. With c classes, both error probabilities are at most c−1 c , because
this is the probability of being wrong if one merely guesses. The optimal Bayes error
probability P∗ is a lower bound for P. But it is also possible to bound the error probabil-
ity of the nearest neighbor classifier from above in terms of P∗ . Cover and Hart [1967]
5.3 k-nearest neighbor classification | 157

m 2
ω1
ω2

Fig. 5.9. k-nearest

neighbor classifier
m 1
for c = 2, d = 2 and
k = 7; m is assigned to
ω1 (blue class) since 4
out of its 7 neighbors
are in this class.

m 2

R1 R2 ω1
12 ω2
Optimal
boundary
10 Decision
boundary

2 R1

m 1

2 4 6 8 10 12

Fig. 5.10. Decision regions of a nearest neighbor classifier with k = 1. The training error is etrain =
0 % by definition of the classifier. The testing error is etest = 10 %, and asymptotically approaches
etest ≈ 9.4 %. The training set is the same as in Figure 3.8. Test samples are shown with hollow
marks.
158 | 5 Parameter free methods

m 2

R1 R2 ω1
12 ω2
Optimal
boundary
10 Decision
boundary

2 R1

m 1

2 4 6 8 10 12

Fig. 5.11. Decision regions of a nearest neighbor classifier with k = 3. The training error is etrain =
4.5 %. The testing error is etest = 8 %, and asymptotically approaches etest ≈ 7.5 %. The training set
is the same as in Figure 3.8. Test samples are shown with hollow marks.

m 2

R1 R2 ω1
12 ω2
Optimal
boundary
10 Decision
boundary

2 R1

m 1

2 4 6 8 10 12

Fig. 5.12. Decision regions of a nearest neighbor classifier with k = 5. The training error is etrain =
7 %. The testing error is etest = 6 %, and asymptotically approaches etest ≈ 7.1 %. The training set is
the same as in Figure 3.8. Test samples are shown with hollow marks.
5.3 k-nearest neighbor classification | 159

P
c −1
c

P = 2P∗

P = P∗

c
P = P∗ (2 − c −1 P )
∗

P∗
Fig. 5.13. Asymptotic error bounds of the
c −1
c nearest neighbor classifier

have shown that

c
P∗ ≤ P ≤ P∗ (2 − P∗ ). (5.42)
c−1
This means the error probability of the nearest neighbor classifier is somewhere in the
blue shaded area of 5.13.
Instead of reproducing the proof in Cover and Hart [1967], we just discuss the result.
The first observation is that the upper and lower bounds coincide for the boundary val-
ues P∗ ∈ {0, c−1
c }. This is a reasonable result. If P = 0 for the optimal error probability,
∗

then the supports of the class-specific feature distributions must be disjoint. Hence,
for infinite many samples, the nearest sample asymptotically has the correct class and
the nearest neighbor classifier always decides correctly. In contrast, if P∗ = c−1 c , then
the Bayes classifier is not better than guessing and the nearest neighbor classifier is
not worse.
By dropping the last term in Equation (5.42), the weaker upper bound

P ≤ 2P∗ (5.43)

follows. In real-world applications, where the true distribution is typically unknown,

any classifier—as sophisticated as it might possibly be—needs to base its decision rule
somehow on the given samples. On the one hand, no matter how this rule might look,
the classifier cannot be better than P∗ , because this the optimal lower bound. On the
other hand, the nearest neighbor classifier does asymptotically not perform worse than
2P∗ . In consequence, roughly speaking, half of the information that can be somehow
extracted from the samples is already included in the nearest sample.
The last paragraph sounds promising, as a very easy classification rule has a guar-
anteed not so bad upper bound. Unfortunately, this bound only holds asymptotically,
and no such bound exists for the case of a finite sample. There is not much that can be
done about this situation, because without any additional assumptions, there are few
160 | 5 Parameter free methods

starting points to control the quality of the approximation. Even worse, in the most
general case, examples can be constructed such that convergence is arbitrarily slow
and not even monotonic.

5.4 Exercises

(5.1) Given the mapping y = A(x) = x2 , why is the inference from y to x an ill-posed
inverse problem?

(5.2) One wishes to establish a functional relation between two scalar measurements
x and y. Considering the underlying physics, it is known that the relation must
be linear. Suppose that there are N > 2 noisy measurements (x i ,y i ), i = 1, . . . , N.
The task is formulated as follows:
Find the parameters a,b ∈ ℝ of a straight line that interpolates the data points, i.e., y i =
a x i + b for all i = 1, . . . ,N.
Why is this (inverse) problem ill-posed? How can the task be reformulated so that
the inverse problem is well-posed?

(5.3) Suppose there are given six mappings y = A i (x), i = 1, . . . , 6 with the properties
shown in the table below. For which of the mappings A i is the inference from y to
x an ill-posed inverse problem?

Property Mapping
A1 A2 A3 A4 A5 A6

A−1
i is well defined × × × × ×
A−1
i is injective × ×
A i is injective × × × × ×
A−1
i is surjective × × × ×
A i is surjective ×
A−1
i is continuous × × × ×
A−1
i is linear × × × ×
x = A−1
i (y) is unique × × × ×

(5.4) Given the following sample of one-dimensional features m i

D = {8.1, 8.9, 7.6, 9.7, 12.2, 7.1, 10.4, 9.3, 14.9, 10.1},

estimate the sample density p(m)

̂ at the locations m = 6, m = 8, m = 10, m = 12
and m = 14 using the Parzen window method with a rectangular window as in
Equation (5.12) and a window size of h N = 1.

(5.5) Given the following sample of one-dimensional features m i

D = {8.0, 8.5, 7.6, 9.7, 12.2, 7.1, 10.5, 9.3, 14.9, 10.0},
5.4 Exercises | 161

estimate the sample density p(m)

̂ at the locations m = 6, m = 8, m = 10, m = 12
and m = 14 using the k-nearest neighbor method with k = 3 neighbors.

(5.6) Use the following sample D = {m1 , . . . , m6 } to graphically classify the points
m󸀠1 = (−2,2)T , m󸀠2 = (2,0)T and m󸀠3 = (−1,−5)T using the nearest neighbor method:

0 4 2
m1 = ( ) , m2 = ( ) , m3 = ( )
0 2 6

with ω(m1 ) = ω(m2 ) = ω(m3 ) = ω1 and

4 −6 −4
m4 = ( ) , m5 = ( ) , m6 = ( )
−4 −6 2

with ω(m4 ) = ω(m5 ) = ω(m6 ) = ω2 .

Use the Euclidean distance d(x,y) = ‖x − y‖2 as distance measure. Sketch the
decision boundaries in the two-dimensional feature space.
6 General considerations
This chapter explicitly discusses two issues that are scattered around several chapters
in many other books about pattern recognition. These issues would fit equally well into
any other chapter of this book and would emerge in slightly different flavors depending
on the surrounding context. Hence, one could discuss each of the issues on demand
partially as required, but this approach would miss the point that both issues pervade
the entire scope of pattern recognition.
Actually both matters are rather short and therefore this chapter will be, too. But
we decided to spent an entire chapter to accommodate their fundamental importance.
The first topic addresses the question of dimension and clarifies the term the curse
of dimensionality. The second topic is overfitting.

6.1 Dimensionality of the feature space

Section 2.7 introduced techniques to reduce the dimension of a feature space but
lacked an explanation of the reasons why a small number of dimensions is favorable.
The introduction of this book established some design principles for how to select
features, and referred to the “curse of dimensionality” (see Section 1.4, Page 8), but
did not give an explanation of the term. This section will fill this gap.
The beginning will be the exact opposite and first give an example that seems to
support the commonsense (but false) belief that a large number of features should
lead to better classification. After that, this belief is disproved and it will be shown
why the example is misleading.
Recall an example from Section 3.3.4. The number of classes will be c = 2 and
both class-specific feature distributions will be p(m|ω i ) = N(m; µi ,Σ) with shared
covariance matrix Σ. Moreover, the a priori distribution P(ω1 ) = P(ω2 ) = 12 is assumed.
As already known from Equation (3.56), the decision boundary of the corresponding
classifier is a hyperplane given by

T 1
Λ(m) = (Σ−1 (µ1 − µ2 )) (m − (µ + µ2 )) = 0 (6.1)
2 1
and the decision rule is
{ω1 Λ(m) < 0
ω(m)
̂ ={ (6.2)
ω else.
{ 2
Putting the random variable m into Λ(m) makes this a Gaussian distributed ran-
dom variable itself, because it is a linear transformation of m. Thus, it is possible to
calculate the conditional expectation and variance with respect to the true class. There

https://doi.org/10.1515/9783110537949-184
6.1 Dimensionality of the feature space | 163

follows
↑
E{Λ(m)↑
↑ω = ω1 }
↑
T 1 ↑
= E{(Σ−1 (µ1 − µ2 )) (m − (µ + µ2 ))↑
↑ω = ω1 }
↑
↑
2 1
↑ 1
= (µ1 − µ2 )T Σ−1 (E{m↑
↑ω = ω1 } − (µ1 + µ2 ))
↑ 2
1
= (µ1 − µ2 )T Σ−1 (µ1 − µ2 ) (6.3)
2
and likewise ω = ω2 yields

↑ 1 T −1
E{Λ(m)↑
↑ω = ω2 } = − 2 (µ1 − µ2 ) Σ (µ1 − µ2 ).
↑ (6.4)

The calculation of the variance can be combined for both cases ω = ω i (i = 1, 2)

↑
Var{Λ(m)↑
↑ω = ω i }
↑
1 T ↑
= Var{(Σ−1 (µ1 − µ2 )) (m − (µ + µ2 ))↑
↑ω = ω i }
↑
↑
2 1
↑
= (µ1 − µ2 )T Σ−1 Cov{m↑ −1
↑ω = ω i } Σ (µ1 − µ2 )
↑
= (µ1 − µ2 )T Σ−1 (µ1 − µ2 ). (6.5)

To simplify the syntax, define

󵄩 󵄩
s := 󵄩󵄩󵄩µ1 − µ2 󵄩󵄩󵄩m = √(µ1 − µ2 )T Σ−1 (µ1 − µ2 ) (6.6)

as the Mahalanobis distance w.r.t. Σ−1 between the expectation values µ1 and µ2 . In
summary, this leads to

p(m|ω1 ) = N(Λ(m); − 21 s2 , s2 ), (6.7)

1 2 2
p(m|ω2 ) = N(Λ(m); 2 s , s ). (6.8)

An object is incorrectly classified if Λ(m) ≥ 0 but the true class equals ω1 , or if

Λ(m) < 0 but the true class equals ω2 . Hence, the error probability is
1 ↑ 1 ↑
R= P(Λ(m) ≥ 0↑ ↑ω = ω2 )
↑ω = ω1 ) + 2 P(Λ(m) < 0↑
↑ ↑
2
↑
= P(Λ(m) ≥ 0↑
↑ω = ω1 )
↑
∞ 2
1 (Λ + s2 )2
= ∫ exp(− ) dΛ
√2πs 2s2
0
∞
1 u2
= ∫ exp(− ) du. (6.9)
√2π s 2
2
164 | 6 General considerations

Fig. 6.1. Increasing dimension vs. overlapping densities.

The densities do not overlap in three dimensions, yet in one
dimension they overlap is nearly 50 %.

The last line shows that the error probability vanishes (R → 0) if the Mahalanobis
distance of the expectation values increases (s → ∞). Until now, no assumption about
the dimension d of the feature space has been made. The mutual positions of the ex-
pectation values cannot be easily changed, because they are given by the nature of the
application, but one could try to put more features into the feature vector and thereby
increase its dimension. As long as these additional features contain new information
about the problem, the Mahalanobis distance is increased. In mathematical terms,
s → ∞ if d → ∞. Note that simply duplicating components does not increase the
Mahalanobis distance: the increase depends on the correlation between the existing
and the additional features. In consequence, one could argue the more data the better,
or, at greater length, the more data, the higher the dimension, the greater the distance,
the lower the error. Figure 6.1 seems to support this statement. The plot indicates the
support of two uniform distributions in different dimensions. Under a projection onto
the first dimension, both supports overlap each other by one-half. This overlapping
region is responsible for a false classification. In two dimensions, the overlapping re-
gion of the rectangles only counts for one-quarter of the area. So the proportion of
possible false classifications declines. In three dimensions, the cubes can be perfectly
separated.
Although this seems to support the belief that a larger number of dimensions
improves the classification, the statement is generally wrong in real-world applica-
tions. The conclusion is only true if the class-specific feature distributions are perfectly
known or if there were infinitely many training samples.
Consider another, this time factual, example that illustrates the real effect of in-
creasing the number of dimensions (Beyerer [1994]). The task was to automatically
assess the quality of honed surfaces of cylinders given a catalog of N = 33 pictures
of such a surface with manually assigned grades on an ordinal scale between 1 and
6.1 Dimensionality of the feature space | 165

Estimation error

2
Number of Fig. 6.2. Dependence of
error rate on the dimen-
features d
sion of the feature space
5 10 15 20 25 in Beyerer [1994].

10. To solve the problem, 25 heuristic and model-driven features were defined. The
idea for classification was to estimate the grade n by a linear regression n̂ = Am
with the smallest mean square error. A feature selection according to Section 2.7.5
showed that increasing the number of features improved the classification result at
the beginning but beyond a certain point, additional features increased the error again
(see Figure 6.2).
This phenomenon can be best understood by the following allegory. Each feature
bears some net payload and some irrelevant payload that counts as a disturbance.
As long as a new feature adds more important information into the system than dis-
turbance, the classification performance increases. But if all the net payload that a
feature potentially could add is already included by the existing features, then only
the disturbance goes on top, and the performance degrades. To rectify the misleading
perception, the concept of interval probability will be introduced with the help of an
example. Let m ∼ N(µ, σ2 I) with dimension d ∈ ℕ. As all components are indepen-
dent, the probability that a sample falls in the d-dimensional cube with edge length
4σ around its expectation value equals

Pr(|m1 − μ1 | < 2σ, . . . , |m d − μ d | < 2σ) ≈ 0.95d . (6.10)

This is a strictly decreasing function with respect to d. For d = 1, the bulk of the
samples lies within the interval [μ1 − 2σ, μ1 + 2σ]. But for d = 100, it follows that
P ≈ 0.95100 ≈ 0.0059, and only a small fraction of the samples lies within the cube.
Generally, the higher the dimension, the more sparsely the samples are scattered.
This leads to the notion of sample density, to conceptualize this statement more
precisely. Given a finite number of samples N i per class, the smallest axis-aligned
enclosing cuboid is considered (see Figure 6.3). The sample density will be defined as
the number of samples per unit volume: γ i := NV ii . So as to better compare the cuboids,
and abstract from the different ratios of the edges, the geometric mean of the edge
lengths, s i = √d
∏dj=1 s ij , is used. Then γ i = NV i = N i d . In order to keep the density
i (s i )
166 | 6 General considerations

m 2
s 21
s 21 s 22
m 1 s 22 m 1

s 11 s 21

(a) d = 1 (b) d = 2

m 3
ω1
s 11 ω2
s 12

s 13
s 21
s 22
s 23
m 2
Fig. 6.3. Density of a sample for
feature spaces of increasing
dimensionality. In each plot,
m 1
the number of samples per
(c) d = 3 class is the same.

constant for different dimensions,

Ni
γi = = const. ⇔ N i ∝ (s i )d (6.11)
(s i )d
must be fulfilled. This means that the number of samples has to increase exponentially
with the number of dimensions.
The decision regions in the feature space are bounded by (d − 1)-dimensional
hypersurfaces. These decision boundaries are determined by pairwise equalities of
the decision functions k l (m) = k j (m). A parametric description of the decision bound-
aries requires a mathematical model whose parameter space generally has a higher
dimension q ≥ d than the dimension of the feature space (see Figure 6.4). The sim-
plest (d − 1)-dimensional manifold is a hyperplane that is defined by q = d parameters
(see Figures 6.4a and 6.4c).
The parameter vector θ ∈ Θ with dim Θ = q needs to be estimated from the sam-
ples. The quality of the decision boundary depends on the estimation error involved.
Hence, it is natural to ask how the dimension q of the parameter space influences the
magnitude of the error. To this end, we will reconsider the multi-dimensional CRB
(see Equation (4.14)) and apply it to an example with normally distributed, indepen-
dent features. Let m ∼ N(µ, σ2 I) with unknown expectation µ but known covariance
σ2 I. Hence, the parameter vector is θ = µ and q = d. Then
1 d
ln p(m) = − (m − µ)T (m − µ) − ln 2π − ln σ
2σ2 2
6.1 Dimensionality of the feature space | 167

m2 m2

R1
R2
R2
R1
m1 R1 m1

(a) d = 2, q = 2 (b) d = 2, q > 2

m3
m3

m2
m2
m1
m1

(c) d = 3, q = 3 (d) d = 3, q > 3

Fig. 6.4. Examples of feature dimension d and parameter dimension q. If p = q, the decision bound-
ary is linear. Larger q enable more complicated decision boundaries

1
⇒ ∇µ ln p(m) = (m − µ)
σ2
1T
⇒ (∇µ ln p(m))(∇µ ln p(m)) = (m − µ)(m − µ)T
σ4
T
⇒ J(µ) = E{(∇µ ln p(m))(∇µ ln p(m)) }
1 1
= E{(m − µ)(m − µ)T } = 2 I. (6.12)
σ4 σ
So the inverse of the Fisher information matrix is J−1 (µ) = σ2 I and it follows that

1 qσ2
tr Cov{µ}
̂ ≥ tr J−1 (µ) = →∞ (q → ∞). (6.13)
N N
The trace of the covariance matrix is the sum of the squared variances, and grows
linearly in the number of parameters. Unfortunately, the trace has no direct geometric
interpretation. But since in this example the features are all pairwise independent, the
168 | 6 General considerations

̂ ≥ det J−1 (µ)

only nonzero entries are on the diagonal: hence one also has det Cov{µ}
and therefore
1 √ σq
√det Cov{µ}
̂ ≥ det J−1 (µ) = →∞ (q → ∞). (6.14)
√N √N
The square root of the variance equals the volume of the parallelepiped spanned by
the columns of the matrix. In summary, one could say that the volume of the estimation
error grows exponentially with the dimension of the parameter space.
As the last point in this section, the computational complexity of both tasks, esti-
mation and classification, will be considered. To this end, we examine the complexity
of the decision function k i on the basis of the MAP classifier for the Gaussian case:

Learning (Estimation)

1 T
̂ −1⏟⏟ (m d 1
k i (m) = − (m − µ̂ i ) Σ
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ ⏟⏟⏟⏟⏟i ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ ⏟⏟̂ i⏟⏟ + ln ⏟⏟⏟⏟⏟⏟⏟⏟⏟
− µ̂ i ) − ln 2π − ln det ⏟⏟⏟Σ P(ω i ) (6.15)
2 2
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ 2 2
O(dN) 2 O(d N) O(dN) O(N) O(d )
O(1)

d(d+1)
The covariance matrix Σ̂ has 2 distinct entries, hence the estimation is asymp-
−1
totically dominated by O(d2 N). The necessary matrix inversion Σ̂ requires O(d2.4 )
operations, but d2.4 < d2 N due to the fact that d < N. Therefore, the overall complexity
to determine k i is O(d2 N). As there are i = 1, . . . , c such decision functions, the total
cost is O(cd2 N).

Classification

1 −1 d 1
k i (m) = − (m − µ̂ i )T Σ̂ i (m − µ̂ i ) − ln 2π − ln det Σ̂ i + ln P(ω i )
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ (6.16)
2 2
2 2
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
O(d )
O(1)

Evaluating a single decision function needs, asymptotically, O(d2 ) operations. Again,

as there are c such functions that need to be compared, the total complexity is O(cd2 ).
This short analysis reveals that the complexity grows quadratically with the di-
mension d. Slower growth would be even better, but this is actually a good result and
far from posing a serious problem. Usually, for currently available systems and rea-
sonable applications, the computational complexity is not an issue. If a classification
tasks fails due to the high dimensionality of the feature space, then the previously
discussed points are much more likely to bear the blame than is the computational
complexity.
To summarize, if the dimension d of the feature space increases, . . .
– . . . the interval probability Pr(|m1 − μ1 | < 2σ, . . . , |m d − μ d | < 2σ) decreases;
6.2 Overfitting | 169

– . . . the dimension of the parameter space q increases;

– . . . the total parameter estimation error increases;
– . . . the density of the samples γ decreases;
– . . . the complexity increases.

6.2 Overfitting

The term overfitting denotes a phenomenon that generally occurs when a model with
a large number of parameters is fit to a set with too few samples. After the model is
chosen and the number of parameters is fixed, the remaining objective is to minimize
the classification error over the dataset. If the model is powerful enough, the error
with respect to the specifically given dataset can be reduced to zero: the model learns
the data by heart (see Figure 6.5b). However, this usually does not coincide with a
good general solution and the classification error on new and unseen samples will be
large. An overly simple model, on the other hand, is not able to sufficiently reduce
the error at all (see Figure 6.5a), because it lacks the necessary flexibility to match the
data. The ability to achieve a low error rate on both the training data and the testing
data is called generalization.
In order to check whether a chosen model fits the problem, the dataset can be
divided into a training set D and a test set T (see Figure 1.5). The training set is used to
estimate the parameters of the model, the test set is used to assess the model’s perfor-
mance and ability to generalize. Nonetheless, such a check can never be a strict proof
that the model is the correct one, but only a test for plausibility. Hence, the question
remains how to find the right model. In Figure 6.5c, the optimal decision boundary (of
a Bayesian classifier) can be given, because the example was artificially created and
the underlying model from which the data set was generated was known. In reality,
this is hardly ever true. Hence, the viable approach is to employ Occam’s Razor. This
principle states that among different competing hypotheses that are equally consistent
with the given data, the hypothesis with the fewest assumptions should be selected.
Figure 6.5 illustrates the effect of overfitting by means of an example. The decision
boundary tries to optimally separate the classes, within the limits of its ability.
The next example is of overfitting in the context of regression analysis. Let f(x) =
1 2
2 x − x, y = f(x) + r where r is some Gaussian distributed noise. Five samples (x1 , y1 ),
. . . , (x5 , y5 ) are given and it is only known that x and y are governed by some poly-
nomial rule. The task is to find the best estimation f ̂ with f ̂(x) = ∑ki=0 a i x i . For order
k = 4, the regression f ̂ is able to perfectly fit the given samples, but overall the regres-
sion with order k = 2 resembles the true function much better although it exhibits a
small training error (see Figure 6.6). If there had been a sixth sample (x6 , y6 ) and both
regression functions had been kept fixed, the quadratic polynomial would very likely
have been a better fit.
170 | 6 General considerations

m2 m2

m1 m1

(a) Linear decision boundary; simple (b) Overfitting with a highly flexible
model, but large training error decision boundary; the training error
is zero.

(c) Optimal decision boundary

(Bayesian classifier); note that the
training error is greater than zero.

Fig. 6.5. Trade-off between generalization and training error. The classifier should neither be too
simple to represent the underlying classes, nor too complex to not generalize from the training data.

y
Samples
4
Underlying function
Order k = 2
2 Order k = 4

x
−2 −1 1 2 3 4 5

−2

Fig. 6.6. Overfitting in a regression scenario. Two polynomials with order k = 2 and k = 4 are fitted
to data that was generated from a polynomial with order k = 2.
6.3 Exercises | 171

But if there had been many more samples (possible infinitely many), both regres-
sion functions would eventually converge to the true function. Hence, the model of
order k = 4 is not generally worse than the model of order k = 2, it is only worse for
a small number of samples. This leads to the following rules of thumb, which are in
accordance with Occam’s Razor:
– The smaller the dataset, the simpler the model should be, and
– the higher the number of parameters of the model (or classifier), the more samples
are required.

6.3 Exercises

(6.1) Given a sample D of five-dimensional feature vectors m ∈ D ⊂ ℝ5 , how many

parameters need to be estimated for the feature density p(m) to be represented by
a Gaussian mixture model with 3 components, i.e.,
3
p(m)
̂ = ∑ α k g k (m), (6.17)
k=1

where g k (m) denotes a Gaussian density? How many parameters need to be esti-
mated when using a Parzen window method instead?

(6.2) Given two classes ω1 and ω2 that are to be classified using four-dimensional
features m ∈ ℝ4 , how many parameters must be estimated when using a linear
classifier? How many parameters must be estimated for a maximum a posteriori
classifier, under the assumption that the features are class-conditionally normally
distributed, i.e., p(m | ω c ) = N(µc ,Σc ), c = 1,2?

(6.3) A micro-controller for Internet of Things applications can only save up to 256
parameters. You are tasked to use this micro-controller for classification in a six-
dimensional feature space.
1. How many linear classifiers can be realized using this micro-controller? How
many parameters will remain unused?
2. How many classes can be separated using a maximum a posteriori classifier
with multivariate Gaussian distributions as class-dependent feature distribu-
tions? How many parameters will remain unused?

(6.4) Suppose given a d-dimensional feature vector m = (m1 , . . . ,m d )T of d stochasti-

cally independent features m i , i = 1, . . . ,d. Each of the m i is uniformly distributed
over the interval [−10,11]. What is the smallest d that fulfills the following inequal-
ity:
9
Pr (m ∈ ̸ [−2,5]d ) > ? (6.18)
10
172 | 6 General considerations

In other words, what is the smallest dimension d in which more than 90% of the
probability mass will be outside of the hypercube [−2,5]d ?
7 Special classifiers
The remaining chapters of this book collect some further topics of pattern recognition.
Except for Section 9.4, these chapters deal with certain important classifier methods.
In contrast to the techniques of Chapters 3 to 5, these classifiers do not estimate a
distribution first, but try to find a classification rule directly from the given data instead.
This means that these classifiers follow the transduction path from Figure 5.1.

7.1 Linear discriminants

A linear discriminant is a linear function that operates on the feature space. With two
classes (c = 2), w ∈ ℝd and b ∈ ℝ, it is given by

k(m) = wT m + b (7.1)

and the decision rule becomes

{
{ ω1 if k(m) > 0
{
ω̂ = ω(m)
̂ = {ω2 if k(m) < 0 (7.2)
{
{
{whatever if k(m) = 0.

This means that the decision boundary is given by the (d − 1)-dimensional hyper-
plane H = {m ∈ M|k(m) = 0}. Here, w is a vector perpendicular to the hyperplane and
b determines the distance to the origin. The oriented distance of a point to the plane
is given by D(m, H) = k(m), provided that m is normalized (‖m‖2 = 1).
The resemblance to the decision function of Chapter 3 is not an accident. In case
of a Bayesian classifier for normally distributed features with identical covariances,
the decision function is linear. However, here and in the following sections, the ap-
proach is reversed: instead of inspecting the decision boundary of a Bayesian classifier,
the decision boundary is explicitly stated. The parameters of the plane are directly
determined from the training samples in such a way that the classification error is
minimized. How to carry out such a minimization will be discussed in the following
sections.

7.1.1 More than two classes

Equation (7.2) shows the decision rule for the case of c = 2 classes, but the approach
can be extended to more classes as well. In the upcoming discussion, it is implicitly
assumed that the classes are linearly separable.
The most straightforward solution is to determine one hyperplane per class. Each
hyperplane divides the space into two half-spaces so that the samples that belong to

https://doi.org/10.1515/9783110537949-195
174 | 7 Special classifiers

R1 H1,4
ambiguous
ambiguous ω1 ω1
region R1
region H1 R2
R4 R4 H2,3
ambiguous ω2
ω2 ω4 ω4
region
R2
H3 H1,3
ambiguous ω3 ambiguous ω3
region region R3
R3
H3,4
H2 H4 H2,4 H1,2

(a) One linear discriminant function per (b) One linear discriminant for each pair of
class classes

ω1
R4
ω2 ω4

R2 ω3

(c) Linear machine (no ambiguous

regions)

Fig. 7.1. Different techniques for extending linear discriminants to more than two classes. All these
methods except for the linear machine introduce ambiguous regions.

the class fall in one half-space, while the samples that belong to the other classes fall
in the other half-space. An unseen sample is classified as belonging to class ω i if it
falls in the corresponding half-space, but not in a half-space that corresponds to the
other classes, i.e., as before

ω(m)
̂ = ω i ⇔ k i (m) > 0 and k j (m) < 0 for i ≠ j. (7.3)

The resulting decision regions are depicted in Figure 7.1a. A major drawback of
this approach is that a large volume of the feature space belongs to an ambiguous
region, where no classification is possible.
Another approach is to determine one hyperplane Hi,j with i, j = 1, . . . , c, i ≠ j
for each pair of classes, resulting in c(c−1)
2 linear discriminants. A sample is classified
as being in class ω i if its feature vector lies on the correct side of all hyperplanes that
separate ω i from the other classes ω j ,

ω(m)
̂ = ω i ⇔ k i,j (m) > 0 for all j ≠ i. (7.4)
7.1 Linear discriminants | 175

This approach is depicted in Figure 7.1b and usually leads to smaller ambiguous re-
gions, at the cost of a more complicated classifier.
A third approach is given by the linear machine. As with the first approach, there
is only one linear discriminant k i for each class, but the decision rule is different: A
point is assigned to a class if the corresponding linear discriminant is larger than any
other:
ω(m)
̂ = ω i ⇔ k i (m) > k j (m) for all j ≠ i. (7.5)

As there is always a largest linear discriminant, there are no ambiguous regions.

This leads to the Voronoi tessellation shown in Figure 7.1c. Note that the depicted
hyperplanes are not the hyperplanes Hi given by k i (m) = 0. Instead, the final decision
boundaries are given by

0 = k i (m) − k j (m) = wTi m + b i − wTj m + b j

T
= (wi − wj ) m + (b i − b j ). (7.6)

As with the second technique, at most c(c−1)

2 decision boundaries are possible.
In summary, all three techniques have in common that the decision regions are
convex and connected, because they are the intersection of half-spaces. Moreover the
decision boundary consists of at most c(c−1)
2 segments of hyperplanes. Usually the
number is smaller, because not all regions are mutually adjacent.

7.1.2 Nonlinear separation

Going back to the two-class cases c = 2, there is another possibility of extending the
linear discriminants. Explicitly writing out the vectorized term,
d
k(m) = wT m + b = ∑ w i m i + b, (7.7)
i=1

suggests that it can be extended by higher order combinations, e.g., quadratic terms,
d d d
k(m) = ∑ ∑ w ij m i m j + ∑ w i m i + b = mT Wm + wT m + b. (7.8)
j=1 i=1 i=1

As discussed in Equation (3.57), the decision boundary k(m) = 0 is a hyperquadric

that coincides with the decision boundary of a Bayesian classifier with class-specific
features that are normally distributed. Adding higher order terms (cubic, quartic, quin-
tic, etc.) allows deriving polynomial decision functions of any degree.
176 | 7 Special classifiers

m2
m1 m2
R1
R̃ 1 ⊂ ℝ3

w m1

m2 R2

m1 R̃ 2 ⊂ ℝ3 R1
(a) Linear separation in the augmented, (b) . . . and the corresponding nonlinear
3-dimensional feature space . . . separation in the original 2-dimensional
feature space

Fig. 7.2. Nonlinear separation by augmentation of the feature space. The purple surface in (a) shows
the embedding of the original feature space, the orange plane is the decision boundary of the linear
discriminant in ℝ3 . The augmentation feature vector is defined as y := (1, m1 , m2 , m1 m2 )T and the
parameters of the linear discriminant are a = (1,0,0,1)T .

Indeed, the monomials can be replaced by arbitrary functions y i , where by con-

vention y0 (m) = 1. This leads to an extended linear discriminant:

1
d∗ y1 (m)
(a0 . . . a d∗ ) ( . ) = aT y.
k(m) = ∑ a i y i (m) = ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ (7.9)
..
i=0 =aT
y d∗ (m)

In the special case y = (1, m1 , . . . , m d )T , Equation (7.9) corresponds to the ordi-

nary linear discriminant in Equation (7.1). Although in the general case k(m) is not
linear in m, it is linear in y. In general, separability can be ensured by artificially in-
creasing the dimension of the feature space by augmenting the feature space with
nonlinear combinations of the existing features. Note that linear separation in the aug-
mented space usually results in nonlinear decision boundaries in the original feature
space (see Figure 7.2). In other words: augmenting the feature space makes it possi-
ble to use linear methods, even if the samples in the original feature space were not
linearly separable.
Nonetheless, there are two pitfalls:
1. Although the coefficient vector a can be easily estimated, there is no standard
technique to find appropriate nonlinear functions y i (m).
2. Increasing the dimensionality means that more parameters a have to be deter-
mined. All the effects discussed in Section 6.1 apply.
7.2 The perceptron | 177

m 2

R1 R2 ω1
12 ω2
Optimal
boundary
10 Decision
boundary

2 R1

m 1

2 4 6 8 10 12

Fig. 7.3. Application to the reference example of Section 3.3.2. Decision regions of a linear regres-
T
sion classifier with augmented feature vector y = (1,m1 ,m2 ,m1 m2 ,m21 ,m22 ) . The training and
testing errors are etrain = 8.5 % and etest = 8.5 %. The testing error asymptotically approaches
etest ≈ 8.7 %. The training set is the same as in Figure 3.8. Test samples are shown with hollow
marks.

Figure 7.3 shows the decision regions of a linear regression classifier with an aug-
mented feature vector (see description). By design of the feature vector, the decision
boundary is a conic section and is visually very similar to the decision boundary of the
Gaussian classifier in Figure 3.13. The linear regression classifier does not make any
explicit assumption about the density of the features. However, such assumptions are
implicit in the choice of feature augmentation.

7.2 The perceptron

What remains are techniques to determine a separating hyperplane from the given
samples. The perceptron algorithm introduced by Rosenblatt [1957, 1962] serves as
the first example. A perceptron is a binary classifier (c = 2) and requires that the
training set D is linearly separable. The pseudocode to learn the classifier is shown in
Algorithm 7.1. For each sample mi , an indicator variable

{1 if ω(m) = ω1
zi = { (7.10)
−1 if ω(m) = ω2
{
178 | 7 Special classifiers

Algorithm 7.1: The perceptron algorithm (Rosenblatt [1962]).

Data: Training set D, Learning-rate η > 0
Result: Hyperplane parameters w, b
w←0
b←0
R ← max1≤i≤N ‖mi ‖
repeat
forall mi ∈ D do
if z i (wT mi + b) ≤ 0 then
w ← w + η z i mi
b ← b + η z i R2
until no training error in inner loop
return w, b

is introduced so that both correct and false classifications can be covered with a single
statement (see line 4 of Algorithm 7.1). This indicator variable is often found in binary
classifiers and will reappear in Section 7.7.
The perceptron algorithm starts with an arbitrary but fixed hyperplane, and it-
eratively constructs a sequence of hyperplanes until the training error is 0. This can
only work if the training data is linearly separable, but if it is, then the algorithm is
guaranteed to converge.
The speed of convergence depends on the sample, the absolute value of the normal
vector m, and the learning rate η. Novikoff [1962] showed that the influence of a single
update eventually vanishes and that the sequence of hyperplanes converges.

Theorem 7.1 (Novikoff’s Perceptron Theorem). Let D be linearly separable with a mar-
gin γ > 0. This means there exists a hyperplane given by w and b with ‖w‖ = 1 and
for all mi ∈ D one has
z i (wT mi + b) ≥ γ. (7.11)

Then the perceptron algorithm makes at most

2R 2
k≤( ) (7.12)
γ

errors on the training set.

Though the perceptron algorithm is guaranteed to find a separating hyperplane with

zero training error if such a hyperplane exists, the solution is often not good in an intu-
itive sense. Figure 7.4 shows three intermediate hyperplanes and the final hyperplane
found by the perceptron algorithm. Clearly, a human would have drawn a different
decision boundary. In other words, the hyperplane found by the perceptron does not
necessarily generalize well to unseen data. Furthermore, the final results depend on
7.3 Linear regression | 179

m 2

Class ω1
Class ω2
Preliminary
hy perplanes
Final hy perplane
Intuitively better
hy perplane

m 1

Fig. 7.4. Four steps of the perceptron algorithm. The algorithm converged to a separating, but subop-
timal hyperplane.

the initial hyperplane w0 , b0 , the learning rate η, and the order in which the samples
in D are processed. Theorem 7.1 already required the existence of a hyperplane with
margin γ. Surely, it would be desirable to find such a hyperplane with γ as large as
possible. This idea is picked up in Section 7.7.

7.3 Linear regression

The aim of linear regression is to find a linear function (see Equation (7.1)) that best
maps a set of input vectors to their corresponding output. Note that “best” is only
loosely defined and can, for example, mean minimal squared error, minimal absolute
error, or any other loss function.
In the context of pattern recognition, linear regression can be applied to learn a
linear decision function for each class ω i . The input is given by the dataset D and the
corresponding output is the (perfect) decision function, that is, the objective is to find
a decision function k i (m) = aTi m, ai ∈ ℝd such that

! {1 if ω(m) = ω i
k i (mk ) = aTi m = { (7.13)
0 otherwise
{
for every i = 1, . . . , c. Given the training sample D, the optimization goals become

! {1 if ω(mk ) = ω i
mTk ai = z ki := { (7.14)
0 otherwise,
{
180 | 7 Special classifiers

where k = 1, . . . ,N. These conditions can be conveniently expressed in matrix form:

mT1 z1i
( .. ) ai = ( ... ),
. !
(7.15)
mT1
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ z Ni
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
:=M :=zi

where zi is the membership vector to class ω i and known for D.

If N > d, Equation (7.14) is an overdetermined system of linear equations, hence
an approximate solution can be obtained using the method of least squares:
!
e := ‖Mai − zi ‖2 → minimal. (7.16)

Factoring out Equation (7.16) and taking the gradient with respect to ai yields

e = aTi MT Mai − zTi Mai − aTi Mzi + zTi zi (7.17)

MT M ai − 2MT zi .
∇ai e = 2 ⏟⏟⏟⏟⏟⏟⏟⏟⏟ (7.18)
symmetric and invertible

Setting ∇ai = 0 and solving for ai yields the optimal (in the sense of minimal squared
error) solution
−1
âi = (MT M) MT zi , i = 1, . . . ,c. (7.19)
−1
The term (MT M) MT is called a pseudo-inverse of the matrix M. Substituting this
result into Equation (7.13) gives the decision functions
−1
k i (m) = zTi M(MT M) m, i = 1, . . . ,c. (7.20)

As the pseudo-inverse and the feature vector m do not depend on ω i , the entire decision
vector can be written as

zT1
k(m) = ( ... ) M(MT M) m.
−1
(7.21)
zTc

Note that the decision function does not take into account an offset b. However,
an offset, as well as nonlinearities, can be included using the techniques discussed in
Section 7.1.2. The structure of the classifier remains the same.

7.4 Artificial neural networks

Artificial neural networks are biologically inspired networks of artificial neurons and
synapses. The neurons sum the input from all incoming synapses, apply a nonlinear
function, and output the result of the computation to all outgoing synapses. Artificial
7.4 Artificial neural networks | 181

Input Synapses Synapses Output

b j , w ji b̃ l , w̃ lj
1

1 f (∑(⋅))

m1 f (∑(⋅)) f (∑(⋅)) k1

m2 ..
.

.. .. f (∑(⋅)) kc
. .

f (∑(⋅))

d input neurons n hidden neurons c output neurons

Fig. 7.5. Feed-forward neural network with one hidden layer.

neural networks are modeled as directed graphs, where the nodes correspond to neu-
rons and the edges correspond to synapses. A special type of neural network is the
feed-forward network. These networks are directed acyclic graphs that are organized
into layers such that the neurons of one layer have outgoing edges only to the neurons
of the next layer, but not to neurons in the same or other layers. The first layer of a
feed-forward network is called the input layer and the last layer is called the output
layer. The layers in between are called hidden layers. The processing flow goes from
the input to the output layer, but not the other way around.
Figure 7.5 shows a feed-forward neural network with one hidden layer. The input
layer consists of (d+1) neurons, where d neurons distribute the features m1 , . . . ,m d to
the neurons of the hidden layer and one neuron outputs a constant level. Such neurons
are called bias neurons. The hidden layer consists of n neurons, where one neuron is,
again, a bias neuron and the other (n − 1) neurons compute the features from the
input layer. Finally, the output layer consists of c neurons, where in this example each
output computes the decision function k l for the corresponding class. Each synapse
is endowed with a weight. Here, w ji denotes the weight from the i-th input neuron to
182 | 7 Special classifiers

the j-th hidden neuron, w̃ lj denotes the weight from the j-th hidden neuron to the l-th
output neuron, and b j and b̃ l denote the weights from the bias neurons.
Overall, this network computes the discriminant functions
n d
k l (m) = f ( ∑ w̃ lj f ( ∑ w ji m i + b j ) + b̃ l ) l = 1, . . . , c
j=1 i=1

f (wT1 m + b1 )
= f (w̃ Tl h + b̃ l ) where h := ( .. ). (7.22)
.
T
f (wn m + b n )

The activation function f(⋅) is not further specified, but a typical choice is the Fermi
function f(ξ) := 1+e1 −ξ , which approaches 0 as ξ goes to −∞, approaches 1 as ξ goes
to ∞, and is 12 for ξ = 0 (a sigmoid activation). Surprisingly, feed-forward neural net-
works with one hidden layer can represent any continuous function m 󳨃→ k that maps
features to decision vectors. This result can be proven using Kolmogorov’s General
Representation Theorem (Kolmogorov [1963]):

Theorem 7.2 (General Representation Theorem). Any real-valued continuous func-

tion g(m) defined on the cube [0,1]d , d > 1 can be represented by
2d+1 d
g(m) = ∑ Ξ j ( ∑ ψ ij (m i )) , (7.23)
j=1 i=1

where Ξ j and ψ ij are continuous functions of one variable.

In general, Ξ j and ψ ij are nonlinear functions. Because of this result, neural networks
are sometimes also called universal function approximators.
A feed-forward neural network is typically trained using backpropagation. Back-
propagation iteratively minimizes the training error by propagating the error from the
output layer to the input layer and adjusting the weights on the way. More formally,
the training error performs a gradient descent on the squared training error
N
e := ∑ ‖k(mv ) − ω(mv )‖2 (7.24)
v=1

by adjusting the weights according to

∂e ∂e
∆w ji = −η and ∆ w̃ lj = −η , (7.25)
∂w ji ∂ w̃ lj

where η denotes the learning rate (likewise for b j and b̃ l ). Details on the algorithm can
be found, e.g., in Duda et al. [2001].
The advantages of feed-forward neural networks are that they require no prior
knowledge and are relatively easy to configure, yet allow approximating arbitrary
decision functions and afford a very fast classification.
7.4 Artificial neural networks | 183

m 2

R1 R2 ω1
12 ω2
Optimal
boundary
10 Decision
boundary

2 R1

m 1

2 4 6 8 10 12

Fig. 7.6. Application to the reference example of Section 3.3.2. Decision regions of a feed-forward
neural network with two hidden layers with 10 neurons each. Note that a different initialization
usually results in vastly different decision boundaries. The training and testing errors are etrain =
8 % and etest = 5.5 %. The testing error asymptotically approaches etest ≈ 7.6 %. The training set is
the same as in Figure 3.8. Test samples are shown with hollow marks.

On the other hand, large neural networks involve a large number of parameters
to estimate, with all the associated problems (see Section 6.1). Training large neural
networks is computationally very expensive, especially if the training set is also large.
There are no clear guidelines for how to structure a neural network for a given prob-
lem, and it is difficult to interpret the underlying computation—a neural network is a
mathematical black box that only allows numerical interpretation. Neural networks
are prone to overfitting the training data and there is no guarantee that the learned
parameters constitute a global optimum, as the loss function in Equation (7.24) is non-
convex. Nonetheless, neural networks achieve remarkable results in many different
application domains.
Figure 7.6 shows the decision regions of a feed-forward neural network with two
hidden layers, each of which contains ten neurons. The decision boundary is compli-
cated and does not seem to match the optimal decision boundary well, yet with 7.6 %,
the asymptotic testing error is only 1.5 percentage points larger than the Bayes error
rate. However, the decision regions can look vastly different when the architecture
is changed, e.g., when using a different number of hidden layers, or if the layers con-
tain a different number of hidden neurons. Even a different choice of initial weights
can have a significant impact on the decision boundary. For these reasons, a thor-
184 | 7 Special classifiers

ough evaluation of the parameter space is paramount when using this highly flexible
classifier.

7.5 Autoencoders

An artificial neural network can also be used to compress data by letting it learn to
reproduce its input (i.e., to learn the identity function). If the hidden layer consists
of fewer neurons than the input layer, the network is forced to learn some lower-
dimensional encoding of the data. Such networks are called autoencoders. As an ex-
ample (which is due to Ritter et al. [1990], consider a dataset of eight training vectors
9
D = {m1 , . . . , m8 } with mi ∈ ℝ8 for i = 1, . . . ,8. Let the i-th entry of mi be 10 and
1
every other entry be 10 :

1 9 1 T
mi := ( ,..., ,..., ) . (7.26)
10 ⏟⏟10
⏟⏟⏟⏟⏟ 10
i-th position

Figure 7.7 shows the activation of each neuron (except the bias neurons) of an
autoencoder that was trained on this dataset. The net was trained for 5000 iterations
with a learning rate of η = 0.25 and the final mean squared training error (see Equa-
tion (7.24)) was e = 0.047. It can be seen that while the input is reconstructed almost
perfectly, the compression in the hidden layer is not lossless.
Note that the compression is only valid for the seen data: unseen training samples
may not be compressed with a low reconstruction error. For example, this network pro-
duces a squared error of e = 6.51 when reconstructing the vector m = 1 = (1, . . . ,1)T .

7.6 Deep learning

Theorem 7.2 implies that a neural network with one hidden layer is sufficient to rep-
resent any function and therefore derive arbitrarily complicated decision boundaries,
provided that it contains enough neurons. Yet there are still reasons to prefer networks
with many layers with fewer neurons: deeper networks have the same approximation
capabilities as shallow networks, but generally require fewer parameters (see Schmid-
huber [2015]). For example, there are functions that require a polynomial number of
parameters (w.r.t. the number of inputs) in a network with n hidden layers, but require
an exponential number of parameters with n − 1 layers (Schmidhuber [2015]). As dis-
cussed in Section 6.1, having fewer parameters typically leads to better generalization
properties and reduces the risk of overfitting the training data.
Multiple layers can also be used to model hierarchical part–object relationships,
where the first few layers model the individual parts, and the following layers model the
composition of those parts. A typical example of such a model is a car, where the first
7.6 Deep learning | 185

m1 m2 m3 m4 Fig. 7.7. Neuron activa-

tion of an autoencoder
with three hidden neu-
rons for each training
sample. The left column
represents the input
layer, the middle column
the hidden layer, and the
right column the output
layer. The size of a box
indicates the magnitude
m5 m6 m7 m8 of its activation.

few layers could model car parts, such as wheels, doors, windows, headlights, etc., and
the following layers model the car’s frame and body. Lastly, the human visual cortex is
also organized in many hierarchical layers, that fulfill increasingly complicated tasks.
Unfortunately, there is no clear definition of what constitutes a “deep” network.
Generally, a neural network with one hidden layer is considered shallow, whereas
a network with ten hidden layers is already considered deep. One of the pioneering
works in deep learning, LeNet-5 by LeCun et al. [1998] had seven layers, but other
architectures can have more than 1000 layers (He et al. [2016]).

7.6.1 Historical difficulties and successful approaches

Artificial neural networks with multiple layers are typically trained using the back-
propagation algorithm, which, as mentioned above, performs a gradient descent to
minimize the squared prediction error on the training set. However, deep networks
cause two major issues with backpropagation: First, the gradient may vanish or ex-
plode during the backward pass due to exponential changes from one layer to another.
In effect, gradient descent may require an unfeasibly large number of iterations to con-
verge. Second, since gradient descent only considers local information about the first
derivative, backpropagation often becomes stuck at saddle points or in local optima.
This means that even if the gradient descent converges, there is no guarantee that the
solution found is also a good solution.
186 | 7 Special classifiers

Deep networks also tend to have large parameter spaces and therefore need more
training data than models with fewer parameters. Such training data requires storage
and computation time, neither of which were always as abundant as they are today.
Furthermore, there were no clear concepts for representing practical problems or en-
coding prior knowledge in deep neural networks. At the same time, alternatives such
as the SVM achieved a similar classification performance, but had a much more solid
theoretical foundation.
In recent years, these issues have largely been solved. There are huge datasets,
such as ImageNet (Deng et al. [2009]), that contain millions of labeled training samples.
Graphical processing units allow significantly accelerating the computation of the
gradient and therefore allow many more iterations with small learning rates. This
means that even an almost vanishingly small gradient can be sufficient to escape a
saddle point, provided that it is followed for long enough.
But there have also been significant theoretical advances to improve the training
algorithm. Unsupervised pre-training allows using unlabeled training data to initialize
a network with a good solution, before the supervised gradient descent is performed.
Stochastic gradient descent, momentum, and weight decay speed up the training and
avoid falling into local optima. Rectified linear units instead of sigmoid activation
avoid vanishing gradients with large activations.
Similarly, specialized architectures have led to breakthroughs in certain areas.
The long short term memory (LSTM) (see Hochreiter and Schmidhuber [1997]) ap-
proach is well suited to handle sequential data, such as audio, text, or time series.
Since LSTMs “remember” training errors during backpropagation, they also solve the
problem of vanishing gradients. Convolutional neural networks convolve the input
data with banks of trainable filters and are especially well suited for multidimensional
data that contain repeating structures, e.g., image data.
A detailed discussion of these techniques is outside the scope of this book, but we
will briefly explore the fundamental ideas and motivations in the following.

7.6.2 Unsupervised pre-training

In unsupervised pre-training, the weights are initialized by treating each layer individ-
ually. Here, the goal is to find a good initialization for the supervised backpropagation.
The intuition is that the pre-training will move the parameter vector near a local opti-
mum, which can then be quickly found by gradient descent. Unsupervised pre-training
also does not require annotated training data, but only unlabeled data, which is much
easier to obtain in large quantities. This reduces the need for labeled data, since the
supervised training will only run for a short time. The only requirement is that the
unlabeled data must be from the same domain as the labeled data, e.g., images of cars,
if the goal is to classify car models, etc.
7.6 Deep learning | 187

+ +

Fig. 7.8. Pre-training with stacked autoencoders decomposes training a deep network layer by layer.
Each layer is trained as an autoencoder of its input, where the input is the output of the previous
layer.

One particular approach to pre-training is the use of stacked autoencoders, where

each layer of the deep network is treated as a coder of its input (Bengio et al. [2007],
see Figure 7.8).
The layers are trained one after the other: The first layer is trained to reconstruct
the input of the network with minimal error. When the training of this layer is complete,
the weights are fixed, and the output of the neurons of the first layer are treated as
the input for training the second layer, etc. In this way, each layer performs a feature
extraction for the next layer. This approach avoids problems caused by local optima,
because training the next layer starts from a reasonable initialization. At the same
time, layer-wise training significantly reduces the dimension of the parameter space
compared to training the entire network at once. However, pre-training is unnecessary
when a large amount of training data is available.

7.6.3 Stochastic gradient descent

Stochastic gradient descent approximates the gradient using random subsets of the
training data. In particular, in the t-th iteration, only the batch Bt ⊂ D is used to
approximate the gradient,
∂e ∂ev ∂ev
= ∑ ≈ ∑ . (7.27)
∂w v∈D ∂wt v∈B ∂wt
t
188 | 7 Special classifiers

Doing so essentially randomizes the direction of the parameter update, which

helps to escape from saddle points and shallow local optima. Note that stochastic
gradient descent is theoretically justified only when the gradient of the error function
can be written as the sum of gradients, i.e., if the first part of Equation (7.27) holds.
Another technique to escape from saddle points and local optima is the introduc-
tion of momentum. The idea is to carry a fraction of the last parameter update over to
the current update,
∂e
∆wt = + M ⋅ ∆wt−1 , (7.28)
∂w
where M ≠ 0 is the momentum factor. From a mechanical physics perspective, the
above Equation (7.28) contains the second derivative of the location, the acceleration,
and the factor M corresponds to the inertia of the system. As momentum builds up
over time, this technique also enables the gradient descent to escape saddle points
and local optima and to traverse long valleys that have only a slight inclination to-
wards the optimum. In such situations, gradient descent without momentum tends to
oscillate between the walls of the valley, especially when the direction of the gradient
is randomized with stochastic gradient descent (Equation (7.27)).
Lastly, weight decay causes the weights to exponentially decay to zero if no other
weight update (i.e., ∆wt = 0) is performed. In practice, this is a regularization that
penalizes large weights and therefore reduces overfitting. Formally, weight decay adds
another term D ⋅ wt to the weight update, where 0 ≤ D < 1 is the weight decay factor.
Putting the above together and with η denoting the learning rate, the overall parameter
update is calculated as

wt+1 = wt − η ∆wt , where (7.29)

∂ev
∆wt = D ⋅ wt + M ⋅ ∆wt−1 + ∑ . (7.30)
v∈Bt
∂wt

7.6.4 Rectified linear units

A rectified linear unit (ReLU) denotes an activation function that avoids the problem
of vanishing gradients. In particular, the ReLU activation function

f(ξ) = max(0,ξ) (7.31)

is linear if ξ ≥ 0 and 0 (rectified) otherwise. As with the sigmoid activation function,

ReLU activation is nonlinear, but unlike the sigmoid, the gradient is constant over
the range (0,∞) instead of vanishing for large ξ . Figure 7.9 compares both activation
functions.
7.6 Deep learning | 189

f(ξ)
3
ReLU
f(ξ) = max(0,ξ)
2 Sigmoid
f(ξ) = e ξ /(1 + e ξ )
1

ξ
−4 −2 2 4

Fig. 7.9. Comparison of ReLU and sigmoid activation functions.

Layer n Layer n + 1
Layer n + 1 Layer n + 2
Layer n + 2 Layer n + 3
f(ξ)

(a) Convolution (b) ReLU activation (c) Max pooling

Fig. 7.10. A single convolution block in a convolutional neural network.

7.6.5 Convolutional neural networks

Convolutional neural networks (CNNs) are a specialized neural network architecture

suitable for multidimensional data such as images. Such data contains repeating local
structures, e.g., edges and corners, that are arranged to form higher level concepts.
Convolutional neural networks exploit this to achieve a highly accurate classification,
but are less suitable for data that are organized in a different manner.
A CNN takes the entire pattern as input. The first Q layers of a CNN essentially
perform a hierarchical feature extraction by convolving the input with learned filters.
These layers can be seen as a series of convolution blocks, where each block is com-
posed of several layers (see Figure 7.10). Each layer in a convolution block plays a
special role. The weights in the first layer are restricted, so that a neuron in the target
layer is only connected to a small neighborhood in the source layer (see Figure 7.10a).
Furthermore, the weights that map a local neighborhood to a target neuron are shared
across all neighborhoods. In effect, the weights can be seen as the entries of a convo-
lution matrix. The convolution layer can be interpreted as performing a convolution
of the input with that matrix, where the convolution matrix slides over the input.
Although Figure 7.10a shows only one such convolution, in practice several of
these convolution matrices are learned in parallel. This is outlined in Figure 7.11, where
the successive blocks in the convolution stage become deeper and deeper.
190 | 7 Special classifiers

Image channels
Number of convolution
matrices in the first layer

h
idt
Number of convolution
ew
ag matrices in the second layer
Im

k(m)
Image height

Second level feature maps

First level feature maps

Convolution blocks Fully connected layers

(convolution, ReLU, and max pooling)

Fig. 7.11. High level structure of a toy example convolutional neural network with Q = 2 convolution
blocks (convolution layer, ReLU, max pooling) and R = 2 fully connected layers. Stride and padding
are not shown in the figure. All nodes in the last feature maps are connected without restriction to
the nodes in the following (non-convolutional) hidden layer. In the figure this is indicated by the
gray shading between these layers. Note: Real networks, e.g., in Krizhevsky et al. [2012], are usually
much larger.

To reduce the computational effort and the number of parameters to learn, the
convolution is often combined with a downscaling operation. A stride of n computes
the convolution only at every n-th position in both spatial directions and therefore
shrinks the output by a factor of n2 —a factor of n in both directions. For example, the
output of a convolution layer with stride two is one-fourth the size of the input, because
the convolution is computed only on every odd row and column of the input. Stride is
also often used to replace the pooling layer, as both have a similar effect, but strides
significantly reduce the computation time.
Another reduction due to convolution is that positions on the boundary are omit-
ted, because the convolution can only be computed at positions where the convolution
matrix fully fits into the image. As this reduction is typically undesired, the input can
be padded by a certain number of pixels to allow convolution at these positions. There
is no clear guideline on how to fill the padded area, but common approaches are to
tile the input periodically, or to reflect the input at the border pixels.
The next layer performs a nonlinear mapping of the convolution result, usually
using the ReLU activation function (see Figure 7.10b). Lastly, a so called max pooling
layer subsamples the result by propagating only the maximum activation within a
small neighborhood to the next layer (Figure 7.10c). These max pooling layers cause
the features to become tolerant to translation and improve the tolerance for noise in the
7.6 Deep learning | 191

(a) 1st convolution layer

(b) 2nd convolution layer (c) 3rd convolution layer

(d) 4th convolution layer

(e) 5th convolution layer

Fig. 7.12. Image patches that produce large filter responses in a convolutional neural network with
five convolution blocks. For each block, six convolution matrices are shown. The first layer responds
mainly to color, edges, and corners. The following layers capture increasingly complicated structures
and even whole concepts, e.g., “group of humans,” in the fifth layer. The images were generated as
described in Zeiler and Fergus [2014].

input data. At the same time, max pooling leads to data reduction (Figure 7.11). Com-
bined with successive convolutions, this means that later stages see a larger portion
of the input image.
These Q convolution blocks are followed by R fully connected layers (the final
layers in Figure 7.11), where typically Q > R. The fully connected layers play the role of
a regular multilayer feed-forward neural network that classifies the features extracted
by the convolutional layers. However, the parameters of both the feature extraction
and the classification are learned at the same time. In effect, the features are very well
tuned to the classifier and vice versa.
CNNs are usually trained to minimize the error on the training set using (stochastic)
gradient descent, and not using pre-training. This way, it is ensured that the convo-
lution matrices in the same layer learn different weights, and that the convolution
matrices of consecutive layers are geared to each other.
To avoid overfitting in the fully connected layer, one can use so called dropout,
where a random fraction of neurons are deactivated in each training iteration (Srivas-
tava et al. [2014]). The idea is that dropout forces the network to introduce redundan-
cies and therefore reduces the risk of overfitting.
CNNs have been shown to be remarkably successful at difficult image recognition
tasks. For example, prior to the advent of CNNs, the state of the art in image categoriza-
192 | 7 Special classifiers

Fig. 7.13. Detection and classification of vehicles in aerial images with CNNs. Red and blue boxes
show detection of cars and trucks, green and turquoise boxes show the corresponding ground truth.
Results of Sommer et al. [2017], images from the DLR3K dataset (Liu and Mattyus [2015]).

tion (the task of classifying an image into one of many categories) achieved an error
rate of 45.7 % on the ImageNet dataset. The convolutional neural network approach
by Krizhevsky et al. [2012] reduced this error rate to 37.5 %! Their network consisted
of five convolution blocks followed by two fully connected layers.
Here, and with CNNs in general, the different convolution blocks produce different
types of features: The first block essentially detects edges and corners; the second
block responds to primitive textures; the third block detects parts of objects, etc. This
can be seen in Figure 7.12. The figure shows image patches that produce large filter
responses for six distinct convolution matrices for each layer in a convolutional neural
network with five convolution blocks. Although the details depend on the input data
and the architecture of the network, it is common to all CNNs that the deeper the layer,
the higher the level of abstraction in the corresponding features. Peculiarly, a similar
structure is found in the human visual cortex.
7.6 Deep learning | 193

P×P pooling
P×P pooling

4C fi lters 128 outputs

Input image 2C fi lters
C fi lters C fi lters
32 × 32 pixels F neurons

Fig. 7.14. Structure of the CNN used in Herrmann et al. [2016].

Example: CNN as feature extraction method

CNNs are not always used only for classification, but sometimes also as a means to
learn efficient features for the underlying pattern. Sommer et al. [2017], for example,
used CNNs to detect and classify different kinds of vehicles—cars and trucks—in aerial
images. To this end, a CNN was trained on small scale images to classify cars from
non-cars. The convolutional layers were then enlarged to accommodate the large scale
aerial images. Note that this does not require any re-training of the CNN, as the weights
of the convolution layers are shared for every convolution matrix as described above.
Of course, the fully connected part no longer matches the size of the feature maps,
but in this case, what was desired was that the fully connected part be used in a moving
window (see Section 7.8) over the output of the last convolution block to create a list
of object detection candidates. Since this list does not contain information about the
type of vehicle, a second network was used to classify the detection candidates. This
network used the same convolution blocks as the detection network, but different
weights in the fully connected layers. In effect, this network can be seen as an ordinary
feed-forward network that uses the intermediate result of the convolution parts of the
first object detection network as input features. Some detection and classification
results of this method are shown in Figure 7.13.
However, feature representation does not always need to be classified using a
feed-forward network. In the work of Herrmann et al. [2016], the goal was to search a
large database for faces that are similar to the face shown in a low resolution query
video. Since a CNN classifier would have to be retrained every time a new person is
added to the database, and because videos of different lengths are difficult to process
using convolutional neural networks, a CNN was used specifically as a tool for feature
extraction only. The structure of the CNN is shown in Figure 7.14. The CNN was trained
to produce a 128 dimensional descriptor from images of 32 × 32 pixels in size with the
goal that descriptors of the same person should have a small Euclidean distance. At the
same time, the descriptors of faces of different persons should have a large Euclidean
distance. To this end, two CNNs with the same structure were trained to minimize the
loss
l = ∑ max(0, 1 − y ij (b − ‖mi − mj ‖2 ) (7.32)
i,j
194 | 7 Special classifiers

on a training set, where mi and mj denote the output of the network for the i-th and
j-th training sample, and y ij ∈ {−1,1} is an indicator variable that is 1 if the training
samples show the same person and −1 otherwise. b > 0 is the classification threshold,
i.e., mi and mj are said to belong to the same person iff ‖mi − mj ‖2 ≤ b. Once trained,
only one of the CNNs is used in classification and the other one is discarded. Such a
training procedure with two or more networks with the same structure (but different
parameters) is also known as a Siamese setup (Bromley et al. [1993]).

7.7 Support vector machines

The support vector machine (SVM) is one of the most versatile classifiers. It is relatively
simple, yet extremely powerful, and provides good generalization even with a small
number of training samples. The SVM is a linear discriminant classifier for two classes
(c = 2), but can be extended to multiple classes using the techniques discussed in
Section 7.1.1. The SVM can be explained based on five fundamental ideas:
1. Linear separation with maximum distance of the separating hyperplane to the
nearest training samples (the support vectors).
2. Dual formulation of the linear classifier to reduce the number of parameters to
estimate.
3. Nonlinear mapping of the features to a high-dimensional feature space Φ.
4. Implicit use of the (possibly ∞-dimensional) space of eigenfunctions of a so-called
kernel function K as the transformed feature space Φ. The transformed features do
not have to be explicitly computed and the classifier has a small number of free
parameters even though dim(Φ) is large (kernel trick).
5. Relaxation of the linear separability requirement by introducing slack variables.

These ideas will be discussed in the following. For now, it is assumed that the training
set D = {m1 , . . . ,mN }, m ∈ ℝd is linearly separable. This assumption will be relaxed
in the fifth step.

7.7.1 Linear separation with maximum margin

As was already seen in Section 7.2, typically more than one linear discriminant can sep-
arate a given dataset. Yet, some of the discriminants are intuitively “better” than others,
because these discriminants generalize better than others. One method of finding such
discriminants is to impose additional conditions on the separating hyperplane. With
SVMs, the goal is to find those hyperplane parameters (w,b) such that the margin γ,
the distance between the hyperplane and the closest training samples, is maximized:

(w,b) = arg max {γ(w󸀠 ,b󸀠 ,D)} . (7.33)

w󸀠 ,b󸀠
7.7 Support vector machines | 195

5
Class ω1
Class ω2
4 Optimal hyperplane

3
Support vectors {m+ } and {m− }

1
γ

m1
1 2 3 4 5

Fig. 7.15. Classification with maximum margin.

The intuition is that a larger margin results in better generalization. This intuition
was already expressed in Figure 7.4, where the final hyperplane of the perceptron
does separate the training data, but the margin is very small. The dashed line marks a
hyperplane with a larger margin.
Interestingly, there is exactly one hyperplane that satisfies Equation (7.33), pro-
vided that D is linearly separable. This hyperplane is fully defined by the support
vectors {m+ } and {m− }, i.e., the vectors that are closest to the hyperplane (see Fig-
ure 7.15). The SVM concentrates only on the boundaries between classes and therefore
only on the most difficult samples.
But how does one estimate the parameters w and b from a training set D and how
does the margin γ relate to w and b? To derive an answer, recall the linear decision
function
d
k(m) = wT m + b = ( ∑ w i m i ) + b = ⟨w, m⟩ + b, (7.34)
i=1

where m is assigned to the class ω1 if k(m) > 0 and to ω2 otherwise; that is, the
classification depends on the sign of the decision function k(m). Observe further that
the sign of the decision function does not change when the parameters are scaled
by some factor β > 0, that is, sign(wT m + b) = sign(βwT m + βb). By definition, the
support vectors {m+ } of ω1 and {m− } of ω2 lie exactly on the margin and the optimal
separating hyperplane has the same distance from all support vectors (see Figure 7.15).
196 | 7 Special classifiers

It follows that, without loss of generality, one can write

wT m+ + b = 1
} ⇒ wT m+ − wT m− = 2. (7.35)
wT m− + b = −1

Normalizing the vector w (i.e., β = ‖w‖−1 ) yields

wT b
‖w‖ m + ‖w‖ = γ
+
1
w T
b
} ⇒ (wT m+ − wT m− ) = 2γ. (7.36)
‖w‖ m + ‖w‖ = −γ
− ‖w‖

In other words, the margin γ = ‖w‖−1 assumes its maximum iff ‖w‖ becomes
minimal. This translates into the following optimization problem:

Minimize: ⟨w, w⟩ = wT w = ‖w‖2

Subject to: z i (⟨w, mi ⟩ + b) ≥ 1 i = 1, . . . ,N, (7.37)

where z i denotes membership in the class: z i = 1, if mi belongs to class ω1 and z i = −1,

if mi belongs to class ω2 . This formulation of the SVM optimization problem is called
the primal form.
The main goal ensures that the hyperplane wT m + b = 0 maximizes the margin,
while the constraints ensure that every training sample is classified correctly. The
solution to this problem can be computed using Lagrange multipliers α i ≥ 0, i =
1, . . . ,N,
N
1 !
L(w, b, α) = ⟨w, w⟩ − ∑ α i [z i (⟨w, mi ⟩ + b) − 1] → min. (7.38)
2 i=1

Taking the derivative of L with respect to w and b and setting the partial derivatives
to 0 yields
N N
∂L !
= w − ∑ z i α i mi = 0 ⇔ w = ∑ z i α i mi and (7.39)
∂w i=1 i=1
N
∂L !
= ∑ z i α i = 0. (7.40)
∂b i=1

7.7.2 Dual formulation

The primal formulation of the decision function in Equation (7.34) suggests that there
are (d + 1) parameters to estimate: (d parameters for w and one parameter for b. If w is
normalized, the number of free parameters reduces to d. Yet, above it was hinted that
the hyperplane is fully determined by the support vectors. If the number of support
vectors is smaller than d, this means that there are actually fewer parameters that need
to be estimated.
7.7 Support vector machines | 197

Indeed, Equation (7.39) shows that the weight vector w can be written as a linear
combination of the training samples (the same result is used in the perceptron algo-
rithm, see line 5 in Algorithm 7.1). Substituting w = ∑Ni=1 α i z i mi in Equation (7.34)
yields
N
k(m) = ⟨w, m⟩ + b = ⟨ ∑ α i z i mi , m⟩ + b
i=1
N
= ∑ α i z i ⟨mi , m⟩ + b. (7.41)
i=1

Equation (7.41) is called the dual form of Equation (7.34). In this formulation, the
number of free parameters is N: (N − 1) parameters for the α i (recall Equation (7.40))
and one parameter for b. This number does not depend on the dimensionality d of
the feature space! Below it will be seen that the α i are nonzero only for the support
vectors. Also note that the feature vectors mi in Equation (7.41) only appear inside of
inner products, but not on their own. This will become important in Section 7.7.4.
Substituting Equations (7.39) and (7.40) in Equation (7.38) yields the dual formu-
lation of the Lagrange function:
N
1 (N,N) !
L(α) = ∑ α i − ∑ z i z j α i α j ⟨mi , mj ⟩ → max. (7.42)
i=1
2 (i,j)=(1,1)

The dual formulation depends only on the dual variables α i and must be max-
imized instead of minimized. That is, if α∗ solves the following dual constrained
quadratic optimization problem,
N
1 (N,N)
Maximize: ∑ αi − ∑ z i z j α i α j ⟨mi , mj ⟩
i=1
2 (i,j)=(1,1)
N
Subject to: ∑ z i α i = 0 and
i=1

α i ≥ 0, i = 1, . . . , N, (7.43)

then the vector w∗ = ∑Ni=1 z i α∗i mi realizes the linear classifier with maximum margin
γ = ‖w∗ ‖−1 . For b∗ it follows that
1
b∗ = − (max {⟨w∗ , mi ⟩} + min {⟨w∗ , mi ⟩}) . (7.44)
2 z i =−1 z i =1

The above constrained quadratic optimization problem can be solved very effi-
ciently, e.g., with quadratic programming techniques. Note that the solution (w∗ ,b∗ )
is unique. Non-support vector samples mi , i.e., the mi that do not fall on the boundary
of the margin, do not influence the solution at all. During the optimization, the corre-
sponding weights α∗i vanish: only the α∗j that correspond to support vectors are ≠ 0.
This is, in fact, the reason for calling these samples support vectors.
198 | 7 Special classifiers

Algorithm 7.2: Dual form of the perceptron algorithm.

Data: Training set D, Learning-rate η > 0
Result: Sample weight α, offset b
α ← 0 ∈ ℝN
b←0
R ← max1≤i≤N ‖mi ‖
repeat
forall mi ∈ D do
if z i (∑Nj=1 α j z j ⟨mj , mi ⟩ + b) ≤ 0 then
αi ← αi + 1
b ← b + η z i R2
until no training error in inner loop
return α, b

Following this reasoning, the number of parameters that need to be estimated for
the SVM classifier is neither d (as suggested by Equation (7.34)) nor N (as suggested
by Equation (7.41)). Rather, the number of non-vanishing parameters is equal to the
number of support vectors, |SV|, where SV denotes the set of support vectors. This
explains why an SVM is able to find a separating hyperplane in very high-, and even
infinite-dimensional spaces.

Complementary: Dual form of the perceptron algorithm

As mentioned above, the final hyperplane of the perceptron algorithm may be written
as a linear combination of the training samples and the initial parameters. The dual
algorithm is given in Algorithm 7.2.
The more difficult it is to classify a sample mi , the bigger the sample weight α i
(i.e., number of mis-classifications) will become. In this sense, α i is a measure of the
information content of mi with regard to the classification task. In the dual form of the
algorithm, the training samples appear only in scalar products ⟨mi , mj ⟩, which makes
it possible to apply the kernel trick (see Section 7.7.4) to the perceptron algorithm.

7.7.3 Nonlinear mapping

As already discussed in Section 7.1.2, linear separability can be enforced by lifting

the features into a higher-dimensional feature space. More specifically, the original
features are mapped from 𝕄 ⊆ ℝd to a high-dimensional vector space Φ ⊆ ℝd
∗

{𝕄 → Φ
ϕ:{ T
(7.45)
m 󳨃→ ϕ(m) = (φ1 (m), . . . , φ d∗ (m)) ,
{
7.7 Support vector machines | 199

where in general the φ i (⋅) are nonlinear functions.

Let us further assume that Φ is equipped with an inner product ⟨ϕ1 , ϕ2 ⟩, where
ϕ1 ,ϕ2 ∈ Φ. The nonlinear (in 𝕄) decision function in the primal form can be written
as
d∗
k(m) = ∑ w i φ i (mi ) + b = wT ϕ(m) + b = ⟨w, ϕ(m)⟩ + b, (7.46)
i=1

where w ∈ ℝd .
∗

In this formulation, the number of free parameters is (d∗ + 1), which means that
additional parameters have to be estimated and all the discussed drawbacks apply
(see Sections 6.1 and 7.1.2). Note that although the separation is nonlinear in 𝕄, it
is linear in Φ. In essence, the linear separation with maximum margin to the near-
est samples {ϕ(m)} is conducted in the space Φ instead of the space 𝕄. If d∗ > d
(which is generally, but not necessarily, the case), the mapping ϕ(m) determines a
d-dimensional sub-manifold in Φ.
Note that in a two-class problem (c = 2), a dataset D = {m1 , . . . ,mN } of d-
dimensional feature vectors mi can always be linearly separated if d ≥ N − 1 and
the {mi } do not reside in a (d − 1)-dimensional subspace of 𝕄. As a consequence,
linear separation may always be achieved by a suitable mapping ϕ(⋅).
Recall the dual form of the decision function in Equation (7.41). When applying
the mapping ϕ(⋅), the decision function becomes
N
k(m) = ∑ α i z i ⟨ϕ(mi ), ϕ(m)⟩ + b. (7.47)
i=1

In this formulation, the number of free parameters α i is N. In particular, it is inde-

pendent of the dimension d∗ of the high-dimensional vector space Φ! If it was possible
to calculate the inner products ⟨ϕ(mi ), ϕ(m)⟩ directly as functions of the feature vec-
tors mi and m, both steps—the transformation ϕ(⋅) and the inner product in Φ—could
be computed simultaneously without explicitly lifting the features into Φ.
As it turns out, this is indeed possible by using the so-called kernel trick.

7.7.4 The kernel trick

Definition 7.3 (Kernel function). A function K is said to be a kernel function (or kernel
for short) if for all m, m󸀠 ∈ 𝕄

K(m,m󸀠 ) = ⟨ϕ(m), ϕ(m󸀠 )⟩ , (7.48)

where ϕ(⋅) is a mapping from 𝕄 to Φ.

Replacing the inner product in Equation (7.47) by a kernel yields

N
k(m) = ∑ α i z i K(mi , m) + b. (7.49)
i=1
200 | 7 Special classifiers

This is the basic insight of the kernel trick: an inner product of mapped feature
vectors may be replaced by a kernel function that computes both in one step. The
necessary and sufficient conditions for an arbitrary bivariate function K(⋅,⋅) to be a
kernel function are given by Mercer’s theorem (Mercer [1909]).

Theorem 7.4 (Mercer’s theorem). A symmetric function K in L2 has an expansion

∞
K(m,m󸀠 ) = ∑ λ j φ j (m)φ j (m󸀠 ) (7.50)
j=1

with positive coefficients λ j > 0 (i.e., K denotes an inner product in the feature space
Φ associated with K) iff

∫ ∫ K(m,m󸀠 ) f(m) f(m󸀠 ) dm dm󸀠 > 0 (7.51)

𝕄𝕄

for all f ≢ 0 with ∫f 2 (m) dm < ∞. The λ j and φ j (⋅) are the solutions to the eigenvalue
problem
∫ K(m,m󸀠 ) φ(m󸀠 ) dm󸀠 = λ φ(m). (7.52)
𝕄

Given a function K(⋅, ⋅) that satisfies the hypotheses of Mercer’s theorem (and hence is
a kernel function), one can use this function in the dual formulation of the classifier in
Equation (7.49) to implicitly use the possibly infinite-dimensional transformed feature
space Φ without needing to explicitly compute the corresponding feature vectors {ϕj }.
In other words: the kernel function K induces the feature space Φ.
Note, again, that even though the feature vector ϕ(m) may have a very high, pos-
sibly even infinite dimensionality d∗ , the classifier in Equation (7.49) is still fully de-
termined by only N free parameters.

Examples of common kernel functions

As the first example, consider the trivial kernel

K(m,u) := ⟨m, u⟩ = mT u. (7.53)

The corresponding mapping is the identity function ϕ(m) = m. Therefore, this kernel
is also called the linear kernel.
A more interesting kernel arises by squaring the scalar product:
d 2 d d
K(m,u) := ⟨m, u⟩2 = ( ∑ m i u i ) = ( ∑ m i u i )( ∑ m j u j )
i=1 i=1 j=1
d d (d,d)
= ∑ ∑ mi mj ui uj = ∑ (m i m j )(u i u j ). (7.54)
i=1 j=1 (i,j)=(1,1)
7.7 Support vector machines | 201

The corresponding mapping produces feature vectors of the form

T
ϕ(m) = (m i m j )(i,j)=(1,1) = (m21 , m1 m2 , m22 , m2 m3 , . . . , m2d ) ,
(d,d)
(7.55)

i.e., the vector of all monomials of degree 2 of the entries {m i }. The dimensionality of
Φ is d∗ = 12 (d + 1)d, because the terms m i m j appear twice for every i ≠ j, but must
only be counted once.
The above kernel function can be modified by adding a constant c ∈ ℝ before
squaring:
d d
K(m,u) := (⟨m, u⟩ + c)2 = ( ∑ m i u i + c)( ∑ m j u j + c)
i=1 j=1
d d d
= ∑ ∑ m i m j u i u j + 2c ∑ m i u i + c2
i=1 j=1 i=1
(d,d) d
= ∑ (m i m j )(u i u j ) + ∑ (√2c m i ) (√2c u i ) + c2 . (7.56)
(i,j)=(1,1) i=1

The implicit mapping produces vectors of all monomials of the {m i } of degree ≤ 2.

The dimensionality of Φ is

d+2 1
d∗ = ( ) = (d + 2)(d + 1). (7.57)
2 2

This kernel can be generalized by raising to the power of q ∈ ℕ,

K(m,u) := (⟨m, u⟩ + c)q . (7.58)

Akin to the reasoning above, the transformed feature vectors contain all monomi-
als of degree ≤ q. For this reason, this kernel function is also known as the polynomial
kernel. Together with the linear kernel, the polynomial kernel is one of the standard
kernels often used with SVMs.
Another popular kernel is the Gaussian kernel, or radial basis function (RBF) ker-
nel:

‖m − u‖2
K(m,u) := exp {− }. (7.59)
σ2

Unlike the linear and polynomial kernels, this kernel produces a mapping into an
infinite-dimensional space Φ, that is, the eigenfunction decomposition in Theorem 7.4
has an infinite number of solutions (Rasmussen and Williams [2006]).
202 | 7 Special classifiers

A decision function can also be found when a kernel function is used. The opti-
mization problem simply swaps the scalar product in Equation (7.43) for a kernel:
N
1 (N,N)
Maximize: ∑ αi − ∑ z i z j α i α j K (mi , mj )
i=1
2 (i,j)=(1,1)
N
Subject to: ∑ z i α i = 0 and
i=1

α i ≥ 0, i = 1, . . . , N. (7.60)

The decision function

k(m) = ∑ z i α∗i K (mi , m) + b∗ (7.61)

i∈SV

is equivalent to the hyperplane in the induced space Φ with maximum margin (Cris-
tianini and Shawe-Taylor [2000])
1
γ= . (7.62)
√∑mi ∈SV α∗i
N
Note that it follows from Mercer’s theorem that the (kernel) matrix (K(mi ,mj ))i,j=1
is positive definite. This means that the optimization problem in Equation (7.60) is
convex and has a unique global optimum that can be found using, e.g., quadratic
programming.
Figure 7.16 shows the decision regions a hard margin SVM with a Gaussian kernel
(σ = 4) learned from the ongoing reference dataset. The decision regions are very com-
plicated and rugged. Clearly, this classifier overfits the data and does not generalize
well. This can also be seen in the low training error of etrain = 3.5 %, but rather large
testing error of etest = 11 %. Note that in theory a hard margin SVM should have no
training error (etrain = 0). In practice this is rarely achieved, as the optimization of
Equation (7.60) is usually terminated before the (true) optimum is reached.

Probability of error
The probability of making an error when classifying unseen samples mi ∈ ̸ D may be
eye-balled as
|SV|
P (ω(m)
̂ ≠ ω(m), m ∈ ̸ D) ≈ . (7.63)
N
The reasoning goes as follows. If some training sample mi ∈ D is left out in train-
ing, it will be correctly classified if it is not a support vector. If it is a support vector,
there is a chance that it will be misclassified by the SVM trained on the reduced train-
ing set. If this is repeated for all mi ∈ D, one makes at most |SV| errors (leave-one-out
argument). It follows that for a fixed training set, the classifier with fewer support
vectors will perform better (consistently with Occam’s razor).
7.7 Support vector machines | 203

m 2

R1 R2 ω1
12 ω2
Optimal
boundary
10 Decision
boundary

2 R1

m 1

2 4 6 8 10 12

Fig. 7.16. Application to the reference example of Section 3.3.2. Decision regions of a hard margin
SVM classifier with Gaussian kernel (σ = 4). The training error is etrain = 3.5 % and the testing error
is etest = 11 %. The latter asymptotically approaches etest ≈ 11.2 %. The training set is the same as
in Figure 3.8. Test samples are shown with hollow marks.

7.7.5 No linear separability

So far we have assumed that the dataset D is linearly separable (either in 𝕄 or in Φ).
However, there are cases where D is not linearly separable or separable only with
a small margin. To allow the SVM to work well in these cases, one introduces the so
called slack variables ξ i ≥ 0, i = 1, . . . ,N that measure how much a training sample mi
violates the margin or even how far mi lies on the wrong side of the separating hyper-
plane, see Figure 7.17. For the linear classifier with maximum margin, the optimization
goal becomes
N
Minimize: ⟨w, w⟩ + C ∑ ξ i2
i=1

Subject to: z i (⟨w, mi ⟩ + b) ≥ 1 − ξ i and

ξ i ≥ 0, i = 1, . . . ,N. (7.64)

The design parameter C > 0 defines how much emphasis should be put on correct
classification (i.e., C is large) versus a large margin (i.e., C is small). A dual formulation
of the above and the use of a kernel K instead of the scalar product ⟨w, m⟩ leads to
204 | 7 Special classifiers

γ
Class ω1
mk Class ω2
Optimal hyperplane
γξ j
Support vectors
{m+ } and {m− }

γξ k

Fig. 7.17. Geometric interpretation of the slack variables ξ i , i = 1, . . . ,N: The slack variables mea-
sure (w.r.t. to the margin γ) if, and how far training samples penetrate the margin.

the soft margin SVM:

N
1 (N,N) 1
Maximize: ∑ αi − ∑ z i z j α i α j (K (mi , mj ) + δ ij )
i=1
2 (i,j)=(1,1) C
N
Subject to: ∑ zi αi = 0 and
i=1

α i ≥ 0, i = 1, . . . , N. (7.65)

With α∗ denoting the solution vector, the decision function becomes

N
k(m) = ∑ z i α∗i K(mi ,m) + b∗ , (7.66)
i=1

α∗
where the offset b∗ is chosen so that z i k(m) = 1 − Ci for all i with α∗i ≠ 0. A proof of
the above can be found, for example, in Cristianini and Shawe-Taylor [2000].
The decision regions of a soft margin SVM can be seen in Figure 7.18. Again, the
classifier was trained using the ongoing reference dataset from Section 3.3.2. As with
Figure 7.16, the SVM uses a Gaussian kernel, albeit here the kernel parameter was
chosen to be σ = 1. The design parameter was chosen to be C = 1. Unlike with a hard
margin, the soft margin SVM does not overfit the training data, but generalizes well.
The decision boundaries are reasonably close to the decision regions of the Bayesian
optimal classifier and the asymptotic testing error is only 0.7 percentage points above
the optimal Bayes error rate. Different choices for σ and C vary the shape of the decision
boundary: higher values of σ generally lead to a smooth decision boundary, whereas
higher values of C lead to a more complicated boundary.
7.7 Support vector machines | 205

m 2

R1 R2 ω1
12 ω2
Optimal
boundary
10 Decision
boundary

2 R1

m 1

2 4 6 8 10 12

Fig. 7.18. Application to the reference example of Section 3.3.2. Decision regions of a soft margin
SVM classifier with Gaussian kernel (σ = 1) and C = 1. The training and testing errors are etrain =
etest = 6.5 %. The testing error asymptotically approaches etest ≈ 6.8 %. The training set is the
same as in Figure 3.8. Test samples are shown with hollow marks.

In practice, one can estimate the hyperparameter C and the kernel parameters (σ
in the example) using the validation set V. To this end, an SVM classifier with fixed
hyperparameters is trained using the training set D and the classification performance
on V is recorded. The process is repeated for various combinations of the parameters,
where the parameters are often determined using a rule (e.g., grid search, the param-
eters are drawn from a regular grid) or are randomly sampled (randomized search).
Finally, the parameters that yield the highest classification performance are kept.

7.7.6 Discussion

SVMs are very powerful classifiers, which are applicable to a wide range of problems.
As discussed in Section 7.7.4, the complexity of an SVM is determined not by the di-
mensionality of the feature space, but only by the number of support vectors. Because
of this, an SVM is generally less prone to overfitting than other classifiers. Unlike ar-
tificial neural networks, which often yield suboptimal classifiers, an SVM classifier
is uniquely determined by the training data and the learning algorithm will always
produce a globally optimal classifier. Furthermore, the SVM algorithm allows for a
geometric interpretation and is easy to apply without the need of prior knowledge
206 | 7 Special classifiers

m2 m2

ω1
ω2

m1 m1
Hard-margin SVM Soft-margin SVM

Fig. 7.19. Decision boundaries (shown in the original feature space) of a hard margin and soft margin
SVM with Gaussian kernel on the same dataset. The shown feature vectors are the training data
D. The support vectors are marked with circles around them. Example according to Cristianini and
Shawe-Taylor [2000].

about the problem. Using appropriate kernel functions, an SVM classifier can even be
used to classify complicated objects like genome sequences or the words of a (natural)
language. Most of all, it is based on a very well developed theoretical foundation (see
Boser et al. [1992], Cortes and Vapnik [1995], Schölkopf and Burges [1999]). One major
drawback, however, is that the presented SVM is only a binary classifier. Extension
to multiple classes typically requires training at least one SVM for each class (see Sec-
tion 7.1.1), but extensions to true multi-class SVMs exist as well (e.g., Crammer and
Singer [2001]).
Figure 7.19 shows an example of the different decision boundaries that a hard
margin SVM (left) and a soft margin SVM (right) can derive. The boundaries are shown
in the original, two-dimensional feature space. Both SVMs use a Gaussian kernel (see
Equation (7.59)) with the same kernel parameter σ2 = 0.5 and were trained on the
same data. The shown data are the training data D. Support vectors are marked with
orange circles. The hard margin SVM shows perfect classification on the training data,
but has a relatively complicated decision boundary. The decision region of the soft
margin SVM, on the other hand, is smooth and relatively simple, but results in three
errors on the training set. Furthermore, the number of support vectors is significantly
higher with the soft margin SVM than with the hard margin SVM.

7.8 Matched filters

Often the goal is not only classifying an object, but also locating that object within an
image. Consider the toy example in Figure 7.20. Here, the goal is to find the location
7.8 Matched filters | 207

B
C m A

A
A
n
(a) Noisy image with three objects (the (b) A matched filter for the letter A is
letters A, B, and C) moved across the image

Fig. 7.20. Toy example of a matched filter.

of three characters A, B, and C against a noisy background. In other words: the image
not only contains the object to be classified, but also unwanted noise. Matched filters,
also known as template matching, are a popular tool to achieve just that.
The idea behind matched filters is that the objects to be found are known in
advance—which is always the case in a classification setting—and that a prototypical
template can be derived for each of the objects. Matched filters provide a mathemati-
cal mechanism that assumes extremal values in places where the image matches the
template. In the above example, a matched filter for the character “A” produces an
image that is (nearly) black everywhere except at the center of the “A” in the original
image. In other words, objects are found within an image (or any other type of signal)
by moving the template over the image and recording the positions where the template
matches and the resulting image is maximal.
In the following, this intuitive description is formalized in a discrete notation, that
is, the images are considered to be two-dimensional discrete signals, as opposed to
continuous signals. In particular, let g ij denote the value of the image at the pixel
position (i,j) and let
g mn := (. . . ,g m−i,n−j , . . .)T , ∀(i,j) ∈ U (7.67)
denote the image patch around the position (m,n). In other words, g mn is the vector
of all image pixels in the region U around the patch origin (m,n). The image patch
is modeled as being composed of a true, underlying object image omn and additive
stationary noise rmn with E{rmn } = 0,
g mn = omn + rmn . (7.68)
The object and noise terms are defined in the same way as the image patch, which
means that g mn , omn , rmn ∈ ℝ|U| are all of the same size. Given a filter v ∈ ℝ|U| , the
208 | 7 Special classifiers

response of the filter to the image patch g mn is obtained by taking the inner product
of the two vectors:

k mn = vT g mn = vT omn + vT rmn . (7.69)

In the following, the indices mn are dropped for the sake of notational brevity. For
example, the above equation will be written as simply k = vT o + vT r.
The question now is how to find a suitable filter. Many different approaches are
possible. For example, one could train a linear SVM classifier and take the weight
vector as a filter. With matched filters, however, the filter v := (. . . ,v ij , . . .)T is chosen
so that is maximizes the signal to noise ratio

P1 power of wanted signal

SNR := = . (7.70)
P2 power of noise signal

of the resulting output image k mn . As is common in signal processing, P1 is defined

as the square of the local signal power, while P2 is defined as the mean square noise
power:
2
P1 := (vT o) , and (7.71)
2
P2 := E{(vT r) } = E{(vT r)(rT v)} = vT E{rrT } v = vT Krr v. (7.72)

Putting these into Equation (7.70) yields

2
P1 (vT o) !
SNR = = T → max, (7.73)
P2 v Krr v

which we wish to maximize by choosing the free parameter v accordingly. To arrive

at a closed form solution, observe that the covariance of the noise Krr is symmetric
and positive semidefinite. Therefore, there exists a decomposition Krr = QT Q and the
T −1
inverse covariance factors as K−1
rr = Q (Q ) . With the introduction of w := Qv and
−1
T
consequently v = Q−1 w as well as vT = wT (Q−1 ) , the signal to noise ratio can be
written as
2 T 2
(vT o) (wT (Q−1 ) o)
SNR = = . (7.74)
vT QT Qv wT w
Without loss of generality, it can be assumed that ‖w‖ = 1, because wT appears
squared both in the numerator and denominator, and the length ‖w‖ cancels out. There-
fore, with wT w = 1, Equation (7.74) is maximized by
2 !
SNR = (wT (Q−1 )o) → max
T T
⇔ w‖(Q−1 ) o (w is parallel to (Q−1 ) o)
T
⇔ w = c(Q−1 ) o (7.75)
7.8 Matched filters | 209

with some constant c ∈ ℝ. Using the relation between v and w defined above finally
yields the discrete matched filter
T
v = cQ−1 (Q−1 ) o = cK−1
rr o. (7.76)

As c is just a linear factor in the maximization of the SNR, it is usually defined as

c := 1, that is, the matched filter is fully defined by the object image o and the noise
covariance matrix Krr . In the special case that r is white noise, i.e. in the case that
Krr ∝ I, it follows that v ∝ o. In other words: the matched filter is the object image.
To get an intuition of how a matched filter works, we substitute Equation (7.76) in
Equation (7.69) to derive the filter response
T
k mn = vT g mn = oT Q−1 (Q−1 ) g mn . (7.77)

T
Here, (Q−1 ) acts as a whitening filter that de-correlates the noise in the image patch
g mn . However, the transformed image patch resides now in a different space than the
object image o. To correct for this, Q−1 modifies o to match after the whitening of the
image.
Note that while matched filters correct for (pixel) noise, they are still very sensitive
to rotation, scale, and other distortions of the input image. Since these perturbations
are very common in detection tasks, the image has to be normalized before applying
the filter.
In order to use matched filters for classification, one matched filter vi is created
for each class ω i to be recognized. The filters are moved over the image and for each
position the best match is recorded. More formally, the feature vector at position x =
(x,y)T is given by

m(x) := col {{g(x󸀠 ) | x󸀠 = x − α, ∀α ∈ U}} , (7.78)

i.e., the column vector of image pixels within the (shifted) region U around x. The
decision function for each filter vi , i = 1, . . . ,c, is given by k i (m(x)) = vTi m(x) and the
decision vector becomes

vT1
k (m(x)) = ( ... ) m(x) = V m(x). (7.79)
vTc

This decision vector is evaluated at every pixel of the input image. As there is always
a maximal entry in the decision vector, there will be a match at every pixel, even though
this certainly cannot be the case. In practice, match candidates with low responses
will be discarded as “none of the classes.” This maximum criterion is an example of
classification with rejection, which will be explored in more detail in Section 9.4.
More details about matched filters can be found, for example, in Beyerer et al.
[2016].
210 | 7 Special classifiers

7.9 Classification of sequences

An underlying assumption in the discussion so far was that the data are independent
and identically distributed, i.e., the feature vector mi does not depend on the feature
vectors seen before. So far this assumption has been valid, but it does not hold, e.g.,
for videos, where the content of the next frame depends on the content of the current
frame, for speech, where grammar restricts which words can follow another word in
a valid sentence, or for games, where the next move depends on the moves that have
been played before. In general, any sequence where the probability of drawing an
object depends on which objects have been drawn before, i.e., sequences that depend
on some state, violate the i.i.d. assumption.
As an example, consider the recognition of spoken words, or more specifically,
the classification of utterances into characters that make up a word. In this scenario,
each character constitutes a class, i.e., ω1 =̂ A, ω2 =̂ B, ω3 =̂ C, etc. Words are gener-
ated by some source that sequentially attains one of the classes as its internal state
and produces the corresponding character. Clearly, the characters of a word are not
independent of the surrounding characters. For example, if the letters observed so far
are “T” and “E”, the characters “A”, “N” and “D” are more probable to be observed
next than the character “X”. A classifier will be more powerful if these dependences
are modeled into it.
In our example, however, it is not possible to observe the characters directly. In
other words, the classes are hidden from our view. It is possible, after some suitable
signal processing, to observe associated phonemes—the smallest indivisible parts of
speech—v i , e.g., (in IPA phoneme notation) v1 =̂ /i:/, v2 =̂ /e/, v3 =̂ /æ/, etc. Given a
sequence of such phonemes, the goal is to recognize the corresponding word, i.e., the
sequence of states (characters) that will produce the observed sequence of phonemes.
More concretely, consider the word “sequence.” This word is produced by reaching
the states S, E, Q, U, E, N, C, and E one after another. When spoken, however, one
can only observe the phoneme sequence /’sikw ns/. The goal is to work this process
e
backwards, that is, to map the observed phonemes to the word “sequence” by virtue of
a model of the generation process. Note that here the number of phonemes is the same
as the number of characters. This does not always have to be the case: “model” has five
characters, but the (US English) pronunciation /’madl/ contains only four phonemes.

7.9.1 Markov models

Sequences of any type can be modeled using discrete Markov models. A Markov model
describes the probability distribution to switch to the state ω(t) at time t, given the
system’s states ω(t − 1), ω(t − 2), . . . in the previous time steps (t − 1), (t − 2), . . . If
all of these time steps were to be taken into account, such a model would not be very
useful: inference and learning would take place in a very high-dimensional space,
7.9 Classification of sequences | 211

a22

ω2

a12 a23
a21 a32
Fig. 7.21. Discrete first order Markov
a13 model with three states ω i . The probabil-
ity of a transition from state ω i to state
a11 ω1 ω3 a33
ω k is denoted a ik . Image after Duda et al.
a31 [2001].

with all the associated problems (see Section 6.1). A discrete Markov model of the l-th
order therefore assumes that the probability of going into a state depends on the l
preceding states, but not more:

P (ω(t) | ω(t−1), ω(t−2), . . .) = P (ω(t) | ω(t−1), ω(t−2), . . . ,ω(t−l))

∀t ∈ ℤ and ω(t) ∈ Ω/∼ = {ω1 , . . . ,ω c }. (7.80)

In practical applications, the term “Markov model” is often used as synonymous

with a first order Markov model, where the probability of a state only depends on the
previous state:

P(ω(t) | ω(t − 1), ω(t − 2),ω(t − 3), . . .) = P(ω(t) | ω(t − 1)). (7.81)

Here, the above is referred to as the state transition probability and the probability
of switching from state ω i to state ω k is abbreviated as

a ik := P(ω(t + 1) = ω k | ω(t) = ω i ), where ∑ a ik = 1. (7.82)

Finally, a Markov model is completed by the a priori state probabilities P(ω i ), which
encodes the probability of starting in the state ω i .
First order Markov models may be represented by a stochastic automaton, where
the states correspond to the classes ω i and the state transition probabilities are given
by the a ik . An example is shown in Figure 7.21.
Markov models can be, and have been, used to generate sequences of characters.
Table 7.1 presents some examples of sequences that were generated by Markov models
of increasing order. The state transition probabilities and the a priori probabilities
were estimated from large corpora of German, English, and Russian texts.
The sequences generated by the 0th order models were fully determined by the
prior probabilities and do not resemble words from the corresponding languages very
much. Higher order models, on the other hand, include the transition probabilities
and even generate valid words like “IN” and “WHEY”—even though the model does
not explicitly encode the concept of a word!
212 | 7 Special classifiers

Table 7.1. Character sequences generated by Markov models of different order. Table reproduced
from Hoffmann [1998].

Order Generated Examples

German (Küpfmüller [1954])

0 EME GKNEET ERS TITBL BTZENFNDBGD EAI E LASZ BETEATR IASMIRCH EGEOM
1 AUSZ KEINU WONDINGLIN DUFRN ISAR STEISBERER ITEHM ANORER
2 PLANZEUDGES PHIN INE UNDEN VEBEICHT GES AUF ES SO UNG GAN DICH WANDERSO
3 ICH FOLGEMAESZIG BIS STEHEN DISPONIN SEELE NAMEN

English (Shannon et al. [1993])

0 OCRO HLI RGWR NMIELWIS EULL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH BRL
1 ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN D ILONASIVE TUCOOWE AT TEA-
SONARE FUSO TIZIN ANOY TOBE SEACE CTISBE
2 IN NO IST LAT WHEY CRACTICT FROURE BIRS GROCID PONDENOME OF DEMONSTRURES OF
THE REPTAGIN IS REGOACTIONA OF CRE

Russian (Jaglom and Jaglom [1960])

0 EYNT CIJA’A OERV ODNG ’UEMLOLJK Z-JA ENVTŠA

1 UMARONO KAČ VSVANNYJ ROSJA NYCH KOVKROV NEDARE
2 PKAK POT DURNOSKAKA NAKONEPNO ZNE STVOLOVIL SE TVOJ O-NIL’
3 VESEL VRAT’SJA NE SUCHOM I NEP I KORKO

7.9.2 Hidden states

So far we have assumed that the model state at a time t is known with absolute certainty,
but this is not always the case. In the introductory speech recognition example on 210,
the underlying states (characters) were only indirectly observable via the associated
phonemes. A hidden Markov model (HMM) is an extension to a Markov model that can
deal with such situations.
In addition to states and state transition probabilities, an HMM consists of observa-
tions v j and emission probabilities that denote the probability of seeing an observable
given a chain of states ω(t), . . . , ω(t − l), where l is the order of the model. In a first or-
der HMM, the emission probability of observing v(t) = v j depends only on the current
state ω(t) and is denoted by

b ij := P(v(t) = v j | ω(t) = ω i ), where ∑ b ij = 1. (7.83)

Note that with a hidden Markov model, the states are not directly observable. In-
stead, the state sequence can only be inferred from the sequence of observations v j .
Figure 7.22 shows a first order hidden Markov model with three hidden states and four
observations.
There are three important tasks when working with hidden Markov models:
7.9 Classification of sequences | 213

observable
v2 v3
v1 v4

b21 b22 b23 b

a22 ω2

hidden
a12 a23
a21 a32

a13
Fig. 7.22. Discrete first order hidden
a11 ω1 ω3 a33
Markov model with three hidden states
a31 ω i and four possible observations v j .
b11 b
12 b13 b14 b31 b
32 b33 b34 The a ik denote the state transition prob-
v1 v4 v1 v4 abilities, while the b ij denote the proba-
v2 v3 v2 v3 bility that v j is observed when entering
observable state ω i . Image after Duda et al. [2001].

1. Evaluation (forward problem): Given an HMM with transition probabilities

a ik and emission probabilities b ij , how probable is it to observe the sequence
v(1), . . . , v(T)?
2. Decoding (backward problem): Given an HMM with probabilities a ik and b ij , and
a sequence of observations v(1), . . . , v(T), what is the most probable sequence of
states ω(1), . . . , ω(T) that generated this sequence?
3. Learning (parameter estimation): Given the number of states, the set of possible
observations, and a set of training sequences v(1), . . . , v(T), how can the param-
eters a ik and b ij be estimated?

The first task is straightforward, as it only involves known quantities and is mostly of
theoretical importance. The second and third tasks, on the other hand, are very impor-
tant in the context of pattern recognition. The second task is analogous to classification
in the conventional setting: observations correspond to feature vectors, whereas the
states correspond to classes. At first, it seems that this task is not much more com-
plicated than the forward problem, but unfortunately it is much more complicated.
The usual approach to the backward problem is the Viterbi algorithm (Viterbi [1967],
Forney [1973]), a variant of dynamic programming. The last task corresponds to learn-
ing a conventional classifier. Here, the goal is to construct an HMM from the data.
Most common approaches estimate the parameters a ik and b ij using an expectation
maximization (EM) algorithm (Dempster et al. [1977]). Without going into detail, EM al-
ternates between computing the expected value of the likelihood of the data given the
parameters (expectation step) and maximizing the expected likelihood by changing
the parameters (maximization step). This procedure eventually converges to the maxi-
214 | 7 Special classifiers

mum likelihood estimate of the parameters. EM is especially useful in cases where the
likelihood function cannot be maximized analytically.
Naive brute-force methods to solve the inference and estimation tasks fail because
the number of possible sequences grows exponentially with the length T of the se-
quence: the number of possible sequences of length T is c T , where c is the number of
states.
Although HMMs can be employed whenever one has to deal with sequences,
the most successful applications lie in the recognition of handwriting, gestures, and
speech. A full treatment of the training and inference algorithms as well as applica-
tions are outside the scope of this book. Interested readers are instead referred to other
sources, e.g., Moon and Stirling [2000] or Fink [2003].

7.10 Exercises

(7.1) Is K(m,m󸀠 ) = ‖m‖2 a kernel function?

(7.2) Is K(m,m󸀠 ) = 4mT m󸀠 − mT m − (m󸀠 )T m󸀠 a kernel function?

(7.3) Construct the kernel function K(p,q) with p = (p1 ,p2 ,p3 )T and q = (q1 ,q2 ,q3 )T
that implicitly performs the following feature mapping φ : ℝ3 → ℝ4 :

m1 m2
m1
m2 m3
φ(m) = φ ((m2 )) = ( ).
m3 m1
m3
(m1 + m2 + m3 )3

(7.4) Suppose given the following sample of 8 two-dimensional feature vectors:

0.5 2.5 2.0 6.0

m1 = ( ) , m2 = ( ) , m3 = ( ) , m4 = ( ) ,
3.5 4.0 6.0 6.0

1.5 4.0 3.5 5.0

m5 = ( ) , m6 = ( ) , m7 = ( ) , m8 = ( ) ,
1.5 1.0 2.0 2.0

where ω(mi ) = ω1 for i = 1, . . . ,4 and ω(mi ) = ω2 for i = 5, . . . ,8.

1. Sketch a diagram of the features and the decision boundary of a hard margin
SVM classifier. Mark the margin and the support vectors.
2. Determine the parameters of the decision boundary wT m + b = 0.
3. Give an estimate P̂ e of the error probability for this classifier.
4. How does the estimate P̂ e change, if the class ω2 is enlarged by 12 additional
samples, where none of the sample is a support vector and all the samples are
classified correctly by the existing classifier?
8 Classification with nominal features
The techniques we have discussed so far implicitly assume that the feature space is
continuous. This means that the classifiers operate on features with an interval scale,
ratio scale, or absolute scale (see Section 2.1). However, this is not always the case.
Consider, for example, the classification of plant leaves according to a high level
botanical description of (a) the shape of the bud of the plant, and (b) the morphology of
the leaf. Both are nominal features, i.e., features without a quantitative meaning and
without any ordering. The only meaningful relation is equivalence, that is, it is only
possible to tell whether two occurrences of the feature are the same or not. Nominal
features are discrete and can only assume a finite number of values. Therefore, this
two-dimensional feature space to describe plants is also discrete, which means that a
linear classifier like the SVM cannot be used.
Note, however, that the Bayesian classifier framework
P(m|ω)P(ω) P(m|ω)P(ω)
P(ω|m) = = c , (8.1)
P(m) ∑i=1 P(m|ω i )P(ω i )
where ω ∈ {ω1 , . . . ,ω c } = Ω/∼ and m ∈ {m1 , . . . ,m z } = 𝕄 is a nominal feature, is
still applicable if one can find suitable estimates for the class-specific probabilities
P(m | ω). As such, the methods from Chapters 3 and 4 apply without limitations to the
case of nominal features and of mixed feature vectors of nominal and other feature
types.
In this chapter, we will address three additional approaches that allow a classifi-
cation based on nominal features: decision trees and random forests, string matching,
and grammars, i.e., syntactic pattern recognition.

8.1 Decision trees

Consider again the classification of an unknown plant. A botanist will ask a sequence
of questions regarding different nominal properties like the shape of the leaf, the for-
mation of the buds of the plant, the shape of the blossom, the color of the blossom,
etc. Every question rules out certain plant candidates, until there is only one option
left—the final decision.
In such situations it is typical that the next question depends on the answer to the
previous question. Formally, this technique of asking questions can be represented
by a decision tree. In this sense, a floral field guide is a “manual” decision tree clas-
sifier. Other examples of such decision trees are medical diagnoses, error detection
procedures for technical equipment, and codified business procedures.
A decision tree is a tree structure, where the inner nodes correspond to questions,
the links between the nodes represent the answers, and the leaves represent the deci-
sions, or classes. An example of a decision tree to classify fruit is shown in Figure 8.1.

https://doi.org/10.1515/9783110537949-237
216 | 8 Classification with nominal features

In the example, which was taken from Duda et al. [2001], the classes are
Ω/∼ = {ω1 . . . ,ω7 }
= {Apple, Watermelon, Grape, Grapefruit, Lemon, Cherry, Banana}
and the discrete, four dimensional space of nominal features is given by
m ∈ 𝕄 ⊆ 𝕄1 × 𝕄2 × 𝕄3 × 𝕄4
= {green, yellow, red} × {big, medium, small} × {round, thin} × {sweet, sour}.
Decision trees are easy to understand and can, unlike most of the classifiers dis-
cussed so far, be intuitively interpreted by humans. Of course, the interpretability
breaks down with deeper, more complex trees, but in principle every classification
decision can be reproduced and understood by a human.
Figure 8.1 also shows some key properties of decision tree classifiers: The branches
of a node must be mutually exclusive (no answer may appear on two branches) and
exhaustive (all answers that are possible at a particular node must be covered). Fur-
thermore, a question in a node must have a deterministic and unambiguous answer.
The classification of an unknown object is achieved by sequential decisions along a
path through the tree, until a leaf node is reached. The path itself is determined by the
answers to the questions in the inner nodes. Note that the same question may appear
at multiple points in the decision tree, even on a single path through the tree. For exam-
ple, the question “Size?” appears three times in the example in Figure 8.1. Depending
on where the question is asked, it may have a different number of possible answers
(outgoing edges). In Figure 8.1, “Size?” has different possible answers depending on
the node that asks this question. Leaf nodes represent classes. An object that reaches
a given leaf node is assigned to the class that is represented by that node. Multiple
leaf nodes may (and usually do) represent the same class. The tree in Figure 8.1, for
example, contains two leaf nodes each for the classes “Apple” and “Grape”.
Decision trees are generally very fast classifiers, provided the tree is well structured
and not too deep. Decision trees allow easily incorporating prior knowledge about the
pattern recognition task into the classifier. For example, one can augment a decision
tree learned from a sample (see Section 8.1.1) by rules that represent expert knowledge.
Decision trees are also applicable to features of a higher scale, e.g., the interval scale.
Then, the nodes typically quantize the features into two or more possible subsets of
values. If, for example, the size were measured in mm, the question “Size?” in the third
node from the left in the second level of Figure 8.1 may be quantized so that “small”
means a size of ≤ 20 mm and “medium” means a size of > 20 mm. Then, the question
would become “Size ≤ 20 mm?”.
Using grouping and quantization, every decision tree can be transformed into a
binary decision tree, i.e., a tree where each internal node has exactly two outgoing
edges. In a binary decision tree, the nodes usually represent yes/no questions. A bina-
rized tree of Figure 8.1 is shown in Figure 8.2. In the following, we will only consider
binary trees.
8.1 Decision trees | 217

Color?

green yellow red

Size? Shape? Size?

big medium small round thin medium small

Watermelon Apple Grape Size? Banana Apple Taste?

big small sweet sour

Grapefruit Lemon Cherry Grape

Fig. 8.1. Decision tree to classify fruit. Inner nodes represent questions about the features, edges
represent possible answers, and leaf nodes represent classes. Recreated according to Duda et al.
[2001].

Color = yellow?

yes no

Shape = round? Color = red?

yes no yes no

Size = big? Banana Size = small? Size = big?

no
yes no yes no
yes
Apple
Grapefruit Lemon Taste = sour? Size = small?
Watermelon
yes no yes no

Grape Cherry Grape Apple

Fig. 8.2. Binarized version of the decision tree in Figure 8.1.

218 | 8 Classification with nominal features

8.1.1 Decision tree learning

Learning a decision tree corresponds to determining the structure, that is, the nodes
and branches, of a tree using the training set D. The training procedure is recursive.
At each step, a node is constructed according to some splitting criterion (see below).
The resulting branches split the training set into two disjoint subsets, D = DYes ⊎ DNo .
The subset DYes is associated with the left branch of the split and DNo is associated
with the right branch. On each branch, a node is again constructed according to the
splitting criterion, but only using the samples that reach that node. This procedure is
repeated recursively until a stopping criterion is met.
Like other classifiers, decision trees may overfit the training set. Overfitting hap-
pens if the structure of the tree is too fine grained and the corresponding decision paths
are too detailed. This can be spotted in the learning phase, when too few samples of
the training set reach the deeper nodes and leaves of the decision tree. In the extreme
case, only one training sample is alloted to each leaf node.
To prevent this scenario, a learning algorithm should create a compact tree, i.e., a
tree with as few nodes as possible. A common greedy approach chooses a question for
the current node so that the training sets in the resulting split DYes and DNo are as pure
as possible. The pureness of a dataset is measured using a heterogeneity or impurity
measure, denoted by i(⋅). The impurity measure i(n) should assume a minimum if
the dataset Dn at node n consists of samples of only one class, and should attain a
maximum if the classes are uniformly distributed, i.e., if each class is represented by
the same number of samples in Dn .
There are three standard measures that fulfill these properties: the entropy mea-
sure, the Gini impurity measure, and the misclassification measure. A qualitative com-
parison of the measures for a two-class scenario is shown in Figure 8.3.

Definition 8.1 (Entropy measure). The entropy measure corresponds to the entropy of
the empirical class distribution in the training set,
c
̂ k | m,n) log2 (P(ω
i(n) = − ∑ P(ω ̂ k | m,n)) . (8.2)
k=1
Here, n denotes the current node and the probability distribution is estimated as the
ratio of the number N nk of samples of class ω k that reach node n to the total number
N n of samples that reach node n:
k
̂ k | m,n) := N n .
P(ω (8.3)
Nn
Definition 8.2 (Gini impurity measure). The Gini impurity measure (or simply the Gini
measure) estimates the expected error probability if the class were to be randomly
assigned according to the class distribution at the node n:
c c c
̂ l | m,n) P(ω
i(n) = ∑ ∑ P(ω ̂ k | m,n))2 .
̂ k | m,n) = 1 − ∑ (P(ω (8.4)
k=1 l=1 k=1
l=k
̸
8.1 Decision trees | 219

i(n) Entropy
1 Gini impurity
Misclassification
0.75

0.5

0.25

P(ω1 |m,n)
0.25 0.5 0.75 1

Fig. 8.3. Qualitative comparison of impurity measures in dependence on the class probability
P(ω(m) = ω1 ) in a two-class scenario.

In the binary case (c = 2), the Gini measure simplifies to

̂ 1 | m,n) ⋅ P(ω
i(n) = 2 ⋅ P(ω ̂ 2 | m,n). (8.5)

Definition 8.3 (Misclassification measure). Lastly, the misclassification measure esti-

mates the error probability of the dominant class at node n:
̂ k | m,n)} .
i(n) = 1 − max {P(ω (8.6)
k

Impurity minimization
Given an impurity measure i(n), the question to ask at node n is chosen so that the
impurity of the split is minimized, i.e., chosen to maximize the decrease in impurity

∆i(n) := i(n) − P̂ Yes i(nYes ) − P̂ No i(nNo ). (8.7)

This process is iterated, each time using the corresponding partitions of the train-
ing data, until the decrease in impurity falls below a threshold (∆i(n) < τ i ), or until
the number of training samples that reach the node n becomes too low. In addition
to stopping the training procedure on these conditions, a fully grown decision tree
may be post-processed by merging or pruning nodes after training. These topics are,
however, outside the scope of this textbook.
In Figure 8.4, the reference dataset from Section 3.3.2 was used to train a decision
tree with the greedy strategy outlined above. Gini impurity was used as split criterion
and recursion was stopped when less than 15 training samples were available for a split.
The decision rules of the tree are shown in Figure 8.5. The testing error is relatively
large and from a visual inspection of the decision regions it is evident that the tree is
more complicated than it should be. The reasons for this will be explored in the next
section.
220 | 8 Classification with nominal features

m 2

R1 R2 ω1
12 ω2
Optimal
boundary
10 Decision
boundary

2 R1

m 1

2 4 6 8 10 12

Fig. 8.4. Application to the reference example of Section 3.3.2. Decision regions of a decision tree
classifier. Training was stopped when there were fewer than 15 samples available for a split. The
training and testing errors are etrain = 7 % and etest = 12 %, respectively. The testing error asymp-
totically approaches etest ≈ 14.6 %. The training set is the same as in Figure 3.8. Test samples are
shown with hollow marks.

m 2 ≤ 9.38?

y es no

m 2 ≤ 6.37? m 1 ≤ 7.03?

y es no y es no

m 1 ≤ 5.43? m 1 ≤ 4.31? ω1 ω2

y es no y es no

ω2 ω1 m 2 ≤ 8.01 m 2 ≤ 7.78?

y es no y es no

ω2 ω1 m 2 ≤ 7.58? ω2

y es no Fig. 8.5. Structure of

the decision tree of Fig-
ω2 ω1
ure 8.4.
8.1 Decision trees | 221

m2
1 ω1
ω2
m2 ≤ 0.563?

0.8 no yes

R1 ω1 m1 ≤ 0.376?
0.6
no yes

m2 ≤ 0.421? m2 ≤ 0.099?
0.4
no yes no yes

m1 ≤ 0.59? ω2 ω1 ω2
0.2 R2
no yes
m1
ω2 ω1
0.2 0.4 0.6 0.8 1

(a) Inadequate features lead to unnecessarily complicated decision trees.

m2
1 ω1
ω2

0.8

R1
− 65 m1 + m2 ≤ −0.45?
0.6

no yes

0.4 ω1 ω2

0.2 R2

m1
0.2 0.4 0.6 0.8 1

(b) Feature transformation results in a minimal decision tree.

Fig. 8.6. Impact of the features used in decision tree learning. This example descriptively underlines
what is meant by qualifying the boundary between feature extraction and classification as “blurry”
in Figure 1.2.

Decision trees can also be interpreted as meta-algorithms for combining differ-

ent classifiers, which solve subproblems of the classification task in the inner nodes.
Indeed, an arbitrary classifier can be deployed at each inner node. Furthermore, hier-
archical classifiers can be easily realized with decision trees. For example, imagine a
classifier that distinguishes between letters and numerals on the first level and classi-
fies the distinct letters and numerals on the second level.
Note that with this greedy strategy, the optimization at each branch is only local.
As a result, there is no guarantee that a globally optimal tree will be found. If the clas-
sification problem is not too complicated, the optimal decision tree can be determined
using an exhaustive search of all possible trees, but in general this is not feasible.
222 | 8 Classification with nominal features

m2
1 ω1 m2 ≤ 0.414?
ω2
no yes

0.8 ω1 m2 ≤ 0.343?

no yes
0.6
m2 ≤ 0.385? m1 ≤ 0.842?
R1
no yes no yes
0.4
ω2 ω1 ω1 ω2

0.2
R2

m1
0.2 0.4 0.6 0.8 1

Fig. 8.7. A decision tree that does not generalize well. The filled marks show the training sample D.

8.1.2 Influence of the features used

As mentioned in Chapter 2, the choice of features can have a significant impact on the
performance of the classifier. Since decision trees are transparent classifiers, they are
well suited to exemplify this point. Figure 8.6 shows the impact of inadequate features
on the classifier. Here, the features are on the quantitative scale, and each node of the
decision tree splits the feature space parallel to the axes. As can be seen in Figure 8.6a,
the features m1 and m2 produce an unnecessarily complicated and overfitted decision
tree. Yet, a simple feature transformation produces the minimal decision tree shown in
Figure 8.6b. Similarly, the decision tree in Figure 8.7 does not generalize well. Allowing
the misclassification of one training sample would eliminate the thin decision region
around m2 = 0.4 and produce a much simpler decision tree without overfitting.

8.2 Random forests

In practice, decision trees are often observed to overfit the data, especially when the
trees are relatively deep. As mentioned above, one method to address this issue is to
prune the tree after construction. An alternative solution is to train an ensemble of
several classifiers and average their predictions. The idea is that while every classifier
in the ensemble may give inaccurate predictions, the random part of the pertaining
error will average out over the collective predictions. In other words: averaging over
the ensemble will reduce the variance of the classification system. We will state this
intuition more precisely below.
Methods that take this approach, i.e., methods that derive a decision from a set of
classifiers, are called ensemble methods. Ensemble methods subsume a broad range
of techniques, much more than can be covered in this book. Rokach [2010] and Zhou
8.2 Random forests | 223

[2012] give a much more thorough discussion of this subject. Here, we will restrict
ourselves to the following general understanding: ensemble methods predict the class
of a sample according to a weighted average
M
k(m) = ∑ α j kj (m), (8.8)
j=1

where the kj (m) denote the base classifiers in the ensemble, the α j ∈ ℝ denote weights
associated with each classifier, and k(m) is the overall decision function of the ensem-
ble.
One particularly successful instance of ensemble methods is the method of random
forests, sometimes also called random decision forests or randomized forests. As the
name suggests, a random forest is composed of decision trees kj (m), j = 1, . . . ,M,
1
where each tree is weighted equally, α j = M . We will discuss another ensemble method,
AdaBoost, in Section 9.3, where the weights α j of the classifiers are adapted during the
training phase. Interestingly, under certain conditions, AdaBoost can be interpreted
as a special case of a random forest (Breiman [2001]).
Similar to the SVM, random forests have shown remarkable classification perfor-
mance out of the box in many practical classification tasks. This success can largely
be attributed to a key idea that lends random forests the first half of their name: ran-
domization.
Random forests use randomization during training in two ways: First, each tree
in the ensemble is trained on a random subsample of the training set. Second, at each
node in each tree, only a random subspace of the feature space is considered for a
split.
More formally, let D = {m1 , . . . ,mN } be the training set of N d-dimensional train-
ing samples mi = (m i1 , . . . ,m id )T with known class memberships ω(mi ), i = 1, . . . , N.
To train the decision tree kj (m), first a new training set D̃ j = {mr(1) , . . . ,mr(B) } is con-
structed by randomly sampling from D with replacement. Here, r(l) is a function that
maps l to a random integer 1 ≤ k ≤ N and B ≤ N is the size of the sub-sampled training
set D̃ j (typically B ≈ 0.7N). Note that r(⋅) may map to the same integer more than once,
meaning that one and the same training sample mi may occur more than once in D̃ j .
The motivation behind the resampling is that each tree will be sensitive to a slightly
different version of the classification problem. As a result, each tree will make different
errors in classification, but these errors will presumably be corrected by the other trees
in the ensemble. In a broader context, D̃ j is called a bootstrap sample (bootstrapping
is a statistical technique to deal with small datasets) and the aggregation of decision
trees trained on different bootstrap samples is referred to as bootstrap aggregating or
bagging for short.
Still, the trees are likely to choose the same features in the first few splits, since
these splits favor features that are highly correlated with the class memberships ω(mi ).
To circumvent this issue, only a random subset of d󸀠 ≤ d (typically with d󸀠 ≈ √d)
224 | 8 Classification with nominal features

features are considered when finding each split. In other words, during the decision
tree learning, the question to ask at node n is chosen to minimize the impurity of the
split according to Equation (8.7), but only d󸀠 randomly selected features are considered
as candidates. Note that each split considers a different set of features, that is, the d󸀠
feature candidates are chosen anew in each individual iteration of the decision tree
learning.
The main effect of bagging and feature sub-sampling is that the trees will become
decorrelated, because they specialize on different training samples and emphasize
different features. As a result, this minimizes the variance of the whole ensemble.
1
This result can be seen from the following calculation, where the weights α j = M
in Equation (8.8) are pulled in front of the sum (Hastie et al. [2001]). Note that the
discussion uses real-valued decision functions k j (⋅) instead of the vectorial kj (⋅) in
Equation (8.8) to simplify the notation, but the argument still holds for vector-valued
decision functions.
With the assumptions E{k j (m)} = 0, Var{k j (m)} = σ2 and Cov{k j (m), k l (m)} =
ρσ2 for all j,l = 1, . . . , M:
{1 M } 1 M M
Var{ ∑ k j (m)} = 2 ∑ ∑ Cov{k j (m), k l (m)}
M M j=1 l=1
{ j=1 }

1 M M
= ∑ ( ∑ Cov{k j (m), k l (m)} + Var{k j (m)})
M 2 j=1 l=1
l=j̸
M
1
= ∑ ((M − 1)ρσ2 + σ2 )
M 2 j=1
M(M − 1)ρσ2 + Mσ2
=
M2
(M − 1)ρσ2 σ2
= +
M M
2 21− ρ
= ρσ + σ . (8.9)
M
It can be seen that the variance of k(m) is reduced when
– the correlation ρ between the trees is reduced, or
– M is increased, i.e., more trees are added to the ensemble.

Of course, adding more trees to the ensemble will probably increase the correlation
between the trees and thereby increase the first term of the above equation. At the
same time, removing too many trees will increase the second term. The “correct” num-
ber of trees in the ensemble depends on the classification performance, but in many
applications a number M on the order of tens will provide a good baseline.
What remains to be discussed is classification with random forests, i.e., how to im-
plement Equation (8.8). Breiman [2001] suggests deriving a class probability estimate
8.2 Random forests | 225

m 2

R1 R2 ω1
12 ω2
Optimal
boundary
10 Decision
boundary

2 R1

m 1

2 4 6 8 10 12

Fig. 8.8. Application to the reference example of Section 3.3.2. Decision regions of a random forest
classifier. The forest was composed of 10 decision trees. Training was stopped when there were
fewer than two samples available for a split. The training and testing errors are etrain = 1 % and
etest = 8.9 %, respectively. The testing error asymptotically approaches etest ≈ 8.31 %, which is very
close to the 6.16 % asymptotic testing error of the optimal classifier. The training set is the same as
in Figure 3.8. Test samples are shown with hollow marks.

̂ k | m) from a majority vote. Each tree kj (⋅) in the ensemble will classify the sample
P(ω
m and the probability estimate for class ω k is the fraction of trees that voted for ω k ,
M
̂ k | m) = 1 ∑ δ
P(ω ̂j , (8.10)
M j=1 [arg maxω P (ω | m)=ω k ]

where P̂ j (ω | m) is the a posteriori probability derived from the decision tree kj (m) and
δ[⋅] denotes the generalized Kronecker symbol.
In practice, Equation (8.10) does not require probability estimates P̂ j (ω | m), but
only a class assignment ω̂ j (m). In other words, one can simply use the class assignment
stored in the leaf nodes of the tree kj (⋅), as described in Section 8.1.
An alternative approach is due to Ho [1995], where the class probability estimate
is the average of the probability estimates P̂ j (ω | m) of the trees,
M
̂ k | m) = 1 ∑ P̂ j (ω k | m) .
P(ω (8.11)
M j=1

Here, simple class assignments are not enough: instead, this method requires
full probability estimates. A straightforward approach to obtain these estimates is to
226 | 8 Classification with nominal features

store the class membership probabilities P̂ j (ω k | m,n) in each leaf node n of each tree
kj (⋅). The overall estimate of the tree P̂ j (ω k | m) is then given by the leaf node n that
k
is reached by m. The P̂ j (ω k | m,n) can be estimated as in Equation (8.3), i.e., with N jn
denoting the number of training samples of class ω k that reach node n of tree kj (⋅),
and N jn denoting the total number of training samples that reach that node,
k
N jn
P̂ j (ω k | m,n) := . (8.12)
N jn

In either case, the final class estimate is given by the maximum a posteriori classi-
fier as in Equation (3.23),

ω(m)
̂ = arg max P̂ (ω k | m) , (8.13)
ω k ∈Ω/∼

or by minimizing the a posteriori risk as described in Section 3.3.

In Figure 8.8, the reference dataset from Section 3.3.2 was used to train a random
forest of ten decision trees. Feature sub-sampling was omitted (i.e., d󸀠 = d = 2), since
the example uses only a two-dimensional feature vector. As with the decision tree ex-
ample, the Gini impurity measure was used as the split criterion and the recursion was
stopped when fewer than two training samples were available for a split. Compared
to Figure 8.4, the decision regions are much more complicated, but also much closer
to the decision regions of the Bayes optimal classifier.
To conclude this section, consider the following remarks:
– Learning the decision trees of a random forest classifier is generally fast and simple,
since only d󸀠 ≤ d features need to be considered at each split.
– Learning and classification can also be accelerated by exploiting the inherent
parallelism of the approach: each decision tree can be trained and evaluated in a
separate thread.
– Bagging and random feature sub-sampling lead to better generalization properties
compared to a single decision tree. This effect is commonly observed in ensemble
methods.
– Random forests can also be used for regression, cluster analysis, and density esti-
mation (see, e.g., Criminisi et al. [2012]).
– Since they build on decision trees, random forests can be used with features on
all measurement scales and mixtures thereof.
– Random forests can be used to reduce the dimensionality of the feature space by
assessing each feature’s importance using the left-out samples from bootstrapping
and removing features with little importance (see Breiman [2001]: feature selection
is embedded in a random forest, cf. Section 2.7.5).
– Random forests may consume a large amount of memory, since the classifier needs
to store several decision tree classifiers.
8.3 String matching | 227

t a b a c d b d a c b b a c d a c

s=5
m b d a c

Fig. 8.9. Strict string matching. Figure according to Duda et al. [2001].

8.3 String matching

Another specialized technique may be used when the pattern is a sequence of symbols.
As an example, consider the classification of DNA sequences in biomedical applica-
tions. DNA can be thought of a series of pairs of the bases Adenin, Thymin, Cytosin,
and Guanin. However, as Tymin always pairs with Adenin and Cytosin always pairs
with Guanin, DNA can be sufficiently described as a sequence of symbols in the al-
phabet A = {A,C,G,T}. Here, A stands for Adenin, C for Cytosin, G for Guanin, and T
for the base Thymin. A short DNA sequence may be described by the symbol sequence
AGCTTCGAATC. Long sequences of symbols (where the meaning of “long” depends
on the context) are also called a text and a substring of a sequence is denoted a factor.
Symbols may represent either nominal or ordinal features. With the DNA example, the
symbols are nominal.
The task in string matching is to find a sequence m in a given text t, where t is
usually much longer than m (see Figure 8.9). In other words, the task is to answer the
question: is the sequence m a factor of the text t and where is it located? With the DNA
example, string matching can be used to find markers for genetic disorders in the DNA
of a patient.
Often, it is not necessary or even possible to find exact matches. With genetic
markers, random mutations may cause individual genes to change, without affecting
the overall behavior of the factor. One method to deal with such situations is to use the
nearest neighbor classifier (see Section 5.3), where the distance between two sequences
m1 and m2 is the edit distance of the sequences. The edit distance between m1 and m2
is the minimum number of string operations—insertions, deletions and substitutions—
to transform the sequence m1 into the sequence m2 (see Figure 8.10).
During the training of the classifier, all given strings (factors) and their class mem-
berships are stored. When classifying an unknown sequence m, all edit distances of
m to the stored factors are computed and the class of the sequence with the minimum
distance to m is assigned.
Another strategy is to include in the alphabet the special wildcard symbol ⋆, which
matches with any character in A. The wildcard symbol may appear both in the text
and the target sequence m. An example of string matching with wildcard symbols is
shown in Figure 8.11.
228 | 8 Classification with nominal features

t h e _ p l a c e d _ s t r i c t u r e s _ i n _

s = 11
m s t r u c t u r e s

Edit distance = 1
Fig. 8.10. Approximate string matching. Figure according to Duda et al. [2001].

t a r c h _ p a ⋆ t e r ⋆ s _ i n _ l o n g ⋆ s t

s
m p a t t ⋆ r ⋆ s

Match with wildcard symbols

Fig. 8.11. String matching with wildcard symbol ⋆. Figure according to Duda et al. [2001].

8.4 Grammars

Besides string matching, another approach to classifying sequences is the use of gram-
mars, where every class is represented by a grammar G i , i = 1, . . . ,c. All sequences that
are generated by the grammar G i are considered equivalent. In this sense, a grammar
G i corresponds to a model of the patterns of the class ω i . Classification with grammars
corresponds to parsing: A pattern is assigned to the class whose grammar generates
the pattern.
Formally, a grammar G is a quadruple G = (A,V,S,P) of an alphabet A of terminal
symbols, variables V, the starting variable S ∈ V, and a set of production rules P that
replace variables v ∈ V with strings of variables or terminal symbols. L(G) denotes the
language of the grammar G, where the language is the set of all sequences of terminal
symbols that can be produced by G.
An example of a grammar (according to Duda et al. [2001]) is given by the alphabet
A := {a,b,c}, the variables V = {A,B,C,S} and the production rules

{ p1 : S → AB or BC }
{
{ }
}
{ p2 : A → BA or a }
P={ } .
{
{ p3 : B → CC or b }
}
{ }
{ p4 : C → AB or a }
Parsing a grammar can be approached bottom–up or top–down. Bottom–up pars-
ing starts at the sequences and applies rules in reverse until the starting variable S
is reached (see Figure 8.12, left). Top–down parsing starts with S and applies rules
until the sequence is matched (Figure 8.12, right). The details of both approaches are
outside the scope of this textbook.
8.5 Exercises | 229

5 S,A,C S

4 0 S,A,C
A B
3 0 B B
j
2 S,A B S,C S,A B A C C

1 B A,C A,C B A,C

b a A B a
b a a b a
1 2 3 4 5
i a b

Fig. 8.12. Bottom up (left) and top down (right) parsing of the sequence “baaba” given the example
grammar (see text). In both cases, the sequence is accepted.

The learning problem corresponds to the construction of a grammar from the data
in D and is also known as grammar induction. However, there are generally infinitely
many grammars that are consistent with a finite number of sample sequences. A solu-
tion is again offered by Ockham’s razor: the simplest grammar that is consistent with
the data is to be preferred over more complex grammars that also fit the data. Like
parsing, methods to learn a grammar from data are outside the scope of this textbook.

8.5 Exercises

(8.1) Use the features animal, weight, and speed to construct a balanced decision tree
that classifies the following sample without error:

Animal Weight Speed Class

Dog medium medium ω1

Snail light slow ω1
Kangaroo medium medium ω1
Hawk light fast ω1

Tortoise heavy slow ω2

Cheetah medium fast ω2
Whale heavy medium ω3
Hare light medium ω3
230 | 8 Classification with nominal features

(8.2) Given below is a training sample of two-dimensional features for two classes ω1
and ω2 and a decision tree learned from the data.
Sketch the decision boundary of this classifier. Which leaves are a symptom of
overfitting? Can the tree be simplified to eliminate the overfitting?

5
ω2
4

2
ω1
1
m1
1 2 3 4 5 6 7 8

m1 < 5?
Y N

m1 < 3? m2 < 5?
Y N Y N
ω2 m2 < 2? m1 < 8? ω2
Y N Y N

ω1 ω2 ω1 m2 < 4?
Y N
ω2 ω1
9 Classifier-independent concepts
In the final chapter of this book, we will explore topics that are, in a sense, orthogonal
to the classifiers we have seen so far. The first section deals with fundamental concepts
and the limits of all statistical learning. The following sections will give an overview of
methods for empirically evaluating of a classifier’s performance. The last two sections
will introduce boosting, a meta-technique for combining the predictions of several
weak classifiers into one strong classifier, and will discuss techniques for classifying
with the option of rejecting a sample.

9.1 Learning theory

So far we have gained an understanding of Bayesian classification that uses probability

density estimates to derive a classification decision as well as classifiers that directly
determine the decision boundary. Despite the differences between their approaches,
all statistical learners have a common core. This is the subject of (statistical) learning
theory.
Recall from Chapter 1 that the full dataset S is composed of a learning set D and a
test set T: S = D ⊎ T (for convenience, the validation set V is set to V := 0). Learning
means, in effect, to determine a function ω(m̂ | D) from D, such that ω̂ estimates the
true class membership of a feature vector m with as little expected error as possible. In
supervised learning, the class memberships of the samples in D are known, whereas
in unsupervised learning they are not. The goal of the latter is not as well defined
as with supervised learning, but generally one wishes to understand the underlying
process that generated the dataset. In either case, without considering the required
computational costs, there are four major sources of problems:
1. The training set is too small;
2. The training set is not representative of the true distribution;
3. The features are inappropriate or non-discriminative; and
4. A small training error (on D) does not guarantee a small validation error (on T),
e.g., due to overfitting.

The first two issues are part of the topic of proper sampling, and so of lesser interest in
the context of learning theory. The third point addresses the quality of the features and
the problem of extracting the relevant information from the patterns. The fourth point,
which here is reformulated compared to point 4 on p. 6 in Section 1.4, gives rise to the
central problem of statistical learning. In the following discussion of this problem, we
assume the classification is binary (i.e. c = 2). In this case, it is sufficient to consider a
single decision function k(m | θ) instead of ω(m ̂ | θ).

https://doi.org/10.1515/9783110537949-253
232 | 9 Classifier-independent concepts

Stochastic process with

joint density P(m,ω)
(not known)
D T

Fig. 9.1. Relation of the world model P(m,ω) and training and test sets D and D. The training set D
and the test set T are drawn from a stochastic process with unknown joint probability distribution
P(m,ω). The training error etraining (m,θ) and the test error etest (m,θ) of a model θ are estimated on
the sets D and T.

9.1.1 The central problem of statistical learning

Under what conditions does a small training error lead to a small test error? More
generally, statistical learning theory is concerned with the ability of a classifier to
generalize, that is, to classify unseen samples with little error. Note that the two are
inversely related: if the ability to generalize is great, the mean test error will be small,
and vice versa.
More formally, given a decision function k(m,θ) governed by an controllable pa-
rameter vector θ, what can be known about the expected test error ε(θ) = E{etest (m,θ)}?
Under what conditions will ε(θ) be minimal and what is the probably approximately
correct (PAC, Valiant [1984]) lower bound on ε(θ)?

9.1.2 Vapnik–Chervonenkis learning theory

Vapnik and Chervonenkis provided a theoretical framework to tackle these ques-

tions (Vapnik and Vapnik [1998]). They observed that the following bound on the test
error holds with probability (1 − η):

ε(θ) ≤ etraining (θ) + Φ(ν,N,η). (9.1)

Here, N = |D| is the number of samples in the training set, ν is the VC dimension
(see below) of the set K = {k(m,θ)|θ ∈ Θ} of decision functions and Φ(ν,N,η) denotes
the VC confidence (Vapnik and Vapnik [1998]), defined by
η
ν (log 2N
ν + 1) − log ( 4 )
Φ(ν,N,η) = √ . (9.2)
N
This bound holds regardless of the underlying distribution of the data if this distri-
bution is the same for all samples and all the samples were produced independently,
9.1 Learning theory | 233

Kh ,
ν=3

Ke ,
ν≥4

Fig. 9.2. Sketch of different class assignments to a sample using the model families Kh of two-
dimensional hyperplanes and Ke of ellipses for separation. The VC dimension of Kh is ν = 3,
whereas that of Ke is ν ≥ 4.

that is, if the data is distributed i.i.d. A key quantity in Equation (9.2) is the Vapnik–
Chervonenkis dimension (VC dimension) ν of the set of decision functions K. Briefly,
ν acts as a measure of the complexity of the model family represented by K. Since a
rigorous definition requires concepts that are outside the scope of this text, we will
give only an informal, intuitive description of ν:
Consider a given set of N samples to be assigned to two classes. Because each
sample can belong to either class, but not to both, there are 2N constellations of pos-
sible class assignments in total. The VC dimension ν of a set K of decision functions
is defined as the maximum number of samples that can be separated by K for all pos-
sible class assignments, independently of the spatial distribution of the samples. An
example with two-dimensional features m ∈ ℝ2 is shown in Figure 9.2. In the first
row, the set K is the set of two-dimensional hyperplanes, Kh = {wT m − b = 0}. In the
second row, K is the set of ellipses, Ke = {(m − µ)T Λ(m − µ) − b = 0}, where µ denotes
the center of the ellipse and Λ ∈ ℝ2×2 is a positive semidefinite matrix. Note that in
general a set of four two-dimensional samples cannot be separated by a hyperplane
(e.g., the configuration in the first panel of the second row of Figure 9.2—the XOR prob-
lem), but three samples can. Thus, the VC dimension is ν = 3. In general, if K denotes
the set of hyperplanes in ℝd , then ν = d + 1. For polynomials, ν grows with the degree
of the polynomial.
Since the VC dimension ν depends on the model parameters θ, all three quantities
in Equation (9.1), the expected test error ε(θ), the empirical training error etraining (θ),
and the VC confidence Φ(ν,N,η), change with varying ν. Figure 9.3 outlines how the
three quantities change with increasing VC dimension ν. It can be seen that with larger
ν, the Φ also increases, while the empirical training error decreases. The expected test
error ε(θ) first decreases, but increases again when the classifier begins to overfit the
data. The optimal ν, that is, the model family with optimal complexity with respect to
ε, is found where etraining (θ) + Φ(ν,N,η) is minimal.
234 | 9 Classifier-independent concepts

etraining (θ)
Φ(ν,N,η)
ε(θ)

ν
optimal ν

Fig. 9.3. Qualitative plot of the expected test error ε(θ), the empirical training error etraining (θ), and
VC confidence Φ(ν,N,η), against VC dimension ν.

In general, an increasing VC dimension ν means that the underlying set of func-

tions K becomes more and more “malleable.” With ν1 < ν2 and corresponding sets
K1 , K2 , functions in K2 make fewer errors on the training set. Indeed, often (but not
always) the set K2 includes all decision functions in K1 , or K1 ⊂ K2 . However, in-
creasing the malleability of K also increases the risk that a classifier will overfit the
training data.
Vapnik–Chervonenkis learning theory offers a theoretical framework for finding
a compromise between a reduced training error and the generalizability of the model.
A more detailed discussion and a formal definition of the concepts discussed above
are found in Vapnik and Vapnik [1998].
The upper bound on the expected test error from Equation (9.1),

ε(θ) ≤ etraining (θ) + Φ(ν,N,η)

ν (log 2N
+ 1) − log ( 4η )
(θ) + √
ν
= etraining , (9.3)
N
and the Figure 9.3 give rise to the following observations:
1. If the number of training samples is vastly larger than the VC dimension, as Nν →
∞, then Φ → 0 and in turn ε → etraining . In this case, it is sufficient to minimize
the training error, as a small training error guarantees a small test error. This is
known as the empirical risk minimization (ERM) principle (Vapnik and Vapnik
[1998]).
2. If, on the other hand, the ratio Nν is small, then Φ dominates the bound. A small
training error no longer guarantees a small test error. Here, etraining and Φ must
be minimized simultaneously instead. This principle is known as the structural
risk minimization (SRM) principle (Vapnik and Vapnik [1998]).
9.2 Empirical evaluation of classifier performance | 235

p(m, ω i )
ω1

ω2

Reducible
error
∫R p(m, ω2 ) dm
1
∫R p(m, ω1 ) dm
2

R1 mB m∗ R2
(a) Non-optimal choice of decision boundary

p(m, ω i )
ω1

ω2

∫R p(m, ω2 ) dm
1 ∫R p(m, ω1 ) dm
2

R1 mB = m∗ R2
(b) Optimal choice of decision boundary

Fig. 9.4. Classification error probability. Even with an optimal decision boundary, there is a remain-
ing, irreducible error probability

Most practical applications of pattern recognition fall in the second case: there is a
relatively small dataset on which to train the classifier, which means one should either
choose a model with low VC dimension or train the model using SRM.

9.2 Empirical evaluation of classifier performance

The design phases of a pattern recognition system in Figure 1.4 listed the evaluation
of the classifier as the last step. In the previous chapters, estimators of the error proba-
bility P e were given for some specific classifiers, but this is not the only possibility for
assessing the performance of a classifier. This section will fill that gap by introducing
some common performance measures and techniques to validate classifiers with a
finite test sample.
236 | 9 Classifier-independent concepts

Predicted class ω̂
ω1 ω2
Correct rejection
ω1 ω1
ω1 TN FP

Predicted class ω̂
ck

True class ω
Sla
True class ω

Fa
lse
ala
rm ω2 FN TP
ω2 ω2
Discovery (hit)
(a) Terms for classification outcomes (b) Confusion matrix

Fig. 9.5. Classification outcomes in a 2-class scenario.

We will first focus on a binary classifier that decides between only c = 2 classes.
Usually, one class is called the “negative class” and the other one the “positive class.”
Here, we associate ω1 with the negative class and ω2 with the positive class. Binary
classifiers are always used if the goal is to detect the presence or absence of some qual-
ity, e.g., diseased vs. healthy fruit, defective vs. intact workpiece, etc. Moreover, every
multi-class classification task can be solved using a combination of binary classifiers
(see, for example, Section 7.1.1). Because binary classifiers play such an important role
in pattern recognition, many technical terms have been invented to precisely describe
their characteristics. In order to simplify the upcoming discussion, we restrict the sce-
nario to a one-dimensional feature space (d = 1) and a linear classifier. This means
the decision boundary is just a single point m∗ and the decision regions are given by
R1 = {m ∈ M|m < m∗ } and R2 = {m ∈ M|m > m∗ }.
As already discussed in the previous chapters, the feature distributions usually
overlap, which means that there is some minimum error probability that cannot be
reduced any further. This situation is depicted in Figure 9.4. In both sub-figures, the
decision boundary is marked by m∗ and the optimal decision boundary according
to Bayes is labeled mB . A classification error occurs if a feature falls in a different
decision region than that to which the true class belongs. In this example, the overall
error equals

̂ = ∫ P(ω1 , m) dm + ∫ P(ω2 , m) dm,

P e = P(ω ≠ ω) (9.4)
R2 R1

which is the sum of the red and blue areas in Figure 9.4. In Figure 9.4b, this sum attains
its minimum, as the decision boundary equals the optimal boundary mB = m∗ . In
Figure 9.4a, the sum is larger, because the boundary introduces an additional reducible
error (purple frame).
Depending on the true class ω and the classifier prediction ω,̂ one can distinguish
four cases (see Figure 9.5a):
9.2 Empirical evaluation of classifier performance | 237

p (m | ω i )
ω1 Discovery (hit)
Correct rejection P ( m ∈ R2 | ω2 )
P ( m ∈ R1 | ω1 ) ω2

Slack
False alarm
P ( m ∈ R1 | ω2 )
P ( m ∈ R2 | ω1 )

R1 m ∗
R2

Fig. 9.6. Performance indicators for a binary classifier.

Correct rejection True class and prediction are negative ω = ω1 = ω.̂

Discovery (or hit) True class and prediction are positive ω = ω2 = ω.̂
False alarm The true class is the negative class, ω = ω1 , but the classifier predicts
the positive class, ω̂ = ω2 .
Slack The true class is the positive class, ω = ω2 , but the classifier predicts the nega-
tive class, ω̂ = ω1 .

Figure 9.6 shows the class-conditional decision probabilities that correspond to the
probabilities for each of the cases. Please note that these are not the same as the error
probability, which is the unconditional probability that the classification is false.
When evaluating a classifier using a test set T, the outcome of each case is counted
and recorded in a confusion matrix, as shown in Figure 9.5b. In the two-class setting,
the cells are often labeled as follows: true negatives, which counts the number of cor-
rect rejections, false positives, which counts the number of false alarms, false negatives,
which is the number of falsely rejected samples, and true positives, which is the num-
ber of discoveries in the dataset. We denote the number of samples in each cell by TN,
FP, FN, and TP, respectively. Using these quantities, one can compute higher order
performance measures that characterize the classifier. These measures usually approx-
imate a characteristic classification probability. A list of common measures is given in
Table 9.1.
These measures are coupled in interesting ways. Changing a parameter to improve
one measure will also change the values of the other measures. Unfortunately, the
coupling is sometimes reversed: improving one measure may have a negative impact
on another. Consider the example in Figure 9.7a, where the classifier has only one
parameter m∗ . Increasing m∗ will decrease the false positive rate (the fall-out), but
the true positive rate (the recall) will also be decreased. Moving the decision boundary
in the opposite direction will increase the recall, but also increase the fall-out.
238 | 9 Classifier-independent concepts

Table 9.1. Common binary classification performance measures derived from a confusion matrix.

Performance measure approximates

TP
True positive rate (recall) tpr = TP+FN P(m ∈ R2 | ω2 )
TN
True negative rate tnr = TN+FP P(m ∈ R1 | ω1 )
FP
False positive rate (fall-out) fpr = FP+TN P(m ∈ R2 | ω1 )
FN
False negative rate fnr = FN+TN P(m ∈ R1 | ω2 )
TP
Precision prec = TP+FP P(ω2 | m ∈ R2 )
TN
Negative predictive value npv = TN+FN P(ω1 | m ∈ R1 )
TP+TN
Accuracy acc = TP+TN+FP+FN P(ω̂ = ω)

9.2.1 Receiver operating characteristic

The receiver operating characteristic (ROC), originally developed to evaluate radar

systems, is a tool to visualize the dependence between the false positive rate and the
true positive rate. ROC curves are plots of the fall-out against the recall for different
parameter values. The shape of the curve depends on the underlying class-dependent
feature distributions.
ROC curves for varying distances between class-specific feature distributions of
the example in Figure 9.7a are shown in Figure 9.7b. If d = 0, then the feature is uninfor-
mative about the class. As the distributions overlap totally, the recall and the fall-out
will always be the same and the ROC curve will be a straight, diagonal line. ROC curves
near the diagonal mean that the classifier is no better than random guessing. When
the feature becomes more informative, i.e., if d > 0 grows, the classifier will become
more and more capable of discriminating between the two classes. The recall will grow
faster than the fall-out, and the ROC curve will bend towards the upper left corner of
the plot. Fully informative features, where the class-specific feature distributions do
not overlap, will result in the ROC curve shown by the dashed line. In practice, this
perfect ROC curve is never obtained.
An ROC curve should never lie below the diagonal. If it does, this means that
the classifier misinterprets the features and should predict the opposite of what it
does. Note that while ROC curves are used to evaluate a classifier, they are actually
more indicative about the suitability of the features than about the suitability of the
classifier.

9.2.2 Multi-class setting

We now turn our attention back to the generic multi-class setting. Generalizing the
classification error probability P e of the binary classifier (see Equation (9.4)) to the
9.2 Empirical evaluation of classifier performance | 239

p(m| ω i ) ω1 ω2
d Recall

μ1 μ2

Fall- out
m

R1 m∗ R2
(a) Underlying class-specific feature distributions

Recall (tru e positive rate)

1
d = 0.0
−∞

d = 0.7
0.8 d = 1.3
m ∗ d = 2.0
∞

0.6 d = ∞

0.4

0.2

Fall- out ( false positive rate)

0.2 0.4 0.6 0.8 1
(b) Corresponding ROC curves for varying d

Fig. 9.7. Examples of ROC curves for two Gaussian feature distributions with variance σ 2 = 1 and
distance d between the expectations μ1 and μ2 .

multi-class setting is straightforward:

c c
P e = P(ω ≠ ω)
̂ = 1 − ∑ P(m ∈ Ri , ω i ) = 1 − ∑ ∫ p(m | ω i )P(ω i ) dm. (9.5)
i=1 i=1 R
i

Intuitively, the error probability P e is the probability that the feature vector m does
not fall in the correct decision region.
Other measures from the binary case, like precision and recall, cannot directly
be applied in a multi-class setting. However, these measures can still be computed
individually for each class: the c-class classification problem is treated as c separate
binary classification problems, where the goal is to separate the target class ω i from
all other classes ω k , k ≠ i, or ω i for short.
Formally, let C(ω j ,ω k ) denote the number of samples with true class ω = ω j that
were classified as ω̂ = ω k . The number of true positives, false positives, false negatives,
240 | 9 Classifier-independent concepts

Predicted class ω̂
ω1 ω2 ω3 ω4 ω5

ω1 TN FP TN
Prediction
ω2 ω2
ω2 FN TP FN
True class ω

ω2 TP2 FN2

Truth
ω3
ω2 FP2 TN2
ω4 TN FP TN

ω5

Fig. 9.8. Converting a multi-class confusion matrix to binary confusion matrices: A multi-class
confusion matrix of c classes can be subsumed into c binary confusion matrices, one for each class
ω i . Example here: the reduced confusion matrix with respect to ω2 , i.e., i = 2.

and true negatives for class ω i are then computed by (see Figure 9.8):

TPi = C(ω i ,ω i ) (9.6)

FPi = ∑ C(ω j ,ω i ) (9.7)
j=i̸

FNi = ∑ C(ω i ,ω k ) (9.8)

k=i̸

TNi = ∑ C(ω j ,ω k ). (9.9)

j,k=i̸

Given these counts, the measures from Table 9.1 can be derived for each class
individually. Sometimes, the class-wise measures are further averaged over all classes
to grade the overall performance with a single number.

9.2.3 Theoretical bounds with finite test sets

In the following, assume that a classifier with c classes was trained on the training set
D and the error probability will be estimated from the test set T. Again, all elements in
the training and test sets are i.i.d. The number of samples in T is denoted by |T| = NT .
Denote by Tj the set of the |Tj | = NT,j test samples that belong to class ω j . Let P j denote
the (unknown) error probability for class ω j and let n j denote the number of incorrectly
classified samples in Tj . The probability to incorrectly classifying n j items of T is

NT,j nj
Pr{n j items of Tj misclassified} = ( ) P j (1 − P j )NT,j −n j . (9.10)
nj
Note, again, that this error probability cannot be computed, because the P j are
not known. It can, however, be estimated from the test sample using the maximum
9.2 Empirical evaluation of classifier performance | 241

n
likelihood estimate of P j , P̂ j = NT,jj , which leads to the maximum likelihood estimate
of the overall error probability of the classifier,
c
nj
P̂ e = ∑ P(ω j ) . (9.11)
j=1
N T,j

This estimator is unbiased, because with E{n j } = NT,j P j it follows that

c c
E{n j }
E{P̂ e } = ∑ P(ω j ) = ∑ P(ω j )P j = P e . (9.12)
j=1
NT,j j=1

The variance is given by

c
P j (1 − P j )
Var{P̂ e } = ∑ P2 (ω j ) . (9.13)
j=1
NT,j

This highlights two problems. First, the variance of the estimator is inversely pro-
portional to NT,j , which means that small test subsets will result in a relatively high
variance of the estimated error probability. Second, the comparison of two different
classifiers with respect to their error probability is only valid if the difference in P̂ e is
significant w.r.t. the stochastic error of the estimate √Var{P̂ e }.
The question, then, is how to choose a good test sample in order to get good esti-
mates P̂ e . General approaches to choosing adequate test sample sizes are described by
Guyon et al. [1998]. Here, however, we are only interested in choosing the number of
test samples NT so that, with probability (1 − a), the true error rate P e does not exceed
the estimated error rate P̂ e by more than some small quantity ε(NT , a):

Pr{P e ≥ P̂ e + ε(NT ,a)} ≤ a with 0 ≤ a ≤ 1. (9.14)

Defining ε(NT ,a) := βP e as a fraction of the true classification error, one can
compute suitable test set sizes. For example, for a = 0.05 and β = 0.2, i.e., that there
should be a 95 % probability that the true error P e does not exceed the estimated error
P̂ e by more than 20 %, it follows that the number of test samples should be NT ≈
100
P e (Guyon et al. [1998]). Note that this result is independent of the number of classes
c.

9.2.4 Dealing with small datasets

As discussed above, the performance evaluation of a classifier becomes more unre-

liable when fewer test samples are available. Unfortunately, it is often not feasible
or even possible to obtain more samples. In such situations, one can use a cross-
validation scheme to perform the evaluation.
In an m-fold cross-validation, the dataset S is partitioned into m equally sized
subsets S = S1 ∪ S2 ∪ . . . ∪ Sm with Si ∩ Sj = 0 for i ≠ j and |S1 | ≈ |S2 | ≈ . . . ≈ |Sm |.
242 | 9 Classifier-independent concepts

D1 := S1 ∪ S2 ∪ S3 ∪ S4 T1 := S5

Fold 1 S1 S2 S3 S4 S5 ⇝ P̂ e,1

Fold 2 S1 S2 S3 S4 S5 ⇝ P̂ e,2

Fold 3 S1 S2 S3 S4 S5 ⇝ P̂ e,3

Fold 4 S1 S2 S3 S4 S5 ⇝ P̂ e,4

Fold 5 S1 S2 S3 S4 S5 ⇝ P̂ e,5

1 5 1 5 2
P̂ e = ∑ P̂ e,f Var{P̂ e } ≈ ∑ (P̂ e,f − P̂ e )
5 f =1 4 f =1

Fig. 9.9. Example of a five-fold cross-validation: The dataset S is partitioned into five equally sized
subsets Si . In each fold, one subset is used as the test set T and the training set D is the union of the
remaining subsets.

In the jth of the m rounds (also called folds), the learning set D = S \ Sj is con-
structed from all but the jth subset, which is used as the test set T = Sj . The classifier
is trained using D and evaluated using T as usual, and the classification error—or any
other performance measure—is recorded. The process is repeated for each j = 1, . . . ,m,
so that every sample is used for testing once. The final estimate of the test error is taken
as the arithmetic mean of the m test errors estimated in the folds. Cross-validation also
makes it possible to estimate the variance of the test error estimate. Typical values for
the number of folds are m = 5,10. A schematic example of a five-fold cross-validation
is shown in Figure 9.9.
A special case occurs if m = N = |S|, i.e., when the number of folds is equal to
the number of samples. Here, the classifier is trained with all but one sample in S
and evaluated on the sample that was left out. This scheme is therefore also known
as the leave-one-out cross-validation. The estimates obtained with a leave-one-out
cross-validation are typically more reliable than with m-fold cross-validation, but this
increased precision is paid for by a much larger evaluation effort.

9.3 Boosting

Boosting is a meta-technique to combine the outcomes of an ensemble of classifiers

into a single classification. The underlying idea is that multiple weak classifiers, classi-
9.3 Boosting | 243

Algorithm 9.1: The AdaBoost algorithm (Freund and Schapire [1997]).

Data: Training set D, number of iterations M
Result: Strong classifier k(m) = sign (∑M j
j=1 α j k (m))
w i ← N1 for i = 1, . . . , N
for j ← 1 to M do
k j ← train(D, {w1 , . . . , w N }) // Train classifier w.r.t. the w i

// Calculate weighted training error

N {1
1 if ω(mi ) = ω1
εj ← N
( ∑ w i δ[z i =k̸ j (mi )] ) where z i := {
∑i=1 w i i=1 −1 if ω(mi ) = ω2
{

1 − εj
α j ← log
εj
w i ← w i exp (α j δ[z i =k̸ j (mi )] ) for i = 1, . . . , N
return k(m) = sign (∑M j
j=1 α j k (m))

fiers that perform marginally better than a random guess, can form a strong classifier.
The decision function k(m) of the strong classifier is a weighted sum of those of the
weak classifiers k j (m),
M
k(m) = sign ( ∑ α j k j (m)) . (9.15)
j=1

The weights α j are adapted so that the weak classifiers k j (m) are weighted accord-
ing to their classification performance on the training set. Note that in Section 8.2, we
have discussed another instance of ensemble methods: Random Forests. Here, how-
ever, the base learners k j (⋅) may be any type of classifier, not just decision trees. The
only restriction is, again, that the k j (⋅) must perform better than random guessing.
Note that here we restrict ourselves to binary classification, hence the scalar decision
function k j (⋅) is used in place of the vectorial kj (⋅).
One of the best-known boosting algorithms is AdaBoost (short for adaptive boost-
ing) of Freund and Schapire [1997]. The algorithm is very simple, yet produces very
strong classifiers out of the box. The basic idea of AdaBoost is to iteratively generate
weak classifiers that minimize the weighted training error on the training set. Initially,
all training samples are weighted equally, but after each iteration more weight is put on
misclassified samples. Consequently, the classifier chosen in the subsequent iterations
will put more emphasis on correctly classifying samples that were often misclassified.
Pseudo-code of the algorithm is given in algorithm 9.1. If the weak classifier is cho-
sen from a set of base classifiers, line 3 of the algorithm is replaced with choosing the
244 | 9 Classifier-independent concepts

1
wi = 10 , i = 1, . . . ,10 k1 : ε1 = 0.3, α1 = 0.42 reweighted samples

k2 : ε2 = 0.21, α2 = 0.65 reweighted samples k3 : ε3 = 0.14, α3 = 0.92

...

Fig. 9.10. Schematic example of AdaBoost training (M = 3) for two classes with the decision func-
tions k i (m) := sign(m d i − τ i ) with d i ∈ 1,2. The size of a mark indicates the magnitude of the
associated weight. Example adapted from Freund and Schapire [2004].

classifier that minimizes the weighted training error. A visual example of AdaBoost
training and the resulting strong classifier is given in Figures 9.10 and 9.11.
Boosting algorithms are meta-algorithms that can be used with many different
weak classifiers or even combinations of classifiers of different types. Popular choices
are shallow decision trees, i.e., decision trees that are only a few levels deep, and linear
SVMs. While simple, the approach is very powerful. Given that the weak classifiers are
simple classifiers, the strong classifier is usually very fast to compute. At the same time,
boosted classifiers typically generalize well, and do not overfit easily. Weak classifiers
that perform well on the training set will receive larger weights α i than classifiers that
do not.
The main design parameters are the types of the weak classifiers k j (⋅) or a set of
pre-selected classifiers to choose from, and the number of iterations M. There is no
need for prior knowledge of the distribution of the features, but boosting typically
needs large training sets to work well.

9.4 Rejection

In Chapter 1, and throughout the book so far, all objects were assumed to belong to one
of the classes ω1 , . . . ,ω c . In other words, the relevant part of the world Ω is partitioned
into equivalence classes ω i , Ω/∼ = {ω1 , . . . ,ω c }. The underlying (implicit) assumption
9.4 Rejection | 245

k(m) = sign ( 0.42 k1 +0.65 k2 +0.92 k3 )=

Fig. 9.11. Visual representation of the AdaBoost classifier obtained after three iterations in Fig-
ure 9.10.

ω1 ω3
Fig. 9.12. Reasons to refuse to clas-
(ii)
sify an object: (i) the feature vector is
(i) (i)
placed near a border between classes
ω2 or (ii) the feature vector is placed
(ii) within unpopulated regions of the
feature space.

that all classes are known at design time and that every object o ∈ Ω belongs to one
of the classes ω i is called the closed world assumption.
In practice, this assumption rarely holds, but it is often close enough to the truth
to not cause any harm. An intelligent scale that classifies fruit and vegetables, for
example, will work with a model of just these produce items of the supermarket. Other
products in the store are not included in the model, and will be misclassified when
put on the scale, but the harm due to the misclassification is negligible.
In any case, there are at least two valid reasons to refuse a classification:
1. Ambiguous situation: the decision functions k i do not exhibit a significant maxi-
mum.
2. Unknown object: The object lies outside the domain explained by Ω/∼.

The first case occurs when the feature vector of an object falls near a decision bound-
ary. In other words, a sample should be rejected if the sample falls into a narrow strip
around the decision boundaries. The second case occurs if the feature vector falls out-
side of the area occupied by the feature vectors of the known objects (see Figure 9.12).
The intelligent scale, for example, may encounter a Nashi pear that looks similar to an
apple. Customers might also be tempted to put unrelated items, e.g., a loaf of bread, on
the scale to see how the system reacts. In both cases, the task of classification should
be declined.
To treat both cases, the workflow of a classifier (see Figure 3.3) is extended by
a subsequent rejection stage, as shown in Figure 9.13. The rejection test might be
inherent to the specific classifier, but there are four classifier-independent rejection
246 | 9 Classifier-independent concepts

k(m)
k1 (m)

Find maximum

Reject?
k2 (m)
m ω̂
..
.

k c (m)

Fig. 9.13. Schema of a classifier with rejection option. As the rejection option is a subsequent step
after maximum search, it can be applied to any classifier.

criteria that work with any classifier that outputs some measure of confidence. These
are (due to Schürmann [1996]):
– Maximum criterion: Reject if max{k i } < τ, i.e., if the decision functions show no
significant maximum. The corresponding rejection region in the decision space is
shown in Figure 9.14a.
– Difference criterion: Reject if max{k i } − max {{k j } \ max{k i }} < τ, i.e., the two top
ranked decision options have a similar confidence. The corresponding rejection
region in the decision space is shown in Figure 9.14b.
– Distance criterion: Reject if min{‖k − ωi ‖} > τ, i.e., if the confidence of the closest
class in the decision space is not large enough. The corresponding rejection region
in the decision space is shown in Figure 9.14c.
– Minimum criterion: Reject if min{k i } < τ < 0, i.e., if at least one decision function
expresses high confidence that the object does not belong to any of the classes
defined during design time. The corresponding rejection region in the decision
space is shown in Figure 9.14d.

Formally, the rejection option can be treated as an additional class ω0 , with which the
original class-partition of Ω is augmented (recall Section 3.3):

Ω0 /∼ := Ω/∼ ∪ {ω0 }. (9.16)

9.5 Exercises | 247

k3 Reject region k3 Reject region

1 1

0.2 0.2
0.8 0.8

0.4 0.4
0.6 0.6

0.6 0.6
0.4 0.4

0.8 0.8
0.2 0.2

1 1
0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1
k1 k2 k1 k2

(a) Maximum criterion (b) Difference criterion

k3 Reject region k3 Reject region

1 1

0.2 0.2
0.8 0.8

0.4 0.4
0.6 0.6

0.6 0.6
0.4 0.4

0.8 0.8
0.2 0.2

1 1
0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1
k1 k2 k1 k2

(c) Distance criterion (d) Minimum criterion

Fig. 9.14. Rejection criteria and the corresponding rejection regions in the decision space.

9.5 Exercises

(9.1) A binary classifier achieves the following results on a test set:

ω1 : NT,1 = 20, k1 = 14
ω2 : NT,2 = 30, k2 = 12

Here, NT,i denotes the number of test samples of class ω i and k i denotes the num-
ber of correctly classified samples of that class. The a priori probabilities are given
by P(ω1 ) = p and P(ω2 ) = 2p for p ∈ [0,1].
1. Give an estimate P̂ e of the error probability of the classifier.
2. In operation, an error rate of P e = 0.4 is observed. Assuming that the class-
dependent error rates are correct, what are the values of the true a priori prob-
abilities?
248 | 9 Classifier-independent concepts

(9.2) A test of a classifier with three classes ω1 , ω2 , ω3 results in the following confu-
sion matrix:

True class
Prediction ω1 ω2 ω3
ω1 120 6 3
ω2 16 21 7
ω3 8 9 26
Give an estimate P̂ e of the error probability of the classifier. Assume the following
a priori probabilities:

P(ω1 ) = 0.5, P(ω2 ) = 0.2, P(ω3 ) = 0.3. (9.17)

A Solutions to the exercises
A.1 Chapter 1

(1.1) The relation is an equivalence relation:

– Reflexivity: x attends the same class as x.
– Symmetry: If x attends the same class as y, then y also attends the same class
as x.
– Transitivity: If x attends the same class as y and y attends the same class as z,
then x and z also attend the same class.

(1.2) The relation is not transitive and hence not an equivalence relation, as seen in
this family tree:
Joel Kathie

Dan Eve Frank Grace Heidi Ian

Alice Bob Carol

Alice and Bob share a grandparent (Joel), and Bob and Carol share a grandparent
(Kathie), but Alice and Carol do not.

(1.3) The relation x ∼ y ⇔ xT y = 0 is not reflexive for x ≠ 0 and hence not an

equivalence relation.

(1.4) x ∼ y ⇔ xT y ≥ 0 is reflexive and symmetric, but not transitive and hence not an
equivalence relation. Let x = (0, − 1)T , y = (1,0)T , and z = (0,2)T . Then:

xT y = 0 ⇒ x ∼ y
yT z = 0 ⇒ y ∼ z
xT z = −2 ⇒ x ≁ z

(1.5) The relation is not an equivalence relation, because symmetry does not hold if
f(x) < f(y) (strictly smaller) for any x,y ∈ ℕ. In this case, x ∼ y holds, but y ≁ x,
because f(y) ≰ f(x).

(1.6) The relation is not symmetric and hence not an equivalence relation:
Let r(X,n) ∈ O(n a ) and r(Y,n) ∈ O(a n ) for a > 1. Then r(X,n) ∈ O(r(Y,n)) = O(a n )
and therefore X ∼ Y. However, r(Y,n) ∈ ̸ O(r(X,n)) = O(n a ) and thus Y ≁ X.

https://doi.org/10.1515/9783110537949-271
250 | A Solutions to the exercises

A.2 Chapter 2

(2.1) mn allows any relabeling function, i. e., any injective function. mo allows only
functions that preserve the ordering. mr allows only functions that also preserve
relative distances, i. e., only linear functions. mr allows only functions that also
preserve the zero, i. e., only scaling functions. ma allows only the identity map-
ping.
1. f(m) = 3 m + α is injective and strictly increasing and therefore allowed for
both mn and mo . Since 3 > 0, it can also be applied to mi , but is only allowed
for mr if α = 0.
2. mn and mo both allow f(m) = e m , since the function is injective and strictly
increasing. It is not linear, and therefore not allowed with either mi or mr .
3. The function is only allowed with mn , as it is injective, but not strictly increas-
ing and not linear.

(2.2) The features are sorted in the scales as below:

– Nominal scale: car brands, genders, varieties of apple.
– Ordinal scale: grades, clothing sizes, places in a race.
– Interval scale: date of birth, motor temperature in ∘ C, intelligence quotient.
– Ratio scale: area of the canvas of a sail, engine speed/revolution, height of
body, optical magnification, account balance, electrical voltage, population
density, annual income in EUR.
– Absolute scale: number of cows in a herd, display of a Geiger counter.

(2.3) The KL divergence between P1 and P2 is

P1 (x)
DKL (P1 ‖P2 ) = ∑ P1 (x) log
x∈supp(P1 )
P2 (x)
P1 (a) P1 (b)
= P1 (a) log + P1 (b) log
P2 (a) P2 (b)
P1 (c) P1 (d)
+ P1 (c) log + P1 (d) log
P2 (c) P2 (d)
1 1/3 0 1 1/3 1 1/3
= log + 0 log + log + log
3 1/3 1/6 3 1/6 3 1/3
1 1 1
= log 1 + log 2 + log 1
3 3 3
1
= log 2 ≈ 0.23
3

(2.4) The KL divergence between P1 and P2

3
P1 (i)
DKL (P1 ‖P2 ) = ∑ P1 (i) log
i=1
P2 (i)
A.2 Chapter 2 | 251

1 1/3 1 1/3 1 1/3

= log + log + log
3 1/6 3 a 3 b
1
= (log 2 − log 3 − log a − log 3 − log b)
3
1 1
= (log 2 − log 3) − (log a + log b)
3 3
1
= const. − log(ab).
3
assumes a minimum if log(ab) assumes a maximum. Since log is strictly mono-
tonically increasing, it is sufficient to maximize ab. Furthermore, the following
restrictions hold:
P2 (1) + P2 (2) + P2 (3) = 1
5 5
⇒a + b = ⇔ − a.
6 6
Therefore ab = a ( 65 − a) = 5
6a − a2 needs to be maximized. Deriving by a and
setting to zero yields
5 ! 5
−2a + =0 ⇔ a=b= .
6 12

(2.5) The distance from the center, e. g., m󸀠 := √ x2 + y2 , m󸀠 := x2 + y2 , or m󸀠 := |x|+|y|.

(2.6) The area under the diagonal, i. e., m󸀠 := m1 − m2 , or explicitly coded m󸀠 :=

δ[m1 <m2 ] . Other solutions are also possible.

(2.7) The feature m is . . .

– . . . invariant to scaling, because every term is divided by |Z1 |.
– . . . not invariant to location, because the location coefficient Z0 is a summand.
– . . . not invariant to rotation, because the phase of the Z i is not normalized.

(2.8) Translation invariance: α = 0.

Scale invariance: β = 2 and γ = 0.

(2.9) The following patterns are equivalent under a feature invariant to . . .

1. translation: 1 and 2; 3 and 4.
2. translation and rotation: 1 and 2; 3, 4 and 7.
3. translation, rotation and scaling: 1, 2, 5 and 6; 3, 4 and 7.

(2.10) A feature is invariant to translation if the location parameter c does not influ-
ence the feature. Similarly, a feature is invariant to rotation if its computation does
not involve the rotation parameter φ. All other measurements—perimeter P, area
A, and axis lengths l1 and l2 —are sensitive to scaling, but to different degrees; a
scaling factor s affects P, l1 and l2 linearly, but A will grow quadratically in s.
252 | A Solutions to the exercises

Therefore, one can derive the following invariances:

– m1 is only invariant to rotation, but not translation or scaling;
– m2 is invariant to translation, rotation and scaling;
– m3 is invariant to translation and scaling, but not rotation;
– m4 is invariant to translation, but not rotation or scaling;
– m5 is invariant to neither translation, rotation nor scaling.

A.3 Chapter 3

(3.1) P(a | b) + P(a | b) = 1 ⇒ P(a | b) = 1 − P(a | b) = 1 − P(a) = P(a)

(3.2) 1. If the a priori probability distribution is uninformative about the class, i. e.,
if P(ω i ) is the same for all classes ω i .
2. If the class-specific feature distribution is the same for both features, i. e.,
p(m1 | ω i ) = p(m2 | ω i ) for all ω i .

(3.3) 1. ω1 and ω2 can not be separated using m. The classification depends solely
on the a priori probabilities.
2. ω3 and ω2 can be separated without error. Since the class-specific feature
distributions for ω1 and ω2 are the same, ω3 and ω1 can also be perfectly
separated.
3. Class-dependent error probabilities:
ω(m) = ω3: P(ω̂ ≠ ω3 | ω3 ) = 0
ω(m) = ω2: P(ω̂ ≠ ω2 | ω2 ) = 0 Since the classifier
}
ω(m) = ω1: P(ω̂ ≠ ω1 | ω1 ) = 1 always decides on ω2

⇒ P e = 0,1 ⋅ 1 + 0,6 ⋅ 0 + 0,3 ⋅ 0 = 0,1

(3.4) 1. Sketch of the feature distributions p(m | ω) and the decision boundaries for
parts (2) and (3):

p(m | ω) (3) (2)

1.00
0.75 p(m | ω1 )
p(m | ω2 )
0.50
0.25
m
0.00
0.0 0.5 1.0 1.5 2.0 2.5 3.0

2. With equal a priori probabilities P(ω1 ) = P(ω2 ), the decision boundary is at

the point p(m | ω1 ) = p(m | ω2 ). The supports of the densities overlap in the
A.3 Chapter 3 | 253

range m ∈ [1,2], which gives

! ! 3
p(m | ω1 ) = p(m | ω2 ) ⇔ 2 − m = m − 1 ⇒ m = = 1.5.
2
3. With unequal a priori probabilities, it holds that

ω̂ = ω1 ⇔ P(ω1 | m) > P(ω2 | m)

⇔ p(m | ω1 )P(ω1 ) > p(m | ω2 )P(ω2 )

The decision boundary is therefore at

p(m | ω1 )P(ω1 ) = p(m | ω2 )P(ω2 ))

1 3
⇔ (2 − m) = (m − 1)
4 4
2 m 3m 3 5
⇔ − = − ⇒ m = = 1.25.
4 4 4 4 4
4. The error probability with two classes is

P e = ∫ P(ω2 )p(m | ω2 ) dm + ∫ P(ω1 )p(m | ω1 ) dm

R1 R2

For the first boundary (part 2)

3
2 3

P e = ∫ P(ω2 )p(m | ω2 ) dm + ∫ P(ω1 )p(m | ω1 ) dm

0 3
2
3
2 2
1 1
= ∫ (m − 1) dm + ∫ (2 − m) dm
2 2
1 3
2

1 1 1 1
= ( + )= .
2 8 8 8
For the second boundary (part 3)
5
4 3

P e = ∫ P(ω2 )p(m | ω2 ) dm + ∫ P(ω1 )p(m | ω1 ) dm

0 5
4
5
4 2
3 1
= ∫ (m − 1) dm + ∫ (2 − m) dm
4 4
1 5
4

3 1 1 9 3
= ⋅ + ⋅ = .
4 32 4 32 32

(3.5) 1. P(ω1 ) = P(ω2 ) = 12 (1 − P(ω3 ) − P(ω4 )) = 0.1

254 | A Solutions to the exercises

2. Class ω4 will be chosen, since it has highest a priori probability of all classes.
3. ω1 and ω4 have the highest class-specific feature densities at m1 . Since
P(ω4 ) > P(ω1 ),
p(m1 | ω1 )P(ω1 ) < p(m1 | ω4 )P(ω4 ) ⇔ P(ω1 | m1 ) < P(ω4 | m1 )
and hence ω4 is chosen again.
4. ω1 and ω4 as well as ω2 and ω4 are best separated using only m2 . ω1 and
ω2 cannot be separated, since the a priori probability and the class-specific
feature distributions, and hence the a posteriori probabilities, are the same
for these classes.
5. Since ω1 and ω2 can not be separated using m2 , but they can be separated
using m1 , one should use m1 instead of m2 .

(3.6) As m1 and m2 are stochastically independent,

(3.7) 1. Cost matrix L:

1
l(ω1 ,ω1 ) l(ω1 ,ω2 ) l(ω1 ,ω3 ) 0 1 100
L = (l(ω2 ,ω1 ) l(ω2 ,ω2 ) l(ω2 ,ω3 )) = ( 1001000 0 0 )
1
l(ω3 ,ω1 ) l(ω3 ,ω2 ) l(ω3 ,ω3 ) 100 000 0 0
The costs l(ω1 ,ω2 ) and l(ω1 ,ω3 ) are directly taken from the description.
The costs l(ω2 ,ω1 ) and l(ω3 ,ω1 ) are inferred from the value of one grain:
1/100 000.
2. According to the description, the a priori probabilities are
97 2 1
P(ω1 ) = P(ω2 ) = P(ω3 ) = .
100 100 100
The a posteriori probabilities p are therefore
90 97 10 000
p1 = P(ω1 | length < 7 mm) = ⋅ ⋅
100 100 90 ⋅ 97 + 3 ⋅ 2 + 5 ⋅ 1
3 2 10 000
p2 = P(ω2 | length < 7 mm) = ⋅ ⋅
100 100 90 ⋅ 97 + 3 ⋅ 2 + 5 ⋅ 1
5 1 10 000
p3 = P(ω3 | length < 7 mm) = ⋅ ⋅
100 100 90 ⋅ 97 + 3 ⋅ 2 + 5 ⋅ 1
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
normalization constant

The classifier will choose class ω1 .

A.4 Chapter 4 | 255

3. The a posteriori risk is computed to be r = (r1 ,r2 ,r3 )⊤ = Lp. Now

10 000 3 2 1 5 1
r1 = ⋅ (1 ⋅ ⋅ + ⋅ ⋅ )
90 ⋅ 97 + 2 ⋅ 3 + 5 100 100 100 100 100
10 000 602
= ⋅
90 ⋅ 97 + 2 ⋅ 3 + 5 1 000 000
10 000 1 90 97
r2 = ⋅ ⋅ ⋅
90 ⋅ 97 + 2 ⋅ 3 + 5 100 000 100 100
10 000 8730
= ⋅
90 ⋅ 97 + 2 ⋅ 3 + 5 1 000 000 000
10 000 1 90 97
r3 = ⋅ ⋅ ⋅
90 ⋅ 97 + 2 ⋅ 3 + 5 100 000 100 100
10 000 8730
= ⋅
90 ⋅ 97 + 2 ⋅ 3 + 5 1 000 000 000
It is easy to see that r1 > r2 = r3 . The classifier will choose class ω2 or ω3 —the
opposite result of the maximum a posteriori classification!

A.4 Chapter 4

(4.1) Estimator m̂ 1 has E{m̂ 1 } = 15 and Var{m̂ 1 } = 0. Therefore

2
MSE(m̂ 1 ) = (E{m̂ 1 } − m) + Var{m̂ 1 } = (15 − m)2
2
MSE(m̂ 2 ) = (E{m̂ 2 } − m) + Var{m̂ 2 } = 36

Since the weight m is in the interval [10,20], and the maximum of MSE(m̂ 1 ) in this
interval is 25 (at m = 10 and m = 20), the estimator m̂ 1 has lower mean squared
error than m̂ 2 .

(4.2) To find the maximum likelihood estimator, we maximize the log-likelihood func-
tion:
N N
1 |m i − μ| 1 |m i − μ|
l(μ) = log ∏ exp ( ) = ∑ (log − )
i=1
2σ σ i=1
2σ σ

1 N
= −N log(2σ) − ∑ |m i − μ|
σ i=1

1 N
⇒ μML = arg max l(μ) = arg max {−N log(2σ) − ∑ |m i − μ|}
μ μ σ i=1

1 N N
= arg max {− ∑ |m i − μ|} = arg max {− ∑ |m i − μ|}
μ σ i=1 μ
i=1
N
= arg min μ { ∑ |m i − μ|} = μ.̂
i=1
256 | A Solutions to the exercises

(4.3) Again, we need to maximize the log-likelihood

N N
l(θ) = log ∏ θe−θ x i = ∑ (log θ − θ x i )
i=1 i=1
N
= N log θ − θ ∑ x i
i=1

to find the maximum likelihood estimator θ̂ ML . Differentiating l(θ) by θ and setting

to zero yields
N
d N !
l(θ) = − ∑ xi = 0
dθ θ i=1
N
N N
⇔ = ∑ x i ⇒ θ̂ ML = N .
θ i=1 ∑i=1 x i

(4.4) 1. μ̂ is unbiased iff E μ̂ = μ:

1 N−2 1 N−2
E{μ}̂ = E{ ∑ xi } = ∑ E{x i }
N − 4 i=3 N − 4 i=3
1
= (N − 4) μ = μ.
N−4
2. Both estimators are unbiased, but the variance of μ̂ is larger than the vari-
ance of μ̂ ML and therefore μ̂ ML is a better estimator. This can be shown using
Equation (4.23):

1 N−2 1 N−2
Var{μ}̂ = Var{ ∑ xi } = ∑ E{(x i − μ)2 }
N − 4 i=3 (N − 4)2 i=3
N−2
1 1 1
= ∑ σ2 = σ2 > σ2 = Var{μ̂ ML }
(N − 4) i=3
2 N−4 N

(4.5) The expected value of the estimator σ̂v is

1 N
E{σ̂v } = ∑ E{(m i − μ)2 } (E{} linear)
α − N i=1

1 N
= ∑ E{(m − μ)2 } (m i i. i. d.)
α − N i=1
1 N
= ⋅ N ⋅ Var{m} = ⋅ σ2 .
α−N α−N
In order for the estimator to be unbiased, the expected value must be equal to the
true value, which gives
N
E{̂
σ2 } = σ2 ⇔ = 1 ⇔ α = 2N.
α−N
A.5 Chapter 5 | 257

(4.6) The expected value of the estimator μ̂ is

N N N N
E{μ}̂ = E{ ∑ f(m i )} = ∑ E{f(m i )}
N − α i=1 N − α i=1

N N N2 μ
= ∑ E{f(m)} = .
N − α i=1 N−α

It is unbiased if

! N2 μ
E{μ}̂ = μ ⇔ = μ ⇒ N2 = N − α ⇔ α = N − N2.
N−α

A.5 Chapter 5

(5.1) The inverse mapping x = A−1 (y) = ±√y is not unique, therefore the inference
from y to x is ill-posed.

(5.2) The system of equations to solve this problem is over-determined, since there
are more data points (N > 2) than degrees of freedom (a,b). This means that
in general there is no solution that interpolates all data points. Therefore, the
problem is ill-posed.
The following variation is well-posed.
Find the parameters a,b ∈ ℝ of a straight line y = a x + b that minimizes the distance
between the line and the data points, i. e., find the parameters a,b ∈ ℝ that minimize
N
∑ d(y i , a x i + b)
i=1

for a suitable distance measure, e. g., d(u,v) = (u − v)2 .

(5.3) The mappings A−1 −1

2 and A 4 are not unique, therefore A 2 and A 4 produce an
ill-posed inverse problem.
A5 is not well defined, and therefore also results in an ill-posed problem.

(5.4) Parzen window estimation:

1 N 1 x − xi {1 if |y| ≤ 0.5
p(x)
̂ = ∑ φ( ) where φ(y) := {
N i=1 V N hN 0 else.
{
Here, V N = h1N = 1, and hence

1
p(6)
̂ = ⋅0=0
10
1
p(8)
̂ = ⋅ 2 = 0.2
10
258 | A Solutions to the exercises

1
p(10)
̂ = ⋅ 3 = 0.3
10
1
p(12)
̂ = ⋅ 1 = 0.1
10
1
p(14)
̂ = ⋅0=0
10

(5.5) The dataset can be sorted to quickly find the nearest neighbors to a given m:
D = {7.1, 7.6, 8.0, 8.5, 9.3, 9.7, 10.0, 10.5, 12.2, 14.9}.
The density is estimated according to p(m)̂ = NkV . The volume V depends on
the position of the neighbors, V(m) = 2 ⋅ |n3 (m) − m|, where n3 (m) denotes the
third-closest neighbor of m). Putting everything together,
3 3
p(m
̂ = 6) = =
10 ⋅ 2 ⋅ |8 − 6| 40
3 3
p(m
̂ = 8) = =
10 ⋅ 2 ⋅ |8.5 − 8| 10
3 3
p(m
̂ = 10) = =
10 ⋅ 2 ⋅ |10.5 − 10| 10
3 3
p(m
̂ = 12) = =
10 ⋅ 2 ⋅ |10 − 12| 40
3 3
p(m
̂ = 14) = =
10 ⋅ 2 ⋅ |10.5 − 14| 70

(5.6) The feature space, sample, decision boundary, and samples to classify, are
shown in the diagram below:
m2
6

4 ω1

m1
2

m2
-6 -4 -2 2 4 6 m1

-2
ω2
-4

m3
-6

⇒ ω(m
̂ 1 ) = ω2 ; ω(m
̂ 2 ) = ω1 ; ω(m
̂ 3 ) = ω1 or ω(m
̂ 3 ) = ω2
A.6 Chapter 6 | 259

A.6 Chapter 6

(6.1) Each of the three Gaussian components g k (m) is parametrized by a mean µk and
a covariance matrix Σk . The mean requires five (5) parameters since the feature
space is five-dimensional. The covariance matrix is symmetric and requires es-
timating fifteen (15) parameters. Two parameters are needed to estimate the α k ,
since α1 + α2 + α3 = 1. In all, there are 3 ⋅ (5 + 15) + 2 = 62 parameters to estimate
for the Gaussian mixture.
The Parzen window method does not require estimating any parameters. The meta-
parameters (window type, window size) are chosen beforehand.

(6.2) The linear classifier needs to estimate four (4) parameters: three for the normal
of the hyperplane w and one for the distance to the origin b.
For the Gaussian classifier, four (4) parameters are needed for each mean and ten
(10) parameters are needed for the covariance matrices, so there are 2⋅(4+10) = 28
parameters to estimate in all. The a priori probabilities are not estimated from the
sample.

(6.3) 1. A linear classifier requires d parameters to define the hyperplane in a d-

dimensional feature space. Here, d = 6, and hence ⌊256/6⌋ = 42 classifiers
can be saved on the device. 256 − 42 ⋅ 6 = 4 parameters remain for other use.
2. A multivariate Gaussian distribution requires d parameters for the mean
and d(d+1)
2 parameters for the covariance matrix. In addition, the a priori
probabilities have to be stored, but since ∑ci=1 P(ω i ) = 1, only c − 1 param-
eters are needed for c classes. In all, the required number of parameters is
c ⋅ (d + d(d+1)
2 + 1) − 1, which yields the inequality

6⋅7 257 5
c ⋅ (6 + + 1) − 1 = 28 c − 1 < 256 ⇔ c < =9+ .
2 28 28
All in all, c = 9 classes can be separated, at a maximum, using this device.
256 − 9 ⋅ 28 + 1 = 5 parameters remain unused.

(6.4) The probability that the feature m i lies in the interval [−2,5] is
5 − (−2) 7 1
Pr (m i ∈ [−2,5]) = = = .
11 − (−10) 21 3

Since the m i are stochastically independent, the probability that m ∈ ̸ [−2,5]d is

1 d
Pr (m⃗ ∈ ̸ [−2,5]d ) = 1 − Pr (m⃗ ∈ [−2,5]d ) = 1 − ( ) =: P d .
3
Plugging in different values for d yields
1 2 9
d = 1 ⇒ P1 = 1 − = ≯
3 3 10
260 | A Solutions to the exercises

1 8 9
d = 2 ⇒ P2 = 1 − = ≯
9 9 10
1 26 9
d = 3 ⇒ P3 = 1 − = >
27 27 10
Therefore, more than 90 % of the probability mass is outside the hypercube
[−2,5]d when the dimensionality of the feature space is at least d = 3.

A.7 Chapter 7

(7.1) K(m,m󸀠 ) = ‖m‖2 does not depend on m󸀠 , hence it is not symmetric and not a
kernel function.

(7.2) K(m,m󸀠 ) = 4mT m󸀠 − mT m − (m󸀠 )T m󸀠 is symmetric, but not positive definite, and
therefore not a kernel function:

1 0
K (( ) , ( )) = 4 ⋅ 0 − 1 − 1 = −2 < 0
0 1
1 1
K (( ) , ( )) = 4 − 1 − 2 = 1 > 0
0 1

(7.3) The kernel function is the scalar product of the lifted features:

K(p,q) = ⟨φ(p), φ(q)⟩ = (p1 p2 )(q1 q2 ) + (p2 p3 )(q2 q3 ) + (p3 p1 )(q3 q1 )

+ (p1 + p2 + p3 )3 (q1 + q2 + q3 )3 .

(7.4) 1. Sketch of the features and decision boundary. Note that there are no feature
vectors inside the margin, since this is a hard margin SVM:

m2
6
ω1
5

2
ω2
1
m1
1 2 3 4 5 6
A.8 Chapter 8 | 261

2. Line equation from the above diagram:

1
m2 = m1 + 2.25 ⇔ 4 m2 = m1 + 9
4
m1
⇔ m1 − 4 m2 + 9 = 0 ⇔ (1 −4) ( ) + 9 = 0
m2
1
⇒ w=( ),b = 9
−4
Note: Any solution α(w⊤ m − b) = αw⊤ m − αb = 0 with α ∈ ℝ\{0} is correct.
4
3. P̂ e = |SV|
N = 8 = 0.5.
4
4. The number of support vectors stays the same, but N = 20, hence P̂ e = 20 =
0.2.

A.8 Chapter 8

(8.1) One possible solution is the following tree:

speed = medium?
Y N

weight = medium? weight = light?

Y N Y N

ω1 ω3 ω1 ω2

(8.2) The decision boundary is shown in the following sketch:

m2
6

5
ω2
4

2
ω1
1
m1
1 2 3 4 5 6 7 8
262 | A Solutions to the exercises

Overfitting occurs in the region m1 > 8, i. e., in the partial tree reached by N→Y→N:
here, both leaves contain only one sample.
This overfitting is eliminated by replacing the partial tree reached by N→Y with a
leaf node ω1 , yielding the following decision tree:

m1 < 5
Y N

m1 < 3? m2 < 5?
Y N Y N
ω2 m2 < 2 ω1 ω2
Y N

ω1 ω2
This tree mis-classifies one of the training samples.

A.9 Chapter 9
!
(9.1) 1. P(ω1 ) + P(ω2 ) = 1 ⇒ P(ω1 ) = 31 , P(ω2 ) = 23 . From Equation (9.12) it follows:
2
NT,i − k i 1 6 2 18 1
P̂ e = ∑ P(ω i ) ⋅ = ⋅ + ⋅ =
i=1
N T,i 3 20 3 30 2

2. Approach: calculate P(ω1 ) (and hence P(ω2 )) from the observed error rate
P e = P(ω1 ) NnT,1
1
+ (1 − P(ω1 )) NnT,2
2
.

6 18 ! 4
P(ω1 ) ⋅ + (1 − P(ω1 )) ⋅ =
20 30 10
6 3 ! 4
⇔ − ⋅ P(ω1 ) =
10 10 10
2 1
⇒ P(ω1 ) = ⇒ P(ω2 ) =
3 3

(9.2) The numbers of testing samples per class are NT,1 = 144, NT,2 = 36, and NT,3 =
36. The class-dependent error probabilities are estimated as

̂ 1 | ω1 ) = 16 + 8 ,
P(ω ̂ 2 | ω2 ) = 6 + 9 ,
P(ω ̂ 3 | ω3 ) = 3 + 7 .
P(ω
144 36 36
Putting both together yields

P̂ e = P(ω1 ) P(ω
̂ 1 | ω1 ) + P(ω2 ) P(ω
̂ 2 | ω2 ) + P(ω3 ) P(ω
̂ 3 | ω3 )
1 24 1 15 3 10 1 1 1 1
= ⋅ + ⋅ + ⋅ = + + = = 0.25.
2 144 5 36 10 36 12 12 12 4
B A primer on Lie theory
The tangential distance in Section 2.4.6 as well as the construction of invariant features
in Section 2.6.3 used concepts from Lie theory, but gave no formal introduction of these
concepts. This section will give a concise introduction to the postponed mathematical
details.

Definition B.1 (Topological Manifold). Let Π be a Hausdorff space, i. e., a set of points
with a system of open sets such that for each pair of two distinct points the points can
be placed in two disjoint open sets.
1. A chart (or coordinate chart or coordinate map) is a pair (U, φ) of an open subset
U ⊆ Π and a corresponding injective map φ : U → V to an open subset V ⊆ ℝd of
Euclidean space such that φ is a homeomorphism. To be a homeomorphism means
that φ and φ−1 are both continuous, i. e., the pre-image of an open set is an open
set.
2. Let (Ui , φ j ), (Ui , φ j ), i ≠ j be two charts with a nonempty intersection Uij = Ui ∩
Uj ≠ 0 and let φ i and φ j denote the restrictions of φ i and φ j to the intersection Uij .
The map
τ = φ i ∘ φ−1
j : φ i (Uij ) → φ j (Uij ) (B.1)

is called a transition map. As φ i and φ j are homeomorphisms, τ is also a homeo-

morphism on ℝd .
↑
3. A system A = {(U, φ)↑ ↑U ⊆ Π open} is called an atlas of Π iff there is a chart (U, φ) ∈
↑
A for every point p ∈ Π such that p ∈ U.
4. A Haussdorff space Π together with an atlas A is called a (topological) manifold.

The above definition looks rather complicated, but can be understood intuitively—the
phrases “coordinate map” and “atlas” were not chosen by chance. A (topological)
manifold is a set that locally looks like Euclidean space.
The canonical example is a globe: a sphere and a plane have different global
geometries, but they look the same when you zoom in close enough. It is possible
to choose an open neighborhood on the sphere and map it to the Euclidean plane
the same way as one can flatten pieces of an orange peel. Otherwise, it would not be
possible to transfer a map printed on a flat sheet of paper onto a spherical globe.
On a grander scale, the earth looks flat (apart from the occasional hill or valley)
from a human perspective, but from outer space, it is clear that it is (approximately) a
sphere.
Most of the above definition is necessary to ensure that every point of the manifold
is on some map and that the same point on the manifold looks similar on different
maps. For example, two maps of Western and Middle Europe both include Germany,
but in different places. Still, the border of the country should look approximately the
same on both maps.

https://doi.org/10.1515/9783110537949-285
264 | B A primer on Lie theory

As a manifold locally looks like Euclidean space, many properties of Euclidean

space can be transfered to manifolds by virtue of the charts. One example is the idea of
a tangent space (used in Section 2.4.6), which requires some notion of differentiation.

Definition B.2 (Smooth manifold, Diffeomorphism). Let Π be a (topological) manifold

↑
and A = {(U, φ)↑ ↑U ⊆ Π open} its atlas.
↑
1. Π is a smooth manifold iff the transition maps τ ij of each pair of chart functions
φ i , φ j are smooth, i. e., infinitely differentiable.
2. Let Π and Π ̃ be two smooth manifolds. A function F : Π → Π ̃ is differentiable
at x ∈ Π if there are maps φ : U → ℝd and ̃ ̃ → ℝd with x ∈ U ⊆ Π and
φ : U
F(x) ∈ U ̃⊆Π ̃ such that

φ ∘ F ∘ φ−1 : φ(U) ⊂ ℝd → ℝd
f =̃ (B.2)

is differentiable at φ(x).
3. A function F : Π → Π ̃ is a diffeomorphism if it is bijective and F as well as F −1 are
both differentiable.

Note that in the first item, differentiability is defined, because τ ij are functions from
and to Euclidean space, where this concept already exists. In the second item, differ-
entiability is independent of the choice of the chart functions φ and ̃ φ, because the
transition function τ between different charts is required to be differentiable. Hence
F is either differentiable with respect to every chart or not differentiable at all.
The definition ensures that the manifold has no “edges” or “corners” but rather
looks smooth, as the name suggests. From this definition, it can be seen that the pre-
vious example for a topological manifold—the globe—is also a smooth manifold.
For the purposes of this book, this short introduction to manifolds will suffice. We
will now turn our attention to a different mathematical field, group theory, but come
back to manifolds at the end of this section.

Definition B.3 (Group). A group (G, ⊙) is a set G with a binary composition ⊙ : G×G → G
with the following properties:
1. Associativity: (g1 ⊙ g2 ) ⊙ g3 = g1 ⊙ (g2 ⊙ g3 ).
2. Neutral element: there exists e ∈ G with g ⊙ e = e ⊙ g = g for all g ∈ G.
3. Inverse element: for every g ∈ G, there is a g −1 ∈ G with g ⊙ g−1 = g −1 ⊙ g = e.

Definition B.4 (Group action). Let G be a group and S an arbitrary set. A (left) group
action of G on S is a function A : G × S → S such that:
1. for all g1 , g2 ∈ G and s ∈ S

A (g2 , A (g1 , s)) = A (g2 ⊙ g1 , s) , (B.3)

2. for all s ∈ S and with e denoting the neutral element of G,

A (e, s) = s. (B.4)
B A primer on Lie theory | 265

Both items of the definition are reasonable when put into words: The first item requires
that the result of composing two group actions will be the same as the group action on
the composition of the elements. The second item requires that the neutral element of
the group has no effect with respect to the group action. One sometimes writes g ⊙ s
instead of A(g, s), i. e., the group composition ⊙ is also used to indicate the group
action.
In the following two definitions, let G be a group, S a set, A a group action of G on
S, and s ∈ S.

Definition B.5 (Group orbit). The set Gs = {A (g, s)|g ∈ G} ⊆ S is called the orbit of s.

Definition B.6 (Stabilizer). The subgroup Gs = {g ∈ G|A (g, s) = s} is called the stabi-
lizer of s.

Although the definitions look similar and the notations differ only in the use of a con-
catenation vs. the use of a subscript, the semantics are quite different. The group orbit
Gs is a subset of S and contains all the points that can be reached from s by a transfor-
mation. The stabilizer Gs is a subset of G and contains all the group elements that do
not affect s.
To see the difference, consider the group of all rotation matrices of two-dimensional
Euclidean space,
↑
cos α sin α ↑ ↑
↑α ∈ ℝ} ,
G = {( )↑ ↑ (B.5)
− sin α cos α ↑ ↑
↑
where the group composition ⊙ is given by the usual matrix multiplication. This group
is also called the two-dimensional special orthogonal group SO(2).
It is easy to see that G is indeed a group. Associativity follows by the associativity
of matrix multiplication and the concatenation of two rotations by α and β gives the
same result as one rotation by α + β. The inverse of a rotation is the rotation by the
negative of its angle, and the neutral element is a rotation by 0.
Now, let S = ℝ2 be the Euclidean plane. The group action of G on S is given by the
usual multiplication of a vector by a matrix. Then the orbit of an arbitrary point s ∈ ℝ2
consists of all points with the same distance from the origin:
↑
cos α sin α ↑ ↑
Gs = {A (g, s)|g ∈ G} = {( ) s↑
↑α ∈ ℝ}
↑
− sin α cos α ↑ ↑
↑
↑
↑󵄩󵄩󵄩s󸀠 󵄩󵄩󵄩 = ‖s‖} .
= {s󸀠 ∈ ℝ2 ↑
↑󵄩󵄩 󵄩󵄩
↑ (B.6)

This means that these orbits are circles around the origin—hence the name. For
s = 0, the orbit is a degenerate circle with radius 0, i. e., a point. Hence, the stabilizer
is given by

G0 = G Gs = {I} for s ≠ 0. (B.7)

266 | B A primer on Lie theory

It is easy to see that any rotation maps the origin to the origin, which means that
the stabilizer G0 is the whole group. All other points are rotated along a circle around
the origin, hence the only stabilizer is the neutral element I.
The special orthogonal group shares an important property that links the world
of algebra with manifolds: each rotation matrix can be decomposed into smaller and
smaller rotations. In particular, a rotation can be decomposed into
n
cos α sin α cos αn sin αn
( )=( ) (B.8)
− sin α cos α − sin αn cos αn

for every n ∈ ℕ. Indeed, any rotation can be decomposed into infinitely many, in-
finitesimally small rotations.
This observation leads to the following definition.

Definition B.7 (Lie group). A group Π that is also a smooth manifold such that the
group operation p1 ⊙ p−12 is differentiable is a Lie group.

The two-dimensional special orthogonal group SO(2) is a Lie group and a one-
dimensional smooth manifold, which can be seen if one considers that a suitable
chart function maps a rotation matrix to its angle α ∈ ℝ.
The definition of a Lie group only makes a statement about the group operation,
but not about group actions. The combination yields the following definition.

Definition B.8 (Lie transformation group). Let M be a smooth manifold, Π a Lie group,
and A : Π × M → M a group action of Π on M. Π is called a Lie transformation group
with respect to M if A is differentiable.

Note that this definition uses some kind of differentiability three times: The group
Π is a Lie group and therefore equipped with a differentiable structure; the space M
is a smooth manifold, i. e., it possesses a differentiable structure; and the map A is
required to be differentiable, too.
To conclude, consider Figure 2.12 (reproduced in Figure B.1 for convenience) in
light of the new concepts. The feature space M is assumed to be a smooth manifold and
the disturbances are modeled as a Lie transformation group Π that acts on the feature
space. Then the orbit of a feature vector mi under the group action are smooth sub-
manifolds. Although this section skipped a mathematical definition of the tangent
space, it should be intuitively clear that a tangent exists at each point of each sub-
manifold, because all involved maps are differentiable.
B A primer on Lie theory | 267

Mi = {A (p, o i )|p ∈ Π}

m󸀠k
m󸀠i

m Mk = {A (p, o k )|p ∈ Π}

Tmk
Tm i

Fig. B.1. Tangential distance measure, reproduced from Figure 2.12.

C Random processes
The following will give a brief overview of random processes. This overview is by no
means meant to be comprehensive, but should be sufficient to understand the concepts
in Chapter 2.
A random process g is a random variable that is also a function of some arguments.
Here, we will focus only on two-dimensional random processes over the real numbers,
i. e., random functions of the form
g : ℝ2 → ℝ. (C.1)
From one perspective, g is a random function that is evaluated at any point x ∈
ℝ2 : depending on the realization of g, a different result will be obtained. Hence, g
can be described by a probability distribution over the space of all possible functions
g : ℝ2 → ℝ that map from ℝ2 to ℝ. In particular, one can find an expectation μ and a
variance σ2 for g,

μ = E{g} with μ : ℝ2 → ℝ and (C.2)

σ2 = Var{g} with σ2 : ℝ2 → ℝ+ . (C.3)
Note that both μ and σ2 are themselves functions. The covariance is a function of two
points x,y ∈ ℝ2 ,
Cov{g} : ℝ2 × ℝ2 → ℝ. (C.4)
A second interpretation considers g as a (deterministic) function that maps x to
a random variable rx on ℝ. From this perspective, there is an infinite (and uncount-
󵄨
able) set {rx 󵄨󵄨󵄨 x ∈ ℝ2 } of random variables. Assume that a probability density function
p(x1 , . . . , xk ) exists for each finite subset with k arbitrary points x1 , . . . , xk ∈ ℝ2 .
The corresponding distributions are called finite-dimensional marginal distributions
(fidis). Moreover, we require that
!
μ(x) = (E{g}) (x) = E{g (x)} = E{rx } (C.5)
!
σ2 (x) = (Var{g}) (x) = Var{g (x)} = Var{rx } (C.6)
for all x and for all other moments, in case they exist. Note the subtle mathematical
difference: on the left of the equations, the stochastic moment of a random function is
calculated first and then the resulting (non-random) function is evaluated at the point
x; on the right, the function is evaluated at x to a random variable first, and then the
stochastic moment of that random variable is calculated.
With these notations at hand it is possible to introduce two properties of stochastic
processes.

Definition C.1 ((Strictly) stationary process (of order m)). Let g be a random process,
k ∈ ℕ be a finite dimension, x1 , . . . , xk ∈ ℝ2 arbitrary points, and τ ∈ ℝ2 a translation
vector.

https://doi.org/10.1515/9783110537949-290
C Random processes | 269

1. If
p(x1 , . . . , xk ) = p(x1 + τ, . . . , xk + τ) (C.7)
holds for all valid choices of k, x1 , . . . , xk , and τ, the process g is called (strictly)
stationary. This means that all fidis are invariant under translation.
2. A stochastic process is called (strictly) stationary of order m, if the above holds for
all k ≤ m.

Definition C.2 (Homogeneity (of order m)). A stochastic process is called homoge-
neous of order m if

E{g ν } (x) = const ∀ν ∈ {1, . . . , m}. (C.8)

This means the first m moments do not depend on the point x. Obviously, stationarity
is much stronger than homogeneity. A stationary process is always homogeneous (up
to the same order) but not vice versa.

Definition C.3 ((Two-dimensional) weakly stationary process). Let g denote a two-

dimensional random process, x, y ∈ ℝ2 two points, and τ ∈ ℝ2 a translation vector.
The process is a weakly stationary process if for all x, y, τ,

E{g} (x) = μ (C.9)

Cov{g, g} (x,y) = Cov{g, g} (x + τ, y + τ). (C.10)

This means the expectation is constant for every point x and the covariance is also
constant in the sense that its value only depends on the relative position of x and y
but is not affected by a translation. Especially for x = y, this implies Var{g} (x) = σ2
is constant for all x ∈ ℝ2 .
The condition of weak stationarity is more restrictive than a homogeneity of order
two, but less restrictive than stationarity of order two. For a process to be homogeneous,
it is only required that its expectation and variance be constant: this does not say
anything about its covariance. In contrast, to be a stationary process of order two, it
is required that all two-dimensional marginal distributions be identical. The latter is
much stronger than having only identical covariances.

Definition C.4 (Expectation-free (two-dimensional) weakly stationary process). A two-

dimensional weakly stationary process g is called expectation free if for all x ∈ ℝ2

E{g} (x) = 0. (C.11)

Note that the term “expectation free” is a bit misleading: an expectation free random
process is not free of having an expectation. It has an expectation: 0.
270 | C Random processes

Definition C.5 ((Two-dimensional) white noise). A two-dimensional random process

e mn is white noise if it is weakly stationary and fulfills the additional requirements

E{e} = 0 (C.12)
{σ2 τ=0
Cov{e} (x, x + τ) = { (C.13)
0 else.
{
Actually, both requirements already ensure that it is a weakly stationary process, but
demand much more. Especially, the last requirement implies that any two states are
uncorrelated with each other.
Lastly, we consider a certain assumption about random processes that makes rea-
soning about them easier in many circumstances: ergodicity. Informally, in an ergodic
process, a reasonably large sample from that process is representative of the process
as a whole. Formally, let E denote a probability space and let e ∈ E be an elementary
event. Moreover, let g(x) = g(x, e) denote the realization of the random process g(x)
with respect to the elementary event e.

Definition C.6 (Ergodic process). Let g be a stationary process and let μ(x) = E{g} (x)
denote the expectation of g. This means that the expectation μ(x) = μ is constant for
all x ∈ ℝ2 . The process g is said to be ergodic if for all events e ∈ E and all y ∈ ℝ2 ,

w h
2 2
1
lim ∫ ∫ g(x, e) dx = μ = E{g} (y). (C.14)
w→∞ wh
h→∞ − w2 − 2h

On the right side of Equation (C.14), one arbitrary point y is fixed and the average over
all possible realizations g of g is calculated. On the left, one realization g(x) = g(x, e) is
fixed and the average over all points x ∈ ℝ2 is calculated. Hence, under the assumption
that g is ergodic, one can determine the unknown expectation and variance of g by
taking the average over all points of only one single realization.
Bibliography
R. Aster, B. Borchers, and C. Thurber. Parameter Estimation and Inverse Problems. International
Geophysics Series. Academic Press, 2013. ISBN 9780123850485.
Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks. In
Advances in neural information processing systems, pages 153–160, 2007.
J. Beyerer. Analyse von Riefentexturen. PhD thesis, Düsseldorf, 1994.
J. Beyerer, F. Puente León, and C. Frese. Machine Vision. Springer, 2016.
B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. In
Proceedings of the fifth annual workshop on Computational learning theory, pages 144–152.
ACM, 1992.
L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
J. Bromley, J. W. Bentz, L. Bottou, I. Guyon, Y. LeCun, C. Moore, E. Säckinger, and R. Shah. Signature
verification using a "siamese" time delay neural network. IJPRAI, 7(4):669–688, 1993.
C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995.
T. Cover and P. Hart. Nearest neighbor pattern classification. IEEE Transactions on Information
Theory, 13(1):21–27, Jan. 1967.
K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector
machines. Journal of machine learning research, 2(Dec):265–292, 2001.
A. Criminisi, J. Shotton, E. Konukoglu, et al. Decision forests: A unified framework for classification,
regression, density estimation, manifold learning and semi-supervised learning. Foundations
and Trends® in Computer Graphics and Vision, 7(2–3):81–227, 2012.
N. Cristianini and J. Shawe-Taylor. An introduction to support vector machines and other kernel-
based learning methods. Cambridge university press, 2000.
G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray. Visual categorization with bags of keypoints.
In Workshop on statistical learning in computer vision, ECCV, volume 1, pages 1–2. Prague,
2004.
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em
algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1–38, 1977.
ISSN 00359246.
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image
database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pages 248–255. IEEE, 2009.
R. O. Duda, P. E. Hart, and D. G. Stork. Pattern classification. Wiley, New York, 2 edition, 2001.
B. Efron and T. Hastie. Computer Age Statistical Inference: Algorithms, Evidence, and Data Science.
Cambridge University Press, New York, NY, USA, 1st edition, 2016. ISBN 9781107149892.
G. A. Fink. Mustererkennung mit Markov-Modellen. Vieweg+Teubner Verlag, 2003. ISBN 978-3-519-
00453-0.
G. D. Forney. The viterbi algorithm. Proceedings of the IEEE, 61(3):268–278, Mar. 1973. ISSN
0018-9219. 10.1109/PROC.1973.9030.
Y. Freund and R. Schapire. A tutorial on boosting, 2004.
Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an applica-
tion to boosting. Journal of Computer and System Sciences, 55(1):119 – 139, 1997.
A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman. From few to many: Illumination cone models
for face recognition under variable lighting and pose. IEEE Trans. Pattern Anal. Mach. Intelli-
gence, 23(6):643–660, 2001.
I. Guyon, J. Makhoul, R. Schwartz, and V. Vapnik. What size test set gives good error rate estimates?
Pattern Analysis and Machine Intelligence, IEEE Transactions on, 20(1):52–64, 1998.

https://doi.org/10.1515/9783110537949-293
272 | Bibliography

T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning, volume 1. Springer
series in statistics Springer, Berlin, 2001.
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
C. Herrmann, D. Willersinn, and J. Beyerer. Low-resolution convolutional neural networks for video
face recognition. In Proceedings of the 13th IEEE International Conference on Advanced Video
and Signal Based Surveillance, Colorado Springs, USA, Aug. 2016. IEEE.
T. K. Ho. Random decision forests. In Document Analysis and Recognition, volume 1, pages 278–282.
IEEE, 1995.
S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780,
1997.
R. Hoffmann. Signalanalyse und –erkennung. Springer, 1998.
A. Hyvärinen, J. Karhunen, and E. Oja. Independent Component Analysis, volume 46. John Wiley &
Sons, 2004.
A. Jaglom and I. Jaglom. Wahrscheinlichkeit und Information. Deutscher Verlag der Wissenschaften,
Berlin, 1960. Translated from Russian.
A. N. Kolmogorov. On the representation of continuous functions of many variables by superpo-
sition of continuous functions of one variable and addition. American Mathematical Society
Translation, 28(2):55–59, 1963.
D. Krahe and J. Beyerer. A parametric method to quantify the balance of groove sets of honed
cylinder bores. In Intelligent Systems & Advanced Manufacturing, pages 192–201. International
Society for Optics and Photonics, 1997.
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural
networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
K. Küpfmüller. Die entropie der deutschen sprache. Fernmeldetechnische Zeitung, 7(6):265–272,
1954.
A. Laubenheimer. Automatische Registrierung adaptiver Modelle zur Typerkennung technischer
Objekte. PhD thesis, Universität Karlsruhe, 2004.
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recogni-
tion. Proceedings of the IEEE, 86(11):2278–2324, 1998.
A. H. Lipkus. A proof of the triangle inequality for the tanimoto distance. Journal of Mathematical
Chemistry, 26(1-3):263–265, 1999.
K. Liu and G. Mattyus. Fast multiclass vehicle detection on aerial images. Geoscience and Remote
Sensing Letters, IEEE, PP(99):1–5, 2015. ISSN 1545-598X. 10.1109/LGRS.2015.2439517.
D. O. Loftsgaarden and C. P. Quesenberry. A nonparametric estimate of a multivariate density
function. The Annals of Mathematical Statistics, 36(3):1049–1051, 1965.
D. G. Lowe. Distinctive image features from scale-invariant keypoints. International journal of
computer vision, 60(2):91–110, 2004.
V. Maz’ya and G. Schmidt. On approximate approximations using gaussian kernels. IMA Journal of
Numerical Analysis, 16(1):13–29, 1996.
J. Mercer. Functions of positive and negative type, and their connection with the theory of integral
equations. Philosophical transactions of the royal society of London, 209:415–446, 1909.
T. K. Moon and W. C. Stirling. Mathematical Methods and Algorithms for Signal Processing. Prentice
Hall, Upper Saddle River, NJ, 2000. ISBN 0-201-36186-8.
J. Neyman and E. S. Pearson. On the problem of the most efficient tests of statistical hypotheses. In
Breakthroughs in statistics, pages 73–108. Springer, 1992.
A. B. J. Novikoff. On convergence proofs on perceptrons. Proceedings of the Symposium on the
Mathematical Theory of Automata, 12:615–622, 1962.
Bibliography | 273

C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. Adaptative computation
and machine learning series. University Press Group Limited, 2006. ISBN 9780262182539.
M. Richter, T. Längle, and J. Beyerer. Knowing when you don’t: Bag of visual words with reject option
for automatic visual inspection of bulk materials. In Proceedings of the 23rd International
Conference on Patter Recognition (ICPR), Cancun, Mexiko, Dec. 2016.
A. Rieder. Keine Probleme mit Inversen Problemen: Eine Einführung in ihre stabile Lösung.
Vieweg+Teubner Verlag, 2003. ISBN 9783528031985.
H. Ritter, T. Martinetz, and K. Schulten. Neuronale netze. Addison-Wesley, 1990.
C. P. Robert. A comparison of the bayesian and frequentist approaches to estimation by francisco j.
samaniego. International Statistical Review, 79(1):117–118, 2011.
L. Rokach. Pattern classification using ensemble methods, volume 75. World Scientific, 2010.
F. Rosenblatt. The Perceptron: A Perceiving and Recognizing Automaton, volume Report 85-60-1.
Cornell Aeronautical Laboratory, 1957.
F. Rosenblatt. Principles of Neurodynamics: Perceptrons and the Rheory of Brain Mechanisms.
Spartan, 1962.
J. Schmidhuber. Deep learning in neural networks: An overview. Neural networks, 61:85–117, 2015.
B. Schölkopf and C. J. Burges. Advances in Kernel Methods: Support Vector Learning. MIT press,
1999.
B. Schölkopf, A. Smola, and K.-R. Müller. Kernel principal component analysis. In International
Conference on Artificial Neural Networks, pages 583–588. Springer, 1997.
J. Schürmann. Pattern classification: a unified view of statistical and neural approaches. Wiley
Online Library, 1996.
C. E. Shannon, N. J. A. N. J. A. Sloane, A. D. Wyner, and I. I. theory society, editors. Claude Elwood
Shannon: collected papers. IEEE Press, New York, 1993. ISBN 0-7803-0434-9. IEEE Information
Theory Society.
L. W. Sommer, T. Schuchert, and J. Beyerer. Deep learning based multi-category object detection in
aerial images. In Proc. SPIE 10202, Automatic Target Recognition XXVII, Anaheim, United States,
May 2017.
N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple
way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):
1929–1958, 2014.
S. S. Stevens. On the theory of scales of measurement. Science, 103(2684):677–680, June 1946.
L. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984.
V. N. Vapnik and V. Vapnik. Statistical learning theory, volume 1. Wiley New York, 1998.
A. J. Viterbi. Error bounds for convolutional codes and an asymptotically optimum decoding algo-
rithm. IEEE Transactions on Information Theory, 13(2):260–269, Apr. 1967. ISSN 0018-9448.
10.1109/TIT.1967.1054010.
M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European
Conference on Computer Vision (ECCV), pages 818–833. Springer, 2014.
Z.-H. Zhou. Ensemble methods: foundations and algorithms. CRC press, 2012.
Glossary
A posteriori distribution The distribution of the classes with respect to a fixed feature.
A priori distribution The distribution of the classes without knowledge of the features.
Absolute norm A special type of Minkowski norm.
Absolute scale Scale of measurement for counting quantities.
AR model Autoregressive signal model.
Autoregressive signal model Representation of a type of random process.

Bagging Bootstrap aggregating.

Bayes’ law Fundamental result in probability theory.
Bayesian classifier A special classifier that uses all the ingredients of the Bayesian framework and
is optimal with respect to the risk.
Bias In parameter estimation: Error of an estimator that is not due to chance.
Binary classifier A classifier that only decides between two classes.
Boosting Meta-method that combines several weak classifiers into one strong classifier.
Bootstrap aggregating Ensemble method in which several classifiers are trained on random sub-
sets of the same training data.

Central limit theorem A central theorem in probability theory.

Chebyshev norm A special type of Minkowski norm.
Class A subset of the world grouping similar objects.
Class-specific feature distribution The distribution of the features given a class.
Classifier A (mathematical) method of assigning an object to a equivalence class based on features.
Conditional distribution The distribution of a random quantity if another quantity from a joint prob-
ability space is kept fixed.
Confusion matrix Table that compares the ground truth with the classifier’s prediction on a valida-
tion set.
Consistent estimator An estimator that converges almost surely to the true value.
Convolutional neural network Type of deep learning architecture suitable for multidimensional
data with repeating local structure.
Cost function A function that describes the costs of assigning a class with respect to the true class.
CR-efficient estimator A special estimator that has minimum variace.
Cramér–Rao bound A lower bound on the variance of an unbiased estimator.
CRB Cramér–Rao bound.
Cross-validation Technique to estimate the performance of a classifier with a small data set.

Dataset The set of all objects that were collected to define, validate and test a pattern recognition
system.
Decision boundary The boundary of a decision region. The entirety of the boundaries is an equiva-
lent description of the classifier.
Decision function A function that maps a feature vector to one component of the decision space.
Decision region A partition in the feature space.
Decision space An intermediate space to unify the mathematical description of the classes.
Decision tree Tree structured classifier where the inner nodes correspond to tests, the edges corre-
spond to the outcomes of the tests, and the leaf nodes govern the class decision.
Decision vector The vector of decision functions of all classes.
Dirac sequence A sequence of probability distributions that converges to the Dirac distribution.

https://doi.org/10.1515/9783110537949-297
276 | Glossary

Discrepancy A function that quantifies the similarity between two (mathematical) objects that lacks
some properties of a metric.
Distance function Usually a synonym for metric (usage may vary depending on contex).
Distribution Mathematical object that encapsulates the properties of random variables.
Divergence A discrepancy between probability distributions.

EM Expectation maximization.
Emission probability In hidden Markov models: Probability of seeing an observable given a chain
of states.
Empirical operation Mathematical operation that corresponds to an experiment, e.g., addition of
the masses of two objects by putting both on a scale at the same time.
Empirical relation Mathematical relations that emerge from experiments, e.g., by comparing the
weight of two objects.
Empirical risk minimization From statistical learning theory: Minimization of the average loss on
a training set.
Empiricism Philosophy of science that emphasizes evidence and experiments.
Entropy measure Impurity measure corresponding to the entropy of the empirical class distribution
of that data set.
Estimator A measurable function from the space of all finite datasets into the parameter space of a
parametric distribution assumption.
Euclidean norm A special type of Minkowski norm.
Expectation maximization Iterative technique to maximize the likelihood function of an estimator.

Fall-out rate False-positive rate.

False alarm The event that a binary classifier incorrectly decides for “positive” although the sample
is negative.
False-negative rate The probability of a binary classifier of deciding on “negative” although the
sample actually belongs to the positive class.
False-positive rate The probability of a binary classifier of deciding on “positive” although the sam-
ple actually belongs to the negative class.
Feature A mathematical quantity that describes the characteristics of an object.
Feature space The set of all possible features.
Fisher information Variance of the score.

Gaussian mixture A random variable whose density is a convex combination of Gaussian densities.
Generalization Ability of a classifier to perform well on unseen data.
Gini impurity Impurity measure corresponding to the expected error probability of random class
assignment on that data set.

Hidden Markov model Markov model where the states and state transitions are hidden and can
only be inferred from observations.
HMM Hidden Markov model.
Homogeneous process A random process whose moments do not depend on the point of evalua-
tion.
Hyper parameters Parameters that govern a classifer but are not estimated from the training set.

Impurity measure Measure that assesses the class distribution in a data set.
Interval scale Scale of measurement for measuring intervals but lacking a natural zero.

Joint distribution The distribution of several random quantities in a joint probability space.
Glossary | 277

k-nearest neighbor method A parameter-free technique to define a density given a number of finite
samples. See also Parzen window method.
Kullback–Leibler divergence Measure (but not a metric) of the difference between probability dis-
tributions.

Leave-one-out cross-validation Cross-validation where only one sample is used for evaluation,
and the rest are used to train the classifier.
Likelihood function A function of the parameters of a statistical model for a given data set.
Likelihood ratio The ratio two likelihood functions with different models. Used in hypothesis test-
ing.
Linear discriminant A basic classifier that draws hyperplanes between classes in the feature space.
Log-likelihood function The logarithm of the likelihood function.
Long short term memory Type of deep learning architecture suitable for sequential data.

Mahalanobis norm Norm of a vector with respect to some positive definite matrix.
Manhattan metric Metric deduced from the absolute norm; also: taxicab metric.
MAP classifier Maximum a posteriori classifier.
Marginal distribution The projection of a joint distribution onto one of the axes.
Markov model Probabilistic model of states and transitions between states with certain restric-
tions.
Maximum a posteriori classifier A classifier that decides on the class with the highest a posteriori
probability with respect to a given feature.
Maximum norm A special type of Minkowski norm.
Maximum-likelihood estimator An estimator that chooses the parameter that makes the given ob-
servation most likely under the model.
Mean squared error Mean of the squared derivations of an estimator to the target variable.
Median The middle entry in a sorted list of items.
Metric A function that defines a distance.
Metric space A set with a distance measure.
Minimax classifier A special type of classifier that estimates the class such that the maximal risk
with respect to any a priori distribution is minimized. See also classifier.
Minkowski norm A parametrized norm for real vector spaces.
Misclassification measure Impurity measure corresponding to the empirical error probability of the
dominant class in that data set.
ML estimator Maximum-likelihood estimator.
Mode In statistics: The global maximum of a probability mass or probability density, i.e., the most
probable value.

Nearest neighbor classifier A classifier that assigns an object the same class as the nearest (in the
feature space) sample of the training set.
Nominal scale Scale of measurement made up of labels.
Norm Function to measure the length of a vector.

Ordinal scale Scale of measurement with an ordering.

Overfitting Phenomenon where a classifier performs well on training set, but very poorly on unseen
data.

Parameter space The (vector) space of all quantities that define a classifier.
Parameter vector A point in the parameter space.
278 | Glossary

Parzen window method A parameter-free technique to define a density given a number of finite
samples. See also k-nearest neighbor method.
Pattern The raw data from a sensor.
Pattern space The set of all possible patterns.
PCA Principal component analysis.
Permutation metric A metric for features on the ordinal scale.
Principal component analysis A method for finding a lower-dimensional subspace such that the
projection of the dataset has a minimal squared reconstruction error.
Probability simplex A subset in the decision space.

Quantile Summary statistic to describe location within an ordered sample.

Random forest An ensemble of decision trees.

Random process Mathematical description of a (time-ordered) series of random events.
Ratio scale Scale of measurement for measuring ratios.
Recall The event that a binary classifier correctly decides for “positive”. The probability of a recall
is the true-positive rate.
Receiver operating characteristics Plot of the fall-out rate against the sensitivity of a binary clas-
sifier.
Rectified linear unit Activation function used in deep learning.
Risk The expected cost of the decisions of a classifier. See also cost function.
ROC Receiver operating characteristic.
Rubber-sheeting Distortion of a surface to allow seamless joins.

Scale of measurement Defines certain types of variables and permissible operations on the vari-
ables of a given type.
Score In statistics: Measure of how much a parameter influences the density of a random variable.
Sensitivity True-positive rate.
Slack The event that a binary classifier incorrectly decides for “negative” although the sample is
positive.
Slack variable In SVMs: Variables associated with the training samples to measure the violation of
the maximum margin constraint.
Specificity True-negative rate.
State transition probability In Markov models: Probability to switch betwenn states.
Stationary process A random process that does not change the joint distribution of a derived time
series when shifted in time.
Stochastic gradient descent Randomized version of the gradient descent optimization algorithm.
Structural risk minimization From statistical learning theory: Joint minimization of the average
loss on a training set and the model complexity.
Supervised learning Learning when the classes of the training samples are known, e.g., classifica-
tion.
Support vector machine A linear classifier that maximizes the margin between the decision bound-
ary and the training samples.
SVM Support vector machine

Target vector A unit vector in the decision space and a corner of the probability simplex.
Taxicab metric Metric deduced from the absolute norm; also: Manhattan metric.
Test set A special subset of the dataset that is used to test the performance of a classifier.
Training set A special subset of the dataset that is used to define the parameters of a classifier.
Glossary | 279

True-negative rate The probability of a binary classifier of deciding on “negative” if the sample ac-
tually belongs to the negative class.
True-positive rate The probability of a binary classifier of deciding on “positive” if the sample actu-
ally belongs to the positive class.

Unbiased estimator A special estimator whose expectation value equals the parameter being esti-
mated, if considered as an random variable on its own.
Unbiasedness See unbiased estimator.
Unsupervised learning Learning when the classes of the training samples are not known or not
needed, e.g., clustering, density estimation, etc.

Validation set A special subset of the dataset that is used to define the design parameters of a
classifier.
Vapnik–Chervonenkis dimension Measure of complexity of a given family of classifiers.

Weak classifier A classifier that performs only marginally better than random guessing.
Weakly stationary process A random process whose expectation and covariance are constant at
every point.
Window function A function that is nonzero only in some interval, often used to assign a weight
according to some distance, e.g., in the Parzen window method.
Index
activation function 182 distribution 98
AR model 43 – a posteriori distribution 99
autoencoder 183 – a priori distribution 98
– class-specific feature distribution 99
– conditional distribution 98, 99
backpropagation 182
– joint distribution 98
bag of words 88
– marginal distribution 98
– bag of visual words 89
divergence 19
bagging 223
dropout 191
Bayes’ law 99
Bayesian classifier 104
eigenfaces 65, 86
Bayesianism 123
EM 213
bias 128
emission probability 211
boosting 242
empirical operation 11
bootstrap aggregating 223
empirical relation 11
bootstrapping 223
empirical risk minimization 234
equivalence relation 1
central limit theorem 113 ERM 234
class 1, 98 estimator 122
classifier 2, 98 – consistent estimator 128
CNN 189 – CR-efficient estimator 126
confusion matrix 237, 240 – unbiased estimator 125
convolutional neural network 189 expectation maximization 213
correlation coefficient 76
cost function 104 feature 3, 10
Cramér–Rao bound 125 feature space 5, 13, 162
cross-validation 241 feed-forward network 180
– leave-one-out cross-validation 242 ferret box 40
curse of dimensionality 8, 162 fidis 269
Fisher information 127
Fisherfaces 86
data matrix 60
form factor 40
dataset 6, 7
frequentism 123
decision boundary 2
decision function 100 Gaussian distribution
decision region 2, 101 – multivariate 114
decision space 4, 99 – univariate 113
decision tree 215 Gaussian mixture 119
decision vector 100 general representation theorem 182
degree of compactness 40 generalization 170
degree of convexity 40 group 265
degree of filling 39 – group action 265, 266
differential entropy 78 – Lie group 267
– conditional differential entropy 78 – Lie transformation group 267
Dirac sequence 147 – stabilizer 266
discrepancy 19
distance function 19 hidden Markov model 211

https://doi.org/10.1515/9783110537949-303
282 | Index

hyper parameters 7 norm 20

– p-norm 20
impurity measure 218 – Chebyshev norm 21
– entropy measure 218 – Euclidean norm 21
– Gini impurity measure 218 – Mahalanobis norm 21
– misclassification measure 219 – maximum norm 21
independence 77 – Minkowski norm 20
normal distribution
k-nearest neighbor method 150 – multivariate 114
kernel function 70, 199 – univariate 113
– Gaussian kernel 201 normed vector space 20
– linear kernel 200
– polynomial kernel 201 overfitting 9, 169
– RBF kernel 201
kernel trick 69, 199
parameter space 100
KL divergence 23
parameter vector 100, 122
Kolmogorov axioms 122
partition 1
Kullback–Leibler divergence 23
Parzen window method 146
pattern 3
learning
pattern space 4
– supervised learning 6, 231
PCA 57
– unsupervised learning 6, 231
pre-training 186
learning rate 182
probability 122
likelihood function 130
– conditional probability 122
likelihood ratio 24
probability simplex 103
linear discriminant 173
pseudo-inverse 180
– Fischer linear discriminant 83
log-likelihood function 130
random forest 222
LSTM 186
random process 36, 269
manifold 264 – ergodic process 271
MAP classifier 103 – expectation-free weakly stationary process
margin 194 270
Markov model 210 – homogeneous random process 270
max pooling 190 – stationary process 269
mean squared error 128 – strictly stationary process 269
median 12 – weakly stationary process 270
metric 19 Rayleigh coefficient 83
– induced metric 20 receiver operating characteristic 238
– Manhattan metric 21 ReLU 188
– permutation metric 23 risk 104
– Tanimoto metric 22 ROC curve 238
– taxicab metric 21 rubber-sheeting 40
metric space 19
Minimax classifier 111 scale of measurement 10, 11
ML estimator 130 – absolute scale 13
mode 12 – interval scale 13
mutual information 79 – nominal scale 10
– ordinal scale 12
nearest neighbor classifier 17, 155, 156 – ratio scale 13
Index | 283

scatter matrix 59 target vector 100

score 127 test set 7
slack variable 203 training set 7
SRM 234
state transition probability 210 uncorrelated 76
statistic 122
stochastic gradient descent 187 validation set 7
structural risk minimization 234 VC confidence 232
support vector machine 193 VC dimension 233
– dual form 196
– hard margin SVM 202 weak classifier 242
– primal form 195 window function 147
– soft margin SVM 203 – Gaussian window function 147

Pattern Recognition - Theodoridis Koutroumbas
No ratings yet
Pattern Recognition - Theodoridis Koutroumbas
641 pages
Fundamentals of Pattern Recognition and Machine Learning by Ulisses Braga-Neto
100% (5)
Fundamentals of Pattern Recognition and Machine Learning by Ulisses Braga-Neto
366 pages
Pattern Recognition - Organizer - 2023
100% (2)
Pattern Recognition - Organizer - 2023
112 pages
Machine Learning and Visual Perception 9783110595567 9783110595536
100% (1)
Machine Learning and Visual Perception 9783110595567 9783110595536
221 pages
Hospital Management System Project Report
No ratings yet
Hospital Management System Project Report
87 pages
Networkjourney CCNP Enterprise 2021 Lab Workbook 1631597584
No ratings yet
Networkjourney CCNP Enterprise 2021 Lab Workbook 1631597584
77 pages
Pattern Recognition
No ratings yet
Pattern Recognition
45 pages
Pattern Classification
100% (1)
Pattern Classification
42 pages
Pattern Recognition and Neural Networks (B.D.ripley)
100% (1)
Pattern Recognition and Neural Networks (B.D.ripley)
410 pages
J-3 Eyebrow Cooling Baffles Drawings and Instructions
50% (2)
J-3 Eyebrow Cooling Baffles Drawings and Instructions
15 pages
To Pattern Recognition: CSE555, Fall 2021 Chapter 1, DHS
100% (1)
To Pattern Recognition: CSE555, Fall 2021 Chapter 1, DHS
39 pages
CSE 473 Pattern Recognition: Instructor: Dr. Md. Monirul Islam
100% (1)
CSE 473 Pattern Recognition: Instructor: Dr. Md. Monirul Islam
57 pages
VIEW Certified Configuration Guide Motorola 0
No ratings yet
VIEW Certified Configuration Guide Motorola 0
59 pages
PR Unit 1 ....
No ratings yet
PR Unit 1 ....
34 pages
CSE 473 Pattern Recognition
No ratings yet
CSE 473 Pattern Recognition
45 pages
Solution Manual For Fundamentals of Communication Systems, 2/E J G. Proakis, M Salehi
No ratings yet
Solution Manual For Fundamentals of Communication Systems, 2/E J G. Proakis, M Salehi
42 pages
Commercial Proposal - Synzeal Research Pvt. Ltd. - v1.0
No ratings yet
Commercial Proposal - Synzeal Research Pvt. Ltd. - v1.0
25 pages
Risk Assessment of It Security Possible Solutions and Mechanisms To Control It Security Risk Unit 8: Security
No ratings yet
Risk Assessment of It Security Possible Solutions and Mechanisms To Control It Security Risk Unit 8: Security
15 pages
Anytone AT-D578UV User
No ratings yet
Anytone AT-D578UV User
38 pages
Pattern Recognition and Machine Learning: Printed Book
No ratings yet
Pattern Recognition and Machine Learning: Printed Book
1 page
Result and Discussion Table 3 Demographic Profile of Teachers Profile F P
No ratings yet
Result and Discussion Table 3 Demographic Profile of Teachers Profile F P
2 pages
Pattern - Recognigation - Lab 3 Sept 23 - Practical File
No ratings yet
Pattern - Recognigation - Lab 3 Sept 23 - Practical File
19 pages
SLG Module 10.2.1 Smplifying and Evaluating Rational Expressions (Casas, Albiso)
No ratings yet
SLG Module 10.2.1 Smplifying and Evaluating Rational Expressions (Casas, Albiso)
5 pages
UNIT-V Notes
No ratings yet
UNIT-V Notes
24 pages
Pattern Recognitionand Neural Networks
No ratings yet
Pattern Recognitionand Neural Networks
12 pages
Pattern and Classification
No ratings yet
Pattern and Classification
20 pages
Machine Learning in Pattern Recognition
No ratings yet
Machine Learning in Pattern Recognition
6 pages
AI Unit 5
No ratings yet
AI Unit 5
295 pages
NEWBrigada Eskwela Template For Volunteers
No ratings yet
NEWBrigada Eskwela Template For Volunteers
7 pages
Pattern Recognition
No ratings yet
Pattern Recognition
5 pages
Freedom Universal Keyboard User Manual
No ratings yet
Freedom Universal Keyboard User Manual
28 pages
Pattern Recognition: Dr. Farah Qais Al-Khalidi
No ratings yet
Pattern Recognition: Dr. Farah Qais Al-Khalidi
43 pages
Digital Image Processing, 4e
No ratings yet
Digital Image Processing, 4e
118 pages
Pattern Recognition: Dr. Farah Qais Al-Khalidi
No ratings yet
Pattern Recognition: Dr. Farah Qais Al-Khalidi
49 pages
Statistical Pattern Recognition A Review
No ratings yet
Statistical Pattern Recognition A Review
35 pages
Statistical Pattern Recognition A Review
No ratings yet
Statistical Pattern Recognition A Review
34 pages
Pattern Recognition Organizer
No ratings yet
Pattern Recognition Organizer
112 pages
Prchapters 1-2
No ratings yet
Prchapters 1-2
28 pages
(Jain2000) Statistical Pattern Recognition A Review
No ratings yet
(Jain2000) Statistical Pattern Recognition A Review
34 pages
Lecture 01 (Introduction To Pattern Recognition)
No ratings yet
Lecture 01 (Introduction To Pattern Recognition)
26 pages
Pattern Recoginition 5
No ratings yet
Pattern Recoginition 5
43 pages
07 Pattern Recognition
No ratings yet
07 Pattern Recognition
53 pages
Co Unit3
No ratings yet
Co Unit3
41 pages
07 - Chapter 1 PDF
No ratings yet
07 - Chapter 1 PDF
12 pages
PR Some Solutions
No ratings yet
PR Some Solutions
26 pages
Pattern Recognition: Lecturer
No ratings yet
Pattern Recognition: Lecturer
43 pages
Icpram 2025
No ratings yet
Icpram 2025
15 pages
2020 Book Fundamentals Pattern Recognition
No ratings yet
2020 Book Fundamentals Pattern Recognition
15 pages
Onion Routing
No ratings yet
Onion Routing
37 pages
Introduction of Pattern Recognition PDF
No ratings yet
Introduction of Pattern Recognition PDF
40 pages
Chapter 1
No ratings yet
Chapter 1
18 pages
1 Introduction
No ratings yet
1 Introduction
81 pages
FCP - FAZ - AD-7.4 (157 Questions)
No ratings yet
FCP - FAZ - AD-7.4 (157 Questions)
9 pages
Pattern Recognition
No ratings yet
Pattern Recognition
11 pages
2 Pattern Recognition Task
No ratings yet
2 Pattern Recognition Task
27 pages
Recognition - Unknown - Pattern Recognition and Classification
No ratings yet
Recognition - Unknown - Pattern Recognition and Classification
11 pages
AI Unit 4
No ratings yet
AI Unit 4
25 pages
Lesson 5 Introduction To Pattern Recognition
No ratings yet
Lesson 5 Introduction To Pattern Recognition
38 pages
Introduction To Pattern Recognition System
No ratings yet
Introduction To Pattern Recognition System
12 pages
Pattern Recognition
No ratings yet
Pattern Recognition
3 pages
Computer Science Assignment 2022-2023
No ratings yet
Computer Science Assignment 2022-2023
14 pages
Pattern Recognition...
No ratings yet
Pattern Recognition...
21 pages
7 A Activity List
No ratings yet
7 A Activity List
13 pages
Pattern Recognition 21BR551 MODULE 01 NOTES
No ratings yet
Pattern Recognition 21BR551 MODULE 01 NOTES
20 pages
Ass
No ratings yet
Ass
8 pages
Pattern Recognition
No ratings yet
Pattern Recognition
12 pages
Introduction To Pattern Recognition: Anil K. Jain, Robert P.W. Duin
No ratings yet
Introduction To Pattern Recognition: Anil K. Jain, Robert P.W. Duin
5 pages
Rescued Document
No ratings yet
Rescued Document
9 pages
An Overview of Pattern Recognition
No ratings yet
An Overview of Pattern Recognition
5 pages
QA6
No ratings yet
QA6
8 pages
Pattern Recognition
No ratings yet
Pattern Recognition
5 pages
Course Overview:: Introduction To Pattern Recognition
No ratings yet
Course Overview:: Introduction To Pattern Recognition
8 pages
What Is Pattern Recognition and Machine Learning
No ratings yet
What Is Pattern Recognition and Machine Learning
6 pages
Account Allocation Sheet
No ratings yet
Account Allocation Sheet
22 pages
Machine Learning in Pattern Recognition
No ratings yet
Machine Learning in Pattern Recognition
6 pages
Review On Reliable Pattern Recognition With Machine Learning Techniques
No ratings yet
Review On Reliable Pattern Recognition With Machine Learning Techniques
16 pages
Irjet V6i11121
No ratings yet
Irjet V6i11121
5 pages
MSBTE Solution App-2
No ratings yet
MSBTE Solution App-2
4 pages
AP-14 Ver 1.0 EN
No ratings yet
AP-14 Ver 1.0 EN
3 pages
Pattern Recognition Theodoridis S. and Koutroumbas K. 2006 Book Reviews
No ratings yet
Pattern Recognition Theodoridis S. and Koutroumbas K. 2006 Book Reviews
1 page
Basic Computer Terminologies
No ratings yet
Basic Computer Terminologies
2 pages
Privilege 12 Eylul 2022-2023 Answer Key PDF 10
No ratings yet
Privilege 12 Eylul 2022-2023 Answer Key PDF 10
1 page
JS 1 Maths
No ratings yet
JS 1 Maths
2 pages
What Are The Differences Between SIMATIC Modbus - TCP Redundant V1 and V2 - ID 63201104
No ratings yet
What Are The Differences Between SIMATIC Modbus - TCP Redundant V1 and V2 - ID 63201104
1 page
BSJSJ Pattern Recognition Syllabus
No ratings yet
BSJSJ Pattern Recognition Syllabus
1 page
Zoom Online Event Consulting Services PDF
No ratings yet
Zoom Online Event Consulting Services PDF
1 page
BE Honours (Text, Web and Social Media Analytics
No ratings yet
BE Honours (Text, Web and Social Media Analytics
1 page
Buy 20W USB-C Power Adapter - Apple
No ratings yet
Buy 20W USB-C Power Adapter - Apple
1 page
Online Sbi Registration Form To The Branch Manager State Bank of India .
No ratings yet
Online Sbi Registration Form To The Branch Manager State Bank of India .
3 pages