Handbook On Image Processing and Computer Vision Volume 3 From Pattern To Object
Handbook On Image Processing and Computer Vision Volume 3 From Pattern To Object
Cosimo Distante
Handbook of Image
Processing and
Computer Vision
Volume 3
From Pattern to Object
Handbook of Image Processing
and Computer Vision
Arcangelo Distante • Cosimo Distante
Handbook of Image
Processing and Computer
Vision
Volume 3: From Pattern to Object
123
Arcangelo Distante Cosimo Distante
Institute of Applied Sciences Institute of Applied Sciences
and Intelligent Systems and Intelligent Systems
Consiglio Nazionale delle Ricerche Consiglio Nazionale delle Ricerche
Lecce, Italy Lecce, Italy
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To my parents and my family, Maria and
Maria Grazia—Arcangelo Distante
1. The first interacts with the environment for the acquisition of data of the domain
of interest, using appropriate sensors (for the acquisition of Signals and Images);
2. The second analyzes and interprets the data collected by the first component,
also using learning techniques to build/update adequate representations of the
complex reality in which the system operates (Computational Vision);
3. The third chooses the most appropriate actions to achieve the objectives
assigned to the intelligent system (choice of Optimal Decision Models) inter-
acting with the first two components, and with human operators, in case of
application solutions based on man–machine cooperative paradigms (the current
evolution of automation including industrial one).
In this scenario of knowledge advancement for the development of Intelligent
Systems, the information content of this manuscript is framed in which are reported
the experiences of multi-year research and teaching of the authors, and of the
scientific insights existing in the literature. In particular, the manuscript divided into
three parts (volumes), deals with aspects of the sensory subsystem in order to
perceive the environment in which an intelligent system is immersed and able to act
autonomously.
The first volume describes the set of fundamental processes of artificial vision
that lead to the formation of the digital image from energy. The phenomena of light
propagation (Chaps. 1 and 2), the theory of color perception (Chap. 3), the impact
vii
viii Preface
of the optical system (Chap. 4), the aspects of transduction from luminous energy
are analyzed (the optical flow) with electrical signal (of the photoreceptors), and
aspects of electrical signal transduction (with continuous values) in discrete values
(pixels), i.e., the conversion of the signal from analog to digital (Chap. 5). These
first 5 chapters summarize the process of acquisition of the 3D scene, in symbolic
form, represented numerically by the pixels of the digital image (2D projection
of the 3D scene).
Chapter 6 describes the geometric, topological, quality, and perceptual infor-
mation of the digital image. The metrics are defined, the aggregation and correlation
modalities between pixels, useful for defining symbolic structures of the scene of
higher level with respect to the pixel. The organization of the data for the different
processing levels is described in Chap. 7 while in Chapter 8, the representation and
description of the homogeneous structures of the scene is shown.
With Chapter 9 starts the description of the image processing algorithms, for the
improvement of the visual qualities of the image, based on point, local, and global
operators. Algorithms operating in the spatial domain and in the frequency domain
are shown, highlighting with examples the significant differences between the
various algorithms also from the point of view of the computational load.
The second volume begins with the chapter describing the boundary extraction
algorithms based on local operators in the spatial domain and on filtering techniques
in the frequency domain.
In Chap. 2 are presented the fundamental linear transformations that have
immediate application in the field of image processing, in particular, to extract the
essential characteristics contained in the images. These characteristics, which
effectively summarize the global informational character of the image, are then used
for the other image processing processes: classification, compression, description,
etc. Linear transforms are also used, as global operators, to improve the visual
qualities of the image (enhancement), to attenuate noise (restoration), or to reduce
the dimensionality of the data (data reduction).
In Chap. 3, the geometric transformations of the images are described, necessary
in different applications of the artificial vision, both to correct any geometric dis-
tortions introduced during the acquisition (for example, images acquired while the
objects or the sensors are moving, as in the case of satellite and/or aerial acquisi-
tions), or to introduce desired visual geometric effects. In both cases, the geomet-
rical operator must be able to reproduce as accurately as possible the image with the
same initial information content through the image resampling process.
In Chap. 4Reconstruction of the degraded image (image restoration), a set of
techniques are described that perform quantitative corrections on the image to
compensate for the degradations introduced during the acquisition and transmission
process. These degradations are represented by the fog or blurring effect caused by
the optical system and by the motion of the object or the observer, by the noise
caused by the opto-electronic system and by the nonlinear response of the sensors,
by random noise due to atmospheric turbulence or, more generally, from the pro-
cess of digitization and transmission. While the enhancement techniques tend to
reduce the degradations present in the image in qualitative terms, improving their
Preface ix
visual quality even when there is no knowledge of the degradation model, the
restoration techniques are used instead to eliminate or quantitatively attenuate the
degradations present in the image, starting also from the hypothesis of knowledge
of degradation models.
Chapter 5, Image Segmentation, describes different segmentation algorithms,
which is the process of dividing the image into homogeneous regions, where all the
pixels that correspond to an object in the scene are grouped together. The grouping
of pixels in regions is based on a homogeneity criterion that distinguishes them
from one another. Segmentation algorithms based on criteria of similarity of pixel
attributes (color, texture, etc.) or based on geometric criteria of spatial proximity of
pixels (Euclidean distance, etc.) are reported. These criteria are not always valid,
and in different applications, it is necessary to integrate other information in relation
to the a priori knowledge of the application context (application domain). In this last
case, the grouping of the pixels is based on comparing the hypothesized regions
with the a priori modeled regions.
Chapter 6, Detectors and descriptors of points of interest, describes the most
used algorithms to automatically detect significant structures (known as points of
interest, corners, features) present in the image corresponding to stable physical
parts of the scene. The ability of such algorithms is to detect and identify physical
parts of the same scene in a repeatable way, even when the images are acquired
under conditions of lighting variability and change of the observation point with
possible change of the scale factor.
The third volume describes the artificial vision algorithms that detect objects in
the scene, attempt their identification, 3D reconstruction, their arrangement and
location with respect to the observer, and their eventual movement.
Chapter 1, Object recognition, describes the fundamental algorithms of artificial
vision to automatically recognize the objects of the scene, essential characteristics of
all systems of vision of living organisms. While a human observer also recognizes
complex objects, apparently in an easy and timely manner, for a vision machine, the
recognition process is difficult, requires considerable calculation time, and the results
are not always optimal. Fundamental to the process of object recognition become the
algorithms for selecting and extracting features. In various applications, it is possible
to have an a priori knowledge of all the objects to be classified because we know the
sample patterns (meaningful features) from which we can extract useful information
for the decision to associate (decision-making) each individual of the population to a
certain class. These sample patterns (training set) are used by the recognition system
to learn significant information about the objects population (extraction of statistical
parameters, relevant characteristics, etc.). The recognition process compares the
features of the unknown objects to the model pattern features, in order to uniquely
identify their class of membership. Over the years, there have been various disci-
plinary sectors (machine learning, image analysis, object recognition, information
research, bioinformatics, biomedicine, intelligent data analysis, data mining, …) and
the application sectors (robotics, remote sensing, artificial vision, …) for which
different researchers have proposed different methods of recognition and developed
different algorithms based on different classification models. Although the proposed
x Preface
algorithms have a unique purpose, they differ in the property attributed to the classes
of objects (the clusters) and the model with which these classes are defined (con-
nectivity, statistical distribution, density, …). The diversity of disciplines, especially
between automatic data extraction (data mining) and machine learning (machine
learning), has led to subtle differences, especially in the use of results and in ter-
minology, sometimes contradictory, perhaps caused by the different objectives. For
example, in data mining the dominant interest is automatic grouping extraction, in
automatic classification the discriminating power of the pattern classes is funda-
mental. The topics of this chapter overlap between aspects related to machine
learning and those of recognition based on statistical methods. For simplicity, the
algorithms described are broken down according to the methods of classifying
objects in supervised methods (based on deterministic, statistical, neural, and non-
metric models such as syntactic models and decision trees) and non-supervised
methods, i.e., methods that do not use any prior knowledge to extract the classes to
which the patterns belong.
In Chap. 2 RBF, SOM, Hopfield and deep neural networks, four different types
of neural networks are described: Radial Basis Functions—RBF, Self-Organizing
Maps—SOM, the Hopfield, and the deep neural networks. RBF uses a different
approach in the design of a neural network based on the hidden layer (unique in the
network) composed of neurons in which radial-based functions are defined, hence
the name of Radial Basis Functions, and which performs a nonlinear transformation
of the input data supplied to the network. These neurons are the basis for input data
(vectors). The reason why a nonlinear transformation is used in the hidden layer,
followed by a linear one in the output one, allows a pattern classification problem to
operate in a much larger space (in nonlinear transformation from the input in the
hidden one) and is more likely to be linearly separable than a small-sized space.
From this observation, derives the reason why the hidden layer is generally larger
than the input one (i.e., the number of hidden neurons is greater than the cardinality
of the input signal).
The SOM network, on the other hand, has an unsupervised learning model and
has the originality of autonomously grouping input data on the basis of their
similarity without evaluating the convergence error with external information on the
data. It is useful when there is no exact knowledge on the data to classify them. It is
inspired by the topology of the brain cortex model considering the connectivity
of the neurons and in particular, the behavior of an activated neuron and the
influence with neighboring neurons that reinforce the connections compared to
those further away that are becoming weaker.
With the Hopfield network, the learning model is supervised and with the ability
to store information and retrieve it through even partial content of the original
information. It presents its originality based on physical foundations that have
revitalized the entire field of neural networks. The network is associated with an
energy function to be minimized during its evolution with a succession of states,
until reaching a final state corresponding to the minimum of the energy function.
This feature allows it to be used to solve and set up an optimization problem in
terms of the objective function to be associated with an energy function. The
Preface xi
some lighting conditions and the reflectance model. Other 3D surface reconstruc-
tion algorithms based on the Shape from xxx paradigm are also described, where xxx
can be texture, structured light projected onto the surface to be reconstructed, or
2D images of the focused or defocused surface.
In Chap. 6 Motion Analysis, the algorithms of perception of the dynamics of the
scene are reported, analogous to what happens in the vision systems of different
living beings. With motion analysis algorithms, it is possible to derive the 3D
motion, almost in real time, from the analysis of sequences of time-varying 2D
images.
Paradigms on movement analysis have shown that the perception of movement
derives from the information of the objects evaluating the presence of occlusions,
texture, contours, etc. The algorithms for the perception of the movement occurring
in the physical reality and not the apparent movement are described. Different
methods of movement analysis are analyzed from those with limited computational
load such as those based on time-variant image difference to the more complex ones
based on optical flow considering application contexts with different levels of
motion entities and scene-environment with different complexities.
In the context of rigid bodies, from the motion analysis, derived from a sequence
of time-variant images, are described the algorithms that, in addition to the
movement (translation and rotation), estimate the reconstruction of the 3D structure
of the scene and the distance of this structure by the observer. Useful information
are obtained in the case of mobile observer (robot or vehicle) to estimate the
collision time. In fact, the methods for solving the problem of 3D reconstruction
of the scene are acquired by acquiring a sequence of images with a single camera
whose intrinsic parameters remain constant even if not known (camera not cali-
brated) together with the non-knowledge of motion. The proposed methods are part
of the problem of solving an inverse problem. Algorithms are described to recon-
struct the 3D structure of the scene (and the motion), i.e., to calculate the coordi-
nates of 3D points of the scene whose 2D projection is known in each image of the
time-variant sequence.
Finally, in Chap. 7 Camera Calibration and 3D Reconstruction, the algorithms
for calibrating the image acquisition system (normally a single camera and stere-
ovision) are fundamental for detecting metric information (detecting an object’s size
or determining accurate measurements of object–observer distance) of the scene
from the image. The various camera calibration methods are described that deter-
mine the relative intrinsic parameters (focal length, horizontal and vertical
dimension of the single photoreceptor of the sensor, or the aspect ratio, the size
of the matrix of the sensor, the coefficients of the radial distortion model, the
coordinates of the main point or the optical center) and the extrinsic parameters that
define the geometric transformation to pass from the reference system of the world
to that of camera. The epipolar geometry introduced in Chap. 5 is described in this
chapter to solve the problem of correspondence of homologous points in a stereo
vision system with the two cameras calibrated and not. With the epipolar geometry
is simplified the search for the homologous points between the stereo images
introducing the Essential matrix and the Fundamental matrix. The algorithms for
Preface xiii
estimating these matrices are also described, known a priori the corresponding
points of a calibration platform.
With epipolar geometry, the problem of searching for homologous points is
reduced to mapping a point of an image on the corresponding epipolar line in the
other image. It is possible to simplify the problem of correspondence through a
one-dimensional point-to-point search between the stereo images. This is accom-
plished with the image alignment procedure, known as stereo image rectification.
The different algorithms have been described; some based on the constraints of the
epipolar geometry (non-calibrated cameras where the fundamental matrix includes
the intrinsic parameters) and on the knowledge or not of the intrinsic and extrinsic
parameters of calibrated cameras. Chapter 7 ends with the section of the 3D
reconstruction of the scene in relation to the knowledge available to the stereo
acquisition system. The triangulation procedures for the 3D reconstruction of the
geometry of the scene without ambiguity are described, given the 2D projections
of the homologous points of the stereo images, known the calibration parameters
of the stereo system. If only the intrinsic parameters are known, the 3D geometry
of the scene is reconstructed by estimating the extrinsic parameters of the system at
less than a non-determinable scale factor. If the calibration parameters of the stereo
system are not available but only the correspondences between the stereo images
are known, the structure of the scene is recovered through an unknown homography
transformation.
We thank all the fellow researchers of the Department of Physics of Bari, of the
Institute of Intelligent Systems for Automation of the CNR (National Research
Council) of Bari, and of the Institute of Applied Sciences and Intelligent Systems
“Eduardo Caianiello” of the Unit of Lecce, who have indicated errors and parts to
be reviewed. We mention them in chronological order: Grazia Cicirelli, Marco Leo,
Giorgio Maggi, Rosalia Maglietta, Annalisa Milella, Pierluigi Mazzeo, Paolo
Spagnolo, Ettore Stella, and Nicola Veneziani. A thank you is addressed to Arturo
Argentieri for the support on the graphic aspects of the figures and the cover.
Finally, special thanks are given to Maria Grazia Distante who helped us realize the
electronic composition of the volumes by verifying the accuracy of the text and the
formulas.
xv
Contents
1 Object Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Classification Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Prior Knowledge and Features Selection . . . . . . . . . . . . . . . . . . 4
1.4 Extraction of Significant Features . . . . . . . . . . . . . . . . . . . . . . . 6
1.4.1 Feature Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.2 Selection of Significant Features . . . . . . . . . . . . . . . . . . 10
1.5 Interactive Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.6 Deterministic Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.6.1 Linear Discriminant Functions . . . . . . . . . . . . . . . . . . . 19
1.6.2 Generalized Discriminant Functions . . . . . . . . . . . . . . . 20
1.6.3 Fisher’s Linear Discriminant Function . . . . . . . . . . . . . . 21
1.6.4 Classifier Based on Minimum Distance . . . . . . . . . . . . . 28
1.6.5 Nearest-Neighbor Classifier . . . . . . . . . . . . . . . . . . . . . . 30
1.6.6 K-means Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.6.7 ISODATA Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.6.8 Fuzzy C-means Classifier . . . . . . . . . . . . . . . . . . . . . . . 35
1.7 Statistical Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.7.1 MAP Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.7.2 Maximum Likelihood Classifier—ML . . . . . . . . . . . . . . 39
1.7.3 Other Decision Criteria . . . . . . . . . . . . . . . . . . . . . . . . . 48
1.7.4 Parametric Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . 48
1.7.5 Maximum Likelihood Estimation—MLE . . . . . . . . . . . . 49
1.7.6 Estimation of the Distribution Parameters
with the Bayes Theorem . . . . . . . . . . . . . . . . . . . ..... 52
1.7.7 Comparison Between Bayesian Learning
and Maximum Likelihood Estimation . . . . . . . . . . . . . . 56
1.8 Bayesian Discriminant Functions . . . . . . . . . . . . . . . . . . . . . . . . 57
1.8.1 Classifier Based on Gaussian Probability Density . . . . . . 58
1.8.2 Discriminant Functions for the Gaussian Density . . . . . . 61
xvii
xviii Contents
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669
Object Recognition
1
1.1 Introduction
The ability of object recognition is an essential feature of all living organisms. Various
creatures have different abilities and modes of recognition. Very important is the
sensorial nature and the modality of interpretation of the available sensorial data.
Evolved organisms like humans can recognize other humans through sight, voice, or
how they write, while less evolved organisms like the dog can recognize other animals
or humans, simply, using the olfactory and visual sense organ. These activities are
classified as recognition.
While a human observer also performs the recognition of complex objects, appar-
ently in an easy and timely manner, for a vision machine, the recognition process
is difficult, requires considerable calculation time, and the results are not always
optimal. The ability of a vision machine is to automatically recognize the objects
that appear in the scene. Normally, a generic object observed for the recognition
process is called patter n. In several applications, the patter n adequately describes
a generic object with the purpose of recognizing it.
A pattern recognition system can be specialized to recognize people, animals,
territory, artifacts, electrocardiogram, biological tissue, etc. The most general ability
of a recognition system can be to discriminate between a population of objects and
determine those that belong to the same class. For example, an agro-food company
needs a vision system to recognize different qualities of fruit (apples, pears, etc.),
depending on the degree of ripeness and size. This means that the recognition system
will have to examine the whole population and classify each fruit thus obtaining
different groupings that identify certain quality classes. The de facto recognition
system is a classification system.
If we consider, for simplicity, the population of only apples to determine the class
to which each apple belongs, it is necessary that be adequately described, that is to
find its intrinsic characteristics (features),1 functional to determine the correct class of
membership (classi f ication). The expert, in this case, can propose to characterize
the apple population using the color (which indicates the state of maturation) and the
geometric shape (almost circular), possibly a measure of the area and to give greater
robustness to the classification process, the characteristic weight could also be used.
We have actually illustrated how to describe a pattern (in this case, the apple pattern
is seen as a set of features that characterizes the apple object), an activity that in
literature is known as selection of significant features (namely, features selection).
The activity instead of producing measures associated with the characteristics of the
pattern is called feature extraction.
The selection and extraction of the feature are important activities in the design
of a recognition system. In various applications, it is possible to have a priori knowl-
edge of the population of the objects to be classified because we know the sample
patterns from which useful information can be extracted for the decision to asso-
ciate (decision-making) each individual of the population to a specific class. These
sample patterns (e.g., training set) are used by the recognition system to learn mean-
ingful information about the population (extraction of statistical parameters, relevant
features, etc.).
In this context of object recognition, Clustering (Group Analysis) becomes central,
i.e., the task of grouping a collection of objects in such a way that objects in the same
group (called clusters) are more similar than those of the other groups. In the example
of apples, the concept of similarity is associated with the color, area, and weight
measurements used as descriptors of the apple pattern to define apple clusters with
different qualities. A system of recognition of very different objects, for example,
the patterns apples and pears, requires a more complex description of patterns, in
terms of selection and extraction of significant features, and clustering methods, for
a correct classification.
In this case, for the recognition of quality classes of the apple or pear pattern, it is
reasonable to add the feature of shape (elliptic), in the context selection of features to
discriminate between the two patterns. The recognition process compares the features
of the unknown objects with the features of the pattern samples, in order to uniquely
identify the class to which they belong. In essence, the classi f ication is the final
goal of a recognition system based on the clustering.
In this introduction of the analysis of the observed data of a population of objects,
two activities emerged: the selection and the extraction of the features. The feature
extraction activity also has the task of expressing them with an adequate metric to
be used, appropriately by the decision component, that is, the classifier that deter-
mines, based on the chosen method of clustering, which class associates with each
object. In reality, the features of an object do not always describe it correctly, they
often represent only a good approximation and therefore while developing a com-
plex classification process with different recognition strategies, it can be difficult to
identify the object in a unique class of belonging.
1 From now on, the two words “characteristic” and “feature” will be used interchangeably.
1.2 Classification Methods 3
The first group analysis studies were introduced to classify the psychological traits
of the personality [1,2]. Over the years there have been several disciplinary sectors
(machine learning, image analysis, object recognition, information research, bioin-
formatics, biomedicine, intelligent data analysis, data mining, ...) and application
sectors (robotics, remote sensing, artificial vision, ...) for which several researchers
have proposed different clustering methods and developed different algorithms based
on different types of clusters. Although the proposed algorithms have univocal pur-
poses, they differ in the property attributed to the clusters and in the model with
which the clusters are defined (connectivity, statistical distribution, density, ...).
The diversity of disciplines, especially those of data mining2 and machine learn-
ing, has led to subtle differences especially in the use of results and sometimes
contradictory terminologies perhaps caused by different objectives. For example, in
data mining, the dominant interest is the automatic extraction of groups, in the auto-
matic classification, the discriminating power of the classes to which the patterns
belong is fundamental. The topics of this chapter overlap between aspects related
to machine learning and those of recognition based on statistical methods. Object
classification methods can be divided into two categories:
2 Literally, it means automatic data extraction, normally coming from a large population of data.
4 1 Object Recognition
Non-Supervised Methods, i.e., methods that do not use any prior knowledge to
extract the classes to which the patterns belong. Often in statistical literature,
these methods are referred to as clustering. In this context, initially objects
cannot be labeled and the goal is to explore the data to find an approach that
groups them once selected features that distinguish one group from another.
The assignment of the labels for each cluster takes place subsequently with
the intervention of the expert. An interlocutory phase can be considered using
a supervised approach on the first classes generated to extract representative
samples of some classes.
This category includes the hierarchical and partitioning clustering methods that in
turn differ in the criteria (models) of clustering adopted:
The description of an object is defined by a set of scalar entities of the object itself that
are combined to constitute a vector of features x (also called feature vector). Con-
sidering the features studied in Chap. 8 on Forms Vol. I, an object can be completely
described by a vector x whose components x = (x1 , x2 , . . . , x M ) (represented, for
example, by some measures which dimensions, compactness, perimeter, area, etc.)
are some of the geometric and topological information of an object. In the case of
multispectral images, the problem of feature selection becomes the choice of the
most significant bands where the feature vector represents the pixel pattern with the
radiometric information related to each band (for example, the spectral bands from
visible to infrared). In other contexts, for example, in the case of a population of
economic data, the problem of feature selection may be more complex because some
measurements are not accessible.
An approach to the analysis of the features to be selected consists in considering
the nature of the available measures, i.e., whether of a physical nature (as in the case
of spectral bands and color components) or of a structural nature (such as geometric
ones) or derived from mathematical transformations of the previous measures. The
physical features are derived from sensory measurements of which information on the
behavior of the sensor and the level of uncertainty of the observed measurements may
be available. Furthermore, information can be obtained (or found experimentally) on
the correlation level of the measurements observed for the various sensors. Similarly,
information on the characteristics of structural measures can be derived.
1.3 Prior Knowledge and Features Selection 5
Basic
knowledge
Features
model
Fig. 1.2 Functional scheme of an object recognition system based on template matching
The analysis and selection of the features is useful to do it considering also the
strategies to be adopted in the recognition process. In general, the functioning mech-
anism of the recognition process involves the analysis of the features extracted from
the observed data (population of objects, images, ...), the formalization of some
hypotheses to define elements of similarity between the observed data and those
extracted from the sample data (in the supervised context). Furthermore, it pro-
vides for the formal verification of the hypotheses using the models of the objects
and eventually reformulate the approach of similarity of the objects. Hypothesis
generation can also be useful to reduce the search domain considering only some
features. Finally, the recognition process selects among the various hypotheses, as a
correct pattern, the one with the highest value of similarity on the basis of evidence
(see Fig. 1.1).
The vision systems typical of object recognition formulate hypotheses and iden-
tify the object on the basis of the best similarity. Vision systems based on a priori
knowledge consider hypotheses only as the starting point, while the verification
phase is assigned the task of selecting the object. For example, in the recognition of
an object, based on the comparison of the characteristics extracted from the scene
observed with those of the sample prototypes, the approach called template matching
is used, and the hypothesis formation phase is completely eliminated (see Fig. 1.2).
From the previous considerations, it emerges how a recognition system is charac-
terized by different components:
Points (1), (2), (3) are mutually interdependent. The representation of an object
depends on the type of object itself. In some cases, certain geometric and topological
characteristics are significant, for other cases, they may be of little significance and/or
redundant. Consequently, the component of the features extraction must determine
those foreseen for the representation of the object that are more adequate considering
their robustness and their difficulty in extracting from the input data. At the same
time, in the selection of features, those that can best be extracted from the model and
that can best be used for comparison must be selected.
Using many exhaustive features can make the recognition process more difficult
and slow. The hypothesis formation approach is a normally heuristic approach that
can reduce the search domain. In relation to the type of application, a priori knowledge
can be formalized by associating a priori probability or confidence level with the
various model objects. These prediction measures constitute the elements to evaluate
the similarity of the presence of objects based on the determined characteristics. The
verification of these hypotheses leads to the methods of selecting the models of
the objects that best resemble those extracted from the input data. All plausible
hypotheses must be examined to verify the presence of the object or discard it.
In vision machine applications, where geometric modeling is used, objects can be
modeled and verified using camera location information or other known information
of the scene (eg known references in the scene). In other applications, the hypotheses
cannot be verified and one can proceed with non-supervised approaches. From the
above considerations, the functional scheme of a recognition system can be modified
by eliminating the verification phase or the hypothesis formation phase (see Fig. 1.1).
In Chap. 5 vol. II of the Segmentation and in Chap. 8 Vol. I on the For ms, we have
examined the image processing algorithms that extract some significant character-
istics to describe the objects of a scene. Depending on the type of application, the
object to be recognized as a physical entity can be anything. In the study of terres-
trial resources and their monitoring, the objects to be recognized are for example the
various types of woods and crops, lakes, rivers, roads, etc. In the industrial sector, on
the other hand, for the vision system of a robot cell, the objects to be recognized are,
for example, the individual components of a more complex object, to be assembled
or to be inspected. For a food industry, for example, a vision system can be used to
recognize different qualities of fruit (apples, pears, etc.) in relation to the degree of
ripeness and size.
In all these examples, similar objects can be divided into several distinct subsets.
This new grouping of objects makes sense only if we specify the meaning of a similar
object and if we find a mechanism that correctly separates similar objects to form so-
called classes of objects. Objects with common features are considered similar. For
example, the set of ripe and first quality apples are those characterized by a particular
color, for example, yellow-green, with a given almost circular geometric shape,
1.4 Extraction of Significant Features 7
and a measure of the area, higher than a certain threshold value. The mechanism
that associates a given object to a certain class is called clustering. The object
recognition process essentially uses the most significant features of objects to group
them by classes (classification or identification).
Question: is the number of classes known before the classification process?
Answer: normally classes are known and intrinsically defined in the specifications
imposed by the application but are often not known and will have to be explored by
analyzing the observed data.
For example, in the application that automatically separates classes of apples,
the number of classes is imposed by the application itself (for commercial reasons,
three classes would be sufficient) deciding to classify different qualities of apples
considering the parameters of shape, color, weight, and size.
In the application of remotely sensed images, the classification of forests, in rela-
tion to their deterioration caused by environmental impact, should take place without
knowing in advance the number of classes that should automatically emerge in rela-
tion to the types of forests actually damaged or polluted. The selection of significant
features is closely linked to the application. It is usually made based on experience
and intuition. Some considerations can, however, be made for an optimum choice of
the significant features.
The first consideration concerns their ability to be discriminating or different
values correspond to different classes. In the previous example, the measurement of
the ar ea was a significant feature for the classification of apples in three classes of
different sizes (small, medium, and large).
The second consideration regards reliability. Similar values of a feature must
always identify the same class for all objects belonging to the same class. Considering
the same example, it can happen that a large apple can have a different color than
the class of ripe apples. In this case, the color may be a nonsignificant feature.
The third consideration relates to the correlation between the various features.
The two features ar ea and weight of the apple are strongly correlated features
since it is foreseeable that the weight increases in proportion to the measure of the
area. This means that these two characteristics are the same property and therefore
are redundant features that may not be meaningful to select them together for the
classification process. The related features can be used together instead when you
want to attenuate the noise. In fact, in multisensory applications, it occurs that some
sensors are strongly correlated with each other but the relative measurements are
affected by different models of noise.
In other situations, it is possible to accept the redundancy of a feature, on one
condition, that is not correlated with at least one of the selected features (in the case
of the apple-mature class, it has been seen that the color-area features alone may not
discriminate, in this case, the characteristic weight becomes useful, although very
correlated with the area, considering the true hypothesis, that the ripe apples are on
average less heavy than the less mature ones).
8 1 Object Recognition
The number of selected features must be adequate and contained to limit the level
of complexity of the recognition process. A classifier that uses few features can
produce inappropriate results. In contrast, the use of a large number of features
leads to an exponential growth of computational resources without the guarantee
of obtaining good results. The set of all the components xi of the vector pattern
x = (x1 , x2 , . . . , x M ) is the space of the features to M-dimensions. We have already
considered how for the remote sensing images for each pixel pattern different spec-
tral characteristics are available (infrared, visible, etc.) to extract the homogeneous
regions from the observed data corresponding to different areas of the territory. In
these applications, the classification of the territory, using a single characteristic (sin-
gle spectral component), would be difficult. In this case, the analyst must select the
significant bands filtered by the noise and eventually reduce the dimensionality of
the pixels.
In other contexts, the features associated with each object are extracted producing
features with higher level information (for example, normalized spatial moments,
Fourier descriptors, etc. described in Chap. 8 Vol. I of the For ms) related to ele-
mentary regions representative of objects. An optimal feature selection is obtained
when the pattern x vectors belonging to the same class lie close to each other when
projected into the feature space.
These vectors constitute the set of similar objects represented in a single class
(cluster). In the feature space, it is possible to accumulate different classes (corre-
sponding to different types of objects) which can be separated using appropriate
discriminating functions. The latter represent the cluster separation hypersurfaces in
the feature space of M-dimensions, which control the classification process. Hyper-
surfaces can be simplified with hyperplanes and in this case, we speak of linearly
separable discriminating functions.
Figure 1.3 shows a two-dimensional (2D) example of the feature space (x1 , x2 )
where 4 clusters are represented separated by linear and nonlinear discriminant func-
tions, representing the set of homogeneous pixels corresponding in the spatial domain
to 5 regions. This example shows that the two selected features x1 and x2 exhibit an
adequate discriminating ability to separate the population of pixels in the space of
the features (x1 , x2 ) which in the spatial domain belong to 5 regions corresponding
to 4 different classes (the example schematizes a portion of land wet by the sea with
a river: class 1 indicates the ground, 2 the river, 3 the shoreline and 4 the sea).
The ability of the classification process is based on the ability to separate without
error the various clusters, which in different applications are located very close to
each other, or are superimposed generating an incorrect classification (see Fig. 1.4).
This example demonstrates that the selected features do not intrinsically exhibit a
good discriminating power to separate in the feature space the patterns that belong
to different classes regardless of the discriminant function that describes the cluster
separation plan.
1.4 Extraction of Significant Features 9
x
4
1 4
3
2 1 3
Band x
1 2
Band x
Multispectral image
Fig. 1.3 Spatial domain represented by a multispectral image with two bands and 2D domain of
features where 4 homogeneous pattern classes are grouped corresponding to different areas of the
territory: bare terrain, river, shoreline, and sea
B
A
B
A
(a) The characteristics of the objects must be analyzed eliminating those that are
strongly correlated that do not help for the discrimination between objects.
Instead, they are to be used to filter out any noise between related features.
(b) The underlying problem of a classifier depends on the fact that not always in the
feature space, the classes of objects are well separated. Very often the pattern
vectors of the features can belong to more than one cluster (see Fig. 1.4) and
the classifier can make mistakes associating some of them to a class of incorrect
membership. This can be avoided by eliminating features that have little dis-
criminating power (for example, the feature x1 in Fig. 1.4 does not separate the
object classes {A, B, C}, while the x2 well separates the classes {A, C}).
10 1 Object Recognition
Let us analyze with an example (see Fig. 1.4) the hypothesized normal distribution
of three classes of objects in the one-dimensional feature space x1 and x2 and two-
dimensional (x1 , x2 ). It is observed that the A class is clearly separated from the B
and C classes, while the latter show overlapping zones in both the features x1 and x2 .
From the analysis of the features x1 and x2 , it is observed that these are correlated
(the patterns are distributed in a dominant way on the diagonal of the plane (x1 x2 ))
that is to similar values of x1 correspond analogous values for x2 . From the analysis
of one-dimensional distributions, it is observed that the A class is well separable with
the single feature x2 , while the B and C classes cannot be accurately separated with
the distribution of the features x1 and x2 , as the latter are very correlated.
In general, it is convenient to select only those features that have a high level
of orthogonality, that is, the distribution of the classes in the various characteristics
should be very different: while in a feature the distribution is located toward the low
values, in another feature the same class must have a different distribution. This can
be achieved by selecting or generating unrelated features. From the geometric point
of view, this can be visualized by imagining a transformation of the original vari-
ables, such that, in the new system of orthogonal axes, they are ordered in terms of
quantity of variance of the original data (see Fig. 1.5). In the Chapter Linear Trans-
1.4 Extraction of Significant Features 11
Y
Pk
components y1 , y2 is
equivalent to rotating the
Y
coordinate axes until the P’k
maximum variance is
obtained when all the θ
patterns are projected on the
OY1 axis
0
formations, Sect. 2.10.1 Vol. II, we have described the transform proper orthogonal
decomposition—POD, better known as principal component analysis—PCA which
has this property, and to represent the original data in a significant way through a
small group of new variables, precisely the components principal. With this data
transformation, the expectation is to describe most of the information (variance) of
the original data with a few components. The PCA is a direct transformation on the
original data, and no assumption is made about their distribution or number of classes
present, and therefore behaves like an unsupervised feature extraction method.
To better understand how the PCA behaves, consider the pattern distribution (see
Fig. 1.5) represented by the features (x1 , x2 ) which can express, for example, respec-
tively, the length measurements (meters, centimeters, ...) and weight (grams, kilo-
grams, ...), that is, quantities with different units of measurement. For a better graph-
ical visibility of the pattern distribution, in the domain of the features (x1 , x2 ), we
imagine that these realizations have a Gaussian distribution.
With this hypothesis in the feature domain, the patterns are arranged in ellip-
soidal form. If we now rotate the axes from the reference system (x1 , x2 ) to the
system (y1 , y2 ), the ellipsoidal shape of the patterns remains the same while only
the coordinates have changed. In this new system, there can be a convenience in
representing these realizations. Since the axes are rotated, the relationship between
the two reference systems can be expressed as follows:
y1 cos θ sin θ x1 y1 = x1 cos θ + x2 sin θ
= or (1.1)
y2 − sin θ cos θ x2 y2 = x1 sin θ + x2 cos θ
where θ is the angle between the homologous (horizontal and vertical) axes of the two
reference systems. It can be observed from these equations that the new coordinate,
i.e., the transformed feature y1 is a linear combination of length and weight mea-
surements (with both positive coefficients), while the second new coordinate, i.e.,
the feature y2 is a linear combination always between the length and weight mea-
surements but with opposite signs of the coefficients. That said, we observe from
Fig. 1.5 that there is an imbalance in the pattern distribution with a more pronounced
dispersion on the first axis.
12 1 Object Recognition
This means that the projection of the patterns on the new axis y1 can be a good
approximation of the entire ellipsoidal distribution. This is equivalent to saying that
the set of realizations represented by the ellipsoid can be signi f icantly represented
by the single new feature y1 = x1 cos θ + x2 sin θ instead of indicating for each
pattern the original measures x1 and x2 .
It follows that, considering only the new feature y1 , we get a size reduction from
2 to 1 to represent the population of the patterns. But be careful, the concept of
meaningful representation with the new feature y1 must be specified. In fact, y1 can
take different values as the angle θ varies. It is, therefore, necessary to select a value
of θ which may be the best representation of the relationship that exists between the
population patterns in the feature domain. This is guaranteed by selecting the value
of θ which minimizes the translation of the points with the projection with respect
to the original position.
Given that the coordinates of the patterns with respect to the Y1 axis are their
orthogonal projections precisely on the OY1 axis, the solution is given by the line
whose distance from the points is minimal. Indicating withPk a generic pattern and
with Pk its orthogonal projection on the axis OY1 , the orientation of the best line is
the one that minimizes the sum, given by [3]
N
2
Pi Pi
i=1
Repeating for all the N patterns, adding and dividing by N −1, we have the following:
1 1 1
N N N
2 2 2
O Pk = Pk Pk + O Pk (1.2)
N −1 N −1 N −1
k=1 k=1 k=1
Analyzing the (1.2) it results that the first member is constant for all the patterns
and is independent of the reference system. It follows that choosing the orientation
of the OY1 axis is equivalent to minimi zing the expression of the first addend of
the (1.2) or maximi zing the expression of the second addend of the same equation.
In the hypothesis that O represents the mass center of all i pattern,3 the expression
N 2
of the second addend N 1−1 k=1 O Pk corresponds to the variance of the pattern
projections on the new axis Y1 .
Choosing the OY1 axis that minimizes the sum of the squares of the perpendicular
distances from this axis is equivalent to selecting the OY1 axis in such a way that the
3 Without loosing generality, this is achieved by expressing both the input variables xi and output
yi , in terms of deviations from the mean.
1.4 Extraction of Significant Features 13
projections of the patterns on it result with the maximum variance. These assumptions
are based on the search for principal components formulated by Hotelling [4]. It
should be noted that with the approach to the principal components, reported above,
the sum of the squared distances between pattern and axis is minimized, while with the
least squares approach, it is different, the sum is squared to the horizontal distances
of patterns from the line (represented by the OY1 axis in this case). This leads to a
different solution (linear regression).
Returning to the principal component approach, the second component is defined,
in the orthogonal direction to the f ir st, and represents the maximum of the remaining
variance of the pattern distribution. For patterns with N dimensions, subsequent
components are obtained in a similar way. It is understood that the peculiarity of
the PCA is to represent a set of patterns in the most sparse way possible along the
principal axes. Imagining a Gaussian distribution with a M-dimensions pattern, the
dispersion assumes an ellipsoidal shape,4 the axes of which are oriented with the
principal components.
In Sect. 2.10.1 Vol. II, we have shown that to calculate the axes of these ellipsoidal
structures (they represent the density level set), the covariance matrix K S was cal-
culated for a multispectral image S of N pixels with M bands (which here represent
the variables of each pixel pattern). Recall that the covariance matrix K S integrates
the variances of the variables and the covariances between different variables.
Furthermore, in the same paragraph, we showed how to get the transform to the
principal components:
Y PC A = A · S
through the diagonali zation of the covariance matrix K S , obtaining the orthogonal
matrix A having the eigenvector s ak , k = 1, 2, . . . , M and the diagonal matrix
of the eigenvalues λk , k = 1, 2, . . . , M. It follows that the i-th principal component
is given by
yi = <x, ai > = aiT x with i = 1, . . . , M (1.3)
with the variances and covariances of the new components expressed by:
4 Indeed, an effective way to represent the graph of the normal multivariate density function N (0, )
is made by curves of level c. In this case, the function is positive and the level curves to be examined
concern values of c > 0 with a positive and invertible covariance matrix. It is shown that the
equation of an ellipsoid results to be x T −1 x = c centered in the origin. In the reference system
of the principal components, these are expressed in the bases of the eigenvectors of the covariance
y2 y2
matrix and the equation of the ellipsoid becomes λ11 + · · · + λMM = c, with the length of the
√ √
semi-axes equal to λ1 , . . . , λ M , where λi are the eigenvalues of covariance matrix. For M = 2,
we have elliptic contour lines. If μ = 0, the ellipsoid is centered in μ.
14 1 Object Recognition
Finally, we highlight the property that the initial total variance of the pattern popu-
lation is equal to the total variance of the principal components:
M
M
V ar (xi ) = σ12 +, · · · , σ M
2
= V ar (yi ) = λ1 +, · · · , λ M (1.5)
i=1 i=1
although distributed with different weights and decreasing in the principal compo-
nents considered the decreasing order of the eigenvalues:
λ1 ≥ λ2 ≥ · · · ≥ λ M ≥ 0
where x indicates the generic input pattern. The process can be extended for all
M principal axes, mutually orthogonal, even if the graphic display becomes difficult
beyond the three-dimensional (3D) representation. Often the graphical representation
is useful at an exploratory level to observe how the patterns are grouped in the features
space, i.e., how the latter are related to each other. For example, in the classification
of multispectral images, it may be useful to explore different 2D projections of
the principal components to explore how the ellipsoidal structures are arranged to
separate the homogeneous pattern classes (as informally anticipated with Fig. 1.4).
The elongated shape of an ellipsoid indicates that one axis is very short with respect to
the other and informs us of the little variability in that direction of that component and
consequently projecting all the patterns <x, a1 >, we get the least loss of information.
This last aspect, the variability of the new components, is connected to the problem
of the variability of the input data that can be different, both in terms of measurement
(for example, in the area-weight case, the first is the measure linked to the length,
the second indicates a measure of force of gravity), and both in terms of dynamics
of the range of variability although expressed in the same unit of measurement. The
different variability of the input data tends to influence the first principal components,
thus distorting the exploratory analysis. The solution to this problem is obtained
by activating a standardization procedure of the input data, before applying the
transform to the principal components. This procedure consists in transforming the
original data xi , i = 1, . . . , N into the normalized data zi as follows:
xi j − μ j
zi j = (1.7)
σj
1.4 Extraction of Significant Features 15
where μ j and σ j are, respectively, the mean and standard deviation of the j-th feature,
that is,
1 N 1 N
μj = xi j σj = (xi j − μ j )2 (1.8)
N N −1
i=1 i=1
In this way, each feature has the same mean zero and the same standard deviation
1, and if we calculate the covariance matrix Kz , this coincides with the correlation
matrix Rx . Each element r jk of the latter represents the normalized covariance
between two features, called Pearson’s correlation coefficient (linear relation mea-
sure) between the features x j and xk obtained as follows:
1
N
Cov(x j , xk )
r jk = = z i j z ik (1.9)
σx j σxk N −1
i=1
By virtue of the inequality of Hölder [5], it is shown that the correlation coefficients
ri j have value |ri j | ≤ 1 and in particular, the elements rkk of the principal diagonal
are all equal to 1 and represent the variance of a standardized feature. Great absolute
value of r jk corresponds to a high linear relationship between the two features. For
|r jk | = 1, the values of x j and xk lie exactly on a line (with positive slope if the coeffi-
cient is positive; with negative slope if the coefficient is negative). This property of the
correlation matrix explains the invariance for change of unit of measure (scale invari-
ance) unlike the covariance matrix. Another property of the correlation coefficient
is the fact that if the features are not related to each other (E[σxl , σxk ] = μx j μxk ),
we have rσxl ,σxk = 0 while for the covariance if Cov(σxl , σxk ) = 0 is not said to be
uncorrelated.
Having two operating modes available, with the original data and standardized
data, the principal components could be calculated, respectively, by diagonalizing
the covariance matrix (data sensitive to the change of scale) or the correlation matrix
(normalized data). This choice must be carefully evaluated based on the nature of
the available data considering that the analysis of the principal components leads to
different results in using the two matrices on the same data.
A reasonable criterion could be to standardize data when these are very different
in terms of scale. For example, in the case of multispectral images, a characteristic,
represented by the broad dynamics of a band, can be in contrast with another that
represents a band with a very restricted dynamic. Instead having homogeneous data
available, for the various features, it would be better to explore with the analysis of the
principal components without performing data standardization. A direct advantage,
offered in operating with the R correlation matrix, is given by the possibility of being
able to compare data of the same nature but acquired at different times. For example,
the classification of multispectral images relating to the same territory, acquired with
the same platform, but at different times.
We now return to the interpretation of the principal components (in the litera-
ture also called latent variables). The latter term is indicated precisely to express
16 1 Object Recognition
the impossibility of giving a direct formalization and meaning to the principal com-
ponents (for example, the creation of a model). In essence, the mathematical tool
indicates the direction of the components where the information is significantly con-
centrated, but the interpretation of these new components is left to the analyst. For a
multispectral image, the characteristics represent the spectral information associated
with the various bands (different in the visible, in the infrared, ...), when projected
into the principal components, mathematically we know that the original physical
data are redistributed and represented by new variables hidden which have lost the
original physical meaning even if evaluating the explained variances, they give us
the quantitative information of the energetic content, of each component, in this
new space. In various applications, this property of the principal components is used
in reducing the dimensionality of the data considering the first p most significant
components of the M-original dimensions.
For example, the first 3 principal components can be used as the components
of a suitable color space (RGB, ...) to display a multispectral image of dozens of
bands (see an example shown in Sect. 2.10.1 Vol. II). Another example concerns
the aspects of image compression where the analysis of the principal components is
strategic to evaluate the feasible compression level even if then the data compression
is performed with computationally more performing transforms (see Chap. 2 Vol. II).
In the context of clustering, the analysis of principal components is useful in
selecting the most significant features and eliminating the redundant ones. The analyst
can decide the acceptable d percentage of variance explained keeping the first p
components calculable (considering Eq. 1.5) with the following ratio:
p
λk
d = 100 k=1
M
(1.10)
k=1 λk
Another approach is to graph the value of the eigenvalues on the ordinates with
respect to their order of extraction and choose the number p of the most significant
components where an abrupt change of the slope occurs with the rest of the graph
almost flat.
Multispectral image
Band x x
4 4
1 4 4
3 1 3
2 1 3 1 3 2
1 2 2 1
Banda
Bandxx
(a) Space Domain (b) Features Domain (c) Look-up Table (d) Thematic Map
Fig. 1.6 Functional scheme of the interactive deterministic method. a Spatial domain represented
by two bands (x1 , x2 ); b 2D features space where the expert checks how the patterns cluster in
nonoverlapping clusters; c definition of the look-up table after having interactively defined the
limits of separation of classes in the features domain; d thematic map, obtained using the values
of the features x1 , x2 of each pixel P as a pointer to the 2D look-up table, where the classes to be
associated are stored
d(x) = yr (1.11)
where d(x) is called decision function or discriminant function. While in the inter-
active classification method, the regions associated with the different classes were
defined by the user observing the data projections in the features domain, with the
deterministic method, instead, these regions are delimited by the decision func-
tions that are defined by analyzing sample data for each observable class. The deci-
sion functions divide into practice the space of the disjoint features in R classes
ωr , r = 1, . . . , R, each of which constitutes the subset of the pattern vectors x to M-
dimensions for which the decision d(x) = yr is valid. The ωr classes are separated
by discriminating hypersurfaces.
In relation to the R classes ωr , the discriminating hypersurfaces can be defined by
means of the scalar functions dr (x) which are precisely the discriminating functions
of the classifier with the following property:
That said, with the (1.12), a pattern vector x is associated with the class with the
largest value of the discriminate function:
Different are the discriminating functions used in literature, linear (defined with
d(x) as a linear combination of features x j ) and nonlinear multiparametric (defined
as d(x, γ ) where γ represents the parameters of the model d to be defined in the
1.6 Deterministic Method 19
training phase, note the sample patterns, as is the case for the multilevel perceptron).
Discriminant functions can also be considered as a linear regression of data to a
model where y is the class to be assigned (the dependent variable) and the regressors
are the pattern vectors (the independent variables).
The linear discriminant functions are the simplest and are normally the most used.
They are obtained as a linear combination of the features of the x patterns:
M+1
>0, x ∈ ωr
dr (x) = wrT x = wr,i xi = (1.15)
<0, other wise
i=1
where x = (x1 , . . . , x M , 1)T is the aumented pattern vector (to get a simpler nota-
tion), r = 1, . . . , R indicates the class, while
wr = (wr,1 , . . . , wr,M+1 )T
wo = (w1 , w2 , . . . , w M )T
Z
Dz
Do
hyperplane d(x)=0
d(x)<0
20 1 Object Recognition
0
d x)=
(x)=0
0
)=
(x
0
0 0
Fig. 1.8 Linear and nonlinear discriminant function. a A linear function that separates two classes
of patterns ω1 and ω2 ; b Absolute separation between classes ω1 , ω2 , and ω3 separated respectively
by lines with equations d1 (x) = 0, d2 (x) = 0, and d3 (x) = 0; c Example of separation between
two classes with a nonlinear discriminant function described by a parabolic curve
|w M+1 |
and it is shown [7] to have perpendicular distance Do = wo from the origin
woT z+w M+1
and distance Dz = wo from an arbitrary vector pattern z (see Fig. 1.7). If
Se Do = 0, the hyperplane passes from the origin. The value of the discriminant
function associated with a vector pattern x represents the measure of its perpendicular
distance from the hyperplane given by (1.16).
For M = 2, the linear discriminant function (1.16) corresponds to the equation
of the straight line (separation line between two classes) given by
d(x) = w1 x1 + w2 x2 + w3 = 0 (1.17)
where the coefficients wi , i = 1, 2, 3 are chosen to separate the two classes ω1 and
ω2 , i.e., for each pattern x ∈ ω1 , we have that d(x) > 0, while for every x ∈ ω2 results
d(x) < 0 as shown in Fig. 1.8a. In essence, d(x) results in the linear discriminant
function of the class ω1 . More generally, we can say that the set of R classes are
absolutely separable if each class ωr , r = 1, . . . , R is linearly separated from the
remaining pattern classes (see Fig. 1.8b).
For M = 3, the linear discriminant function is represented by the plane and for
M > 3 by the hyperplane.
where φi (x) are the M scalar functions associated with the pattern x with M features
(x ∈ R M ). In vector form, introducing the aumented vectors w and z, with the
substitution of the original variable x in z i = φi (x), we have
M+1
d(x) = wi φi (x) = w T z (1.19)
i=1
where z = (φ1 (x), . . . , φ M (x), 1)T is the vector function of x and w = (w1 , . . . ,
w M , w M+1 )T . The discriminant function (1.19) is linear in z i through the functions
φi (i.e., in the new transformed variables) and not in the measures of the original
features xi . In essence, by transforming the input patterns x, via the scalar functions
φi , in the new aumented domain M+ 1 of the features z i , the classes can be separated
by a linear function as described in the previous paragraph.
In the literature, several functions have been proposed φi to separate linearly pat-
terns. The most common are the discriminating functions polynomial, quadratic,
radial basis,5 multilevel perceptron. For example, for M = 2, the quadratic gener-
alized discriminant function results
5 Function of real variables and real values dependent exclusively on the distance from a fixed
point, called centroid xc . An RBF function is expressed in the form φ : R M → R such that
φ(x) = φ(|x − xc |).
22 1 Object Recognition
(a) (b)
horiz.
ω
v
ω
Fig. 1.9 Fisher linear discriminant function. a Optimal projection line to separate two classes of
patterns ω1 and ω2 ; b nonoptimal line of separation of the two classes where the partial overlap of
di f f er ent patterns is noted
is the vector that defines the orientation of the line where the patterns are projected
xi , i = 1, . . . , N , the goal is to find for every xi a scalar value yi that represents the
distance from the origin of its projection on the line. This distance is given by
yi = v T xi i = 1, . . . , N (1.21)
The figure shows the projection of a sample in the one-dimensional case. Now let’s
see how to determine the v or the optimal direction of the sample projection line that
best separates the K classes ωk with each consisting of n k samples. In other words,
we need to find v so that after the projection, the ratio of variances between classes
and the intra-class ratio is maximized.
A criterion for defining a separation measure of the two classes consists in considering
the distance (see Fig. 1.9) between the projected averages |μ̂2 − μ̂1 | which represents
the inter-class distance (measure of separation):
The measure of dispersion ( Ŝ12 + Ŝ22 ) obtained is called intra-class dispersion (within-
class scatter) of the samples projected in the direction v in this case with two classes.
The Fisher linear discriminating criterion is given by the linear function defined
by the (1.21) which projects the samples on the line in the direction v and maximizes
the following linear function:
|μ̂2 − μ̂1 |2
J (v) = (1.25)
Ŝ12 + Ŝ22
The goal of the (1.25) is to project the samples of a compact class (that is, have very
small Ŝk2 ) and simultaneously project their centroids very far apart as possible (i.e.,
the very large distance |μ̂2 − μ̂1 |2 ). This is achieved by adequately finding a vector
v∗ which maximizes J (v) through the following procedure.
yi ∈ωk
= v T Sk v
where Sk is the dispersion matrix in the source space of the f eatur e. From the
(1.27), we get
Ŝ12 + Ŝ22 = v T Sv v (1.28)
24 1 Object Recognition
which includes the separation measures between the centroids of the two classes
before the projection. It is observed that S B is obtained from the external product
of two vectors and has at most one rank.
4. Calculation of the difference between the centroids, after the projection, expressed
in terms of the averages in the space of the features of origin:
(μ̂1 − μ̂2 )2 = (v T μ1 − v T μ2 )2
= v T (μ1 − μ2 )(μ1 − μ2 )T v (1.30)
= v SB v
T
|μ̂2 − μ̂1 |2 vT S B v
J (v) = = (1.31)
Ŝ12 + Ŝ22 v T Sv v
6. Find the maximum of the objective function J (v). This is achieved by deriving
J with respect to the vector v and setting the result to zero.
d d vT S B v
J (v) =
dv dv v T Sv v
d[v T S B v] T
v T Sv v dv − [v T S B v] d[vdvSv v]
=
(v T Sv v)2 (1.32)
[v Sv v]2S B v − [v T S B v]2Sv v
T
= =0
(v T Sv v)2
=⇒ [v T Sv v]2S B v − [v T S B v]2Sv v = 0
7. Solving the problem with the eigenvalue method generalized with the Eq. (1.33)
if Sv has complete rank (with the existence of the inverse matrix). Solving from
v, its maximum value v∗ is obtained as follows:
T
v SB v
v∗ = arg maxv J (v) = arg maxv = Sv−1 (μ1 − μ2 ) (1.34)
v T Sv v
1.6 Deterministic Method 25
SB SB
S SB μ
S
We thus obtained with the (1.34) the Fisher linear discriminant function although
more than a discriminant it is rather an appropriate choice of the direction of the
one-dimensional projection of the data.
where the second equation expresses in matrix compact form the vector y of the
C − 1 projections generated by the C − 1 projection vectors vi assembled in the
C − 1 columns of the projection matrix V.
Let us now see how the equations seen above for LDA to C-classes are generalized
(Fig. 1.10 presents an example of 2D features with 3 classes and samples with 2
dimensions).
where
1
Si = (x j − μi )(x j − μi )T μi = xj (1.37)
x j ∈ωi
n i x ∈ω
j i
26 1 Object Recognition
1 1
C−1
μ̂i = yj μ̂ = yi (1.39)
n i y ∈ω N
j i i=1
ŜV = VT SV V (1.41)
The dispersion matrix Ŝ B = VT S B V remains the same valid for LDA with C
classes.
6. Our goal is to find a projection that maximizes the relationship between inter-class
and intra-class dispersion. Since the projection is no longer one-dimensional but
has dimensions C − 1, the determinant of the dispersion matrices is used to obtain
a scalar objective function, as follows:
|Ŝ B | |VT S B V|
J (V) = = (1.42)
|ŜV | |VT SV V|
It is now necessary to find the projections defined by the column vectors of the
projection matrix V, or a projection matrix V∗ that maximizes the ratio of the
objective function J (V).
7. Calculation of the matrix V∗ . In analogy to the 2-class case, the maximum of
J (V) is found differentiating the objective function (1.42) and equalizing the
result to zero. Subsequently, the problem with the eigenvalue method is solved by
generalizing the Eq. (1.33) previously obtained for 2 classes. It is shown that the
optimal projection matrix V∗ is the matrix whose columns are the eigenvectors
1.6 Deterministic Method 27
(a) (b)
8 12
r
10 c to
component
ve
6 st
8 Fir
4 6
4
2
X2 − Second
2
0 0
−2
−2 FDA
−4 Second vector
PCA
−4 −6
−4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8 10
X1 − First component
Fig. 1.11 Application of LDA for a 2D dataset with 2 and 3 classes. a Calculation of the linear
discriminant projection vector for the 2-class example. In the figure produced with M AT L AB,
the main component of PCA is reported, together with the projection vector FDA. It is highlighted
how FDA better separates the classes from the PCA whose principal component is more oriented
to highlight the greater variance of data distribution. b Calculation of the 2 projection vectors for
the example with 3 classes
Figure 1.11 shows the application of Fisher’s discriminant analysis (FDA) for two
datasets with 2 features but with 2 and 3 classes. Figure (a) also shows the prin-
cipal component of the PCA applied for the same 2-class dataset. As previously
highlighted, PCA tends to project data in the direction of maximum variance which
is useful for concentrating data information on a few possible components while
it is less useful for separating classes. FDA on the other hand determines the pro-
jection vectors where the data are better separated and therefore more useful for
classification.
Let us now analyze some limitations of the LDA. A first aspect concerns the
reduction of the dimensionality of the data which is only of C − 1, unlike the PCA
which can reduce the dimensionality up to a feature. For complex data, not even the
best one-dimensional projection can separate samples of different classes. Similarly
28 1 Object Recognition
to the PCA, if the classes are very large with a very large J (v) value, the classes
have large overlaps on any projection line. LDA is in fact a parametric approach
in that it assumes a tendentially Gaussian and unimodal distribution of the samples.
For the classification problem, if the distributions are significantly non-Gaussian, the
LDA projections will not be able to correctly separate complex data (see Fig. 1.8c).
In literature, there are several variants of LDA [9,10] (nonparametric, orthonormal,
generalized, and in combination with neural networks).
If there are several minimum candidates, the pattern x is assigned to the class ωr
corresponding to the first r-th found. This classifier can be considered as a special
case of a classifier based on discriminating functions. In fact, if we consider the
Euclidean distance Di between the generic pattern x and the prototype pi , we have
that this pattern is assigned to the class ωi which satisfies the relation Di < D j for
all j = i. Finding the minimum of Di is equivalent to finding the minimum of Di2
(being the positive distances) for which we have
1
Di2 = |x − pi |2 = (x − pi )T (x − pi ) = x T x − 2 x T pi − piT pi (1.45)
2
where in the final expression, the term xT x can be neglected being independent of
the i index and the problem is reduced to maximize the expression in brackets (•).
It follows that we can express the classifier in terms of the following discriminant
function:
1
di (x) = x T pi − piT pi i = 1, . . . , R (1.46)
2
and the generic pattern x is assigned to the class ωi if di (x) > d j (x) for each j = i.
The discriminant functions di (x) are linear expressed in the form:
where x is given in the form of an augmented vector (x1 , . . . , x M , 1)T , while the
weights wi = (wi1 , . . . , wi M , wi,M+1 ) are determined as follows:
1
wi j = pi j wi,M+1 = − piT pi i = 1, . . . , R; j = 1, . . . , M (1.48)
2
1.6 Deterministic Method 29
It can be shown that the discriminating surface that separates each pair of prototype
patterns pi and p j is the hyperplane that bisects perpendicularly the segment joining
the two prototypes. Figure 1.12 shows an example of minimum distance classification
for a single prototype with three classes. If the prototypes patterns pi match the
average patterns μi of the class, we have a classifier of minimum distance from the
average.
As before, also in this case, the discriminant function is determined for the generic
class ωi as follows:
(k)
di (x) = max di (x) (1.50)
k=1,...,n i
1
di(k) (x) = x T pi(k) − (pi(k) )T pi(k) i = 1, . . . , R; k = 1, . . . , n i (1.51)
2
The x pattern is assigned to the ωi class for which the discriminant function di (x)
assumes the maximum value, i.e., di (x) > d j (x) for each j = i. In other words, the x
30 1 Object Recognition
pattern is assigned to the ωi class which has the closest pattern prototype. The
Rlinear
discriminant functions given by the (1.51) partition the features space into i=1 ni
regions, known in the literature as the Dirichlet tessellation.6
(a) Determine between n sample-class pairs (pi , z i ), k samples closer to the pattern
x to classify (always considering the distance with an appropriate metric).
(b) The class to assign to x is the most representative class (the most voted class),
that is, the class that has the greatest number of samples among the nearest k
found.
With the classifier k-NN, the probability of erroneous attribution of the class is
reduced. Obviously the choice of k must be adequate. A high value reduces the
sensitivity to data noise, while a very low value reduces the possibility of extending
the concept of proximity to the domain of other classes. Finally, it should be noted
that with the increase of k, the probability of error of the classifier k-NN approaches
the probability of the Bayes classifier which will be described in the following para-
graphs.
The K-means method [11] is also known as C-means clustering, applied in different
contexts, including the compression of images and vocal signals, and the recognition
of thematic areas for satellite images. Compared to the previous supervised classifiers,
6 Also called the Voronoi diagram (from the name of Georgij Voronoi), it is a particular type of
decomposition of a metric space, determined by the distances with respect to a given finite set of
space points. For example, in the plane, given a finite set of points S, the Dirichlet tessellation for
S is the partition of the plane that associates a region R( p) to each point p ∈ S, so such that, all
points of R( p) are closer to p than to any other point in S.
1.6 Deterministic Method 31
K-means does not have a priori knowledge of the patterns to be classified. The only
information available is the number of classes k in which to group the patterns.
So far, we have adopted a clustering criterion based on the minimum Euclidean
distance to establish a similarity7 measure between two patterns to decide whether
they are elements of the same class or not. Furthermore, this similarity measure can
be considered r elative by associating a thr eshold that defines the acceptability level
of a pattern as similar to another or belonging to another class. K-means introduces
a clustering criterion based on a performance index that minimizes the sum of the
squares of the distances between all the points of each cluster with respect to its own
cluster center.
Suppose we have the dataset available X = {xi }i=1N
consisting of N observations
of a physical phenomenon M-dimensional. The goal is to partition the dataset into
a number K of groups.8 Each partition group is represented by a prototype that
on average has an intra-class distance9 smaller than distances taken between the
prototype of the group and an observation belonging to another group (inter-class
distance). Then we represent with μk , a vector M-dimensional representing the
prototype of the k-th group (with k = 1, . . . , K ). In other words, the prototype
represents the center of the group. We are interested in finding the set of prototypes
of the X dataset with the aforementioned clustering criterion, so that the sum of the
squares of the distances of each observation xi with the nearest prototype is minimal.
We now introduce a notation to define the way to assign each observation to a
prototype with the following binary variable rik = 0, 1 indicating whether the i-th
observation belongs to the k-th group if rik = 1, or rik = 0 if it belongs to some
other group other than k. In general, we will have a matrix R of membership with
dimension N × K of the binary type which highlights whether the i-th observation
belongs to the k-th class. Suppose for now we have the K prototypes μ1 , μ2 , . . . , μ K
(later we will see how to calculate them analytically), and therefore we will say that
an observation x belongs to the class ωk if the following is satisfied:
x − μk = min x − μj ⇒ x ∈ ωk (1.53)
j=1,K
At this point, temporarily assigning all the dataset patterns to the K cluster, one can
evaluate the error that is made in electing the prototype of each group, introducing
a functional named distortion measure of the data or total reconstruction error with
the following function:
N
K
J= rik xi − μk 2
(1.54)
i=1 k=1
Since a given observation x can belong to only one group, the R matrix has the
following property:
K
rik = 1 ∀i = 1, . . . , N (1.56)
k=1
and
K
N
rik = N . (1.57)
k=1 i=1
We now derive the update formulas for rik and μk in order to minimize the function
J . If we consider the optimization of μk with respect to rik fixed, we can see that
the function J in (1.54) is a quadratic function of μk , which can be minimized by
setting the first derivative to zero:
∂J n
=2 rik xi − μk = 0 (1.58)
∂μk
i=1
Note that the denominator of (1.59) represents the number of points assigned to the
k-th cluster, i.e., it calculates μk as the average of the points that fall within the
cluster. For this reason, it is called as K-means.
So far we have described the batch version of the algorithm, in which the whole
dataset is used in a single solution to update the prototypes, as described in the
Algorithm 1. A stochastic online version of the algorithm has been proposed in the
1.6 Deterministic Method 33
with ηi the learning parameter that is monotonically decreased based on the number
of observations that compose the dataset. In Fig. 1.13 is shown the result of the
quantization or classification of color pixels. In particular, in Fig. 1.13a, the original
image is given and in the following ones, the result of the method with different
values of prototypes K = 3, 5, 6, 7, 8. Each color indicates a particular cluster, so
the value (representing the RGB color trio) of the nearest prototype has been replaced
by the original pixel. The computational load is easy O(K N t) where t indicates the
number of iterations, K the number of clusters, and N the number of patterns to
classify. In general, we have K , t N .
3: for xi with i = 1, . . . , N do
4:
1 I f k = arg min j xi − μ j
rik =
0 other wise
5: end for
6: for μk with k = 1, 2, . . . , K do
7: N
rik xi
μk = i=1
n
i=1 rik
8: end for
Fig. 1.13 Classification of RGB pixels with the K-means method of the image in (a). In b for
K = 3; c for K = 5; d for K = 6; e for K = 7; and f for K = 8
same dataset of data, or the minimum distortion. It is often useful to proceed by trial
and error by varying the number of K classes. In general, this algorithm converges
in a dozen steps, although there is no rigorous proof of its convergence.
It is also influenced by the order in which the patterns are presented. Furthermore, it
is sensitive to noise and outliers. In fact, a small number of the latter can substantially
influence the average value. Not suitable when cluster distribution has non-convex
geometric shapes. Another limitation is given by the membership variables or also
called z it responsibility variables that assign the data i-th to the cluster t in a har d
way or binary way. In the fuzzy C-means, but also in the mixture of Gaussians dealt
with below, these variables are treated so f t or with values that vary between zero
and one.
The ISODATA classifier was applied with good results for multispectral images with
a high number of bands. The heuristics adopted, in order to limit the little significant
clusters together with the ability to divide and unite the dissimilar clusters and similar
clusters, respectively, make the classifier very flexible and effective. The problem
remains that geometrically curved clusters, even with ISODATA, are difficult to
manage. Obviously, the initial parameters must be better defined, with different
attempts, repeating the procedure several times. As the K-means also ISODATA
does not guarantee convergence a priori, even if, in real applications, with clusters
not very overlapping, convergence is obtained after dozens of iterations.
Fuzzy C-Means (FCM) classifier is the f uzzy version of the K-means and is char-
acterized by the Fuzzy theory which allows three conditions:
The f uzzy version of the K-means proposed by Bezdek [15], also known as Fuzzy
ISODATA, differs from the previous one for the membership function. In this algo-
rithm, each pattern x has a membership function r of the smooth type, i.e., it is not
binary but defines the degree to which the data belongs to each cluster. This algorithm
partitions the dataset X = {xi }i=1
N
of N observations in K groups fuzzy, and find
cluster centers in a similar way to K-means, such as to minimize the similarity cost
function J . So the partitioning of the dataset is done in a fuzzy way, so as to have for
each given xi ∈ X a membership value of each cluster between 0 and 1. Therefore,
the membership matrix R is not binary, but has values between 0 and 1. In any case,
36 1 Object Recognition
N
K
J= m
rik xi − μk 2
(1.61)
i=1 k=1
Now differentiating the (1.62) with respect to μk , λi and setting zero to following:
n mx
rik i
μk = i=1 n m , (1.63)
i=1 rik
and
1
rik = k = 1, . . . , K ; i = 1, . . . , N
K 2/(m−1) (1.64)
xi −μk
t=1 xi −μt
In the batch version, the algorithm is reported in Algorithm 2. We observe the iter-
ativity of the algorithm that alternately determines the centroids μk of the clusters
and the memberships rik until convergence.
It should be noted that, if the exponent m = 1 in the objective function (1.61), the
algorithm fuzzy C-means approximates the hard algorithm K-means. Since the level
of belonging of the patterns to the clusters produced by the algorithm, they become
0 and 1. At the extreme value m → ∞, the objective function has value J → 0.
Normally m is chosen equal to 2. The FCM classifier is often applied in particular in
the classification of multispectral images. However, performance remains limited by
the intrinsic geometry of the clusters. As for the K-means also for FCM, an elongated
or curved grouping of the patterns in the features space can produce unrealistic results.
The statistical approach, in analogy to the deterministic one, uses a set of decision
rules based, however, on statistical theory. In particular, the discriminating functions
can be constructed by estimating the density functions and applying the Bayes rules.
In this case, the proposed classifiers are of the parametric type extracting information
directly from the observations.
This rule assigns all the patterns to a class, that is, the class with the highest priori
probability. This rule makes sense if the a priori probabilities of the classes are very
different between them, that is, p(ωi ) p(ωk ). We can now assume to know,
for each class ωi , an adequate number of sample patterns x, from which we can
evaluate the conditional probability distribution p(x|) of x given the class ωi , that is,
estimating the probability density of x assuming the association to the class ωi . At
this point, it is possible to adopt a probabilistic decision rule to associate a generic
pattern x to a class ωi , in terms of conditional probability, if the probability p(ωi ) of
the class ωi given the generic pattern x, or p(ωi |x), is greater than all other classes.
In other words, the generic pattern is assigned x to the class ωi if the following
condition is satisfied:
The probability p(ωi |x) is known as the posterior probability of the class ωi given x,
that is, the probability that having observed the generic pattern x, the class to which
38 1 Object Recognition
p(x|ωi ) p(ωi )
p(ωi |x) = (1.67)
p(x)
where
(a) ωi is the class, not known, to be estimated, to associate it with the observed
pattern x;
(b) p(ωi ) is the priori probability of the class ωi , that is, it represents part of our
knowledge with respect to which the classification is based (they can also be
equiprobable);
(c) p(x|ωi ) is the conditional probability density function of the class, interpreted
as the likeli hood of the pattern x which occurs when its features are known to
belong to the class ωi ;
11 The Bayes theorem can be derived from the definition of conditional probability and the total
probability theorem. If A and B are two events, the probability of the event A when the event B has
already occurred is given by
p(A ∩ B)
p(A|B) = if p(B) > 0
p(B)
and is called conditional probability of A conditioned on B or simply probability of A given B.
The denominator p(B) simply normalizes the joint probability p(A, B) of the events that occur
together with B. If we consider the space S of the events partitioned into B1 , . . . , B K , any event A
can be represented as
A = A ∩ S = A ∩ (B1 ∪ B2 , . . . , B K ) = (A ∩ B1 ) ∪ (A ∩ B2 ), . . . , (A ∩ B K ).
and replacing the conditional probabilities, the total probability of any A event is given by
K
p(A) = p(A|B1 )P(B1 ) + · · · + p(A|B K ) p(B K ) = p(A|Bk ) p(Bk )
k=1
By combining the definitions of conditional probability and the total probability theorem, we obtain
the probability of the event Bi , if we suppose that the event A happened, with the following:
known as the Bayes Rule or Theorem which represents one of the most important relations in the
field of statistics.
1.7 Statistical Method 39
(d) p(x) is known as evidence, i.e., the absolute probability density given by
K
K
p(x) = p(x|ωk ) p(ωk ) with p(ωk ) = 1 (1.68)
k=1 k=1
which represents a normalization constant and does not influence the decision.
which unless a constant factor corresponds to the value of the posterior probability
p(ωk |x) which expresses how often a pattern x belongs to the class ωk . The (1.65)
can, therefore, be rewritten, in terms of the optimal rule, to classify the generic pattern
x and associate it with a class ωk if the posterior probability p(ωk |x) is the highest
of all possible a posteriori probabilities:
known as the maximum a posteriori (MAP) probability decision rule. Also known
as the Bayes optimal rule for the minimum error of classification.
The MAP decision rule12 can be re-expressed in another form. For simplicity, we
consider the rule for two classes ω1 and ω2 , and we apply the rule defined by the
(1.66) which assigns a generic pattern x to the class which has a posterior probability
highest. In this case, applying the Bayes rule (1.67) to the (1.66), and eliminating
the common term p(x), we would have
which assigns x to the class ω1 if satisfied, otherwise to the class ω2 . This last
relationship we can rewrite it as follows:
p(x|ω1 ) p(ω2 )
(x) = > (1.72)
p(x|ω2 ) p(ω1 )
which assigns x to the class ω1 if satisfied, otherwise to the class ω2 . (x) is called
likelihood ratio and the corresponding decision rule is known as likelihood test. We
observe that in the likelihood test, the evidence p(x) does not appear (while it is
necessary for the MAP rule for the calculation of the posterior probability p(ωk |x)),
since it is a constant not influenced by the class ωk . The de facto likelihood test is
a test that estimates how good the assignment decision is based on the comparison
between the a priori knowledge ratios, i.e., conditional probabilities (likelihood) and
a priori probabilities. If these latter p(ωk ) turn out to be equiprobable, then the test
is performed only by comparing the likelihoods p(x|ωk ), thus becoming the rule of
ML, Maximum Likelihood. This last rule is also used when the p(ωk ) are not known.
The decision rule (1.71) can also be expressed in geometric terms by defining the
decision regions. Figure 1.14 shows the decision regions 1 and 2 for the separation
of two classes assuming the classes ω1 and ω2 both with Gaussian distribution. In
the figure, the graphs of p(x|ωi ) p(ωi ), i = 1, 2 are displayed, with the priori prob-
abilities p(ωi ) different. The theoretical boundary of the two regions is determined
by
p(x|ω1 ) p(ω1 ) = p(x|ω2 ) p(ω2 )
In the figure, the boundary corresponds to the point of intersection of the two Gaus-
sians. Alternatively the boundary can be determined by calculating the likelihood
ratio (x) and setting a threshold θ = ω2 /ω1 . Therefore, with the likelihood test,
the decision regions would result
18
1 x
2 n
Band x
1
Band x
Multispectral image 50
0
Fig. 1.15 Example of the nonparametric Bayes classifier that classifies 2 types of territory (land
and river) in the spectral domain starting from the training sets extracted from two bands (x1 , x2 )
of a multispectral image
multispectral image. In the 2D spectral domain, the training set samples of which
we know the membership class are projected. A generic pixel pattern with spectral
measurements x = (x1 , x2 ) (in the figure indicated with the circle) is projected in
the features domain and associated with one of the classes using the nonparametric
MAP classifier. From the training sets, we can have a very rough estimate of the a
priori probabilities p(ωi ), of the likelihoods p(x|ωi ), and of the evidence p(x), as
follows:
n ω1 18 n ω2 50
p(ω1 ) = = = 0.26 p(ω2 ) = = = 0.74
N 68 N 68
ni ω1 4 ni ω2 7
p(x|ω1 ) = = = 0.22 p(x|ω2 ) = = = 0.14
n ω1 18 n ω2 50
2
p(x) = p(x|ωi ) p(ωi ) = 0.22 × 0.26 + 0.14 × 0.74 = 0.1608
i=1
where n ω1 and n ω2 indicate the number of samples in the training sets belonging,
respectively, to the earth and water class, ni ω1 and ni ω2 indicate the number of
samples belonging, respectively, to the earth and water class found in the window
centered in x the pattern to classify.
Applying the Bayes rule (1.67), we obtain the posterior probabilities:
p(ω1 ) p(x|ω1 ) 0.26 × 0.22 p(ω2 ) p(x|ω2 ) 0.74 × 0.14
p(ω1 |x) = = = 0.36 p(ω2 |x) = = = 0.64
p(x) 0.1516 p(x) 0.1608
For the MAP decision rule (1.70), the pattern x is assigned to the class ω2 (water
zone).
42 1 Object Recognition
K
p(err or ) = p(err or |ωk ) p(ωk ) (1.73)
k=1
With C[ k ] is indicated the set of regions complement of the region k , that is,
C[k ] = Kj=1; j=k j . That being said, we can rewrite the probability of incorrect
classification of a pattern in the following form:
K
p(err or ) = p(x|ωk ) p(ωk )dx
k=1 C [k ]
K
= p(ωk ) 1 − p(x|ωk )dx (1.75)
k=1 k
K
=1− p(ωk ) p(x|ωk )dx
k=1 k
From which it is observed that the minimization of the error is equivalent to maxi-
mizing the probability of correct classification given by
K
p(ωk ) p(x|ωk )dx (1.76)
k=1 k
This goal is achieved by maximizing the integral of the (1.76) which is equivalent
to choosing the decision regions k for which p(ωk ) p(x|ωk ) is the value higher for
all regions, exactly as imposed by the MAP rule (1.70). This ensures that the MAP
rule minimizes the probability of error.
It is observed (see Fig. 1.16) how the decision region translates with respect to
the point of equal probability of likelihood p(x|ω1 ) = p(x|ω2 ) for different values
of the a priori probability.
1.7 Statistical Method 43
p(x|
0.4
0.3
0.2
0.1
-2 0 2 4 x
Fig. 1.16 Elements that characterize the probability of error by considering the conditional density
functions of the classes with normal distribution of equal variance and unequal a priori probability.
The blue area corresponds to the probability of error in assigning a pattern of the class ω1 (standing
in the region 1 ) to the class ω2 . The area in red represents the opposite situation
K
R(αi |x) = C(αi |ω j ) p(ω j |x) i = 1, . . . , a (1.77)
j=1
44 1 Object Recognition
The zero-order conditional risk R, considering the zero order cost function, is defined
by
0 if i = j
C(αi |ω j ) = i, j = 1, . . . , K (1.78)
1 if i = j
from which it can be deduced that we can minimize conditional risk by selecting the
action that minimizes R(αi |x) to classify the observed pattern x. It follows that we
need to find a decision rule α(x) which relates the input space of the features with
that of the actions, calculate the overall risk RT given by
K K
K
RT = R(αi |x) = C(αi |ω j ) p(ω j |x) p(x) (1.80)
i=1 i=1 i j=1
which will be minimal by selecting αi for which R(αi |x) is minimum for all x. The
Bayes rule guarantees overall risk minimization by selecting the action α ∗ which
minimizes the conditional risk (1.77):
K
α ∗ = arg min R(αi |x) = arg min C(αi |ω j ) p(ω j |x) (1.81)
αi αi
j=1
thus obtaining the Bayes Risk which is the best achievable result.
Let us now calculate the minimum risk for an example of binary classification.
Both α1 the action to decide that the correct class is ω1 , and similarly it is α2 for ω2 .
We evaluate the conditional risks with the extended (1.77) rewritten:
highlighting that the posterior probability is scaled by the cost differences (normally
positive). Applying the Bayes rule to the latter (remembering the 1.71), we decide
1.7 Statistical Method 45
for ω1 if
(C21 − C11 ) p(x|ω1 ) p(ω1 ) > (C12 − C22 ) p(x|ω2 ) p(ω2 ) (1.82)
Assuming that C21 > C11 and remembering the definition of likelihood ratio
expressed by the (1.72), the previous Bayes rule can be rewritten as follows:
It is shown that the threshold t to be chosen to carry out the rejection must be
t < KK−1 where K is the number of classes. In fact, if the classes are equiprobable,
the minimum value reachable by max p(ωi |x) is 1/K because the following relation
i
must be satisfied:
K
1= p(ωi |x) ≤ K arg max p(ωi |x) (1.86)
i=1 i=1,...,K
Figure 1.18 shows the rejection region 0 associated with a threshold t for two
Gaussian classes of Fig. 1.16. The patterns that fall into the 1 and 2 regions are
regularly classified with the Bayes rule. It is observed that the value of the threshold
t strongly influences the dimensions of the region of rejection.
For a given threshold t, the probability of correct classification c(t) is given by
the (1.76) considering only the regions of acceptance (0 is excluded):
K
c(t) = p(ωk ) p(x|ωk )dx (1.87)
k=1 k
The unconditional probability of rejection r (t), that is, the probability of a pattern
falling into the region 0 is given by
r (t) = p(x)dx (1.88)
0
1.7 Statistical Method 47
The value of the error e(t) associated with the probability of accepting to classify a
pattern and classifying it incorrectly is given by
K
e(t) = [1 − max p(ωi |x)] p(x)dx = 1 − c(t) − r (t) (1.89)
i
k=1 k
From this relation, it is evident that a given value of correct classification c(t) =
1−r (t)−e(t) can be obtained by choosing to reduce the error e(t) and simultaneously
increase the rejection error r (t), that is, to harmonize the compromise error-rejection
being inversely related to each other.
If a Ci j cost is considered even in the assignment of a pattern to the rejected class
ω0 (normally lower than the wrong classification one), the cost function is modified
as follows:
⎧
⎪
⎨0 if i = j
Ci j = 1 if i = j i = 0, . . . , K ; j = 1, . . . , K (1.90)
⎪
⎩
t if i = 0 (rejection class ω0 )
In [16], the following decision rule (see Fig. 1.18) with optimal rejection α(x) is
demonstrated, which is also the minimum risk rule if the cost function is uniform
within each decision class:
ωi if ( p(ωi |x) > p(ω j |x)) ∧ ( p(ωi |x) > t) ∀i = j
α(x) = (1.91)
ω0 otherwise reject
where the rejection threshold t is expressed according to the cost of error e, the cost
of rejection r , and the cost of correct classification c, as follows:
e−r
t= (1.92)
e−c
where with c ≤ r it is guaranteed that t ∈ [0, 1], while if e = r we will back to the
Bayes rule. In essence, Chow’s rejection rule attempts to reduce the error by rejecting
border patterns between regions whose classification is uncertain.
48 1 Object Recognition
The Bayes criteria described, based on the MAP decider or maximum likelihood
ML, need to define the cost values Ci j and know the probabilities a priori p(ωi ).
In applications where this information is not known, in the literature [17,18], the
decision criterion Minimax and that of Neyman–Pearson are proposed. The Minimax
criterion is used in applications where the recognition system must guarantee good
behavior over a range of possible values rather than for a given priori probability
value. In these cases, although the a priori probability is not known, its variability
can be known for a given interval. The strategy used in these cases is to minimize
the maximum value of the risk by varying the prior probability.
The Neyman–Pearson criterion is used in applications where there is a need to limit
the probability of error within a class instead of optimizing the overall conditional
risk as with the Bayes criterion. For example, we want to fix a certain attention on
the probability of error associated with a false alarm and minimize the probability
of failure to alarm as required in radar applications. This criterion evaluates the
probability of error 1 in classifying patterns of ω1 in the class ω2 and vice versa the
probability of error 2 for patterns of the class ω2 attributed to the class ω1 .
The strategy of this criterion is to minimize the error on the class ω1 by imposing
to find the minimum of 1 = 2 p(x|ω1 )dx correlating it to the error limiting it
below a value α, that is, 2 = 1 p(x|ω2 )dx < α. The criterion is set as a constrained
optimization problem, whose solution is due to the Lagrange multipliers approach
which minimizes the objective function:
F = 1 + λ(2 − α)
It is highlighted the absence of p(ωi ) and costs Ci j while the decision regions i
are to be defined with the minimization procedure.
The Bayes decision rule (1.67) requires knowledge of all the conditional probabil-
ities of the classes and a priori probabilities. Its functional and exact parameters of
these density functions are rarely available. Once the nature of the observed patterns
is known, it is possible to hypothesize a parametric model for the probability den-
sity functions and estimate the parameters of this model through sample patterns.
Therefore, an approach used to estimate conditional probabilities p(x|ωi ) is based
on a training set of patterns Pi = {x1 , . . . , xn i } xi j ∈ Rd associated with the class
ωi . In the parametric context, we assume the form (for example, Gaussian) of the
probability distribution of the classes and the unknown parameters θi that describe
it. The estimation of the parameters θk , k = 1, . . . , n p (for example in the Gaussian
form are θ1 = μ; θ2 = σ and p(x) = N (μ, σ )) can be done with known approaches
of maximum likelihood or Bayesian estimation.
1.7 Statistical Method 49
The parameters that characterize the hypothesized model (e.g., Gaussian) are assumed
known (for example, mean and variance) but are to be determined (they represent
the unknowns). The estimation of the parameters can be influenced by the choice
of the training sets and an optimum result is obtained using a significant number
of samples. With the M L E method, the goal is to estimate the parameters θ̂i which
maximizes the likelihood function p(x|ωi ) = p(x|θi ) defined using the training set
Pi :
θ̂i = arg max [ p(P j |θ j )] = arg max [ p(x j1 , . . . , x jn i |θ j )] (1.93)
θj θj
If we assume that the patterns of the training set Pi = {x1 , . . . , xn i } form a sequence
of variables random independent and identically distributed (iid),13 the likelihood
function p(Pi |θi ) associated with class ωi can be expressed as follows:
ni
p(Pi |θi ) = p(xki |θi ) (1.94)
k=1
The logarithmic function has the property of being monotonically increasing besides
the evidence of expressing the (1.93) in terms of sums instead of products, thus
simplifying the procedure of finding the maximum especially when the probability
function model has exponential terms, as happens with the assumption of Gaussian
distribution. Given the independence of training sets Pi = x1 , . . . , xn i of patterns
associated with K classes ωi , i = 1, . . . , K , we will omit the index i-th which
indicates the class in estimating the related parameters θi . In essence, the parameter
estimation procedure is repeated independently for each class.
13 Implies that the patterns all have the same probability distribution and are all statistically inde-
pendent.
50 1 Object Recognition
The maximum likelihood value of the function for the sample patterns P is obtained
differentiating with respect the parameter θ and setting to zero:
n
∂
n
k=1 log( p(xk |θ ))
= −1 (xk − μ) = 0 (1.97)
∂θ
k=1
1
n
μ̂ = xk (1.98)
n
k=1
It is observed that the estimate of the mean (1.98) obtained with the MLE approach
leads to the same result of the mean calculated in the traditional way with the average
of the training set patterns.
n
1
n
1
n
(xk − θ̂1 )2
(xk − θ̂1 ) = 0 − + =0
k=1 θ̂2 2θˆ2
k=1 2θ̂ 2
k=1 2
from which, with respect to θˆ1 and θˆ2 , we get the estimates of μ and σ 2 , respectively,
as follows:
1 1
n n
θ̂1 = μ̂ = xk θ̂2 = σ̂ 2 = (xk − μ̂)2 (1.100)
n n
k=1 k=1
The expressions (1.100) MLE estimate of variance and mean correspond to the
traditional variance and mean calculated on training set patterns. Similarly, it can
be shown [17] that the MLE estimates, for a multivariate Gaussian distribution in
d-dimensional, are the traditional mean vector μ and the covariance matrix , given
by
1 1
n n
μ̂ = xk ˆ =
(xk − μ̂)(xk − μ̂)T (1.101)
n n
k=1 k=1
from which it results that the estimated mean is not distorted (unbiased), while for
the estimated variance with MLE, we have
n
1 n−1 2
E[σ̂ ] = E
2
(xk − μ̂) =
2
σ = σ 2 (1.103)
n n
k=1
from which it emerges that the variance is distorted (biased). It is shown that the
magnitude of a distorted estimate is related to the number of samples considered,
52 1 Object Recognition
for n → ∞ asymptotically the bias is zero. A simple estimate unbiased for the
covariance matrix is given by
1
n
ˆU =
(xk − μ̂)(xk − μ̂)T (1.104)
n−1
k=1
The starting conditions with the Bayesian approach are identical to those of the max-
imum likelihood, i.e., from the training set of pattern P = {x1 , . . . , xn } x j ∈ Rd
associated with the generic class ω, we assume the form (for example, Gaussian) of
the probability distribution and the unknown parameter vector θ describing it. With
the Bayesian estimate (also known as Bayesian learning), θ is assumed as a ran-
dom variable whose a priori probability distribution p(θ ) is known and intrinsically
contained in the training set P.
The goal is to derive the a posteriori probability distribution p(θ |x, P) from the
training set of patterns of the class ω. Having said this, the formula of Bayes theorem
(1.67) is rewritten as follows:
p(x|θ̂ , P)
p(ω|x, P) = p(θ̂ |x, P) = p(ω) (1.105)
p(x)
and by the total probability theorem, we can calculate p(x|P) (very close to p(x) as
much as possible) the conditional density function by integrating the joint probability
density p(x|θ , P) on the variable θ :
p(x|P) = p(x|θ ) p(θ |P)dθ (1.106)
where integration is extended over the entire parametric domain. With the (1.106),
we have a relationship between the conditional probability of the class with the
parametric conditional probability of the class (whose form is known) and the pos-
terior probability p(θ|P) for the variable θ to estimate. With the Bayes theorem it is
possible to express the posterior probability p(θ|P) as follows:
Assuming that the patterns of the training set P form a sequence of independent
and identically distributed (iid) random variables, the likelihood probability function
p(P|θ) of the (1.107) can be calculated with the product of the conditional probability
densities of the class ω:
n
p(P|θ) = p(xk |θ) (1.108)
k=1
1 − 1 2 (μ−μ0 )2
p(μ) = √ e 2σ0 (1.109)
2π σ0
Applicando la regola di Bayes (1.107) possiamo calcolare la probabilità a poste-
riori p(μ|P):
p0 (μ)
n
p(P |μ) p(μ)
p(μ|P ) = = p(xk |μ)
p(P ) p(P )
k=1
n
(1.110)
1 − 12 (μ−μ0 )2 1 1 − 1 (x −μ)2
= √ e 2σ0 √ e 2σ 2 k
2πσ0 p(P ) 2πσ
k=1
54 1 Object Recognition
We observe that the posterior probability p(μ|P) depends on the a priori probability
p(μ) and therefore from the training set of the selected patterns P. This dependence
influences the Bayesian estimate, that is, the value of p(μ|P), observable with the
increment n of the training set samples. The maximum of p(μ|P) is obtained by
computing the partial derivative of the logarithm of the (1.110) with respect to μ that
∂
is, ∂μ log p(μ|P) and equaling to zero, we have
n
∂ 1 1
− 2 (μ − μ0 ) +
2
− 2 (xk − μ) = 0
2
(1.111)
∂μ 2σ0 2σ
k=1
nσ02 1
n
σ2
μn = μ0 + xk (1.112)
σ 2 + nσ02 σ 2 + nσ02 n k=1
' () * ' () *
μ I nitial Estimate M L E
1 n 1 σ02 σ 2
= + =⇒ σn2 = (1.113)
σn2 σ σ0 nσ02 + σ 2
from which it emerges that the posterior variance of μ, σn2 , tends to zero as well as
1/n for n → ∞. In other words, with the posterior probability p(μ|P) calculated
with the (1.110), we get the best estimate μn of μ starting from the training set of n
observed patterns, while σn2 represents the uncertainty of μ, i.e., its posterior variance.
Figure 1.19 shows how Bayesian learning works, that is, as the number of samples
in the training set increases, the p(μ|P) becomes more and more with the peak
accentuated and narrowed toward the true value of the mean μ. The extension to the
multivariate case [18] of the Bayesian estimate for Gaussian distribution with mean
unknown μ and covariance matrix known, is more complex, as is the calculation
of the estimate of the mean and of the covariance matrix both not known for a normal
distribution [17].
p(μ|Ρ)
50
n=20
n=10
25
n=5
Iniziale n=1
Fig. 1.19 Bayesian learning of the mean of a Gaussian distribution with known variance starting
from a training set of patterns
θ = μ) the posterior density p(μ|P) given by the (1.110) and assumed with normal
distribution N (μn , σn2 ):
p(x|P ) = p(x|μ) p(μ|P )dμ
1 1 x − μ 2 1 1 μ − μn 2
= √ exp − √ exp − dμ (1.114)
2π σ 2 σ 2π σn 2 σn
1 1 (x − μn )2
= exp − f (σ, σn )
2π σ σn 2 σ + σn
2 2
where 2
1 σ 2 + σn2 σn2 x + σ 2 μn
f (σ, σn ) = exp − μ − dμ
2 σ 2 σn2 σ 2 + σn2
We highlight that the density p(x|P), as a function of x, results with normal distri-
bution:
p(x|P) ∼ N (μn , σ 2 + σn2 ) (1.115)
being proportional to the expression exp [−(1/2)(x − μn )2 /(σ 2 + σn2 )]. In conclu-
sion, to get the conditional density of the class p(x|P) = p(x|ω, P), with a known
parametric form described by the normal distribution, p(x|μ) ∼ N (μ, σ ), the param-
eters of the normal are replaced μ = μn and σ 2 = σn2 . In other words, the value
of μn is considered as the mean true while the initial variance known σ , once the
posterior density of the mean p(μ|P), is calculated, is increased by σn2 to account for
the uncertainty on the significance of the training set due to the poor knowledge of
the mean μ. This contrasts with the MLE approach which gets a point estimate of the
parameters μ̂ and σ̂ 2 instead of directly estimating the class distribution p(x|ω, P).
56 1 Object Recognition
With the MLE approach, a point value of the parameter θ is estimated which
maximizes the likelihood density p(P|θ ). Therefore with MLE, we get an esti-
mated value of θ̂ not considering the parameter a random variable. In other words,
with reference to the Bayes equation, (1.107), MLE treats the ratio p(θ )/ p(P) =
pr ob. priori/evidence as a constant and does not take into account the a priori
probability in the calculation procedure of the θ estimation.
In contrast, Bayesian learning instead considers the parameter to be estimated
θ as a random variable. Known the conditional density and a priori probability,
the Bayesian estimator obtains a probability distribution p(θ |P) associated with θ
instead of a point value as it happens for MLE. The goal is to select an expected
value of θ assuming a small possible variance of the posterior density p(θ |P). If the
variance is very large, a non-good estimate is assumed of θ .
The Bayesian estimator incorporates the information a priori and if this is not
significant, the posterior density is determined by the training set (data-driven esti-
mator). If it is significant, the posterior density is determined by the combination of
the priori density and by the training set of patterns. If even the training set has a
significant cardinality of patterns of fact, these dominate on the a priori information
making it less important. From this, it follows that between the two estimators, there
is a relation when the number of patterns n of the training set is very high. Consid-
ering the Bayes equation (1.107), we observe that the denominator can be neglected
as independent of θ and we have that
where the likelihood density has a peak at the maximum θ = θ̂ . With n very large, the
likelihood density shrinks around its maximum value while the integral that estimates
the conditional density of the class with the Bayesian method can be approximated
(see Eq. 1.106) as follows:
p(x|P) = p(x|θ ) p(θ|P)dθ ∼= p(x|θ̂ ) p(θ |P)dθ = p(x|θ̂ ) (1.117)
remembering that p(θ |P)dθ = 1. In essence, Bayesian learning instead of finding a
precise value of θ calculates a mean over all values θ of the density p(x, θ ), weighted
with the posterior density of the parameters p(θ |P).
In conclusion, the two estimators tend, approximately, to similar results, when n
is very large, while for small values, the results are very different.
1.8 Bayesian Discriminant Functions 57
where the sign is motivated by the fact that the minimum conditional risk corresponds
to the maximum discriminating function.
In the case of a minimum zero-order error function, the further simplified Bayesian
discriminant function is given by gi (x) = p(ωi |x). The choice of discriminating
functions is not unique since a generic function gi (x) can be replaced with f (gi (x))
where f (•) is a growing monotonic function that does not affect the accuracy of
the classification. We will see that these transformations are useful for simplifying
expressions and calculation.
Class assignment
Select max
Costs
Discriminant K(x)
Functions
Features xd
Fig. 1.20 Functional scheme of a statistical classifier. The computational model is of the type
bottom-up as shown by the arrows. In the first level are the features of the patterns processed in the
second level with the discriminating functions to choose the one with the highest value that assigns
the pattern to the class to which it belongs
58 1 Object Recognition
The discriminating functions for the classification with minimum error are
p(x|ωi ) p(ωi )
gi (x) = p(ωi |x) = K (1.120)
k=1 p(x|ωk ) p(ωk )
gi (x) = p(x|ωi ) p(ωi ) (1.121)
gi (x) = log p(x|ωi ) + log p(ωi ) (1.122)
which produce the same classification results. As already described in Sect. 1.6, the
discriminant functions partition the feature space into the K decision regions i
corresponding to the ωi classes according to the following:
The decision boundaries that separate the regions correspond to the valleys between
the discriminant functions described by the equation gi (x) = g j (x). If we consider
a two-class classification ω1 and ω2 , we have a single discriminant function g(x)
which can be expressed as follows:
The Gaussian probability density functions, also called normal distribution, have
already been described in different contexts in this volume considering their partic-
ularity in modeling well the observations of various physical phenomena and for its
treatability in analytical terms. In the context of classification, it is widely used to
model the observed measurements of the various classes often subject to random
noise.
We also know from the central limit theorem that the distribution of the sum of a
high number n of independent and identically distributed random variables tends to
distribute nor mal, independently of the distribution of the single random variables.
A Bayesian classifier is based on the conditional probability density p(x|ωi ) and
the priori probability density p(ωi ) of the classes. Now let’s see how to get the
discriminant functions of a Bayesian classifier by assuming the classes with the
multivariate normal distribution (MND). The objective is to derive simple forms of
the discriminating functions by exploiting some properties of the covariance matrix
of the MNDs.
A univariate normal density is completely described by the mean μ and the
variance σ 2 and abbreviated as p(x) ∼ N (μ, σ 2 ). A multivariate normal density is
1.8 Bayesian Discriminant Functions 59
described by the mean vector μ and by the covariance matrix , and in short form
is indicated with p(x) ∼ N (μ, ) (see Eq. 1.101).
For an arbitrary class ωi with patterns described by the vectors x = (x1 , . . . , xd )
to d-dimensions with normal density, the mean vector is given by μ = (μ1 , . . . , μd )
with μi = E[xi ] while the covariance matrix is
while the diagonal components represent the variance of the f eatur es:
D 2M = (x − μ)T −1 (x − μ) (1.128)
14 If = I, where I is the identity matrix, the (1.128) becomes the Euclidean distance (norm
2). If is diagonal,
+ the resulting measure becomes the normalized Euclidean distance given by
d (xi −μi )2
D(x, μ) = i=1 σ2
. It should also be pointed out that the Mahalanobis distance can also
i
be defined as a dissimilarity measure between two vector patterns x and,y with the same probability
density function and with covariance matrix , defined as D(x, μ) = (x − y)T −1 (x − y).
60 1 Object Recognition
White transformation
λ
{
p(x|ωi
Fig. 1.21 2D geometric representation in the feature domain of a Gaussian pattern distribution.
We observe their grouping centered on the average vector μ and the contour lines, which in the 2D
domain are elli pses, which represent the set of points with equal probability density of the Gaussian
distribution. The orientation of the grouping is determined by the eigenvectors of the covariance
matrix, while the eigenvalues determine the extension of the grouping
The dispersion of the class patterns centered on the average vector is measurable by
the volume of the hyperellipsoid in relation to the values of D M and .
In this context, it may be useful to proceed with a linear transformation (see Chap.
2 Vol. II) of the x patterns to analyze the correlation level of the features and reduce
their dimensionality, or normalize the vectors x to have unrelated components with
variance equal to unity. The normalization of the features is obtained through the
so-called whitening of the observations, that is, by means of a linear transformation
(known as whitening transform [19]) such as to have unrelated features with unitary
variance.15
With this transformation, the ellipsoidal distribution in the feature space becomes
(see Fig. 1.21) spherical (covariance matrix is equal to the identity matrix after the
transformation y = I) and the Euclidean metric is used instead of the Mahalanobis
distance (Eq. 1.128).
15 The whitening transform is always possible and the method used is still based on the eigen
Among the Bayesian discriminant functions gi (x), described above for the classifica-
tion with minimal error, we consider the (1.122) that in the hypothesis of multivariate
conditional normal density p(x|ωi ) ∼ N (μi , i ) is rewritten in the form:
1 1 d
gi (x) = − (x − μi )T i−1 (x − μi ) − log |i | − log 2π + log p(ωi ) (1.129)
2 2 '2 () *
constant
having replaced in the discriminant function (1.122) for the class ωi its conditional
density of multivariate normal distribution p(x|ωi ) given by
- .
1 − 21 (x−μi )T i−1 (x−μi )
p(x|ωi ) = e
(2π ) |i |1/2
d/2
The (1.129) is strongly characterized by the covariance matrix i for which different
assumptions can be made.
1.8.2.1 Assumption: i = σ 2 I
With this hypothesis, the features (x1 , x2 , . . . , xd ) are statistically independent and
have the same variance σ 2 with different means μ. The patterns are distributed in
the space of the features forming hyperspherical groupings with equal dimensions
and centered in μi . In this case, the calculation of the determinant and of the inverse
matrix of i are, respectively, |i | = σ 2d and i−1 = (1/σ 2 )I (I stands for the
identity matrix). Also, considering that the constant term in the (1.129) and |i | they
are both terms independent of i, they can be ignored as irrelevant. It follows that a
simplification of the discriminating functions is obtained:
x − μi 2 (x − μi )T (x − μi )
gi (x) = − + log p(ωi ) = − + log p(ωi )
2σ 2 2σ 2 (1.130)
1 - T .
= − 2 x x − 2μiT x + μiT μi ) + log p(ωi )
2σ
From (1.130), it is noted that the discriminating functions are characterized by the
Euclidean distance between the patterns and averages of each class ( x − μi 2 ) and
by the normalization terms given by the variance (2σ 2 ) and the prior density (offset
log p(ωi )). It doesn’t really need to calculate distances. In fact, with the expansion
of the quadratic form (x − μi )T (x − μi ), from the (1.130), it is evident that the
quadratic term x T x is identical for all i and can be eliminated. This allows to obtain
the equivalent of the linear discriminant functions as follows:
where
1 1 T
wi (x) = μ wi0 = − μ μ + log p(ωi ) (1.132)
σ2 i 2σ 2 i i
62 1 Object Recognition
p(x|ωi)
p(x
0.4
0.3
0.2
0.1
-2 0 2 4 x
Fig.1.22 1D geometric representation for two classes in the feature space. If the covariance matrices
for the two distributions are equal and with identical a priori density p(ωi ) = p(ω j ), the Bayesian
surface of separation in the 1D representation is the line passing through the intersection point
of the two Gaussians p(x|ωi ). For d > 1, the separation surface is instead the hyperplane to
(d − 1)-dimensions with the groupings of the spherical patterns in d-dimensions
The term wi0 is called the threshold (or bias) of the i-th class. The Bayesian decision
surfaces are hyperplanes defined by the equations:
gi (x) = g j (x) ⇐⇒ gi (x) − g j (x) = (wi − w j )T x + (wi0 − w j0 ) = 0 (1.133)
The hyperplane equations considering the (1.131) and (1.132), we can rewrite it in
the form:
w T (x − x0 ) = 0 (1.134)
where
w = μi − μ j (1.135)
1 σ2 p(ωi )
x0 = (μ + μ j ) − log (μ − μ j ) (1.136)
2 i μi − μ j 2 p(ω j ) i
Equation (1.134) describes the decision hyperplane separating the class ωi from ω j
and is perpendicular to the line joining the centroids μi and μ j . The x0 point is
determined by the values of p(ωi ) and p(ω j ), the point from which the hyperplane
that is normal to the vector w passes. A special case occurs when p(ωi ) = p(ω j )
for each class. Figures 1.22 and 1.16 show the distributions in feature space for two
classes, respectively, for density a priori equal p(ωi ) = p(ω j ) and in the case of
sensible difference with p(ωi ) > p(ω j ).
In the first case, from the (1.136), we observe that the second addend becomes zero
and the separation point of the classes x0 is at the midpoint between the vectors μi
and μ j , and the hyperplane bisects perpendicularly the line joining the two averages.
1.8 Bayesian Discriminant Functions 63
gi (x) = − x − μi 2
(1.137)
thus obtaining a classifier named at minimum distance. Considering that the (1.137)
calculates the Euclidean distance between the x and the averages μi which in this
context represent the prototypes of each class, the discriminant function is the typical
of a template matching classifier.
In the second case, with p(ωi ) = p(ω j ), the point x0 moves away from the
most probable class. As you can guess, if the variance has low values (patterns more
grouped) than the value of the distance between the averages μi − μ j , the a
priori densities will have a lower influence.
having eliminated the constant term and the term with |i | being both independent
of i. If the a priori densities p(ωi ) were found to be identical for all classes, their
contribution is ignored, and the (1.138) would be reduced only to the term of the
Mahalanobis distance. Basically, we have a classifier based on the following decision
rule: a pattern x is classified by assigning it to the class whose centroid μi is at the
minimum distance of Mahalanobis. If p(ωi ) = p(ω j ), the separation boundary
moves in the direction of less priori probability. It is observed that the Mahalanobis
distance becomes the Euclidean distance x − μ 2 = (x − μ)T (x − μ) if = I.
Expanding in the Eq. (1.138) only the expression of the Mahalanobis distance and
eliminating the terms independent of i (i.e., the quadratic term xT −1 x), the linear
discriminant functions are still obtained:
where
1
wi (x) = −1 μi wi0 = − μiT −1 μi + log p(ωi ) (1.140)
2
The term wi0 is called the threshold (or bias) of the i-th class. The linear discriminant
functions, also in this case, represent geometrically the hypersurface surfaces of
64 1 Object Recognition
w T (x − x0 ) = 0 (1.141)
where
w = −1 (μi − μ j ) (1.142)
It is observed that the vector w, given by the (1.142), does not result, in general, in
the direction of the vector (μi − μ j ). It follows that the separation hyperplane (of
the regions i and j ) is not perpendicular to the line joining the two averages μi
and μ j . As in the previous case, the hyperplane always intersects the junction line
of the averages in x0 with the position that depends on the values of the a priori
probabilities. If the latter are different, the hyperplane moves toward the class with
less prior probability.
having only been able to eliminate the constant term d2 log 2π , while the others are
all dependent on i.
The discriminant functions (1.144) are quadratic functions and can be rewritten
as follows:
gi (x) = x T Wi x + wiT x + wi0 (1.145)
' () * '()* '()*
x-Quadratic x-Linear Costant
where
1
Wi = − i−1 wi = i−1 μi (1.146)
2
1 1
wi0 = − μiT i−1 μi + − log |i | + log p(ωi ) (1.147)
2 2
1.8 Bayesian Discriminant Functions 65
The term wi0 is called the threshold (or bias) of the i-th class. The decision surface
between two classes is quadric hypersurface.16 These decision surfaces may not be
connected even in the one-dimensional case (see Fig. 1.23).
Any hypersurface can be generated from two Gaussian distributions. The surfaces
of separation become more complex when the number of classes is greater than 2
even with Gaussian distributions. In these cases, it is necessary to identify the pair
of classes involved in that particular area of the feature space.
1.8.2.4 Conclusions
Let us briefly summarize the specificities of the Bayesian classifiers described. In the
hypothesis of Gaussian distribution of classes, in the most general case, the Bayesian
classifier is quadratic. If the hypothesized Gaussian classes all have equal variance,
the Bayesian classifier is linear . It is also highlighted that a classifier based on
the Mahalanobis distance is optimal in the Bayesian sense if the classes have a
normal distribution, equal covariance matrix, and equal a priori probability. Finally,
it is pointed out that a classifier based on the Euclidean distance is optimal in the
Bayesian sense if the classes have a normal distribution, equal covariance matrix
proportional to the identical matrix, and equal a priori probability.
Both the Euclidean and Mahalanobis distance classifiers are linear. In various
applications, different distance-based classifiers (Euclidean or Mahalanobis) are used
making implicit statistical assumptions. Often such assumptions, for example, those
on the normality of class distribution are rarely true, obtaining bad results. The
strategy is to verify pragmatically if these classifiers solve the problem.
16 In the geometric and mathematical discipline, we define quadric surface a hypersurface of a space
In the parametric approach, where assumptions are made about class distribu-
tions, it is important to carefully extract the associated parameters through signifi-
cant training sets and an interactive pre-analysis of sample data (histogram analysis,
transformation to principal components, reduction of dimensionality, verification of
significant features, number of significant classes, data normalization, noise attenu-
ation, ...).
set.
In the implementation (framework) of the Gaussian mixtures, this generating
distribution of the P dataset consists of a set composed of K Gaussian distributions
(see Fig. 1.24); each of them is therefore:
with μk and k , respectively, the mean and the covariance matrix of the k-th Gaussian
distribution, defined as follows:
1
N (xi |μk , k ) = (2π )−d/2 |k |−1/2 exp − (x − μk )k−1 (x − μk ) (1.149)
2
that is, the probability of generating the observation x using the model k-th is provided
by the Gaussian distribution having the parameter vector θ k = (μk , k ) (mean and
covariance matrix, respectively).
x
1.9 Mixtures of Gaussian—MoG 67
1 0 . . . 0 0*]
zi = [0' 0 . . . 0 () (1.150)
K elements
Since we do not know the corresponding zi for each xi ,17 these variables are called
hidden variables. Our problem is now reduced to the search for the parameters μk
and k for each of the K models and the respective a priori probabilities πk that
if incorporated in the generative model, it has a high probability of generating the
observed distribution of data. The density of the mixture is given by
K
p(xi |θ ) = πk N (x|μk , k ) (1.151)
k=1
K
πk = 1. (1.152)
k=1
Also note that both p(x) ≥ 0 and N (x|μk , k ) ≥ 0 resulting in πk ≥ 0 for each k.
Combining these last conditions with the (1.152), we get
0 ≤ πk ≤ 1 (1.153)
and so the mixture coefficients are probabilities. We are interested in maximizing the
likelihood L(θ ) = p(P; θ ) that generates the observed data with the parameters of
the model θ = {μk , k , πk }k=1
K .
17 If
we knew them, we would group all xi based on their zi and we would model each grouping
with a single Gaussian.
68 1 Object Recognition
Our goal is, therefore, to find the parameters of the K Gaussian distributions and the
coefficients πk based on the dataset P we have. Calculating the mixture density on
the entire dataset of statistically independent measurements, we have
n
K
p(X |θ ) = πk p(xi |z i = k, μk , k ) (1.155)
i=1 k=1
n
K
L= log πk p(xi |z i = k, μk , k ) (1.156)
i=1 k=1
which we will differentiate with respect to θ = {μ, } and then, we will annul to
find the maximum, that is
∂L n
πk ∂ Pik n
πk Pik ∂ log Pik ∂ log Pik
n
= K = K = rik (1.157)
∂θ k j=1 π j Pi j
∂θ k j=1 π j Pi j
∂θ k ∂θ k
i=1 i=1 i=1
' () *
rik
in which we used the identity ∂ p/∂θ = p × ∂ log p/∂θ and defined rik the responsi-
bility, i.e., the variable representing how likely the i-th point is modeled or explained
by the Gaussian k-th, namely
p(xi , z i = k|μk , k )
rik = = p(z i = k|xi , μk , k ) (1.158)
p(xi |μk , k )
K
which are a posteriori probabilities of class membership with k=1 rik = 1. Now,
we have to calculate the derivative with respect to πk . Take the objective function L
and add a Lagrange multiplier λ which reinforces the fact that the priori probabilities
need to adapt to 1
K
L̃ = L + λ(1 − πk ) (1.159)
k=1
1.9 Mixtures of Gaussian—MoG 69
therefore, we derive with respect to πk and λ, and we put to zero as before obtaining
∂ L̃ n
Pik n
= n −λ=0⇔ rik − λπk = 0 (1.160)
∂πk j=1 π j Pi j
i=1 i=1
∂ L̃ K
=1− πk = 0 (1.161)
∂λ
k=1
n
and considering that k=1 i=1 rik − λπk = n − λ = 0 we get λ = n, so the priori
probability for the k-th class is given by
1
n
πk = rik (1.162)
n
i=1
We now find mean and covariance from the objective function L as follows:
∂L n
= rik (1.163)
∂μk
i=1
∂L
n
= rik [k − (xi − μk )(xi − μk ) ] (1.164)
∂k−1 i=1
A smart way to find maximum likelihood estimates for models of latent variables
is characterized by the algorithm Expectation–Maximization or EM [20]. Instead of
finding the maximum likelihood estimation (ML) of the observed data p(P; θ ), we
will try to maximize the likelihood of the joint distribution of P and Z = {zi }i=1
n ,
lc (θ ) = log p(P, Z; θ ),
70 1 Object Recognition
quantity known as complete log-likelihood. Since we cannot observe the values of the
random variables zi , we have to work with the expected values of the quantity lc (θ)
with respect to some distribution Q(Z). The logarithm of the complete likelihood
function is defined as follows:
lc (θ ) = log /
p(P, Z; θ )
n
= log i=1 p(P , z ; θ )
/n / K i i (1.166)
= log i=1 k=1 [ p(xi |z ik = 1; θ ) p(z ik = 1)]zik
n K
= i=1 k=1 z ik log p(xi |z ik = 1; θ ) + z ik log πk .
Since we have assumed that each of the models is a Gaussian, the quantity p(xi |k, θ )
represents the conditional probability of generating xi given the model k-th:
1 1
log p(xi |z ik = 1; θ) = exp − (xi − μk ) k−1 (xi − μk ) (1.167)
(2π )d/2 ||1/2 2
n
K
lc (θ ) Q(Z) = z ik log p(xi |z ik = 1; θ ) + z ik log πk . (1.168)
i=1 k=1
∂lc (θ ) Q(Z) n
∂
= z ik log p(xi |z ik = 1; θ ) = 0 (1.169)
∂μk ∂μk
i=1
∂
where the last equality derives from the relation ∂ x x Ax = x (A + A ). By
replacing the result of (1.170) in (1.169), we get
n
z ik (xi − μk ) k−1 = 0 (1.171)
i=1
1.9 Mixtures of Gaussian—MoG 71
Let us now calculate the estimate for the covariance matrix by differentiating
Eq. (1.168) with respect to k−1 , we have
∂
We can calculate log p(xi |z ik = 1; θ ) using the (1.167) as follows:
∂k−1
∂
log p(xi |z ik =1;θ ) = ∂
log 1
exp − 21 (xi −μk ) k−1 (xi −μk )
∂k−1 ∂k−1 (2π )d/2 ||1/2
= ∂ 1
log |k−1 |− 21 (xi −μk ) k−1 (xi −μk ) (1.174)
∂k−1 2
= 1 1
2 k − 2 (xi −μk )(xi −μk )
n
1 1
z ik k − (xi − μk )(xi − μk ) =0 (1.175)
2 2
i=1
which gives us the update equation for the covariance matrix for the k-th component
of the mixture: n
z ik (xi − μk )(xi − μk )
k = i=1 n (1.176)
i=1 z ik
Now, we need to find the update equation of the prior probability πk for the k-
th component of the mixture. This means maximizing the expected value of the
logarithm function of the likelihood lc (Eq. 1.168) subject to the constraint that
k πk = 1. To do this, we introduce the Lagrange multipliers λ by increasing the
(Eq. 1.168) as follows:
0 K 1
L(θ ) = lc (θ ) Q(Z) − λ πk − 1 (1.177)
k=1
72 1 Object Recognition
K
n
K
z ik − λ πk = 0 (1.180)
k=1 i=1 k=1
K
and since k=1 πk = 1, we have
K
n
λ= z ik = n (1.181)
k=1 i=1
Replacing this result in Eq. (1.179), we obtain the following update formula:
n
z ik
πk = i=1 (1.182)
n
K
which retains the constraint k=1 πk = 1.
= p(z ik = 1|xi ; θ )
p(x |z =1;θ)πk
= K i ik
j=1 p(xi |z i j =1;θ )π j
1.9.3 EM Theory
That is, if we want to find the value of the function between the two points x1 and x2 ,
we say x ∗ = λx1 +(1−λ)x2 , then the value of f (x ∗ ) will be found below the joining
f (x1 ) and f (x2 ) (in the case it is convex, vice versa if concave). We are interested
f(x*)
x1 x2
x*= x1+(1- )x2
74 1 Object Recognition
in evaluating the logarithm which is actually a concave function and for which we
will consider the last inequality (1.185). We rewrite log p(P; θ ), as follows:
log p(P; θ ) = log p(P, Z; θ )dZ (1.186)
Now let’s multiply and divide by an arbitrary distribution Q(Z) in order to find
a lower bound of the log p(P; θ ), and we use the result of Jensen’s inequality to
continue with Eq. (1.186):
,Z;θ)
log p(P , Z; θ)dZ = log Q(Z) p(P
Q(Z) dZ
p(P ,Z;θ)
≥ Q(Z) log Q(Z) dZ Jensen
= Q(Z) log p(P , Z; θ)dZ − Q(Z) log Q(Z)dZ
(1.187)
' () * ' () *
expected value log-likelihood Entropy of Q(Z)
= F (Q, θ)
= log p(P; θ )
This means that when we calculate the expected value of the complete version of the
logarithm function of the likelihood log p(P, Z; θ ) Q(Z) , it should be taken with
respect to the true posteriori probability (Z|P; θ ) of the hidden variables (hence the
step “E”).
1.9 Mixtures of Gaussian—MoG 75
Now let’s multiply and divide by the same arbitrary distribution q(y) defined on
the latent variables y. Now we can take advantage of Jensen’s inequality, since we
have a convex weighted combination of q(y) of some combination of functions. In
practice, we consider as f (y) a function of latent variables y as indicated in (1.191).
Any distribution q(y) on the hidden variables can be used to get a lower limit of the
log of the likelihood function:
p(x, y|θ ) p(x, y|θ )
L(θ ) = log q(y) dy ≥ q(y) log dy = F(q, θ ) (1.191)
q(y) q(y)
This lower bound is called Jensen’s inequality and derives from the fact that the
logarithm function is concave.18 In the EM algorithm, we alternatively optimize
F(q, θ ) with respect to q(y) and θ. It can be proved that this mode of operation will
never decrease L(θ ). In summary, the EM algorithm alternates between the following
two steps:
1. Step E optimizes F(Q, θ ) with respect to the distribution of the hidden variables
maintaining the fixed parameters:
2. Step M maximizes F(Q, θ ) with respect to the parameters keeping the distribution
of the hidden variables fixed:
θ (k) = arg maxF (Q(z), θ (k−1) ) = arg max Q (k) (z) log p(x, z|θ)dz (1.193)
θ θ
where the second equality derives from the fact that the entropy of q(z) does not
depend directly on θ .
The intuition, that is, the basis of the EM algorithm, can be schematized as follows:
Step E finds the values of the hidden variables according to their posterior proba-
bilities;
Step M learns the model as if the hidden variables were not hidden.
The EM algorithm is very useful in many contexts, since in many models, if the hidden
variables are not anymore, learning becomes very simple (in the case of Gaussian
18 The logarithm of the average is greater than the averages of the logarithms.
76 1 Object Recognition
mixtures). Furthermore, the algorithm breaks down the complex learning problem
into a sequence of simpler learning problems. The pseudo-code of the algorithm EM
for Gaussian mixtures is reported in Algorithm 3.
3: for i = 1, 2, . . . , n do
4: for k = 1, 2, . . . , K do
5:
1
p(xi |z ik = 1; θ) = (2π )−d/2 −1/2 exp − (x − μ) −1 (x − μ)
2
6:
p(xi |z ik = 1; θ)πk
z ik = K
j=1 p(xi |z i j = 1; θ)π j
7: end for
8: end for
9: for k = 1, 2, . . . , K do
10: n
i=1 z ik (xi − μk )(xi − μ k )
k = n
i=1 z ik
11: n
z ik xi
μk = i=1
n
i=1 z ik
12: n
i=1 z ik
πk =
n
the maximum likelihood (ML) method), with a supervised approach, the relative
characteristic parameters. Moreover, these densities have been considered unimodal
when in real applications, they are always multimodal. The extension with Gaussian
mixtures is possible even if it is necessary to determine the number of the components
and hope that the algorithm (for example, E M) that estimates the relative parameters
converges toward a global optimum.
With the nonparametric methods, no assumption is made on the knowledge of the
density functions of the various classes that can take arbitrary forms. These methods
can be divided into two categories, those based on the Density Estimation—DE and
those that explicitly use the features of the patterns considering the training sets
significant for the classification. In Sect. 1.6.5, some of these have been described
(for example, the k-Nearest-Neighbor algorithm). In this section, we will describe the
simple method based on the Histogram and the more general form for estimating
density together with the Parzen Window.
where the density is constant over the whole width of the bin which is normally
chosen for all with the same dimension, i = (see Fig. 1.26). With (1.194), the
objective is to model the normalized density p(x) from the observed N patterns P.
0
0 0.5 1
78 1 Object Recognition
From the figure, it can be observed how the approximation p(x) can be attributable
to a mixture of Gaussians. The approximation p(x) is characterized by . For very
large values, the density is too level and the bimodal configuration of p(x) is lost, but
for very small values of , a good approximation of p(x) is obtained by recovering
its bimodal structure.
The histogram method to estimate p(x), although very simple to calculate the
estimate from the training set, as the pattern sequence is observed, has some limita-
tions:
(a) Discontinuity of the estimated density due to the discontinuity of the bin rather
than the intrinsic property of the density.
(b) Problem of scaling the number of bin with pattern to d-dimension, we would
have M d bin with M bin for each dimension.
Normally the histogram is used for a fast qualitative display (up to 3 dimensions) of
the pattern distribution. A general formulation of the D E is obtained based on the
probability theory. Consider pattern samples x ∈ Rd with associated density p(x).
Both is a bounded region in the feature domain, the P probability that a x pattern
falls in the region is given by
P= p(x )dx (1.195)
It is shown by the properties of the binomial distribution that the average and the
variance with respect to the ratio k/N (considered as random variable) are given by
-k. -k. - k 2 . P(1 − P)
P =P V ar =E −P = (1.197)
N N N N
For N , the distribution becomes more and more peaked with small variance
(var (k/N ) → 0) and we can expect a good estimate of the probability P that
can be obtained from the mean of the sample fraction which fall into :
k
P∼
= (1.198)
N
1.9 Mixtures of Gaussian—MoG 79
At this point, if we assume that the region is very small and p(x) continuous which
does not vary appreciably within it (i.e., approximately constant), we can write
p(x )dx ∼
= p(x) 1dx ∼= p(x)V (1.199)
where x is a pattern inside and V is the volume enclosed by the region. By virtue
of the (1.195), (1.198), and the last equation, combining the results, we obtain
P = p(x )dx ∼ = p(x)V k/N
=⇒ p(x) ∼= (1.200)
P∼= N k
V
In essence, this last result assumes that the two approximations are identical. Fur-
thermore, the estimation of the density p(x) becomes more and more accurate with
the increase in the number of samples N and the simultaneous contraction of volume
V . Attention, this leads to the following contradiction:
1. Reducing the volume implies the reduction of the sufficiently small region
(with the density approximately constant in the region) but with the risk of zero
samples falling.
2. Alternatively, with sufficiently large would suffice k samples sufficient to pro-
duce an accentuated binomial peak (see Fig. 1.27).
If instead the volume is fixed (and consequently ) and we increase the number of
samples of the training set, then the ratio k/N will converge as desired. But this only
produces an estimate of the spatial average of the density:
P p(x )dx
=
(1.201)
V 1dx
In reality, we cannot have V very small considering that the number of samples N
is always limited. It follows that we should accept that the density estimate is a spatial
average associated with a variance other than zero. Now let’s see if these limitations
can be avoided when an unlimited number of samples are available. To evaluate p(x)
in x, let us consider a sequence of regions 1 , 2 , . . . containing patterns x with 1
having 1 sample, 2 having 2 samples, and so on. Let Vn be the volume of n , let kn
be the number of falling samples in n , and let pn (x) be the n-th estimate of p(x),
we have
kn /N
pn (x) = (1.202)
Vn
80 1 Object Recognition
20
50
100
k/N
0 P=0.7 1
lim Vn = 0 (1.203)
N →∞
lim kn = ∞ (1.204)
N →∞
lim kn /N = 0 (1.205)
N →∞
The (1.203) ensures that the spatial average P/V converges to p(x). The (1.204)
essentially ensures that the ratio of the frequencies k/N converges to the probability
P with the binomial distribution sufficiently peaked. The (1.205) is required for the
convergence of pn (x) given by the (1.202).
There are two ways to obtain regions that satisfy the three conditions indicated
above (Eqs. (1.203), (1.204), and (1.205)):
Figure 1.28 shows a graphical representation of the two methods. The two sequences
represent random variables that normally converge and allow us to estimate the
probability density at a given point in the circular region.
1.9 Mixtures of Gaussian—MoG 81
(a)
n=1 n=4 n=9 n=18
Vn=1/√n
(b)
kn=√n
Fig. 1.28 Two methods for estimating density, that of Parzen windows (a) and that of knn-nearest
neighbors (b). The two sequences represent random variables that generally converge to estimate
the probability density at a given point in the circular (or square) region. The Parzen method starts
with a large initial value of the region which decreases as n increases, while the knn method specifies
a number of samples kn and the region V increases until the predefined samples are included near
the point under consideration x
Vn = h dn (1.206)
where d indicates the dimensionality of the hypercube. To find the number of samples
kn that fall within the region n is defined the window function ϕ(u) (also called
kernel function)
1 if |u j | ≤ 1/2 ∀ j = 1, . . . , d
ϕ(u) = (1.207)
0 otherwise
It follows that the total number of samples inside the hypercube is given by
N
x − xi
kn = ϕ (1.209)
hn
i=1
82 1 Object Recognition
(a)
p(x) 7
x=0.5
p(0.5)= 1
7 Σ φ( x )
i=1
1
3
0.5-
3
i
= 1
21
(b)
5
2 3 5 6 7 9 10 x hn=0.005
p(x) x=1 7
p(1)= 1
7 Σ φ( x )
i=1
1
3
1-
3
i
= 1
21 0
0 0.5 1
x 5
2 3 5 6 7 9 10 hn=0.08
p(x) 7
x=2
p(2)= 1
7 Σ φ( x )
i=1
1
3
2-
3
i
= 2
21 0
0 0.5 1
x 5
2 3 5 6 7 9 10
hn=0.2
p(x) x=3 x=4 x=5 x=6 x=7 x=8 x=9 x=10
3
21 0
2
21 0 0.5 1
1
21
2 3 5 6 7 9 10 x
Fig. 1.29 One-dimensional example of calculation of density estimation with Parzen windows. a
The training set consists of 7 samples P = {2, 3, 5, 6, 7, 9, 10}, the window has width h n = 3.
The estimate is calculated with the (1.210) starting from x = 0.5 and subsequently the window
is centered in each sample obtaining finally the p(x) as the sum of 7 rectangular functions each
of height 1/nh d = 1/7 · 31 = 1/21. b Analogy of the density estimation carried out with the
histogram and with the rectangular windows (hyper-cubic in the case of d-dimensions) of Parzen
where we observe strong discontinuities of p(x) with very small h n (as happens for small values
of for the bins) while obtaining a very smooth shape of p(x) for large values of h n in analogy to
the histogram for high values of
By replacing the (1.209) in the Eq. (1.202), we get the KDE density estimate:
1 1
N
kn /N x − xi
pn (x) = = ϕ (1.210)
Vn N Vn hn
i=1
The kernel function ϕ, in this case called the Parzen window, tells us how to weigh all
the samples in n to determine the density pn (x) with respect to a particular sample
x. The density estimation is obtained as the average of the kernel functions of x and
xi . In other words, each sample xi contributes to the estimation of the density in
relation to the distance from x (see Fig. 1.29a). It is also observed that the Parzen
window has an analogy with the histogram with the exception that the bin locations
are determined by the samples (see Fig. 1.29b).
Now consider a more general form of the kernel function instead of the hypercube.
You can think of a kernel function seen as an interpolator placed in the various samples
xi of training set P instead of considering only the position x. This means that the
kernel function ϕ must satisfy the density function conditions, that is, be nonnegative
and the integral equal to 1:
ϕ(x) ≥ 0 ϕ(u)du = 1 (1.211)
1.9 Mixtures of Gaussian—MoG 83
For the hypercube previously considered with the volume Vn = h dn , it follows that
the density pn (x) satisfies the conditions indicated by the (1.211):
1 1
N
x − xi
pn (x)dx = ϕ dx
N Vn hn
i=1
N
1 1 1 (1.212)
N
x − xi
= ϕ dx = Vn = 1
Vn N hn N Vn
i=1 ' () * i=1
hypercube volume
If instead we consider an interpolating kernel function that satisfies the density con-
ditions, (1.211), integrating by substitution, putting u = (x − xi )/ h n for which
du = dx/ h n , we get:
1 1
N
x − xi
pn (x)dx = ϕ dx
N Vn hn
i=1
(1.213)
N N
1 1
= h n ϕ(u)du =
d
ϕ(u)du = 1
Vn N N
i=1 i=1
1
N
1 - 1 .
pϕ (x) = √ exp − 2 (x − xi )2 (1.215)
N h 2π 2h n
i=1 n
The Parzen Gaussian window eliminates the problem of the discontinuity of the
rectangular window. The champions of the training set P which are closest to the
sample xi have higher weight thus obtaining a density pϕ (x) smoothed (see Fig. 1.30).
It is observed how the shape of the estimated density is modeled by the Gaussian
kernel functions located on the observed samples.
84 1 Object Recognition
We will now analyze the influence of the window width (also called smoothing
parameter) h n on the final density result pϕ (x). A large value tends to produce a
more smooth density by altering the structure of the samples, on the contrary, small
values of h n tend to produce a very peaked density function resulting in complex
interpretation. To have a quantitative measure of the influence of h n , we consider the
function δ(x) as follows:
1 x 1 x
δn (x) = ϕ = dϕ (1.216)
Vn h n hn hn
where h n affects the horizontal scale (width) while the volume h dn affects the vertical
scale (amplitude). Also the function δn (x) satisfies the conditions of density function,
in fact operating the integration with substitution and putting u = x/ h n , we get
1 x 1
δn (x)dx = ϕ dx = ϕ(u)h d
n du = ϕ(u)du = 1 (1.217)
h dn hn h dn
1
N
pn (x) = δn (x − xi ) (1.218)
N
i=1
The effect of the h n parameter (i.e., the volume Vn ) on the function δn (x) and con-
sequently on the density pn (x) results as follows:
(a) For h n that tends toward high values, a contrasting action is observed on the
function δn (x) where on one side there is a reduction of the vertical scale factor
(amplitude) and an increase of the scale factor horizontal (width). In this case,
we get a poor (very smooth) resolution of the density pn (x) considering that it
will be the sum of many delta functions centered on the samples (analogous to
the convolution process).
(b) For h n that tends toward small values, δn (x) becomes very peaked and pn (x) will
result in the sum of N peaked pulses with high resolution and with an estimate
1.9 Mixtures of Gaussian—MoG 85
0.4 1 4
0.2 0.5 2
N=1
0 0 0
−2 0 2 −2 0 2 −2 0 2
1 1 1.5
1
N=10 0.5 0.5
0.5
0 0 0
−2 0 2 −2 0 2 −2 0 2
1 1 1
0 0 0
−2 0 2 −2 0 2 −2 0 2
1 1 1
0 0 0
−2 0 2 −2 0 2 −2 0 2
Fig. 1.31 Parzen window estimates associated to a univariate normal density for different values
of the parameters h 1 and N , respectively, window width and number of samples. It is observed that
for very large N , the influence of the window width is negligible
The theory suggests that for an unlimited N of samples, with the volume Vn
tending to zero, the density pn (x) converges toward an unknown density p(x).
In reality, having a limited number of samples, the best that can be done is to
find a compromise between the choice of h n and the limited number of samples.
Considering the training set of samples P = (x1 , . . . , x N ) as random variables on
which the density pn (x) depends, for any value of x, it can be shown that if pn (x) has
an estimate of the average p̂n (x) and an estimate of the variance σ̂n2 (x), it is proved
[18] that
lim p̂n (x) = p(x) lim σ̂n2 (x) = 0 (1.219)
n→∞ n→∞
Let us now consider a training set of samples i.i.d. deriving from a normal distribution
where p(x) → N (0, √ 1). If a Parzen Gaussian window given by the (1.214) is used,
setting h n = h 1 / N where h 1 is a free parameter, the resulting estimate of the
density is given by the (1.215) which is the average of the normal density centered
in the samples xi . Figure 1.31 shows the estimate of the true density p(x) = N (0, 1)
using a Parzen window with Gaussians as the free parameter varies h 1 = 1, 0.4, 0.1
and the number of samples N = 1, 10, 100, ∞. It is observed as in the approximation
86 1 Object Recognition
(a) The conditional densities are estimated for each class p(x|ωi ) (Eq. (1.215)) and
the test samples are classified according to the corresponding maximum posterior
probability (Eq. (1.70)). Eventually the priori probabilities can be considered in
particular when they are very different.
(b) The decision regions for this type of classifier depend a lot on the choice of the
kernel function used.
19 Cross-validation is a statistical technique that can be used in the presence of an acceptable number
of the observed sample (training set). In essence, it is a statistical method to validate a predictive
model. Taken a sample of data, it is divided into subsets, some of which are used for the construction
of the model (the training sets) and others to be compared with the predictions of the model (the
validation set). By mediating the quality of the predictions between the various validation sets, we
have a measure of the accuracy of the predictions. In the context of classification, the training set
consists of samples of which the class to which they belong is known in advance, ensuring that this
set is significant and complete, i.e., with a sufficient number of representative samples of all classes.
For the verification of the recognition method, a validation set is used, also consisting of samples
whose class is known, used to check the generalization of the results. It consists of a set of samples
different from those of the training set.
1.9 Mixtures of Gaussian—MoG 87
concerns the limited number of samples available in concrete applications and this
implies a not easy choice of h n . It also requires a high computational complexity.
The classification of a single sample requires the calculation of the function that
potentially depends on all the samples. The number of samples grows exponentially
as the feature space increases.
The Neural Network—NN has seen an explosion of interest over the years, and is
successfully applied in an extraordinary range of sectors, such as finance, medicine,
engineering, geology, and physics.
Neural networks are applicable in practically every situation in which the rela-
tionship between prediction (independent of input) and predicted (output dependent)
variables exists, even when this relationship is very complex and not easy to define
in terms of correlation or similarity between various classes. In particular, they are
also used for the classification problem that aims to determine which of a defined
number of classes a given sample belongs to.
Studies on neural networks are inspired by the attempt to understand the functioning
mechanisms of the human brain and to create models that mimic this functionality.
This has been possible over the years with the advancement of knowledge in neuro-
physiology which has allowed various physicists, physiologists, and mathematicians
to create simplified mathematical models, exploited to solve problems with new
computational models called neur ocomputing. Biological motivation is seen as a
source of inspiration for developing neural networks by imitating the functionality of
the brain regardless of its actual model of functioning. The nervous system plays the
fundamental role of intermediary between the external environment and the sensory
organs to guarantee the appropriate responses between external stimuli and internal
sensory states. This interaction occurs through the receptors of the sense organs,
which, excited by the external environment (light energy, ...), transmit the signals to
other nerve cells which are in turn processed, producing informational patterns useful
to the executing organs (effectors, an organ, or cell that acts in response to a stimu-
lus). The neur on is a nerve cell, which is the basic functional construction element
of the nervous system, capable of receiving, processing, storing and transmitting
information.
Ramòn y Cajal (1911) introduced the idea of neurons as elements of the structure
of the human brain. The response times of neurons are 5–6 orders of magnitude slower
than the gates of silicon circuits. The propagation of the signals in a silicon chip is
of a few nanoseconds (10−9 s), while the neural activity propagates with times of the
order of milliseconds (10−3 s). However, the human brain is made up of about 100
88 1 Object Recognition
Soma Dendrites
Nucleus
Axon
Myelin sheaths
Synaptic
Nodes of Ranvier
terminals
Impulses
direction
billion (1012 ) nerve cells, also called neur ons and interconnected with each other up
to 1 trillion (1018 ) of special structures called synapses or connections. The number
of synapses per neuron can range from 2,000 to 10,000. In this way, the brain is in
fact a massively parallel, efficient, complex, and nonlinear computational structure.
Each neuron constitutes an elementary process unit. The computational power of
the brain depends above all on the high degree of interconnection of neurons, their
hierarchical organization, and the multiple activities of the neurons themselves.
This organizational capacity of the neurons constitutes a computational model that
is able to solve complex problems such as object recognition, perception, and motor
control with a speed considerably higher than that achievable with traditional large-
scale computing systems. It is in fact known how spontaneously a person is able to
perform functions of visual recognition for example of a particular object with respect
to many other unknowns, requiring only a few milliseconds of time. Since birth, the
brain has the ability to acquire information about objects, with the construction of
its own rules that in other words constitute knowledge and experience. The latter is
realized over the years together with the development of the complex neural structure
that occurs particularly in the first years of birth. The growth mechanism of the neural
structure involves the creation of new connections (synapses) between neurons and
the modification of existing synapses. The dynamics of development of the synapses
is 1.8 million per second (from the first 2 months of birth to 2–3 years of age then it
is reduced on average by half in adulthood). The structure of a neuron is schematized
in Fig. 1.32. It consists of three main components: the cellular body or soma (the
central body of the neuron that includes the genetic heritage and performs cellular
functions), the axon (filiform nerve fiber), and the dendrites.
A synaptic connection is made by the axons which constitute the transmission
lines of out put of the electrochemical signals of neurons, whose signal reception
structure (input) consists of the dendrites (the name derives from the similarity to
the tree structure) that have different ramifications. Therefore, a neuron can be seen
as an elementary unit that receives electrochemical impulses from different dendrites
and once processed in the soma several electrochemical impulses are transmitted to
other neurons through the axon. The end of the latter branches forming terminal
1.10 Method Based on Neural Networks 89
fibers from which the signals are transmitted to the dendrites of other neurons. The
transmission between axon and dendrite of other neurons does not occur through a
direct connection but there is a space between the two cells called synaptic fissure
or cleft or simply synapse.
A synapse is a junction between two neurons. A synapse is configured as a
mushroom-shaped protrusion called the synaptic node or knob that is modeled from
the axon to the surface of the dendrite. The space between the synaptic node and the
dendritic surface is precisely the synaptic fissure through which the excited neuron
propagates the signal through the emission of fluids called neurotransmitters.
These come into contact with the dendritic structure (consisting of post-synaptic
receptors) causing the exchange of electrically charged ions-atoms (entering and
leaving the dendritic structure) thus modifying the electrical charge of the dendrite.
In essence, an electrical signal is propagated from the axon, a chemical transmission
is propagated in the synaptic cleft, and then an electrical signal is propagated in the
dendritic structure.
The body of the neuron receiving the signals from its dendrites processes them by
adding them together and triggers an excitator y response (increasing the frequency
of discharge of the signals) or inhibitor y (decreasing the discharge frequency) in
the post-synaptic neuron. Each post-synaptic neuron accumulates signals from other
neurons that add up to determine its excitation level. If the excitation level of the
neuron has reached a threshold level limit, this same neuron produces a signal
guaranteeing the further propagation of the information toward other neurons that
repeat the process.
During the propagation of each signal, the synaptic permeability, as well as the
thresholds, are slightly adapted in relation to the signal intensity, for example, the
activation threshold ( f iring) is lowered if the transfer is frequent or is increased if
the neuron has not been stimulated for a long time. This represents the plasticity of
the neuron, that is, the ability to adapt to stimulations-stresses that lead to reorganize
the nerve cells. The synaptic plasticity leads to the continuous remodeling of the
synapses (removal or addition) and is the basis of the learning of the brain’s abilities
in particular during the period of development of a living organism (in the early years
of a child are formed new synaptic connections with the frequency of one million per
second), during adult life (plasticity is reduced), and also in the phases of functional
recovery after any injuries.
The simplified functional scheme, described in the previous paragraph, of the bio-
logical neural network is sufficient to formulate a model of artificial neural network
(ANN) from the mathematical point of view. An ANN can be made using elec-
tronic components or it can be simulated to software in traditional digital computers.
An ANN is also called neuro-computer, connectionist network, parallel distributed
processor (PDP), associative network, etc.
90 1 Object Recognition
(a) (b)
x
||
w
x/||
T
w
g(x)=0
∑ ξ σ y=σ(ξ) w
||
y |w g(x)>0
wn θ ξ
xn -θ g(x)<0
bias threshold Ω
Fig. 1.33 Mathematical model of the perceptron (a) and its geometric representation used as a
linear binary classifier (b). In this 2D example, of the feature space, the hyperplane is the dividing
line of the two classes ω1 and ω2
In analogy to the behavior of the human brain that learns from experience, even a
neural computational model must solve the problems in the same way without using
an algorithmic approach. In other words, an artificial neural network is seen as an
adaptive machine with some features. The network must adapt the nodes (neurons)
for the learning of knowledge through a learning phase observing examples, organiz-
ing and modeling this knowledge through the synaptic weights of the connections,
and finally making this knowledge available for its generalized use.
Before addressing how a neural network can be used to solve the classification
problem, it is necessary to analyze how an artificial neuron can be modeled, what the
neuron connection architecture can be, and how the learning phase can be realized.
N
ξ= wi xi (1.220)
i=1
The excitation level ξ compared to a suitable θ threshold associated with the neuron
determines its final state, i.e., it produces the output y of the neuron that models the
electric-chemical signal generated by the axon.
The nonlinear growth of the output value y after the excitation value has reached
the θ threshold level is determined by the activation function (i.e., the transfer func-
tion) σ for which, we have
0 N 1
1 if ξ ≥ θ
y = σ (ξ ) = σ wi xi = σ (wx) = (1.221)
i=1
0 if ξ < θ
where the output y is binary with the step activation function σ (see Fig. 1.33a). A
more compact mathematical formulation of the neuron is obtained with a simple arti-
fice, considering the activation function σ with threshold zero and the true threshold
with opposite sign, seen as an additional input x0 = 1 with constant unitary value,
appropriately modulated by a weight coefficient w0 = −θ (also called bias) which
has the effect of controlling the translation of the activation threshold with respect
to the origin of the signals. It follows that the (1.221) becomes
0 N 1
1 if ξ ≥ 0
y = σ (ξ ) = σ wi xi = σ (wx) = (1.222)
i=0
0 if ξ < 0
where the vector of the input signals x and that of the weights w are augmented,
respectively, with x0 and w0 .
The activation function σ considered in the (1.222) is inspired by the functionality
of the biological neuron. Together with the θ threshold, they have the effect of limiting
the amplitude of the output signal y of a neuron. Alternatively, different activation
functions are used to simulate different functional models of the neuron based on
mathematical or physical criteria.
The simplest activation function used in neural network problems is linear given
by y = σ (ξ ) = ξ (the output is same as the input and the function is defined in
the range [−∞, +∞]). Other types of activation functions (see Fig. 1.34) are more
used: binary step, linear piecewise with threshold, nonlinear sigmoid, and hyperbolic
tangent.
1. Binary step:
1 if ξ ≥ 0
σ (ξ ) = (1.223)
0 if ξ < 0
92 1 Object Recognition
1 1 1 1
Fig. 1.34 Activation functions that model neuron output. From left to right, we have step activation
function; linear piecewise; sigmoid; and hyperbolic tangent
2. Linear piecewise: ⎧
⎪
⎨1 if ξ > 1
σ (ξ ) = ξ if 0 ≤ ξ ≤ 1 (1.224)
⎪
⎩
0 if ξ < 0
All activation functions assume values between 0 and 1 (except in some cases where
the interval can be defined between −1 and 1, as is the case for the hyperbolic tangent
activation function). As we shall see later, when we analyze the learning methods of a
neural network, these activation functions do not properly model the functionality of
a neuron and above all of a more complex neural network. In fact, synapses modeled
with simple weights are only a rough approximation of the functionality of biological
neurons that are a complex nonlinear dynamic system.
In particular, also the nonlinear activation functions of sigmoid and hyperbolic
tangent are inadequate when they saturate around the extremes of the interval 0 or
1 (or −1 and +1) where the gradient in these regions tends to vanish. This results
in the non-operation of the activation functions and no output signal is generated by
the neuron with the consequent block of the update of the weights. Other aspects
concern the interval of the activation functions with the nonzero-centered output
as is the case for the sigmoid activation function and the slow convergence. Also,
the function of hyperbolic tangent, although with zero-centered output, presents
the problem of saturation. Despite these limitations in the applications of machine
learning, sigmoid and hyperbolic tangent functions were frequently used.
In recent years, with the great diffusion of deep learning have become very popular
new activation functions: ReLu, Leaky ReLu, Parametric ReLU, and ELU. These
functions are simple and exceed the limitations of previous functions. The description
of these new activations functions is reported in the Sect. 2.13 of Deep Learning.
1.10 Method Based on Neural Networks 93
The neuron model of McCulloch and Pitts (MP) does not learn, the weights and
thresholds are analytically determined and operate with binary and discrete input
and output signal values. The first successful neuro-computational model was the
per ceptr on devised by Rosenblatt based on the neuron model defined by MP.
The objective of the perceptron is to classify a set of input patterns (stimuli)
x = (x1 , x2 , . . . , x N ) into two classes ω1 and ω2 . In geometric terms, the clas-
sification is characterized by the hyperplane that divides the space of the input
patterns. This hyperplane is determined by the linear combination of the weights
w = (w1 , w2 , . . . , w N ) of the perceptron with the f eatur es of the patter n x.
According to the (1.222), the hyperplane of separation of the two decision regions
is given by
N
wT x = 0 or wi xi + w0 = 0 (1.227)
i=1
where the vector of the input signals x and that of the weights w are augmented,
respectively, with x0 = 1 and w0 = −θ to include the bias20 of the level of excite-
ment. Figure 1.33b shows the geometrical interpretation of the perceptron used to
determine the hyperplane of separation, in this case a straight line, to classify two-
dimensional patterns in two classes. In essence, the N input stimuli to the neuron are
interpreted as the coordinates of a N -dimensional pattern projected in the Euclidean
space and the synaptic weights w0 , w1 , . . ., w N , including the bias, are seen as the
coefficients of the hyperplane equation we denote with g(x). A generic pattern x is
classified as follows:
⎧
N ⎪
⎨> 0 =⇒ x ∈ ω1
g(x) = wi xi + w0 < 0 =⇒ x ∈ ω2 (1.228)
⎪
⎩
i=1 = 0 =⇒ x ∈ hyperplane
20 Inthis context, the bias is seen as a constant that makes the perceptron more flexible. It has
a function analogous to the constant b of a linear function y = ax + b that representing a line
geometrically allows to position the line not necessarily passing from the origin (0, 0). In the
context of the perceptron, it allows a more flexible displacement of the line to adapt the prediction
with the optimal data.
94 1 Object Recognition
vector form:
w T x = w0 x0 (1.229)
where signal and weight vectors do not include bias. The (1.229) in this form is
useful in observing that, if w and w0 x0 are constant, this implies that the projection
xw of the vector x on w is constant since21 :
w0 x0
xw = (1.230)
w
We also observe (see Fig. 1.33b) that w determines the orientation of the decision
plan (1.228) to its orthogonal (in the space of augmented patterns) for the (1.227),
while the bias w0 determines the location of the decision surface. The (1.228) also
informs us that in 2D space, all the x patterns on the separation line of the two classes
have the same projection xw .
It follows that, a generic pattern x can be classified, in essence, by evaluating if the
excitation level of the neuron is greater or less than the threshold value, as follows:
> w0wx0 =⇒ x ∈ ω1
xw (1.231)
< w0wx0 =⇒ x ∈ ω2
21 By definition, the scalar or inner product between two vectors x and w belonging to a vector
space R N is a symmetric bilinear form that associates these vectors to a scalar in the real number
field R, indicated in analytic geometry with:
N
<wx> = w · x = (w, x) = wi xi
i=1
In matrix notation, considering the product among matrices, where w and x are seen as matrices
N × 1, the formal scalar product is written
N
wT x = wi xi
i=1
The (convex) angle θ between the two vectors in any Euclidean space is given by
wT x
θ = arccos
|w||x|
from which a useful geometric interpretation can be derived, namely to find the orthogonal projection
of one vector on the other (without calculating the angle θ), for example, considering that xw =
|x| cos θ is the length of the orthogonal projection of x over w (or vice versa calculate wx ), this
projection is obtained considering that wT x = |w| · |x| cos θ = |w| · xw , from which we have
wT x
xw =
|w|
1.10 Method Based on Neural Networks 95
1. Direct calculation. Possible in the case of simple problems such as the creation
of logic circuits (AND, OR, ...).
2. Calculation of synaptic weights through an iterative process. To emulate biolog-
ical learning from experience, synaptic weights are adjusted to reduce the error
between the output value generated by the appropriately stimulated perceptron
and the correct output defined by the pattern samples (training set).
The latter is the most interesting aspect if one wants to use the perceptron as a neural
model for supervised classification. In this case, the single perceptron is trained
(learning phase) offering as input stimuli the features xi of the sample patterns and
the values y j of the belonging classes in order to calculate the synaptic weights
describing a classifier with the linear separation surface between the two classes.
Recall that the single perceptron can classify only linearly separable patterns (see
Fig. 1.35a).
It is shown that the perceptron performs the learning phase by minimizing a tuning
cost function, the current value at time t of the y(t) response of the neuron, and
the desired value d(t), adjusting appropriately synaptic weights during the various
iterations until converging to the optimal results.
Let P = {(x1 , d1 ), (x2 , d2 ), . . . , (x M , d M )} the training set consisting of M pairs
of samples pattern xk = (xk0 , xk1 , . . . , xk N ), k = 1, . . . , M (augmented vectors
with xi0 = 1) and the corresponding desired classes dk = {0, 1} selected by the
expert. Let w = (w0 , w1 , . . . , w N ) be the augmented weight vector. Considering
that xi0 = 1, w0 corresponds to the bias that will be learned instead of the constant
bias θ . We will denote by w(t) the value of the weight vector at time t during the
iterative process of perceptron learning.
The convergence algorithm of the perceptron learning phase consists of the fol-
lowing phases:
96 1 Object Recognition
1. I nitially at time t = 0, the weights are initialized with small random values (or
w(t) = 0).
2. For each adaptation step t = 1, 2, 3, . . ., a pair is presented (xk , dk ) of training
set P.
3. Activation. The perceptron is activated by providing the vector of the features
xk . Calculated its current answer yk (t) is compared to the desired output dk and
then calculated the error dk − yk (t) as follows:
where 0 < η ≤ 1 is the parameter that controls the degree of learning (known as
learning rate).
Note that the expression [dk − yk (t)] in the (1.233) indicates the discrepancy between
the actual perceptron response yk (t) calculated for the input pattern xk and the desired
associated output dk of this pattern. Essentially the error of the t-th output of the
perceptron is determined with respect to the k-th pattern of training. Considering
that dk , y(t) ∈ {0, 1} follows that (y(t) − dk ) ∈ {−1, 0, 1}. Therefore, if this error
is zero, the relative weights are not changed. Alternatively, this discrepancy can take
the value 1 or −1 because only binary values are considered in output. In other words,
the perceptron-based classification process is optimized by iteratively adjusting the
synaptic weights that minimize the error.
The training parameter η very small (η ≈ 0), implies a very limited modification of
the current weights that remain almost unchanged with respect to the values reached
with the previous adaptations. With high values (η ≈ 1) the synaptic weights are
significantly modified resulting in a high influence of the current training pattern
together with the error dk − yk (t), as shown in the (1.233). In the latter case, the
adaptation process is very fast. Normally if you have a good knowledge of the training
set, you tend to use a fast adaptation with high values of η.
the distribution of the Gaussian classes, controls the possible overlap of the class
distributions with the statistical parameters of the covariance matrix.
The learning algorithm of the perceptron, not depending on the statistical param-
eters of the classes, is effective when it has to classify patterns whose features are
dependent on nonlinear physical phenomena and the distributions are strongly dif-
ferent from the Gaussian ones as assumed in the statistical approach. The perceptron
learning approach is adaptive and very simple to implement, requiring only memory
space for synaptic weights and thresholds.
where M is the set of patterns x misclassified from the perceptron using the weight
vector w [18]. If all the samples were correctly classified, the set M would be
empty and consequently the cost function J (w) is zero. The effectiveness of this
cost function is due to its differentiation with respect to the weight vector w. In fact,
differentiating the J (w) (Eq. 1.234) with respect to w, we get the gradient vector:
∇J(w) = (−x) (1.235)
x∈M
22 It should be noted that the optimization approach based on the gradient descent guarantees to
find the local minimum of a function. It can also be used to search for a global minimum, randomly
choosing a new starting point once a local minimum has been found, and repeating the operation
many times. In general, if the number of minimums of the function is limited and the number of
attempts is very high, there is a good chance of converging toward the global minimum.
98 1 Object Recognition
Δ
- J(wt)=0
wt w
where 0 < η ≤ 1 is still the parameter that controls the degree of learning (learning
rate) that defines the entity of the modification of the weight vector. We recall the
criticality highlighted earlier in the choice of η for convergence.
The perceptron batch update rule based on the gradient descent, considering the
(1.235), has the following form:
w(t + 1) = w(t) + η x (1.238)
x∈M
The denomination batch rule is derived from the fact that the adaptation of the weights
to the t-th iteration occurs with the sum, weighted with η, of all the samples pattern
M misclassified.
From the geometrical point of view, this perceptron rule represents the sum of the
algebraic distances between the hyperplane given by the weight vector w and the
sample patterns M of the training set for which one has a classification error. From
the (1.238), we can derive the adaptation rule (also called on-line) of the perceptron
based on a single sample misclassified xM given by
xt xt
Fig. 1.37 Geometric interpretation of the perceptron learning model based on the gradient descent.
a In the example, a pattern xM misclassified from the current weight w(t) is on the wrong side
of the dividing line (that is, w(t)T xM < 0), with the addition to the current vector of ηxM , the
weight vector moves the decision line in the appropriate direction to have a correct classification
of the pattern xM . b Effect of the learning parameter η; large values, after adapting to the new
weight w(t + 1), can misclassify a previous pattern, denoted by x(t), which was instead correctly
classified. c Conversely, small values of η are likely to still leave misclassified xM
From Fig. 1.37a, we can observe the effect of weight adaptation considering the single
sample xM misclassified by the weight vector w(t) as it turns out w(t)T xM ≤ 0.23
Applying the rule (1.239) to the weight vector w(t), that is, adding to this η·xM , we
obtain the displacement of the decision hyperplane (remember that w is perpendicular
to it) in the correct direction with respect to the misclassified pattern. It is useful to
recall in this context the role of η. If it is too large (see Fig. 1.37b), a previous pattern
xt correctly classified with w(t), after adaptation to the weight w(t), would now be
classified incorrectly.
If it is too small (see Fig. 1.37c), the pattern xM after the adaptation to the
weight w(t + 1) would still not be classified correctly. Applying the rule to the
single sample, once the training set samples are augmented and normalized, these
are processed individually in sequence and if at the iteration t for the j-th sample we
have w T x j < 0, i.e., the sample is misclassified we perform the adaptation of the
weight vector with the (1.239), otherwise we leave the weight vector unchanged and
go to the next (j +1)-th sample.
For a constant value of η predefined, if the classes are linearly separable, the
perceptron converges to a correct solution both with the batch rule (1.238) and with
the single sample (1.239). The iterative process of adaptation can be blocked to
23 Inthis context, the pattern vectors of the training set P , besides being augmented (x0 = 1), are
also nor mali zed, that is, all the patterns belonging to the class ω2 are placed with their negative
vector:
x j = −x j ∀ x j ∈ ω2
It follows that a sample is classified incorrectly if:
N
wT x j = wk j xk j < 0.
k=0
100 1 Object Recognition
limit processing times or for infinite oscillations in the case of nonlinearly separable
classes.
The arrest can take place by predicting a maximum number of iterations or by
imposing a minimum threshold to the cost function J(w(t)), while being aware of
not being sure of the quality of the generalization. In addition, with reference to
the Robbins–Monro algorithm [22], convergence can be analyzed by imposing an
adaptation also for the learning rate η(t) starting from an initial value and then
decreasing over time in relation a η(t) = η0 /t, where η0 is a constant and t the
current iteration.
This type of classifier can also be tested for nonlinearly separable classes although
convergence toward an optimal solution is not ensured with the procedure of adap-
tation of the weights that oscillates in an attempt to minimize the error despite using
the trick to also update the parameter of learning.
w T xi = bi (1.240)
0
of the pattern vectors from
)=
xk
x
the class separation wT xk /||w||
g(
hyperplane
w
linear equations. Moreover, with MSE algorithm, all TS patterns are considered
simultaneously, not just misclassified patterns. From the geometrical point of view,
the MSE algorithm with w T xi = bi proposes to calculate for each sample xi the
distance bi from the hyperplane, normalized with respect to |w| (see Fig. 1.38). The
matrix compact form of the (1.240) is given by
⎛ ⎞⎛ ⎞ ⎛ ⎞
x10 x11 · · · x1N w0 b1
⎜ x20 x21 · · · x2N ⎟ ⎜ w1 ⎟ ⎜ b2 ⎟
⎜ ⎟⎜ ⎟ ⎜ ⎟
⎜ .. .. . . . ⎟ ⎜ . ⎟ = ⎜ . ⎟ ⇐⇒ Xw = b (1.241)
⎝ . . . .. ⎠ ⎝ .. ⎠ ⎝ .. ⎠
xN0 xN1 · · · xN N wN bN
' () * ' () * ' () *
N ×(d+1) (d+1)×1 N ×1
The goal is now to solve the system of linear equations (1.241). If the number of
equations N is equal to the number of unknowns, i.e., the number of augmented
features d + 1, we have the exact formal solution:
w = X−1 b (1.242)
ε = Xw − b (1.243)
One approach is to try to minimize the module of the error vector, but this corresponds
to minimizing the sum function of the squared error:
N
JMSE (w) = Xw − b 2
= (w T xi − bi )2 (1.244)
i=1
The minimization of the (1.244) can be solved by analytically calculating the gradient
and setting it to zero, differently from what is done with the perceptron. From the
102 1 Object Recognition
dJMSE d N
∇ JMSE (w) = = (w T xi − bi )2
dw dw
i=1
N
d
= 2(w T xi − bi ) (w T xi − bi )
dw (1.245)
i=1
N
= 2(w T xi − bi )xi
i=1
= 2XT (Xw − b)
In this way instead of solving the system Xw = b, we set it to solve the Eq. (1.246)
with the advantage that XT X is a square matrix of (d + 1) × (d + 1) and is often
non-singular. Under these conditions, we can solve the (1.246) only with respect to
w obtaining the solution sought MSE:
where
X† = (XT X)−1 XT
where I is the identity matrix and the matrix X† is inverse on the left (the inverse on the
right in general is XX† = I). Furthermore, it is observed that if X is square and non-
singular, the pseudo-inverse coincides with the normal inverse matrix X† = X−1 .
Like all regression problems, the solution can be conditioned by the uncertainty
of the initial data which then propagates on the error committed on the final result. If
the training data are very correlated, the XT X matrix could become almost singular
and therefore not admit its inverse preventing the use of the (1.247).
This type of ill-conditioning can be approached with the linear regularization
method, also known as ridge regression. The ridge estimator is defined in this way
[23]:
wλ = ((XT X) + λId)−1 XT b (1.248)
to guarantee an appropriate balance between the variance and the distortion of the
estimator. For λ = 0, the ridge regression (1.248) coincides with the pseudo-inverse
solution.
Normally, the proper choice of λ is found through a cross-validation approach. A
graphical exploration that represents the components of w in relation to the values
of λ is useful when analyzing the curves (traces of the ridge regressions) that tend to
stabilize for acceptable values of λ.
The MSE solution also depends on the initial value of the margin vector b which
conditions the expected results of w∗ . The arbitrary choice of positive values of b can
give an MSE solution with a discriminant function that separates linearly separable
(even if not guaranteed) and non-separable classes. For b = 1, the MSE solution
becomes identical to the Fischer linear discriminant solution. If the number of sam-
ples tends to infinity, the MSE solution approximates the discriminating function of
Bayes g(x) = p(ω1 /x) − p(ω2 /x).
It is shown that if η(t) = η(1)/t, with arbitrary positive value of η(1), this rule
generates the weight vector w(t) which converges to the solution MSE with weightw
such that XT (Xw − b) = 0. Although the memory required for this update rule has
been reduced considering the dimensions (d + 1) × (d + 1) of the XT X matrix
with respect to the matrix X† of (d + 1) × N , using the Widrow–Hoff procedure (or
Least Mean Squared rule-LMS) has a further memory reduction, considering single
samples sequentially:
To make the first step, the gradient ∇b JMSE is calculated of the functional MSE
(1.244) with respect to the margin vector b, given by
which suggests a possible update rule for b. Since b is subject to the constraint
b > 0, we start from this condition and following the gradient descent, we prevent
from reducing any component of the vector b to negative values. In other words, the
gradient descent does not move in any direction but is always forced in a learner’s
way to move in the direction that b remains positive. This is achieved through the
following rule of adaptation of the margin vector (t is the iteration index):
and setting to zero all the positive components of ∇b JMSE or equivalently, keeping
the positive components in the second term of the last expression. Choosing the first
option, the adaptation rule (1.253) for b is given by
1- .
b(t + 1) = b(t) − η(t) ∇b JMSE (w(t), b(t)) − ∇b JMSE (w(t), b(t)) (1.254)
2
where | • | indicates a vector to which we apply the absolute value to all of its
components. Remember that η indicates the learning parameter.
Summing up, the equations used for the Ho–Kashyap algorithm are (1.252), for
the calculation of the gradient ∇b JMSE (w, b), the (1.254) that is the adaptation rule
to find the margin vector b fixed the weight vector w, and the (1.247) to minimize
the gradient ∇w JMSE (w, b) with respect to the weight vector w that we rewrite
At this point, we can get the Ho–Kashyap algorithm with the following adaptation
equations for the iterative calculation of both margin and weight vectors:
;
b(t + 1) = b(t) + 2ηε + (t)
adaptation equations of Ho–Kashyap (1.256)
w(t) = X† b(t) t = 1, 2, . . .
If the two classes are linearly separable, the Ho–Kashyap algorithm always pro-
duces a solution reaching the condition ε(t) = 0 and freezes (otherwise it continues
the iteration if some components of the error vector are positive). In the case of non-
separable classes, it occurs that ε(t) will have only negative components proving the
condition of non-separable classes. It is not possible to know after how many itera-
tions this condition of non-separability is encountered. The pseudo-inverse matrix is
calculated only once depending only on the samples of the training set. Considering
the high number of iterations required to limit the computational load, the algo-
rithm can be terminated by defining a maximum number of iterations or by setting a
minimum threshold for the error vector.
With the (1.260), the feature space is partitioned into K regions (see Fig. 1.40).
The discriminant function k-th with the largest value gk (x) assigns the pattern x
under consideration to the region k . In the case of equality, we can consider the
unclassified pattern (it is considered to be on the separation hyperplane).
1.10 Method Based on Neural Networks 107
(a) (b)
?
?
(c) (d)
Fig. 1.40 Classifier based on the MSE algorithm in the multiclass context. a and b show the
ambiguous regions when used binary classifier to separate the 3 classes. c and d instead show the
correct classification of a multiclasse MSE classifier that uses a number of discriminant functions
gk (x) up to the maximum number of classes
From the (1.261), it follows that the difference of the vectors weight w j − wk is
normal to the hyperplane H jk and that the distance of a pattern w from H jk is given
by
(w j − wk )/ w j − wk
It follows that with the linear machine, the difference of the vectors is important
and not the vectors themselves. Furthermore not all K (K − 1)/2 region pairs must
be contiguous and a lower number of separation hyperplanes may be required (see
Fig. 1.40d). A multiclass classifier can be implemented as a direct extension of the
MSE approach used for two classes based on the pseudo-inverse matrix.
In this case, the matrix N × (d + 1) of the training set X = {X1 , . . . , X K }
can be organized partitioning the lines so that it contains the patterns ordered by
the K classes, that is, all the samples associated to a class ωk are contained in the
submatrix Xk . Likewise the weight matrix is constructed W = [w1 , w2 , . . . , w K ] of
size (d + 1) × K . Finally, the margin matrix B = [B1 , B2 , . . . , B N ] of size N × K
partitioned in submatrix B j (like X) whose elements are zero except those in the j-th
column that are set to 1. In essence, the problem is set as K MSE solutions in the
generalized form:
XW = B (1.262)
K
J (A) = Xwi − bi 2
(1.263)
i=1
1.10.4.5 Summary
A binary classifier based on the perceptron always finds the hyperplane of separation
of the two classes only if these are linearly separable otherwise it oscillates without
ever converging. Convergence can be controlled by adequately updating the learning
parameter but there is no guarantee on the convergence point. A binary classifier
that uses the MSE method converges for classes that can be separated and cannot be
separated linearly, but in some cases, it may not find the hyperplane of separation for
separable linear classes. The solution with the pseudo-inverse matrix is used if the
sample matrix XT X is non-singular and not too large. Alternatively, the Widrow–
Hoff algorithm can be used. In other paragraphs, we will describe how to develop
a multiclass classifier based on multilayer perceptrons able to classify nonlinearly
separable patterns.
1.11 Neural Networks 109
In Sects. 1.10.1 and 1.10.2 we described, respectively, the biological and mathemat-
ical model of a neuron, explored in the 1940s by McCulloch and Pitts, with the aim
of verifying the computationality of a network made up of simple neurons. A first
application of a neural network was the per ceptr on described previously for binary
classification and applied to solve logical functions.
An artificial neural network (ANN) consists of simple neurons connected to each
other in such a way that the output of each neuron serves as an input to many neurons
in a similar way as the axon terminals of a biological neuron are connected via
synaptic connections with dendrites of other neurons. The number of neurons and
the way in which they are connected (topology) determines the ar chitectur e of a
neural network. After the perceptron, in 1959, Bernard Widrow and Hoff Marciano
of Stanford University developed the first neural network models (based on the Least
Mean Squares—LMS algorithm) to solve a real problem. These models are known
as ADALINE (ADAptive LInear NEuron) and MADALINE (multilayer network of
ADALINE units) realized, respectively, to eliminate echo in telephone lines and for
pattern recognition.
Research on neural networks went through a period of darkness in the 1970s
after the Per ceptr ons book of 1969 (by M. Minsky and S. Pappert) that questioned
the ability of neural models, limited to solving only linearly separable functions.
This involved the limited availability of funds in this revitalized sector only in the
early 1980s when Hopfield [24] demonstrated, through a mathematical analysis,
what could and could not be achieved through neural networks (he introduced the
concepts of bidirectional connections between neurons and associative memory).
Subsequently, research on neural networks took off intensely with the contribution
of various researchers from whom the proposed neural network models were named:
Grosseberg–Carpenter for the ART—Adaptive Resonance Theory network; Kohonen
for the SOM—Self Organization Map; Y. Lecunn, D. Parker, and Rumelhart–Hinton–
Williams who independently proposed the learning algorithm known as Backprop-
agation for an ANN network; Barto, Sutton, and Anderson for incremental learning
based on the Reinforcement Learning, ...
While it is understandable how to organize the topology of a neural network,
years of research have been necessary to model the computational aspects of state
change and the aspects of adaptation (configuration change). In essence, neural net-
works have been developed only gradually defining the modalities of interconnection
between neurons, their dynamics (how their state changes), and how to model the
process of adaptation of synaptic weights. All this in the context of a neural network
created by many interconnected neurons.
A linear machine that implements linear discriminant functions with the minimum
error approach, in general, does not sufficiently solve the requirements required
110 1 Object Recognition
1/2(z
J(w)
xi zk
tk Σ Objective
wji 1/2(zk-tk
yj wkj function
netk
netj
tK
xd zK
1/2(zK-tK
yNh
bias 1 bias 1
Fig. 1.41 Notations and symbols used to represent a MLP neural network with three layers and its
extension for the calculation of the objective function
network leads from the input layer to the output layer through individual neurons
contained in each layer.
The ability of an NN-Neural Network to process information depends on the
inter-connectivity and the states of neurons that change and the synaptic weights that
are updated through an adaptation process that represents the learning activity of the
network starting from the samples of the training set. This last aspect, i.e., the network
update mode is controlled by the equations or rules that determine the dynamics and
functionality over time of the NN. The computational dynamics specifies the initial
state of an NN and the update rule over time, once the configuration and topology
of the network itself has been defined. An NN f eed f or war d is characterized by a
time-independent data flow (static system) where the output of each neuron depends
only on the current input in the manner specified by the activation function.
The adaptation dynamic specifies the initial configuration of the network and
the method of updating weights over time. Normally, the initial state of synaptic
weights is assigned with random values. The goal of the adaptation is to achieve
a network configuration such that the synaptic weights realize the desired function
from the input data (training pattern) provided. This type of adaptation is called
supervised learning. In other words, it is the expert that provides the network with
the input samples and the desired output values and with the learning occurs how
much the network response agrees with the desired target value known a priori. A
supervised feedforward NN is normally used as a function approximator. This is
done with different learning models, for example, the backpr opagation which we
will describe later.
Let us now see in detail how a supervised MLP network can be used for the clas-
sification in K classes of d-dimensional patterns. With reference to Fig. 1.41, we
describe the various components of an MLP network following the flow of data from
the input layer to the output layer.
(a) Input layer. With supervised learning, each sample pattern x = (x1 , . . . , xd ) is
presented to the network input layer.
(b) Intermediate layer. The neuron j-th of the middle layer (hidden) calculates the
activation value net j obtained from the inner product between the input vector
x and the vector of synaptic weights coming from the first layer of the network:
d
net j = w ji xi + w j0 (1.265)
i=1
where the pattern and weight vectors are augmented to include the fictitious
input component x0 = 1, respectively.
112 1 Object Recognition
(c) Activation function for hidden neurons. The j-th neuron of the intermediate layer
emits an output signal y j through the nonlinear activation function σ , given by
1 if net j ≥ 0
y j = σ (net j ) = (1.266)
−1 if net j < 0
(d) Output layer. Each output neuron k calculates the activation value netk obtained
with the inner product between the vector y (the output of the hidden neurons)
and the vector wk of the synaptic weights from the intermediate layer:
Nh
netk = wk j y j + wk0 (1.267)
j=1
where Nh is the number of neurons in the intermediate layer. In this case, the
weight vector wk is augmented by considering a neuron bias which produces a
constant output y0 = 1.
(e) Activation function for output neurons. The neuron k-th of the output layer emits
an output signal z k through the non-linear activation function σ , given by
1 if netk ≥ 0
z k = σ (netk ) = (1.268)
−1 if netk < 0
The output z k for each output neuron can be considered as a direct function of an input
pattern x through f eed f or war d operations of the network. Furthermore, we can
consider the entire f eed f or war d process associated with a discriminant function
gk (x) capable of separating a class (of the K classes) represented by the k-th output
neuron. This discriminating function is obtained by combining the last 4 equations
as follows:
Nh
d
gk (x) = z k = σ wk j σ w ji xi + w j0 + wk0 (1.269)
j=1 i=1
' () *
activation of the k-th output neuron
where the internal expression (•) instead represents the activation of the j-th hidden
neuron net j given by the (1.265).
The activation function σ (net) must be continuous and differentiable. It can also be
different in different layers or even different for each neuron. The (1.269) represents
a category of discriminant functions that can be implemented by a three-layer MLP
network starting from the samples of the training set {x1 , x2 , . . . , x N } belonging to
K classes. The goal now is to find the network learning paradigm to get the synaptic
weights wk j and w ji that describe the functions gk (x) for all K classes.
1.11 Neural Networks 113
It is shown that an MLP network, with three layers, an adequate number of nodes
per layer, and appropriate nonlinear activation functions, is sufficient to generate
discriminating functions capable of separating classes also nonlinearly separable in
a supervised context. The backpr opagation algorithm is one of the simplest and
most general methods for supervised learning in an MLP network. The theory on
the one hand demonstrates that it is possible to implement any continuous function
from the training set through an MLP network but from a practical point of view, it
does not give explicit indications on the network configuration in terms of number
of layers and necessary neurons. The network has two operating modes:
N
K
1
N
1
J (W) = (tnk − z nk )2 = tn − zn 2
(1.270)
2 2
n=1 k=1 n=1
where N indicates the number of samples in the training set, K is the number of
neurons in the output layer (coinciding with the number of classes), and the factor
of 1/2 is included to cancel the contribution of the exponent with the differentiation,
such as we will see forward.
The backpropagation learning rule is based on the gradient descent. Once the
weights are initialized with random values, their adaptation to the t-th iteration occurs
in the direction that will reduce the error:
114 1 Object Recognition
∂ J (W)
w(t + 1) = w(t) + w = w(t) − η (1.271)
∂w
where η is the learning parameter that establishes the extent of the weight change. The
(1.271) ensures that the objective function (1.270) is minimized and never becomes
negative. The learning rule guarantees that the adaptation process converges once
all input samples of the training set are input. Now let’s look at the essential steps
of supervised learning based on the backpr opagation. The data of the problem is
the samples xk of the training set, the output of the MLP network and the desired
target value tk . The unknowns are the weights related to all the layers to be updated
with the (1.271) for which we should determine the w adaptation with the gradient
descent:
∂ J (W)
w = −η (1.272)
∂w
for each net weight (weights are updated in the opposite direction to the gradient).
For simplicity, we will consider the objective function (1.270) for a single sample
(N = 1):
K
1 1
J (W) = (tk − z k )2 = t−z 2 (1.273)
2 2
k=1
∂J ∂ J ∂z k ∂netk
= (1.274)
∂wk j ∂z k ∂netk ∂wk j
Let us now calculate each partial derivative of the three terms of the (1.274)
separately.
1.11 Neural Networks 115
Fig. 1.42 Reverse path to the feed-forward one, shown in Fig. 1.41, during the learning phase for
the backward propagation of the error of backpropagation δko for the output neuron k-th and the
error of backpr opagation δ hj associated with the hidden neuron j-th
The second term, considering the activation value of the output neuron k-th given
by the (1.267) and the corresponding output signal z k given by its nonlinear
activation function σ Eq. (1.268), results
∂z k ∂
= σ (netk ) = σ (netk ) (1.276)
∂netk ∂netk
The activation function is generally nonlinear and commonly the sigmoid func-
tion24 given by the (1.225) is chosen which by replacing in the (1.276), we obtain
∂z k ∂ 1
=
∂netk ∂netk 1 + exp(−netk )
(1.277)
exp(−netk )
= = (1 − z k )z k
(1 + exp(−netk ))2
24 Thesigmoid or sigmoid curve function (in the shape of an S) is often used as a transfer function
in neural networks considering its nonlinearity and easy differentiability. In fact, the derivative is
given by
dσ (x) d 1
= = σ (x)(1 − σ (x))
∂x dx 1 + exp(−x)
and is easily implementable.
116 1 Object Recognition
The third term, considering the activation value netk given by the (1.267), results
Nh
∂netk ∂
= wkn yn = y j (1.278)
∂wk j ∂wk j
n=1
From the (1.278), we observe that only one element in the sum netk (that is of the
inner product between the output vector y of the hidden neurons and the weight
vector wk of the output neuron) depends on wk j .
Combining the results obtained for the three terms, (1.275), (1.277), and (1.278),
we get
∂J
= (z k − tk )(1 − z k )z k y j = δk y j (1.279)
∂wk j ' () *
δk
∂J ∂ J ∂ y j ∂net j
= (1.280)
∂w ji ∂ y j ∂net j ∂w ji
In this case, the first term ∂ J/∂ y j of the (1.280) cannot be determined directly
because we do not have a desired value t j to compare with the output y j of a hid-
den neuron. The error signal must instead be recursively inherited from the error
signal of the neurons to which this hidden neuron is connected. For the MLP in
question, the derivative of the error function must consider the backpropagation
of the error of all the output neurons. In the case of multilayer MLP, reference
would be made to the neurons of the next layer. Thus, the derivative of the error
on the output y j of the j-th hidden neuron is obtained by considering the errors
propagated backward by the output neurons:
∂J ∂ J ∂z n ∂net o
K
n
= (1.281)
∂yj ∂z n ∂netno ∂ y j
n=1
The first two terms in the summation of the (1.281) have already been calculated,
respectively, with the (1.275) and (1.277) in the previous step and their product
corresponds to the backpropagation error δk associated to the k-th output neuron:
∂ J ∂z n
= (z n − tn )(1 − z n )z n = δno (1.282)
∂z n ∂netno
1.11 Neural Networks 117
where here the propagation error is explicitly reported with the upper apex “o”
to indicate the association to the output neuron. The third term in the summation
of the (1.281) is given by
Nh
∂netno ∂
= wns ys = wno j (1.283)
∂yj ∂yj
s=1
From the (1.283), it is observed that only one element in the sum netn (that is
of the inner product between the output vector y of the hidden neurons and the
weight vector wno of output neuron) depends on y j . Combining the results of the
derivatives (1.282) and (1.283), the derivative of the error on the output y j of
the j-th hidden neuron given by the (1.281), becomes
∂J ∂ J ∂z n ∂net o
K
n
=
∂yj ∂z n ∂netno ∂ y j
n=1
(1.284)
K
K
= (z n − tn )(1 − z n )z n wn j = δno wn j
' () *
n=1 δno n=1
From the (1.284), we highlight how the error propagates backward on the j-th
hidden neuron accumulating the error signals coming backward from all the K
neurons of output to which it is connected (see Fig. 1.42).
Moreover, this backpropagation error is weighed by the connection force of the
hidden neuron with all the output neurons. Returning to the (1.280), the second
∂ y j /∂net j and the third ∂net j /∂w ji term are calculated in a similar way to those
of the data output layer from the Eqs. (1.277) and (1.278) which in this case are
∂yj
= (1 − y j )y j (1.285)
∂net j
∂net j
= xi (1.286)
∂w ji
The final result of the partial derivatives ∂ J/∂w ji of the objective function, with
respect to the weights of the hidden neurons, is obtained by combining the results
of the single derivatives (1.284) and of the last two equations, as follows:
K
∂J
= δn wn j (1 − y j )y j xi = δ hj xi
o
(1.287)
∂w ji
n=1
' () *
δ hj
118 1 Object Recognition
where δ hj indicates the backpropagated error related to the j-th hidden neuron.
Recall that the weight of the associated input value is where delta hj indicates
the retropropated error related to the j-th hidden neuron. Recall that for the bias
weight, the associated input value results xi = 1.
4. Weights update. Once all the partial derivatives are calculated, all the weights of
the MLP network are updated in the direction of the negative gradient with the
(1.271) and considering the (1.279). For the weights of the neurons hidden →
out put wk j , we have
∂J
wk j (t + 1) = wk j (t) − η = wk j (t) − ηδko y j k = 0, 1, . . . , K ; j = 1, . . . , N h (1.288)
∂wk j
Let us now analyze the weight update Eqs. (1.279) and (1.287) and see how they affect
the network learning process. The gradient descent procedure is conditioned by the
initial values of the weights. From the above equations, it is normal to randomly set
the initial weight value. The update amount of the k-th output neuron is proportional
to (z k − tk ). It follows that no update occurs when the output of the neuron and the
desired value coincide.
The sigmoid activation function is always positive and controls the output of
neurons. According to the (1.279) y j and (z k − tk ) concur, based on their own sign,
to modify adequately (decrease or increase) the weight value. It can be verified that
a pattern presented to the network produces no signal (y j = 0) and this implies no
update on the weights.
The learning methods concern how to present the samples of the training set and how
to update the weights. Three are the most common methods:
1. Online. Each sample is presented only once and the weights are updated after
the presentation of the sample (see Algorithm 6).
2. Stochastic. The samples are randomly chosen from the training set and the
weights are updated after the presentation of each sample (see Algorithm 7).
3. Batch Backpropagation. Also called off-line, the weights are updated after the
presentation of all the samples of the training set. The variations of the weights
for each sample are stored and the update of the weights takes place only when
all the samples have been presented only once. In fact, the objective function of
batch learning is the (1.270) and its derivative is the sum of the derivatives for
1.11 Neural Networks 119
each sample:
K
1 ∂
N
∂
J (W) = (tnk − z nk ) 2
(1.290)
∂w 2 ∂w
n=1 k=1
where the partial derivatives of the expression [•] have been calculated previ-
ously and are those related to the objective function of the single sample (see
Algorithm 8).
From the experimental analysis, the stochastic method is faster than the batch
even if the latter fully uses the direction of the gradient descent to converge. Online
120 1 Object Recognition
training is used when the number of samples is very large but is sensitive to the order
in which the samples of the training set are presented.
An MLP network is able to approximate any nonlinear function if the training set
of samples (input data/desired output data) presented are adequate. Let us now see
what is the level of generalization of the network, that is, the ability to recognize a
pattern not presented in the training phase and not very different from the sample
patterns. The learning dynamics of the network is such that at the beginning, the
error on the samples is very high and proceeds to decrease asymptotically tending
to a value that depends on: the Bayesian error of the samples, the size of the training
set, the network configuration (number of neurons and layers), and initial value of
the weights.
A graphical representation of the learning dynamics (see Fig. 1.43) is obtained by
reporting on the ordinates how the error varies with respect to the number of realized
epochs. From the obtained learning curve, you can decide the level of training and
stop it. Normally, the learning is blocked when the imposed error is reached or when
an asymptotic value is reached. A situation of saturation in learning can occur, in
the sense that an attempt is made to excessively approximate the training data (for
example, many samples are presented) generating the phenomenon of overfitting
(in this context over training) with the consequent loss of the generalization of the
network when one then enters the test context.
A strategy to control the adequacy of the level of learning achieved is to use test
samples, other than the training samples, and validate the generalization behavior of
the network. On the basis of the results obtained, it is also possible to reconfigure
the network in terms of number of nodes. A strategy that allows an appropriate
configuration of the network (and avoid the problem of overfitting) is that of having
a third set of samples called validation. The dynamics of learning are analyzed with
the two curves that are obtained from the training set and the one related to the
Validation
Testing
Training
Point of Early Stopping Epochs
1.11 Neural Networks 121
validation set (see Fig. 1.43). From the comparison of the curves, you can decide to
block the learning at the minimum local met on the validation curve.
A neural network like the MLP we have seen that it is based on mathemati-
cal/computing fundamentals inspired by biological neural networks. The backprop-
agation algorithm used for supervised learning is set up as an error minimization
problem associated with training set patterns. In these conditions it is shown that the
convergence of the backpropagation is possible, both in probabilistic and indeter-
ministic terms. Although this, it is useful in real applications to introduce heuristics
aimed at optimizing the implementation of an MLP network, in particular for the
aspects of classification and pattern recognition.
Global Minimum
Local Minimum w
122 1 Object Recognition
weights taking into account the past iterations). Let w(t) = w(t) − w(t − 1) the
variation of the weights at the iteration t-th, the adaptation rule of the weights (for
example, considering the 1.289) is modified as follows:
∂J
w(t + 1) = w(t) + (1 − α) η + αw(t − 1) (1.291)
∂w
where α (also called momentum) is a positive number with values between 0 and
1, and the expression [•] is the variation of the weights associated with the gradient
descent, as expected for the rule of backpr opagation. In essence, α parameter
determines the amount of influence from the previous iterations over the current one.
The momentum introduces a sort of damping on the dynamic of adaptation of the
weights avoiding oscillations in the irregular areas of the surface of the error function
averaging the components of the gradient with opposite sign and speeding up the
convergence in the flat areas. This attempts to prevent the search process from being
blocked on a local minimum. For α = 0, we have the gradient descent rule, for
α = 1, the gradient descent is ignored and weights are updated in constant variation.
Normally, α = 0.9 is used.
The activation function that satisfies all the properties described above is the
following sigmoid function:
b·net
e − e−b·net
σ (net) = a · tanh(b · net) = a b·net (1.292)
e + e−b·net
1.11 Neural Networks 123
with the following optimal values for a = 1.716 and b = 2/3, and the linear interval
−1 < net < 1.
in the hidden layer with the consequent slowing of the learning process. According
to the interval of definition of the sigmoid function, the heuristic used for the choice
-of the weights
. of the layer I-H is that of a uniform
- random distribution
. in the interval
− √ , √ while for the layer H-O, it results − √ N , √ N , where d indicates the
1 1 1 1
d d h h
size of the samples.
The backpropagation algorithm is applicable for MLP with more than one hidden
layer. The increase in the number of hidden layers does not improve the approxi-
mation power of any function. A 3-layer MLP is sufficient. From the experimental
analysis, for some applications, it was observed that the configuration of an MLP
with more than 3 layers presents a faster learning phase with the use of a smaller
number of hidden neurons altogether. However, there is a greater predisposition of
the network to the problem of the local minimum.
equivalent to the descent of the gradient which uses an objective function based on
the regularization method.25
The decision tree is a model used for the classification that is generated by analyz-
ing the various attributes (in this context, the attributes or features constitute the
instance space, i.e., the attribute/value pairs) that describe a pattern (object) to be
25 In the fields of machine learning and inverse problems, the regularization consists of the introduc-
Training Set
K Class
Indu
1 Yes 11 A ctio
n
Samples
Model
Learning
N No 33 B
Model
(Decision
Test Set Tree)
K Class Application
Patterns
1 No 30 B of the Model
ction
Dedu
M Yes 13 A
Classes
Prediction
Fig. 1.45 Functional scheme of a classifier based on a decision tree. The induction learning of
the decision tree prediction model takes place by analyzing the instances (attributes/values and
associated class) of the training set samples. The validity of the model is verified with the test set
patterns to predict classes of which the true class is known
Samples
and using the samples of the
training set whose
membership class is known N No 33 ωC
Root node
Figure 1.47 shows the structure of a decision tree created with a training set of
patterns x characterized by attributes of the type color, size, shape, taste. We can
see how easy it is to understand and interpret the decision tree for a classification
problem. For example, the pattern x = (yellow, medium, thin, sweet) is classified as
banana because the path from root to leaf banana encodes a con junction of test
on attributes: (color = yellow) and (shape = thin). It is also observed that different
paths can lead to the same class by encoding a dis junction of conjunctions. In the
figure, the tree shows two paths leading to the class apple = (color = green) and (size
= average) or (color = red) and (size = average).
Although decision trees give a concise representation of a classification process,
it can be difficult to interpret it especially for an extended tree. In this case, we can
use a simpler representation through classification rules, easily obtainable from the
tree. In other words, a path can be transformed into a rule. For example, the following
rule can be derived from the tree in Fig. 1.47 to describe the grape pattern = (size =
small) and (taste = sweet) and not (color = yellow).
We have already seen the structure of a decision tree (see Fig. 1.45) and how easy
it is to interpret it for pattern classification (see Fig. 1.47). When the training set D
of patterns, with defined attributes and classes, is very extensive, building a decision
tree to classify generic patterns can be very complex. In this case, the tree can grow
significantly and become little interpretable. They therefore need control criteria to
limit growth based on the depth reachable from the tree or from the minimum number
of pattern samples present in each node in order to carry out the appropriate partition
128 1 Object Recognition
Color?
n Red
Yel
Gree
low
Dimens? Form? Dimens?
ium
nd
Medium
Sm
Th
Sm
e
rg
Rou
in
al
al
Med
La
l
Anguria Apple Grapes Dimens? Banana Apple Taste?
t
rge
Sm
Ac
ee
Sw
al
rid
La
l
Grapefruit Lemon Cherry Grapes
(division) of the training set in ever smaller subsets. The top-down approach to
building the learning tree for the training set partition using the logical test conditions
on one attribute at a time involves the following steps:
1. Cr eate the root node by assigning it the most significant attribute to trigger the
classification process by assigning it all the training set D and creates the arcs
for all possible test values associated with the significant attribute. The samples
of the training set are distributed to the descendant nodes in relation to the value
of the attribute of the source node. The process continues recursively considering
the samples associated with the descendant nodes, choosing for these the most
significant attribute for the test.
2. E xamine the current node:
a. If all the samples of the node have the same class of membership, ω assigns
this class to the node that becomes lea f and the process stops.
b. Evaluates a measure of significance for each attribute of the partitioned sam-
ples.
c. Associate the test that maximizes significance measurement to the node.
d. For each test value, it creates a descending node by associating the arc with
the condition of maximum significance by creating a subtree with the samples
that satisfy this condition.
This type of algorithm for the top-down construction of the decision tree that extends
and partitions the training set is recursively known as the divide and conquer (from
Latin divide et impera) algorithm. In each iteration, the algorithm considers the
training set partition using the results of the attribute test. The various algorithms
differ in relation to the test function used to limit the growth of the tree and the
partitioning mode that controls the distribution of the training set in smaller subsets
1.13 Decision Tree 129
in each node until a stop criterion is encountered that does not compromise the
accuracy of the classification or prediction.
The main known algorithms in the literature are the ID3 and C 4.5 of Quinlan
[25,26], CART [27]. The ID3 algorithm performs the top-down construction of the
tree by raising the tree by appropriately choosing an attribute in each node while the
C 4.5 and CART algorithms in addition implement a pruning phase (pruning) of the
tree by checking that the tree does not become too extensive and complex in terms of
number of nodes, number of leaves, depth, and number of attributes. Several other
algorithms are available in literature that are characterized by the introduction of
some variants compared to those mentioned above. An evaluation and comparison
of these algorithms are reported in [28].
Recall that the concept of entr opy defined in thermodynamics defines a measure of
the disor der . In information theory, this concept is used as a measure of uncertainty
or information in a random variable. In this context, entropy is used to measure the
information content of an attribute, that is, a measure of the impurity (or homogeneity)
associated with the training set samples. In other words, in this context, the measure
of entropy is interpreted to indicate the value of the disorder (or diversity or impurity)
when a set of samples belong to different classes, while, if the samples of this set all
have the same class, the information content is zero.
Given a training set D = {(x1 , ω(x1 )), (x2 , ω(x2 )), . . . , (x N , ω(x N ))}, contain-
ing samples belonging to classes ωi , i = 1, . . . , C, the entropy of training set D is
defined by:
C
H (D) = − p(ωi ) log2 p(ωi ) (1.294)
i=1
130 1 Object Recognition
where p(ωi ) indicates the fraction of the samples in D belonging to the class ωi .
If we consider a training set D with samples belonging to two classes with boolean
value (see Table 1.1) nominally indicated with the symbols “⊕” and “!”, the value
of the entropy H (D) for this boolean classification is
where p⊕ = 9/14 is the fraction of the samples with positive class and p! =
1 − p⊕ = 5/14 is the fraction of samples belonging to the negative class. It is
observed that the entropy results with zero value if the samples in D belong to a
same class ( purit y), in this case, if they are all positive or negative. In fact, we have
the following:
considering the properties of the logarithm function.26 Figure 1.48 shows the plot
of the entropy H (D) for the Boolean classification according to p⊕ between 0
26 Normally, the logarithm values are given with respect to the base 10 and to the number of Nepero
log (x)
e. With the base change, we can have the logarithms in base 2 of a number x, that is, log2 = log10 (2) .
10
The above calculated entropy values are obtained considering that (1) log2 (1) = 0; log2 (2) = 1;
log2 (1/2) = −1; and (1/2) log2 (1/2) = (1/2)(−1) = −1/2.
1.14 ID3 Algorithm 131
Fig. 1.48 Entropy graph 1.0 Maximum value of the entropy with
H (D) associated with the equal distribution of the classes
Minimum value of the entropy if D
binary classification of the
Entropy H(D)
contains samples with the same class
training set samples D
expressed as a function of 0.5
the distribution p⊕ of the
positive examples
(corresponding to all samples with negative class) and 1 (samples all positive). The
maximum value of the entropy is given when the distribution of the classes is uniform
(in the example p⊕ = p! = 1/2). We can interpret the entropy as a value of the
disorder (impurity) of the distribution of classes in D.
According to information theory, entropy in this context has the meaning of mini-
mal informative content to encode in terms of bit the class to which a generic sample
belongs x. The logarithm is expressed in base 2 precisely in relation to the binary
coding of the value of the entropy.
Let us now look at a measure based on entropy that evaluates the effectiveness of an
attribute in the classification process of a training set. The Gain Information measures
the reduction of entropy that is obtained by partitioning the training set with respect
to a single attribute. Let A be a generic attribute, the information gain G(D, A) of A
relating to the training set of samples D, is defined as follows:
|Di |
G(D, A) = H (D) − H (Di ) (1.296)
|D|
i
' () *
average entr opy
where Di , i = 1, . . . , n A are the subsets deriving from the partition of the entire set
D using the attribute A with n A values.
It is observed that the first term of the (1.296) is the entropy of the whole set D
calculated with the (1.295), while the second term represents the mean entropy, that
is the sum of the entropy of each subset Di weighted with the fraction of samples ||D i|
D|
belonging to Di . The most significant attribute to choose is the one that maximizes
the information gain G, which is equivalent to minimizing the average entropy (the
second term of the 1.296) since the first term, the entropy H (D) is constant for all
attributes. In other words, the attribute that maximizes G is the one that most reduces
entropy (i.e., disorder).
132 1 Object Recognition
Let’s go back to the samples of the Table 1.1 and calculate the significance of the
attribute Humidity which has 7 occurrences with a value of high and 7 with a value
of nor mal. For Humidity = high, 3 samples are positive and 4 negative, while, for
example, Humidity = normal, 6 are positive and 1 negative. Applying the (1.296)
and using the value of the entropy H (D) given by the (1.295), the information
gain G(D, H umidit y), partitioning the training set D with respect to the attribute
Humidity results
|Di |
G(D, H umidit y) = H (D) − H (Di )
|D|
i
7 7 (1.297)
= 0.940 − H (D H igh ) − H (D N or mal )
14 14
7 7
= 0.940 − 0.985 − 0.592 = 0.151
14 14
where H (D H igh ) and H (D N or mal ) are the entropies calculated for the subsets D H igh
and D N or mal , selected for the samples associated with the Humidity attribute with
value, respectively, of H igh and N or mal.
The ID3 algorithm for building the tree selects the one with the highest value of
the information gain G as a significant attribute. Therefore, with the training set of
Table 1.1, to decide the best weather day to play tennis in a 2 week time frame, ID3
can build the decision tree. The classification leads to a binary result, i.e., it is played
if the path on the tree leads to the class ω1 = yes is positive or it is not played if
class ω2 = no is negative. The root node of the tree is generated by evaluating G
for the attributes (Outlook, Wind, Temperature and Humidity). The information gain
for these attributes is calculated by applying the (1.297), as done for the Humidity
attribute, obtaining
G(D, Outlook) = 0, 246
G(D, W ind) = 0, 048
G(D, T emperatur e) = 0, 029
G(D, H umidit y) = 0, 151
The choice falls on the Outlook attribute which has the highest value of the infor-
mation gain G(D, Outlook) = 0.246. The construction top-down of the tree (see
Fig. 1.49) continues by creating 3 branches from the root node, or as many as there
are values that the Outlook attribute can take: Sunny, Overcast, Rain. The question
to ask now is the following. The nodes to correspond to the 3 branches, are leaf nodes
or root nodes of sub trees to be created by making the tree grow further?
Analyzing the training set samples (see Table 1.1), we have the 3 subsets D Sunny =
{D1, D2, D8, D9, D11}, D Over cast = {D3, D7, D12, D13}, and D Rain =
{D4, D5, D6, D10, D14} associated, respectively, to the 3 values of the Outlook
attribute. It is observed that all the samples of the subset D Over cast have positive class
(associated with Outlook = Over cast) and therefore, the corresponding node is a
leaf node with class/action PlayTennis.
1.14 ID3 Algorithm 133
(a) (c)
Outlook Outlook
ny Rain ny Rain
Sun Overcast Sun Overcast
to be to be to be
partitioned Yes partitioned
Humidity 4 Yes partitioned
Contains only High Normal
samples of class Yes
2 Yes
3 No
(b)
Outlook Outlook Outlook
ny Rain ny Rain ny Rain
Sun Overcast Sun Overcast Sun Overcast
Humidity ? Temperature ? Wind ?
High Normal Hot Mild Cool Weak Strong
2 Yes 2 No 1 No 1 Yes 2 No 1 No
3 No 1 Yes
1 Yes 2 Yes
Fig.1.49 Generation of the decision tree associated with the training set D consisting of the samples
of the Table 1.1. a Once the Outlook attribute is selected as the most significant, 3 child nodes are
created, as many as there are values of that attribute that partition D into 3 subsets. The child node,
associated with the value Outlook = Over cast, is leaf node since it includes samples with the
same class ω1 = Y es, while the other two nodes are to be partitioned having samples with different
classes. b For the first child node, associated with the value Outlook = Sunny, the information gain
G is calculated for the remaining attributes (e.g., Wind, Temperature, and Humidity) to select the
most significant one. c The attribute Humidity is selected which generates two leaf nodes associated
with the two values (H igh, N or mal) of this attribute. The process of building the tree continues,
for the third node associated with Outlook = Rain, as done in (b), selecting the most significant
attribute
For the other two branches Sunny and Rain, the associated nodes are not leaf
and will become root nodes of sub trees to be built. For the node associated with
the Sunny branch, what is done on the initial node of the tree is repeated, i.e., the
most significant attribute must be chosen. Since we have already used the Outlook
attribute, it remains to select from the 3 attributes: Humidity, Temperature, and Wind.
Considering the subset D Sunny and applying the (1.297), the information gain for the
3 attributes results
G(D Sunny , H umidit y) = 0.970
G(D Sunny , T emperatur e) = 0.570
G(D Sunny , W ind) = 0.019
The choice falls on the attribute attribute Humidity which has the highest value of G.
This attribute has two values: high and nor mal. Therefore, from this node, there will
be two branches and the process of building the tree proceeds by selecting attributes
(excluding those already selected at the highest levels of the tree) and partitioning the
training set until leaf nodes are obtained, i.e., all associated samples have the same
class (zero entropy of the associated subset). Figure 1.50 shows the entire decision
tree built. The same tree can be expressed by rules (see the rules in Algorithm 9).
These are generated, for each leaf node, testing each attribute through a path that
134 1 Object Recognition
D1,D2,D3,D4,D5,D6,D7,D8,D9,D10,D11,D12,D13,D14
[9 Yes, 5 No]
Outlook
Rain
Sunny Overcast
D1,D2,D8,D9,D11 D4,D5,D6,D10,D14
Humidity 4 Yes Wind [3 Yes, 2 No]
[2 Yes, 3 No]
High Normal D3,D7,D12,D13 Strong Weak
[4 Yes, 0 No]
2 Yes 2 No 3 Yes
3 No
D9,D11 D6,D14
D1,D2,D8 [2 Yes, 0 No] [0 Yes, 2 No] D4,D5,D10
[0 Yes, 3 No] [3 Yes, 0 No]
Fig. 1.50 The complete decision tree, associated with the training set D relative to the samples of
Table 1.1, generated with the ID3 algorithm. Starting from the root node to which all training set D
is associated, for each child node are then reported the subsets deriving from the partitions together
with the frequency of the two classes in these subsets
starts from the root node and arrives at the leaf anode (precondition of the rule) and
the classification of the leaf is the rule post-condition. The pseudo-code of the ID3
2: Play = N o;
4: Play = Y es;
6: Play = Y es;
8: Play = N o;
11: else
13: end if
6: else
7: 1. Select a significant attribute A according to the Entr opia and Information Gain function
2. Create a new node in DT and use the attribute A as a test
3. For each V alue vi of A
8: end if
The peculiarity of the ID3 algorithm is gr eedy search, that is, chooses the best
attribute and never comes back to reconsider the previous choices. ID3 is a non-
incremental algorithm, meaning that it derives classes from the default instance
formation training set. An incremental algorithm [29] reviews the current instance
definition, if necessary, with a new sample.
The classes created by ID3 are inductive, that is, the classification carried out by
ID3 is based on the intrinsic knowledge of the instances contained in the training set
which is assumed to be functional also for the future instances presented in the test
phase. The induction of classes cannot be shown to always work, since an infinite
number of patterns can be presented by classifying them. Therefore, ID3 (or any
inductive algorithm) may not classify correctly.
The description and examples provided demonstrate that ID3 is easy to use. It is
mainly used for replacing the ex per t who normally builds a decision tree classifier
manually. ID3 has been used in various industrial applications, for medical diagnosis
and for the assessment of credit risk (or insolvency).
Information gain as a partitioning indicator has the tendency to favor tests with
attributes with many values. Another problem arises when the training set has a
limited number of samples or the data is affected by noise. In all these cases, the
136 1 Object Recognition
problem of overfitting27 may occur, i.e., the selection of a nonoptimal attribute for the
prediction, and the problem of the strong f ragmentation, or when the training sets
are very small to represent in a way significant samples of a certain class. Returning
to the samples of Table 1.1, if we add the Date attribute, we get the situation in which
the information gain G favors this attribute for the proper partition because it has a
high number of possible values. Therefore, the Date attribute, with the large number
of values, would correctly predict the samples of the training set by being preferred
to the root node with the construction of the very large tree but with depth 1. The
problem occurs when presenting test samples where the tree is unable to generalize.
In fact, having the Date attribute many values, it prevents the construction of the
tree by partitioning the training set into small subsets. It follows that the entropy of the
partition caused by Date would be zero (each day would result in a different and pure
subset, as it consists of unique samples and a single class) with the corresponding
maximum value of G(D, Date) = 0.940 bits.
To eliminate this problem, Quinlan [25] introduced an alternative measure to G
known as Gain Ratio G R which normalizes the information gain. This new measure,
which normalizes impurity, penalizes attributes with many values, through the term
Split Information Split I n(D, A) which measures the information due to the partition
of D in relation to the values of the attribute A. The Gain Ratio G R is defined as
follows:
G(D, A)
G R(D, A) = (1.298)
Split I n(D, A)
where
C
|Di | |Di |
Split I n(D, A) = − log2 (1.299)
|D| |D|
i=1
with Di , i = 1, . . . , C which are the subsets of the training set D partitioned by the
C values of the attribute A. It should be noted that Split I n(D, A) is the entropy of
D calculated with respect to the values of the attribute A. Furthermore, we have that
Split I n(D, A) penalizes attributes with a large number of values which partitions
27 In the context of supervised learning, a learning algorithm uses the training set samples to predict
the class to which other samples belong in the test phase, not presented during the learning phase.
In other words, it is assumed that the learning model is able to generalize. It can happen instead,
especially when the learning is too adapted to the training samples or when there is a limited number
of training samples, that the model adapts to characteristics that are specific only to the training set,
but that does not have the same capacity prediction (for example, to classify) with the samples of
the test phase. We are, therefore, in the presence of overfitting, where performance (ie the ability
to adapt/predict) on training data increases, while performance on unseen data will be worse. In
general, the problem of overfitting can be limited with cross-validation in statistics or with the
early stop in the learning context. Decision trees that are too large are not easily understood and
often there is the overfitting known in literature as the violation of Occam’s Razor (philosophical
principle) which suggests the futility of formulating more hypotheses than those that are strictly
necessary to explain a given phenomenon when the initial ones may be sufficient.
1.14 ID3 Algorithm 137
D into many subsets all of the same cardinality. For the attribute Date, we would
have
|Di | |Di | 1 1
Split I n(D, Date) = − log2 = 14 − log = 3, 807
|D| |D| 14 14
i=1
and
G(D, Date) 0, 940
G R(D, Date) = = = 0, 246
Split I n(D, Date) 3, 807
It should be pointed out that for the training set example considered, the Date
attribute would still be a winner, having the value of G R higher than that of the other
attributes. Nevertheless, G R proves more reliable than the G information gain. The
choice of attributes must however be made carefully by selecting those with a value
of G higher than the average of the information gain of each attribute. The final
choice is then made on the attribute with G R greater. The heuristic itself is used in
cases where the denominator of the (1.298) tends toward zero.
In summary, the ID3 algorithm uses a measure-based approach of information
gain to navigate the attribute space of the decision tree. The result converges on a
single hypothesis and converges optimally without ever navigating backward. The
construction of the tree (learning phase) is based only on the samples of the training
set and therefore represents a non-incremental method of learning. This produces the
drawback of not being able to update the tree when a new sample is badly classified
and requires tree regeneration. It uses the statistical information of the whole training
set and this makes it less sensitive to the noise related to the individual training
samples. ID3 is limited to working with only one attribute at a time and not numeric
type.
The C4.5 algorithm is an evolution of ID3, always proposed by Quinlan [26]. C4.5
also uses the Gain Ratio as the partitioning criterion. Compared to the ID3 algorithm,
it has the following improvements:
1. It handles both discrete and continuous attributes. In the case of attributes with
numerical values, the test is performed over an interval, for example by appropri-
ately dividing it in binary mode. If A is an attribute with numeric value (as is the
case for the attribute T emperatur e of the training set of Table 1.1), it can be dis-
cretized or represented with a boolean value by dividing the interval with a thresh-
old t appropriate (for example, if A < t → Ac = T r ue otherwise Ac = False).
The appropriate choice of the t threshold can be evaluated with the maximum
information gain G (or considering Gain Ratio) by partitioning the training set
138 1 Object Recognition
according to A values once the two subsets of values A ≤ t and A > t are
obtained.
2. Partitioning stops when the number of instances to be partitioned is less than a
certain threshold.
3. Manages training set samples with missing attribute values. If a A attribute with
missing values needs to be tested, in the learning phase, we can use the following
approaches:
(a) We choose the most probable value, that is the one most frequently used in
the samples associated with A.
(b) We choose considering all the values vi of x.A assigning an estimated prob-
ability p(vi ) on the samples belonging to the node under examination. This
probability is calculated with the observed frequency of the various values of
A between the samples of the node under examination. We assign a fraction
of p(vi ) of x to each descendant in the tree that receives a weight proportional
to its importance. The calculation of the information gain G(D, A) occurs
using these proportionality weights.
In the construction of the decision tree, we have previously highlighted the possibility
of blocking growth through a stop criterion based, for example, on the maximum
depth or by providing a maximum number of partitions. These methods are limited
in that they tend to create very small and undersized trees to classify correctly or
to create very large over-sized trees compared to the training set. This last aspect
concerns the problem of overfitting highlighted in the previous paragraphs.
A method to avoid the problem of overfitting is achieved by pruning the tree [27].
The strategy is to grow the tree without stop restrictive criteria, initially accepting
an over-sized tree compared to the training set and then, in the second phase, prune
the tree by adequately removing branches that do not contribute to the generalization
and accuracy of the classification process. Before describing the methods of pruning,
we give the formal definition of overfitting in this context.
Figure 1.51 shows the consequences of overfitting in the context of learning with the
decision tree. The graph shows the accuracy trend of the predictions made by the
1.15 C4.5 Algorithm 139
Accuracy
phase of the decision tree, on
0.7
the samples of the test set,
and then on these last 0.65
samples after applying the Training Set
0.6
tree pruning operator Test Set
0.55 Post-Pruned on Test Set
0.5
0 10 20 30 40 50 60 70 80 90 100
Number of nodes (tree size)
tree in the learning phase using the training set and in the test phase considering the
samples of the test set not processed by the tree in the training phase. Accuracy varies
with the number of nodes as the tree grows by examining the training set samples in
the learning phase. It is observed how the accuracy decreases, with the samples of
the test set, after the tree reaches a certain size (depending on the type of application)
in terms of number of nodes. This decrease in accuracy is due to the random noise
of the training set samples and the test set.
There are various methods of pruning decision trees which, for simplicity, we can
group them as follows:
present the same statistical fluctuation. Normally, the validation set can help verify
the existence of some abnormal training set samples.
This is made possible in practice when the two sets of samples are adequately
sized, in the ratio of 2/3 for the training set and 1/3 for the validation set.
An alternative measure of tree performance after pruning is given by calculating
the minimum length (Minimum Description Length—MDL) to describe the tree [30].
This measurement is expressed as the number of bits required to code the decision
tree. This evaluation method selects decision trees with a shorter length.
The basic idea of these algorithms first involves the complete construction of the tree
to include all possible attributes and later, with pruning, parts of the tree associated
with attributes due to random effects are removed. The simplification of the tree takes
place using a post-pruning strategy based on two post-pruning operators: subtree
replacement and subtree raising. In the first case, with the substitution of a subtree,
the initial tree is modified after having analyzed all its subtrees and eventually placed
with leaf nodes. For example, the entire subtree shown in Fig. 1.52 which includes
two internal nodes and 4 leaf nodes is replaced with a single leaf node. This pruning
involves less accuracy if evaluated on the training set while it can increase if you
consider the test set.
This operator is implemented starting from the leaf nodes and proceeds backward
to the root node. In fact, in the example of Fig. 1.52 is first considered to replace
the 3 child nodes of the subtree with the root X with a single leaf node. Later, going
backward, it is evaluated to prune the subtree with root B, which now has only two
child nodes, and replace it with a single leaf node as shown in the figure.
In the second case, with the subtree raising operator with the deletion of a node
(and consequently the subtree of which this node is root) involves the raising of an
entire subtree as shown in Fig. 1.53.
B C C
X D D
Fig. 1.52 Pruning with the subtree replacement operator. In the example, the subtree with the root
B is replaced with a leaf node ω1
1.15 C4.5 Algorithm 141
Pruning operator: Raising the subtree with the root node C eliminating node B
A A
B C
C 4 5 1’ 2’ 3’
1 2 3
Fig. 1.53 Pruning with the subtree raising operator. In the example, the subtree is raised with root
C deleting the subtree with root B and redistributing the instances of leaf nodes 4 and 5 in the
node C
Although in the figure the child nodes of B and C are referred to as leaves, these
can be subtrees. Furthermore, using the subtree raising operator, it is necessary to
reclassify the samples associated with the suppressed nodes, which in the example
correspond to nodes 4 and 5, in the new subtree with the root node C. This explains
why the child nodes of C, after pruning, are shown differently with 1 , 2 , and 3 (the
instances are redistributed to consider the samples associated with the initial nodes
4 and 5) with respect to the initial one before pruning. This last operator based on
deleting a node is slower.
We now describe two strategies of pruning (Reduced Error Pruning and Rule
Post-Pruning) and how to evaluate the accuracy of the pruned tree.
5: 1. Select a signi f icant attribute A according to the Entr opy and Information Gain
function
2. Create a new node in DT and use the attribute A as a test
3. For each V alue vi of A
6: end while
1. Build the DT decision tree from the training set D, possibly allowing overfitting.
2. Conver t the DT decision tree into an equivalent set of rules by creating a rule
for each path from the root node to leaf nodes (see the rules in Algorithm 9).
3. Generali ze ( pr une) each rule, i.e., try to remove any precondition of the rule
itself that generates an improvement in accuracy.
4. Or der the rules thus obtained in relation to their estimated accuracy and consider
them in this sequence when classifying new instances.
The preconditions that are considered for removal are (Outlook = Sunny) and
H umidit y = H igh). The precondition to be pruned is the one that produces an
improvement in accuracy. Pruning is not performed if the elimination of a precondi-
tion produces a decrease in accuracy. The process is iterated for each rule.
The accuracy can be evaluated, as done previously, using the validation set if the
data are adequately numerous to have separated those of the training set from those
of the validation set.
The C4.5 algorithm evaluates the performance based on the training set itself,
evaluating if the estimated error is reduced by deriving the confidence intervals with
a statistical test on the learning data. In particular, C4.5 assumes that the realistic
error is at the upper limit of this confidence interval (pessimistic error estimate) [26].
The accuracy estimate on the training set D is made by assuming that the probability
of error has a binomial distribution.
1.15 C4.5 Algorithm 143
With this assumption, the standard deviation σ is calculated. For a given confi-
dence interval d (for example, d = 0.95), the realistic error e falls d% of the times
in the confidence interval dependent on sigma. As a pessimistic estimate of the
error, the maximum value of the interval is chosen which corresponds to the esti-
mated accuracy−1.96 · sigma. This method of pessimistic pruning, despite being
a heuristic approach without statistical foundations in practice, produces acceptable
results. If an internal node is pruned, then all the nodes descending to it are removed,
thus obtaining a fast procedure of pr uning.
CART [27] stands for Classification And Regression Trees. The CART algorithm
is among the best known for constructing classification and regression trees.28 The
supervised CART algorithm generates binary trees, i.e., trees in which each node
has only two arcs. This does not involve limitations for making complex trees. A
CART tree is built in a Gr eedy way like for C4.5 trees but the type of tree produced
is substantially different. An important feature of CART is the ability to generate
regression trees (see note 28). CART uses discrete and continuous attributes. CART
constructs the binary tree using the information gain (see the previous paragraph) or
the Gini Index as the training set splitting criterion in each node.
The splitting criterion based on the Gini index29 was applied in the CART algorithm
by Breiman [27] for the construction of binary decision trees, with the advantage
of being able to be used well also for regression problems. As an alternative to the
entr opy, the Gini index, indicated with Gini(D), is a measure of impurity (disorder)
of the training set D to be partitioned into a certain node t of the tree. The generalized
form of this index is given by
C
Gini(D) = 1 − p 2 (ωi ) (1.300)
i=1
28 Decision trees that predict a categorical variable (i.e., the class to which a pattern vector belongs)
are commonly called classification trees, while those that predict continuous-type variables (i.e., real
numbers) and not a class are named trees of regression. However, classification trees can describe
the attributes of a pattern also in the form of discrete intervals.
29 Introduced by the Italian statistician Corrado Gini in 1912 in Variability and Mutability to represent
where p(ωi ) indicates the relative frequency (i.e., the probability) of the class ωi in
training set D at the node under examination t, and C indicates the number of the
classes. If the training set D consists of samples belonging to the same Cclass, the Gini
index is 1 − 12 = 0 considering that the probability distribution is i=1 p(ωi ) = 1.
It would reach the maximum purity of D. If instead, the probability distribution is
uniform,
C i.e., p(ω i ) = 1/C
C∀ i, the Gini index would result with maximum value
1 − i=1 (1/C)2 = 1 − i=1 (1/C)2 = 1 − 1/C.
As a splitting criterion, when a t node is partitioned into K subsets Di , i =
1, . . . , K in relation to a generic attribute A, the Gini index average (as an alternative
to the average entropy) Gini split (D, A) defined by
K
|D j |
Gini split (D, A) = Gini(D j ) (1.301)
|D|
j=1
where | • | indicates the number of samples of the training set D and D j . Basically,
in correspondence of the training set D associated with the node under examination
t, the various values of the Gini index relative to the subsets D j of the partition
|D |
of D are weighted by the ratios |Dj| . The best attribute selected, for the node t, is
the one corresponding to the minimum value of Gini split (D, A), evaluated for each
attribute.
In the simplest case, that is, with a training set D consisting of samples belonging
to only two classes (ω1 and ω2 ), the Gini index given by the (1.300) and the average
Gini index are
Gini(D) = p(ω1 ) p(ω2 ) (1.302)
|D1 | |D2 |
Gini split (D1 , D2 , A) = Gini(D1 ) + Gini(D2 ) (1.303)
|D| |D|
where the training set D relative to the node under examination t is partitioned into
two subsets D1 and D2 . In two-class classification problems, the Gini index (1.302)
can be interpreted as variance of impurity. In fact, we can imagine for the node t that
the associated samples D are the realizations of a random Bernoulian experiment
where the random variable of Bernoulli is the class ω to be assigned to each sample.
Assigning the value 1 to the class ω1 and 0 to the class ω2 , we will have
1 if p(ω = 1) = p(ω1 )
ω= (1.304)
0 if p(ω = 0) = 1 − p(ω1 )
Thus we obtain, respectively, the expectation value E(ω) = μ = p(ω1 ) and the
variance V ar (ω) = σ = p(ω1 )(1 − p(ω1 ) = p(ω2 )(1 − p(ω2 ) = p(ω1 ) p(ω2 ) =
Gini. If the two classes are equiprobable, the variance will reach the minimum value,
that is, p(ω1 ) = p(ω2 ) = 1/2 (the worst condition for the classification), while it
1.16 CART Algorithm 145
0.5 Gini(D)
M
error
will take the maximum value when there are samples belonging to a single class, i.e.,
p(ω1 ) is equal to 0 or 1 (the condition better for classification).
The splitting criterion based on the Gini index, therefore, involves partitioning the
training set D into two subsets, associated with the node under examination t, mini-
mizing the value of the variance. Figure 1.54 shows the comparison between the split-
ting criteria based on the entropy equation (1.294), on the Gini index equation (1.303)
and on the misclassification error MC(D) given by MC(D) = min( p(ω1 ), p(ω2 )),
for a problem with 2 equiprobable classes remembering that p(ω1 ) and p(ω2 ) indi-
cate, respectively, the probability that a generic sample belongs to the class ω1 or to
the class ω2 .
Returning to the CART algorithm for the construction of the binary tree, the Gini
index (measure of the impurity), given by the (1.303), which chooses the attribute
corresponding to the minimum value of the impurity in partitioning the training set
of the node under examination in two subsets (relative to the two nodes children).
Equivalently, as an optimal splitting criterion the measure that maximizes the impurity
gradient (i.e., the reduction of the impurity) is given by
|D1 | |D2 |
Gini split (D1 , D2 , A) = 1 − Gini(D1 ) − Gini(D2 ) (1.305)
|D| |D|
Therefore, the splitting strategy is to choose the A attribute that maximizes with
the (1.305) Gini split . If instead the entropy is used as a measure of impurity, the
corresponding strategy is that of the choice of the attribute that produces the highest
value of the information gain. The Gini index tends to isolate the largest class from
all the others, while the entropy-based criteria tend to find sets of more balanced
classes.
The construction of a multiclass binary decision tree is made with the Twoing
criterion, an extension of the Gini index. The strategy is to find the partition that best
divides groupings of samples into C classes. The approach is to optimally divide the
samples of all the classes into two super classes C1 which contains a subset of all
the classes, and C2 which contains the rest of the remaining samples. The strategy
consists in choosing the A attribute by applying the Gini index to two classes to these
146 1 Object Recognition
The classification based on decision trees requires the prediction of a discrete value
of the class to which a sample belongs. This is achieved by learning the decision
tree starting from the training set of the samples whose class is known in advance.
The tree is constructed by iteratively selecting the best attribute and partitioning the
training set according to this attribute once the relative information content has been
evaluated using entropy, or the information gain or impurity measurement of the
partitioned training set in each node. The pruning process is applied after the tree is
built to eliminate redundant nodes or subtrees. In essence, the inductive decision tree
is a nonparametric method for creating a classification model. Therefore, it does not
require any prior knowledge of the probability of class distribution.
Among the advantages of decision trees are the following:
(a) Their peculiarity of being self-explanatory in particular when they are well com-
pacted being easily understandable. If the number of leaf nodes are numerically
small, they are easily accessible even to nonexperts.
(b) Their immediate conversion into a set of easily understandable rules.
(c) They can handle samples whose attributes can be nominal or numerical (discrete
and continuous, CART).
(d) They allow accurate estimates even when the training set data contains noise,
for example samples with incorrect class or attributes with inaccurate values
(CART).
(e) They can also manage training sets with missing attributes.
(f) They manage attributes with different costs. In some applications it is convenient
to build the decision tree by placing the least expensive attributes as close as
possible to the root node with the highest probability to verify them. One way
to influence the choice of the meaningful attribute in relation to the cost of the
attribute A is achieved by inserting the term Cost (A) into the function that
chooses the optimal attribute.
(g) Decision trees are based on nonparametric approaches and this implies that they
have no assumptions about the distribution of attributes and the structure of the
classifier.
(a) Limited scalability to very large training sets. In particular when it presents a lot
of attributes for which considerable computational complexity is also required.
(b) Different algorithms, such as ID3 and C4.5, require target attributes with only
discrete values.
1.16 CART Algorithm 147
(c) The gr eedy feature of the tree growth algorithms, based on selecting an attribute
to partition the training set, does not take into account the intrinsic irrelevance
that an attribute may have on future partitions with the top-down approach.
(d) The approach used for the construction of decision trees is of the divide and
conquer type which tends to work correctly if there are highly relevant attributes,
but less if many complex instances are present. One reason is that other classifiers
can compactly describe a classifier that would be very difficult to represent with
a decision tree. Furthermore, most decision trees divide the instance space into
mutually exclusive regions to represent a concept, in some cases, the tree should
contain several duplicates of the same subtree in order to represent the classifier.
30 Inthe theory of computational complexity, the different problem solutions are grouped into two
classes: the problems P and N P. The first includes all those decision problems that can be solved
with a computational load (deterministic Turing machine) in a time that is polynomial with respect
to the size of the problem, that is, they admit algorithms that in the worst case scenario is polynomial
and includes all treatable problems. The second class is one of the problems that cannot be solved
in polynomial time. For this last class, it can be verified that every resolving algorithm would
require an exponential calculation time (in the worst case) or however asymptotically superior
to the polynomial one. In the latter case, we have problems also called intractable in terms of
calculation time. The class of NP-complete problems are instead the most difficult problems of the
class N P nondeterministic in polynomial times. In fact, if you find an algorithm that solves any
N P-complete problem in a reasonable time (i.e., in polynomial times), then it could be used to
reasonably solve every NP problem. The theory of complexity has not yet given answers if the class
N P is more general than the class P or if both coincide.
148 1 Object Recognition
leaf node and the root node which is of the order of O(log N ) (average cost). The total
number of reclassifications is O(N log N ). It follows that the overall computational
load of the inductive decision tree is
Divisive
C
C C
The algorithm can end earlier if you also impose a predefined number of clusters
Cmax to be extracted. As described in the algorithm, the essential step (line 5) is
the calculation of similarity, i.e., closeness between two clusters. The similarity
calculation method characterizes the various hierarchical agglomerative clustering
algorithms. The similarity measure is normally associated with the measurement of
150 1 Object Recognition
Fig. 1.56 Graphical representation of the types of calculation of distance measurement (proximity)
between clusters: a MIN, minimum distance (single linkage); b MAX, maximum distance (complete
linkage); c average distance (average linkage); d distance between centroids
distance between the patterns of two clusters. Figure 1.56 graphically shows 4 ways
to calculate the distance between the patterns of two clusters: minimum distance dmin ,
maximum distance dmax , average distance dmed , and distance between centroids dcnt .
31 In
graph theory, having a spanning tree with weighted arcs, it is also possible to define the
minimum spanning tree (MST), that is, a spanning tree so that adding the weights of the arcs, a
minimum value is obtained.
1.17 Hierarchical Methods 151
where Ni and N j are the number of the patterns, respectively, of the classes ωi and
ωj.
Distance measurements based on dcnt and dmed are more robust with the presence
of noisy patterns (outlier s) than the distance measures given by dmin and dmax who
are more sensitive to outlier s. Furthermore, it should be noted that the distance
between centroids dcnt is computationally simpler than dmed which instead requires
the calculation of Ni × N j distances.
8: end while
There are several algorithms developed for the divisive hierarchical clustering that
differ in the way in which the concept of cluster is defined wor st (for example,
larger number of patterns, i.e., clusters with larger diameter, higher variance, largest
sum squared error) and how clusters are divided (for example, median-median in
152 1 Object Recognition
Step 0 Initially, we start with all the patterns as individual clusters {Foggia},
{Bari}, {Taranto}, {Brindisi}, {Lecce}. From Table 1.2, we observe that the min-
imum distance between the clusters is that relative to the two clusters {Brindisi}
and {Lecce} with value 39.
Step 1 Once the clusters are found with minimum distance, these are immersed in
a single cluster and the distance table is updated considering the current clusters
{Foggia}, {Bari}, {T aranto}, {Brindisi, Lecce}. From Table 1.3, it is shown
that the minimum distance between the clusters is that relative to the clusters
{Brindisi, Lecce} and {T aranto} with value 55.
Step 2 Clusters found with minimum distance, these are immersed in a sin-
gle cluster and the distance table is updated considering the current clusters
{Foggia}, {Bari}, {Brindisi, Lecce, T aranto}. From Table 1.4, it results
that the minimum distance between the clusters is that relative to the clusters
{Brindisi, Lecce, T aranto} and {Bari} with value 97.
Step 3 Clusters found with minimum distance, these are immersed in a sin-
gle cluster and the distance table is updated considering the current clusters
{Foggia} e {Brindisi, Lecce, Taranto, Bari} (see Table 1.5). This last table
indicates that the minimum distance between the only two clusters remain-
ing is 137 which are immersed constituting a single cluster {Brindisi, Lecce,
T aranto, Bari, Foggia}.
Step 4 The procedure ends by obtaining the single cluster that includes all the
patterns.
Figure 1.57 shows the results of the hierarchical agglomerative algorithm in the
single linkage version with the procedure described above through a dendrogram.
The vertical position (height) in which two clusters are immersed in the dendrogram
represents the distance of the two clusters. It is observed that in step 0, the clus-
ters {Brindisi} and {Lecce} are immersed at the height 39 corresponding to their
minimum distance.
154 1 Object Recognition
300
Distance in km
200
137 1 Cluster
100 97 2 Cluster
55 3 Cluster
39
4 Cluster
0 5 Cluster
Taranto
Brindisi
Foggia
Lecce
Bari
Fig.1.57 Dendrogram generated with the hierarchical agglomerative algorithm in the single linkage
version which considers the minimum distance between two clusters. In the example considered
with 5 cities, initially the clusters are 5 and in the final step, the algorithm ends with a single cluster
containing all the patterns (cities)
While the statistical approach for the recognition of an object (pattern) is based on
a quantitative evaluation of its descriptors based on the prior knowledge of a model,
the syntactic approach is based on a more qualitative description. This last approach
can be used when a complex object is not easily described by its features and when
it is possible to have its own hierarchical representation, composed of elementary
components of the same object for which it is possible to consider some relational
information. In fact, the basic idea of a syntactic method (also called str uctural)
for recognition is to decompose a complex image into a hierarchy of elementary
structures and to develop the r ules with which these elementary structures (sub-
pattern or primitive) have a mutual relationship that allows their recombination to
generate higher level structures. With this syntactic or structural approach, it is there-
fore possible to describe complex patterns by adequately breaking them down into
robust primitives and using composition or production rules for them that include the
1.18 Syntactic Pattern Recognition Methods 155
0 0 3 3 5 5 0 0 2 2 4 4 6 6
Fig. 1.58 Syntactic description of a scene whose objects are decomposed into oriented segments
representing primitives (terminal symbols)
In essence, the first two processes constitute the learning phase, the third process
concerns the grammar construction that normally requires user intervention and is
rarely automated, the last two processes constitute the recognition and verification
phase.
3. Production rules P. A set of production rules also called rewrite rules. Each pro-
duction rule is a pair of strings of the type <α, β> that as the binary relationship
of cardinality finite on V × V is indicated as
α→β (1.311)
1.18 Syntactic Pattern Recognition Methods 157
where the left string α of the rule must contain at least one nonterminal symbol,
i.e.,
= = =
α ∈ (VT VN )∗ ◦ VN ◦ (VT VN )∗ and β ∈ (VT VN )∗ (1.312)
where α is a string (word) that contains at least one nonterminal symbol and β
is any string of terminal and nonterminal symbols or empty.32
4. Nonterminal initial symbol S. Also called axiom where S ∈ VN .
It should therefore be noted that every production rule (1.311) states that the string
α can be replaced by the string β and defines how, starting from the axiom, we can
generate strings of primitives and not, until you reach a string of only primitives.
Given a grammar G, we can say that a language L(G) generated from a grammar
is given by the set of words (strings) of primitive derivable by applying a finite
sequence of production rules given by the (1.311).
32 The characters of “*” and “◦” have the following meaning. If V is an alphabet that defines strings
or words as a sequence of characters (symbols) of V , the set of all strings defined on the alphabet
V (including the empty string) is normally denoted by V ∗ . The string 110011001 is a string of
length 9 defined on the alphabet {0, 1} and therefore belongs to {0, 1}∗ . The symbol “◦” instead
defines the concatenation or product operation given by ◦ : V ∗ × V ∗ → V ∗ , which consists in
juxtaposing two words of V ∗ . This operation is not commutative but only associative (for example,
mono ◦ block = monoblock and abb ◦ bba = abbbba = bba ◦ abb = bbaabb)). It should also
be noted that an empty string x consists of 0 symbols, therefore of length |x| = 0, normally denoted
also with the neutral symbol . It follows that x ◦ = ◦ x = x, ∀x ∈ V ∗ ; and besides || = 0.
It can be shown that given an alphabet V , the triad < V ∗ , ◦, > is a monoid, that is a closed set
with respect to the concatenation operator “◦” and for which is the neutral element. It is called
syntactic monoid defined on V because it is the basis of the syntactic definition
of languages. The
set of non-empty strings are indicated with V + and it follows that V ∗ = V + {}.
158 1 Object Recognition
Finally applying the second rule S → ab, we get as result the strings of the type
a n bn . A more compact way to represent production rules is possible using the “|”
symbol with the meaning of O R, writing as follows:
α → β1 α → β2 ··· α → βn (1.314)
P = {S → aS | B, B → bB | bC, C → cC | c}
→ aaS '()*
aS '()* → aa B '()*
→ aabB '()*
→ aabbC ⇐⇒ aS =⇒∗ aabbC
r ule1 r ule2 r ule3 r ule4
where r ule i indicates the i-th rule contained in P of the direct derivation
applied.
4. A sentential form is any string x derivable from the S axiom of the G grammar
such that x ∈ V ∗ and S ⇐⇒∗ x.
5. According to point 1, a sentence or proposition of the language generated by G
consists of sentence forms made up of terminals only, that is,
{a n bn | n ≥ 1}
while grammar:
generates language:
{a n b | n ≥ 1 and n even}
{a n b | n ≥ 1 and n even}
7. The application of the production rules does not guarantee the generation of a
language string, in fact it can be verified to produce a form of sentence in which
it is not possible to apply any production rule.
(a) Primitives:
VT = {Lecce, is, a, bar oque, cit y}.
(b) Nonterminal words:
Starting from the axiom, root of the tree structure, and applying the above production
rules, the intermediate words belonging to VN (intermediate levels of the tree) are
generated up to the terminal words (the tree’s terminal nodes) which constitute a
proposition of the Italian language.
A grammar can be used in two ways:
The formal grammars used in the informatic context are known as the Chomsky
Grammar [37] introduced by linguists to study the syntactic analysis of natural
languages even if they were found to be more adequate for the study of the syntactic
characteristics of computer programming languages. Basically, Chomsky introduces
restrictions on the type of production to differentiate the various grammars and
express a limited number of languages.
α → β α ∈ V ∗ ◦ VN ◦ V ∗ , β ∈ V ∗ (1.315)
remembering that V = VT VN and V ∗ also include empty strings . The lan-
guages that can be generated by type 0 grammars are type 0 languages. These
grammars do not include any particular production restrictions. In fact, the lan-
guage with strings {a n bn , n ≥ 0} is of type 0 since it can be generated with the
grammar G 0 = ({a, b}, {S, A, X }, P, S) with nonrestrictive productions (differ-
ent from those considered in the previous example for the same language) given
by
P = {S → a AbX | , a A → aa Ab, bX → b, A → }
α→β α ∈ V ∗ ◦ VN ◦ V ∗ , β ∈ V + (1.316)
with |α| ≤ |β| and remembering that V + does not include empty strings. It follows
by the (1.316) that type 1 productions do not reduce the length of the sentence
forms unlike the type 0 productions which can admit into β empty strings by
generating derivations that shorten the strings in α. The languages that can be
generated by type 1 grammars are type 1 languages.
For example, the language {a n bn cn , n ≥ 1} is of type 1 because generated by the
grammar of type 1:
G 1 = ({a, b, c}, {S, B, C}, P, S)
These productions respect the restriction (they do not reduce the length of the
sentence forms) imposed by the (1.316), i.e., both strings of each production
αi → βi have the same length. The derivation of the strings of this G 1 grammar
162 1 Object Recognition
is obtained as follows:
→ aSBC '()*
S '()* → aaSBC BC '()*
→ aaa BC BCBC '()*
→ aaa BCBBCC
r ule1 r ule1 r ule2 r ule3
→ aaa B BCBCC
'()*
r ule3
where r ule.i indicates the i-th rule contained in P of the direct derivation applied.
To highlight the effects of the derivation, the string that is rewritten is shown in
bold, while the result of the rule found in the subsequent derivation is underlined.
Type 2 grammars, called Context Free—CF, have the productions in the following
form:
A→β A ∈ VN , β ∈ V+ (1.317)
It should be noted that the left part of a rule can be formed by a single nonter-
minal symbol. The languages that can be generated by type 2 grammars are the
non-contextual languages of type 2. For example, the language {a n bn , n ≥ 1}
considered in the previous example 1 is of type 2 because generated by the
grammar of type 2 G 2 = ({a, b}, {S}, P, S) with the two production rules
P = {S → aSb | ab}.
Type 3 grammars, called Regular or Finite State—FS, are based on productions in
the following form:
A → a B or A → a A, B ∈ VN , a ∈ VT (1.318)
Similarly to non-contextual grammar (type 2), even the regular grammar (type
3) has the production rules, on the left side, with a single nonterminal symbol.
Furthermore, the regular grammar has the right side of a production with the string
restriction which can be (null), or terminal or terminal followed by nonterminal
symbol. The strings {a n b, n ≥ 0} are of the regular language, since they can be
generated with the productions P = {S → aS, S → b} that are in agreement
to the production rules given by (1.318). The regular languages generated with
grammars (type 3) are recognizable by finite state automata. FS grammars are the
most widespread in the computer sector of automation and are often used with
graphical representations derived from graph theory.
We have seen that the various grammars are characterized by the restrictions imposed
on the various production rules. It can be shown that the i grammar classes that
1.18 Syntactic Pattern Recognition Methods 163
L0 ⊃ L1 ⊃ L2 ⊃ L3 (1.319)
where the symbol ⊃ indicates the concept of super set and represent the hierarchies
of Chomsky grammars. A language is strictly of type i if generated from a grammar
of type i and there is no higher level grammar, of type j > i that can generate it. It
can be shown that the language {a n bn , n ≥ 1}, reported in the previous examples, is
strictly type 2, since it cannot be generated by any higher level grammar. As well as
the language {a n bn cn , n ≥ 1} is strictly type 1, because there are no type 2 or type
3 grammars that can generate it.
All the languages presented are based on nondeterministic grammars. In fact, in
the various production rules, the same string on the left of the productions can have
different forms to the right of the production rules and there is no specific criterion for
choosing the rule to be selected. Therefore, a language, generated by a nondetermin-
istic grammar, does not present words with certain preferences. In some cases, it is
possible to keep track of the number of substitutions made with some productions to
learn the frequency of some generated words and have an estimate of the probability
of how often a certain rule is applied. If a probability of application is associated
with the productions, the grammar is called stochastic. In cases where probability
properties cannot be applied, it may be useful to apply the fuzzy approach [38].
The theory of syntactic analysis introduced in the previous paragraphs can be used to
define a syntactic classifier that assigns the class of membership to a pattern (word).
Given a pattern x (a string characterized by the pattern feature set, seen as a sequence
of symbols) chosen a given language L generated by an appropriate grammar G, the
problem of the recognition is reduced to determine if the pattern string belongs to
the language (x ∈ L(G)). In problems with C classes, the classification of a pattern x
can be solved by associating a single grammar for each class. As shown in Fig. 1.60,
an unknown pattern x is syntactically analyzed to find the language of membership
L I associated with the grammar G i that identifies the class ωi , i = 1, . . . , C. To
avoid assigning the pattern to different classes, it is necessary to adequately define
the G i grammars, i.e., the L i languages are strictly disjoint. In the presence of noise
in the patterns, it can happen that it does not belong to any language.
From a practical point of view, it is necessary that the patterns of the various
classes consist of features or terminal elements ( primitives) that represent the set
VT . Strategic is the selection and extraction of primitives and in particular their
structural relationship that depends on the type of application. Once the primitives are
defined, it is necessary to plan the appropriate grammar through which the associated
patterns are described. The construction of a grammar in the context of classification
is difficult to achieve automatically, generally it is defined by the expert who knows
164 1 Object Recognition
the application context well. Once the grammar has been defined, a syntactic analysis
process must be carried out for the recognition of the patterns generated by this
grammar, i.e., given the string x the recognition problem consists in finding L(G i )
such that:
x ∈ L(G i ) i = 1, . . . , C (1.320)
This syntactic analysis process also called parsing consists in constructing the deriva-
tion tree useful for understanding the structure of the pattern (string) x and verifying
its syntactic correctness. The syntactic analysis process is based on the attempt to
construct the pattern string to be classified by applying an appropriate sequence of
productions to the starting axiom symbol. If the applied productions are successful,
it converges to the test string and we have x ∈ L(G) otherwise, the pattern string is
not classified. Now let’s see how to formalize a syntactic tree (or derivation tree)
The search for a string of a language with syntactic analysis is a generally non-
deterministic operation and can involve exponential calculation times. In order to
be efficient, normally the syntactic analysis requires that the syntactic tree be con-
structed by analyzing a few symbols of the pattern string at a time (usually a symbol
of the string at a time). This implies that the production rules are chosen with the char-
acteristics of adaptability to syntactic analysis. The ambiguous grammars, which
generate multiple syntactic trees for a given string, should be avoided. The recon-
struction of the syntactic tree can take place in two ways: top-down (descending
derivation) or bottom-up (ascending reduction).
1.18 Syntactic Pattern Recognition Methods 165
Rule 2
S S S S S S
(1) S AB
(2) S cSc Rule 1
c S c c S c c S c c S c c S c
(3) A a
(4) B bB Rule 3 Rule 4
(5) B b A B A B A B A B
Rule 5
a a b B a b B
Fig. 1.61 Descending syntactic analysis: the result of the derivation is shown in red, while the left
part of the production is shown in green. r ule.i indicates the rule i applied up to the final tree of
the right whose leaves represent the derived string x = cabbc
P = {S → AB | cSc, A → a, B → bB | b}
The construction of the tree (see Fig. 1.61) with the top-down approach to derive the
string pattern x = cabbc is given by the following derivations:
→ cSc '()*
S '()* → cABc '()*
→ caBc '()*
→ cabBc '()*
→ cabbc
r ule2 r ule1 r ule3 r ule4 r ule5
It is noted that the top-down process starts by expanding the axiom with the appropri-
ate r ule 2 S → cSc instead of the r ule 1 S → AB and the third derivation expands
the nonterminal symbol A with the r ule 3 A → a instead of the nonterminal symbol
B. The choice of another sequence of rules would have generated a pattern string
different from x.
Figure 1.61 shows the entire tree built for the pattern string x = cabbc. As you
can see from the example of the built tree, it needs a strategy to choose the production
rule to be applied to expand a nonterminal symbol to avoid, in the case of the wrong
choice, to go back to the root of the tree and start the parsing again. To avoid this,
several syntactic analysis algorithms have been developed. A deterministic analyzer
is the following.
At any time, the analyzer recognizes only the next symbol to be analyzed of the
string (one at a time from left to right) and finds the set of P productions that pro-
duce that symbol. Subsequently, the production rule is applied iteratively to expand
a nonterminal symbol. If a is the leftmost current symbol of the string x under con-
166 1 Object Recognition
sideration and A is the current nonterminal symbol that must be replaced, only those
productions starting with a are chosen.
This syntactic analyzer is based on the construction of a binary matrix which for
each nonterminal symbol A determines all β terminal symbols and not, such that
there is a production A → β · · · . In the top-down analysis, the presence of recursions
on the left must also be managed when there are symbols X ∈ VN such that there is
a derivation of the type X =⇒∗ X α (where α ∈ V ∗ ), and the presence of common
prefixes in the rules, that is, there are distinct rules such as A → aα, A → aβ
(where a ∈ VT and α ∈ V ∗ ). Common recursions and prefixes are solved through
equivalent transformations of the rules of G leaving the generated language L(G)
unchanged. For example, the rules with recursion to the left:
A → Aα1 | Aα2 ed A → β1 | β2
A → β1 A | β2 A e A → | α1 A | α2 A
P = {S → AB | cSc, A → a, B → bB | b}
1.18 Syntactic Pattern Recognition Methods 167
S
Rule 2
(1) S AB
(2) S cSc Rule 1 S S
(3) A a Rule 4
(4) B bB B B B
Rule 3 Rule 5
(5) B b
A A B A B A B A B
c a b b c c a b b c c a b b c c a b b c c a b b c
Fig. 1.62 Ascending syntactic analysis: in red the reduction result, while in green the right part of
the production
The reconstruction of the tree (see Fig. 1.62) with the bottom-up approach to reduce
the pattern string x = cabbc is given by the following reductions:
where →r indicates the right reduction operator while the effects of the reduction
are highlighted with the writing in bold of the substring that results the right part of
the production rule (indicated with r ule i) applied while the result of the reduction
is underlined. The substrings indicated in bold correspond to the handles.
The bottom-up process, as highlighted in the example, in each reduction step finds
a substring (the handle) that corresponds to the right side of a production of G. These
reduction steps may not be unique, in fact in the previous grammar, we could replace
the symbol b with r ule 5 instead of a. Therefore, it is important to decide which is
the substring handle to be reduced and which is the most appropriate reduction to
select.
The substring to be reduced is defined by formalizing the characteristics of the
handle. Let n be a positive integer, be x and β two strings of symbols, and let A be
a nonterminal symbol such that A ↔ β is a production of the grammar G. We call
handle for the string (sentential form) x, and indicated by the pair (A ↔ β, n), if
there is a string of symbols γ such that
position 1. For the sentential form x = c AB, the resulting handle is (S → AB, 2).
It is shown that, for an unambiguous grammar G, any sentential form of right has a
single handle.
The implementation of an bottom-up analyzer, based on the handle, uses different
data structures such as a stack to contain the grammar symbols G (initially the stack
is empty " ↑), an input vector (buffer) to contain the part of the input string x still to
be examined (at the beginning the pointer is positioned adjacent to the right border
x ↑), and a decision table. The bottom-up analyzer uses the following approach
Shift/Reduce:
Shift: move the pointer of the vector containing the input string one character to
the right and move the terminal symbol under consideration to the top of the stack.
For example, if the input string is ABC ↑ abc, the shift produces ABCa ↑ bc,
that is, move the pointer and load the terminal symbol a in the stack.
Reduce: the analyzer knows that the right side of the handle is at the top of the
stack, of the same it locates in the stack the left side and decides with which
string of nonterminal symbols replace the handle.
A parser that uses the shift/reduce paradigm decides one of the two actions recog-
nizing on the top of the stack the presence of a handle. The procedure ends correctly
by sending a message of accept when the top of the stack appears the symbol S
(↑) and the input string " ↑ is completely analyzed. An error is reported when the
parser causes a syntax error. The following (Algorithm 14) is a simple algorithm for
a bottom-up parser based on shift/reduce operations.
Example of the operations of a shift/reduce parser.
Given the grammar G with the following pr oductions:
The useful reductions, with the bottom-up approach, to reduce the following string
x = I D + I D ∗ I D are
ID + I D ∗ I D →r E + ID ∗ I D →r E + E ∗ ID →r E + E ∗ E →r E + E →r S
'()* '()* '()* '()* '()*
r ule4 r ule4 r ule4 r ule2 r ule1
In the Table 1.6 are instead reported the results of the previous parsing algorithm,
based on the shift/reduce operations, for the syntactic analysis of the x = I D + I D ∗
I D input string using the G grammar with the 4 productions listed above.
Figure 1.63a shows the syntactic tree obtained with the results shown in the Table
1.6 related to the parsing of the string x = I D + I D ∗ I D. A parser shift/reduce
even for a free context grammar (CF) can produce conflicts in deciding which action
to apply if shift or r educe, or in deciding which production to choose for reduction.
For example, in step 6, the action performed results in a shift (appropriate) but the
action of r educe could be applied by applying the production S → E + E being
in the stack the handle β = E + E. This potential conflict shift/reduce involves
the problem that the parser is not able to automatically decide which of the two
1.18 Syntactic Pattern Recognition Methods 169
7: if T O F = handle A → β then
8: E xecute R E DU C E : Reduction o f β f r om A
9: Remove |β| f r om the stack
10: Inserting A in the stack
12: E xecute S H I F T :
13: I nser ting o f car on the stack
14: k ←k+1
15: car ← x(k)
16: else
17: Report an error having analyzed the input string but without reaching the root S
18: end if
Table 1.6 Applied shift/reduce actions for the entry parsing of the input string x = I D + I D ∗ I D
Step n. Stack Input string Action Handle
1 ↑" ID+ ID∗ ID Shift –
2 ↑ ID +I D ∗ I D Reduce E → ID
3 ↑E +I D ∗ I D Shift –
4 ↑ E+ ID∗ ID Shift –
5 ↑ E + ID ∗I D Reduce E → ID
6 ↑E+E ∗I D Shift –
7 ↑ E + E∗ ID Shift –
8 ↑ E + E ∗ ID " Reduce E → ID
9 ↑E+E∗E " Reduce E→E∗E
10 ↑E+E " Reduce S→E+E
11 ↑S " Accept –
(1) S E+E S
(2) E E*E Rule 1
(3) E (E) E E
(4) E ID
Rule 2
(a) E E E E E E E E E E E E
Rule 4 Rule 4 Rule 4
ID + ID * ID ID + ID * ID ID + ID * ID ID + ID * ID ID + ID * ID ID + ID * ID
Rule 1
S
(b) E E E E E
Rule 4 Rule 4 unanalyzed symbols
ID + ID * ID ID + ID * ID ID + ID * ID ID + ID * ID
Fig. 1.63 Bottom-up syntactic analysis with operations shift/reduce: a Generation of the bottom-up
parsing tree applied to the string x = I D + I D ∗ I D relative to the actions shown in the Table
1.6. In red is shown the result of the derivation, while in green the left part of the pr oduction. b
Construction of the tree as in (a), but in step 6, instead of shift the action of r educe was applied
with production 1 highlighting the conflict situation shift/reduce which breaks the parsing of the
string
Generally, the bottom-up parser has a wider availability of L R(1) grammar classes
than the L L(1) grammar classes based on top-down parsers. This can be explained by
the fact that the bottom-up approach uses the input string information more effectively
considering that it begins the construction of the syntactic tree from the terminal
symbols, the leaf nodes, instead of the axiom. The disadvantage of the bottom-up
parser is that only at the end of the procedure, it checks whether the tree created
ends in the S axiom. Therefore, all trees are expanded even those that are not able to
converge in the root node S. A possible strategy is the mixed one that combines the
top-down and bottom-up approach with the aim of achieving a parser that prevents
1.18 Syntactic Pattern Recognition Methods 171
the expansion of trees that cannot converge in the axiom S and prevents expansion
of trees that cannot end on the input string.
33 We define isomorphism between two complex structures when trying to evaluate the correspon-
dence between the two structures or a similarity level between their structural elements. In math-
ematical terms, isomorphism is a f bijective application between two sets consisting of similar
structures belonging to the same structural nature such that both f and its inverse preserve the same
structural characteristics.
172 1 Object Recognition
(a) (b)
1 2 ...... 10 n 1 2 ...... 10 n
Text T d a v i d e d . m o l t o a l t o a a b b a b a z z v k v k v k z z f
s=9
v k v
Pattern x s=10 m o l t o
s=11
v k v
1 2 ...... m
Fig. 1.64 a The problem of the exact string matching: given a text string T and a substring pattern
x, normally much shorter than the text, we want to find the occurr ences of x in T , with relative
displacements s. In the example, the x pattern occurs in T only once for s = 10. b Occurrences of
x in T can also be superimposed as occurs for s = 9 and s = 11
length m < n. The key problem of string recognition (known as string matching) is to
find the set of all positions in the text T , starting from where it appears (occurrence),
the pattern x if it exists (see Fig. 1.64a). The occurrence of x in T can be multiple
and overlaid (see Fig. 1.64b).
The pseudo-code of a simple string-matching algorithm is given below
(Algorithm 15) assuming that the text T [1 : n] is longer than the pattern x[1 : m]
and with s + 1 is indicated the position in the text where we have the occurrence
x[1 : m] = T [s + 1 : s + m], that is, where x in T appears aligned (s is the necessary
displacement to align the first character of x with the character in position s + 1 in
T ). This simple algorithm starts by aligning (initially s = 0) the first character of x
5: if x[1 : m] = T [s + 1 : s + m] then
7: end if
8: s ←s+1
9: end while
and of T , then compares, from left to right, the corresponding characters of x and T
until you find different characters (it means no-match) or you get to the end of x (it
means found match and then indicates the occurrence that is the position s + 1 of
the T character that corresponds to the first character of x). Both in the match and
no-match situation, it is translated to the right, of a character, the pattern x, and the
procedure is repeated until the last character of x or x[m], exceeds the last character
of the text T , that is, T [s + m].
The remarkable computational load of this simple procedure is immediately high-
lighted, which in the worst case scenario (where x = a m and T = a n with strings of
the same character a) has the complexity of O((n − m + 1)m). If, on the other hand,
the x and T characters change randomly, the procedure tends to be more powerful.
The weakness of this algorithm is the systematic displacement of a single position
to the right of the x pattern which is not always convenient.
A possible strategy is to consider larger displacements without compromising the
loss of occurrences. More efficient algorithms have been developed that tend to learn,
through appropriate heuristics, the configuration of the patter n x and the text T .
In 1977, the KMP algorithm of Knuth, Morris, and Pratt [45] is published which
goes in the direction of making the string matching procedure more efficient. Break
down the process into two steps: preprocessing and sear ching. With the first step,
174 1 Object Recognition
1 2 ...... 10 s+m n
Text T d a v i d e d . a l t o
s+j
scansion
Pattern x s=9 m o l t o
1 2 ...... m
j
Fig. 1.65 At the displacement s, the comparison between each character of the pattern and the text
occurs from right to left, until meeting the discordant characters, respectively, x[ j] and T [s + j]
(in red in the picture)
The algorithm that over the years has turned out to be an excellent reference for
the scientific community of the sector is that of Boyer–Moore [46]. From the other
algorithms it is essentially distinguished by the following aspects:
(a) Compar e the pattern x and the text T from right to left. When comparing
characters from right to left, starting from x[m] up to x[1], when a discordance
is found (no-match), between text and pattern, the pointer j + 1 of the pattern
is returned, which corresponds to the last position in which the characters of the
pattern and the text were correspondents (see Fig. 1.65).
(b) Reduce comparisons by calculating shifts greater than 1 without compromising
the detection of valid occurrences. This occurs with the use of two heuristics that
operate in parallel as soon as a discordance is signaled in the comparison step.
The two heuristics are
1. Heuristic of the discordant character (bad character rule), uses the position
where the discordant character of the text T [s + j] is found (if it exists) in
the pattern x to propose a new displacement s appropriate. This heuristic
proposes to increase the displacement of the necessary value to align the
discordant found character more to the right in x with the one identified in
the text T . However, it must be guaranteed not to skip valid occurrences (see
Fig. 1.66).
1.19 String Recognition Methods 175
Scansion
Pattern x s=9 m o l t o
j
discordant character s=9+j=9+2
m o l t o
go s=9+m=9+5 m o l t o
Fig. 1.66 Boyer–Moore heuristics to propose the magnitude of the increase in the movement of
the pattern x with respect to the text T . With current s, we do not have the occurrence of the pattern
as scanning from right to left is the discordant character “a” of the text with respect to the character
“o” of the pattern in position j = 2 (both in red). The heuristic of the discordant character, not
finding the discordant character “a” in the pattern, proposes to increase s of j = 2 or move the
pattern immediately after the discordant character. The heuristic of the good suffix, instead, verifies
if in the pattern a substring is found identical to that of the good suffix (in the example “olto”) and it
proposes to move the pattern of a suitable quantity to align this new substring found with the good
suffix previously found in the text (colored green). In this example, this substring does not exist in
the pattern, and proposes to move the complete pattern immediately after the good text suffix, that
is, to increase s by the length m of the pattern
2. Good suffix heuristics (good suffix rule)34 that operating in parallel with
the (bad character rule) attenuates the number of shifts in the pattern. The
search for the good suffix is determined efficiently by having adopted the
search from right to left. Also this heuristic proposes, in an independent
way, to increase the displacement of the value necessary to align the next
occurrence of the good suffix in x with the one identified in the text T ,
always scanning in the search from right to left (see Fig. 1.66).
When the discordance between the pattern and text characters is found, it is actu-
ally determined that the current s shift does not correspond to a valid occurrence of
the pattern, and at this point, each heuristic proposes a value with respect to which
the shift s can be increased without compromising the determination of occurrences.
The authors of the algorithm choose the largest value, suggested by the heuristics,
to increase the shift s and then continue to search for occurrences. Returning to the
example of Fig. 1.66 is chosen the heuristic of the good suffix which suggests a larger
increment, that is, the shift to the right of the 5-character pattern proposed by the
34 Prefix and suffix string formalism. A string y ∈ V is a substring of x if there are two strings α and
β on the alphabet V such that x = α ◦ y ◦ β (concatenated strings) and we can say that y occurs
in x. It follows that the string α is a pr e f i x of x (and is denoted by α ⊂ x), i.e., it corresponds to
the initial characters of x. Similarly, it is defined that β is a su f f i x of x (and is denoted by β ⊃ x)
and coincides with the final characters of x. In this context, a good suffix is defined as the substring
suffix of the pattern x that occurs in the text T for a given value of the shift s starting from the
character j + 1 + s. It should be noted that the relations ⊂ and ⊃ enjoy the transitive property.
176 1 Object Recognition
good suffix is chosen instead of the 2 of the heuristic of the discordant character.
Now let’s see in more detail how the two heuristics are able to attenuate the number
of comparisons without compromising any occurrences.
x[k] = T [s + j] (1.322)
where k is the maximum value of the position in x where the discordant character is
found. The action proposed by the heuristic is to move the pattern x of s + k, and
update the shift s as follows:
s ← s + ( j − k) j = 1, . . . , m k = 1, . . . , m (1.323)
which has the effect of aligning the discordant character of the text T [s + j] with the
identical character found inside the pattern in the position k. Figure 1.67 schematizes
the functionality of this heuristic for the 3 possible configurations depending on the
value of the index k. Let’s now analyze the 3 configurations:
1. k = 0: is the configuration in which is not satisfied the Eq. (1.322), that is, the
discordant character does not appear inside the pattern x and the proposed action,
according to the (1.323), is to increase the shift s by j:
s←s+ j (1.324)
which has the effect of aligning the first character of the pattern x[1] with the
character of T after the discordant character (see Fig. 1.67a).
2. k < j: is the configuration in which the discordant character of T is present in
the pattern x in the k position and is to the left of the j position of the discordant
character in x resulting in j −k > 0. It follows that we can increase the movement
s of j − k (to the right), according to the (1.323), which has the effect of aligning
the x[k] character with the discordant character in the text T (see Fig. 1.67b).
3. k > j: is the configuration where the discordant character T [s + j] is present in
the pattern x, but to the right of the position j of the discordant character in x,
resulting j − k < 0, which would imply a negative shift, to the left, and therefore,
the increment of s is ignored and not applied with the (1.323) but we could only
increase s by 1 character to the right (see Fig. 1.67c).
1.19 String Recognition Methods 177
(c)
1 2 ...... s+j n
Text T A G G G C G G A C C G C T A G A A T A G ...... k>j
j k
s=8
Pattern x G G T A A G G A
s=8+(j-k)=6 j-k=6-8<0 Negative
G G T A A G G A
Proposal ignored increase
Fig. 1.67 The different configurations of the heuristic of the discordant character. a The discordant
character T [s + j] is not found in the pattern x and it is proposed to move the pattern to the left
immediately after the discordant character (incremented s by 5). b The discordant character occurs
in the pattern, at the rightmost position k, with k < j, and the movement of the pattern of j − k
characters (incremented s by 4) is proposed. This is equivalent to aligning the discordant character
T [s + j] = “T ” of the text with the identical character x[k] found in the pattern. c Situation identical
to (b) but the discordant character T [s + j] = “A” is found in the pattern in the rightmost position
where k > j. In the example, j = 6 and k = 8 and the heuristic proposing a negative shift is
ignored
given as follows:
max{k : 1 ≤ k ≤ m and x[k] = σi if σi ∈ x
λ[σi ] = (1.325)
λ[σi ] = 0 otherwise
where σi is the i-th symbol of the alphabet V. The function of the last occurrence
defines λ[σi ] as the pointer of the rightmost position (i.e., of the last occurrence) in x
where the character σi appears, for all the characters of the alphabet V. The pointer
is zero if σi does not appear in x. The pseudo-code that implements the algorithm of
the last occurrence function is given below (Algorithm 16).
178 1 Object Recognition
Fig. 1.68 The different configurations of the good suffix heuristic (string with a green background).
a k does not exist. In the pattern x, no prefix is needed that is also suffixed to T [s + j + 1 : s + m].
A pattern shift equal to its length m is proposed. b k does not exist but a prefix α is needed (in
the example α = “C A”) of x which is also suffix of T [s + j + 1 : s + m] indicated with β. A
pattern shift is proposed to match its α prefix with the text suffix β (s increment of 7 characters). c k
exists. In the pattern, a substring is needed (in the example, it is “AC A” orange colored) coinciding
with the suffix that occurs in T [s + j + 1 : s + m] satisfying the condition that x[k] = x[ j]. It is
proposed, as in (b), the shift of the pattern to align the substring found in x with the suffix of the
text indicated above (in the example, the increment is 3 characters)
4: λ[σ ] ← 0
5: end for
6: for j ← 1 to m do
7: λ[x[ j]] ← j
8: end for
9: return λ
The first expression indicates the existence of a copy of the good suffix starting from
the position k + 1, while the second one indicates that this copy is preceded by a
character different from the one that caused the discordance or x[ j]. Having met the
(1.326), the heuristic suggests updating the s shift as follows:
s ← s + ( j − k) j = 1, . . . , m k = 1, . . . , m (1.327)
and to move the pattern x to the new position s + 1. Comparing the characters of x
from the position k to k + m − j is useless. Figure 1.68 schematizes the functionality
of the good suffix heuristic for the 3 possible configurations as the index k varies.
In the Boyer–Moore algorithm, the good suffix heuristic is realized with the good
suffix function γ [ j] which defines, once found in position j, j < m the discordant
character x[ j] = x[s + j], the minimum amount of increment of the movement s,
given as follows:
γ [ j] = m − max{k : 0 ≤ k < m and x[ j + 1 : m] ∼ x[1 : k] with x[k] = x[ j] (1.328)
7: γ [ j] ← m − π [m]
8: end for
9: for k ← 1 to m do
10: j ← m − π [k]
11: if (γ [ j] > (k − π [k])) then
12: γ [ j] ← k − π [k]
13: end if
35 Let α and β be two strings, we define a similarity relation α ∼ β (we read α is similar to β), with
the meaning that α ⊃ β (where we recall that the symbol ⊃ has the meaning of suffix). It follows
that, if two strings are similar , we can align them with their identical characters further to the right,
and no pair of aligned characters will be discordant. The similarity relation ∼ is symmetric, that is,
α ∼ β if and only if α ∼ β. It is also shown that the following implication is had:
α⊃β and y ⊃ β =⇒ α ∼ y.
1.19 String Recognition Methods 181
From the pseudo-code, we observe the presence of the function prefix π applied
to the pattern x and its inverse indicated with x . This function is used in the prepro-
cessing of the string-matching algorithm of Knuth–Morris–Pratt and is formalized
as follows: given a pattern x[1 : m], the function prefixed for x is the function
π : {1, 2, . . . , m} → {0, 1, 2, . . . , m − 1} such that
In essence, the (1.329) indicates that π [q] is the length of the longest prefix of the
pattern x and is also suffix of x[1 : q].
Returning to the good suffix algorithm (Algorithm 17), the first f or − loop cal-
culates the vector γ with the difference between the length of the pattern x and the
values returned by the prefix function π . With the second f or −loop, having already
initialized the vector γ , the latter is updated with the values of π for any shifts less
than those calculated with π in the initialization. The pseudo-code of the function
prefix π given by the (1.329) is given in Algorithm 18.
7: k ← π [k]
8: if (x[k + 1] = x[i]) then
9: k ←k+1
10: end if
11: π [i] ← k
We are now able to report the Boyer–Moore algorithm having defined, the two
preprocessing functions, that of the discordant character (Last_Occurrence) and of
the good suffix (Good_Suffix). The pseudo-code is given in Algorithm 19.
Figure 1.69 shows a simple example of the Boyer–Moore algorithm that finds
the first occurrence of the pattern x[1:9] = “CTAGCGGCT” in the text T[1:28] =
“CTTATAGCTGATCGCGGCCTAGCGGCTAA” after 6 steps, having previously
182 1 Object Recognition
7: j ←m
8: while j > 0 and x[ j] = T [s + j] do
9: j ← j −1
10: if j = 0 then
13: else
15: end if
pre-calculated the tables of the two heuristics, appropriate for the alphabet V =
{A, C, G, T } and the pattern x considered.
From the analysis of the Boyer–Moore algorithm, we can observe the similarity
with the simple algorithm Algorithm 15 from which it differs substantially, for the
comparison between pattern and text, which occurs from right to left, and from
the use of the two heuristics to evaluate the shifts of the pattern by more than 1
character. In fact, while in the simple algorithm of string matching, the shift s is
always 1 character, with the Boyer–Moore algorithm, when the discordant character
is determined the instruction associated with line 14 is executed to increase s of
a quantity that corresponds to the maximum of the values suggested by the two
heuristic functions.
The computational complexity [47,48] of the Boyer–Moore algorithm is O(nm).
In particular, the one related to the preprocessing, due to the Last_ Occurrence
function is O(m + |V|) and the Good_Su f f i x function is O(m), while the one
due to the search phase is O(n − m + 1)m. The comparisons between strings saved
in the search phase depend very much on heuristics that learn useful information
about the internal structure of the pattern or text. To operate in linear times, the
two heuristics are implemented through tables containing the entire alphabet and the
symbols of the pattern string whose positions of their occurrence more to the right
1.19 String Recognition Methods 183
1 2 ...... s+j n
T: C T T A T A G C T G A T C G C G G C C T A G C G G C T A A
Step 1 s=0 α j=6
x: C T A G C G G C T kG=0 Gs=m-|α|=9-2=7 kL=3 Lo=j-kL=6-3=3
T: C T T A T A G C T G A T C G C G G C C T A G C G G C T A A
T: C T T A T A G C T G A T C G C G G C C T A G C G G C T A A
Step 3 s=7+2 j=9
x: C T A G C G G C T kG=0 Gs=1 kL=8 Lo=j-kL=9-8=1
T: C T T A T A G C T G A T C G C G G C C T A G C G G C T A A
Step 4 s=9+1 j=9
x: C T A G C G G C T come passo 3
T: C T T A T A G C T G A T C G C G G C C T A G C G G C T A A
Step 5 s=10+1 α j=7
x: kG=0 Gs=m-|α|=9-2=7 C T A G C G G C T kL=8 Lo=j-kL=7-8=-1
T: C T T A T A G C T G A T C G C G G C C T A G C G G C T A A
Step 6 s=11+7
x: C T A G C G G C T
Fig. 1.69 Complete example of the Boyer–Moore algorithm. In step 1, we have character
discor dance for j = 6 and match between su f f i x “CT” (of the good suffix T[7:9] = “GCT”)
and pr e f i x α = x[1:2] = “CT”; between the two heuristics the good suffix (Gs = 7) wins over
the one of the discordant character (Lo = 3) moving the 7-character pattern as shown in the figure.
In step 2, the heuristic wins having found the rightmost discordant character in the pattern and
proposes a shift of Lo = 2 greater than the heuristic of the good suffix Gs = 1. In steps 3 and 4,
both heuristics suggest moving 1 character. Step 5 instead chooses the heuristics of the good suffix
Gs = 7 (configuration identical to step 1) while the other heuristic that proposes a negative shift
Lo = −1 is ignored. In step 6, we have the first occurrence of the pattern in the text
are pre-calculated in x and of the rightmost positions of the occurrence of the suffixes
in x. From experimental results, the Boyer–Moore algorithm is performing for large
lengths of the x pattern and with large alphabets V. To better optimize computational
complexity, several variants of the Boyer–Moore algorithm have been developed
[47,49,50].
These are the elementary operations to calculate the distance of edit. Several elemen-
tary operations can be considered as the transposition that interexchanges adjacent
characters of a string x. For example, transforming x = “marai” into y = “maria”
requires a single transposition operation equivalent to 2 substitution operations. The
edit distance between the strings “dived” and “davide” is equal to 3 elementary
operations: 2 substitution or “dived” → “daved” (substitution of “i” with “a”)
and “dived” → “david” (substitution of “e” with “i”), and 1 insertion “dived” →
“davide” (the character “e”).
The edit distance is the minimum number of operations required to make two
strings identical. Edit distance can be calculated by giving different costs to each
elementary operation. For simplicity, we will consider unit costs for each edit ele-
mentary operation.
Given two strings
it is possible to define the matrix D(i, j) as the edit distance of the prefix strings
(x1 ..xi ) and (y1 ..y j ) and consequently obtain as final result the edit distance D(n, m)
between the two strings x and y, as the minimum number of edit operations to
transform the entire string x in y.
The calculation of D(i, j) can be set recursively, based on immediately shorter
prefixes, considering that you can have only three cases associated with the related
edit operations:
1. Substitution, the xi character is replaced with the y j and we will have the
following edit distance:
where D(i − 1, j − 1) is the edit distance between the prefixes (x1 ..xi−1 ) and
(y1 ..y j−1 ), and cs (i, j) indicates the cost of the substitution operation between
the characters xi and y j , given by
1 if xi = y j
cs (i, j) = (1.331)
0 if xi = y j
2. Deletion, the character xi is deleted and we will have the following edit distance:
where D(i − 1, j) is the edit distance between the prefixes (x1 ..xi−1 ) and
(y1 ..y j ), and cd (i, j) indicates the cost of the deletion operation, normally set
equal to 1.
3. I nserimento, il carattere y j viene inserito ed avremo la seguente edit distance:
where D(i, j −1) is the edit distance between the prefixes (x1 ..xi−1 ) and (y1 ..y j ),
and cin (i, j) indicates the cost of the insertion operation, normally set equal to
1.
Given that there are no other cases, and that we are interested in the minimum value,
the correct edit distance recursively defined is given by
D(i, j) = min{D(i −1, j)+1, D(i, j −1)+1, D(i −1, j −1)+cs (i, j)} (1.334)
with strictly positive i and j. The edit distance D(n, m) between two length strings
of n and m characters, respectively, can be calculated with a recursive procedure that
implements the (1.334) starting from the basic conditions:
D(i, 0) = i i = 1, . . . , n D(0, j) = j j = 1, . . . , m D(0, 0) = 0 (1.335)
where D(i, 0) is the edit distance between the prefix string (x1 ..xi ) and the null
string , D(0, j) is the edit distance between the null string and the prefix string
(y1 ..y j ) and D(0, 0) represents the edit distance between null strings.
A recursive procedure based on the (1.334) and (1.335) would be inefficient requir-
ing considerable computation time O[(n + 1)(m + 1)]. The strategy used, instead,
is based on dynamic programming (see algorithm Algorithm 20).
With this algorithm, a cost matrix D is used to calculate the edit distance starting
from the basic conditions given by the (1.335), and then, using these basic values
(lines 3–12), calculate for each element D(i, j) (i.e., for pairs of strings more and
more long at the variation of i and j) the edit distance (minimum cost, line 20)
with the (1.334) thus filling the matrix up to the element D(n, m) representing the
edit distance of the strings x and y of length respectively of n and m. In essence,
the Algorithm 20 instead of directly calculating the distance D(n, m) of the two
186 1 Object Recognition
strings of interest, the strategy is to determine the distance of all the prefixes of the
two strings (reduction of the problem in sub problems) from which to derive, for
induction, the distance for the entire length of the strings.
5: D(i, 0) ← i
6: end for
7: for j ← 1 to m do
8: D(0, j) ← j
9: end for
10: for i ← 1 to n do
11: for j ← 1 to m do
13: c←0
14: else
15: c←1
16: end if
17: D(i, j) = min{D(i − 1, j) + 1, D(i, j − 1) + 1, D(i − 1, j − 1) + cs }
' () * ' () * ' () *
Deletion xi I nser tion y j subst/nosubst xi with y j
The symmetry properties are maintained if the edit operations (insertion and dele-
tion) have identical costs as in this case is reported in the algorithm. Furthermore,
we can consider c(i, j) if we wanted to differentiate the elementary costs of editing
between the character xi with the character yi . Figure 1.70a shows the schema of the
matrix D with dimensions (n + 1 × m + 1) for the calculation of the minimum edit
distance for strings x = “Frainzisk” and y = “Francesca”. The elements D(i, 0) and
D(0, j) are filled first (respectively, the first column and the first row of D), or the
base values representing the lengths of all the prefixes of the two strings with respect
1.19 String Recognition Methods 187
(a) j y (b)
D(0,0) ε F r a n c e s c a ε F r a n c e s c a
ε 0 1 2 3 4 5 6 7 8 9 D(0,j) ε 0 1 2 3 4 5 6 7 8 9
i F 1 0 1 F 1 0 1 2 3 4 5 6 7 8 Substitution
r 2 r 2 1 0 1 2 3 4 5 6 7
a 3 D(i-1,j-1) a 3 2 1 0 1 2 3 4 5 6
x No Substitution
i 4 D(i-1,j) i 4 3 2 1 1 2 3 4 5 6
n 5 D(i,j) n 5 4 3 2 1 2 3 4 5 6
z 6 5 4 3 2 2 3 4 5 6 Insertion
D(i,j-1) z 6
i 7 i 7 6 5 4 3 3 3 4 5 6
s 8 s 8 7 6 5 4 4 4 3 4 5 Deletion
k 9 k 9 8 7 6 5 5 5 4 4 5
F’ c(1,1)=0 c(1,2)=1
D(i,0)
D(i,j)=D(1,1)=min[D(1-1,1-1)+c(1,1), D(1-1,1)+1, D(1,1-1)+1]= D(i,j)=D(1,2)=min[D(1-1,2-1)+c(1,2), D(1-1,2)+1, D(1,2-1)+1]=
min[D(0,0)+0, D(0,1)+1, D(1,0)+1]= min[D(0,1)+1, D(0,2)+1, D(1,1)+1]=
min[0+0, 1+1, 1+1]=0 min[1+1, 2+1, 0+1]=1
Fig. 1.70 Calculation of the edit distance to transform the string x = “Frainzisk” into y =
“Francesca”. a Construction of the matrix D starting from the base values given from the (1.335)
and then iterating with the dynamic programming method to calculate the other elements of D using
the algorithm Algorithm 20. b D complete calculated by scanning the matrix from left to right and
from the first line to the last. The edit distance between the two strings is equal to D(9, 9) = 5 and
the requested edit operations are 1 deletion, 3 substitution, and 1 insertion
to the null string . The element D(i, j) is the edit distance between the prefix x(1..i)
and y(1.. j). The value D(i, j) is calculated by induction based on the last characters
of the two prefixes.
If these characters are equal D(i, j) is equal to the edit distance between the two
shorter prefixes of 1 character (x(1..i − 1) and y(1.. j − 1)) or D(i, j) = 0 + D(i −
1, j − 1). If the last two characters are not equal D(i, j) results in a unit greater than
the minimum edit distances relative to the 3 shortest prefixes (the adjacent elements:
upper, left, and upper left), that is, D(i, j) = 1+min{D(i −1, j), D(i, j −1), D(i −
1, j −1)}. It follows, as shown in the figure, that for each element of D the calculation
of the edit distance depends only on the values previously calculated for the shortest
prefixes of 1 character.
The complete matrix is obtained by iterating the calculation, for each element
of D, operating from left to right and from the first row to the last, thus obtaining
the edit distance in the last element D(n, m). Figure 1.70b shows the complete D
matrix for the calculation of the edit distance to transform the string x = “Frainzisk”
in y = “Francesca”. The path is also reported (indicating the type of edit operation
performed) which leads to the final result D(9.9) = 5, i.e., to the minimum number
of required edit operations.
Returning to the calculation time, the algorithm reported requires a computational
load of O(nm) while in space-complexity, it requires O(n) (the space-complexity is
O(nm) if the whole of the matrix is kept for a trace-back to find an optimal alignment).
Ad hoc algorithms have been developed in the literature that reduce computational
complexity up to O(n + m).
188 1 Object Recognition
In several applications, where the information may be affected by error or the nature
of the information itself evolves, the exact pattern matching is not useful. In these
cases, it is very important to solve the problem of the approximate pattern matching
which consists in finding in the text string an approximate version of the pattern
string according to a predefined similarity level.
In formal terms, the approximate pattern matching is defined as follows. Given a
text string T of length n and a pattern x of length m with m ≤ n, the problem is to
find the k approximate occurrence of the pattern string x in the text T with maximum
k (0 ≤ k ≤ m) different characters (or errors). A simple version of an approximate
matching algorithm Algorithm 21, shown below, is obtained with a modification of
the exact matching algorithm presented in Algorithm 15.
5: count ← 0
6: for j ← 1 to m do
7: if x[ j] = T [s + j − 1] then
8: count ← count + 1
9: end if
13: end if
The first for-loop (statement line 4) iterates through the text of one character at a
time while the second for-loop (statement line 6) there are counted in count label the
number of different characters found between the pattern x and text T [s : s + j − 1]
reporting the s positions in T of the approximate patterns found in T according to the
required k-differences. We would return to the exact matching algorithm by putting
in Algorithm 21 k = 0. We recall the computational inefficiency of this algorithm
equal to O(nm). An efficient algorithm of approximate string matching, based on
the edit distance, is reported in Algorithm 22.
1.19 String Recognition Methods 189
5: D(i, 0) ← i
6: end for
7: for j ← 1 to n do
8: D(0, j) ← 0
9: end for
10: for i ← 1 to m do
11: for j ← 1 to n do
13: c←0
14: else
15: c←1
16: end if
17: D(i, j) = min{D(i − 1, j) + 1, D(i, j − 1) + 1, D(i − 1, j − 1) + cs }
23: end if
This algorithm differs substantially in having zeroed the line D(0, j), j = 1, n
instead of assigning the value j (line 8 of the Algorithm 22). In this way, the D matrix
indicates that a null prefix of the pattern x corresponds to a null occurrence of the
text T which does not involve any cost. Each element D(i, j) of the matrix contains
190 1 Object Recognition
j T
D(0,0) ε C C T A T A G C T G A T C
ε 0 0 0 0 0 0 0 0 0 0 0 0 0 0 D(0,j)
i C 1 0 0 1 1 1 1 1 0 1 1 1 1 0
T 2 1 1 0 1 1 2 2 1 0 1 2 1 1
x
A 3 2 2 1 0 1 1 2 2 1 1 1 2 2
G 4 3 3 2 1 1 2 1 2 2 1 2 2 3
4 5 7 10
D(i,0) 1-approximate occurrences
Fig. 1.71 Detection, using the algorithm Algorithm 22, of the 1-approximate occurrences of the
pattern x = “CTAG” in the text T = “CCTATAGCTGATC”. The positions of the approximate
occurrences in T are in the line D(m, ∗) of the modified edit matrix, where the value of the edit
distance is minimum, i.e., D(4, j) ≤ 1. In the example, the occurrences in T are 4 in the positions
for j = 4, 5, 7 and 10
the minimum value k for which there exists an approximate occurrence (with at most
k different characters) of the prefix x[1 : i] in T . It follows that the approximate
occurrences of k of the entire pattern x are found in T in the positions shown in the
last row of the matrix D(m, ∗) (line 22 of the Algorithm 22). In fact, each element
D(m, j), j = 1, n reports the number of different characters (that is, number of
edit operations required to transform x in the occurrence to T ) between pattern and
corresponding substring T [ j : j + m − 1] of the text under examination. In fact, the
positions j-th of the occurrences of x in T are found where D(m, j) ≤ k.
Figure 1.71 shows the modified edit matrix D calculated with the algorithm
Algorithm 22 to find the approximate occurrences of k = 1 of the pattern x =
“CTAG” in the text T = “CCTATAGCTGATC”. The 1-approximate pattern x, as
shown in the figure, occurs in the text T in positions j = 4, 5, 7 and 10, or where
D(4, j) ≤ k. In literature [51], there are several other approaches based on different
methods of calculating the distance between strings (Hamming, Episode, ...) and on
dynamic programming with the aim also of reducing computational complexity in
time and space.
Ad hoc solutions have been developed in the literature [52] for the optimal com-
parison between strings neglecting the special character.
References
1. R.B. Cattell, The description of personality: basic traits resolved into clusters. J. Abnorm. Soc.
Psychol. 38, 476–506 (1943)
2. R.C. Tryon, Cluster Analysis: Correlation Profile and Orthometric (Factor) Analysis for the
Isolation of Unities in Mind and Personality (Edward Brothers Inc., Ann Arbor, Michigan,
1939)
3. K. Pearson, On lines and planes of closest fit to systems of points in space. Philos. Mag. 2(11),
559–572 (1901)
4. H. Hotelling, Analysis of a complex of statistical variables into principal components. J. Educ.
Psychol. 24, 417–441 and 498–520 (1933)
5. W. Rudin, Real and Complex Analysis (Mladinska Knjiga McGraw-Hill, 1970). ISBN 0-07-
054234-1
6. R. Larsen, R.T. Warne, Estimating confidence intervals for eigenvalues in exploratory factor
analysis. Behav. Res. Methods 42, 871–876 (2010)
7. M. Friedman, A. Kandel, Introduction to Pattern Recognition: Statistical, Structural, Neural
and Fuzzy Logic Approaches (World Scientific Publishing Co Pte Ltd, 1999)
8. R.A. Fisher, The statistical utilization of multiple measurements. Ann Eugen 8, 376–386 (1938)
9. K. Fukunaga, J.M. Mantock, Nonparametric discriminant analysis. IEEE Trans. Pattern Anal.
Mach. Intell. 5(6), 671–678 (1983)
10. T. Okada, S. Tomita, An optimal orthonormal system for discriminant analysis. Pattern Recog-
nit. 18, 139–144 (1985)
11. J.-S.R. Jang, C.-T. Sun, E. Mizutani, Neuro-fuzzy and Soft Computing (Prentice Hall, 1997)
12. J. MacQueen, Some methods for classification and analysis of multivariate observations, in
Proceedings of the Fifth Berkeley Symposium on Mathematical statistics and Probability, vol.
1, ed. by L.M. LeCam, J. Neyman (University of California Press, 1977), pp. 282–297
13. G.H. Ball, D.J. Hall, Isodata: a method of data analysis and pattern classification. Technical
report, Stanford Research Institute, Menlo Park, United States. Office of Naval Research.
Information Sciences Branch (1965)
14. J.R. Jensen, Introductory Digital Image Processing: A Remote Sensing Perspective, 2nd edn.
(Prentice Hall, Upper Saddle River, NJ, 1996)
15. L.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms (Plenum Press,
New York, 1981)
16. C.K. Chow, On optimum recognition error and reject tradeoff. IEEE Trans. Inf. Theory 16,
41–46 (1970)
17. A.R. Webb, K.D. Copsey, Statistical Pattern Recognition, 3rd edn. (Prentice Hall, Upper Saddle
River, NJ, 2011). ISBN 978-0-470-68227-2
18. R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, 2nd edn. (Wiley, 2001). ISBN
0471056693
19. K. Fukunaga, Introduction to Statistical Pattern Recognition, 2nd edn. (Academic Press Pro-
fessional, Inc., 1990). ISBN 978-0-470-68227-2
20. A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the
EM algorithm. J. R. Stat. Soc. B 39(1), 1–38 (1977)
21. W. McCulloch, W. Pitts, A logical calculus of the ideas immanent in nervous activity. Bull.
Math. Biophys. 5, 115–133 (1943)
192 1 Object Recognition
22. H. Robbins, S. Monro, A stochastic approximation method. Ann. Math. Stat. 22, 400–407
(1951)
23. R. Tibshirani, Regression shrinkage and selection via the lasso. J. R. Stat. Soc. 58(1), 267–288
(1996)
24. J.J. Hopfield, Neural networks and physical systems with emergent collective computational
abilities. Proc. Natl. Acad. Sci 79, 2554–2558 (1982)
25. J.R. Quinlan, Induction of decision trees. Mach. Learn. 1, 81–106 (1986)
26. J.R. Quinlan, C4.5: Programs for Machine Learning (Morgan Kaufmann, San Mateo, CA,
1993)
27. L. Breiman, J. Friedman, R. Olshen, C. Stone, Classification and Regression Trees (Wadsworth
Books, 1984)
28. X. Lim, W.Y. Loh, X. Shih, A comparison of prediction accuracy, complexity, and training
time of thirty-three old and new classification algorithms. Mach. Learn. 40, 203–228 (2000)
29. P.E. Utgoff, Incremental induction of decision trees. Mach. Learn. 4, 161–186 (1989)
30. J.R. Quinlan, R.L. Rivest, Inferring decision trees using the minimum description length prin-
ciple. Inf. Comput. 80, 227–248 (1989)
31. J.R. Quinlan, Simplifying decision trees. Int. J. Man-Mach. Stud. 27, 221–234 (1987)
32. L. Kaufman, P.J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis
(Wiley, 2009)
33. T. Zhang, R. Ramakrishnan, M. Livny, Birch: an efficient data clustering method for very large
databases, in Proceedings of SIGMOD’96 (1996)
34. S. Guha, R. Rastogi, K. Shim, Rock: a robust clustering algorithm for categorical attributes, in
Proceedings in ICDE’99 Sydney, Australia (1999), pp. 512–521
35. G. Karypis, E.-H. Han, V. Kumar, Chameleon: a hierarchical clustering algorithm using
dynamic modeling. Computer 32, 68–75 (1999)
36. K.S. Fu, Syntactic Pattern Recognition and Applications (Prentice-Hall, Englewood Cliffs, NJ,
1982)
37. N. Chomsky, Three models for the description of language. IRE Trans. Inf. Theory 2, 113–124
(1956)
38. H. J. Zimmermann, B.R. Gaines, L.A. Zadeh, Fuzzy Sets and Decision Analysis (North Holland,
Amsterdam, New York, 1984). ISBN 0444865934
39. Donald Ervin Knuth, On the translation of languages from left to right. Inf. Control 8(6),
607–639 (1965)
40. D. Marcus, Graph Theory: A Problem Oriented Approach, 1st edn. (The Mathematical Asso-
ciation of America, 2008). ISBN 0883857537
41. D.H. Ballard, C.M. Brown, Computer Vision (Prentice Hall, 1982). ISBN 978-0131653160
42. A. Barrero, Three models for the description of language. Pattern Recognit. 24(1), 1–8 (1991)
43. R.E. Woods, R.C. Gonzalez, Digital Image Processing, 2nd edn. (Prentice Hall, 2002). ISBN
0201180758
44. P.H. Winston, Artificial Intelligence (Addison-Wesley, 1984). ISBN 0201082594
45. D.E. Knuth, J.H. Morris, V.B. Pratt, Fast pattern matching in strings. SIAM J. Comput. 6(1),
323–350 (1977)
46. R.S. Boyer, J.S. Moore, A fast string searching algorithm. Commun. ACM 20(10), 762–772
(1977)
47. Hume and Sunday, Fast string searching. Softw. Pract. Exp. 21(11), 1221–1248 (1991)
48. T.H. Cormen, C.E. Leiserson, R.L. Rivest, C. Stein, Introduction to Algorithms (MIT Press
and McGraw-Hill, 2001). ISBN 0-262-03293-7
49. R. Nigel Horspool, Practical fast searching in strings. Softw. Pract. Exp., 10(6), 501–506 (1980)
50. D.M. Sunday, A very fast substring search algorithm. Commun. ACM 33(8), 132–142 (1990)
51. N. Gonzalo, A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88
(2001)
52. P. Clifford, R. Clifford, Simple deterministic wildcard matching. Inf. Process. Lett. 101(2),
53–54 (2007)
RBF, SOM, Hopfield, and Deep Neural
Networks 2
2.1 Introduction
We begin to describe the first three different neural network architectures: Radial
Basis Functions-RBF, Self-Organizing Maps-SOM, and the Hopfield network. The
Hopfield network has the ability to memorize information and recover it through
partial contents of the original information. As we shall see, it presents its originality
based on physical foundations that have revitalized the entire sector of neural net-
works. The network is associated with an energy function to be minimized during
its evolution with a succession of states until it reaches a final state corresponding to
the minimum of the energy function. This characteristic allows it to be used to solve
and set an optimization problem in terms of objective function to be associated with
an energy function.
The SOM network instead has an unsupervised learning model and has the orig-
inality of autonomously grouping input data on the basis of their similarity without
evaluating the convergence error with external information on the data. Useful when
we have no exact knowledge of the data to classify them. It is inspired by the topology
of the model of the cortex of the brain considering the connectivity of neurons and
in particular the behavior of an activated neuron and the influence with neighbor-
ing neurons that reinforce the bonds with respect to those further away that become
weaker. Extensions of the SOM bring it back to supervised versions, as in the Learn-
ing Vector Quantization versions SOM-LVQ1, LVQ2, etc., which essentially serve
to label the classes and refine the decision edges.
The RBF network uses the same neuron model as the MLP but differs in its
architectural simplification of the network and of the activation function (based on
the radial base function) that implements the Cover theorem. In fact, RBF provides
only one hidden layer and the output layer consists of only one neuron. The MLP
network is more vulnerable in the presence of noise on the data while the RBF is
more robust to the noise due to the radial basis functions and to the linear nature of
the combination of the output of the previous neuron (MLP instead uses the nonlinear
activation function).
© Springer Nature Switzerland AG 2020 193
A. Distante and C. Distante, Handbook of Image Processing and Computer Vision,
https://doi.org/10.1007/978-3-030-42378-0_2
194 2 RBF, SOM, Hopfield, and Deep Neural Networks
The design of a supervised neural network can be done in a variety of ways. The
backpropagation algorithm for a multilayer (supervised) network, introduced in the
previous chapter, can be seen as the application of a recursive technique that is denoted
by the term stochastic approximation in statistics. RBF uses a different approach in
the design of a neural network as a “curve fitting” problem, that is, the resolution of
an approximation problem, in a very large space, so learning is reduced to finding a
surface in a multidimensional space that provides the best “fit” for training data, in
which the best fit is intended to be measured statistically. Similarly, the generalization
phase is equivalent to using this multidimensional surface searched with training data
to interpolate test data never seen before from the network.
The network is structured on three levels: input, hidden, and output. The input layer
is directly connected with the environment, that is, they are directly connected with
the sensory units (raw data) or with the output of a subsystem of feature extraction.
The hidden layer (unique in the network) is composed of neurons in which radial-
based functions are defined, hence the name of radial basis functions, and which
performs a nonlinear transformation of the input data supplied to the network. These
neurons form the basis for input data (vectors). The output layer is linear, which
provides the network response for the presented input pattern.
The reason for using a nonlinear transformation in the hidden layer followed by
a linear one in the output layer is described in an article by Cover (1965) according
to which a pattern classification problem reported in a much larger space (i.e., in the
nonlinear transformation from the input layer to the hidden one) it is more likely to
be linearly separable than in a reduced size space. From this observation derives the
reason why the hidden layer is generally larger than the input one (i.e., the number
of hidden neurons much greater than the cardinality of the input signal).
{ϕi (x) |i = 1, ..., m1 } for which the input space is transformed to m0 -dimensional in
a new m1 -dimensional space, as follows:
T
ϕ(x) = ϕ1 (x), ϕ2 (x), . . . , ϕm1 (x) (2.1)
The function ϕ, therefore, allows the nonlinear spatial transformation from a space
to a larger one (m1 > m0 ). This function refers to the neurons of the hidden layer
of the RBF network. A binary partition (dichotomy) [C1 , C2 ] of C is said to be
ϕ—separable if there exists a vector w m1 -dimensional such that we can write: w
m1
1. Nonlinear formulation of the functions of the hidden layer defined by ϕi (x) with
x the input vector and i = 1, . . . , m1 the cardinality of the layer.
2. A dimensionality greater than the hidden space compared with that of the input
space determined as we have seen from the value of m1 (i.e., the number of
neurons in the hidden layer).
It should be noted that in some cases it may be sufficient to satisfy only point 1,
i.e., the nonlinear transformation without increasing the input space by increasing
the neurons of the hidden layer (point 2), in order to obtain linear separability. The
XOR example shows this last observation. Let 4 points in a 2D space: (0, 0), (1, 0),
(0, 1), and (1, 1) in which we construct an RBF neural network that solves the XOR
function. It has been observed above that the single perceptron is not able to represent
this type of function due to the nonlinearly separable problem. Let’s see how by using
the Cover theorem it is possible to obtain a linear separability following a nonlinear
transformation of the four points. We define two Gaussian transformation functions
as follows:
ϕ1 (x) = e−x−t1 ,
2
t1 = [1, 1]T
ϕ2 (x) = e−x−t2 ,
2
t2 = [0, 0]T
In the Table 2.1 the values of the nonlinear transformation of the four points
considered are shown and in Fig. 2.2 their representation in the space ϕ. We can
observe how they become linearly separable after the nonlinear transformation with
the help of the Gaussian functions defined above.
The Cover’s theorem shows that there is a certain benefit of operating a nonlinear
transformation from the input space into a new one with a larger dimension in order
to obtain separable patterns due to a pattern recognition problem.
Mainly a nonlinear mapping is used to transform a nonlinearly separable classi-
fication problem into a linearly separable one. Similarly, nonlinear mapping can be
used to transform a nonlinear filtering problem into one that involves linear filtering.
For simplicity, consider a feedforward network with an input layer, a hidden layer,
and an output layer, with the latter consisting of only one neuron. The network, in
this case, operates a nonlinear transformation from the input layer into the hidden
one followed by a linear one from the hidden layer into the output one. If m0 always
indicates the dimensionality of the input layer (therefore, m0 neurons in the input
2.3 The Problem of Interpolation 197
1
(1,1)
0.8
Separation line
0.6 or around decision-making
0.4
(0,1)
(1,0)
0.2
(0,0)
0
0 0.2 0.4 0.6 0.8 1 1.2
Fig. 2.2 Representation of the nonlinear transformation of the four points for the XOR problem
that become linearly separable in the space ϕ
(a) the training phase is an optimization of a fitting procedure for the surface, start-
ing from known examples (i.e., training data) that are presented to the network
as input–output pairs (patterns).
(b) The generalization phase is synonymous with interpolation between the data,
with the interpolation performed along a surface obtained by the training proce-
dure using optimization techniques that allow to have a surface close to the
real one.
F(xi ) = di i = 1, 2, . . . , N (2.6)
For better interpolation, the interpolating surface (therefore, the function F) passes
through all the points of the training data. The RBF technique consists of choosing
a function F with the following form:
N
F(x) = wi ϕ(x − xi ) (2.7)
i=1
with
ϕji = ϕ(xj − xi ) (j, i) = 1, 2, . . . , N (2.9)
Let
d = [d1 , d2 , . . . , dN ]T
w = [w1 , w2 , . . . , wN ]T
which will be denoted interpolation matrix. Rewriting the Eq. (2.8) in compact form
we get
w = d (2.11)
2.3 The Problem of Interpolation 199
Assuming that is a non-singular matrix, there is its inverse −1 , and therefore, the
solution of the Eq. (2.11) for weights w is given by
w = −1 d (2.12)
Gaussian
r2
ϕ(r) = exp − (2.13)
2σ 2
Inversemultiquadrics
σ2
ϕ(r) = √ (2.15)
r2 + σ 2
Cauchy
σ2
ϕ(r) = (2.16)
r2 + σ 2
The four functions described above are depicted in Fig. 2.3. The radial functions
defined in the Eqs. (2.13)–(2.16) so that they are not singular, all the points of the
dataset {xi }N
i=1 must necessarily be distinct from one another, regardless of the sample
size N and the cardinality m0 of the input vectors xi .
The inverse multiquadrics (2.15), the Cauchy (2.16) and the Gaussian functions
(2.13) share the same property, that is, they are localized functions in the sense
that ϕ(r) → 0 for r → ∞. In both these cases, the is positive definite matrix. In
200 2 RBF, SOM, Hopfield, and Deep Neural Networks
4.5
Gaussian
Multiquadric
4 Inverse Multiquadric
Cauchy
3.5
2.5
1.5
0.5
0
−3 −2 −1 0 1 2 3
contrast, the family of multiquadric functions defined in (2.14) are nonlocal because
ϕ(r) becomes undefined for r → ∞, and the corresponding interpolation matrix
has N − 1 negative eigenvalues and only one positive, with the consequence of not
being positive definite. It can, therefore, be established that an interpolation matrix
based on multiquadric functions (introduced by Hardy [1]) is not singular, and
therefore, suitable for designing an RBF network. Furthermore, it can be remarked
that radial basis functions that grow to infinity, such as multiquadrics, can be used
to approximate smoothing input–output with large accuracy with respect to those
that make the interpolation matrix positive definite (this result can be found in
Powell [2]).
The interpolation procedure described so far may not have good results when the
network has to generalize (see examples never seen before). This problem is when
the number of training samples is far greater than the degrees of freedom of the
physical process you want to model, in which case we are bound to have as many
radial functions as there are the training data, resulting in an oversized problem.
In this case, the network attempts to best approximate the mapping function,
responding precisely when a data item is seen during the training phase but fails
when one is ever seen. The result is that the network generalizes little, giving rise
to the problem of overfitting. In general, learning means finding the hypersurface
(for multidimensional problems) that allows the network to respond (generate an
2.5 Learning and Ill-Posed Problems 201
output) to the input provided. This mapping is defined by the hypersurface equation
found in the learning phase. So learning can be seen as a hypersurface reconstruction
problem, given a set of examples that can be scattered.
There are two types of problems that are generally encountered: ill-posed prob-
lems and well-posed problems. Let’s see what they consist of. Suppose we have a
domain X and a set Y of some metric space, which are related to each other by a
functional unknown f which is the objective of learning. The problem of reconstruct-
ing the mapping function f is said to be well-posed if it satisfies the following three
conditions:
where ρ(•, •) represents the distance symbol between the two arguments in their
respective spaces. The continuity property is also referred to as being the property
of stability.
If none of these conditions are met, the problem will be said to be ill-posed. In
problems ill-posed, datasets of very large examples may contain little information
on the problem to be solved. The physical phenomena responsible for generating the
dataset for training (for example, for speech, radar signals, sonar signals, images, etc.)
are well-posed problems. However, learning from these forms of physical signals,
seen as a reconstruction of hypersurfaces, is an ill-posed problem for the following
reasons.
The criterion of existence can be violated in the case in which distinct outputs do
not exist for each input. There is not enough information in the training dataset to
univocally reconstruct the mapping input–output function, therefore, the uniqueness
criterion could be violated. The noise or inaccuracies present in the training data add
uncertainty to the surface of mapping input–output. This last problem violates the
criterion of continuity, since if there is a lot of noise in the data, it is likely that the
desired output y falls outside the range Y for a specified input vector x ∈ X .
Paraphrasing Lanczos [3] we can say that There is no mathematical artifice to
remedy the missing information in the training data. An important result on how
to render a problem ill-posed in a well-posed is derived from the theory of the
Regularization.
Introduced by Tikhonov in 1963 for the solution of ill-posed problems. The basic
idea is to stabilize the hypersurface reconstruction solution by introducing some
nonnegative functional that integrates a priori information of the solution. The most
202 2 RBF, SOM, Hopfield, and Deep Neural Networks
common form of a priori information involves the assumption that the function of
input–output mapping (i.e., solution of the reconstruction problem) is of smooth type,
or similar inputs correspond to similar outputs. Let be the input and output data sets
(which represent the training set) described as follows:
The fact that the output is one-dimensional does not affect any generality in the
extension to multidimensional output cases. Let F(x) be the mapping function to
look for (the weight variable w from the arguments of F has been removed), the
Tikhonov regularization theory includes two terms:
1. Standard Error. Denoted by ξs (F) measures the error (distance) between the
desired response (target) di and the current network response yi of the training
samples i = 1, 2, . . . , N
1
N
ξs (F) = (di − yi )2 (2.18)
2
i=1
1 N
= (di − F(xi ))2 (2.19)
2
i=1
So the quantity that must be minimized in the regularization theory is the following:
the training set that specifies the solution Fλ (x). In particular, in a limiting case when
λ → 0 implies that the problem is unconstrained, the solution Fλ (x) is completely
determined by the training examples. The other case, where λ → ∞ implies that the
continuity constraint introduced by the smooth operator D is sufficient to specify the
solution Fλ (x), or another way of saying that the examples are unreliable. In practical
applications, the parameter λ is assigned a value between the two boundary condi-
tions, so that both training examples and information a priori contribute together for
the solution Fλ (x). After a series of steps we arrive at the following formulation of
the functional for the regulation problem:
1
N
Fλ (x) = [di − F(xi )]G(x, xi ) (2.21)
λ
i=1
where G(x, xi ) is called the Green function which we will see later on as one of
the radial-based functions. The Eq. (2.21) establishes that the minimum solution
Fλ (x) to the regularization problem is the superposition of N Green functions. The
vectors of the sample xi represent the expansion centers, and the weights [di −
F(xi )]/λ represent the expansion coefficients. In other words, the solution to the
regularization problem lies in a N -dimensional subspace of the space of smoothing
functions, and the set of Green functions {G(x, xi )} centered in xi , i = 1, 2, . . . , N
form a basis for this subspace. Note that the expansion coefficients in (2.21) are:
linear in the error estimation defined as the difference between the desired response
di and the corresponding output of the network F(xi ); and inversely proportional to
the regularization parameter λ.
Let us now calculate the expansion coefficients, that are not known, defined by
1
wi = [di − F(xi )], i = 1, 2, . . . , N (2.22)
λ
We rewrite the (2.21) as follows:
N
Fλ (x) = wi G(x, xi ) (2.23)
i=1
N
Fλ (xj ) = wi G(xj , xi ) j = 1, 2, . . . , N (2.24)
i=1
d = [d1 , d2 , . . . , dN ]T (2.26)
⎡ ⎤
G(x1 , x1 ) G(x1 , x2 ) · · · G(x1 , xN )
⎢ G(x2 , x1 ) G(x2 , x2 ) · · · G(x2 , xN ) ⎥
⎢ ⎥
G=⎢ .. .. .. .. ⎥ (2.27)
⎣ . . . . ⎦
G(xN , x1 ) G(xN , x2 ) · · · G(xN , xN )
w = [w1 , w2 , . . . , wN ]T (2.28)
we can rewrite the (2.22) and the (2.25) in matrix form as follows:
1
w= (d − Fλ ) (2.29)
λ
and
Fλ = Gw (2.30)
(G + λI)w = d (2.31)
where I is the identity matrix N × N . The G matrix is named Green matrix. Green’s
functions are symmetric (for some classes of functions seen above), namely
and therefore, the Green matrix is also symmetric and positive definite if all the points
of the sample are distinct between them and we have
GT = G (2.33)
w = (G + λI)−1 d (2.34)
This equation allows us to obtain the vector of weights w having identified the Green
function G(xj , xi ) for i = 1, 2, . . . , N ; the desired answer d; and an appropriate value
of the regularization parameter λ. In conclusion, it can be established that a solution
to the regularization problem is provided by the following expansion:
2.6 Regularization Theory 205
N
Fλ (x) = wi G(x, xi ) (2.35)
i=1
(a) the approach based on the regularization theory is equivalent to the expansion
of the solution in terms of Green functions, characterized only by the form of
stabilizing D and by the associated boundary conditions;
(b) the number of Green functions used in the expansion is equal to the number of
examples used in the training process.
The characterization of the Green functions G(x, xi ) for a specific center xi depend
only on the stabilizer D, priori information known based on the input–output map-
ping. If this stabilizer is invariant from translation, then the Green function centered
in xi depends only on the difference between the two arguments
otherwise if the stabilizer must be invariant both for translation and for rotation, then
the function of Green will depend on the Euclidean distance of its two arguments,
namely
G(x, xi ) = G(x − xi ). (2.37)
N
Fλ (x) = wi G(x − xi ) (2.38)
i=1
Therefore, the solution is entirely determined by the N training vectors that help to
construct the interpolating surface F(x).
The output layer consists of a single linear neuron (but can also be composed of
several output neurons) connected entirely to the hidden layer. By linear, we mean
that the output neuron calculates its output value as the weighted sum of the outputs
of the neurons of the hidden layer. The weights wi of the output layer represent the
unknown variables, which also depend on the functions of Green G(x − xi ) and the
regularization parameter λ.
The functions of Green G(x − xi ) for each ith are defined positive, and therefore,
one of the forms satisfying this property is the Gaussian one
1
G(x, xi ) = exp (− x − xi 2 ) (2.39)
2σi2
remembering that xi represents the center of the function and σi its width. With the
condition that the Green functions are defined positive, the solution produced by
the network will be an optimal interpolation in the sense that it minimizes the cost
function seen previously ξ(F). We remind you that this cost function indicates how
much the solution produced by the network deviates from the true data represented
by the training data. Optimality is, therefore, closely related to the search for the
minimum of this cost function ξ(F).
In Fig. 2.4 is also shown the bias (variable independent of the data) applied to the
output layer. This is represented by placing one of the linear weights equal to the
bias w0 = b and treating the associated radial function as a constant equal to +1.
Concluding then, to solve an RBF network, knowing in advance the input data and
the shape of the radial basis functions, the variables to be searched are the linear
weights wi and the centers xi of the radial basis functions.
2.8 RBF Network Solution 207
Let be {ϕi (x)| i = 1, 2, . . . , m1 } the family of radial functions of the hidden layer,
which we assume to be linearly independent. We, therefore, define
where ti are the centers of the radial functions to be determined. In the case in
which the training data are few or computationally tractable in number, these centers
coincide with the training data, that is ti = xi for i = 1, 2, . . . , N . Therefore, the new
interpolating solution F ∗ is given by the following equation:
m1
F ∗ (x) = wi G(x, ti )
i=1
m1
= wi G(x − ti )
i=1
which defines the new interpolating function with the new weights
{wi | i = 1, 2, . . . , m1 } to be determined in order to minimize the new cost function
⎛ ⎞2
N
m1
ξ(F ∗ ) = ⎝di − wi G(x − ti )⎠ + λDF ∗ 2 (2.41)
i=1 j=1
with the first term of the right side of this equation it can be expressed as the Euclidean
norm of d − Gw2 , where
d = [d1 , d2 , . . . , dN ]T (2.42)
⎡ ⎤
G(x1 , t1 ) G(x1 , t2 ) · · · G(x1 , tm1 )
⎢ G(x2 , t1 ) G(x2 , t2 ) · · · G(x2 , tm1 ) ⎥
⎢ ⎥
G=⎢ .. .. .. .. ⎥ (2.43)
⎣ . . . . ⎦
G(xN , t1 ) G(xN , t2 ) · · · G(xN , tm1 )
and minimizing the (2.41) with respect to the weight vector w, we arrive at the
following equation:
(GT G + λG0 )w = GT d (2.47)
w = G+ d, λ = 0 (2.48)
The (2.48) represents the solution to the problem of learning weights for an RBF
network. Now let’s see what are the RBF learning strategies that starting from a
training set describe different ways to get (in addition to the weight vector calculation
w) also the centers of the radial basis functions of the hidden layer and their standard
deviation.
So far the solution of the RBF has been found in terms of the weights between the
hidden and output layers, which are closely related to how the activation functions of
the hidden layer are configured and eventually evolve over time. There are different
approaches for the initialization of the radial basis functions of the hidden layer. In
the following, we will show some of them.
The simplest approach is to fix the Gaussian centers (radial basis functions of the
hidden layer) chosen randomly from the training dataset available. We can use a
Gaussian function isotropic whose standard deviation σ is fixed according to the
dispersion of the centers. That is, a normalized version of the radial basis function
centered in ti
2.9 Learning Strategies 209
m1
(G(x − ti ) = exp − 2 x − ti , i = 1, 2, . . . , m1
2 2
(2.50)
dmax
where m1 is the number of centers (i.e., neurons of the hidden layer), dmax the
maximum distance between the centers that have been chosen. The standard deviation
(width) of the Gaussian radial functions is fixed to
dmax
σ =√ . (2.51)
2m1
The latter ensures that the identified radial functions do not reach the two possible
extremes, that is, too thin or too flattened. Alternatively to the Eq. (2.51) we can
think of taking different versions of radial functions. That is to say using very large
standard deviations for areas where the data is very dispersed and vice versa.
This, however, presupposes a research phase carried out first about the study of the
distribution of the data of the training set which is available. The network parameters,
therefore, remain to be searched, that is, the weights of the connections going from
the hidden layer to the output layer with the pseudo-inverse method of G described
above in (2.48) and (2.49). The G matrix is defined as follows:
G = {gji } (2.52)
with
m
1
gji = exp − 2 xj − ti 2 , i = 1, 2, . . . , m1 j = 1, 2, . . . , N (2.53)
d
with xj the jth vector of the training set. Note that if the samples are reasonably1 few
so as not to affect considerably the computational complexity, one can also fix the
centers of the radial functions with the observations of the training data, then ti = xi .
The computation of the pseudo-inverse matrix is done by the Singular Value
Decomposition (SVD) as follows. Let G be a matrix N × M of real values, there are
two orthogonal matrices:
U = [u1 , u2 , . . . , uN ] (2.54)
and
V = [v1 , v2 , . . . , vM ] (2.55)
such that
UT GV = diag(σ1 , σ2 , . . . , σK ) K = min(M , N ) (2.56)
1 Always satisfying the theorem of Cover described above, with reasonably we want to indicate a
where
σ1 ≥ σ2 ≥ · · · ≥ σK > 0 (2.57)
The column vectors of the matrix U are called left singular vectors of G while those
of V right singular vectors of G. The standard deviations σ1 , σ2 , . . . , σK are simply
called singular values of the matrix G. Thus according to the SVD theorem, the
pseudo-inverse matrix of size M × N of a matrix G is defined as follows:
G+ = V +
UT (2.58)
The random selection of the centers shows that this method is insensitive to the use
of regularization.
One of the problems that you encounter when you have very large datasets is the
inability to set the centers based on the size of the training dataset, whether they are
randomly selected and whether they coincide with the same training data. Training
datasets with millions of examples would involve millions of neurons in the pattern
layer, with the consequence of raising the computational complexity of the classifier.
To overcome this you can think of finding a number of centers that are lower than the
cardinality of the training dataset but which is in any case descriptive in terms of the
probability distribution of the examples you have. To do this you can use clustering
techniques such as fuzzy K-means, K-means or self-organizing maps.
Therefore, in a first phase, the number of prototypes to be learned with the cluster-
ing technique is set, and subsequently, we find the weights of the RBF network with
the radial basis functions centered in the prototypes learned in the previous phase.
In essence, this layer of neurons adapts during the learning phase, in the sense
that the positions of the individual neurons are indicators of the significant statistical
characteristics of the input stimuli. This process of spatial adaptation of input pattern
characteristics is also known as feature mapping. The SOMs learn without a priori
knowledge in unsupervised mode, from which descends the name of self-organizing
networks, i.e., they are able to interact with the data, training themselves without a
supervisor.
Like all neural networks, SOMs have a neuro-biological motivation, based on the
spatial organization of brain functions, as has been observed especially in the cerebral
cortex. Kohonen developed the SOM-based on the studies of C. von der Malsburg
[5] and on the models of the neural fields of Amari [6]. A first emulative feature of
SOM concerns the behavior of the human brain when subjected to an input signal.
When a layer of the neural network of the human brain receives an input signal, very
close neurons are strongly excited with stronger bonds, while those at an intermediate
distance are inhibited, and distant ones are weakly excited.
Similarly in SOM, during learning, the map is partitioned into regions, each of
which represents a class of input patterns (principle of topological map formation).
Another characteristic of biological neurons, when stimulated by input signals, is that
of manifesting an activity in a coordinated way such as to differentiate themselves
from the other less excited neurons. This feature was modeled by Kohonen restricting
the adaptation of weights only to neurons in the vicinity of what will be considered
the winner (competitive learning). This last aspect is the essential characteristic of
unsupervised systems, in which the output neurons compete with each other before
being activated, with the result that only one is activated at any time. The winning
neuron is called winner-takes-all-neuron (the winning neuron takes everything).
Similarly to the multilayer neural networks, the SOM presents a feed-forward ar-
chitecture with a layer of input neurons and a single layer of neurons, arranged on
a regular 2D grid, which combine the computational output functions (see Fig. 2.5).
The computation-output layer (Kohonen layer) can also be 1D (with a single row or
column of neurons) and rarely higher than 2D maps. Each input neuron is connected
to all the neurons of the Kohonen layer. Let x = (x1 , x2 , . . . , xd ) the generic pattern
d -dimensional of the N input patterns to present to the network and both PEj (Pro-
cessing Element) generic neuron of the 2D grid composed of M = Mr × Mc neurons
arranged on Mr rows and Mc columns.
The input neurons xi , i = 1, . . . , d only perform the memory function and are
connected to the neurons PEj , j = 1, . . . , M through the vectors weight wj , j = 1, M
of the same dimensionality d of the input pattern vectors. For a configuration with
N input pattern vectors and M neurons PE of the Kohonen layer we have in total
N M connections. The activation potential yj of the single neuron PEj is given by the
212 2 RBF, SOM, Hopfield, and Deep Neural Networks
Fig. 2.5 Kohonen network architecture. The input layer has d neurons that only have a memory
function for the input patterns x d -dimensional, while the Kohonen layer has 63 PE neurons (process
elements). The window size, centered on the winning neuron, gradually decreases with the iterations
and includes the neurons of the lateral interaction
inner product between the generic vector pattern x and the vector weight wj
d
yj = wjT x = wji xi (2.60)
i=1
Competition, for each input pattern x, all neurons PE calculate their respective
activation potential which provides the basis of their competition. Once the dis-
criminant evaluation function is defined, only one neuron PE must be winning in
the competition. As a discriminant function the (2.60) can be used and the one
with maximum activation potential yv is chosen as the winning neuron, as follows:
d
yv = arg max {yj = wji xi } (2.61)
j=1,...,M i=1
With the (2.62), the one whose weight vector is closest to the pattern presented
as input to the network is selected as the winning neuron. With the inner product
it is necessary to normalize the vectors with unitary norm (|x| = |w| = 1). With
the Euclidean distance the vectors may not be normalized. Both the two discrimi-
nant functions are equivalent, i.e., the vector of weights with minimum Euclidean
distance from the input vector is equivalent to the weight vector which has the
maximum inner product with the same input vector. With the process of compe-
tition between neurons, PE, the continuous input space is transformed (mapped)
into the discrete output space (Kohonen layer).
Cooperation, this process is inspired by the neuro-biological studies that demon-
strate the existence of a lateral interaction that is a state of excitation of neurons
close to the winning one. When a neuron is activated, neurons in its vicinity
tend to be excited with less and less intensity as their distance from it increases.
It is shown that this lateral interaction between neurons can be modeled with a
function with circular symmetry properties such as the Gaussian and Laplacian
function of the Gaussian (see Sect. 1.13 Vol.II). The latter can achieve an action
of lateral reinforcement interaction for the neurons closer to the winning one and
an inhibitory action for the more distant neurons. For the SOM a similar topology
of proximity can be used to delimit the excitatory lateral interactions for a limited
neighborhood of the winning neuron. If Djv is the lateral distance between the
neuron jth and the winning one v, the Gaussian function of lateral attenuation φ
is given by
2
Djv
−
φ(j, v) = e 2σ 2 (2.63)
where σ indicates the circular amplitude of the lateral interaction centered on the
winning neuron. The (2.63) has the property of having the maximum value in the
position of the winning neuron, circular symmetry, decreases monotonically to
zero as the distance tends to infinity and is invariant with respect to the position
of the winning neuron. Neighborhood topologies can be different (for example,
proximity to 4, to 8, hexagonal, etc.) what is important is the variation over time
of the extension of the neighborhood σ (t) which is useful to reduce it over time
until only winning neuron. A method to progressively reduce over time, that is, as
the iterations of the learning process grow, is given by the following exponential
form: t
σ (t) = σ0 e− Dmax (2.64)
where σ0 indicates the size of the neighbourhood at the iteration t0 and Dmax
is the maximum dimension of the lateral interaction that decreases during the
training phase. These parameters of the initial state of the network must be selected
appropriately.
Adaptation, this process carries out the formal learning phase in which the Koho-
nen layer self-organizes by adequately updating the weight vector of the winning
neuron and the weight vectors of the neurons of the lateral interaction accord-
ing to the Gaussian attenuation function (2.63). In particular, for the latter, the
adaptation of the weights is smaller than the winning neuron. This happens as
214 2 RBF, SOM, Hopfield, and Deep Neural Networks
input pattern vectors are presented to the network. The equation of adaptation
(also known as Hebbian learning) of all weights is applied immediately after
determining the v winning neuron, given by
where j indicates the jth neuron included in the lateral interaction defined by
the Gaussian attenuation function, t + 1 indicates the current iteration (epoch)
and η(t) controls the learning speed. The expression η(t)φ(j, v) in the (2.65)
represents the weight factor with which the weight vector of the winning neuron
and of the neurons included in the neighborhood of the lateral interaction φ(j, v)
are modified. The latter, given by the (2.63), also dependent on σ (t), we have
seen from the (2.64) that it is useful to reduce it over time (iterations). Also η(t)
is useful to vary it over the time starting from a maximum initial value and then
reducing it exponentially over the time as follows:
t
η(t) = η0 e− tmax (2.66)
where η0 indicates the maximum initial value of the learning function η and tmax
indicates the maximum number of expected iterations (learning periods).
From the geometric point of view, the effect of learning for each epoch, obtained
with the (2.65), is to adjust the weight vectors wv of the winning neuron and those
of the neighborhood of the lateral interaction and move them in the direction of
the input vector x. Repeating this process for all the patterns in the training set
realizes the self-organization of the Kohonen map or its topological ordering. In
particular, we obtain a bigection between the feature space (input vectors x) and
the discrete map of Kohonen (winning neuron described by the weight vector wv ).
The weight vectors w can be used as pointers to identify the vector of origin x in
the feature space (see Fig. 2.6).
starting with the completely random weight vectors, the initial state of the Kohonen
map is totally disordered.
Presenting to the network the patterns of the training set gradually triggers the
process of self-organization of the network (see Fig. 2.8) during which the topological
ordering of the output neurons is performed with the weight vectors that map as much
as possible the input vectors (network convergence phase).
It may occur that the network converges toward a metastable state, i.e., the network
converges toward a disordered state (in the Kohonen map we have topological de-
fects). This occurs when the lateral interaction function φ(t) decreases very quickly.
2.10 Kohonen Neural Network 215
Fig. 2.6 Self-organization of the SOM network. It happens through a bijection between the input
space X and the map of Kohonen (in this case 2D). Presenting to the network a pattern vector x
the winning neuron PEv whose associated weight vector wv is determined the most similar to x.
Repeating for all the training set patterns, this competitive process produces a tessellation of the
input space in regions represented by all the winning neurons whose vectors weight wvj realize
the discretization once reprojected in the input space, thus producing the Voronoi tesselation. Each
Voronoi region is represented by the weight vectors of the winning neurons (prototypes of the input
vectors included in the relative regions)
1.2
0.8
0.6
0.4 X: 0.85
Y: 0.26
0.2
−0.2
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
for i = 1 to N do
t ← t + 1;
Present each pattern vector to the network x chosen randomly from
the training set;
Calculate the winning neuron with the (2.62) wv ;
Update the weight vectors including those of neighboring neurons
with the (2.65);
end for
Reduce η(t) and σ (t) according to (2.66) and (2.64)
Normally the convergence phase requires a number of iterations related to the number
of neurons (at least 500 times the number of neurons M ).
Like the MLP network, once the synaptic weights have been calculated with the
training phase, the SOM network is used in the test context to classify a generic pattern
vector x not presented in the training phase. In Fig. 2.7 shows a simple example of
classification with the SOM 1D network. The number of classes is 6, each of which
has 10 2D input vectors (indicated with the “+” symbol).
The network is configured with 6 neurons with associated initial weight vectors
wi = (0.5, 0.5), i = 1, . . . , 6, and the initial learning parameter η = 0.1. After the
training, the weight vectors, adequately modified by the SOM, each represent the
prototype of the classes. They are indicated with the symbol “◦” and are located in the
center of each cluster. The SOM network is then presented with some input vectors
(indicated with black squares), for testing the network, each correctly classified in
the class to which they belong.
The Kohonen network, by projecting a d -dimensional vector in a discrete 2D grid,
actually performs a transformation of the data reducing the dimensionality of the data
as happens with the transformation to the main components (PCA). In essence, it
realizes a nonlinear generalization of the PCA. Let us now examine some peculiar
properties of the Kohonen network.
1 2 3
w
Kohonen network
Before training After training
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
(0.9806, -0.0001)
−0.2 −0.2
−0.4 −0.4
−0.6 −0.6
(-0.6975, -0.6974)
−0.8 −0.8
−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
Fig. 2.8 Example of Kohonen 1D network that groups 6 2D input vectors in 3 classes; using Matlab
the network is configured with 3 neurons. Initially, the 3 weight vectors take on small values and
are randomly oriented. As the input vectors are presented (indicated with “+”), the weight vectors
tend to move toward the most similar input vectors until they reach the final position to represent
the prototype vectors (indicated with “◦” ) of each grouping
1
N
Eq = xi − wv (2.67)
N
i=1
1
Iter=0 Iter=25 Iter=300
1 1
0.9 0.9 0.9
0.8 0.8 0.8
0.7 0.7 0.7
0.6 0.6 0.6
0.5 0.5 0.5
0.4 0.4 0.4
0.3 0.3 0.3
0.2 0.2 0.2
0.1 0.1 0.1
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
a) b) c)
Fig. 2.9 Simulation of a Kohonen SOM network with 10 × 10 neurons: a displaying uniformly
distributed 2D input vectors in the range [0, 1] × [0, 1] with overlapping weight vectors whose
initial assigned values are around zero; b position of the vectors weight linked to each other, after
25 iterations; c Weights after 300 iterations
achieves the same objective by diagonalizing the correlation matrix to obtain the
associated eigenvectors and eigenvalues. If the data do not have a linear distribu-
tion, the PCA does not work correctly while the SOM overcomes this problem
by virtue of its topological ordering property. In other words, the SOM is able to
sufficiently approximate a nonlinear distribution of data by finding the principal
surface, and can be considered as a nonlinear generalization of the PCA.
Many applications have been developed with the Kohonen network. An important
fallout is represented by the fact that this simple network model offers plausible
explanations on some neuro-biological phenomena. The Kohonen network is used
in the fields of the combinatory calculus to solve the traveling salesman problem,
in the fields of Economic Analysis, Data Mining, Data Compression, Recognition
of Phonemes in real time, and in the field of robotics to solve the problem of in-
verse kinematics. Several applications have been developed for signal and image
processing (segmentation, classification, texture, ...). Finally, various academic and
commercial software packages are available.
To improve the classification process, in some applications, Kohonen maps can
be given in input to a linear classification process of the supervised type. In this case,
we speak of a hybrid neural network that combines the SOM algorithm that produces
the unsupervised feature maps with the supervised linear one of a backpropagation
MLP network to achieve a more accurate and more efficient adaptive classification
requiring a smaller number of iterations.
220 2 RBF, SOM, Hopfield, and Deep Neural Networks
}
Supervised 1
Learning
2
}..
Input
Input 3
SOM LVQ
a)
Class
Labels
b)
xd wd M
M
Kohonen Layer
yM
}ω C
Fig. 2.10 Learning Vector Quantization Network: a Functional scheme with the SOM component
for competitive learning and the LVQ component for supervised learning; b architecture of the
LVQ network, composed of the input layer, the layer of Kohonen neurons and the computation
component to reinforce the winning neuron if the input has been correctly classified by the SOM
component
the prototype wv belongs and reinforces it appropriately if they belong to the same
class.
Figure 2.10b shows the architecture of the LVQ network. At the schematic level,
we can consider it with three linear layers of neurons: input, Kohonen, and output.
In reality, the M process neurons are only those of the Kohonen layer. The d input
neurons only have the function of storing the input vectors {x} randomly presented
individually. Each input neuron is connected with all the neurons of the Kohonen
layer. The number of neurons in the output layer is equal to the C number of the
classes.
The net is very conditioned by the number of neurons used for the Kohonen layer.
Each neuron of this layer represents a prototype of a class whose values are defined
by the vectors weight wv , i.e., the synaptic connections of each neuron connected to
all the input neurons. The number of neurons in the middle layer M is a multiple of
the number of C classes. In Fig. 2.10b, the output layer shows the possible clusters
of neurons that LVQ has detected to represent the same class ωj , j = 1, . . . , C.
We now describe the sequential procedure of the basic LVQ algorithm
as follows:
wv (t + 1) = wv (t) + η(t)[xi − wv (t)] (2.68)
where t indicates the previous iteration, and η(t) indicates the current value of
the learning parameter (variable in the range 0 < η(t) ≤ 1) analogous to that of
the SOM.
4. If the selected input vector xi and the weight vector wv of the winning neuron
have a different label (ωxi = ωwv ) as a class, then modify the prototype wv ,
removing it from the input vector, as follows:
The described classifier (also known as LVQ1) is more efficient than using the
SOM algorithm only. It is also observed that with respect to the SOM it is no longer
necessary to model the neurons of the Kohonen layer with the function φ of lateral
interaction since the objective of LVQ is vector quantization and not the creation
of topological maps. The goal of the LVQ algorithm is to adapt the weights of the
neurons to optimally represent the prototypes of the training set patterns to obtain a
correct partition of this.
This architecture allows us to classify the N input vectors of the training set into
C classes, each of which is subdivided into subclasses of the latter represented by
the initial M prototypes/code-books. The sizing of the LVQ network is linked to the
number of prototypes that defines the number of neurons in the Kohonen layer. An
undersizing involves partitions with few regions and with the consequent problem of
having regions with patterns belonging to different classes. An over-dimensioning
instead involves the problem of overfitting.
2.11 Network Learning Vector Quantization-LVQ 223
In relation to the initial value of the neuron weight vectors, these must be able to
move through a class that they do not represent, in order to associate with another
region of belonging. Since the weight vectors of these neurons will be rejected by the
vectors in the regions they must cross, it is possible that this will never be realized
and will never be classified in the correct region to which they belong.
The LVQ2 algorithm [9] which introduces a learning variant with respect to LVQ1,
can solve this problem. During the learning, for each input vector x, the simultaneous
update is carried out considering the two prototype vectors wv1 and wv2 closer to x
(always determined with the minimum distance from x). One of them must belong
to the correct class and the other to a wrong class. Also, x must be in a window
between the vectors wv1 and wv2 that delimit the decision boundaries (perpendicular
plane bisecting). In these conditions, the two weight vectors wv1 and wv2 are updated
appropriately using the Eqs. (2.68) and (2.69), respectively, of correct and incorrect
class of membership of x. All other weight vectors are left unchanged.
The LVQ3 algorithm [8] is the analogue of LVQ2 but has an additional update of
the weights in cases in which x, wv1 , and wv2 represent the same class
where is a stabilization constant that reflects the width of the window associated
with the borders of the regions represented by the prototypes wv1 and wv2 . For very
narrow windows, the constant must take very small values.
With the changes introduced by LVQ2 and LVQ3 during the learning process,
it is ensured that weight vectors (code-book vectors) continue to approximate class
distributions and prevent them from moving away from their optimal position if
learning continues.
Fig. 2.11 Computational dynamics of a neural network: a Static system; b Dynamic continuous-
time system; and c Discrete-time dynamic system
a dynamic temporal behavior and the last input signal. Two classes of dynamical
systems are distinguished: dynamic systems with continuous time and discrete time.
The dynamics of continuous-time systems depend on functions whose continuous
variable is time (spatial variables are also used). This dynamic is described by dif-
ferential equations. A model of the most useful dynamics is that described only by
differential equations of the first order y (t) = f [x(t), y(t)] where y (t) = dy(t)/dt
which models the output signal as a function of the derivative with respect to time,
requiring an integration operator and the feedback signal inherent to the dynamic
systems (see Fig. 2.11). In many cases, a discrete-time computational system is as-
sumed. In these cases, a discrete-time system is modeled by discrete-time variable
functions (even spatial variables are considered). The dynamics of the network, in
these cases, starts from the initial state at time 0 and in the subsequent discrete steps
for t = 1, 2, 3, . . . the state of the network changes in relation to the computational
dynamics foreseen by the activation function of one or more neurons. Thus, each
neuron acquires the related inputs, i.e., the output of the neurons connected to it, and
updates its state with respect to them.
The dynamics of a discrete-time network is described by difference equations
whose first discrete derivative is given by y(n) = y(n + 1) − y(n) where y(n + 1)
and y(n) are, respectively, the future value (predicted) and the current value of y,
and n indicates the discrete variable that replaces the continuous and independent
variable t. To model the dynamics of a discrete-time system, that is, to obtain the
output signal y(n + 1) = f [x(n), y(n)], the integration operator is replaced with the
summation operator D which has the function of delay unit (see Fig. 2.11c).2
The state of neurons can change independently of each other or can be controlled
centrally, and in this case, we have asynchronous or synchronous neural network
models, respectively. In the first, case the neurons are updated one at a time, while
in the second case all the neurons are updated at the same time. Learning with a
recurrent network can be accomplished with a procedure similar to the gradient
descent as used with the backpropagation algorithm.
2 The D operator derives from the Z transform applied to the discrete signals y(n) : n = 0, 1, 2, 3, . . .
to obtain analytical solutions to the difference equations. The delay unit is introduced simply to
delay the activation signal until the next iteration.
2.12 Recurrent Neural Networks 225
A particular recurrent network was proposed in 1982 by J.Hopfield [10]. The orig-
inality of Hopfield’s network model was such as to revitalize the entire scientific
environment in the field of artificial neural networks. Hopfield showed how a collec-
tion of simple process units (for example, perceptrons by McCulloch-Pitts), appro-
priately configured, can exhibit remarkable computing power. Inspired by physical
phenomenologies,3 he demonstrated that a physical system can be used as a potential
memory device, once such a system has a dynamic of locally stable states to which it
is attracted. Such a system, with its stability and well localized attractors,4 constitutes
a model of CAM memory (Content-Addressable Memory).5
A CAM memory is a distributed memory that can be realized by a neural network
if each contained (in this context a pattern) of the memory corresponds to a stable
configuration of the neural network received after its evolution starting from an
initial configuration. In other words, starting from an initial configuration, the neural
network reaches a stable state, that is, an attractor associated with the pattern most
similar to that of the initial configuration. Therefore, the network recognizes a pattern
when the initial stimulus corresponds to something that, although not equal to the
stored pattern, is very similar to it.
Let us now look at the structural details of the Hopfield network that differs greatly
from the two-layer network models of input and output. The Hopfield network is
realized with M neurons configured in a single layer of neurons (process unit or
process element PE) where each is connected with all the others of the network
except with itself. In fact it is a recurrent symmetric network, that is, with a matrix
of synaptic weights
wij = wji , ∀i, j (2.72)
3 Ising’s model (from the name of the physicist Ernst Ising who proposed it) is a physical-
mathematical model initially devised to describe a magnetized body starting from its elementary
constituents. The model was then used to model variegated phenomena, united by the presence of
single components that, interacting in pairs, produce collective effects.
4 In the context of neural networks, an attractor is the final configuration achieved by a neural
network that, starting from an initial state, reaches a stable state after a certain time. Once an
attractor is known, the set of initial states that determine evolution of the network that ends with
that attractor is called the attraction basin.
5 Normally different memory devices store and retrieve the information by referring to the memory
location addresses. Consequently, this mode of access to information often becomes a limiting factor
for systems that require quick access to information. The time required to find an item stored in
memory can be considerably reduced if the object can be identified for access through its contents
rather than by memory addresses. A memory accessed in this way is called addressable memory for
content or CAM-Content-Addressable Memory. CAM offers an advantage in terms of performance
on other search algorithms, such as binary tree or look-up table based searches, comparing the
desired information against the entire list of pre-stored memory location addresses.
226 2 RBF, SOM, Hopfield, and Deep Neural Networks
Fig. 2.12 Model of the discrete-time Hopfield network with M neurons. The output of each neuron
has feedback with all the others excluded with itself. The pattern vector (x1 , x2 , . . . , xM ) forms the
entry to the network, i.e., the initial state (y1 (0), y2 (0), . . . , yM (0)) of the network
wii = 0, ∀i (2.73)
where yi indicates the output of the neuron ith and θi the relative threshold. The
(2.74) rewritten with the activation function σ () of each neuron becomes
M
yi = σ wij yj − θi (2.75)
j=1;j=i
z
where
1 if z≥0
σ (z) = (2.76)
0 if z<0
2.12 Recurrent Neural Networks 227
It is observed that in the Eq. (2.75) only the state of neurons yi is updated and not
the relative thresholds θi . Furthermore, the update dynamics of the discrete Hopfield
network is asynchronous, i.e., the neurons are sequentially updated in order and
random time. This is the opposite of a synchronous dynamic system where the t + 1
time update takes place considering the time state t and simultaneously updating
all neurons at once. This mode requires buffer memory to maintain the state of the
neurons and a synchronism signal. Normally, networks with asynchronous updating
are more available in applications with neural networks since the single neural unit
updates its status as soon as the input information is available.
The Hopfield network associates a scalar value with the overall state of the network
itself defined by the energy E which, recalling the concept of kinetic energy expressed
in quadratic form, the energy function for a network with M neurons is given by
1
M
M
E=− wij yi yj + yi θi (2.77)
2
i,j=1;j=i i=1
Hopfield demonstrated the convergence of the network in a stable state, after a finite
number of sequential (asynchronous) updates of neurons, in the conditions expressed
by the (2.72) and (2.73) that guarantee the decrease of the energy function E for each
updated neuron.
In fact, we consider how the energy of the system changes to the change of state
of a neuron with the activation function (for example, from 0 to 1, or from 1 to 0)
M
M
E = Et+1 − Et = − wij yi (t + 1)yj + yi (t + 1)θi − − wij yi (t)yj + yi (t)θi
j=1;j =i j=1;j =i
M (2.78)
= [yi (t) − yi (t + 1)] wij yj −θi
j=1;j =i
neti
The second factor of the (2.78) is less than zero if neti = j=i wij yj < θi . Also, for
the (2.75) and (2.76) we would have yi (t + 1) = 0 and yi (t) = 1, it follows that the
first factor [•] is more large of zero, thus resulting E < 0. If instead neti > θi , the
second factor is greater or equal to zero; yi (t + 1) = 1 and yi (t) = 0; it follows that
the first factor [•] is less than zero, thus resulting E ≤ 0.
Therefore, Hopfield has shown that any change in the ith neuron, if the activation
equations are maintained (2.75) and (2.76), the energy variation E is negative or
zero. The energy function E decreases in a monotonic way (see Fig. 2.13) if the
activation rules and the symmetry of the weights are maintained. This allows the
network, after repeated updates, to converge toward a stable state that is a local
minimum of the energy function (also considered as a Lyapunov function6 ).
6 Lyapunov functions, named after the Russian mathematician Aleksandr Mikhailovich Lyapunov,
are scalar functions that are used to study the stability of an equilibrium point of an ordinary au-
tonomous differential equation, which normally describes a dynamic system. For dynamic physical
systems, conservation laws often provide candidate Lyapunov functions.
228 2 RBF, SOM, Hopfield, and Deep Neural Networks
Fig. 2.13 1D and 2D energy map with some attractors in a Hopfield network. The locations of
the attractors indicate the stable states where the patterns are associated and stored. After having
initialized the network with the pattern to be memorized, the network converges following the
direction indicated by the arrows to reach the location of an attractor to which the pattern is associated
The generalized form of the network update equation also includes the additional
term of the bias I (which can also be the direct input of a sensor) for each neuron
1 if w y + Ii > θi
yi = j=i ij j (2.79)
0 if j=i wij yj + Ii < θi
1
M
M
M
E=− wij yi yj − yi Ii + yi θi (2.80)
2
i,j=1;j=i i=1 i=1
Also in this case it is shown that the change of state of the network for a single update
of a neuron, the energy change E is always zero or negative.
not always guarantee the optimal solution. The classic application example is the
problem of Traveling Salesman’s Problem.
(c) Calculation of logical functions (OR,XOR, ...).
(d) Miscellaneous Applications: pattern classification, signal processing, control,
voice analysis, image processing, artificial vision. Generally used as a black box
to calculate some output resulting from a certain self-organization caused by the
same network. Many of these applications are based on Hebbian learning.
(a) If two neurons have the same state of activation their synaptic connection is
reinforced, i.e., the associated weight wij is increased.
(b) If two neurons exhibit opposite states of activation, their synaptic connection is
weakened.
Starting with the W matrix of zero weights, applying these learning rules and pre-
senting the patterns to be memorized, the weights are modified as follows:
where k indicates the pattern to memorize, i and j indicate the indices of the com-
ponents of the binary pattern vector to be learned; is a constant to check that the
weights do not become too large or small (normally = 1/N ). The (2.81) is iterated
for all N patterns to be stored and once presented, the final weights will result
N
wij = xki xkj wii = 0; i, j = 1, . . . , M ; with i = j (2.82)
k=1
In vector form the M × M weights matrix W for the set P of the stored patterns is
given by
N
W= T
xki xkj − N I = (x1T x1 − I) + (x2T x2 − I) + · · · , (xNT xN − I) (2.83)
k=1
230 2 RBF, SOM, Hopfield, and Deep Neural Networks
where I is the identity matrix M × M that subtracted from the weights matrix W
guarantees that the latter has zeroed all the elements of the main diagonal. It is also
assumed that the binary responses of neurons are bipolar (+1 or −1). In the case of
binary unipolar pattern vectors (that is, with values 0 or 1) the weights Eq. (2.82) is
thus modified by introducing scale change and translation
N
wij = (2xki − 1)(2xkj − 1) wii = 0; i, j = 1, . . . , M ; with i = j (2.84)
k=1
where in this case the activation function σ () used is the sign function sgn() con-
sidering the bipolarity of the patterns (−1 or 1). Recall that the neurons are updated
asynchronously and randomly. The network activity continues until the output of the
neurons remains unchanged, thus representing the final result, namely the x pattern
recovered from the set P, stored earlier in the learning phase, and that best approxi-
mates the input pattern s. This approximation is evaluated with the Hamming distance
H which determines the number of different bits between two binary patterns which
in this case is calculated from H = M − i si xi .
The Hopfield network can be used as a pattern classifier. The set of patterns P,
in this case, are considered the prototypes of the classes and the patterns s to be
classified are compared with these prototypes by applying a similarity function to
decide the class to which they belong.
Consider a first example with two stable states (two auto-associations) x1 =
(1, −1, 1)T and x2 = (−1, 1, −1)T . The diagonal and symmetric matrix of the
synaptic weights W given by the (2.83) results
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
1 −1 0 −2 2
⎢ ⎥
W = ⎣−1⎦ 1 −1 1 + ⎣ 1 ⎦ −1 1 −1 − 2I = ⎣−2 0 −2⎦
1 −1 2 −2 0
2.12 Recurrent Neural Networks 231
Fig. 2.14 Structure of the Hopfield network at discrete time in the example with M = 3 neurons
and N = 23 = 8 states of which two are stable representing the patterns (1, −1, 1) and (−1, 1, −1)
M
yi = neti = wij xj
j=1;j=i
232 2 RBF, SOM, Hopfield, and Deep Neural Networks
Pattern riconosciuto “0” Pattern riconosciuto “1” Pattern riconosciuto “3” Pattern riconosciuto “3”
Fig. 2.15 Results of numerical pattern retrieval using the Hopfield network as a CAM memory.
By giving non-noisy patterns to the input, the network always retrieves the memorized characters.
By giving instead noisy input patterns at 25% the network does not correctly recover only the “5”
character, recognized as “3”
M
N
∼
= xki xkj xj
j=1;j=i k=1
for the (2.82) =wij
N
M
∼
= xki xkj xj
k=1 j=1
M̃
≈ xi M̃ (2.86)
where M̃ ≤ M , being the result of the inner product between vectors with M bipolar
binary elements (+1 or −1). If the pattern x and x are statistically independent, i.e.,
orthogonal, their inner product becomes zero. Alternatively, the limit case is when
the two vectors are identical, obtaining that M̃ = M , that is, the pattern x does not
generate any updates and the network is stable.
This stable state for the recovery of the pattern x we have, considering the Eq.
(2.83), with the activation potential net given by
N
y = net = Wx = xkT xk − M I x (2.87)
k=1
2.12 Recurrent Neural Networks 233
1. stored patterns P are orthogonal (or statistically independent) and M > N , that
is
0 if i = j
xiT xj = (2.88)
M if i = j
With the assumption that M > N , it follows as a result that x is a stable state of
the Hopfield network.
2. stored patterns P are not orthogonal (or statistically independent):
where the activation potential of the neuron in this case is given by the term
stable state previously determined and the term of the noise. The x vector is a
stable state when M > N and the noise term is very small and a concordant sign
must be maintained between the activation potential y and the vector x (sgn(y) =
sgn(x )). Conversely, x will not result in the stable state if the noise is dominant
with respect to the equilibrium status term as happens with the increasing number
of patterns N to be stored (i.e., M − N decreases).
As is the case for all associative memories, the best results occur when the patterns
to be memorized are represented by orthogonal vectors or very close to their orthog-
onality. The Hopfield network used as CAM is proved to have a memory capacity
equal to 0.138 · M where M is the number of neurons (the theoretical limit is 2M
pattern). The network is then able to recover the patterns even if they are noisy within
a certain tolerance dependent on the application context.
the interval between 0 and 1 as with MLP networks. In this case, the selected activation
function is the sigmoid (or hyperbolic tangent) described in the Sect. 1.10.2. The
dynamics of the network remain asynchronous and the new state of the neuron ith
is given by a generic monotone increasing function. Choosing the sigmoid function
the state of the neuron results
1
yi = σ (neti ) = (2.91)
1 + e−neti
where neti indicates the activation potential of the neuron ith. It is assumed that the
state of a neuron changes slowly over time. In this way, the change of state of the
other neurons does not happen instantaneously but with a certain delay. The activation
potential of the ith neuron changes over time according to the following:
M
M
d
neti = η −neti + wij yi = η −neti + wij σ (netj ) (2.92)
dt
j=1 j=1
where η indicates the learning parameter (positive) and wij the synaptic weight be-
tween the neuron i and j. With the (2.92) a discrete approximation of the differential
d (neti ) is calculated which is added to the current value of the activation potential
neti which leads the neuron to the new state given by yi expressed with the (2.91).
The objective is now to demonstrate how this new, more realistic network model,
with continuous-state dynamics, can converge toward fixed or cyclic attractors. Hop-
field to demonstrate convergence has proposed [12] a slightly different functional
energy than the discrete model, given by
M M
yi
1
E=− wij yi yj + σ −1 (y)dy (2.93)
2 0
i,j=1;j=i i=1
At this point, it is sufficient to demonstrate that the energy of the network decreases
after each update of the state of the neuron. This energy decrease is calculated with
the following:
dE M
dyi M
dyi
=− wij yj + σ −1 (yi ) (2.94)
dt dt dt
i,j=1;j=i i=1
Considering the property of symmetry (that is, wij = wji ) of the network and since
there exists the inverse of the function sigmoide neti = σ −1 (yi ), the (2.94) can be so
simplified
M
dyi
M
dE
=− wij yj − neti (2.95)
dt dt
i=1 j=1
2.12 Recurrent Neural Networks 235
The expression in round brackets (•) is replaced considering the (2.92) and we get
the following:
1 dyi d
M
dE
=− neti (2.96)
dt η dt dt
i=1
d
and replacing dt neti considering the inverse of the derivative of the activation function
σ (using the rule of derivation of composite functions) we obtain
2
1
M
dE d
=− σ (neti ) neti (2.97)
dt η dt
i=1
It is shown that the derivative of the inverse function σ is always positive considering
that the sigmoid is a strictly monotone function.7 The parameter η is also positive, as
well as (d (neti )/dt)2 . It can then be affirmed, considering the negative sign, that the
expression of the second member of the (2.97) will never be positive and consequently
the derivative of the energy function with respect to time will never be positive, i.e.,
the energy can never grow while the network dynamics evolves over time
dE
≤0 (2.98)
dt
The (2.98) implies that the dynamics of the network is based on the energy E which
is reduced or remains stable in each update. Furthermore, a stable state is reached
when the (2.98) vanishes and corresponds to a attractor of the state space. This
happens when dtd
neti ≈ 0 ∀i, i.e., the state of all neurons does not change significantly.
Convergence can take a long time since dtd neti gets smaller and smaller. It is, however,
required to converge around a local minimum in a finite time.
2.12.3.1 Summary
The Hopfield network has been an important step in advancing knowledge on neural
networks and has revitalized the entire research environment in this area. Hopfield
established the connection between neural networks and physical systems of the
type considered in statistical mechanics. Other researchers had already considered
the associative memory model in the 1970s more generally. With the architecture of
the symmetrical connection network with a diagonal zero matrix, it was possible to
design recurrent networks of stable states. Furthermore, with the introduction of the
7 Recall the nonlinearity property of the sigmoid function σ (t) ensuring the limited definition range.
In the previous section, we described the Hopfield network based on the minimization
of the energy function without the guarantee of achieving a global optimization
even if once the synaptic weights have been determined the network spontaneously
converges in stable states. Taking advantage of this property of the Hopfield network
it was possible to introduce variants to this model to avoid the problem of local
minima of the energy function. At the conceptual level, we can consider that the
network reaches a state of minimum energy but it could accidentally jump into
a higher state of energy. In other words, a stochastic variant of network dynamics
could help the network avoid a local minimum of the energy function. This is possible
through the best known stochastic dynamics model known as Boltzmann Machine
(BM).
A neural network based on the BM [13] can be seen as the stochastic dynamic
version of the Hopfield network. In essence, the deterministic approach for the update
of the weights used for the Hopfield network (Eq. 2.74), is replaced with the stochastic
approach that updates the state yi of the neuron ith, always in asynchronous mode,
but according to the following rule:
1 with probability pi
yi = (2.99)
0 with probability (1 − pi )
where
1 1
pi = = (2.100)
1 + eEi /T M
1 + exp −( j=1 wij yj − θi )/T
with wij are indicated the synaptic weights, θi denotes the term bias associated with
the neuron ith and T is the parameter that describes the temperature of the BM
network (simulates the temperature of a physical system). This latter parameter is
motivated by statistical physics whereby neurons normally enter a state that reduces
the system energy E, but can sometimes enter a wrong state, as well as a physical
system sometimes (but not often) can visit states corresponding to higher energy
2.12 Recurrent Neural Networks 237
values.8 With the simulated annealing approach considering that the connections
between neurons are symmetrical for all (wij = wji ), then each neuron calculates the
energy gap Ei as the difference between energy of the inactive state E(yi = 0) and
the active one E(yi = 1).
Returning to the (2.100) we point out that for low values of T we have
E
T −→ 0 =⇒ exp − −→ 0 =⇒ pi −→ 1
T
having assumed that E > 0. This situation brings the updating rule back to the
deterministic dynamic method of the Hopfield network of the discrete case. If instead
T is very large, as happens at the beginning of activation of the BM network, we
have
E
T −→ ∞ =⇒ exp − −→ 1 =⇒ pi −→ 0.5
T
1
M
M
E=− wij yi yj + yi θi (2.101)
2
i,j=1;j=i i=1
The network update activity puts the same in the local minimum configuration of the
energy function, associating (memorizing) the patterns in the various local minima.
This occurs with BM starting with high-temperature values and then through the
simulated annealing process it will tend to stay longer in attraction basins with
deeper minimum values, and there is a good chance of ending in a global minimum.
The BM network compared to Hopfield’s also differs because it divides neurons
into two groups: visible neurons and hidden neurons. The visible ones interface with
the external environment (they perform the function of input/output) while the hidden
ones are used only for internal representation and do not receive signals from the
8 This stochastic optimization method that attempts to find a global minimum in the presence of
local minima, is known as simulated annealing. In essence, the physical process that is adopted
in the heat treatment (heating then slow cooling and then fast cooling) of the ferrous materials is
simulated to make them more resistant and less fragile. At high temperatures the atoms of these
materials are excited but during the slow cooling phase they have the time to assume an optimal
crystalline configuration such that the material is free of irregularities reaching an overall minimum.
This heat quenching treatment of the material can avoid local minimum of the lattice energy because
the dynamics of the particles include a temperature-dependent component. In fact, during cooling,
the particles lose energy but sometimes they acquire energy, thus entering states of higher energy.
This phenomenon avoids the system from reaching less deep minimums.
238 2 RBF, SOM, Hopfield, and Deep Neural Networks
outside. The state of the network is given by the states of the two types of neurons.
The learning algorithm has two phases.
In the first phase the network is activated keeping the visible neurons blocked with
the value of the input pattern and the network tries to reach a thermal balance toward
the low temperatures. Then we increase the weights between any pair of neurons that
both have the active state (analogy to the Hebbian rule).
In the second phase, the network is freely activated without blocking the visible
neurons and the states of all neurons are determined through the process of simu-
lated annealing. Reached (perhaps) the state of thermal equilibrium toward the low
temperatures there are sufficient samples to obtain reliable averages of the products
yi yj .
It is shown that the learning rule [13–15] of the BM is given by the following
relation: η
wij = − yi yj fixed − yi xj free (2.102)
T
where η is the learning parameter; yi yj fixed denotes the expected average value of
the product
of neuron states ith and jth during the training phase with blocked visible
neurons; yi xj free denotes the expected average value of the product of neuron states
ith and jth when the net is freely activated without the block of visible neurons.
In general, the Boltzmann machine has found widespread use with good perfor-
mance in applications requiring decisions on stochastic bases. In particular, it is used
in all those situations where the Hopfield network does not converge in the global
minimums of the energy function. Boltzmann’s machine has had considerable im-
portance from a theoretical and engineering point of view. One of its architecturally
efficient versions is known as Restricted Boltzmann Machine-RBM [16].
An RBM network considers distinct the set of neurons visible from those hidden
with the particularity that no connection is expected between neurons of the same
set. Only connections between visible and hidden neurons are expected. With these
restrictions the hidden neurons are independent of each other and depend only on
visible neurons allowing the calculation,
in a single parallel step, of the expected
average value of the product yi yj fixed of the states of the neurons ith and jth during the
training phase with blocked visible neurons. The calculation instead of the yi xj free
products requires several iterations that involve parallel updates of all visible and
hidden neurons.
The RBM network has been applied with good performance for speech recogni-
tion, dimensionality reduction [17], classification, and other applications.
In the previous paragraphs we have described several applied neural network ar-
chitectures for machine learning and, above all, for supervised and non-supervised
automatic classification, based on the back-propagation method, commonly used in
2.13 Deep Learning 239
conjunction with an algorithm to optimize the rapid descent of the gradient to update
the weights of the neurons by calculating the gradient of the cost or target function.
In recent years, starting in 2000, with the advancement of Machine Learning
research, the neural network sector has vigorously recovered through further avail-
ability of low-cost multiprocessor computing systems (computer clusters, processors
GPU graphs, ...), the need to elaborate large amounts of data (big data) in various
applications (industries, public administration, social telematic sector, ...), and above
all through the development of new algorithms for automatic learning applied for
artificial vision [18,19], speech recognition [20], textual and language [21].
This advancement of research on neural networks has led to the development of
new algorithms for machine learning-based also on the architecture of traditional
neural networks developed since the 1980s. The strategic progress was with the best
results achieved with the new neural networks developed, called Deep Learning-DL
[22].
In fact, with Deep Learning it is intended as a set of algorithms that use neural
networks as a computational architecture that automatically learn the significant
features from the input data. In essence, with the Deep Learning approach neural
networks have improved the limitations of traditional neural networks to approximate
nonlinear functions and to automatically resolve the extraction of feature in complex
application contexts with large amounts of data, exhibiting excellent adaptability
(recurrent neural networks, convolutional neural networks, etc.).
The meaning of deep learning is also understood in the sense of the multiple
layers involved in architecture. In more recent years, networks with deeper learn-
ing are based on the use of rectified linear units [22] as an activation function in
place of the traditional sigmoid function (see Sect. 1.10.2) and regularization tech-
nique (dropout) [20] to improve or reduce the problem overfitting in neural networks
(already described in Sect. 1.11.5).
In Sect. 1.11.1 we have already described the MLP network which in fact can be
considered a deep neural network as it can have more than two intermediate layers of
neurons, between input and output, even if it is among the traditional neural networks
being the neurons completely connected between adjacent layers. Therefore, even
the MLP network using many hidden layers can be defined as deep learning. But
with the learning algorithm based on the backpropagation increasing the number
of hidden layers become increasingly difficult the process of adapting significant
weights, even if from a theoretical point of view it would be possible.
Unfortunately, this is found experimentally when the weights are randomly ini-
tialized and subsequently, during the training phase, with the backpropagation algo-
rithm, the error is backpropagated from the right (from the output layer) to the left
(towards the input layer) by calculating the partial derivatives with respect to each
weight, moving in the opposite direction of the gradient of the objective function.
This weight-adjustment process becomes more and more complex as the number of
240 2 RBF, SOM, Hopfield, and Deep Neural Networks
hidden layers of neurons increases as the value of the weight update and the objective
function can not converge toward optimal values for the given data set. Consequently,
the MLP with the backpropagation process, in fact, does not optimally parametrize
the traditional neural network although deep in terms of the number of hidden layers.
The limits of learning with backpropagation derive from the following problems:
1. There is not always available prior knowledge about the data to be classified
especially when dealing with large volumes of data such as color images, variant
space-time image sequences, multispectral images, etc.
2. The learning process, based on backpropagation, tends to lock up in local min-
ima using the gradient descent method especially when the layers with totally
connected neurons are many. Problem known as vanish gradient problem. The
process of adaptation of the weights through activation functions (log-sigmoid
or hyperbolic tangent) tends to cancel itself. Even if the theory of adaptation of
the weights with the error back-propagation with the gradient descent is mathe-
matically rigorous, it fails in practice, since the gradient values (calculated with
respect to weights), which determine how much each neuron should change, they
get smaller and smaller as they are propagated backward from the deeper layers of
the neural network.9 This means that the neurons of the previous levels learn very
slowly compared to the neurons in the deeper layers. It follows that the neurons
of the first layers learn less and more slowly.
3. The computational load becomes noticeable as the depth of the network and the
data of the learning phase increase.
4. The network optimization process becomes very complex as the hidden layers
increase significantly.
9 We recall that the adaptation of the weights in the MLP occurs by deriving the objective function
that uses the sigmoid function with the value of the derivative less than 1. Application of the
chain rule derivation leads to multiplying many terms less than 1 with the consequent problem of
considerably reducing the gradient values as you proceed toward the layers furthest from the output.
2.13 Deep Learning 241
the automatic learning algorithms are negatively influenced by operating with little
significant extracted features.
Therefore, the goal behind CNN is to automatically learn the significant features
from input data that is normally affected by noise. In this way, significant features
are provided to the deep network so that they can learn more effectively. We can think
of deep learning as algorithms for the automatic detection of features to overcome
the problem of descent of the gradient in nonoptimal situations and facilitate the
learning with a network with many hidden layers.
In the context of the classification of images, CNN uses the concept of ConvNet, in
fact, a neural convolution that uses windows (convolution masks), for example, 5 × 5
dimensions, thought as receptive fields associated with the neurons of the adjacent
layer, which slide on the image to generate a feature map, also called the activation
map, which is propagated to the next layers. In essence, the convolution mask of
fixed dimensions will be the object of learning, based on the data supplied as input
(sequence of pairs contained in the dataset—images/label).
Let’s see how a generic convolution layer is implemented. Starting from a volume
of size H × W × D, where H is the height, W is the witdh, and D indicates the
number of channels (which in the case of a color image D = 3), we define one or
more convolution filters of dimensions h × w × D. Note that the number of filter
channels and the input volume must be the same size. Figure 2.16 shows an example
of convolution. The filter generates a scalar when positioned on a single volume
location, while it’s sliding over the entire volume (implementing the convolution
operator) of the input generates a new image called the activation map.
242 2 RBF, SOM, Hopfield, and Deep Neural Networks
The activation map will have high values, where the structures of the image or
input volume are similar to the filter coefficients (which are the parameters to be
learned), and low where there is no similarity. If x is the image or input volume,
let w denote the filter parameters and with b the bias, the result obtained is given
by w x + b. For example, for a color image with D = 3, and a filter consisting of
the same height and width equal to 5, we have a number of parameters equal to
5 × 5 × 3 = 75 plus the bias b, for a total of 76 parameters to learn. The same image
can be analyzed with K filters, generating several different activation maps, as shown
in Fig. 2.16.
A certain analogy between the local functionality (defined by its receptive field) of
biological neurons (simple and complex cells) described in Sect. 4.6.4 for capturing
movement and elementary patterns in the visual field, and the functionality of neurons
in the convolutional layer of CNN. Both neurons propagate very little information
in successive layers in relation to the area of the receptive field, greatly reducing the
number of total connections. In addition, the spatial information of the visual field
(retinal map and feature map) propagated in subsequent layers is preserved in both.
Let us now look at the multilayered architecture of a CNN network and how the
network evolves from the convolutional layer, which for clarity we will refer to a
CNN implemented for the classification.
defined with the padding parameter P which affects the size of the 2D base of
the output volume or the size of the activation maps (or even feature maps), as
follows:
(W0 − w + 2P) (H0 − w + 2P)
W1 = +1 H1 = +1 (2.103)
S S
For P = 0 and S = 1 the size of the feature maps results W1 = H1 = W0 − w + 1.
The parameters defined above are known as hyperparameters of the CNN network.10
Defined as the hyperparameters of the convolutional layer we have the following
dimensions of the feature volume:
W1 × H1 × D1
where W1 and H1 are given by the Eq. (2.103) while D1 = K coincident with the
number of convolution masks. It should be noted that the number of weights shared
for each feature map is w × w × D0 for a total of w × w × D0 weights for the entire
volume of the features.
For a 32 × 32 × 3 RGB image and 5 × 5 × 3 masks we would have for each
neuron of the feature map a value resulting from the convolution given by the in-
nerl product of size 5 × 5 × 3 = 75 (convolution between 3D input volume and 3D
convolution mask, neglecting the weight bias). The corresponding activation map
would be
W1 × H1 × D1 = 28 × 28 × K
assuming P = 0 and S = 1. For K = 4 masks we would have between the input layer
and the convolutional layer a total of connections equal to (28 × 28 × 4) × (5 × 5 ×
3) = 235200 and a number total weights (5 × 5 × 3) · 4 = 300, plus the 4 biases b
and there are a number of parameters to be learned equal to 304. In this example, for
each portion of size (5 × 5 × 3) of the input volume, as extended as the size of the
mask, 4 different neurons observe the same portion by extracting 4 different features.
10 In the field of machine learning, there are two types of parameters, those that are learned during the
learning process (for example, the weights of a logistic regression or a synaptic connection between
neurons), and the intrinsic parameters of an algorithm of a learning model whose optimization takes
place separately. The latter is known as the hyperparameters, or optimization parameters associated
with a model (which can be a regularization parameter, the depth of a decision tree, or how in the
context of deep neural networks the number of neuronal layers and other parameters that define the
architecture of the network).
244 2 RBF, SOM, Hopfield, and Deep Neural Networks
10 10
8
8
7
6
6
5 4
2
3
2
0
1
0 −2
−10 −8 −4 −6 −2 0 2 4 6 8 10 −10 −8 −6 −4 −2 0 2 4 6 8 10
a) b)
10
−2
−10 −8 −6 −4 −2 0 2 4 6 8 10
c)
Fig. 2.17 a Rectifier Linear Unit ReLU. b Leaky ReLU; c exponential Linear Units ELU
presents some problems when the neurons go into saturation. The sigmoid resets the
gradient and does not allow to modify the weights of the net. Furthermore, its output
is not centered with respect to 0, the exponential is slightly more computationally
complex, and slows down the process of the gradient descent. Subsequently, more
efficient activation functions have been proposed such as the ReLU [19] of which
we also describe some variants.
It performs the same function as the sigmoid used in MLPs but has the char-
acteristic of increasing the nonlinearity property and eliminates the problem of
canceling the gradient highlighted previously (see Fig. 2.17a). There is no satura-
tion for positive values of the derivative while it is zero for negative or null values
of net and 1 for positive values. This layer is not characterized by parameters and
the convolution volume remains unchanged in size. Besides being computation-
ally less expensive, it allows in the training phase to converge much faster than
the sigmoid and is more biologically plausible.
2.13 Deep Learning 245
As shown in Fig. 2.17b does not cancel when net becomes negative. It has the
characteristic of not saturating neurons and does not reset the gradient.
ParametricReLU. Proposed in [26], defined by
net if net ≥ 0
f (net, α) =
α · net if net < 0
This has been proposed to improve speedup in the convergence phase, and to
alleviate the problem of zeroing the gradient [27]. In the negative part, it is not
annulled as for Leaky ReLU (see Fig. 2.17c).
where
W1 − w H1 − w
W2 = +1 H2 = +1 D2 = D1 (2.106)
S S
246 2 RBF, SOM, Hopfield, and Deep Neural Networks
Fig. 2.18 a Subsampling of the activation maps produced in the convolution layer. b Operation of
max pooling applied to the linearly adjusted map (ReLU) of size 4 × 4, pooling window 2 × 2 and
stride = 2. We get a map of dimensions 2 × 2 in output
These relations are analogous to the (2.103) but in this case, for the pooling, there is no
padding, and therefore, the parameter P = 0. Normally the values of the parameters
used are (w, S) = (2, 2) and (w, S) = (3, 2). It is observed that also in this layer
there are no parameters to learn as in the ReLU layer. An example of max pooling is
shown in Fig. 2.18b.
Fig. 2.19 Architectural scheme of a CNN network very similar to the original LeNet network [24]
(designed for automatic character recognition) used to classify 10 types of animals. The network
consists of ConvNet components for extracting the significant features of the input image, given as
input to the traditional MPP network component, based on three fully connected FC layers, which
performs the function of classifier
Having analyzed the architecture of a CNN network, we now see its operation in the
context of classification. In analogy to the operation of a traditional MLP network,
the CNN network envisages the following phases in the forward/backward training
process:
1. Dataset normalization. Images are normalized with respect to the average. There
are two ways to calculate the average. Suppose we have color images, so with
three channels. In the first case, the average is calculated from each channel,
and it is subtracted from the pixels of the corresponding channels. In this case,
the average is a scalar referred to the R, G, and B band of the single image. In
the second case, all the training images of the dataset are analyzed, calculating
an average color image, which is then subtracted pixel by pixel from each
image of the dataset. In the latter case, the average image must be remembered
in order to remove it from a test image. Furthermore, there are techniques of
data augmentation which serve to increase the cardinality of the dataset and
to improve the generalization of the network. These techniques modify each
image by turning it upside down in both axes, generating rotated versions of
them, inserting slight geometric and radiometric transformations.
2. Initialization. The hyperparameter values are defined (size w of the convolution
masks, number of filters K, stride S, padding P) related to all the layers Conv,
Pool, ReLU and initialized according to the Xavier method. The Xavier initial-
ization method introduced in [28] is an optimal way to initialize weights with
respect to activation functions, avoiding convergence problems. The weights
are initialized as follows:
!
1 1
Wi,j = U − √ , √ (2.107)
n n
with U uniform distribution and n the number of connections with the previous
layer (the columns of the weight matrix W). The initialization of weights is a
fairly active research activity, other methods are discussed in [29–31].
3. Forward propagation. The images of the training set are presented as input
(input volume W0 × H0 × D0 ) to the layer Conv which generates the feature
maps subsequently processed in a cascade from the layers ReLU and Pool, and
finally given to the FC layer that calculates the final result of the classes.
4. Learning error calculation. FC evaluates the learning error if it is still above
a predefined threshold. Normally the C classes vector contains the labels of
the objects to be classified in terms of probability and the difference between
the target probabilities and the current output to FC is calculated with a metric
to estimate this error. Considering that in the initial process the weights are
random, surely the error will be high and the process of retro-propagation of
the error is triggered as in the MLPs.
2.13 Deep Learning 249
With the learning process CNN has calculated (learned) all the optimal parameters
and weights to classify an image of the training set in the test phase. What happens
if an image never before seen in the learning phase is presented to the network?
As with traditional networks, an image that has never been seen can be recognized as
one of the images learned if it presents some similarity, otherwise it is hoped that the
network will not recognize it. The learning phase, in relation to the available calcu-
lation system and the size of the image dataset, can also require days of calculation.
The time required to recognize an image with the test phase is a few milliseconds,
but it always depends on the complexity of the model and the type of GPU you have.
The level of generalization of the network in recognizing images also depends on
the ability to construct a CNN network with an adequate number of convolutional
layers (Conv, ReLU, Poll) replicated and FC layers, the MLP component of a CNN
network. In image recognition applications it is strategic to design the convolutional
layers capable of extracting as many features as possible from the training set of
images using optimal filters to extract in the first layers Conv the primal sketch
information (for example, lines, corners) and in subsequent Conv layers use other
types of filters to extract higher level structures (shapes, texture).
The monitoring of the parameters during the training phase is of fundamental
importance, it allows to avoid continuing with the training due to an error in the
model or in the initialization of the parameters or hyperparameters, and to take
specific actions. It is useful to divide the dataset into three parts when it is very large,
leaving, for example, 60% for training, 20% of the data for the validation phase and
the remaining 20% for the test phase to be used only at the end when with the training
and validation partitions found the best model.
the training set. The use of SGD in neural network settings is motivated by the high
computational cost of the backpropagation procedure on the complete training set.
SGD can overcome this problem and still lead to rapid convergence.
The SGD technique is also known as an incremental gradient descent, which
uses a stochastic technique to optimize a cost function or objective function—the
loss function. Consider having a dataset (xi , yi )Ni=1 , consisting of image pairs and
membership class, respectively. The output of the network in the final layer is given by
the score generated by the jth neuron for a given input xi from fj (xi , W) = W xi + b.
Normalizing each output of the output layer we obtain values that can be traced back
to probabilities as follows for the neuron kth
e fk
P(Y = k|X = xi ) = f (2.108)
je
j
which is known as the Softmax function, whose outputs of each output neuron are in
the range [0.1]. We define the cost function for the ith pair of the dataset
" #
e fyi
Li = − log f (2.109)
je
j
1
N
L(W) = Li (f (xi , W), yi ) + λR(W). (2.110)
N
i=1
The first term of the total cost function is calculated on all the pairs supplied as input
during the training process, and evaluates the consistency of the model contained in
W up to that time. The second term is regularization, penalizes values of W very high,
and prevents the problem of overfitting, i.e., CNN behaves too well with training data
but then will have poor performance with data never seen before (generalization).
The R(W) function can be one of the following:
R(W) = Wkl2
k l
R(W) = |Wkl |
k l
For very large training datasets, the stochastic technique of the gradient descent
is adopted. That is, we approximate the sum with a subset of the dataset, named
2.13 Deep Learning 251
1
N
∇W L(W) = ∇W Li (f (xi , W), yi ) + λ∇W R(W). (2.111)
N
i=1
with the sign “−” to go in the direction of minimizing the cost function, and η rep-
resents the learning parameter (learning rate or step size). The gradient of the cost
function is propagated backward through the use of representation of computational
graphs that make the weights in each layer of the network simple to update. The
discussion of computational graphs, are not covered in the text, as well as variants
to SGD integrating the moment (discussed in the previous chapter), or adaptive gra-
dient (ADAGRAD) [33], root mean square propagation (RMSProp) [34], adaprive
moment estimator (ADAM) [35] and Kalman based Stochastic Gradient Descent
(kSGD) [36].
2.13.3.2 Dropout
The problem of overfitting we had already highlighted for traditional neural networks
and to reduce it we have used heuristic methods and verified the effectiveness (see
Sect. 1.11.5). For example, an approach used was to monitor the learning phase
and block it (early stop) when the values of some parameters were checked with a
certain metric and thresholds (a simple way was to check the number of epochs).
Validation datasets (not used during network training) were also used to verify the
level of generalization of the network as the number of epochs changed and to assess
whether the network’s accuracy improved or deteriorated.
For deep neural networks with a much larger number of parameters, the problem of
overfitting becomes even more serious. A regularization method, to improve or reduce
the problem of overfitting in the context of deep neural networks, is known as dropout
proposed in [37] in 2014. The dropout method together with the use of rectified linear
units (ReLUs) are two fundamental ideas for improving the performance of deep
networks.
252 2 RBF, SOM, Hopfield, and Deep Neural Networks
Fig. 2.20 Neural network with dropout [37]. a A traditional neural network with 2 hidden layers.
b An example of a reduced network produced by applying the dropout heuristic to the network in
(a). Neural units in gray have been excluded in the learning phase
The key idea of the dropout method 11 is to randomly exclude neural units (together
with their connections) from the neural network during each iteration of the training
phase. This prevents neural units from co-adapting too much. In other words, the
exclusion refers to ignoring, during the training phase, a certain set of neurons that
are chosen in a stochastic way. Therefore, these neurons are not considered during
a particular propagation forward and backward. More precisely, in each training
phase, the individual neurons are either abandoned from the network with probability
1 − p or maintained with probability p, so as to have a reduced neural network (see
Fig. 2.20). It is highlighted that a neuron excluded in an iteration can be active in the
subsequent iteration of its own because the sampling of neurons occurs in a stochastic
manner.
The basic idea to exclude or consider a neural unit is to prevent an excessive
adaptation at the expense of generalization. In other words, avoid a highly trained
network on the training dataset which then shows little generalization with the test
or validation dataset. To avoid this, dropout expects stochastically excluding neu-
rons in the training phase, causing their contribution to the activation of neurons
in the downstream layer to be temporarily reset in the forward propagation, and
11 The heuristic of the dropout is better understood through the following analogy. Let’s imagine
that in a patent office the expert is a single employee. As often happens, if this expert is always
present all the other employees of the office would have no incentive to acquire skills on patent
procedures. But if the expert decides every day, in a stochastic way (for example, by throwing a
coin), whether to go to work or not, the other employees, unable to block the activities of the office,
are forced, even occasionally, to adapt by acquiring skills. Therefore, the office cannot rely only on
the only experienced employee. All other employees are forced to acquire these skills. Therefore, a
sort of collaboration between the various employees is generated, if necessary, without the number
of the same being predefined. This makes the office much more flexible as a whole, increasing
the quality and competence of employees. In the jargon of neural networks, we would say that the
network generalizes better.
2.13 Deep Learning 253
consequently, any weights (of the synaptic connections) are not then updated in the
backward propagation.
During the learning phase of the neural network the weights of each neuron intrin-
sically model an internal state of the network on the basis of an adaptation process
(weight update) that depends on the specific features providing a certain specializa-
tion. Neurons in the vicinity rely on this specialization which can lead to a fragile
model that is too specialized for training data. In essence, dropout tends to prevent
the network from being too dependent on a small number of neurons and forces each
neuron to be able to operate independently (reduces co-adaptations). In this situation,
if some neurons are randomly excluded during the learning phase, the other neurons
will have to intervene and attempt to manage the representativeness and prediction
of the missing neurons. This leads to more internal, independent representations
learned from the network. Experiments (in the context of supervised learning in vi-
sion, speech recognition, document classification, and computational biology) have
shown that with such training the network becomes less sensitive to specific neuron
weights. This, in turn, translates into a network that is able to improve generalization
and reduces the probability of over-adaptation with training data.
The functionality of the dropout is implemented by inserting in a CNN network
a dropout layer between any layer of the network. Normally it is inserted between
the layers that have a high number of hyperparameters and in particular between
the FC layers. For example, in the network of Fig. 2.19 a layer dropout can be
inserted immediately before the FC layers with the function of randomly selecting
neurons to be excluded with a certain probability (for example, p = 0.25 and this
implies that 1 neuron on 4 inputs it will be randomly excluded) in each forward step
(zero contribution to the activation of neurons in the downstream layer) and in the
backward propagation phase (weight update). The dropout feature is not used during
the network validation phase. The hyperparameter p must be adapted starting from
a low probability value (for example, p = 0.2) and then increment it up to p = 0.5
with the foresight to avoid very high values for not compromising the learning ability
of the network.
In summary, we have that the dropout method forces a neural network to learn
more robust features that are useful in conjunction with many different random sub-
sets of other neurons. The weights of the net were learned in conditions such that
a part of the neurons were temporarily excluded and when using the net in the test
phase the number of neurons involved will be greater. This tends to reduce the over-
fitting since the network trained with the dropout layer actually turns out to be a
sort of media of different networks that could potentially present an over-adaptation
even if with different characteristics, and in the end with this heuristic it is hoped
to reduce (mediating) the phenomenon of overfitting. In essence, it would reduce
the co-adaptation of neurons, from the moment when a neuron, in the absence of
other neurons exhibit a different forced adaptation in the hope of capturing (learn-
ing) more significant features, useful together with different random subgroups of
other neurons. Most CNN networks insert the dropout layer. A further widespread
regularization method is known as Batch Normalization described in the following
paragraph.
254 2 RBF, SOM, Hopfield, and Deep Neural Networks
Instead, inserting the normalization process activates the BN function before the
activation function by obtaining
⎛ ⎞
⎜ (l) (l−1) ⎟
ŷ(l) = f (l) BN (net (l) (x) = f (l) ⎝BN (W
x )⎠
y(l)
where in this case the bias b is ignored for now as, as we shall see, its role will be
played by the parameter β of BN. In reality the normalization process could be done
throughout the training data set, but used together with the stochastic optimization
process described above, it is not practical to use the entire dataset. Therefore, the
normalization is limited to each minibatch B = {x1 , . . . , xm } of m samples during
the stochastic network training process. For a layer of the network with input y =
(y(1) , . . . y(d ) ) at d -dimensions, according to the previous equations, the relative
normalization for each kth feature ŷ(k) is given by
y(k) − E[x(k) ]
ŷ(k) = & (2.113)
Var[y(k) ]
where the expected value and the variance are calculated on the single minibatch
By = {y1 , . . . , ym } consisting of m activations of the current layer. Average μB and
variance σB2 are approximated using the data from the minibatch as follows:
1
m
μB ← yi
m
i=1
1 m
σB2 ← (yi − μB )2
m
i=1
2.13 Deep Learning 255
yi − μB
ŷi ← '
σB2 +
where ŷi are the normalized values of the inputs of the lth layer while is a small
number added to the denominator to guarantee numerical stability (avoids division by
zero). Note that simply normalizing each input of a layer can change what the layer
can represent. Most activation functions present problems when the normalization
is applied before the activation function. For example, for the sigmoid activation
function, the normalized region is more linear than nonlinear.
Therefore, it is necessary to perform a transformation to move the distribution from
0. We introduce for each activation, a pair of parameters γ and β that, respectively,
scale and translate each normalized value ŷi , as follows:
These additional parameters are learned along with the weight parameters W of the
network during the training process through backpropagation in order to improve
accuracy and speedup the training phase. Furthermore, normalization does not alter
the network’s representative power. If you decide not to use BN, it places γ =
√
σ 2 + and β = μ thus obtaining ỹ = ŷ that is operating as an identity function
that returns the original values back.
The use of BN has several advantages:
1. It improves the flow of the gradient through the net, in the sense that the descent
of the gradient can reduce the oscillations when it approaches the minimum point
and converge faster.
2. Reduces network convergence times by making the network layers independent
of the dynamics of the input data of the first layers.
3. It is allowed to have a more high learning rate. On a traditional network, high
values of the learning factor can scale the parameters that could amplify the
gradients, thus leading to saturation. With BN, small changes in the parameter to
a layer are not propagated to other layers. This allows higher learning factors to
be used for optimizers, which otherwise would not have been possible. Moreover,
it makes the propagation of the gradient in the network more stable as indicated
above.
4. Reduces dependence on accurate initialization of weights.
5. It acts in some way as a form of regularization motivated by the fact that using
reduced minibatch samples reduces the noise introduced in each layer of the
network.
6. Thanks to the regularization effect introduced by BN, the need to use the dropout
is reduced.
256 2 RBF, SOM, Hopfield, and Deep Neural Networks
AlexNet, described in [19] is a deeper and much wider version of LeNet. Result of
the first work in the competition ImageNet Large Scale Visual Recognition Chal-
lenge (ILSVRC) of 2012, making the convolutional networks in the Computer
Vision sector so popular. It turned out to be an important breakthrough compared
to previous approaches and the spread of CNN can be attributed to this work. It
showed good performances and won by a wide margin the difficult challenge of
visual recognition on large datasets (ImageNet) in 2012.
ZFNet, described in [39] is an evolution of AlexNet but better performing, result-
ing winner in the competition ILSVRC 2013. Compared to AlexNet, the hyper-
parameters of the architecture have been modified, in particular, by expanding the
size of the central convolutional layers and reducing the size of the stride and the
window on the first layer.
GoogLeNet, described in [40] differs from the previous ones for having inserted
the inception module which heavily reduces the number of network parameters
up to 4M compared to the 60M used in AlexNet. Result in the winner of ILSVRC
2014.
2.13 Deep Learning 257
V GGNet, described in [41] was ranked second in ILSVRC 2014. Its main contri-
bution was the demonstration that the depth of the network (number of layers) is
a critical component for achieving an optimal CNN.
ResNet, described [42] and first classified in ILSVRC 2015. The originality of
this network is the lack of the FC layer, the heavy use of the batch normalization
and above all the idea winning of the so-called identity shortcut connection that
skips the connections of one or more layers. ResNet is currently by far the most
advanced of the CNN network models and is the default choice for using CNN
networks. ResNet allows you to train up to hundreds or even thousands of layers
and achieves high performance. Taking advantage of its powerful representation
capacity, the performance of many artificial vision applications other than image
classification, such as object detection and facial recognition [43,44], have been
enhanced.
The authors of ResNet, following the intuitions of the first version, have perfected
the residual block and proposed a variant of pre-activation of the residual block [45],
in which the gradients can flow without obstacles through the connections shortcut
connection to any other previous layer. They have shown with experiments that they
can now train a deep ResNet to 1001 layers to overcome their lesser counterparts.
Thanks to its convincing results, ResNet has quickly become one of the strategic
architectures also in various artificial vision applications.
A more recent evolution has modified the original architecture by renaming the
network to ResNeXt [46] by introducing a hyperparameter called cardinality, the
number of independent paths, to provide a new way to adapt the model’s capacity.
Experiments show that accuracy can be improved more efficiently by increasing
cardinality rather than going deeper or wider. The authors say that, compared to
the Inception module of GoogLeNet, this new architecture is easier to adapt to new
datasets and applications, as it has a simple paradigm and only one hyperparameter
to adjust, while the Inception module has many hyperparameters (such as the size of
the convolution layer mask of each path) to fine-tune the network.
A further evolution is described in [47,48] with a new architecture called DenseNet
that further exploits the effects of shortcut connections that connect all the layers
directly to each other. In this new architecture, the input of each layer consists of the
feature maps of the entire previous layer and its output is passed to each successive
layer. The maps of the features are aggregated with depth concatenation.
References
1. R.L. Hardy, Multiquadric equations of topography and other irregular surfaces. J. Geophys.
Res. 76(8), 1905–1915 (1971)
2. J.D. Powell, Radial basis function approximations to polynomials, in Numerical Analysis, eds.
by D.F. Griffiths, G.A. Watson (Longman Publishing Group White Plains, New York, NY,
USA, 1987), pp. 223–241
258 2 RBF, SOM, Hopfield, and Deep Neural Networks
3. Lanczos Cornelius, A precision approximation of the gamma function. SIAM J. Numer. Anal.
Ser. B 1, 86–96 (1964)
4. T. Kohonen, Selforganized formation of topologically correct feature maps. Biol. Cybern. 43,
59–69 (1982)
5. C. Von der Malsburg, Self-organization of orientation sensitive cells in the striate cortex. Ky-
bernetik 14, 85–100 (1973)
6. S. Amari, Dynamics of pattern formation in lateral inhibition type neural fields. Biol. Cybern.
27, 77–87 (1973)
7. T. Kohonen, Self-Organizing Maps, 3rd edn. ISBN 3540679219 (Springer-Verlag New York,
Inc. Secaucus, NJ, USA, 2001)
8. T. Kohonen, Learning vector quantization. Neural Netw. 1, 303 (1988)
9. T. Kohonen, Improved versions of learning vector quantization. Proc. Int. Joint Conf. Neural
Netw. (IJCNN) 1, 545–550 (1990)
10. J.J. Hopfield, Neural networks and physical systems with emergent collective computational
abilities. Proc. Nat. Acad. Sci 79, 2554–2558 (1982)
11. R.P. Lippmann, B. Gold, M.L. Malpass, A comparison of hamming and hopfield neural nets
for pattern classification. Technical Report ESDTR-86-131,769 (MIT, Lexington, MA, 1987)
12. J.J. Hopfield, Neurons with graded response have collective computational properties like those
of two-state neurons. Proc. Nat. Acad. Sci 81, 3088–3092 (1984)
13. D.H. Ackley, G.E. Hinton, T.J. Sejnowski, A learning algorithm for boltzmann machines. Cogn.
Sci. 9(1), 147–169 (1985)
14. J.A. Hertz, A.S. Krogh, R. Palmer, Introduction to the Theory of Neural Computation. (Addison-
Wesley, Redwood City, CA, USA, 1991). ISBN 0-201-50395-6
15. R. Raél, Neural Networks: A Systematic Introduction (Springer Science and Business Media,
Berlin, 1996)
16. S. Paul, Information processing in dynamical systems: Foundations of harmony theory, in
Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1:
Foundations, eds. by D.E. Rumelhart, J.L. McLelland, vol. 1, Chapter 6 (MIT Press, Cambridge,
1986), pp. 194–281
17. G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural networks.
Science 313(5786), 504–507 (2006)
18. D.C. Ciresan, U. Meier, J. Masci, L.M. Gambardella, J. Schmidhuber, Flexible, high per-
formance convolutional neural networks for image classification, in In International Joint
Conference on Artificial Intelligence, pp. 1237–1242, 2011
19. A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional
neural networks, in Advances in Neural Information Processing Systems, eds. by F. Pereira,
C.J.C. Burges, L. Bottou, K.Q. Weinberger, vol. 25 (Curran Associates, Inc., Red Hook, NY,
2012), pp. 1097–1105
20. G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P.
Nguyen, T. Sainath, B. Kingsbury, Deep neural networks for acoustic modeling in speech
recognition. IEEE Signal Process. Mag. 29(6), 82–97 (2012)
21. M. Tomáš, Statistical language models based on neural networks. Ph.D. thesis, Brno University
of Technology, 2012
22. Y. Lecun, Y. Bengio, G. Hinton. Nature, 521(7553), 436–444 (2015). ISSN 0028-0836. https://
doi.org/10.1038/nature14539
23. L. Yann, B.E. Boser, J.S. Denker, D. Henderson, R.E. Howard, W.E. Hubbard, L.D. Jackel,
Handwritten digit recognition with a back-propagation network, in Advances in Neural In-
formation Processing Systems, ed. by D.S. Touretzky, vol. 2 (Morgan-Kaufmann, Burlington,
1990), pp. 396–404
References 259
24. L. Yann, B. Leon, Y. Bengio, H. Patrick, Gradient-based learning applied to document recog-
nition. Proc. IEEE 86, 2278–2324 (1998)
25. A.L. Maas, A.Y. Hannun, A.Y. Ng, Rectifier nonlinearities improve neural network acoustic
models. Int. Conf. Mach. Learn. 30, 3 (2013)
26. K. He, X. Zhang, S. Ren, J. Sun, Delving deep into rectifiers: Surpassing human-level per-
formance on imagenet classification, in International Conference on Computer Vision ICCV,
2015a
27. D.A. Clevert, T. Unterthiner, S. Hochreiter, Fast and accurate deep network learning by expo-
nential linear units (elus), in ICLR, 2016
28. X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward neural net-
works. AISTATS 9, 249–256 (2010)
29. D. Mishkin, J. Matas, All you need is a good init. CoRR (2015). http://arxiv.org/abs/1511.
06422
30. P. Krähenbühl, C. Doersch, J. Donahue, T. Darrell, Data-dependent initializations of convolu-
tional neural networks, CoRR (2015). http://arxiv.org/abs/1511.06856
31. D. Sussillo, Random walks: Training very deep nonlinear feed-forward networks with smart
initialization. CoRR (2014). http://arxiv.org/abs/1412.6558
32. Q. Liao, T. Poggio, Theory of deep learning ii: Landscape of the empirical risk in deep learning.
Technical Report Memo No. 066 (Center for Brains, Minds and Machines (CBMM), 2017)
33. D. John, H. Elad, S. Yoram, Adaptive subgradient methods for online learning and stochastic
optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011). ISSN 1532-4435. http://dl.acm.org/
citation.cfm?id=1953048.2021068
34. S. Ruder, An overview of gradient descent optimization algorithms. CoRR (2016). http://arxiv.
org/abs/1609.04747
35. D.P. Kingma, J. Ba, Adam: A method for stochastic optimization, CoRR (2014). http://arxiv.
org/abs/1412.6980
36. V. Patel, Kalman-based stochastic gradient method with stop condition and insensitivity to
conditioning. SIAM J. Optim. 26(4), 2620–2648 (2016). https://doi.org/10.1137/15M1048239
37. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A simple
way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014).
http://jmlr.org/papers/v15/srivastava14a.html
38. S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing
internal covariate shift, CoRR (2015). http://arxiv.org/abs/1502.03167
39. M.D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, CoRR (2013).
http://arxiv.org/abs/1311.2901
40. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A.
Rabinovich, Going deeper with convolutions, In Computer Vision and Pattern Recognition
(CVPR) (IEEE, Boston, MA, 2015). http://arxiv.org/abs/1409.4842
41. K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recogni-
tion. CoRR (2014). http://arxiv.org/abs/1409.1556
42. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition. CoRR, 2015b.
http://arxiv.org/abs/1512.03385
43. M. Del Coco, P. Carcagn, M. Leo, P. Spagnolo, P. L. Mazzeo, C. Distante, Multi-branch cnn
for multi-scale age estimation, in International Conference on Image Analysis and Processing,
pp. 234–244, 2017
44. M. Del Coco, P. Carcagn, M. Leo, P. L. Mazzeo, P. Spagnolo, C. Distante, Assessment of
deep learning for gender classification on traditional datasets, in In Advanced Video and Signal
Based Surveillance (AVSS), pp. 271–277, 2016
45. K. He, X. Zhang, S. Ren, J. Sun, Identity mappings in deep residual networks, CoRR, 2016.
http://arxiv.org/abs/1603.05027
260 2 RBF, SOM, Hopfield, and Deep Neural Networks
46. S. Xie, R.B. Girshick, P. Dollár, Z. Tu, K. He, Aggregated residual transformations for deep
neural networks. CoRR (2016). http://arxiv.org/abs/1611.05431
47. G. Huang, Z. Liu, K.Q. Weinberger, Densely connected convolutional networks, CoRR, 2016a.
http://arxiv.org/abs/1608.06993
48. G. Huang, Y. Sun, Z. Liu, D. Sedra, K.Q. Weinberger, Deep networks with stochastic depth,
CoRR, 2016b. http://arxiv.org/abs/1603.09382
Texture Analysis
3
3.1 Introduction
The texture is an important component for recognizing objects. In the field of image
processing, it is consolidated with the term textur e, any geometric and repetitive
arrangement of the gray levels of an image [1]. In this sector, texture becomes an
additional strategic component to solve the problem of object recognition, the seg-
mentation of images, and the problems of synthesis. Several researches have been
oriented on the mechanisms of human visual perception of texture, to be emulated for
the development of systems for the automatic analysis of the information content of
an image (partitioning of the image in regions with different textures, reconstruction,
and orientation of the surface, etc.).
The human visual system easily determines and recognizes different types of textures
characterizing them in a subjective way. In fact, there is no general definition of tex-
ture and a method of measuring the texture accepted by all. Without entering into the
merits of our ability to perceive texture, our qualitative approach to characterizing
texture with coarse, granular, random, ordered, threadlike, dotted, fine-grained, etc.,
attributes are known. Several studies have shown [2,3] that the quantitative analy-
sis of the texture passes through statistical and structural relationships between the
basic elements (known as primitive of the texture also called texel) of what we
call texture. Our visual system easily determines the relationships between the fun-
damental geometric structures, which characterize a specific texture, composed of
macrostructures, as can be the regular covering of a building or a floor. Similarly, our
Fig. 3.2 Images acquired from the microscope, in the first line, natural images (the first three of
the second line) and artificial images (the last two)
The first computational texture studies were conducted by Julesz [2,3]. Several ex-
periments have demonstrated the importance of perception through the statistical
analysis of the image for various types of textures to understand how low-level vi-
sion responds to variations in the order of statistics. Various images with different
statistics have been used, with patterns (such as points, lines, symbols, ...) distributed
and positioned randomly, each corresponding to a particular order statistic. Examples
264 3 Texture Analysis
Fig. 3.3 Texture pairs with identical and different second-order statistics. In a the two textures
have pixels with different second-order statistics and are easily distinguishable from people; b the
two textures are easily distinguishable but have the same second-order statistics; c textures with
different second-order statistics that are not easy to distinguish from the human unless it closely
scrutinizes the differences; d textures share the same second-order statistics but an observer does
not immediately discriminate the two textures
of order statistics used are those of the first order (associated with the contrast), of the
second order to characterize homogeneity and those of the third order to characterize
the curvature. Textures with sufficient differences in brightness are easily separable
in different regions with the first order statistics.
Textures with differently oriented structures are also easily separable. Initially,
Julesz finds that, with similar first-order statistics (gray-level histogram), but with
different second-order statistics (variance), they are easily discriminable. However,
no textures with identical statistics of the first and second order are found, but with
different statistics of the third order (moments), which could be discriminated. This
led to Julesz’s conjecture: textures with identical second-order statistics are indistin-
guishable. For example, the texture of Fig. 3.3a have different second-order statistics
and are immediately distinguishable, while the textures of Fig. 3.3d share the same
statistics and are not easily distinguishable [4].
Later, Caelli, Julesz, and Gilbert [5] find textures with identical first and second-
order statistics, but with different third-order statistics, which are distinguishable
(see Fig. 3.3b) with pre-active visual perception, thus violating Julesz’s conjecture.
Further studies by Julesz [3] show that the mechanisms of human visual perception do
not necessarily use third-order statistics to distinguish textures with identical second-
order statistics, but instead use second-order statistics of patterns (structures) called
texton. Consequently, the previous conjecture is revised as follows: the human pre-
active visual system does not calculate major statistical parameters of the second
order.
3.2 The Visual Perception of the Texture 265
It is also stated that the pre-attentive human visual system1 uses only the first-
order statistics of these texton which are, in fact, the local structural characteristics,
such as edges, orientated lines, blobs, edges, etc.
Psychophysiological research has shown the evidence that the human brain performs
a spectral analysis on the retinal image that can be emulated on the computer by
modeling with a filter bank [6]. This research has motivated the development of
mathematical models of texture perception based on appropriate filters. Bergen [7]
suggests that the texture present in an image can be decomposed into a series of
image sub-bands using a bank of linear filters with different scales and orientations.
Each sub-band image is associated with a particular type of texture. In essence, the
texture is characterized by the empirical distribution of the filter response form and
by the similarity metrics used (distance between distributions) can be evaluated to
discriminate the potential differences in textures.
The research described in the previous paragraph did not lead to a rigorous defini-
tion of texture but led to a variety of approaches (physical-mathematical models) to
analyze and interpret the texture. In developing a vision machine, texture analysis
(which exhibits a myriad of properties) can be guided by the application and can be
requested at various stages of the automatic viewing process. For example, in the
context of segmentation to detect homogeneous regions, in the extraction phase of
the feature or in the classification phase where the texture characteristics can provide
useful information for object recognition. Given the complexity of the information
content of the local structures of a texture, which can be expressed in several per-
ceptive terms (color, light intensity, density, orientation, spatial frequency, linearity,
random arrangement, etc.), texture analysis methods can be grouped into four classes:
1 Stimulus processing does not always require the use of attentional resources. Many experiments
have shown that the elementary characteristics of a stimulus derived from the texture (as happens
for color, shape, movement) are detected without the intervention of attention. The processing of the
stimulus is, therefore, defined as pre-attentive. In other words, the pre-attentive information process
allows to detect the most salient features of the texture very quickly, and only at a second time the
focused attention completes the recognition of the particular texture (or of an object in general).
266 3 Texture Analysis
Other approaches, based on fractal models [13] have been used to better characterize
natural textures, although they have the limit of losing local information and orien-
tation of texture structures. Another method of texture analysis is to compare the
characteristics of the texture observed with a previously synthesized texture pattern.
These texture analysis approaches are used in application contexts: extraction
of the image features described in terms of texture properties; segmentation where
each region is characterized by a homogeneous texture; classification to determine
homogeneous classes of texture; reconstruction of the surface (shape from texture) of
the objects starting from the information (such as density, dimension, and orientation)
associated with the macrostructures of the texture; and synthesis to create extensive
textures starting from small texture samples, very useful in rendering applications
(computational graphics).
For this last aspect, with respect to classification and segmentation, the synthesis
of the texture requires a greater characterization of the texture in terms of description
of the details (accurate discrimination of the structures).
3.3 Texture Analysis and its Applications 267
The classification of the texture concerns the search for particular regions of tex-
tures between various classes of predefined textures. This type of analysis can be
seen as an alternative to the techniques of supervised classification and analysis of
clusters or as a complement to further refine the classes found with these techniques.
For example, a satellite image can be classified with the techniques associated with
the analysis of the texture, or, classified with the clustering methods (see Chap. 1) and
then it can be further investigated to search for details subclasses that characterize a
given texture, for example, in the case of a forest, cropland or other regions.
The statistical approach is particularly suitable when the texture consists of very
small and complex elementary primitives, typical of microstructures. When a texture
is composed of primitives of large dimensions (macrostructures) it is fundamental
to first identify the elementary primitives, that is, to evaluate their shape and their
spatial relationship. These last measures are often also altered by the effect of the
perspective view that modifies the shape and size of the elementary structures, and
by the lighting conditions.
Measure the likelihood of observing a gray value in random positions in the image.
The statistics of the first order can be calculated from the histogram of the gray levels
of the image pixels. This depends only on the single gray level of the pixel and not
on the co-occurring interaction with the surrounding pixels. The average intensity in
the image is an example of first-order statistics.
Recall that from the histogram we can derive the p(L) approximate probability
density2 of the occurrence of an intensity level L (see Sect. 6.9 Vol. I), given by
2 Itis assumed that L is a random variable that expresses the gray level of an image deriving from
a stochastic process.
268 3 Texture Analysis
H (L)
p(L) = L = 0, 1, 2, . . . , L max (3.1)
Ntot
where H (L) indicates the frequency of pixels with gray level L and Ntot is the total
number of pixels in the image (or a portion thereof). We know that from the shape of
the histogram we can get information on the characteristics of the image. A narrow
peak level distribution indicates an image with low contrast, while several isolated
peaks suggest the presence of different homogeneous regions that differ from the
background. The parameter that characterizes the first-order statistic is given by the
average μ, calculated as follows:
L
max
μ= L · p(L) (3.2)
L=0
To obtain useful parameters (features of the image) from the histogram, one can
derive quantitative information from the statistical properties of the first order of the
image. In essence we can derive the central moments (see Sect. 8.3.2 Vol. I) from
the probability density function p(L) and characterize the texture with the following
measures:
L
max
μ2 = σ 2 = (L − μ)2 p(L) (3.3)
L=0
where μ2 is the central moment of order 2, that is the variance, traditionally indicated
with σ 2 . The average describes the position and the variance describes the dispersion
of the distribution of levels. The variance in this case provides a measure of the
contrast of the image and can be used to express a measure of smoothness relative
S, given by
1
S =1− (3.4)
1 + σ2
which becomes 0 for regions with constant gray levels, while it tends to 1 in rough
areas (where the variance is very large).
The central moments of higher order (normally not greater than 6-th), associated
with the probability density function p(L), can characterize some properties of the
texture. The central moment n-th in this case is given by
3.4 Statistical Texture Methods 269
L
max
μn (L) = (L − μ)n p(L) (3.5)
L=0
For natural textures, the measurements of asymmetry (or Skewness) S and Kurtosis
K are useful, based, respectively, on the moments μ3 and μ4 , given by
L max
μ3 L=0 (L − μ)3 p(L)
S = n3 = (3.6)
σ σ3
L max
μ4 L=0 (L − μ)4 p(L)
K = n4 = (3.7)
σ σ3
The measure S is zero if the histogram H (L) has a symmetrical form with respect to
the average μ. Instead it assumes positive or negative values in relation to its defor-
mation (with respect to that of symmetry). This deformation occurs, respectively, on
the right (that is, shifting of the shape toward values higher than the average) or on
the left (deformation toward values lower than the average). If the histogram has a
normal distribution S is zero, and consequently any symmetric histogram will have
a value of S which tends to zero.
The K measure indicates whether the shape of the histogram is flat or has a peak
with respect to a normal distribution. In other words, if K has high values, it means
that the histogram has a peak near the average value and decays quickly at both
ends. A histogram with small K tends to have a flat top near the mean value rather
than a net peak. A uniform histogram represents the extreme case. If the histogram
has a normal distribution K takes value three. Often the measurement of kurtosis is
indicated with K e = K − 3 to have a value of zero in the case of a histogram with
normal distribution.
Other useful texture measures, known as Energy E and Entr opy H , are given
by the following:
L
max
E= [ p(L)]2 (3.8)
L=0
L
max
H =− p(L) log2 p(L) (3.9)
L=0
The energy measures the homogeneity of the texture and is associated at the moment
according to the probability distribution density of the gray levels. Entropy is a
270 3 Texture Analysis
quantity that in this case measures the level of the disorder of a texture. The maximum
value is when the gray levels are all equiprobable (that is, p(L) = 1/(L max + 1)).
Characteristic measures of local texture, called module and state, based on the
local histogram [14], are derivable considering a window with N pixel centered in
the pixel (x, y). The module I M H is defined as follows:
L
max
H (L) − N /N Liv
IM H = √ (3.10)
L=0
H (L)[1 − p(L)] + N /N Liv (1 − 1/N Liv )
where N Liv = L max + 1 indicates the number of levels in the image. The state of
the histogram is the gray level that corresponds to the highest frequency in the local
histogram.
The size of P depends on the number of levels in the image I. A binary image
generates a 2 × 2 matrix, an RGB color image would require a matrix of 224 × 224 ,
but typically grayscale images are used up to a maximum of 256 levels.
The spatial relationship between pixel pairs defined in terms of increments (d x , d y )
generates co-occurrence matrices sensitive to image rotation. Except for rotations of
180◦ , any other rotation would generate a different distribution of P. To obtain the
invariance to the rotation for the analysis of the texture co-occurrence matrices are
calculated considering rotations of 0◦ , 45◦ , 90◦ , 135◦ .
For this purpose it is useful to define the co-occurrence matrices of the type
PR (L 1 , L 2 ) where the spatial relation R = (θ, d) indicates the co-occurrence of
pixel pairs at a distance d and in the direction θ (see Fig. 3.4). These matrices can be
calculated to characterize texture microstructures considering the unit distance be-
tween pairs, that is, d = 1. The rotation invariant matrices would result: P(dx ,d y ) =
P(0,1) ⇐⇒ P(θ,d) = P(0◦ ,1) (horizontal direction); P(dx ,d y ) = P(−1,1) ⇐⇒ P(θ,d) =
P(45◦ ,1) (right diagonal direction); P(dx ,d y ) = P(−1,0) ⇐⇒ P(θ,d) = P(90◦ ,1) (above
vertical direction); and P(dx ,d y ) = P(−1,−1) ⇐⇒ P(θ,d) = P(135◦ ,1) (left diagonal di-
rection).
Figure 3.5 shows the GLMC matrices calculated for the 4 angles indicated above,
for a test image of 4 × 4 size which has maximum level L max = 4. Each matrix
has the size of N × N where N = L max + 1 = 5. We will now analyze the element
P(0◦ ,1) (2, 1) which has value 3. This means that in the test image there are 3 pairs of
pixels horizontally adjacent with gray values, respectively, L 1 = 2 the pixel under
consideration and L 2 = 1 the adjacent co-occurring pixel. Similarly, there are adja-
cent pairs of pixels with values (1, 2) with frequency 3 examining in the opposite
direction. It follows that the matrix is symmetric like the other three calculated.
In general, the co-occurrence matrix is not always symmetrical, that is, not always
P(L 1 , L 2 ) = P((L 2 , L 1 ). The symmetric co-occurrence matrix Sd (L 1 , L 2 ) is de-
fined as the sum of the matrix Pd (L 1 , L 2 ) associated with the distance vector d and
the vector −d:
Sd (L 1 , L 2 ) = Pd (L 1 , L 2 ) + P−d (L 1 , L 2 ) (3.12)
272 3 Texture Analysis
C
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
0 0 0 0
0 0 0 1 0 0 1 0 0 0 0 0 2 0 0 0 1 0 0 0
3 3 4 2 1 0 2 3 2 0 1 1 2 1 0 1 1 0 0 1 4 1 1 1 0 1 3 0
2 1 3 4 2 0 3 0 1 1 2 0 1 0 3 1 2 2 1 0 2 1 2 0 1 0 2 0
0 3 2 1 1 2 1 2 2 0 0 3 2 0 3 0 4 2 0 1 3 0 3 2 2 0
3 3
2 1 1 3 0 0 1 2 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 2
4 4 4 4
Image (0°,1) [0,1] (45°,1) [-1,1] (90°,1) [-1,0] (135°,1) [-1,-1]
Fig. 3.5 Calculation of 4 co-occurrence matrices, relative to the 4 × 4 test image with number of
levels 5, for the following directions: a 0◦ , b to +45◦ , c to +90◦ , and d to +135◦ . The distance is
d = 1 pixel. It is also shown how the element P0◦ ,1 (L 1 , L 2 ) = (2, 1) = 3 exists in the image three
pairs of pixels with L 1 = 2, L 2 = 1 arranged spatially according to the relation (0◦ , 1)
The co-occurrence matrix captures some properties of a texture but is not normally
characterized by using the elements of this matrix directly. From the co-occurrence
matrix, some significant Ti parameters are derived to describe a texture more com-
pactly. Before describing these parameters, it is convenient to normalize the co-
occurrence matrix by dividing each of its elements Pθ,d (L 1 , L 2 ) with the total sum
of the frequencies of all pairs of spatially related pixels from R(θ, d). In this way, the
3.4 Statistical Texture Methods 273
Fig. 3.6 Examples of co-occurrence matrices calculated on 3 different images with different tex-
tures: a complex texture of an electronic board where there are little homogeneous regions and small
dimensions; b a little complex texture with larger macrostructures and greater correlation between
the pixels; and c more homogeneous texture with the distribution of frequencies concentrated along
the main diagonal (strongly correlated pixels)
where the joint probabilities assume values between 0 and 1. From the normalized
co-occurrence matrix, given by the (3.13), the following characteristic parameters
Ti of the texture are derived.
Higher energy values correspond to very homogeneous textures, i.e., the differences
in gray values are almost zero in most pixel pairs, for example, with a distance
of 1 pixel. For low energy values, there are differences that are equally spatially
distributed.
3.4.5.2 Entropy
The entr opy is a parameter that measures the random distribution of gray levels in
the image.
L
max L max
Entr opy = T2 = − pθ,d (L 1 , L 2 ) · log2 [ pθ,d (L 1 , L 2 )] (3.15)
L 1 =0 L 2 =0
It is observed that entropy is high when each element of the co-occurrence matrix
has an equal value, that is, when the p(L 1 , L 2 ) probabilities are equidistributed.
Entropy has low values if the co-occurrence matrix is diagonal, i.e., there are spatially
dominant gray level pairs for a certain direction and distance.
3.4.5.4 Contrast
The contrast is a parameter that measures the local variation of the gray levels of
the image. Corresponds to the moment of inertia.
L
max L max
T4 = (L 1 − L 2 )2 pθ,d (L 1 , L 2 ) (3.17)
L 1 =0 L 2 =0
A low value of the contrast is obtained if the image has almost constant gray levels,
vice versa it presents high values for images with strong local variations of intensity
that is with very pronounced texture.
3.4.5.7 Correlation
L max L max
L 1 =0 L 2 =0 [(L 1 − μx )(L 2 − μ y ) pθ,d (L 1 , L 2 )]
T7 = (3.20)
σx σ y
where the mean m x and mu y , and the standard deviations σx and σ y are related to
the marginal probabilities px (L 1 ) and p y (L 2 ). The latter correspond, respectively,
to the rows and columns of the co-occurrence matrix pθ,d (L 1 , L 2 ), and are defined
as follows:
px (L 1 ) = LL max p (L 1 , L 2 ) p y (L 2 ) = LL max
2 =0 θ,d
p (L 1 , L 2 )
1 =0 θ,d
From the implementation point of view, the extraction of the texture characteris-
tics, based on the co-occurrence matrices, requires a lot of memory and calculation.
However, in literature there are solutions that quantize the image with few levels
of gray thus reducing the dimensionality of the co-occurrence matrix with the fore-
sight to balance the eventual degradation of the structures. In addition, solutions with
fast ad hoc algorithms are proposed [18]. The complexity, in terms of memory and
calculation, of the co-occurrence matrix increases with the management of color
images.
A feature of the texture is evaluated by spatial frequency analysis. In this way, the
repetitive spatial structures of the texture are identified. Primitives of textures charac-
terized by fine structures present high values of spatial frequencies, while primitive
with larger structures result in low spatial frequencies. The autocorrelation function
of an image can be used to evaluate spatial frequencies, that is, to measure the level of
homogeneity or roughness (fineness/coarseness) of the texture present in the image.
With the autocorrelation function (see Sect. 6.10.2 Vol. I) we measure the level
of spatial correlation between neighboring pixels seen as texture primitives (gray-
level values). The spatial arrangement of the texture is described by the correlation
coefficient that measures the linear spatial relationship between pixels (primitive).
For an image f (x, y) with a size of N × N with L gray levels, the autocorrelation
function ρ f (dx , d y ) is given by:
3.5 Texture Features Based on Autocorrelation 277
4 4
x10 Autocorrelation function Tex. 1 x 10 Autocorrelation function Tex. 2
Texture 1 Texture 2 2.14
2.1
2.05
2.12 2
1.95
2.1 1.9
2.08 1.85
1.8
2.06 1.75
2.04 1.7
1.65
2.02 1.6
50 100 150 200 250 50 100 150 200 250
Axis x Axis x
4
Autocorrelation function Tex. 3 x10 Autocorrelation function Tex. 4
4350
1.45
4300
4250 1.4
4200 1.35
Texture 3 Texture 4 4150
1.3
4100
4050 1.25
Fig. 3.8 Autocorrelation function along the x axis for 4 different textures
N −1 N −1
r =0 c=0 f (r, c) · f (r + dx , c + dy )
ρ f (dx , d y ) = −1 N −1 2 (3.24)
N 2 · rN=0 c=0 f (r, c)
N −1 N
−1
ρ f (dx , d y ) = f (dx , d y ) f (−dx , −d y ) = f (r, c) f (r + dx , c + d y )
r =0 c=0
(3.25)
An immediate way to calculate the autocorrelation function is obtained by virtue of
the convolution theorem (see Sect. 9.11.3 Vol. I) which states: the Fourier transform
of a convolution is the product of the Fourier transforms of each function.
In this case, applied to the (3.25) we will have
where the symbol F {•} indicates the Fourier transform operation and F indicates
the Fourier transform of the image f . It is observed that the complex conjugate of a
real function does not influence the function itself.
In essence, the Fourier transform of the autocorrelation function F {ρ f (dx , d y )},
defined by the (3.26), represents the Power Spectrum
We can then say that the autocorrelation of a function is the inverse of the Fourier
transform of the power spectrum. Also with f real the autocorrelation is also real
and symmetric ρ f (−dx , −d y ) = ρ f (dx , d y ). Figure 3.8 shows the autocorrelation
function calculated with the (3.27) for four different types of textures.
An alternative method for measuring the spatial frequencies of the texture is based on
the Fourier transform. From the Fourier theory, it is known that many real surfaces
can be represented in terms of sinusoidal base functions. In the Fourier spectral
domain, it is possible to characterize the texture present in the image in terms of
energy distributed along the base vectors. The analysis of the texture, in the spectral
domain, is effective when it is composed of repetitive structures, however, oriented.
This approach has already been used to improve image quality (see Chap. 9 Vol. I) and
noise removal (Chap. 4 Vol. II) using filtering techniques in the frequency domain.
In this context, the characterization of the texture in the spectral domain takes place
3.6 Texture Spectral Method 279
Fig. 3.9 Images with 4 different types of textures and relative power spectrum
by analyzing the peaks that give the orientation information of the texture and the
location of the peaks that provide the spatial periodicity information of the texture
structures. Statistical texture measurements (described above) can be derived after
filtering the periodic components.
The first methods that made use of these spectral features divide the frequency
domain into concentric rings (based on the frequency content) and into segments
(based on oriented structures). The spectral domain is, therefore, divided into regions
and the total energy of each region is taken as the feature characterizing the texture.
Let us consider with F(u, v) the Fourier transform of the image f (i, j) whose texture
is to be measured and with |F(u,v)|2 the power spectrum (the symbol | • | represents
the module of a complex number), which we know to coincide with the Fourier
transform of the autocorrelation function ρ f .
Figure 3.9 shows 4 images with different types of textures and their power spec-
trum. It can be observed how texture structures, linear vertical and horizontal, and
those curves are arranged in the spectral domain, respectively, horizontal, vertical,
and circular. The more textured information is present in the image, the more ex-
tended is the energy distribution. This shows that it is possible to derive the texture
characteristics in relation to the energy distribution in the power spectrum.
In particular, the characteristics of the texture in terms of spectral characteristics
are obtained by dividing the Fourier domain into concentric circular regions of radius
r that contains the energy that characterizes the level of fineness/roughness of the
texture (high energy for large values of r that is high frequency implies the presence
of fine structures) while high energy with small values of r that is low frequencies
implies the presence of coarse structures.
The energy evaluated in sectors of the spectral domain identified by the angle θ
reflects the directionality characteristics of the texture. In fact, for the second and
third image of Fig. 3.9, we have a localized energy distribution in the sectors in the
range 40◦ –60◦ and in the range 130◦ –150◦ corresponding to the texture of the spaces
between inclined bricks and inclined curved strips, respectively, present in the third
image. The rest of the energy is distributed across all sectors and corresponds to the
variability of the gray levels of the bricks and streaks.
280 3 Texture Analysis
The functions that can, therefore, be extracted by ring (centered at the origin) are
π
r2
tr1 r2 = |F(r, θ )|2 (3.28)
0=0 r =r1
The power spectrum |F(r, θ )|2 is expressed in polar coordinates (r, θ ) and considering
its symmetrical nature with respect to the origin (u, v) = (0, 0) only the upper half
of the frequencies u axis is analyzed. It follows, that the polar coordinates r and θ
vary, respectively, for r = 0, R where R is the maximum radius of the outer ring,
and θ varies from 0◦ to 180◦ . From functions tri ,r j and tθl ,θk can be defined n a × n s
texture measures Tm,n n a ; n = 1, n s sampling the entire spectrum in n a rings and n s
radial sectors as shown in Fig. 3.10.
As an alternative to Fourier, other transforms can be used to characterize the
texture. The choice must be made in relation to the better invariance of the texture
characteristics with respect to the noise. The most appropriate choice is to consider
combined spatial and spectral characteristics to describe the texture.
v v
u u
Fig. 3.10 Textural features from the power spectrum. On the left the image of the spectrum is
partitioned into circular rings, each representing a frequency band (from zero to the maximum
frequency) while on the right is shown the subdivision of the spectrum in circular sectors to obtain
the information of direction of the texture in terms of distribution directional energy
3.7 Texture Based on the Edge Metric 281
1
W W
TD (i, j) = B(i + l, j + k) (3.31)
W2
l=−W k=−W
where W indicates the image window on which the average value of the module
is calculated. A high contrast of the texture occurs at the maximum values of the
module. The contrast expressed by the (3.32) can be normalized by dividing it with
the maximum value of the pixel in the window.
The boundary density obtained with the (3.31) has the problem of finding an
adequate threshold to extract the edges. This is not always easy considering that the
threshold is applied to the entire image and is often chosen by trial and error. Instead
of extracting the edges by first calculating the module with an edging operator, an
alternative is given by calculating the gradient gd (i, j) as an approximation of the
distance function between adjacent pixels for a defined window.
The procedure involves two steps:
2. The texture measure T (d), based on the density of the edges, is given as the
mean value of the gradient gd (i, j) for a given distance d (for example, d = 1)
1
N N
T (d) = gd (i, j) (3.34)
N2
i=1 j=1
The micro and macro texture structures are evaluated by the edge density expressed
by the gradient T (d) related to the distance d. This implies that the dimensionality of
the feature vector depends on the number of distances d considered. It is understood
that the microstructures of the image are detected for values of d small, while the
macrostructures are determined for large values (normally d assumes values to obtain
from 1 to 10 feature of edge density) [19]. It can be verified that the function T (d)
is similar to the negative autocorrelation function, with inverted peaks, its minimum
corresponds to the maximum of the autocorrelation function, while its maximum
corresponds to the minimum of the autocorrelation function.
A measure based on edge randomness is expressed as a measure of the Shannon
entropy of the gradient module
N
N
TEr = gm (i, j) log2 gm (i, j) (3.35)
i=1 j=1
N
N
TE θ = gθ (i, j) log2 gθ (i, j) (3.36)
i=1 j=1
Other measures on the periodicity and linearity of the edges are calculated using
the direction of the gradient, respectively, through the co-occurrence of pairs of
edges with identical orientation and the co-occurrence of pairs of collinear edges
(for example, edge 1 with direction ←− and edge 2 with direction ←−, or with the
opposite direction ←− −→).
Gray Levels
Gray Levels
0 2 2 3 1 4 1 0 0 1 1 1 0 0
0 3 1 1 2 2 1 0 0 2 1 0 1 0
3 0 1 0 3 5 0 0 0 3 3 1 0 0
Direction 0° Direction 45°
Fig. 3.11 Example of calculating GLRLM matrices for horizontal direction and at 45◦ for a test
image with gray levels between 0 and 3
which in fact represent pixels belonging to oriented segments of a certain length and
the same gray level.
In particular, this information is described by GLRLM matrices (Gray Level Run
Length Matrix) reporting how many times sequences of consecutive pixels appear
with identical gray level in a given direction [8,20]. In essence, any matrix defined for
a given θ direction of primitives (called also r uns) can be seen as a two-dimensional
histogram where each of its elements pθ (z, r ), identified by the gray level z e from
the length r of the primitives, it represents the frequency of these primitives present
in the image with maximum L gray levels and dimensions M × N . Therefore, a
GLRLM matrix has the dimensions of L × R where L is the number of gray levels
and R is the maximum length of the primitives.
Figure 3.11 shows an example of a GLRLM matrix calculated for an image with a
size of 5 × 5 with only 4 levels of gray. Normally, for an image, 4 GLRLM matrices
are calculated for the directions 0◦ , 45◦ , 90◦ and 135◦ . To obtain an invariant rotation
matrix p(z, r ) the GLRLM matrices can be summed. Several texture measures are
then extracted from the statistics of the primitives captured by the p(z, r ) invariant
to rotation.
The original 5 texture measurements [20] are derived from the following 5
statistics:
1 p(z, r )
L R
TS R E = (3.37)
Nr r2
z=1 r =1
1
L R
Nr = p(z, r )
Nr
z=1 r =1
1
L R
TL R E = p(z, r ) · r 2 (3.38)
Nr
z=1 r =1
This texture measure evaluates the distribution of runs on gray values. The value
of the feature is low when runs are evenly distributed along gray levels.
4. Run Length Nonuniformity (RLN)
R L 2
1
TR L N = p(z, r ) (3.40)
Nr
r =1 z=1
This feature measure evaluates the fraction of the number of realized runs and
the maximum number of potential runs.
The above measures mostly emphasize the length of the primitives (i.e., the vector
L
pr (r ) = z=1 p(z, r ) which represents the sum of the distribution of the number
of primitives having length r ), without
considering the information level of the gray
level expressed by the vector pz (z) = rR=1 p(z, r ) which represents the sum of the
distribution of the number of primitives having gray level z [21]. To consider also
the gray-level information two new measures have been proposed [22].
1 p(z, r )
L R
TLG R E = (3.42)
Nr z2
z=1 r =1
This texture measure, based on the gray level of the runs, is the same as the S R E,
and instead of considering the short primitive, those with low levels of gray are
emphasized.
3.8 Texture Based on the Run Length Primitives 285
1
L R
TH G R E = p(z, r ) · z 2 (3.43)
Nr
z=1 r =1
This texture measure, based on the gray level of the runs, is the same as the
L R E, and instead of considering the long primitives, those with high levels of
gray are emphasized.
Subsequently, by combining together the statistics associated with the length of the
primitives and the gray level, 4 further measures have been proposed [23]
1 p(z, r )
L R
TS R LG E = (3.44)
Nr z2 · r 2
z=1 r =1
This texture measure emphasizes the primitives shown in the upper left part of
the GLRLM matrix, where the primitives with short length and low levels are
accumulated.
9. Short Run High Gray-Level Emphasis (SRHGE)
1 p(z, r ) · z 2
L R
TS R H G E = (3.45)
Nr r2
z=1 r =1
This texture measure emphasizes the primitives shown in the lower left part of
the GLRLM matrix, where the primitives with short length and high gray levels
are accumulated.
10. Long Run Low Gray-Level Emphasis (LRLGE)
1 p(z, r ) · r 2
L R
TL R LG E = (3.46)
Nr z2
z=1 r =1
This texture measure emphasizes the primitives shown in the upper right part
of the GLRLM matrix, where the primitives with long and low gray levels are
found.
11. Long Run High Gray-Level Emphasis (LRHGE)
1
L R
TL R H G E = p(z, r ) · r 2 · z 2 (3.47)
Nr
z=1 r =1
286 3 Texture Analysis
Table 3.1 Texture measurements derived from the GLRLM matrices according to the 11 statistics
given by the equations from (3.37) to (3.47) calculated for the images of Fig. 3.12
Image Tex_1 Tex_2 Tex_3 Tex_4 Tex_5 Tex_6 Tex_7
SRE 0.0594 0.1145 0.0127 0.0542 0.0174 0.0348 0.03835
LRE 40.638 36.158 115.76 67.276 85.848 52.37 50.749
GLN 5464.1 5680.1 4229.2 4848.5 6096.3 6104.4 5693.6
RLN 1041.3 931.18 1194.2 841.44 950.43 1115.2 1037
RP 5.0964 5.1826 4.5885 4.8341 5.2845 5.342 5.2083
LGRE 0.8161 0.8246 0.7581 0.7905 0.8668 0.8408 0.82358
HGRE 2.261 2.156 2.9984 2.6466 1.7021 1.955 2.084
SGLGE 0.0480 0.0873 0.0102 0.0435 0.0153 0.0293 0.03109
SRHGE 0.1384 0.3616 0.0312 0.1555 0.0284 0.0681 0.08397
LRLGE 34.546 30.895 86.317 52.874 74.405 44.812 42.847
LRHGE 78.168 67.359 363.92 180.54 146.15 95.811 97.406
Fig. 3.12 Images with natural textures that include fine and coarse structures
This texture measure emphasizes the primitives shown in the lower right part of
the GLRLM matrix, where the primitives with long length and high gray levels
are accumulated.
The Table 3.1 reports the results of the 11 texture measures described above applied to
the images in Fig. 3.12. The GLRLM matrices were calculated by scaling the images
into 16 levels, and the statistics were extracted from the matrix p(z, r ) summation
of the 4 directionals matrices.
Let’s now see some methods based on the model that was originally developed for
texture synthesis. When an analytical description of the texture is possible this can
be modeled by some characteristic parameters that are subsequently used for the
analysis of the texture itself. If this is possible, these parameters are used to describe
the texture and to have its own representation (synthesis). The most widespread
texture modeling is that of the discrete Markov Random Field-MRF that are optimal,
to represent the local structural information of an image [24] and to classify the
3.9 Texture Based on MRF, SAR, and Fractals Models 287
texture [25]. These models are based on the hypothesis that the intensity in each
pixel of the image depends only on the intensity of the pixels in its proximity to
less than any additional noise. With this model, each pixel of the image f (i, j) is
modeled as a linear combination of the intensity values of neighboring pixels and an
additional noise n
f (i, j) = f (i + l, j + k) · h(l, k) + n(i, j) (3.48)
(l,k)∈W
where W indicates the window that is the set of pixels in the vicinity of the current
pixel (i, j) (where the window is centered almost always of size 3 × 3) and n(i, j)
is normally considered a random Gaussian noise with mean zero and variance σ 2 .
In this MRF model, the parameters are represented by the weights h(l, k) and by
the noise n(l, k) which are calculated with the least squares approach, i.e., they are
estimated by minimizing the error E expressed by the following functional:
2
E= f (i, j) − f (i + l, j + k) · h(l, k) + n(i, j) (3.49)
(i, j) (l,k)∈W
The texture of the model image is completely described with these parameters, which
are subsequently compared with those estimated by the observed image to determine
the texture class.
A method similar to that of MRF is given by the Simultaneous Autoregressive-
SAR model [26] which always uses the spatial relationship between neighboring
pixels to characterize the texture and classify it. The SAR model is expressed by the
following relationship:
f (i, j) = μ + f (i + l, j + k) · h(l, k) + n(i, j) (3.50)
(l,k)∈W
The fractal dimension is useful for characterizing the texture, and D expresses a
measure of surface roughness. Intuitively, the larger the fractal dimension is, the more
the surface is rough. In [13] it is shown that images with various natural textures can
be modeled with spatially isotropic fractals.
Generally, the texture related to many natural surfaces cannot be modeled with
deterministic fractal models because they have statistical variations. From this, it
follows that the estimation of the fractal dimension of an image is difficult. There are
several methods for estimating the D parameter, one of which is described in [29]
as follows. Given the closed set A, we consider windows of size L max of the side,
such as to cover the set A. A scaled down version of a factor r of A will result in
N = 1/r D similar sets. This new set can be enclosed by windows of size L = r L max ,
and therefore, their number is related to the fractal dimension D
1 L max
N (L) = D = (3.52)
r L
The fractal dimension is, therefore, estimated by the equation (3.52) as follows. For
a given value of L, the n-dimensional space is divided into squares of side L and
the number of squares covering A is counted. The procedure is repeated for different
values of L and therefore the value of the fractal dimension D is estimated with the
slope of the line
ln(N (L)) = −D ln(L) + D ln(L max ) (3.53)
where n(m, L) is the number of window containing m points and M is the total
number of pixels in the image. When overlapping windows of size L on the image,
then the value (M/m)P(m, L) represents the expected number of windows with m
points inside. The expected number of windows covering the entire image is given
by
3.9 Texture Based on MRF, SAR, and Fractals Models 289
N
E[N (L)] = M (1/m)P(m, L) (3.54)
m=1
where M is the mass (understood as the set of pixel entities) of the fractal set and
E(M) its expected value. This quantity measures the discrepancy between the current
mass and the expected mass. There are small lacunarity values when the texture is
fine, while with large values there are coarse texture. The mass of the fractal set is
related to the length L in the following way
M(L) = K L D (3.56)
It is highlighted that M(L) and M 2 (L) are, respectively, the first and second moment
of the probability distribution P(m, L). This lacunarity measurement of the image
is used as a feature of the texture for segmentation and classification purposes.
The texture characteristics can be determined by spatial filtering (see Sect. 9.9.1
Vol. I) by choosing a filter impulse response that effectively accentuates the texture’s
microstructures. For the purpose Laws [31] proposed texture measurements using the
convolution of the image f (i, j) with filtering masks h(i, j) of dimensions 5×5 that
represent the impulse responses of the filter to detect the different characteristics of
textures in terms of uniformity, density, granularity, disorder, directionality, linearity,
roughness, frequency, and phase.
From the results of convolutions ge = f (i, j) h e (i, j), with various masks, the
relative texture measurements Te are calculated, which express the energetic measure
of the texture microstructures detected, such as edges, wrinkles, homogeneous, point-
like, and spot structures. This diversity of texture structures is captured with different
convolution operations using appropriate masks defined as follows. It starts with three
simple 1D masks
L3 = [ 1 2 1] L − Level
E3 = [−1 0 1] E − Edge (3.58)
S3 = [−1 2 − 1] S − Spot
where L3 represents a local mean filter, E3 represents an edge detector filter (at the
first difference), and S3 represents a spot detector filter (at the second difference).
Through the convolution operator of these masks with himself and each other, the
following 1D basic 5×1 masks are obtained
L5 = L3 L3 = 1 2 1 1 2 1 = 1 4 6 4 1 (3.59)
E5 = L3 E3 = 1 2 1 −1 0 1 = −1 −2 0 2 1 (3.60)
S5 = L3 S3 = 1 2 1 −1 2 −1 = −1 0 2 0 −1 (3.61)
R5 = S3 S3 = −1 2 −1 −1 2 −1 = 1 −4 6 −4 1 (3.62)
W 5 = E3 (−S3) = −1 0 1 1 −2 1 = −1 2 0 −2 1 (3.63)
These basic masks 5 × 1 represent the filters, respectively, of smoothing (e.g., Gaus-
sian) L5, of detectors of edges (e.g., gradient) E5, of Spot (e.g., Laplacian of
Gaussian-LOG) S5, of crests R5 and of wave structures W 5. From these basic masks
one can derive the two-dimensional 5×5 through the external product between the
same 1D masks and between different pairs. For example, the masks E5L5 and
L5E5 are obtained from the external product, respectively, between E5 and L5, and
between L5 and E5, as follows:
⎡ ⎤ ⎡ ⎤
−1 −1 −4 −6 −4 −1
⎢ −2 ⎥ ⎢ −2 −8 −12 −8 −1 ⎥
⎢ ⎥ ⎢ ⎥
E5L5 = E5 × L5 = ⎢ 0 ⎥ × [1 4 6 4 1] = ⎢
T ⎢ ⎥
⎢ 0 0 0 0 0 ⎥ (3.64)
⎥
⎣ 1 ⎦ ⎣ 2 8 12 8 1 ⎦
2 1 4 6 4 1
3.10 Texture by Spatial Filtering 291
⎡ ⎤ ⎡ ⎤
1 −1 −2 0 2 1
⎢4⎥ ⎢ −4 −8 0 8 4⎥
⎢ ⎥ ⎢ ⎥
L5E5 = L5T × E5 = ⎢ ⎥ ⎢
⎢ 6 ⎥ × [−1 −2 0 2 1] = ⎢ −6 −12 0 12 6⎥⎥ (3.65)
⎣4⎦ ⎣ −4 −8 0 8 4⎦
1 −1 −2 0 2 1
The mask E5L5 detects the horizontal edges and simultaneously executes a local
average in the same direction, while the mask L5E5 detects the vertical edges. The
number of Laws 2D masks that can be obtained is 25, useful for extracting different
texture structures present in the image. The essential steps of the Laws algorithm for
extracting texture characteristics based on local energy are the following:
With the (3.67) we will have the set of 25 images of texture energy
if 5 basic 1D-masks are used or 16 if the first 4 are used, i.e., L5, E5, S5, R5.
A second approach for the calculation of texture energy images is to consider,
instead of the absolute value, the square root of the value of neighboring pixels,
as follows:
j+W
i+W
T (i, j) = g 2 (m, n) (3.68)
m=i−W n= j−W
1
i+W
j+W
T̂ (i, j) = |g(m, n) − μ(m, n)| (3.70)
(2W + 1)2
m=i−W n= j−W
i+W
j+W
1
T̂ (i, j) = [g(m, n) − μ(m, n)]2 (3.71)
(2W + 1) 2
m=i−W n= j−W
where μ(i, j) is the local average of the texture measure g(i, j), relative to
the window (2W + 1) × (2W + 1) centered in the pixel being processed (i, j),
estimated by
1
i+W
j+W
μ(i, j) = gW (m, n) (3.72)
(2W + 1)2
m=i−W n= j−W
3.10 Texture by Spatial Filtering 293
5. Significant images of texture energy. From the original image f (i, j) we have:
The energy image TL5L5 (i, j) is not meaningful to characterize the texture unless
we want to consider the contrast of the texture. The remaining 24 or 15 energy
images can be further reduced by combining some symmetrical pairs replacing
them with the average of their sum. For example, we know that TE5L5 and TL5E5
represent the energy of vertical and horizontal structures (variants to rotation),
respectively. If they are added we have as a result the energy image TE5L5/L5E5
corresponding to the module of the edge (texture measurement invariant to ro-
tation). The other images energy TE5E5 , TS5S5 , TR5R5 and TW 5W 5 are used
directly (rotation invariant measures). Therefore, using 5 one-dimensional bases
L5, E5, S5, R5, W 5, after the combination we have the following 14 energy
images
TE5L5/L5E5 TS5L5/L5S5 TW 5L5/L5W 5 TR5L5/L5R5
TE5E5 TS5E5/E5S5 TW 5E5/E5W 5 TR5E5/E5R5
(3.73)
TS5S5 TW 5S5/S5W 5 TR5S5/S5R5 TW 5W 5
TR5W 5/W 5R5 TR5R5
while, using the first 4 masks, one-dimensional bases, we have the following 9
energy images
TE5L5/L5E5 TS5L5/L5S5 TR5L5/L5R5
TE5E5 TS5E5/E5S5 TR5E5/E5R5 (3.74)
TS5S5 TR5R5 TR5S5/S5R5
These texture measurements are used in different applications for image segmentation
and classification. In relation to the nature of the texture, to better characterize the
microstructures present at various scales, it is useful to verify the impact of the size
of the filtering masks on the discriminating power of the texture measurements T .
In fact, the method of Laws has also been experimented using the one-dimensional
masks 3 × 1 date in (3.58), from which the two-dimensional ones 3×3 were derived
through the external product between the same 1D masks and between different
couples. In this case, the one-dimensional mask relating to the R3 crests is excluded
as it cannot be reproduced in the mask 3 × 3, and the 2D derivable masks are the
following:
294 3 Texture Analysis
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
121 1 0 −1 −1 2 −1
L3L3 = ⎣ 2 4 2 ⎦ L3E3 = ⎣ 2 0 −2 ⎦ L3S3 = ⎣ −2 4 −2 ⎦
121 1 0 −1 −1 2 −1
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
−1 −2 −1 1 0 −1 −1 2 −1
E3L3 = ⎣ 0 0 0 ⎦ E3E3 = ⎣ 0 0 0 ⎦ E3S3 = ⎣ 0 0 0 ⎦
1 2 1 −1 0 1 1 −2 1
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
−1 −2 −1 −1 0 1 1 −2 1
S3L3 = ⎣ 2 4 2 ⎦ S3E3 = ⎣ 2 0 −2 ⎦ S3S3 = ⎣ −2 4 −2 ⎦
−1 −2 −1 −1 0 1 1 −2 1
(3.75)
With the masks 3 × 3, after the combinations of the symmetrical masks we have the
following 5 images available:
Laws energy masks have been applied to the images in Fig. 3.12 and in the Table 3.2
are reported for each image the texture measurements derived from the 9 significant
energy images obtained by applying the process described above. The 9 energy
images reported in (3.74) were used. A window with a size of 7 × 7 was used to
estimate local energy measurements with the (3.68) and then normalized with respect
to the original image (smoothed with mean filter). Using a larger window does not
change the results and would result in a larger computational time. Laws tested the
proposed method on a sample mosaic of Brodatz texture fields and was identified at
90%. Laws’texture measurements have been extended for volumetric analysis of 3D
textures [32].
In analogy to the masks of Laws, Haralick proposed those used for the extraction
of the edges for the measurement of the derived texture with the following basic
masks:
Table 3.2 Texture measurements related to the images in Fig. 3.12 derived from the energy equation
images 3.74
Image Tex_1 Tex_2 Tex_3 Tex_4 Tex_5 Tex_6 Tex_7
L5E5/E5L5 1.3571 2.0250 0.5919 0.9760 0.8629 1.7940 1.2034
L5R5/R5L5 0.8004 1.2761 0.2993 0.6183 0.4703 0.7778 0.5594
E5S5/S5E5 0.1768 0.2347 0.0710 0.1418 0.1302 0.1585 0.1281
S5S5 0.0660 0.0844 0.0240 0.0453 0.0455 0.0561 0.0441
R5R5 0.1530 0.2131 0.0561 0.1040 0.0778 0.1659 0.1068
L5S5/S5L5 0.8414 1.1762 0.3698 0.6824 0.6406 0.9726 0.7321
E5E5 0.4756 0.6873 0.2208 0.4366 0.3791 0.4670 0.3986
E5R5/R5E5 0.2222 0.2913 0.0686 0.1497 0.1049 0.1582 0.1212
S5R5/R5S5 0.0903 0.1178 0.0285 0.0580 0.0445 0.0713 0.0523
3.10 Texture by Spatial Filtering 295
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
1 1 1
h 1 = 13 ⎣ 1 ⎦ h 2 = 1⎣ ⎦
2 0 h3 = 1⎣
2 −2 ⎦
1 1 1
Another spatial filtering approach to extract texture features is based on the Gabor
filters [33]. These filters are widely used for the analysis of the texture motivated by
their spatial location nature, orientation, selectivity, and the frequency characteristic.
Gabor filters are seen as the precursors of wavelets (see Sect. 2.12 Vol. II) where
each filter captures energy at a particular frequency and for a specific direction. Their
diffusion is motivated by mathematical properties and neurophysiological evidence.
In 1946 Gabor showed that the specificity of a signal, simultaneously in time and
frequency is fundamentally limited by a lower limit given by the product of its band
length and duration.
This limit is x ω ≥ 4π 1
. Furthermore, it found that signals of the form
t2
s(t) = exp − + jωt
α2
reach the theoretical limit who found. The Gabor functions form a complete set of
basic (non-orthogonal) functions and allow you to expand any function in terms
of these basic functions. Subsequently, Gabor’s functions were generalized in two-
dimensional space [34,35] to model the profile of receptive fields of simple cells of
the primary visual cortex (also known as striated cortex or V 1)5 .
5 Psychovisual redundancy studies indicate that the human visual system processes images at dif-
ferent scales. In the early stages of vision, the brain performs a sort of analysis in different spatial
296 3 Texture Analysis
As we shall see, these functions are substantially bandpass filters that can be
treated together in the 2D spatial domain or in the Fourier 2D domain. These specific
properties of Gabor 2D functions have motivated research to describe and discrim-
inate the texture of images using the power spectrum calculated with Gabor filters
[36]. In essence, it is verified that the texture characteristics found with this method
are locally spatially invariant.
Now let’s see how it is possible to define a bank of Gabor 2D filters to capture the
energy of the image and detect texture measurements at a particular frequency and a
specified direction. In the 2D spatial domain, the canonical elementary function of
Gabor h(x, y) is a complex harmonic function (i.e., composed of the sine and cosine
functions) modulated by a Gaussian oriented function g(xo , yo ), given in the form
where γ is the spatial aspect ratio and specifies the ellipticity of the 2D-Gaussian
(support of Gabor function), σ is the standard deviation of the Gaussian that char-
acterizes the extension (scale) of the filter in the spatial domain, and the band, in
the Fourier domain. If γ = 1, then the angle θ is no longer considered because the
Gaussian (3.79) becomes with circular symmetry, simplifying the filter (3.78).
The Gabor filter h(x, y) in the Fourier domain is given by
H (u, v) = exp −2π 2 σ 2 (u o − Uo )2 γ 2 + (vo − Vo )2 (3.80)
frequencies, and consequently, the visual cortex is composed of different cells that correspond to
different frequencies and orientations. It has also been observed that the responses of these cells are
similar to those of the Gabor functions. This multiscale process, which successfully takes place in
the human vision for texture perception, has motivated the development of texture analysis methods
that mimic the mechanisms of human vision.
3.10 Texture by Spatial Filtering 297
α
U u
with respect to the axis u, with aspect ratio 1/γ . The complex exponential represents
a complex 2D harmonic with radial central frequency
F = U2 + V 2 (3.81)
where φ is the orientation angle of the sinusoidal harmonic, with respect to the
frequency axis u, in the Fourier domain (u, v) (see Fig. 3.13).
Figure 3.14 shows instead the 3D and 2D graphic representation of the real and
imaginary components of a Gabor function.
Although Gabor’s filters may have the Gaussian function of modulating support
with arbitrary direction, in many applications it is useful that the modulating Gaussian
function has the same orientation as the complex sinusoidal harmonic, i.e., θ = φ.
In that case, the (3.78) and (3.80) are reduced, respectively,
and
H (u, v) = exp −2π 2 σ 2 (u o − F)2 γ 2 + vo2 (3.84)
Theta= 0 Theta= 0
0.5 0.5
0 0
−0.5 −0.5
30 30
20 30 30
20 25 20
20 25
10
5 10 15 10 10 15
5
Cosine component Sine component
Fig. 3.14 Perspective representation of the real component (cosine) and of the imaginary compo-
nent (sine) of a Gabor function with a unitary aspect ratio
H(u,v)
Fig.3.15 Support in the frequency domain of the Gabor filter bank. Each elliptical region represents
a range of frequencies for which some filters respond strongly. Regions that are on the same ring
support filters with the same radial frequency, while regions at different distances from the origin F
but with identical direction correspond to different scales. In the example shown on the left, the filter
bank would have 3 scales and 3 directions. The figure on the right shows the frequency responses
in the spectral domain H (u, v) of a filter bank in 5 scales and 8 directions
Regions included in a ring correspond to filters with the same radial frequency. While
regions at different distances from the origin but with the same direction correspond
to filters with different scales. The goal of defining the filter bank is to map the
different textures of an image in the appropriate region that represents the filter’s
characteristics in terms of frequencies and direction.
Gabor’s basic 2D functions are generally spatially localized, oriented and with an
octave bandwidth.6
6 We recall that it is customary to divide the bands with constant percentage amplitudes. Each
band is characterized by a lower frequency f i and a higher frequency f s and a central frequency
3.10 Texture by Spatial Filtering 299
√
where α = (ln 2)/2. A bank of Gabor filters of arbitrary direction and bandwidth
can be defined, varying the 4 free parameters θ, F, σ, γ (or , B, σ, γ ) and ex-
tending the elliptical regions of the spatial frequency domain with the major axis
passing through the origin (see Fig. 3.15). In general, we tend to cover the frequency
domain with a limited number of filters and to minimize the overlap of the filter
support regions. From the (3.83) we observe that, for the sinusoidal component, the
Gabor function h(x, y) is a complex function with a real and imaginary part. The
sinusoidal component is given by
and the real (cosine) and imaginary (sine) components of h(x, y) are (see Fig. 3.14)
The functions h c,F,θ and h s,F,θ are, respectively, even (with symmetry with respect
to the x axis) and odd (with symmetry with respect to the origin), and symmetric
in the direction of θ . To get Gabor texture measurements TF,θ , an image I (x, y) is
filtered with Gabor’s filters (3.88) and (3.89) through the convolution operation, as
follows:
Tc,F,θ (x, y) = I (x, y) h c,F,θ (x, y) Ts,F,θ (x, y) = I (x, y) h s,F,θ (x, y)
(3.90)
The result of the two convolutions is almost identical to less than the phase difference
of π/2 in the θ direction. From the texture measures Tc,F,θ and Ts,F,θ obtained it is
useful to calculate the energy E F,θ and the amplitude A F,θ
f c . The most frequently used bandwidths are the octave where the lower and upper extremes are
in the ratio√1 : 2, or f s = 2Fi . The bandwidth percentage
√ is given by ( f s − di )/dc = constant,
and f c = f i · f s . We also have 1/3 octave bands f s = 3 2 · f i , where the width of each band is
narrower, equal to 23.2% of the central nominal frequency of each band.
300 3 Texture Analysis
y hc,Fθ(x,y)
v
H(u,v)
x
Ω
Real Component
F
u
y
hs,Fθ(x,y) B
Imaginary Component
Fig. 3.16 Bandwidth detail B and orientation of the frequency domain support of a Gabor filter,
expressed by the (3.83), of which, in the spatial domain, the real and imaginary components are
represented
Figure 3.17 shows a simple application of Gabor filters to segment the texture of an
image. The input image (a) has 5 types of natural textures, not completely uniform.
The features of the textures are extracted with a bank of 32 Gabor filters in 4
scales and 8 directions (figure (d)). The number of available features is, therefore,
32 × 51 × 51 after sampling the image of size 204 × 204 by a factor of 4. Figure (b)
shows the result of the segmentation by applying K-means algorithm to the feature
images extracted with the Gabor filter bank, while the figure (c) shows the result of
the segmentation applying Gabor filters defined in (d) after reducing the feature to
5 × 51 × 51 applying the data reduction with the principal components (PCA) (see
Sect. 2.10.1 Vol. II).
Another approach to texture analysis is based on the wavelet transform where the
input image is decomposed at various levels of subsampling to extract different image
σ
scale
Fig. 3.17 Segmentation of 5 textures not completely uniform. a Input image; b segmented by
applying K-means algorithm to the features extracted with the Gabor filter bank shown in (d); c
segmented image after reducing the feature images to 5 with the PCA; d the bank of Gabor filters
used with 4 scales and 8 directions
302 3 Texture Analysis
details [37,38]. Texture measurements are extracted from the energy and variance
of the subsampled images. The main advantage of wavelet decomposition is that it
provides a unified multiscale context analysis of the texture.
The syntactic description of the texture is based on the analogy between the spatial
relation of texture primitives and the structure of a formal language [39]. The de-
scriptions of the various classes of textures form a language that can be represented
by its grammar which constitutes its rules by analyzing the primitives of the sample
textures (training set) in the learning phase. The syntactic description of the texture
is based on the idea that the texture is composed of primitives repeated and arranged
in a regular manner in the image. The syntactic methods of texture, in order to fully
describe it, must be determined essentially by the primitives and rules by which
these primitives are spatially arranged and how they are repeated. A typical syntactic
solution involves using grammar with rules that generate the texture of primitives
using transformation rules for a limited number of symbols. The symbols represent in
practice various types of texture primitives while the transformation rules represent
the spatial relations between the primitives.
The syntactic approach must, however, foresee that the textures of the real world
are normally irregular with the presence of errors in the structures repeated in an
unpredictable way and with considerable distortions. This means that the rules of
grammar may not efficiently describe real textures if they are not variables and the
grammar must be of various types (stochastic grammar). Let’s consider a simple
grammar for the generation of the texture starting with a starting symbol S and
applying the transformation rules called shape rules. The texture is generated through
various phases:
In different applications, we are faced with so-called oriented textures, that is, primi-
tives are represented by a local orientation selectivity that varies in the various points
of the image. In other words, the texture shows a dominant local orientation and in
this case we speak of a texture with a high degree of local anisotropy. To describe
and visualize this type of texture it is convenient to think of the gray-level image as
representing a flow map where each pixel represents a fluid element subjected to a
motion in the dominant direction of the texture, that is, in the directions of maxi-
mum variation of the levels of gray. In analogy to what happens in the study of fluid
dynamics, where each particle is subject to a velocity vector that is composed of its
module and direction, even in the case of images with oriented textures, we can define
a texture orientation field simply called Oriented Texture Field-OTF, which is actu-
ally composed of two images: the image of orientation and the image of coherence.
The orientation image includes the local information of orientation of the texture
for each pixel, the image of coherence represents the degree of anisotropy always
304 3 Texture Analysis
in each pixel of the image. The images of the oriented texture fields as proposed by
Rao [40] are calculated with the following five phases:
The first two phases, as known, are realized with standard edge extraction algo-
rithms (see Chap. 1 Vol. II). Recall that the Gaussian gradient operator is an optimal
solution for edge extraction. We specify also that the Gaussian filter is characterized
by the standard deviation of the Gaussian distribution σ that defines the level of
detail with which the geometric figures of the texture are extracted. This parameter,
therefore, indicates the degree of detail (scale) of the texture to be extracted.
In the third phase the local orientation of texture is calculated by means of the
inverse tangent function which requires only one argument as input and providing in
output a unique result in the interval (−π /2, π /2 ). With edge extraction algorithms
normally the maximum gradient direction is calculated with the arctangent function
which requires two arguments and does not provide a unique result.
In the fourth phase the orientation estimates are smoothed by a Gaussian filter
with standard deviation σ 2 . This second filter must have a greater standard deviation
than the previous one (σ2 σ1 ), and must produce a significant leveling between
the various orientation estimates. The value of σ2 must, however, be smaller than the
distance within which the orientation of the texture has the widest variations, and
finally, it must not attenuate (blurring) the details of the texture itself.
The fifth phase calculates the texture coherence which is the dominant local orien-
tation estimate, i.e., the normalized direction where most of the directional vectors of
neighboring pixels are projected. If the orientations are coherent, then the normalized
projections will have a value close to unity, on the contrary case, the projections tend
to cancel each other producing a result close to zero.
Consider an area of the image with different segments whose orientations indicate
the local arrangement of the texture. One could calculate as the dominant direction
the one corresponding to the resulting vector sum of the single local directions.
This approach would have the disadvantage of not being able to determine a single
direction as there would be two angles θ and θ + π . Another drawback would occur if
we considered oriented segments, as some of these with opposite signs would cancel
each other out, instead of contributing to the estimation of the dominant orientation.
3.12 Method for Describing Oriented Textures 305
Rao suggests the following solution (see Fig. 3.20). Let N be the local segments,
and consider a line oriented at an angle θ with respect to the horizontal axis x.
Consider a segment j with angle θ j and with R j we denote its length. The sum of
absolute values of all the projections of all the other segments is given by
N
S1 = |R j · cos(θ j − θ )| (3.94)
j=1
where S1 varies with the orientation θ of the considered line. The dominant orientation
is obtained for a value of θ where S1 is maximum. In this case, θ is calculated by
setting the derivative of the function S1 with respect to θ to zero. To eliminate the
problem of differentiation of the absolute value function (not differentiable) it is
convenient to consider and differentiate the following sum S2 :
N
S2 = R 2j · cos2 (θ j − θ ) (3.95)
j=1
dS2 N
=− 2R 2j · cos(θ j − θ ) sin(θ j − θ )
dθ
j=1
Recalling the trigonometric formulas of double angle and then of sine addition, and
setting equal to zero, we obtain the following equations:
N
N
N
− R 2j · sin 2(θ j − θ ) = 0 =⇒ R 2j · sin 2θ j cos 2θ = R 2j · cos 2θ j sin 2θ
j=1 j=1 j=1
from which N
j=1 R 2j · sin 2θ j
tan 2θ = N (3.96)
j=1 R 2j · cos 2θ j
If we denote by θ the value of θ for which the maximum value of S2 is obtained, this
coincides with the best estimate of local dominant orientation. Now let’s see how the
306 3 Texture Analysis
Window WxW
previous equation (3.96) is used for the calculation of the dominant orientation in
each pixel of the image. Let us consider with gx and g y the horizontal and vertical
components of the gradient at each point of the image, and the complex quantity gx +
ig y which constitutes the representation of the same pixel in the complex plane. The
gradient vector at a point (m, n) of the image can be represented in polar coordinates
with R m,n ei θ m,n . At this point we can calculate the dominant local orientation angle
θ for a neighborhood of (m, n) defined by the N × N pixel window as follows:
N N
n=1 Rm.n · sin 2θm,n
1 2
−1 m=1
θ = tan N N (3.97)
n=1 Rm,n · cos 2θm,n
2 m=1
2
The dominant orientation θ in the point (m, n) is given by θ + π/2 because the
gradient vector is perpendicular to the direction of anisotropy.
Let G(x, y) the magnitude of the gradient calculated in phase 2 (see Sect. 3.12) in the
point (x, y) of the image plane. The measure of the coherence in the point (x0 , y0 ) is
calculated considering at this point centered a window of size W × W (see Fig. 3.21).
For each point (xi , y j ) of the window the gradient vector G(xi , y j ) considered in the
direction θ (xi , y j ) in the unit vector in the direction θ (x0 , y0 ) is projected. In other
words, the projected gradient vector is given by
The normalized sum of the absolute values of these projections of gradient vectors
included in the window is considered as an estimate κ of the coherence measure
(i, j)∈W |G(x i , yi ) · cos[θ (x 0 , y0 ) − θ (x i , yi )]|
κ= (3.98)
(i, j)∈W G(x i , yi )
3.12 Method for Describing Oriented Textures 307
Fig. 3.22 Calculation of the orientation map and coherence measurement for two images with
vertical and circular dominant textures
This measure is correlated with the dispersion of data directionality. A better coher-
ence measure ρ is obtained by weighing the value of the estimate κ, given by the
previous (3.98), with the magnitude of the gradient at the point (x 0 , y 0 )
(i, j)∈W |G(x i , yi ) · cos[θ (x 0 , y0 ) − θ (x i , yi )]|
ρ = G(x0 , y0 ) (3.99)
(i, j)∈W G(x i , yi )
In this way the coherence is presented with high values, in correspondence with high
values of the gradient, i.e., where there are strong local variations of intensity in the
image (see Fig. 3.22).
The images of coherence and of the dominant orientation previously calculated are
considered as intrinsic images (primal sketch) according to the paradigm of Marr
[41] which we will describe in another chapter.7 These images are obtained with
an approach independent of the applicability domain. They are also independent
of light conditions. Certain conditions can be imposed in relation to the type of
application to produce appropriate intrinsic images. These intrinsic images find a
field of use for the inspection of defects in the industrial automation sector (wood
7 Primal sketch indicates the first information that the human visual system extracts from the scene
and in the context of image processing are the first features extracted such as borders, corners,
homogeneous regions, etc. A primal sketch image we can think of as equivalent to the significant
traits that an artist draws as his expression of the scene.
308 3 Texture Analysis
defects, skins, textiles, etc.). From these intrinsic images, it is possible to model
primitives of oriented textures (spirals, ellipses, radial structures) to facilitate image
segmentation and interpretation.
Coarseness: has a direct relation to the scale and repetition frequency of primitives
(textels), i.e., it is related to the distance of high spatial variations of gray level
of textural structures. The coarseness is referred to in [42] as the fundamental
characteristic of the texture. The extremes of coarseness property are coar se and
f ine. These properties help to identify texture macrostructures and microstruc-
tures, respectively. Basically, the measure of coarseness is calculated using local
operators with windows of various sizes. A local operator with a large window
can be used for coarse textures while operators with small windows are adequate
for fine textures. The measures of coarseness are calculated as follows:
3. For each pixel choose the value of k which maximizes the difference E k in
both directions in order to have the highest difference value
Sbest (x, y) = 2k or Sbest (x, y) = arg max max E k,d (x, y) (3.103)
k=1,...,5 d=h,v
3.13 Tamura’s Texture Features 309
These 4 factors are considered separately to develop the estimation of the contrast
measure. In particular, to evaluate the polarization of the distribution of gray levels,
the statistical index of kur tosis is considered to detect how much a distribution
is flat or has a peak with respect to the normal distribution.8
To consider the dynamics of gray levels (factor 1) Tamura includes the variance
σ 2 in the contrast calculation. The contrast measure Tcon is defined as follows:
σ
Tcon = (3.105)
α4m
with μ4
α4 = (3.106)
σ4
8 Normally the evaluation of a distribution is evaluated with respect to the normal distribution
considering two indexes that of asymmetr y (or skewness) γ1 = μ3/2 3
and kurtosis index γ2 =
μ2
μ4
μ22
− 3 where μn indicates the central moment of order n. From the analysis of the two indexes we
detect the deviation of a distribution compared to normal:
• γ1 < 0: Negative asymmetry, which is the left tail of the very long distribution;
• γ1 > 0: Positive asymmetry, which is the right tail of the very long distribution;
• γ2 < 0: The distribution is platykurtic, which is very flat compared to the normal;
• γ2 > 0: The distribution is leptokurtic, which is much more pointed than normal;
• γ2 = 0: The distribution is mesokurtic, meaning that this distribution has kurtosis statistic similar
to that of the normal distribution.
310 3 Texture Analysis
where μn indicates the central order moment n (see Sect. 8.3.2 Vol. I) and m = 41
is experimentally calculated by Tamura in 1978 resulting in a value that produces
the best results. It is pointed out that the measure of the contrast (3.105) is based
on the kurtosis index α4 defined by Tamura with the (3.106) which is different
from the one currently used as reported in the note 8. The measure of the contrast
expressed by the (3.105) does not include the factors 3 and 4 above.
Directionality: not intended as an orientation in itself but understood as relevant,
the presence of orientation in the texture. This is because it is not always easy to
describe the orientation of the texture. While it is easier to assess whether two
textures differ only in orientation, the directionality property can be considered
the same. The directionality measure is evaluated taking into consideration the
module and the direction of the edges. According to the Prewitt operator (of edge
extraction, see Sect. 1.7 Vol. II), the directionality is estimated with the horizontal
derivatives x (x, y) and vertical y (x, y) calculated from the convolution of the
image f (x, y) with the 3 × 3 kernel of Prewitt and then evaluating, for each pixel
(x, y), the module | | and the θ direction of the edge
π
| |= 2 + 2 ∼
=| x| + | y| θ = tan−1
y
+ (3.107)
x y
x 2
Subsequently a histogram Hdir (θ ) is constructed from the values of the quan-
tized directions (normally in 16 directions) evaluating the frequency of the edge
pixels corresponding to high values of the module, higher than a certain thresh-
old. The histogram is relatively uniform, for images without strong orientations,
and presents different peaks, for images with oriented texture. The directionality
measure Tdir , proposed by Tamura, considers the sum of the moments of the sec-
ond order of Hdir relative only to the values around the peaks of the histogram
between valleys and valleys, given by
np
Tdir = 1 − r · n p (θ − θ p )2 Hdir (θ ) (3.108)
p=1 θ∈w p
where n p indicates the number of peaks, θ p is the position of the p-th peak, w p
indicates the range of angles included in the p-th peak (i.e., the interval between
valleys adjacent to the peak), r is a normalization factor associated with the quan-
tized values of the angles θ , and θ is the quantized angle. Alternatively, we can
consider the sum of moments of the second order of all the values of the histogram
as the measure of directionality Tdir instead of considering only those around the
peaks.
Line-Likeness: defines a local texture structure composed of lines so that when a
border and the direction of the nearby edges are almost equal they are defined as
similar linear structures. The measure of line-likeness is calculated similarly to
the co-occurrence matrix GLCM (described in Sect. 3.4.4) only that in this case is
calculated the frequency of the directional co-occurrence between pixels of edge
that are at a distance d with similar direction (more precisely, if the orientation of
3.13 Tamura’s Texture Features 311
the relative edges is kept within an orientation interval). In the computation of the
directional co-occurrence matrix PDd edge with module higher than a predefined
threshold is considered by filtering the weak ones. The directional co-occurrence
is weighted with the cosine of the difference of the angles of the edge pair, and
in this way, the co-occurrences in the same direction are measured with +1 and
those with perpendicular directions with −1. The line-likeness measure Tlin is
given by n n
j=1 PDd (i, j)cos[(i − j) n ]
2π
i=1
Tlin = n n (3.109)
i=1 j=1 PDd (i, j)
where the directional co-occurrence matrix PDd (i, j) has dimensions n × n and
has been calculated by Tamura using the distance d = 4 pixels.
Regularity: intended as a measure that captures information on the spatial regu-
larity of texture structures. A texture without repetitive spatial variations is con-
sidered r egular , unlike a texture that has strong spatial variations observed as
irr egular . The measure of regularity proposed by Tamura is derived from the
combination of the measures of previous textures of coarseness, contrast, direc-
tionality, and line-likeness.
These 4 measurements are calculated by partitioning the image into regions of
equal size, obtaining a vector of measures for each region. The measure of reg-
ularity is thought of as a measure of the variability of the 4 measures over the
entire image (i.e., overall regions). A small variation in the first 4 measurements
indicates a regular texture. Therefore, the regularity measure Tr eg is defined as
follows:
Tr eg = 1 − r (σcr s + σcon + σdir + σlin ) (3.110)
Figure 3.25 shows the first 3 Tamura texture measurements estimated for some
sample images. Tamura texture measurements are widely used in image recovery ap-
plications based on visual attributes contained in the image (known in the literature as
312 3 Texture Analysis
(a) (b)
2k
f(x,y)
(x,y) 2k
Fig. 3.23 Generation of the set {Ak }k=0,1,··· ,5 of average images at different scales for each pixel
of the input image
2k
Ek,a(x,y)=| Ak - Ak Ak Ak
P(x,y)={E1,a,E1,b,E2,aE2,b,...,E5,a,E5,a} P(x,y) 2k
Ek,b(x,y)=| k - Ak Ak Ak
Fig. 3.24 Of the set {Ak }k=0,1,...,5 of average images with different scales k, for each pixel P(x, y),
the absolute differences are calculated E k,h and E k,v of the averages between non overlapping pairs
and on opposite sides, respectively, in the horizontal direction h and vertical v, as shown in the
figure
T_crs = 25.6831 T_crs = 6.2593 T_crs = 13.257 T_crs = 31.076 T_crs = 14.987
T_con = 0.7868 T_con = 0.2459 T_con = 0.3923 T_con = 0.3754 T_con = 0.6981
T_dir = 0.0085 T_dir = 0.0045 T_dir = 1.5493 T_dir = 0.0068 T_dir = 1.898
Fig. 3.25 The first 3 Tamura texture measures (coarseness, contrast, and directionality) calculated
for some types of images
3.13 Tamura’s Texture Features 313
CBIR Content-Based Image Retrieval) [43]. They have many limitations to discrim-
inate fine textures. Often the first three Tamura measurements are used, treated as a
3D image, whose first three components Coarseness-coNtrast-Directionality (CND)
are considered in analogy to the RGB components. Mono-multidimensional his-
tograms can be calculated from the CND image. More accurate measurements can
be calculated using other edge extraction operators (for example, Sobel). Tamura
measurements have been extended to deal with 3D images [44].
References
1. R.M. Haralick, K. Shanmugam, I. Dinstein, Textural features for image classification. IEEE
Trans. Syst. Man Cybern. B Cybern. 3(6), 610–621 (1973)
2. B. Julesz, Visual pattern discrimination. IRE Trans. Inf. Theory 8(2), 84–92 (1962)
3. B. Julesz, Textons, the elements of texture perception, and their interactions. Nature 290, 91–97
(1981)
4. Ruth Rosenholtz. Texture perception. in Johan Wagemans, editor, Oxford Handbook of Per-
ceptual Organization, Oxford University Press, (2015), pp. 167–186. ISBN 9780199686858
5. T. Caelli, B.Julesz, E.N. Gilbert, On perceptual analyzers underlying visual texture discrimi-
nation. Part II. Biol. Cybern. 29(4), 201–214 (1978)
6. J.R. Bergen, E.H. Adelson, Early vision and texture perception. Nature 333(6171), 363–364
(1988)
7. R. Rosenholtz, Computational modeling of visual texture segregation, in Computational Models
of Visual Processing, ed. by M. Landy, J.A. Movshon (MIT Press, Cambridge, MA, 1991), pp.
253–271
8. R. Haralick, Statistical and structural approaches to texture. Proc. IEEE 67(5), 786–804 (1979)
9. Y. Chen, E. Dougherty, Grey-scale morphological granulometric texture classification. Opt.
Eng. 33(8), 2713–2722 (1994)
10. C. Lu, P. Chung, C. Chen, Unsupervised texture segmentation via wavelet transform. Pattern
Recognit. 30(5), 729–742 (1997)
11. A.K. Jain, F. Farrokhnia, Unsupervised texture segmentation using gabor filters. Pattern Recog-
nit. 24(12), 1167–1186 (1991)
12. A. Bovik, M. Clark, W. Giesler. Multichannel texture analysis using localised spatial fil-
ters.IEEE Trans. Pattern Anal. Mach. Intell. 2, 55–73 (1990)
13. A. Pentland, Fractal-based description of natural scenes. IEEE Trans. Pattern Anal. Mach.
Intell. 6(6), 661–674 (1984)
14. G. Lowitz, Can a local histogram really map texture information? Pattern Recognit. 16(2),
141–147 (1983)
15. R. Lerski, K. Straughan, L. Schad, D. Boyce, S. Blüml, I. Zuna, Mr image texture analysis an
approach to tissue characterisation. Magn. Reson. Imaging 11, 873–887 (1993)
16. K.P. William Digital Image Processing. Wiley, second edition, (1991). ISBN 0-471-85766-1
17. S.W. Zucker, D. Terzopoulos, Finding structure in co-occurrence matrices for texture analysis.
Comput. Graphics Image Process. 12, 286–308 (1980)
18. L. Alparone, F. Argenti, G. Benelli, Fast calculation of co-occurrence matrix parameters for
image segmentation. Electron. Lett. 26(1), 23–24 (1990)
19. L.S. Davis, A. Mitiche, Edge detection in textures. IEEE Comput. Graphics Image Process.
12, 25–39 (1980)
314 3 Texture Analysis
20. M.M. Galloway, Texture classification using grey level runlengths. Comput. Graphics Image
Process. 4, 172–179 (1975)
21. X. Tang, Texture information in run-length matrices. IEEE Trans. Image Process. 7(11), 1602–
1609 (1998)
22. A. Chu, C.M. Sehgal, J.F. Greenleaf, Use of gray value distribution of run lengths for texture
analysis. Pattern Recogn. Lett. 11, 415–420 (1990)
23. B.R. Dasarathy, E.B. Holder, Image characterizations based on joint gray-level run-length
distributions. Pattern Recogn. Lett. 12, 497–502 (1991)
24. G.C. Cross, A.K. Jain, Markov random field texture models. IEEE Trans. Pattern Anal. Mach.
Intell. 5, 25–39 (1983)
25. R. Chellappa, S. Chatterjee, Classification of textures using gaussian markov random fields.
IEEE Trans. Acoust. Speech Signal Process. 33, 959–963 (1985)
26. J.C. Mao, A.K. Jain, Texture classification and segmentation using multiresolution simultane-
ous autoregressive models. Pattern Recognit. 25, 173–188 (1992)
27. Divyanshu Rao Sumit Sharma, Ravi Mohan, Classification of image at different resolution
using rotation invariant model. Int. J. Innovative Res. Adv. Eng. 1(4), 109–113 (2014)
28. B.B. Mandelbrot, The Fractal Geometry of Nature (Freeman, Cityplace San Francisco, 1983)
29. J.M.S.Chen Keller, R.M. Crownover, Texture description and segmentation through fractal
geometry. Comput. Vis. Graphics Image Process. 45(2), 150–166 (1989)
30. R.F. Voss, Random fractals: Characterization and measurement, in Scaling Phenomena in
Disordered Systems, ed. by R. Pynn, A. Skjeltorp (Plenum, New York, 1985), pp. 1–11
31. K.I. Laws, Texture energy measures. in Proceedings of Image Understanding Workshop, pp.
41–51 (1979)
32. M.T. Suzuki, Y. Yaginuma, A solid texture analysis based on three dimensional convolution
kernels. in Proceedings of the SPIE, vol. 6491, pp. 1–8 (2007)
33. D. Gabor, Theory of communication. IEEE Proc. 93(26), 429–441 (1946)
34. J.G. Daugman, Uncertainty relation for resolution, spatial frequency, and orientation optimized
by 2d visual cortical filters. J. Opt. Soc. Am. A 2, 1160–1169 (1985)
35. J. Malik, P. Perona, Preattentive texture discrimination with early vision mechanism. J. Opt.
Soc. Am. A 5, 923–932 (1990)
36. I. Fogel, D. Sagi, Gabor filters as texture discriminator. Biol. Cybern. 61, 102–113 (1989)
37. T. Chang, C.C.J. Kuo, Texture analysis and classification with tree-structured wavelet trans-
form. IEEE Trans. Image Process. 2(4), 429–441 (1993)
38. J.L. Chen, A. Kundu, Rotation and gray scale transform invariant texture identification using
wavelet decomposition and hidden markov model. IEEE Trans. PAMI 16(2), 208–214 (1994)
39. M. Sonka, V. Hlavac, R. Boyle, Image Processing, Analysis and Machine Vision. CL Engineer-
ing, third edition (2007). ISBN 978-0495082521
40. A.R. Rao, R.C. Jain, Computerized flow field analysis: oriented texture fields. IEEE Trans.
Pattern Anal. Mach. Intell. 14(7), 693–709 (1992)
41. D. Marr, S. Ullman, Directional selectivity and its use in early visual processing. in Proceedings
of the Royal Society of London. Series B, Biological Sciences, vol. 211, pp. 151–180 (1981)
42. S. Mori, H. Tamura, T. Yamawaki, Texture features corresponding to visual perception. IEEE
Trans. Syst. Man Cybern. B Cybern. SMC 8(6), 460–473 (1978)
43. S.H. Shirazi, A.I. Umar, S.Naz, N. ul Amin Khan, M.I. Razzak, B. AlHaqbani, Content-based
image retrieval using texture color shape and region. Int. J. Adv. Comput. Sci. Appl. 7(1),
418–426 (2016)
44. T. Majtner, D. Svoboda, Extension of tamura texture features for 3d fluorescence microscopy.
in Proceedings of 3D Imaging, Modeling, Processing, Visualization and Transmission (3DIM-
PVT), pp. 301–307. IEEE, 2012. ISBN 978-1-4673-4470-8
Paradigms for 3D Vision
4
The human visual system addresses the problems of 3D vision using a binocular
visual system, a remarkable richness of elementary processors (neurons) and a model
of reconstruction based also on the a priori prediction and knowledge of the world.
In the field of artificial vision, the current trend is to develop 3D systems oriented
to specific domains but with characteristics that go in the direction of imitating
some functions of the human visual system. For example, use systems with multiple
cameras, analyze time-varying image sequences, observing the scene from multiple
points of view, and make the most of prior knowledge with respect to the specific
application. With 3D vision systems based on these features, it is possible to try
to optimize the 2D to 3D inversion process, obtaining the least ambiguous results
possible.
Once the scene is reconstructed, the vision system performs the perception phase
trying to make hypotheses that are verified with the predicted model and evaluating
its validity. If a hypothesis cannot be accepted, a new2 hypothesis of description of
the scene is reformulated until the comparison with the model is acceptable. For the
formation of the hypothesis and the verification with the model, different processing
steps are required for the acquired data (2D images) and for the data known a priori
that represent the models of the world (the a priori knowledge for a certain domain).
A 3D vision system must be incremental in the sense that its elementary process
components (tasks) can be extended to include new descriptions to represent the
model and to extract new features from the images of the scene. In a vision system,
the 3D reconstruction of the scene and the understanding of the scene (perception),
are the highest level tasks that are based on the results achieved by the lowest level
tasks (acquisition, pre-processing, feature extraction, . . .) and the intermediate one
(segmentation, clustering, etc.). The understanding of the scene can be achieved only
through the cooperation of the various elementary calculation processes and through
an appropriate control strategy for the execution of these processes.
In fact, the biological visual systems have different control strategies including sig-
nificant parallel computing skills, a complex computational model based on learning,
remarkable adaptive capacity, and high incremental capacity in knowledge learning.
An artificial vision system to imitate some functions of the biological system should
include the following features:
1. Several image processing algorithms are applied to the raw data (pre-
processing and data transformation) to make them available to higher-level
processes (for example, to the segmentation process).
2. Extraction of higher-level information such as homogeneous regions corre-
sponding to parts of the object and objects of the scene.
3. Reconstruction and understanding of the scene based on the results of point
(2) and on the basis of a priori knowledge.
1. Make the best choice based on the knowledge and current status achieved.
2. Use the results achieved with the last choice to improve and increase the
available information about the problem.
3. If the goal has not been reached, return to the first step otherwise the procedure
ends.
318 4 Paradigms for 3D Vision
(a) The level of algorithms in the Marr model tacitly includes the level of robustness
and stability.
(b) In developing a component of the vision system, the three levels (computational
theory, algorithms, and implementation) are often considered, but when activat-
ing the vision process using real data (images) it is possible to obtain absurd
results for having neglected (or not well modeled), for example, in the compu-
tational level, the noise present in the input images.
(c) Need to introduce stability criteria for the algorithms, assuming, for example, a
noise model.
4.3 Toward the Marr’s Paradigm 319
(d) Another source of noise is given by the use of uncertain intermediate data (such
as points, edges, lines, contours, etc.) and in this case, stability can be feasible
using statistical analysis.
(e) Analysis of the stability of results is a current and future theme of research that
will avoid the problem of algorithms that have an elegant mathematical basis but
do not operate properly on real data.
The bottom-up approaches are potentially more general since they operate only
considering the information extracted from the 2D images, together with the cali-
bration data of the acquisition system, to interpret the 3D objects of the scene. The
vision systems of the type bottom-up are oriented for more general applications.
The top-down approaches assume the presence of particular objects or classes of
objects that are localized in the 2D images and the problems are solved in a more
deterministic way. The vision systems of the top-down type are oriented to solve
more specific applications with a more general vision theory.
Marr observes that the complexity of vision processes imposes a sequence of ele-
mentary processes to improve the geometric description of the visible surface. From
the pixels, it is necessary to delineate the surface and derive some of its character-
istics, for example, the orientation and the depth with respect to the observer, and
finally arrive at the complete 3D description of the object.
The input to the biological visual system are the images that are formed on the retina
seen as a matrix of values of intensity of reflected light of the physical structures of
the observed external environment.
The goal of the first stages of vision (early vision) is to create, from the 2D image, a
description of the physical structures: the shape of the surface and of the objects, their
orientation, and distance from the observer. This goal is achieved by constructing
a distinct number of representations, starting from the variations in light intensity
observed in the image. This first representation is called Primal Sketch.
This primary information describes the variations in intensity present in the image
and makes some global structures explicit. This first stage of the vision process,
locates the discontinuity of light intensity in correspondence with the edge points,
which often coincide with the geometric discontinuities of the physical structures of
the observed scene.
The primal sketches correspond to the edges and small homogeneous areas present
in the image, including their location, orientation and whatever else can be deter-
mined. From this primary information, by applying adequate algorithms based on
group theory, more complex primary structures (contours, regions, and texture) can
be derived, called full primal sketch.
The ultimate goal of early vision processes is to describe the surface and shape of
objects with respect to the observer, i.e., to produce a world representation observed in
320 4 Paradigms for 3D Vision
the reference system centered with respect to the observer (viewer-centred). In other
words, the early vision process in this viewer-centered reference system produces
a representation of the world, called 2.5D sketch. This information is obtained by
analyzing the information on depth, movement, and derived shape, analyzing the
primal sketch structures. The extracted 2.5D structures describe the structures of the
world with respect to the observation point. A vision system must fully recognize
an object. In other words, it is necessary that the 2.5D viewer centered structures
are expressed in the object reference system (object-centered) and not referred to
the observer. Marr indicates this level of representation of the world as 3D model
representation.
In this process of 3D formation of the world model, all the primary informa-
tion extracted from the primary stages of vision are used, proceeding according to
a bottom-up model based on general constraints of 3D reconstruction of objects,
rather than on specific hypotheses of the object. The Marr paradigm foresees a dis-
tinct number of levels of representation of the world, each of them is a symbolic
representation of some aspects of information derived from the retinal image.
Marr’s paradigm sees the vision process based on a computational model of a set
of symbolic descriptions of the input image. The process of recognizing an object, for
example, can be considered achieved when one, among the many descriptions derived
from the image, is comparable with one of those memorized, which constitutes the
representation of a particular class of the known object. Different computational
models are developed for the recognition of objects, and their diversity is based on
how concepts are represented as distributed activities on different elementary process
units. Some algorithms have been implemented, based on the neural computational
model to solve the problem of depth perception and object recognition.
From the 2D image, the primary information is extracted or primal sketch which can
be any elementary structure such as edges, straight and right angle edges (corner),
texture, and other discontinuities present in the image. These elementary structures
are then grouped to represent higher-level physical structures (contours, parts of an
object) that can be used later to provide 3D information of the object, for example, the
superficial orientation with respect to the observer. These primal sketch structures
can be extracted from the image at different geometric resolutions just to verify their
physical consistency in the scene.
The primal sketch structures, derived by analyzing the image, are based on the
assumption that there is a relationship between zones in the image where the light
intensity and the spectral composition varies, and the areas of the environment where
the surface or objects are delimited (border between different surfaces or different
objects). Let us immediately point out that this relationship is not univocal, and it is
not simple. There are reasonable considerations for not assuming that any variation
in luminous or spectral intensity in the image corresponds to the boundary or edge of
an object or a surface of the scene. For example, consider an environment consisting
4.4 The Fundamentals of Marr’s Theory 321
220
50
200
100
180
150
160
Intensity
200 140
250 120
300 100
350 80
400 60
50 100 150 200 250 300 350 400 450 50 100 150 200 250 300 350 400 450
Distance in pixels
Fig. 4.1 Brightness fluctuation even in homogeneous areas of the image as can be seen from the
graph of the intensity profile relative to line 320 of the image
of objects with matte surfaces, i.e., that the light reflected in all directions from every
point on the surface has the same intensity and spectral composition (Lambertian
model, described in Chap. 2 Vol. I).
In these conditions, the boundaries of an object or the edges of a surface will
emerge in the image in correspondence of the variations of intensity. It is found that
in reality, these discontinuities in the image emerge also for other causes, for example,
due to the effect of the edges derived from a shadow that falls on an observed surface.
The luminous intensity varies even in the absence of geometric discontinuity of
the surface, due as a consequence, that the intensity of the reflected light is a function
of the angle of the surface, with respect to the incident light direction. The intensity of
the reflected light has maximum value if the surface is perpendicular to the incident
light, and decreases as the surface rotates in other directions. The luminous intensity
changes in the image, in the presence of a curved surface and in the presence of
texture, especially in natural scenes. In the latter case, the intensity and spectral
composition of the light reflected in a particular direction by a surface with texture
varies locally and consequently generates a spatial variation in luminous intensity
within a particular region of the image corresponding to a particular surface.
To get an idea about the complex relationship between natural surface and the
reflected light intensity resulting in the image, see Fig. 4.1 where it is shown how
the brightness varies in a line of the image of a natural scene. Brightness variations
are observed in correspondence of variation of the physical structures (contours and
texture) but also in correspondence of the background and of homogeneous physical
structures. Brightness fluctuations are also due to the texture present in some areas
and to the change in orientation of the surface with respect to the direction of the
source.
To solve the problem of the complex relationship between the structures of natural
scenes and the structures present in the image, Marr proposes a two-step approach.
322 4 Paradigms for 3D Vision
Fig. 4.2 Results of the LoG filter applied to the Koala image. The first line shows the extracted
contours with the increasing scale of the filter (from left to right) while in the second line the images
of the zero crossing are shown (closed contours)
In the first step the image is processed making explicit the significant variations
in brightness, thus obtaining what Marr calls a representation of the image, in terms
of raw primal sketch.
In the second step the edges are identified with particular algorithms that process
the information raw primal sketch of the previous step to describe information and
structures of a higher level, called Perceptual Chunks.
Marr’s approach has the advantage of being able to use raw primal sketch informa-
tion for other perceptual processes that operate in parallel, for example, to calculate
depth or movement information. The first stages of vision are influenced by the noise
of the image acquisition system. In Fig. 4.1 it can be observed how the brightness
variations in the image are present also in correspondence of uniform surfaces. These
variations at different scales are partly caused by noise.
In the Chap. 4 Vol. II we have seen how it is possible to reduce the noise present
in the image, applying an adequate smoothing filter that does not alter the significant
structures of the image (attenuation of the high frequencies corresponding to the
edges). Marr and Hildreth [2] proposed an original algorithm to extract raw primal
sketch information by processing images of natural scenes.
The algorithm has been described in Sect. 1.13 Vol. II, also called the Laplacian
of Gaussian (LoG) filter operator, used for edge extraction and zero crossing. In the
context of extracting the raw primal sketch information, the LoG filter is used with
different Gaussian filters to obtain raw primal sketch at different scales for the same
image as shown in Fig. 4.2. It is observed how the various maps of raw primal sketch
(zero crossing in this case) represent the physical structures at different scales of
representation.
In particular, the very narrow Gaussian filter highlights the noise together with
significant variations in brightness (small variations in brightness are due to noise
4.4 The Fundamentals of Marr’s Theory 323
and physical structures), while as a wider filter is used only zero crossing remain
corresponding to significant variations (which are not accurately localized) to be
associated with the real structures of the scene with the almost elimination of noise.
Marr and Hildreth proposed to combine the various raw primal sketch maps extracted
at different scales to obtain more robust primal sketch than the original image with
contours, edges, homogeneous areas (see Fig. 4.3).
Marr and Hildreth assert that at least in the early stages of biological vision the LoG
filter is implemented for the extraction of the zero crossing at different filtering scales.
The biological evidence of the theory of Marr and Hildreth has been demonstrated
by several researchers. In 1953 Kuffler had discovered the spatial organization of the
receptive fields of retinal ganglion cells (see Sect. 3.2 Vol. I).
In particular, Kuffler [3] discovered the effect on ganglion cells of a luminous spot
and observed concentric receptive fields with circular symmetry with a central region
of excitation (sign +) and a surrounding inhibitor (see Fig. 4.4). Some ganglion cells
instead presented receptive fields with concentric regions excited of opposite sign.
In 1966 Enroth-Cugell and Robson [4] discovered, in relation to temporal response
properties, the existence of two types of ganglion cells, called X and Y cells. The
X cells have a linear response, proportional to the difference between the intensity
of light that affects the two areas and this response is maintained over time. The
Y cells do not have a linear response and are transient. This cellular distinction is
also maintained up to the lateral geniculate nucleus of the visual cortex. Enroth-
Cugell and Robson showed that the intensity contribution for both areas is weighted
according to a Gaussian distribution, and the resulting receptive field is described
as the difference of two Gaussians (called Difference of Gaussian (DoG) filter, see
Sect. 1.14 Vol. II).
324 4 Paradigms for 3D Vision
Ganglion cells
x x x
Central spot Peripheral spot Central Peripheral
lighting lighting lighting lighting lighting
Ganglion cells
x x x
Fig. 4.4 Receptive fields of the center-ON and middle-OFF ganglion cells. They have the charac-
teristic of having both the receptive field almost circular and divided into two parts, an internal area
called center and an external one called peripheral. Both respond well to changes in lighting between
the center and the periphery of their receptive fields. They are divided into two classes of ganglion
cells center-ON and center-OFF, based on the different responses when excited by a light beam. As
shown in the figure, the first (center-ON) respond with excitement when the light is directed to the
center of the field (with spotlight or with light that illuminates the entire center), while the latter
(center-OFF) behave in the opposite way, that is, they are very little excited. Conversely, if the
light beam affects the peripheral part of both, they are the center-OFF that respond very well (they
generate a short electrical excitable membrane signal) while the center-ON are inhibited. Ganglion
cells respond primarily to differences in brightness, making our visual system sensitive to local
spatial variations, rather than the absolute magnitude (or magnitude) of light affecting the retina
From this, it follows that the LoG operator is seen as a function equivalent to the
DoG, and the output of the ∇ 2 G I operator is analogous to the X retinal and cells
of the lateral geniculate nucleus (LGN). Positive values of ∇ 2 G I correspond to
the central zone of the cells X and the negative values for the surrounding concentric
zone. In this hypothesis, the problem arises that positive and negative values must
be present for the determination of the Zero Crossing in the image ∇ 2 G I . This
would not be possible since the nerve cells cannot operate with negative values in the
answers to calculate the zero crossing. Marr and Hildreth explained, for this reason,
the existence of cells in the visual cortex that are excited with opposite signs as shown
in Fig. 4.4.
This hypothesis is weak if we consider the inadequacy of the concentric areas of
the receptive fields of the cells X , for the accurate calculation of the function ∇ 2 G I .
With the interpretation of Marr and Hildreth, cells with concentric receptive fields
cannot determine the presence of edges in a classical way, as done for all the other
edge extraction algorithms (see Chap. 1 Vol. II).
A plausible biological rule for extracting the zero crossing in the X cells would
be to find adjacent active cells that operate with positive and negative values in
the central receptive area, respectively, as shown in Fig. 4.5. The zero crossing are
determined with the AND logical connection of two cells center-ON and center-OFF
(see Fig. 4.5a).
With this idea it is possible to extract also segments of zero crossing organizing
in two ordered columns of cells with receptive fields of opposite sign (see Fig. 4.5b).
Two cells X of opposite sign are connected through a logical AND connection
4.4 The Fundamentals of Marr’s Theory 325
(a) (b)
+ -
- + + -
ing
Cross
Zero + -
+ -
+ -
Fig. 4.5 Functional scheme proposed by Marr and Hildreth for the detection of zero crossing from
cells of the visual cortex. a Overlapping receptive fields of two cells center-ON and center-OFF
of LGN; if both are active, a zero crossing ZC is located between the two cells and detected by
these if they are connected with a logical AND conjunction. b Several different AND logic circuits
associated with pairs of cells center-ON and center-OFF (operating in parallel) are shown which
detect an oriented segment of zero crossing
producing an output signal only if the two cells are active indicating the presence of
zero crossing between the cells. The biological evidence of the LoG operator explains
some features of the visual cortex and how it works at least in the early stages of
visual perception for the calculation of segments. It is not easy to explain how the
nervous system combines this elementary information (zero crossing) generated by
ganglion cells to obtain the information called primal sketch.
Another important feature of the theory of Marr and Hildreth is to provide a
plausible explanation for the existence of cells, in the visual cortex, that operate with
different spatial frequencies in a similar way to the filter ∇ 2 G varying the width of
the filter itself through the parameter σ of the Gaussian. Campbell and Robson [5]
in 1968 discovered with their experiments that visual input is processed in multiple
independent channels, each of which analyzes a different band of spatial frequencies.
Following Marr’s theory in the early stages of vision, the first elementary information
called primal sketch is extracted. In reality, the visual system uses this elementary
information and organizes it in a higher level to generate more important percep-
tual structures called chunks. This to reach a perception of the world not made of
elementary structures (borders and homogeneous areas), but of 3D objects with the
visible surface of the objects well reconstructed. How this higher-level perceptual
organization is realized, has been studied by several researchers in the nineteenth
and twentieth centuries.
We can immediately affirm that while for the retina, all its characteristics have
been studied and how the signals are transmitted through the optic nerve to the visual
cortex, for the latter, called Area 17 or sixth zone, the mechanisms are not yet clear of
perception of reconstruction of objects and their motion. Hubel and Wiesel [6] have
shown that the signals coming from the retina through the fibers of the optic nerve
arrive in the fourth layer of the Area 19 passing from the lateral geniculate nucleus
(see Fig. 3.12 of Vol. I). In this area of the visual cortex it is hypothesized that the
326 4 Paradigms for 3D Vision
retinal image is reconstructed maintaining the information from the first stages of
vision (first and second derivatives of luminous intensity).
From this area of the visual cortex different information is transmitted to the
various layers of the visual cortex in relation to the tasks of each layer (motor control
of the eyepieces, perception of motion, perception of depth, integration of different
primal sketches to generate chunks, etc.).
The in-depth study of human perception began in the nineteenth century, when
psychology was established as a modern autonomous discipline detached from the
philosophy [7]. The first psychologists (von Helmholtz [8] influenced by J. Stuart
Mill) studied perception based on associationism. It was assumed that the percep-
tion of an object can be conceived in terms of a set of sensations that emerge with
past experience and that the sensations that make up the object have always pre-
sented themselves associated with the perceiving subject. Helmholtz asserted that
the past perceptive experience of the observer imposes that an unconscious inference
automatically links the dimensional aspects of perception of the object taking into
account its distance.
In the early twentieth century, the association theory was opposed by a group
of psychologists (M. Wertheimer 1923, W. Kohler 1947, and K. Koffka 1935) who
founded the school of Gestalt psychology (i.e., the psychology of form), who gave
the most contributions important to the study of perception. The Gestaltists argued
that it is wrong to say that perception can be seen as a sum of stimuli linked by
associative laws, based on past experience (as the associationists thought). At the
base of the Gestalt theory is this admission:
we do not perceive sums of stimuli, but forms, and the whole is much more
than the sums of the components that compose it.
In Fig. 4.6 it is observed how each of the represented forms are perceived as three
squares regardless of the fact that its components are completely different (stars,
lines, circles).
The Gestalt idea can be stated as the observed set is greater than the sum of
its elementary components. The natural world is perceived as composed of discrete
objects of various sizes and appear well highlighted with respect to background.
Fig. 4.7 Ambiguous figures that produce different perceptions. a Figure with two possible inter-
pretations: two human face profiles or a single black vase; b perception of a young or old woman,
designed by William Ely Hill 1915 and reported in a paper by the psychologist Edwin Boring in
1930
Even if the surfaces of the objects have a texture, there is no difficulty in perceiving
the contours of the objects, unless they are somehow camouflaged, and generally,
the homogeneous areas belonging to the objects (foreground ) with respect to the
background. Some graphic and pictorial drawings, made by man can present some
ambiguity when they are interpreted and their perception can lead to errors. For
example, this happens to distinguish from the background the objects present in
famous figures (see Fig. 4.7 that highlight the perceptual ambiguity indicated above).
In fact, figure (a), conceived by Gestalt psychologist Edgar Rubin, can be inter-
preted by perceiving the profiles of two human figures or perceived as a single black
vase. It is impossible to perceive both human figures and the vase simultaneously.
Figure (b) on the other hand can be interpreted by perceiving an old woman or a
young woman.
Some artists have produced paintings or engravings with ambiguities between
the background and the figure represented, based on the principles of perceptive
reversibility. Examples of reversible perception are given by the Necker cube in
1832 (see Fig. 4.8), which consists of a two-dimensional representation of a three-
dimensional wire-frame cube. The intersections between two lines do not show which
line is above the other and which is below, so the representation is ambiguous. In
other words, it is not possible to indicate which face is facing the observer and which
is behind the cube. Looking at the cube (figure a) or corner (figure d) for a long time
they appear alternately concave and convex in relation to the perceptive reactivity of
a person.
Some first perceive the lower left face of the cube as the front face (figure (b))
facing the observer or alternatively it is perceived as further back as the lower rear
face of the cube (figure (c)). In a similar way, the same perceptive reversibility
328 4 Paradigms for 3D Vision
Fig. 4.8 a The Necker cube is a wire-frame drawing of a cube with no visual cues (like depth or
orientation). b One possible interpretation of the Necker cube. It is often claimed to be the most
common interpretation because people view objects from above (see the lower left face as being
in front) more often than from below). c Another possible interpretation. d The same perceptive
reversibility occurs by observing a corner of the cube that appears alternately concave and convex
(a) (b)
Fig. 4.9 Examples of stable real figures: a hexagonal figure which also corresponds to the two-
dimensional projection of a cube seen from a corner; b predominantly stable perception of over-
lapping circular disks
occurs by observing a corner of the cube that appears alternately concave and convex
(figure (d)).
Several psychologists have tried to study what are the principles that determine
the ambiguities in the perception of these figures through continuous perceptual
change. In the examples shown, the perceptible data (the objects) are invariant, that
is, they remain the same, while, during the perception, only the interpretation between
background and objects varies. It would seem that the perceptual organization was
of the top-down type.
High-level structures of perceptive interpretation seem to condition and guide
continuously, low-level structures extracted from image analysis. Normally in nat-
ural scenes and in many artificial scenes, no perceptual ambiguity is presented. In
these cases, there is a stable interpretation of the components of the scene and its
organization.
As an example of stable perception we have Fig. 4.9a, which seen individually
by several people is perceived as a hexagon, while if we remember the 3D cube
of Fig. 4.8a the Fig. 4.9 also represents the figure of the cube seen by an observer
positioned on a corner of the cube itself. Essentially, perceptual ambiguity does not
occur as the hexagonal Fig. 4.9 is a correct and real two-dimensional projection of
the 3D cube seen from one of its corners. Also Fig. 4.9b gives a stable perception of a
set of overlapping circular disks, instead of being interpreted, alternatively, as disks
interlocked with each other (for example, thinking that two disks have a circular
notch for having removed a circular portion).
4.4 The Fundamentals of Marr’s Theory 329
Gestalt psychologists have formulated some ideas about the perceptive organiza-
tion to explain why, by observing the same figure, there are some different perceptions
between them. Some principles of the Gestalt theory were based mainly on the idea
of grouping elementary regions of the figures to be interpreted and other principles
were based on the segregation of the objects extracted with respect to the back-
ground. In the first case, it is fundamental to group together the elementary regions
of a figure (object) to obtain a larger region which constitutes, as a whole, the figure
to be perceived. Some of the perceptive principles of the Gestalt theory are briefly
described below.
4.4.3.1 Proximity
A basic principle of the perceptive organization of a scene is the proximity of
its elementary structures. Elementary structures of the scene, which are close to
each other, are perceived grouped. In Fig. 4.10a we observe vertical and horizon-
tal linear structures since the vertical spacing of the elementary structures (points)
is smaller, compared to the horizontal one (vertical structures), while the horizontal
spacing is smaller than the vertical one when observing the horizontal structures (see
Fig. 4.10b). If they are equally spaced horizontally and vertically (see Fig. 4.10c) the
perceived aggregation is a regular grid of points. An important example of the prin-
ciple of proximity is in the visual perception of depth as we will describe later in the
paragraph on binocular vision.
4.4.3.2 Similarity
Elementary structures of the scene that appear similar tend to be grouped together
from the perceptive point of view (see Fig. 4.11). The regions appear distinct because,
the visual stimulus elements that are similar, are perceived grouped and components
of the same grouping (figure (a)). In Fig. 4.11b vertical linear structures are perceived
although, due to the proximity principle, they should be perceived as horizontal struc-
tures. This shows that the principle of similarity can prevail over information per-
ceived by the principle of proximity. There is no strict formulation of how structures
are aggregated with the principle of similarity.
330 4 Paradigms for 3D Vision
(a) (b)
Fig. 4.11 Similar elements of visual stimulation tend to be perceived grouped as components of the
same object: a two distinct perceived regions consisting of groups of similar elements; b perceive
the vertical structures because the dominant visual stimulus is the similarity, between the visual
elements, rather than the proximity
Fig. 4.12 The elements of visual stimulation that move, in the same direction and speed, are
perceived as components of the same whole: in this case a swarm of birds
4.4 The Fundamentals of Marr’s Theory 331
+
(a) (b) *
Δ
Δ @
*+
Δ
X
Fig. 4.13 Looking at the figure a we have the perception of two curves that intersect in X for the
principle of continuity and not two separate structures. The same happens for the figure b where in
this case the mechanisms of perception combine perceptual stimuli of continuity and proximity
4.4.3.4 Continuity
As shown in Fig. 4.13a, an observer tends to perceive two intersecting curves at the
point X , instead of perceiving two separate irregular graphic structures that touch
at the point X . The Gestalt theory justifies this type of perception in the sense that
we tend to preserve the continuity of a curvilinear structure instead of producing
structures with strong discontinuities and graphic interruptions. Some mechanisms of
perception combine the Gestalt principles of proximity and continuity. This explains
why completely dissimilar elementary structures can be perceived as belonging to
the same graphic structure (see Fig. 4.13b).
4.4.3.5 Closure
The perceptive organization of different geometric structures, such as shapes, letters,
pictures, etc., can generate closed figures as shown in Fig. 4.14 (physically nonexis-
tent) thanks to inference with other forms. This happens even when the figures are
partially overlapping or incomplete. If the closure law did not exist, the image would
represent an assortment of different geometric structures with different lengths, rota-
tions, and curvatures, but with the law of closure, we perceptually combine the
elementary structures into whole forms.
332 4 Paradigms for 3D Vision
(a) (b)
Fig. 4.15 Principle of symmetry. a Symmetrical elements (in this case with respect to the central
vertical axis) are perceived combined and appear as a coherent object. b This does not happen in a
context of asymmetric elements where we tend to perceive them as separate
4.4.3.6 Symmetry
The principle of symmetry indicates the tendency to perceive symmetrical elementary
geometric structures against the antisymmetric ones. Visual perception favors the
connection of symmetrical geometric structures forming a region around the point
or axis of symmetry. In Fig. 4.15a it is observed how symmetrical elements are
easily perceptively connected to favor a symmetrical form rather than considering
them separated as instead happens for the figure (b). Symmetry plays an important
role in the perceptual organization by combining visual stimuli in the most regular
and simple way possible. Therefore, the similarities between symmetrical elements
increase the probability that these elements are grouped together forming a single
symmetrical object.
Fig. 4.16 Principle of relative dimension and adjacency. a The dominant perception aided by the
contour circumference is that of the black cross (or seen as a helix) because it is smaller than the
possible white cross. b The perception of the black cross becomes even more evident by becoming
even smaller. c After the rotation of the figure, the situation is reversed, the perception of the white
cross is dominant because the vertical and horizontal parts are better perceived than the oblique
ones. d The principle of relative size eliminates the ambiguity of the vase/face figure shown in
Fig. 4.7, in fact, in the two vertical images on the left, the face becomes background and the vase
is easily perceived, vice versa, in the two vertical images on the right the two faces are perceived
while the vase is background
salience, simplicity, and orderliness. Humans will perceive and interpret ambiguous
or complex objects as the simplest forms possible. The form that is perceived is so
good in relation to the conditions that allow it. In other words, what is perceived
or what determines perceptive stimuli to make a form appear is intrinsically the
characteristic of prägnanz or good form it possesses in terms of regularity, sym-
metry, coherence, homogeneity, simplicity, conciseness, and compactness. These
characteristics contribute to the greater probability of favoring various stimuli for
the perception process. Thanks to these properties the forms take on a good shape
and certain irregularities or asymmetries are attenuated or eliminated, also due to the
stimuli deriving from the spatial relations of the elements of the field of view.
Fig. 4.18 Perception of homogeneous and nonhomogeneous regions. a In the circular field of view
three different homogeneous regions are easily visible characterized by linear segments differently
oriented and with a fixed length. b In this field of view some curved segments are randomly located
and oriented, and one can just see parts of homogeneous regions. c Homogeneous regions are
distinguished, with greater difficulty, not by the different orientation of the patterns but by the
different types of the same patterns
between the regions. In this case, it is more difficult to perceive different regions and
the process of grouping by similarity is difficult. It can be concluded by saying that
the pattern orientation variable is a significant parameter for the perception of the
elementary structures of different regions.
Julesz [10] (1965) emphasized that in the process of grouping by similarity, vision
occurs spontaneously through a perceptual process with pre-attention, which pre-
cedes the identification of elementary structures and objects in the scene. In this
context, a region with a particular texture is perceived by carefully examining and
comparing the individual elementary structures in analogy to what happens when
one wants to identify a camouflaged animal through a careful inspection of the
scene. The hypothesis that is made is that the mechanism of pre-attention grouping
is implemented in the early stages of the vision process (early stage) and the biolog-
ical evidence that the brain responds if patterns of texture differ significantly in the
orientation between the central and external area of their receptive fields.
Julesz extends the study of grouping by similarity, including other significant
variables to characterize the patterns, such as brightness and color, which are indis-
pensable for the perception of complex natural textures. Show that two regions are
perceived as separate if they have a significant difference in brightness and color.
These new parameters, the color and brightness, used for grouping, seem to influ-
ence the perceptive process by operating on the mean value instead of considering
punctual values of the differences between brightness and color values.
A region that presents half dominant patterns with black squares and levels of
gray toward black, and the other half with dominant patterns with white squares and
light gray levels, is perceived with two different regions thanks to the perception of
the boundary of its subregions.
336 4 Paradigms for 3D Vision
The perceptual process discriminates the regions based on the average value of
the brightness of the patterns in the two subregions and is not based on the details of
the composition of these average brightness values. Similarly, it occurs to separate
regions with different colored patterns. Julesz also used parameters based on the
granularity or spatial distribution of patterns for the perception of different regions.
Regions that have an identical average brightness value, but a different spatial pattern
arrangement, have an evident separation boundary and are perceived as different
regions (see Fig. 4.19).
In analogy with what has been pointed out by Bech and Attneave, Julesz has also
found the orientation parameter of the patterns for the perception of the different
regions to be important. In particular, he observed the process of perception based
on grouping and evaluated in formal terms the statistical properties of the patterns to
be perceived. At first, Julesz argued that two regions cannot be perceptually separated
if their first and second-order statistics are identical.
The statistical properties were derived mathematically based on Markovian pro-
cesses in the one-dimensional case or by applying random geometry techniques
generating two-dimensional images. The differences in the statistical properties of
the first order of the patterns highlight the differences in the global brightness of the
patterns (for example, the difference in the value of the average brightness of the
regions). The difference in the statistical properties of the second order highlights
instead the differences in the granularity and orientation of the patterns.
Julesz’s initial approach, which attempted to model the significant variables to
determine grouping by similarity in mathematical terms, was challenged applied to
artificial concrete examples, and Julesz himself modified his theory to the theory
of texton (structures elementary of the texture). He believed that only a difference
4.4 The Fundamentals of Marr’s Theory 337
He stated that the pre-attentive process operates in parallel while the attentive
(i.e., focused attention) operates in a serial way. The elementary structures of
the texture (texton) can be elementary homogeneous areas (blob) or rectilinear
segments (characterized by parameters such as appearance ratio and orienta-
tion) that correspond to the primal sketch representation proposed by Marr
(studies on the perception of texture will influence the same Marr).
Figure 4.20 shows an image with a texture generated by two types of texton.
Although the two texton (figure (a) and (b)) are perceived as different when viewed
in isolation, they are actually structurally equivalent, having the same size and con-
sisting of the same number of segments and each having two ends of segments. The
figure image (c) is made from texton of type (b) representing the background and
from texton of type (a) in the central area representing the region of attention. Since
both textons are randomly oriented and in spite of being structurally equivalent,
the contours of the two textures are not proactively perceptible except with focused
attention. It would seem that only the number of texton together with their shape
characteristics are important, while as they are spatially oriented together with the
closure and continuity characteristics they do not seem to be important.
Julesz states that vision with attention does not determine the position of sym-
bols but evaluates their number, or density or first-order statistics. In other
words, Julesz’s studies emphasize that the elementary structures of texture do
not necessarily have to be identical to be considered together by the processes
of grouping. These processes seem to operate between symbols of the same
brightness, orientation, color, and granularity. These properties correspond to
those that we know to be extracted from the first stages of vision, and corre-
1 Various studies have highlighted mechanisms of co-processing of perceptive stimuli, reaching the
distinction between automatic attentive processes and controlled attentive processes. In the first
case, the automatic processing (also called pre-attentive or unconscious) of the stimuli would take
place in parallel, simultaneously processing different stimuli (relating to color, shape, movement,
. . .) without the intervention of attention. In the second case, the focused attention process requires a
sequential analysis of perceptive stimuli and the combination of similar ones. It is hypothesized that
the pre-attentive process is used spontaneously as a first approach (in fact it is called pre-attentive)
in the perception of familiar and known (or undemanding) things, while the focused attention
intervenes subsequently, in situations of uncertainty or in sub-optimal conditions, sequentially
analyzing the stimuli. In other words, the mechanism of perception would initially tend toward a
pre-attentive process by processing in parallel all the stimuli of the perceptual field, but in front of
an unknown situation, if necessary activating a focused attention process to adequately integrate
the information of the individual stimuli. The complexity of the perception mechanisms is not yet
rigorously demonstrated.
338 4 Paradigms for 3D Vision
(c)
(a) (b)
Fig. 4.20 Example of two non-separable textures despite being made with structurally similar
elements (texton) (figure a and b) in dimensions, number of segments and terminations. The region
of interest, located at the center of the image (figure c) and consisting of texton of type (b), is
difficult to perceive, with respect to the background (distractor component) represented by texton
of type (a), when both texton are randomly oriented
spond to the properties extracted with the description in terms of raw primal
sketch proposed initially by Marr.
So far we have quantitatively examined some Gestalt principles such as the sim-
ilarity and evaluation of the form to justify the perception based on grouping. All
the studies used only artificial images. It remains to be shown whether this percep-
tive organization is still valid when applied for natural scenes. Several studies have
demonstrated the applicability of the Gestalt principles to explain how some animals
with their particular texture are able to camouflage themselves in their habitat by
making it difficult to perceived its predatory antagonist. Various animals, prey and
predators, have developed particular systems of perception that are different from
each other and are also based on the process of grouping by similarity or on the
process of grouping by proximity.
We have analyzed how the Gestalt principles are useful descriptions to analyze
perceptual organization in the real world, but we are still far from having found
an adequate theory that explains why these principles are valid and how perceptual
organization is realized. The same Gestalt psychologists have tried to answer these
questions thinking of a model of the brain that provides force fields to characterize
objects.
Marr’s theory attempts to give a plausible explanation of the process of perception,
emphasizing how such principles can be incorporated into a computational model
that detects structures hidden in uncertain data, obtained from natural images. With
4.4 The Fundamentals of Marr’s Theory 339
Fig. 4.21 Example of map raw primal sketch derived from zero crossing calculated at different
scales by applying the LoG filter to the input image
Marr’s theory, with the first stages of vision, they are not understood as processes
that directly solve the problem of segmentation or to extract objects as is traditionally
understood (generally complex and ambiguous activity).
The goal of the early vision process seen by Marr is to describe the surface
of objects by analyzing the real image, even in the presence of noise, of complex
elementary structures (textures) and of shaded areas. We have already described
in Sect. 1.13 Vol. II the biological evidence for the extraction of zero crossings
from the early vision process proposed by Marr-Hildreth, to extract the edges of the
scene, called raw primal sketch. The primitive structures found in the raw primal
sketch maps are edges, homogeneous elementary areas (blob), bars, terminations in
general, which can be characterized with attributes such as orientation, brightness,
length, width, and position (see Fig. 4.21).
The map of raw primal sketch is normally very complex, from which it is necessary
to extract global structures of higher level such as the surface of objects and the
various textures present. This is achieved with the successive stages of the processes
of early vision by recursively assigning place tokens (elementary structures such as
bars, oriented segments, contours, points of discontinuity, etc.) to small structures of
the visible surface or aggregation of structures, generating primal sketch maps.
These structures place tokens are then aggregated together to form larger struc-
tures and if possible this is repeated cyclically. The place tokens structures can be
characterized (at different scales of representation) by the position of the elementary
structures represented (elementary homogeneous areas, edges or short straight seg-
ments or curved lines), or by the termination of a longer edge, a line or area elongated
elementary, or from a small aggregation of symbols.
The process of aggregation of the place tokens structures can proceed by clustering
those close to each other on the basis of changes in spatial density (see Fig. 4.22)
and proximity, or by curvilinear aggregation that produces contours joining aligned
structures that they are close to each other (see Fig. 4.23) or through the aggregation
340 4 Paradigms for 3D Vision
Fig. 4.22 Full primal sketch maps derived from raw primal sketch maps by organizing elementary
structures (place tokens) into larger structures, such as regions, lines, curves, etc., according to
continuity and proximity constraints
light d
irectio
n
(a) Edge and curve alignment (b) Termination alignment (c) Curved alignment
Fig. 4.23 Full primal sketch maps derived from raw primal sketch maps by aggregating elemen-
tary structures (place tokens), like elements of boundaries or terminations of segments, in larger
structures (for example, contours), according to spatial alignment constraints and continuity. a The
shading, generated by the illumination in the indicated direction, creates local variations of bright-
ness, and applying closing principles are also detected contours related to the shadows, in addition
to the edges of the object itself; b The circular contour is extracted from the curvilinear aggregation
of the terminations of elementary radial segments (place tokens); c Curved edge detection through
curvilinear aggregation and alignment of small elementary structures
of texture oriented structures (see Fig. 4.24). This latter type of aggregation implies
a grouping of similar structures oriented in a given direction and other groups of
similar structures oriented in other directions. In this context, it is easy to extract
rectilinear or curved structures by aligning the terminations or the discontinuities as
shown in the figures.
The process of grouping together with the place tokens structures is based on local
proximity (i.e., they are combined adjacent elementary structures) and on similarity
(i.e., they are combined elementary oriented structures), but the determined structures
of the visible surface can be influenced by more global considerations. For example,
in the context of curvilinear aggregation, a closure principle can allow two segments,
which are edge elements, to be joined even if the change in brightness across the
segments is sensitive, due to the effect of the lighting conditions (see Fig. 4.23a).
Marr’s approach combines many of the Gestalt principles discussed above. From
the proposed grouping processes, we derive primal sketch structures at different
scales to physically locate the significant contours in the images. It is important to
derive different types of visible surface properties at different scales, as Marr claims.
The contours due to the change in the reflectance of the visible surface (for example,
due to the presence of two overlapping objects), or due to the discontinuity of the
orientation or depth of the surface can be detected in two ways.
4.4 The Fundamentals of Marr’s Theory 341
(a) (b)
Fig. 4.24 Full primal sketch maps derived from raw primal sketch maps by organizing elemen-
tary oriented structures (place tokens) in vertical edges, according to constraints of directional
discontinuity of the oriented elementary structures
In the first way, the contours can be identified with place tokens structures. The
circular contour perceived in Fig. 4.23b can be derived through the curvilinear aggre-
gation of place tokens structures assigned to the termination of each radial segment.
In the second way, the contours can be detected by the discontinuity in the param-
eters that describe the spatial organization of the structures present in the image.
In this context, changes in the local spatial density of the place tokens structures,
their spacing or their dominant orientation could be used together to determine the
contours. The contour that separates the two regions in Fig. 4.24b is not detected
by analyzing the structures place tokens, but is determined by the discontinuity in
the dominant orientation of the elementary structures present in the image. This
example demonstrates how Marr’s theory is applicable to the problem of texture and
the separation of regions in analogy to what Julesz realized (see Sect. 3.2).
In fact, the procedures of early vision have no knowledge on the possible shape
of the teddy bear’s head and do not find the contours of the eyes since they do not
have any prior knowledge to find them.
We can say that Marr’s theory, based on the processes of early vision, contrasts
strongly with the theories of perceptive vision that are based on the expectation
and hypothesis of objects that drive each stage of perceptual analysis. These
342 4 Paradigms for 3D Vision
last theories are based on the knowledge of the objects of the world, for which
in the computer there is a model of representation.
In fact, in Fig. 4.23a shows the edges of the object but also those caused by shadows
not belonging to the object itself. In this case, the segmentation procedure may fail
to separate the object’s contour from the background if it does not know a priori
information about the lighting conditions. In general, ambiguities cannot be solved
by analyzing a single image, but by considering the additional information as that
deriving from stereo vision (observation of the world from two different points of
view) or from the analysis of motion (observing time-varying image sequences) in
order to extract depth and movement information present in the scene, or by knowing
the lighting conditions.
In Marr’s theory, the goal of the first stages of vision is to produce a description of the
visible surface of the objects observed together with the information indicating the
structure of the objects with respect to the observer’s reference system. In other words,
all the information extracted from the visible surface is referred to the observer.
The primal sketch data are analyzed to perform the first level of reconstruction
of the visible surface of the observed scene. These data (extracted as a bottom-up
approach) together with the information provided by some modules of early vision,
such as depth information (between scene and observer) and orientation of the visible
surface (with respect to the observer) form the basis for the first 3D reconstruction of
the scene. The result of this first reconstruction of the scene is called 2.5D Sketch, in
the sense that the result obtained, generally orientation maps (called also needle map)
and depth map, is something more than 2D information but it cannot be considered
as a 3D reconstruction yet.
The contour, the texture, the depth, orientation and movement information (called
by Marr information full primal sketch), extracted from the processes of early vision,
such as stereo vision, movement analysis, analysis of texture and color, all together
contribute to the production of 2.5D-sketch maps seen as intermediate information,
temporarily stored, which give a partial solution waiting to be processed by the
perceptual process, for the reconstruction of the visible surface observed.
Figure 4.25, shows an example of a 2.5D sketch map representing the visible
surface of a cylindrical object in terms of orientation information (with respect to the
observer) of elementary portions (patch) of the visible surface of the object. To render
this representation effective, the orientation information is represented with oriented
4.4 The Fundamentals of Marr’s Theory 343
Fig. 4.25 2.5D sketch map derived from the full primal sketch map and from the orientation map.
The latter adds the orientation information (with respect to the observer) for each point of the
visible surface. The orientation of each element (patch) of visible surface is represented by a vector
(seen as a little needle) whose length indicates how much it is inclined with respect to the observer
(maximum length means the direction perpendicular to the observer, zero indicates pointing toward
the observer), while the direction coincides with the normal at the patch
needles, whose inclination indicates how the patches are oriented with respect to the
observer. These 2.5D sketch maps are called nedle map or orientation maps of the
observed visible surface.
The length of each oriented needle describes the level of inclination of the patch
with respect to the observer. A needle with zero length indicates that the patch is
perpendicular to the vector that joins the center of the patch with the observer. The
increase in the length of the needle implies the increase of the inclination of the patch
with respect to the observer. The maximum length is when the inclination angle of
the patch reaches 90◦ . In a similar way, the depth map can be represented for each
patch that indicates its distance from the observer.
It can be verified that the processes of early vision do not produce information
(of orientation, depth, etc.) in some points of the visual field and this involves the
production of a 2.5D sketch map with some areas without information. In this case,
interpolation processes can be applied to produce the missing information on the map.
This also happens with the stereo vision process in the occlusion areas where no depth
information is produced, while in the shape of shading process2 (for extraction of
the orientation maps) this occurs in areas of strong light discontinuity (or depth), in
which the orientation information is not produced.
2 The shape-from-shading we will see later that it is a method to reconstruct the surface of a 3D
object from a 2D image based on the shading information, i.e., it uses shadows and light direction as
reference points to interpolate the 3D surface. This method is used to obtain a 3D surface orientation
map.
344 4 Paradigms for 3D Vision
The final goal of vision is the recognition of the objects observed. Marr vision
processes through 2.5D sketch maps produce a representation, with respect to the
observer, of the visible surface of 3D objects. This implies that the representation of
the observed object varies in relation to the point of view, thus making the recognition
process more complex. To discriminate between the observed objects, the recognition
process must operate by representing the objects with respect to their center of mass
(or with respect to an absolute reference system) and not with respect to the observer.
This third level of representation is called object centred. Recall that the shape
that characterizes the object observed expresses the geometric information of the
physical surface of an object. The representation of the shape is a formal scheme
to describe a form or some aspects of the form together with rules that define how
the schema is applied to any particular form. In relation to the type of representation
chosen (i.e., the chosen scheme), a description can define a shape with a rough or
detailed approximation.
There are different models of 3D representation of objects. The representation of
the objects viewer centered or object centered, becomes fundamental to characterize
the same process of recognition of the objects. If the vision system uses a viewer
centered representation to fully describe the object it will be necessary to have
a representation of it through different views of the object itself. This obviously
requires more memory to maintain an adequate representation of all the observed
views of the objects. Minsky [11] proposed to optimize the multi-view representation
of objects, choosing significant primitives (for example, 2.5D sketch information)
representing the visible surface of the observed object from appropriate different
points of view.
An alternative to the multi-view representation is given by the object centered
representation that certainly requires less memory to maintain a single description of
the spatial structures of the object (i.e., it uses a single model of 3D representation of
the object). The recognition process will have to recognize the object, whatever its
spatial arrangement. We can conclude by saying that a object centered description
presents more difficulties for the reconstruction of the object, since a single coordinate
system is used for each object and this coordinate system is identified by the image
before the description of the shape be rebuilt. In other words, the form is described
not with respect to the observer but relative to the object centered coordinate system
that is based on the form itself.
For the recognition process, the viewer centered description is easier to produce,
but is more complex to use than the object centered one. This depends on the fact
that the viewer centered description depends very much on the observation points
of the objects. Once defined the coordinate system (viewer centered or object cen-
tered) that is the way to organize the representation that imposes a description of the
objects, the fundamental role of the vision system is to extract the primitives in a
stable and unique (invariance of information associated with primitives) which are
the basic information used for the representation of the form. We have previously
mentioned the 2.5D sketch information, extracted from the process of early vision,
4.5 Toward 3D Reconstruction of Objects 345
which essentially is the orientation and depth (calculated with respect to the observer)
of the visible surface observed in the field of view and calculated for each point of
the image. In essence, primitives contain 3D information of the visible surface or 3D
volumetric information. The latter include the spatial information on the form.
The complexity of primitives, considered as the first representation of objects,
is closely linked to the type of information that can be derived from the vision
system. While it is possible to define even complex primitives by choosing a model
of representation of the sophisticated world, we are nevertheless limited by the ability
of vision processes that are not able to extract consistent primitives. It achieves an
adequate choice between the model of representation of the objects and the type of
primitives that must guarantee stability and uniqueness [12].
The choice of the most appropriate representation will depend on the application
context. A limited representation of the world, made in blocks (cubic primitives), can
be adequate to represent for example, the objects present in an industrial warehouse
that produces packed products (see Fig. 4.26). A plausible representation for peo-
ple and animals, can be to consider as solid primitive cylindrical or conical figures
[12] properly organized (see Fig. 4.27), known as generalized cones.3 With this 3D
model, according to the approach of Marr and Nishihara [12], the object recogni-
tion process involves the comparison between the 3D reconstruction obtained from
the vision system and the 3D representation of the object model using generalized
cones, previously stored in memory. In the following paragraphs, we will analyze
the problems related to the recognition of objects.
In conclusion, the choice of the 3D model of the objects to be recognized, that is,
the primitives, must be adequate with respect to the application context that character-
3A generalized cone can be defined as a solid created by the motion of an arbitrarily shaped cross-
section (perpendicular to the direction of motion) of constant shape but variable in size along the
symmetry axis of the generalized cone. More precisely a generalized cone is defined by drawing
a 2D cross-section curve at a fixed angle, called the eccentricity of the generalized cone, along
a space curve, called spine of the generalized cone, expanding the cross-section according to a
sweeping rule function. Although spine, cross-section, and sweeping rule can be arbitrary analytic
functions, in reality only simple functions are chosen, in particular, a spine is a straight or circular
line, the sweeping rule is constant or linear, and the cross-section is generally rectangular or circular.
The eccentricity is always chosen with a right angle and consequently the spine is normal to the
cross-section.
346 4 Paradigms for 3D Vision
Hand
izes the model of representation of objects, the modules of the vision system (which
must reconstruct the primitives) and the recognition process (that must compare the
reconstructed primitives and those of the model). With this approach, invariance was
achieved thanks to the coordinate system centered on the object whose 3D compo-
nents were modeled. Therefore, regardless of the point of view and most viewing
conditions, the same 3D structural description that is invariant from the point of
view would be retrieved by identifying the appropriate characteristics of the image
(for example, the skeleton or principal axes of the object) by retrieving a canonical
set of 3D components and comparing the resulting 3D representation with similar
representations obtained from the vision system. Following Marr’s approach, in the
next chapter, we will describe the various vision methods that can extract all pos-
sible information in terms of 2.5 D Sketch from one or more 2D images, in order
to represent the shape of the visible surface of the objects of the observed scene.
Subsequently, we will describe the various models of representation of the objects,
analyzing the more adequate primitives 2.5 D Sketch, also in relation to the typology
of the recognition process.
A very similar model to the theory of Marr and Nishihara is Biederman’s model
[13,14] known as Recognition By Components-RBC. The limited set of volumet-
ric primitives used in this model is called geons. RBC assumes that the geons are
retrieved from the images based on some key features, called non-accidental prop-
erties since they do not change when looking at the object from different angles, i.e.,
shapes configurations that are unlikely to be verified by pure chance. Examples of
non-accidental properties are parallel lines, symmetry, or Y-junctions (three edges
that meet at a single point).
These qualitative descriptions could be sufficient to distinguish different classes
of objects, but not to discriminate within a class of objects having the same basic
components. Moreover, this model is inadequate to differentiate dissimilar forms in
which the perceptual similarities of the forms can induce to recognize similar objects
between them that instead are composed of different geons.
4.5 Toward 3D Reconstruction of Objects 347
This section describes how a vision system can describe the observed surface of
objects by processing the intensity values of the acquired image. The vision system
can include different procedures according to the Marr theory which provides for
different processes of early vision and the output from which it gives a first description
of the surface as seen by the observer.
In the previous paragraphs, we have seen that this first description extracted from
the image is expressed in terms of elementary structures (primal sketch) or in terms
of higher-level structures (full primal sketch). Such structures essentially describe the
2D image rather than the surface of the observed world. To describe instead the 3D
surface of the world, it will be necessary to develop further vision modules that, by
giving the input, the image and/or the structures extracted from the image, it should
be possible to extract the information associated with the physical structures of the
3D surface.
This information can be the distance of different points of the surface from the
observer, the shape of the solid surface at different distances and different inclination
with respect to the observer, the motion of the objects between them and with respect
to the observer, the texture, etc. The human vision system is considered as the vision
machine of excellence to be inspired to study how the nervous system, by processing
the images of each retina (retinal image), correctly describes the 3D surface of the
world. The goal is to study the various modules of the vision system considering:
In other words, the modules of the vision system will be studied not on the basis
of ad hoc heuristics but according to the Marr theory, introduced in the previous
paragraphs, which provides for each module a methodological approach, known as
the Marr paradigm, divided into three levels of analysis: computational, algorithmic,
and implementation.
The first level, the computational model, deals with the physical and mathematical
modeling with which we intend to derive information from the physical structures
of the observable surface, considering the aspects of the image acquisition system
and the aspects that link the physical properties of the structures of the world
(geometry and reflectance of the surface) with the structures extracted from the
images (zero crossing, homogeneous areas, etc.).
The second level, the algorithmic level, addresses how a procedure is realized that
implements the computational model analyzed in the first level and identifies the
input and output information. The procedure chosen must be robust, i.e., it must
guarantee acceptable results even in conditions of noisy input data.
348 4 Paradigms for 3D Vision
The third level, the implementation level, deals with the physical modality of how
the algorithms and information representation are actually implemented using a
hardware with an adequate architecture (for example, neural network) and soft-
ware (programs). Usually, a vision system operates in real-time conditions. The
human vision system implements vision algorithms using a neural architecture.
Basically each of the above methods produces according to Marr’s theory a 2.5D
primal sketch map, for example, the depth map (with the methodology Shape from
Stereo and Structured light), the surface orientation map (with the methodology
Shape from Shading), and the orientation of the source (using the methodologies
Shape from shading and Shape from Stereo photometry).
The human visual system perceives the 3D world by integrating (not yet known
how it can happen) 2.5D sketch information deriving the primal sketch structures from
the pair of 2D images formed on the retinas. The various information of shading and
shape are normally used, for example, with ability by an artist when he represents
in a painting, a two-dimensional projection of the 3D world.
4.5 Toward 3D Reconstruction of Objects 349
Fig.4.28 Painted by Canaletto known for its effectiveness in creating wide-angle perspective views
of Venice with particular care for lighting conditions
A system based on stereo vision extracts distance information and surface geometry
of the 3D world by analyzing two or more images corresponding to two or more
different observation points. The human visual system uses binocular vision for the
perception of the 3D world. The brain processes the pair of images, 2D projections
on the retina, of the 3D world, observed from two slightly different points of view,
but lying on the horizontal.
The basic principle of binocular vision is as follows: objects positioned at dif-
ferent distances from the observer, are projected into the retinal images in different
locations. A direct demonstration of the principle of stereo vision, you can have by
looking at your own thumb at different distances closing first one eye and then the
other. It will be possible to observe how in the two retinal images the finger will
appear in two different locations moved horizontally. This relative difference in a
location in the retina is called disparity.
From Fig. 4.29 can be understood, as in a human binocular vision, retinal images
are slightly different in relation to the interocular distance of the eyes and in relation to
the distance and angle of the observed scene. It is observed that the image generated by
the fixation point P is projected in the foveal area due to the convergence movements
of the eyes, that is, they are projected in corresponding points of the retinas. In
principle, the distance from any object that is being fixed could be determined based
on the observation directions of the two eyes, through triangulation. Obviously, with
this approach, it would be fundamental to know precisely the orientation of each eye
which is limited in humans while it is possible to make binocular vision machines
with good precision of the fixing angles as we will see in the following paragraphs.
Given the binocular configuration, points closer or further away from the point
of fixation project their image at a certain distance from each fovea. The distance
between the image of the fixation point and the image of each of the other points
Vs Vd
P s Ls Ld Pd
4.6 Stereo Vision 351
Figure 4.29 shows the human binocular optical system of which the interocular dis-
tance D is known with the ocular axes assumed coplanar. The ocular convergence
allows the focusing and fixing of a point P of the space. Knowing the angle θ between
352 4 Paradigms for 3D Vision
Fixation point
P
Q R
Horopter
Nodal Points
Rs Qd
Ps Qs Rd Pd
Fig. 4.30 The horopter curve is the set of points of view that projected on the two retinas stimulate
corresponding areas and are perceived as single elements. The P fixation point stimulates the main
corresponding zones, while simultaneously other points, such as Q that are on the horopter, stimulate
correspondences and are seen as single elements. The R point, outside the horopter, is perceived as
a double point (diplopia) as it stimulates retinal zones unmatched. The shape of the horopter curve
depends on the distance of the fixation point
the ocular axes with a simple triangulation it would be possible to calculate the dis-
tance d = D/2 tan(θ/2) between observer and fixation point. Looking at the scene
from two slightly different viewpoints on the two retinal images you will have slightly
shifted points in the scene.
The problem of the fusion process is to find in the two retinas homologous points
of the scene in order to produce a single image, as if the scene were observed by
a cyclopic eye placed in the middle of the two. In the human vision, this occurs at
the retinal level if the visual field is spatially correlated with each fovea where the
disparity assumes a value of zero. In other words, the visual fields of the two eyes
have a reciprocal bond between them, such that, a retinal zone of an eye placed at a
certain distance from the fovea, finds in the other eye a corresponding homologous
zone, positioned on the same side and at the same distance from your own fovea.
Spatially scattered elementary structures of the scene projected into the corre-
sponding retinal areas stimulate them, propagating signals in the brain that allow the
fusion of retinal images and the perception of a single image. Therefore, when the
eyes converge fixating an object this is seen as single as the two corresponding main
areas (homologous) of the fovea are stimulated. Simultaneously, other elementary
structures of the object, included in the visual field, although not fixed, can be per-
ceived as single because they fall back and stimulate the other corresponding retinal
(secondary) zones.
Points of space (seen as single, that stimulate corresponding areas on the two
retinas) that are at the same distance from the point of fixation form a circumference
that passes through the point of fixation and for the nodal points of the two eyes
(see Fig. 4.30). The set of points that are at the same distance from the point of
fixation that induces corresponding positions on the retina (zero disparity) form the
4.6 Stereo Vision 353
horopter. It is shown that these points have the same angle of virginity b and lie on
the circumference of Vieth-Müller which contains the fixation point and the nodal
points of the eyes.
In reality, the shape of the horopter changes with the distance of the fixation point,
the contrast, and the lighting conditions. When the fixation distance increases the
shape of the horopter tends to become a straight line until it returns a curve with the
convexity facing the observer.
All the points of the scene, included in the visual field, but located outside
the horopter stimulate non-corresponding retinal areas, and therefore, will be per-
ceived blurred or tendentially double, thus giving rise to the diplopia (see Fig. 4.30).
Although all the points of the scene outside the horopter curve were defined as
diplopics, in 1858 Panum showed that there is an area near the horopter curve within
which these points, although stimulating retinal zones that are not perfectly corre-
spondent, are still perceived as single. This area in the vicinity of the horopter is
called Panum area (see Fig. 4.31a). It is precisely these minimal differences between
the retinal images, relative to the points of the scene located in the Panum area,
which are used in the stereoscopic fusion process to perceive depth. Points at the
same distance as that of fixation from the observer produce zero disparity. Points
closer to and farther from the fixation point produce disparity measurements through
a neurophysiological process.
The retinal disparity is a condition in which the line of sight of the two eyes
do not intersect at the point of fixation, but in front of or behind the fixation point
(Panum area). In processing the closest object, the visual axes converge, and the
visual projection from an object in front of the fixation point leads to Crossed Retinal
Disparity (CRD). On the other hand, in processing the distant object, the visual axes
diverge and the visual projection from an object behind the fixation point leads to
Uncrossed Retinal Disparity (URD). In particular, as shown in Fig. 4.31b the nearest
V point P induces CRD disparities and its image VL is shifted to the left in the left
Diplopia
er
Horopter
V
VR LR
PL PL VL PL LL
PR PR PR
Fig. 4.31 a Panum area: the set of points of the field of view, in the vicinity of the horopter curve,
where despite stimulating retinal zones that are not perfectly corresponding, they are still perceived
as single elements. b Objects within the horopter (closer to the observer than the fixation point P)
induce retinal disparity crossed. c Objects outside the horopter (farther from the observer than the
fixation point P) induce retinal disparity uncrossed
354 4 Paradigms for 3D Vision
eye and to the right VR in the right eye. For the point L farther than the fixation point
P, its projection induces disparity URD, that is, the image LL is shifted to the right
in the left eye and to the left LR in the right eye (see Fig. 4.31c). It is observed that
the nearest point V to the eyes has greater disparity.
The physiological evidence of depth perception through the estimated disparity value
is demonstrated with the stereoscope invented in 1832 by the physicist C. Wheatstone
(see Fig. 4.32). This tool allows the viewing of stereoscopic images.
The first model consists of a reflection viewer composed of two mirrors positioned
at an angle of 45◦ with respect to the respective figures positioned at the end of the
viewer. The two figures represent the image of the same object (in the figure appears
a rectangular pyramid trunk) represented with a slightly different angle. An observer
could approach the mirrors and turn away the supports of the two representations
until the two images reflected in the mirrors did not overlap perceiving a single
3D object. The principle of operation is based on the fact that to observe an object
closely, the brain usually tends to converge the visual axes of the eyes, while in the
stereoscopic view the visual axes must point separately and simultaneously on the
images on the left and on the right to merge them into a single 3D object.
The observer actually looks at the two sketch images IL and IR of the object through
the mirrors SL and SR , so that the left eye sees the sketch IL of the object, simulating
the acquisition made with the left monocular system, and the right eye similarly sees
the sketch IR of the object that simulates the acquisition of the photo made by the right
monocular system. In these conditions, the observer, looking at the images formed
in the two mirrors SL and SR , has a clear perception of depth having the impression
of being in front of a solid figure. An improved and less cumbersome version of the
reflective stereoscope was made by D. Brewster. The reflective stereoscope is still
used today for photogrammetric relief.
Mirrors
Mirrors
IL IR
SL SR
IL IR
OL OR
Fig. 4.32 The mirror stereoscope by Sir Charles Wheatstone from 1832 that allows the viewing of
stereoscopic images
4.6 Stereo Vision 355
(a) (b)
LL FL FL
OL A
B
OR
LR FR FR
Fig. 4.33 a Diagram of a simple stereoscope that allows the 3D recomposition of two stereograms
with the help of magnifying lenses (optional). With the lenses, the task of the eyes is facilitated to
remain parallel as if looking at a distant object and focus, at a distance of about 400 mm, simultane-
ously the individual stereograms with the left and right eye, respectively. b Acquisition of the two
stereograms
(red and blue or green) on a single support and then observe with glasses having, in
place of lenses, two filters of the same complementary colors.
The effect of these filters is to show each eye only one of the stereo image pairs. In
this way, the perception of the 3D object is only an illusion, since the physical object
does not exist, what the human visual system really acquires from the anaglyphs
are the 2D images, the object projected on the two retinas and the brain evaluates
a disparity measure for each elementary structure present in the images (remember
that the image pair make up the stereogram).
The 3D perception of the object, realized with the stereoscope, is identical to the
same sensation that the brain would have when seeing the 3D physical object directly.
Today anaglyphs can be generated electronically by displaying the pair of images of
a stereogram on a monitor, perceiving the 3D object by observing the monitor with
glasses containing a red filter on one eye, and a green filter on the other. The pairs of
images can also be displayed superimposed on the monitor with shades of red on the
left image, respectively, and with shades of green on the right image. Since the two
images, 2D projections of the object observed from slightly different points of view,
observed superimposed on the monitor, the 3D object of origin would be confused,
but with glasses with red and green filters, the brain fuses the two images perceiving
the 3D source object.
A better visualization is obtained by alternatively displaying on the monitor the
pair of images with a certain frequency compatible with the persistence capacity on
the retina of the images.
4.6.3 Stereopsis
Julesz [16] in 1971 showed how it can be relatively simple to evaluate disparity to get
depth perception and the neurophysiological evidence that neural cells in the cortex
are enabled to select elementary structures in pairs of retinal images and measure
the disparity present. It is not yet evident in biological terms how the brain fuses the
two images giving the perception of the single 3D object.
He conceived the random-dot stereograms (see Fig. 4.34) as a tool to study the
working mechanisms of the binocular vision process. The stereograms random-dot
are generated by a computer producing two equal images of point-like structures
randomly arranged with uniform density, essentially generating a texture of black
and white dots. Next, two central windows are selected (see Fig. 4.35a and b) in each
stereogram which are shifted horizontally by the same amount D, respectively, to
the right in the left stereogram and to left in the right one. After moving to the right
and left of the central square (homonymous or uncrossed disparity), the remaining
area without texture is adequately filled with the same background texture (black
and white random points). In this way, the two central windows are immersed and
camouflaged in the background with an identical texture.
When each stereogram is seen individually it appears (see Fig. 4.34) as a single
texture of randomly arranged black and white dots. When the pair of stereograms
is seen instead binocularly, i.e., an image is seen from one eye, and the other from
4.6 Stereo Vision 357
Fig. 4.35 Construction of the Julesz stereograms of Fig. 4.34. a The central window is shifted
to the right horizontally by D pixels and the void left is covered by the same texture (white and
black random points) of the background. b The same operation is repeated as in (a) but with a
horizontal shift D to the left. c If the two stereograms constructed are observed simultaneously with
a stereoscope, due to the disparity introduced, the central window is perceived high with respect to
the background
the other eye, the brain performs the stereopsis process that is, it fuses the two
stereograms perceiving depth information, noting that the central square rises with
respect to the texture of the background, toward the observer (see Fig. 4.35c).
In this way, Julesz has shown that the perception of the emerging central square
at a different distance (as a function of D) from the background, is only to be
attributed to the disparity determined by the binocular visual system that surely
performs first the correspondence activity between points of the background
(zero disparity) and between points of the central texture with disparities D. In
other words, the perception of the central square raised by the background is
358 4 Paradigms for 3D Vision
due only to the disparity measure (no other information is used) that the brain
realizes through the process of stereogram fusion.
This demonstration makes it difficult to support any other stereo vision theory that
is based, for example, on the a priori knowledge of what is being observed or on the
fusion of particular structures of monocular images (for example, the contours). If in
the construction of the stereograms the central windows are moved in the opposite
direction to those of Fig. 4.35a and b, that is, to the left in a stereogram and to the
right in the other (crossed disparity), with the stereoscopic vision the central window
is perceived emerging from the opposite side of the background or moves away from
the observer (see Fig. 4.36). Within certain limits the perceived relief is easier the
more the disparity (homonymous or crossed) is high.
Different biological systems use stereo vision and others (rabbits, fish, etc.) observe
the world with panoramic vision, that is, their eyes are placed to observe different
parts of the world, and unlike stereo vision, pairs of images do not overlapping areas
of the observed scene for depth perception.
Studies by Wheatstone and Julesz have shown that binocular disparity is the key
feature for stereopsis. Let us now look at some neurophysiological mechanisms
related to retinal disparity. In Sect. 3.2 Vol. I it was shown how the visual system
propagates the light stimuli on the retina and how impulses propagate from this to the
brain components. The stereopsis process uses information from the striate cortex
and other levels from the visual binocular system to represent the 3D world.
The stimuli coming from the retina through the optical trait (containing fibers
of both eyes) are transmitted up to the lateral geniculate nucleus—LNG which
functions as a thalamic relay station, subdivided into 6 laminae, for the sorting of the
4.6 Stereo Vision 359
different information (see Fig. 4.37). In fact, the fibers coming from the single retinas
are composed of axons deriving both from the large ganglion cells (of type M ), and
from small ganglion cells (of the type P) and from small ganglion cells of the type
non-M and non-P, called koniocellular or K cell. The receptive fields of the ganglion
cells are circular and of center-ON and center-OFF type (see Sect. 4.4.1). The M cells
are connected with a large number of photoreceptor cells (cones and rods) through
the bipolar cells and for this reason, they are able to provide information on the
movement of an object or on rapid changes in brightness. The P cells are connected
with fewer receptors and are suitable for providing information on the shape and
color of an object.
In particular, some different peculiarities between the M and P cells should be
highlighted. The former is not very sensitive to different wavelengths, very selective
at low spatial frequencies, high temporal response and conduction velocity, and wide
dendritic branching. The P cells, on the other hand, are selective at different wave-
lengths (color) and for high spatial frequencies (useful for capturing details having
small receptive fields), have low conduction speed and temporal resolution. The K
cells are very selective at different wavelengths and do not respond to orientation.
Laminae 1 and 2 receive the signals of the M cells, while the remaining 4 laminae
receive the signals of the P cells. The interlaminar layers receive the signals from
the koniocellular cells K. The receptive fields of the K cells are also circular and
of center-ON and center-OFF type. How P cells are color-sensitive but with the
specificity that receptive fields are opponents for red-green and blue-yellow.4
As shown in the figure, the information of the two eyes is transmitted separately to
the different LGN laminae in such a way that the nasal hemiretina covers the hemi-
camp view of the temporal side, while the temporal hemiretina of the opposite eye
includes the hemicamp view of the nasal side. Only in the first case, the information
of the two eyes intersect. In particular, laminae 1, 4, and 6 receive information from
the nasal retina of the opposite eye (contralateral), while laminae 2, 3, and 5 from
4 In addition to the trichromatic theory (based on three types of cones sensitive to red, green and
blue, the combination of which determines the perception of color in relation to the incident light
spectrum) was proposed by Hering (1834–1918), the theory of opponent color. According to this
theory, we perceive colors by combining 3 pairs of opponent colors: red-green, blue-yellow, and
an achromatic channel (white-black) used for brightness. This theory foresees the existence in the
visual system of two classes of cells, one selective for the color opponent (red-green and yellow-
blue) and one for brightness (black-white opponent). In essence, downstream of the cones (sensitive
to red, green, and blue) adequate connections with bipolar cells would allow to have ganglion cells
with the typical properties of chromatic opponency, having a center-periphery organization. For
example, if a red light affects 3 cones R, G, and B connected to two bipolar cells β1 and β2 with
the following cone-cell connection configuration β1 (+R, −G) and β2 (−R, −G, +B), we would
have an excitation of the bipolar cell β1 stimulated by the cone R, sending the signal +R − G on
its ganglion cell. On the other hand, a green light inhibits both bipolar cells. A green or red light
inhibits the bipolar cell β2 while a blue light signals to its cell ganglion the signal +B − (G + R).
Hubel and Wiesel demonstrated, the presence of cells of the retina and the lateral geniculate nucleus
(the P and the K cells), which respond to the chromatic opponency properties organized with the
properties of the center-ON and center-OFF receptive fields.
360 4 Paradigms for 3D Vision
the temporal retina of the eye of the same side (ipsilateral). In this way, each lamina
contains a representation of the contralateral visual hemifield (of the opposite side).
With this organization of information, in the LGN the spatial arrangement of the
receptive fields associated with ganglion cells is maintained and in each lamina the
complete map of the field of view of each hemiretina is stored (see Fig. 4.37).
P Fixation point
C D Binocular
iew
B E
A F
Monocular Monocular
view iew
E AF B
Nasal B E Temporal
D Chiasm D C hemiretina
hemiretina C
Optic nerves
Contralateral Ipsilateral
Optic tracts
6 FED CBA 6
ED 5 5 CB
4 FED CBA 4
ED 3 3 CB
ED 2 2 CB
FED CBA
1 1
Channel K
Left LGN Right LGN
Channel M
Channel P Channel P
Fig. 4.37 Propagation of visual information from retinas to the Lateral Geniculate Nucleus (LGN)
through optic nerves, chiasm, and optic tracts. The information from the right field of view passes
to the left LGN and vice versa. The left field of view information is processed by the right LGN
and vice versa. The left field of view information that is seen by the right eye does not cross and is
processed by the right LGN. The opposite situation occurs for the right field of view information
seen by the left eye that is processed by the left LGN. The spatial arrangement of the field of
view is reversed and mirrored on the retina but the information toward the LGN propagates while
maintaining the topographic arrangement of the retina. The relative disposition of the hemiretinas
is mapped on each lamina (in the example, the points A, B, C, D, E, F)
362 4 Paradigms for 3D Vision
I
II Towards other
Complex Cells
III Cortical Areas (e.g. V2, MT)
IV A Simple cells
IV B Simple cells
Pyramidal cells V3,MT
IV Cα Stellate Cells M
V Complex Cells
Cross section of the VI Complex Cells 6
monkey visual cortex 5
4
3
to the superior Channel M 2
1
colliculus
Right LGN
Fig. 4.38 Cross-section of the primary visual cortex (V1) of the monkey. The six layers and relative
substrates with different cell density and connections with other components of the brain are shown
The layers V and VI , with high cell density even pyramidal, transmit their output
back to the upper colliculus5 and to the LGN, respectively. As shown in Fig. 4.37
the area of the left hemisphere of V 1 receives only the visual information related to
the right field of view and vice versa. Furthermore, the information that reaches the
cortex from the retina is organized in such a way as to maintain the hemiretinas of
origin, the cell type (P or M ) and the spatial position of the ganglion cells inside the
retina (see Fig. 4.42). In fact, the axons of the cells M and P transmit the information
of the retinas, respectively, in the substrates IV Cα and IV Cβ. In addition, the cells
close together in these layers receive information from the local areas of the retina,
thus maintaining the topographical structure of origin.
5 Organ that controls saccadic movements, coordinates visual and auditory information, directing
the movements of the head and eyes in the direction where the stimuli are generated. It receives
direct information from the retina and from different areas of the cortex.
4.6 Stereo Vision 363
- +- 2 2
Basic level
Light bar
3 3
- +-
Weak response
4 4
- +- 5 5
Strong response
6 6
7 7
Receptive Fields
of Complex Cell of Hypercomplex Cell
Fig. 4.39 Receptive fields of cortical cells associated with the visual system and their response
to the different orientations of the light beam. a Receptive field of the simple cell of ellipsoidal
shape with respect to the circular one of ganglion cells and LGN. The diagram shows the maximum
stimulus only when the light bar is totally aligned with the ON area of the receptive field while it
remains inhibited when the light falls on the OFF zone. b Responses of the complex cell, with a
rectangular receptive field, when the inclination of a moving light bar changes. The arrows indicate
the direction of motion of the stimulus. From the diagram, we note the maximum stimulation when
the light is aligned with the axis of the receptive field and moves to the right, while the stimulus
is almost zero with motion in the opposite direction. c hypercomplex cell responses when solicited
by a light bar that increases in length by exceeding the size of the receptive field. The behavior of
the cells (called with end-stopped cells) is such that the stimulus increases reaching the maximum
when the light bar completely covers the receptive field but decreases its activity if stimulated with
a larger light bar
1. Simple cells, they present the excitatory and inhibitory areas rather narrow and
elongated, having a specific orientation axis. These cells are functional as detec-
tors of linear structures, in fact, they are well stimulated when a rectangular light
beam is located in an area of the field of view and oriented in a particular direction
(see Fig. 4.39a). The receptive fields of simple cells seem to be realized by the
convergence of different receptive fields of adjacent cells of the substrate IV C.
The latter, known as stellate cells are small-sized neurons with circular receptive
fields that receive signals from the cells of the geniculate body (see Fig. 4.40a)
which, like retinal ganglion cells, are center-ON and center-OFF type.
2. Complex Cells, have extended receptive fields but not a clear zone of excitation or
inhibition. They respond well to the motion of an edge with a specific orientation
and direction of motion (good motion detectors, see Fig. 4.39b. Their receptive
fields seem to be realized by the convergence of different receptive fields of more
364 4 Paradigms for 3D Vision
+
+
of +- o +-
+- +-
Stellate Cells CS Stellate Cells
+- of
of layer IV C ld of layer IV C
ie
eF
ptiv
Cortical ce
Simple Cell (CS) Re
Simple
Cortical Cells
Complex
Cortical Cells
Fig. 4.40 Receptive fields of simple and complex cortical cells, generated by multiple cells with
circular receptive fields. a Simple cell generated by the convergence of 4 stellate cells receiving the
signal from adjacent LGN neurons with circular receptive fields. The simple cell with an elliptic
receptive field responds better to the stimuli of a localized light bar oriented in the visual field. b A
complex cell generated by the convergence of several simple cells that responds better to the stimuli
of a localized and oriented bar (also in motion) in the visual field
simple cells (see Fig. 4.40b). The peculiarity of motion detection is due to two
phenomena.
The first occurs when the axons of different simple cells adjacent and with the
same orientation, but not with identical receptive fields, converge on a complex
cell which determines the motion from the difference of these different receptive
fields.
The second occurs when the complex cell can determine motion through differ-
ent latency times in the responses of adjacent simple cells. Complex cells are
very selective in a given direction, responding only when the stimulus moves in
one direction and not in the other (see Fig. 4.39c). Compared to the simple one,
complex cells are not conditioned to the position of the light beams (in stationary
conditions) in the receptive field. The amount of the stimulus also depends on the
length of the rectangular light beam that falls within the receptive field. They are
located in layers II and III of the cortex and in the boundary areas between layers
V and VI .
3. Hypercomplex cells, are a further extension of the process of visual information
processing and advancement of the knowledge of the biological visual system.
Hypercomplex cells (known as end-stopped cells) respond only if a light stimulus
has a given ratio between the illuminated surface and the dark surface, or comes
from a certain direction, or includes moving forms. Some of these hypercomplex
cells respond well only to rectangular beams of light of a certain length (com-
pletely covering the receptive field), so that if the stimulus extends beyond this
length, the response of the cells is significantly reduced (see Fig. 4.39c). Hubel
and Wiesel characterize these receptive fields as containing activating and antago-
nistic regions (similar to excitatory/inhibitory regions). For example, the left half
of a receptive field can be the activating region, while the antagonistic region is
on the right. As a result, the hypercomplex cell will respond, with the spatial sum-
mation, to stimuli on the left side (within the activation region) to the extent that
4.6 Stereo Vision 365
it does not extend further into the right side (antagonistic region). This receptive
field would be described as stopped at one end (i.e., the right). Similarly, hyper-
complex receptive fields can be stopped at both ends. In this case, a stimulus that
extends too far in both directions (for example, too left or too far to the right) will
begin to stimulate the antagonistic region and reduce the signal strength of the
cell. The hypercomplex cells occur when the axons of some complex cells, with
adjacent receptive fields and different in orientation, converge in a single neuron.
These cells are located in the secondary visual area (also known as V5 and MT).
Following the experiments of Hubel and Wiesel, it was discovered that even some
simple and complex cells exhibit the same property as the hypercomplex, that is,
they have end-stopping properties when the luminous stimulus exceeds a certain
length overcoming the margins of the same receptive field.
From the properties of the neural cells of the primary visual cortex a computational
model emerges with principles of self-learning that explain the sensing and motion
capacities of structures (for example, lines, points, bars) present in the visual field.
Furthermore, we observe a hierarchical model of visual processing that starts from
the lowest level, the level of the retinas that contains the scene information (in the
field of view), the LGN level that captures the position of the objects, the level of
simple cells that see the orientation of elementary structures (lines), the level of
complex cells that see(detect) their movement, and the level of hypercomplex cells
that perceive of the object, edges, and their orientation. The functionality of simple
cells can be modeled using Gabor filters to describe their sensitivity to orientation
to a linear light beam.
Figure 4.41 summarizes the connection scheme between the retinal photoreceptors
and the neural cells of the visual cortex. In particular, it is observed how groups of
cones and rods are connected with a single bipolar cell, which in turn is connected
with one of the ganglion cells from which the fibers afferent to the optic nerve
originate whose exit point (papilla) is devoid of photoreceptors. This architecture
suggests that stimuli from retinal areas of a certain extension (associated, for example,
with an elementary structure of the image) are conveyed into a single afferent fiber of
the optic nerve. In this architecture, a hierarchical organization of the cells emerges,
starting from the bipolar cells that feed the ganglion cells up to the hypercomplex
cells.
The fibers of the optic nerve coming from the medial half of the retina (nasal field)
intersect at the level of the optic chiasm to those coming from the temporal field and
continue laterally. From this it follows (see also Fig. 4.37) that after crossing in the
chiasm the right optic tract contains the signals coming from the left half of the
visual field and the left one the signals of the right half. The fibers of the optic tracts
reach the lateral geniculate bodies that form part of the thalamus nuclei: here there
is the synaptic junction with neurons that send their fibers to the cerebral cortex of
the occipital lobes where the primary visual cortex is located. The latter occupies the
terminal portion of the occipital lobes and extends over the medial surface of them
along the calcarine fissure (or calcarine sulcus).
366 4 Paradigms for 3D Vision
}
Right eye
}
Left eye
}
Optic
chiasm Hypercomplex cells
}
Complex cells
}
}
LGN cells Simple cells
}
Retina
Fig. 4.41 The visual pathway with the course of information flow from photoreceptors and retinal-
visual cortex cells of the brain. In particular, the ways of separation of visual information coming
from nasal and temporal hemiretinas for each eye are highlighted
I
II
III
IV A
IV B
IV Cα
IV Cβ
V
VI
Columns of I m
C 0μ
ocular dominance I ~5
C Orientation
Channel K
column
6(C)
5(I)
4(C)
Channel P
3(I)
2(I) Channel M
1(C)
LGN
Fig. 4.42 Columnar organization, perpendicular to the layers, of the cells in the cortex V 1. The
ocular dominance columns of the two eyes are indicated with I (the ipsilateral) and with C (the
contralateral). The orientation columns are indicated with oriented bars. Blob cells are located
between the columns of the layers II , III and V , VI
Each column crosses the 6 layers of the cortex and represents an orientation in the
visual field with an angular resolution of about 10◦ . The cells crossed by each column
respond to the stimuli deriving from the same orientation (orientation column) or
to the input of the same eye (dominant ocular column, dominant ocular plate). An
adjacent column includes cells that respond to a small difference in orientation from
the near one and perhaps to the input of the same eye or the other. The neurons in the
IV layer are an exception, as they could respond to any orientation or just one eye.
From Fig. 4.42 it is observed how the signals of the M and P cells, relative to the
two eyes, coming from the LGN, are kept separated in the IV layer of the cortex and in
particular projected, respectively, in the substrate IVCα and IVCβ where monocular
cells are found with center-ON and center-OFF circular receptive fields. Therefore,
368 4 Paradigms for 3D Vision
the signals coming from LGN are associated with one of the two eyes, never to both,
while each cell of the cortex can be associated with input from one eye or that of
the other. It follows that we have ocular dominance columns arranged alternately
associated with the two eyes (ipsi or contralateral), which extend horizontally in the
cortex, consisting of simple and complex cells.
The cells of the IVCα substrate propagate the signals to the neurons (simple cells)
of the substrate IV B. The latter responds to stimuli from both eyes (binocular cells),
unlike the cells of the IV C substrate whose receptive fields are monocular. Therefore,
the neurons of the IV B layer begin the process of integration useful for binocular
vision. These neurons are selective to detect movement but also to detect direction
only if stimulated by a beam of light that moves in a given direction.
A further complexity of the functional architecture of V 1 emerges with the dis-
covery (in 1987) by contrast medium, along with the ocular dominance columns of
another type of column, regularly spaced and localized in the layers II − III and
V − VI of the cortex. These columns are made up of arrays of neurons that receive
input from the parvocellular pathways and from the koniocellular pathways. They
are called blob appearing (by contrast medium) as leopard spots when viewed with
tangential sections of the cortex. The characteristic of the neurons included in the
blobs is that of being sensitive to color (i.e., to the different wavelengths of light,
thanks to the information of the channels K and P) and to the brightness (thanks to
the information coming from the channel M ).
Among the blobs, there are regions with neurons that receive signals from the mag-
nocellular pathways. These regions called interblobs contain orientation columns and
ocular dominance columns whose neurons are motion-sensitive and nonselective for
color. Therefore, the blob are in fact modules, in which the signals of the three
channels P, M , K converge, where it is assumed necessary to combine these signals
(i.e., the spectral and brightness information) on which the perception of color and
brightness variation depends.
This organization of the cortex V 1 in hypercolumns (also known as cortical mod-
ule) each of which receiving input from the two eyes (orientation columns and ocular
dominance columns) is able to analyze a portion of the visual field. Therefore, each
module includes neurons sensitive to color, movement, linear structures (lines or
edges) for a given orientation and for an associated area of the visual field, and
integrates the information of the two eyes for depth perception.
The orientation resolution in the various parallel cortical layers is 10◦ and the
whole module can cover an angle of 180◦ . It is estimated that a cortical module
that includes a region of only 2 × 2 mm of the visual cortex is able to perform a
complete analysis of a visual stimulus. The complexity of the brain is such that the
functionalities of the various modules and their total number have not been clearly
defined.
visual system. The visual pathways begin with each retina (see Fig. 4.37), then leave
the eye by means of the optic nerve that passes through the optic chiasm (in which
there is a partial crossing of the nerve fibers coming from the two hemiretinas of each
eye), and then it becomes the optic tract (seen as a continuation of the optic nerve).
The optic tract goes toward the lateral geniculate body of the thalamus. From here
the fibers, which make up the optic radiations, reach the visual cortex in the occipital
lobes.
The primary visual cortex mainly transmits to the adjacent secondary visual cortex
V2, also known as area 18, most of the first processed information of the visual field.
Although most neurons in the V 2 cortex have properties similar to those of neurons
in the primary visual cortex, many others have the characteristic of being much more
complex. From areas V 1 and V 2 the visual information processed continues toward
the areas so-called associative areas, which process information at a more global
level. These areas, in a progressive way, combine (associate) the first level visual
information with information deriving from other sensors (hearing, touch, . . .) thus
creating a multisensory representation of the observed world.
Several researches have highlighted dozens of cortical areas that contribute to
visual perception. Areas V 1 and V 2 are surrounded by several of these cortical
areas and associative visual areas called: V 3, V 4, V 5 (orMT ), PO, TEO, etc. (see
Fig. 4.43). From the visual area V 1 two cortical pathways of propagation and pro-
cessing of visual information branch out [17]: the ventral path which extends to the
temporal lobe and the dorsal pathway projected to the parietal lobe.
V3A
V3
V2
V1 Parietal cortex
V2
V4 VIP
V3
MT V3A Dorsal stream
MT MST 7a
(WHERE)
LIP
Retina LGN V1 V2 V3
Ventral stream
(WHAT) V4 TEO TE
Temporal Cortex
Fig. 4.43 Neuronal pathways involved in visuospatial processing. Distribution of information from
the retina to other areas of the visual cortex that interface with the primary visual cortex V 1. The
dorsal pathway, which includes the parietal cortex and its projections to the frontal cortex, is involved
in the processing of spatial information. The ventral pathway, which includes the inferior and lateral
temporal cortex and their projections to the medial temporal cortex, is involved in the processing
of recognition and semantic information
370 4 Paradigms for 3D Vision
The main function of the ventral visual pathway (channel of what is observed, i.e.,
object recognition pathway) seems to be that of conscious perception, that is, to make
us recognize and identify objects by processing their intrinsic visual properties, such
as shape and color, memorizing such information in memory for a long time term. The
basic function of the dorsal visual pathway (channel where is an object, i.e., spatial
vision pathway) seems to be the one associated with visual-motor control on objects
by processing their extrinsic properties which are essential for their localization
(and mobility), such as their size, their position, orientation in space, and saccadic
movements.
Figure 4.43 shows the connectivity between the various main areas of the cortex.
The signals start from the ganglion cells of the retina, and through LGN and V 1 they
branch out, by their process, toward the ventral path (from V 1 to V 4 reaching the
inferior temporal cortex IT ) and dorsal (from V 1 to V 5 reaching the posterior parietal
cortex) thus realizing a hierarchical connection structure. In particular, the parieto-
medial temporal area integrates information from both pathways and is involved in
the encoding of landmarks in spatial navigation and in the integration of objects into
the structural environment.
The flow of information is summarized in the ventral visual channel for the
perception of objects:
complex properties yet to be known. Some of the latter are sensitive to color and
movement, characteristics most commonly analyzed in other stages of the visual
process.
Area V4, receives the information flow after the process in V 1, V 2, and V 3, to con-
tinue further processing the color information (received from blobs and interblob
of V 1) and form. In this area, there are neurons with properties similar to other
areas but with more extensive receptive fields than those of V 1. Area still to be
analyzed in-depth. It seems to be essential for the perception of extended and
more complex contours.
Area IT-Inferior temporal cortex, receives many connections from the area V 4 and
includes complex cells that have shown little sensitivity to color and size of the
perceived object but are very sensitive to the shape. Studies have led to consider
this area sensitive to face recognition and important visual memory capacity.
The cortical areas of the dorsal pathway that terminate in the parietal lobe, elaborate
the spatial and temporal aspects of visual perception. In addition to spatially locating
the visual stimulus, these areas are also linked to aspects of movement including eye
movement. In essence, the dorsal visual pathway integrates the spatial information
between the visual system and the environment for a correct interaction. The dorsal
pathway includes several cortical areas, including in particular the Middle Temporal
(MT) area, also called area V 5, the Medial Superior Temporal (MST ) area, and the
lateral and ventral intraparietal areas (LIP and VIP, respectively).
The MT area is believed to contribute significantly to the perception of move-
ment. This area receives the signals from V 2, V 3, and the substrate IVB of V 1 (see
Fig. 4.43). We know that the latter is part of the magnocellular pathways involved in
the analysis of movement. Neurons of MT have properties similar to those of V 1, but
have the most extensive receptive field (up to covering an angle of tens of degrees).
They have the peculiarity of being activated only if the stimulus, which falls on its
receptive field, moves in a preferred direction.
The area MST is believed to contribute, as well as to the analysis of the movement,
it is sensitive to the radial motion (that is, approaching or moving away from a point)
and to the circular motion (clockwise or counterclockwise). The neurons of MST are
also selective for movements in complex configurations. The LIP area is considered
to be the interface between the visual system and the oculomotor system. Also, the
neurons of the areas LIP and VIP (receive the signals from V 5 and MST ) are sensitive
to stimuli generated by a limited area of the field of view and are active for stimuli
resulting from an ocular movement (known also as saccade) in the direction of a
given point in the field of view.
The brain can use for various purposes, this wealth of information associated with
movement, acquired through the dorsal pathway. It can acquire information about
objects moving in the field of view, understand the nature of motion compared to
that of one own by moving the eyes and then act accordingly.
The activities of the visual cortex take place through various hierarchical levels
with the serial propagation of the signals and their processing also in parallel through
the different communication channels thus forming a highly complex network of
372 4 Paradigms for 3D Vision
Inferotemporal areas
Parietal areas V4 V4
V5(MT)
V3
Movement Thick stripe
Depth
Shape Inter-stripe
Color Thin stripe
Movement
Blob Depth
Shape
I Color
Color II Movement
Shape Depth
III Shape
Channel K Movement IV A Color
Depth
Channel M IV B
IV Cα Area V2
Retina LGN
IV Cβ
Channel P
V
VI
Area V1
Fig.4.44 Overall representation of the connections and the main functions performed by the various
areas of the visual cortex. The retinal signals are propagated segregated through the magnocellular
and parvocellular pathways, and from V 1 they continue on the dorsal and ventral pathways. The
former specializes in the perception of form and color, while the latter is selective in the perception
of movement, position, and depth
circuits. This complexity is attributable in part to the many feedback loops that each
of these cortical areas form with their connections to receive and return information
considering all the ramifications that for the visual system originate from the receptors
(with ganglion neurons) and then transmitted to the visual cortical areas, through the
optic nerve, chiasm and optic tract, and lateral geniculate nucleus of the thalamus.
Figure 4.44 summarizes schematically, at the current state of knowledge, the main
connections and the activities performed by the visual areas of the cortex (of the
macaque), as the signals from the retinas propagate (segregated by the two eyes) in
such areas, through the parvocellular and magnocellular channels, and the dorsal and
ventral pathways.
From the analysis of the responses of different neuronal cells, it is possible to
summarize the main functions of the visual system realized through the cooperation
of the various areas of the visual cortex.
Color perception. The selectivity response to color is given by the ganglion cells
P that through the parvocellular channel of LGN reaches the cells of the substrate
IVBβ of V 1. From here it propagates in the other layers II and III of V 1 in
vertically organized cells that form the blob. From there the signal propagates
in the V 4 area directly and through the thin strips of V 2. V 4 includes cells with
larger receptive fields with selective capabilities to discriminate color even with
lighting changes.
Perception of the form. As for the color, the ganglion cells P of the retina, through
the parvocellular channel of LGN transmit the signal to the cells of the substrate
4.6 Stereo Vision 373
IVCβ of V 1 but it propagates later in the cells interblob of the other layers II
and III of V 1. From here the signal propagates in the V 4 area directly and via
the interstripes (also known as pale stripes) of V 2 (see Fig. 4.44). V 4 includes
cells with larger receptive fields with also selective capabilities to discriminate
orientation (as well as color).
Perception of movement. The signal from the ganglion cells M of the retina,
through the magnocellular channel of LGN, reaches the cells of the substrate
IVCα of V 1. From here, it propagates in the IVB layer of V 1, which we high-
lighted earlier, including very selective complex cells in orientation also in relation
to movement. From the layer IVB the signal propagates directly in the area V5
(MT) and through the thick strips of V 2.
Depth perception. Signals from the LGN cells that enter the IVC substrates of the
V 1 cortex keep the information in the two eyes segregated. Subsequently, these
signals are propagated in the other layers II and III of V 1, and in these appear, for
the first time, cells with afferents coming from cells (but not M and not P cells)
of the two eyes, that is, we have bipolar cells. Hubel and Wiesel have classified
the cells of V 1 in relation to their level of excitation deriving from one and the
other eye. Those deriving from the exclusive stimulation of a single eye are called
ocular dominance cells, and therefore, monocular cells. The binocular cells are
instead those excited by cells of the two eyes whose receptive fields simultane-
ously see the same area of the visual field. The contribution of the cells of a single
eye can be dominant with respect to the other or both contribute with the same
level of excitation, and in the latter case, they have perfectly binocular cells. With
binocular cells it is possible to evaluate depth by estimating binocular disparity
(see Sect. 4.6) at the base of stereopsis. Although the neurophysiological basis
of stereopsis is still not fully known, the functionality of binocular neurons is
assumed to be guaranteed by the monocular cells of both eyes stimulated by the
corresponding receptive fields (also with different viewing angles) as much as
possible compatible in terms of orientation and position compared to the point
of fixation. With reference to Fig. 4.29 the area of the fixation point P gener-
ates identical receptive fields in the two eyes stimulating a binocular cell (zero
disparity) with the same intensity, while the stimulation of the two eyes will be
differentiated (receptive fields slightly shifted with respect to the fovea), deriving
from the farthest zone (the point L) and the nearest one (the point V ) with respect
to the observer. In essence, the action potential of the most distant monocular cells
(corresponding) is higher than those closer to the fixation point and this behavior
becomes a property of binocular disparity.
The current state of knowledge is based on the functional analysis of the neurons
located in the various layers of the visual areas, their interneural connectivity, and
the effects caused by the lesions in one or more components of the visual system.6
6 Biologicalevidence has been demonstrated which, with the stimulation of the nerve cells of the
primary visual cortex, through weak electrical impulses, causes the subject to see elementary visual
374 4 Paradigms for 3D Vision
After having analyzed the complexity of the biological visual system of primates, let
us now see how it is possible to imitate some of its functional capabilities by creating
a binocular vision system for calculating the depth map of a 3D scene, locating an
object in the scene and calculating its attitude. These functions are very useful for
navigating an autonomous vehicle and for various other applications (automation of
robot cells, remote monitoring, etc.).
Although we do not yet have a detailed knowledge of how the human visual system
operates for the perception of the world, as highlighted in the previous paragraphs, it
is hypothesized that different modules cooperate together for the perception of color,
texture, movement, and to estimate depth. Modules of the primary visual cortex have
the task of merging images from the two eyes in order to perceive the depth (through
stereopsis) and the 3D reconstruction of the visible surface observed.
events, such as a colored spot or a flash of light, in the expected areas of the visual field. Given the
spatial correspondence one by one, between the retina and the primary visual area, the lesion of
areas of the latter part to blind areas (blind spot) in the visual field even if some visual patterns are
left unchanged. For example, the contours of a perceived object are spatially completed even if they
overlap with the blind area. In humans, two associative or visual-psychic areas are located around
the primary visual cortex, the parastriate area and the peristriate area. The electrical stimulation of
the cells of these associative areas is found to generate the sensation of complex visual hallucinations
corresponding to images of known objects or even sequences of significant actions. The lesion or
surgical removal of areas of these visual-psychic areas does not cause blindness but prevents the
maintenance of old visual experiences; moreover, it generates disturbances in perception in general,
that is, the impossibility of combining individual impressions in complete structures and the inability
to recognize complex objects or their pictorial representation. However, new visual learnings are
possible, at least until the temporal lobe is removed (ablation). Subjects, with lesions in the visual-
psychic areas, can describe single parts of an object and correctly reproduce the contour of the
object but are unable to recognize the object as a whole. Other subjects cannot see more than one
object at a time in the visual field. The connection between the visual-psychic areas of the two
hemispheres is important for comparing the received retinal images of the primary visual cortex to
allow 3D reconstruction of the objects.
4.6 Stereo Vision 375
Fig. 4.45 Map of the disparity resulting from the fusion of Julesz stereograms. a and b are the
stereograms of left and right with different levels of disparity; c and d, respectively, show the pixels
with different levels of disparity (representing four depth levels) and the red-blue anaglyph image
generated by the random-dot stereo pair from which the corresponding depth levels can be observed
with red-blue glasses
regions on each retina, can be assumed homologous, i.e., corresponding to the same
physical part of the observed 3D object.
Julesz demonstrated, using random-dot stereogram images, that this matching
process applied to random-dot points is performing finding a large number of matches
(homologous points in the two images) even under very noisy random-dot images.
Figure 4.45 shows another example of random-dot stereogram with different central
squares of different disparities perceiving a depth map at different heights. On the
other hand, by using more complex images the matching process produces false
targets, i.e., the search for homologous points in the two images fails [18].
376 4 Paradigms for 3D Vision
VL VR
PL LL LR PR
In these cases the observer can find himself in the situation represented in Fig. 4.46
where both eyes see three points, but the correspondence between the two retinal
projections is ambiguous. In essence, we have the problem of correspondence, that
is, how do you establish the true correspondence between the three points seen from
the left retina with the right retina that are possible projections of the nine points
present in the field of view?
Nine candidate matches are plausible and the observer could see different depth
planes corresponding to the perceived false targets. Only three matches are correct
(colored squares), while the remaining six are generated by false targets (false coun-
terparts), indicated with black squares.
To solve the problem of ambiguous correspondences (basically ill-posed problem),
Julesz suggested to consider in the process of correspondence global constraints. For
example, to consider candidate points for correspondence, more complex structures
than simple points, that is to say, traits of oriented contours or to consider some
physical constraints of 3D objects, or to impose some constraints on the search
modes in the two images of the homologous structures (example, the search for
structures only on horizontal lines), or particular textures.
This stereo vision process is called by Julesz as a global stereo vision, which is
probably based on a more complex neural process to select local structures in images
composed of elements with the same disparity. For the perception of more extended
depth intervals the human visual system uses the movement of the eyes.
The global stereo vision mechanism introduced by Julesz is not inspired by neuro-
physiology but uses the phenomenon of physics associated with the magnetic dipole,
consisting of two point-like magnetic masses of equal value and opposite polarity,
placed at a small distance from each other. This model also includes the hysteresis
phenomenon. In fact, when Julesz’s stereograms are out of the observer, the disparity
can be increased by twenty times the limit of Panum’s fusion area (range in which
one can fuse two stereo images, normally 6 −18 of arc), without losing the feeling
of stereoscopic vision.
4.6 Stereo Vision 377
The estimate of the distance of an object from the observer, i.e., the perceived depth
is determined in two phases: first, the disparity value is calculated (having solved
the correspondence of the homologous points) and subsequently, this measurement
is used, together with the geometry of the stereo system, to calculate the depth
measurement.
Following the Marr paradigm, these phases will have to include three levels for
estimating disparity: the level of computational theory, the level of algorithms, and the
level of algorithm implementation. Marr, Poggio, and Grimson [19] have developed
all three levels inspired by human stereo vision. Several researchers subsequently
applied some ideas of computational stereo vision models proposed by Marr-Poggio,
to develop artificial stereo vision systems.
In the previous paragraph, we examined what are the elements of uncertainty in the
estimate of the disparity known as the correspondence problem. Any computational
model chosen will have to minimize this problem, i.e., correctly search for homolo-
gous points in stereo images through a similarity measure that represents an estimate
of how similar such homologous points (or structures) are. In the computational
model of Marr and Poggio [20] are considered different constraints (based on phys-
ical considerations) to reduce as much as possible the problem of correspondence.
These constraints are
Compatibility. The homologous points of stereo images must have a very similar
intrinsic physical structure if they represent the 2D projection of the same point
(local area) of the visible surface of the 3D object. For example, in the case of
random-dot stereograms, homologous candidate points are either black or white.
Uniqueness. A given point on the visible surface of the 3D object has a unique
position in space at any time (static objects). It follows that a point (or structure)
in an image has only one homologous point in the other image, that is, it has only
one candidate point as comparable: the constraint of uniqueness.
Continuity. The disparity will vary slightly in any area of the stereo image. This
constraint is motivated by the physical coherence of the visible surface in the
sense that it assumes various continuities without abrupt discontinuities. This
constraint is obviously violated in the areas of surface discontinuity of the object
and in particular in the contour area of the object.
Epipolarity. Homologous points must lie on the same line as the stereograms.
378 4 Paradigms for 3D Vision
Fig. 4.47 Results of the Marr-Poggio stereo cooperative algorithm. The initial state of the network,
which includes all possible match within a predefined disparity range, is indicated with the map
0. With the evolution of iterations, the geometric structure present in the random-dot stereograms
emerges, and the different disparity values are represented with gray levels
380 4 Paradigms for 3D Vision
much as possible the number of correspondences and to become invariant from the
possible different contrast of the stereograms.
The functional scheme of this second algorithm is the following:
1. Each stereo image is analyzed with different spatial resolution channels and the
comparison is made between the elementary structures present in the images
associated with the same channel and for disparity values that depend on the
resolution of the channel considered.
2. We can use the disparity estimates calculated with coarse resolution channels
to guide the ocular movements of vergence of the eyes to align the elementary
structures by comparing the channel disparities with a finer resolution to find the
correct match.
3. Once the correspondence is determined, associated with a certain resolution, the
disparity values are maintained in a map of disparity (2.5D sketch) as memory
buffers (the function of this memory suggests to Marr-Poggio to attribute the
phenomenon of hysteresis).
The stereo vision process with this algorithm begins by analyzing the images
with coarse resolution channels that generate elementary structures well separated
from each other, and then the matching process is guided to corresponding channels
with finer resolution, thus improving the robustness in determining the homologous
points.
The novelty in this second algorithm consists in selecting in the two images,
as elementary structures for comparison, contour points, and in particular, Marr
and Poggio used the zero crossing, characterized by the sign of variation of the
contrast and their orientation local. Grimson [19] has implemented this algorithm
using random-dot stereograms at 50% density.
The stereograms are convolved with the LoG filter (Laplacian of Gaussian), with
different values of σ to produce multi-channel
√ stereograms with different spatial reso-
lution. Remember the relation W = 2×3 2σ , introduced in Sect. 1.13 Vol. II, which
links the W dimension of the convolution mask with σ which controls the smooth-
ing effect of the LOG filter. In Fig. 4.48 the three convolutions of the stereograms
obtained with square masks of different sizes of 35, 17 and 9 pixels, respectively, are
shown.
The zero crossings obtained from the convolutions are shown and it is observed
how the structures become more and more detailed as the convolution filter is smaller.
Points of zero crossing are considered homologous in the two images, if they have
the same sign and their local orientation remains within an angular difference not
exceeding 30◦ .
The comparison activity of the zero crossing starts with the coarse channels and
the resulting disparity map is very coarse. Starting from this map of rough disparity,
the comparison process analyzes the images convolved with the medium-sized filter
and the resulting disparity map is more detailed and precise. The process continues
using this intermediate disparity map that guides the process of comparing the zero
4.6 Stereo Vision 381
Fig.4.48 Zero crossing obtained through the multiscale LoG filtering applied to the pair of random-
dot stereo images of Fig. 4.45 (first column). The other columns show the results of the filtering
performed at different scales by applying the convolution mask of the LOG filter, of the square
shape, respectively, of sizes 35, 17 and 9
crossing in the last channel, obtaining a final disparity map with the finest resolution
and the highest density.
The compatibility constraint is satisfied considering the zero crossing structures
as candidates for comparison to those that have the same sign and local orientation.
The larger filter produces few candidates for zero crossing due to the increased
smoothing activity of the Gaussian component of the filter and only the structures with
strong variations in intensity are maintained (coarse channel). In these conditions,
the comparison concerning the zero crossing structures, which in stereo images are
within a predefined disparity interval (Panum fusion interval, which depends on the
binocular system used) and the W width of the filter used, which we know depends
on the parameter σ for the LOG filter.
The constraints of the comparison, together with the quantitative relationship that
links these latter parameters (filter width and default range of disparities), allow to
optimize the number of positive comparisons between the homologous zero crossing
structures, reducing false positives (can not to detect homologous zero crossing
structures that exist instead) and false negatives (homologous zero crossing structures
can be detected instead that they should not exist), for example, a homologous zero
crossing is found in the other image generated by the noise, or the homologous
point in the other image is not visible because it is occluded and instead another
nonhomologous zero crossing is chosen (see Fig. 4.49).
Once the candidate homologous points are found in the pair of images with low
spatial resolution channels, these constitute the constraints to search for zero crossing
structures in the images filtered at higher resolution and the search for zero crossing
382 4 Paradigms for 3D Vision
-W +W
F R F R
Right
d d
-W/2 d +W/2 -W +W -W +W
Fig.4.49 Scheme of the matching process to find zero crossing homologous points in stereo images.
a One zero crossing L in the left image has a high probability of finding the homologous R with
disparity d in the right image if d < W/2. b Another possible configuration is to find a counterpart
in the whole range W or a false counterpart F with 50% probability, but it always remains to find a
R homologous. c To disambiguate the false homologues, the comparison between the zero crossing
is realized first from the left image to the right one and viceversa obtaining that L2 can have as a
homologous R2 , while R1 has as a homologous L1
homologues must take place within a range of disparity which is twice the size of
the current filter W . Given a zero crossing L in a given position in the left image (see
Fig. 4.49a) and another R in the right image which is homologous to L (or has the
same sign and orientation) with a disparity value d . With F a possible false match
is indicated instead near R. From the statistical analysis, it is shown that R, is the
counterpart of L, within a range ±W/2 with the probability of 95% if the maximum
disparity is d = W/2. In other words, given a zero crossing in some position in the
filtered image, it has been shown that the probability of the existence of another zero
crossing in the range ±W/2 is 5%. If the correct disparity is not in the range ±W/2,
the probability of a match is 40%.
For d > W/2 and d ≤ W we have the same probability of 95% that R is the only
homologous candidate for L with disparity from 0 to W if the value of d is positive
(see Fig. 4.49b). The probability of 50% is also statistically determined, of a false
correspondence of 2W of disparity in the interval between d = −W and d = W .
This means that the 50% of the times there is ambiguity in determining the correct
match, both in the disparity interval (0−W ) (convergent disparity) and in the interval
(−W − 0) (divergent disparity), where only one of the two cases is correct. If d is
around zero, the probability of a correct match is 90%. Therefore, from the figure
we have that F is a false match candidate, with the probability of 50%, but it also
turns out the possible match with R.
To determine the correct match, Grimson proposed the match procedure that first
compares the zero crossing from the left to the right image, and then from the right
and left image (see Fig. 4.49c). In this case, starting the comparison from the left to
right image, L1 can ambiguously correspond to R1 or to R2 , but L2 has only R2 as its
counterpart. From the right hand side, the correspondence is unique for R1 which is
only L1 , but is ambiguous for R2 .
Combining the two situations together, the two unique matches provide the correct
solution (constraint of uniqueness). It is shown that if more than 70% of the zero
crossings matches in the range (−W, +W ) then the disparity interval is correct
4.6 Stereo Vision 383
(satisfies the continuity constraint). Figure 4.50 shows the results of this algorithm
applied to the random-dot stereograms of Fig. 4.48 with four levels of depth.
As previously indicated, the disparity values, found in the intermediate steps of
this coarse-to-fine comparison process, are saved in a temporary memory buffer also
called 2.5D sketch map. The function of this temporary memory of the correspon-
dence process is considered for Marr-Poggio to be the equivalent of the hysteresis
phenomenon initially proposed by Julesz to explain the biological fusion process.
This algorithm does not fully understand all the psycho-biological evidences of
human vision. This computational model of stereo vision has been revised by other
researchers, and others have proposed different computational modules where the
matching process is seen as integrated with the extraction process of candidate ele-
mentary structures for comparison (primal sketch). These latest computational mod-
els contrast with the idea of Marr’s stereo vision, which sees the early vision modules
for the extraction of elementary structures separated.
Figure 4.51 shows a simplified diagram of a monocular vision system and we can
see how the 3D scene is projected onto the 2D image plane essentially reducing
the original information of the scene by one dimension. This loss of information
is caused by the perspective nature of the projection that makes ambiguous in the
image plane, the apparent dimension of a geometric structure of an object. In fact,
it appears of the same size, regardless of whether it is near or further away from the
capture system.
This ambiguity depends on the inability of the monocular system to recover the
information lost with the perspective projection process. To solve this problem, in
an analogy to human binocular vision, the artificial binocular vision scheme, shown
in Fig. 4.52, consists of two cameras located in slightly different positions along
the X -axis, is proposed. The acquired images are called stereoscopic image pairs or
stereograms. A stereoscopic vision system produces a depth map that is the distance
384 4 Paradigms for 3D Vision
N Object
Light source
Im
ag
eP
lan
e
Optical System
Fig. 4.51 Simplified scheme of a monocular vision system. Each pixel in the image plane captures
the light energy (irradiance) reflected by a surface element of the object, in relation to the orientation
of this surface element and the characteristics of the system
(a) (b)
Z
Y Z N
M M P(X,Y,Z)
N
Epipolar line
D
ZP
cL
PL yL 0 b P(X,Y,Z)
yL xL
b
c’L 0
cR cL cR X
yR
f
PR f
f
xR
xL yR c’R xL xR
PL c’L xL PR c’R xR
X Left image Right image
xR
Fig. 4.52 Simplified diagram of a binocular vision system with parallel and coplanar optical axes.
a 3D representation with the Z axis parallel to the optical axes; b Projection in the plane X −Z of
the binocular system
4.6 Stereo Vision 385
between the cameras and the visible points of the scene projected in the stereo image
planes.
The gray level of each pixel of the stereo images is related to the light energy
reflected by the visible surface projected in the image plane, as shown in Fig. 4.51.
With binocular vision, part of the 3D information of the visible scene is recov-
ered through the gray level information of the pixels and through the triangulation
process that uses the disparity value for depth estimation. Before proceeding to the
formal calculation of the depth, we analyze some geometric notations of stereometry.
Figure 4.52 shows the simplest geometric model of a binocular system, consisting of
two cameras arranged with separate parallel and coplanar optical axes of a value b,
called distance baseline, in the direction of X axis. In this geometry, the two image
planes are also coplanar at the focal distance f with respect to the optical center of
the left lens which is the origin of the stereo system.
A P element of the visible surface is projected by the two lenses on their respective
retinas in PL and in PR . The plane passing through the optical centers CL and CR of
the lenses and the visible surface element P is called epipolar plane. The intersection
of the epipolar plane with the plane of the retinas defines the epipolar line. The Z axis
coincides with the optical axis of the left camera. Stereo images are also vertically
aligned and this implies that each element P of the visible surface is projected onto
the two retinas maintaining the same vertical coordinate Y . The constraint of the
epipolar line implies that the stereo system does not present any vertical disparity.
Two points found in the two retinas along the same vertical coordinate are called
homologous points if they derive from the perspective projection of the same element
of the visible surface P. The disparity measure is obtained by superimposing the two
retinas and calculating the horizontal distance of the two homologous points.
7 Inthe Fig. 4.52b the depth of the point P is indicated with ZP but in the text, we will continue to
indicate with Z the depth of a generic point of the object.
4.6 Stereo Vision 387
By eliminating the X from the (4.8), and resolving with respect to Z, we get the
following relation:
b·f
Z= (4.9)
xR − xL
which is the triangulation equation for calculating the perpendicular distance (depth)
for a binocular system with the geometry defined in Fig. 4.52a, that is, with the
constraints of the parallel optical axes and with the projections PL and PR lying on
the epipolar line. We can see how in the (4.9) the distance Z is correlated only to the
disparity value (xR − xL ), induced by the observation of the point P of the scene,
and is independent of the system of reference of the local coordinates or, from the
absolute values of xR and xL .
Recall that the parameter b is the baseline, that is, the separation distance of the
optical axes of the two cameras and f is the focal length, identical for the optics of
the two cameras. Furthermore, b and f have positive values. In the (4.9) the value of
Z must be positive and consequently the denominator is xR ≤ xL .
The geometry of the human binocular vision system is such that the b·f numerator
of (4.9) assumes values in the range (390 − 1105 mm) considering the interval
(6 − 17 mm) of the focal length of the crystalline lens (corresponding, respectively,
to the vision of objects closer, at about 25 cm, with contracted ciliary muscle, and at
the sight of more distant objects, at about 25 cm, with relaxed ciliary muscle) and the
baseline b = 65 mm. Associated with the corresponding range of Z = 0, 25−100 m,
the interval of disparity xR − xL would result (2 − 0, 0039 mm). The denominator
value of (4.9) tends to assume very small values to calculate large values of depth Z
(for (xR − xL ) → 0 ⇒ Z → ∞). This can determine a non-negligible uncertainty
in the estimate of Z.
For a binocular vision system, the uncertainty of the estimate of Z can be limited
with the use of cameras with a good spatial resolution (not less than 512 × 512 pixel)
and to minimize the error in the estimation of the position of the elementary struc-
tures detected in stereo images, candidate as homologous structures. These two
aspects can easily be solved by considering the availability of HD cameras (res-
olution 1920 × 1080 pixel) equipped with chips with photoreceptors of 4µm. For
example, a pair of these cameras, configured with a baseline of b = 120 mm and
optics with focal lengths of 15 mm, to detect an object at a distance of 10 m, for
the (4.9) the corresponding disparity would be 0.18 mm which in terms of pixels
would correspond to several tens (adequate resolution to evaluate the position of
homologous structures in the two HD stereo images).
Let us return to Fig. 4.52 and use the same similar right angled triangles in the 3D
context. PL CL and CL P are the hypotenuses of this similar right angled triangles. We
can get the following expression:
D PL CL D f 2 + xL2 + yL2
= =⇒ = (4.10)
Z f Z f
388 4 Paradigms for 3D Vision
to which Z can be replaced with the Eq. (4.9) (perpendicular distance of P) obtaining
b f 2 + xL2 + yL2
D= (4.11)
xR − xL
which is the equation of the Euclidean distance D of the point P in the three-
dimensional reference system, whose origin always coincides with the optical center
CL of the left camera. When calibrating the binocular vision system, if it is necessary
to verify the spatial resolution of the system, it may be convenient to use the Eq. (4.9)
or the (4.11) to predict, note the positions of points P in space and position xL in
the left retina, what should be the value of the disparity, i.e., estimate xR , and the
position of the point P when projected in the right retina.
In some applications it is important to evaluate well the constant b · f (of the
Eq. 4.9) linked to the intrinsic parameter of the focal length f of the lens and to the
extrinsic parameter b that depends on the geometry of the system.
(a) (b)
Uncertainty of P Field of view
Field of view
P
Uncertainty of P
cL cR
1 pixel
f
PL c’L c’R PR
Left image Right image
Fig. 4.53 Field of view in binocular vision. a In systems with parallel optical axes, the field of
view decreases with the increase of baseline but a consequent increase in the accuracy is obtained
in determining the depth. b In systems with converging optical axes, the field of view decreases as
the vergence angle and baseline increase but decreases the level of depth uncertainty
images increases with the increase of b, all to the disadvantage of the stereo fusion
process which aims to search, in stereo images, for the homologous points deriving
from the same point P of the scene.
Proper calibration is strategic when the vision system has to interact with the
world to reconstruct the 3D model of the scene and when it has to refer to it (for
example, an autonomous vehicle or robotic arm must self-locate with an accuracy).
Some calibration methods are well described in [21–24].
According to Fig. 4.52, the equations for reconstructing the 3D coordinates of
each point P(X , Y , Z) visible from the binocular system (with parallel and coplanar
optical axes) are summarized
b·f Z xL · b Z yL · b
Z= X = xL = Y = yL = (4.12)
xR − xL f xR − xL f xR − xL
To mitigate the limit situations indicated above, with the stereo geometry proposed
in Fig. 4.52 (to be used when Z b), it is possible to use a different configuration of
the cameras arranging them with the convergent optical axes, that is, inclined toward
the fixation point P which is at a finished distance from the stereo system, as shown
in Fig. 4.54.
With this geometry the points of the scene projected on the two retinas lie along the
lines of intersection (the epipolar lines) between the image planes and the epipolar
plane which includes the point P of the scene and the two centers optical CL and CR
of the two cameras, as shown in Fig. 4.54a. It is evident that with this geometry, the
epipolar lines are no longer horizontal as they were with the previous stereo geometry,
390 4 Paradigms for 3D Vision
(a) (b)
P
P
Epipolar
PL Plane PR
PL PR
Epipolar lR
lL Lines
cL cR lL
cL lR cR
Baseline Epipoles eL eR
Fig. 4.54 Binocular system with converging optical axes. a Epipolar geometry: the baseline line
intersects each image plane to the epipoles eL and eR . Any plan containing the baseline is called
epipolar plane and intersects the image planes at the epipolar lines lL and lR . In the figure, the
epipolar plane considered is the one passing through the fixation point P. As the 3D position of P
changes, the epipolar plane rotates around the baseline and all the epipolar lines pass through the
epipoles. b The epipolarity constraint imposes the coplanarity in the epipolar plane of the point P
of the 3D space, of the projections PL and PR of P in the respective image planes, and of the two
optical centers CL and CR . It follows that a point of the image of the left PL through the center CL
is projected backward in the 3D space from the radius CL PL . The image of this ray is projected in
the image on the right and corresponds to the epipolar line lR where to search for PR , that is, the
homologue of PL
with the optical axes of the cameras arranged parallel and coplanar. Furthermore,
the epipolar lines intersect the epipolar plane always in corresponding pairs. The
potential homologous points of P projections in the two retinas, respectively, in PL
and PR , lie on the corresponding epipolar lines lL and lR for the epipolarity constraint.
The baseline b is always the line joining the optical centers and the epipoles eL and eR
of the optical systems are the intersection points of the baseline with the respective
image planes. The right epipole eR is the virtual image of the left optical center CL
observed in the right image, and vice versa the left epipole eL is the virtual image of
the optical center CR .
Known the intrinsic and extrinsic parameters of the binocular system (calibrated)
and the epipolarity constraints, the correspondence problem is simplified by restrict-
ing the search for the homologous point of PL (supposedly known) on the associated
epipolar line lR , coplanar to the plane epipolar determined by PL , CL , and the baseline
(see Fig. 4.54b). Therefore, the search is restricted on the epipolar line lR and not on
the entire image on the right.
For a binocular system with converging optical axes, the binocular triangulation
method (also called binocular parallax) can be used, for the calculation of the coor-
dinates of P, but the previous Eq. (4.12), is no longer valid, having substantially
assumed a binocular system with point of fixation at infinity (parallel optical axes).
In fact, in this geometry, instead of calculating the linear disparity, it is necessary to
calculate the angular disparities θL and θR , which depend on the angle of convergence
ω of the optical axes of the system (see Fig. 4.55).
In analogy to the human vision, the optical axes of the two cameras intersect
at a point F of the scene (fixation point) at the perpendicular distance Z from the
4.6 Stereo Vision 391
P P
(a) (b) (c)
β
ΔZ
β
P
F β/2 β/2 F ΔZ
β/2 β/2
θL θR θL F θR
α α Δψ
Δω Z
Z α/2 α/2
α/2 α/2 ω ψ
b
A
b/2 b/2 b B
Δω
θL θR Δψ
baseline. We know that with the stereopsis (see Sect. 4.6.3) we get the perception
of the relative depth if simultaneously another point P is seen near or farther with
respect to F (see Fig. 4.55a).
In particular, we know that all points located around the horopter stimulate the
stereopsis caused by the retinal disparity (local difference between the retinal images
caused by the different observation point of each eye). The disparity in the point F
is zero, while there is an angular disparity for all points outside the horopter curve
and each presents a different angle of vergence β. Analyzing the geometry of the
binocular system of Fig. 4.55a it is possible to derive a binocular disparity in terms
of angular disparity δ, defined as follows:
δ = α − β = θR − θL (4.13)
where α and β are the angles of vergence underlying the fixation point F from the
optical axes and the point P outside the horopter curve; θL and θR are the angles
included in the retinal projections, in the left and right camera, of a fixation point F
and the target point P. The functional relationship that binds the angular disparity δ
(expressed in radians) and the depth Z is obtained by applying the elementary geom-
etry (see Fig. 4.55b). Considering the right angled triangles with base b/2, where b
is the baseline, it results from the trigonometry 2b = tan(α/2)Z. For small angles
you can approximate tan(α/2) ≈ α/2. Therefore, the angular disparity between PL
and PR is obtained by applying the (4.13) as follows:
b b b· Z
δ =α−β = − = 2 (4.14)
Z Z +· Z Z +Z Z
For very small distances of Z (less than 1 meter) of the fixation point and with values
of depth very small compared to Z, the second term in the denominator of the (4.14)
becomes almost zero and the expression of angular disparity is simplified
b Z
δ= (4.15)
Z2
To the same result of the angular disparity is achieved if we consider the differences
between the angles θL and θR treating in this case the sign of the angles.
392 4 Paradigms for 3D Vision
The estimate of the depth Z and of the absolute distance Z of an object with
respect to a known reference object (fixation point), where the binocular system
converges with the ω and ψ angles, can be calculated considering different angular
reference coordinates, as shown in Fig. 4.55c. Note the angular configuration ω and
ψ with respect to the reference object F, the absolute distance Z of F with respect
to the baseline b is given by8 :
sin ω sin ψ
Z ≈b (4.16)
sin(ω + ψ)
while the depth Z results
ω+ ψ
Z=b (4.17)
2
where ω and ψ are the angular offsets of the left and right image planes to align
the binocular system with P starting from the initial reference configuration.
In human vision, considering the baseline b = 0.065 m and fixing an object at the
distance Z = 1.2 m, for a distant object Z = 0.1 m from the one fixed, applying
the (4.15) would have an angular disparity of δ = 0.0045 rad. A person with normal
vision is able to pass a wire through the eye of a needle fixed at Z = 0.35 m and
working around the eyelet with the resolution of Z = 0.1 mm. The visual capacity
of the human stereopsis is such as to perceive depth, around the point of fixation,
of fractions of millimeters requiring, according to the (4.15), a resolution of the
angular disparity δ = 0.000053 rad = 10.9 s of arc. Fixing an object at the distance
of Z = 400 m the depth around this object is no longer perceptible (perceived
flattened background) as the resolution of the required angular disparity would be
very small, less than 1 s of arc. In systems with converging optical axes, the field of
view decreases with increasing vergence angle and baseline but decreases the level
of depth uncertainty (see Fig. 4.53b).
Active and passive vision systems, have been experimented to estimate the angular
disparity together with other parameters (position and orientation of the cameras)
of the system. These parameters are evaluated and dynamically checked for the
calculation of the depth of various points in the scene. The estimation of the position
and orientation of the cameras requires their calibration (see Chap. 7 of Camera
calibration and 3D Reconstruction). If positions and orientations of the cameras are
known (calculated, for example, with active systems) the reconstruction of the 3D
points of the scene is realized with the roto-translation transformation of the points
PL = (xL , yL , zL ) ( projection of P(X , Y , Z) in the left retina) which are projected
8 From Fig. 4.55c it is observed that Z = AF · sin ω, where AF is calculated remembering the
b
theorem of the sines for which sin(π −ω−ψ) = sin
AF
ψ and that the sum of the inner angles of the acute
triangle AFB is π . Therefore, resolving with respect to AF and replacing, it is obtained
sin ψ sin ω sin ψ
Z =b sin ω = b .
sin[π − (ω + ψ)] sin(ω + ψ)
4.6 Stereo Vision 393
For strategy 1 the use of contour points or elementary areas have already been
proposed. Recently, the Points Of Interest-POI described in Chap. 6 Vol. II (for
example, SIFT and SUSAN) are also used.
For strategy 2, two classes of algorithms are obtained for the measurement of
similarity (or dissimilarity) for point elementary structures or extended areas. We
immediately highlight the importance of similarity (or dissimilarity) algorithms of
structures that must well discriminate between structures that are not very different
from each other (homogeneous distribution of gray levels of pixels). This is to keep
the number of false matches to a minimum.
The calculation of the depth is made only for the elementary structures found
in the images and in particular by choosing only the homologous structures (punc-
tual or areas). For all the other structures (features) for which the depth cannot be
394 4 Paradigms for 3D Vision
calculated with stereo vision, interpolation techniques are used to have a more com-
plete reconstruction of the visible 3D surface. The search for homologous structures
(strategy 2) is simplified when the geometry of the binocular system is conditioned
by the constraint of the epipolarity. With this constraint, the homologous structures
are located along the corresponding epipolar lines and the search area in the left and
right image is limited.
The extent of the research area depends on the uncertainty of the intrinsic and
extrinsic parameters of the binocular system (for example, uncertainty about the posi-
tion and orientation of the cameras), making it necessary to search for the homologous
structure in a small neighborhood with respect to the estimated position of the struc-
ture in the image on the right (note the geometry of the system) slightly violating the
constraint of the epipolarity (search in a horizontal and/or vertical neighborhood).
In the simple binocular system, with parallel and coplanar optical axes, or in the
case of rectified stereo images, the search for homologous structures takes place by
considering corresponding lines with the same vertical coordinate.
The extraction of the elementary structures present in the pair of stereo images can
be done by applying to these images some filtering operators described in Chap. 1
Vol. II Local Operations: Edging. In particular, point-like structures, contours, edge
elements, and corners can be extracted. Julesz and Marr used the random-dot images
(black or white point-like synthetic structures) and the corresponding structures of
the zero crossing extracted with the LOG filtering operator.
With the constraint of epipolar geometry, a stereo vision algorithm includes the
following essential steps:
where SD represents the inner product between the vectors si and sj which indicate
two generic elementary structures characterized by n parameters, φ is the angle
between the two vectors and • indicates the length of the vector. Instead, SE rep-
resents the Euclidean distance weighted by the weights wk of each characteristic
sk of the elementary structures.
The two similarity measures SD and SE can be normalized to not depend very
much on the variability of the characteristics of the elementary structures. In this
case, the (4.18) can be normalized by dividing each term of the summation by the
product of the vector modules si and sj , given by
2 2
si sj = ski skj
k k
The weighted Euclidean distance measure can be normalized by dividing each
addend of the (4.19) by the term R2k which represents the maximum range of
variability of the component k-th squared high. The Euclidean distance measure is
used as a similarity estimate in the sense that the more different are the components
that describe the pairs of candidate structures as homologous, the greater their
difference, that is, the value of SD . The weights wk , relative to each characteristic
sk , are calculated by analyzing a certain number of pairs of elementary structures
for which we can guarantee that they are homologous structures.
An alternative normalization can be chosen on a statistical basis by considering
each characteristic sk having subtracted its average sk and dividing by its stan-
dard deviation. Obviously, this is possible for both measures SD and SE , only if
the probability distribution of the characteristics is known which can be estimated
using known pairs {(si , sj ), . . .} of elementary structures.
In conclusion, in this step, the measure of similarity of pairs (si , sj ) is estimated
to check whether they are homologous structures and then, for each pair, the
measure of disparity dij = si (xR ) − sj (xL ) is calculated.
5. The previous steps can be repeated to have different disparity estimates by iden-
tifying and calculating the correspondence of the structures at different scales,
analogous to the coarse-to-fine approach proposed by Marr-Poggio, described
earlier in this chapter.
6. Calculation with Eq. (4.12) of 3D spatial coordinates (depth Z and coordinates X
and Y ), for each point of the visible surface represented by the pair of homologous
structures.
396 4 Paradigms for 3D Vision
7. Reconstruction of the visible surface, at the points where it was not possible to
measure the depth, through an interpolation process, using the measurements of
the stereo system estimated in step 6.
x x
xL xR
y y
yL
yL
WL WR(i+m,j+n)
WR xR=i+m
Rmxn
xL yR=j+n
WR
yL n
i
WL
j Rmxn
(m,n)
Fig. 4.56 Search in stereo images of homologous points through the correlation function between
potential homologous windows, with and without epipolarity constraint
9 In the context of image processing, it is often necessary to compare the level of equality (similarity,
with i, j = −(N − M − 1), +(N − M − 1). The size of the square window WL
is given by (2M + 1) with values of M = 1, 2, . . . generating windows of the
size of 3 × 3, 5 × 5, etc, respectively. The size of the square search area Rm,n ,
located in (m = xL + dminx , n = yL + dminy ) in the right image IR , is given by
(2N + 1) related to the size of the WR window with N = M + q, q = 1, 2, . . ..
For each value of (i, j) where is centered WR in the search region R we have a value
of C(i, j; m, n) and to move the window WR in R it is sufficient to vary the indices
i, j = −(N − M − 1), +(N − M − 1) inside R whose dimensions and position can
be defined a priori in relation to the geometry of the stereo system which allows
a maximum and minimum interval of disparity. The accuracy of the correlation
measurement depends on the variability of the gray levels between the two stereo
images. To minimize this drawback, the correlation measurements C(i, j; m, n) can
be normalized using the correlation coefficient 10 r(i, j; m, n) as the new correlation
estimation value given by
+M +M
k=−M l=−M WL (k, l; xL , yL ) · WR (k, l; i + m, j + n)
r(i, j; m, n) = 2 +M 2 (4.21)
+M +M WL (k, l) +M WR (k, l; i + m, j + n)
k=−M l=−M k=−M l=−M
where
WL (k, l; xL , yL ) = WL (xL + k, yL + l) − W̄L (xL , yL )
(4.22)
WR (k, l; i + m, j + n) = WR (i + m + k, j + n + l) − W̄R (i + m, j + n)
with i, j = −(N − M − 1), +(N − M − 1) and (m, n) prefixed as above. W̄L and
W̄R are the mean values of the intensity values in the two windows. The correlation
coefficient is also known as Zero Mean Normalized Cross-Correlation - ZNCC. The
numerator of the (4.21) represents the covariance of the pixel intensities between
the two windows while the denominator is the product of the respective standard
deviations. It can easily be deduced that the correlation coefficient r(i, j; m, n) takes
scalar values in the range between −1 and +1, no longer depending on the variability
of the intensity levels in the two stereo images. In particular, r = 1 corresponds to
10 The correlation coefficient has been described in Sect. 1.4.2 and in this case it is used to evaluate
the statistical dependence of the intensity of the pixels between the two windows, without knowing
the nature of this statistical dependence.
4.7 Stereo Vision Algorithms 399
the exact equality of the elementary structures (homologous structures, less than a
constant factor c, WR = cWL , that is, the two windows are very correlated but with
uniform intensity, one clearer than the other). r = 0 means that they are completely
different, while r = −1 indicates that they are anticorrelated (i.e., the intensity of
the corresponding pixels are equal but of opposite sign).
The previous correlation measures, for different applications, may not be adequate
due to the noise present in the images and in particular when the research regions
are very homogeneous with little variability in intensity values. This generates very
uncertain or uniform correlation values C or r with the consequent uncertainty in the
estimation of horizontal and vertical disparities (dx , dy ).
More precisely, if the windows WL and WR represent the intensities in two
images obtained under different lighting conditions of a scene and the correspond-
ing intensities are linearly correlated, a high similarity between the images will be
obtained. Therefore, the correlation coefficient is suitable for determining the sim-
ilarity between the windows with the assumed intensities to be linearly correlated.
When, on the other hand, the images are acquired under different conditions (sensors
and nonuniform illumination), so that the corresponding intensities are correlated in
a nonlinear way, the two perfectly matched windows may not produce sufficiently
high correlation coefficients, causing misalignments.
Another drawback is given by the intensive calculation required especially when
the size of the windows increases. In [25] an algorithm is described that optimizes the
computational complexity for the problem of template matching between images.
CSAD (i, j; m, n) = +M
k=−M
+M
l=−M |WL (xL +k,yL +l)−WR (i+m+k,j+n+l)| (4.24)
with i, j = −(N − M − 1), +(N − M − 1).
400 4 Paradigms for 3D Vision
For these two dissimilarity measures CSSD (i, j; m, n) and CSAD (i, j; m, n) (sub-
stantially based, respectively, on the norm L2 and L1 ) the minimum value is chosen as
the best match between the windows WL (xL , yL ) and WR (xR , yR ) = WR (i + m, j + n)
which are chosen as homologous local structures with estimate of the disparity
(dx = xR − xL , dy = yR − yR ) (see Fig. 4.56). The SSD metric is less expensive com-
putationally than the correlation coefficient (4.21), and as the latter can be normalized
to obtain equivalent results. In literature there are several methods of normalization.
The SAD metric is mostly used as it requires less computational load.
All matching measurements described are sensitive to geometric deformations
(skewing, rotation, occlusions, . . .) and radiometric (vignetting, impulse noise, . . .).
The latter can also be attenuated for the SSD and SDA metrics by subtracting the
value of the average W̄ calculated on the windows to be compared from each pixel, as
already done for the calculation of the correlation coefficient (4.21). The two metrics
become the Zero-mean Sum of Squared Differences (ZSSD) and the Zero-mean Sum
of Absolute Differences (ZSAD) and their expressions, considering the Eq. (4.22), are
+M
+M
2
CZSSD (i, j; m, n) = WL (k, l; xL , yL ) − WR (k, l; i + m, j + n) (4.25)
k=−M l=−M
+M
+M
CZSAD (i, j; m, n) = WL (k, l; xL , yL ) − WR (k, l; i + m, j + n) (4.26)
k=−M l=−M
CorrNorm
Left image IL
SSD
Real depth map
SAD
Fig. 4.57 Calculation of the depth map from a pair of stereo images by detecting homologous
local elementary structures using similarity functions. The first column shows the stereo images
and real depth map of the scene. The following columns show the results of depth maps detected
with windows of increasing size starting from 3 × 3, while the corresponding windows in stereo
images are calculated with correlation similarity functions (first row), SSD (second row) and SAD
(third line)
The rank transform is applied to the two windows of the stereo images where for
each pixel the intensity is replaced with its Rank(i, j). For a given window W in the
image, centered in the pixel p(i, j) the transformed rank RankW (i, j) is defined as the
number of pixels in 79W42for which the intensity is less than the value of p(i, j). For
51
example, if W = 46 36 34 it is had that RankW (i, j) = 3 being there three pixels
37 30 28
with intensity less to the central pixel with value 36. It is evidenced, that the obtained
values are on the base of the relative order of intensity of the pixels rather than the
same intensities. The position of the pixels inside the window is also lost. Using the
preceding symbolism (see also Fig. 4.56), the dissimilarity measure of rank distance
RD(i, j) based on the rank transform Rank(i, j) is given by the following:
+M
+M
RD(i, j; m, n) = RankWL (xL + k, yL + l) − RankWR (i + m + k, j + n + l) (4.27)
k=−M l=−M
In the (4.27) the value of RankW for a window centered in (i, j) in the image is
calculated as follows:
+M
+M
RankW (i, j) = L(i + k, j + l) (4.28)
k=−M l=−M
where L(k, l) is given by
1 if W (k, l) < W (i, j)
L(k, l) = (4.29)
0 otherwise
With the (4.29) is calculated the number of pixels that in the window W are with
value of the intensity less than the central pixel (i, j). Once the rank is calculated
with the (4.28) for the window WL (xL , yL ) and the windows WR (i + m, j + n), i, j =
−(NM − 1), +(NM − 1) in the search region R(i, j) located in (m, n) in the image
IR , the comparison between the windows is evaluated using the SAD method (sum
of absolute differences) with the (4.27).
Rank distance is not a metric11 like all other measures based on ordering. The
dissimilarity measure based on the rank transform actually compresses the informa-
tion content of the image (the information of a window is encoded in a single value)
thus reducing the potential discriminating ability of the comparison between the win-
dows. The choice of window size becomes even more important in this method. The
computational complexity of the rank distance is reduced compared to the methods
of correlation and ordering. Evaluated in the order of nlog2 n, where n indicates the
11 Rank distance is not a metric because it does not satisfy the reflexivity property of metrics. If
WL = WR we have that each corresponding pixel has the same intensity value and it follows
that RD = 0. However, when RD = 0, being RD the sum of nonnegative numbers given by the
Eq. (4.27), would require |RankWL (i, j) − RankWR (i, j)| = 0 for each pixel corresponding to the two
windows. But the intensity of the pixels can vary to less than an offset or a scale factor being able
to still produce RD = 0, and therefore, does not imply that WL = WR thus violating the reflexivity
property of the metric that in this case would necessarily impose RD(WL , WR ) = 0 ⇔ WL = WR .
4.7 Stereo Vision Algorithms 403
Fig. 4.58 Calculation of the depth map, for the same stereo images of Fig. 4.57, obtained using
the census and rank transform. a Transformed census applied to the left image; b map of disparity
based on the census transform and Hamming distance; c transformed rank applied to the left image;
d map of disparity based on the rank distance and the SAD matching method
1 if W (k, l) < W (i, j)
CensusW (i, j) = BitstringW [W (k, l) < W (i, j)] = (4.30)
0 otherwise
For example, by applying to the same window preceding the census function for the
central pixel with value 36 the following binary coding is obtained
79 42 51 0 0 0
W = 46 36 34 ⇔ 0 (i,j) 1 ⇔ (0 0 0 0 1 0 1 1)2 ⇔ (11)10
37 30 28 0 1 1
From the result of the comparison, we obtain a binary mask whose bits, not con-
sidering the central bit, are concatenated by row, from left to right, and the final 8
bit code is formed which represents the central pixel of the W window. Finally, the
404 4 Paradigms for 3D Vision
decimal value representing this 8-bit code is assigned as the central pixel value of
W . The same operation is performed for the window to be compared.
The dissimilarity measure DCT , based on the CT , is evaluated by comparing with
the Hamming distance, which evaluates the number of different bits, between the bit
strings relating to the windows to be compared. The dissimilarity measure DCT is
calculated according to the (4.30) as follows:
+M
+M
DCT (i, j; m, n) = DistHamming [CensusWL (xL + k, yL + l)−
k=−M l=−M (4.31)
CensusWR (i + m + k, j + n + l)]
where CensusWL represents the census bit string of left window WL (xL , yL ) on the
left stereo image and CensusWR (i + m, j + n) represents the census bit string of the
windows in the search area R in the right stereo image that must be compared (see
Fig. 4.56).
The function (4.30) generates the census bit string whose length depends, as seen
in the example, on the size of the windows, i.e., Ls = (2M − 1)2 − 1 which is
given by the number of pixels in the window minus the middle one (in the example
Ls = 8). Compared to the rank transformed there is a considerable increase in the
dimensionality of the data depending on the size of the windows to be compared, with
the consequent increase in the computationality required. For this method, real-time
implementations, based on ad hoc hardware (based on FPGA-Field Programmable
Gate Arrays), have been developed for the treatment of binary strings. Experimental
results have shown that these last two dissimilarity measurements, based on rank
and census transforms, are more efficient than correlation-based methods, to obtain
disparity maps from stereo images in the presence of occlusions and radiometric
distortions. Figure 4.58 shows the results of the census transform (figure c) applied
to the left image of Fig. 4.57 and the disparity map (figure d) obtained on the basis
of the census transform and the Hamming distance.
differential equations generated by applying the (4.32) to each pixel of the window
centered on the pixel being processed, imposing the constraint that the disparity
varies continuously on the pixels of the window same. This process of minimizing
this functional is repeated in each pixel of the image to estimate the disparity. This
methodology is described in detail in Sect. 6.4.
Another way to use the gradient is to accumulate the horizontal and vertical gradi-
ent difference calculated for each pixel of the two windows WL and WR to compare.
As a result, a measure of dissimilarity between the accumulated gradients of the
two windows is obtained. We can consider dissimilarity measures DG SAD and DG SSD
based, respectively, on the sum of the absolute differences (SAD) and on the sum
of the squares of the differences (SSD) of the horizontal (∇x W ) and vertical com-
ponents of the gradient of the local structures to be compared, respectively, (∇x W )
and (∇y W ). In this case the DG dissimilarity measures of the local structures to be
compared, based on the components of the gradient vector accumulated according
to the sum SSD or SAD, are given, respectively, by the following functions:
+M
+M
2
DG SSD (i, j; m, n) = ∇x WL (xL + k, yL + l) − ∇x WR (i + m + k, j + n + l) +
k=−M l=−M
2
∇y WL (xL + k, yL + l) − ∇y WR (i + m + k, j + n + l)
+M
+M
DG SAD (i, j; m, n) = |∇x WL (xL + k, yL + l) − ∇x WR (i + m + k, j + n + l)|+
k=−M l=−M
|∇y WL (xL + k, yL + l) − ∇y WR (i + m + k, j + n + l)|
on minimizing a global energy function formulated as the sum of evidence and com-
patibility have been proposed. Methods based on dynamic programming attempt
to reduce the complexity of the problem into smaller and simpler subproblems by
setting a functional cost in the various stages with appropriate constraints. The best
known algorithms are: Grapf-Cuts [30], Belief Propagation [31], Intrinsic Curves
[32], Nonlinear Diffusion [33]. Further methods have been developed, known as
hybrid or semiglobal methods, which essentially use similar approaches to global
ones, but operate on parts of the image, such as line by line image, to reduce the
considerable computational load required by global methods.
The PMF stereo vision algorithm: A stereo correspondence algorithm using a dis-
parity gradient limit, proposed in 1985 by Pollard, Mayhem, and Frisby [28], is
based on a computational model in which the problem of stereo correspondence is
seen as intimately integrated with the problem of identifying and describing elemen-
tary structures (primal sketch) candidates and homologues. This contrasts with the
computational model proposed by Marr and Poggio which see the stereo vision with
separate modules for the identification of elementary structures and the problem of
correspondence.
The PMF algorithm differs mainly because it includes the constraint of the conti-
nuity of the visible surface (figural continuity). Candidate structures as homologous
must have a disparity value contained in a certain range and if there are more candidate
structures as homologous, some will have to be eliminated, verifying if structures in
the vicinity of the candidate, support the same relation of surface continuity (figural
continuity).
4.7 Stereo Vision Algorithms 407
Al Ac Ar
Bl Br
xl Bc xr
The constraint of the figural continuity eliminates many false pairs of homologous
structures applied to natural and simulated images. In particular, PMF exploits what
was experienced by Burt-Julesz [34] who found, in human binocular vision, how the
image fusion process is tolerated for homologous structures with disparity gradient
with value 1. The disparity gradient of two elementary structures A and B (points,
contours, zero crossing, SdI, etc.) identified in the pair of stereo images is given by the
ratio of the difference between their disparity value and their cyclopean separation
(see Fig. 4.59, concept of disparity gradient).
The disparity gradient measures the relative disparity of two pairs of homologous
elementary structures. The authors of the PMF algorithm assert that the fusion process
of binocular images can tolerate homologous structures within the unit value of the
disparity gradient. In this way, false homologous structures are avoided and implicitly
satisfy the constraint of the continuity of the visible surface introduced previously.
Now let’s see how to calculate the disparity gradient by considering pairs of
points (A, B) in the two stereo images (see Fig. 4.59). Let AL = (xAL , yAL ) and
BL = (xBL , yBL ), AR = (xAR , yAR ) and BR = (xBR , yBR ) the projection in the two
stereo images of the points A and B of the visible 3D surface. The disparity value for
this pair of points A and B is given by
dA = xAR − yAL and dB = xBR − yBL (4.33)
A cyclopic image (see Fig. 4.59) is obtained by projecting the considered points A
and B, respectively, in the points Ac and Bc whose coordinates are given by the
average of the coordinates of the same points A and B projected in the stereo image
pair. The coordinates of the cyclopic points Ac and Bc are
xAL + xAR
xAc = and yAc = yAL = yAR (4.34)
2
xBL + xBR
xBc = and yBc = yBL = yBR (4.35)
2
408 4 Paradigms for 3D Vision
The cyclopean separation S is given by the Euclidean distance between the cyclopic
points
S(A, B) = (xAc − xBc )2 + (yAc − yBc )2
x + x y + y 2
=
AL AR BL BR
− + (yAc − yBc )2
2 2
(4.36)
1 2 2
= xAL − xBL + xAR − xBR + yAc − yBc
4
2
1 2
= dx (AL , BL ) + dx (AR , BR ) + yAc − yBc
4
where dx (AL , BL )and dx (AR , BR ) are the horizontal distances of points A and B pro-
jected in the two stereo images. The difference in disparity between the pairs of
points (AL , AR ) and (BL , BR ) is given as follows:
D(A, B) = dA − dB
= xAR − xAL − xBR − xBL
(4.37)
= xAR − xBR − xAL − xBL
= dx (AR − BR ) − dx (AL − BL )
The disparity gradient G for the pair of homologous points (AL , AR ) and (BL , BR ) is
given by the ratio of the difference of disparity D(A, B) with the ciclopic separation
S(A, B) given by the Eq. (4.36)
D(A, B) dx (AR , BR ) − dx (AL , BL )
G(A, B) = = 2 (4.38)
S(A, B) 2
4 dx (AL , BL ) + dx (AR , BR ) + yAc − yBc
1
With the definition given to the disparity gradient G, the constraint of the disparity
gradient limit is immediate since the Eq. (4.38) shows that G can never exceed the
unit. It follows that small differences in disparities are not acceptable if the points
A and B, considered in the 3D space, are very close to each other. This is easy to
understand and is supported by physical evidence as experienced by the PMF authors.
The PMF algorithm includes the following steps:
4. Choosing homologous structures with the highest likelihood index. The unique-
ness constraint removes other incorrect homologous pairs and is excluded from
further considerations.
5. Return in step 2 and the indices are redetermined considering the derived homol-
ogous points.
6. The algorithm terminates when all possible pairs of homologous points have been
extracted.
From the procedure described it is observed that the PMF algorithm assumes that it
finds a set of candidate points as homologous from each stereo image and proposes
to find the correspondence for pairs of points (A, B), i.e., for pairs of homologous
points. The calculation of the correspondence of the single points is facilitated by
the constraint of the epipolar geometry, and the uniqueness of the correspondence
is used in step 4 to prevent the same point from being used more than once in the
calculation of the disparity gradient.
The estimation of the likelihood index is used to check that the more unlikely the
correspondences are, the more distant they are from the limit value of the disparity
gradient. In fact, candidate pairs of homologous points are considered those that have
the disparity gradient close to the unit. It is reasonable to consider only pairs that fall
within a circular area of radius seven, although this value depends on the geometry
of the vision system and the scene.
This means that small values of the D disparity difference are easily detected
and discarded when caused by points that are very close together in 3D space. The
PMF algorithm has been successfully tested for several natural and artificial scenes.
When the points of uniqueness, the epipolarity and the limits of the disparity gradient
are violated, the results are not good. In these latter conditions, it is possible to use
algorithms that calculate the correspondence for a number of points greater than two.
For example, we can organize the Structures of Interest (SdI) as a set of nodes
related to each other in topological terms. Their representation can be organized with
a graph H(V,E) where V = {A1 , A2 , . . . , An } is the set of Structures of Interest Ai
and E = {e1 , e2 , . . . , en } is the set of arcs that constitute the topological relations
between nodes (for example, based on the Euclidean distance).
In this case, the matching process is reduced to looking for the structures of interest
SdIL and SdIR in the pair of stereo images, organizing them in the form of graphs,
and subsequently performing the comparison of graphs or subgraphs (see Fig. 4.60).
In graph theory, the comparison of graphs or subgraphs is called isomorphism of
graph or subgraph.
The problem of the graph isomorphism however emerges considering that the
graphs HL and HR , which can be generated with the set of potentially homologous
points, can be many, and such graphs can never be identical due to of the diversity
of the stereo image pair. The problem has a solution as it is set by evaluating the
similarity of the graphs or subgraphs. In the literature several algorithms are proposed
for the problem of graph comparison [35–38].
410 4 Paradigms for 3D Vision
HL HR
References
1. D. Marr, S. Ullman, Directional selectivity and its use in early visual processing, in Proceedings
of the Royal Society of London. Series B, Biological Sciences, vol. 211 (1981), pp. 151–180
2. D. Marr, E. Hildreth, Theory of edge detection, in Proceedings of the Royal Society of London.
Series B, Biological Sciences, vol. 207 (1167) (1980), pp. 187–217
3. S.W. Kuffler, Discharge patterns and functional organization of mammalian retina. J. Neuro-
physiol. 16(1), 37–68 (1953)
4. C. Enroth-Cugell, J.G. Robson, The contrast sensitivity of retinal ganglion cells of the cat. J.
Neurophysiol. 187(3), 517–552 (1966)
5. F.W. Campbell, J.G. Robson, Application of fourier analysis to the visibility of gratings. J.
Physiol. 197, 551–566 (1968)
6. D.H. Hubel, T.N. Wiesel, Receptive fields, binocular interaction and functional architecture in
the cat’s visual cortex. J. Physiol. 160(1), 106–154 (1962)
7. V. Bruce, P. Green, Visual Perception: Physiology, Psychology, and Ecology, 4th edn. (Lawrence
Erlbaum Associates, 2003). ISBN 1841692387
8. H. von Helmholtz, Handbuch der physiologischen optik, vol. 3 (Leopold Voss, Leipzig, 1867)
9. R.K. Olson, F. Attneave, What variables produce similarity grouping? J. Physiol. 83, 1–21
(1970)
10. B. Julesz, Visual pattern discrimination. IRE Trans. Inf. Theory 8(2), 84–92 (1962)
11. M. Minsky, A framework for representing knowledge, in The Psychology of Computer Vision,
ed. by P. Winston (McGraw-Hill, New York, 1975), pp. 211–277
12. D. Marr, H. Nishihara, Representation and recognition of the spatial organization of three-
dimensional shapes. Proc. R. Soc. Lond. 200, 269–294 (1987)
13. I. Biederman, Recognition-by-components: a theory of human image understanding. Psychol.
Rev. 94, 115–147 (1987)
14. J.E. Hummel, I. Biederman, Dynamic binding in a neural network for shape recognition.
Psychol. Rev. 99(3), 480–517 (1992)
References 411
15. T. Poggio, C. Koch, Ill-posed problems in early vision: from computational theory to analogue
networks, in Proceedings of the Royal Society of London. Series B, Biological Sciences, vol.
226 (1985), pp. 303–323
16. B. Julesz, Foundations of Cyclopean Perception (The MIT Press, 1971). ISBN 9780262101134
17. L. Ungerleider, M. Mishkin, Two cortical visual systems, in ed. by D.J. Ingle, M.A.
Goodale, R.J.W. Mansfield, Analysis of Visual Behavior (MIT Press, Cambridge MA, 1982),
pp. 549–586
18. D. Marr, Vision: A Computational Investigation into the Human Representation and Processing
of Visual information, 1st edn. (The MIT Press, 2010). ISBN 978-0262514620
19. W.E.L. Grimson, From Images to Surfaces: A Computational Study of the Human Early Visual
System, 4th edn. (MIT Press, Cambridge, Massachusetts, 1981). ISBN 9780262571852
20. D. Marr, T. Poggio, A computational theory of human stereo vision, in Proceedings of the
Royal Society of London. Series B, vol. 204 (1979), pp. 301–328
21. O. Faugeras, Three-Dimensional Computer Vision: A Geometric Approach (MIT Press, Cam-
bridge, Massachusetts, 1996)
22. R. Hartley, A. Zisserman, Multiple View Geometry in Computer Vision, 2nd edn. (Cambridge,
2003)
23. R.Y. Tsai, A versatile camera calibration technique for 3D machine vision. IEEE J. Robot.
Autom. 4, 323–344 (1987)
24. Z. Zhang, A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach.
Intell. 22(11), 1330–1334 (2000)
25. A. Goshtasby, S.H. Gage, J.F. Bartholic, A two-stage cross correlation approach to template
matching. IEEE Trans. Pattern Anal. Mach. Intell. PAMI-6, 374–378 (1984)
26. W. Zhang, K. Hao, Q. Zhang, H. Li, A novel stereo matching method based on rank transfor-
mation. Int. J. Comput. Sci. Issues 2(10), 39–44 (2013)
27. R. Zabih, J. Woodfill, Non-parametric local transforms for computing visual correspondence,
in Proceedings of the 3rd European Conference Computer Vision (1994), pp. 150–158
28. S.B. Pollard, J.E.W. Mayhew, J.P. Frisby, PMF: a stereo correspondence algorithm using a
disparity gradient limit. Perception 14, 449–470 (1985)
29. D. Scharstein, Matching images by comparing their gradient fields, in Proceedings of 12th
International Conference on Pattern Recognition, vol. 1 (1994), pp. 572–575
30. Y. Boykov, O. Veksler, R. Zabih, Fast approximate energy minimization via graph cuts. IEEE
Trans. Pattern Anal. Mach. Intell. 11(23), 1222–1239 (2001)
31. J. Sun, N.N. Zheng, H.Y. Shum, Stereo matching using belief propagation, in Proceedings of
the European Conference Computer Vision (2002), pp. 510–524
32. C. Tomasi, R. Manduchi, Stereo matching as a nearest-neighbor problem. IEEE Trans. Pattern
Anal. Mach. Intell. 20, 333–340 (1998)
33. D. Scharstein, R. Szeliski, Stereo matching with non linear diffusion. Int. Jorn. Comput. Vis.
2(28), 155–174 (1998)
34. P. Burt, B. Julesz, A disparity gradient limit for binocular fusion. Science 208, 615–617 (1980)
35. N. Ayache, B. Faverjon, Efficient registration of stereo imagesby matching graph descriptions
of edge segments. Inter. J. Comput. Vis. 2(1), 107–131 (1987)
36. D.H. Ballard, C.M. Brown, Computer Vision (Prentice Hall, 1982). ISBN 978-0131653160
37. Radu HORAUD and Thomas SKORDAS, Stereo correspondence through feature grouping
and maximal cliques. IEEE Trans. PAMI 11(11), 1168–1180 (1989)
38. A. Branca, E. Stella, A. Distante, Feature matching by searching maximum clique on high order
association graph, in International Conference on Image Analysis and Processing (1999), pp.
642–658
Shape from Shading
5
5.1 Introduction
With Shape from Shading, in the field of computer vision, we intend to reconstruct
the shape of the visible 3D surface using only the brightness variation information,
i.e., the gray-level shades present in the image.
It is well known that an artist is able to represent the geometric shape of the
objects of the world in a painting (black/white or color) creating shades of gray or
color. Looking at the painting, the human visual system analyzes these shades of
brightness and can perceive the shape information of 3D objects even if represented
in the two-dimensional painting. The author’s ability consists in projecting, the 3D
scene, in the 2D plane of the painting, creating to the observer, through the shades
of gray (or color) level, the impression of the 3D vision of the scene.
The approach of Shape from Shading essentially proposes the analogous problem:
from the variation of luminous intensity of the image we intend to reconstruct the
visible surface of the scene. In other words, the inverse problem, to reconstruct the
shape of the visible surface from the brightness variations present in the image is
known as the Shape from Shading problem.
In Chap. 2 Vol. I, the fundamental aspects of radiometry involved in the image
formation process were examined, culminating in the definition of the fundamental
formula of radiometry. These aspects will have to be considered to solve the problem
of Shape from Shading by finding solutions based on reliable physical–mathematical
foundations, also considering, the complexity of the problem.
The statement reconstruction of the visible surface must not be strictly understood
as a 3D reconstruction of the surface. We know, in fact, that from a single point of
observation of the scene, a monocular vision system cannot estimate a distance
measure between observer and visible object.1 Horn [1] in 1970 was the first to
introduce the Shape from Shading paradigm by formulating a solution based on
the knowledge of the light source (direction and distribution), the scene reflectance
model, the observation point and the geometry of the visible surface, which together
contribute to the process of image formation.
In other words, Horn has derived the relations between the values of the luminous
intensity of the image and the geometry of the visible surface (in terms of the orienta-
tion of the surface point by point) under some lighting conditions and the reflectance
model. To understand the paradigm of the Shape from Shading, it is necessary to
introduce two concepts: the reflectance map and the gradient space.
The basic concept of the reflectance map is to determine a function that calculates
the orientation of the surface from the brightness of each point of the scene. The
fundamental relation Eq. (2.34), of the image formation process, described in Chap. 2
Vol. I, allows to evaluate the brightness value of a generic point a of the image,
generated by the luminous radiance reflected from a point A of the object.2 We also
know that the process of image formation is conditioned by the characteristics of
the optical system (focal length f and diameter of the lens d) and by the model of
reflectance considered. We remember the fundamental equation that links the image
irradiance E to the scene radiance is given by the following:
2
π d
E (a, ψ) = cos4 ψ · L (A, ) (5.1)
4 f
where L is the radiance of the object’s surface, and is the angle formed with the
optical axis of the incident light ray coming from a generic point A of the object
(see Fig. 5.1). It should be noted that the image irradiance E, is linearly related to
the radiance of the surface, it is proportional to the area of the lens (defined by the
diameter d), it is inversely proportional to the square of the distance between lens
and image plane (dependent on the focal f ), and that the irradiance decreases with
the increase of the angle ψ comprised between the optical axis and line of sight.
1 While with the stereo approach we have a quantitative measure of the depth of the visible surface
with the shape from shading we have a nonmetric but qualitative (ordinal) reconstruction of the
surface.
2 In this paragraph, we will indicate with A the generic point of the visible surface and with a its
projection in the image plane, instead of, respectively, P and p, indicated in the radiometry chapter,
to avoid confusion with the Gradient coordinates that we will indicate in the next paragraph with
( p, q).
5.2 The Reflectance Map 415
ΔΩ P
d
P
θ
ΔΑ
Ψ
Ψ
Δa
p
ΔΩp
f Z
Fig. 5.1 Relationship between the radiance of the object and the irradiance of the image
S Li(s)
E(X,Y,Z) y
Ra n an
ce
di
θi θe ad
i
an
ce s Irr 0 x
Φi Φe
A
Y
Z
z=Z(x,y)
0 X
Fig. 5.2 Diagram of the components of a vision system for the shape from shading. The surface
element receives in A the radiant flow L i from the source S with direction s = (θi , φi ) and a part
of it is emitted in relation to the typology of material. The brightness of the I (x, y) pixel depends
on the reflectance properties of the surface, on its shape and orientation defined by the angle θi
between the normal n and the vector s (source direction), from the properties of the optical system
and from the exposure time (which also depends on the type of sensor)
We also rewrite (with the symbols according to Fig. 5.2), Eq. (2.16) of the image
irradiance, described in Chap. 2 Vol. I Radiometric model, given by
I (x, y) ∼
= L e (X, Y, Z ) = L e (A, θe , φe ) = F(θi , φi ; θe , φe ) · E i (θi , φi ) (5.2)
Figure 5.2 schematizes all the components of the image formation process highlight-
ing the orientation of the surface in A and of the lighting source. The orientation of
the surface in A(X, Y, Z ) is given by the vector n, which indicates the direction of the
normal to the tangent plane to the visible surface passing through A. Recall that the
BRDF function defines the relationship between L e (θe , φe ) the radiance reflected by
the surface in the direction of the observer and E i (θi , φi ) the irradiance incident on
the object at the point A coming from the source with known radiant flux L i (θi , φi )
(see Sect. 2.3 of Vol. I).
In the Lambertian modeling conditions, the image irradiance I L (x, y) is given by
Eq. (2.20) described in Chap. 2 Vol. I Radiometric model, which we rewrite as
ρ ρ
I L (x, y) ∼
= Ei · = L i cos θi (5.3)
π π
where 1/π is the BRDF of a Lambertian surface, θi is the angle formed by the vector n
normal to the surface in A, and the vector s = (θi , φi ) which represents the direction
of the irradiance incident E i in A generated by the source S which emits radiant
flux incident in the direction (θi , φi ). ρ indicates the albedo of the surface, seen
as reflectance coefficient which expresses the ability of the surface to reflect/absorb
incident irradiance at any point. The albedo has values in the range 0 ≤ ρ ≤ 1 and
for energy conservation, the missing part is due to absorption. Often a surface with a
3 Inthis context, the influence of the optical system, exposure time, and the characteristics of the
capture sensor are excluded.
5.2 The Reflectance Map 417
In this context, we can indicate with R(A, θi ) the radiance of the object expressed
by (5.4) that by replacing in the (5.5) we get
which is the fundamental equation of Shape from Shading. Equation (5.6) directly
links the luminous intensity, i.e., the irradiance E(a) of the image to the orientation
of the visible surface at the point A, given by the angle θi between the surface normal
vector n and the vector s of incident light direction (see Fig. 5.2).
We summarize the properties of this simple Lambertian radiometric model:
Light source Z
1 q
(0,0)
Z=1 (ps,qs) (p,q,1)
z=
Z(x
s n ,y)
A
p Y
Fig. 5.3 Graphical representation of the gradient space. Plans parallel to the plane x − y (for
example, the plane Z = 1, whose normal has components p = q = 0) have zero gradients in both
directions in x and y. For a generic patch (not parallel to the plane x y) of the visible surface, the
orientation of the normal vector ( p, q) is given by (5.10) and the equation of the plane comprising
the patch is Z = px +qy +k. In the gradient space p −q the orientation of each patch is represented
by a point indicated by the gradient vector ( p, q). The direction of the source given by the vector s
is also reported in the gradient space and is represented by the point ( ps , qs )
to review both the reference systems of the visible surface and that of the source,
which becomes the reference system of the image plane (x, y) as shown in Figs. 5.2
and 5.3.
It can be observed how, in the hypothesis of a convex visible surface, for each
of its generic point A we have a tangent plane, and its normal outgoing from A
indicates the attitude (or orientation) of the 3D surface element in space represented
by the point A(X, Y, Z ). The projection of A in the image plane identifies the point
a(x, y) which can be obtained from the perspective equations (see Sect. 3.6 Vol. II
Perspective Transformation):
X Y
x= f y= f (5.7)
Z Z
remembering that f is the focal length of the optical system. If the distance of the
object from the vision system is very large, the geometric projection model can be
simplified assuming the following orthographic projection is valid:
which means projecting the points of the surface through parallel rays and, less than
a scale factor, the horizontal and vertical coordinates of the reference system of the
image plane (x, y) and of the reference system (X, Y, Z ) of the world, coincide.
With f → ∞ implies that Z → ∞ for which Zf becomes the unit that justifies the
(5.8) for the orthographic geometric model.
5.2 The Reflectance Map 419
Under these conditions, for the geometric reconstruction of the visible surface,
i.e., determining the distance Z of each point P from the observer, it can be thought
of as a function of the coordinates of the same point A projected in the image plane
in a(x, y). Reconstructing the visible surface then means finding the function:
z = Z (x, y) (5.9)
The fundamental equation of Shape from Shading (5.6) cannot calculate the function
of the distances Z (x, y) but, through the reflectance map R(A, θn ) the orientation of
the surface can be reconstructed, point by point, obtaining the so-called orientation
map. In other words, this involves calculating, for each point A of the visible surface,
the normal vector n, calculating the local slope of the visible surface that expresses
how the tangent plane passing through A is oriented with respect to observer.
The local slopes at each point A(X, Y, Z ) are estimated by evaluating the partial
derivatives of Z (x, y) with respect to x and y. The gradient of the surface Z (x, y)
in the point P(X, Y, Z ) is given by the vector ( p, q) obtained with the following
partial derivatives:
∂ Z (x, y) ∂ Z (x, y)
p= q= (5.10)
∂x ∂y
where p and q are, respectively, the components relative to the x-axis and y-axis of
the surface gradient.
Calculated with the gradient of the surface, the orientation of the tangent plane,
the gradient vector ( p, q) is bound to the normal n of the surface element centered
in A, by the following relationship:
n = ( p, q, 1)T (5.11)
which shows how the orientation of the surface in A can be expressed by the normal
vector n of which we are interested only as it is oriented and not the module. More
precisely, (5.11) tells us that in correspondence of unitary variations of the distance
Z , the variation x and y in the image plane, around the point a(x, y) must be
p and q, respectively.
To have the normal vector n unitary, it is necessary to divide the normal n to the
surface for its length:
n ( p, q, 1)
nN = = (5.12)
|n| 1 + p2 + q 2
The pair ( p, q) constitutes the gradient space that represents the orientation of each
element of the surface (see Fig. 5.3). The gradient space expressed by the (5.11)
can be seen as a plane parallel to the plane X-Y placed at the distance Z = 1. The
geometric characteristics of the visible surface can be specified as a function of the
coordinates of the image plane x-y while the coordinates p and q of the gradient
space have been defined to specify the orientation of the surface. The map of the
420 5 Shape from Shading
The origin of the gradient space is given by the normal vector (0, 0, 1), which is
normal to the image plane, that is, with the visible surface parallel to the image
plane. The more the normal vectors move away from the origin of the gradient
space, the larger the inclination of the visible surface is compared to the observer.
In the gradient space is also reported the direction of the light source expressed by
the gradient components ( ps , qs ). The shape of the visible surface, as an alternative
to the gradient space ( p, q), can also be expressed by considering the angles (σ, τ )
which are, respectively, the angle between the normal n and the Z -axis (reference
system of the scene) which is the direction of the observer, and the angle between
the projection of the normal n in the image plane and the x-axis of the image plane.
The reflectance map (i.e., the reflected radiance L e (A, θn ) of the Lambertian surface)
expressed by Eq. (5.4) can be rewritten with the coordinates p and q of the gradient
space and by replacing the (5.14) becomes
ρ 1 + ps p + q s q
R(A, θn ) = Li (5.15)
π 1 + p 2 + q 2 1 + ps2 + qs2
This equation represents the starting point for applying the Shape from Shading, to
the following conditions:
3. The optical system has a negligible impact and the visible surface of the objects
is illuminated directly by the source with incident radiant flow L i ;
4. The optical axis coincides with the Z -axis of the vision system and the visible
surface Z (x, y) is described for each point in terms of orientation of the normal
in the gradient space ( p, q);
5. Local point source very far from the scene.
The second member of Eq. (5.15), which expresses the radiance L(A, θn ) of the
surface, is called the reflectance map R( p, q) which when rewritten becomes
ρ 1 + ps p + q s q
R( p, q) = Li (5.16)
π 1 + p 2 + q 2 1 + ps2 + qs2
The reflectance can be calculated for a certain type of material, and for a defined
type of illumination it can be calculated for all possible orientations of p and q of the
surface to produce the reflectance map R( p, q) which can have normalized values
(maximum value 1) to be invariant with respect to the variability of the acquisition
conditions. Finally, admitting the invariance of the radiance of the visible surface,
namely that the radiance L(A), expressed by (5.15), is equal to the irradiance of
the image E(x, y), as already expressed by Eq. (5.5), we finally get the following
equation of image irradiance:
where l and s indicate the Lambertian reflectance model and the source direction,
respectively. This equation tells us that the irradiance (or luminous intensity) in the
image plane in the location (x, y) is equal to the value of the reflectance map R( p, q)
corresponding to the orientation ( p, q) of the surface of the scene. If the reflectance
map is known (computable with the 5.16), for a given position of the source, the
reconstruction of the visible surface z = Z (x, y) is possible in terms of orientation
( p, q) of the same, for each point (x, y) of the image.
Let us remember
that the
∂Z ∂Z
orientation of the surface is given by the gradient p = ∂ x , q = ∂ y .
Figure 5.4 shows two examples of reflectance map, represented graphically by
iso-brightness curves, under the Lambertian reflectance model conditions and with
point source illuminating in the direction ( ps = 0.7 and qs = 0.3) and ( ps = 0.0
and qs = 0.0) a sphere. In the latter condition, the incident light is adjacent to the
observer, that is, source and observer see the visible surface from the same direction.
Therefore, the alignment of the normal n to the source vector s implies that θn = 0◦
with cos θn = 1 and consequently for (5.4) has the maximum reflectance value
R( p, q) = πρ L i . When the two vectors are orthogonal, the surface in A is not
illuminated having cos θn = π/2 = 0 with reflectance value R( p, q) = 0.
The iso-brightness curves represent the set of points that in the gradient space
have the different orientations ( p, q) but derive from points that in the image plane
(x, y) have the same brightness. In the two previous figures, it can be observed how
these curves are different in the two reflectance maps for having simply changed the
422 5 Shape from Shading
2.0
0.8
0.8
0.6
1.0 0.4
0.9
0.2
(ps,qs) (ps,qs)
1.0 0 1.0
-3.0 -2.0 -1.0 1.0 2.0 3.0 p −0.2
−0.4
0.9
--1.0 −0.6 0.8
0.7
−0.8 0.6
0.5
0.4 0.3 0.2
--2.0 −1
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
θs=90°
Fig. 5.4 Examples of reflectance maps. a Lambertian spherical surface illuminated by a source
placed in the position ( ps , qs ) = (0.7, 0.3); b iso-brightness curves (according to (5.14) where
R( p, q) = constant, that is, points of the gradient space that have the same brightness) calculated
for the source of the example of figure (a) which is the general case. It is observed that for R( p, q) =
0 the corresponding curve is reduced to a line, for R( ps , qs ) = 1 the curve is reduced to a point
where there is the maximum brightness (levels of normalized brightness between 0 and 1), while in
the intermediate brightness values the curves are ellipsoidal, parabolic, and then hyperbole until they
become asymptotically a straight line (zero brightness). c Lambertian spherical surface illuminated
by a source placed at the position ( ps , qs ) = (0, 0) that is at the top in the same direction as the
observer. In this case, the iso-brightness curves are concentric circles with respect to the point
of maximum brightness (corresponding to the origin (0, 0) of the gradient space). Considering
the image irradiance Eq. (5.16), for the different values of the intensity E i , the equations of iso-
brightness circumferences are given by p 2 +q 2 = (1/E i −1) by virtue of (5.14) for ( ps , qs ) = (0, 0)
direction of illumination, oblique to the observer in a case and coinciding with the
direction of the observer in the other case. In particular, two points in the gradient
space that lie on the same curve indicate two different orientations of the visible
surface that reflect the same amount of light and are, therefore, perceived with the
same brightness by the observer even if the local surface is oriented differently in the
space. The iso-brilliance curves suggest the non-linearity of Eq. (5.16), which links
the radiance to the orientation of the surface.
It should be noted that using a single image I (x, y), despite knowing the direction
of the source ( ps , qs ) and the reflectance model Rl,s ( p, q), the SfS problem is not
solved because with (5.17) it is not possible to calculate the orientation ( p, q) of each
surface element in a unique way. In fact, with a single image each pixel has only
one value, the luminous intensity E(x, y) while, the orientation of the corresponding
patch is defined by the two components p and q of the gradient according to Eq. (5.17)
(we have only one equation and two unknowns p and q). Figure 5.4a and b shows
how a single intensity of a pixel E(x, y) corresponds different orientations ( p, q)
belonging to the same iso-brilliance curve in the map of reflectance. In the following
paragraphs, some solutions to the problem of SfS are reported.
5.4 Shape from Shading-SfS Algorithms 423
Several researchers [2,3] have proposed solutions to the SfS problem inspired by
the visual perception of the 3D surface. In particular, in [3] it is reported that the
brain retrieves form information not only from shading, but also from contours,
elementary features, and visual knowledge of objects. The SfS algorithms developed
in computer vision use ad hoc solutions very different from those hypothesized by
human vision. Indeed, the proposed solutions use minimization methods [4] of an
energy function, of propagation [1] which extend the shape information from a set
of surface points on the whole image, and locale [5] which derives the shape from
the luminous intensity assuming the locally spherical surface.
Let us now return to the fundamental equation of Shape from Shading, Eq. (5.17),
by which we propose to reconstruct the orientation ( p, q) of each visible surface
element (i.e., calculate the orientation map of the visible surface) known the irradi-
ance E(x, y), that is, the energy reemitted at each point of the image in relation to
the reflectance model. We are faced with the situation of an ill-conditioned problem
since we have for each pixel E(x, y) a single equation with two unknowns p and q.
This problem can be solved by imposing additional constraints, for example, the
condition of visible surface continues in the sense that it has a minimal geometric
variability (smooth surface). This constraint implies that, in the gradient space, p and
q vary little locally, in the sense that nearby points in the image plane presumably
represent orientations whose positions in the gradient space are also very close to
each other. From this, it follows that the conditions of continuity of the visible surface
will be violated where only great geometric variations occur, which normally occur
at the edges and contours.
A strategy to solve the image irradiation Eq. (5.17) consists in not finding an exact
solution but defining a function to be minimized that includes a term representing
the error of the image irradiation equation e I and a term that controls the constraint
of the geometric continuity ec .
The first term e I is given by the difference between the irradiance of the image
E(x, y) and the reflectance function R( p, q):
2
eI = E(x, y) − R( p, q) dxdy (5.18)
The second term ec , based on the constraint of the geometric continuity of the surface
is derived from the condition that the directional gradients p and q vary very slowly
(and to a greater extent their partial derivatives), respectively, with respect to the
direction of the x and y. The error ec due to geometric continuity is then defined by
minimizing the integral of the sum of the squares of such partial derivatives, and is
given by
ec = px2 + p 2y + qx2 + q y2 dxdy (5.19)
424 5 Shape from Shading
The total error function eT to be minimized, which includes the two terms of previous
errors, is given by combining these errors as follows:
2
eT = e I + λec = E(x, y) − R( p, q) + λ px2 + p 2y + qx2 + q y2 dxdy (5.20)
where λ is the positive parameter that weighs the influence of the geometric continuity
error ec with respect to that of the image irradiance. A possible solution to minimize
the function of the total error is given with the variational calculation which through
an iterative process determines the minimum acceptable solution of the error. It
should be noted that the function to be minimized (5.20) depends on the surface
orientations p(x, y) and q(x, y) which are dependent on the variables x and y of the
image plane.
Recall from (5.5) that the irradiance E(x, y) can be represented by the digital
image I (i, j), where i and j are the row and column coordinates, respectively, to
locate a pixel at location (i, j) containing the observed light intensity, coming from
the surface element, whose orientation is denoted in the gradient space with pi j and
qi j . Having defined these new symbols for the digital image, the procedure to solve
the problem of the Shape from Shading, based on the minimization method, is the
following:
1. Orientation initialization. For each pixel I (i, j), initialize the orientations pi0j
and qi0j .
2. Equivalence constraint between image irradiance and reflectance map. The
luminous intensity I (i, j) for each pixel must be very similar to that produced
by the reflectance map derived analytically in the conditions of Lambertianity
or evaluated experimentally knowing the optical properties of the surface and
the orientation of the source.
3. Constraint of geometric continuity. Calculation of partial derivatives of the
reflectance map ( ∂∂ Rp , ∂∂qR ), analytically, when R( pi j , qi j ) is Lambertian (by
virtue of Eqs. 5.14 and 5.17), or, estimated numerically by the reflectance map
obtained experimentally.
4. Calculation for iterations of the gradient estimation (p, q). Iterative process
based on the Lagrange multiplier method which minimizes the total error eT ,
defined with (5.20), through the following update rules that find pi j and qi j to
reconstruct the unknown surface z = Z (x, y):
∂R
pin+1 = p̄inj + λ (I (i, j) − R( p̄inj , q̄inj )
j
∂p
(5.21)
∂R
qin+1 = q̄inj + λ (I (i, j) − R( p̄inj , q̄inj )
j
∂q
5.4 Shape from Shading-SfS Algorithms 425
Fig. 5.5 Examples of orientation maps calculated from images acquired with Lambertian illumi-
nation with source placed at the position ( ps , qs ) = (0, 0). a Map of the orientation of the spherical
surface of Fig. 5.4c. b Image of real objects with flat and curved surfaces. c Orientation map relative
to the image of figure b
where p̄i j and q̄i j denote the values of the mean of pi j and qi j calculated on four
local pixels (in correspondence of the location (i, j) in the image plane) with
the following equations:
pi+1, j + pi−1, j + pi, j+1 + pi, j−1
p̄i j =
4 (5.22)
qi+1, j + qi−1, j + qi, j+1 + qi, j−1
q̄i j =
4
The iterative process (step 4) continuously updates p and q until the total error eT
reaches a reasonable minimum value after several iterations, or stabilizes. It should
be noted that, although in the iterative process the estimates of p and q are evaluated
locally, a global consistency of surface orientation is realized with the propagation
of constraints (2) and (3), after many iterations.
Other minimization procedures can be considered to improve the convergence of
the iterative process. The procedure described above to solve the problem of Shape
from Shading is very simple but presents many difficulties when trying to apply it
concretely in real cases. This is due to the imperfect knowledge of the reflectance
characteristics of the materials and the difficulties in controlling the lighting condi-
tions of the scene.
Figure 5.5 shows two examples of maps of the orientations calculated from
images acquired with Lambertian illumination with the source placed at the position
( ps , qs ) = (0, 0). Figure a is relative to the sphere of Fig. 5.4c while that of figure c
is relative to a more complex real scene, constituted by overlapping opaque objects
whose dominant surface (cylindrical and spherical) presents a good geometric con-
tinuity to less than the zones of contour and in the shadow areas (i.e., shading which
contributes to an error in reconstructing the shape of the surface) where the intensity
levels vary abruptly. Even if the reconstruction of the orientation of the surface does
not appear perfect in every pixel of the image, overall, the results of the algorithm of
426 5 Shape from Shading
Shape from Shading are acceptable for the purposes of the perception of the shape
of the visible surface.
For better visual effect of the results of the algorithm, Fig. 5.5 graphically displays
the gradient information ( p, q) in terms of the orientation of the normals (represent-
ing the orientation of a surface element) with respect to the observer, seen as oriented
segments (perceived as needles oriented). In the literature, such orientation maps are
also called needle map.
The orientation map together with the depth map (when known and obtained
by other methods, for example, with stereo vision) becomes essential for the 3D
reconstruction of the visible surface, for example, by combining together, through
an interpolation process, the depth and orientation information (a problem that we
will describe in the following paragraphs).
With the stereo photometry, we want to recover the 3D surface of the objects observed
with the orientation map obtained through different images acquired from the same
point of view, but illuminated by known light sources with known direction.
We have already seen above that it is not possible to derive the 3D shape of a surface
using a single image with the SfS approach since there are an indefinite number of
orientations that can be associated with the same intensity value (see Fig. 5.4a and b).
Moreover, remembering (5.17), the intensity E(x, y) of a pixel has only one degree
of freedom while the orientation of the surface has two p and q. Therefore, additional
information is needed for calculating the orientation of a surface element.
One solution is given by the stereo photometry approach [6] which calculates the
( p, q) orientation of a patch using different images of the same scene, acquired
from the same point of view but illuminating the scene from different directions as
shown in Fig. 5.6.
The figure shows three different positions of the light source with the same obser-
vation point, subsequently acquiring images of the scene with different shading. For
each image acquisition, only one lamp is on. The different lighting directions lead to
different reflectance maps. Now let’s see how the stereo photometry approach solves
the problem of the poor conditioning of the S f S approach.
Figure 5.7a shows two superimposed reflectance maps obtained as expected by
stereo photometry. For clarity, only the iso-brightness curve 0.4 (in red) is super-
imposed relative to the second reflectance map R2 . We know from (5.14) that the
Lambertian reflectance function is not linear as shown in the figure with the iso-
brightness curves. The latter represent the different ( p, q) orientations of the surface
with the same luminous intensity I (i, j) related to each other by (5.17) that we
rewrite the following:
Light source 3
120
0° °
12
Light source 2
Light source 1
120°
Object Plane
Fig. 5.6 Acquisition system for the stereo photometry approach. In this experimentation, three
lamps are used positioned at the same height and arranged at 120◦ on the basis of an inverted cone
and with the base of the objects to be acquired placed at the apex of the cone
The intensity of the I (i, j) pixels, for the images acquired with stereo photometry,
varies for the different local orientation of the surface and for the different orientation
of the sources (see Fig. 5.6). Therefore, if we consider two images of the stereo pho-
tometry I1 (i, j) and I2 (i, j), according to the Lambertian reflectance model (5.14),
we will have two different reflectance maps R1 and R2 associated with the two dif-
ferent orientations s1 and s2 of the sources. Therefore applying to the two images
the (5.23), we will have
(a) (b)
0.5
2.0 2.0
0.8 0.8
P P
1.0 1.0
0.9 0.9
1.0 1.0
-2.0 -1.0 1.0 2.0 3.0 p -2.0 -1.0 1.0 2.0 3.0 p
Q
--1.0 --1.0
--2.0 --2.0
Fig. 5.7 Principle of stereo photometry. The orientation of a surface element is determined through
multiple reflectance maps obtained from images acquired with different orientations of the lighting
source (assuming Lambertian reflectance). a Considering two images I1 and I2 of stereo photom-
etry, they are associated with the corresponding reflectance maps R1 ( p, q) and R2 ( p, q) which
establish a relationship between the pair of intensity values of the pixels (i, j) and orientation of the
corresponding surface element. In the gradient space where the reflectance maps are superimposed,
the orientation ( p, q) associated with the pixel (i, j) can be determined by the intersection of the two
iso-brightness curves which in the corresponding maps represent the respective intensities I1 (i, j)
and I2 (i, j). Two curves can intersect in one or two points thus generating two possible orienta-
tions. b To obtain a single value ( p(i, j), q(i, j)) of the orientation of the patch, it is necessary to
acquire at least another image I3 (i, j) of photometric stereo, superimpose in the gradient space the
corresponding reflectance map R3 ( p, q). A unique orientation ( p(i, j), q(i, j)) is obtained with
the intersection of the third curve corresponding to the value of the intensity I3 (i, j) in the map R3
maps, the orientation ( p, q) associated with the pixel (i, j) can be determined by
the intersection of the two iso-brightness curves which in the corresponding maps
represent the respective intensities I1 (i, j) and I2 (i, j).
The figure graphically shows this situation where, for simplicity, several iso-
brightness curves for the normalized intensity values between 0 and 1 have been
plotted only for the map R1l,s1 ( p, q), while for the reflectance map R2l,s2 ( p, q) is
plotted the curve associated to the luminous intensity I2 (i, j) = 0.4 relative to the
same pixel (i, j). Two curves can intersect in one or two points thus generating two
possible orientations for the same pixel (i, j) (due to the non-linearity of Eq. 5.15).
The figure shows two gradient points P( p1 , q1 ) and Q( p2 , q2 ), the intersection of
the two iso-brightness curves corresponding to the intensities I1 (i, j) = 0.9 and
I2 (i, j) = 0.4, candidates as a possible orientation of the surface corresponding to
the pixel (i, j).
5.4 Shape from Shading-SfS Algorithms 429
To obtain a single value ( p(i, j), q(i, j)) of the orientation of the patch, it is
necessary to acquire at least another image I3 (i, j) of stereo photometry always
from the same point of observation but with a different orientation s3 of the source.
This involves the calculation of a third reflectance map R3l,s3 ( p, q) obtaining a third
image irradiance equation:
ρ(i, j) 1 + ps p + q s q
I (i, j) = R( p, q) = Li (5.26)
π 1 + p 2 + q 2 1 + ps2 + qs2
where the term ρ, which varies between 0 and 1, indicates the albedo, i.e., the
coefficient that controls in each pixel (i, j) the reflecting power of a Lambertian
surface for various types of materials.
In particular, the albedo takes into account, in relation to the type of material, how
much of the incident luminous energy is reflected toward the observer.4
Now, let’s always consider the Lambertian reflectance model where a source with
diffuse lighting is assumed. In these conditions, if L S is the radiance of the source,
4 From a theoretical point of view, the quantification of the albedo is simple. It would be enough to
measure the reflected and incident radiation from a body with an instrument and make the ratio of
such measurements, respectively. In physical reality, the measurement of the albedo is complex for
various reasons:
1. The incident radiation does not come from a single source but normally comes from different
directions;
430 5 Shape from Shading
the radiance received from a patch A of the surface, that is, its irradiance is given by
I A = π L S (considering the visible hemisphere).
Considered the Lambertian model of surface reflectance (B R D Fl = π1 ), the
brightness of the patch, i.e., its radiance L A , is given by
1
L A = B R D Fl · I A = π LS = LS
π
Therefore, a Lambertian surface emits the same radiance as the source and its lumi-
nosity does not depend on the observation point but may vary, point to point, except
for the multiplicative factor of the albedo ρ(i, j).
In the Lambertian reflectance conditions, (5.26) suggests that the variables become
the orientation of the surface ( p, q) and the albedo ρ. The unit vector of the normal
n to the surface is given by Eq. (5.12) while the orientation vector of a source is
known, expressed by s = (S X , SY , S Z ). The mathematical formalism of the stereo
photometry involves the application of the irradiance Eq. (5.26) for the 3 images
Ik (i, j), k = 1, 2, 3 acquired subsequently for the 3 different orientations Sk =
(Sk,X , Sk,Y , Sk,Z ), k = 1, 2, 3 of the Lambertian diffuse light sources.
Assuming the albedo ρ constant and remembering Eq. (5.14) of the reflectance
map for a Lambertian surface illuminated by a generic orientation sk , the image
irradiance equation Ik (i, j) becomes
Li Li
Ik (i, j) = ρ(i, j)(sk · n)(i, j) = ρ(i, j) cos θk k = 1, 2, 3 (5.27)
π π
where θk represents the angle between the source vector sk and the normal vector n
of the surface. (5.27) reiterates that the incident radiance L i , generated by a source in
the context of diffused light (Lambertian), is given by the cosine of the angle formed
between the vector of the incident light and the unit vector of the surface normal,
i.e., the light intensity that reaches the observer is proportional to the inner product
of these two unit vectors assuming the albedo ρ constant.
2. the energy reflected by a body is never unidirectional but multidirectional and the reflected
energy is not uniform in all directions, and a portion of the incident energy can be absorbed
by the same body;
3. the measured reflected energy is only partial due to the angular opening limits of the detector
sensor.
Therefore, the reflectance measurements are to be considered as samples of the BRDF function.
The albedo is considered as a coefficient of global average reflectivity of a body. With the BRDF
function, it is possible to model instead the directional distribution of the reflected energy from a
body associated with a solid angle.
5.4 Shape from Shading-SfS Algorithms 431
and the unit normal vector of the resulting surface is n = (− p, −q, 1)T . Finally, we
can also calculate the albedo ρ with Eq. (5.30) and considering that the normal n is
a unit vector, we have
π −1
ρ= |S I| =⇒ ρ = n 2x + n 2y + n 2z (5.32)
Li
If we have more than 3 stereo photometric images you have an overdetermined
system having the S matrix of the sources of size m × 3 with m > 3. In this case,
the system of Eq. (5.28) in matrix terms becomes
1
I =
S ·
b (5.33)
π
mx1 m x 3 3×1
432 5 Shape from Shading
where to simplify b(i, j) = ρ(i, j)n(i, j) indicates the unit vector of the normal
scaled from the albedo ρ(i, j) for each pixel of the image and, similarly, the vectors
sk direction of the sources are scaled (sk L i ) of the factor of intensity of the incident
radiance L i (we assume that the sources have identical radiant intensity). The calcu-
lation of normals with the overdetermined system (5.33) can be done with the least
squares method which finds a solution of b which minimizes the 2-norm squared of
the residual r :
min |r |22 = min |I − Sb|22
b = (SS)−1 ST I (5.34)
The similar Eqs. (5.33) and (5.34) are known as normal equations which, when
solved, lead to the solution of the least squares problem if the matrix S has rank
equal to the number of columns (3 in this case) and the problem is well conditioned.6
Once the system is resolved, the normals and the albedo for each pixel are calculated
with analogous Eqs. (5.31) and (5.32) replacing (bx , b y , bz ) instead of (n x , n y , n z )
as follows:
ρ = π |(SS)−1 ST I| =⇒ ρ = b2x + b2y + bz2 (5.35)
In the case of color images for a given source kth, there are three equations of the
image irradiance for every pixel, each relating to each component of color RG B
characterized by the albedo ρc :
bc = ρc n = (SS)−1 ST Ic (5.37)
and once calculated bc we get, as before, with (5.32) the value of the albedo ρc
relative to the color component considered. Normally, the surface of the objects is
5 In fact, we have
and the gradient ∇b (r 2 ) = 2ST Sb − 2ST I that equaling to zero and resolving with respect to b
produces (5.34) with a minimum solution of the residual.
6 We recall that a problem is well conditioned if small perturbations on the measures (the images in
the context) generate small variations of the same order also on the quantities to be calculated (the
orientations of the normals in the context).
5.4 Shape from Shading-SfS Algorithms 433
reconstructed using gray-level images and it is hardly interesting to work with RGB
images for the reconstruction in color of the surface.
Several research activities have demonstrated the use of stereo photometry to extract
the map of the normals of objects even in the absence of knowledge of the lighting
conditions (without calibration of the light sources or not knowing their orientation
sk and intensity L i ) and without knowing the real reflectance model of the same
objects. Several results based on the Lambertian reflectance model are reported in
the literature. The reconstruction of the observed surface is feasible, albeit still with
ambiguity, starting from the map of normals. It also assumes the acquisition of images
with orthographic projection and with linear response of the sensor.
In this situation, uncalibrated stereo photometry implies that in the system of
Eq. (5.33) the unknowns are the source direction matrix S and the vector of normals
given by:
b(i, j) ≡ ρ(i, j)n(i, j)
A solution to the problem is that of factorizing a matrix [7] which consists in its
decomposition into singular values SVD (see Sect. 2.11 Vol. II).
In this formulation, it is useful to consider the I = (I1 , I2 , . . . , Im ) matrix of
m images, of size m × k (with m ≥ 3), which organizes the pixels of the image
Ii , i = 1, m in the ith row of the global image matrix I, each row of dimensions
k equal to total number of pixels per image (organization of pixels in lexicographic
order). Therefore, the matrix Eq. (5.33), to arrange the factoring of the intensity
matrix, is reformulated as follows:
I =
S ·
B (5.38)
(m×k) m×3 3×k
where S is the unknown matrix of the sources whose directions for the ith image are
reported by line (si x , si y , si z ), while the matrix of normals B organizes by column
the direction components (s j x , s j y , s j z ) for the jth pixel. With this formalism the
irradiance equation under Lambertian conditions for the jth pixel of the ith image
results in the following:
Ii,j = Si · Bj (5.39)
(1×1) 1×3 3×1
434 5 Shape from Shading
For any matrix of size m×, k there always exists its decomposition into singular
values, according to the SVD theorem, given by
I =
U · ·
VT (5.40)
(m×k) m×m m×k k×k
Î =
U · ·
VT (5.41)
(m×k) m×3 3×3 3×k
where the submatrices are renamed with the addition of the apex to distinguish them
from the originals. With (5.41), we get the best approximation of the original image
expressed by (5.40) which uses all the singular values of the complete decomposition.
In an ideal context, with images without noise, SVD can well represent the original
image matrix with a few singular values. With SVD it is possible to evaluate, from
the analysis of singular values, the acceptability of the approximation based on the
first three singular values (in the presence of noise the rank of Î is greater of three).
Using the first three singular values of , and the corresponding columns (singular
vectors) of U and V from (5.41), we can define the pseudo-matrices Ŝ and B̂,
respectively, of the sources and normals, as follows:
√ √
Ŝ =
U · B̂ = ·
VT (5.42)
(m×3) m×3 3×3 (3×k) 3×3 3×k
The decomposition obtained with the (5.43) is not unique. In fact, if A is an arbitrary
invertible matrix with a size of 3 × 3, we have that also the matrices ŜA−1 and AB̂
are still a valid decomposition of the approximate image matrix Î such that
where G L(3) is the group of all matrices of size 3 × 3. In essence, with the (5.44) we
establish an equivalence relation in the space of the solutions T = IRm×3 × IR3×k ,
where IRm×3 represents the space of all possible matrices Ŝ of direction of the
sources and IRm×3 the space of all possible matrices IR3×k of scaled normals.
The ambiguity generated by the factorization Eq. (5.44) of Î in Ŝ and B̂ can be
managed by considering the matrix A associated with a linear transformation such
that
S̄ = ŜA−1 B̄ = AB̂ (5.45)
Equation (5.45) tells us that the two solutions (Ŝ, B̂) ∈ T and (S̄, B̄) ∈ T are
equivalent if exists the matrix A ∈ G L(3). With SVD through Eq. (5.42), the matrix
of the stereo photometric images I can select an equivalence class T (I). This class
contains the matrix S of the true sources direction and the matrix B of true normals
scaled, but it is not possible to distinguish (S, B) from other members in T (I) based
only on the content of the images assembled in the images matrix I.
The A matrix can be determined with at least 6 pixels with the same or known
reflectance or considering that the intensity of at least six sources is constant or is
known [7]. It is shown [8] that by imposing the constraint of integrability7 ambiguity
can be reduced by introducing the Generalized Bas-Relief (GBR) transformations
which satisfy the integrability constraint. GBR transforms a surface z(x, y) into a
new surface ẑ(x, y) combining a flattening operation (or a scale change) along the
z-axis with the addition of a plane:
where λ = 0, ν, μ ∈ are the parameters that represent the group of the GBR
transformations. The matrix A which solves Eq. (5.45) is given by the matrix G
associated with the group of GBR transformations given in the form:
⎡ ⎤
1 00
A = G = ⎣ 0 1 0⎦ (5.47)
μνλ
7 The constraint of integrability requires that the normals estimated by stereo photometry correspond
to a curved surface. Recall that from the orientation map the surface z(x, y) can be reconstructed
by integrating the gradient information { p(x, y), q(x, y)} or the partial derivatives of z(x, y) along
any path between two points in the image plane. For a curved surface, the constraint of integrability
[9] means that it does not matter the path chosen to have an approximation of ẑ(x, y). Formally,
this means
∂ 2 ẑ ∂ 2 ẑ
=
∂ x∂ y ∂ y∂ x
having already estimated the normals b̂(x, y) = (b̂x , b̂ y , b̂z )T with partial derivatives of the first
∂ ẑ b̂x (x,y) ∂ ẑ b̂ y (x,y)
order ∂x = ; = .
b̂z (x,y) ∂ y b̂z (x,y)
436 5 Shape from Shading
The GBR transformation defines the sources pseudo-orientations S̄ and the pseudo-
normals B̄ according to Eq. (5.45) replacing the matrix A with the matrix G. Resolv-
ing then with respect to Ŝ and B̄, we have
⎡ ⎤
λ 0 0
1
Ŝ = S̄G B̂ = G−1 B̄ = ⎣ 0 λ 0⎦ B̄ (5.48)
λ −μ −ν 1
Thus, the problem is reduced from the 9 unknowns of the matrix A of size 3×3 to the
3 parameters of the matrix G associated with the GBR transformation. Furthermore,
the property of the GBR transformations is unique, in the sense that it does not
alter the shading configurations of a surface z(x, y) (with Lambertian reflectance)
illuminated by any source s with respect to that of the surface ẑ(x, y) obtained from
the GBR transformation with G and illuminated by the source whose direction is
given by ŝ = G −T s. In other words, when the orientations of both surface and source
are transformed with the matrix G, the shading configurations are identical in the
images of origin and that of the transformed surface.
The reconstruction of the orientation of the visible surface can be carried out experi-
mentally using a look-up table (LUT), which associates the orientation of the normal
with a triad of luminous intensity (I1 , I2 , I3 ) measured after appropriately calibrating
the acquisition system.
Consider as a stereo photometry system the one shown in Fig. 5.6 which provides
the three sources arranged on the basis of a cone at 120◦ degrees between them, a
single acquisition post located at the center of the cone base, and the work plane placed
at the top of the inverted cone where the objects to be acquired are located. Initially,
the system is calibrated to consider the reflectivity component of the material of the
objects and to acquire all the possible intensity triads to be stored in the LUT table
to be associated with a given orientation of the surface normal. To be in Lambertian
reflectance conditions, objects made of opaque material are chosen, for example,
objects made of PVC plastic material.
Therefore, the calibration of the system is carried out by means of a sphere of
PVC material (analogous to the material of the objects), for which the three images
I1 , I2 , I3 provided with stereo photometry are acquired. The images are assumed to be
registered (images are aligned with each other), that is, during the three acquisitions,
the object and observer are stopped while the corresponding sources are turned
on in succession. The calibration process associates, for each pixel (i, j) of the
image, a set of luminous intensity values (I1 , I2 , I3 ) (the three measurements of
stereo photometry) as measured by the camera (single observation point) in the three
successive acquisitions while a single source is operative, and the orientation value
( p, q) of the calibration surface is derived from the knowledge of the geometric
description of the sphere.
5.4 Shape from Shading-SfS Algorithms 437
T
R
Calibration sphere
Lookup Table
(p,q)
Fig. 5.8 Stereo photometry: calibration of the acquisition system using a sphere of material with
identical reflectance properties of the objects
In essence, the calibration sphere is chosen as the only solid whose visible surface
has all the possible orientations of a surface element in the space of the visible
hemisphere. In Fig. 5.8 are indicated on the calibration sphere two generic surface
elements centered in the points R and T visible by the camera and projected in
the image plane, respectively, in the pixels located in (i R , j R ) and (i T , jT ). The
orientation of the normals corresponding to these superficial elements of the sphere is
given by n R ( p R , q R ) and n T ( pT , qT ) calculated analytically knowing the parametric
equation of the hypothesized sphere of unit radius.
Once the projections of these points of the sphere in the image plane are known,
after the acquisition of the three stereo photometry images {I1 , I2 , I3 }spher e , it is
possible to associate, to the surface elements considered R and T , their normals
orientation (knowing the geometry of the sphere) and the triad of luminous intensity
measurements determined by the camera. These associations are stored in the LUT
table of dimensions (3 × 2 × m) using as pointers the triples of the luminous intensity
measurements to which the values of the two orientation components ( p, q) are made
to correspond. The number of triples m depends on the level of discretization of the
sphere or the resolution of the images.
In the example shown in Fig. 5.8, the associations for the superficial elements R
and T are, respectively, the following:
I1 (i R , j R ), I2 (i R , j R ), I3 (i R , j R ) =⇒ n R ( p R , q R )
I1 (i T , jT ), I2 (i T , jT ), I3 (i T , jT ) =⇒ n T ( pT , qT )
Once the system has been calibrated and chosen the type of material with Lam-
bertian reflectance characteristics, positioned the lamps appropriately, stored all the
associations orientations ( p, q), and measured triples (I1 , I2 , I3 ) with the sphere of
calibration, the latter is removed from the acquisition plan and replaced with the
objects to be acquired.
438 5 Shape from Shading
Lookup Table
(p,q)
Orientation map
Fig. 5.9 Stereo photometric images of a real PVC object with the final orientation map obtained
by applying the calibrated stereo photometry approach with the PVC sphere
The three stereo photometry images of the objects are acquired in the same con-
ditions in which those of the calibration sphere were acquired, that is, by making
sure that the image Ik is acquired keeping only the source Sk on. For each pixel (i, j)
of the image, the triad (I1 , I2 , I3 ) is used as a pointer in the LUT table this time
used not to store but to find the orientation ( p, q) to be associated with the surface
corresponding to the pixel located in (i, j).
Figure 5.9 shows the results of the process of stereo photometry, which builds the
orientation map (needle map) of an object of the same PVC material as the calibration
sphere. Figure 5.10 instead shows another experiment based on stereo photometry
to detect the best-placed object, in the stack of similar objects (identical to the one
in the previous example), for automatic gripping (known as a problem of the bin
picking) in the context of robotic cells. In this case, once the map of the orientation
of the stack has been obtained, it is segmented to isolate the object best placed for
gripping and to estimate the attitude of the isolated object (normally, it is the one at
the top in the stack).
Ob
jec
ta
tt
he
to
po
ft
he
sta
ck
Orientation map Segmented map
Fig. 5.10 Stereo photometric images of a stack of objects identical to that of the previous example.
In this case, with the same calibration data, the calculated orientation map is used not for the purpose
of reconstructing the surface but to determine the attitude of the object higher up in the stack
in the iso-brightness curves (see Fig. 5.11). Once the direction of the sources is
estimated, the value of the albedo can be estimated according to (5.32).
Obtained with the stereo photometry the map of the orientations (unitary normals)
of the surface for each pixel of the image, it is possible to reconstruct the surface
z = Z (x, y), that is, to determine the map of depth through a data integration
algorithm. In essence, it requires a transition from the gradient space ( p, q) to the
depth map to recover the surface.
The problem of surface reconstruction, starting from the discrete gradient space
with noisy data (often the surface continuity is violated, a constraint imposed by
stereo photometry), is an ill-posed problem. In fact, the estimated normal surfaces
do not faithfully reproduce the local curvature (slope) of the surface itself. It can
happen that more surfaces with different height values can have the same gradients. A
check on the acceptability of the estimated normals can be done with the integrability
test (see Note 7) which evaluates at each point the value ∂∂ py − ∂q
∂ x which should be
theoretically zero but small values would be acceptable. Once this test has been
passed, the reconstruction of the surface occurs less than a constant additive value
of the heights and with an adequate depth error.
One approach to constructing the surface is to consider the gradient information
( p(x, y), q(x, y)) which gives the height increments between adjacent points of the
surface in the direction of the x- and y-axes. Therefore, the surface is constructed by
5.4 Shape from Shading-SfS Algorithms 441
adding these increments starting from a point and following a generic path. In the
continuous case, by imposing the integrability constraint, integration along different
paths would lead to the same value of the estimated height for a generic point (x, y)
starting from the same initial point (x0 , y0 ) with the same arbitrary height Z 0 . This
reconstruction approach is called local integration method.
A global integration method is based on a C cost function that minimizes the
quadratic error between the ideal gradient (Z x , Z y ) and the estimated ( p, q):
C= (|Z x − p|2 + |Z y − q|2 )dxdy (5.49)
where represents the domain of all the points (x, y) of the map of the normals
N(x, y), while Z x and Z y are the partial derivatives of the ideal surface Z (x, y) with
respect to the respective axes x and y. This function is invariant when a constant
value is added to the function of the height’s surface Z (x, y). The optimization
problem posed by (5.49) can be solved with the variational approach, with the direct
discretization method or with the expansion methods. The variational approach [12]
uses the Euler–Lagrange equation as the necessary condition to reach a minimum.
The numerical solution to minimize (5.49) is realized with the conversion process
from continuous to discrete. The expansion methods instead are set by expressing
the function Z (x, y) as a linear combination of a set of basic functions.
nx,y = [ p(x, y), q(x, y)] nx+1,y = [ p(x + 1, y), q(x + 1, y)]
nx,y+1 = [ p(x, y + 1), q(x, y + 1)]nx+1,y+1 = [ p(x + 1, y + 1), q(x + 1, y + 1)]
Now let’s consider the normals of the second column of the grid (along the x-axis).
The line connecting these points z[x, y +1, Z (x, y +1)] e z[x +1, y +1, Z (x +1, y +
1)] of the surface is approximately perpendicular to the normal average between these
two points. It follows that the inner product between the vector (slope) of this line
and the average normal vector is zero. This produces the following:
1
Z (x + 1, y + 1) = Z (x, y + 1) + [ p(x, y + 1) + p(x + 1, y + 1)] (5.50)
2
Similarly, considering the adjacent points of the second row of the grid (along the
y-axis), that is, the line connecting the points z[x + 1, y, Z (x + 1, y)] and z[x +
1, y + 1, Z (x + 1, y + 1)] of the surface, we obtain the relation:
1
Z (x + 1, y + 1) = Z (x + 1, y) + [q(x + 1, y) + q(x + 1, y + 1)] (5.51)
2
442 5 Shape from Shading
The map of heights thus estimated has the values influenced by the choice of the
initial arbitrary values. Therefore, it is useful to perform a final step by taking the
average of the values of the two scans to obtain the final map of the surface heights.
Figure 5.12 shows the height map obtained starting from the map of the normals of
the visible surface acquired with the calibrated stereo photometry.
Fig. 5.12 Results of the reconstruction of the surface starting from the orientation map obtained
from the calibrated stereo photometry
It follows that for each pixel (x, y) of the height map, a system of equations can be
defined by combining (5.56) with the Z (x, y) surface derivatives represented with
the gradient according to (5.31) obtaining
where n(x, y) = (n x (x, y), n y (x, y), n z (x, y) is the normal vector of the point (x, y)
of the normal map in 3D space. If the map includes M pixel, the complete equation
system (5.57) consists of 2M equations. To improve the estimate of Z (x, y) for each
pixel of the map can be extended the system (5.57) considering also the adjacent
pixels, respectively, the one on the left (x − 1, y) and the one above (x, y − 1)
with respect to the pixel (x, y) being processed. In that case, the previous system
extends to 4M equations. The system (5.57) can be solved as an overdetermined
linear system. It should be noted that Eq. (5.57) are valid for points not belonging to
the edges of the objects where the component n z → 0.
This additional constraint imposes the equality of the second partial derivatives
Z x x = px and Z yy = q y , and the cost function to be minimized becomes
C(Z ) = (|Z x − p|2 +|Z y −q|2 )+λ0 (|Z x x − px |2 +|Z yy −q y |2 )dxdy (5.58)
where represents the domain of all the points (x, y) of the map of the normals
N(x, y) = ( p(x, y), q(x, y)) and λ0 > 0 controls how to tune the curvature of the
surface and the variability of the acquired gradient data. The constraint of integrability
still remains p y = qx ⇔ Z x y = Z yx .
An additional constraint can be added with the term smoothing (smoothness) and
the new function results in the following:
C(Z ) = (|Z x − p|2 + |Z y − q|2 )dxdy + λ0 (|Z x x − px |2 + |Z yy − q y |2 )dxdy
(5.59)
+ λ1 (|Z x |2 + |Z y |2 )dxdy + λ2 (|Z x x |2 + 2|Z x y |2 + |Z yy |2 )dxdy
where λ1 and λ2 are two additional nonnegative parameters that control the smoothing
level of the surface and its curvature, respectively. The minimized cost function C(Z )
which estimates the surface Z (x, y) unknown is solvable using two algorithms both
based on the Fourier transform.
The first algorithm [12] is set as a minimization problem expressed by the func-
tion (5.49) with the constraint of integrability. The proposed method uses the the-
ory of projection8 on convex sets. In essence, the gradient map of the normal
N (x, y) = ( p(x, y), q(x, y)) is projected into the gradient space that can be inte-
grated in the sense of least squares, then using the Fourier transform for optimization
in the frequency domain. Consider the surface Z (x, y) represented by the functions
φ(x, y, ω) as follows:
Z (x, y) = K (ω)φ(x, y, ω) (5.60)
ω∈
where ω is a 2D map of indexes associated with a finite set , and the coefficients
K (ω) that minimize the function (5.49) can be expressed as
8 In mathematical analysis, the projection theorem, also known as projection theorem in Hilbert
spaces, which descends from the convex analysis, is often used in functional analysis, establishing
that for every point x in a space of Hilbert H and for each convex set closed C ⊂ H there exists a
single value y ∈ C such that the distance x − y assumes the minimum value on C. In particular,
this is true for any closed subspace M of H . In this case, a necessary and sufficient condition for y
is that the vector x − y is orthogonal to M.
5.4 Shape from Shading-SfS Algorithms 445
where Px (ω) = |φx (x, y, ω)|2 and Py (ω) = |φ y (x, y, ω)|2 . The Fourier
derivatives of the basic functions φ can be expressed as follows:
φx = jωx φ φ y = jω y φ (5.62)
whereas we have that derivatives Px ∝ ω2x , Py ∝ ω2y , and also we can have that
x (ω) K (ω)
K 1 (ω) = Kjω x
and K 2 (ω) = jω
y
y
. Expanding the surface Z (x, y) with the Fourier
basic functions, then the function (5.49) is minimized with
jωx K x (ω) − jω y K y (ω)
K (ω) = (5.63)
ω2x + ω2y
where K x (ω) and K y (ω) are the Fourier coefficients of the heights of the recon-
structed surface. These Fourier coefficients can be calculated from the following
relationships:
c −1 N
N r −1
1 − j2π u Nxc +v Nyr
Z(u, v) = √ Z (x, y)e (5.66)
Nr Nc x=0 y=0
√ the transform is calculated for each point on the normal map ((x, y) ∈ ),
where
j = −1 is the imaginary unit, and u and v represent the frequencies in the Fourier
domain. We now report the derivatives of the function Z (x, y) in the spatial and
446 5 Shape from Shading
Z x (x, y) ⇔ juZ(u, v)
Z y (x, y) ⇔ jvZ(u, v)
Z x x (x, y) ⇔ −u 2 Z(u, v) (5.68)
Z yy (x, y) ⇔ −v2 Z(u, v)
Z x y (x, y) ⇔ −uvZ(u, v)
which establishes the equivalence of two representations (the spatial one and the
frequency domain) of the function Z (x, y) from the energetic point of view useful
in this case to minimize the energy of the function (5.59).
Let P(u, v) and Q(u, v) be the Fourier transform of the gradients p(x, y) and
q(x, y), respectively. Applying the Fourier transform to the function (5.59) and
considering the energy theorem (5.69), we obtain the following:
| juZ(u, v) − P(u, v)|2 + | jvZ(u, v) − Q(u, v)|2
(u,v)∈
+ λ0 | − u 2 Z(u, v) − juP(u, v)|2 + | − v2 Z(u, v) − jvQ(u, v)|2
(u,v)∈
+ λ1 | juZ(u, v)|2 + | jvZ(u, v)|2
(u,v)∈
+ λ2 | − u 2 Z(u, v)|2 + 2| − uvZ(u, v)|2 + | − v2 Z(u, v)|2
(u,v)∈
=⇒ minimum
9 Infact, Rayleigh’s theorem is based on Parseval’s theorem. If x1 (t) and x2 (t) are two real signals,
X1 (u) and X2 (u) are the relative Fourier transforms, for Parseval’s theorem proves that:
+∞ +∞
x1 (t) · x2∗ (t)dt = X1 (u) · X∗2 (u)du
−∞ −∞
If x1 (t) = x2 (t) = x(t) then we have the Rayleigh theorem or energy theorem:
+∞ +∞
E= |x(t)|2 dt = |X(u)|2 (u)du
−∞ −∞
the asterisk indicates the complex conjugate operator. Often used to calculate the energy of a function
(or signal) in the frequency domain.
5.4 Shape from Shading-SfS Algorithms 447
where the asterisk denotes the complex conjugate operator. By differentiating the
latter expression with respect to Z∗ and setting the result to zero, it is possible to
impose the necessary condition to have a minimum of the function (5.59) as follows.
λ0 (u 4 + v4 ) + (1 + λ1 )(u 2 + v2 ) + λ2 (u 2 + v2 )2 Z(u, v)
+ j (u + λ0 u 3 )P(u, v) + j (v + λ0 v3 )Q(u, v) = 0
Solving the above equation except for (u, v) = (0, 0), we finally get
− j (u + λ0 u 3 )P(u, v) − j (v + λ0 v3 )Q(u, v)
Z(u, v) = (5.70)
λ0 (u 4 + v4 ) + (1 + λ1 )(u 2 + v2 ) + λ2 (u 2 + v2 )2
Therefore, with (5.70), we have arrived at the Fourier transform of the heights of
an unknown surface starting from the Fourier transforms P(u, v) and Q(u, v) of the
gradient maps p(x, y) and q(x, y) calculated with stereo photometry. The details of
the complete algorithm are reported in [15].
When a scene is observed, the image captures, in addition to the variation information
of light intensity (shading), even if present, the texture information. With Shape From
Texture—SFT, we mean the vision paradigm that analyzes the texture information
448 5 Shape from Shading
Any method of Shape From Texture must then evaluate the geometric parameters of
the texture primitives characterized by these two distortions, which are essential for
the reconstruction of the surface and the calculation of its structure. The orientation of
a plan must be estimated starting from the knowledge of the geometry of the texture,
from the possibility of extracting these primitives without ambiguity, and appropri-
ately estimating the invariant parameters of the geometry of the primitives such as:
relationships between horizontal and vertical dimension, variations of areas, etc. In
particular, by extracting all the primitives present, it is possible to evaluate invariant
parameters such as the texture gradient which indicates the rapidity of change of the
density of these by the observer. In other words, the texture gradient in the image
provides a continuous metric of the scene, analyzing the geometry of the primitives
that always appear smaller, as they move away from the observer. The information
measured with the texture gradient allows humans to perceive the orientation of a
flat surface, the curvature of a surface and the depth. In Fig. 5.14 are shown some
images, where it is shown how the texture gradient information gives the perception
of the depth of the primitives on the flat surface that move away from the observer,
and how locally the visible surface changes from the change of the texture gradi-
ent. Other information considered is the perspective gradient and the compression
gradient, defined, respectively, by the change in the width and height of the projec-
tions of the texture primitives in the image plane. As the distance increases between
observer and points of the visible surface, the gradient of perspective and compres-
sion decrease with distance. This perspective and compression gradient information
has been widely used in computational graphics to give a good perception of the 3D
surface observed on a monitor or 2D screen.
In the context Shape From Texture, it is usual to define the structure of the flat
surface to be reconstructed, with respect to the observer, through the slant angle σ
which indicates the angle included between the normal vector on the flat surface and
the z-axis (coinciding with the optical axis) and through the tilt angle τ indicating
the angle between the X -axis and the projection vector, in the image plane, of the
normal vector n (see Fig. 5.15). The figure shows how the slant angle is such that
the textured flat surface is inclined with respect to the observer in such a way that
the upper part is further away while the tilt angle is zero and, consequently, all the
texture primitives which are arranged horizontally are all the same distance from the
observer.
A general algorithm of Shape From Texture includes the following essential steps:
1. De f ine the texture primitives to be considered for the given application (lines,
disks, ellipses, rectangles, curved lines, etc.).
2. Choose the invariant parameters (texture, perspective, and compression gradi-
ents) appropriate for the texture primitives defined in step 1.
3. U se the invariant parameters of step 2, to calculate the attitude of the textured
surface.
5.6 Shape from Structured Light 451
h
Light
source (laser)
β α
O Q X
Camera L
A depth map can be obtained with a range imaging system10 , where the object to
be reconstructed is illuminated by the so-called structured lighting of which the
geometry of projected geometric structures is known.
In essence, remembering the binocular vision system, a camera is replaced by a
luminous pattern projector and the problem of correspondence is solved (in a simpler
way) by searching for the (known) patterns in the camera that captures the scene with
overlapping light patterns.
Figure 5.16 shows the functional scheme of a range acquisition device based on
structured light. The scene is illuminated with a project (for example, based on low-
power lasers) by known patterns of light (structured light) and the observer (camera)
are separated at a distance L and the distance measurement (range) can be calculated
with a single image (scene with overlapping light patterns) by triangulation in a
similar way to the stereo binocular system. Normally the scene can be illuminated
by a luminous spot or by a thin lamina of light (vertical light plane perpendicular to
the scene) or with more complex luminous patterns (for example, a rectangular or
square luminous grid, binary luminous strips or gray; Microsoft’s Kinect is a low-
cost device that projects with a laser scanner scattered in the infrared and uses an
infrared-sensitive camera).
The relation between the coordinates (X, Y, Z ) of a point P of the scene and those
(x, y) of its projection in the image plane is linked to the calibration parameters of
the capture system such as, the focal f of the camera’s optical system, the separation
distance L between projector and camera, from the angle of inclination α of the
projector with respect to the axis of X and from the projection angle β of the object’s
point P illuminated by the light spot (see Fig. 5.16). In the hypothesis of the 2D
10 Indicates a set of techniques that are used to produce a 2D image to calculate the distance of
points in a scene from a specific point, normally associated with a particular sensory device. The
pixels of the resulting image, known as the depth image, have the information content from which
to extrapolate values of distances between points of the object and sensory device. If the sensor that
is used to produce the depth image is correctly calibrated, the pixel values are used to estimate the
distance information as in a stereo binocular device.
452 5 Shape from Shading
Z γ
f
p(x,y) Light
Image
x source (laser)
Plane α
O X Q X
L
projection of a single light spot, this relation is determined to calculate the position
of P, by triangulation, considering the triangle O P Q and applying the law of sines:
d L
=
sin α sin γ
from which follows:
L · sin α L · sin α L · sin α
d= = = (5.71)
sin γ sin[π − (α + β)] sin(α + β)
The angle β (given by β = arctan( f /x)) is determined by the projection geometry of
the point P in the image plane located in p(x, y) considering the focal length f of the
optical system and the only horizontal coordinate x. Determined the angle β, known
by the system configuration the parameters L and α, is calculated the distance d with
Eq. (5.71). Considering the triangle O P S the polar coordinates (d, β) of the point P
in the plane (X, Z ) are calculated in Cartesian coordinates (X P , Y P ) as follows11 :
11 Obtained according to the trigonometric formulas of the complementary angles (their sum is a
right angle) where in this case we have the complementary angle ( π2 − β).
5.6 Shape from Structured Light 453
Considering the right triangle with base (L − X ) in the baseline O Q (see Fig. 5.17)
we get
Z
tan α = (5.74)
L−X
Therefore, considering the equality of the relations of the (5.73) and the last expres-
sion of the (5.75), we get the 3D coordinates of P given by
L · tan α
[X Y Z ] = [x y z] (5.76)
f + x · tan α
It should be noted that the resolution of the depth measurement Z given by (5.76) is
related to the accuracy with which α is measured and the coordinates (x, y) deter-
mined for each point P of the scene (illuminated) projected in the image plane. It is
also observed that, to calculate the distance of P, the angle γ was not considered (see
Fig. 5.17). This depends on the fact that the projected structured light is a vertical
light plane (not a ray of light) perpendicular to the X Z plane and forms an angle α
with the X -axis. To calculate the various depth points, it is necessary to project the
light spot in different areas of the scene to obtain a 2D depth map by applying (5.76)
for each point. This technique using a single mobile light spot (varying α) is very
slow and inadequate for dynamic scenes.
Normally, systems with structured light are used, consisting of a vertical light
lamina (light plane) that scans the scene by tilting that lamina with respect to the
Y -axis as shown in Fig. 5.18. In this case, the projection angle of the plane of
laser light gradually changed to capture the entire scene in amplitude. As before,
P Z
n
io
ct
re
di
an
Sc
454 5 Shape from Shading
The techniques most used are those based on the sequential projection of coded light
patterns (binary, gray levels, or in color) to eliminate the ambiguity in identifying
patterns associated with the surface of objects with different depths. It is, therefore,
necessary to uniquely determine the patterns of multiple strips of light seen by the
camera projected onto the image plane comparing them with those of the original
pattern. The process that compares, the projected patterns (for example, light binary
strips) with the corresponding original projected patterns (known a priori), is known
as the process of decoding the patterns the equivalent of the process of searching for
correspondence in binocular vision. In essence, the decoding of the patterns consists
in locating them in the image and finding their correspondence in the plane of the
projector of which it is known how they were coded.
5.6 Shape from Structured Light 455
Laser
known geometry, projected Camera
onto a 3D object whose
deformation points of
interest are detected to
reconstruct the shape of the
observed curved surface
Binary light pattern projection techniques involve projecting light planes onto
objects where each light plane is encoded with appropriate binary patterns [16].
These binary patterns are uniquely encoded by black and white strips (bands) for
each plane, so that when projected in a time sequence (the strips increase their width
over time) each point on the surface of the objects is associated with a single binary
code distinct from the other codes of different points. In other words, each point is
identified by the intensity sequence it receives. If the patterns are n (i.e., the number
of planes to be projected) then you can code 2n strips (that is, in the image 2n regions
are identified). Each strip represents a specific α angle of the projected light plane
(which can be vertical or horizontal or both directions depending on the type of scan).
Figure 5.20 shows a set of luminous planes encoded with binary strips to be pro-
jected in a temporal sequence on the scene to be reconstructed. In the figure, the
number of the patterns is 5 and the coding of each plane represents the binary con-
figuration of pattern 0 and 1 to indicate light off and on, respectively. The figure also
shows the temporal sequence of the patterns with the binary coding to uniquely asso-
ciate the code (lighting code) for each of the 25 strips. Each acquired image relative
to the projected patten is in fact a bit-plane which together form a bit-plane block.
This block contains the n-bit sequences that establish the correspondence between
all the points of the scene and their projection in the image plane (see Fig. 5.20).
456 5 Shape from Shading
1
2
3
4
5
ce
u en
eq
ns
c tio
oje
Pr y Camera
Observed code p
Projector
P(10100) x
Fig. 5.20 3D reconstruction of the scene by projecting in time sequence 5 pattern planes with
binary. The observed surface is partitioned into 32 regions and each pixel is encoded in the example
by a unique 5-digit binary code
Binary coding provides two levels of light intensity encoded with 0 and 1. Binary
coding can be made more robust by using the concept of Gray code12 where each band
is encoded in such a way that two adjacent ones differ in a bit which is the maximum
possibility of error in the encoding of the bands (see Fig. 5.21). The number of
images with the Gray code is the same as the binary code and each image is a bit-
plane of the Gray code that represents the luminous pattern plane to be projected. The
transformation algorithm from binary code to Gray is a simple recursive procedure
(see Algorithm 24). The inverse recursive procedure that transforms a Gray code into
a binary sequence is shown in Algorithm 25.
Once the images are acquired with the patterns superimposed on the surface of
the objects, with the segmentation the 2n bands are univocally codified and finally it
is possible to calculate the relative 3D coordinates with a triangulation process and
12 Named Gray code after Frank Gray, a researcher at Bell Laboratories in 1953, patented it). Also
known as Reflected Binary Code (RBC) which is a binary coding method where two successive values
differ only by one bit or a binary digit. RBC was originally designed to prevent spurious errors for
various electronic devices; today widely used in digital transmission. Basically, the graycode is
based on the Hamming distance (in this case 1) which evaluates the number of digit substitutions
to make two strings of the same length equal.
5.6 Shape from Structured Light 457
5 01 1 0 01 1 0 01 10 0 1 1 0 0 11 0 0 1 1 0 0 1 1 0 0 1 1 0
Projection sequence
4 00 1 1 11 0 0 00 11 1 1 0 0 0 01 1 1 1 0 0 0 0 1 1 1 1 0 0
3 00 0 0 11 1 1 11 11 0 0 0 0 0 00 0 1 1 1 1 1 1 1 1 0 0 0 0
2 00 0 0 00 0 0 11 11 1 1 1 1 1 11 1 1 1 1 1 0 0 0 0 0 0 0 0
1 00 0 0 00 0 0 00 00 0 0 0 0 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1
Horizontal spatial distribution
Fig. 5.21 Example of a 5 − bit Gray code that generates 32 bands with the characteristic that
adjacent bands only differ by 1 bit. It can be seen from Fig. 5.20 the comparison with structured
light planes with binary coding always at 5bit
6: end for
7: return G
6: end for
7: return B
obtain a depth map. The coordinates of each pixel (X, Y, Z ) (along the 2n horizontal
bands) are calculated from the intersection between the plane passing through the
vertical band and the optical projection center with the straight line passing through
the optical center of the camera calibrated and points of the band (see Fig. 5.20),
according to Eq. (5.77).
The segmentation algorithm required is simple since they are normally well-
contrasted binary bars on the surface of objects and, except in shadow areas, the
projected luminous pattern plane does not optically interfere with the surface itself.
However, to obtain an adequate spatial resolution different pattern planes must be
projected. For example, to have a band resolution of 1024 you need log2 1024 = 10
pattern planes to project and then process 10 bit-plane images. Overall, the method
458 5 Shape from Shading
has the advantage of producing depth maps with high resolution and accuracy in the
order of µ m and reliable using the Gray code. The limits are related to the static
nature of the scene and the considerable computational time when a high spatial
resolution is required.
To improve the 3D resolution of the acquired scene and at the same time to reduce
the number of pattern planes it is useful to project bright pattern planes at gray levels
[17] (or colored) [18]. In this way, the code base is increased instead of the binary
coding. If m is the number of gray (or color) levels and n is the number of pattern
planes (known as codes n-ary) we will have mn bands and each band is seen as a
space point of n − dimensions. For example with n = 3 and using only m = 4 gray
levels, we would have 43 = 64 unique codes to characterize the bands, against the 6
pattern planes required with binary coding.
We have previously considered patterns based on binary coding, Gray code, and on
n-ary coding that have the advantage of encoding individual pixel regions without
spatially depending on neighboring pixels. A limitation of these methods is given by
the poor spatial resolution. A completely different approach is based on the Phase
Shift Modulation [19,20], which consists of projecting different modulated periodic
light patterns with a constant phase shift in each projection. In this way, we have
a high-resolution spatial analysis of the surface with the projection of sinusoidal
luminous patterns (fringe patterns) with constant phase shift (see Fig. 5.22).
2π-Φ(x,y) 2π
Ia(x,y)
) x
θ x
Fig. 5.22 The phase shift based method involves the projection of 3 planes of sinusoidal light
patterns (on the right, an image of one of the luminous fringe planes projecting into the scene is
displayed) modulated with phase shift
5.6 Shape from Structured Light 459
If we consider an ideal model of image formation we will have that every point
of the scene receives the luminous fringes perfectly in focus and not conditioned by
other light sources.
Therefore, the intensity of each pixel (x, y) of the images Ik , k = 1.3, acquired
by projecting three planes of sinusoidal luminous fringes with constant shift of the
phase angle θ , is given by the following:
where Io (x, y) is an offset that includes the contribution of other light sources in the
environment, Ia (x, y) is the amplitude of the modulated light signal,13 φ(x p , y p ) is
the phase of the luminous pixel of the projector which illuminates the point of the
scene projected in the point (x, y) in the image plane. The phase φ(x p , y p ) provides
the matching information in the triangulation process. Therefore, to calculate the
depth of the observed surface, it is necessary to recover the phase of each pixel
(a process known as wapped phase) relative to the three projections of sinusoidal
fringes starting from the three images Ik . The phases φk , k = 1, 3 recovered are then
combined to obtain a unique phase φ, unambiguous, through a procedure known as
unwrapped phase.14 Therefore, phase unwrapping is a trivial operation if the context
of the wrapped phases is ideal. However, in real measurements various factors (e.g.,
presence of shadows, low modulation fringes, nonuniform reflectivity of the object’s
surface, fringe discontinuities, noise) influence the phase unwrapping process. As
we shall see, it is possible to use a heuristic solution to solve the phase unwrapping
problem which attempts to use continuity data on the measured surface to move data
when it has obviously crossed the border even though it is not an ideal solution and
does not completely manage the discontinuity.
13 The value of Ia (x, y) is conditioned by the BRDF function of the point of the scene, by the
response of the camera sensor, by the arrangement of the tangent plane in that point of the scene
(as seen from foreshortening by the camera) and by the intensity of the projector.
14 It is known that the phase of a periodic signal is univocally defined in the main interval (−π ÷ π ).
As shown in the figure, fringes with sinusoidal intensity are repeated for different periods to cover
the entire surface of the objects. But this creates ambiguity (for example, 20◦ are equal to 380◦
and 740◦ to derive, from the gray levels of the acquired images (5.77), the phase is calculable to
less than multiples of 2, which is known as a wapped phase. The recovery of the original phase
values from the values in the main interval is a classic problem in signal processing known as the
process of phase unwrapping. Formally, the phase unwrapping means that, given the wapped phase
ψ ∈ (−π, π ), it needs to find the true phase φ which is related to ψ as follows:
φ
ψ = W(φ) = φ − 2π
2π
where W is the phase wrapping operator and the expression • rounds its argument to the nearest
integer. It is shown that the phase unwrapping operator is generally a mathematically ill-posed
problem and is usually solved through algorithms based on heuristics that give acceptable solutions.
460 5 Shape from Shading
In this context, the phase unwrapping process that calculates the absolute (true)
phase ψ must be derived from the wrapped phase φ based on the observed intensities
given by the images Ik , k = 1, 2, 3 of light fringes, that is, Eq. (5.77). It should be
noted that in these equations we have the terms Io and Ia that are not known (we
will see later that they will be removed) while the phase angle φ is the unknown.
According to the algorithm proposed by Huang and Zhang [19], the wrapped phase
is given by combining the intensities Ik as follows:
I1 (x, y) − I3 (x, y) cos(φ − θ) − cos(φ + θ)
=
2I2 (x, y) − I1 (x, y) − I3 (x, y) 2 cos(φ) − cos(φ − θ) − cos(φ + θ)
2 sin(φ) sin(θ)
= (add/sub trigon. funcs)
2 cos(φ)[1 − cos(θ)]
tan(φ) sin(θ)
= (tangent half-angle formula)
1 − cos(θ)
tan(φ)
= (5.79)
tan(θ/2)
from which the removal of the dependence on the terms Io and Ia . Considering the
final result15 reported by (5.79), the phase angle, expressed in relation to the observed
intensities, is obtained as follows:
√ I1 (x, y) − I3 (x, y)
ψ(0, 2π ) = arctan 3 (5.80)
2I2 (x, y) − I1 (x, y) − I3 (x, y)
√
where θ = 120◦ is considered for which tan(θ/2) = 3. (5.80) gives the phase
angle of the pixel in the local period from the intensities.
To remove the ambiguity of the discontinuity of the arctangent function in 2π , we
need to add or subtract multiples of 2π to the calculated phase angle ψ, which is to
find the phase unwrapping (see Note 14 and Fig. 5.23) given by
1−cos θ
15 Obtained considering the tangent half-angle formula in the version tan θ/2 = sin θ valid with
θ = k · 180◦ .
5.6 Shape from Structured Light 461
Φ(x,y)
Φ(x,y)
6π
2π Unwrapped Phase
4π
π
2π
π/2 Wrapped Phase
0
0
π 2π x x
-π/2
Current Phase
Fig. 5.23 Illustration of the phase unwrapping process. The graph on the left shows the conversion
of the phase angle φ(x, y) with module 2π , while the graph on the right shows the result of the
unwrapped phase
Camera L Projector
where z is the height of a pixel with respect to the reference plane, L is the separation
distance between projector and camera, Z is the perpendicular distance between the
reference plane and the segment joining the optical centers of camera and projector,
and d is the separation distance of the projection points of P (point of the object
surface) in the reference plane obtained by the optical rays (of the projector and
camera) passing through P (see Fig. 5.24). Considering Z z, the (5.82) can be
simplified as follows:
Z Z
z ≈ d ∝ (φ − φref ) (5.83)
L L
where the phase unwrapping φref is obtained by projecting and acquiring the fringe
patterns on the reference plane in the absence of the object, while φ is obtained by
repeating the scan with the presence of the object. In essence, the heights (depth) of
the object’s surface, once the scanning system has been calibrated (with known L
and Z ) and by virtue of the triangulation is determined d (a sort of disparit y of the
phase unwrapping), are calculated with Eq. (5.83).
462 5 Shape from Shading
Previously, we have highlighted the problem of ambiguity with the use of the method
based on phase shift modulation and the need to find a solution for phase unwrap-
ping that does not resolve the absolute phase unequivocally. This ambiguity can be
resolved by combining this method that projects periodic patterns with that of the
Gray code pattern projection described above.
For example, even projecting 3 binary patterns would have the surface of the object
divided into 8 regions while projecting the periodic patterns would have an increase
in spatial resolution with a more accurate reconstruction of the depth map. In fact,
once the phase of a given pixel has been calculated, the period of the sinusoid where
the pixel lies is obtained from the region of belonging associated with the binary
code.
Figure 5.25 gives an example [21] which combines the binary code method (Gray
code) and the one with phase shift modulation. There are 32 binary code sequences
to partition the surface, determining the phase interval unambiguously, while phase
shift modulation reaches a subpixel resolution beyond the number of split regions
expected by the binary code. As shown in the figure, the phase modulation is achieved
by approximating the sine function with on/off intensity of the patterns generated
by a projector. These patterns are then translated into steps of π/2 for a total of 4
translations.
These last two methods have the advantage of operating independently from the
environmental lighting conditions but have the disadvantage of requiring different
light projection patterns and not suitable for scanning dynamic objects.
Methods based on the projection of sequential patterns have the problem of being
unsuitable for acquiring depth maps in the context of dynamic scenes (such as moving
Fig. 5.25 Method that combines the projection of 4 Gy binary pattern planes with phase shift
modulation achieved by approximating the sine function with 4 angular translations with a step of
π/2 thus obtaining a sequence of 32 coded bands
5.6 Shape from Structured Light 463
As with all 3D reconstruction systems, even with structured light approaches, a cam-
era and projection system calibration phase is provided. In the literature, there are
various methods [25]. The objective is to estimate the intrinsic and extrinsic parame-
ters of the camera and projection system with appropriate calibration according to the
resolution characteristics of the camera and the projection system itself. The camera
is calibrated, assuming a perspective projection model (pinhole), looking at different
angles a white-black checkerboard reference plane (with known 3D geometry) and
establishing a nonlinear relationship between the spatial coordinates (X, Y, Z ) of
3D points of the scene and the coordinates (x, y) of the same points projected in the
image plane. The calibration of the projector depends on the scanning technology
used. In the case of projection of a pattern plane with known geometry, the cali-
bration is done by calculating the homography matrix (see Sect. 3.5 Vol. II) which
establishes a relationship between points of the plane of the patterns projected on a
plane and the same points observed by the camera considering known the separation
distance between the projector and the camera, and known the intrinsic parameters
464 5 Shape from Shading
(focal, sensor resolution, center of the sensor plane, ...) of the camera itself. Estimated
the homography matrix, it can be established, for each projected point, a relationship
between projector plane coordinates and those of the image plane.
The calibration of the projector is twofold providing for the calibration of the
active light source that projects the light patterns and the geometric calibration seen
as a normal reverse camera. The calibration of the light source of the projector
must ensure the stability of the contrast through the analysis of the intensity curve
providing for the projection of light patterns acquired by the camera and establishing
a relationship between the intensity of the projected pattern and the corresponding
values of the pixels detected from the camera sensor. The relationship between the
intensities of the pixels and that of the projected patterns determines the function to
control the linearity of the lighting intensity.
The geometric calibration of the projector consists of considering it as a reverse
camera. The optical model of the projector is the same as that of the camera (pinhole
model) only the direction changes. With the inverse geometry, it is necessary to
solve the difficult problem of detecting in the plane of the projector a point of the
image plane, projection of a 3D point of the scene. In essence, the homography
correspondence between points of the scene seen simultaneously by the camera and
the projector must be established.
Normally the camera is first calibrated with respect to a calibration plane for which
a homography relation is established H between the coordinates of the calibration
plane and those of the image plane. Then the light patterns of known geometry
calibration are projected onto the calibration plane and acquired by the camera. With
the homography transformation H, the known geometry patterns projected in the
calibration plane are known in the reference system of the camera, that is, they
are projected homographically in the image plane. This actually accomplishes the
calibration of the projector with respect to the camera having established with H
the geometrical transformation between points of the projector pattern plane (via
the calibration plane) and with the inverse transform H −1 the mapping between
image plane and pattern plane (see Sect. 3.5 Vol. II). The accuracy of the geometric
calibration of the projector is strictly dependent on the initial calibration of the camera
itself.
The calculated depth maps and in general the 3D surface reconstruction technolo-
gies with a shape from structured light approach are widely used in the industrial
applications of vision, where the lighting conditions are very variable and a passive
binocular vision system would be inadequate. In this case, structured light systems
can be used to have a well-controlled environment as required, for example, for
robotized cells with the movement of objects for which the measurements of 3D
shapes of the objects are to be calculated at time intervals. They are also applied for
the reconstruction of parts of the human body (for example, facial reconstruction,
dentures, 3D reconstruction in plastic surgery interventions) and generally in support
of CAD systems.
5.7 Shape from (de)Focus 465
This technique is based on the depth of field of optical systems that is known to be
finite. Therefore, only objects that are in a given depth interval that depends on the
distance between the object and the observer and the characteristics of the optics
used are perfectly in focus in the image. Outside this range, the object in the image
is blurred in proportion to its distance from the optical system. Remember to have
used (see Sect. 9.12.6 Vol. I) as a tool for blurring the image the convolution process
by means of appropriate filters (for example Gaussian filter, binomial, etc.) and that
the same process of image formation is modelable as a process of convolution (see
Chap. 1 Vol. I) which intrinsically introduces blurring into the image.
By proceeding in the opposite direction, that is, from the estimate of blurring
observed in the image, it is possible to estimate a depth value knowing the parameters
of the acquisition system (focal length, aperture of the lens, etc.) and the transfer
function with which it is possible to model the blurring (for example, convolution
with Gaussian filter). This technique is used when one wants to obtain qualitative
information of the depth map or when one wants to integrate the depth information
with that obtained with other techniques (data fusion integrating, for example, with
depth maps obtained from stereo vision and stereo photometry).
Depth information is estimated with two possible strategies:
Figure 5.26 shows the basic geometry of the image formation process on which the
shape from focus proposed in [26] is based. The light reflected from a point P of the
scene is refracted by the lens and converges at the point Q in the image plane. From
the Gaussian law of a thin lens (see Sect. 4.4.1 Vol. I), we have the relation, between
the distance p of the object from the lens, distance q of the image plane from the
lens and focal length f of the lens, given by
1 1 1
+ = (5.84)
p q f
466 5 Shape from Shading
According to this law, points of the object plane are projected into the image plane
(where the sensor is normally placed) and appear as well-focused luminous points
thus forming in this plane the image I f (x, y) of the scene resulting perfectly in focus.
If the plane of the sensor does not coincide with that of the image but is shifted to the
distance δ (before or after the image plane in focus, in the figure it is translated after),
the light coming from the point P of the scene, refracted from the lens, it undergoes
a dispersion and in the sensor plane the projection of P in Q is blurred due to
the dispersion of light and appears as a blurred circular luminous spot, assuming a
circular aperture of the lens.16 This physical blurring process occurs at all points in
the scene, resulting in a blurred image in the sensor plane Is (x, y). Using similar
triangles (see Fig. 5.26), it is possible to derive a formula to establish the relationship
between the radius of the blurred disk r and the displacement δ of the sensor plane
from the focal plane, obtaining
r δ δR
= f r om which r= (5.85)
R q q
where R is the radius of the lens (or aperture). From Fig. 5.26, we observe that the
displacement of the sensor plane from the image focal plane is given by:
δ =i −q
. It is pointed out that the intrinsic parameters of the optical and camera system are
(i, f, and R). The dispersion function that models point blurring in the sensor plane
can be modeled in physical optics.17 The approximation of the physical model of
point blurring can be achieved with the two-dimensional Gaussian function in the
hypothesis of limited diffraction and incoherent illumination.18
Thus, the blurred image Is (x, y) can be obtained through the convolution of the
image in focus I f (x, y) with the PSF Gaussian function h(x, y), as follows:
16 This circular spot is also known as confusion circle or confusion disk in photography or blur
circle, blur spot in image processing.
17 Recall from Sect. 5.7 Vol. I that in the case of circular openings the light intensity distribution
occurs according to the Airy pattern, a series of concentric rings that are always less luminous due
to the diffraction phenomenon. This distribution of light intensity on the image (or sensor) plane is
known as the dispersion function of a luminous point (called PSF—Point Spread Function).
18 Normally, the formation of images takes place in conditions of illumination from natural (or arti-
ficial) incoherent radiation or from (normally extended) non-monochromatic and unrelated sources
where diffraction phenomena are limited and those of interference cancel each other out. The lumi-
nous intensity in each point is given by the sum of the single radiations that are incoherent with
each other or that do not maintain a constant phase relationship. The coherent radiations are instead
found in a constant phase relation between them (for example, the light emitted by a laser).
5.7 Shape from (de)Focus 467
Sensor Plane
R
f
Image Plane
O
Object Plane
Q
2r
Q’
p q
Δp If(x,y) Is(x,y)
with
x 2 +y 2
1 −
2σh2
h(x, y) = √ e (5.87)
2π σh
2
where the symbol “” indicates the convolution operator, σh is the dispersion param-
eter (constant for each point P of the scene, assuming the convolution a spatially
invariant linear transformation) that controls the level of blurring corresponding to the
standard deviation of the 2D Gaussian Point Spread Function (PSF), and is assumed
to be proportional [27,28] to the radius r .
Blurred image formation can be analyzed in the frequency domain, where it is
observed how the Optical Transfer Function (OTF) which corresponds to the Fourier
transform of the PSF is characterized. By indicating with I f (u, v), Is (u, v) and
H(u, v), the Fourier transforms, respectively, of the image in focus, the blurred
image and the Gaussian PSF, the convolution expressed by the (5.86) in the Fourier
domain results in the following:
where
u 2 +v 2 2
H(u, v) = e− 2 σh (5.89)
From Eq. (5.89), which represents the optical transfer function of the blur process,
its dependence on the dispersion parameter σh is explicitly observed, and indirectly
depends on the intrinsic parameters of the optical and camera system considering that
σh ∝ k ·r is dependent on r unless a proportionality factor k is, in turn, dependent on
the characteristics of the camera and can be determined from a previous calibration of
the same camera. Considering the circular symmetry of the OTF, expressed by (5.89),
with still Gaussian form, the blurring is due only to the passage of low frequencies and
the cutting of high frequencies in relation to the increase of σh , in turn conditioned to
468 5 Shape from Shading
1. Translating the sensor plane with respect to the image plane where the scene is
in perfect focus.
2. Translating the optical system.
3. By translating the objects of the scene relative to the object plane, against which,
the optical system focuses on the image plane the scene. Normally, of a 3D object
only the points belonging to the object plane are perfectly in focus, all the other
points, before and after the object plane, are acceptably or less in focus, in relation
to the depth of field of the system optical.
The mutual translation between the optical system and the sensor plane (modes 1 and
2) introduces a scale factor (of apparent reduction or enlargement) of the scene by
varying the coordinates in the image plane of the points of the scene and a variation
of intensity in the acquired image, caused by the different distribution of irradiance
in the sensor plane. These drawbacks are avoided by acquiring images translating
only the scene (mode 3) with respect to a predetermined setting of the optical-sensor
system, thus keeping the scale factor of the acquired scene constant.
Figure 5.27 shows the functional scheme of an approach shape from focus pro-
posed in [26]. We observe the profile of the surface S of the unknown scene whose
depth is to be calculated and in particular a surface element ( patch) s is highlighted.
We distinguish a reference base with respect to which the distance d f of the focused
object plane is defined and the distance d from the object-carrying translation basis
is defined simultaneously. These distances d f and d can be measured with controlled
resolution. Now consider the patch s and the situation where the base moves toward
the focused object plane (i.e., toward the camera).
It will have that in the images in acquisition the patch s will tend to be more and
more in focus reaching the maximum when the base reaches the distance d = dm ,
and then begin the process of defocusing as soon as it exceeds the focused object
plane. If for each step d of translation, the distance d of the base is registered
and the blur level of the patch s, we can evaluate the estimate of the height (depth)
ds = d f − dm at the value d = dm where the patch has the highest level of focus.
This procedure is applied for any patch on the surface S. Once the system has been
calibrated, from the height ds , the depth of the surface can also be calculated with
respect to the sensor plane or other reference plane.
5.7 Shape from (de)Focus 469
mobile base
Focus approach
Sensor Plane
S
d
df Object Plane
Reference base
Once the mode of acquisition of image sequences has been defined, to determine
the depth map it is necessary to define a measurement strategy of the level of blurring
of the points of the 3D objects, not known, placed on a mobile base. In the literature
various metrics have been proposed to evaluate in the case of Shape from Focus (SfF)
the progression of focusing of the sequence of images until the points of interest of
the scene are in sharp focus, while in the case of Shape from Defocus (SfD) the
depth map is reconstructed from the blurring information of several images. Most of
the proposed SfF metrics [26,29,30] measure the level of focus by considering local
windows (which include a surface element) instead of the single pixel.
The goal is to automatically extract the patches of interest with the dominant
presence of strong local intensity variation through ad hoc operators that evaluate
from the presence of high frequencies the level of focus of the patches. In fact,
patches with high texture, perfectly in focus, give high responses to high-frequency
components. Such patches, with maximum responses to high frequencies can be
detected by analyzing the sequence of images in the Fourier domain or the spatial
domain.
In Chap. 9 Vol. I, several local operations have been described for both domains
characterized by different high-pass filters. In this context, the linear operator of
Laplace (see Sect. 1.12 Vol. II) is used, based on the differentiation of the second
order, which accentuates the variations in intensity and is found to be isotropic.
Applied for the image I (x, y), the Laplacian ∇ 2 is given by
∂ 2 I (x, y) ∂ 2 I (x, y)
∇ 2 I (x, y) = + = I (x, y) h ∇ (x, y) (5.90)
∂x2 ∂ y2
calculable in each pixel (x, y) of the image. Equation (5.90), in the last expres-
sion, also expresses the Laplacian operator in terms of convolution, considering the
function PSF Laplacian h ∇ (x, y) (described in detail in Sect. 1.21.3 Vol. II). In the
frequency domain, indicating with F the Fourier transform operator, the Laplacian
of image I (x, y) is given by
where we remember that h(x, y) is the Gaussian PSF function. For the associative
property of convolution, the previous equation can be rewritten as follows:
Equation (5.93) informs us that, instead of directly applying the Laplacian operator
to the blurred image Is with (5.92), it is also possible to apply it first to the focused
image I f and then blur the result obtained with the Gaussian PSF. In this way, with the
Laplacian only the high spatial frequencies are obtained from the I f and subsequently
attenuated with the Gaussian blurring, useful for attenuating the noise normally
present in the high-frequency components. In the Fourier domain, the application of
the Laplacian operator to the blurred image, considering also Eqs. (5.89) and (5.91),
results in the following:
We highlight how in the Fourier domain, for each frequency (u, v), the transfer func-
tion H(u, v) · H ∇ (u, v) (produced between the Laplacian operator and Gaussian
blurring filter) has a Gaussian distribution controlled by the blurring parameter σh .
Therefore, a sufficiently textured image of the scene will present a richness of high
frequencies emphasized by the Laplacian filter H ∇ (u, v) and attenuated by the con-
tribution of the Gaussian filter according to the value of σh . The attenuation of high
frequencies is almost nil (ignoring any blurring due to the optical system) when the
image of the scene is in focus with σh = 0.
If the image is not well and uniformly textured, the Laplacian operator does not
guarantee a good measure of image focusing as the operator would hardly select dom-
inant high frequencies. Any noise present in the image (due to the camera sensor)
would introduce high spurious frequencies altering the focusing measures regardless
of the type of operator used. Normally, noise would tend to violate the spatial invari-
ance property of the convolution operator (i.e., the PSF would vary spatially in each
pixel h σ (x, y)).
To mitigate the problems caused by the noise by working with real images, the
focusing measurements obtained with the Laplacian operator are calculated locally in
each pixel (x, y) by adding the significant ones included in a support window x,y
of n × n size and centered in the pixel in elaboration (x, y). The focus measures
5.7 Shape from (de)Focus 471
where T indicates a threshold value beyond which a Laplacian value of one pixel is
considered significant in the support window of the Laplacian operator. The size
of (normally square window equal to or greater than 3 × 3) is chosen in relation to
the size of the texture of the image. It is evident that with the Laplacian, the partial
second derivatives in the direction of the horizontal and vertical components can
have equal and opposite values, i.e., I x x = −I yy =⇒ ∇ 2 I = 0, thus canceling
reciprocally. In this case, the operator would produce incorrect answers, even in the
presence of texture, as the contributions of the high frequencies associated with this
texture would be canceled. Nayar and Nakagawa [26], to prevent the cancelation of
such high frequencies, they proposed a modified version of the Laplacian, known as
Modified Laplacian—ML given by
2 2
∂ I ∂ I
∇ M I = 2 + 2
2
(5.96)
∂x ∂y
Compared to the original Laplacian the modified one is always greater or equal. To
adapt the possible dimensions of the texture, it is also proposed to calculate the partial
derivatives using a variable step s ≥ 1 between the pixels belonging to the window
x,y . The discrete approximation of the modified Laplacian is given by
2 I (x, y) = | − I (x + s, y) + 2I (x, y) − I (x − s, y)| + | − I (x, y + s) + 2I (x, y) − I (x, y − s)|
∇M (5.97)
D
The final focus measure S M L(x, y), known as Sum Modified Laplacian, is calculated
as the sum of the values of the modified Laplacian ∇ M
2 I , given by
D
S M L(x, y) = ∇M
2
D I (i, j) f or ∇M
2
D I (i, j) ≥ T (5.98)
(i, j)∈x,y
Several other focusing operators are reported in the literature based on the gradient
(i.e., on the first derivative of the image) which in analogy to the Laplacian operator
evaluates the edges present in the image; on the coefficients of the discrete wavelet
transform by analyzing the content of the image in the frequency domain and the spa-
tial domain, and using these coefficients to measure the level of focus; on the discrete
cosine transform (DCT), on the median filter and statistical methods (local variance,
texture, etc.). In [31], the comparative evaluation of different focus measurement
operators is reported.
Once the focus measurement operator has been defined, the depth estimate of
each point (x, y) of the surface is obtained from the set of focusing measurements
related to the sequence of m images acquired according to the scheme of Fig. 5.27.
For each image of the sequence, the focusing measure is calculated with (5.98) (or
with other measurement methods) for each pixel using a support window x,y (of
472 5 Shape from Shading
In the previous paragraph, we have seen the method of focusing based on setting
the parameters of the optical-sensor–object system according to the formula of thin
lenses (5.84) to have an image of the scene in focus. We have also seen what the
parameters are and how to model the process of defocusing images. We will take
up this last aspect in order to formulate the method of the S f D which relates the
object–optical distance (depth), the sensor–optical parameters, and the parameters
that control the level of blur to derive the depth map. Pentland [27] has derived from
(5.84) an equation that relates the radius r of the blurred circular spot with the depth
p of a scene point. We analyze this relationship to extract a dense depth map with
the S f D approach.
Returning to Fig. 5.26, if the sensor plane does not coincide with the focal image
plane, a blurred image is obtained in the sensor plane Is where each bright point of
5.7 Shape from (de)Focus 473
the scene is a blurred spot involving a circular pixel window known precisely as a
circle of confusion of radius r . We have seen in the previous paragraph the relation
(5.85), which links the radius r of this circle with the circular opening of the lens of
radius R, the translation δ of the sensor plane with respect to the focal image plane
and the distance i between the sensor plane P S and the lens center. The figure shows
the two situations in which the object is of p in front of the object plane P O (on
this plane, it would be perfectly in focus in the focal plane P F) and the opposite
situation with the object translated of p but closer to the lens. In the two situations,
according to Fig. 5.26 the translation δ would result in the following:
δ =i −q δ =q −i (5.100)
where i indicates the distance between the lens and the sensor plane and q the distance
of the focal image plane from the lens. A characteristic of the optical system is given
by the so-called f/number, here indicated with f # = f /2R which expresses the ratio
between the focal f and the diameter 2R of the lens (described in Sect. 4.5.1 Vol. I).
If we express the radius R of the lens in terms of f # , the f/number, we have
R = 2 ff# which together with the first equation of (5.100) let’s replace in (5.85), we
get the following relation for the radius r of the blurred circle:
f ·i − f ·q
r= (5.101)
2 f# · q
In addition, resolving with respect to q from (5.101) and replacing in the thin lens
formula (5.84), q is eliminated, and we get the following:
p(i − f ) − f · i
r= (5.102)
2 f# · p
Solving from the depth p from the (5.102), we finally get
f ·i
if δ =i −q
p = i− f −2 f ·i
f # ·r
(5.103)
i− f +2 f # ·r if δ =q −i
It is pointed out that Eq. (5.103), valid in the context of geometric optics, relate the
calculation of the depth p of a point of the scene with the corresponding radius r
of the blurring circle. Furthermore, Pentland proposed to consider the size of σh of
the Gaussian PSF h σ proportional to the radius r of the blurring circle less than a
factor k:
σh = k · r (5.104)
0 5 10 15 20 25 30 35 40 45
Depth (x100 mm)
focused at 1 m and with the distance i of the sensor plane that remains constant while
varying depth p.
With (5.103) and (5.104), the problem of calculating the depth p is led back to
the estimation of the blurring parameter σh and the radius of the blurring circle r
once known (through calibration) the intrinsic parameters ( f, f # , i) and the extrinsic
parameter (k) of the acquisition system. In fact, Pentland [27] has proposed a method
that requires the acquisition of at least two images with different settings of the system
parameters to detect the different levels of blurring and derive with the (5.103) a
depth estimate. To calibrate the system, an image of the perfectly focused scene
(zero blurring) is initially acquired by adequately setting the acquisition parameters
(large f # ) against which the system is calibrated.
We assume an orthographic projection of the scene (pinhole model) to derive the
relation between the function of blurring and the parameters of setting of the optical-
sensor system. The objective is to estimate the depth by evaluating the difference of
the PSF between a pair of images blurred by this, the denomination of SfD. The idea
of Pentland is to emulate the human vision that is capable of evaluating the depth
of the scene based on similar principles since the focal length of the human visual
system is variable in a sinusoidal way around the frequencies of 2 H z.
In fact, the blurring model considered is the one modeled with the convolution
(5.86) between the image of the perfectly focused scene and the Gaussian blurring
function (5.87) which in this case is indicated with h σ ( p,e) (x, y), to indicate that
the dependency of the defocusing (blurring) depends on the distance p of the scene
from the optics and from the setting parameters e = (i, f, f # ) of the optical-sensor
system. We rewrite the blurring model given by the convolution equations and the
Gaussian PSF as follows:
with
x 2 +y 2
1 − 2σh2
h σ ( p,e) (x, y) = e (5.106)
2π σh2
5.7 Shape from (de)Focus 475
Analyzing these equations, it is observed that the defocused image Is (x, y) is known
with the acquisition while the parameters e are known with the calibration of the
system. The depth p and the image in focus Is (x, y) are instead unknown. The idea
is to acquire at least two defocused images with different settings of the e1 and e2
system parameters to get at least a theoretical estimate of the depth p. (5.105) is
not linear with respect to the unknown p, and therefore cannot be used to directly
solve the problem even for a numerical stability problem if a minimization functional
is desired. Pentland proposed to solve the unknown p by operating in the Fourier
domain. In fact, if we consider the two defocused images acquired, modeled by the
two spatial convolutions, we have
and executing the ratio of the relative Fourier transforms (remembering the (5.106)),
we obtain
Is1 (u, v) I f (u, v)H σ1 (u, v) H σ1 (u, v) 1
= = = exp − (u 2 + v 2 )(σ12 − σ22 ) (5.108)
Is2 (u, v) I f (u, v)H σ2 (u, v) H σ2 (u, v) 2
Is1 (u, v) 1
ln = (u 2 + v2 )[σ 2 ( p, e2 ) − σ 2 ( p, e1 )] (5.109)
Is2 (u, v) 2
where the ideal image perfectly in focus Is is deleted. Known the transforms Is1
and Is2 , and calibrating the functions σ ( p, e1 ) and σ ( p, e2 ) it is possible to derive
from (5.109) the term (σ12 − σ22 ) given by
! "
1 Is1 (u, v)
σ12 − σ22 = −2 2 ln (5.110)
u +v 2 Is2 (u, v) W
where the term • denotes an average calculated over an extended area W of the
spectral domain, instead of considering single frequencies (u, v) of a point in that
domain. If one of the images is perfectly in focus, we have σ1 = 0 and σ2 is estimated
by (5.110) while the depth p is calculated with (5.100). If, on the other hand, the two
images are defocused due to the different settings of the system parameters, we will
have σ1 > 0 and σ2 > 0, two different values of the distance from the image plane
and from the lens i 1 and i 2 . Substituting these values in (5.103) we have
f · i1 f · i2
p= = (5.111)
i 1 − f − 2r1 f # i 2 − f − 2r2 f #
and considering the proportionality relation σh = kr we can derive a linear relation-
ship between σ1 and σ2 , given by:
σ1 = ασ2 + β (5.112)
476 5 Shape from Shading
where:
i1 f · i1 · k 1 1
α= and β= − (5.113)
i2 2 f# i2 i1
In essence, we now have two equations that establish a relationship between σ1 and
σ2 , the (5.112) in terms of the known parameters of the optical-sensor system, the
(5.110) in terms of the level of blur between the two defocused images derived from
convolutions. Both are useful for determining depth. In fact, from the (5.110), we
have the contribution that σ12 + σ22 = C, replacing in this the value of σ1 given by
(5.112), we get an equation with only one unknown to calculate σ2 , given by [32]:
where:
1 −2 Is1 (u, v)
C= ln dudv (5.115)
A W +v
u2 2 Is2 (u, v)
The measurement of the defocusing difference C = σ12 − σ22 in the Fourier domain
is calculated considering the average of the values of the frequencies on a window
W centered in the point being processed (u, v) of images and A is the area of the
window W . With (5.114), we have a quadratic equation to estimate σ2 . If the main
distances are i 1 = i 2 , we have that α = 1 obtaining a single value of σ2 . Once the
parameters of the optical-sensor system are known, the depth p can be calculated
with one of the two Eq. (5.103). The procedure is repeated for each pixel of the image
thus obtaining a depth-dense map having acquired only two defocused images with
different settings of the acquisition system.
The approaches of S f D described are essentially based on the measurement of
the defocusing level of multiple images with different settings of the parameters of
the acquisition system. This measurement is estimated for each pixel often consid-
ering also the pixels of the surroundings included in a square window of adequate
dimensions, assuming that the projected points of the scene have constant depth.
The use of this local window also tends to mediate noise and minimize artifacts.
In the literature, S f D methods have been proposed based on global algorithms that
operate simultaneously on the whole image in the hypothesis that image intensity
and shape are spatially correlated although the image formation process tends to
lose intensity-shape information. This leads to the typical ill-posed problem often
proposing solutions based on the regularization [33], which introduce minimization
functions thus bringing back a problem ill posed in a problem of numerical approx-
imation or energy minimization, or formulated with Markov random field (MRF)
[34], or through a diffusion process based on differential equations [35].
References 477
References
1. B.K.P. Horn, Shape from Shading: A Method for Obtaining the Shape of a Smooth Opaque
Object from One View. Ph.D. thesis (MIT, Boston-USA, 1970)
2. E. Mingolla, J.T. Todd, Perception of solid shape from shading. Biol. Cybern. 53, 137–151
(1986)
3. V.S. Ramachandran, Perceiving shape from shading. Sci. Am. 159, 76–83 (1988)
4. K. Ikeuchi, B.K.P. Horn, Numerical shape from shading and occluding boundaries. Artif. Intell.
17, 141–184 (1981)
5. A.P. Pentland, Local shading analysis. IEEE Trans. Pattern Anal. Mach. Intell. 6, 170–184
(1984)
6. R.J. Woodham, Photometric method for determining surface orientation from multiple images.
Opt. Eng. 19, 139–144 (1980)
7. H. Hayakawa, Photometric stereo under a light source with arbitrary motion. J. Opt. Soc.
Am.-Part A: Opt., Image Sci., Vis. 11(11), 3079–3089 (1994)
8. P.N. Belhumeur, D.J. Kriegman, A.L. Yuille, The bas-relief ambiguity. J. Comput. Vis. 35(1),
33–44 (1999)
9. B. Horn, M.J. Brooks, The variational approach to shape from shading. Comput. Vis., Graph.
Image Process. 33, 174–208 (1986)
10. E.N. Coleman, R. Jain, Obtaining 3-dimensional shape of textured and specular surfaces using
four-source photometry. Comput. Graph. Image Process. 18(4), 1309–1328 (1982)
11. K. Ikeuchi, Determining the surface orientations of specular surfaces by using the photometric
stereo method. IEEE Trans. Pattern Anal. Mach. Intell. 3(6), 661–669 (1981)
12. R.T. Frankot, R. Chellappa, A method for enforcing integrability in shape from shading algo-
rithms. IEEE Trans. Pattern Anal. Mach. Intell. 10, 439–451 (1988)
13. R. Basri, D.W. Jacobs, I. Kemelmacher, Photometric stereo with general, unknown lighting.
Int. J. Comput. Vis. 72(3), 239–257 (2007)
14. T. Wei, R. Klette, On depth recovery from gradient vector fields, in Algorithms, Architectures
and Information Systems Security, ed. by B.B. Bhattacharya (World Scientific Publishing,
London, 2009), pp. 75–96
15. K. Reinhard, Concise Computer Vision, 1st edn. (Springer, London, 2014)
16. J.L. Posdamer, M.D. Altschuler, Surface measurement by space-encoded projected beam sys-
tems. Comput. Graph. Image Process. 18(1), 1–17 (1982)
17. E. Horn, N. Kiryati, Toward optimal structured light patterns. Int. J. Comput. Vis. 17(2), 87–97
(1999)
18. D. Caspi, N. Kiryati, J. Shamir, Range imaging with adaptive color structured light. IEEE
Trans. PAMI 20(5), 470–480 (1998)
19. P.S. Huang, S. Zhang, A fast three-step phase shifting algorithm. Appl. Opt. 45(21), 5086–5091
(2006)
20. J. Gühring, Dense 3-D surface acquisition by structured light using off-the-shelf components.
Methods 3D Shape Meas. 4309, 220–231 (2001)
21. C. Brenner, J. Böhm, J. Gühring, Photogrammetric calibration and accuracy evaluation of a
cross-pattern stripe projector, in Videometrics VI 3641 (SPIE, 1999), pp. 164–172
22. Z.J. Geng, Rainbow three-dimensional camera: new concept of high-speed three-dimensional
vision systems. Opt. Eng. 35(2), 376–383 (1996)
23. K.L. Boyer, A.C. Kak, Color-encoded structured light for rapid active ranging. IEEE Trans.
PAMI 9(1), 14–28 (1987)
24. M. Maruyama, S. Abe, Range sensing by projecting multiple slits with random cuts. IEEE
Trans. PAMI 15(6), 647–651 (1993)
25. Z. Zhang, A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach.
Intell. 22(11), 1330–1334 (2000)
26. S.K. Nayar, Y. Nakagawa, Shape from focus. IEEE Trans. PAMI 16(8), 824–831 (1994)
478 5 Shape from Shading
27. A.P. Pentland, A new sense for depth of field. IEEE Trans. Pattern Anal. Mach. Intell. 9(4),
523–531 (1987)
28. M. Subbarao, Efficient depth recovery through inverse optics, in Machine Vision for Inspection
and Measurement (Academic press, 1989), pp. 101–126
29. E. Krotkov, Focusing. J. Comput. Vis. 1, 223–237 (1987)
30. M. Subbarao, T.S. Choi, Accurate recovery of three dimensional shape from image focus. IEEE
Trans. PAMI 17(3), 266–274 (1995)
31. S. Pertuza, D. Puiga, M.A. Garcia, Analysis of focus measure operators for shape-from-focus.
Pattern Recognit. 46, 1415–1432 (2013)
32. C. Rajagopalan, Depth recovery from defocused images, in Depth From Defocus: A Real
Aperture Imaging Approach (Springer, New York, 1999), pp. 14–27
33. V.P. Namboodiri, C. Subhasis, S. Hadap. Regularized depth from defocus, in ICIP (2008), pp.
1520–1523
34. A.N. Rajagopalan, S. Chaudhuri, An mrf model-based approach to simultaneous recovery of
depth and restoration from defocused images. IEEE Trans. PAMI 21(7), 577–589 (1999)
35. P. Favaro, S. Soatto, M. Burger, S. Osher, Shape from defocus via diffusion. IEEE Trans. PAMI
30(3), 518–531 (2008)
Motion Analysis
6
6.1 Introduction
So far we have considered the objects of the world and the observer both stationary,
that is, not in motion. We are now interested in studying a vision system capable of
perceiving the dynamics of the scene, in analogy to what happens, in the vision sys-
tems of different living beings. We are aware, that these latter vision systems require
a remarkable computing skills, instant by instant, to realize the visual perception,
through a symbolic description of the scene, deriving various information of depth
and form, with respect to the objects themselves of the scene.
For example in the human visual system, the dynamics of the scene is captured by
stereo binocular images slightly different in time, acquired simultaneously by two
eyes, and adequately combined to produce a single 3D perception of the objects of
the scene. Furthermore, observing the scene over time, he is able to reconstruct the
scene completely, differentiating 3D objects in movement from stationary ones. In
essence, it realizes the visual tracking of moving objects, deriving useful qualitative
and quantitative information on the dynamics of the scene.
This is possible given the capacity of biological systems to manage spatial and tem-
poral information through different elementary processes of visual perception, ade-
quate and fundamental, for interaction with the environment. The temporal dimension
in visual processing plays a role of primary importance for two reasons:
1. the apparent motion of the objects in the image plane is an indication to understand
the structure and 3D motion;
2. the biological visual systems use the information extracted from time-varying
image sequences to derive properties of the 3D world with a little a priori knowl-
edge of the same.
The motion analysis has long been used as a specialized field of research that had
nothing to do with image processing in general; this is for two reasons:
1. the techniques used to analyze movement in image sequences were quite different;
2. the large amount of memory and computing power required to process image
sequences made this analysis available only to specialized research laboratories
that could afford the necessary resources.
These two reasons no longer exist because the methods used in motion analysis do
not differ from those used for image processing and the image sequence analysis
algorithms can also be developed by normal personal computers. The perception of
movement in analogy to other visual processes (color, texture, contour extraction,
etc.) is an inductive visual process. Visual photoreceptors derive motion information
by evaluating the variations in light intensity of the 2D (retinal) image formed in the
observed 3D world. The human visual system adequately interprets these changes in
brightness in time-varying image sequences to realize the perception of moving 3D
objects in the scene. In this chapter, we will describe how it is possible to derive 3D
motion, almost in real time, from the analysis of time-varying 2D image sequences.
Some studies on the analysis of movement have shown that the perception of
movement derives from the information of objects by evaluating the presence of
occlusions, texture, contours, etc. Psychological studies have shown that the visual
perception of movement is based on the activation of neural structures. In some
animals, it has been shown that the lesion of some parts of the brain has led to
the inability to perceive movement. These losses of visual perception of movement
were not associated with the loss of visual perception of color and sensitivity to
perceive different patterns. This suggests that some parts of the brain are specialized
for movement perception (see Sect. 4.6.4 describing the functional structure of the
visual cortex).
We are interested in studying the perception of movement that occurs in the phys-
ical reality and not in the apparent movement. A typical example of apparent motion
is when you observe advertising light panels, which light up and off at different
times, sequences of bright zones compared to others, which always remain on. Other
objects of apparent motion can be verified by varying the color or luminous intensity
of some objects.
Figure 6.1 shows two typical examples of movement, captured when a photo is
taken with the object of interest moving to the side and the person taking pictures
remains still (photo (a)) while in photo (b) is the observer moving toward the house.
The images a and b of Fig. 6.2 show two images (normally in focus) of a video
sequence acquired at the frequency of 50 Hz. Some differences between the images
are evident in the first direct comparison. If we subtract the images from each other,
the differences become immediately visible as seen in Fig. 6.2c. In fact, the dynamics
of the scene indicate the movement of the players near the door.
From the difference image c, it can be observed that all the parts not in motion
are in black (in the two images the intensity remained constant), while the moving
parts are well highlighted and one appreciates just a different speed of the moving
parts (for example, the goalkeeper moves less than the other players). Even from this
qualitative description, it is obvious that motion analysis helps us considerably in
understanding the dynamics of a scene. All the parts not in motion of the scene are
6.1 Introduction 481
Fig. 6.1 Qualitative motion captured from a single image. Example of lateral movement captured
in the photo (a) by a stationary observer, while in photo (b) the perceived movement is of the
observer moving toward the house
Fig. 6.2 Couple of a sequence of images acquired with the frequency of 50 Hz. The dynamics of
the scene indicates the approach of the players toward the ball. The value of the image difference
(a) and (b) is shown in (c)
dark while with the value of the difference the moving parts are exalted (the areas
with brighter pixels).
We can summarize what has been said by affirming that the movement (of the
objects in the scene or of the observer) can be detected by the temporal variation of
the gray levels; unfortunately, the inverse implication is not valid, namely that all
changes in gray levels are due to movement.
This last aspect depends on the possible simultaneous change of the lighting con-
ditions while the images are acquired. In fact, in Fig. 6.2 the scene is well illuminated
by the sun (you can see the shadows projected on the playing field very well), but if
a cloud suddenly changes the lighting conditions, it would not be possible to derive
motion information from the time-varying image sequence because the gray levels
of the images also change due to the change in lighting. Thus a possible analysis of
the dynamics of the scene, based on the difference of the space–time-variant images,
would result in artifact. In other words, we can say that from a sequence of space–time
images varying f (x, y, t), it is possible to derive motion information by analyzing
482 6 Motion Analysis
Experiments have shown the analogy between motion perception and depth (distance)
of objects derived from stereo vision (see Sect. 4.6.4). It can easily be observed as for
an observer in motion, with respect to an object, in some areas of the retina there are
local variations of luminous intensity, deriving from the motion, containing visual
information that depends on the distance of the observer from the various points of
the scene.
The variations in brightness on the retina change in a predictable manner in rela-
tion to the direction of motion of the observer and the distance between objects and
observer. In particular, objects that are farther away generally appear to move more
slowly than objects closer to the observer. Similarly, points along the direction of
motion of the observer move slowly with respect to points that lie in other direc-
tions. From the variations of luminous intensity perceived on the retina the position
information of the objects in the scene with respect to the observer is derived.
The motion field image of the movement is calculated from the motion information
derived from a sequence of time-varying images. Gibson [1] has defined an important
algorithm that correlates the perception of movement and the perception of distance.
The algorithm calculates the flow field of the movement and proposes algorithms to
extract the ego-motion information, that is, it derives the motion of the observer and
the depth of the objects from the observer, analyzing the information of the flow field
of the movement.
Motion and depth estimation have the same purpose and use similar types of
perceptive stimuli. The stereo vision algorithms use the information of the two retinas
to recover the depth information, based on the diversity of the images obtained by
observing the scene from slightly different points of view. Instead, motion detection
algorithms use coarser information deriving from image sequences that show slight
differences between them due to the motion of the observer or to the relative motion
of the objects with respect to the observer. As an alternative to determining the
variations of gray levels for motion estimation, in analogy to the correspondence
problem for stereo vision, some characteristic elements (features) can be identified
in the sequence image that corresponds to the same objects of the observed scene
and evaluate, in the image, the spatial difference of these features caused by the
movement of the object.
While in stereo vision this spatial difference (called disparity) of a f eatur e is due
to the different position of observation of the scene, in the case of the sequence of time-
varying images acquired by the same stationary observer, the disparity is determined
by images consecutive acquired at different times. The temporal frequency of image
6.2 Analogy Between Motion Perception and Depth Evaluated … 483
1. we can find the visual correspondence without the existence of a physical corre-
spondence, as in the case of indistinguishable objects;
2. a physical correspondence does not generally imply a visual correspondence, as
in the case in which we are not able to recognize a visual correspondence due to
variations in lighting.
The methods used to solve the correspondence and reconstruction problems are
based on the following assumption:
there is only one, rigid, relative motion between the camera and the observed
scene, moreover the lighting conditions do not change.
484 6 Motion Analysis
This assumption implies that the observed 3D objects cannot move according
to different motions. If the dynamics of the scene consist of multiple objects with
different movements than the observer, another problem will have to be considered,
that is, you will have to segment the flow image of the movement to select the
individual regions that correspond to the different objects with different motions (the
problem of segmentation).
In Fig. 6.3, we observe a sequence of images acquired with a very small time inter-
val t such that the difference is minimal in each consecutive pair of the images
sequence. This difference in the images depends on the variation of the geometric
relationships between the observer (for example, a camera), the objects of the scene,
and the light source. It is these variations, determined in each pair of images of the
sequence, at the base of the motion estimation and stereo vision algorithms.
In the example shown in the figure, the dynamic of the ball is the main objective
in the analysis of the sequence of images which, once detected in the spatial domain
of an image, is tracked in the successive images of the sequence (time domain) while
approaching the door to detect the Goal–NoGoal event.
Let P(X, Y, Z ) be a 3D point of the projected scene (with pin-hole model) in the
time t · t in a point p(x, y) of the image I (x, y, t · t) (see Fig. 6.4). If the 3D
motion of P is linear and with speed V = (V X , VY , VZ ), in the time interval t will
move by V·t in Q = (X + V X ·t, Y + VY ·t, Z + VZ ·t). The motion of P with
speed V induces the motion of p in the image plane at the speed v(vx , v y ) moving
to the point q(x + vx · t, y + v y · t) of image I (x, y, t + 1 · t). The apparent
motion of the intensity variation of the pixels is called optical flow v = (vx , v y ).
Fig. 6.3 Goal–NoGoal detection. Sequence of space–time-variant images and motion field calcu-
lated on the last images of the sequence. The main object of the scene dynamics is only the motion
of the ball
6.3 Toward Motion Estimation 485
y Y
p X
x
Q
Z
0
q
P
Fig. 6.4 Graphical representation of the formation of the velocity flow field produced on the retina
(by perspective projection) generated by the motion of an object of the scene considering the
observer stationary
Moving towards object Moving away from object Rotation Translation from right to left
Fig. 6.5 Different types of ideal motion fields induced by the motion of the observer toward the
object or vice versa
Figure 6.5 shows ideal examples of optical flow with different vector fields gen-
erated by various types of motion of the observer (with uniform speed) which, with
respect to the scene, approaches or moves away, or moves laterally from right to left,
or rotates the head. The flow vectors represent an estimate of the variations of points
in the image that occur in a limited space–time interval. The direction and length
of each flow vector corresponds to the local motion entity that is induced when the
observer moves with respect to the scene or vice versa, or both move.
The projection of velocity vectors in the image plane, associated with each 3D
point of the scene, defines the motion field. Ideally, motion field and optical flow
should coincide. In reality, this is not true since the motion field associated with
an optical flow can be caused by an apparent motion induced by the change of the
lighting conditions and not by a real motion or by the aperture problem mentioned
above. An effective example is given by the barber pole1 illustrated in Fig. 6.6, where
the real motion of the cylinder is circular and the perceived optical flow is a vertical
1 Panel used in the Middle Ages by barbers. On a white cylinder, a red ribbon is wrapped helically. The
cylinder continuously rotates around its vertical axis and all the points of the cylindrical surface move
horizontally. It is observed instead that this rotation produces the illusion that the red ribbon moves
vertically upwards. The motion is ambiguous because it is not possible to find corresponding points
in motion in the temporal analysis as shown in the figure. Hans Wallach in 1935, a psychologist,
486 6 Motion Analysis
Real Motion
Perceived Motion
Fig. 6.6 Illusion with the barber pole. The white cylinder, with the red ribbon wrapped helically,
rotates clockwise but the stripes are perceived to move vertically upwards. The perceived optical
flow does not correspond to the real motion field, which is horizontal from right to left. This illusion
is caused by the aperture problem or by the ambiguity of finding the correct correspondence of points
on the edge of the tape (in the central area when observed at different times) since the direction of
motion of these points is not determined uniquely by the brain
motion field when the real one is horizontal. The same happens when standing in a
stationary train observing another adjacent train and we have the feeling that we are
moving when instead it is the other that is moving. This happens because we have
a limited opening from the window and we don’t have precise references to decide
which of the two trains is really in motion.
Returning to the sequence of space-time-variant images I (x, y, t), the information
content captured of the dynamics of the scene can be effective to analyze it in a
space–time graph. For example, with reference to Fig. 6.3, we could analyze in the
space–time diagram (t, x) the dynamics of the ball structure (see Fig. 6.7). In the
sequence of images, the dominant motion of the ball is horizontal (along the x axis)
moving toward the goalposts. If the ball is stationary, the position in the time-variant
images does not change and in the diagram this state is represented by a horizontal
line. When the ball moves at a constant speed the trace described by its center of mass
is an oblique straight line and its inclination with respect to the time axis depends on
the speed of the ball and is given by
x
ν= = tan(θ ) (6.1)
t
where θ is the angle between the time axis t and the direction of movement of
the ball given by the mass centers of the ball located in the time-varying images
of the sequence or from the trajectory described by the displacement of the ball in
the images of the sequence. As shown in Fig. 6.7, a moving ball is described in the
plane (t, x) with an inclined trajectory while for a stationary ball, in the sequence
discovered that the illusion is less if the cylinder is shorter and wider, and the perceived motion is
correctly lateral. The illusion is also solved if the texture is present on the tape.
6.3 Toward Motion Estimation 487
θ
msec t msec t
Fig. 6.7 Space–time diagram of motion information. In the diagram on the left the horizontal line
indicates stationary motion while on the right the inclined line indicates motion with uniform speed
along the x axis
of images, the gray levels associated with the ball do not vary, and therefore in the
plane (t, x) we will see the trace of the motion of the ball which remains horizontal
(constant gray levels).
In other words, we can say that in the space–time (x, y, t) the dynamics of the
scene is estimated directly from the orientation in continuous space–time (t, x) and
not as discrete shifts by directly analyzing two consecutive images in the space
(x, y). Therefore, the motion analysis algorithms should be formulated in the contin-
uous space–time (x, y, t) for which the level of discretization to adequately describe
motion becomes important. In this space, observing the trace of the direction of
motion and how it is oriented with respect to the time axis, an estimate of the speed
is obtained. On the other hand, by observing only the motion of some points of the
contour of the object, the orientation of the object itself would not be univocally
obtained.
(a) (b)
t y
tn x α
l Bo
Goa
α
tgoal
a
Are
Y Goa
l
0 x
e
l Lin
Z Goa
Fig. 6.8 Goal–NoGoal event detection. a The dynamics of the scene is taken for each goal by a
pair of cameras with high temporal resolution arranged as shown in the figure on the opposite sides
with the optical axes (aligned with the Z -axis of the central reference system (X, Y, Z )) coplanar
with the vertical plane α of the goal. The significant motion of the ball approaching toward the goal
box is in the domain time–space (t, x), detected with the acquisition of image sequences. The 3D
localization of the ball with respect to the central reference system (X, Y, Z ) is calculated by the
triangulation process carried out by the relative pair of opposite and synchronized cameras, suitably
calibrated, with respect to the known positions of the vertical plane α and the horizontal goal area.
b Local reference system (x, y) of the sequence images
The goal event occurs only when the ball (about 22 cm diameter) completely
crosses the goal box (plane α), that is, it completely exceeds the goalposts-crossbar
and goal line inside the goal as shown in the figure. The ball can reach a speed
of 120 Km/h. To capture the dynamics of the goal event, it is necessary to acquire
a sequence of images by observing the scene as shown in the figure, from which it
emerges that the significant and dominant motion is the lateral one with the trajectory
of the ball, which moves toward the goal, it is almost always orthogonal to the optical
axis of the cameras.
Figure 6.8 shows the dynamics of the scene being acquired by discreetly sampling
the motion of the ball over time. As soon as the ball appears in the scene, this
is detected in a I (x, y, t1 ) image of the time–space sequence varying at time t1
and the lateral motion of the ball is tracked in the spatial domain (x, y) of the
images of the sequence acquired in real time with the frame rate defined by the
camera that characterizes the level of temporal discretization t of the dynamics of
the event. In the figure, we can observe the 3D and 2D discretized trajectory of the
ball for some consecutive images of the two sequences captured by the two opposite
and synchronized cameras. It is also observed that the ball is spatially spaced (in
consecutive images of the sequence) with a value inversely proportional to the frame
rate. We will now analyze the impact of time sampling with the dynamics of the goal
event that we want to detect.
In Fig. 6.9a, the whole sequence of images (related to one camera) is represented
where it is observed that the ball moves toward the goal, with a uniform speed v,
leaving a cylindrical track of diameter equal to the real dimensions of the ball. In
essence, in this 3D space–time of the sequence I (x, y, t), the dynamics of the scene
is graphically represented by a parallelepiped where the images of the sequence
I (x, y, t) that vary over time t are stacked. Figure 6.9c, which is a section (t, x) of
the parallelepiped, shows the dominant and significant signal of the scene, that is,
the trajectory of the ball useful to detect the goal event if the ball crosses the goal
(i.e., the vertical plane α). In this context, the space–time diagram (t, y) is used to
yk tk
t 0 msec t 0 msec t
t1+n+m t1+n
Fig. 6.9 Parallelepiped formed by the sequence of images that capture the Goal–NoGoal event. a
The motion of the ball moving at uniform speed is represented in this 3D space–time by a slanted
cylindrical track. b A cross section of the parallelepiped, or image plane (x-y) of the sequence at
time tth, shows the current position of the entities in motion. c A parallelepiped section (t-x) at a
given height y represents the space–time diagram that includes the significant information of the
motion structure
490 6 Motion Analysis
indicate the position of the cylindrical track of the ball with respect to the goal box
(see Fig. 6.9b).
Therefore, according to Fig. 6.8b, we can affirm that from the time–space diagram
(t, x), we can detect the image tgoal of the sequence in which the ball crossed the
vertical plane α of the goal box with the x-axis indicating the horizontal position of
the ball, useful for calculating its distance from the plane α with respect to the central
reference system (X, Y, Z ). From the time–space diagram (t, y) (see Fig. 6.9b), we
obtain instead the vertical position of the ball, useful to calculate the coordinate Y
with respect to the central reference system (X, Y, Z ).
Determined the centers of mass of the ball in the synchronized images
relative to the opposite cameras C1 and C2 we have the information that the ball
has crossed the plane α useful to calculate the horizontal coordinate X of the central
reference system but, to detect the goal event, it is now necessary to determine if the
ball is in the goal box evaluating its position (Y, Z ) in the central reference system
(see Fig. 6.8a). This is possible through the triangulation between the two cameras
having previously calibrated them with respect to the known positions of the vertical
plane α and the horizontal goal area [2].
It should be noted that in Fig. 6.9 in the 3D space–time representation, for sim-
plicity the dynamics of the scene is indicated assuming a continuous motion even
if the images of the sequence are acquired with a high frame rate, and in the plane
(t, x) the resulting trace is inclined by an angle θ with a value directly proportional
to the speed of the object.
Now let’s see how to adequately sample the motion of an object to avoid the phe-
nomenon known as time aliasing, which introduces distortions in the signal due to
an undersampling. With the technologies currently available, once the spatial reso-
lution2 is defined, the continuous motion represented by Fig. 6.9c can be discretized
with a sampling frequency that can vary from a few images to thousands of images
per second.
In Fig. 6.10 is shown the relation between the speed of the ball and the displacement
of the ball in the time interval of acquisition between two consecutive images that
remains constant during the acquisition of the entire sequence. It can be observed that
for low values of the frame rate, the displacement of the object varies from meters to
a few millimeters, i.e., the displacement decreases with the increase of the sampling
time frequency. For example, for a ball with the speed of 120 km/h the acquisition of
2 We recall that we also have the phenomenon of the spatial aliasing already described in Sect. 5.10
Vol. I. According to the Shannon–Nyquist theorem, a sampled continuous function (in the time or
space domain) can be completely reconstructed if (a) the sampling frequency is equal to or greater
than twice the frequency of the maximum spectral component of the input signal (also called Nyquist
frequency) and (b) the spectrum replicas are removed in the Fourier domain, remaining only the
original spectrum. The latter process of removal is the anti-aliasing process of signal correction by
eliminating spurious space–time components.
6.3 Toward Motion Estimation 491
1000
800
600
400
200
0
0 50 100 150
Object Velocity (km/h)
Fig.6.10 Relationship between speed and displacement of an object as the time sampling frequency
of the sequence images I (x, y, t) changes
the sequence with a frame rate of 400 fps the movement of the ball in the direction of
the x-axis is 5 mm. In this case, a time aliasing would occur with the appearance of
the ball in the sequence images in the form of an elongated ellipsoid and one could
hardly estimate the motion. The time aliasing also generates the helix effect when
observing on a television video the motion of the propeller of an airplane that seems
to rotate in an inverse way compared to the real one.
Figure 6.11 shows how the continuous motion represented in Fig. 6.9c is better
represented by sampling the dynamics of the scene with a very high time sampling
frequency, while as the level of discretization of the trajectory of the ball decreases,
it is very far from the continuous motion. The speed of the object in addition to
t t t t
vx
vx
vx
vx vx vx vx
ut ut ut ut
Fig. 6.11 Analysis in the Fourier domain of the effects of time sampling on the 1D motion of an
object. a Continuous motion of the object in the time–space domain (t, x) and in the corresponding
Fourier domain (u t , vx ); in b, c and d the analogous representations are shown with a time sampling
which is decreased by a factor 4
492 6 Motion Analysis
u t = ν · vx (6.2)
In the continuous approach, the motion of each pixel of the image is estimated, obtain-
ing a dense map of estimated speed measurements by evaluating the local variations
of intensity, in terms of spatial and temporal variations, between consecutive images.
These speed measurements represent the apparent motion, in the two-dimensional
image plane, of 3D points in the motion of the scene and projected in the image
plane. In this context it is assumed that the objects of the scene are rigid, that is they
all move at the same speed and that, during observation, the lighting conditions do
494 6 Motion Analysis
not change. With this assumption, we analyze the two terms: motion field and optical
flow.
The motion field represents the 2D speed in the image plane (observer) induced
by the 3D motion of the observed object. In other words, the motion field represents
the apparent 2D speed in the image plane of the real 3D motion of a point in the scene
projected in the image plane. The motion analysis algorithms propose to estimate
from a pair of images of the sequence and for each corresponding point of the image
the value of the 2D speed (see Fig. 6.12).
The velocity vector estimated at each point of the image plane indicates the direc-
tion of motion and the speed which also depends on the distance between the observer
and observed objects. It should be noted immediately that the 2D projections of the
3D velocities of the scene points cannot be measured (acquired) directly by the acqui-
sition systems normally constituted, for example, by a camera. Instead, information
is acquired which approximates the motion field, i.e., the optical flow is calculated
by evaluating the variation of the gray levels in the pairs of time-varying images of
the sequence.
The optical flow and the motion field can be considered coincident only if the
following conditions are satisfied:
(a) The time distance is minimum for the acquisition of two consecutive images in
the sequence;
(b) The gray levels’ function is continuous;
(c) The conditions of Lambertianity are maintained;
(d) Scene lighting conditions do not change during sequence acquisition.
In reality, these conditions are not always maintained. Horn [7] has highlighted
some remarkable cases in which the motion field and the optical flow are not equal.
In Fig. 6.13, two cases are represented.
First case: Observing a stationary sphere with a homogeneous surface (of any
material), this induces optical flow when a light source moves in the scene. In this
case, by varying the lighting conditions, the optical flow is detected by analyzing
the image sequence since the condition (d) is violated, and therefore there is a
6.3 Toward Motion Estimation 495
(a) (b)
Nul
Fig. 6.13 Special cases of noncoincidence between optical flow and motion field. a Lambertian
stationary sphere induces optical flow when a light source moves producing a change in intensity
while the motion field is zero as it should be. b The sphere rotates while the light source is stationary.
In this case, the optical flow is zero no motion is perceived while the motion field is produced as it
should be
variation in the gray levels while the motion field is null having assumed the
stationary sphere.
Second case: The sphere rotates around its axis of gravity while the illumination
remains constant, i.e., the conditions indicated above are maintained. From the
analysis of the sequence, the induced optical flow is zero (no changes in the gray
levels between consecutive images are observed) while the motion field is different
from zero since the sphere is actually in motion.
In the discrete approach, the speed estimation is calculated only for some points in
the image, thus obtaining a sparse map of velocity estimates. The correspondence
in pairs of consecutive images is calculated only in the significant points of interest
(SPI) (closed contours of zero crossing, windows with high variance, lines, texture,
etc.). The discrete approach is used for certain small and large movements of moving
objects in the scene and when the constraints of the continuous approach cannot be
maintained.
In fact, in reality, not all the abovementioned constraints are satisfied (small shifts
between consecutive images do not always occur and the Lambertian conditions are
not always valid). The optical flow has the advantage of producing dense speed maps
and is calculated independently of the geometry of the objects of the scene unlike
the other (discrete) approaches that produce sparse maps and depend on the points
of interest present in the scene. If the analysis of the movement is also based on the
a priori knowledge of some information of the moving objects of the scene, some
assumptions are considered to better locate the objects:
496 6 Motion Analysis
Maximum speed. The position of the object in the next image after a time t can
be predicted.
Homogeneous movement. All the points of the scene are subject to the same
motion.
Mutual correspondence. Except for problems with occlusion and object rotation,
each point of an object corresponds to a point in the next image and vice versa
(non-deformable objects).
Let I1 and I2 be two consecutive images of the sequence, an estimate of the movement
is given by the binary image d(i, j) obtained as the difference between the two
consecutive images:
d(i, j) = 0 if |I1 (i, j) − I2 (i, j)| ≤ S
d(i, j) = (6.5)
d(i, j) = 1 otherwise
where S is a positive number indicating the threshold value above which to consider
the presence of movement in the observed scene. In the difference image d(i, j),
the presence of motion in pixels with value one is estimated. It is assumed that the
images are perfectly recorded and that the dominant variations of the gray levels are
attributable to the motion of the objects in the scene (see Fig. 6.14). The difference
image d(i, j) which contains the qualitative information on the motion is very much
influenced by the noise and cannot correctly determine the motion of very slow
objects. The motion information in each point of the difference image d(i, j) is
associated with the difference in gray levels between the following:
– adjacent pixels that correspond to pixels of moving objects and pixels that belong
to the background;
– adjacent pixels that belong to different objects with different motions;
– pixels that belong to parts of the same object but with a different distance from the
observer;
– pixels with gray levels affected by nonnegligible noise.
The value of the S threshold of the gray level difference must be chosen experimen-
tally after several attempts and possibly limited to very small regions of the scene.
The difference image d(i, j), obtained in the previous paragraph, qualitatively high-
lights objects in motion (in pixels with value 1) without indicating the direction
of motion. This can be overcome by calculating the cumulative difference image
6.3 Toward Motion Estimation 497
Fig. 6.14 Motion detected with the difference of two consecutive images of the sequence and result
of the accumulated differences with Eq. (6.6)
dcum (i, j), which contains the direction of motion information in cases where the
objects are small and with limited movements. The cumulative difference dcum (i, j)
is evaluated considering a sequence of n images, whose initial image becomes the ref-
erence against which all other images in the sequence are subtracted. The cumulative
difference image is constructed as follows:
n
dcum (i, j) = ak |I1 (i, j) − Ik (i, j)| (6.6)
k=1
where I1 is the first image of the sequence against which all other images are com-
pared Ik , ak is a coefficient with increasingly higher values to indicate the image of
the most recently accumulated sequence and, consequently, it highlights the location
of the pixels associated with the current position of the moving object (see Fig. 6.14).
The cumulative difference can be calculated if the reference image I1 is acquired
when the objects in the scene are stationary, but this is not always possible. In the
latter case, we try to learn experimentally the motion of objects or, based on a model
of motion prediction, we build the reference image. In reality, it does not always
interest an image with motion information in every pixel. Often, on the other hand,
it is interesting to know the trajectory in the image plane of the center of mass of the
objects moving with respect to the observer. This means that in many applications it
may be useful to first segment the initial image of the sequence, identify the regions
associated with the objects in motion and then calculate the trajectories described by
the centers of mass of the objects (i.e., the regions identified).
In other applications, it may be sufficient to identify in the first image of the
sequence some characteristic points or characteristic areas (features) and then search
for such features in each image of the sequence through the process of matching
homologous features. The matching process can be simplified by knowing or learning
the dynamics of the movement of objects. In the latter case, tracking algorithms of
the features can be used to make the matching process more robust and reduce the
level of uncertainty in the evaluation of the motion and location of the objects. The
Kalman filter [8,9] is often used as a solution to the tracking problem. In Chap. 6 Vol.
II, the algorithms and the problems related to the identification of features and their
research have been described, considering the aspects of noise present in the images,
498 6 Motion Analysis
(a) (b)
Fig. 6.15 Aperture problem. a The figure shows the position of a line (edge of an object) observed
through a small aperture at time t1 . At time t2 = t1 + t, the line has moved to a new position. The
arrows indicate the possible undeterminable line movements with a small opening because only the
component perpendicular to the line can be determined with the gradient. b In this example, again
from a small aperture, we can see in two consecutive images the displacement of the corner of an
object with the determination of the direction of motion without ambiguity
6.4 Optical Flow Estimation 499
In Sect. 6.3, we introduced the concepts of motion field and optical flow. Now let’s
calculate the dense map of optical flow from a sequence of images to derive useful
information on the dynamics of the objects observed in the scene. Recall that the
variations of gray levels in the images of the sequence are not necessarily induced
by the motion of the objects which, instead, is always described by the motion field.
We are interested in calculating the optical flow in the conditions in which it can be
considered a good approximation of the motion field.
The motion estimation can be evaluated assuming that in small regions in the
images of the sequence, the existence of objects in motion causes a variation in the
luminous intensity in some points without the light intensity varying appreciably:
constraint of the continuity of the light intensity relative to moving points. In reality,
we know that this constraint is violated as soon as the position of the observer changes
with respect to the objects or vice versa, and as soon as the lighting conditions
are changed. In real conditions, it is known that this constraint can be considered
acceptable by acquiring sequences of images with an adequate temporal resolution
(a normal camera acquires sequences of images with a time resolution of 1/25 of a
second) and evaluating the brightness variations in the images through the constraint
of the spatial and temporal gradient, used to extract useful information on the motion.
The generic point P of a rigid body (see Fig. 6.12) that moves with speed V, with
respect to a reference system (X, Y, Z ), is projected, through the optical system, in
the image plane in the position p with respect to the coordinate system (x, y), united
with the image plane, and moves in this plane with an apparent speed v = (vx , v y )
which is the projection of the velocity vector V. The motion field represents the set
of velocity vectors v = (vx , v y ) projected by the optical system in the image plane,
generated by all points of the visible surface of the moving rigid object. An example
of motion field is shown in Fig. 6.16.
In reality, the acquisition systems (for example, a camera) do not determine the 2D
measurement of the apparent speed in the image plane (i.e., they do not directly mea-
sure the motion field), but record, in a sequence of images, the brightness variations
of the scene in the hypothesis that they are due to the dynamics of the scene. There-
fore, it is necessary to find the physical-mathematical model that links the perceived
gray-level variations with the motion field.
We indicate with the following:
(a) I(x, y, t) the acquired sequence of images representing the gray-level informa-
tion in the image plane (x, y) in time t;
(b) (I x , I y ) and It , respectively, the spatial variations (with respect to the axes x and
y) and temporal variations of the gray levels.
Suppose further that the space–time-variant image I(x, y, t) is continuous and differ-
entiable, both spatially and temporally. In the Lambertian hypotheses of continuity
conditions, that is, that each point P of the object appears equally luminous from any
direction of observation and, in the hypothesis of small movements, we can consider
500 6 Motion Analysis
Fig. 6.16 Motion field coinciding with the optical flow idealized in 1950 by Gibson. Each arrow
represents the direction and speed (indicated by the length of the arrow) of surface elements visible
in motion with respect to the observer or vice versa. Neighboring elements move faster than those
further away. The 3D motion of the observer with respect to the scene can be estimated through the
optical flow. This is the motion field perceived on the retina of an observer, who moves toward the
house in the situation of motion of the Fig. 6.1b
the brightness constant in every point of the scene. In these conditions, the brightness
(irradiance) in the image plane I(x, y, t) remains constant in time and consequently
the total derivative of the time-variant image with respect to time becomes null [7]:
dI
I[x(t), y(t), t] = Constant ⇒ =0 (6.7)
dt
constant irradiance constraint
The dynamics of the scene is represented as the function I(x, y, t), which depends
on the spatial variables (x, y) and on the time t. This implies that the value of the
function I [x(t), y(t), t] varies in time in each position (x, y) in the image plane and,
consequently, the values of the partial derivatives ∂∂tI are distinguished with respect
to the total derivative dI
dt . Applying the definition of total derivative for the function
I [x(t), y(t), t], the expression (6.7) of the total derivative becomes
∂ I dx ∂ I dy ∂I
+ + =0 (6.8)
∂ x dt ∂ y dt ∂t
that is,
I x v x + I y v y + It = 0 (6.9)
dx dy
where the time derivatives dt and dt are the vector components of the motion field:
dx dy
v(vx , v y ) = v ,
dt dt
6.4 Optical Flow Estimation 501
while the spatial derivatives of the image I x = ∂∂ xI and I y = ∂∂ yI are the components of
the spatial gradient of the image ∇ I (I x , I y ). Equation (6.8), written in vector terms,
becomes
∇I(I x , I y ) · v(vx , v y ) + It = 0 (6.10)
which is the brightness continuity equation of the searched image that links the infor-
mation of brightness variation, i.e., of the spatial gradient of the image ∇I(I x , I y ),
determined in the sequence of multi-temporal images I(x, y, t), and the motion field
v(vx , v y ) which must be estimated once the components I x , I y , It are evaluated. In
such conditions, the motion field v(vx , v y ) calculated in the direction of the spatial
gradient of the image (I x , I y ) adequately approximates the optical flow. Therefore,
in these conditions, the motion field coincides with the optical flow.
The same Eq. (6.9) is achieved by the following reasoning. Consider the generic
point p(x, y) in an image of the sequence which at time t will have a luminous
intensity I (x, y, t). The apparent motion of this point is described by the velocity
components (vx , v y ) with which it moves, and in the next image, at time t + t,
it will be moved to the position (x + vx t, y + v y t) and for the constraint of
continuity of luminous intensity the following relation will be valid (irradiance
constancy constraint equation):
Dividing by t, ignoring the terms above the first and taking the limit for t → 0
the previous equation becomes Eq. 6.8, which is the equation of the total derivative of
dI /dt. For the brightness continuity constraint of the scene (Eq. 6.11) in a very small
time, and for the constraint of spatial coherence of the scene (the points included in
the neighborhood of the point under examination (x, y) move with the same speed
during the unit time interval), we can consider valid equality (6.11) that replaced in
Eq. (6.12) generates
I x vx + I y v y + It = ∇I · v + It ∼
=0 (6.13)
which represents the gradient constraint equation already derived above (see
Eqs. (6.9) and (6.10)). Equation (6.13) constitutes a linear relation between spatial
and temporal gradient of the image intensity and the apparent motion components
in the image plane.
We now summarize the conditions to which Eq. (6.13) gradient constraint is sub-
jected for the calculation of the optical flow:
1. Subject to the constraint of preserving the intensity of gray levels during the time t
for the acquisition of at least two images of the sequence. In real applications, we
502 6 Motion Analysis
know that this constraint is not always satisfied. For example, in some regions of
the image, in areas where edges are present and when lighting conditions change.
2. Also subject to the constraint of spatial coherence, i.e., it is assumed that in areas
where the spatial and temporal gradient is evaluated, the visible surface belongs
to the same object and all points move at the same speed or vary slightly in
the image plane. Also, this constraint is violated in real applications in the image
plane regions where there are strong depth discontinuities, due to the discontinuity
between pixels belonging to the object and the background, or in the presence of
occlusions.
3. Considering, from Eq. (6.13), which
−It = I x vx + I y v y = ∇I · v (6.14)
it is observed that the variation of brightness It , in the same location of the image
plane in the time t of acquisition of consecutive images, is given by the scalar
product of the spatial gradient vectors of the image ∇I and of the components
of the optical flow (vx , v y ) in the direction of the gradient ∇I. It is not possible
to determine the orthogonal component in the direction of the gradient, i.e., in
the normal direction to the direction of variation of the gray levels (due to the
aperture problem).
In other words, Eq. (6.13) shows that, estimated from the two consecutive images,
the spatial and temporal gradient of the image I x , I y , It it is possible to calculate the
motion field only in the direction of the spatial gradient of the image, that is we can
determine the component of the optical flow vn only in the normal direction to the
edge. From Eq. (6.14) follows:
from which
−It ∇I · v
vn = = (6.16)
∇I ∇I
where vn is the measure of the optical flow component that can be calculated in the
direction of the normalized spatial gradient with respect to the norm ∇ I = I x2 I y2
of the gradient vector of the image. If the spatial gradient ∇I is null (that is, there is no
change in brightness that occurs along a contour) it follows by (6.15) that It = 0 (that
is, we have no temporal variation of irradiance), and therefore there is no movement
information available in the examination point. If we verify that the spatial gradient
is zero and the time gradient It = 0, at this point the constraint of the optical flow
is violated. This impossibility of observing the velocity components in the point in
the examination is known as the aperture problem already discussed in Sect. 6.3.6
Ambiguity in motion analysis (see Fig. 6.13).
From brightness continuity Eqs. (6.13) and (6.16), also called brightness preser-
vation, we highlight that in every pixel of the image it is not possible to determine
6.4 Optical Flow Estimation 503
the optical flow (vx , v y ) starting from the spatial and temporal image gradient of the
image (I x , I y , It ), having two unknowns vx and v y and a single linear equation. It
follows that Eq. (6.13) has multiple solutions and the only gradient constraint cannot
uniquely estimate the optical flow. Equation (6.16) instead can only calculate the
component of optical flow in the direction of the variation of intensity, that is, of
maximum variation of the spatial gradient of the image.
Equation (6.13), of brightness continuity, can be represented graphically, in the
velocity space, as a motion constraint line, as shown in Fig. 6.17a, from which it is
observed that all the possible solutions of (6.13) fall on the speed constraint straight
line. Calculated the spatial and temporal gradient in a pixel of the image, in the plane
of the flow components (vx , v y ), the straight line of speed constraint intersects the
axes vx and v y , respectively, in the points (I x /I y , 0) and (0, It /I y ). It is also observed
that only the optical flow component vn can be determined.
If the real 2D motion is the diagonal one (v x , v y ) indicated by the red dot in
Fig. 6.17a and from the dashed vector in Fig. 6.17b, the estimable motion is only
the one given by its projection on the gradient vector ∇ I . In geometric terms, the
calculation of vn with Eq. (6.16) is equivalent to calculating the distance d of the
motion constraint line (I x · vx + I y · v y + It = 0) from the origin of the optical flow
(vx , v y ) (see Fig. 6.17a). This constraint means that the optical flow can be calculated
only in areas of the image in the presence of edges.
In Fig. 6.17c, there are two constraint-speed lines obtained at two points close
to each other of the image for each of which the spatial and temporal gradient is
calculated by generating the lines (1) and (2), respectively. In this way, we can
reasonably hypothesize, that in these two points close together, the local motion is
identical (according to the constraint of spatial coherence) and can be determined
geometrically as the intersection of the constraint-speed lines producing a good local
estimate of the optical flow components (vx , v y ).
In general, to calculate both the optical flow components (vx , v y ) at each point of
the image, for example, in the presence of edges, the knowledge of the spatial and
temporal derivatives I x , I y , It , estimated by the pair of consecutive images), it is
not sufficient using Eq. (6.13). The optical flow is equivalent to the motion field in
particular conditions as defined above.
To solve the problem in a general way we can impose the constraint of spatial
coherence on Eq. (6.13), that is, locally, in the vicinity of the point (x, y) in the
processing of the motion field, the speed does not change abruptly. The differential
approach presents limits in the estimation of the spatial gradient in the image areas,
where there are no appreciable variations in gray levels. This suggests calculating
the speed of the motion field considering windows of adequate size, centered on the
point being processed to satisfy the constraint of the spatial coherence of validity of
Eq. (6.15). This is also useful for mitigating errors in the estimated optical flow in
the presence of noise in the image sequence.
504 6 Motion Analysis
Ix vx vx
Iy
Fig. 6.17 Graphical representation of the constraint of the optical flow based on the gradient. a
Equation (6.9) is represented by the straight line locus of the points (vx , v y ) which are possible
multiple solutions of the optical flow of a pixel of the image according to the values of the spatial
gradient ∇ I and time gradient It . b Of the real 2D motion (v x , v y ) (red point indicated in figure
a), it is possible to estimate the speed component of the optical flow in the direction of the gradient
(Eq. 6.16) perpendicular to the variation of gray levels (contour) while the component p parallel to
the edge cannot be estimated. c In order to solve the aperture problem and more generally to obtain
a reliable estimate of the optical flow, it is possible to calculate the velocities even in the pixels
close to the one being processed (spatial coherence of the flow) hypothesizing pixels belonging to
the same surface that they have the same motion in a small time interval. Each pixel in the speed
diagram generates a constraint straight line of the optical flow that tends to intersect in a small area
whose center of gravity represents the components of the speed of the estimated optical flow for
the set of locally processed pixels
where the process of minimization involves all the pixels p(x, y) of an image I (:, :, t)
of the sequence that for simplicity we indicate with . The first term E 1 represents
the error of the measures (also known as data energy based on Eq. (6.13), the second
term E 2 represents the constraint of the spatial coherence (known also as smoothness
energy or smoothness error), and λ is the regularization parameter that controls the
relative importance of the constraint of continuity of intensity (Eq. 6.11) and that
of the validity of spatial coherence. The introduction of the E 2 constraint of spatial
coherence, as an error term expressed by the partial derivatives squared of the velocity
components, restricts the class of possible solutions for the flow velocity calculation
(vx , v y ) transforming a problem ill conditioned in a problem well posed. It should
be noted that in this context the optical flow components vx and v y are functions,
respectively, of the spatial coordinates x and y. To avoid confusion on the symbols,
6.4 Optical Flow Estimation 505
let us indicate for now the horizontal and vertical components of optical flow with
u = vx and v = v y .
Using the variational approach, Horn and Schunck have derived the differential
Eq. (6.22) as follows. The objective is the estimation of the optical flow components
(u, v) which minimizes the energy function:
∂vx ∂v y ∂vx
reformulated with the new symbolism where u x = ∂x , vx = ∂x , uy = ∂y and
∂v y
vx = ∂ y , represent the first derivatives of the optical flow components (now denoted
by u and v) with respect to x and y. Differentiating the function (6.18) with respect
to the unknown variables u and v, we obtain
⎧ ∂ E(u, v)
⎪
⎨ = 2I x (u I x + v I y + It ) + λ2(u x x + u yy )
∂u (6.19)
⎩ ∂ E(u, v) = 2I (u I + v I + I ) + λ2(v + v )
⎪
y x y t xx yy
∂v
where (u x x + u yy ) and (vx x + v yy ) are, respectively, the Laplacian of u(x, y) and
v(x, y) as shown in the notation.3 In essence, the expression corresponding to the
Laplacian controls the contribution of the smoothness term of (6.18) of the optical
flow, which when rewritten becomes
⎧ ∂ E(u, v)
⎪
⎨ = 2I x (u I x + v I y + It ) + 2λ∇ 2 u
∂u (6.20)
⎩ ∂ E(u, v) = 2I (u I + v I + I ) + 2λ∇ 2 v
⎪
y x y t
∂v
A possible solution to minimization of the function (6.18) is to set to zero the partial
derivatives given by (6.20) and approximating the Laplacian with the difference of
the flow components u and v, respectively, with the local averages u and v obtained
on a local window W centered on the pixel being processed (x, y), given as follows:
u = u − u ≈ ∇ 2 u v = v − v ≈ ∇ 2 v (6.21)
3 In fact, considering the smoothness term of (6.18) and deriving with respect to u we have
which corresponds to the second-order differential operator defined as the divergence of the gradient
of the function u(x, y) in a Euclidean space. This operator is known as the Laplace operator or simply
Laplacian. Similarly, the Laplacian of the function v(x, y) is derived.
506 6 Motion Analysis
Replacing (6.21) in Eq. (6.20), setting these last equations to zero and reorganizing
them, we obtain the following equations as a possible solution to minimize the
function E(u, v), given by
λ + I x2 u + v I x I y = λū − I x It
(6.22)
λ + I y2 v + u I x I y = λ · v̄ − I y It
The calculation of the optical flow is performed by applying the iterative method
of Gauss–Seidel, using a pair of consecutive images of the sequence. The goal is to
explore the space of possible solutions of (u, v) such that, for a given value found
at the kth iteration, the function E(u, v) is minimized within a minimum acceptable
error for the type of dynamic images of the sequence in question. The iterative
procedure applied to two images would be the following:
1. From the sequence of images choose, an adjacent pair of images I1 and I2 for each
of which a two-dimensional spatial Gaussian filter is applied, with an appropriate
standard deviation σ , to attenuate the noise. Apply the Gaussian filter to the time
component by considering the other images adjacent to the pair in relation to the
size of the standard deviation σt of the time filter. The initial values of the velocity
components u and v are assumed zero for each pixel of the image.
2. Kth iterative process. Calculate the velocities u (k) and v(k) for all the pixels (i, j)
of the image by applying Eq. (6.23):
I x · ū (k−1) + I y · v̄(k−1) + It
u (k) (i, j) = ū (k−1) (i, j) − I x (i, j) ·
λ + I x2 + I y2
(6.24)
(k) (k−1) I x · ū (k−1) + I y · v̄(k−1) + It
v (i, j) = v̄ (i, j) − I y (i, j) ·
λ + I x2 + I y2
If the value of the error e is less than a certain threshold es , proceed with the
next iteration, that is, return to step 2 of the procedure, otherwise, the iterative
6.4 Optical Flow Estimation 507
process ends and the last values of u and v are assumed as the definitive estimates
of the optical flow map of the same dimensions as the images. The regularization
parameter λ is experimentally set at the beginning with a value between 0 and 1,
choosing by trial and error the optimal value in relation to the type of dynamic
images considered.
The described algorithm can be modified to use all the images in the sequence.
In essence, in the iterative process, instead of always considering the same pair of
images, the following image of the sequence is considered at the kth iteration. The
algorithm is thus modified:
1. Similar to the previous one, applying the Gaussian spatial and time filter to all the
images in the sequence. The initial values of u and v instead of being set to zero
are initialized by applying Eq. (6.24) to the first two images of the sequence. The
iteration begins with k = 1, which represents the initial estimate of optical flow.
2. Calculation of the (k+1)th estimation of the speed of the optical flow based on the
current values of the kth iteration and of the next image of the sequence. Equations
are applied to all pixels in the image:
I x · ū (k) + I y · v̄(k) + It
u (k+1) (i, j) = ū (k) (i, j) − I x (i, j) ·
λ + I x2 + I y2
(6.25)
(k+1) (k) I x · ū (k) + I y · v̄(k) + It
v (i, j) = v̄ (i, j) − I y (i, j) ·
λ + I x2 + I y2
3. Repeat step 2 and finish when the last image in the sequence has been processed.
The iterative process requires thousands of iterations, and only experimentally, one
can verify which are the optimal values of the regularization parameter λ and of the
threshold es that adequately minimizes the error function e.
The limits of the Horn and Schunck approach are related to the fact that in real
images the constraints of continuity of intensity and spatial coherence are violated.
In essence, the calculation of the gradient leads to two contrasting situations: on the
one hand, for the calculation of the gradient it is necessary that the intensity varies
locally in a linear way and this is generally invalid in the vicinity of the edges; on the
other hand, again in the areas of the edges that delimit an object, the smoothness
constraint is violated, since normally the surface of the object can have different
depths.
A similar problem occurs in areas, where different objects move with different
motions. In the border areas, the conditions of notable variations in intensity occur,
generating very variable values of flow velocity. However, the smoothness com-
ponent tends to propagate the flow velocity even in areas where the image does not
show significant speed changes. For example, this occurs when a single object moves
with respect to a uniform background where it becomes difficult to distinguish the
velocity vectors associated with the object from the background.
508 6 Motion Analysis
v
(Ix,Iy) (u
(u,v)
vn (u,v)
d Ixu+Iyv+It=0
Fig. 6.18 Graphical interpretation of Horn–Schunck’s iterative process for optical flow estimation.
During an iteration, the new velocity (u, v) of a generic pixel (x, y) is updated by subtracting the
value of the local average velocity (ū, v̄) the update value according to Eq. (6.24) moving on the
line perpendicular to the line of the motion constraint and in the direction of the spatial gradient
Fig. 6.19 Results of the optical flow calculated with the Horn–Schunck method. The first line
shows the results obtained on synthetic images [10], while in the second line the flow is calculated
on real images
A simpler approach to estimate the optical flow is based on the minimization of the
function (6.17) with the least squares regression method (Least Square Error-LSE)
and approximating the derivatives of the smoothness constraint E 2 with simple
symmetrical or asymmetrical differences. With these assumptions, if I (i, j) is the
510 6 Motion Analysis
pixel of the image being processed, the smoothness error constraint E 2 (i, j) is defined
as follows:
1
E 2 (i, j) = (u i+1, j − u i, j )2 + (u i, j+1 − u i, j )2 + (vi+1, j − vi, j )2 + (vi, j+1 − vi, j )2 (6.28)
4
A better approximation would be obtained by calculating the symmetric differences
(of the type u i+1, j − u i−1, j ). The term E 1 based on Eq. (6.13), constraint of the
optical flow error, results in the following:
The regression process involves finding the set of unknowns of the flow components
{u i, j , vi, j }, which minimizes the following function:
e(i, j) = E 1 (i, j) + λE 2 (i, j) (6.30)
(i, j)∈
According to the LSE method, differentiating the function e(i, j) with respect to the
unknowns u i, j and vi, j for E 1 (i, j) we have
⎧ ∂ E (i, j)
⎪
⎪
1
= 2[I x (i, j)u i, j + I y (i, j)vi, j + It (i, j)]I x (i, j)
⎨ ∂u
i, j
(6.31)
⎪
⎪ ∂ E 1 (i, j)
⎩ = 2[I x (i, j)u i, j + I y (i, j)vi, j + It (i, j)]I y (i, j)
∂vi, j
In (6.32), we have the only unknown term u i, j and we can simplify it by putting it
in the following form:
1 ∂ E 2 (i, j) 1
= 2 u i, j − (u i+1, j + u i, j+1 + u i−1, j + u i, j−1 ) = 2[u i, j − u i, j ] (6.33)
4 ∂u i, j 4
local average u i, j
Differentiating E 2 (i, j) (Eq. 6.28) with respect to vi, j we obtain the analogous
expression:
1 ∂ E 2 (i, j)
= 2[vi, j − vi, j ] (6.34)
4 ∂vi, j
6.4 Optical Flow Estimation 511
Combining together the results of the partial derivatives of E 1 (i, j) and E 2 (i, j) we
have the partial derivatives of the function e(i, j) to be minimized:
⎧ ∂e(i, j)
⎪
⎪ = 2[u i, j − u i, j ] + 2λ[I x (i, j)u i, j + I y (i, j)vi, j + It (i, j)]I x (i, j)
⎨ ∂u
i, j
(6.35)
⎪
⎪ ∂e(i, j)
⎩ = 2[vi, j − vi, j ] + 2λ[I x (i, j)u i, j + I y (i, j)vi, j + It (i, j)]I y (i, j)
∂vi, j
Setting to zero the partial derivatives of the error function (6.35) and solving with
respect to the unknowns u i, j and vi, j , the following iterative equations are obtained:
(k) (k)
(k+1) (k)
I x (i, j) · ū i, j + I y (i, j) · v̄i, j + It (i, j)
u i, j = ū i, j − I x (i, j) ·
1 + λ[(I x2 (i, j) + I y2 (i, j)]
(k) (k)
(6.36)
(k+1) (k)
I x (i, j) · ū i, j + I y (i, j) · v̄i, j + It (i, j)
vi, j = v̄i, j − I y (i, j) ·
1 + λ[(I x2 (i, j) + I y2 (i, j)]
For the calculation of the optical flow at least two adjacent images of the temporal
sequence are used at time t and t +1. In the discrete case, iterative Eq. (6.36) are used
to calculate the value of the velocities u i,k j and vi,k j to the kth iteration for each pixel
(i, j) of the image of size M × N . The spatial and temporal gradient in each pixel
is calculated using one of the convolution masks (Sobel, Roberts, . . .) described in
Chap. 1 Vol. II.
The original implementation of Horn estimated the spatial and temporal deriva-
tives (the data of the problem) I x , I y , It considering the mean of the differences
(horizontal, vertical, and temporal) between the pixel being processed (i, j) and the
3 spatially and temporally adjacent, given by
1
I x (i, j, t) ≈ I (i, j + 1, t) − I (i, j, t) + I (i + 1, j + 1, t) − I (i + 1, j, t)
4
+ I (i, j + 1, t + 1) − I (i, j, t + 1) + I (i + 1, j + 1, t + 1) − I (i + 1, j, t + 1)
1
I y (i, j, t) ≈ I (i + 1, j, t) − I (i, j, t) + I (i + 1, j + 1, t) − I (i, j + 1, t)
4
(6.37)
+ I (i + 1, j, t + 1) − I (i, j, t + 1) + I (i + 1, j + 1, t + 1) − I (i, j + 1, k + 1)
1
It (i, j, t) ≈ I (i, j, t + 1) − I (i, j, t) + I (i + 1, j, t + 1) − I (i + 1, j, t)
4
+ I (i, j + 1, t + 1) − I (i, j + 1, t) + I (i + 1, j + 1, t + 1) − I (i + 1, j + 1, k)
To make the calculation of the optical flow more efficient in each iteration, it is useful
to formulate Eq. (6.36) as follows:
u i,(k+1)
j = ū i,(k)j − α I x (i, j)
(k+1) (k)
(6.38)
vi, j = v̄i, j − α I y (i, j)
512 6 Motion Analysis
where
(k) (k)
I x (i, j) · ū i, j + I y (i, j) · v̄i, j + It (i, j)
α(i, j, k) = (6.39)
1 + λ[(I x2 (i, j) + I y2 (i, j)]
Recall that the averages ū i, j and v̄i, j are calculated on the 4 adjacent pixels as
indicated in (6.33). The pseudo code of the Horn–Schunck algorithm is reported in
Algorithm 26.
Algorithm 26 Pseudo code for the calculation of the optical flow based on the discrete
Horn–Schunck method.
1: Input: Maximum number of iterations N iter = 10; λ = 0.1 (Adapt experimentally)
2: Output: The dense optical flow u(i, j), v(i, j), i = 1, M and j = 1, N
3: for i ← 1 to M do
4: for j ← 1 to N do
5: Calculates I x (i, j, t), I y (i, j, t), and It (i, j, t) with Eq. (6.37)
6: Handle the image areas with edges
7: u(i, j) ← 0
8: v(i, j) ← 0
9: end for
13: for i ← 1 to M do
14: for j ← 1 to N do
The preceding methods, used for the calculation of the optical flow, have the draw-
back, being iterative, of the noncertainty of convergence to minimize the error func-
tion. Furthermore, they require the calculation of derivatives of a higher order than the
first and, in areas with gray level discontinuities, the optical flow rates are estimated
with a considerable error. An alternative approach is based on the assumption that,
if locally for windows of limited dimensions the optical flow velocity components
(vx , v y ) remain constant, from the equations of conservation of brightness we obtain
a system of linear equations solvable with least squares approaches [12].
Figure 6.17c shows how the lines defined by each pixel of the window represent
geometrically in the domain (vx , v y ) the optical flow Eq. 6.13. Assuming that the
window pixels have the same speed, the lines intersect in a limited area whose
center of gravity represents the real 2D motion. The size of the intersection area of
the lines also depends on the error with which the spatial derivatives (I x , I y ) and
the temporal derivative It are estimated, caused by the noise of the sequence of
images. Therefore, the velocity v of the pixel being processed is estimated by the
linear regression method (line fitting) by setting a system of linear overdetermined
equations defining an energy function.
Applying the brightness continuity Eq. (6.14) for the N pixels pi of a window W
(centered in the pixel in elaboration) of the image, we have the following system of
linear equations: ⎧
⎪
⎪ ∇ I ( p1 ) · v( p1 ) = −It ( p1 )
⎪
⎪
⎪
⎪
⎨ I ( p2 ) · v( p2 ) = −It ( p2 )
∇
... ... (6.40)
⎪
⎪
⎪. . .
⎪ . . .
⎪
⎪
⎩∇ I ( p ) · v( p ) = −I ( p )
N N t N
and indicating with A the matrix of the components I x ( pi ) and I y ( pi ) of the spatial
gradient of the image, with b the matrix of the components It ( pi ) of the temporal
gradient and with v the speed of the optical flow, we can express the previous relation
in the compact matrix form:
A ·
v = −
b (6.42)
N ×2 2×1 N ×1
514 6 Motion Analysis
With N > 2, the linear equation system (6.42) is overdetermined and this means
that it is not possible to find an exact solution, but only an approximate estimate ṽ,
which minimizes the norm of the vector e derived with the least squares approach:
e = Av + b (6.43)
(AT · A) · ṽ = AT b (6.44)
2×2 2×1 2×1
for the image pixel being processed, centered on the W window. The solution (6.45)
exists if the matrix (AT A)4 is calculated as follows:
N N
(AT A) = i=1 I x ( pi )I x ( pi ) i=1 I x ( pi )I y ( pi ) =
Iαα Iαβ
(6.46)
N N
i=1 I x ( pi )I y ( pi ) i=1 I y ( pi )I y ( pi )
Iαβ Iββ
4 The matrix (A T A) is known in the literature as the tensor structure of the image relative to a pixel
p. A term that derives from the concept of tensor which generically indicates a linear algebraic
structure able to mathematically describe an invariable physical phenomenon with respect to the
adopted reference system. In this case, it concerns the analysis of the local motion associated with
the pixels of the W window centered in p.
6.4 Optical Flow Estimation 515
Eigenvalues λ1 and λ2
of the tensor matrix ATA
are small
/
in the contour points
(aperture problem)
Homogeneous zone
Large eigenvalues in the presence of texture
Fig. 6.20 Operating conditions of the Lucas–Kanade method. In the homogeneous zones, there
are small values of the eigenvectors of the tensor matrix AT A while in the zones with texture
there are high values. The eigenvalues indicate the robustness of the contours along the two main
directions. On the contour area the matrix becomes singular (not invertible) if all the gradient
vectors are oriented in the same direction along the contour (aperture problem, only the normal
flow is calculable)
Returning to Eq. (6.46), it is also observed that in the matrix (AT A) does not
appear the component of temporal gradient It and, consequently, the accuracy of
the optical flow estimation is closely linked to the correct calculation of the spatial
gradient components I x and I y . Now, appropriately replacing in Eq. (6.45) the inverse
matrix (AT A)−1 given by (6.46), and the term AT b is obtained from
⎡ ⎤
It ( p1 )
⎢ It ( p2 ) ⎥
I x ( p1 ) I x ( p2 ) . . . I x ( p N ) ⎢
⎢ . ⎥= −
⎥ N
i=1 I x ( pi )It ( pi ) −Iαγ
AT b = = (6.48)
I y ( p1 ) I y ( p2 ) . . . I y ( p N ) ⎢
⎣ .. ⎦
⎥ − N −Iβγ
i=1 I y ( pi )It ( pi )
It ( p N )
it follows that the velocity components of the optical flow are calculated from the
following relation:
⎡ Iαγ Iββ −Iβγ Iαβ ⎤
ṽx Iαα Iββ −I 2
= − ⎣ Iβγ Iαα −IαγαβIαβ ⎦ (6.49)
ṽ y 2 Iαα Iββ −Iαβ
The process is repeated for all the points of the image, thus obtaining a dense map of
optical flow. Window sizes are normally chosen at 3×3 and 5×5. Before calculating
the optical flow, it is necessary to filter the noise of the images with a Gaussian filter
with standard deviation σ according to σt , for filtering in the time direction. Several
other methods described in [13] have been developed in the literature.
Figure 6.21 shows the results of the Lucas–Kanade method applied on synthetic
and real images with various types of simple and complex motion. The spatial and
temporal gradient was calculated with windows of sizes from 3 × 3 up to 13 × 13.
In the case of RGB color image, Eq. (6.42) (which takes constant motion locally)
can still be used considering the gradient matrix of size 3 · N × 2 and the vector of
516 6 Motion Analysis
the time gradient of size 3 · N × 1 where in essence each pixel is represented by the
triad of color components thus extending the dimensions of the spatial and temporal
gradient matrices.
WAv = Wb (6.50)
where both members are weighed in the same way. To find the solution, some matrix
manipulations are needed. We multiply both members of the previous equation with
(WA)T and we get
Fig. 6.21 Results of the optical flow calculated with the Lucas–Kanade method. The first line
shows the results obtained on synthetic images [10], while in the second line the flow is calculated
on real images
6.4 Optical Flow Estimation 517
T
W2 A v = AT W2 b
A (6.51)
2×2
If the determinant of (AT W2 A) is different from zero exists (AT W2 A)−1 , that is, its
inverse, and so we can solve the last equation of (6.51) with respect to v obtaining:
The original algorithm is modified using for each pixel p(x, y) of the image I (:, :, t)
the system of Eq. (6.50), which includes the matrix weight and the solution at least
squares is given by (6.52).
where ∇x,y denoting the gradient symbol, is limited to derivatives with respect to the
axes of x and y, it does not include the time derivative. Compared to the quadratic
functions of the Horn–Schunck method which do not allow flow discontinuities
(in the edges and due to the presence of gray levels noise), Eq. (6.53) based on the
gradient is a more robust constraint. With this new constraint it is possible to derive an
energy function E 1 (u, v) (data term) that penalizes the drifts from these assumptions
of constancy of levels and spatial gradient that results in the following:
! 2
E 1 (u, v) = I (x + u, y + v, t + 1) − I (x, y, t)
(x,y)∈ (6.54)
2 "
+ λ1 ∇x,y I (x + u, y + v, t + 1) − ∇x,y I (x, y, t))
where λ1 > 0 weighs one assumption relative to the other. Furthermore, the smooth-
ness term E 2 (u, v) must be considered as done with the function (6.17). In this
case the third component of the gradient is considered, which relates two temporally
adjacent images from t to t + 1. Therefore, by indicating with ∇3 = (∂x , ∂ y , ∂t ) the
associated space–time gradients of smoothness, the smoothness term E 2 is formu-
518 6 Motion Analysis
lated as follows: # $
E 2 (u, v) = |∇3 u|2 + |∇3 v|2 (6.55)
(x,y)∈
The total energy function E(u, v) to be minimized, which estimates the optical flow
(u, v) for each pixel of the image sequence I (:, :, t), is given by the weighted sum
of the terms data and smoothness:
where λ2 > 0 weighs appropriately the smoothness term with respect to that of the
data (total variation with respect to the assumption of the constancy of the gray levels
and gradient). Since the data term E 1 is set with quadratic expression (Eq. 6.54), the
outlier s (due to the variation of pixel intensities and the presence of noise) heavily
influence the flow estimation. Therefore, instead of the least squares approach, we
define the optimization problem set with a more robust energy function, based on
the increasing concave function Ψ (s), given by:
%
Ψ (s 2 ) = s 2 + 2 ≈ |s| = 0.001 (6.57)
and # $
E 2 (u, v) = Ψ |∇3 u|2 + |∇3 v|2 (6.59)
(x,y)∈
It is understood that the total energy, expressed by (6.56), is the weighted sum of the
terms E 1 and E 2 , respectively, data and smoothness, controlled with the regular-
ization parameter λ2 . It should be noted that with this approach the influence of the
outlier s is attenuated since the optimization problem based on the l1 nor m of the
gradient (known as total variation–TV l1 ) is used instead of the l2 nor m.
The goal now is to find the functions u(x, y) and v(x, y) that minimize these
energy functions by trying to find a global minimum. From the theory of calculus
of variations, it is stated that a general way to minimize an energy function is that it
must satisfy the Euler–Lagrange differential equations.
6.4 Optical Flow Estimation 519
# 2 $ # $
Ψ It2 + λ1 I xt 2 · I I + λ (I I + I I ) − λ · div Ψ ||∇ u||2 + ||∇ v||2 ∇ v = 0
+ I yt (6.61)
y t 1 yy yt x y xt 2 3 3 3
where Ψ indicates the derivative of Ψ with respect to u (in the 6.60) and with respect
to v (in the 6.61). The divergence div indicates the sum of the space–time gradients
of smoothness ∇3 = (∂x , ∂ y , ∂t ) related to u and v. Recall that I x , I y , and It are
the derivatives of I (:, :, t) with respect to the spatial coordinates x and y of the
pixels and with respect to the time coordinate t. Also, I x x , I yy , I x y , I xt , and I yt
are the their second derivatives. The problem data is all the derivatives calculated
from two consecutive images of the sequence. The solution w = (u, v, 1), in each
point p(x, y) ∈ , for nonlinear Eqs. (6.60) and (6.61) can be found with an iterative
method of numerical approximation. The authors of BBPW used the one based on
fixed point iterations5 on w. The iterative formulation with index k, of the previous
nonlinear equations, starting from the initial value w(0) = (0, 0, 1)T , results in the
following:
⎧ # (k+1) 2 #
(k+1) 2 (k+1) 2
$$
(k) (k+1)
#
(k) (k+1) (k) (k+1)
$
⎪
⎪ Ψ It + λ1 I xt + I yt · I x It + λ1 I x x I xt + I x y I yt
⎪
⎪
⎪
⎪ #& &2 & &2 $
⎪
⎪
⎨ − λ2 · div Ψ &∇3 u (k+1) & + &∇3 v (k+1) & ∇3 u (k+1) = 0
⎪
#
(k+1) 2
#
(k+1) 2 (k+1) 2
$$
(k) (k+1)
#
(k) (k+1) (k) (k+1)
$ (6.62)
⎪
⎪ Ψ It + λ1 I xt + I yt · I y It + λ1 I yy I yt + I x y I xt
⎪
⎪
⎪
⎪ #&
⎪
⎩ &2 & &2 $
− λ2 · div Ψ &∇3 u (k+1) & + &∇3 v (k+1) & ∇3 v (k+1) = 0
This new system is still nonlinear due to the non-linearity of the function Ψ and
(k+1) (k+1)
the derivatives I∗ . The removal of the non-linearity of the derivatives I∗ is
obtained with their expansion to the Taylor series up to the first order:
5 Represents the generalization of iterative methods. In general, we want to solve a nonlinear equation
f (x) = 0 by leading back to the problem of finding a fixed point of a function y = g(x), that is, we
want to find a solution α such that f (α) = 0 ⇐⇒ α = g(α). The iteration function is in the form
x (k+1) = g(x (k) ), which iteratively produces a sequence of x for each k ≥ 0 and for a x (0) initial
assigned. Not all the iteration functions g(x) guarantee convergence at the fixed point. It is shown
that, if g(x) is continuous and the sequence x (k) converges, then this converges to a fixed point α,
that is, α = g(α) which is also a solution of f (x) = 0.
520 6 Motion Analysis
Therefore, we can consider the unknowns u (k+1) and v(k+1) separate, in the solutions
of the previous iterative process u (k) and v(k) , and the unknown increments du (k) and
dv(k) , having
Substituting the expanded derivatives (6.63) in the first equation of the system (6.62)
and separating for simplicity the terms data and smoothness, we have by definition
the following expressions:
# 2
(k) (k)
(Ψ ) E 1 := Ψ It + I x(k) du (k) + I y(k) dv(k)
(k) $ (6.65)
(k)
+ λ1 I xt + I x(k)
x du
(k)
+ I x(k)
y dv
(k) 2
+ I yt + I x(k)
y du
(k) (k) (k) 2
+ I yy dv
#& &2 & &2 $
(k)
(Ψ ) E 2 := Ψ &∇3 u (k) + du (k) & + &∇3 v (k) + v (k) & (6.66)
(k)
The term (Ψ ) E1 defined with (6.65) is interpreted as a factor of robustness in the
(k)
data term, while the term (Ψ ) E2 defined by (6.66), is considered to be the diffusivity
in the smoothness term. With these definitions, the first equation of the system (6.62)
is rewritten as follows:
(k) (k) (k) (k) (k)
(Ψ ) E · I x It + I x du (k) + I y dv (k)
1
(k) (k) (k) (k) (k) (k) (k) (k) (k)
+ λ1 (Ψ ) E · I x x I xt + I x x du (k) + I x y dv (k) + I x y I yt + I x y du (k) + I yy dv (k) (6.67)
1
# $
(k)
− λ2 div (Ψ ) E ∇3 u (k) + du (k) = 0
2
Similarly, the second equation of the system is redefined (6.62). For a fixed value of
k (6.67) is still non-linear, but now, having already estimated u (k) and v(k) with the
approximation at the first fixed point, the unknowns in Eq. (6.67) are the increments
du (k) and dv(k) . Thus, there remains only the non-linearity due to the derivative Ψ ,
but was chosen as a convex function,6 and the remaining optimization problem is a
convex problem, that is, it can exist a minimal and unique solution.
The non-linearity in Ψ can be removed by applying a second iterative process
based on the search for the fixed point of Eq. (6.67). Now consider the unknown
variables to iterate du (k,l) and dv(k,l) , where l indicates the lth iteration. We assume
(k,l)
as initial values du (k,l) = 0 and dv(k,l) = 0, and we indicate with (Ψ ) E 1 and
(Ψ )(k,l)
E 2 , respectively, the factor robustness and diffusivity expressed by the respec-
tive Eqs. (6.65) and (6.66) at the iteration (k,l)th. Finally, we can formulate the first
6 A function f (x) with real values and defined in an interval is called convex if the segment joining
any two points of its graph is above the graph itself. Convex optimization problems simplify the
analysis and solution of a convex problem. It is shown that a convex function, defined in a convex
set, has no solution, or has only global solutions, and cannot have exclusively local solutions.
6.4 Optical Flow Estimation 521
linear system equation in an iterative form with the unknowns du (k,l+1) and dv(k,l+1)
given by
# $
(Ψ )(k,l)
E1 · Ix
(k) (k)
It + I x(k) du (k,l+1) + I y(k) dv(k,l+1)
# $
(k) (k)
+ λ1 (Ψ ) E 1 · I x(k) (k) (k,l+1)
x I xt + I x x du + I x(k)
y dv
(k,l+1)
# $ (6.68)
(k)
+ I x(k) (k) (k,l+1)
y I yt + I x y du
(k) (k,l+1)
+ I yy dv
# # $$
(k)
− λ2 div (Ψ ) E 2 ∇3 u (k) + du (k,l+1) = 0
This system can be solved using the normal iterative numerical methods (Gauss–
Seidel, Jacobi, Successive Over-Relaxation—SOR also called the method of over-
relaxation) for a linear system also of large size and with sparse matrices (presence
of null elements).
The model of motion considered until now is of pure translation. If we consider that
a small region R at the time t is subject to an affine motion model, at the time t + t
the speed (or displacement) of the relative pixels is given by
⎡ ⎤ ⎡ ⎤
u(x; a) p1 + p2 x + p3 y
u(x; p) = ⎣ ⎦=⎣ ⎦ (6.69)
v(x; a) p4 + p5 x + p6
remembering the affine transformation equations (described in Sect. 3.3 Vol. II) and
assuming that the speed is constant [12] for the pixels of the region R. By replacing
(6.69) in the equation of the optical flow (6.13), it is possible to set the following
error function:
2
e(x; p) = ∇ I T u(x; p) + It
x∈R
2 (6.70)
= I x p1 + I x p2 x + I x p3 y + I y p4 + I y p5 x + I y p6 y + I t
x∈R
to be minimized with the least squares method. Given that there are 6 unknown
motion parameters and each pixel provides only a linear equation, at least 6 pixels of
the region are required to set up a system of linear equations. In fact, the minimization
of the error function requires the differentiation of (6.70) with respect to the unknown
vector p, to set the result of the differentiation equal to zero and solve with respect to
the motion parameters pi , thus obtaining the following system of linear equations:
⎡ ⎤
I x2 x I x2 y I x2 Ix I y x Ix I y y Ix I y ⎡ p ⎤ ⎡
I x It
⎤
⎢ 1
⎢ x I x2 x 2 I x2 x y I x2 x Ix I y x 2 Ix I y x y Ix I y ⎥
⎥ ⎢ p2 ⎥ ⎢ x I x It ⎥
⎢ y I x2 x y I x2 y 2 I x2 y 2 Ix I y ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y Ix I y x y Ix I y ⎥ ⎢ p3 ⎥ ⎢ y I x It ⎥
⎢
⎢ Ix I y x Ix I y y Ix I y I y2 x I y2 2 ⎥⎢ ⎥ = −⎢
y I y ⎥ ⎢ p4 ⎥ ⎢ I y It ⎥
⎥ (6.71)
⎢ ⎥ ⎣ ⎦ ⎣ x I y It ⎦
⎣ x Ix I y x 2 Ix I y x y Ix I y x I y2 x 2 I y2 x y I y2 ⎦ p5
y Ix I y x y Ix I y y 2 Ix I y y I y2 x y I y2 2
y I 2 p6 y I y It
y
522 6 Motion Analysis
As for the pure translation also for the affine motion the approximation to the Taylor
series constitutes a very approximate estimate of the real motion. This imprecision of
motion can be mitigated with the iterative alignment approach proposed by Lucas–
Kanade [12]. In the case of affine motion with large displacements, we can use
the multi-resolution approach described in the previous paragraphs. With the same
method, it is possible to manage other motion models (homography, quadratic, . . .).
In the case of a planar quadratic motion model, it is approximated by
u = p1 + p2 x + p3 y + p7 x 2 + p8 x y
(6.72)
v = p4 + p5 x + p6 y + p7 x y + p8 y 2
I I I
u(xt) u(xt+ε)=u(xt)-It/Ix
Time derivative
u(xt+1)
I(x,t) It I(x,t) It
p(xt) u I(x,t+1)=I(x,t)
p(xt)
I(x,t+1) I(x,t+1)
xt Ix x xt+ε Ix x xt+1 Ix x
Spatial derivative
(Tangent slope)
Fig. 6.22 Iterative refinement of the optical flow. Representation of the signal 1D I (x, t) and
I (x, t + 1) relative to the temporal images observed in two instants of time. Starting from an initial
speed, the flow is updated by imagining to translate the signal I (x, t) to superimpose it over that of
I (x, t + 1). Convergence occurs in a few iterations by calculating in each iteration the space–time
gradient on the window centered on the pixel being processed which gradually shifts less and less
until the two signals overlap. The flow update takes place with (6.75) and for the figure, signal
occurs when the time derivative varies while the spatial derivative is constant
1. Compute for the p pixel the spatial derivative I x using the pixels close to p;
2. Set the initial speed of p. Normally we assume u ← 0;
3. Repeat until convergence:
(a) Locate the pixel in the adjacent time image I (x , t + 1) by assuming the
current speed u. Let I (x , t + 1) = I (x + u, t + 1) obtained by interpolation
considering that the values of u are not always integers.
(b) Compute the time derivative It = I (x , t + 1) − I (x, t + 1) as an approxi-
mation to the difference of the interpolated pixel intensities.
(c) U pdate speed u according to Eq. (6.74).
It is observed that during the iterative process the spatial derivative 1D I x remains
constant while the speed u is refined in each iteration starting from a very approximate
initial value or assumed zero. With Newton’s method (also known as the Newton–
Raphson method), it is possible to generate a sequence of values of u starting from
a plausible initial value that after a certain number of iterations converges to an
approximation of the root of Eq. (6.74) in the hypothesis that I (x, t) is derivable.
Therefore, given the signal I (x(t), t) know an initial value of u (k) (x), calculated, for
example, with the Lucas–Kanade approach, it is possible to obtain the next value of
the speed, solution of (6.74), with the general iterative formula:
It (x)
u k+1 (x) ← u (k) (x) − (6.75)
I x (x)
where k indicates the iteration number.
The iterative approach of the optical flow (u, v) extended to the 2D case is imple-
mented considering Eq. (6.13), which we know to have the two unknowns u and v,
and an equation, but solvable with the simple Lucas–Kanade method described in
Sect. 6.4.4. The 2D iterative procedure is as follows:
6.4 Optical Flow Estimation 525
1. Compute the speeds (u, v) in each pixel of the image considering the adjacent
images I 1 and I 2 (of a temporal sequence) using the Lucas–Kanade method,
Eq. (6.49).
2. Transform (warp) the image I 1 into the image I 2 with bilinear interpolation
(to calculate the intensities at the subpixel level) using the optical flow speeds
previously calculated.
3. Repeat the previous steps until convergence.
The convergence is when applying step 2 the translation of the image of time t leads
to its overlap with the image of the time t + 1 and it follows that the ratio It /I x is
null and the speed value is unchanged.
1. Generates the Gaussian pyramids associated with the images I (x, y, t) and
I (x, y, t + 1) (normally 3-level pyramids are used).
2. Compute the optical flow (u 0 , v0 ) with the simple method of Lucas–Kanade (LK)
at the highest level of the pyramid (coarse);
3. Reapply LK iteratively over the current images to update (correct) the flow by
converging normally within 5 iterations, which becomes the initial flow.
4. Remain to the level ith, perform the following steps:
Initial estimate
x2
Interpolates
Trasf x2
Update Flow
Fig. 6.23 Estimation of the optical flow with a multi-resolution approach by generating two Gaus-
sian pyramids relative to the two temporal images. The initial estimate of the flow is made starting
from the coarse images (at the top of the pyramid) and this flow is propagated to the subsequent
levels until it reaches the original image. With the coarse-to-fine approach, it is possible to handle
large object movements without violating the assumptions of the optical flow equation
Figure 6.24 shows instead the results of the Lucas–Kanade method for the manage-
ment of large movements calculated on real images. In this case, the multi-resolution
approach based on the Gaussian pyramid was used and iteratively refining the flow
calculation in each of the three levels of the pyramid.
Fig. 6.24 Results of the optical flow calculated on real images with Lucas–Kanade’s method for
managing large movements that involves an organization with Gaussian pyramid at three levels
of the adjacent images of the temporal sequence and an iterative refinement at every level of the
pyramid
The motion of the satellite caused complex geometric deformations in the images,
requiring a process of geometric transformation and resampling of images based on
the knowledge of points of reference of the territory (control points, landmarks).
Of these points, windows (samples of images) were available, to be searched later
in the images, useful for registration. The search for such sample patchs of sizes
n × n in the images to be aligned occurs using the classic gray-level comparison
methods which minimizes an error function based on the sum of the square of the
differences (SSD) described in Sect. 5.8 Vol. II, or on the sum of the absolute value
of the differences (SAD) or on the normalized cross-correlation (NCC). The latter
functional SAD and NCC have been introduced in Sect. 4.7.2.
In the preceding paragraphs, the motion field was calculated by considering the
simple translation motion of each pixel of the image by processing (with the differ-
ential method) consecutive images of a sequence. The differential method estimates
the motion of each pixel by generating a dense motion field map (optical flow), does
not perform a local search around the pixel being processed (it only calculates the
local space–time gradient) but only estimates limited motion (works on high frame
rate image sequences). With multi-resolution implementation, large movements can
be estimated but with a high computational cost.
The search for a patch model in the sequence of images can be formulated to
estimate the motion with large displacements that are not necessarily translational.
Lucas and Kanade [12] have proposed a general method that aligns a portion of
an image known as a sample image (template image) T (x) with respect to an input
528 6 Motion Analysis
image I (x), where x = (x, y) indicates the coordinates of the processing pixel where
the template is centered. This method can be applied for the motion field estimation
by considering a generic patch of size n × n from the image at time t and search for
it in the image at time t + t. The goal is to find the position of the template T (x)
in the successive images of the temporal sequence (template tracking process).
In this formulation, the search for the alignment of the template in the sequence
of varying space–time images takes place considering the variability of the intensity
of the pixels (due to the noise or to the changing of the acquisition conditions), and
a model of motion described by the function W (x; p), parameterizable in terms of
the parameters p = ( p1 , p2 , . . . , pm ). Essentially, the motion (u(x), v(x)) of the
template T (x) is estimated through the geometric transformation (warping trans-
formation) W (x; p) that aligns a portion of the deformed image I (W (x; p)) with
respect to T (x):
I (W (x; p)) ≈ T (x) (6.76)
Basically, the alignment (registration) of T (x) with patch of I (x) occurs by deforming
I (x) to match it with T (x). For example, for a simple motion of translation the
transformation function W (x; p) is given by
x + p1 x +u
W (x; p) = = (6.77)
y + p2 y+v
The minimization of this functional occurs by looking for the unknown vector p
calculating the pixel residuals of the template T (x) in a search region D × D in
I . In the case of a affine motion model (6.78), the unknown parameters are 6. The
minimization process assumes that the entire template is visible in the input image
I and the pixel intensity does not vary significantly. The error function (6.79) to
be minimized is non-linear even if the deformation function W (x; p) is linear with
6.4 Optical Flow Estimation 529
respect to p. This is because the intensity of the pixels varies regardless of their spa-
tial position. The Lucas–Kanade algorithm uses the Gauss–Newton approximation
method to minimize the error function through an iterative process. It assumes that
an initial value of the estimate of p is known, and through an iterative process is
increased by p such that it minimizes the error function expressed in the following
form (parametric model):
2
I (W (x; p + p)) − T (x) (6.80)
x∈T
In this way, the assumed known initial estimate is updated iteratively by solving the
function error with respect to p:
p ← p + p (6.81)
until they converge, checking that the norm of the p is below a certain thresh-
old value ε. Since the algorithm is based on the nonlinear optimization of a
quadratic function with the gradient descent method, (6.80) is linearized approx-
imating I (W (x; p + p)) with his Taylor series expansion up to the first order,
obtaining
∂W
I (W (x; p + p)) ≈ I (W (x; p)) + ∇ I p (6.82)
∂p
having considered W (x; p) = (Wx (x; p), W y (x; p)). In the case of affine deforma-
tion (see Eq. 6.78), the Jacobian results:
∂W x 0 y010
= (6.84)
∂p 0x 0y01
If instead
the
deformation is of pure translation, the Jacobian corresponds to the
matrix 01 01 . Replacing (6.82) in the error function (6.80) we get
∂W
2
e SS D = I (W x; p) + ∇ I p − T (x) (6.85)
∂p
x∈T
530 6 Motion Analysis
∂e SS D ∂W
T
∂W
=2 ∇I I (W x; p) + ∇ I p − T (x) = 0 (6.86)
∂p ∂p ∂p
x∈T
where ∇ I ∂∂p
W
means the Steepest Descent term. Solving with respect to the unknown
p, the function (6.80) is minimized, with the following correction parameter:
T
∂W
p = H −1 ∇I T (x) − I (W x; p) (6.87)
∂p
x∈T
Term of steepest descend with error
where
∂W
T
∂W
H= ∇I ∇I (6.88)
∂p ∂p
x∈T
is the Hessian matrix (of second derivatives, of size m × m) of the deformed image
I (W x; p). (6.88) is justified by remembering that the Hessian matrix actually rep-
resents the Jacobian of the gradient (concisely H = J · ∇). Equation (6.87) repre-
sents the expression that calculates the increment of p to update p and through
the cycle of predict–correct to converge toward a minimum of the error function
(6.80). Returning to the Lucas–Kanade algorithm, the iterative procedure expects to
apply Eqs. (6.80) and (6.81) in each step. The essential steps of the Lucas–Kanade
alignment algorithm are
1. Compute I W (x, p) transforming I (x) with the matrix W (x, p);
2. Compute the similarity value (error): I W (x, p) − T (x);
3. Compute war ped gradients ∇ I = (I x , I y ) with the transformation
W (x, p);
4. Evaluate the Jacobian of the warping ∂∂pW
at (x, p);
5. Compute steepest descent: ∇ I ∂∂p
W
;
6. Compute the Hessian matrix with Eq. (6.88);
7. Multi ply steepest descend with error, indicated in Eq. (6.87);
8. Compute p using Eq. (6.87);
9. U pdate the parameters of the motion model: p ← p + p;
10. Repeat until p < ε
Figure 6.25 shows the functional scheme of the Lucas–Kanade algorithm. In sum-
mary, the described algorithm is based on the iterative approach prediction–
correction. The pr ediction
consists of the calculation of the transformed input
image (deformed) I W (x, p) starting from an initial estimate of the parameter vec-
tor p once the parametric motion model has been defined (translation, rigid, affine,
6.4 Optical Flow Estimation 531
Gradient x of I Gradient y of I
Template
T(x)
Step 2
I(x)
I(x) deformed
Step 3
Step 1
Step 4
I(W(x;p))
W parameters Gradient of I deformed Jacobian of W
Step 9
Δ Δ
1 2 3 4 5 6
Ix m x 2m Iy ∂W 2m x 6m
p ∂p
Δp Updating Inverse Hessian
Step 5
Step 8
1 2 3 4 5 6
Δp 6x1
Hessian Images of steepest descent
6x6
Step 8
Step 6
H
Δ ∂W m x 6m
I ∂p
Steepest descent
Step 2
parameters update
Error 4
3 6x1
2
0
Step 7 -1
Step 7
-2
T(x) - I(W(x;p)) -3
Δ ∂W
1 2 3 4 5 6
T
∑[
x
I ∂p ] [T(x)-I(W(x;p))]
Fig. 6.25 Functional scheme of the reported Lucas–Kanade alignment algorithm (reported by
Baker [18]). With step 1 the input image I is deformed with the current estimate of W (x; p) (thus
calculating the prediction) and the result obtained is subtracted from the template T (x) (step 2) thus
obtaining the error function T (x) − I (W (x; p)) between prediction and template. The rest of the
steps are described in the text
Therefore, the Hessian matrix thus obtained for the pure translational motion corre-
sponds to that of the Harris corner detector already analyzed in Sect. 6.5 Vol. II. It
follows that, from the analysis of the eigenvalues (both high values) of H it can be
verified if the template is a good patch to search for in the images of the sequence in
the translation context. In applications where the sequence of images has large move-
ments the algorithm can be implemented using a multi-resolution data structure, for
example, using the coarse to fine approach with pyramidal structure of the images.
It is also sensitive if the actual motion model is very different from the one predicted
and if the lighting conditions are modified. Any occlusions become a problem for
convergence. A possible mitigation of these problems can be achieved by updating
the template image.
From a computational point of view, the complexity is O(m 2 N + m 3 ) where m
is the number of parameters and N is the number of pixels in the template image. In
[18], the details of the computational load related to each of the first 9 steps above
of the algorithm are reported. In [19], there are some tricks (processing of the input
image in elementary blocks) to reduce the computational load due in particular to
the calculation of the Hessian matrix and the accumulation of residuals. Later, as
an alternative to the Lucas–Kanade algorithm, other equivalent methods have been
developed to minimize the error function (6.79).
First it is minimized with respect to p, in each iteration, and then the deformation
estimate is updated as follows:
In this expression, the “◦” symbol actually indicates a simple linear combination
of the parameters of W (x; p) and W (x; p), and the final form is rewritten as the
compositional deformation W (W (•)). The substantial difference of the composi-
tional algorithm with respect to the Lucas–Kanade algorithm is represented by the
iterative incremental deformation W (x; p) rather than the additive updating of the
parameters p.
In essence, Eqs. (6.80) and (6.81) of the original method are replaced with (6.89)
and (6.90) of the compositional method. In other words, this variant in the iterative
approximation involves updating W (x; p) through the two compositional deforma-
tions given with the (6.90).
The compositional algorithm involves the following steps:
1. Compute I W (x, p) transforming I (x) with the matrix W (x, p);
2. Compute the similarity value (error): I W (x, p) − T (x);
3. Compute the gradient ∇ I (W ) of the image I W (x, p) ;
4. Evaluate the Jacobian ∂∂pW
at (x, 0). This step is only performed in the beginning
by pre-calculating the Jacobian at (x, 0) which remains constant.
5. Compute steepest descent: ∇ I ∂∂pW
;
6. Compute the Hessian matrix with Eq. (6.88);
7. Multi ply steepest descend with error, indicated in Eq. (6.87);
8. Compute p using Eq. (6.87);
9. Update the parameters of the motion model with Eq. (6.90);
10. Ripeti finché p < ε
Basically, the same procedure as in Lucas–Kanade does, except for the steps shown
in bold, that is, step 3 where the gradient of the image is calculated I W (x, p) , step
4 which is executed at the beginning out of the iterative process, initially calculating
the Jacobian at (x, 0), and step 9 where the deformation W (x, p) is updated with
the new Eq. (6.90). This new approach is more suitable for more complex motion
models such as the homography one, where the Jacobian calculation is simplified
even if the computational load is equivalent to that of the Lucas–Kanade algorithm.
where, as you can see, the role of the I and T images is reversed. In this case, as
suggested by the name, the minimization problem of (6.91) is solved by updating the
estimated current deformation W (x; p) with the inverted incremental deformation
W (x; p)−1 , given by
T
∂W
p = H −1 ∇T I W (x; p) − T (x) (6.94)
∂p
x∈T
∂W
T
∂W
H= ∇T ∇T (6.95)
∂p ∂p
x∈T
7–8. Compute the incremental value of the motion model parameters p using
Eq. (6.94);
The substantial difference between the forward compositional algorithm and the
inverse compositional algorithm concerns: the calculation of the similarity value
(step 1) having exchanged the role between input image and template; steps 3, 5, and
6 calculating the gradient of T instead of the gradient of I , with the addition of being
able to precompute it out of the iterative process; the calculation of p which is done
with (6.94) instead of (6.87); and finally, step 9, where the incremental deformation
W (x; p) is reversed before being composed with the current estimate.
Regarding the computational load, the inverse computational algorithm requires a
computational complexity of O(m 2 N ) for the initial pre-calculation steps (executed
only once), where m is the number of parameters and N is the number of pixels
in the template image T . For the steps of the iterative process, a computational
complexity of O(m N + m 3 ) is required for each iteration. Essentially, compared
to the Lucas–Kanade and compositional algorithms, we have a computational load
saving of O(m N + m 2 ) for each iteration. In particular, the greatest computation
times are required for the calculation of the Hessian matrix (step 6) although it is
done only once while keeping in memory the data of the matrix H and of the images
of the steepest descent ∇T ∂∂pW
.
the motion field can be done using techniques based on the identification, in the
images of the sequence in question, of some significant structures (Points Of Interest
(POI)). In other words, the motion estimation is calculated by first identifying the
homologous points of interest in the consecutive images (with the correspondence
problem analogous to that of the stereo vision) and measuring the disparity value
between the homologous points of interest. With this method, the resulting speed
map is a scattered speed map, unlike previous methods that generated a dense speed
map. To determine the dynamics of the scene from the sequence of time-varying
images, the following steps must be performed:
For images with n pixel, the computational complexity to search for points of interest
in the two consecutive images is O(n 2 ). To simplify the calculation of these points,
we normally consider windows of minimum 3 × 3 with high brightness variance.
In essence, the points of interest that emerge in the two images are those with high
variance that normally are found in correspondence of corners, edges, and in general
in areas with strong discontinuity of brightness.
The search for points of interest (step 2) and the search for homologous (step 3)
between consecutive images of the time sequence is carried out using the appropriate
methods (Moravec, Harris, Tomasi, Lowe, . . .) described in Chap. 6 Vol. II. In partic-
ular, the methods for finding homologous points of interest have also been described
in Sect. 4.7.2 in the context of stereo vision for the problem of the correspondence
between stereo images. The best known algorithm in the literature for the tracking of
points of interest is the KLT (Kanade-Lucas–Tomasi), which integrates the Lucas–
Kanade method for the calculation of the optical flow, the Tomasi–Shi method for
the detection of points of interest, and the Kanade–Tomasi method for the ability to
tracking points of interest in a sequence of time-varying images. In Sect. 6.4.8, we
have already described the method of aligning a patch of the image in the temporal
sequence of images.
The essential steps of the KLT algorithm are
1. Find the POIs in the first image of the sequence with one of the methods above
that satisfy min(λ1 , λ2 ) > λ (default threshold value);
2. For each POI, apply a motion model (translation, affine, . . .) to calculate the
displacement of these points in the next image of the sequence. For example,
alignment algorithms based on the Lucas–Kanade method can be used;
6.4 Optical Flow Estimation 537
3. Keep track of the motion vectors of these POIs in the sequence images;
4. Optionally, it may be useful to activate the POI detector (step 1) to add more to
follow. Step to execute every m processed images of the sequence (for example
every 10–20 images);
5. Repeat steps 2.3 and optionally step 4;
6. KLT returns the vectors that track the points of interest found in the image
sequence.
The KLT algorithm would automatically track the points of interest in the images
of the sequence compatibly with the robustness of the detection algorithms of the
points and the reliability of the tracking influenced by the variability of the contrast
of the images, by the noise, by the lighting conditions that must vary little and above
all from the motion model. In fact, if the motion model (for example, of translation
or affine) changes a lot with objects that move many pixels in the images or change
of scale, there will be problems of tracking with points of interest that may appear
partially and not more detectable. It may happen that in the phase of tracking a point
of interest detected has identical characteristics but belongs to different objects.
Fig. 6.26 Points of interest detected with Lowe’s SIFT algorithm for a sequence of 5 images
captured by a mobile vehicle
(a) (b)
(c)
Fig. 6.27 Results of the correspondence of the points of interest in Fig. 6.26. a Correspondence of
the points of interest relative to the first two images of the sequence calculated with Harris algorithm.
b As in a but related to the correspondence of the points of interest SIFT. c Report the tracking of
the SIFT points for the entire sequence. We observe the correct correspondence of the points (the
trajectories do not intersect each other) invariant to rotation, scale, and brightness variation
the right side of the corridor is highlighted. In the second line, only the POIs found
corresponding between two consecutive images are shown, useful for motion detec-
tion. Given the real navigation context, tracking points of interest must be invariant
to the variation of lighting conditions, rotation, and, above all, scale change. In fact,
the mobile vehicle during the tracking of the corresponding points captures images
of the scene where the lighting conditions vary considerably (we observe reflections
with specular areas) and between one image and another consecutive image of the
sequence the points of interest can be rotated and with a different scale. The figure
also shows that some points of interest present in the previous image are no longer
visible in the next image, while the latter contains new points of interest not visible
in the previous image due to the dynamics of the scene [21].
Figure 6.27 shows the results of the correspondence of the points of interest SIFT
of Fig. 6.26 and the correspondence of the points of interest calculated with Har-
6.4 Optical Flow Estimation 539
ris algorithm. While for SIFT points, the similarity is calculated using the SIFT
descri ptor s for Harris corners the similarity measurement is calculated with the
SSD considering a square window centered on the position of the corresponding
corners located in the two consecutive images.
In figure (a), the correspondences found in the first two images of the sequence are
reported, detected with the corners of Harris. We observe the correct correspondence
(those relating to nonintersecting lines) for corners that are translated or slightly
rotated (invariance to translation) while for scaled corners the correspondence is
incorrect because they are non-invariant to the change of scale. Figure (b) is the
analogous of (a) but the correspondences of the points of interest SIFT are reported,
that being also invariant to the change of scale, they are all correct correspondences
(zero intersections). Finally, in figure (c) the tracking of the SIFT points for the
whole sequence is shown. We observe the correct correspondence of the points (the
trajectories do not intersect each other) being invariant with respect to rotation, scale,
and brightness variation.
There are different methods for finding the optimal correspondence between a
set of points of interest, considering also the possibility that for the dynamics of the
scene some points of interest in the following image may not be present. To eliminate
possible false correspondences and to reduce the times of the computational process
of the correspondence, constraints can be considered in analogy to what happens for
stereo vision, where the constraints of the epipolarity are imposed or knowing the
kinematics of the objects in motion it is possible to predict the position of points of
interest. For example, the correspondences of significant point pairs can be considered
to make the correspondence process more robust, placing constraints on the basis of
a priori knowledge that can be had on the dynamics of the scene.
would be found (also because they are no longer visible) and are excluded for motion
analysis.
yk = x j + vt = x j + c jk (6.96)
where the vector c jk can be seen as the vector connecting the points of interest x j
and yk . This pair of homologous points has a good correspondence if the following
condition is satisfied:
|x j − yk | ≤ cmax (6.97)
where cmax indicates the maximum displacement (disparity) of x j in the time interval
t found in the next image with the homologous point yk . Two pairs of homologous
points (x j , yk ) and (x p , yq ) are declared consistent if they satisfy the following
condition:
|c jk − c pq | ≤ cost
6.4 Optical Flow Estimation 541
where I1 (s, t) and I2 (s, t) indicate the respective pixels of the windows W j and
Wk , respectively, centered, respectively, on the points of interest x j and yk in the
corresponding images. An initial value of the estimate of the correspondence of a
generic pair of points of interest (x j , yk ), expressed in terms of probability P jk
0 ,
(0)
1
P jk = (6.99)
1 + αS jk
(s,t)∈W
(0)
where α is a positive constant. The probability P jk is determined by considering
the fact that a certain number of POIs have a good similarity, excluding those that
are inconsistent as a value of similarity, for which a value of probability equal to
(0)
1 − max(P jk ) can be assumed. The probabilities of the various possible matches
are given by
(0)
P jk
P jk = n (6.100)
t=1 P jt
where P jk can be considered as the conditional probability that x j has the homolo-
gous point yk , normalized on the sum of the probabilities of all other potential points
{y1 , y2 , . . . , yn }, excluding those found with inconsistent similarity. The essential
steps of the complete algorithm for calculating the flow rate of two consecutive
images of the sequence are the following:
1. Calculate the set of points of interest A1 and A2 , respectively, in the two consec-
utive images It and It+t .
2. Organize a data structure among the potential points of correspondence for each
point x j ∈ A1 with points yk ∈ A2 :
{x j ∈ A1 , (c j1 , P j1 ), (c j2 , P j2 ), . . . , (β, γ )} j = 1, 2, . . . , m
542 6 Motion Analysis
given by Eq. (6.98) once you have chosen the appropriate size W of the window
in relation to the dynamics of the scene.
4. Iteratively calculate the probability of matching of a point x j with all the potential
points (yk , k = 1, . . . , n) as the weighted sum of all the probabilities of corre-
spondence for all consistent pairs (x p , yq ), where the x p points are in the vicinity
of x j (while i points yq are in the vicinity of yk ) and consistency (x p , yq ) is eval-
uated according to the pair (x j , yk ). A quality measure Q jk of the matching pair
is given by
(s−1) (s−1)
Q jk = Ppq (6.101)
p q
where s is the iteration step, p refers to all x p points that are in the vicinity of
the point of interest x j being processed and index q refers to all the yq ∈ A2
points that form pairs (x p , yq ) consistent with pairs (x j , yk ) (points that are not
consistent or with probability below a certain threshold are excluded).
5. Update correspondence probabilities for each pair (x j , yk ) as follows:
# $
(s) (s−1) (s−1)
P̂ jk = P̂ jk a + bQ jk
(s)
with a and b default constants. The probability P̂ jk is useful to normalize it with
the following:
(s)
(s)
P̂ jk
P jk = n (s)
(6.102)
t=1 P̂ jt
6. Iterate steps (4) and (5) until the best match (x j , yk ) is found for all the points
examined x j ∈ A1 .
7. The vectors c jk constitute the velocity fields of the analyzed motion.
The selection of the constants a and b conditions the convergence speed of the algo-
rithm that normally converges after a few iterations. The algorithm can be speedup
(0)
by eliminating correspondences with initial probability values P jk that are very low
below a certain threshold.
z
x y
x
Fig. 6.28 Tracking of the ball by detecting its dynamics using the information of its displacement
calculated in each image of the sequence knowing the camera’s frame rate. In the last image on the
right, the Goal event is displayed
calculating the position of its center of mass, independently in each image of the
sequence, also based on a prediction estimate of the expected position of the object.
In some real applications, the dynamics of objects is known a priori. For example
in the tracking of a pedestrian or in the tracking of entities (players, football, . . .)
in sporting events, they are normally shot with appropriate cameras having a frame
rate appropriate to the intrinsic dynamics of these entities. For example, the tracking
of the ball (in the game of football or tennis) would initially predict its location in
an image of the sequence, analyzing the image entirely, but in subsequent images of
the sequence its location can be simplified by the knowledge of the motion model
that would predict its current position.
This tracking strategy reduces the search time and improves, with the prediction
of the dynamics, the estimate of the position of the object, normally influenced by
the noise. In object tracking, the camera is stationary and the object must be visible
(otherwise the tracking procedure must be reinitialized to search for the object in
a sequence image, as we will see later), and in the sequence acquisition phase the
object–camera geometric configuration must not change significantly (normally the
object moves with lateral motion with respect to the optical axis of the stationary
camera). Figure 6.28 shows the tracking context of the ball in the game of football
to detect the Goal–NoGoal event.7 The camera continuously acquires sequences of
images and as soon as the ball appears in the scene a ball detection process locates
it in an image of the sequence and begins a phase of ball tracking in the subsequent
images.
Using the model of the expected movement, it is possible to predict where the
ball-object is located in the next image. In this context, the Kalman filter can be
used considering the dynamics of the event that presents uncertain information (the
ball can be deviated) and it is possible to predict the next state of the ball. Although
in reality, external elements interfere with the predicted movement model, with the
Kalman filter one is often able to understand what happened. The Kalman filter is
ideal for continuously changing dynamics. A Kalman filter is an optimal estima-
tor, i.e., it highlights parameters of interest from indirect, inaccurate, and uncertain
7 Goal–NoGoal event, according to FIFA regulations, occurs when the ball passes entirely the ideal
vertical plane parallel to the door, passing through the inner edge of the horizontal white line
separating the playing field from the inner area of the goal itself.
544 6 Motion Analysis
observations. It operates recursively by evaluating the next state based on the pre-
vious state and does not need to keep the historical data of the event dynamics. It
is, therefore, suitable for real-time implementations, and therefore strategic for the
tracking of high-speed objects.
It is an optimal estimator in the sense that if the noise of the data of the problem
is Gaussian, the Kalman filter minimizes the mean square error of the estimated
parameters. If the noise were not Gaussian (that is, for data noise only the average
and standard deviation are known), the Kalman filter is still the best linear estimator
but nonlinear estimators could be better. The word f ilter must not be associated
with the most common meaning that removes the frequencies of a signal but must
be understood as the process that finds the best estimate from noisy data or to filter
(attenuate) the noise.
Now let’s see how the Kalman filter is formulated [8,9]. We need to define the
state of a deterministic discrete dynamic system, described by a vector with the
smallest possible number of components, which completely synthesizes the past of
the system. The knowledge of the state allows theoretically to predict the dynamics
and future (and previous) states of the deterministic system in the absence of noise.
In the context of the ball tracking, the state of the system could be described by the
vector x = (p, v), where p = (x, y) and v = (vx , v y ) indicate the position of
the center of mass of the ball in the images and the ball velocity, respectively. The
dynamics of the ball is simplified by assuming constant velocity during the tracking
and neglecting the effect of gravity on the motion of the ball.
This velocity is initially estimated by knowing the camera’s frame rate and evalu-
ating the displacement of the ball in a few images of the sequence, as soon as the ball
appears in the field of view of the camera (see Fig. 6.28). But nothing is known about
unforeseeable external events (such as wind and player deviations) that can change
the motion of the ball. Therefore, the next state is not determined with certainty and
the Kalman filter assumes that the state variables p and v may vary randomly with the
Gaussian distribution characterized by the mean μ and variance σ 2 which represents
the uncertainty. If the prediction is maintained, knowing the previous state, we can
estimate in the next image where the ball would be, given by:
pt = pt−1 + tv
where t indicates the current state associated with the image tth of the sequence, t–1
indicates the previous state and t indicates the time elapsed between two adjacent
images, defined by the camera’s frame rate. In this case, the two quantities, position
and velocity of the ball, are corr elated.
Therefore, in every time interval, we have that the state changes from xt−1 to
xt according to the prediction model and according to the new observed measures
zt evaluated independently from the prediction model. In this context, the observed
measure zt = (x, y) indicates the position of the center of mass of the ball in the
image tth of the sequence. We can thus evaluate a new measure of the tracking state
(i.e., estimate the new position and speed of the ball, processing the tth image of the
sequence directly). At this point, with the Kalman filter one is able to estimate the
6.4 Optical Flow Estimation 545
current state x̂t by filtering out the uncertainties (generated by the measurements zt
and/or from the prediction model xt ) optimally with the following equation:
where Kt is a coefficient called Kalman Gain and t indicates the current state of
the system. In this case, t is used as an index of the images of the sequence but in
substance it has the meaning of discretizing time such that t = 1, 2, . . . indicates t ·
t ms (millisecondi), where t is the constant time interval between two successive
images of the sequence (defined by the frame rate of the camera).
From (6.103), we observe that with the Kalman filter the objective is to estimate,
optimally the state at time t, filtering through the coefficient Kt (which is the unknown
of the equation) the intrinsic uncertainties deriving from the prediction estimated in
the state x̂t−1 and from the new observed measures zt . In other words, the Kalman
filter behaves like a data fusion process (prediction and observation) by optimally
filtering the noise of that data. The key to this optimal process is reduced to the Kt
calculation in each process state.
Let us now analyze how the Kalman filter mechanism, useful for the tracking,
realizes this optimal fusion between the assumption of state prediction of the system,
the observed measures and the correction proposed by the Kalman Eq. (6.103). The
state of the system at time t is described by the random variable xt , which evolves
from the previous state t-1 according to the following linear equation:
xt = Ft · xt−1 + Gt ut + ε t (6.104)
where
– xt is a state vector of size n x whose components are the variables that characterize
the system (for example, position, velocity, . . .). Each variable is normally assumed
to have a Gaussian distribution N (μ, ) with mean μ which is the center of the
random distribution and covariance matrix of size n x × n x . The correlation
information between the state variables is captured by the covariance matrix
(symmetric) of which each element i j represents the level of correlation between
the variable ith and jth. For the example of the ball tracking, the two variables p
and v would be correlated, considering that the new position of the ball depends
on velocity assuming no external influence (gravity, wind, . . .).
– F t is the transition matrix that models the system prediction, that is, it is the state
transition matrix that applies the effect of each system state parameter at time t-1 in
the system state over time t (therefore, for the example considered, the position and
velocity, over time t-1, both influence the new ball position at the time t). It should
be noted that the components of F t are assumed to be constant in the changes of
state of the system even if in reality they can be modified (for example, the variables
of the system deviate with respect to the hypothesized Gaussian distribution). This
last situation is not a problem since we will see that the Kalman filter will converge
toward a correct estimate even if the distribution of the system variables deviate
from the Gaussian assumption.
546 6 Motion Analysis
– ut is an input vector of size n u whose components are the system control input
parameters that influence the state vector xt . For the ball tracking problem, if we
wanted to consider the effect of the gravitational field, we should consider in the
state vector xt , also the vertical component y of the position p = (x, y), which is
dependent on the acceleration −g according to the gravitational motion equation
y = 21 gt 2 .
– G t is the control matrix associated with the input parameters ut .
– εt is the vector (normally unknown, represents the uncertainty of the system model)
that includes the terms of process noise for each parameter associated with the state
vector xt . It is assumed that the process noise has a multivariate normal distribution
with mean zero (white noise process) and with a covariance matrix Q t = E[εt ε tT ].
Equation (6.104) defines a linear stochastic process, where each value of the state
vector xt is a linear combination of its previous value xt−1 plus the value of the
control vector ut and the process noise εt .
The equation associated with the observed measurements of the system, acquired
in each state t, is given by
zt = Ht · xt + ηt (6.105)
where
It should be noted that ε t and ηt are independent variables, and therefore, the uncer-
tainty on the prediction model does not depend on the uncertainty on the observed
measures and vice versa.
Returning to the example of the ball tracking, in the hypothesis of constant veloc-
ity, the state vector xt is given by the velocity vt and by the horizontal position
xt = xt−1 + tvt−1 , as follows:
xt 1 t xt−1
xt = = + εt = F t−1 xt−1 + εt (6.106)
vt 0 1 vt−1
For simplicity, the dominant dynamics of lateral motion was considered, with respect
to the optical axis of the camera, thus considering only the position p = (x, 0) along
6.4 Optical Flow Estimation 547
the horizontal axis of the x-axis of the image plane (the height of the ball is neglected,
that is, the y-axis as shown in Fig. 6.28). If instead we also consider the influence of
gravity on the motion of the ball, in the state vector x two additional variables must
be added to indicate the vertical fall motion component. These additional variables
are the vertical position y of the ball and the velocity v y of free fall of the ball along
the y axis. The vertical position is given by
g(t)2
yt = yt−1 − tv yt−1 −
2
where g is the gravitational acceleration. Now let us indicate the horizontal speed of
the ball with vx which we assume constant taking into account the high acquisition
frame rate (t = 2ms for a frame rate of 500 fps). In this case, the status vector
results in xt = (xt , yt , vxt , v yt )T and the linear Eq. (6.104) becomes
⎡ ⎤ ⎡ ⎤⎡ ⎤ ⎡ ⎤
xt 1 0 t 0 xt−1 0
⎢ yt ⎥ ⎢0 1 0 t ⎥ ⎢ yt−1 ⎥ ⎢− 2 ⎥
⎥ ⎢ ⎥ ⎢ (t) 2
xt = ⎢ ⎥ ⎢
⎣v x t ⎦ = ⎣ 0 + ⎥ g + εt = F t−1 xt−1 + G t ut + εt (6.107)
0 1 0 ⎦ ⎣vxt−1 ⎦ ⎣ 0 ⎦
v yt 0 0 0 1 v yt−1 t
with the control variable ut = g. For the ball tracking, the observed measurements
are the (x, y) coordinates of the center of mass of the ball calculated in each image
of the sequence through an algorithm of ball detection [23,24]. Therefore, the vector
of measures represents the coordinates of the ball z = [x y]T and the equation of
the observed measures, according to Eq. (6.105), is given by
1000
zt = x + ηt = H t xt + ηt (6.108)
0100 t
Having defined the problem of the tracking of an object, we are now able to adapt it
to the Kalman filter model. This model involves two distinct processes (and therefore
two sets of distinct equations): update of the prediction and update of the observed
measures.
The equations for updating the prediction are
Equation (6.109), derived from (6.104), computes an estimate x̂t|t−1 of the current
state t of the system on basis of the previous state values t-1 with the prediction
matrices Ft and Gt (provided with the definition of the problem), assuming known
and Gaussian the distributions of the variables status x̂t−1|t−1 and control ut .
Equation (6.110) updates the system state prediction covariance matrix by know-
ing the covariance matrix Qt associated with the noise ε of the input control variables.
548 6 Motion Analysis
The variance associated with the prediction x̂t|t−1 of an unknown real value xt is
given by
P t|t−1 = E[(xt − x̂t|t−1 )(xt − x̂t|t−1 )T ],
8 Let’s better specify how the uncertainty of a stochastic variable is evaluated, which we know to
be its variance, to motivate Eq. (6.110). In this case, we are initially interested in evaluating the
uncertainty of the state vector prediction x̂ t−1 which is given, being multidimensional, from its
covariance matrix P t−1 = Cov( x̂ t−1 ). Similarly, the uncertainty of the next value of the prediction
vector x̂ t at time t, after the transformation Ft obtained with (6.109), is given by
μ y = E[ y] = E[ Ax + b] = AE[x] + b = Aμx + b
t=t+1
Initialize state t
with Rt which is the covariance matrix associated with the noise of the observed mea-
surements z t (normally known by knowing the uncertainty of the measurements of
the sensors used) and Ht Pt|t−1 HtT is the covariance matrix of the measures that cap-
tures the propagated uncertainty of the previous state of prediction (characterized by
the covariance matrix Pt|t−1 ) on the expected measures, through the transformation
matrix Ht , provided from the model of the measures, according to Eq. (6.105).
At this point, the matrices R and Q remain to be determined, starting from the
initial values of x0 and P 0 , and start thus the iterative process of updating the status
( pr ediction) and updating of the observed measures (state correction), as shown
in Fig. 6.29. We will now analyze the various phases of the iterative process of the
Kalman filter reported in the diagram of this figure.
In the prediction phase, step1, an a priori estimate x̂t|t−1 is calculated, which is in
fact a rough estimate made before the observation of the measures zt , that is before
the correction phase (measures update). In step 2, the covariance matrix a priori of
propagation of errors Pt|t−1 is computed with respect to the previous state t − 1.
These values are then used in the update equations of the observed measurements.
In the correction phase, the system state vector xt is estimated combining the
knowledge information a priori with the measurements observed at the current time
t thus obtaining a better updated and correct estimate of x̂t and of Pt . These values
are necessary in the prediction/correction for the future estimate at time t + 1.
Returning to the example of the ball tracking, the Kalman filter is used to predict
the region where the ball would be in each image of the sequence, acquired in real
time with a high frame rate considering that the speed of the ball can reach 120 km/ h.
The initial state t0 starts as soon as the ball appears in an image of the sequence,
only initially searched for on the entire image (normally HD type with a resolution
of 1920 × 1080). The initial speed v0 of the ball is estimated by processing multiple
adjacent images of the sequence before triggering the tracking process. The accuracy
550 6 Motion Analysis
(a) (b)
Filtered measure combining
prediction and observed measure
Observed measure
with noise
Predicted estimate
Predicted estimate
y
x x x
Fig. 6.30 Diagram of the error filtering process between two successive states. a The position of the
ball at the time t1 has an uncertain prediction (shown by the bell-shaped Gaussian pdf whose width
indicates the level of uncertainty given by the variance) since it is not known whether external
factors influenced the model of prediction. b The position of the ball is shown by the measure
observed at time t1 with a level of uncertainty due to the noise of the measurements, represented by
the Gaussian pdf of the measurements. Combining the uncertainty of the prediction model and the
measurement one, that is, multiplying the two pdfs (prediction and measurements), there is a new
filtered position measurement obtaining a more precise measurement of the position in the sense
of the Kalman filter. The uncertainty of the filtered measurement is given by the third Gaussian pdf
shown
of the position and initial speed of the ball is reasonably known and estimated in
relation to the ball detection algorithm [24–26].
The next state of the ball (epoch t = 1) is estimated by the prediction update
Eq. (6.109) which for the ball tracking is reduced to Eq. (6.106) excluding the influ-
ence of gravity on the motion of the ball. In essence, in the tracking process there is
no control variable ut to consider in the prediction equation, and the position of the
ball is based only on the knowledge of the state x0 = (x0 , v0 ) at time t0 , and therefore
with the uncertainty given by the Gaussian distribution (xt ∼ N(F t xt−1 ; ). This
uncertainty is due only to the calculation of the position of the ball which depends on
the environmental conditions of acquisition of the images (for example, the lighting
conditions vary between one state and the next). Furthermore, it is reasonable to
assume less accuracy in predicting the position of the ball at the time t1 compared to
the time t0 due to the noise that we propose to filter with the Kalman approach (see
Fig. 6.30a).
At the time t1 we have the measure observed on the position of the ball acquired
from the current image of the sequence which, for this example, according to
Eq. (6.105) of the observed measures, results:
xt
zt = Ht · xt + ηt = 1 0 + ηt (6.115)
vt
where we assume the Gaussian noise ηt ∼ N(0; ηt ). With the observed measure z t ,
we have a further measure of the position of the ball whose uncertainty is given by
the distribution z t ∼ N(μz ; σz2 ). An optimal estimate of the position of the ball is
obtained by combining that of the prediction x̂ t|t−1 and that of the observed measure
z t . This is achieved by multiplying the two Gaussian distributions together. The
product of two Gaussians is still a Gaussian (see Fig. 6.30b). This is fundamental
6.4 Optical Flow Estimation 551
These equations represent the updating equations at the base of the process of pre-
diction/correction of the Kalman filter that was rewritten according to the symbolism
of the iterative process and we have
μx̂t|t−1 σz2t + μz t σx2t|t−1 σx2t|t−1
μx̂t|t = = μx̂t|t−1 + (μz t − μx̂t|t−1 ) (6.118)
σx2t|t−1 + σz2t|t σx2t|t−1 + σz2t|t
K alman Gain
By indicating with k the Kalman Gain, the previous equations are thus simplified:
The Kalman Gain and Eqs. (6.120) and (6.121) can be rewritten in matrix form to
handle the multidimensional Gaussian distributions N(μ; ) given by
xt|t−1
K= (6.122)
xt|t−1 + z t|t
552 6 Motion Analysis
Finally, we can derive the general equations of prediction and correction in matrix
form. This is possible considering the distribution of the prediction measures x̂ t
given by (μ x̂ t|t−1 ; xt|t−1 ) = (H t x̂ t|t−1 ; H t P t|t−1 H tT ) and the distribution of the
observed measures z t given by (μz t|t , z t|t ) = (z t|t ; Rt ). Replacing these values of
the prediction and correction distributions in (6.123), in (6.124), and in (6.122), we
get
H t x̂ t|t = H t x̂ t|t−1 + K (z t − H t x̂ t|t−1 ) (6.125)
H t P t|t−1 H tT
K= (6.127)
H t P t|t−1 H tT + Rt
We can now delete H t from the front of each term of the last three equations (remem-
bering that one is hidden in the expression of K ), and H tT from Eq. (6.126), we finally
get the following update equations:
(6.128) calculates, for the time t, the best new estimate of the state vector x̂ t|t of the
system, combining the estimate of the prediction x̂ t|t−1 (calculated with the 6.109)
with the r esidual (also known as innovation) given by the difference between
observed measurements z t and expected measurements ẑ t|t = H t x̂ t|t−1 .
We highlight (for Eq. 6.128) that the measurements residual is weighted by the
Kalman gain K t , which establishes how much importance to give the r esidual
with respect to the predicted estimate x̂ t|t−1 . We also sense the importance of K
in filtering the r esidual. In fact, from Eq. (6.127), we observe that the value of K
6.4 Optical Flow Estimation 553
1 t x̂t−1|t−1
x̂ t|t−1 = F x̂ t−1|t−1 = = x̂t−1|t−1 + vt−1 t,
0 1 vt−1
where the measurement matrix has only one nonzero element since only the hor-
izontal position is measured while H (2) = 0 since the speed is not measured.
Measurements noise η is controlled by the covariance matrix R, which in this case is
a scalar r , associated only with the measure xt . The uncertainty of the measurements
is modeled as Gaussian noise controlled with r = σm2 assuming constant variance in
the update process.
The simplified update equations are
P t|t−1 H T
Kt =
H P t|t−1 H T + r
P t|t = P t|t−1 − K t H P t|t−1
x̂ t|t = x̂ t|t−1 + K t (z t − H x̂ t|t−1 )
Figure 6.31 shows the results of the Kalman filter for the ball tracking considering
only the motion in the direction of the x-axis. The ball is assumed to have a speed of
80 km/h = 2.2222 · 104 mm/s. The uncertainty of the motion model (with constant
speed) is assumed to be null (with covariance matrix Q = 0) and any slowing down
or acceleration of the ball are assumed as noise. The covariance matrix P t checks
the error due to the process in each time t and indicates whether we should give
more weight to the new measurement or to the estimation of the model according to
Eq. (6.129).
In this example, assuming a zero-noise motion model the state of the system is
controlled by the variance of the state variables reported in the terms of the main
diagonal of P t . Previously, we indicated with σs2 the variance (error) of the variable
xt and the confidence matrix of the model P t is predicted, in every time, based on
the previous value, through (6.110). The Kalman filter results shown in the figure are
obtained with initial values of σs = 10 mm. Measurements noise is characterized
instead with standard deviation σm = 20 mm (note that in this example the units of the
measurements of the state variables and observed measurements are homogeneous,
expressed in mm).
The Kalman filter was applied with an initial value of the wrong speed of 50%
(graphs of the first row) and of 20% (graphs second row) with respect to the real one.
In the figure, it is observed that the filter converges, however, toward the real values
of the speed even if in different epochs in relation to the initial error. Convergence
occurs with the action of error filtering (of the model and measurements) and it
is significant to analyze the qualitative trend of the P matrix and of the gain K
(which asymptotically tends to a minimum value), as the σs2 and σm2 variances vary.
In general, if the variables are initialized with significant values, the filter converges
faster. If the model corresponds well to a real situation the state of the system is
6.4 Optical Flow Estimation 555
4
x 10 Estimated position x 10
4
Estimated velocity
6 3
True position True velocity
Observed measure Average velocity estimated
Estimated position with KF Estimated velocity with the KF
5
X axis position (mm)
2.5
Velocity (mm/s)
4 2
3 1.5
2 1
1 0.5
0 0
0 0.5 1 1.5 2 2.5 0 0.5 1 1.5 2 2.5
Time (1/fps s) Time (1/fps s)
x 10
4 Estimated position x 10
4 Estimated velocity
6 3
True position True velocity
Observed measure Average velocity estimated
Estimated position with KF Estimated velocity with the KF
5 2.5
X axis position (mm)
Velocity (mm/s)
4 2
3 1.5
2 1
1 0.5
0 0
0 0.5 1 1.5 2 2.5 0 0.5 1 1.5 2 2.5
Time (1/fps s) Time (1/fps s)
Fig.6.31 Kalman filter results for the ball tracking considering only the dominant horizontal motion
(x-axis) and neglecting the effect of gravity. The first column shows the graphs of the estimated
position of the ball, while in the second column the estimated velocities are shown. The graphs of
the two lines refer to two initial velocity starting conditions (in the first line the initial velocity error
is very high at 50% while in the second line it is 20%), where it is observed as after a short time the
initial speed error is quickly filtered by the Kalman filter
well updated despite the presence of measures observed with considerable error (for
example, 20−30% of error).
If instead, the model does not reproduce a real situation well even with measure-
ments with not very noise the state of the system presents a drift with respect to the
true measures. If the model is poorly defined, there will be no good estimate. In this
case, it is worth trying to make the model weigh less by increasing the estimated
error. This will allow the Kalman filter to rely more on measurement values while
still allowing some noise removal. In essence, it would be convenient to set the mea-
surements error η and verify the effects on the system. Finally, it should be noted
that the gain K tends to give greater weight to the observed measures if it has high
values, on the contrary, it weighs more the prediction model if it has low values.
In real applications, the filter does not always achieve the optimality conditions
provided by the theory, but the filter is used anyway, giving acceptable results, in
556 6 Motion Analysis
various tracking situations, and in general, to model the dynamics of systems based
on prediction/correction to minimize the covariance of the estimated error. In the case
of the ball tracking it is essential, in order to optimally predict ball position, to sig-
nificantly reduce the ball searching region, and consequently appreciably reduce the
search time (essential in this tracking application context which requires to process
several hundred images per second) of the ball by the ball detection algorithm. For
nonlinear dynamic models or nonlinear measurement models, the Extended Kalman
Filter (EKF)[9,27] is used, which solves the problem (albeit not very well) by apply-
ing the classical Kalman filter to the linearization of the system around the current
estimate.
In this section, we will describe some algorithms that continuously (in real time)
detect the dynamics of the scene characterized by the different entities in motion and
by the continuous change of environmental conditions. This is the typical situation
that arises for the automatic detection of complex sporting events (soccer, basketball,
tennis, . . .) where the entities in motion can reach high speeds (even hundreds of
km/h) in changing environmental conditions and the need to detect the event in real
time. For example, the automatic detection of the offside event in football would
require the simultaneous tracking of different entities, therefore recognizing the
class to which it belongs (ball, the two goalkeepers, player team A and B, referee
and assistants), to process the event data (for example, player who has the ball, his
position and that of the other players at the moment he hits the ball, player who
receives the ball) and make the decision in real time (in a few seconds) [3].
In the past, technological limits of vision and processing systems prevented the
possibility of realizing vision machines for the detection, in real time, of such complex
events under changing environmental conditions. Many of the traditional algorithms
of motion analysis and object recognition fail in these operational contexts. The goal
is to find robust and adaptive solutions (with respect to changing light conditions,
recognizing dynamic and static entities, and arbitrary complex configurations of mul-
tiple moving entities) by choosing algorithms with adequate computational complex-
ity and immediate operation. In essence, algorithms are required that automatically
learn the initial conditions of the operating context (without manual initialization)
and automatically learn how the conditions change.
Several solutions are reported in the literature (for tracking people, vehicles, . . .)
which are essentially based on fast and approximate methods such as the background
subtraction—BS which are the direct way to detect and trace the motion of moving
entities of the scene observed by stationary vision machines (with frame rates even
higher than the standard of 25 fps) [28–30]. Basically, the BS methods label the
“dynamic pixels” at time t whose gray level or color information changes significantly
compared to the pixels belonging to the background. This simple and fast method,
valid in the context of a stationary camera, is not always valid especially when the
6.5 Motion in Complex Scenes 557
light conditions change and in all situations when the signal-to-noise ratio becomes
unacceptable, also due to the noise inherent to the acquisition system.
This involved the development of some backgr ound models also based on statisti-
cal approaches [29,31] to mitigate the instability of simple BS approaches. Therefore,
these new BS methods must not only robustly model the noise of the acquisition sys-
tems but must also adapt to the rapid change in environmental conditions. Another
strategic aspect concerns shadow management and temporary occlusions of moving
entities compared to backgr ound. It is also highlighted the need to process the diver-
sity of recorded video sequences (video broadcast) from video sequences acquired
in real time which has an impact on the types of BS algorithms to be used. We now
describe the most common BS methods.
Several BS methods are proposed that are essentially based on the assumption that
the sequence of images is acquired in the context of a stationary camera that observes
a scene with stable background B with respect to which one sees objects in motion
that normally have an appearance (color distribution or gray levels) distinguishable
from B. These moving pixels represent the foreground, or regions of interest of
ellipsoidal or rectangular shape (also known as blob, bounding box, cluster, . . .). The
general strategy, which distinguishes the pixels of moving entities (vehicle, person,
. . .) from the static ones (unchangeable intensity) belonging to the background, is
shown in Fig. 6.32. This strategy involves the continuous comparison between the
current image and the background image, the latter appropriately updated through a
model that takes into account changes in the operating context. A general expression
that for each pixel (x, y) evaluates this comparison between current background B
and image I t at time t is the following:
1 if d[I t (x, y), B(x, y)] > τ
D(x, y) = (6.131)
0 other wise
Background
Update
The background is then updated with the last image acquired B t (x, y) = I t (x, y) to
be able to reapply (6.132) for subsequent images. When a moving object is detected
and then stops, with this simple BS method, the object disappears into Dt,τ . Fur-
thermore, it is difficult to detect it and recognize it when the dominant motion of
the object is not lateral (for example, if it moves away or gets closer than the cam-
era). The results of this simple method depend very much on the threshold value
adopted which can be chosen manually or automatically by previously analyzing the
histograms of both background and object images.
A first step to get a more robust background is to consider the average or the median
of the n previous images. In essence, an attempt is made to attenuate the noise present
in the background due to the small movement of objects (leaves of a tree, bushes,
. . .) that are not part of the objects. Applies a filtering operation based on the average
or median (see Sect. 9.12.4 Vol. I) of the n previous images. With the method of the
media, the background is modeled with the arithmetic mean of the n images kept in
memory:
1
n−1
B t (x, y) = I t−i (x, y) (6.133)
n
i=0
where n is closely related to the acquisition frame rate and object speed.
Similarly, the background can be modeled with the median filter for each pixel of
all temporarily stored images. In this case, it is assumed that each pixel has a high
6.5 Motion in Complex Scenes 559
The mask image for both the mean and the median results:
1 if |I t (x, y) − B t (x, y)] > τ
Dt,τ (x, y) = (6.135)
0 other wise
Appropriate n and frame rate values produce a correct updated background and a
realistic foreground mask of moving objects, with no phantom or missing objects.
These methods are among the nonrecursive adaptive updating techniques of the
background, in the sense that they depend only on the images stored and maintained
in the system at the moment. Although easy to make and fast, they have the drawbacks
of nonadaptive methods, meaning that they can only be used for short-term tracking
without significant changes to the scene. When the error occurs it is necessary to
reinitialize the background otherwise the errors accumulate over time. They also
require an adequate memory buffer to keep the last n images acquired. Finally, the
choice of the global threshold can be problematic.
where the parameter α, seen as a learning parameter (assumes a value between 0.01
and 0.05), models the update of the background B t (x, y) at time t, weighing the
previous value B t−1 (x, y) and the current value of the image I t (x, y). In essence,
the current image is immersed in the model image of the background via the parameter
α. If α = 0, (6.136) is reduced to B t (x, y) = B t−1 (x, y), the background remains
unchanged, and the mask image is calculated by the simple subtraction method
(6.133). If instead, α = 1, (6.136) is reduced to B t (x, y) = I t (x, y) producing the
simple difference between images.
This method [29] proposes approximating the distribution of the values of each
pixel in the last n images with a Gaussian probability density function (unimodal).
Therefore, in the hypothesis of gray-level images, two maps are maintained, one for
the average and one for the standard deviation. In the initialization phase the two
maps are created background μt (x, y) and σ t (x, y) to characterize each pixel with
560 6 Motion Analysis
its own pd f with the parameters, respectively, of the average μt (x, y) and of the
variance σt2 (x, y). The maps of the original background are initialized by acquiring
the images of the scene without moving objects and calculating the average and the
variance for each pixel. To manage the changes of the existent background, due to
the variations of the ambient light conditions and to the motion of the objects, the
two maps are updated in each pixel for each current image at the time t, calculating
the moving average and the relative mobile variance, given by
with
d(x, y) = |I t (x, y) − μt (x, y) (6.138)
where d indicates the Euclidean distance between the current value It (x, y) of the
pixel and its average, α is the learning parameter of the background update model.
Normally α = 0.01 and as evidenced by (6.137) it tends to weigh little the value of
the current pixel It (x, y) if this is classified as foreground to avoid merging it in the
background. Conversely, a pixel classified as background the value of α should be
chosen based on the need for stability (lower value) or fast update (higher value).
Therefore, the current average in each pixel is updated based on the weighted average
of the previous values and the current value of the pixel. It is observed that with the
adaptive process given by (6.137) the values of the average and of the variance of
each pixel are accumulated, requiring little memory and having high execution speed.
The pixel classification is performed by evaluating the absolute value of the dif-
ference between the current value and the current average of the pixel with respect
to a confidence value of the threshold τ as follows:
|I (x,y)−μ (x,y)|
For egr ound if t σ t (x,y)t >τ
It = (6.139)
Backgr ound otherwise
The value of the threshold depends on the context (good results can be obtained
with τ = 2.5) even if it is normally chosen less than a factor k of the standard
deviation τ = kσ . This method is easily applicable also for color images [29] or
multispectral images maintaining two background maps for each color or spectral
channel. This method has been successfully experimented for indoor applications
with the exception of cases with multimodal background distribution.
background. The direct formula for classifying a pixel between moving object and
background is the following:
For egr ound if |I t (x, y) − I t−1 (x, y)| > τ (no background update)
It = (6.140)
Backgr ound otherwise
The methods based on mobile average, seen in the previous paragraphs, are actually
also selective.
So far methods have been considered where the background update model is based
on the recent pixel history. Only with the Gaussian moving average method was the
background modeled with the statistical parameters of average and variance of each
pixel of the last images with the assumption of a unimodal Gaussian distribution. No
spatial correlation was considered with the pixels in the vicinity of the one being pro-
cessed. To handle more complex application contexts where the background scene
includes structures with small movements not to be regarded as moving objects (for
example, small leaf movements, trees, temporarily generated shadows, . . .) differ-
ent methods have been proposed based on models of background with multimodal
Gaussian distribution [31].
In this case, the value of a pixel varies over time as a stochastic process instead
of modeling the values of all pixels as a particular type of distribution. The method
determines which Gaussian a background pixel can correspond to. The values of the
pixels that do not adapt to the background are considered part of the objects, until it is
associated with a Gaussian which includes them in a consistent and coherent way. In
the analysis of the temporal sequence of images, it happens that the significant varia-
tions are due to the moving objects compared to the stationary ones. The distribution
of each pixel is modeled with a mixture of K Gaussians N(μi,t (x, y), i,t (x, y)).
The probability P of the occurrence of an RGB pixel in the location (x, y) of the
current image t is given by
K
P(I i,t (x, y)) = ωi,t (x, y)N(μi,t (x, y), i,t (x, y)) (6.141)
i=1
where ωi,t (x, y) is the weight of the ith Gaussian. To simplify, as suggested by the
author, the covariance matrix i,t (x, y) can be assumed to be diagonal and in this
case we have i,t (x, y) = σi,t
2 (x, y)I, where I is the matrix 3×3 identity in the case
of RGB images. The K number of Gaussians depends on the operational context and
the available resources (in terms of calculation and memory) even if it is normally
between 3 and 5.
Now let’s see how the weights and parameters of the Gaussians are initialized
and updated as the images I t are acquired in real time. By virtue of (6.141), the
distribution of recently observed values of each pixel in the scene is characterized by
562 6 Motion Analysis
the Gaussian mixture. With the new observation, i.e., the current image I t , each pixel
will be associated with one of the Gaussian components of the mixture and must be
used to update the parameters of the model (the Gaussians). This is implemented as a
kind of classification, for example, the K-means algorithm. Each new pixel I t (x, y)
is associated with the Gaussian component for which the value of the pixel is within
2.5 standard deviations (that is, the distance is less than 2.5σi ) of its average. This 2.5
threshold value can be changed slightly, producing a slight impact on performance. If
a new It (x, y) pixel is associated with one of the Gaussian distributions, the relative
parameters of the average μi,t (x, y) and variance σi,t 2 (x, y) are updated as follows:
while the previous weights of all the Gaussians are updated as follows:
#
b $
B = arg min ωi > T (6.144)
b i=1
where T indicates the minimum portion of the image that should be background
(characterized with distribution with high value of weight and low variance). Slowly
moving objects take longer to include in the background because they have more
variance than the background. Repetitive variations are also learned and a model
is maintained for the distribution of the background, which leads to faster recovery
when objects are removed from subsequent images.
The simple BS methods (difference of images, average and median filtering),
although very fast, using a global threshold to detect the change of the scene are
6.5 Motion in Complex Scenes 563
inadequate in complex real scenes. A method that models the background adaptively
with a mixture of Gaussians better controls real complex situations where often the
background is bimodal with long-term scene changes and confused repetitive move-
ments (for example caused by the temporary overlapping of objects in movement).
Often better results are obtained by combining the adaptive approach with temporal
information on the dynamics of the scene or by combining local information deriving
from simple BS methods.
1
t−1
Pkde (I i,t (x, y)) = K I t (x, y) − I i (x, y) (6.145)
n
i=t−n
where n is the number of the previous images used to estimate the pd f distribution
of the background using the Gaussian kernel function K .
A It (x, y) pixel is labeled as background if Pkde (It (x, y)) > T , where T is a
default threshold otherwise it is considered a pixel foreground. The T threshold is
appropriately adapted in relation to the number of false positives acceptable for the
application context. The KDE method is also extended for multivariate variable and
is immediately usable for multispectral or color images. In this case, the kernel func-
tion is obtained from the product of one-dimensional kernel functions and (6.145)
becomes
564 6 Motion Analysis
( j) ( j)
1 -
t−1 m
I t (x, y) − I i (x, y)
Pkde (Ii,t (x, y)) = K (6.146)
n σj
i=t−n j=1
Learning phase. The pixels I j of the image ith of the sequence are organized in a
column vector I i = {I1,i , . . . , I j,i · · · , I N ,i } of size N × 1 which allocates all N
pixels of the image. The entire sequence of images is organized in n columns in the
n
matrix I of size N × n of which the image average μ = n1 i=1 I i is calculated.
Then, the matrix X = [X 1 , X 2 , . . . , X n ] of size N × n is calculated, where each
of its column vector (image) has mean zero given by X i = I i − μ. Next, the
covariance matrix C = E{X i , X iT } ≈ n1 X X T is calculated. By virtue of the PCA
transform it is possible to diagonazlize the covariance matrix C calculating the
eigenvector matrix by obtaining
D = CT (6.147)
B t = m (I t − μ) (6.148)
6.5 Motion in Complex Scenes 565
B t = m
T
Bt + μ (6.149)
At this point, considering that the eigenspace described by m mainly models static
scenes and not dynamic objects, the image Bt reconstructed by the autospace does
not contain moving objects that can be highlighted instead, comparing with a metric
(for example, the Euclidean distance d2 ), the input image I t and the reconstructed
one Bt are as follows:
For egr ound if d2 (I t , Bt ) > T
F t (x, y) = (6.150)
Backgr ound otherwise
I = UV T (6.151)
In the parametric models, the probability density distribution (pdf) of the background
pixels is assumed to be known (for example, a Gaussian) described by its own charac-
teristic parameters (mean and variance). A semiparametric approach used to model
the variability of the background is represented for example by the Gaussian mixture
as described above. A more general method used, in different applications (in partic-
566 6 Motion Analysis
ular in the Computer Vision), consists instead of trying to estimate the pd f directly
by analyzing the data without assuming a particular form of their distribution. This
approach is known as the nonparametric estimate of the distribution, for example,
the simplest one is based on the calculation of the histogram or the Parzen window
(see Sect. 1.9.4), known as kernel density estimation.
Among the nonparametric approaches is the mean-shift algorithm (see Sect. 5.8.2
Vol. II), an iterative method of ascending the gradient with good convergence prop-
erties that allows to detect the peaks (modes) of a multivariate distribution and the
related covariance matrix. The algorithm was adopted as an effective technique for
both the blob tracking and for the background modeling [35–37]. Like all nonpara-
metric models, we are able to model complex pd f , but the implementation requires
considerable computational and memory resources. A practical solution is to use the
mean-shift method only to model the initial background (the pd f of the initial image
sequence) and to use a propagation method to update the background model. This
strategy is proposed in [38], which propagates and updates the pd f with new images
in real time.
In this paragraph, we want to derive the geometric relations that link the motion
parameters of a rigid body, represented by a flat surface, and the optical flow induced
in the image plane (observed 2D displacements of intensity patterns in the image),
hypothesized corresponding to the motion field (projection of 3D velocity vectors on
the 2D image plane).
In particular, given a sequence of space-time-variant images acquired while objects
of the scene move with respect to the camera or vice versa, we want to find solutions
to estimate:
1. the 3D motion of the objects with respect to the camera by analyzing the 2D flow
field induced by the sequence of images;
2. the distance to the camera object;
3. the 3D structure of the scene.
As shown in Fig. 6.33a, the camera can be considered stationary and the object in
motion with speed V or vice versa. The optical axis of the camera is aligned with
the Z -axis of the reference system (X, Y, Z ) of the camera, with respect to which
the moving object is referenced. The image plane is represented by the plane (x, y)
perpendicular to the Z -axis at the distance f , where f is the focal point of the optics.
In reality, the optical system is simplified with the pinhole model and the focal
distance f is the distance between the image plane and the perspective projection
center located at the origin O of the reference system (X, Y, Z ). A point P =
(X, Y, Z ) of the object plane, in the context of perspective projection, is projected
in the image plane at the point p = (x, y) calculated with the perspective projection
6.6 Analytical Structure of the Optical Flow of a Rigid Body 567
(a) (b)
Y Y
y P(X,Y,Z) y P(Y,Z)
V V
f f
p(x,y) COP Z p Z Z
v
x
X
Fig. 6.33 Geometry of the perspective projection of a 3D point of the scene with model pinhole.
a Point P in motion with velocity V with respect to the observer with reference system (X, Y, Z )
in the perspective center of projection (CoP), with the Z -axis coinciding with the optical axis and
perpendicular to the image plane (x, y); b 3D relative velocity in the plane (Y, Z ) of the point P
and the 2D velocity of its perspective projection p in the image plane (visible only y-axis)
equations (see Sect. 3.6 Vol. II), derived with the properties of similar triangles (see
Fig. 6.33b), given by
X Y P
x= f y= f =⇒ p= f (6.152)
Z Z Z
dp(t) d fZP(t)
(t) Z V − Vz P
v= = = f (6.154)
dt dt Z2
whose components are
f Vx − x Vz f Vy − yVz
vx = vy = (6.155)
Z Z
while vz = 0 = f VZz − f VZz . From (6.154), it emerges that the apparent velocity is a
function of the velocity V of the 3D motion of P and of its depth Z with respect to the
image plane. We can reformulate the velocity components in terms of a perspective
568 6 Motion Analysis
The relative velocity of the P point with respect to the camera, in the context of a rigid
body (where all the points of the objects have the same parameters of motion), can
also be described in terms of instantaneous rectilinear velocity T = (Tx , Ty , Tz )T
and angular = ( x , y , z )T (around the origin) from the following equation
[39,40]:
V=T+ ×P (6.157)
where the “×” symbol indicates the vector product. The components of V are
Vx = Tx + yZ −Y z
V y = Ty − xZ + X z (6.158)
Vz = Tz + xY − X y
from which it emerges that the perspective motion of P in the image plane in p
induces a flow field v produced by the linear composition of the translational and
rotational motion. The translational flow component depends on the distance Z of
the point P which does not affect the rotational motion. For a better readability of
the flow induced in the image plane by the different possibilities of motion (6.160),
6.6 Analytical Structure of the Optical Flow of a Rigid Body 569
p
COP Z
x
X
Tx f − Tz x xxy yx
2
vx = − y f + zy + −
Z f f
Transl. comp. Rotational component
(6.161)
Ty f − Tz y yxy xx
2
vy = − x f + zx + −
Z f f
Transl. comp. Rotational component
We have, therefore, defined the model of perspective motion for a rigid body, assum-
ing zero optical distortions, which relates to each point of the image plane the apparent
velocity (motion field) of a 3D point of the scene at a distance Z , subject to the trans-
lation motion T and rotation . Other simpler motion models can be considered as
the weak perspective model, orthographic or affine. From the analysis of the motion
field, it is possible to derive some parameters of the 3D motion of the objects.
In fact, once the optical flow is known (vx , v y ) with Eq. (6.161) we would have for
each point (x, y) of the image plane two bilinear equations in 7 unknowns: the depth
Z , the 3 translational velocity components T and the 3 angular velocity components
. The optical flow is a linear combination of T and once known Z or it results a
linear combination of the inverse depth 1/Z and once is known the translational
velocity T. Theoretically, the 3D structure of the object (the inverse of the depth 1/Z
for each image point) and the motion components (translational and rotational) can
be determined by knowing the optical flow for different points in the image plane.
For example, if the dominant surface of an object is a flat surface (see Fig. 6.34)
it can be described by
P · nT = d (6.162)
where d is the perpendicular distance of the plane from the origin of the reference
system, P = (X, Y, Z ) is a generic point of the plane, and n = (n x , n y , n z )T is the
normal vector to the flat surface as shown in the figure. In the hypothesis of transla-
tory and rotatory motion of the flat surface with respect to the observer (camera), the
normal n and the distance d vary in time. Using Eq. (6.152) of the perspective projec-
tion, solving with respect to the vector P the spatial position of the point belonging
570 6 Motion Analysis
which replaced in the equation of the plane (6.162) and resolving with respect to the
inverse of the depth Z1 of P we have
1 p·n nx x + n y y + nz f
= = (6.164)
Z fd fd
Substituting Eq. (6.164) in the equations of the motion field (6.161) we get
1 2
vx = a1 x + a2 x y + a3 f x + a4 f y + a5 f 2
fd
(6.165)
1
vy = a1 x y + a2 y 2 + a6 f y + a7 f x + a8 f 2
fd
where
a1 = −d y + Tz n x a2 = d x + Tz n y
a3 = Tz n z − Tx n x a4 = d z − Tx n y
(6.166)
a5 = −d y − Tx n z a6 = Tz n z − Ty n y
a7 = −d z − Ty n x a8 = d x − T y n z
Let us now analyze what information can be extracted from the knowledge of optical
flow. In a stationary environment consisting of rigid bodies, whose depth can be
known, a sequence of images is acquired by a camera moving toward such objects
or vice versa. From each pair of consecutive images, it is possible to extract the
optical flow using one of the previous methods. From the optical flow, it is possible
to extract information on the structure and motion of the scene. In fact, from the
6.6 Analytical Structure of the Optical Flow of a Rigid Body 571
optical flow map, it is possible, for example, to observe that regions with small
velocity variations correspond to single image surfaces including information on
the structure of the observed surface. Regions with large speed variations contain
information on possible occlusions, or they concern the areas of discontinuity of
the surfaces of objects even at different distances from the observer (camera). A
relationship, between the orientation of the surface with respect to the observer and
the small variations of velocity gradients, can be derived. Let us now look at the type
of motion field induced by the model of perspective motion (pinhole) described by
general Eq. (6.161) in the hypothesis of a rigid body subject to roto-translation.
Y Y Y
Tz Z Z
Z x x Ωz
x X X
X Tx +Tz
Fig. 6.35 Example of flow generated by translational or rotary motion. a Flow field induced by
longitudinal translational motion T = (0, 0, Tz ) with the Z -axis coinciding with the velocity vector
T; b Flow field induced by translational motion T = (Tx + Tz ) with the FOE shifted with respect to
the origin along the x-axis in the image plane; c Flow field induced by the simple rotation (known
as r oll) of the camera around the longitudinal axis (in this case, it is the Z axis)
of the velocity vector Tz are coincident), where the flow velocity is zeroed (see
Fig. 6.35a). In the literature, FOE is also known as vanish point.
In the case of lateral translational motion seen above, the FOE can be thought
to be located at infinity where the parallel flow vectors converge (the FOE cor-
responds to a vanishing point). Returning to the FOE, it represents the point of
intersection between the direction of motion of the observer and the image plane
(see Fig. 6.35a). If the motion also has a translation component in the direction of
X , with the resulting velocity vector T = Tx + Tz , the FOE appears in the image
plane displaced horizontally with the flow vectors always radial and convergent
in the position p0 = (x0 , y0 ) where the relative velocity is zero (see Fig. 6.35b).
This means that from a sequence of images, once the optical flow has been deter-
mined, it is possible to calculate the position of the FOE in the image plane, and
therefore know the direction of motion of the observer. We will see in the follow-
ing how this will be possible considering also the uncertainty in the localization
of the FOE, due to the noise present in the sequence of images acquired in the
estimation of the optical flow. The flow fields shown in the figure refer to ideal
cases without noise. In real applications, it can also be had that several indepen-
dent objects are in translational motion, the flow map is always radial but with
different FOEs, however each is associated with the motion of the corresponding
object. Therefore, before analyzing the motion of the single objects, it is necessary
to segment the flow field into regions of homogeneous motion relative to each
object.
9 In the aeronautical context, the attitude of an aircraft (integral with the axes (X, Y,
Z )), in 3D space,
is indicated with the angles of rotation around the axes, indicated, respectively, as lateral, vertical
and longitudinal. The longitudinal rotation, around the Z -axis, indicates the roll, the lateral one,
6.6 Analytical Structure of the Optical Flow of a Rigid Body 573
from the object, induces a flow that is represented from vectors oriented along the
points tangent to concentric circles whose center is the point of rotation of the points
projected in the image plane of the object in rotation. In this case, the FOE does not
exist and the flow is characterized by the center of rotation with respect to which
the vectors of the flow orbit (see Fig. 6.35c) tangent to the concentric circles. The
pure rotation around the X -axis (called pitch rotation) or the pure rotation around
the axis of Y (called yaw rotation) induces in the image plane a center of rotation
of the vectors (no longer oriented tangents along concentric circles but positioned
according to the perspective projection) shifted towards the FOE, wherein these two
cases it exists.
The collision time is the time required by the observer to reach the contact with the
surface of the object when the movement is of pure translation. In the context of
pure translational motion between scene and observer, the radial map of optical flow
in the image plane is analytically described by Eq. (6.167) valid for both Tz > 0
(observer movement towards the object) with the radial vectors emerging from the
FOE, and both for Tz < 0 (observer that moves away from the object) with the
radial vectors converging in the FOC. A generic point P = (X, Y, Z ) of the scene,
with translation velocity T = (0, 0, Tz ) (see Fig. 6.35a) in the image plane, it moves
radially with velocity v = (vx , v y ) expressed by Eq. (6.168) and, remembering the
perspective projection Eq. (6.152), is projected in the image plane in p = (x, y). We
have seen how in this case of pure translational motion, the vectors of the optical
flow converge at the FOE point p0 = (x0 , y0 ) where they vanish, that is, they cancel
with (vx , v y ) = (0, 0) for all vectors. Therefore, in the FOE point Eq. (6.167) cancel
each other thus obtaining the FOE coordinates given by
Tx Ty
x0 = f y0 = f (6.169)
Tz Tz
The same results are obtained if the observer moves away from the scene, and in this
case, it is referred to as a Focus Of Contraction (FOC).
We can now express the relative velocity (vx , v y ) of the points p = (x, y) projected
in the image plane with respect to their distance from p0 = (x0 , y0 ), that is, from
the FOE, combining Eq. (6.169) and the equations of the pure translational motion
around the X -axis, indicates the pitch, and the ver tical one, around the Y axis indicates the yaw. In
the robotic context (for example, in the case of an autonomous vehicle) the attitude of the camera
can be defined with 3 degrees of freedom with the axes (X, Y, Z ) indicating the lateral direction
(side-to-side), ver tical (up-down), and camera direction (looking). The rotation around the lateral,
up-down, and looking axes retain the same meaning as the axes considered for the aircraft.
574 6 Motion Analysis
Y
P(XP,YP,ZP)
Image plane
at time t+1 Image plane
y y at time t
p’(xp’,yp’) p(xp,yp)
Z
f COP(t+1) FOE f COP(t)
x x Camera motion
X ΔZ=Z-Z’ with velocity Tz
Fig. 6.36 Geometry of the perspective projection of a 3D point in the image plane in two instants
of time while the observer approaches the object with pure translational motion
– The Time To Collision (TTC) of the observer from a point of the scene without
knowing his distance and the approach velocity.
– The distance d of a point of the scene from the observer moving at a constant
velocity Tz parallel to the Z -axis in the direction of it.
This is the typical situation of an autonomous vehicle that tries to approach an object
and must be able to predict the collision time assuming that it moves with constant
velocity Tz . In these applications, it is strategic that this prediction occurs without
knowing the speed and the distance instant by instant from the object. While in other
situations, it is important to estimate a relative distance between vehicle and object
without knowing the translation velocity.
6.6 Analytical Structure of the Optical Flow of a Rigid Body 575
We have essentially obtained that the time to collision τ is given by the relationship
between observer–object distance Z P and velocity Tz , which represents the classic
way to estimate TTC but they are quantities that we do not know with a single camera,
but above all, we have the important result that we wanted, namely that the time to
collision τ is also given by the ratio of the two measurements deriving from the
optical flow, y p (length of the vector of the optical flow obtained from the distance
y p − x0 from the FOE) and v y (flow velocity vector ∂∂ty ) that can be estimated from the
sequence of images, in the hypothesis of translational motion with constant speed.
The accuracy of τ depends on the accuracy of the FOE position and the optical
flow. Normally the value of τ is considered acceptable if the value of y p exceeds
a threshold value in terms of the number of pixels to be defined in relation to the
velocity Tz .
At same Eq. (6.172), of the time to collision τ , we arrive by considering the
observer in motion toward the stationary object as represented in Fig. 6.36. The
figure shows two time instants of the P projections in the image plane moving with
velocity Tz toward the object. At the instant t its perspective projection p in the image
Y
plane, according to (6.152), results in y p = f Z pp . In time t, the projection p moves
away from the FOE as the image plane approaches P moving radially in p at time
t + 1. This dynamic is described by differentiating y with respect to time t obtaining
576 6 Motion Analysis
∂Y P ∂Zp
∂ yp ∂t ∂t
= f − f YP (6.173)
∂t ZP Z 2P
∂Zp
have Y P = y p ZfP and that ∂t = Tz . Replacing in (6.173), we get
∂ yp Tz
= −y p (6.174)
∂t ZP
Dividing as before, both members for y p and taking the reciprocal, we finally get the
same expression (6.172) of the time to collision τ .
It is also observed that τ does not depend on the focal length of the optical system
or the size of the object, but depends only on the observer–object distance Z P and
the translation velocity Tz . With the same principle with which TTC is estimated,
we could calculate the size of the object (useful in navigation applications where we
want to estimate the size of an obstacle object) reformulating the problem in terms of
τ = hht , where h is the height (seen as a scale factor) of the obstacle object projected
in the image plane and h t (in analogy to v y ) represents the time derivative of how
the scale of the object varies. The reformulation of the problem in these terms is not
to estimate the absolute size of the object with τ but to have an estimate of how its
size varies temporally between one image and another in the sequence. In this case,
it is useful to estimate τ also in the plane X–Z according to Eq. (6.170).
from which it emerges that it is possible to calculate the relative 3D distance for any
(t)
two points of the object, in terms of the ratio ZZ 21 (t) using the optical flow measure-
ments (velocity and distance from the FOE) derived between adjacent images of the
image sequence. If for any point of the object the accurate distance Z r (t) is known,
according to (6.176) we could determine the instantaneous depth Z i (t) of any point
as follows:
(r )
yi (t) v y (t)
Z i (t) = Z r (t) · (i) (6.177)
yr (t) v y (t)
In real applications of an autonomous vehicle, the flow field induced in the image
plane is radial generated by the dominant translational motion (assuming a flat floor),
with negligible r oll and pitch rotations and possible yaw rotation (that is, rotation
around the axis Y perpendicular to the floor, see the note 9 for the conventions used
for vehicle and scene orientation). With this radial typology of the flow field, the
coordinates (x0 , y0 ) of FOE (defined by Eq. 6.169) can be determined theoretically
by knowing the optical flow vector of at least two points belonging to a rigid object. In
these radial map conditions, the lines passing through two flow vectors intersect at the
point (x0 , y0 ), where all the other flow vectors converge, at least in ideal conditions
as shown in Fig. 6.35a.
In reality, the optical flow observable in the image plane is induced by the motion
of the visible points of the scene. Normally, we consider the corners and edges that are
not always easily visible and univocally determined. This means that the projections
in the image plane of the points, and therefore the flow velocities (given by the 6.170)
are not always accurately measured by the sequence of images. It follows that the
direction of the optical flow vectors is noisy and this involves the convergence of the
flow vectors not in a single point. In this case, the location of the FOE is estimated
approximately as the center of mass of the region [42], where the optical flow vectors
converge.
Calculating the location of the FOE is not only useful for obtaining information
on the structure of the motion (depth of the points) and the collision time, but it is also
useful for obtaining information relating to the direction of motion of the observer
(called heading) not always coinciding with the optical axis. In fact, the flow field
induced by the translational motion T = (Tx + Tz ) produces the FOE shifted with
respect to the origin along the x-axis in the image plane (see Fig. 6.35b).
We have already shown earlier that we cannot fully determine the structure of the
scene from the flow field due to the lack of knowledge of the distance Z (x, y) of
578 6 Motion Analysis
3D points as indicated by Eq. (6.170), while the position of the FOE is independent
of Z (x, y) according to Eq. (6.169). Finally, it is observed (easily demonstrated
geometrically) that the amplitude of the velocity vectors of the flow is dependent on
the depth Z (x, y) while the direction is independent.
There are several methods used for the estimation of FOE, many of which use
calibrated systems that separate the translational and rotary motion component or by
calculating an approximation of the FOE position by setting a function of minimum
error (which imposes constraints on the correspondence of the points of interest of the
scene to be detected in the sequence of images) and resolving to the least squares, or
simplifying the visible surface with elementary planes. After the error minimization
process, an optimal FOE position is obtained.
Other methods use the direction of velocity vectors and determine the position of
the FOE by evaluating the maximum number of intersections (for example, using the
Hough transform) or by using a multilayer neural network [43,44]. The input flow
field does not necessarily have to be dense. It is often useful to consider the velocity
vectors associated with points of interest of which there is good correspondence in
the sequence of images. Generally, they are points with high texture or corners.
A least squares solution for the FOE calculation, using all the flow vectors in the
pure translation context, is obtained by considering the equations of the flow (6.170)
and imposing the constraint that eliminates the dependence of the translation velocity
Tz and depth Z , so we have
vx x − x0
= (6.178)
vy y − y0
This constraint is applied for each vector (vxi , v yi ) of the optical flow (dense or
scattered) that contribute to the determination of the position (x0 , y0 ) of the FOE. In
fact, placing (6.178) in the matrix form:
x0
v yi −vxi = xi v yi − yi vxi (6.179)
y0
we have a highly overdetermined linear system and it is possible to estimate the FOE
position (x0 , y0 ) from the flow field (vxi , v yi ) with the least-squares approach:
x0
= (A T A)−1 A T b (6.180)
y0
where ⎡ ⎤ ⎡ ⎤
v y1 −vx1 x1 v y1 − y1 vx1
A = ⎣· · · · · · ⎦ b=⎣ ··· ⎦ (6.181)
v yn −vxn xn v yn − yn vxn
6.6 Analytical Structure of the Optical Flow of a Rigid Body 579
(x − x0 )It
vx = −
(x − x0 )I x + (y − y0 )I y
(6.183)
(y − y0 )It
vy = −
(x − x0 )I x + (y − y0 )I y
where I x , I y , and It are the first partial space–time derivatives of the adjacent images
of the sequence. Combining then, the equations of the flow (6.170) (valid in the
context of translational motion with respect to a rigid body) with the equations of
the flow (6.183) we obtain the following relations:
Z (x − x0 )I x + (y − y0 )I y
τ= =
Tz It
(6.184)
Tz
Z (x, y) = (x − x0 )I x + (y − y0 )I y
It
These equations express the collision time τ and the depth Z (x, y), respectively, in
relation to the position of the FOE and the first derivatives of the images. The estimate
of τ and Z (x, y) calculated, respectively, with (6.184) may be more robust than those
calculated with Eqs. (6.172) and (6.176) having determined the position of the FOE
(with the least-squares approach, closed form) considering only the direction of the
optical flow vectors and not the module.
580 6 Motion Analysis
Y
Sy
Ry
y P’(X’,Y’,Z’)
p’(x’,y’)
Sz
Z
)
0
,Δy
Sx x Rz
(Δx
Rx p(x,y)
X
P(X,Y,Z)
Fig. 6.37 Geometry of the perspective projection of a point P, in the 3D space (X, Y, Z ) and in the
image plane (x, y), which moves in P = (X , Y , Z ) according to the translation (Sx , S, y , Sz )
and r otation (Rx , R y , Rz ). In the image plane the vector (x, y) associated with the roto-
translation is indicated
The richness of information present in the optical flow can be used to estimate the
motion parameters of a rigid body [45]. In the applications of autonomous vehicles
and in general, in the 3D reconstruction of the scene structure through the optical flow,
it is possible to estimate the parameters of vehicle motion and depth information.
In general, for an autonomous vehicle it is interesting to know its own motion (ego-
motion) in a static environment. In the more general case, there may be more objects
in the scene with different velocity and in this case the induced optical flow can be
segmented to distinguish the motion of the various objects. The motion parameters
are the translational and rotational velocity components associated with the points
of an object with the same motion.
We have already highlighted above that if the depth is unknown only the rotation
can be determined univocally (invariant to depth, Eq. 6.161), while the translation
parameters can be estimated at less than a scale factor. The high dimensionality of the
problem and the non-linearity of the equations derived from the optical flow make
the problem complex.
The accuracy of the estimation of motion parameters and scene structure is related
to the accuracy of the flow field normally determined by sequences of images with
good spatial and temporal resolution (using cameras with high frame rate). In par-
ticular, the time interval t between images must be very small in order to estimate
with a good approximation the velocity x t of a point that has moved by x between
two consecutive images of the sequence.
Consider the simple case of rigid motion where the object’s points move with
the same velocity and are projected prospectively in the image plane as shown in
Fig. 6.37. The direction of motion is always along the Z -axis in the positive direction
with the image plane (x, y) perpendicular to the Z -axis. The figure shows the position
of a point P = (X, Y, Z ) at the time t and the new position in P = (X , Y , Z ) after
6.6 Analytical Structure of the Optical Flow of a Rigid Body 581
the movement at time t . The perspective projections of P in the image plane, in the
two instants of time, are, respectively, p = (x, y) and p = (x , y ).
The 3D displacement of P in the new position is modeled by the following geo-
metric transformation:
P =R×P+S (6.185)
where ⎡ ⎤ ⎡ ⎤
1 −Rz R y Sx
R = ⎣ Rz 1 −Rx ⎦ S = ⎣ Sy ⎦ (6.186)
−R y Rx 1 Sz
It follows that, according to (6.185), the spatial coordinates of the new position of P
depend on the coordinates of the initial position (X, Y, Z ) multiplied by the matrix of
the rotation parameters (Rx , R y , Rz ) added together with the translation parameters
(Sx , S y , Sz ). Replacing R and S in (6.185) we have
X = X − R z Y + R y Z + Sx X = X − X = Sx − Rz Y + R y Z
Y = Y + Rz X − Rx Z + S y =⇒ Y = Y − Y = S y + Rz X − Rx Z (6.187)
Z = Z − R y X + Rx Y + Sz Z = Z − Z = Sz − R y X + Rx Y
The determination of the motion is closely related to the calculation of the motion
parameters (Sx , S y , Sz , Rx , R y , Rz ) which depend on the geometric properties of
projection of the 3D points of the scene in the image plane. With the perspective model
of image formation given by Eq. (6.152), the projections of P and P in the image
plane are, respectively, p = (x, y) and p = (x , y ) with the relative displacements
(x, y) given by
X X
x = x − x = f − f
Z Z (6.188)
Y Y
y = y − y = f − f
Z Z
In the context of rigid motion and with very small 3D angular rotations, then
the 3D displacements (X, Y, Z ) can be approximated by Eq. (6.187) which
when replaced in the perspective Eq. (6.188) give the corresponding displacements
(x, y) as follows:
( f Sz −Sz x) 2
Z + f R y − Rz y − Rx xfy + R y xf
x = x − x = Sz y
1+ Z + Rx f − R y xf
(6.189)
( f S y −Sz y) 2
Z − f Rx + Rz x + R y xfy − Rx yf
y = y − y = Sz y
1+ Z + Rx f − R y xf
The equations of the displacements (x, y) in the image plane in p = (x, y) are
thus obtained in terms of the parameters (Sx , S y , Sz , Rx , R y , Rz ) and the addition of
the depth Z for the point P = (X, Y, Z ) of the 3D object, assuming the perspective
582 6 Motion Analysis
projection model. Furthermore, if the images of the sequence are acquired with
high frame rate, the components of the displacement (x, y), in the time interval
between one image and the other, are very small, it follows that the variable terms in
the denominator of Eq. (6.189) are small compared to unity, that is,
Sz y x
+ Rx − R y 1 (6.190)
Z f f
With these assumptions, it is possible to derive the equations that relate motion in the
image plane with the motion parameters by differentiating equations (6.189) with
respect to time t. In fact, dividing (6.189) with respect to the time interval t and
for t → 0 we have that the displacements (x, y) approximate (become ) the
instant velocities (vx , v y ) in the image plane, known as optical flow.
Similarly, in 3D space, the translation motion parameters (Sx , S y , Sz ) instead
become translation velocity indicated with (Tx , Ty , Tz ), so for the rotation parame-
ters (Rx , R y , Rz ), which become rotation velocity indicated with ( x , y , z ). The
equations of the optical flow (vx , v y ) correspond precisely to Eq. (6.160) of Sect. 6.6.
Therefore, these equations of motion involve velocities both in 3D space and in the
2D image plane. However, in real vision systems, information is that acquired from
the sequence of space–time images based on the induced displacements on very small
time intervals according to the condition expressed in Eq. (6.190).
In other words, we can approximate the 3D velocity of a point P = (X, Y, Z )
with the equation V = T + × P (see Sect. 6.6) of a rigid body that moves with a
translational velocity T = (Tx , Ty , Tz ) and a rotational velocity = ( x , y , z ),
while the relative velocity (vx , v y ) of the 3D point projected in p = (x, y), in the
image plane, can be approximated by considering the displacements (x, y) if the
constraint Z Z 1 is maintained, according to Eq. (6.190).
With these assumptions, we can now address the problem of estimating the motion
parameters (Sx , S y , Sz , Rx , R y , Rz ) and Z starting from the measurements of the
displacement vector (x, y) given by Eq. (6.189) that we can break them down
into separate components of translation and rotation, as follows:
with
f Sz − Sz x
x S =
Z (6.192)
f S y − Sz y
y S =
Z
and
xy x2
x R = f R y − Rz y − Rx + Ry
f f
(6.193)
xy y2
y R = − f Rx + Rz x + R y − Rx
f f
6.6 Analytical Structure of the Optical Flow of a Rigid Body 583
As previously observed, the rotational component does not depend on the depth Z
of a 3D point of the scene.
Given that the displacement vector (x, y) is applied for each 3D point projected
in the scene, the motion parameters (the unknowns) can be estimated with the least
squares approach by setting a function error e(S, R,Z ) to be minimized given by
n
e(S, R,Z ) = (xi − x Si − x Ri )2 + (yi − y Si − y Ri )2 (6.194)
i=1
where (xi , yi ) is the measurable displacement vector for each point i of the image
plane, (x R , y R ) and (x S , y S ) instead they are, respectively, the rotational and
translational components of the displacement vector. With the perspective projection
model it is not possible to estimate an absolute value for the translation vector S and
for the depth Z for each 3D point. In essence, they are estimated at less than a scale
factor. In fact, in the estimate of the translation vector (x S , y S ) from Eq. (6.192)
we observe that multiplied both by a constant c this equation is not altered.
Therefore, by scaling the translation vector by a constant factor, at the same
time the depth is increased by the same factor without producing changes in the
displacement vector in the image plane. Of the displacement vector it is possible
to estimate the direction of motion and the relative depth of the 3D point from the
image plane. According to the strategy proposed in [45] it is useful to set the function
error by eliminating first ZS with a normalization process. Let U = (Ux , U y , Uz )
the normalized motion direction vector and r the translation component module
S = (Sx , S y , Sz , ). The normalization of U is given by
(Sx , S y , Sz )
(Ux , U y , Uz ) = (6.195)
r
Let Z̄ be the relative depth given by
r
Z̄ = ∀i (6.196)
Zi
At this point we can rewrite in normalized form the translation component (6.192)
which becomes
x S
xU = = U x − Uz x
Z̄ (6.197)
y S
yU = = U y − Uz y
Z̄
Rewriting the error function (6.194) with respect to U (for Eq. 6.197) we get
n
e(U, R, Z̄ ) = (xi − xUi Z̄ i − x Ri )2 + (yi − yUi Z̄ i − y Ri )2 (6.198)
i=1
We are now interested in minimizing this error function for all possible values of Z̄ i .
584 6 Motion Analysis
Differentiating Eq. (6.198) with respect to Z , setting the result to zero and resolv-
ing with respect to Z̄ i we obtain
(yi − y Ri )yUi
Z̄ i = (xi − x Ri )xUi + ∀i (6.199)
xU2 i + yU2 i
Finally, having estimated the relative depths Z̄ i , it is possible to replace them in the
error function (6.198) and obtain the following final formulation:
n
[(xi − x Ri )yUi − (yi − y Ri )xUi ]2
e(U, R) = (6.200)
i=1
xU2 i + yU2 i
With this artifice the depth Z has been eliminated from the error function which is
now formulated only in terms of U and R. Once the motion parameters have been
estimated with (6.200) it is possible to finally estimate with (6.199) the optimal depth
values associated to each point ith of the image plane.
In the case of motion with pure rotation, the error function (6.200) to be minimized
becomes
n
e(R) = (xi − x Ri )2 + (yi − y Ri )2 (6.201)
i=1
where the rotational motion parameters to be estimated are the three components
of R. This is possible differentiating the error function with respect to each of the
components (Rx , R y , Rz ), setting the result to zero and solving the three linear equa-
tions (remembering the nondependence on Z ) as proposed in [46]. The three linear
equations that are obtained are
n
[(xi − x Ri )x y + (yi − y Ri )(y 2 + 1)] = 0
i=1
n
[(xi − x Ri )(x 2 + 1) + (yi − y Ri )x y] = 0 (6.202)
i=1
n
[(xi − x Ri )y + (yi − y Ri )x] = 0
i=1
where (xi , yi ) is the displacement vector ith for the image point (x, y) and
(x Ri , y Ri ) are the rotational components of the displacement vector. As reported
in [46], to estimate the rotation parameters (Rx , R y , Rz ) for the image point (x, y),
Eq. (6.202) are expanded and rewritten in matrix form obtaining
⎡ ⎤ ⎡ ⎤−1 ⎡ ⎤
Rx a d f k
⎣ Ry ⎦ = ⎣d b e ⎦ ⎣ l ⎦ (6.203)
Rz f e c m
6.6 Analytical Structure of the Optical Flow of a Rigid Body 585
where
n n n
a= i=1 [x y
2 2 + (y 2 + 1)] b = i=1 [(x
2 + 1) + x 2 y 2 ] c = i=1 (x
2 + y2)
n n n
d= i=1 [x y(x
2 + y 2 + 2)] e= i=1 y f = i=1 x (6.204)
n n n
k= i=1 [ux y + v(y 2 + 1)] l = i=1 [u(x
2 + 1) + vx y] m = i=1 (x y − vx)
In Eq. (6.204), (u, v) indicates the flow measured for each pixel (x, y). It is proved
that the matrix in (6.203) is diagonal and singular if the image plane is symmetrical
with respect to the axes x and y. Moreover, if the image plane is reduced in size,
the matrix could be ill conditioned, it results in an inaccurate estimation of the
summations (k, l, m) calculated by the observed flow (u, v) which would be very
amplified. But this in practice should not happen because generally it is not required
to determine the rotary motion around the optical axis of the camera by observing
the scene with a small angle of the field of view.
In this section, we will describe an approach known as Structure from Motion (SfM)
to obtain information on the geometry of the 3D scene starting from a sequence of
2D images acquired by a single uncalibrated camera (the intrinsic parameters and its
position are not known). The three-dimensional perception of the world is a feature
common to many living beings. We have already described the primary mechanism
used by the ster eopsis human vision system (i.e., the lateral displacement of objects
in two retinal images) to perceive the depth and obtain 3D information of the scene
by fusing the two retinal images.
The computer vision community has developed different approaches for 3D recon-
struction of the scene starting from 2D images observed from multiple points of view.
One approach is to find the correspondence of points of interest of the 3D scene (as
happens in the stereo vision) observed in 2D multiview images and by triangulation
construct a 3D trace of these points. More formally (see Fig. 6.38), given n projected
points (xi j ; i = 1, n and j = 1, m) in the m 2D images, the goal is to find all projec-
tion matrices P j ; j = 1, . . . , m (associated with motion) consisting of the structure
of n observed 3D points (X i ; i = 1, . . . , n). Fundamental to the SfM approach is
the knowledge of the camera projection matrix (or camera matrix) that is linked to
the geometric model of image formation (camera model).
Normally, the simple model of perspective projection is assumed (see Fig. 6.39)
which corresponds to the ideal pinhole model where a 3D point Q = (X, Y, Z ),
whose coordinates are expressed in the reference system of the camera, it is projected
in the image plane in q whose coordinates x = (x, y) are related to each other with
586 6 Motion Analysis
Xi
xi1
xim
xij
Pm
Pj
Fig. 6.38 Observation of the scene from slightly different points of view obtaining a sequence of
m images. n points of interest of the 3D scene are taken from the m images to reconstruct the 3D
structure of the scene by estimating the projection matrices P j associated with the m observations
Image Discretization
xis
u la
t ica
Op int Q(X,Y,Z)
v al P
o
cip
Prin
c
x
f y
Z q(x,y)
C X
T
Y T Yw Zw
R
O Xw
Fig. 6.39 Intrinsic and extrinsic parameters in the pinhole model. A point Q, in the camera’s 3D
reference system (X, Y, Z ), is projected into the image plane in q = (x, y), whose coordinates are
defined with respect to the principal point c = (cx , c y ) according to Eq. (6.206). The transformation
of the coordinates (x, y) of the image plane into the sensor coordinates, expressed in pixels, is defined
by the intrinsic parameters with Eq. (6.207) which takes into account the translation of the principal
point c and sensor resolution. The transformation of 3D point coordinates of a rigid body from
the world reference system (X w , Yw , Z w ) (with origin in O) to the reference system of the camera
(X, Y, Z ) with origin in the projection center C is defined by the extrinsic parameters characterized
by the roto-translation vectors R, T according to Eq. (6.210)
6.7 Structure from Motion 587
where ≈ indicates that the projection x̃ is defined less than a scale factor. Furthermore,
x̃ is independent of the size of X that is only depends on the direction of the 3D point
relative to the camera and not how far from it is distant. The matrix A represents the
geometric model of the camera and is known as canonical perspective projection
matrix.
10 The physical position of a point projected in the image plane, normally expressed with a metric,
for example, in mm, must be transformed into units of the image sensor expressed in pi xel which
typically does not correspond to a metric such as for example in mm. The physical image plane is
discretized by the pixels of the sensor characterized by its horizontal and vertical spatial resolution
588 6 Motion Analysis
reference system of the camera by a geometric relation that includes the camera
orientation (through the rotation matrix R with a size of 3 × 3) and the translation
vector T (3D vector indicating the position of the origin O w with respect to the
system camera reference). This transformation into a compact matrix form is given
by
X = RX w + T (6.209)
In essence, the perspective projection matrix P defined by (6.212) includes: the sim-
ple perspective transformation defined by A (Eq. 6.206), the effects of discretization
11 The roto-translation transformation expressed by (6.209) indicates that the r otation R was per-
formed first and then the translation T . Often it is reported with inverted operations, that is, before
the translation and after the rotation, having thus
X = R(X w − T ) = RX w + (−RT )
and in this case, in Eq. (6.210), the translation term T is replaced with −RT .
590 6 Motion Analysis
of the image plane associated with the sensor through the matrix K (Eq. 6.208), and
the transformation that relates the position of the camera with respect to the scene
by means of the matrix M (Eq. 6.210).
The transformation (6.211) is based only on the pinhole perspective projection
model and does not include the effects due to distortions introduced by the optical
system, normally modeled with other parameters that describe radial and tangential
distortions (see Sect. 4.5 Vol. I).
Starting from Eq. (6.211), we can now analyze the proposed methods to solve the
problem of 3D scene reconstruction by capturing a sequence of images with a single
camera whose intrinsic parameters remain constant even if not known (uncalibrated
camera) together without the knowledge of the motion [47,48]. The proposed meth-
ods are part of the problem of solving an inverse problem. In fact, with Eq. (6.211) we
want to reconstruct the 3D structure of the scene (and the motion), that is, calculate
the homogeneous coordinates of n points X̃ i (for simplicity we indicate, from now
on, without a subscript w the 3D points of the scene) whose projection is known in
homogeneous coordinates ũi j detected in m images characterized by the associated
m perspective projection matrices P j unknowns.
Essentially, the problem is reduced by estimating the m projection matrices P j
and the n 3D points X̃ i , known the m · n correspondences ũi j found in the sequence
of m images (see Fig. 6.38). We observe that with (6.211) the scene is reconstructed
up to a scale factor having considered a perspective projection. In fact, if the points of
the scene are scaled by a factor λ and we simultaneously scale the projection matrix
by a factor λ1 , the points of the scene projected in the image plane remain exactly the
same:
1
ũ ≈ P X̃ = P (λ X̃) (6.213)
λ
Therefore, the scene cannot be reconstructed with an absolute scale value. For recog-
nition applications, even if the structure of the reconstructed scene shows some
resemblance to the real one and reconstructed with an arbitrary scale, it still pro-
vides useful information. The methods proposed in the literature use, the algebraic
approach [49] (based on the Fundamental matrix described in the Chap. 7), the fac-
torization approach (based on the singular values decomposition—SVD ), the bundle
adjustment approach [50,51] that iteratively refines the motion parameters and 3D
structure of the scene minimizing an appropriate functional cost. In the following
section, we will describe the method of factorization.
tion. This simplifies the geometric model of projection of the 3D points in the image
plane whose distance with respect to the camera can be considered irrelevant (ignored
the scale factor due to the object–camera distance). It is assumed that the depth of the
object is very small compared to the observation distance. In this context, no motion
information is detected along the optical axis (Z -axis). The orthographic projection
is a particular case of the perspective12 one where the orthographic projection matrix
results:
⎡ ⎤
X
x 1000 ⎢ Y
⎢ ⎥
⎥ x=X
= =⇒ (6.214)
y 0 1 0 0 ⎣Z⎦ y=Y
1
Combining Eq. (6.214) with the roto-translation matrix (6.210), we obtain an affine
projection:
⎡ ⎤⎡ ⎤⎡ ⎤
r11 r12 r13 0 1 0 0 T1 X
x 1000 ⎢ ⎢ r 21 r 22 r 23 0 ⎥ ⎢0 1 0 T2 ⎥ ⎢ Y ⎥
⎥⎢ ⎥⎢ ⎥
=
y 0 1 0 0 ⎣r31 r32 r33 0⎦ ⎣0 0 1 T3 ⎦ ⎣ Z ⎦
0 0 0 1 000 1 1
⎡ ⎤⎡ ⎤ (6.215)
1 0 0 T1 X
r r r 0 ⎢ ⎥⎢ ⎥
⎢0 1 0 T2 ⎥ ⎢ Y ⎥
= 11 12 13
r21 r22 r23 0 0 0 1 T3 ⎦ ⎣ Z ⎦
⎣
000 1 1
from which simplifying (eliminating the last column in the first matrix and the last
row in the second matrix) and expressing in nonhomogeneous coordinates the ortho-
graphic projection is obtained combined with the extrinsic parameters of the roto-
translation: ⎡ ⎤
X
x r r r T
= 11 12 13 ⎣ Y ⎦ + 1 = RX + T (6.216)
y r21 r22 r23 T2
Z
From (6.213), we know that we cannot determine the absolute positions of the 3D
points. To factorize it is further worth simplifying (6.216) by assuming the origin of
the reference system of 3D points coinciding with their center of mass, namely:
1
n
Xi = 0 (6.217)
n
i=1
Now we can execute the centering of the points, in each image of the sequence,
subtracting from the coordinates xi j the coordinates of their center of mass x̄i j and
12 The distance between the projection center and the image plane is assumed infinite with focal
1 1
n n
x̃i j = xi j − xik = R j X i + T j − (R j X k + T j )
n n
k=1 k=1
n (6.218)
1
= R j Xi − Xk = R j Xi
n
k=1
=0
We are now able to factorize with (6.218) aggregating the data centered in large
matrices. In particular, the coordinates
of the 2D points centered x̃i j are placed in
a single matrix W = U V organized into two submatrices each of size m × n.
In the m rows of submatrix, U are placed the horizontal coordinates of the n 2D
points centered relative to the m image. Similarly, in the m rows of the submatrix V
the vertical coordinates of the n 2D centered points are placed. Thus we obtain the
matrix W, called matrix of the measures R1 of size 2m × n. In analogy to the W, we can
construct the rotation matrix M = R2 relative to all the m images indicating with
R1 = [ r11 r12 r13 ] and R2 = [ r21 r22 r23 ], respectively, the rows of the camera rotation
matrix (Eq. 6.216) representing the motion information.13 Rewriting Eq. (6.218) in
matrix form, we get
⎡ ⎤ ⎡ ⎤
x̃11 x̃12 · · · x̃1n R11
⎢ .. ⎥ ⎢ .. ⎥
⎢ . ⎥ ⎢ . ⎥
⎢ ⎥ ⎢ ⎥
⎢x̃m1 x̃m2 · · · x̃mn ⎥ ⎢ R1m ⎥
⎢ ⎥=⎢ ⎥
⎢ ỹ11 ỹ12 · · · ỹ1n ⎥ ⎢ R21 ⎥ X 1 X 2· · · X n (6.219)
⎢ ⎥ ⎢ ⎥
⎢ .. ⎥ ⎢ . ⎥
⎣ . ⎦ ⎣ .. ⎦ 3×n
W = MS (6.220)
By virtue of the rank theorem, we have that the matrix of the observed centered
measures W of size 2m × n has at the highest rank 3. This statement is immediate
considering the properties of the rank. The rank of a matrix of size m × n is at most
the minimum between m and n. In fact, the rank of a product matrix A · B is at
13 Recallthat the rows of R represent the coordinates in the original space of the unit vectors along
the coordinate axes of the rotated space, while the columns of R represent the coordinates in the
rotated space of unit vectors along the axes of the original space.
6.7 Structure from Motion 593
most the minimum between the rank of A and that of B. Applying this rank theorem
to the factoring matrices M · S, we immediately get that the rank of W is 3. The
importance of the rank theorem is evidenced by the fact that the 2m × n measures
taken from the sequence of images are highly redundant, to reconstruct the 3D scene.
It also informs us that quantitatively the 2m × 3 motion information and the
3 × n coordinates of the 3D points would be sufficient to reconstruct the 3D scene.
Unfortunately, both these latter information are not known and to solve the problem
of the reconstruction of the structure from motion the method of factorization has
been proposed, seen as an overdetermined system that can be solved with the least
squares method based on singular value decomposition (SVD). The SVD approach
involves the following decomposition of W:
W =
U
VT (6.221)
2m×n 2m×2m 2m×n n×n
Ŵ =
U
VT (6.222)
2m×n 2m×3 3×3 3×n
Essentially, according to the rank theorem considering only the three greatest singular
values of W and the corresponding left and right eigenvectors, with (6.222) we get the
best estimate of motion and structure information. Therefore, Ŵ can be considered
as a good estimate of the ideal observed measures W which we can be defined as
1 1
Ŵ = (U [ ] 2 ) ([ ] 2 V T )
=⇒ Ŵ = M̂Ŝ (6.223)
2m×n 2m×3 3×n
594 6 Motion Analysis
where the matrices M̂ and Ŝ even if different from the ideal case M and S, always
maintain the motion information of the camera and the structure of the scene, respec-
tively. It is pointed out that, except for noise, the matrices M̂ and Ŝ, respectively,
induce a linear transformation of the true motion matrix M and of the true matrix of
the scene structure S. If the observed measure matrix W is acquired with an adequate
frame rate, appropriate to the camera motion, we can have a noise level low enough
that it can be ignored. This can be checked by analyzing the singular values of Ŵ
verifying that the ratio of the third to the fourth singular value is sufficiently large.
We also point out that the decomposition obtained with (6.223) is not unique. In
fact, any invertible matrix Q of size 3 × 3 would produce an identical decomposition
of Ŵ with the matrices M̂ Q and Q −1 M̂, as follows:
Another problem concerns the row pairs R1T ; R2T of M̂ that may not necessarily be
orthogonal.14 To solve these problems, we can find a matrix Q such that appropriate
rows of M̂ satisfy some metric constraints. Indeed, considering the matrix M̂ as a
linear transformation of the true motion matrix M (and similarly for the matrix of
scene structure) we can find a matrix Q such that M = M̂ Q and S = Q −1 Ŝ. The Q
matrix is found by observing that the rows of the true motion matrix M, considered
as 3D vectors, must have a unitary norm and the first m rows R1T must be orthogonal
to the corresponding last m rows R2T . Therefore, the solution of Q is found with
the system of equations deriving from the following metric constraints:
ˆ iT Q Q T R1
R1 ˆ i =1
ˆ iT Q Q T R2
R2 ˆ i =1 (6.225)
ˆ iT Q Q T R2
R1 ˆ i =0
14 Let’sremember here from the properties of the rotation matrix R which is nor mali zed, that is,
the squares of the elements in a row or in a column are equal to 1, and it is or thogonal, i.e., the
inner product of any pair of rows or any pair of columns is 0.
6.7 Structure from Motion 595
occlusions and the distance from the scene is much greater than the depth of the
objects of the scene to be reconstructed.
2. Organi ze 2D centered measures in the W matrix such that each pair of rows jth
and (j+m)th contains the horizontal and vertical coordinates of the n points 3D
projected in the jth image. While a column of W contains in the first m elements,
respectively, the horizontal coordinates of a 3D point observed in the m images,
and in the last m elements the vertical coordinates of the same point. It follows
that W has dimensions 2m × n.
3. Calculate the decomposition of W = UV T with the SVD method, which
produces the following matrices:
– U of size 2m × 2m.
– of size 2m × n.
– V T of size n × n.
15 TheGrotta dei Cervi (Deer Cave) is located in Porto Badisco near Otranto in Apulia–Italy at a
depth of 26 m below sea level and represents an important cave, it is, in fact, the most impressive
Neolithic pictorial complex in Europe, only recently discovered in 1970.
596 6 Motion Analysis
Fig. 6.40 Results of the 3D scene reconstructed with the factorization method. The first line shows
the three images of the sequence with the points of interest used. The second line shows the 3D
image reconstructed as described in [53]
structed with VRML software16 using a triangular mesh built on the points of interest
shown on the three images and superimposing the texture of the 2D images.
References
1. J.J. Gibson, The Perception of the Visual World (Sinauer Associates, 1995)
2. T. D’orazio, M. Leo, N. Mosca, M. Nitti, P. Spagnolo, A. Distante, A visual system for real
time detection of goal events during soccer matches. Comput. Vis. Image Underst. 113 (2009a),
622–632
3. T. D’Orazio, M. Leo, P. Spagnolo, P.L. Mazzeo, N. Mosca, M. Nitti, A. Distante, An investi-
gation into the feasibility of real-time soccer offside detection from a multiple camera system.
IEEE Trans. Circuits Syst. Video Surveill. 19(12), 1804–1818 (2009b)
4. A. Distante, T. D’Orazio, M. Leo, N. Mosca, M. Nitti, P. Spagnolo, E. Stella, Method
and system for the detection and the classification of events during motion actions. Patent
PCT/IB2006/051209, International Publication Number (IPN) WO/2006/111928 (2006)
5. L. Capozzo, A. Distante, T. D’Orazio, M. Ianigro, M. Leo, P.L. Mazzeo, N. Mosca, M. Nitti,
P. Spagnolo, E. Stella, Method and system for the detection and the classification of events
during motion actions. Patent PCT/IB2007/050652, International Publication Number (IPN)
WO/2007/099502 (2007)
16 Virtual Reality Modeling Language (VRML) is a programming language that allows the simulation
of three-dimensional virtual worlds. With VRML it is possible to describe virtual environments that
include objects, light sources, images, sounds, movies.
References 597
6. B.A. Wandell, Book Rvw: Foundations of vision. By B.A. Wandell. J. Electron. Imaging 5(1),
107 (1996)
7. B.K.P. Horn, B.G. Schunck, Determining optical flow. Artif. Intell. 17, 185–203 (1981)
8. Ramsey Faragher, Understanding the basis of the kalman filter via a simple and intuitive
derivation. IEEE Signal Process. Mag. 29(5), 128–132 (2012)
9. P. Musoff, H. Zarchan, Fundamentals of Kalman Filtering: A Practical Approach. (Amer-
ican Institute of Aeronautics and Astronautics, Incorporated, 2000). ISBN 1563474557,
9781563474552
10. E. Meinhardt-Llopis, J.S. Pérez, D. Kondermann, Horn-schunck optical flow with a multi-scale
strategy. Image Processing On Line, 3, 151–172 (2013). https://doi.org/10.5201/ipol.2013.20
11. H.H. Nagel, Displacement vectors derived from second-order intensity variations in image
sequences. Comput. Vis., Graph. Image Process. 21, 85–117 (1983)
12. B.D. Lucas, T. Kanade, An iterative image registration technique with an application to stereo
vision, in Proceedings of Imaging Understanding Workshop (1981), pp. 121–130
13. J.L. Barron, D.J. Fleet, S. Beauchemin, Performance of optical flow techniques, Performance
of optical flow techniques. Int. J. Comput. Vis. 12(1), 43–77 (1994)
14. T. Brox, A. Bruhn, N. Papenberg, J. Weickert, High accuracy optical flow estimation based on
a theory for warping. in Proceedings of the European Conference on Computer Vision, vol. 4
(2004), pp. 25–36
15. M.J. Black, P. Anandan, The robust estimation of multiple motions: parametric and piecewise
smooth flow fields. Comput. Vis. Image Underst. 63(1), 75–104 (1996)
16. J. Wang, E. Adelson, Representing moving images with layers. Proc. IEEE Trans. Image
Process. 3(5), 625–638 (1994)
17. E. Mémin, P. Pérez, Hierarchical estimation and segmentation of dense motion fields. Int. J.
Comput. Vis. 46(2), 129–155 (2002)
18. Simon Baker, Iain Matthews, Lucas-kanade 20 years on: a unifying framework. Int. J. Comput.
Vis. 56(3), 221–255 (2004)
19. H.-Y. Shum, R. Szeliski, Construction of panoramic image mosaics with global and local
alignment. Int. J. Comput. Vis. 16(1), 63–84 (2000)
20. David G. Lowe, Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis.
60, 91–110 (2004)
21. F. Marino, E. Stella, A. Branca, N. Veneziani, A. Distante, Specialized hardware for real-time
navigation. R.-Time Imaging, Acad. Press 7, 97–108 (2001)
22. Stephen T. Barnard, William B. Thompson, Disparity analysis of images. IEEE Trans. Pattern
Anal. Mach. Intell. 2(4), 334–340 (1980)
23. M. Leo, T. D’Orazio, P. Spagnolo, P.L. Mazzeo, A. Distante, Sift based ball recognition in
soccer images, in Image and Signal Processing, vol. 5099, ed. by A. Elmoataz, O. Lezoray, F.
Nouboud, D. Mammass (Springer, Berlin, Heidelberg, 2008), pp. 263–272
24. M. Leo, P.L. Mazzeo, M. Nitti, P. Spagnolo, Accurate ball detection in soccer images using
probabilistic analysis of salient regions. Mach. Vis. Appl. 24(8), 1561–1574 (2013)
25. T. D’Orazio, M. Leo, C. Guaragnella, A. Distante, A new algorithm for ball recognition using
circle hough transform and neural classifier. Pattern Recognit. 37(3), 393–408 (2004)
26. T. D’Orazio, N. Ancona, G. Cicirelli, M. Nitti, A ball detection algorithm for real soccer
image sequences, in Proceedings of the 16th International Conference on Pattern Recognition
(ICPR’02), vol. 1 (2002), pp. 201–213
27. Y. Bar-Shalom, X.R. Li, T. Kirubarajan, Estimation with Applications to Tracking and Navi-
gation (Wiley, 2001). ISBN 0-471-41655-X, 0-471-22127-9
28. S.C.S. Cheung, C. Kamath, Robust techniques for background subtraction in urban traffic
video. Vis. Commun. Image Process. 5308, 881–892 (2004)
29. C.R. Wren, A. Azarbayejani, T. Darrell, A. Pentland, Pfinder: real-time tracking of the human
body. IEEE Trans. Pattern Anal. 19(7), 780–785 (1997)
598 6 Motion Analysis
30. D. Makris, T. Ellis, Path detection in video surveillance. Image Vis. Comput. 20, 895–903
(2002)
31. C. Stauffer, W.E. Grimson, Adaptive background mixture models for real-time tracking, in IEEE
Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2 (1999)
32. A Elgammal, D. Harwood, L. Davis, Non-parametric model for background subtraction, in
European Conference on Computer Vision (2000), pp. 751–767
33. N.M. Oliver, B. Rosario, A.P. Pentland, A bayesian computer vision system for modeling
human interactions. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 831–843 (2000)
34. R. Li, Y. Chen, X. Zhang, Fast robust eigen-background updating for foreground detection, in
International Conference on Image Processing (2006), pp. 1833–1836
35. A. Sobral, A. Vacavant, A comprehensive review of background subtraction algorithms evalu-
ated with synthetic and real videos. Comput. Vis. Image Underst. 122(05), 4–21 (2014). https://
doi.org/10.1016/j.cviu.2013.12.005
36. Y. Benezeth, P.M. Jodoin, B. Emile, H. Laurent, C. Rosenberger, Comparative study of back-
ground subtraction algorithms. J. Electron. Imaging 19(3) (2010)
37. M. Piccardi, T. Jan, Mean-shift background image modelling. Int. Conf. Image Process. 5,
3399–3402 (2004)
38. B. Han, D. Comaniciu, L. Davis, Sequential kernel density approximation through mode prop-
agation: applications to background modeling, in Proceedings of the ACCV-Asian Conference
on Computer Vision (2004)
39. D.J. Heeger, A.D. Jepson, Subspace methods for recovering rigid motion i: algorithm and
implementation. Int. J. Comput. Vis. 7, 95–117 (1992)
40. H.C. Longuet-Higgins, K. Prazdny, The interpretation of a moving retinal image. Proc. R. Soc.
Lond. 208, 385–397 (1980)
41. H.C. Longuet-Higgins, The visual ambiguity of a moving plane. Proc. R. Soc. Lond. 223,
165–175 (1984)
42. W. Burger, B. Bhanu, Estimating 3-D egomotion from perspective image sequences. IEEE
Trans. Pattern Anal. Mach. Intell. 12(11), 1040–1058 (1990)
43. A. Branca, E. Stella, A. Distante, Passive navigation using focus of expansion, in WACV96
(1996), pp. 64–69
44. G. Convertino, A. Branca, A. Distante, Focus of expansion estimation with a neural network, in
IEEE International Conference on Neural Networks, 1996, vol. 3 (IEEE, 1996), pp. 1693–1697
45. G. Adiv, Determining three-dimensional motion and structure from optical flow generated by
several moving objects. IEEE Trans. Pattern Anal. Mach. Intell. 7(4), 384–401 (1985)
46. A.R. Bruss, B.K.P. Horn, Passive navigation. Comput. Vis., Graph., Image Process. 21(1), 3–20
(1983)
47. M. Armstong, A. Zisserman, P. Beardsley, Euclidean structure from uncalibrated images, in
British Machine Vision Conference (1994), pp. 509–518
48. T.S. Huang, A.N. Netravali, Motion and structure from feature correspondences: a review. Proc.
IEEE 82(2), 252–267 (1994)
49. S.J. Maybank, O.D. Faugeras, A theory of self-calibration of a moving camera. Int. J. Comput.
Vis. 8(2), 123–151 (1992)
50. D.C. Brown, The bundle adjustment - progress and prospects. Int. Arch. Photogramm. 21(3)
(1976)
51. W. Triggs, P. McLauchlan, R. Hartley, A. Fitzgibbon, Bundle adjustment - a modern synthesis,
in Vision Algorithms: Theory and Practice ed. by W. Triggs, A. Zisserman, R. Szeliski (Springer,
Berlin, 2000), pp. 298–375
52. C. Tomasi, T. Kanade, Shape and motion from image streams under orthography: a factorization
method. Int. J. Comput. Vis. 9(2), 137–154 (1992)
53. T. Gramegna, L. Venturino, M. Ianigro, G. Attolico, A. Distante, Pre-historical cave fruition
through robotic inspection, in IEEE International Conference on Robotics an Automation
(2005)
Camera Calibration and 3D
Reconstruction 7
7.1 Introduction
In Chap. 2 and Sect. 5.6 of Vol.I, we described the radiometric and geometric aspects
of an imaging system, respectively. In Sect. 6.7.1, we instead introduced the geometric
projection model of the 3D world in the image plane. Let us now see, defined a
geometric projection model, the aspects involved in the camera calibration to correct
all the sources of geometric distortions introduced by the optical system (radial and
tangential distortions, …) and by the digitization system (sensor noise, quantization
error, …), often not provided by vision system manufacturers.
In various applications, a camera is used to detect metric information of the scene
from the image. For example, in the dimensional control of an object it is required to
perform certain accurate control measurements, while for a mobile vehicle, equipped
with a vision system, it is required to self-locate, that is, estimate its position and
orientation with respect to the scene. Therefore, a calibration procedure of the camera
becomes necessary, which determines, the relative intrinsic parameters (focal length,
the horizontal and vertical dimension of the single photoreceptor of the sensor or the
aspect ratio, the size of the sensor matrix, the model coefficients of radial distortion,
the coordinates of the main point or the optical center) and the extrinsic parameters.
The latter define the geometric transformation to pass from the world reference sys-
tem to the camera system (the 3 translation parameters and the 3 rotation parameters
around the coordinate axes) described in the Sect. 6.7.1.
While the intrinsic parameters define the internal characteristic of the acquisition
system, independent of the position and attitude of the camera, the extrinsic parame-
ters describe its position and attitude regardless of its internal parameters. The level
of accuracy of these parameters defines the accuracy of the level of the derivable
measurements from the image. With reference to Sect. 6.7.1, the geometric model
underlying the image formation process is described by Eq. (6.211) that we rewrite
the following:
ũ ≈ P X̃ w (7.1)
where ũ indicates the coordinates in the image plane, expressed in pixels (taking
into account the position and orientation of the sensor in the image plane), of 3D
points with coordinates X̃ w , expressed in the world reference system, and P is the
perspective projection matrix, of size 3 × 4, expressed in the most general form:
The matrix P, defined by (7.2), represents the most general perspective projection
matrix that includes
1.
The simple
canonical perspective transformation defined by the matrix A =
100
0 1 0 (Eq. 6.206) according to the pinhole model;
001
2. The effects of discretization of the image plane associated with the sensor through
the matrix K (Eq. 6.208);
3. The geometric transformation that relates the position and
orientation
of the cam-
era with respect to the scene through the matrix M = 0RT T1 (Eq. 6.210).
Essentially, in the camera calibration process, the matrix K is the matrix of intrinsic
parameters (maximum 5) that models, the pinhole perspective projection of 3D
points, expressed in coordinates of the camera (i.e., in 2D image coordinates), and
the further transformation needed to take into account the displacement of the sensor
in the image plane or the offset of the projection of the principal point on the sensor
(see Fig. 6.39). The intrinsic parameters describe the internal characteristics of the
camera regardless of its location and position with respect to the scene. Therefore,
with the calibration process of the camera all its intrinsic parameters are determined
which correspond to the elements of the matrix K .
In various applications, it is useful to define the observed scene with respect
to an arbitrary 3D reference system instead of the camera reference system (see
Fig. 6.39). The equation that performs the transformation from one system and the
other is given by the (6.209), where M is the roto-translation matrix of size 4 × 4
whose elements represent the extrinsic (or external) parameters that characterize the
transformation between coordinates of the world and camera. This matrix models
a geometric relationship that includes the orientation of the camera (through the
rotation matrix R of size 3 × 3) and the translation T (3D vector indicating the
position of the origin O w with respect to the reference system of the camera) as
shown in Fig. 6.39 (in the hypothesis of rigid movement).
The transformation (7.1) is based only on the pinhole perspective projection model
and does not contemplate the effects due to the distortions introduced by the optical
system, normally modeled with other parameters that describe the radial and tan-
7.2 Influence of the Optical System 601
gential distortions (see Sect. 4.5 Vol.I). These distortions are very accentuated when
using optics with a large angle of view and in low-cost optical systems.
The radial distortion generates an internal or external displacement of a 3D point
projected in the image plane with respect to its ideal position. It is essentially caused
by a defect in the radial curvature of the lenses of the optical system. A negative radial
displacement of an image point generates the barrel distortion, that is, it causes more
external points to accumulate more and more toward the optical axis, decreasing by
a scale factor with decreasing axial distance. A positive radial displacement instead
generates the pincushion distortion, that is, it causes more external points to expand
by increasing by a factor of scale with the increase of the axial distance. This type
of distortion has circular symmetry around the optical axis (see Fig. 4.25 Vol.I).
The tangential or decentralized distortion is instead caused by the displacement
of the lens center with respect to the optical axis. This error is checked with the
coordinates of the principal point (x̃0 , ỹ0 ) in the image plane.
Experimentally, it has been observed that radial distortion is dominant. Although
both optical distortions are generated by complex physical phenomena, they can
be modeled with acceptable accuracy with a polynomial function D(r ), where the
variable r is the radial distance of the image points (x̃, ỹ) (obtained with the ideal
pinhole projection, see Eq. 6.206) from the principal point (u 0 , v0 ). In essence, the
optical distortions influence the pinhole perspective projection coordinates and can
be corrected before or after the transformation in the sensor coordinates, in relation
to the camera’s chosen calibration method.
Therefore, having obtained the projections with distortions (x̃, ỹ) of the 3D points
of the world in the image plane with the (6.206), the corrected coordinates by radial
distortions, indicated with (x̃c , ỹc ) are obtained as follows:
x̃c = x̃0 + (x̃ − x̃0 ) · D(r, k) ỹc = ỹ0 + ( ỹ − ỹ0 ) · D(r, k) (7.3)
where (x̃0 , ỹ0 ) is the principal point and D(r ) is the function that models the nonlinear
effects of radial distortion, given by
with
r= (x̃ − x̃0 )2 + ( ỹ − ỹ0 )2 (7.5)
The truncation of terms with powers of r greater than the sixth degree make a neg-
ligible contribution and can be assumed null. Experimental tests have shown that
2–3 coefficients ki are sufficient to correct almost 95% of the radial distortion for a
medium quality optical system.
To include also the tangential distortion that attenuates the effects of the lens
decentralization, the tangential correction component is added to Eq. (7.3), obtaining
the following equations:
x̃c = x̃0 + (x̃ − x̃0 ) k1 r 2 + k2 r 4 + k3 r 6 + · · · + p1 r 2 + 2(x̃ − x̃0 )2 + 2 p2 (x̃ − x̃0 )( ỹ − ỹ0 ) 1 + p3 r 2 + · · ·
(7.6)
602 7 Camera Calibration and 3D Reconstruction
ỹc = ỹ0 + ( ỹ − ỹ0 ) k1 r 2 + k2 r 4 + k3 r 6 + · · · + 2 p1 (x̃ − x̃0 )( ỹ − ỹ0 ) + p2 (r 2 + 2( ỹ − ỹ0 )2 ) 1 + p3 r 2 + · · ·
(7.7)
As already mentioned, the radial distortion is dominant and is characterized with this
approximation with the coefficients (ki , i = 1, 3) while the tangential distortion is
characterized by the coefficients ( pi , i = 1, 3), all of which can be obtained through
a process of ad hoc calibration by projecting sample patterns on a flat surface, for
example, grids with vertical and horizontal lines (or other types of patterns) and
then acquiring distorted images. Note the geometry of the projected patterns and by
calculating the coordinates of the patterns distorted with Eqs. (7.6) and (7.7), we can
find the optimal coefficients of the distortions through a nonlinear equation system
and a nonlinear regression method.
X̃ = M X̃ w
Q(X,Y,Z) (a) x
(b) xc (c) u
ũ = K x̃c
and characterized by the calibration matrix of the camera K (see Sect. 6.7.1),
to switch between the reference system of the image plane (coordinates x̃c =
(x̃c , ỹc ) corrected by radial distortions) to the sensor reference system with the
new coordinates, expressed in pixels and indicated ũ = (ũ, ṽ). Recall that the 5
elements of the triangular calibration matrix K are the intrinsic parameters of the
camera.
(a) (b) Y
(c)
Y
Y
Z X
Z X
Z X
Fig. 7.2 Calibration platforms. a Flat calibration platform observed from different view points; b
Platform having at least two different planes of patterns observed from a single view; c Platform
with linear structures
and extrinsic parameters. Since the autocalibration is based on the pattern matches
determined between the images, it is important that with this method the detection
of the corresponding patterns is accurate. With this approach, the distortion of the
optical system is not considered.
In the literature, calibration methods based on vanishing points are proposed using
parallelism and orthogonality between the lines in 3D space. These approaches rely
heavily on the process of detecting edges and linear structures to accurately determine
vanishing points. Intrinsic parameters and camera rotation matrix can be estimated
from three mutually orthogonal vanishing points [2].
The complexity and accuracy of the calibration algorithms also depend on the
need to know all the intrinsic and extrinsic parameters and the need to remove optical
distortions or not. For example, in some cases, it may not be required to use methods
that do not require the estimation of the focal length f and of the location of the
principal point (u 0 , v0 ) since only the relationship between the coordinates of the
world’s 3D points and its 2D coordinates in the image plane is required.
The basic approaches of calibration derive from photogrammetry which solves
the problem by minimizing a nonlinear error function to estimate the geometric and
physical parameters of a camera. Less complex solutions have been proposed for
the computer vision by simplifying the camera model using linear and nonlinear
systems and operating in real time. Tsai [3] has proposed a two-stage algorithm that
will be described in the next section and a modified version of this algorithm [4] has
a four-stage extension including the Direct Linear Transformation—DLT method in
the two stages.
Other calibration methods [5] are based on the estimation of the perspective pro-
jection matrix P by acquiring an image of calibration platforms with noncoplanar
patterns (at least two pattern planes as shown in Fig. 7.2b) and selecting at least 6 3D
points and automatically detecting the 2D coordinates of the relative projections in
the image plane. Zhang [6] describes instead an algorithm that requires at least two
images acquired from different points of view of a flat pattern platform (see Fig. 7.2a,
c).
7.4 Camera Calibration Methods 605
The Tsai method [3], also called direct parameter calibration (i.e., it directly recovers
the intrinsic and extrinsic parameters of the camera), uses as calibration platform
two orthogonal planes with black squares patterns on a white background equally
spaced. Of these patterns, we know all the geometry (number, size, spacing, …) and
their position in the 3D reference system of the world, integral with the calibration
platform (see Fig. 7.2b). The acquisition system (normally a camera) is placed in front
of the platform and through normal image processing algorithms all the patterns of
the calibration platform are automatically detected and the position of each pattern
in the reference system of the camera is determined in the image plane. It is then
possible to find the correspondence between the 2D patterns of the image plane and
the visible 3D patterns of the platform.
The relationship between 3D world coordinates X w = [X w Yw Z w ]T and 3D
coordinates of the camera X = [X Y Z ]T in the context of rigid roto-translation
transformation (see Sect. 6.7) is given by Eq. 6.209 which when rewritten in explicit
matrix form results results in the following:
⎡ ⎤⎡ ⎤ ⎡ ⎤
r11 r12 r13 Xw Tx
X = RX w + T = ⎣r21 r22 r23 ⎦ ⎣ Yw ⎦ + ⎣Ty ⎦ (7.8)
r31 r32 r33 Zw Tz
where αu and αv are the horizontal and vertical scale factors expressed in pixels,
and u0 = [u 0 v0 ]T is the sensor’s principal point. The parameters αu , αv , u 0 and v0
represent the intrinsic (or internal) parameters of the camera that together with the
extrinsic parameters (or exter nal) R and T are the unknown parameters. Normally
the pixels are squares for which αu = αv = α, and α is considered as the focal length
of the optical system expressed in units of the pixel size. The sensor skew parameter
s, neglected in this case (assumed zero), is a parameter that takes into account the
non-rectangularness of the pixel area (see Sect. 6.7).
This method assumes known the image coordinates (u 0 , v0 ) of the principal point
(normally assumed as the center of the image-sensor) and consider the unknowns the
parameters αu , αv , R and T . To simplify, let’s assume (u 0 , v0 ) = (0, 0) and denote
with (u i , vi ) the ith projection in the image plane of the 3D calibration patterns. Thus
the first members of the previous equations become, respectively, u i − 0 = u i and
vi − 0 = vi . After these simplifications, dividing member to member this projection
equations between them, considering only the first and third members of (7.10), for
each ith projected pattern we get the following equation:
u i αv (r21 X w(i) + r22 Yw(i) + r23 Z w(i) + Ty ) = vi αu (r11 X w(i) + r12 Yw(i) + r13 Z w(i) + Tx ) (7.11)
If we now divide both members of (7.11) by αv , let’s indicate with α = ααuv , which we
remember to be the aspect ratio of the pixel, and we define the following symbols:
ν1 = r21 ν5 = αr11
ν2 = r22 ν6 = αr12
(7.12)
ν3 = r23 ν7 = αr13
ν4 = Ty ν8 = αTx
where:
⎡ (1) (1) (1) (1) (1) (1)
⎤
u 1 X w u 1 Yw u1 Zw u 1 −v1 X w −v1 Yw −v1 Z w −v1
⎢ ⎥
⎢ u 2 X w(2) u 2 Yw(2) u 2 Z w(2) (2)
u 2 −v2 X w −v2 Yw
(2)
−v2 Z w
(2)
−v2 ⎥
A=⎢ ⎥ (7.15)
⎣ ··· ··· ··· ··· ··· ··· ··· ··· ⎦
u N X w(N ) u N Yw(N ) u N Z w(N ) u N −v N X w(N ) −v N Yw(N ) −v N Z w(N ) −v N
It is shown [7] that if N ≥ 7 and the points are not coplanar then the matrix A has
rank 7 and there exists a nontrivial solution which is the eigenvector corresponding to
the zero eigenvalue of AT A. In essence, the system can be solved by f actori zation
of the matrix A with the SVD approach (Singular Value Decomposition), that is, A =
UV T , where we remember that the diagonal elements of are the singular values
of A and the solution of the system has solution not trivial ν which is proportional
7.4 Camera Calibration Methods 607
where γ = 1/κ.
According to the symbols reported in (7.12), using the last relation of (7.16)
assuming that the solution is given by the eigenvector ν̄, we have
(ν̄1 , ν̄2 , ν̄3 , ν̄4 , ν̄5 , ν̄6 , ν̄7 , ν̄8 ) = γ (r21 , r22 , r23 , Ty , αr11 , αr12 , αr13 , αTx ) (7.17)
Now let’s see how to evaluate the various parameters involved in the previous equa-
tion.
2
||(ν̄5 , ν̄6 , ν̄7 )||2 = ν̄52 + ν̄62 + ν̄72 = γ 2 α 2 r11 + r12
2
+ r13
2
= α|γ | (7.19)
=1
1. R is nor mali zed, that is, the sum of the squares of the elements of any row or column is 1
( ri 2 = 1, i = 1, 2, 3).
2. The inner product of any pair of rows or columns is zero (r1T r3 = r2T r3 = r2T r1 = 0).
3. From the first two properties follows the orthonormality of R, or R R T = R T R = I, its
invertibility, that is, R −1 = R T (its inverse coincides with the transpose), and the determinant
of R has module 1 (det (R) = 1), where I indicates the identity matrix.
4. The rows of R represent the coordinates in the space of origin of unit vectors along the
coordinate axes of the rotated space.
5. The columns of R represent the coordinates in the rotated space of unit vectors along the
axes of the space of origin. In essence, they represent the directional cosines of the triad axes
rotated with respect to the triad of origin.
6. Each row and each column are orthonormal to each other, as they are orthonormal bases of
space. It, therefore, follows that given two vectors row or column of the matrix r 1 and r 2 it
is possible to determine the third base vector as a vector product given by r 3 = r 1 × r 2 .
.
608 7 Camera Calibration and 3D Reconstruction
The scale factor γ is determined less than the sign with (7.18) while by definition
the aspect ratio α > 0 is calculated with (7.19). Explicitly α can be determined as
α = ||(ν̄5 , ν̄6 , ν̄7 )||2 /||(ν̄1 , ν̄2 , ν̄3 )||2 .
At this point, known as α and the module |γ |, knowing the solution vector ν̄, we
can calculate the rest of the parameters even without the sign, since the sign of γ
is not known.
Calculation of Tx and Ty . Always considering the estimated vector ν̄ for (7.17),
we have
ν̄8 ν̄4
Tx = Ty = (7.20)
α|γ | |γ |
Verification of the orthogonality of the rotation matrix R. Recall that the elements
of R have been calculated starting from the estimated eigenvector ν̄, possible solu-
tion of system (7.14), obtained with the SVD approach. Therefore, it is necessary
to verify if the computed matrix R satisfies the properties of an orthogonal matrix
where the rows and the columns constitute the characteristics of orthonormal bases,
T
that is, R̂ R̂ = I considering R̂ the estimate of R. The orthogonality of R̂ can
be imposed again using the SVD approach by calculating R̂ = U I V T , where the
diagonal matrix of the singular values is replaced with the identity matrix I.
Determine the sign of γ and calculate Tz , αu and αv . The sign of γ is determined
by checking the sign of u and X , and the sign of v and of Y (see Eq. 7.10) which
must be consistent with the signs in agreement also to the geometric configuration
of the camera and calibration platform (see Fig. 6.39). Observing Eq. 7.10, we have
that the two members are positive. If we replace in these equations the coordinates
(u, v) of the image plane and the coordinates (X, Y, Z ) of the camera reference
system of a generic point used for the calibration, we could analyze the sign of
Tx , Ty and the sign of the elements of r 1 and r 2 . In fact, we know that αu and
αv are positive as well as Z (that is, the denominator of Eq. 7.10) considering the
origin of the camera reference system and the position of the camera itself (see
7.4 Camera Calibration Methods 609
Fig. 6.39). Therefore, there must be concordance of signs between the first member
and numerator for the two equations, that is, the following condition must occur:
If these conditions are satisfied, the sign of Tx , Ty is positive and the elements of
r 1 and r 2 are left unchanged together with the translation values Tx , Ty . Otherwise
the signs of these parameters are reversed.
Determined the parameters R, Tx , Ty , α remain to estimate Tz and αu (we remem-
ber from the definition of aspect ratio α that αu = ααv ). Let’s reconsider the first
equation (7.10) and rewrite it in the following form:
where ⎡ ⎤
(1) (1) (1)
u 1 (r11 X w + r12 Yw + r13 Z w + Tx )
⎢ ⎥
⎢ u (r11 X w(2) + r12 Yw(2) + r13 Z w(2) + Tx ) ⎥
M=⎢ 2 ⎥ (7.26)
⎣· · · ··· ⎦
(N ) (N ) (N )
u N (r11 X w + r12 Yw + r13 Z w + Tx )
⎡ ⎤
−u 1 (r31 X w(1) + r32 Yw(1) + r33 Z w(1) + Tx )
⎢ ⎥
⎢ −u 2 (r31 X w(2) + r32 Yw(2) + r33 Z w(2) + Tx ) ⎥
b=⎢ ⎥ (7.27)
⎣ ··· ⎦
(N ) (N ) (N )
−u N (r31 X w + r32 Yw + r33 Z w + Tx )
The solution of this overdetermined linear system is possible with the SVD or
pseudo-inverse methods of Moore–Penrose, obtaining
Tz
= (M T M)−1 M T b (7.28)
αu
The coordinates of the principal point u 0 and v0 can be calculated by virtue of the
orthocenter theorem.2 If in the image plane a triangle is defined by three vanishing
points generated by three groups of parallel and orthogonal lines in the 3D space
(i.e., the vanishing points correspond to 3 orthogonal directions of the 3D world),
then the principal point coincides with the or thocenter of the triangle. The same
calibration platform (see Fig. 7.3) can be used to generate the three vanishing points
using pairs of parallel lines present in the two orthogonal planes of the same platform
[2]. It is shown that from 3 finite vanishing points we can estimate the focal length
f and the principal point (u 0 , v0 ).
The accuracy of the u 0 and v0 coordinates improves if you calculate the principal
point by looking at the calibration platform from multiple points of view and then
averaging the results. The camera calibration can be performed using the vanishing
points, which on the one hand eliminate the problem of finding the correspondences
but have the disadvantage of the problem of the presence of infinite vanishing points
and the inaccuracy of the calculation of the same.
Another method to determine the parameters (intrinsic and extrinsic) of the camera is
based on the estimation of the perspective projection matrix P (always assuming the
pinhole projection model) defined by (7.2) which transform, with Eq. (7.1), 3D points
of the calibration platform (for example, a cube of known dimensions with faces
having a checkerboard pattern also of known size), expressed in world homogeneous
coordinates X̃ w , in image homogeneous coordinates ũ, the last expressed in pixels.
In particular, the coordinates (X i , Yi , Z i ) of the corners of the board are assumed to
be known and the coordinates (u i , vi ) in pixels of the same corners projected in the
image plane are automatically determined.
The relation, which relates the coordinates of the 3D points with the coordinates
of the relative projections in the 2D image plane, is given by Eq. (7.1) that rewritten
in extended matrix form is
⎡ ⎤ ⎡ ⎤
⎡ ⎤ Xw ⎡ ⎤ Xw
u ⎢ ⎥ p p p p
11 12 13 14 ⎢ ⎥
⎣ v ⎦ = P ⎢ Yw ⎥ = ⎣ p21 p22 p23 p24 ⎦ ⎢ Yw ⎥ (7.29)
⎣ Zw ⎦ ⎣ Zw ⎦
1 p31 p32 p33 p34
1 1
From (7.29), we can get three equations, but dividing the first two by the third
equation, we have two equations, given by
p11 X w + p12 Yw + p13 Z w + p14
u=
p31 X w + p32 Yw + p33 Z w + p34
(7.30)
p21 X w + p22 Yw + p23 Z w + p24
v=
p31 X w + p32 Yw + p33 Z w + p34
from which we have 12 unknowns that are precisely the elements of the matrix P.
Applying these equations for the ith correspondence 3D → 2D relative to an corner
of the cube, and placing them in the following linear form, we obtain
p11 X w(i) + p12 Yw(i) + p13 Z w(i) + p14 − p31 u i X w(i) − p32 u i Yw(i) − p33 u i Z w(i) − p34 = 0
(7.31)
p21 X w(i) + p22 Yw(i) + p23 Z w(i) + p24 − p31 vi X w(i) − p32 vi Yw(i) − p33 vi Z w(i) − p34 = 0
we can rewrite them in the compact homogeneous linear system of 12 equations con-
sidering that they need at least N = 6 points to determine the 12 unknown elements
of the matrix P:
Ap = 0 (7.32)
where
⎡ ⎤
X w(1) Yw(1) Z w(1) 1 0 0 0 0 −u 1 X w(1) −u 1 Yw(1) −u 1 Z w(1) −u 1
⎢ (1) (1) (1) (1) (1) (1) ⎥
⎢ 0 0 0 0 X w Yw Z w 1 −v1 X w −v1 Yw −v1 Z w −v1 ⎥
⎢ ⎥
A = ⎢ ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ⎥
⎢ (N ) (N ) (N ) (N ) (N ) (N ) ⎥
⎣ X w Yw Z w 1 0 0 0 0 −u N X w −u N Yw −u N Z w −u N ⎦
(N ) (N ) (N ) (N ) (N ) (N )
0 0 0 0 X w Yw Z w 1 −v N X w −v N Yw −v N Z w −v N
(7.33)
and
T
p = p1 p2 p3 p4 (7.34)
with pi , i = 1, 4 representing the rows of the solution matrix p of the system (7.32)
and 0 represents the zero vector of length 2N .
If we use only N = 6 points, we have a homogeneous linear system with 12
equations and 12 unknowns (to be determined). The points (X i , Yi , Z i ) are projected
in the image plane in (u i , vi ) according to the pinhole projection model. From algebra,
it is known that a homogeneous linear system admits the trivial solution p = 0, that
is, the vector p lying in the null space of the matrix A and we are not interested in
this trivial solution. Alternatively, it is shown that (7.32) admits infinite solutions if
and only if the rank of A is less than 2N , the number of unknowns.
612 7 Camera Calibration and 3D Reconstruction
In this context, the elements of the matrix A can be reduced to 11 if each ele-
ment is divided with any of the elements themselves (for example, the element p34 )
thus obtaining only 11 unknowns (with p34 = 1). Therefore, it would result that
rank( A) = 11 is less than 2N = 12 unknowns and the homogeneous system admits
infinite solutions. One of the possible nontrivial solutions, in the space of homoge-
neous system solutions, can be determined with the SVD method where A = UV T
and the solution would be the column vector of V corresponding to the zero singular
value of the matrix A. In fact, a p solution is the eigenvector that corresponds to
the smallest eigenvalue of AT A unless a proportionality factor. If we denote with
p̄ the last column vector of V , which is less than a proportionality factor κ can be a
solution of p of the homogeneous system, we have
p = κ p̄ or p̄ = γ p (7.35)
where γ = 1/κ. Given a solution p̄ of the projection matrix P, even if less than
γ , we can now find an estimate of the intrinsic and extrinsic parameters using the
elements of matrix p̄. For this purpose, two decomposition approaches of the matrix
P are presented, one based on the equation of the perspective projection matrix and
the other on the Q R f actori zation.
where we recall that K is the triangular calibration matrix of the camera given by
(6.208), R is the rotation matrix and T the vector translation.
Before deriving the calibration parameters, considering that the solution obtained
is less than a scale factor, it is better to normalize p̄ also to avoid a trivial solution
of the type p̄ = 0. The normalization by setting p34 = 1 can cause singularity if
the value of p34 is close to zero. The alternative is to normalize by imposing the
constraint of the unit vector r 3T , that is, p̄31 + p̄32 + p̄33 = 1 which eliminates the
singularity problem [5]. In analogy to what was done in the previous paragraph with
the direct calibration method (see Eq. 7.18), we will use γ to normalize p̄. Looking
from (7.36) that the first three elements of the third line of P correspond to the vector
r 3 of the rotation matrix R, we have
2 + p̄ 2 + p̄ 2 = |γ |
p̄31 (7.37)
32 33
At this point, dividing the solution vector p̄ for |γ | is normalized to less than the
sign. For simplicity, we define the following intermediate vectors q 1 , q 2 , q 3 , q 4 , on
the matrix solution found, as follows:
7.4 Camera Calibration Methods 613
⎡ ⎤
p̄11 p̄12 p̄13 p̄14
⎢ ⎥ ⎡ T ⎤
⎢ q 1T ⎥ q
⎢ ⎥ ⎢ 1 ⎥
⎢ p̄21 ⎥
p̄22 p̄23 p̄24 ⎥ ⎢ ⎥
⎢
p̄ = ⎢ ⎥=⎢
⎢ q T q ⎥
⎥ (7.38)
⎢ q 2T ⎥ ⎣ 2 4
⎦
⎢ ⎥
⎢ p̄31 p̄32 p̄33 p̄34 ⎥
⎣ ⎦ q T
3
q 3T q4
We can now determine all the intrinsic and extrinsic parameters by analyzing the
elements of the projection matrix given by (7.36) and that of the estimated parameters
given by (7.38) according to the intermediate vectors q i . The p̄ is an approximation
of the P matrix. Unless of the sign we can get Tz (the term p̄34 of the matrix 7.36)
and the elements r3i of the third row of R, associating them, respectively, with the
corresponding term p̄34 of (7.38) (i.e., the term q̄43 ) and the elements p̄3i :
The sign is determined by examining the equality Tz = ± p̄34 and knowing if the
origin of the world reference system is positioned in front of or behind the camera.
If forward Tz > 0, and therefore the sign must agree with p̄34 , otherwise Tz < 0 the
sign must be opposite to that of p̄34 .
According to the properties of the rotation matrix (see Note 1) the other parameters
are calculated as follows:
αu = q 1T q 1 − u 20 αv = q 2T q 2 − v02 (7.41)
where i = 1, 2, 3.
It should be noted that the parameter s of skew has not been determined assuming
the rectangularity of the sensor pixel area that can be determined with other nonlinear
methods. The accuracy level of P calculated depends very much on the noise present
on the starting data, that is, on the projections (X i , Yi , Z i ) → (u i , vi ). This can be
verified by ensuring that the rotation matrix R maintains the orthogonality constraint
with the det (R) = 1.
An alternative approach, to recover the intrinsic and extrinsic parameters from the
estimated projection matrix P, is based on its decomposition into two submatrices
B and b, where the first is obtained by considering the first 3 × 3 elements of P,
614 7 Camera Calibration and 3D Reconstruction
while the second represents the last column of P. Therefore, we have the following
breakdown: ⎡ ⎤
p11 p12 p13 p14
⎢ p21 p22 p23 p24 ⎥
P ≡⎢ ⎣ p31 p32 p33 p34 ⎦ = B
⎥ b (7.43)
B b
P=K R T =K R − RC (7.44)
B = KR b = KT (7.45)
It is pointed out that T = −RC expresses the translation vector of the origin of the
world reference system to the camera system or the position of the origin of the
world system in the camera coordinates. According to the decomposition (7.43),
considering the first equation of (7.45) and the intrinsic parameters defined with the
matrix K (see Eq. 6.208), we have
⎡ ⎤
αu2 + s 2 + u 20 sαv + u 0 v0 u 0
B ≡ B B = K K R
T
R = KK
T T T ⎣
= sαv + u 0 v0 αv + v0 v0 ⎦
2 2 (7.46)
I u0 v0 1
u 0 = B 13 v0 = B 23 (7.47)
B 12 − u 0 v0
αv = B22 − v02 s= (7.48)
αv
αu = B 11 − u 20 − s 2 (7.49)
R = K −1 B T = K −1 b (7.50)
7.4 Camera Calibration Methods 615
B −1 = Q L (7.51)
where by definition Q is the orthogonal matrix and L the upper triangular matrix.
From (7.51), it is immediate to derive the following:
B = L −1 Q −1 = L −1 Q T (7.52)
T = K −1 b (7.53)
which the position (u i , vi ) is known and their predicted position from the perspective
projection equation (7.30). This error function is given by
N (i) (i) (i) (i) (i) (i)
p X + p12 Yw + p13 Z w + p14 2 p X + p22 Yw + p23 Z w + p24 2
min u i − 11 w + vi − 21 w
p (i) (i) (i) (i) (i) (i)
i=1 p31 X w + p32 Yw + p33 Z w + p34 p31 X w + p32 Yw + p33 Z w + p34
(7.54)
where N is the number of correspondences (X w , Yw , Z w ) → (u, v) which are
assumed to be affected by independent and identically distributed (iid) random noise.
The error function (7.54) is nonlinear and can be minimized using the Levenberg–
Marquardt minimization algorithm. This algorithm is applied starting from initial
values of p calculated with the linear least squares approach described in Sect. 7.4.2.
This method [6,9] uses a planar calibration platform, that is, a flat chessboard (see
Fig. 7.2a) observed from different points of view or by keeping the camera position
fixed, changes position and attitude of the chessboard. The 3D points of the chess-
board are automatically localized in the image plane (with the known algorithms
of corner detector, for example, that of Harris) of which the geometry is known
and detected the correspondences (X w , Yw , 0) → (u, v). Without losing generality,
the world’s 3D reference system assumes that the chessboard plane is on Z w = 0.
Therefore, all the 3D points lying in the chessboard plane have the third coordinate
Z w = 0. If we denote by r i the columns of the rotation matrix R, we can rewrite the
projection relation (7.1) of the correspondences in the form:
⎡ ⎤
⎡ ⎤ Xw ⎡ ⎤
u ⎢ Yw ⎥ Xw
ũ = ⎣ v ⎦ = K r 1 r 2 r 3 T ⎢ ⎥ ⎣ ⎦
⎣ 0 ⎦ = K r 1r 2 T H Yw = H X̃ w (7.55)
1 1
1 homography
from which it emerges that the third column of R (matrix of the extrinsic parameters)
is deleted, and the homogeneous coordinates in the image plane ũ and the correspond-
ing 2D on the chessboard plane X̃ w = (X w , Yw , 1) are related by the homography
matrix H of size 3 × 3 less than a scale factor λ:
λũ = H X̃ w (7.56)
with
H = h1 h2 h3 = K r 1 r 2 T (7.57)
From (7.58), we can get three equations, but dividing the first two by the third
equation, we have two nonlinear equations in the 9 unknowns which are precisely
the elements of the homography matrix H, given by3 :
h 11 X w + h 12 Yw + h 13
u=
h 31 X w + h 32 Yw + h 33
(7.59)
h 21 X w + h 22 Yw + h 23
v=
h 31 X w + h 32 Yw + h 33
with (u, v) expressed in nonhomogeneous coordinates. By applying these last equa-
(i) (i)
tions to the ith correspondence (X w , Yw , 0) → (u i , vi ), related to the corners of
the chessboard, we can rewrite them in the linear form, as follows:
AH = 0 (7.61)
and
T
H = h1 h2 h3 (7.63)
3 The coordinates (u, v) of the points in the image plane in Eq. (7.58) are expressed in homogeneous
coordinates while in the nonlinear equations (7.59) are not homogeneous (u/λ, v/λ) but for sim-
plicity they remain with the same notation. Once calculated H, from the third equation obtainable
from (7.58), we can determine λ = h 31 u + h 32 v + 1.
618 7 Camera Calibration and 3D Reconstruction
with hi , i = 1, 3 representing the rows of the solution matrix H of the system (7.61)
and 0 represents the zero vector of length 2N .
The homogeneous system (7.61) can be solved with the SVD approach which
decomposes the data matrix (known at least N correspondences) in the product of
3 matrices A = UV T and the solution would be the column vector of V corre-
sponding to the singular value zero of the matrix A. In reality, a solution H is the
eigenvector that corresponds to the smallest eigenvalue of AT A for less than a
proportionality factor. Therefore, if we denote by h̄ the last column vector of V , it
can be a solution of the homogeneous system up to a factor of proportionality. If the
coordinates of the corresponding points are exact, the homography transformation
is free of errors and the singular value found is zero. Normally this does not happen
for the noise present in the data and especially in the case of overdetermined sys-
tems with N > 4 and in this case the singular value chosen is always the smallest
seen as the optimal solution of the system (7.61) in the sense of least squares (i.e.
A · H 2 → minimum).
The system (7.58) can be solved with the constraint of the scale factor h 33 =
1 and in this case the unknowns result to be 8 with the normal linear system of
nonhomogeneous equations expressed in the form:
AH = b (7.64)
and
T
b = u 1 v1 · · · u N v N (7.67)
where at least 4 corresponding points are always required to determine the homogra-
phy matrix H. The accuracy of the homography transformation could improve with
N > 4. In the latter case, we would have an overdetermined system solvable with
the least squares approach (i.e., minimizing A · H − b 2 ) or with the method of
the pseudo-inverse.
The computation of H done with the preceding linear systems minimizes an
algebraic error [10], and therefore not associable with a physical concept of geometric
distance. In the presence of errors in the coordinate measurements of the image points
of correspondences (X w , Yw , 0) → (u, v), assuming affected by Gaussian noise, the
7.4 Camera Calibration Methods 619
where N is the number of matches, ũi are the points in the image plane affected by
noise, and X̃ i are the points on the calibration chessboard assumed accurate. This
function is nonlinear and can be minimized with an iterative method like that of
Levenberg–Marquart.
H = h1 h2 h3 = λK r 1 r 2 T (7.69)
where λ is an arbitrary nonzero scale factor. From (7.69), we can get the relations
that link the column vectors of R as follows:
h1 = λK r1 (7.70)
h2 = λK r2 (7.71)
620 7 Camera Calibration and 3D Reconstruction
from which ignoring the factor λ (not useful in this context) we have
r1 = K −1 h1 (7.72)
−1
r2 = K h2 (7.73)
We know that the column vectors r1 and r2 are orthonormal by virtue of the properties
of the rotation matrix R (see Note 1) which, applied to the previous equations, we
get
h T K −T K −1 h2 = 0 (7.74)
1
r1T r2 =0
h T K −T K −1 h1 = h2T K −T K −1 h2 (7.75)
1
r1T r1 =r2T r2 =1
which are the two relations to which the intrinsic unknown parameters associated
with a homography are constrained. We now observe that the matrix of the intrinsic
unknown parameters K is upper triangular and Zhang defines a new matrix B,
according to the last two constraint equations found, given by
⎡ ⎤
b11 b12 b13
B = K −T K −1 = ⎣b21 b22 b23 ⎦ (7.76)
b31 b32 b33
Now let’s rewrite the constraint equations (7.74) and (7.75), considering the matrix
B defined by (7.76), given by
From these constraint equations, we can derive a relation that links the vectors column
hi = (h i1 , h i1 , h i1 )T , i = 1, 2, 3 of the homography matrix H, given by (7.69), with
7.4 Camera Calibration Methods 621
where vi j is the vector obtained from the calculated homography H and considering
both the two constraint equations (7.74) and (7.75) we can rewrite them in the form
of a homogeneous system in two equations, as follows:
T
v12 0
b= (7.82)
(v11 − v22 )T 0
where b is the unknown vector. The information of the intrinsic parameters of the
camera is captured by the M observed images of the chessboard of which we have
independently estimated the relative homographies H k , k = 1, . . . , M. Therefore,
every homography projection generates 2 equations (7.82) and since the unknown
vector b has 6 elements at least 3 different homography projections of the chessboard
are necessary (M = 3) to assemble in a homogeneous linear system of 2M equations,
starting from Eq. (7.82), obtaining the following system:
Vb = 0 (7.83)
where remembering that the vectors of the rotation matrix have unitary norm, the
scale factor is
1 1
λ= −1
= −1
(7.86)
K h1 K h2
r3 = r1 × r2 (7.87)
The extrinsic parameters are different for each homography because the points of
view of the calibration chessboard are different. The rotation matrix R may not satisfy
numerically the orthogonality properties of a rotation matrix due to the noise of the
correspondences. In [8,9], there are techniques to approximate the calculated R for
a tr ue rotation matrix. A technique is based on the SVD decomposition, imposing
the orthogonality R R T = I by forcing the matrix to the identity matrix:
⎡ ⎤
100
R̄ = U ⎣0 1 0⎦ V T (7.88)
001
u − u0 · D(r, k) = ū − u (7.89)
7.4 Camera Calibration Methods 623
where we recall that k is the vector of the coefficients of the nonlinear radial distortion
function D(r, k) and r = x − x 0 = x is the distance of the point x, associated
with the projected point u, is not calculated in the image plane, but is calculated as
the distance of x from the principal point x 0 = (0, 0), expressed not in pixels. In this
equation, knowing the ideal coordinates of the projected points u and the distorted
ones observed ū, the unknown to be determined is the vector k = (k1 , k2 ) (approx-
imating the nonlinear distortion function to only 2 coefficients) which rewritten in
matrix form, for each point of each observed image, two equations are obtained:
(u − u 0 ) · r 2 (u − u 0 ) · r 4 k1 ū − u
= (7.90)
(v − v0 ) · r 2 (v − v0 ) · r 4 k2 v̄ − v
The estimate of the vector k = (k1 , k2 ) solution of such overdetermined system can
be determined with the least squares approach with the method of the pseudo-inverse
of Moore–Penrose for which k = ( D T D)−1 D T d, or with SVD or QR factorization
methods.
N
M
ūi j − ui j (K , k, Ri , T i , X j ) 2 (7.92)
j=1 i=1
This nonlinear least squares minimization problem can be solved by iterative methods
such as the Levenberg–Marquardt algorithm. The iterative process is useful to start
with the estimates of the intrinsic parameters obtained in Sect. 7.4.3.2 and with the
extrinsic parameters obtained in Sect. 7.4.3.3. The initial parameters of the radial
distortion coefficients can be with an initial zero value or with those estimated in
the previous paragraph. It should be noted that the rotation matrix R has 9 elements
624 7 Camera Calibration and 3D Reconstruction
despite having 3 degrees of freedom (that is, the three angles of rotation around the
3D axes). The Euler–Rodrigues method [11,12] is used in [6] to express a 3D rotation
with only 3 parameters.
7. Refining the accuracy of intrinsic and extrinsic parameters and radial distortion
coefficients initially estimated with least squares methods. Basically, starting from
these initial parameters and coefficients, a nonlinear optimization procedure based
on the maximum likelihood estimation (MLE) is applied globally to all the param-
eters related to the M homography images and the N points observed.
The camera calibration results, for the different methodologies used, are mainly
influenced by the level of accuracy of the 3D calibration patterns (referenced with
respect to the world reference system) and the corresponding 2D (referenced in the
reference system of the image plane). The latter are dependent on the automatic pat-
tern detection algorithms in the acquired images of the calibration platform. Another
important aspect concerns how the pattern localization error is propagated to deter-
mine the camera calibration parameters. In general, the various calibration methods,
at least theoretically, should produce the same results, but in reality then differ in
the solutions adopted to minimize pattern localization and optical system errors.
The propagation of errors is highlighted in particular when the configuration of the
calibration images is modified (for example, the focal length varies), while the exper-
imental configuration remains intact. In this situation, the extrinsic parameters do not
remain stable. Similarly, the instability of the intrinsic ones occurs when the experi-
mental configuration remains the same while only the translation of the calibration
patterns varies.
In the previous paragraphs, we have described the methods for calibrating a sin-
gle camera, that is, we have defined what the characteristic parameters are, how to
determine them with respect to the known 3D points of the scene assuming the pin-
hole projection model. In particular, the following parameters have been described:
the intrinsic parameters that characterize the optical-sensor components defining the
camera intrinsic matrix K , and the extrinsic parameters, defining the rotation matrix
R and the translation vector T with respect to an arbitrary reference system of the
world that characterize the attitude of the camera with respect to the 3D scene.
In the stereo system (with at least two cameras), always considering the pinhole
projection model, a 3D light spot of the scene is seen (projected) simultaneously
in the image plane of the two cameras. While with the monocular vision the 2D
projection of a 3D point defines only the ray that passes through the optical center
and the 2D intersection point with the image plane, in the stereo vision the 3D point
is uniquely determined by the intersection of the homologous rays that generate their
2D projections on the corresponding image planes of the left and right camera (see
Fig. 7.4). Therefore, once the calibration parameters of the individual cameras are
known, it is possible to characterize and determine the calibration parameters of a
stereo system and establish a unique relationship between a 3D point and its 2D
projections on the stereo images.
626 7 Camera Calibration and 3D Reconstruction
uL ZL ZR
uR
vL pR=(uR,vR)
cL pL=(uL,vL) vR cR
f f
PL=(XL,YL,f) PR=(XR,YR,f)
CL XR
XL CR
T
YL YR
R
According to Fig. 7.4, we can model the projections of the stereo system from the
mathematical point of view as an extension of the monocular model seen as rigid
transformations (see Sect. 6.7) between the reference systems of the cameras and
the world. The figure shows the same nomenclature of monocular vision with the
addition of the subscript L and R to indicate the parameters (optical centers, 2D and
3D reference systems, focal length, 2D projections, …), respectively, of the left and
right camera.
If T is the column vector representing the translation between the two optical
centers C L and C R (the origins of the reference systems of each camera) and R is
the rotation matrix that orients the left camera axes to those of the right camera (or vice
versa), then the coordinates of a world point P w = (X, Y, Z ) of 3D space, denoted
by P L = (X L p , Y L p , Z L p ) and P R = (X R p , Y R p , Z R p ) in the reference system of
the two cameras, they are related to each other with the following equations:
P R = R( P L − T ) (7.93)
P L = RT P R + T (7.94)
where R and T characterize the relationship between the left and right camera coor-
dinate systems, which is independent of the projection model of each camera. R and
T are essentially the extrinsic parameters that characterize the stereo system in the
pinhole projection model. Now let’s see how to derive the extrinsic parameters of
the stereo system R, T knowing the extrinsic parameters of the individual cameras.
Normally the cameras are individually calibrated considering known 3D points,
defined with respect to a world reference system. We indicate with
P w = (X w , Yw , Z w ) the coordinates in the world reference system, and with R L ,
T L and R R , T R the extrinsic parameters of the two cameras, respectively, the rota-
tion matrices and the translation column vectors. The relationships that project the
7.4 Camera Calibration Methods 627
point P w in the image plane of the two cameras (according to the pinhole model),
in the respective reference systems, are the following:
P L = RL Pw + T L (7.95)
P R = RR Pw + T R (7.96)
We assume that the two cameras have been independently calibrated with one of the
methods described in Sect. 7.4, and therefore their intrinsic and extrinsic parameters
are known. The extrinsic parameters of the stereo system are obtained from Eqs.
(7.95) and (7.96) as follows:
for the (7.96)
P L = R L P w + T L = R L R−1
R ( P R − T R ) +T L
comparison with the (7.94) (7.97)
= (R L R−1 ) P R −(R L R −1
)T R + T L
R R
RT T
R T = (R L R−1
R ) ⇐⇒ R = R TL R R (7.98)
T = T L − R L R−1 T R = T L − RT T R (7.99)
R
RT
where (7.98) and (7.99) define the extrinsic parameters (the rotation matrix R T
and the translation vector T ) of the stereo system. At this point, the stereo system is
completely calibrated and can be used for 3D reconstruction of the scene starting from
2D stereo projections. This can be done, for example, by triangulation as described
in Sect. 7.5.7.
In Sect. 4.6.8, we have already introduced the epipolar geometry. This section
describes how to use epipolar geometry to solve the problem of matching homol-
ogous points in a stereo vision system with the two calibrated and non-calibrated
cameras. In other words, with the epipolar geometry we want to simplify the search
for homologous points between the two stereo images.
628 7 Camera Calibration and 3D Reconstruction
Let us remember with the help of Fig. 7.5a as a point P in 3D space is acquired
by a stereo system and projected (according to the pinhole model) in PL in the left
image plane and in PR in the right image plane. Epipolar geometry establishes a
relationship between the two corresponding projections PL and PR in the two stereo
images acquired by cameras (having optical centers C L and C R ), which can have
different intrinsic and extrinsic parameters.
Let us briefly summarize notations and properties of epipolar geometry:
Baseline is the line that joins the two optical centers and defines the inter-optical
distance.
Epipolar Plane is the plane (which we will indicate from now on with π ) which
contains the baseline line. A family of epipolar planes are generated that rotate
around the baseline and passing through the 3D points of the considered scene
(see Fig. 7.5b). For each point P, an epipolar plane is generated containing the
three points {C L , P, C R }. An alternative geometric definition is to consider the
3D epipolar plane containing the projection PL (or the projection PR ) together
with the left and right optical centers C L and C R .
Epipole is the intersection point of the baseline line with the image plane. The
epi pole can also be seen as the projection of the optical center of a camera on
the image plane of the other camera. Therefore, we have two epi poles indicated
with e L and e R , respectively, for the left and right image. If the image planes
are coplanar (with parallel optical axes) the epi poles are located at the opposite
infinites (intersection at infinity between baseline and image planes since they
are parallel to each other). Furthermore, the epipolar lines are parallel to an axis
of each image plane (see Fig. 7.6).
Epipolar Lines indicated with lL and lR are the intersections between an epipolar
plane and the image planes. All the epipolar lines intersect in the relative epipoles
(see Fig. 7.5b).
From Fig. 7.5 and from the properties described above, it can be seen that given a
point P of the 3D space, its projections PL and PR in the image planes, and the optical
centers C L and C R are in the epipolar plane π generated by the triad {C L , P, C R }. It
also follows that the rays drawn backwards from the PL and PR points intersecting
in P are coplanar to each other and lie in the same epipolar plane identified. This last
property is of fundamental importance in finding the correspondence of the projected
points. In fact, if we know PL , to search for the homologous PR in the other image
we have the constraint that the plane π is identified by the triad {C L , P, C R } (i.e.,
from the baseline and from the ray defined by PL ) and consequently also the
ray corresponding to the point PR must lie in the plane π , and therefore PR itself
(unknown) must be on the line of intersection between the plane π and the plane of
the second image.
This intersection line is just the right epipolar line lR that can be thought of as the
projection in the second image of the backprojected ray from PL . Essentially, lR is the
searched epipolar line corresponding to PL and we can indicate this correspondence
as follows:
7.5 Stereo Vision and Epipolar Geometry 629
(a) (b)
Epipolar
P planes
π
Epipolar
PL plane PR
Epipolar Epipolar
lL lR lines
lines
cL cR cL cR
eL eR
Baseline Epipoles
Fig. 7.5 Epipolar geometry. a The baseline is the line joining the optical centers C L and C R and
intersects the image planes in the respective epipoles and L and e R . Each plane passing through the
baseline is an epipolar plane. The epipolar lines l L and l R are obtained from the intersection of
an epipolar plane and the stereo image planes. b At each point P of the 3D space corresponds an
epipolar plane that rotates around the baseline and intersects the relative pair of epipolar lines in
the image planes. In each image plane, the epipolar lines intersect in the relative epi pole
uL uR
eL eR
vL pL vR pR
cL cR
Fig. 7.6 Epipolar geometry for a stereo system with parallel optical axes. In this case, the image
planes are coplanar and the epi poles are in the opposite infinite. The epipolar lines are parallel to
the horizontal axis of each image plane
PL → lR
which establishes a dual relationship between a point in an image and the associated
line in the other stereo image. For a stereo binocular system, with the epipolar
geometry, that is, known the epi poles and the epipolar lines, it is possible to restrict
the possible correspondences between the points of the two images by searching the
homologue of PL only on the corresponding epipolar line lR in the other image and
not over the entire image (see Fig. 7.7). This process must be repeated for each 3D
point of the scene.
Let us now look at how to formalize epipolar geometry in algebraic terms, using
the Essential matrix [13], to find correspondences P L → lR [11]. We denote by
630 7 Camera Calibration and 3D Reconstruction
4 We know that a 3D point, according to the pinhole model, projected in the image plane in P L
defines a ray passing through the optical center C L (in this case the origin of the stereo system) locus
of 3D points aligned represented by λ P L . These points can be observed by the right camera and
referenced in its reference system to determine the homologous points using the epipolar geometry
approach. We will see that this is possible so that we will neglect the parameter λ in the following.
5 From now on, the perspective projection matrices will be indicated with P to avoid confusion with
Epipolar line
PL=(XL,YL,f) ZR PR=(XR,YR,f)
lL ZL
lR
CL XR
eL eR CR
XL T
YL YR
Fig. 7.8 Epipolar geometry. Derivation of the essential matrix from the coplanarity constraint
between the vectors CL PL , CR PR and CR CL
Pre-multiplying both members of (7.100) first vectorially for T and then scaling for
P TR , we get
P T (T × P R ) = P TR (T × R P L ) + P TR (T × T ) (7.101)
R
=0 =0
For the property of the vector product T × T = 0 and the scalar one considered the
coplanarity of the vectors P TR (T × P R ) = 0, the previous relationship becomes
P TR (T × R P L ) = 0 (7.102)
From the geometric point of view (see Fig. 7.8), Eq. (7.102) expresses the coplanarity
of the vectors CL PL , CR PR and CR CL representing, respectively, the projection
rays of a point P in the respective image planes and the direction of the vector
translation T .
At this point, from the algebra we use the property of the antisymmetric matrix
consisting of only three independent elements that can be considered as the elements
of a vector with three components.7 For the translation vector T = (Tx , Y y , Z z )T ,
6 With reference to Note 1, we recall that the matrix R provides the orientation of the camera C R
with respect to the C L one. The column vectors are the direction cosines of C L axes rotated with
respect to the C R .
7 A matrix A is said to be antisymmetric when it satisfies the following properties:
A + AT = 0 AT = − A
It follows that the elements on the main diagonal are all zeroes while those outside the diagonal
satisfy the relation ai j = −a ji . This means that the number of elements is only n(n − 1)/2 and for
n = 3 we have a matrix of the size 3 × 3 of only 3 elements that can be considered as the components
632 7 Camera Calibration and 3D Reconstruction
where conventionally [•]× indicates the operator that transforms a 3D vector into an
antisymmetric matrix 3 × 3. It follows that we can express the vector T in terms of
antisymmetric matrix [T ]× and define the matrix expression:
E = [T ]× R (7.104)
P TR E P L = 0 (7.105)
where E, defined by (7.104), is known as the essential matrix which depends only
on the rotation matrix R and the translation vector T , and is defined less than a scale
factor.
Equation (7.105) is still valid by scaling the coordinates from the reference system
of the cameras to those of the image planes, as follows:
T T
X L YL X R YR
p L = (x L , y L ) =
T
, p R = (x R , y R ) =
T
,
ZL ZL ZR ZR
pTR E p L = 0 (7.106)
This equation realizes the epipolar constraints, i.e., for a 3D point projected in the
stereo image planes it relates the homologous vectors p L and p R . It also expresses
the coplanarity between any two corresponding points p L and p R included in the
same epipolar plane for the two cameras.
In essence, (7.105) expresses the epipolar constraint between the rays that intersect
the point P of the space coming from the two optical centers, while Eq. (7.106) relates
points homologues between the image planes. Moreover, for any projection in the
left image plane p L , through the essential matrix E, the epipolar line in the right
of a generic three-dimensional vector v. In this case, we use the symbolism [v]× or S(v) to indicate
the operator that transforms the vector v in an antisymmetric matrix as reported in (7.103). Often
this dual form of representation between vector and antisymmetric matrix is used to write the vector
product or outer product between two three-dimensional vectors with the traditional form x × y in
the form of simple product [x]× y or S(x) y.
7.5 Stereo Vision and Epipolar Geometry 633
l R = E pL (7.107)
l TL = pTR E =⇒ lL = pR ET (7.109)
which verifies that p L is on the epipolar line l L (defined by the 7.109) according to
the Note 8.
Epipolar geometry requires that the epi poles are in the intersection between the
epipolar lines and the translation vector T . Another property of the essential matrix
is that its product with the epipoles e L and e R is equal to zero:
eTR E = 0 Ee L = 0 (7.111)
In fact, for each point p L , except e L , in the left image plane must hold Eq. (7.107)
of the right epipolar line l R = E p L , where the epipole e R also lies. Therefore, also
the epipole e R must satisfy (7.108) thus obtaining
The epipole e R is thus in the null space on the left of E. Similarly, it is shown
for the left epipole e L that Ee L = 0, i.e, which is in the null space on the right of
E. The equations (7.111) of the epipoles can be used to calculate their position by
knowing E.
The essential matrix has rank 2 (so it is also singular) since the antisymmetric
matrix [T ]× is of rank 2. It also has 5 degrees of freedom, 3 associated with rotation
angles and 2 for the vector T defined less than a scale factor. We point out that,
while the essential matrix E associates a point with a line, the homography matrix
H associates a point with another point ( p L = H p R ).
An essential matrix has two equal singular values and the third equals zero. This
property can be demonstrated by decomposing it with the SVD method and verifying
with E = UV that the first two elements of the main diagonal of are σ1 = σ2 = 0
and σ3 = 0.
Equation (7.106) in addition to solving the problem of correspondence in the
context of epipolar geometry is also used for 3D reconstruction. In this case at least
5 corresponding points are chosen, in the stereo images, generating a linear system
of equations based on (7.106) to determine E and then R and T are calculated. We
will see in detail in the next paragraphs the 3D reconstruction of the scene with the
triangulation procedure.
In the previous paragraph, the coordinates of the points in relation to the epipolar
lines were expressed in the reference system of the calibrated cameras and in accor-
dance with the pinhole projection model. Let us now propose to obtain a relationship
analogous to (7.106) but with the points in the image plane expressed directly in
pixels. Suppose that for the same stereo system considered above the cameras cali-
bration matrices K L and K R are known with the projection matrices P L = K L [I | 0]
and P R = K R [R | T ] for the left and right camera, respectively. We know from Eq.
(6.208) that we can get, for a 3D point with (X, Y, Z ) coordinates, the homogeneous
coordinates in pixels ũ = (u, v, 1) in stereo image planes, which for the two left and
right images are given by
ũ L = K L p̃ L ũ R = K R p̃ R (7.112)
where p̃ L and p̃ R are the homogeneous coordinates, of the 3D point projected in the
stereo image planes, expressed in the reference system of the cameras. These last
coordinates can be derived from (7.112) obtaining
p̃ L = K −1
L ũ L p̃ R = K −1
R ũ R (7.113)
(K −1 −1
R ũ R ) E(K L ũ L ) = 0
T
ũ TR K −T E K −1 ũ L = 0 (7.114)
R L
F
7.5 Stereo Vision and Epipolar Geometry 635
F = K −T −1
R EKL (7.115)
ũ TR F ũ L = 0 (7.116)
where the fundamental matrix F has size 3 × 3 and rank 2. As for the essential
matrix E, Eq. (7.116) is the fundamental algebraic tool based on the fundamental
matrix F for the 3D reconstruction of a point P of the scene observed from two
views. The fundamental matrix represents the constraint of the correspondence of
the homologous image points ũ L ↔ ũ R being 2D projections of the same point P
of 3D space. As done for the essential matrix we can derive the epipolar lines and
the epipoles from (7.116). For homologous points ũ L ↔ ũ R , we have for ũ R the
constraint to lie on the epipolar line l R associated with the point ũ L , which is given
by
l R = F ũ L (7.117)
l L = F T ũ R (7.118)
from which emerges the direct association of the epipole e L with the epipolar line
F ẽ L and which is in the null space of F, while the epipole e R is associated with the
epipolar line F T ẽ R and is in the null space of F. It should be noted that the position
of the epipoles does not necessarily fall within the domain of the image planes (see
Fig. 7.9a).
A further property of the fundamental matrix concerns the transposed property: if
−
→
F is the fundamental matrix relative to the stereo pair of cameras C L → C R , then
←−
the fundamental matrix F of the stereo pair of cameras ordered in reverse C L ← C R
−
→
is equal to F T . In fact, applying (7.116) to the ordered pair C L ← C R we have
←− −
→ ←
− − →
ũ TL F ũ R = 0 =⇒ ũ TR F T ũ L = 0 f or which F = F T
636 7 Camera Calibration and 3D Reconstruction
(a) (b)
P π
pL pR
Hπ
lR
cL eL eR cR
cL eL eR cR
Fig. 7.9 Epipolar geometry and projection of homologous points through the homography plane.
a Epipoles on the baseline but outside the image planes; b Projection of homologous points by
homography plane not passing through optical centers
Finally, we analyze a further feature of the fundamental matrix F also rewriting the
equation of the fundamental matrix (7.115) with the essential matrix E expressed
by (7.104), thus obtaining
F = K −T −1 −T −1
R E K L = K R [T ]× R K L (7.120)
We know that the determinant of the antisymmetric matrix [T ]× is zero, it follows that
the det (F) = 0 and the rank of F is 2. Although both matrices include the constraints
of the epipolar geometry of two cameras and simplify the correspondence problem
by mapping points of an image only on the epipolar line of the other image, from
Eqs. (7.114) and (7.120), which relate the two matrices F and E, it emerges that
the essential matrix uses the coordinates of the camera and depends on the relative
extrinsic parameters (R and T ), while the fundamental matrix operates directly with
the coordinates in pixels and can be abstracted from the knowledge of the intrinsic and
extrinsic parameters of the cameras. Knowing the intrinsic parameters (the matrices
K ) from (7.120), it is observed that the fundamental matrix is reduced to the essential
matrix, and therefore it operates directly in coordinates of the cameras.
An important difference between the E and F matrices is the number of degrees
of freedom, the Essential matrix has 5 while the Fundamental matrix has 7.
the point p R with coordinates u R . Basically, by the projection of P in the left and
right image planes, we can consider it occurred through the plane .
From epipolar geometry we know that p R lies on the epipolar line l R (projection
of the ray P − p L ) and also passing through the right epipole e R . Any other point
in the plane is projected in the same way in the stereo image planes thus real-
izing an omographic projection H to map each point p L i of an image plane in the
corresponding points p Ri in the other image plane.
Therefore, the homologous points between the stereo image planes we can con-
sider them mapped by the 2D homography transformation:
u R = H uL
Then, imposing the constraint that the epipolar line l R is the straight line passing
through p R and the epiple e R , with reference to the Note 8, we have
l R = e R × u R = [e R ]× u R = [e R ]× H u L = Fu L (7.121)
F
from which, considering also (7.117), we obtain the searched relationship between
homography matrix and fundamental matrix, given by
F = [e R ]× H (7.122)
where H is the homography matrix with rank 3, F is the fundamental matrix of rank
2, and [e R ]× is the epipole vector expressed as an antisymmetric matrix with rank
2. Equation (7.121) is valid for any ith point p L i projected from the plane and
must satisfy the equation of epipolar geometry (7.116). In fact, replacing in (7.116)
u R given by the homography transformation and considering the constraint that the
homologous of each p L i must be on the epipolar line l R , given by (7.121), we can
verify that the constraint of the epipolar geometry remains valid, as follows:
u TR Fu L = (H u L )T [e R ]× u R = u TR [e R ]× H u L = 0 (7.123)
H uL lR u TR lR F
Therefore, we have an equation for every correspondence uli ↔ uri , and with n
correspondences we can assemble a homogeneous system of n linear equations as
follows:
⎡ ⎤
f 11
⎢ f 12 ⎥
⎢ ⎥
⎡ ⎤ ⎢ f 13 ⎥
u l1 u r1 u l1 vr1 u l1 vl1 u r1 vl1 vr1 vl1 u r1 vr1 1 ⎢ ⎥
⎢ f 21 ⎥
⎢ .. .. .. .. .. .. ⎥ ⎢
.. .. .. ⎢ ⎥ ⎥
⎣ . . . . . . . . . ⎦ ⎢ f 22 ⎥ = A f = 0 (7.128)
⎢ f 23 ⎥
u ln u rn u ln vrn u ln vln u rn vln vrn vln u rn vrn 1 ⎢ ⎥
⎢ f 31 ⎥
⎢ ⎥
⎣ f 32 ⎦
f 33
factor of scale and can be determined with linear methods the null space solution
of the system. Therefore, 8 correspondences are sufficient from which the name of
the algorithm follows. In reality, the coordinates of the homologous points in stereo
images are affected by noise and to have a more accurate estimate of f it is useful to
use a number of correspondences n 8. In this case, the system is solved with the
least squares method finding a solution f that minimizes the following summation:
n
(urTi f uli )2 (7.129)
i=1
subject to the additional constraint such that f = 1 since the norm of u is arbitrary.
The least squares solution of f corresponds to the smallest singular value of the SVD
decomposition of A = UV T , taking the components of the last column vector of
V (which corresponds to the smallest eigenvalue).
Remembering some properties of the fundamental matrix it is necessary to make
some considerations. We know that F is a singular square matrix (det (F) = 0) of
size 3 × 3 = 9 with rank 2. Also F has 7 degrees of freedom motivated as follows.
The constraint of rank 2 implies that any column is a linear combination of the other
two. For example, the third is the linear combination of the first two. Therefore,
the first two elements of the third column specify the linear combination and then
give the third element of the third column. This suggests that F has eight degrees of
freedom. Furthermore, by operating in homogeneous coordinates, the elements of F
can be scaled to less than a scale factor without violating the epipolar constraint of
(7.116). It follows that the degrees of freedom are reduced to 7.
Another aspect to consider is the effect of the noise present in the correspondence
data on the SVD decomposition of the matrix A. In fact, this causes the ninth singular
value obtained to be different from zero, and therefore the estimate of F is not really
with rank equal to 2. This implies a violation of the epipolar constraint when this
approximate value of F is used, and therefore the epipolar lines (given by Eqs. 7.117
and 7.118) do not exactly intersect in the their epipoles. It is, therefore, advisable to
correct the F matrix obtained from the decomposition of A with SVD, effectively
reapplying a new SVD decomposition directly on the first estimate of F to obtain a
new estimate F̂, which minimizes the Frobenius norm,9 as follows:
9 TheFrobenius norm is an example of a matrix norm that can be interpreted as the norm of the
vector of the elements of a square matrix A given by
r r
n n
A F = ai j = T r (AT A) =
2 λi = σi2
i=1 j=1 i=1 i=1
where A is the n × n square matrix of real elements, r ≤ n is the rank of A, λi = σi2 is the ith
√
nonzero eigenvalues of AT A, and σi = λi is the ith singular value of the SVD decomposition of
A. It should be considered T r ( A A) with the transposed conjugate A∗ in the more general case.
∗
640 7 Camera Calibration and 3D Reconstruction
where
F̂ = UV T (7.131)
Found with (7.131) the matrix of rank 2 which approximates F, to obtain the matrix
of rank 2 with the closest approximation to F, the third singular value, that is, σ 33 = 0
of the last SVD decomposition of F. Therefore, the best approximation is obtained,
recalculating F with the updated matrix , as follows:
⎡ ⎤
σ11 0 0
F = UV T = U ⎣ 0 σ22 0⎦ VT (7.132)
0 0 0
pl = (xl , yl , 1) ↔ pr = (xr , yr , 1)
where B is the data matrix of the correspondences of the size n × 9, the analog
of the matrix A of the system (7.128) relative to the fundamental matrix. As with
the fundamental matrix, the least squares solution of e corresponds to the smallest
singular value of the SVD decomposition of B = UV T , taking the components
of the last column vector of V (which corresponds to the smallest eigenvalue). The
same considerations on data noise remain, so the solution obtained may not satisfy the
requirement that the essential matrix obtained is not exactly of rank 2, and therefore it
is also convenient for the essential matrix to reapply the SVD decomposition directly
on the first estimate of E to get a new estimate given by Ê = UV T .
The only difference in the calculation procedure concerns the different properties
between the two matrices. Indeed, the essential matrix with respect to the fundamental
has the further constraint that its two nonzero singular values are equal. To take this
into account, the diagonal matrix is modified by imposing = diag(1, 1, 0) and
the essential matrix is E = Udiag(1, 1, 0)V T which is the best approximation of
the normalized essential matrix that minimizes the Frobenius norm. It is also shown
that, if from the SVD decomposition Ê = UV T we have that = diag(a, b, c)
7.5 Stereo Vision and Epipolar Geometry 641
with the following E = UV T , which is the most approximate essential matrix in
agreement to the Frobenius norm.
For F, we can impose the constraint that det (F) = 0 for which we have
det (α F 1 + (1 − α)F 2 ) = 0
such that F has rank 2. This constraint leads to a nonlinear cubic equation with the
unknown α with notes F 1 and F 2 . The solutions of this equation for real α are in
numbers of 1 or 3. In the case of 3 solutions, these must be verified replacing them
in (7.134) and not considered the degenerate ones.
Recall that the essential matrix has 5 degrees of freedom and can be set as pre-
viously a homogeneous system of linear equations Be = 0 with the data matrix B
of size 5 × 9 built with only 5 correspondences. Compared to overdetermined sys-
tems, its implementation is more complex. In [16], an algorithm is proposed for the
estimation of E from just 5 correspondences.
The 8-point algorithm, described above for the estimation of essential and fundamen-
tal matrices, uses the basic least squares approach and if the error in experimentally
determining the coordinates of the correspondences is contained, the algorithm pro-
duces acceptable results. As with all algorithms, to reduce the numerical instability,
due to data noise and above all as in this case when the coordinates of the correspon-
dences are expressed with a large numerical amplitude (data matrix badly conditioned
by SVD by altering the singular values to be equal and others cleared), it is, therefore,
advisable to activate a normalization process on the data before applying the 8-point
algorithm [14].
642 7 Camera Calibration and 3D Reconstruction
where the centroid (μu , μv ) and the average distance from the centroid μd are cal-
culated for n points as follows:
!n
1 1
n n
i=1 u i − μu )2 + (vi − μv )2
μu = ui μv = vi μd = (7.136)
n n n
i=1 i=1
After the normalization of the data, the fundamental matrix F n is estimated with the
approach indicated above and subsequently needs to denormalize it to be used with
the original coordinates. The denormalized version F is obtained from the epipolar
constraint equation as follows:
u TR Fu L = û TR T −T FT −1 û L = û TR F n û L = 0 =⇒ F = T TR F n T L (7.138)
R L
Fn
With the 8-point algorithm (see Sect. 7.5.3), we have calculated the fundamental
matrix F and knowing the matrices K of the stereo cameras it is possible to calculate
with (7.115) the essential matrix E. Alternatively, E can be calculated directly with
7.106) which we know to include the extrinsic parameters, that is, the rotation matrix
R and the translation vector T .
R and T are just the result of the decomposition of E we want to accomplish.
Recall from 7.104 that the essential matrix E can be expressed in the following form:
E = [T ]× R (7.139)
7.5 Stereo Vision and Epipolar Geometry 643
which suggests that we can decompose E into two components, the vector T
expressed in terms of the antisymmetric matrix [T ]× and the rotation matrix R.
By virtue of the theorems demonstrated in [17,18] we have
Theorem 7.2 Suppose that E can be factored into a product RS, where R is an
orthogonal matrix and S is an antisymmetric matrix. Let be the SVD of E given
by E = UV T , where = diag(k, k, 0). Then, up to a scale factor, the possible
factorization is one of the following:
S = U ZU T R = UWVT or R = UWT VT E = RS (7.140)
where W and Z are rotation matrix and antisymmetric matrix, respectively, defined
as follows: ⎡ ⎤ ⎡ ⎤
0 10 0 −1 0
W = ⎣−1 0 0⎦ Z = ⎣1 0 0 ⎦ (7.141)
0 01 0 0 0
Since the scale of the essential matrix does not matter, it, therefore, has 5 degrees of
freedom. The reduction from 6 to 5 degrees of freedom produces an extra constraint
on the singular values of E, moreover we have that det (E) = 0 and finally since the
scale is arbitrary we can assume both singular values equal to 1 and having an SVD
given by
E = Udiag(1, 1, 0)V T (7.142)
But this decomposition is not unique. Furthermore, being U and V orthogonal matri-
ces det (U) = det (V T ) = 1 and if we have an SVD like (7.142) with det (U) =
det (V T ) = −1 then we can change the sign of the last column of V . Alternatively, we
can change the sign to E and then get a following SVD −E = Udiag(1, 1, 0)(−V )T
with det (U) = det (−V T ) = 1. It is highlighted that the SVD for −E generates a
different decomposition since it is not unique.
Now let’s see with the decomposition of E according to (7.142) the possible
solutions considering that
S1 = −U ZU T R1 = U W T V T (7.144)
and E = S2 R2 , where
S2 = U ZU T R2 = U W V T (7.145)
644 7 Camera Calibration and 3D Reconstruction
Now let’s see if these are two possible solutions for E by first checking if R1 and R2
are rotation matrices. In fact, remembering the properties (see Note 1) of the rotation
matrices must result in the following:
T
R1T R1 = U W T V T U W T V T = V W U T U W T V T = I (7.146)
and therefore R1 is orthogonal. It must also be shown that the det (R1 ) = 1:
det (R1 ) = det U W T V T = det (U)det (W T )det (V T ) = det (W )det (U V T ) = 1 (7.147)
To verify that the possible decompositions are valid or that the last equation of (7.140)
is satisfied, we must get E = S1 R1 = S2 R2 by verifying
S1 R1 = −U ZU T U W T V T = −U ZW T V T = −U −diag([1 1 0]) V T = E (7.149)
f or the equation (7.143)
By virtue of (7.142), the last step of the (7.149) shows that the decomposition S1 R1
is valid. Similarly it is shown that the decomposition S2 R2 is also valid. Two possible
solutions have, therefore, been reached for each essential matrix E and it is proved
to be only two [10].
Similarly to what has been done for the possible solutions of R we have to examine
the possible solutions for the translation vector T which can assume different values.
We know that T is encapsulated in S the antisymmetric matrix, such that S = [T ]× ,
obtained from the two possible decompositions. For the definition of vector product
we have
ST = [T ]× T = U ZU T T = T × T = 0 (7.150)
Therefore, the vector T is in the null space of S which is the same as the null
space of the matrices S1 and S2 . It follows that the searched estimate of T from
this decomposition, by virtue of (7.150), corresponds to the third column of U as
follows10 : ⎡ ⎤
0
T = U ⎣0⎦ = u3 (7.151)
1
10 For the decomposition predicted by (7.140), it must result that the solution of T = U[0 0 1]T
since it must satisfy (7.150) that is ST = 0 according to the property of an antisymmetric matrix.
In fact, for T = u3 the following condition is satisfied:
" "
" 0 −1 0 " T
Su3 =U ZU T u3 =U "" 1 0 0 ""U u3 =[u2 −u1 0][u1 u2 u3 ]T u3 =u2 u1T u3 −u1 u2T u3 =0
0 0 0
.
7.5 Stereo Vision and Epipolar Geometry 645
Let us now observe that if T is in the null space of S the same is for λT , in fact for
any nonzero value of λ we have a valid solution since we will have
which is still a valid essential matrix defined less than an unknown scale factor λ.
We know that this decomposition is not unique given the ambiguity of the sign of E
and consequently also the sign of T is determined considering that S = U(±Z)U T .
Summing up, for a given essential matrix, there are 4 possible choices of projection
matrices P R for the right camera, since there are two choice options for both R and
T , given by the following:
P R = U W V T | ± u3 or U W T V T | ± u3 (7.153)
By obtaining 4 potential pairs (R, T ) there are 4 possible configurations of the stereo
system by rotating the camera in a certain direction or in the opposite direction with
the possibility of translating it in two opposite directions as shown in Fig. 7.10. The
choice of the appropriate pair is made for each 3D point to be reconstructed by
triangulation by selecting the one where the points are in front of the stereo system
(in the direction of the positive z axis).
With epipolar geometry, the problem of searching for homologous points is reduced
to mapping a point of an image on the corresponding epipolar line in the other image.
It is possible to simplify the problem of correspondence through a one-dimensional
point-to-point search between the stereo images. For example, we can execute an
appropriate geometric transformation (e.g., projective) with resampling (see Sect. 3.9
Vol. II) on stereo images such as to make the epipolar lines parallel and thus sim-
646 7 Camera Calibration and 3D Reconstruction
plify the search for homologous points as a 1D correspondence problem. This also
simplifies the correlation process that evaluates the similarity of the homologous
patterns (described in Chap. 1). This image alignment process is known as recti-
fication of stereo images and several algorithms have been proposed based on the
constraints of epipolar geometry (using uncalibrated cameras where the fundamental
matrix includes intrinsic parameters) and on the knowledge of intrinsic and extrinsic
parameters of calibrated cameras.
Rectification algorithms with uncalibrated cameras [10,19] perform without the
explicit camera parameter information, implicitly included in the essential and fun-
damental matrix used for image rectification. The nonexplicit use of the calibration
parameters makes it possible to simplify the search for homologous points by operat-
ing on the aligned homography projections of the images but for the 3D reconstruction
we have the problem that objects observed from different scales or from different
perspectives may appear identical in the homography projections of aligned images.
In the approaches with calibrated cameras, intrinsic and extrinsic parameters are
used to perform geometric transformations to horizontally align the cameras and
make the epipolar lines parallel to the x-axis. In essence, the images transformed
for alignment can be thought of as reacquired with a new configuration of the stereo
system where the alignment takes place by rotating the cameras around their optical
axes with the care of minimizing distortion errors in perspective reprojections.
from which it emerges that the y vertical coordinate is the same for the homologous
points and the equation of the epipolar line l R = (0, −b, by L ) associated to the point
p L is horizontal. Similarly, we have for the epipolar line l L = E T p R = (0, b, −by R )
7.5 Stereo Vision and Epipolar Geometry 647
associated with the point p R . Therefore, a 3D point of the scene always appears on
the same line in the two stereo images.
The same result is obtained by calculating the fundamental matrix F for the paral-
lel stereo cameras. Indeed, assuming for the two cameras, the perspective projection
matrices have
P L = K L [I | 0] P R = K R [R | T ] with K L = K R = I R=I T = (b, 0, 0)
where b is the baseline. By virtue of Eq. (7.120), we get the fundamental matrix:
⎡ ⎤⎡ ⎤⎡ ⎤ ⎡ ⎤
100 00 0 100 00 0
F= K −T [T ]× R K −1 = ⎣0 1 0⎦ ⎣0 0 −b⎦ ⎣0 1 0⎦ = ⎣0 0 −1⎦ (7.156)
R L
E
001 0b 0 001 01 0
We thus have that even with F the vertical coordinate v is the same for the
homologous points and the equation of the epipolar line l R = (0, −1, −v L ) asso-
ciated with the point u L is horizontal. Similarly, we have for the epipolar line
l L = F T u R = (0, 1, −v R ) associated with the point u R .
Now let’s see how to rectify the stereo images acquired in the noncanonical config-
uration, with the converging and non-calibrated cameras, of which we can estimate
the fundamental matrix (with the 8-point normalized algorithm) and consequently
calculate the epipolar lines relative to the two images for the similar points consid-
ered. Known the fundamental matrix and the epipolar lines, it is then possible to
calculate the relative epipoles.11
At this point, having known the epipoles e L and e R , we can already check if the
stereo system is in the canonical configuration or not. From the epipolar geometry,
we know (from Eq. 7.119) that the epipole is the vector in the null space of the
fundamental matrix F for which F · e = 0. Therefore, from (7.156) the fundamental
matrix of a canonical configuration is known and in this case we will have
⎡ ⎤⎡ ⎤
00 0 1
F·e = ⎣0 0 −1⎦ ⎣0⎦ = 0 (7.158)
01 0 0
11 According to epipolar geometry, we know that the epipolar lines intersect in the relative epipoles.
Given the noise is present in the the correspondence coordinates, in reality the epipolar lines intersect
not in a single point e but in a small area. Therefore, it is required to optimize the calculation of the
position of each epipole considering the center of gravity of this area and this is achieved with the
least squares method to minimize this error. Remembering that each line is represented with a 3D
vector of the type l i = (ai , bi , ci ) the set of epipolar lines {l 1 , l 2 , . . . , l n } can be grouped in a n × 3
matrix L and form a homogeneous linear system L · e = 0 in the unknown the epipole vector e,
solvable with the SVD (singular value decomposition) method.
648 7 Camera Calibration and 3D Reconstruction
for which ⎡ ⎤
1
e = ⎣0⎦ (7.159)
0
is the solution vector of the epipole corresponding to the configuration with parallel
cameras, parallel epipolar lines, and epipole at infinity in the horizontal direction.
If the configuration is not canonical, it is necessary to carry out an appropriate
homography transformation for each stereo image to make them coplanar with each
other (see Fig. 7.11), so as to obtain each epipole at infinity along the horizontal axis,
according to (7.159).
If we indicate with H L and H R the homography transforms that, respectively, cor-
rect the original image of left and right and indicate with û L and û R the homologous
points in the rectified images, these are defined as follows:
û L = H L ũ L û R = H R ũ R (7.160)
where ũ L and ũ R are homologous points in the original images of the noncanonical
stereo system of which we know F. We know that the latter satisfy the constraint of
the epipolar geometry given by (7.116) so considering Eq. (7.160) we have
T
ũ TR F ũ L = û R H −1 F û L H −1 = û TR H −T F H −1 û L = 0 (7.161)
R L
R L
F̂
from which we have, that the fundamental matrix F̂, for the rectified images must
result, according to (7.156), to the following factorization:
⎡ ⎤
00 0
F̂ = H −T −1 ⎣
R F H L = 0 0 −1
⎦ (7.162)
01 0
Therefore, find the homography transforms H L and H R that satisfy (7.162) the
images are rectified obtaining the epipoles to infinity as required. The problem is
7.5 Stereo Vision and Epipolar Geometry 649
that these homography transformations are not unique and if chosen improperly they
generate distorted rectified images. One idea is to consider homography transforma-
tions as rigid transformations by rotating and translating the image with respect to a
point of the image (for example, the center of the image). This is equivalent to carry-
ing out the rectification with the techniques described in Chap. 3 Vol. II with linear
geometric transformations and image resampling. In [19], an approach is described
which minimizes distortions of the rectified images by decomposing the homographs
into elementary transformations:
H = H p Hr Hs
where H and L are the height and width of the image, respectively. After applying
the translation, we apply a rotation R to position the epipole on the horizontal axis
at a certain point ( f, 0.1). If the translated epipole T e R is in position (e Ru , e Rv ,1 ) the
rotation applied is ⎡ ⎤
e e
α 2 Ru 2 α 2 Rv 2 0
⎢ e Ru +e Rv e Ru +e Rv ⎥
⎢ ⎥
⎢ ⎥
⎢ e Rv e Ru ⎥
R = ⎢−α 2 2 α 2 2 0⎥ (7.164)
⎢ e Ru +e Rv e Ru +e Rv ⎥
⎢ ⎥
⎣ ⎦
0 0 1
Therefore, the homography transformation H R for the right image is given by the
combination of the three elementary transformations as follows:
H R = G RT (7.166)
which represents the rigid transformation of the first order with respect to the image
center.
At this point, having known the homography H R , we need to find an optimal solu-
tion for the homography H L such that the images rectified with these homographs
are very similar with less possible distortions. This is possible by searching for the
homography H L , which minimizes the difference of the adjusted images by setting a
function that minimizes the sum of the square of the distances between homologous
points of the two images:
min H L u L i − H R u Ri 2 (7.167)
HL
i
Without giving the algebraic details described in [10], it is shown that the homography
H L can be expressed in the form:
HL = HAHRM (7.168)
assuming that the fundamental matrix F of the stereo pair of input images is known,
which we express as
F = [e]× M (7.169)
and finally, the minimization problem can be set up as a simple least squares problem
solving a system of linear equations, where the unknowns are the components of the
vector a, given by
⎡ ⎤ ⎡ ⎤
û L 1 v̂ L 1 1 ⎡a1 ⎤ û R1
⎢ .. .. .. ⎥ ⎣ ⎦ ⎢ .. ⎥
Ua = b ⇐⇒ ⎣ . . . ⎦ a2 = ⎣ . ⎦ (7.177)
û L v̂ L 1 a3 û R
n n n
Once we have calculated the vector a with (7.170), we can calculate H A , estimate
H L with (7.168) and with the other homography matrix H R already calculated we
can rectify each pair of stereo images acquired with the n correspondences used.
We summarize the whole procedure of the rectification process of stereo images,
based on homography transformations, applied to a pair of images acquired by a
652 7 Camera Calibration and 3D Reconstruction
stereo system (in the noncanonical configuration) of which we know the epipolar
geometry (the fundamental matrix) for which the epipolar lines in the input images
are mapped horizontally in the rectified images. The essential steps are
1. Calibrate the cameras to get K , R and T and derive the calibration parameters
of the stereo system.
2. Compute the rotation matrix Rr ect with which to rotate the left camera to map
the left epipole e L to infinity along the x-axis and thus making the epipolar lines
horizontal.
3. Apply the same rotation to the right camera.
4. Calculate for each point of the left image the corresponding point in the new
canonical stereo system.
5. Repeat the previous step even for the right camera.
6. Complete the rectification of the stereo images by adjusting the scale and then
resample.
7.5 Stereo Vision and Epipolar Geometry 653
PLr
Plans
PL
PRr
RRECT
cL
PR
RRECT
cR
Fig. 7.12 Rectification of the stereo image planes knowing the extrinsic parameters of the cameras.
The left camera is rotated so that the epipole moves to infinity along the horizontal axis. The same
rotation is applied to the camera on the right, thus obtaining plane images parallel to the baseline.
The horizontal alignment of the epipolar lines is completed by rotating the right camera according
to R−1 and possibly adjusting the scale by resampling the rectified images
Step 1 calculates the parameters (intrinsic and extrinsic) of the calibration of the
individual cameras and the stereo system. Normally the cameras are calibrated con-
sidering known 3D points, defined with respect to a world reference system. We
indicate with P w (X w , Yw , Z w ) the coordinates in the world reference system, and
with R L , T L and R R , T R the extrinsic parameters of the two cameras, respectively,
the rotation matrices and the translation column vectors. The relationships that project
the point P w in the image plane of the two cameras (according to the pinhole model),
in the respective reference systems, are the following:
P L = RL Pw + T L (7.178)
P R = RR Pw + T R (7.179)
We assume that the two cameras have been independently calibrated with one of the
methods described in Sect. 7.4, and therefore their intrinsic and extrinsic parameters
are known.
If T is the column vector representing the translation between the two optical cen-
ters (the origins of each camera’s reference systems) and R is the rotation matrix that
orients the right camera axes to those of the left camera, then the relative coordinates
of a 3D point P(X, Y, Z ) in the space, indicated with P L = (X L p , Y L p , Z L p ) and
P R = (X R p , Y R p , Z R p ) in the reference system of the two cameras, are related to
each other with the following:
P L = RT P R + T (7.180)
654 7 Camera Calibration and 3D Reconstruction
The extrinsic parameters of the stereo system are computed with Eqs. (7.98) and
(7.99) (derived in Sect. 7.4.4) that we rewrite here
R = R TL R R (7.181)
T = T L − RT T R (7.182)
In the step 2, the rotation matrix Rr ect is calculated for the left camera which
has the purpose of mapping the relative epipole to infinity in the horizontal direction
(x axis) and obtain the horizontal epipolar lines. From the property of the rotation
matrix we know that the column vectors represent the orientation of the rotated axes
(see Note 1). Now let’s see how to calculate the three vectors r i of Rr ect . The new
x-axis must have the direction of the translation column vector T (the baseline vector
joining the optical centers) given by the following unit vector:
⎡ ⎤
Tx
T 1 ⎣ Ty ⎦
r1 = = (7.183)
T T2 + T2 + T2 T
x y z z
The second vector r 2 (which is the direction of the new y axis) has the constraint of
being only orthogonal to r 1 . Therefore it can be calculated as the normalized vector
product between r 1 and the direction vector (0, 0, 1) of the old axis of z (which is
the direction of the old optical axis), given by
⎡ ⎤
−Ty
r 1 × [0 0 1] T 1 ⎣ Tx ⎦
r2 = = (7.184)
r 1 × [0 0 1]T T + T2
2
0
x y
The third vector r 3 represents the new z-axis which must be orthogonal to the
baseline (vector r 1 ) and to the new axis of y (the vector r 2 ), so we get as the
vector product of these vectors:
⎡ ⎤
−Tx Tz
1 ⎣ −Ty Tz ⎦
r3 = r1 × r2 = (7.185)
(Tx + Ty )(Tx2 + Ty2 + Tz2 ) Tx2 + Ty2
2 2
We can verify the effect of the rotation matrix Rr ect on the stereo images to be
rectified as follows. Let us now consider the relationship (7.180), which orients the
7.5 Stereo Vision and Epipolar Geometry 655
right camera axes to those of the left camera. Applying to both members Rr ect we
have
from which it emerges that in fact the coordinates of the points of the image of the
left and the right are rectified, by obtaining
having indicated with P L r and P Rr the rectified points, respectively, in the reference
system of the left and right camera. The correction of the points, according to (7.188),
is obtained considering that
⎡ T ⎤ ⎡ ⎤
r1 T T
Rr ect T = ⎣ r 2T T ⎦ = ⎣ 0 ⎦ (7.189)
r 3T T 0
(7.190) shows that the rectified points have the same coordinates Y and Z and differ
only in the horizontal translation along the X -axis. Thus the steps 2 and 3 are made.
The corresponding 2D points rectified in the left and right image planes are
obtained instead from the following:
f f
pLr = P Lr p Rr = P Rr (7.191)
ZL ZR
Thus steps 4 and 5 are realized. Finally, with step 6, to avoid empty areas in the rec-
tified images, the inverse geometric transformation (see Sect. 3.2 Vol.II) is activated
to associate in the rectified images the pixel value of the stereo input images and
possibly resample if in the inverse transform the pixel position is between 4 pixels
in the input image.
The 3D reconstruction of the scene can be realized in different ways, in relation to the
knowledge available to the stereo acquisition system. The 3D geometry of the scene
can be reconstructed, without ambiguity, given the 2D projections of the homolo-
gous points of the stereo images, by triangulation, known the calibration parameters
(intrinsic and extrinsic) of the stereo system. If instead only the intrinsic parame-
656 7 Camera Calibration and 3D Reconstruction
ters are known the 3D geometry of the scene can be reconstructed by estimating the
extrinsic parameters of the system to less than a not determinable scale factor. If
the calibration parameters of the stereo system are not available but only the corre-
spondences between the stereo images are known, the 3D structure of the scene is
recovered through an unknown homography transformation.
(a) (b)
PR = TpR
P P P
PL= L
Ra
yl
Ra
yl
R
L
yl
Ra
yl
Ra
Distances to be
pL minimized
uR
pL pR
pR uL
cL cR cL cR
p R in the image plane. Furthermore, we have evidence that there is only one segment
of minimum length indicated with the column vector v, which is perpendicular to
both rays joining them via the intersection points indicated with P̂L (extreme of
the segment obtained from the 3D intersection between ray l L and segment) and
P̂R (extreme of the segment obtained from the 3D intersection between radius l R
and segment) as shown in the figure. The problem is then reduced to finding the
coordinates of the extreme points P̂L and P̂R of the segment.
We now express in the vector form a p L and b p R where a, b ∈ R, the equations
of the two rays, in the respective reference systems, passing through the optical
centers C L and C R , respectively. The extremes of the segment to be found are
expressed with respect to the reference system of the left camera with origin in C L
whereby according to (7.94) the equation of the right ray expressed with respect to
the reference system of the left camera is R T b p R + T remembering that R and T
represent the extrinsic parameters of the stereo system defined from Eqs. (7.98) and
(7.99), respectively. The constraint that the segment, represented by the equation cv
with c ∈ R, is orthogonal to the two rays defines the vector v obtained as a vector
product of the two vectors/rays given by
v = pL × RT p R (7.192)
where also v is expressed in the reference system of the left camera. At this point, we
have that the segment represented by cv will intersect the ray a p L for a given value
of a0 thus obtaining the coordinates of P̂L , an extreme of the segment. a p L + cv
represents the equation of the plane passing through the ray l L and to be orthogonal
to the ray l R must be
a p L + cv = T + b R T p R (7.193)
for certain values of the unknown scalars a, b, and c that can be determined consid-
ering that the vector equation (7.193) can be set as a linear system of 3 equations
(for three-dimensional vectors) in 3 unknowns. In fact, replacing the vector v given
by the (7.192) we can solve the following system:
a pL + c pL × RT p R − b RT p R = T (7.194)
If a0 , b0 , and c0 are the solution of the system then the intersection between the
ray l L and the segment gives an extreme of the segment P̂L = a0 p L , while the
other extreme is obtained from the intersection of segment and ray l R given by
P̂R = T + b0 R T p R and the midpoint between the two extremes finally identifies
the estimate of P̂ reconstructed in 3D with the coordinates expressed in the reference
system of the left camera.
658 7 Camera Calibration and 3D Reconstruction
where P L i and P Ri indicate the rows of the two perspective projection matrices,
respectively. The perspective projections in Cartesian coordinates u L = (u L , v L )
and u R = (u R , v R ) are
PTL 1 X PTL 2 X
uL = vL = (7.196)
PTL 3 X PTL 3 X
PTR1 X PTR2 X
uR = vR = (7.197)
PTR3 X PTR3 X
12 The same equations can be obtained, for each camera, considering the properties of the vector
product p × (PX) = 0, that is, by imposing the constraint of parallel direction between the vectors.
Once the vector product has been developed, three equations are obtained but only two are linearly
independent of each other.
7.5 Stereo Vision and Epipolar Geometry 659
Proceeding in the same way for the homologous point u R , from (7.197) we get
two other linear equations that we can assemble in (7.199) and we thus have a
homogeneous linear system with 4 equations, given by
⎡ ⎤
u L PTL 3 − PTL 1
⎢ v L PTL − PTL ⎥
⎢ 2⎥
⎣u R PT − PT ⎦ X = 04×1 ⇐⇒ A4×4 X 4×1 = 04×1 (7.200)
3
R3 R1
v R PTR3 − PTR2
where it is observed that each pair of homologous points gives the point P in the
3D space of coordinates X = (X, Y, Z , W ) with the fourth unknown component.
Considering the noise present in the localization of homologous points, the solution
of the system is found with the SVD method which estimates the best solution in
the sense of least squares. With this method the 3D estimate of P can be improved
by adding further observations: with N > 2 cameras. In this case two equations of
the type (7.198) would be added to the matrix A for each camera thus obtaining a
homogeneous system with 2N equations always in 4 unknowns, with the matrix A
of size 2N × 4.
Recall that the reconstruction of P based on this linear method minimizes the
algebraic error without geometric meaning. To better filter the noise, present in the
correspondences and in the perspective projection matrices, the optimal estimate
can be obtained by setting a nonlinear minimization function (in the sense of the
maximum likelihood estimation) as follows:
min P L X̂ − u L 2 + P R X̂ − u R 2 (7.201)
X̂
where X̂ represents the best estimate of the 3D coordinates of the point P. In essence,
X̂ is the best least squared estimate of the backprojection error of P in both images,
seen as the distance in the image plane between its projection (for the respective
cameras are given by Eqs. 7.196 and 7.197) and the related observed measurement
of P always in the image plane (see Fig. 7.13b). In the function (7.201) the backpro-
jection error for the point P is accumulated for both cameras and in the case of N
cameras the error is added and the function to be minimized is
N
min Pi X̂ − ui 2 (7.202)
X̂ i=1
K L and K R of the stereo cameras. The 3D reconstruction of the scene occurs less
than an incognito scale factor because the cameras setups (cameras attitude) are not
known. In particular, not knowing the baseline (the translation vector T ) of the
stereo system it is not possible to reconstruct the 3D scene in the real scale even if
the reconstruction is unique but unless an incognito scale factor.
Known at least 8 corresponding points it is possible to calculate the fundamental
matrix F and once the calibration matrices K are known it is possible to calculate
the essential matrix E (alternatively, E can be calculated directly with 7.106) which
we know to include the extrinsic parameters, that is, the rotation matrix R e the
translation vector T . R and T are just the unknowns we want to calculate and
then perform the 3D reconstruction by triangulation. The essential steps of the 3D
reconstruction process, known the intrinsic parameters of the stereo cameras and a
set of homologous points, are the following:
In this context only step 4 is analyzed while the others are immediate since they have
already been treated previously. From Sect. 7.5.5, we know that the essential matrix
E = [T ]× R can be factored with the SVD method obtaining E = UV T where by
definition the essential matrix has rank 2 and must admit two equal singular values and
the third equals zero, so we have = diag(1, 1, 0). We also know, from Eqs. (7.142)
and (7.143), the existence of the rotation matrice W and the antisymmetric matrix Z
such that their product is ZW = diag(1, 1, 0) = , producing the following result:
E = U{}V T = U{ZW }V T = U ZU T
V = [T ]× R
U W
T
(7.203)
[T ]× R
where the penultimate step is motivated by Eq. (7.140). The orthogonality charac-
teristics of the obtained rotation matrix and the definition of the essential matrix are
thus satisfied. We know, however, that the decomposition is not unique and E is
defined unless a scale factor λ and the translation vector unless the sign. In fact, the
decomposition leads to 4 possible solutions of R and T , and consequently we have
4 possible projection matrices P R = K R [R T ] of the stereo system for the right
camera given by Eq. (7.153), which we rewrite as follows:
E T E = (S R)T S R = ST R T RS = ST S (7.205)
where with S we have indicated the antisymmetric matrix associated with the trans-
lation vector T defined by (7.103). Expanding the antisymmetric matrix in (7.205)
we have ⎡ 2 ⎤
Ty + Tz2 −Tx Ty −Tx Tz
E T E = ⎣ −Ty Tx Tz2 + Tx2 −Ty Tz ⎦ (7.206)
−Tz Tx −Tz Ty Tx2 + Ty2
T r (E T E) = 2 T 2 (7.207)
To normalize the translated vector to the unit, the essential matrix is normalized as
follows:
E
Ê = (7.208)
T r (E T E)/2
T [Tx Ty Tz ]T
T̂ = = = T̂x T̂y T̂z (7.209)
T Tx2 + Ty2 + Tz2
662 7 Camera Calibration and 3D Reconstruction
According to the normalization defined with (7.208) and (7.209) the matrix (7.206)
is rewritten as follows:
⎡ ⎤
1 − T̂x2 −T̂x T̂y −T̂x T̂z
T ⎢ ⎥
Ê Ê = ⎣ −T̂y T̂x 1 − T̂y2 −T̂y T̂z ⎦ (7.210)
−T̂z T̂x −T̂z T̂y 1 − T̂z2
At this point, the components of the vector T̂ can be derived from any row or column
T
of the matrix Ê Ê given by (7.210). Indeed, by indicating it for simplicity with
T
E = Ê Ê the components of the translation vector T̂ are derived from the following:
E12 E13
T̂x = ± 1 − E11 T̂y = − T̂z = − (7.211)
T̂x T̂x
Due to the quadratic elements of E for the components of T̂ , the latter can differ from
the real ones unless the sign. The rotation matrix R can be calculated by knowing the
normalized essential matrix E and the normalized vector T̂ albeit with the ambiguity
in the sign. For this purpose the 3D vectors are defined:
wi = Ê i × T̂ (7.212)
where Ê i indicates the three rows of the normalized essential matrix. From these
vectors wi , through simple algebraic calculations, are calculated the rows of the
rotation matrix given by
⎡ T⎤ ⎡ ⎤
R1 (w1 + w2 + w3 )T
R = ⎣ R2T ⎦ = ⎣(w2 + w3 + w1 )T ⎦ (7.213)
R3T (w3 + w1 + w2 )T
Due to the double ambiguity in the sign of E and T̂ we have 4 different estimates of
pairs of possible solutions for ( T̂ , R). In analogy to what was done in the previous
paragraph, the choice of the appropriate pair is made through the 3D reconstruction
starting from the projections to solve the ambiguity. In fact, for each 3D point, the
third component is calculated in the reference system of the left camera considering
the 4 possible pairs of solutions ( T̂ , R). The relation that for a point P of the 3D
space links the coordinates P L = (X L , Y L , Z L ) and P R = (X R , Y R , Z R ) among the
reference systems of the stereo cameras is given by (7.93), that is, P R = R( P L − T )
to reference P with respect to the left camera, from which we can derive the third
component Z R :
Z R = R3T ( P L − T̂ ) (7.214)
7.5 Stereo Vision and Epipolar Geometry 663
and from the relation (6.208) which links the point P and its projection in the image
on the right we have
fR f R R( P L − T̂ )
pR = PR = (7.215)
ZR R3T ( P L − T̂ )
f R R1T ( P L − T̂ )
xR = (7.216)
R3T ( P L − T̂ )
In analogy to (7.215), we have the equation that links the coordinates of P in the left
image plane:
fL
pL = PL (7.217)
ZL
( f R R1 − x R R3 )T T̂
Z L = fL (7.218)
( f R R 1 − x R R 3 )T p L
From (7.217), we get P L and considering (7.218) we finally get the 3D coordinates
of P in the reference systems of the two cameras:
( f R R1 − x R R3 )T T̂
PL = P R = R( P L − T̂ ) (7.219)
( f R R 1 − x R R 3 )T
Therefore, being able to calculate for each point to reconstruct the depth coordinates
Z L and Z R for both cameras it is possible to choose the appropriate pair (R, T̂ ), that
is, the one for which the depths are both positive because the scene to be reconstructed
is in front of the stereo system. Let’s summarize the essential steps of the algorithm:
a. If both are negative for some points, change the sign of T̂ and go back to step
4.
b. Otherwise, if one is negative and the other is positive for some point, change
the sign of each element of the matrix Ê and go back to point 3.
c. Otherwise, if both depths of the reconstruction points are positive, then it ends.
Recall that the 3D points of the scene are reconstructed less than an incognito scale
factor.
664 7 Camera Calibration and 3D Reconstruction
as described in [10]. Then with these matrices triangulate the 3D points by retropro-
jection of the corresponding projections.
In summary, it is shown that in the context of uncalibrated cameras, the ambiguity
in the reconstruction is attributable only to an arbitrary projective transformation.
In particular, given a set of correspondences for a stereo system, the fundamental
matrix is uniquely determined, then the cameras’ matrix is estimated and then the
scene can be reconstructed with only these correspondences. It should be noted,
however, that any two reconstructions from these correspondences are equivalent
from the projective point of view, that is, the reconstruction is not unique but less
than a projective transformation (see Fig. 7.14).
The ambiguity of 3D reconstruction from uncalibrated cameras is formalized by
the following projective reconstruction theorem [10]:
Object reconstructed
Original 3D object observed
with non-calibrated
stereovision
Ambiguous
Projective
Reconstruction
Fig. 7.14 Ambiguous 3D reconstruction from a non-calibrated stereo system with only the projec-
tions of the homologous points known. The 3D reconstruction, although the structure of the scene
emerges, takes place unless an unknown projective transformation
for all i, except for those i such that F p L i = pTRi F = 0 (coincident with the epipoles
related to stereo images).
P i = H 4×4 P i (7.220)
but the original projection points p L i ↔ p Ri are the same (together with the F),
verifying as follows:
p L i = P L P i = P L H −1 H P i = PL P i
(7.222)
p Ri = P R P i = P R H −1 H P i = PR P i
only 5 3D points (of which 4 must not be coplanar), of the N of the scene, used to
define a basic projective transformation. The first approach [20], starting from the
basic projective finds the projection matrices (known the epipoles) with algebraic
methods while the second approach [24] uses a geometric method based on the
epipolar geometry to select the reference points in the image plans.
References
1. S.J. Maybank, O.D. Faugeras, A theory of self-calibration of a moving camera. Int. J. Comput.
Vis. 8(2), 123–151 (1992)
2. B. Caprile, V. Torre, Using vanishing points for camera calibration. Int. J. Comput. Vis. 4(2),
127–140 (1990)
3. R.Y. Tsai, A versatile camera calibration technique for 3d machine vision. IEEE J. Robot.
Autom. 4, 323–344 (1987)
4. J. Heikkila, O. Silvén, A four-step camera calibration procedure with implicit image correction,
in IEEE Proceedings of Computer Vision and Pattern Recognition (1997), pp 1106–1112
5. O.D. Faugeras, G. Toscani, Camera calibration for 3d computer vision, in International Work-
shop on Machine Vision and Machine Intelligence (1987), pp. 240–247
6. Z. Zhengyou, A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach.
Intell. 22(11), 1330–1334 (2000)
7. R.K. Lenz, R.Y. Tsai, Techniques for calibration of the scale factor and image center for high
accuracy 3-d machine vision metrology. IEEE Trans. Pattern Anal Mach Intell 10(5), 713–720
(1988)
8. G.H. Golub, C.F. Van Loan, Matrix Computations, 3rd edn. (Johns Hopkins, 1996). ISBN
978-0-8018-5414-9
9. Z. Zhang, A flexible new technique for camera calibration. Technical Report MSR- TR-98-71
(Microsoft Research, 1998)
10. R. Hartley, A. Zisserman, Multiple View Geometry in computer vision, 2nd. (Cambridge, 2003)
11. O. Faugeras, Three-Dimensional Computer Vision: A Geometric Approach (MIT Press, Cam-
bridge, Massachusetts, 1996)
12. J. Vince, Matrix Transforms for Computer Games and Animation (Springer, 2012)
13. H.C. Longuet-Higgins, A computer algorithm for reconstructing a scene from two projections.
Nature 293, 133–135 (1981)
14. I. Hartley Richard, In defense of the eight-point algorithm. IEEE Trans. Pattern Recogn. Mach.
Intell. 19(6), 580–593 (1997)
15. Q.-T. Luong, O. Faugeras, The fundamental matrix: theory, algorithms, and stability analysis.
Int. J. Comput. Vis. 1(17), 43–76 (1996)
16. Nistér David, An efficient solution to the five-point relative pose problem. IEEE Trans. Pattern
Recogn. Mach. Intell. 26(6), 756–777 (2004)
17. O. Faugeras, S. Maybank, Motion from point matches : Multiplicity of solutions. Int. J. Comput.
Vis. 4, 225–246 (1990)
18. T.S. Huang, O.D. Faugeras, Some properties of the e matrix in two-view motion estimation.
IEEE Trans. Pattern Recogn. Mach. Intell. 11(12), 1310–1312 (1989)
19. C. Loop, Z. Zhang, Computing rectifying homographies for stereo vision, in IEEE Conference
of Computer Vision and Pattern Recognition (1999), vol. 1, pp. 125–131
20. E. Trucco, A. Verri, Introductory Techniques for 3-D Computer Vision (Prentice Hall, 1998)
21. R. Mohr, L. Quan, F. Veillon, B. Boufama, Relative 3d reconstruction using multiples uncali-
brated images. Technical Report RT 84-I-IMAG LIFIA 12, Lifia-Irimag (1992)
References 667
22. O.D. Faugeras, What can be seen in three dimensions from an uncalibrated stereo rig, in ECCV
European Conference on Computer Vision (1992), pp. 563–578
23. R. Hartley, R. Gupta, T. Chang, Stereo from uncalibrated cameras, in IEEE CVPR Computer
Vision and Pattern Recognition (1992), pp. 761–764
24. R. Mohr, L. Quan, F. Veillon, Relative 3d reconstruction using multiple uncalibrated images.
Int. J. Robot. Res. 14(6), 619–632 (1995)
Index
Symbols B
2.5D Sketch map, 342 background modeling
3D representation based on eigenspace, 564, 565
object centered, 344 based on KDE, 563, 564
viewer centered, 344 BS based on GMM, 561, 562
3D stereo reconstruction BS with mean/median background, 558, 559
by linear triangulation, 658 BS with moving average background, 559,
by triangulation, 656 560
knowing intrinsic parameters & Essential BS with moving Gaussian average, 559, 560
matrix, 661 BS-Background Subtraction, 557, 558
knowing only correspondences of non-parametric, 566, 567
homologous points, 664 parametric, 565, 566
knowing only intrinsic parameters, 659 selective BS, 560, 561
3D world coordinates, 605 backpropagation learning algorithm
batch, 119
online, 118
A stochastic, 118
active cell, 324 Bayes, 30
Airy pattern, 466 classifier, 48
albedo, 416 rules, 37, 39
aliasing, 490, 491 theorem, 38
alignment Bayesian learning, 56
edge, 340 bias, 51, 62, 91, 93
image, 533, 534, 646 bilinear interpolation, 525, 526
pattern, 180 binary coding, 455, 458, 462
ambiguous 3D reconstruction, 665 binary image, 231, 496, 497
angular disparity, 355 binocular fusion, 351
anti-aliasing, 490, 491 binocular fusion
aperture problem, 483, 484, 498, 499, 514, 515 fixation point, 352
artificial vision, 316, 348, 393 horopter, 353
aspect ratio, 599, 606, 608, 609 Vieth-Müller circumference, 353
associative area, 369 binocular vision
associative memory, 229 angular disparity calculation, 389
autocorrelation function, 276, 282 computational model, 377
© Springer Nature Switzerland AG 2020 669
A. Distante and C. Distante, Handbook of Image Processing and Computer Vision,
https://doi.org/10.1007/978-3-030-42378-0
670 Index
J moment
Jacobian central, 268, 309
function, 529, 530 inertia, 274
matrix, 531, 532 normalized spatial, 8
momentum, 121, 273
K motion discretization
KDE-Kernel Density Estimation, 563, 564 aperture problem, 498, 499
kernel function, 81 frame rate, 487, 488
KF-Kalman filter, 544, 545 motion field, 494, 495
ball tracking example, 546, 547, 553, 554 optical flow, 494, 495
gain, 545, 546 space–time resolution, 492, 493
object tracking, 543, 544 space-time frequency, 493, 494
state correction, 549, 550 time-space domain, 492, 493
state prediction, 545, 546 visibility area, 492, 493
KLT algorithm, 536, 537 motion estimation
kurtosis, 269 by compositional alignment, 532, 533
by inverse compositional alignment, 533,
L 534
Lambertian model, 321, 416, 430 by Lucas–Kanade alignment, 526, 527
Laplace operator, 281, see also LOG-Laplacian by OF pure rotation, 572, 573
of Gaussian by OF pure translation, 571, 572
LDA-Linear Discriminant Analysis, 21 by OF-Optical Flow, 570, 571
least squares approach, 514, 515, 618 cumulative images difference, 496, 497
lens image difference, 496, 497
aperture, 466 using sparse POIs, 535, 536
crystalline, 387 motion field, 485, 486, 494, 495
Gaussian law, 465 MRF-Markov Random Field, 286, 476
line fitting, 513, 514 MSE-Mean Square Error, 124, 544, 545
local operator, 308 multispectral image, 4, 13, 16, 17
LOG-Laplacian of Gaussian, 290, 380
LSE-Least Square Error, 509, 510
LUT-Look-Up-Table, 436 N
NCC-Normalized Cross-Correlation, 527, 528
needle map, 426, see also orientation map
M
neurocomputing biological motivation
Mahalonobis distance, 59
MAP-Maximum A Posterior, 39, 67 mathematical model, 90
mapping function, 200 neurons structure, 88
Marr’s paradigm synaptic plasticity, 89
algorithms and data structures level, 318 neuron activation function, 90
computational level, 318 ELU-Exponential Linear Units, 245
implementation level, 318 for traditional neural network, 91
Maximum Likelihood Estimation, 49 Leaky ReLU, 245
for Gaussian distribution & known mean, 50 Parametric ReLU, 245
for Gaussian with unknown µ and , 50 properties of, 122
mean-shift, 566, 567 ReLU-Rectified Linear Units, 244
Micchelli’s theorem, 199 Neyman–Pearson criterion, 48
minimum risk theory, 43 nodal point, 352
MLE estimator distortion, 51 normal map, 441
MND-Multivariate Normal Distribution, 58 normal vector, 417, 419, 430
MoG-Mixtures of Gaussian, 66, see also normalized coordinate, 632
EM-Expectation–Maximization NP-complete problem, 147
674 Index
U Z
unsupervised learning zero crossing, 322, 324, 380, 495, 496
brain, 89 ZNCC-Zero Mean Normalized Cross-
Hebbian, 213 Correlation, 398