0% found this document useful (0 votes)
52 views632 pages

Computer Vision Three-Dimensional - Andrea Fusiello

The document is a comprehensive overview of Andrea Fusiello's book on three-dimensional reconstruction techniques in computer vision, emphasizing the importance of geometric computer vision alongside deep learning advancements. It covers fundamental concepts, mathematical frameworks, and practical applications, with a focus on teaching and implementation through MATLAB code. The book serves as a primary resource for students and practitioners in the field, providing an updated and accessible approach to the subject matter.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views632 pages

Computer Vision Three-Dimensional - Andrea Fusiello

The document is a comprehensive overview of Andrea Fusiello's book on three-dimensional reconstruction techniques in computer vision, emphasizing the importance of geometric computer vision alongside deep learning advancements. It covers fundamental concepts, mathematical frameworks, and practical applications, with a focus on teaching and implementation through MATLAB code. The book serves as a primary resource for students and practitioners in the field, providing an updated and accessible approach to the subject matter.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 632

Andrea Fusi ello

Computer Vision: Three-


dimensional
Reconstruction Techniques
Andrea Fusi ello
University of Udine, Udine, Italy

ISBN 978-3-031-34506-7 e-ISBN 978-3-031-34507-4


https://doi.org/10.1007/978-3-031-34507-4

Translation from the Italian language edition: “Visione


Computazionale” by Andrea Fusi ello, © Edizioni Franco
Angeli 2022. Published by Edizioni Franco Angeli. All
Rights Reserved.

© Edizioni Franco Angeli under exclusive license to


Springer Nature Switzerland AG 2024
This work is subject to copyright. All rights are solely and
exclusively licensed by the Publisher, whether the whole
or part of the material is concerned, specifically the
rights of reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other
physical way, and transmission or information storage
and retrieval, electronic adaptation, computer software,
or by similar or dissimilar methodology now known or
hereafter developed.
The use of general descriptive names, registered names,
trademarks, service marks, etc. in this publication does
not imply, even in the
absence of a specific statement, that such names are
exempt from the relevant protective laws and regulations
and therefore free for general
use.
The publisher, the authors, and the editors are safe to
assume that the advice and information in this book are
believed to be true and accurate at the date of publication.
Neither the publisher nor the authors or the editors give a
warranty, expressed or implied, with respect to the
material contained herein or for any errors or omissions
that may have
been made. The publisher remains neutral with regard to
jurisdictional claims in published maps and institutional
affiliations.
This Springer imprint is published by the registered
company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11,
6330 Cham, Switzerland
To my mother
Foreword
Deep learning has brought undeniable successes and some
breakthroughs in image recognition and scene description.
It is
nevertheless true that geometric computer vision remains a
fundamental field. Given the impressive state-of-the-art and
the rapid pace of progress in deep learning, it would be
of course risky to rule out the possibility that the solution to
many geometric vision problems, for instance
reconstructing 3D structure from multiple images, can be
learned from millions of examples. Yet we believe that a
principled,
approach that obtains the geometric structure of what we
see through applied mathematics provides more insight.
We would also go as far as suggesting that, in the end,
such an approach can be even more fun to study and
implement.
The book on geometric computer vision that we
published in 1 9 9 8 was received so well that it became a
standard textbook for the field. As it happens inevitably to
any advanced textbook, the relevance of its
contents has tarnished as the state of the art of research
has moved on.
Today, Andrea Fusi ello’ s book offers an admirable
contemporary
view on geometric computer vision. The Matlab code of
the algorithms offers a hands-on counterpart to the
algebraic derivations, making the material truly complete
and useful. Ultimately, the book makes the
theoretical basis on which 3D computer vision can be built
and
developed accessible. With this book, geometry is back into
the picture!
Andrea’ s book fills also a gap in the current panorama
of computer vision textbooks. It gives students and
practitioners a compact yet
exhaustive overview of geometric computer vision. It tells
a story that is logical and easy to follow, yet rigorous in its
mathematical
presentation. We have no doubt that it will be well received
by teachers, students and practitioners alike. We are also
sure that it will contribute to remind the community of the
beauty of building mathematical
models for computer vision , and that more than a
superficial
understanding of algebra and geometry has a lot to offer,
both in terms of results and intellectual pleasure.
As d’Alembert reportedly wrote, algebra is generous: she
often gives more than is asked of her.
Emanuele Trucco
Alessandro
Verri University of
Dundee Università degli
Studi di Genova
Preface
People are usually more convinced by reasons they discovered
themselves than by thosefound by others.

Computer vision is a rapidly advancing field that has had a


—B. Pascal

profound
impact on the way we interact with technology. From facial
recognition to self-driving cars, the applications of
computer vision are vast and
ever-expanding. Geometry plays a fundamental role in this
discipline,
providing the necessary mathematical framework to
understand the underlying principles of how we perceive
and interpret visual
information in the world around us.
This text delves into the theories and computational
techniques
used for determining the geometric properties of solid
objects through images. It covers the fundamental concepts
and provides the necessary mathematical background for
more advanced studies. The book is
divided into clear and concise chapters that cover a wide
range of
topics, including image formation, camera models, feature
detection, and 3D reconstruction. Each chapter includes
detailed explanations of the theory, as well as practical
examples to help readers understand and apply the
concepts presented.
With a focus on teaching, the book aims to find a balance
between
complexity of the theory and its practical applicability in
terms of
implementation. Instead of providing an all-encompassing
overview of the current state of the field, it offers a
selection of specific methods
with enough detail for readers to implement them. To aid
the reader in implementation, most of the methods
discussed in the book are
accompanied by a MATLAB listing and the sources are
available on Github at
https://github.com/fusiello/Computer_Vision_Toolkit.
This approach results in leaving out several valuable
topics and algorithms, but this does not mean that they
are any less important than the ones that have been
included; it is simply a personal choice.
The book has been written with the intention of being
used as a
primary resource for students of university courses on
computer
vision, specifically final-year undergraduates or
postgraduates of
computer science or engineering degrees. It is also useful
for self-study
and for those who are using computer vision for practical
applications outside of academia.
Basic knowledge of linear algebra is necessary, while
other
mathematical concepts can be introduced as needed
through included appendices. The modular structure
allows instructors to adapt the
material to fit their course syllabus, but it is
recommended to cover at least the chapters on
fundamental geometric concepts, namely Chaps. 3, 4, 5, 6,
7, 8.
This edition has been updated to ensure that it is
accessible to a
global audience, while also ensuring that the material is
current and up- to-date with the latest developments in the
field. To accomplish this, the book has been translated to
English and has undergone extensive
revision from its previous version, which was published by
Franco
Angeli, Milano. The majority of the chapters have
undergone changes, which include the inclusion of new
material, as well as the
reorganisation of existing content.
I hope that you will find this book to be a valuable
resource as you explore the exciting world of computer
vision.
Andrea
Fusiello
Udine,
Italy
December, 2022
Acknowledgements
This text is derived from the handouts I have prepared for
academic
courses or seminar presentations over the past 2 0 years.
The first
chapters were born, in embryonic form, in 1997 and then
evolved and expanded to the current version. I would like
to thank the students of the University of Udine and the
University of Verona who, during these years, have pointed
out errors, lacks and unclear parts. Homeopatic
traces of my PhD thesis can also be found here and there.
The text benefited from the timely corrections suggested
by
Fed erica Arrigoni, Guido Maria Cortelazzo, Fabio Crosilla,
Riccardo
Gherardi, Luca Magri, Francesco Malapelle, Samuel e
Martelli, Eleonora Maset, Roberto Rinaldo, and Roberto
Toldo, whom I sincerely thank. The residual errors are
solely my responsibility.
Credits for figures are recognised in the respective
captions.
Acronyms
PPM Perspective Projection Matrix
COP Centre of Projection
SVD Singular Value Decomposition
OPA Orthogonal Procrustes analysis
GPA Generalised Procrustes analysis
BA Bundle Adjustment
AIM Adjustment of Independent Models
DLT Direct Linear Transform
ICP Iterative Closest Point
GSD Ground Sampling Distance
IRLS Iteratively Reweighted Least-Squares
LMS Least Median of Squares
RANSAC Random Sample Consensus
SFM Structure from Motion
SGM Semi Global Matching
SO Scanline Optimisation
WTA Winner Takes All
DSI Disparity Space Image
LS Least-Squares
LM Levenberg-Marquardt
SIFT Scale Invariant Feature Transform
TOF Time of Flight
SSD Sum of Squared Difference
SAD Sum of Absolute Difference
NCC Normalised Cross Correlation
Contents
1 Introduction
1.1 The Prodigy of Vision
1.2 Low-Level Computer Vision
1.3 Overview of the Book
1.4 Notation
References
2 Fundamentals of Imaging
2.1 Introduction
2.2 Perspective
2.3 Digital Images
2.4 Thin Lenses
2.4.1 Telecentric Optics
2.5 Radiometry
References
3 The Pinhole Camera Model
3.1 Introduction
3.2 Pinhole Camera
3.3 Simplified Pinhole Model
3.4 General Pinhole Model
3.4.1 Intrinsic Parameters
3.4.2 Extrinsic Parameters
3.5 Dissection of the Perspective
Projection Matrix 3.5.1 Collinearity
Equations
3.6 Radial Distortion
Problems
References
4 Camera Calibration
4.1 Introduction
4.2 The Direct Linear Transform Method
4.3 Factorisation of the Perspective
Projection Matrix 4.4 Calibrating Radial
Distortion
4.5 The Sturm-Maybank-Zhang Calibration
Algorithm Problems
References
5 Absolute and Exterior Orientation
5.1 Introduction
5.2 Absolute Orientation
5.2.1 Orthogonal Procrustes Analysis
5.3 Exterior Orientation
5.3.1 Fiore’s Algorithm
5.3.2 Procrustean Method
5.3.3 Direct Method
Problems
References
6 Two-View Geometry
6.1 Introduction
6.2 Epipolar Geometry
6.3 Fundamental Matrix
6.4 Computing the Fundamental Matrix
6.4.1 The Seven-Point Algorithm
6.4.2 Preconditioning
6.5 Planar Homography
6.5.1 Computing the Homography
6.6 Planar Parallax
Problems
References
7 Relative Orientation
7.1 Introduction
7.2 The Essential Matrix
7.2.1 Geometric Interpretation
7.2.2 Computing the Essential Matrix
7.3 Relative Orientation from the Essential Matrix
7.3.1 Closed Form Factorisation of the
Essential Matrix
7.4 Relative Orientation from the Calibrated
Homography
Problems
References
8 Reconstruction from Two Images
8.1 Introduction
8.2 Triangulation
8.3 Ambiguity of Reconstruction
8.4 Euclidean Reconstruction
8.5 Projective Reconstruction
8.6 Euclidean Upgrade from Known Intrinsic
Parameters 8.7 Stratification
Problems
References
9 Non-linear Regression
9.1 Introduction
9.2 Algebraic Versus Geometric Distance
9.3 Non-linear Regression of the PPM
9.3.1 Residual
9.3.2 Parameterisation
9.3.3 Derivatives
9.3.4 General Remarks
9.4 Non-linear Regression of Exterior
Orientation 9.5 Non-linear Regression of
a Point in Space
9.5.1 Residual
9.5.2 Derivatives
9.5.3 Radial Distortion
9.6 Regression in the Joint Image Space
9.7 Non-linear Regression of the Homography
9.7.1 Residual
9.7.2 Parameterisation
9.7.3 Derivatives
9.8 Non-linear Regression of the Fundamental
Matrix
9.8.1 Residual
9.8.2 Parameterisation
9.8.3 Derivatives
9.9 Non-linear Regression of Relative Orientation
9.9.1 Parameterisation
9.9.2 Derivatives
9.10 Robust Regression
Problems
References
10 Stereopsis: Geometry
10.1 Introduction
10.2 Triangulation in the Normal Case
10.3 Epipolar Rectification
10.3.1 Calibrated Rectification
10.3.2 Uncalibrated Rectification
Problems
References
11 Features Points
11.1 Introduction
11.2 Filtering Images
11.2.1 Smoothing
11.2.2 Derivation
11.3 LoG Filtering
11.4 Harris-Stephens Operator
11.4.1 Matching and Tracking
11.4.2 Kanade-Lucas-Tomasi Algorithm
11.4.3 Predictive Tracking
11.5 Scale Invariant Feature Transform
11.5.1 Scale-Space
11.5.2 SIFT Detector
11.5.3 SIFT Descriptor
11.5.4 Matching
References
12 Stereopsis: Matching
12.1 Introduction
12.2 Constraints and Ambiguities
12.3 Local Methods
12.3.1 Matching Cost
12.3.2 Census Transform
12.4 Adaptive Support
12.4.1 Multiresolution Stereo Matching
12.4.2 Adaptive Windows
12.5 Global Matching
12.6 Post-Processing
12.6.1 Reliability Indicators
12.6.2 Occlusion Detection
References
13 Range Sensors
13.1 Introduction
13.2 Structured Lighting
13.2.1 Active Stereopsis
13.2.2 Active Triangulation
13.2.3 Ray-Plane Triangulation
13.2.4 Scanning Methods
13.2.5 Coded-Light Methods
13.3 Time-of-Flight Sensors
13.4 Photometric Stereo
13.4.1 From Normals to Coordinates
13.5 Practical Considerations
References
14 Multi-View Euclidean Reconstruction
14.1 Introduction
14.1.1 Epipolar Graph
14.1.2 The Case of Three Images
14.1.3 Taxonomy
14.2 Point-Based Approaches
14.2.1 Adjustment of Independent Models
14.2.2 Incremental Reconstruction
14.2.3 Hierarchical Reconstruction
14.3 Frame-Based Approach
14.3.1 Synchronisation of Rotations
14.3.2 Synchronisation of Translations
14.3.3 Localisation from Bearings
14.4 Bundle Adjustment
14.4.1 Jacobian of Bundle Adjustment
14.4.2 Reduced System
References
15 3D Registration
15.1 Introduction
15.1.1 Generalised Procrustes Analysis
15.2 Correspondence-Less Methods
15.2.1 Registration of Two Point Clouds
15.2.2 Iterative Closest Point
15.2.3 Registration of Many Point Clouds
References
16 Multi-view Projective Reconstruction and
Autocalibration
16.1 Introduction
16.1.1 Sturm-Triggs Factorisation Method
16.2 Autocalibration
16.2.1 Absolute Quadric Constraint
16.2.2 Mendonça-Cipolla Method
16.3 Autocalibration via H∞
16.4 Tomasi-Kanade’s Factorisation
16.4.1 Affine Camera
16.4.2 The Factorisation Method for Affine
Camera
Problems
References
17 Multi-view Stereo Reconstruction
17.1 Introduction
17.2 Volumetric Stereo in Object-Space
17.2.1 Shape from Silhouette
17.2.2 Szeliski’s Algorithm
17.2.3 Voxel Colouring
17.2.4 Space Carving
17.3 Volumetric Stereo in Image-Space
17.4 Marching Cubes
References
18 Image-Based Rendering
18.1 Introduction
18.2 Parametric Transformations
18.2.1 Mosaics
18.2.2 Image Stabilisation
18.2.3 Perspective Rectification
18.3 Non-parametric Transformations
18.3.1 Transfer with Depth
18.3.2 Transfer with Disparity
18.3.3 Epipolar Transfer
18.3.4 Transfer with Parallax
18.3.5 Ortho-Projection
18.4 Geometric Image Transformation
Problems
References
A Notions of Linear Algebra
A.1 Introduction
A.2 Scalar Product
A.3 Matrix Norm
A.4 Inverse Matrix
A.5 Determinant
A.6 Orthogonal Matrices
A.7 Linear and Quadratic Forms
A.8 Rank
A.9 QR Decomposition
A.10 Eigenvalues and Eigenvectors
A.11 Singular Value Decomposition
A.12 Pseudoinverse
A.13 Cross Product
A.14 Kronecker’s Product
A.15 Rotations
A.16 Matrices Associated with Graphs
References
B Matrix Differential Calculation
B.1 Derivatives of Vector and Matrix Functions
B.2 Derivative of Rotations
B.2.1 Axis/Angle Representation
B.2.2 Euler Representation
References
C Regression
C.1 Introduction
C.2 Least-Squares
C.2.1 Linear Least-Squares
C.2.2 Non-linear Least-Squares
C.2.3 The Levenberg-Marquardt Method
C.3 Robust Regression
C.3.1 Outliers and Robustness
C.3.2 M-Estimators
C.3.3 Least Median of Squares
C.3.4 RANSAC
C.4 Propagation of Uncertainty
C.4.1 Covariance Propagation in Least-Squares
References
D Notions of Projective Geometry
D.1 Introduction
D.2 Perspective Projection
D.3 Homogeneous Coordinates
D.4 Equation of the Line
D.5 Transformations
Reference
E MATLAB Code
Index
Listings
3.1 Projective transformation

3.2 Parameterisation of K (Jacobian omitted for space


reasons)

3.3 Construction of the PPM

3.4 Radial distortion: direct model

3.5 Radial distortion: inverse model

4.1 DLT method

4.2 Resection with preconditioning

4.3 Factorisation of the PPM

4.4 Regression of radial distortion

4.5 Calibration of radial distortion with resection

4.6 SMZ calibration

5.1 Weighted OPA


5.2 Fiore’s linear exterior orientation

5.3 Iterative exterior orientation

6.1 Eight-point algorithm

6.2 Linear computation of F

6.3 Preconditioning

6.4 Linear computation of H

7.1 Linear computation of E

7.2 Relative orientation from the factorisation of E

7.3 Relative orientation from E in closed form

7.4 Relative orientation from calibrated H

8.1 Triangulation

8.2 Reconstruction from two images


8.3 Euclidean upgrade of a projective reconstruction

9.1 Non-linear calibration

9.2 Computation of the residual of reprojection and its

derivatives

9.3 Non-linear regression of exterior orientation

9.4 Non-linear regression of a point (triangulation)

9.5 Non-linear regression of H

9.6 Parameterisation of H

9.7 Sampson distance for H

9.8 Non-linear regression of F

9.9 Parameterisation of F

9.10 Sampson distance for F

9.11 Non-linear regression of relative orientation


9.12 Parameterisation of E

9.13 Robust estimate of F

9.14 Robust estimate of H

10.1 Triangulation from disparity

10.2 Calibrated epipolar rectification

10.3 Uncalibrated epipolar rectification

11.1 Harris-Stephens operator

12.1 Stereo with NCC

12.2 Stereo with SSD

12.3 Stereo with SCH

13.1 Photometric stereo

14.1 Rotation synchronisation


14.2 Translation synchronisation

14.3 Localisation from bearings

15.1 Generalised Procrustean Analysis

15.2 Iterative closest point

16.1 Sturm-Triggs projective reconstruction

16.2 Mendonça-Cipolla self-calibration

16.3 LVH calibration

17.1 Reconstruction from silhouettes

17.2 General scheme (stub) of space carving

18.1 Homography synchronisation

18.2 Image transform with backward mapping

C. 1 Gauss-Newton method
C.2 IRLS

C.3 Least median of squares

C.4 MSAC

OceanofPDF.com
© The Author(s), under exclusive license to Springer Nature
Switzerland AG 2024 A. Fusiello, Computer Vision: Three-dimensional

https://doi.org/10.1007/978-3-031-34507-4_1
Reconstruction Techniques

1. Introduction
Andrea Fusi ello1
( 1) University of Udine, Udine, Italy

Email: [email protected]
Andrea Fusiello

Airplanes do notflap their


wings —Frederick
Jelinek

1. 1 The Prodigy of Vision


If we pause to reflect detachedly on vision as a sensory
ability, we must agree with Ullman that it is prodigious:
As seeing agents, we are so used to the benefits of
vision, and so unaware of how we actually use it, that
it took a long time to
appreciate the almost miraculous achievements of our
visual
system. If one tries to adopt a more objective and
detached
attitude, by considering the visual system as a device
that
records a band of electromagnetic radiation as an input,
and
then uses it to gain knowledge about surrounding
objects that emit and reflect it, one cannot help but
be struck by the richness of information this system
provides. (Ullman 1996)

Computer vision was born as a branch of artificial


© The Author(s), under exclusive license to Springer Nature
Switzerland AG 2024 A. Fusiello, Computer Vision: Three-dimensional

https://doi.org/10.1007/978-3-031-34507-4_1
Reconstruction Techniques

intelligence in the 70s of the last century, essentially as a


confluence of pattern
recognition, signal and image processing, photogrammetry,
perceptual psychology and neurophysiology/brain
research, but has since evolved into an autonomous
discipline with its own methods, paradigms and
problems.
In the modern approach it does not endeavour to
replicate human vision. In fact, this attempt would
probably be doomed to failure
because of the inherent difference between the two
hardware. The
comparison that is often evoked is that of flight: the efforts
to replicate animal flight in history have all failed, while
airplanes, with a
completely different approach, have solved the problem
more than satisfactorily, surpassing in some aspects the
birds themselves.

1.2 Low-Level Computer Vision


Computer vision is the discipline of computer science
concerned with the extraction of information from images.
The information can be
numerical in nature (e.g. spatial coordinates) or symbolic
(e.g.
identities and relationships between objects). Simplifying,
we could say that it is about finding out what is present in
the scene and where.
Following Ullman (1996) we use to distinguish between
low-level and high-level vision. The former is concerned with
extracting certain physical properties of the visible
environment, such as depth, three- dimensional shape
and object contours.
Conversely, high-level vision is concerned with the
extraction of
shape properties, spatial relations, object recognition and
classification.
The traditional distinction between low-level and high-
level,
reflected in the architectures of traditional vision systems,
is now very much blurred by deep learning. This book
focuses on reconstructing a precise and accurate
geometric model of the scene, a task that can
hardly be delegated to a neural network, however deep it
may be. We will study computational methods
(algorithms) that aim to obtain a
representation of the solid (sterèos) structure of the three-
dimensional world sensed through two-dimensional
projections of it, the images.
This approach can be easily framed within the
theory of reconstructionism developed by Marr in the
late 1970s:
In the theory of visual processes, the underlying task is
to
reliably derive properties of the world from images of it;
the
business of isolating constraints that are both
powerful enough to allow a process to be defined
and generally true of the world is a central theme of
our inquiry. (Marr 1982)
This, of course, is not the only possible paradigm in
vision. Aloimonos and Shulman (1989), for example,
argue that the description of the
world should not be generic, but dependent on the goal.
Low-level computer vision can be effectively
described as the inverse of computer graphics (Fig.
1.1), in which, given:
the geometric description of the scene,
the radiometric description of the scene (light sources
and surface properties),
the description of the acquisition device ( camera),
the computer produces the synthetic image as seen by the
camera.

Fig. 1.1 Relation between vision and graphics

The dimensional reduction operated by the projection


and the
multiplicity of the causes that concur to determine the
brightness and the colour make the inverse problem ill-
posed and non-trivial. The
human visual system exploits multiple visual cues (Palmer
1999) to
solve this problem. Several computational techniques,
collectively
referred to as shapefrom X algorithms, exploit the same
range of visual cues and other optical phenomena to
recover the shape of objects from their images. Shape from
silhouette, shape from stereo, shape from
motion, shape from shading, shape from texture and shape
from
focus/defocus are the most commonly studied. Some of
them will be discussed in this book.
As for the others, let us briefly mention them here. Shape
from
shading uses the gradients of the light intensity in an
image to infer the 3D shape of an object (similarly to
photometric stereo, but with a single image), while shape
from texture uses the texture patterns
present in an image to infer the 3D shape of an object.
Shape from focus/defocus takes advantage of the fact
that blur of an object in an image is related to its depth.
More detailed descriptions of these
algorithms can be found, for example, in Jain et al.
(1995); Trucco and Verri (1998).

1.3 Overview of the Book


The first eight chapters of the book cover the basics of the
discipline, and Chaps. 3–8 in particular should be
studied sequentially as a block.
The next ten can be selected to form a tailor-made course;
the
dependencies are shown in Fig. 1.2. Some (Chaps. 11 and
12) focus on imaging techniques to determine dense or
sparse correspondences,
while others deal more with geometric aspects such as
multi-view
reconstruction and self-calibration (Chaps. 14 and 16). The
last chapter represents a departure from the main strand of
reconstruction in that it deals with the synthesis of images
from other images. The appendices report some
mathematical facts that make the book self-contained.
Fig. 1.2 Structure of the book with dependencies. Solid lines indicate a
requirement, while dashed ones specify a recommendation

1.4 Notation
The notation largely follows that ofFaugeras (1993),
with 3D points called M (from the French “monde”) and
camera matrices called P
(from “projection” and “perspective”), although P for
points and M for matrices would have sounded more
intuitive. Vectors are in bold, the 2D ones in lower case,
the 3D ones in upper case. Matrices are
capitalised. The ˜ above a vector is used to distinguish its
Cartesian
representation (with ˜) from the homogeneous one
(without ˜). This choice is the opposite ofFaugeras
(1993) and is motivated by the fact
that homogeneous coordinates are the default in this
book, while Cartesian coordinates are rarely used.

References
J. Aloimonos and D. Shulman. Integration of Visual Modules. An Extension to the Marr
Paradigm. Academic Press, Waltham, MA, 1989.
O. Faugeras. Three-Dimensional Computer Vision: A Geometric Viewpoint. The MIT
Press, Cambridge, MA, 1993.
R. Jain, R. Kasturi, and B.G. Schunk. Machine Vision. Computer Science
Series. McGraw-Hill International Editions, 1995.
D. Marr. Vision. Freeman, San Francisco, CA, 1982.

Stephen E. Palmer. Vision science: photons to phenomenology. MIT Press,


Cambridge, Mass., 1999.
E. Trucco and A. Verri. Introductory Techniquesfor 3-D Computer Vision. Prentice-
Hall, Upper Saddle River, NJ, 1998.
D. Ullman. High-level Vision. The MIT Press, Cambridge, MA, 1996.
[Crossref]

OceanofPDF.com
© The Author(s), under exclusive license to Springer Nature
Switzerland AG 2024 A. Fusiello, Computer Vision: Three-dimensional

https://doi.org/10.1007/978-3-031-34507-4_2
Reconstruction Techniques

2. Fundamentals of Imaging
Andrea Fusi ello1
( 1) University of Udine, Udine, Italy

Email: [email protected]
Andrea Fusiello

Sarà adunque pittura non altro che intersegazione


della pirramide visiva, sicondo data distanza, posto il
centro e constituitii lumi, in una certa superficie con
linee e colori artificiose representata.
—L. B. Alberti

2. 1 Introduction
An imaging device works by collecting light reflected from
objects in the scene and creating a two-dimensional
image. If we want to use the image to gain information
about the scene, we need to be familiar with the nature of
this process that we would like to be able to reverse.

The word “camera” is the Latin word for “chamber” It


was originally used in reference to the camera obscura
(dark chamber), a dark,
enclosed room used by artists and scientists in the
sixteenth century to observe the projected image of an

2.2 Perspective
© The Author(s), under exclusive license to Springer Nature
Switzerland AG 2024 A. Fusiello, Computer Vision: Three-dimensional

https://doi.org/10.1007/978-3-031-34507-4_2
Reconstruction Techniques

external object through a small hole. Over time, the term


was adapted to refer to the devices used to capture and
record images, such as cameras.

2.2 Perspective
The simplest geometric model of image formation is the
camera, represented in Fig. 2.1; it is based on the same
pinhole

principle as the camera obscura. The tiny hole (pinhole) in one


wall of the room lets in a ray of light for each point in the
scene so that an image of the outside
world is drawn on the opposite wall (image plane).

Fig. 2.1 Image formation in the camera obscura/pinhole camera

Let Mbe a point in the scene, of coordinates ,


and let , of coordinates be its projection
onto the image plane
through the pinhole C. Iff is the distance of C from the
image plane (focal length), then from the similitude of
the triangles we get:

(2.
1)
and therefore

(2.2)
Note that the image is inverted with respect to the scene,
both left-right and top-bottom, as indicated by the minus
sign. These equations define the image formation process,
which is called perspective projection.
The parameterf determines the magnification factor of
the image: if the image plane is close to the pinhole, the
image is reduced in size,
while it will become larger and larger asf increases. Since
the working area of the image plane is framed, moving the
image plane away also
reduces the field of view, that is the solid angle that contains
the portion of the scene that appears in the image.
The division by Z is responsible for the foreshortening efect,
whereby the apparent size of an object in the image
decreases
according to its distance from the observer, such as the
sleepers of the tracks of Fig. 2.2, which, although being of
the same length in reality, appear shortened to different
extents in the image. In the words of
Leonardo da Vinci: “Infra le cose d ’egual grandezza quella
che sarà più distante dall’ ochio sidimostrerà di minore
figura” 1

Fig. 2.2 The projection on the left is definitely perspective—note the


converging lines—while the aerial image on the right approximates an
orthographic projection— the distance to the
object is certainly very large compared to its depth. Photos by Markus
Winkler (left) and Max Böttinger (right) from Unsplash

If the object being framed is relatively thin, compared to


its average distance from the observer, one can
approximate the perspective
projection with the orthographic projection. The idea is as
follows: if the depth Z of the object points varies over an
interval with
, then the perspective scale factor can be
approximated by a constant . The projection
equations then become:

(2.3)

which is an orthographic projection composed with a


scaling by a factor of .
Leonardo da Vinci recommended using this
approximation
whenever .

2.3 Digital Images


In a digital camera, the image plane consists of a silicon
image sensor, which can be a CCD (Charge-Coupled Device)
or CMOS (Complementary
Metal-Oxide Semiconductor) array. An image sensor measures
approximately 5 0 mm and contains an order of
photosensitive elements, arranged in an array of
rectangular cells, each of
which converts the energy of incident light radiation into
an electrical potential.
This array is converted into a digital image (Fig. 2.3),
that is, into a matrix (e.g. ) of integer
values (e.g. in ). The elements of the matrix are
named pixels (picture elements).
Fig. 2.3 A digital (greyscale) image is a matrix of values between 0 and
255. The elements (pixels) become visible if the image is magnified
We will denote by the brightness value of the
image in the pixel identified by row v and column u. The
size of the sensor matrix ( ) is not necessarily the
same as the pixel matrix, or image (
); for this reason, one pixel in the image corresponds
to a
rectangular area on the sensor, called footprint of the pixel,
not
necessarily equal to a cell. Its size is calle d efective pixel
size (order of m).

2.4 Thin Lenses


Vertebrate eyes and cameras use lenses. A lens, being
larger than a
pinhole, can gather more light. The downside is that not the
whole
scene can be in focus at the same time. The approximation
we make for the optics of the imaging system—which in
general is very complex,
being made up of multiple lenses—is that of the thin lens.
Thin lenses are ideal devices that enjoy the following
properties:

1. The rays parallel to the optical axis incident on the lens


are
refracted so that they pass through a point on the
optical axis called focus .
2. The rays passing through the centre of the lens are
unaffected.
The distance of the focus F from the centre of the lens C
is called
focal length D (Fig. 2.4). It depends on the radii of curvature
of the two lens surfaces and the refractive index of the
constituent material.

Fig. 2.4 Thin lens

For non-thin lenses there is no “center” of the lens, but


two nodal points (front and rear) are defined to
characterise its behaviour. By definition, an input ray
directed at the front nodal point leads to an output ray
which appears to come from the back nodal point,
forming the same angle with respect to the optical
axis. For a thin lens, the two nodal points coincide in
the centre of the lens.

Given a point of the scene Mit is possible to construct


graphically the image (or conjugate point) using two
special rays that depart from M: the ray parallel to the
optical axis, which after refraction,
passes through F and the ray that passes unchanged
through C (Fig. 2.5).
Fig. 2.5 Construction of the conjugate point
Due to this construction and the similitude of the
triangles, we obtain the conjugate pointsformula (or
equation of the thin lens):

(2.4) The image of a point in the scene at distance Z from


the lens is
produced (in focus) at a distance from the lens, which
depends on the depth Z of the point and the focal length D
of the lens.
The image of the point M, when in focus, is formed on the
image
plane at the same point predicted by the pinhole model with
the
pinhole coincident with the centre of the lens C: in fact, the
ray passing through C is common to the two geometric
constructions. The other
light rays that leave M and are collected by the lens
increase the light reaching .
If (2.4) is not verified one gets a blurred image of the
point, that is, a circle which is named circle of confusion. The
photosensitive elements have a small but finite size. As
long as the circle of confusion does not exceed the size
of the photosensitive element, the image is in focus.
Therefore, there is an interval of depths for which the points
are in
focus. This range is called depth offield. It is inversely
proportional to the diameter of the lens; in fact, the
pinhole camera has an infinite
depth of field. The light that is collected, on the other
hand, is directly proportional to the diameter of the lens.

The focal length of the lens is a different notion from the


focal length of the pinhole camera. Two different
notations, f and D, were used on purpose, although,
unfortunately, the two quantities have the same
name.

It is interesting to note that by placing a diaphragm with a


2.4. 1 Telecentric Optics

hole in
correspondence of the focus F, as in Fig. 2.6, rays leaving
Min a
direction parallel to the optical axis are selected, while
other rays are blocked. This results in a device that
produces an image according to a pattern that is obviously
different from the pinhole/perspective model. The image of
the point M, in fact, does not depend on its depth: this is
indeed a telecentric camera, which realises an orthographic
projection.

Fig. 2.6 Principle of the telecentric camera

2.5 Radiometry
Light is emitted by the sun or other sources and interacts
with objects in the scene; part is absorbed, and part is
scattered and propagates in new directions. Among these
directions, some points towards the
camera sensor and contributes to the image formation.
The brightness of a pixel in the image I is
proportional to the “ amount of light” that the surface —
centred at a point M—
scatters in the direction of the camera, where is the
area of the surface that projects into the pixel .
This in turn depends on the surface reflectance and the
spatial distribution of the light sources.

A material is characterised from the radiometric point of


view by its reflectance, defined as is the ratio of energy
reflected to the total
energy incident on a surface. Reflectance is largely
determined by the physical properties of the surface in
relation to the wavelength and incident/viewing angle.

The “amount of light” travelling along a ray is formally


measured by the radiance: radiance is defined as
the power of light
radiation per unit area per unit solid angle emitted (or
received) by point M along the direction .
Neglecting attenuation due to the atmosphere, the
radiance leaving a point M of the surface in the
direction of coincides with the radiance reaching
from the same direction, therefore:
(2.5)
The reflectance of a material is encoded by its
bidirectional reflectance distributionfunction (BRDF). The
BRDF
depends on the surface point M, the lighting direction and
the viewing direction (refer to Fig. 2.7). In general, given
a point M, is a 4D
function parameterised by the angular coordinates of
and . For many materials, however, it depends only
on the mutual angles
between the surface normal , and (Woodham 1980), so
it reduces
to 3D.
Fig. 2.7 Elements involved in the reflectance equation
The radiance leaving Min the direction is
given by the reflectance equation:

(2.6)

where the integral computes the radiance incident at


point M from all directions in a hemisphere above the
surface centred on the surface normal .
The scalar product takes into account the fact that
the
radiance is measured on a unit area that is perpendicular to
, while the surface is oriented as . In order to accurately
calculate the reflectance equation, all direct and indirect
light sources should be considered in
the integral, and the radiance reaching M would then
depend on every other surface patch visible from it.
However, in a first approximation, only the primary
sources are taken into account and, thanks to the
principle of superposition of effects, only one source can be
considered at a time. If the light source is assumed to be a
distant point, the
integral reduces to a single evaluation of the BRDF, and
the reflectance equation can be written as follows:
(2.7)
where is the emitted radiance, is the direction
under which M sees the light source and is the versor of
the normal in M (refer to Fig. 2.7). The Iverson brackets
are introduced to account only for light contributions
that form an acute angle with .

The Iverson brackets construct a binary function from a


predicate and are defined as follows:

(2.8)

In MATLAB this is implicitly done by identifying logical


values with 1/0.

The simplest BRDF is the constant one, which


describes materials that scatter the light in all directions
uniformly. This phenomenon is called diffuse scattering and
follows the so-called Lambert’s law:
(2.9)
The radiance emitted by M does not depend on the
viewing
direction, so diffusive surfaces appear equally bright from
every angle. is the albedo of M, which takes values
between 0 ( the light is not reflected by the object and the
material appears black) and 1 (all light is reflected and the
material appears of the colour of the light). At the
opposite side of an ideal spectrum of behaviours lies the
specular
reflection, where scattering is concentrated along one
direction only, the one for which the reflected and
incident rays lie in the same plane and the angle of
reflection is equal to the angle of incidence: this is the
behaviour of a perfect mirror.
This topic is extensively discussed in computer graphics
textbooks (Akenine-Möller et al. 2008); for our purposes
we will assume that the surfaces of the scene are diffusive,
and treat deviations from this model as disturbances.
References
Tomas Akenine-Möller, Eric Haines, and Naty Hoffman. Real-Time Rendering 3rd
Edition. A. K. Peters, Ltd., Natick, MA, USA, 2008. ISBN 987- 15-68814-24-7.
Robert J. Woodham. Photometric method for determining surface
orientation from multiple images. Optical Engineering, 19 ( 1): 191139, 1980.

Footnotes
1 “Of things of equal size, that which is most distant from the eye will appear
of lesser figure”

OceanofPDF.com
© The Author(s), under exclusive license to Springer Nature
Switzerland AG 2024 A. Fusiello, Computer Vision: Three-dimensional

https://doi.org/10.1007/978-3-031-34507-4_3
Reconstruction Techniques

3. The Pinhole Camera Model


Andrea Fusi ello1
( 1) University of Udine, Udine, Italy

Email: [email protected]
Andrea Fusiello

3. 1 Introduction
Referring back to the geometric aspect of image formation,
covered in
Chap. 2, we shall now introduce a geometric model for the
camera, that is, we will study how the position of a point in
the scene and the position of the corresponding point in the
image are related. We make the
simplifying assumption that lenses can be neglected
as far as the geometry of projection is concerned.

3.2 Pinhole Camera


The geometric model of the camera that we will adopt is
the so-
called pinhole or perspective model, which consists (refer to
Fig. 3.1) of the image plane and the centre of projection
(Centre of Projection
(COP)) C, located at a distancef (focal length) from the image
plane. The bundle of lines with centre in C is also called the
projective bundle.
Fig. 3.1 Pinhole camera. C is the COP, is the image plane and f is the focal
length
The line through C orthogonal to is the principal axis of
the
camera and its intersection with is called the principal
point. The plane parallel to and containing C is
called the focal plane. The points in are projected to
infinity on .
To analytically describe the perspective projection
operated by the camera, we need to introduce
appropriate Cartesian coordinate systems in which to
represent the points in the three-dimensional (3D) space
and the points in the image plane.
Let us consider, initially, a very special case, in which the
coordinate systems are chosen so as to obtain particularly
simple equations. In Sect. 3.4 we will enhance the model by
studying the more general case.

3.3 Simplified Pinhole Model


The coordinate system in which the points of the three-
dimensional
space are represented is called the world (or object)
coordinate system.
Let us introduce the camera referenceframe, a right-
handed triplet of axes centred in C and with the Z
axis coincident with the principal axis. In this simplified
model, the world reference frame is
chosen coincident with the camera reference frame.
We also introduce a reference frame for the plane
centred at the principal point and with the axes u and v
oriented as X and Y ,
respectively, as shown in Fig. 3.2.
Fig. 3.2 Simplified model of the camera with reference frames. The point Mis
projected onto the point m where the line CM intersects the image plane

Now consider a point M of coordinates in


3 D space and let m of coordinates be its
projection onto through C. In the following, unless
otherwise noted, we will always use ˜ to denote Cartesian
coordinates and distinguish them from homogeneous
coordinates.
By simple considerations on the similitude of triangles
(Fig. 3.3), we arrive at the following relation:

(3.
1)
that is

(3.2)
Fig. 3.3 Nadiral view of the camera model. The two triangles that have
and for hypotenuse are similar to each other
This is the perspective projection. The transformation in
Cartesian coordinates is clearly non-linear because of the
division by Z. Using homogeneous coordinates1 instead
it becomes linear.
Let therefore be

(3.3)

the homogeneous coordinates of m and M, respectively. Note


that by
setting the last coordinate to 1, we have excluded the
points at infinity (to include them we would have used a
generic last component). So the equation for perspective
projection, in this simplified case, rewrites:

(3.4)

Switching to matrix notation:


(3.5)
or, also:

(3.6)
where means “equal up to a scaling factor”
The matrix represents the geometric model of the
camera and is called Perspective Projection Matrix (PPM)
or camera matrix.
Listing 3.1 reports the implementation of a generic
projective
transformation, which includes perspective projection as a
special case.
Listing 3. 1 Projective transformation

3.4 General Pinhole Model


A realistic model of a camera, describing the transformation
from 3D
coordinates to pixel coordinates, as well as the perspective
transformation (Fig. 3.4), must take into account:
The change of coordinates from metres in the sensor
plane to pixels in the image.
The change of coordinates between the camera
reference frame and the world reference frame.
Fig. 3.4 General model of the camera with reference frames. Note that the
image plane should be viewed from the half-space opposite to C, so that the
origin of the image reference is in the upper left corner of the image

\r\n

The transformation from the sensor coordinates to pixels is


3.4. 1 Intrinsic Parameters

modelled by an affine transformation A that takes into


account the translation of the origin at the upper left
corner, the independent rescaling of the u and v axes for
changing the unit of measurement from metres to pixels and
the reversal of the direction of the axes:

(3.7)

where are the coordinates in pixels of the principal


point and
( ) is the inverse of the efective pixel size (see Sect. 2.3)
along the direction u (v), measured in . The matrix
A transforms the
coordinates from metres to pixels; hence, it pre-
multiplies the PPM, which becomes:
(3.8
)
wher
e
(3.9)

K is the matrix of intrinsic parameters, while encodes the


essence of the perspective transformation, without any
parameters. It is
obtained when the world reference frame coincides with the
camera
reference frame and the image points are expressed in
normalised image coordinates, that is,
(3. 10)
Note that whereas pixel coordinates are always measurable
in the image,
to step to normalised image coordinates the matrix
of intrinsic parameters is needed.
The most general K includes also the skew angle that
corresponds to the angle between the u and v axes, that
normally is :

(3. 11)

Let us define the focal length expressed in (horizontal)


pixels
, and the pixel aspect ratio , then the
most general matrix of intrinsic parameters writes:

(3. 12)

It has five degrees of freedom, for it depends on five intrinsic


. Reduced parameterisation from
five to one parameter can be obtained by making educated
parameters:

guesses on some of
them. Normally, it is safe to assume , as we
implicitly did in the beginning. Also, very often the
footprint of the pixel is square, that is,
. Pushing this further, one could reduce the intrinsic
parameters to
the sole focal length by fixing the principal point
in the centre of the image.
Some authors parameterise K with its entries by
introducing the focal length expressed in vertical pixels
and the skew parameter s:

(3.
13)

Although from the physical point of view the image plane


can only be behind the pinhole, ideally it can also be placed
in front, at the same
distancef, obtaining the so-called positive image, which has the
same
size of the negative image (refer to Fig. 3.5), but which is not
inverted. This trick was first recommended in the
Renaissance by Leon Battista Alberti. The choice of
whether to have positive or negative image is only a
convention: what changes is the sign off.

Fig. 3.5 Positive and negative images

So far we have modelled a real camera, with negative


focal length (
), in which the image plane is placed behind the pinhole
and the
image is the inverted (if we entered a real camera obscura, this
is how
we would see the image). However, the images we receive
from
electronic devices are not inverted, so the positive focal (
) model is the one most consistent with the digital image
formation process. We
will therefore choose this model from now on. Beware that
positive focal implies and .
The field of view is the apex angle of the pyramid of vision,
3.4.1.1 Field of View
that is the pyramid with apex in the COP that extends
along the optical axes to
infinity and encompasses the extent of the world that is
imaged by the camera. The same apex angle is shared by
the pyramid that has the
image as its base and apex in the COP, whose height is
the focal length (Fig. 3.6).2

Fig. 3.6 The pyramid of vision (extending to the right of C) and the field of view

Since the base of the pyramid is—in general—a


rectangle, there are two apex angles, one horizontal and
one vertical. The horizontal field of view , for example, is
related to the focal length through the image width w
by:

(3.
14)
And, conversely:
(3.
15)

Focal length and field of view are related, but the latter is
a more
intelligible parameter than the focal length in pixels: an
educated guess for lies in the range from to . For
this reason, in our MATLAB
toolkit we choose to parameterise as in (3.15), obtaining
the
following five intrinsic parameters: . Thanks to
the modular approach of the toolkit, however, this aspect
remains encapsulated in
the function par2K (Listing 3.2) that computes the
parameterisation and its derivative. We will henceforth use
as the first parameter in the
text, but it is understood that the implementation uses
instead.

Listing 3.2 Parameterisation of K (Jacobian omitted for


space
reasons)

To account for the fact that—in general—the world reference


3.4.2 Extrinsic Parameters

frame
does not coincide with the camera reference frame, we must
introduce a coordinate change consisting of a direct isometry,
that is, a rotation
followed by a translation . Let us denote by
the
homogeneous coordinates of a point in the camera
coordinate system and the homogeneous coordinates
of the same point in the world coordinate system. We
can therefore write:

(3. 16)
where
(3. 17)

The matrix G encodes the exterior orientation (position and


angular attitude3 ) of the camera with respect to the world
reference frame.
Specifically, the rows of R are the and Z axes of
the camera reference frame expressed in the world
coordinate system.
By appropriately parameterising the rotation R (see
Sect. A. 15), the exterior orientation of the camera is
encoded by six extrinsic
parameters: three Euler angles that specify the
rotation and three components for the
translation.
From (3.8) we obtain:

(3. 18) and substituting (3.16) into the equation above gives:

(3. 19) and therefore

(3.20) This is the most general form of the perspective


projection matrix:
encodes the exterior orientation of the camera, contains its
intrinsic
parameters (by analogy, we say that it encodes the interior
orientation), while the matrix represents a perspective
transformation from
camera coordinates to normalised image coordinates.
Depending on convenience, we may also consider
the following expression for the PPM:
(3.21)
which is obtained by substituting into (3.20) the block
form of G and developing the product with the central
matrix.
By letting

(3.22
)
we get the following expression for P as a function of the
elements of the matrices K and G:

(3.23)

The rows of R could be further exploded into sines and


cosines of the angles and (see Sect. A. 15).
A PPM is defined up to an arbitrary scale factor. In fact, if
one
replaces P with in (3.6), one obtains the same projection,
for every non-zero real . Thus, if one wants to bring a
generic PPM into the form of (3.23), it is necessary to
rescale it so that the vector corresponding to has norm
one.
P has 12 entries but has 11 degrees of freedom since it
loses one due to the scaling factor: this implies that it must
depend on 11 independent parameters, that are 5 intrinsic
parameters plus 6 extrinsic parameters.

Let us summarise, with the help of Fig. 3.7, the


coordinate systems we introduced and their relationship.
World coordinates: are the generic coordinates that
represent
points in 3D space, denoted by .
Camera coordinates: are the coordinates of the 3D points in
a
reference frame with origin in the COP and integral with
the camera. They will often be referred to with the same
notation as the world coordinates and
only in special cases, where they need to be
distinguished, will a subscript be used ( ).
Normalised image coordinates: are measured in metres with
origin at the principal point and denoted by .
Image coordinates: are measured in pixels with origin in
the upper left corner of the image and denoted by
.
3.5 Dissection of the Perspective
Projection Matrix
In this section we shall see how to derive some
properties of the perspective projection directly from
the PPM.
Let us write the PPM according to its rows
(3.24)
and insert this into (3.6):
(3.25)
so that the perspective projection equation becomes
(this is the generalisation of (3.2)):

(3.26)

The focal plane has equation and contains


the points that project to infinity (excluding C). The
planes of equation and contain the
points that project into the image axes and ,
respectively (Fig. 3.8).
Fig. 3.8 Geometric determination of the centre of projection
The COP is defined by the intersection of these three
planes. The
intersection of and is the line projecting
into , which must necessarily pass through C, which
belongs to . Therefore, C must belong to all three of
these planes:

(3.27)

that is, In other words, Cis the null space of P,


which, by the rank-nullity theorem (Theorem A.2) has
dimension one. Note that the COP is the only point in
space for which the projection is not defined,
and, in fact, is the only triplet that does not represent any
point in the projective plane.
If we partition P as and the
system (3.27) rewrites:
(3.28

The coordinates of the COP are:


)

(3.29)
Setting in the above equation yields the
relation between the COP and the translation component
of the exterior orientation:
(3.30
)

The parameterisation of the exterior orientation of a PPM


through
the Euler angles and the translation although
useful and
appropriate from the mathematical point of view is not
very intuitive to the human user. The main graphics
libraries, for example, allow to specify a camera by:
position (“where the eye is”), a point on the
optical axis (“where it is looking”) and a point “above” to
fix the
vertical direction. Listing 3.3 shows a MATLAB
implementation of this idea; it can be used in simulated
experiments to get a realistic camera
placement. Note that the function returns only
the exterior orientation of the camera, that is,
.

Listing 3.3 Construction of the PPM

The optical ray of point m is the straight line through the


COP C and m itself, that is, the geometric locus of points
such that . On the optical ray of m lie all
points in space of which the point m is the
projection.
One point that belongs to this line is C, by definition.
Another is the point (at infinity) that has coordinates:
in fact, it suffices to project this point to verify that:
(3.31)

The parametric equation of the optical ray of is


therefore the following:

(3.32)

Let us rewrite the perspective projection in


3.5. 1 Collinearity Equations

normalised image coordinates, with the points Min


camera coordinates
:

(3.33)

Using (3.30) we get that


then

(3.34)

Furthermore, being:

(3.35)

one finally has:

(3.36)
We obtained what in photogrammetry are known as
equations (Kraus 2007). They express the fact that the COP
collinearity
C, the object point M and the image point m are aligned on
the same optical ray. The rows of R can be further
expanded into their components, which are
combinations of trigonometric functions of the angles
.

3.6 Radial Distortion


Real photographic objectives are complex optical system
made of non- thin lenses. In such devices one can identify
several nodal points, placed at a certain distance from each
other.
If these lie along the optical axis of the system and other
alignment conditions hold (orthoscopy condition), we can
simplify the system and consider only two nodal points: one
for the outer air-glass interface and one for the inner glass-
air interface. In this condition the ray passing
through the outer nodal point and the ray passing through
the inner
nodal point are parallel, and therefore, we can consider null
the distance between the two nodal points, thus obtaining
the thin lens model.
If the orthoscopy condition is violated, the thin lens
model is not
applicable and this in turn results in a deviation from the
pinhole model, which causes a distortion of the image,
called radial distortion, whose
effect is illustrated in Fig. 3.9.

Fig. 3.9 Cushion (outer dashed red line) and barrel (inner dashed blue line)
radial distortion, obtained with and , respectively
The standard model in computer vision4 is a
transformation defined on the normalised image coordinates
that brings the ideal (undistorted)
ones onto the observable (distorted) ones :

(3.37)

where . The degree of the polynomial in


determines the accuracy with which one wants to model the
phenomenon; however, be careful that too high a degree
may lead to modelling the noise that
inevitably plagues the measurements, instead of the
phenomenon itself. Very often we stop at the first
coefficient and in any case rarely exceed the fourth.
We call the function defined by (3.37), such
that

(3.38)

where the indicates that the function works in Cartesian


coordinates.
Listing 3.4 reports the implementation of the radial
distortion,
including the computation of the Jacobian matrix of the
map, both with respect to the ideal coordinates and with
respect to the coefficients.
In order to apply radial distortion in pixel coordinates,
one has to
switch to normalised image coordinates, apply the distortion
and switch back to pixel coordinates, that is:

(3.39)

where is the homogeneous counterpart of .


The inverse map of , the one that brings the distorted
coordinates into ideal ones, that is, compensates for the
radial distortion, is not easy to write in closed form.
Although intuition tells us that a barrel
distortion can be compensated for by applying a cushion one
and vice
versa, the polynomial derived from the analytic inversion
(Drap and
Lefèvre 2016) is of higher degree than the original (e.g. 10
coefficients are needed to invert a 4-coefficient model),
and the coefficients are not always decreasing: in some
situations they have increasing modulus and alternate signs,
producing “unstable” behaviour. For these reasons we
prefer to rely on the numerical inversion, which entails the
solution of a system of polynomial equations (for each
point) in the unknowns x and
y:

(3.40
)

If (e.g.) the Gauss-Newton method is applied, the Jacobian


matrix of must be computed:

(3.41
)

Listing 3.4 Radial distortion: direct model

In the case of a single coefficient , one must solve the


system of two equations of degree three:
(3.42
)
whose Jacobian matrix is:

(3.43
)

The development of the Jacobian up to the fourth


coefficient can be found in Listing 3.4, while Listing 3.5
reports the implementation of the inverse radial distortion
model. The function first normalises the
coordinates, then computes the inverse model and finally
transforms back into pixels.
Listing 3.5 Radial distortion: inverse model

The inverse model is only used to correct sparse


points. If, on the other hand, we intend to transform
Problems
the whole image, the direct
transformation is sufficient because backward
mapping is used in that case (see Sect. 18.4).

Problems
3. 1 Prove that ifMis a point at infinity, its projection is
given by
where .

3.2 Let us consider a circle in space, centred on the


optical axis and orthogonal to it: (in
the camera frame). Its projection onto the image is:
that is, still a circle. If the centre of the
circle moves off the optical axis, does the projection
remain a circle?
Consider the situation, typical in aerial
photogrammetry, when the scene consists of a plane parallel
3.3

to the image plane (called ground, in


this context). In this case the image is related to the ground
plane by a
similitude; the photographic scale of the image is defined as the
ratio
between a distance measured on the image (in pixels) and
the
corresponding distance on the ground (in metres). Derive
the formula
for the photographic scale given the focal length in pixels
and the
distance to the ground. Note that if the framed object is not
planar or not parallel to the image plane, the scale is not
constant across the image.
With reference to Problem 3.3 observe that the
inverse of the
3.4

photographic scale coincides with the pixel footprint on


the ground, or Ground Sampling Distance (GSD).

3.5 Equation (3.6) with the explicit scaling factor writes:


(3.44)
Prove that ifP i s normalised as in (3.23), then is equal to
the distance of point M from the focal plane, a.k.a. depth of
the point.
Prove that ifP i s normalised as in (3.23), the
parameter in the equation of the optical ray (3.32) is
3.6

equal to the depth of the point, as defined in Problem 3.5


Prove that in a pinhole camera the translation of the
COP along the optical axis and the variation of the focal
3.7

length (a.k.a. zoom) do not


produce the same effect.
The perspective projection of a bundle of parallel
lines in the object space is a bundle of lines in the
3.8

image plane passing through a common point, called


vanishing point. Find the coordinates of the
vanishing point for a given direction in the object space.

Prove that if the intrinsic parameters are known, the


angle
3.9

between two optical rays can be computed and derive an


expression fort it.

References
Pierre Drap and Julien Lefèvre. An Exact Formula for Calculating Inverse Radial
Lens Distortions. Sensors, 16 (6): 807, 2016.
Karl Kraus. Photogrammetry - Geometryfrom Images and Laser Scans - 2nd edition.
Walter de Gruyter, Berlin, 2007.
[Crossref]

Footnotes
1 The homogeneous coordinates of a 2D point in the image are the ,
where
are the corresponding Cartesian coordinates. Thus, there is a one-
to-many
correspondence between Cartesian and homogeneous coordinates.
Homogeneous coordinates can represent any point in the Euclidean plane and
also points at infinity, which have the third component equal to zero and
therefore do not have a corresponding Cartesian representation. See
Appendix D.

2 Humans instead have a circular sensor; hence, we speak of the cone of vision.
People have an approximate angle of undistorted vision that extends as a
fictitious cone from their eyes forward.

3 In navigation orientation is the determination of the east cardinal point


(oriens in latin); by extension, in topography and photogrammetry, it indicates
the determination of position and direction with respect to cardinal points or
other reference frames.

4 Beware that other models are in use in the field of photogrammetry.


OceanofPDF.com
© The Author(s), under exclusive license to Springer Nature
Switzerland AG 2024 A. Fusiello, Computer Vision: Three-dimensional

https://doi.org/10.1007/978-3-031-34507-4_4
Reconstruction Techniques

4. Camera Calibration
Andrea Fusi ello1
( 1) University of Udine, Udine, Italy

Email: [email protected]
Andrea Fusiello

4. 1 Introduction
In the previous chapter we introduced a geometric model
for the camera and we shall now address the problem of
precisely and accurately
determining the parameters of that model, a process that
goes by the name of calibration of the camera.
The idea of the resection methods for calibration is that
by knowing the image projections of 3D points of known
coordinates (control
points), it should be possible to obtain the unknown
parameters by
solving the perspective projection equation. The specific
algorithm we will illustrate, called Direct Linear Transform
(DLT), directly estimates the PPM, from which the intrinsic
and extrinsic parameters can
subsequently be derived.
Resection is not the only way to perform calibration (e.g.
Caprile and Torre 1990 uses vanishing points). In Sect. 4.5
we will also see a non-
resection calibration method.

4.2 The Direct Linear Transform


Method
Given the correspondences between n non-coplanar
control points and their projections in the image , it
is required to compute a
matrix P such that:
(4. 1)
For example, the control points can be the vertices of the
black
squares on the two visible facets of the trihe dron depicted
in Fig. 4.1 and their coordinates are known by construction
(the world reference frame is fixed integral with the edges
of the trihe dron). The image coordinates of the control
points can be obtained manually or by classical image
processing methods (e.g. using template matching or
the Harris- Stephens operator which we will see in
Sect. 11.4).

Fig. 4.1 Calibration object with the world reference frame superimposed.
One of the control points is highlighted

To eliminate the scaling factor from (4.1) we


exploit the cross product, rewriting the previous
equation as: (4.2)

Then, using the properties of the Kronecker product (Sect.


A. 14), and in particular the “vec trick” (A.59), we derive
that:
(4.3)
These are three equations in 12 unknowns, but only 2 of
them are
independent; in fact, the rank of is two
since it is the Kronecker product of a matrix of rank one
with a matrix of rank two.
With n points we obtain a system of 2n
homogeneous linear equations, which we can write
in matrix form as

(4.4)
where A is the matrix of coefficients and depends on
the
coordinates of the control points, while the vector of
unknowns contains the 12 elements of P read by
columns. In theory, then, six non- coplanar points are
sufficient for the computation of P; if any four of
them are coplanar, the rank of A drops and the DLT
algorithm fails (Faugeras 1993).
In practice, it is advisable to use all the available
points (the more, the better) to compensate for the
inevitable measurement errors.
System (4.4) is therefore solved with the least-squares
method:
according to Proposition A. 13, the solution is the least
right singular vector of A.
Listing 4.1 implements this method.

Listing 4. 1 DLT method


Be aware that, while preconditioning of image points is
always
appropriate, as regards the 3D points it is
recommended by Hartley and Zisserman (2003) only if
they are “shallow”, that is, their depth variation is small
compared
Listing 4.2toResection
the distance from
with the camera.
preconditioning

4.3 Factorisation of the Perspective


Projection Matrix
Given a matrix P of full rank, we are required to
decompose it as

(4.5)
Let us focus on the submatrix in : by
comparison with , we have that with K
upper triangular andR
orthogonal (with positive determinant). Let
(4.6
)
be the QR factorisation of , with Q orthogonal
and U upper triangular. Since , it
suffices to set
(4.7
)
where the product by is used to make the determinant
of R
positive, as required for a rotation matrix. We are allowed to
change sign
to R since this results in changing sign to P, which can
absorb an arbitrary scaling factor. For the same
reason, we can rescale K by imposing .
The translation is eventually calculated with
(4.8
)
where we take into account the possible change of sign of P.
Listing 4.3 shows the MATLAB implementation of the
method.

Listing 4.3 Factorisation of the PPM

Note: by the Theorem A.3, the factorisation QR is unique up


to the sign of the diagonal elements of R. This gives us the
freedom to impose that the focal is positive or negative,
depending on the convention adopted (line 6 of Listing
4.3).

4.4 Calibrating Radial Distortion


Let us now see how to calibrate the radial distortion: the goal
is to
derive the coefficients . Suppose initially that we
know the
PPM with its parameters and the 3D coordinates of a
number of control points, let be one of them and let
be the
corresponding ideal image point, according to the pinhole
model:

(4.9) Due to radial distortion the point observed


in the image does not coincide with the ideal point,
according to (3.37). Each point
gives two linear equations (in normalised image
coordinates) in the unknown coefficients .:
(4.
10)

Since there are usually more points than coefficients,


we obtain an overdetermined system to be solved with
least-squares, as we do in Listing 4.4.
Listing 4.4 Regression of radial distortion

The assumption that the PPM is known, however, is


unrealistic, since the radial distortion must be estimated at
the same time as the calibration, thus the ideal points are
not accessible. It is therefore necessary to
proceed iteratively, alternating between resection and
radial distortion estimation, by which the distorted
coordinates are corrected, until
convergence is achieved, as in Listing 4.5. Note that in the
iterative step
we use the inverse model of radial distortion to correct
the observed coordinates given the current estimate of
the radial distortion
coefficients (thus approximating the ideal undistorted
coordinates).
Listing 4.5 Calibration of radial distortion with resection

4.5 The Sturm-Maybank-Zhang


Calibration Algorithm
In calibration by resection (which we described in the Sect.
4.1), one
needs a non-planar object on which the coordinates of the
control points have to be measured with great accuracy. In
practice, such an object is
difficult to construct without a machine shop at hand; much
easier is to get hold of a planar object. The Sturm-
Maybank-Zhang (SMZ) calibration method (Sturm and
Maybank 1999; Zhang 2000) relies precisely on
many (at least three) images of one plane, instead of one
image of many (at least two) planes that is required by
resection.
We begin by establishing that the map between a plane
in object space and its perspective image is a
homography (see Appendix D), as can be easily verified by
choosing the world coordinate system so that the plane
has equation (see Fig. 4.2). Writing the perspective
projection equation for a point belonging to the plane, we
get
(4. 11)

where we have represented P with its columns:


. Thus, the map from to the image is
represented by a non-singular matrix
, that is, a homography.

Fig. 4.2 The map bringing any plane of on the image plane is a
homography
We assume that there is a plane in the scene of which we
are able to calculate the homography that maps it to the
image. It is customary to prepare for this purpose a
planar calibration object, on which a grid or a chessboard is
drawn, as in Fig. 4.3. The matching between the points in
the image m and the corresponding control points Min the
plane should be easy to obtain. The homography His
computed from these
correspondences using the DLT algorithm. As a
matter of fact, the function dlt given in Listing 4.1
serves to both resection and
homography calculation (see Sect. 6.5.1).
Fig. 4.3 Some images of the calibration chessboard
Given that , and , we have:
(4. 12)
where is an unknown scalar that accounts for the
fact that the homographyH computed with DLT is
given up to a scaling factor.
Listing 4.6 SMZ calibration

By , we therefore get:
writing
(4.
13)
Due to the fact that the columns of R are orthonormal, one
can
constraint the intrinsic parameters. In particular, the
orthogonality yields or,
equivalently
(4.
14)
where translates into: Similarly,
the condition on
the norm:
(4.
Thanks to the “vec trick”, the last two equations are 15)
rewritten as:
(4.

16)

(4.
17)
B is a symmetric matrix; therefore, its unique
elements are only six. This can be formally taken into
account by introducing the operator and the
duplication matrix (see Sect. A. 14):
(4. 18)

(4. 19)
In summary, one image provides two equations in six
unknowns. If one takes n images of the plane (with
different orientations, as in Fig. 4.3), one can stack the
resulting 2n equations into a linear system of equations
(4.20)
where A is a matrix. If there is a solution,
determined up to a scaling factor. From we derive B
and then K (by Cholesky factorisation), from which we
then go back to R and .
The implementation is given in Listing 4.6. Some caveats
are in order, which are related to the fact that homographies
are known up to a scale, including a sign. Therefore, B is
not necessarily positive definite as one would expect from
its definition, but either B or is positive definite. To test
which of these two cases occur, we exploit the fact that the
trace of a positive definite matrix is positive, being the
sum of positive
eigenvalues. Finally, a sign on the extrinsic parameters
(derived from H) is arbitrarily fixed by imposing that the
cameras lie in the positive half- space of the plane.

The SMZ method does not minimise a geometric residual


and as such produces suboptimal results, so it is
advisable to refine the obtained
Problems

4. 1 Use (3.23) to extract R in closed form from P and then


also K and .
Prove that the principal point is the barycentre of the
triangle
4 .2
identified by the vanishing points of three orthogonal
directions in
space. This observation is part of the calibration method of
Caprile and Torre (1990).
If some interior orientation parameters are known (or
can be
4.3
assumed to be so), it becomes possible to estimate the
remaining ones from a single homography, exploiting the
same equations as in the SMZ method. In fact, knowledge
of some of the entries of K results in
constraints on the entries of B, which add up to the two
equations
provided by the homography. In particular, if and
are known (or assumed to be in the centre of the
image), it is possible to estimate directly from a
single homography. Work out the four additional
equations.
Experiment with the camera calibration of your
smartphone using the SMZ algorithm implemented in the
4.4
MATLAB Calibration Toolkit
https://github.com/fusiello/Calibration_Toolkit.

References
B. Caprile and V. Torre. Using vanishing points for camera calibration.
International Journal of Computer Vision, 4: 127– 140, 1990.
[Crossref]

O. Faugeras. Three-Dimensional Computer Vision: A Geometric Viewpoint. The


MIT Press, Cambridge, MA, 1993.
R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge
University Press, Cambridge, 2nd edition, 2003.
Peter F. Sturm and Stephen J. Maybank. On plane-based camera calibration: A
general algorithm, singularities, applications. In CVPR, pages 1432– 1437. IEEE
Computer Society, 1999.
Z. Zhang. A flexible new technique for camera calibration. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 22 ( 11): 1330– 1334, 2000.
[Crossref]

OceanofPDF.com
© The Author(s), under exclusive license to Springer Nature
Switzerland AG 2024 A. Fusiello, Computer Vision: Three-dimensional

https://doi.org/10.1007/978-3-031-34507-4_5
Reconstruction Techniques

5. Absolute and Exterior


Orientation
Andrea Fusi ello1
( 1) University of Udine, Udine, Italy

Email: [email protected]
Andrea Fusiello

5. 1 Introduction
In computer vision and photogrammetry several orientation
problems are defined, that require the computation of a
3D direct i sometry
(rotation + translation) from correspondences between
points.
According to the space where the corresponding points
belong, the following problems are established:
Absolute orientation (3D-3D): The coordinates of some 3D
points with respect to two different 3D reference frames
are given; we are
required to determine the transformation between the
two reference frames.
Exterior orientation (3D-2D): The position of some 3D points
and
their projection in the image are given; we are required
to determine the transformation between the camera
reference frame and the
world reference frame. The intrinsic parameters are
assumed to be known.
Relative orientation (2D-2D): The projections of some 3D
points and two distinct images are given; we are
required to determine the
transformation between the two camera reference
frames. The intrinsic parameters are assumed to be
known.
Absolute orientation will be addressed in Sect. 5.2
where a method based on the Orthogonal Procrustes
© The Author(s), under exclusive license to Springer Nature
Switzerland AG 2024 A. Fusiello, Computer Vision: Three-dimensional

https://doi.org/10.1007/978-3-031-34507-4_5
Reconstruction Techniques

analysis (OPA) (Arun et al. 1987),


which ultimately uses the Singular Value Decomposition
(SVD), is
described. Exterior orientation will be reduced to the
former in Sect. 5.3.1, while relative orientation will be
dealt within Chap. 7.

5.2 Absolute Orientation


Suppose we have two sets of n 3D point coordinates,
which
correspond to a single object, but which are expressed in
two different reference frames, linked by a direct i
sometry. We will call one of these sets and the other
. We assume that for every point in the
corresponding point in is known. The absolute orientation (or
registration) problem is to find the direct i sometry to
apply to that brings it to coincide with or, in the
presence of noise, that minimise the distance between the
corresponding points of two sets.
Consider the more general absolute orientation problem
with scale, in which the relation between the two sets is a
similitude:
(5. 1)
where , is a vector representing a
translation, s is a
scalar and are corresponding points (in Cartesian
coordinates1)
in the sets and . The goal of registration is to solve:

(5.2)

Let be the matrix obtained by juxtaposing next


5.2. 1 Orthogonal Procrustes Analysis

to each
other the vectors and the matrix obtained in the
same way with the vectors . This allows us to rewrite
(5.1) in matrix form:
(5.3)
and the objective function as well:

(5.4)
Note that if is a vector of d elements and is a vector of n
elements all equal to 1, then replicates the vector n
times, thus producing a
matrix.
If there were no scale and translation, this would be an
orthogonal Procrustes problem, which we could solve by
applying Proposition A. 14; we will now see how to fallback
in this case anyway.
Observe that ifA is a matrix, is the vector
containing the average computed over the rows of A.
Thus, averaging over the rows of both members of (5.3)
produces:

(5.5) which immediately gives rise to an expression for the


translation:
(5.6)
To eliminate the translation, let us replace this expression
in (5.3):
(5.7
)
thus
obtaining
(5.8)

Note that the matrix , which is symmetric


and
idempotent, has the effect of subtracting the column
average from each column of the matrix to which it is
applied:
(5.9)
So, if the columns are points, as in our case, it
subtracts from the coordinates of each point those of
the centroid of the set.
This problem is equivalent to the original one, but the
translation has been removed; it will be computed at the end
after recoveringR and s. If it were not for the scale, one
could solve (5.8) by applying the OPA
(Proposition A. 14) between the two matrices and
. Let us note, however, that the scale is irrelevant
in determining the rotation, since it is absorbed by the
diagonal matrix in the SVD, which does not
contribute to determine R. Thus, the rotation is given by
the SVD of , being symmetric and
idempotent:

(5. 10)

where = .
The scale is finally computed as the solution of a one-
dimensional
least-squares problem. Given two matrices A and B, the
problem
is rewritten: , whose least-squares
solution is:

(5.
11)
thanks to the properties of the trace. In our case we
obtain:

(5.
12)

Referring to the proof of Proposition A. 14, we observe


that the
numerator of (5.12) coincides with , and thus, in the
implementation we can save some computation by
exploiting the matrix S computed in the previous SVD.
It is easily shown that the solution we derived minimises
the
residual (5.4), which expresses the sum of the squares of
the distances between corresponding points.
Both with scale (similitude) and without scale
(isometry), a minimum of three points is required.
The function opa given in Listing 5.1 implements an
extension of the method just described, in which the
matches are weighted by a diagonal matrix W. With
we obtain the unweighted method
described in the text.
Listing 5. 1 Weighted OPA

Absolute orientation can also be formulated as the


solution of a system in which each point provides
three equations. This
formulation has the advantage of being able to exploit
partial
information about the coordinates of the control points;
in fact, each coordinate provides one equation. In the
case of aerial
photogrammetry (Kraus 2007) one can have, for example,
planimetric and/or altimetric control points. A similitude
is

5.3 Exterior Orientation


Suppose we know the coordinates of some 3D control points
together with their projections in the image and the camera
intrinsic parameters. The exterior orientation problem (a.k.a
perspective pose or Perspective-n- Point (PnP)) consists in
determining the direct i sometry between the
camera reference frame and the world reference frame
(in which the control points are given).
In essence, this is a problem similar to the camera
calibration, which consists in recovering interior and
exterior orientation of the camera. If the intrinsic
parameters are already known, one is left with the problem
of computing the extrinsic parameters only, which encodes
the exterior orientation of the camera.
Exterior orientation can be recovered with a resection
procedure, if we assume that enough 3 D control points
together with their projections in the image are given (Fig.
5.1). We already encountered the resection algorithm in
the context of calibration. In fact, it generically denotes the
estimation of the orientation (only exterior or interior as
well) of the
camera from 3D control points and their images.

Fig. 5.1 Three control points and their projections determine the exterior
orientation of the
camera

Let be the n control points, whose coordinates


(in the
world coordinate system) can be measured, and let
be the
respective projections onto the image. Since the intrinsic
parameters are known, one can express the image points in
normalised image
coordinates and write the perspective

projection equation as:


(5. 13)
If the DLT method (Sect. 4.1) can be applied,
obtaining a perspective projection matrix that is
identified with .
This approach is not satisfactory because: (i) the matrix R
thus
obtained is not necessarily orthogonal (due to the noise in
the data), and therefore, this property must be forced a
posteriori (Proposition A. 14); (ii) more control points (six)
than the bare minimum (three) are used.
To see that three control points are sufficient to compute
the exterior orientation, let us expand (5.13) into:

(5. 14)

where and If the rotation


matrix is expressed as a function of three parameters
(e.g. with three Euler
angles), we will eventually end up with a non-linear system in
six
unknowns, which can be determined if at least three
control points are given, for each control point generates
two equations. As a matter of
fact, exterior orientation can be recovered in closed form
with three points.

We introduce here a method proposed by Fiore (2001) that


5.3. 1 Fiore’s Algorithm

reduces the problem to that of absolute orientation, thus


producing a rotation matrix by construction.
Let us rewrite (5.13) with explicit scaling factors :
(5. 15)
We can clearly see that if the factors were known, we
would be faced with an absolute orientation problem. Let
us therefore focus on the
computation of . To this end, we rewrite (5.15) in matrix
form:
(5. 16)

where and constructs the diagonal


matrix given
a vector. Given the SVD of M: , let be the
matrix
consisting of the last columns of V , with ,
which
generate the null space of M. Multiplying both members of
(5.16) by has the effect of cancelling out the right-hand
member:
(5. 17)
Taking the of both members and exploiting the “vec
trick” (A.59) yields:
(5.
18) The vector does not only contain the
unknowns but also zeros from the off-diagonal elements
of . To fix this, similarly to what is done with the
vectorisation of symmetric matrices, we introduce an
appropriate matrix D such that and then
substitute in the previous equation, obtaining:
(5. 19)
By stacking enough equations like this one can derive
—up to a
multiplicative constant— as the generator of the null space
of the
coefficient matrix. How many control points are needed?
The size of the coefficient matrix in (5.19) is ,
and to determine a one-
dimensional family of solutions, it must have rank , so
it must be: . It follows that
control points are
needed. For example, six points in general positions (
) are needed, but they drop to four it they are coplanar (
).
Now that the right-hand member of (5.15) is known up to
a scaling factor, all that remains is to solve the absolute
orientation problem (with scaling) represented by (5.15)
with the OPA method described in Sect.
5.2.1
Listing 5.2 implements the procedure just described. To
limit the size of the intermediate matrices that are used in
the computation of the
coefficient matrix, instead of using (5.19), the
following property (Proposition A.20) is used in place
of the “vec-trick”:

(5.20)
where denotes the Hadamard (element by element)
product.

Listing 5.2 Fiore’s linear exterior orientation

Fiore’ s method is linear and fast, but in general, it needs


more than
three control points. In contrast, the non-linear method
that we will describe in Sect. 9.4 only needs the
minimum number of points, but it is iterative, being based
on the Gauss-Newton algorithm.

The method proposed by Garro et al. (2012) formulates


5.3.2 Procrustean Method

the exterior orientation as a Procrustean problem. Let us


rewrite (5.16) as:

(5.21)
As already noted, if is known, this reduces to an
absolute
orientation problem, which is solved with OPA (see Sect.
5.2.1). If, on the other hand, R and are known, one is left
with the linear problem of
solving for . The algorithm (Listing 5.3) proceeds by
alternating between these two steps until
convergence.
Solving for in (5.21) follows a path similar to the
solution of
(5.19). Instead of the “vec-trick” we exploit Proposition A.20,
to write

(5.22)
and the fact that is the diagonal
block matrix which has the columns of Q as blocks,
implemented by the MATLAB
function diagc.

Listing 5.3 Iterative exterior orientation

The geometrical interpretation is that are the lengths of


the segments of optical rays from the COP to the control
points, which we will call
“legs” (Fig. 5.2). The method therefore alternates
between estimating the length of the legs and calculating
the orientation of the camera by aligning the control
points with the endpoints of the legs.
Fig. 5.2 The blue edges connecting the control points are known, and the
angles as well. The length of the legs can be recovered (up to some
ambiguities) without knowing the COP

In their seminal paper, Fischler and Bolles (1981) gave


the following definition of the PnP problem:

Given the relative spatial locations of n control points,


and
given the angle to every pair of control points from an
additional point called the Center of Perspective
(CP), find the lengths of the line segments (“legs”)
joining the CP to each of the control points. We
call this the “perspective-n-point”
problem (PnP).

With control points, the legs can be computed in


closed form (and the problem is then called “P3P”).
Consider the tetrahedron with the COP and the control
points as
vertices (Fig. 5.2). The edges of the triangle defined by the
control points are known, the COP and the legs are
unknown, but one can compute the angles between the legs
(Problem 3.9). From these inputs, the triangles can be
solved by applying the law of cosines and finding the roots
of a
fourth-degree polynomial, as described by Grunert (1841).
An
additional point is needed to discriminate between the four
solutions.
For a discussion on these ambiguities and a more
accessible account of the original method, the reader is
referred to (Wang et al. 2020). An
excellent review can be found in (Haralick et al. 1994). For
an example of recent methods, see (Gao et al. 2003; Kneip
et al. 2011).
5.3.3 Direct Method
A weakness of all the methods that attack the problem with
an analytical approach is the fact that they require the
identification of the
projections of the control points in the images, a trivial
operation if done manually, but subject to errors if done by
the computer, which can
negatively affect the final solution.
Marchand et al. (1999) describe an alternative technique
which,
assuming we have a digital model of an object (instead of a
set of control points), avoids matching at all. The idea is
simple: the correct exterior
orientation is the one for which the projection of the model
coincides with its image. The “coincidence” is measured
along edges, by summing the modulus of the image
gradient at the projected model contour ( see Fig. 5.3).

Fig. 5.3 Example application of the direct method. The spray bottle is the
object whose model is known. On the right is the gradient map on which the
camera orientation retrieval is based

The computation of the exterior orientation is therefore


reduced to the maximisation of an appropriate objective
function that indicates
how much the contours of the model correspond with
those visible in the image, represented by maxima of the
gradient:

(5.23)

where is the gradient of the image (grey scale), is


the
(visible) projection of the model contour according to the
extrinsic
parameters and is its
characteristic function. Every point belonging to and
with non-zero gradient
makes a contribution to the objective function.
In order to be more selective on the gradients that
contribute to the objective function and reduce the effect of
noise or fine textures, it is
advisable to also take into account the direction of the
gradient by projecting it onto the normal of the
projected model contour :

(5.24)

To have a larger and smoother convergence basin,


one can also consider a smoothed characteristic
function that decreases with the distance from the
contour according to a Gaussian law.
The objective function is then iteratively maximised by
employing
direct techniques (using only the value of the function, not
its
derivative), such as Hooke and Jeeves (1961). As in any
iterative
optimisation, an initial solution close to the global minimum
is needed. Therefore, this method can be effectively
employed in a video tracking scheme, where the exterior
orientation obtained for the current frame provides the
starting solution for computing the orientation for the next
frame (Valinetti et al. 2001).

The same idea of the direct method of Sect. 5.3.3 was


employed in the calibration technique introduced by
Robert (1996). The author uses the DLT method with
six points to obtain an initial estimate of the
PPM, which then refines by minimising an objective
function similar to (5.23).

Problems
Given two vectors and prove that the least-
squares solution of
5. 1

is

(5.25)

Convince yourself that this is different from


o
when
References
K. S. Arun, T. S. Huang, and S. D. Blostein. Least-squares fitting of two 3-D point
sets. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 9 (5): 698– 700,
September 1987. [Crossref]
Paul D. Fiore. Efficient linear solution of exterior orientation. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 23 (2): 140– 148, 2001.
[Crossref]

M. A. Fischler and R. C. Bolles. Random Sample Consensus: A paradigm model


fitting with
applications to image analysis and automated cartography. Communications of the
ACM, 24 (6):
381– 395, June 1981.
[MathSciNet][Crossref]

Xiao-Shan Gao, Xiao-Rong Hou, Jianliang Tang, and Hang-Fei Cheng. Complete
solution
classification for the perspective-three-point problem. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 25: 930–943, August 2003.
[Crossref]

V. Garro, F. Crosilla, and A. Fusiello. Solving the PnP problem with anisotropic
orthogonal
procrustes analysis. In Second Joint 3DIM/3DPVT Conference: 3D Imaging, Modeling,
Processing, Visualization and Transmission (3DIMPVT), 2012.
Johann August Grunert. Das pothenotische problem in erweiterter gestalt nebst
bber seine
anwendungen in der geodasie. Grunerts Archiv fur Mathematik und Physik, pages 238–
248, 1841.
Robert M. Haralick, Chung-Nan Lee, Karsten Ottenberg, and Michael Nölle.
Review and analysis of solutions of the three point perspective pose estimation
problem. International Journal of
Computer Vision, 13: 331– 356, December 1994.
[Crossref]

R. Hooke and T.A. Jeeves. Direct search solution of numerical and statistical
problems. Journal of the Associationfor Computing Machinery (ACM), pages 212– 229,
1961.
L Kneip, D Scaramuzza, and R Siegwart. A novel parametrization of the
perspective-three-point problem for a direct computation of absolute camera
position and orientation. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2011.
Karl Kraus. Photogrammetry - Geometryfrom Images and Laser Scans - 2nd edition.
Walter de Gruyter, Berlin, 2007.
[Crossref]

Éric Marchand, Patrick Bouthemy, Francois Chaumette, and Valérie Moreau.


Robust real-time visual tracking using a 2-D-3-D model-based approach.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1999.
L Robert. Camera calibration without feature extraction. Computer Vision,
Graphics, and Image Processing, 63 (2): 314– 325, March 1996.
[Crossref]
A. Valinetti, A. Fusiello, and V. Murino. Model tracking for video-based virtual
reality. In

pages 372– 377, Palermo, IT, 2001. IAPR, IEEE Computer Society Press. doi: 10.
Proceedings of the 11th International Conference on Image Analysis and Processing (ICIAP),

1109/ICIAP.2001.
957038.
Bo Wang, Hao Hu, and Caixia Zhang. Geometric interpretation of the multi-
solution phenomenon in the P3P problem.J. Math. Imaging Vis., 62 (9): 1214– 1226,
nov 2020. ISSN 0924-9907.
https://doi.org/10.1007/s10851-020-00982-5.

Footnotes
1 In this section, in order to enlighten the notation, the Cartesian coordinates
of the points are denoted without the .

OceanofPDF.com
© The Author(s), under exclusive license to Springer Nature
Switzerland AG 2024 A. Fusiello, Computer Vision: Three-dimensional

https://doi.org/10.1007/978-3-031-34507-4_6
Reconstruction Techniques

6. Two-View Geometry
Andrea Fusi ello1
( 1) University of Udine, Udine, Italy

Email: [email protected]
Andrea Fusiello

6. 1 Introduction
We shall now study the relationship that links two images
framing the same scene from two different points of view.
In particular, we will be concerned with the relationship
that exists between the homologous points, which are
defined as the points in the two images that are
projections of the same point in objects space (Fig. 6.1).
Fig. 6.1 Stereo pair. Two homologous (image) points are the projection of the
same object point
Epipolar geometry describes the relationship between
two images of the same scene, so it is fundamental to any
computer vision technique
based on more than one image.

6.2 Epipolar Geometry


Given a point in the first image, we shall investigate
what
constraints exist on the position of its homologous in the
second
image. Some simple geometric considerations indicate that
the
homologous point of must lie on a straight line in the
second image, called the epipolar line of .
Consider the case illustrated in Fig. 6.2. Given a point
in the first image, its homologous in the second
image is constrained to lie on
the intersection of the image plane with the plane
determined by , and , called the epipolar plane. This is
because the point can be the
projection of any point in space lying on the optical ray of
.
Furthermore, it is observed that all epipolar lines of an
image pass
through the same point, called epipole, and that the epipolar
planes
constitute a bundle of planes having in common the baseline,
that is, the line passing through the COP and .

Fig. 6.2 Epipolar geometry


Fixed a world coordinate system, given
and
two PPMs , we know that:
(6. 1)
The epipolar line corresponding to is the projection
according to of the optical ray of , which has equation:
(6.2)
Since:
(6.3)
and

(6.4)

the epipolar has equation:


line of (6.5)
This is the equation (in homogeneous coordinates) of the
line through the points (epipole) and the projection of
the point at infinity
.
Figure 6.3 shows an example where some epipolar lines
are plotted in the right image.

Fig. 6.3 The epipolar lines corresponding to the points marked with a square
in the left image are drawn on the right

When the COP is in the focal plane of the conjugate


camera, the
epipole lies at infinity and the epipolar lines form a
bundle of parallel lines. A very special case is when both
epipoles lie at infinity, which
happens when the baseline is contained in both focal
planes, that is, the two image planes are parallel to the
baseline. In this case, which is
called normal configuration, the epipolar lines form bundle
of parallel lines in both images.

6.3 Fundamental Matrix


To prove that there is a bilinear relationship between the
homologous points, we need to further manipulate (6.5). We
multiply to the left and to the right by , the
antisymmetric matrix that acts as the cross product with
, obtaining (recall that ):
(6.6)
The left-hand side is a vector orthogonal to , so if we
multiply both sides by , we get:
(6.7)
This is the Longuet-Higgins equation and represents a bilinear
form in and .
Geometrically, it is consistent with the fact that, as is
explained in Appendix D, the line through two points is
represented by the cross product of the two points, so
the epipolar line of is represented (in homogeneous
coordinates) by the vector:

(6.8) The matrix

(6.9) which contains the coefficients of the bilinear form is


called the
fundamental matrix. The Longuet-Higgins equation then
rewrites:
(6. 10)
It states that the corresponding point of in the second
image must lie on its epipolar line ; in fact, the line
defined by is the epipolar line of m.
Since , we have that In
particular, since is non-singular, has the same
rank as which is two.
Furthermore, the matrix is defined up to a scaling
factor, since if it is multiplied by an arbitrary scalar, (6.10)
is still satisfied.
In general, a matrix has nine degrees of freedom;
each one of the above constraints removes one degree of
freedom, so a fundamental matrix has only seven degrees of
freedom.
Observe that by transposing (6.10), we obtain the
symmetric relation from the second image to the first
image:
(6. 11)
The epipole is easily derived from F as a generator
of the null space of . In fact, since the epipole
belongs to all epipolar lines, it must be for
every , and therefore, Similarly
.

6.4 Computing the Fundamental


Matrix
The fundamental matrix can be computed from
homologous points, using the method described in the
following section. Excellent
references for the computation ofF are (Zhang 1998) and
(Luong and Faugeras 1996).
Given a set of point correspondences (at least eight, as
we shall see) , we want to determine
the fundamental
matrix F that connects the points in the bilinear relation:
(6. 12)
The unknown matrix can be easily derived through
the use of the Kronecker product and the “vec trick”
(Sect. A. 14). In fact:

So each point correspondence generates a linear


homogeneous equation in the nine unknown elements of the
matrix (read by columns). From n corresponding points,
we obtain a linear system of n equations:
(6. 13)

The solution to the homogeneous linear system is the null


space of . With at least points in general position,
this null space reaches dimension one, so the solution is
determined up to a multiplicative
constant. Therefore, this method is called the eight-point
algorithm.

Not always the eight-point algorithm succeeds in uniquely


determining a fundamental matrix: a degenerate case
occurs
whenever the null space of has dimension greater
than one. This happens if the eight points lie on a plane
or on the intersection of
three quadrics, like, for example, the eight vertices of a
cube. The study of the rank of is described in
(Agarwal et al. 2017), while the critical loci, which also
include the position of the cameras, are studied in
(Hartley and Kahl 2007). A degeneracy also occur when
the two COPs are the same (no translation).

When more than eight-point correspondences are


available, the
linear system is overdetermined and can be solved with the
least-
squares method. According to Proposition A. 13, the
solution is the least right singular vector of . The MATLAB
implementation is given in
Listing 6.1. It is understood that despite its name, the
function accepts as input any number of points .
Listing 6. 1 Eight-point algorithm

The matrix obtained as a solution of the system (6.13)


6.4. 1 The Seven-Point Algorithm

will not, in general, have rank two. This can be forced a


posteriori by replacing with , the matrix of rank two
closest to F in Frobenius norm, following the Eckart-Young
theorem (A.6). Alternatively, it is possible to easily
derive a “seven-point algorithm” that yields a F of rank two
by
construction. Indeed, since F possesses only seven degrees
of freedom, seven correspondences must be sufficient to
compute it, with the
addition of the rank constraint.
The matrix of coefficients that is generated from
seven matches has dimension and a null space of
dimension two, so the generic solution is written as a
linear combination of two particular solutions:

(6. 14)
where and are the two matrices corresponding to the
last two
singular vectors in the Singular Value Decomposition
(SVD) of . Only one coefficient is needed because of
the scale ambiguity ofF. The
specific that makes F singular is the solution of:

(6. 15)
which is a third-degree polynomial in and therefore
can be solved analytically, yielding one or three real
solutions.
6.4.2 Preconditioning
The eight-point algorithm (but the same can be said about
the seven- point algorithm) has been criticised for being
too sensitive to noise and thus not very useful in practical
applications. However, Hartley (1995) showed that the
instability of the method is mainly due to a
malconditioning problem rather than to its linear nature.
He observed, in fact, that homogeneous pixel coordinates
most likely yield an ill-
conditioned system of linear equations. In fact, typical
image
coordinates are , so the entries of ranges
from 1 to , and consequently, has a high
conditioning number, which makes the solution of the
system unstable.
By applying a simple preconditioning technique— which
consists of transforming the points so that all the three
coordinates have the same orders of magnitude— the
conditioning number is reduced and the
results are more stable. The points are translated so that
their centroid coincides with the origin, and then they are
scaled so that the average distance from the origin is
. The latter choice comes from the
observation that the best conditioning occurs when the
mean
coordinates are , and this point has distance
from the origin.
Let and be the resulting transformations in the two
images and , the transformed points.
Using and in the eight-point algorithm, we obtain a
fundamental matrix that can be
related to the original one by , as can be easily
shown.
The MATLAB implementation of the computation
ofF with preconditioning is given in Listings 6.2
and 6.3.
Listing 6.2 Linear computation ofF

Listing 6.3 Preconditioning

Preconditioning is useful not only in the computation ofF


but also in
all those cases where the coefficient matrix is obtained
from
homogeneous image coordinates, so, for example, also
in the case of resection (Sect. 4.2) or in the computation
of a homography (Sect. 6.5.1).

6.5 Planar Homography


While in the most general case homologous points in two
images are related by the fundamental matrix, there are
instances of practical
interest in which point correspondences between two
images are encoded by a projectivity of , or
homography or collineation. This occurs when the
observed points lie on a plane in object space.
We already established in Sect. 4.5 that the map
between a plane in object space and its perspective
image is a homography. Here we
observe that the projections of the points of in two
images are related by a homography as well. In fact, we
have a homography between and the left image, and
likewise a homography between and the right
image. By composing the inverse of the former with the
latter (see Fig. 6.4), we obtain a homography from the left
image to the right image.

Fig. 6.4 The plane induces a homography of between the two images

We will then say that induces a homography between


the two images, represented by a matrix , in the
sense that corresponding points in the two images are
associated via . In formulae:
(6.
16) The matrix is non-singular, and being defined up
to a scale
factor has eight degrees of freedom. Three of these are due
to the choice of the plane , the other five arise from the
compatibility with the
epipolar geometry, represented by F. In fact, all points
related via must also satisfy the epipolar constraint,
that is:
(6. 17)
which implies:
(6.
18)

This is equivalent to saying that must be


antisymmetric, that is, . Due to the
homogeneity of the equation and the symmetry of the
matrix on the left-hand side, this results in five
constraints.
To derive an expression for we need to start again
from the
epipolar geometry. If we take the reference frame of the
first camera as the world reference frame, we can write
the following two PPMs:
(6. 19)
Substituting these PPMs into the equation of the
epipolar line of (6.5), we obtain
(6.20)
with
(6.21)
In general, two homologous points and , which are
the
projection of the 3D point onto the first and second
cameras,
respectively, are related by (6.20). However, if the 3D point
lies on a plane , then (6.20) can be specialised. Using
the equation of the plane , where d is the distance
of the plane from the origin and is its normal, after some
rewriting s (see Problem 6.4) one gets:

(6.22)

The latter equation states that there is a homography


induced by the plane between the two images, defined
by the matrix:

(6.23)
If approaches the plane at infinity, that is, , in
the limit we get the matrix of the homography induced by
the plane at infinity
. This homography relates the projections of
points
lying the plane at infinity, that is, it maps vanishing points
of one image into vanishing points of the other image.
Formula (6.22) applies to any pair of cameras framing
a plane , with a totally general relative orientation.
There is, however, another special case when two images
are linked by a homography, which
depends on the motion of the camera and not on the
structure of the scene. This occurs when the baseline
(i.e. the relative translation) is zero, In fact, plugging
into (6.20) one gets:
(6.24)
Thus, plays a dual role:
Like all homographies induced by a plane, it relates the
projections of points on the plane at infinity in the two
images.
In addition, associates the projections of all the points
in the
scene (no matter what plane they belong to) between the
two images when the camera makes a purely rotational
motion.
6.5. 1 Computing the Homography
Given n homologous points , we are required to
determine the homography matrix H such that:
(6.25)
Taking advantage of the cross product to eliminate the
multiplicative factor, the equation is rewritten:

(6.26) Along the same lines as the derivation made in


Sect. 4.1, we get:

and similarly, we conclude that the matrix has


rank two, so only two out of three equations are linearly
independent. For n points we obtain a system of 2n
homogeneous linear equations, which we can write as
(6.27)
where A is the matrix of coefficients obtained by
stacking two
equations for each correspondence, while the vector of
unknowns
contains the nine elements of H read by columns. So,
four points in general position1 determine a matrix of
coefficients A of rank eight
whose one-dimensional null space is the sought solution up
to a scale
factor. Thus, a homography is determined by its action on
four points, as illustrated in Fig. 6.5.

Fig. 6.5 Four corresponding points determine a homography


For points, the least-squares solution is the least
right singular vector of A (see Proposition A. 13).
As seen in Sect. 6.3 for the fundamental matrix, also in
the
computation of H, it is useful to transform the points so that
the problem becomes better conditioned. Let and
. Using the
correspondences between the transformed points and
, one finds a homography of the plane that is related
to the sought one by
, as can easily be seen.
The MATLAB implementation of the homography
computation with preconditioning is given in Listing 6.4
and is based on the dlt function given in Listing 4.1.
Listing 6.4 Linear computation of H

6.6 Planar Parallax


Let us consider (6.20) which we report here for the
convenience of the reader:

(6.28)
One way to interpret it is that a point m is associated with its
homologous in two steps: first is applied and then a
correction, called parallax, is added along the epipolar
line.
This observation is not only true for but
generalises to any plane (Shashua and Navab 1996). In
fact, after a few steps along the lines of the procedure
used to derive (6.22), we obtain:

(6.29
)
where a is the distance of the point (of which and
are the
projections) from the plane and is its depth with
respect to the first
camera.
When belongs to the plane , then ;

otherwise, there is a residual term called parallax .


The parallax is the projection onto the plane of the second
camera of the segment of the optical ray
between M and the intersection with (see Fig. 6.6).

Fig. 6.6 Epipolar geometry and planar parallax. Parallax with respect to plane
is the
projection of the segment joining the point M and its projection onto along
the optical ray

Equation (6.29) is an alternative representation


of epipolar geometry, taking a particular plane as a
reference. Note that:
is independent of the choice of the second image.
is proportional to the inverse of depth .
When the reference plane is the plane at infinity .
the parallax field is radial with centre in the epipole.
A simple experiment to visualise parallax is shown in
Fig. 6.7. Take two images of a scene containing at least
one plane (the facade of the
building); apply the homography induced by that plane to
the whole
first image. Observe that the points in the plane coincide
when the two images are superimposed while those off the
plane (the statue of Dante) do not. The difference in
position between corresponding points is the parallax.

Fig. 6.7 The first two images are the original ones. The third one is the
superimposition of the second one with the first one transformed according to
the homography of the building facade (the points used to calculate the
homography are highlighted)

Problems
6. 1 If the PPMs are normalised as in (3.23), we can
rewrite the
equation of the epipolar line with the point depths (
and ) made explicit:
(6.30)

6.2 Prove that the epipole is in the null space of


by algebraic manipulations using the fact that
.

6.3 In the step that led to (6.7), we could arrive at the


same result by observing that after multiplication by
the left member is a quadratic form and is
antisymmetric.

6.4 Derive (6.22).


6.5 Show that, given two images whose PPMs are
and
, the homography of the plane at infinity between
the two
is .

Show that if the homographyH transfers points


between two
6.6

images, then the homography transfers straight lines


between the same two images.

Prove that, however fixed a plane , it always holds:


It means that to compute the homography, three
6.7

points belonging to the plane and the epipoles are


enough, to have a total of four homologous points.
6.8 Using the result of Problem 6.7, show that, given two
homographies and induced by two different planes
in the same image pair, then the epipole is the
eigenvector corresponding to the distinct eigenvalue of
. What do the other two eigenvectors
represent?

References
Sameer Agarwal, Hon-Leung Lee, Bernd Sturmfels, and Rekha R. Thomas.
On the existence of epipolar matrices. International Journal of Computer Vision,
121 (3): 403–415, Feb 2017.
[MathSciNet][Crossref]

R. I. Hartley. In defence of the 8-point algorithm. In Proceedings of the International


Conference on Computer Vision, pages 1064– 1071, Washington, DC, USA, 1995.
IEEE Computer Society.
Richard Hartley and Fredrik Kahl. Critical configurations for projective
reconstruction from multiple views. International Journal of Computer Vision, 71 (
1): 5–47, Jan 2007.
[Crossref]

Q.-T. Luong and O. D. Faugeras. The fundamental matrix: Theory,


algorithms, and stability analysis. International Journal of Computer Vision, 17:
43– 75, 1996.
[Crossref]

A. Shashua and N. Navab. Relative affine structure: Canonical model for 3D


from 2D geometry and applications. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 18 (9): 873– 883, September 1996.
[Crossref]

Z. Zhang. Determining the epipolar geometry and its uncertainty: A review.


International Journal of Computer Vision, 27 (2): 161– 195, March/April 1998.
Footnotes
1 There must not be three aligned.

OceanofPDF.com
© The Author(s), under exclusive license to Springer Nature
Switzerland AG 2024 A. Fusiello, Computer Vision: Three-dimensional

https://doi.org/10.1007/978-3-031-34507-4_7
Reconstruction Techniques

7. Relative Orientation
Andrea Fusi ello1
( 1) University of Udine, Udine, Italy

Email: [email protected]
Andrea Fusiello

7. 1 Introduction
Consider the situation where some points of the 3D
space—called in this context tie points—are projected into
two different cameras, and for each point in one camera the
homologous point in the other camera is given. The
relative orientation problem consists in determining the
orientation (which includes position and angular attitude) of
one
camera with respect to the other, assuming known intrinsic
parameters.
It is customary to refer also to motion recovery, implicitly
assuming the equivalent scenario to the previous one in
which a static scene is framed by a moving camera with
known intrinsic parameters. In each case, orientation or
rigid motion is represented by a direct i sometry.
The approach that we will describe in Sect. 7.2 follows
the one
proposed by Longuet-Higgins (1981): it is based on the
essential matrix, which describes the epipolar geometry of
two perspective images with known intrinsic parameters
and from which the rigid motion of the
camera can be derived with a factorisation described in
Sect. 7.3.
Note that the translational component of the
displacement can only be calculated up to a scaling factor,
because it is impossible to determine ( without additional
information) whether the motion observed in the
image is caused by a nearby object with the camera moving
slowly or by
a distant object with the camera moving faster. This fact is
known as the depth-speed ambiguity.

7.2 The Essential Matrix


Suppose we have a camera, with known intrinsic
parameters, that is moving in a static environment
following an unknown trajectory.
Consider two images taken by the camera at two different
time instants and assume that a number of homologous
points, in normalised image coordinates, are given.
Let and be the PPM of the cameras
corresponding to the two time instants and
the normalised
coordinates of the homologous points in the images,
which are projections of the same tie point M (Fig.
7.1).

Fig. 7.1 Reference frames and normalised image coordinates system

Without loss of generality (see Problem 7.1) we


represent the 3D points in the reference frame of the
first camera and thus write the following two PPMs:
(7. 1)
By retracing the derivation of the Longuet-Higgins
equation (6.7) with these two particular PPMs, we obtain
the bilinear form that links
homologous points in normalised image coordinates
(also called the “Longuet-Higgins equation”):
(7.2)

The matrix
(7.3)
which contains the coefficients of the form is called the
essential matrix. It depends on three parameters for rotation
and on two parameters for translation. In fact, (7.2) is
homogeneous with respect to , that is, the
modulus of the vector does not matter. This reflects the
depth-speed
ambiguity, that is, the fact that we cannot derive the
absolute scale of the scene without an additional parameter,
for example knowledge of the
distance between two tie points. Therefore, an essential
matrix has only five degrees of freedom, accounting for
rotation (three parameters) and translation up to a scale
factor (two parameters).
In terms of constraints, we can observe that the
essential matrix is defined up to a scaling factor and is
singular, since . This
brings the degrees of freedom to seven; in order to match
the five
degrees of freedom of the parameterisation, we need to be
able to
exhibit other two constraints. The theorem in the next
section proves that these two constraints are the equality
of the two non-zero singular values of E (a polynomial in
the elements of E, which yields two
independent constraints).
The essential matrix and the fundamental matrix are
related, since both encode rigid motion between two
cameras. The former relates the normalised image coordinates
of the homologous points, while the latter relates the pixel
coordinates of the same points. It is easy to verify that:
(7.4)
7.2. 1 Geometric Interpretation
The Longuet-Higgins equation (7.2):
(7.5)
holds for any pair of homologous points , in
normalised image
coordinates. We can visualise and (Fig. 7.2) as the
vectors of
that connect the COP (origin of the camera reference frame)
and the
corresponding point on the image (which has equation
, in front of the camera). Then the Longuet-Higgins
equation can be interpreted as the coplanarity of the three
vectors ( is rotated to bring it
into the reference frame of the left camera), since the left
member is
nothing but the triple product of these vectors, which cancels
if and only if they are coplanar (see Appendix A).

Fig. 7.2 The Longuet-Higgins equation as the coplanarity of three vectors

Equation (7.2) is at the basis of the method proposed by


Horn
(1990b) for the computation of relative orientation. He
observed that by dividing (7.2) by , one obtains
a geometrically
meaningful quantity that can be iteratively minimised.

This geometric interpretation also allows the position of


M to be
calculated very easily, a process called triangulation (which
will be
discussed in Chap. 8). Since the three vectors are
coplanar
when we scale and so that they meet in M, their
difference must be equal to t. In other words:
(7.6)
where and are the unknown scaling factors. This is a
system of three equations in two unknowns that can be
solved for and . It is easy to see that they are the
depth of point Min the right and left
cameras, respectively.

The essential matrix can be computed from the


7.2.2 Computing the Essential Matrix
correspondences of at least eight points, using the
method described for the fundamental
matrix in Sect. 6.4. All it takes is replacing the pixel
coordinates with the normalised image coordinates.
Note that, due to the inevitable errors that plague
measurements, in general, the matrix found by solving
the system of linear equations
will not satisfy the requirements of Theorem 7.1, that is, it
will not have two equal singular values and one singular
value equal to zero. However, this can be forced a posteriori
via SVD.
Let be a matrix and its SVD with
and . It can be shown that
(similarly to the Eckart-Young’s Theorem)
(7.7)

is the closest matrix to E, in Frobenius norm, that


satisfies the requirements of Theorem 7.1.
Since depends on five parameters, it is possible, in
principle, to
compute it with five correspondences plus the polynomial
constraints
provided by Theorem 7.1. Faugeras and Maybank (1990)
showed that
this is indeed the case and that ten distinct solutions exist.
Subsequently, algorithms for computing E with five points
were proposed by Nister
(2003) and Li and Hartley (2006).
Listing 7. 1 Linear computation of E

The MATLAB function that computes the essential matrix from


point
correspondences (Listing 7.1) makes use of the function
eight_points that we introduced for computing the
fundamental matrix. What
changes is the input— pixel coordinates for F versus
normalised image coordinates for E— and the type of
constraints that are forced a
posteriori.

7.3 Relative Orientation from the


Essential Matrix
The following theorem due to Huang and Faugeras (1989)
characterises the essential matrix and allows us to factorise
it in rotation and translation.
Theorem 7. 1 (Essential Matrix Factorisation) A real
matrix E can be factorised as a product of a non-zero
antisymmetric matrix and a rotation matrix if and only ifE has two
equal non-zero singular values and one singular value equal to zero.
Proof Let be the SVD ofE, with
(with no loss of generality, since E is defined up to a scale
factor) and and
orthogonal. The key observation is that:

(7.8)

where is antisymmetric and is a rotation matrix (of


on the z-axis). Thus,
(7.9
)
Taking and the sought
factorisation is: . In fact, the matrix is
antisymmetric (Problem 7.2) and the matrix
is a rotation (Problem 7.3). This proves
one implication; let us now see the other direction.
Let where is a rotation matrix and is
antisymmetric. By virtue of Proposition A.3, we have
Also from (7.8) we get: .
Therefore:
(7. 10)

Since is orthogonal, this is a singular value


decomposition of E with two equal singular values and one
null, as was to be proved.

Observe that spans to the null space of S; hence, it


belongs to null space of , which means that is equal
to the third column of U.
The factorisation of E constructed in the first part of the
proof is not unique. In fact, it is possible to transpose both
and obtaining in all four combinations, of which two
change the sign of E (irrelevant, since it is defined up to a
multiplication by a scalar). In fact, we see that
, while by the definition of antisymmetric:
.
In summary, then, the four possible factorisations are
given by:
(7. 11)
The geometric interpretation of the four solutions is as
follows:
changing the sign of S swaps right with left, while choosing R
or
results in a rotation of the camera by around the
baseline. In all cases it is possible to triangulate the tie
points, obtaining a three-
dimensional model, but only in one case do the tie points
lie in front of both cameras (Fig. 7.3), and this
requirement identifies the only
physically feasible solution. A single tie point is
sufficient to discriminate.

Fig. 7.3 The four possible solutions of the factorisation of E. Between the left
and right columns there is a left-right inversion, while between the top and
bottom rows the camera B rotates around the baseline. Only in the top-left
case the triangulated point lies in front of both cameras
The property that specifies whether a point is in front
of or behind with respect to a given camera is called
chirality of the point.1
The MATLAB implementation is given in Listing 7.2. Note
that for the chirality test we solved (7.6) using the
function icomb that implements
Proposition A. 18.

Listing 7.2 Relative orientation from the factorisation of


E

7.3. 1 Closed Form Factorisation of the Essential


Horn (1990a) proved that the factorisation of E can also
Matrix
be obtained without SVD, in closed form.
We will preliminarily note that (based on Problem 7.5)
multiplying E by the factor leads us back to the
case where . We therefore assume that this is the
case.
Let us write and then
(7.
12)
means that each column of E is perpendicular to so the
latter is
parallel to the cross product of any pair of columns of E, for
example:
(7.
13)
by a triple product property.
As for the recovery of R given , let us turn to the
adjoint matrix, defined as
(7.
14)

(7.
15)

From which we get

(7.
16)
Regarding the ambiguity of the solutions, we have two
solutions of opposite sign for translation, which generate
two solutions for rotation, respectively. Combining them
yields four solutions. This factorisation is implemented in
Listing 7.3.
Listing 7.3 Relative orientation from E in closed form

7.4 Relative Orientation from the


Calibrated Homography
In analogy to what we saw in the previous section with
reference to the essential matrix, knowledge of the
homography induced by a plane
between two images, along with the intrinsic parameters of
the cameras, can also serve to recover the relative
orientation of the two images, plus the parameters of the
plane that induced the homography.
From the definition of (6.23) we immediately get:
(7. 17)

This means the left-hand side term—which we dub calibrated


— encodes the motion parameters ( ) and
also the
homography
plane parameters and d. As a matter of fact, this equation
can be
solved (with a double ambiguity in the solution) to obtain: (i)
the
normal to the plane , (ii) the rotation and (iii) the
translation scaled with the distance from the plane ( ).
The algorithm we will describe follows, to some extent, the
one introduced by Faugeras and Lustman
(1988).
The calibrated homography will in general contain an
unknown
scaling factor, which we can however eliminate by
dividing it by its second singular value, (Problem
7.7). We therefore assume that
.
We begin by deriving the normal to the plane, and to
this end we consider the SVD of :
(7.
with . 18)
Therefore:

(7.
19)

Note that is only orthogonal, to get a rotation one


would need to fix the determinant, but this is not
necessary.
Let us multiply both members by
getting

and so
(7.20)

(7.21)
(7.22
Sinc an we )
e d have
(7.23

and finally
)

(7.24
)

where consumes a square matrix and returns the


vector
containing the diagonal. Expanding the components, the
last equation becomes:

(7.25)

By collecting the unknowns we obtain a


homogeneous linear system in the squares of the
components of (recall that ):

(7.26)

whose solution, up to a scale, is:


.
Since , we have a total of four possibilities for the
signs of and . The solution is up to a scale, which we
can fix by imposing
.
Finally, we get: , with , V being
orthogonal.
We then proceed to determine the rotation and multiply
both
members of (7.17) by in order to eliminate the term
containing the translations:
(7.27)
Transposing the previous equation
(7.28)
we observe that it is a matter of decomposing the left-hand
term into an antisymmetric matrix and a rotation matrix, as
we did in Sect. 7.3.1. We then use (7.16) where E is
substituted by therefore:

(7.29) Finally, from (7.17) we derive:


(7.30)
To obtain we multiply both members by , remembering
that
:

(7.31)

From each possible (there are four of them) we get one


R, but
since the sign of does not affect R, we actually have only
two different R. From each pair we get a , so we
have a total of four possible solutions that return . To
these we must add four more that return
, for a total of eight solutions. In fact, of these two
groups only one satisfies a physical constraint involving
. Listing 7.4 implements this factorisation.
The solutions can be further reduced to two by using
constraints on the visibility of the points, in analogy to the
previous section. See Malis and Vargas (2007) for details
on eliminating impossible solutions and other methods for
decomposition.
Listing 7.4 Relative orientation from calibrated H

Problems
Derive the essential matrix with two generic PPMs and
observe
7. 1
that: a) the translation and the rotation that appear in the
expression for E are the relative ones; b) by applying the
same i sometry to the right of each PPM, the expression
for E remains unchanged. This allows us,
without loss of generality, to fix .

Prove that if is antisymmetric, then


is also antisymmetric.
7.2
Prove that given a matrix of rotation and two
orthogonal
7.3

matrices and , then is a rotation matrix


(i.e. has positive determinant).

Obtain the direction of relative translation in the


case where the relative rotation is given. Two points are
7.4

enough.
Prove
that:
7.5

(7.32
)

Derive the relationship between the fundamental


matrix and the essential matrix.
7 .6

Prove that any matrix A of the form


with Q orthogonal must have where
7.7

is the second singular value.

References
O. Faugeras and S. Maybank. Motion from point matches: multiplicity of
solutions. International Journal of Computer Vision, 4 (3):225– 246, June 1990.
[Crossref]

O.D. Faugeras and F. Lustman. Motion and structure from motion in a piecewise
planar
environment. International Journal of Pattern Recognition and Artificial Intelligence,
2:485– 508, 1988.
[Crossref]
Berthold Horn. Recovering baseline and orientation from essential matrix.

Unpublished, 1990a.

B.K.P. Horn. Relative orientation. 4 ( 1):59– 78, January 1990b.

T.S. Huang and O.D. Faugeras. Some properties of the E matrix in two-view
motion estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(
12):1310– 1312, December 1989.
[Crossref]
Hongdong Li and Richard Hartley. Five-point motion estimation made easy. In
Proceedings of the International Conference on Pattern Recognition, pages 630–633,
Washington, DC, USA, 2006.
IEEE Computer Society.
H. C. Longuet-Higgins. A computer algorithm for reconstructing a scene from
two projections. Nature, 293( 10):133– 135, September 1981.
[Crossref]
Ezio Malis and Manuel Vargas. Deeper understanding of the homography
decomposition for vision-based control. Research Report RR-6303, INRIA,
2007. https://hal.inria.fr/inria-
00174036.
David Nister. An efficient solution to the five-point relative pose problem. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, page
195, Los Alamitos, CA, USA, 2003. IEEE Computer Society.

Footnotes
1 A geometric configuration is said to be chiral if it is not superimposable on
its mirror image, such as right hand and left hand.

OceanofPDF.com
© The Author(s), under exclusive license to Springer Nature
Switzerland AG 2024 A. Fusiello, Computer Vision: Three-dimensional

https://doi.org/10.1007/978-3-031-34507-4_8
Reconstruction Techniques

8. Reconstruction from Two


Images
Andrea Fusi ello1
( 1) University of Udine, Udine, Italy

Email: [email protected]
Andrea Fusiello

8. 1 Introduction
We will deal in this chapter with triangulation, which allows
us to obtain the 3D coordinates of the tie points, that is points
whose projections
have been matched in the images. The set of 3D points
obtained from
triangulation is also called model. We will see that the model
differs from the real one by transformations that reflect our
degree of knowledge
about the sensor and/or the scene.

Triangulation and resection are somehow dual operations:


in one
case observing a point from at least two different known
locations
one derives the position of the point, while in the other
case one
orients the camera by measuring the position of at least
three points.

8.2 Triangulation
Given homologous points in the two images and given the
two PPMs, triangulation (also known as forward intersection)
aims to reconstruct the position in object space of the
© The Author(s), under exclusive license to Springer Nature
Switzerland AG 2024 A. Fusiello, Computer Vision: Three-dimensional

https://doi.org/10.1007/978-3-031-34507-4_8
Reconstruction Techniques

respective tie points (refer to Fig. 6.1).


Consider m, the projection of point M onto the camera
whose PPMs is P. From the perspective projection
equation (3.26), we derived in Sect. 4.2 that:
(8.
1)
These are three equations in the four unknowns of (the
fourth
component of is left free and figures among the
unknown), but only two of them are linearly independent,
being .
Let us now also consider , the homologous point of m in
the
second image, and let be the second PPMs. Since m and
are
projections of the same tie point M, the equations given by
both can be stacked, yielding a homogeneous linear
system of six equations (of
which no more than four are independent) in four unknowns:
(8.2)
The solution is the null space of the matrix A, which
therefore
must have rank three, otherwise, the only solution would
be the trivial one .
In the presence of noise, however, the condition on rank is
not
satisfied exactly, and so a least-squares solution is obtained
as the least right singular vector of A (see Proposition A.
13). Hartley and Sturm
(1997) call this method linear-eigen, because the right
singular vectors are the eigenvectors of the matrix .
The above generalises to the case of images: each
image adds two equations and we obtain a homogeneous
system of 2m equations in four unknowns. The MATLAB code
for the triangulation of one point in
multiple images is shown in Listing 8.1.

Listing 8. 1 Triangulation
Iterative refinement of the linear-eigen method. With
reference—for example—to the second equation of the
linear system 8.2, the
residual that is minimised is while
instead we would like to minimise the geometric
residual, that is the difference between the measured u
coordinate and the projection of :

(see also Chap. 9). Observe that ,


that is if the equation were weighted with , we would
be minimising exactly the geometric residual. The
problem is that the weights depend on
, which is precisely what we want to compute. The
solution is to iterate the linear-eigen method by
weighting the equations with
values calculated from the coordinates of in the
previous step.

8.3 Ambiguity of Reconstruction


Consider a set of 3D points, framed by two cameras
represented by P and . Let be the (homogeneous)
coordinates of the projection of the j- th point in the first
image and the ( homogeneous) coordinates of the
projection of the j-th point in the second image.
The reconstruction problem can be posed as follows: given
the
coordinates (in pixels) of the homologous points and
find the PPMs and the model such that:
(8.3)
We will also denote by the term reconstruction the pair
formed by the set of PPMs and the 3D model such that
(8.3) holds.
This reconstruction can be unique or ambiguous
depending on what is known about the PPMs. We shall see
that the reconstruction differs
from the true one by transformations that become
progressively more general as the information on the
PPMs decreases.
We will preliminarily note that if and are a
reconstruction, that is they satisfy (8.3), then also
and satisfy (8.3) for any non-
singular matrix T, which specifies a linear
transformation of the 3D projective space.
If the PPMs are unconstrained (no information), any T
produces a valid reconstruction. On the other hand, if we
know that the PPMs must satisfy certain constraints, this
limits the generality of T to those
transformations that preserve the constraints.
At one extreme of the spectrum there is the full
knowledge of
(barring the usual irrelevant scaling factor). In this
case T can only be the identity; in fact, any other T would
change at least one of the two PPMs (see Problem 8.1). The
reconstruction is identical to the true one, expressed in the
same reference frame in which the exterior
orientation of the camera is given.
If the two PPMs are known modulo a direct i sometry
(the same for both), it means that the exterior orientation
is unknown but the relative orientation is assigned. In this
case Tis constrained to be a direct
i sometry, which leaves the relative orientation unchanged
(see Problem 7.1). Correspondingly, the points are pre-
multiplied by , which is
itself a direct i sometry. The reconstruction that is obtained
differs from the true one by a (unknown) direct i sometry
and is called metric
reconstruction. Note that in this case the interior orientation
is known, so the PPMs must have the structure
with ,a
property that a direct i sometry preserves.
In practice, the modulus of translation is hardly known
(only in
special cases such as a pair of cameras rigidly mounted on
a bar). Much more frequent is the case of one camera that
moves freely, so one can
only recover the relative orientation up to the translation
modulus, as
seen in Chap. 7. In this case Tis a similitude and the global
scaling that it contains only affects the translation
component of the PPM (see
Problem 8.2). The resulting reconstruction differs from the
true one by a similitude and is called Euclidean reconstruction.
The opposite extreme is when nothing is known about
the PPM. In this case, as already observed, T can be any
non-singular matrix, and we get a reconstruction differing
from the true one by an arbitrary
projectivity, which is why it is called projective reconstruction.
At the end of the chapter we will pick up this
discussion and summarise it in a table.

8.4 Euclidean Reconstruction


Consider a single moving camera; the intrinsic parameters
(i.e. K) are
known, but the motion of the camera is unknown (i.e. the
extrinsic
parameters are missing). In this case, a Euclidean
reconstruction is
obtained, as we have already discussed. The procedure to
obtain such a reconstruction is also called Structure from
Motion (SFM) in the
literature. It is based on recovering the relative orientation
(up to the modulus of translation) from E and the
observation that this pair of canonical PPM
(8.4)
yields the essential matrix: . This pair is not the
only one to
have this property: as already noted we can obtain the same
E by
multiplying the two PPMs by the same similitude T. The
Euclidean
reconstruction we obtain is defined up to a similitude, which
is
arbitrarily fixed by choosing and .
Once the two PPMs are instantiated, the model is
obtained by triangulation. The
procedure is illustrated in Listing 8.2, while Fig. 8.1 shows
a sample of the output.
Fig. 8.1 Two images with the homologous points (top row). Two views of the
reconstructed 3D model (bottom row). The units on the axes are meaningless,
as the model is known up to a scale
Listing 8.2 Reconstruction from two images

The upgrade from Euclidean to metric reconstruction is


obtained as
soon as the length of the baseline, or the distance between
two control points, is known, which fix the unknown scale
of the model. In order to upgrade this to the identical
reconstruction, it is necessary to determine also the
unknown i sometry, through the knowledge of the
coordinates of at least three control points, by solving an
absolute orientation problem (Sect. 5.2)

This is not the only route to follow in order to obtain an


identical
reconstruction when a sufficient number of control points
is given.
Other procedures are well studied and codified in
photogrammetry
(Kraus 2007). One alternative is to solve for the exterior
orientation of each image separately (Chap. 5). While in
the previous case
(relative + absolute orientation) three control points in all
are
sufficient, for exterior orientation three control points must
be visible
in each image, which makes three to six control
points in total, depending on visibility. A third path
would be to run bundle adjustment (see Chap.
14) with three control points.
8.5 Projective Reconstruction
When the intrinsic parameters are unknown, the only
information that can be extracted from a pair of images
are the homologous points,
and from them it is possible to compute the fundamental
matrix. By following mutatis mutandis the procedure of
Sect. 8.4 based on the essential matrix, we arrive at a
reconstruction of the scene, defined however up to a
projectivity of the space.
A fundamental matrix can be factorised into

(8.5) with an appropriate matrix A, which is said to be


compatible with F. The reader will note a resemblance with
the factorisation .
Unfortunately, however, for the fundamental matrix there
are infinitely many such factorisations. In fact, if a matrix
is compatible with F,
then also is compatible, for every vector ,
since
(8.6)
If is any matrix compatible with , it is easily
verified that the following pair of PPM:
(8.7)
yields the given fundamental matrix. Once the two PPMs are
instantiated, the (projective) model is obtained by
triangulation.
We are left with the problem of obtaining a matrix A
compatible with F. We will see in Problem 8.3 that any
homography induced by a plane, including , is
compatible with F. Another matrix that is compatible with
F but is not a homography (being singular) is the following:

(8.8)

Substituting into the (8.7), we obtain the so-called


canonical representation, introduced by Luong and Viéville
(1996).
Upgrade from projective to Euclidean reconstruction is
achieved by knowledge of the intrinsic parameters (Sect.
8.6) or through a process known as self-calibration, which
we will study in Chap. 16. Otherwise, one can step
directly to the identical reconstruction through the
knowledge of the coordinates of at least five control
points, which are sufficient to compute a projectivity of
the space, for example with the DLT algorithm.
8.6 Euclidean Upgrade from Known
Intrinsic Parameters
In this section we shall see how to upgrade to Euclidean a
projective pair of PPM when the intrinsic parameters are
given. The problem was
addressed by Bougnoux (1998) who formulated it as a
system of linear equations. The closed form reported here
is taken from Gherardi and Fusi ello (2010).
We start with a projective reconstruction:
(8.9)
and suppose we know the intrinsic parameters of the two
PPMs, K and , respectively. There must exists a non-
singular matrix ,
which transforms the projective reconstruction into the
Euclidean one, that is:
(8. 10)
It is easy to see that the expression for P in (8.9) fixes the
following form of T:

(8.
11)
where is a vector of three elements and s is a
scalar. From the expression for in (8.9) instead we
get:

(8.
and then, for each block: 12)

(8.
13)
While translation (up to a scale) is immediately derived,
for rotation it is necessary to also recover the unknown
vector . A solution can be found by observing that there
always exists a rotation Q such that:
Then, after multiplying both members of (8.13) by Q:
(8.
14)

calling W the first right-hand term and its rows, we arrive


at an
expression where the last two rows of are explicit,
since they are
independent of :

(8.
15)

The first row of is obtained as the cross product of the


last two, and from this we finally derive R.
In the presence of errors (8.15) may be only
approximately verified; in that case we cannot assume that
and are orthonormal, so
before computing the first line they must be forced to be so,
for example via SVD, as we did in Listing 8.3.
Listing 8.3 Euclidean upgrade of a projective
reconstruction
Even if one proceeds to Euclidean reconstruction
starting from the fundamental matrix and performing
the final upgrade with the
known interior orientation, the chirality problem that we
addressed in the factorisation of E cannot be avoided.
Thus, the second PPM of
the resulting pair must eventually be adjusted
so that the triangulated points are in front of
the camera.

8.7 Stratification
This section may be challenging because of the technical
detail, but it is supplementary and can be omitted without
affecting the understanding of the rest.
We have seen in the previous sections that depending on
the
information held on the cameras— that provide images from
which
measurements are made— one can have access to different
descriptions of the three-dimensional structure of physical
space, via the
reconstruction. This hierarchy is called stratification by
Faugeras (1995), and each of the levels is called
The table below summarises the strata that are relevant
stratum.
to our
discussion. The second row reports the transformation
that links the obtained reconstruction to the true one,
while the third shows the
invariants, that is quantities that remain unchanged
after this transformation. They are defined as
follows:
(8. 16)
(8. 17)
(8. 18)
(8. 19)
(8.20)
(8.21)
The denotes a normalisation. For the PPM it means
dividing by the norm of the third row of .
The last row gives the canonical pair of PPM that can be
written in
that case, using only the invariants. Transforming by T the
canonical pair yields another possible pairs of cameras that
are consistent with the
invariant above.
Rows 4 and 5 describe how the canonical pair is
obtained from what is known or measurable. “Match” stand
for “a sufficient number of
homologous points”

Reconstructi Identic Metric Euclidean Affine Projective


on al
Transformati Identity Isometry Similitude Affinity Projectivity
on T
Unchanged by , ,
T (E)

Known + match + match + pts match


+ baseline match
Procedure Void Relative o. + Relative o. Compute Compute F,
set scale ,
Canonical pair

The affine stratum corresponds to the knowledge of the


plane at
infinity and its homography, that can be used to
instantiate a camera pair. The corresponding
transformation T that preserves the
structure of the cameras is an affinity, that does not
change the plane at infinity. More on this in Luong and
Viéville (1996).

Problems
Prove that, given a PPM P, any transformation
leaves it unchanged (modulo
8. 1

the scale), that is . However, if the same T


must leave unaltered two different PPMs, then it can only
be .
Show that if Tis a scaling and P a PPM, the
product only affects the fourth column of P.
8.2
Prove that, given any plane , the induced
homography is always compatible.
8.3

Compare the statement of the previous problem with


the
8.4

observation about the degrees of freedom of in Sect.


6.5, and show that if and only if
.

References
S. Bougnoux. From projective to Euclidean space under any practical situation,
a criticism of self- calibration. In Proceedings of the International Conference on
Computer Vision, pages 790– 796,
Bombay, 1998.

Olivier Faugeras. Stratification of three-dimensional vision: projective,


affine, and metric representations.J. Opt. Soc. Am. A, 12(3):465–484, Mar
1995.
[Crossref]
Riccardo Gherardi and Andrea Fusiello. Practical autocalibration. In Proceedings
of the European Conference on Computer Vision, Lecture Notes in Computer Science,
pages 790– 801. Springer,
Berlin, 2010.
R. I. Hartley and P. Sturm. Triangulation. Computer Vision and Image Understanding,
68(2):146– 157, November 1997.
[Crossref]

Karl Kraus. Photogrammetry - Geometryfrom Images and Laser Scans - 2nd edition.
Walter de Gruyter, Berlin, 2007.
[Crossref]

Q.-T. Luong and T. Viéville. Canonical representations for the geometries of


multiple projective views. Computer Vision and Image Understanding, 64(2):193–
229, 1996.
[Crossref]

OceanofPDF.com
© The Author(s), under exclusive license to Springer Nature
Switzerland AG 2024 A. Fusiello, Computer Vision: Three-dimensional

https://doi.org/10.1007/978-3-031-34507-4_9
Reconstruction Techniques

9. Non-linear Regression
Andrea Fusi ello1
( 1) University of Udine, Udine, Italy

Email: [email protected]
Andrea Fusiello

9. 1 Introduction
In this chapter we will see how to refine the results
obtained from the linear algorithms studied so far in
order to compute better estimates (in terms of precision and
accuracy) in the face of input data affected by
errors. We will do this at the cost of having to use non-linear
minimisation iterative algorithms, such as the Levenberg-
Marquardt
algorithm, which involve a greater computational burden
and have no guarantee of global convergence to the
optimal value. Our job is to
provide the cost function and its derivatives, which will be
the bulk of the work done in this chapter. Dealing with
derivatives forces us to be overly pedantic. However, the
chapter is not essential to the economy of the text if one
wishes to limit oneself to linear methods. At least the first
two sections (Sects. 9.3 and 9.4) are necessary, though, if
one wants to understand bundle adjustment in Chap. 14.

9.2 Algebraic Versus Geometric


Distance
Consider the problem of regressing a conic on a of
set of points plane. The conic is represented by
the implicit equation:
where .
Because is a polynomial, this is a special case of
an algebraic variety.1
The immediate approach for regressing the conic on
the points is based on minimising the following cost
function:

(9. 1)

The term is a polynomial in , but it is linear in the


unknown
parameters , so this is a linear least-squares
problem (see Sect. C. 1).
If also , hence needs to be
somehow normalised, typically by setting to a constant
value a function of its coefficients, for example ,
or The latter case corresponds to the
least-squares solution described in
Proposition A. 13.
Let us call the polynomial after normalisation and
define the
algebraic distance of from the conic as .
Different
normalisations yield different algebraic distances. The
essential feature of an algebraic distance is that it is
computed by evaluating a normalised polynomial where
the normalisation does not depend on the point . An
important aspect that makes algebraic distances attractive
is that they always lead to a regression that is solved by
eigendecomposition
(or by SVD).
This approach, however, has some problems. First, not
all algebraic distances that can be defined are invariant to
the change of reference frame, although Bookstein
(1979) proposes one that is invariant to a
similitude. Second, algebraic distances yield biased
regression models. Bookstein shows that is
proportional to where d and are defined
as in Fig. 9.1. This causes, for example, deviations from an
ellipse to be overweighted where the curvature is low and
underweighted elsewhere, with the effect of producing
conics that fit points better in the low curvature sections
than in the high curvature sections: this is the bias effect.
Fig. 9.1 Algebraic distance in the regression of conics (taken from Bookstein
1979)
What we would like to minimise instead is the geometric
defined as the distance from point to the
conic , that is, the distance from point to the
distance,

closest point on the conic. This distance is invariant to the


choice of the reference frame and produces unbiased
regression models. The price to pay for this is that the
expression of the point-conic distance is non-linear and thus
one must solve a non-linear least-squares problem.
In addition, the point-conic distance is cumbersome to
calculate. For this reason one must be content with
surrogates, in the form of non-
algebraic distances such as the Sampson distance, defined as the
distance of the point to the first-order approximation of
the conic. Let us consider the Taylor expansion of around
, the point on the conic that realises the minimum
distance from :
(9.2)
Since belongs to the conic: and the normal in
points towards , that is, has the direction of
. Confusing with we get:
(9.3)
from which

(9.4)

Since the normalisation depends on the current point ,


this is not an algebraic distance, although it is also not
geometric.
It offers the following advantages: first, the Sampson
distance is
computationally more tractable than the geometric distance,
while being invariant for translations and rotations and
varying in proportion to the geometric distance under
uniform scaling. Second, it coincides with the
geometric distance for flat surfaces and approximates
it to the first order for non-flat surfaces (in the sense
that it approximates the surface). Finally, it is
invariant to scaling of .
In this section we used the example of conics to introduce
some
general concepts. In particular, the fact that the geometric
distance
enjoys good properties but leads to non-linear least-
squares problems, while algebraic distances give rise to
linear least-squares problems but yield biased (hence less
accurate) regression models, is a notion that applies to
all the problems we will address.
On the other hand, the fact that the geometric distance
is onerous to compute is not general. It is true for conics,
homographies and
fundamental matrices (for which we will use the Sampson
distance,
indeed), while for other problems (e.g. calibration) the
geometric
distance is easy to compute and there is no need to resort to
surrogates.

The eight-point method minimises an algebraic distance


defined by
the polynomial with the normalisation .
It is not invariant to the reference frame, but
preconditioning in practice
mitigates this aspect. In analogy to the work of Bookstein
(1979) on conics, Torr and Fitzgibbon (2004) suggest
fixing the scale ofF via the Frobenius norm of its principal
submatrix consisting of its first two
rows and columns, so as to obtain an algebraic error
invariant to the reference frame. Be aware though that
this parameterisation fails
when the principal submatrix vanishes, which
corresponds to parallel epipolar lines.

9.3 Non-linear Regression of the PPM


The DLT method for calibration by resection illustrated in
Sect. 4.2 is simple and fast but minimises an algebraic
distance (4.2), which,
although being invariant to similitudes, overweights
distant points and it is biased towards small focal lengths
(Hartley and Zisserman 2003). To get things right we
shall therefore minimise a geometric distance.
Let us write the perspective projection as a
composition of transformations, in order to exploit
the chain rule to compute its
derivative. We begin by introducing the function that
performs the so- called perspective division:
(9.5)

Therefore, the perspective projection writes:


(9.6)
The last step is justified by the fact that K is affine and will
allow us to insert the radial distortion before the
transformation into pixels (Sect. 9.5.3). Finally, to step to
the Cartesian coordinates, we have to remove the third
component, through pre-multiplication by the matrix

(9.7)

thus obtaining:
(9.8)
where we defined as the composition of and :

(9.9)

The relevant distance for this problem is the distance


9.3. 1 Residual

between the image point and the projection of the


corresponding control point, also called reprojection error. In
formula, the cost function is:

(9. 10)

The minimisation of this cost function corresponds to the


least-squares solution of a system of 2n non-linear
equations (two for each point)
where each pair of equations is given by
(the left part of the equation is the residual vector).
The solution involves iterative techniques with local
convergence— such as the Levenberg-Marquardt algorithm
(see Sect. C.2.2)—which
therefore require an initial estimate of the solution,
achievable, for example, with the DLT method.

Since P has 11 degrees of freedom, while the matrix has 12


9.3.2 Parameterisation
entries, it
must be appropriately parameterised with 11 parameters
, which are the 6
extrinsic parameters and the 5 intrinsic parameters.
We will refer, nominally, to this set of five intrinsic
parameters, but it is clear that any parameterisation of K
(also a reduced one) will do.

The Levenberg-Marquardt algorithm needs the derivative of


9.3.3 Derivatives

the residual with respect to the parameters that determine


the unknowns to find its way to the solution. In this specific
case the unknown is the matrix P,
assigned through its parameterisation ; therefore, the
derivative that must be computed is:

(9. 11)

This Jacobian matrix can be expanded into ( we will


henceforth omit the index i):

The Jacobian matrix of the system (9.10) is obtained by


stacking n of these matrices, one for each equation of
the system.
If we consider separately the extrinsic parameters
and the intrinsic ones
, the Jacobian is partitioned into two
blocks, corresponding respectively to intrinsic and
extrinsic parameters:
(9. 12)

We will work out these two blocks separately.


Thanks to the chain rule, the derivative with respect to
the extrinsic parameters expands to:

(9.
13)

because

(9.
where 14)

(9.
15)
and , the Jacobian of the rotation matrix with respect
to the Euler angles, has been derived in Sect. B.2.
As for the intrinsic parameters, thanks to the chain rule,
we have:

(9.
16) where is the Jacobian of the parameterisation of K
and is easily
computed entry-wise.
Listing 9.1 reports the implementation of non-linear
regression of
PPM or non-linear calibration. Notice how the value of initial
estimate
for P is used to transform the control points. In this way the
Euler angles are close to zero and singularities—that occurs
at —are
avoided.
Listing 9. 1 Non-linear calibration

Listing 9.2 Computation of the residual of reprojection


and its
derivatives
The computation of the reprojection residual and its
derivatives prior to the parameterisation of the unknowns is
implemented in Listing 9.2.
Block A in the code implements (9.13), while block C
implements (9.16), without the derivative of the
parameterisation, that will be added by the caller.
This function will be invoked in other context, so it
contains more
stuff than needed in this section. In particular, block B will
be introduced in Sect. 9.5, while the radial distortion and
block D will be introduced in
Sect. 9.5.3.
Please note that the removal of the third line for
blocks A and B is obtained by using a submatrix of
Kin place of K.

This first example of non-linear regression gives us an


9.3.4 General Remarks

opportunity to focus on the elements that are needed in


similar cases, which have been highlighted with
subsections, namely:
1. Identify a significant residual to be minimised

2. Define a parameterisation of the unknowns

3. Select a minimisation scheme

4. Compute the derivative of the residual (if required).

As for point 1, the least-squares cost function is a


summation:
(9. 17)
where is the residual, in general a vector. Its norm is
interpreted as a distance (algebraic, geometric or other), so
the cost is a sum of
squared distances. It is evident that the cost function is
equal to the sum of squares of the components of each
residual, so it is equivalent to the Frobenius norm of the
matrix whose columns are the residual vectors:
(9. 18)
As for point 3, we choose once and for all the
Levenberg-Marquardt (LM) algorithm (Sect. C.2.3), and
therefore, the derivatives (point 4) are always required.

9.4 Non-linear Regression of Exterior


Orientation
Regressing the exterior orientation is similar to
calibration, what changes is the fact that the intrinsic
parameters are known and
therefore image points are represented in normalised image
coordinates . The cost function is:

(9.
19)

The Jacobian matrix of the residuals corresponds to


that
defined in (9.13) with since we work in normalised
coordinates.
The MATLAB implementation is shown in Listing 9.3. As
in the non- linear calibration, the transformation of the
control points avoids the singularities of Euler angles
representation.
Listing 9.3 Non-linear regression of exterior orientation
9.5 Non-linear Regression of a Point
in Space
The point triangulation introduced in Sect. 8.2 leads to a
linear
regression but minimises an algebraic distance. In the same
section it
has been noted that the difference between algebraic and
geometric
distance lies in the weights that are given to the equations,
and these
weights bias the algebraic solutions. In addition, the
algebraic distance is not invariant to affine
transformations (Hartley and Zisserman 2003), a feature
that makes the method unsuitable to operate when the
reference is not Euclidean.

Let m and be projections of the same 3D point M


9.5. 1 Residual
through two
PPMs P and respectively. As in the calibration (Sect. 9.3),
we consider the reprojection error, that is, the distance
between the image point and the projection of the
corresponding 3D point:
(9.20)
In this case the unknowns are not the intrinsic and extrinsic
parameters
that determine the PPM but the coordinates of the point
in space. The previous equation generalises to m
images:

(9.21)

where is the image index.


This is a non-linear least-squares problem, each term is
the squared distance between the given point in the j-th
image and the projection of through ; the residuals
are the differences .
Since the unknown does not require any
parameterisation, we proceed directly to calculate the
derivatives.
Let us compute the derivative of the residual, namely (9.21),
9.5.2 Derivatives

or:
(9.22
)

The Jacobian matrix of the system (9.21) is obtained


stacking m of these matrices , one for each image.
Let us see in detail how the derivatives are computed.
By applying the chain rule (and omitting the index) we
have:

(9.23
)

where is given in (9.15).


The MATLAB code of the non-linear regression of one
point, or non- linear triangulation, is reported in Listing 9.4.
The Jacobian (9.23) is implemented in block B in Listing
9.2.
Listing 9.4 Non-linear regression of a point
(triangulation)

With reference to the projection model defined by (9.6), the


9.5.3 Radial Distortion

radial
distortion (defined by (3.38)) is introduced before the
transformation into pixels operated by K, that is:
(9.24
)
Taking this into account, it is easy to verify that the
derivatives that we previously computed for the
reprojection residual are modified as
follows:

(9.25

(9.26

(9.27)

where and .
Finally, we have to compute the derivative of the
reprojection
residual with respect to the radial distortion coefficients,
gathered in the vector :

(9.28) The derivative of the radial distortion with respect to


the parameters is easily computed component by
component. This expression
corresponds to block D in Listing 9.2.

9.6 Regression in the Joint Image


Space
In the case of homography and fundamental matrix, we are
in a context much more similar to the regression of conics
introduced at the
beginning. The analogy becomes apparent when we map the
input data
into a suitable joint image space, containing the
direct sum (juxtaposition) of the input
coordinates.
In fact, estimating the homography (and the fundamental
matrix)
between two sets of homologous points can be interpreted
as a non-
linear regression problem in , where we want to determine
the
algebraic variety (we can visualise it as a surface) that best
approximates (i.e. passes as closely as possible) a set of
points in .
These points are given by the juxtaposition of the Cartesian
coordinates
of the homologous points and in the two images, and
the variety is defined by the implicit equation
or .
What should be minimised in this case is the square of the
(geometric) distance between the points and this variety
of .
This distance is onerous to compute, so we settle for an
approximation of it, introduced by Sampson (1982) in the
context of the regression of conics.
It is proved in Hartley and Zisserman (2003) that the
Sampson
distance between a point and a variety of defined by
the implicit equation is given by
(9.29
where )

(9.30
)

and is the pseudoinverse of , which, if has full


row rank, is given by:
(9.31

The Sampson residual is the vector inside the norm:


)

(9.32
)
The regression on n points thus minimises the following
function
cost:

(9.33
)

This general form of the residual holds for both


homographies and fundamental matrices; in the specific
cases we need to instantiate and its derivative , which
we will do in the next two paragraphs.

9.7 Non-linear Regression of the


Homography
Let us see how to use the Sampson residual in the
regression of the homography.
In the case of homography, if and are two
9.7. 1 Residual
homologous points (from now on we shall leave out the
subscript i):

(9.34)

where the pre-multiplication by the matrix has the effect


of removing the third row, since only two of the equations
are independent, as we
noted with regard to DLT. So in this case and
the algebraic variety is defined by the zeros of two
polynomials.
As regards the computation of , we first
observe that the derivative of the cross product is:
(9.35)
(in the derivative with respect to the sign changes). So we
get:
(9.36)
and consequently:
(9.37)
These derivatives are with respect to the homogeneous
vector; to get the derivatives with respect to Cartesian
coordinates alone, we must
eliminate the third column, through post-
multiplication by the projection matrix , that is:
(9.38)
The parameterisation of His straightforward; it only needs to
9.7.2 Parameterisation
take
homogeneity into account, that is, fix its scale. The simplest
(and most
common) choice would be to fix , but this would
exclude all the homographies that have this element set to
zero (these are non-affine
mappings that send the origin to infinity). Note, however,
that the
entries of the third row cannot be all zero; otherwise, the
matrix would be singular. It is therefore safe to fix the
scale setting the norm of the last row of H to one, as this
would exclude only singular matrices (which are not
homographies).
Therefore, the third row of H is identified with a point on
the unitary sphere of parameterised with two angles
:

(9.39)

To proceed to the solution with Gauss-Newton, we must


9.7.3 Derivatives

compute the derivative of the Sampson residual:


(9.40
with given by (9.34) and given by (9.38). )
Setting , the above equation rewrites:
(9.41
)
Let us apply the rule for the derivative of the product
and the chain rule to obtain:

(9.42)
There are still three terms to be expanded in (9.42).
The first one is the derivative of the pseudoinverse that is
given by
(see Sect. B. 1):
(9.43
)
The second one is

(9.44)

where is a matrix of zeros and ones which is


easily derived element by element.
The last term is the derivative of , which is
computed straightforwardly starting from:
(9.45)
Considering that the residual is given only by the first two
components of , we pre-multiply by the matrix :

(9.46)

Finally, to obtain the derivative of with respect to the


parameters which determine H, we must post-multiply
(9.42) by the derivative of the parameterisation of H,
which is computed from (9.39) element by element.
Listings 9.5, 9.6 and 9.7 implement this method.
Listing 9.5 Non-linear regression of H

Listing 9.6 Parameterisation of H


Listing 9.7 Sampson distance for H

9.8 Non-linear Regression of the


Fundamental Matrix
As already anticipated, the case of the fundamental
matrix falls in the same class with conics and
homography.
9.8. 1 Residual
In the case of the fundamental matrix, is a scalar
function
(9.47
)
Therefore, is a row vector of four columns.
We shall now obtain the expression for .
Remembering that , we see that:

(9.48
)
These derivatives are with respect to the
homogeneous vector; to obtain the derivatives with
respect to the Cartesian components, we need to
eliminate the third component, via the projection matrix
introduced earlier, hence:
(9.49
)
The Sampson residual is then:

(9.50
)
with given by (9.47) and given by (9.49).
Consider the square norm of the residual, that is, the
square of the Sampson distance:

(9.51)

This expression highlights that the Sampson distance


coincides with suitably scaled (as we have seen for the
conics).

We must represent F by a set of parameters equal in number


9.8.2 Parameterisation
to its
degrees of freedom, that is, seven, so that F is singular
and the scaling factor is fixed.
To fix the scale, we can arbitrarily fix the norm of a
subset (possibly improper) ofF entries. Note, however, that
any choice excludes from
parameterisation all F that have those entries null. The
singularity of the matrix can be obtained by writing one row
or column as a linear
combination of the other two. This parameterisation does
not exhaust all matrices of rank two; in fact, by choosing
—for example—to write the first column ofF as a linear
combination of the other two, we give up
representing matrices that have the second or third column
null or
linearly dependent on each other. To explore the pros and
cons of these choices, consult (Luong and Faugeras 1996;
Zhang 1998; Torr and
Fitzgibbon 2004).
As for us, we will follow the approach of Bartoli and
Sturm (2004), which avoids the problems just mentioned
by imposing the rank and
scale constraints through the SVD. The fundamental matrix
is
decomposed as: , where U and V are
rotation matrices and are two positive reals. The
scaling is fixed by setting the norm of the vector to
one, that is, by parameterising its two entries as the sine
and cosine of an angle. The two rotation matrices are
parameterised by three Euler angles each. The careful
reader might
object that U and V in the SVD are only orthogonal;
however, any sign change required to make their
determinant positive is absorbed by the homogeneity ofF.
The parameterisation of rotations with Euler angles has
some
singularities; to avoid them we work in a neighbourhood
of the origin, making the parameterisation incremental,
that is:
(9.52)
where and are constant rotation matrices, obtained
from a initial estimate ofF such that the parameters
close to zero ( see
Listing 9.9).

To proceed to the solution with LM, we must compute the


9.8.3 Derivatives

derivative of the Sampson residual:

(9.53)

with e . The
procedure is similar to the one followed for the
homography, with the difference that
in this case is a scalar and a row vector,
which will make expressions simpler.

We define and apply the rule for the


derivative of the product and the chain rule to obtain:
(9.54)
Since is a row vector, the expression for the
derivative of Y is simple:

(9.55
)
The other two derivatives are:

(9.56
)

(9.57
)

To obtain the derivative of with respect to the


parameters
determining F, it is necessary finally to post-multiply
(9.54) by the derivative of the parameterisation ofF
given by (9.52). It is easy to derive that:
(9.58)
with

(9.59

(9.60

(9.61
)
Since , the derivative of D is
straightforward. Listings 9.8, 9.9 and 9.10 implement
the method.
Listing 9.8 Non-linear regression ofF

Listing 9.9 Parameterisation ofF

Preconditioning, in this case, does not serve to mitigate the


malconditioning problem associated with the algebraic
distance since
we are using Sampson’ s distance, but it does however make
the problem better numerically conditioned.
Listing 9. 10 Sampson distance for F

9.9 Non-linear Regression of Relative


Orientation
The regression of the fundamental matrix is easily
adapted to the essential matrix, one only has to replace
the pixel coordinates with the normalised image
coordinates and F with E.

The essential matrix has five degrees of freedom and is


9.9. 1 Parameterisation

factored into
. Therefore, it can be parameterised with three
Euler angles for the rotation R and two angles e
which identify as a point on the unit sphere of :
(9.62
)

In this section we will only have to provide the derivative of


9.9.2 Derivatives

the
parameterisation, while the rest is as in the case ofF (Sect.
9.8). Thanks to the rule for the derivative of the product:
(9.63)

Because the first term depends only on the first three


parameters and the second term depends only on the last
two, their respective
derivatives have blocks of zeros in complementary positions,
so the sum of two terms gives rise to a partitioned matrix:

(9.64
) The derivative of the rotation was obtained in Sect. B.2,
while as regards the right block, we observe that, by the
chain rule, the derivative is:

(9.65
)
The derivative of the vector with respect to its
parameters is computed easily:

(9.66
)
as well as the Jacobian of the constructor of the
antisymmetric matrix (already encountered in the
regression of H).
Listing 9.11 reports the MATLAB implementation of the
algorithm. Also note in this case the transformation that
brings Euler angles near zero (followed by the anti-
transformation of the result). The
parameterisation is implemented in Listing 9.12.
Listing 9. 11 Non-linear regression of relative
orientation

Listing 9. 12 Parameterisation of E

In light of this parameterisation of E, this method


can also be interpreted as the non-linear
regression of relative orientation.

In the functions for calibration, exterior orientation and


relative
orientation, we avoided the singularities of the Euler
representation by means of a transformation of the points
leading to work close to the origin, whereas with F we
adopted an incremental
parameterisation. How come? They are essentially the
same thing.
The approach followed for F is more general, but the
implementation is less clean because of the static part of
the parameterisation that
has to be carried along ( and in the specific case).
Point
transformation has a more straightforward implementation,
but it is
not always viable. It can be done when the transformation
is rigid, for other transformations would distort the noise
statistics on the image
9. 10 Robust Regression
In conclusion, let us briefly touch upon the problem of
anomalous
samples, or outliers. The methods of least-squares regression,
either
linear or non-linear, assume that all point correspondences
are
inherently correct and that the measurements are affected
by a certain random error distributed as a Gaussian
around the true value. If even
one of these matches is wrong, the estimate can be
arbitrarily skewed, to the point of being useless. These
mismatches are outliers, for the
purposes of the regression problem.
In practice, we have to assume that data coming from
automatic
matching will be corrupted by outliers; therefore, a robust
regression method is needed to deal with these outliers.
This topic is dealt within more details in Sect. C.3
Let us take as an example the regression of the
fundamental matrix. Such example is of considerable
importance, for computing F serves to validate a set of
homologous points with respect to epipolar geometry
even in situations in which the fundamental matrix is not of
interest per
se.

A more stringent validation of matches can be obtained


by imposing the so-called trifocal constraint (Hartley
1997) on image triples.
Listing 9. 13 Robust estimate ofF

Listing 9.13 reports a function for the robust estimation of


the
fundamental matrix with three different methods described
in Sect. C.3. In the case of Random Sample Consensus
(RANSAC) (or M-estimator
SAmple Consensus MSAC), in each iteration a test
fundamental matrix is calculated on a random sample of
seven matches (or eight if we use the linear algorithm).
Consensus is calculated by counting how many of the
remaining corresponding points are reciprocally on the
epipolar line
determined by the test fundamental matrix, with a certain
tolerance in Sampson distance. When finished, the
solution with the greatest
consensus is selected and matches outside the
consensus set are discarded (Fig. 9.2).

Fig. 9.2 Scale Invariant Feature Transform SIFT point correspondences before
(right) and after (left) validation of epipolar geometry through RANSAC
RANSAC can be applied likewise to homographies, as
reported in Listing 9.14.
Listing 9. 14 Robust estimate of H

Note that in this context, methods that use the minimum


number of
points (seven for F, five for E, four for H, three for exterior
orientation) are important to reduce the computational
burden. Indeed, the number of iterations required to
achieve a confidence given the size of the test sample
and the fraction of outliers is given by:

(9.67
)

While in the case of homographies the minimal solver is


linear, in the other cases it entails finding roots of
polynomials. Our implementation of the robust estimate of
F uses the linear algorithm for simplicity.
In summary, errors and rogue measurements must be
taken both
into account, using a robust procedure (e.g. RANSAC) to
discard outliers, followed by a non-linear least-squares
regression on clean data to obtain the optimal estimate.

Problems
Finally, beware that errors that inevitably affect
measures propagate to the regression results, which should
therefore always be accompanied by an indication of the
error, usually in the form of standard deviation.
Section C.4 deals with this topic.

Problems
Verify that the conic equation can be written as a
quadratic form with the matrix
9. 1

in homogeneous coordinates.

Write a conic regression algorithm that minimises


an algebraic distance.
9.2

Derive the Sampson distance for the conic and write


an algorithm that minimises the latter using the Gauss-
9.3

Newton method.

References
A. Bartoli and P. Sturm. Nonlinear estimation of the fundamental matrix with
minimal
parameters. IEEE Transactions on Pattern Analysis and Machine Intelligence,
26(3):426–432, March 2004.
[Crossref]
Fred L. Bookstein. Fitting conic sections to scattered data. Computer
Graphics and Image Processing, 9 ( 1):56– 71, 1979.
[Crossref]

R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge


University Press, Cambridge, 2nd edition, 2003.
R.I. Hartley. Lines and Points in Three Views and the Trifocal Tensor.
International Journal of Computer Vision, 22(2):125– 140, 1997.
[Crossref]

Q.-T. Luong and O. D. Faugeras. The fundamental matrix: Theory,


algorithms, and stability analysis. International Journal of Computer Vision,
17:43– 75, 1996.
[Crossref]

Paul Sampson. Sampson, p.d.: Fitting conic sections to ”very scattered” data: An
iterative
refinement of the Bookstein algorithm. comput. graphics image process. 18,
97- 108. Computer Graphics and Image Processing, 18:97– 108, 01 1982.
P. H. S. Torr and A. W. Fitzgibbon. Invariant fitting of two view geometry. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 26(5):648–650, May 2004.
[Crossref]
Z. Zhang. Determining the epipolar geometry and its uncertainty: A review.
International Journal of Computer Vision, 27(2):161– 195, March/April 1998.

Footnotes
1 Set of zeros of a family of polynomials.

OceanofPDF.com
© The Author(s), under exclusive license to Springer Nature
Switzerland AG 2024 A. Fusiello, Computer Vision: Three-dimensional

https://doi.org/10.1007/978-3-031-34507-4_10
Reconstruction Techniques

10. Stereopsis: Geometry


Andrea Fusi ello1
( 1) University of Udine, Udine, Italy

Email: [email protected]
Andrea Fusiello

10. 1 Introduction
Traditionally, the term stereopsis (from sterèos solid and
sight) denotes a human perceptual capacity;
Poggio and Poggio (1984) give the following definition:
opsis

What is stereopsis? Because our two eyes are located


in different positions in the head, their views of a 3-D
scene are slightly
disparate. One can easily experience directly this
binocular
disparity by looking at objects not too distant and noting
their
different relative positions when closing each eye in turn.
The
disparity of each “point” depends on its distance from
the fixation point of the two eyes. Our brain is capable
of measuring this
disparity and using it to produce the sensation of depth
that is
the subjective estimate of relative distance. This is
stereopsis and its sole basis is the horizontal disparity
between the two retinal images.
Along the lines of this definition, we can say that
computational stereopsis is the process of obtaining
information about the solid structure of a scene from
two images taken from two different
viewpoints.
We have already addressed the problem of reconstruction
from two views in Chap. 8: given the homologous points of
the two images and the two PPMs, it is possible to
reconstruct by triangulation the position in
space of the tie points, thus obtaining a solid model
consisting of the tie points. In this chapter we tackle the
same problem but under a special epipolar geometry,
called the normal configuration. The substance of the problem
does not change, but the normal case allows for the
introduction of ad hoc techniques, which opens the way to
recovering a denser model, by ideally matching all the
pixels.
In the normal configuration the homologous points lie on
horizontal lines at the same height, and the
correspondences can be encoded by the binocular disparity,
that is, the difference of their respective horizontal
coordinates. Searching for homologous points— or
calculating disparity, in this context—is straightforward,
and triangulation also takes a special form (Sect. 10.2).
Given the advantages associated with the normal case, we
will introduce in Sect. 10.3 the epipolar rectification process
that
virtually brings the cameras into this configuration even
where they are not in reality.
We shall assume, for now, that the disparity is known,
and we will see in Chap. 12 how to compute it.

10.2 Triangulation in the Normal


Case
In the so-called normal configuration we consider two
identical
cameras in which the second is related to the first by a
translation along the X-axis (conventionally, the horizontal
axis). It is easy to verify that, as a consequence, disparity is
purely horizontal and therefore the two-
dimensional construction of Fig. 10.1 is justified.
Fig. 10.1 Stereoscopic triangulation in the normal case. Equation (10.3) can
also be derived
immediately from the similitude of the triangle having for base b and height Z
and the one, split into two parts, having for base and height
Let b be the distance between the COPs, called baseline.
Arbitrarily fixed the world reference frame coincident
with that of the first camera, and working in normalised
image coordinates, we can write the
following two PPMs:
( 10. 1)
From which we get:

( 10.2)

and after processing:

( 10.
3)
Equation (10.3) shows that it is possible to derive the third
coordinate Z once the binocular disparity is known (
). It can also be seen
that the baseline b behaves like a scaling factor: the
disparity associated
with a fixed scene point depends directly on b. If parameter
b is
unknown, it is possible to reconstruct the three-
dimensional model up to a scalingfactor.
Let us see how to translate what we just saw into matrix
form. If we add a third component equal to the disparity
to the image
coordinates, we can modify (10.1) into:

( 10.4)

This matrix Q differs from P by one row and can be


inverted to derive the 3D coordinates. In essence, if we
augment the image
coordinates with disparity, we can derive the 3D coordinates
with a
projective transformation. It is easily verified that the
expression for the Z component is (10.3).

This is closely reminiscent of the way projection is done in


computer graphics, where instead of a true projection,
which irreversibly erases the depth information, an
invertible transformation is applied in 3D space that
produces a vector whose first two components are the
image coordinates and the third is a monotonic function
of the depth (called pseudo-depth), which is thus preserved
for further processing.

If we want to make the situation more realistic, we need


to take into account the intrinsic parameters of the two
cameras. We relax the
assumption that the two cameras are identical and allow for
a different horizontal coordinate of the principal point,
and , respectively (the reason will be made clear in Sect.
10.3).
The disparity in image space (measured in pixels) is
derived as:

( 10.
5) then the matrix that accounts for (10.5) is the following:
( 10.
6)

Combining it with Q yields:

( 10.
7)
And finally, taking the inverse:

( 10.
8)
Considering the third component we arrive at the
expression:

( 10.
9)

Listing 10.1 implements the procedure just described,


and Fig. 10.2 shows a sample result.
Fig. 10.2 Disparity map and corresponding 3D model

Listing 10. 1 Triangulation from disparity

10.3 Epipolar Rectification


The epipolar rectification consists in transforming a pair of
generic stereoscopic images into a pair in normal
configuration (Sect. 10.2)
while keeping the two COPs unchanged.
In this configuration, image planes are parallel to the
baseline and the epipolar lines are parallel and horizontal.
This is a particularly
favourable situation for calculating disparity, since the
homologous points lie on the same image line.
We will describe a method that applies to calibrated
images (PPMs are known) introduced by Fusi ello et al.
(2000) and one that applies to uncalibrated images (PPM
unknown) proposed by Fusi ello and Irsara (2011).

Let and be the two PPMs of the two cameras. The


10.3. 1 Calibrated Rectification
idea behind the rectification is to define two new PPMs
and obtained by rotating the original cameras around
their COPs until the focal planes become
coplanar (and in doing so they will both contain the
baseline). This
implies that the epipoles are at infinity, so the epipolar lines
are parallel. To have horizontal epipolar lines, the baseline
must be parallel to the
new -axis of both cameras. In addition, we want to
ensure a stronger rectification property by requiring that
the homologous points have the same vertical coordinates. This
is achieved by forcing the two new
cameras to have the same intrinsic parameters.1 It can be
seen that since
the focal length is the same, the two rectified image
planes are coincident, as shown in Fig. 10.3.
Fig. 10.3 After rectification, the image planes are coincident and parallel to the
baseline
To summarise: the positions (i.e. the COPs) of the new
PPMs are the same as those of the old cameras, while the
new angular attitude (the same for both cameras) must
be defined ad hoc; the intrinsic
parameters are the same for both cameras. 1 Thus, the two
resulting
PPMs can be thought of as a single camera translated
along the -axis of its reference frame as in the normal
case.
Let us write the new PPMs in terms of the COPs and the
rotation:
( 10. 10)
The intrinsic parameters matrix is the same for both
PPMs and can be fixed arbitrarily. The COPs and are
given by the old COPs
computed with (3.29). The matrix , which determines the
camera
angular attitude, is the same for both PPMs. Let us write it
with its rows:
( 10. 11)

we have that are, respectively, the and axes


of the
camera reference frame, expressed in world coordinates. In
accordance
with the above, we determine by setting:
1. the new -axis parallel to the baseline:

2. the new -axis orthogonal to and an arbitrary versor


:

3. the new -axis orthogonal to (necessarily):

At point 2, fixes the position of the new -axis in the


plane orthogonal to (the vertical direction). We take it
equal to the versor of the old camera, thus forcing the
new -axis to be orthogonal to both the new
-axis and the old -axis.
This algorithm fails when the optical axis is parallel to
the baseline, that is, when there is a purely forward
motion.
The MATLAB implementation is given in Listing 10.2.
Listing 10.2 Calibrated epipolar rectification

To rectify the image produced by camera (e.g.), we must


compute the transformation that brings the image plane of
into the image plane of (for we proceed likewise).
For any 3D point we can write:

( 10.
12)

In agreement with (3.32), the optical ray equations


are as follows (since rectification does not displace the
COP):

( 10.
13)
therefore:

( 10.
14)
The sought transformation is the homography
defined by the matrix: .
Recalling and , we have
that that
( 10.
15)

So the homography that, applied to the original image,


produces the rectified image has the form of the
homography of the plane at infinity
associated with the rotation . The latter is precisely
the rotation that must be applied to the camera to bring it
into the normal
configuration.

To understand rectification it is useful to think of an image


as the
intersection of the image plane with the projective
bundle, consisting of the rays passing through the COP
and through each visible point in the scene (Fig. 10.4).
By rotating the image and holding the projective bundle
steady, we will get a new image of the same visible points.

Fig. 10.4 An image is the intersection of a plane with the projective bundle

The rectified images together with their respective PPMs


can be used in all circumstances in place of the
original images and PPMs.
In conclusion, let us address the problem of the size of
the rectified images. The pixel coordinates of the rectified
images can, in principle,
vary in regions of the plane very different from the window
occupied by the original image. In the example in Fig. 10.5
(top) the origin of the
coordinates has been kept in , and the size of the
images has been increased to fit the content. However, the
part corresponding to the
negative coordinates is cut off and there are empty areas on
the sides
that unnecessarily take up space. The first problem is solved
by applying a vertical translation (change of ) to the
coordinates of both images so that they become positive; the
second problem is addressed by applying two different
horizontal translations by changing independently in
the two images, since this do not affect the rectification.
The result is shown in Fig. 10.5 (bottom).
Fig. 10.5 Top: Rectified stereo pair in original coordinate space. Bottom:
Rectified stereo pair after centring; the two reference frames differ by a
horizontal translation of 994 pixels

The previous method relies on the knowledge of the PPM,


10.3.2 Uncalibrated Rectification

which is why it is called “calibrated” When this assumption


is not verified, we can
assume that only homologous points in the two images
are available, which allows to compute the fundamental
matrix F.
It is easy to verify (Problem 10.3) that the fundamental
matrix of a pair of cameras in normal configuration is the
antisymmetric matrix associated with the versor
:

( 10. 16)

Let H and be the two unknown rectifying homographies.


When they are applied to the homologous points ,
respectively, the pair of transformed points must satisfy
the epipolar geometry of a camera pair in normal
configuration, that is:
( 10. 17)
This equation implies that in any factorisation of the
fundamental matrix F of the original camera pair of the
type
( 10. 18)
with and invertible matrices, and are
rectifying homographies. These are not unique, and there
are several ways to
constrain the problem, each leading to solutions with
different
characteristics (Hartley 1999; Isgrò and Trucco 1999; Hsien-
Huang P.
Wu Jan 2005; Mallon and Whelan 2005), where the
underlying goal is to limit the perspective distortion.
The method ofFusi ello and Irsara (2011) is based on the
idea of
approximating what happens in the calibrated case. Indeed,
we know
that rectification occurs by rotating the camera around its
COP, and thus, the induced homographies are (cf. (10.15)):
( 10. 19)
The rotations are the unknowns of the calibrated
rectification
while in the uncalibrated case also the intrinsic parameters
are part of the unknowns. We can reduce the number of
unknowns by making some educated guesses about the
intrinsic parameters: no skew, unit aspect
ratio and principal point in the centre of the image. The
only remaining parameter is the focal length in pixels ,
which we also assume to be identical in the two cameras.
Therefore:

( 10.20)

where w and h are the dimensions (in pixels) of the image.


Regarding rotations, it should be noted that if is any
rotation
about the X-axis, we have:
( 10.21)
In fact, a rotation around the X-axis, which coincides with
the baseline, of both cameras does not alter the
rectification, but only the portion of the scene that is
embraced. Therefore, we reduce the unknowns by
fixing the angle of rotation around X in one of the two
cameras to zero.
The intrinsic parameters of the rectified images (
) can be
determined arbitrarily, as long as the vertical focal length
and the
vertical coordinate of the principal point are the same. In
fact, it is easily verified that
( 10.22)
as soon as the second (and third) rows of and are
equal. For this
reason, they are not included in the parameterisation of
homographies. To summarise, the method aims at
solving
( 10.23)
for the rotation matrices , and the focal
length in (10.20). In practice, a least-squares solution is
sought that minimises the squares of the Sampson residuals
of the original homologous points
with respect to the fundamental matrix F (see Sect. 9.8):
( 10.24)
As with the non-linear regression problems addressed in
Chap. 9, we need to compute the derivative of the residuals
and to hand the
minimisation over an iterative method such as
Levenberg-Marquardt (LM). This derivative is obtained by
applying the chain rule to the
composition of the Sampson residual for F with the
parameterisation given by (10.24); only the derivative of
the latter needs to be computed here, while for the former
we refer to Sect. 9.8:

( 10.25)

where contains the last two columns of , being


. The three terms are:

( 10.26)

( 10.27)

( 10.2

8) and is computed element by element, since is


easily
written in closed form.
In the absence of any a priori estimates, it makes sense
to assign zero as the initial value of the rotation angles;
however, in this way the last
column of the Jacobian matrix (corresponding to ) cancels
out, making it singular (as noted by Monasse 2011). We can
solve the problem by
assigning non-zero values to the rotation angles (e.g. we
can always expect a small rotation around Y to
compensate for vergence) or by
performing two cascading minimisations, the first with
fixed and the second with all the unknowns.
The MATLAB implementation is given in Listing 10.3.

Listing 10.3 Uncalibrated epipolar rectification


Note that the focal length is parameterised with the
field of view, which is an angle as the other unknowns.

Problems

With reference to the triangulation in the normal


case, how is the computed depth Z affected by an error
10. 1

on the disparity?
Referring to triangulation in the normal case,
plot the iso- disparity surfaces in space.
10.2

Prove that a pair of cameras in normal


configuration possesses the following fundamental matrix:
10.3

In light of the result of Problem 10.3, show that


the epipolar rectification procedure illustrated in Sect.
10.4

10.3 is correct.
The rectification can be easily extended to the
trifocal case (i.e. three images). In such a case the focal
10.5

plane of the new PPMs is


completely determined by the three COPs, and thus, the
versor is the normal to this plane.

10.6 What about rectifying images?

References
A. Fusiello and L. Irsara. Quasi-euclidean epipolar rectification of uncalibrated
images. Machine Vision and Applications, 22(4):663–670, 2011.
[Crossref]
A. Fusiello, E. Trucco, and A. Verri. A compact algorithm for rectification of
stereo pairs. Machine Vision and Applications, 12( 1):16– 22, 2000.
[Crossref]

R.I. Hartley. Theory and practice of projective rectification. International Journal


of Computer Vision, 35(2):1– 16, November 1999.
[Crossref]

Yu-Hua Yu Hsien-Huang P. Wu. Projective rectification with reduced geometric


distortion for
stereo vision and stereoscopic video. Journal of Intelligent and Robotic Systems, 42(
1):Pages 71–
94, Jan 2005.
[Crossref]
F. Isgrò and E. Trucco. Projective rectification without epipolar geometry. In
IEEE Conference on Computer Vision and Pattern Recognition, pages I:94–99, Fort Collins,
Proceedings of the

CO, June 23-25 1999.


John Mallon and Paul F. Whelan. Projective rectification from the fundamental
matrix. Image and Vision Computing, 23(7):643–650, 2005.
[Crossref]

Pascal Monasse. Quasi-Euclidean Epipolar Rectification. Image Processing On


Line, 1: 187– 199, 2011.
[Crossref]

Gian F. Poggio and T. Poggio. The analysis of stereopsis. Annual Review of


Neuroscience, 7:379– 412, 1984.
[Crossref]

Footnotes
1 Actually the horizontal coordinate of the principal point, , can be
different in the two images.

OceanofPDF.com
© The Author(s), under exclusive license to Springer Nature
Switzerland AG 2024 A. Fusiello, Computer Vision: Three-dimensional

https://doi.org/10.1007/978-3-031-34507-4_11
Reconstruction Techniques

11. Features Points


Andrea Fusi ello1
( 1) University of Udine, Udine, Italy

Email: [email protected]
Andrea Fusiello

11. 1 Introduction
In the discussion so far, we have always assumed that it
was possible to perform the preliminary computation of a
number of point
correspondences. In this chapter we will address the
practical problem of how to obtain such correspondences.
We begin by noting that not all points in an image are
equally suitable for computing correspondences. The salient
points orfeature points are points belonging to a region of
the image that differs from its neighbourhood and therefore
can be
detected repeatably and with positional accuracy.
The definition of salient point that we have given is
forcibly vague, because it depends implicitly on the
algorithm under consideration: a posteriori we could say
that the salient points are those that the
algorithm extracts.
In order to match such points in different images, we
need to
characterise them. Since the intensity of a single point is
poorly
discriminative, we typically abstract some property that is
a function of the pixel intensities of a surrounding region.
The vector that
summarises the local structure around the salient point is
called the
descriptor. Point matching thus reduces to descriptor
comparison. For this to be effective, it is important that
the descriptors remain invariant (to some extent) to
© The Author(s), under exclusive license to Springer Nature
Switzerland AG 2024 A. Fusiello, Computer Vision: Three-dimensional

https://doi.org/10.1007/978-3-031-34507-4_11
Reconstruction Techniques

changes in viewpoint and illumination, and possibly even to


some degradation of the image.
We will illustrate in the following two different and
complementary approaches to detecting features. The first
consists of identifying image points that can best be
matched with others based on local window
autocorrelation analysis (Sect. 11.4). The second (Sect.
11.5) is based on scale invariant detection of high contrast
regions, a.k.a. blob. Points of
the former type will be matched simply by comparing pixel
values in a local window, while for the latter, a descriptor
invariant to geometric and radiometric transformations
will be computed. The second
approach is more complex but is better suited to images
taken under very different conditions.
For a complete critical review on detectors and
descriptors, see Mikolajczyk and Schmid (2005),
Mikolajczyk et al. (2005).

11.2 Filtering Images


The image I is a matrix of integers (we shall consider
monochromatic images, for simplicity). In this context we
treat it as a two-dimensional digital signal
which is a discrete representation of an
underlying analogue (continuous) signal . In
other words, we regard it as obtained by sampling and
quantisation of a surface
whose height is proportional to the grey value.
Analogous to what is commonly done for one-
dimensional signals, it is possible to define a convolution
operation for images. Linearfiltering consists in the
convolution of the image with a constant matrix called
kernel or mask.
Let I be a image and let Kbe a kernel.
The filtered version of I is given by the discrete
convolution (denoted by ):

( 11. 1)

A linear filter replaces the value with a weighted


sum of
values of I itself in a neighbourhood of , where
the weights are the kernel entries.
Convolution enjoys the following properties:
Commutative:
Associative:
Linear:
Commutative with differential
Note that due to the minus sign in the neighbourhood
indexing in I, the convolution coincides with the cross-
correlation defined as

( 11.2)

provided that the mask K is flipped both up-down and left-


right before combining with I. If the mask is symmetric the
convolution coincides with the correlation.
It is worth noting that cross-correlation can be
interpreted as a
comparison between two signals, and its maximum indicates
the
position where the two match best. In other words, it is
like searching within the image I for the position of a
smaller image K, which in this context is called template,
and the whole operation is referred to as

The effects of a linear filter on a signal can best be


template matching.

appreciated in the frequency domain. By the convolution


theorem, the Fourier transform of the convolution of I and K
is simply the product of their Fourier
transforms and . Therefore, the result of
convolving a signal with K is to attenuate or suppress the
frequencies of the signal
corresponding to small or zero values of . From
this point of view, the kernel is the (finite) impulsive
response of the filter.

If all entries of K are positive or zero, a smoothing effect


11.2. 1 Smoothing

is obtained from the convolution: linear filtering replaces


the pixel value with the
weighted average of its surroundings. The simplest kernel
of this type is the boxfilter or average filter:

( 11.3)
The practical effect of such a filter is to smooth out the
noise, since the mean intuitively tends to level out small
variations. Formally, we see that averaging noisy values
divides the standard deviation of the
noise by m. The obvious disadvantage is that smoothing
reduces the
sharp details of the image, thereby introducing blurring. The
size of the kernel controls the amount of blurring: a larger
kernel produces more blurring, resulting in greater loss of
detail.
In the frequency domain, we know that the Fourier
transform of a 1D box signal is the sinc function: something
similar holds in 2D (Fig. 11.1). Since the frequencies of the
signal that fall within the main lobe are
weighted more than the frequencies that fallin the
secondary lobes, the average filter is a low-pass filter.
However, it is a very poor low-pass
filter as the frequency cut-off is not sharp, due to the
secondary lobes.

Fig. 11.1 Top: kernel (impulsive response). Bottom: Fourier transform


We then consider the Gaussian smoothing filter, which
corresponds
to the function:

( 11.
4)

The Fourier transform of a Gaussian is still a Gaussian and


therefore has no secondary lobes. The following mask
corresponds to a Gaussian with
:

( 11.5)

Gaussian smoothing can be implemented efficiently


due to the fact that the kernel is separable, that is, that
where g denotes a 1D vertical Gaussian kernel.
Indeed, for example:

( 11.6)

In the case of two 1D kernels oriented in orthogonal


directions, their convolutions reduce to an outer product
and therefore, due to the
associativity of the convolution, we have:
( 11.7)
This means that convolving an image I with a 2D Gaussian
kernel G is the same as convolving first all rows and then all
columns with a 1D
Gaussian kernel. The advantage is that the time
complexity depends linearly on the mask size rather
than quadratically.
Another interesting property of Gaussian filtering is
that repeated convolution with a Gaussian kernel is
equivalent to convolution with a larger Gaussian kernel:
( 11.8)

To construct a discrete Gaussian mask one must sample a


continuous Gaussian (due to the separability of the Gaussian
kernel we can only
consider 1D masks). The width of the mask and the variance
of the
Gaussian are not independent: fixed one the other is
determined. A rule of thumb is: .

Gaussian smoothing does a good job in removing the noise,


11.2.1.1 Non-linear Filters
but since it cancels the high frequencies, it also smooths
the image content such as the sharp variations of intensity,
called edges.
The bilateral filter is a non-linear smoothing filter, which
reduces
image noise but preserves edges. Like any linear filter, it
replaces the
intensity of each pixel with a weighted average of the
intensity values of neighbouring pixels. The difference is
that these weights depend not
only on the distance of the pixels from the centre of the
window, as in a Gaussian kernel, but also on differences in
intensities. For this reason, it is non-linear.
In flat regions, the pixel values in a small neighbourhood
are similar to each other and the bilateral filter basically
acts as a standard linear
filter. Let us now consider a sharp boundary between a dark
and a bright region. When the bilateral filter is centred, for
example, on a pixel on the bright side of the boundary, the
weights are high for pixels on the same side and low for
pixels on the dark side. Consequently, the filter replaces the
bright central pixel with an average of the bright pixels in its
neighbourhood, essentially ignoring the dark pixels.
Conversely, when the filter is centred on a dark pixel,
bright pixels are ignored. Thus, the bilateral filter
preserves sharp edges.
Another example of a non-linear filter is the medianfilter,
in which the value of each pixel is replaced by the median
of the values of its
neighbours. Since the median cannot be expressed as a
weighted sum, the filter is non-linear.

Let be a differentiable function. Its gradient at


11.2.2 Derivation

point
is the vector whose components are the partial
derivatives off
at :

( 11.9)

The vector points in the direction of steepest ascent and


is
perpendicular to the tangent to the contour line.
The modulus of the gradient is related to
the slope at the
point and thus takes high values corresponding to steep
changes in the
value of the function. The phase of the gradient
represents its direction instead.
Returning now to images, we can think of using the
gradient to
detect edges or edge points, which are the pixels at—or
around—which the image intensity changes sharply (Fig.
11.2): edges are detected as maxima of the modulus of
the gradient (Fig. 11.3).

Fig. 11.2 Edges are points of sharp contrast in an image where the intensity of
the pixels
changes abruptly. The edge direction points to the direction in which intensity
is constant, while the normal to the edge points in the direction of the maximal
change in intensity
Fig. 11.3 Top: Original image and gradient magnitude. Bottom: Directional
derivatives in u and

There are several reasons for our interest in edges. The


v

main one is that edges often (but not always) define the
contours of solid objects.
To calculate the gradient we need the directional
derivatives. Since the image is a function assigned on a
discrete domain, we will need to consider its numerical
derivatives. Combining the truncated Taylor
expansions of and we get:

( 11. 10)
Considering the image I as the discrete representation
off, setting and neglecting the factor, we have
that:
( 11. 11)
( 11. 12)
We immediately see that the numerical derivation of an
image is implemented as a linear filtering, namely a
convolution with the mask for and with its
transpose for .
The frequency interpretation of the convolution with a
derivative
mask is a high-pass filtering, which amplifies the noise.
The solution is to cut down the noise before deriving by
smoothing the image. Let D be a derivation mask and S a
smoothing mask. For the associativity of the convolution:
( 11. 13)
This means that one needs to filter the image only once
with a kernel given by .
In the case of two 1D kernels D and S oriented in
orthogonal
directions, their convolution reduces to an outer product and
thus
results in a separable kernel. For example, the Prewitt
operator is
obtained from the box filter convolved with the
derivative mask
:

( 11. 14)

Sobel’s operator, on the other hand, is a rudimentary


Gaussian filter convolved with :

( 11. 15)
A 2D Gaussian can be used for smoothing: we can either
filter with a Gaussian kernel and then calculate the
derivatives or utilise the
commutativity property between differential operators and
convolution and simply convolve with the derivative of the
Gaussian kernel:
( 11. 16)
In practice, convolution is carried out with the two partial
derivatives of the Gaussian (Fig. 11.4).

Fig. 11.4 Partial derivatives of the Gaussian kernel

11.3 LoG Filtering


Abrupt changes in the grey level of the input image
correspond to the
extreme points (maxima and minima) of the first derivative.
If we
consider the second derivative instead, these same points
correspond to the zero crossings of the latter (Fig. 11.5).

Fig. 11.5 From left: signal (edge), first derivative, second derivative
To obtain the numerical approximation of the second
derivative, we proceed similarly to what we did for the first
derivative, obtaining:

( 11. 17)

and thus the corresponding convolution mask is


, so the second partial derivatives are:
( 11. 18)
Summing them yields the Laplacian operator:
( 11.
19)
is a scalar and can be found using a single mask.
Since it is a
second-order differential operator, it is very sensitive to
noise, more
than the first derivative. Thus, it is always combined with a
smoothing operation, for example with a Gaussian kernel.
An input image is first convolved with a Gaussian
kernel (at some scale ) to
obtain a smoothed version of it:
( 11.2

then the Laplacian operator is applied:


0)

( 11.2
1)
Due to the commutativity between the differential
operators and the convolution, the Laplacian of the
smoothed image can be equivalently
computed as the convolution with the Laplacian of the
Gaussian kernel, hence the name Laplacian of Gaussian (LoG):

( 11.22)
In addition to identifying the edges as zero crossings
(Marr and
Hildreth 1980), observe that LoG filtering provides strong
positive
responses for dark blobs and strong negative responses for
bright blobs (Fig. 11.7). In fact, the LoG kernel is precisely
the template of the dark
blob on a light background (Fig. 11.6), and we already noted
that the
correlation can be read as an index of similarity with the
mask (a.k.a.
template matching). Moreover, we see that the response is
dependent on the ratio of the blob size to the size of the
Gaussian kernel and is
maximal for blobs of diameter close to . In summary,
extremal points of the LoG detect blobs at a scale given by
(Fig. 11.7).

Fig. 11.6 Two plots of the LoG filter kernel, with the classic “sombrero” shape

Fig. 11.7 Test image and response of the Gaussian Laplacian with and
. As can be seen, the peaks of the response correspond to dark blobs of a
certain size, which depends on the . Sunflowers photo by Todd Trapani on
Unsplash
LoG can be approximated by a difference of two
Gaussians (DoG) with different scales (Fig. 11.8). The
separability and cascadability of Gaussians applies to
DoG, so an efficient implementation is achieved:

( 11.23)
Fig. 11.8 Approximation of LoG as difference of Gaussians (DoG)
From a frequency point of view, the LoG operator is a
bandpass filter, and in fact, it responds well to spots
characterised by a spatial frequency in its passband, while
attenuating the rest. The approximation with DoG further
confirms this interpretation, as it shows that the LoG is the
difference of two low-pass filters.

11.4 Harris-Stephens Operator


This operator detects features points with a discontinuity in
intensity in two directions, which are traditionally referred
to as “corners”, as it also responds to the junction of two
edges. This discontinuity can be
detected by measuring the amount of variation when a small
region
centred on the point is translated in its neighbourhood. If
the region can be translated in one direction without
significant variation, then there is an edge in that direction.
If, however, there is a significant variation in all directions,
then a “corner” is present.
Let us calculate the variation, in terms of sum of squared
differences (SSD), that is obtained by translating in the
direction a region
centred at the point of the image I:

( 11.2
4)
By truncating the Taylor series we obtain:
( 11.25)

where . The matrix is often


referred to in the literature as the structure tensor.
So, moving a region centred on in the direction
yields an SSD
equal to:
( 11.2
6)
Since encodes a direction, it can be assumed to have
unit norm, and thanks to Proposition A.8:
( 11.27)
where and are the minimum and maximum eigenvalues
of , respectively. Thus, if we consider all possible
directions , the maximum of the variation is , while the
minimum is . We can then classify the image structure
around each pixel by analysing the eigenvalues and .
The cases that can occur are the following:
flat (no structure):
edge: ,
corner: and both
The condition for having a corner point then
translates to where c is a predefined threshold.
Harris and Stephens (1988), however, do not explicitly
compute the eigenvalues but the quantity:
( 11.28)
and consider as corners the points where the value of r
exceeds a certain threshold (the constant kis set to 0.04).
The rationale for the above
formula is that

( 11.2
9)
and

( 11.3
0)

The Harris-Stephens operator responds with positive


values at corners, negative values at edges and values
close to zero in uniform regions (Fig. 11.9).
Fig. 11.9 Response of the Harris-Stephens operator as a function of the two
eigenvalues of
, and . There are two curves of interest: the zero-level one separates
flat points from edges, while a threshold-level curve (4.0 in the figure) separates
flat points from those
considered as corners
Starting from the same considerations about eigenvalues,
Noble
(1988) proposes an operator also based on the trace and the
determinant of the matrix but without parameters. For
each point in the image, the following ratio is computed:
( 11.31)
which responds with high values at corners (Fig. 11.10).
Zuliani et al. (2004) provide an insightful comparison of
corner detectors.

Fig. 11.10 From left: test image, Harris-Stephens operator response (note that
it takes negative values), Noble operator response. The circles indicate the
corners detected after selecting local maxima larger than a threshold (4.0 for
HS and 1.0 for Noble). With these thresholds, low
contrast corners on the right side of the image are not detected

Observe that the matrix can be defined


alternatively (up to a constant) as:

( 11.32)

where is a box filter implementing the sum over . In


practice, the box is replaced by a Gaussian smoothing
kernel whose standard
deviation determines the spatial scale at which the corners
are detected.
In the Harris-Stephens operator there are two scales: the
integration scale and the derivation scale (Zuliani et al.
2004). The former is the standard deviation of the
Gaussian kernel convolved with the images and
is the one that most affects the scale at which the
angular points are detected. The derivation scale, on the
other hand, is the standard deviation of the Gaussian
with which the image is
filtered before calculating the derivatives, which in our
implementation is fixed by the Sobel window. In
order to deal with the scale in a more principled way,
one should consider the
latter as well and vary it along with the former, typically
setting them to be equal. See, for example, (Gueguen and
Pesaresi 2011).

Listing 11.1 reports the implementation of the Harris-


Stephens
operator. To obtain a detector, we need to select the
points where the operator (i) has a value above a certain
threshold and (ii) is a local
maximum, that is, in its surroundings there are no points
with a larger
response.
Figure 11.10 shows an example of corner detection.

Listing 11. 1 Harris-Stephens operator


11.4. 1 Matching and Tracking
Once corners have been detected as described in the
previous section, they can be matched, typically taking
into account the proximity
(position) and/or the similarity of a small region
centred on the corner itself, similarly to stereo
matching (see Chap. 12).
When a video sequence needs to be processed, the
matching in each pair of frames is repeated; this is referred
to asfeatures tracking. This process can be difficult, as
features may vanish or reappear because of occlusions or
detector errors.

The tracking problem for small displacements can be


11.4.2 Kanade-Lucas-Tomasi Algorithm

characterised as follows: given the position of a feature


point in a frame at time t, we want to find the position of
the same feature point in the frame at time .
Assuming that the content of a window of pixels
around the feature point remains unchanged, Tomasi and
Kanade (1991)
formalise the problem as the least-squares solution of
the following system of non-linear equations in the
unknown speed :
(
11.33)
In other words, it is a matter of finding the translation
that
produces the best alignment of the window centred on the
feature point between the frame at time t and the frame at
time .
A non-linear least-squares problem can be solved by the
Gauss-
Newton method, which involves calculating the Jacobian of
the system:
In our case
( 11.34)
( 11.3
5)
where represents the spatial gradient of the image.
The solution according to the Gauss-Newton method
(See Appendix C.2.2) proceeds iteratively by computing at
each step an increment
as the solution of the normal equations:

( 11.3
with 6)

( 11.3
7)

In the iterations after the first one, it is necessary to


compute
, which will not, in general, be found on
the pixel grid, and therefore, an interpolation is needed.
Note that the points suitable to be tracked are those
that guarantee good conditioning of the system of normal
equations, namely those for which the eigenvalues and
of are both far from zero and not too different:
where c is a fixed threshold. Since the
matrix is essentially identical to the matrix S of the
Harris-Stephens method, these points coincide with the
corners detected by Harris-
Stephens.
The KLT algorithm, proposed by Tomasi and Kanade
(1991) and
based on previous work by Lucas and Kanade (1981), is
extended by Shi and Tomasi (1994) with an a fine model for
window deformation
(instead of just translational). In fact, one can complicate
the motion model in (11.33) at will, as long as one is
able to compute its Jacobian.

11.4.3 Predictive Tracking


The KLT method illustrated in the previous section
assumed a small displacement between two consecutive
images. When this is instead significant, one can
compensate the motion by exploiting the spatio-
temporal coherence to predict where the points should be
in the next frame.
The classical scheme is based on the iteration of three
steps:
extraction, prediction and association, and typically includes
a Kalman filter, which uses a model of motion to predict
where the point will be at the next time instant, given its
previous positions. In addition, the filter maintains an
estimate of the uncertainty in the prediction, thus allowing a
search window to be defined. Features are detected within
the search window, one of them is associated with the
tracked point, and the filter
updates the position estimate using the position of the
found point and its uncertainty.
The association operation may present ambiguities (Fig.
11.11).
When more than one point is found in the search window,
more
sophisticated filters are needed, like the Joint Probabilistic Data
Association Filter. See (Bar-Shalom and Fortmann 1988) for a
discussion of this problem.

Fig. 11.11 Predictive tracking (a) and data association of single-track (b) and
multiple-tracks (c)

11.5 Scale Invariant Feature


Transform
The Scale Invariant Feature Transform (SIFT) operator
introduced by Lowe (2004) detects blob-like feature points
characterised by high contrast via an approximation of
the LoGfilter (Laplacian of Gaussian) applied to a scale-space
representation of the image.
11.5. 1 Scale-Space
Recall that both the LoG and the HS operator detect
feature points
(of different nature) at a given scale. The choice of scale is not
secondary, and although initially overlooked, it must be
addressed at some point. In fact, real scenes are composed
of different structures that possess
different intrinsic scales. Moreover, the projected size of
objects varies due to the distance from the camera. This
implies that a real object may appear in different ways
depending on contingent factors. Since there is no way to
know a priori which scale is appropriate to describe each of
the interesting structures in the images, one approach is
to embrace them all in a multi scale description.
For this purpose, we introduce the notion of scale-space.
The image is represented as a family of smoothed versions,
parameterised by the size of the smoothing kernel. The
parameter is referred to as scale and together with the
two spatial variables locates a point
in scale-space.
We will use the linear scale-space, which is obtained with
a Gaussian kernel. The choice is not arbitrary but derives as
a necessary
consequence from the formalisation of the criterion that
filtering should not create new spurious structures when
moving from a fine to a
coarser scale. The reader can refer to Lindeberg
(2012) for further details.
For a given image , its linear scale-space
representation is the family of images defined by
the convolution of with a Gaussian kernel of
variance :
( 11.38)
where denotes the convolution and

( 11.39)

As the variance of the Gaussian kernel increases, there is


an
increasing removal of image detail, for L results from the
convolution of I with a low-pass kernel of increasing spatial
support (Fig. 11.12). In
particular, blobs that are significantly smaller than are
faded out in
.
Fig. 11.12 Scale-space for an image of a field of sunflowers. The first one in
the upper left is the original. The others correspond to increasing variances of
the Gaussian kernel, in geometric
progression (standard deviation doubles every three images). Sunflowers photo
by Todd Trapani on Unsplash
This framework will enable us to define operations
(such as feature detection) that are scale invariant, in the
next section.
In (11.38) we defined the Gaussian scale-space with
reference to the original image. A more precise way to
define it is to refer to an ideal image of infinite
resolution, which we call . We then define the
scale-space as the collection of filtered images:

However, is not accessible, and we must therefore


anchor the scale-space to the input image. The latter
is conventionally
considered as the result of filtering with a kernel at
to
account for the finite pixel size:
So,
in practice, the scale-space is calculated with:
11.5.2 SIFT Detector
As we have seen, the Laplacian combined with Gaussian
filtering detects feature points (or rather blobs) at the scale
set by the Gaussian kernel.
In order to obtain a multi scale blob detector, Lindeberg
(1998)
proposed to apply the (normalised) Laplacian to the entire
linear scale-
space:
( 11.4
0)
and to look for its extremal points, that is, local maxima or
minima (in scale-space).
Normalisation is an essential detail when dealing with
scale-space analysis. Going from finer to coarser scales,
the image will become
increasingly blurred, thus resulting in the decrease of the
amplitude of the image derivatives. Without normalisation,
the maximum amplitude will always be found at the finest
scale and the minimum at the coarsest scale. As
demonstrated by Lindeberg (1998), the correct scaling
factor for this is .
In the same paper it is shown that this method of
detecting feature point leads to scale invariance in the
sense that under scale
transformations the feature points are preserved and the
scale of a point of interest transforms accordingly.
The SIFT detector proposed by Lowe (2004) works
similarly to the method of Lindeberg (1998), of which it
can be seen as a
computationally efficient variant, due to the replacement of
the
normalised Laplacian scale-space with a pyramid of
differences of Gaussian (DoG):
( 11.41)
where the scale levels follow a geometric progression
with a fixed k and a given . Indeed, it can be
seen that:
( 11.42)
The constant k does not affect the location of the extremal
points, and it is typically chosen as so that the scale
doubles every n levels (interval which is denoted as
The Gaussian pyramid is constructed by repeatedly
octave).
applying
Gaussian filtering and sub sampling at each octave change
so that in the end we obtain a stepped pyramid in which
the size changes at each
octave. Differences (DoGs) are computed between adjacent
levels within the same octave, and then the extremal points
in space and scale in the DoG pyramid are detected
through comparisons with neighbours in a
small window. Only extremal points that have a response
above a given threshold (in absolute value) are considered.
SIFT employs levels per octave, so , and
the value of is set to 1.6. It uses a window,
which requires a DoG
overlay image with the previous and the next octave, for a
total of DoG images per octave. The resulting pyramid
is shown in Fig. 11.13.
Fig. 11.13 SIFT Gaussian pyramid with three levels per octave. Actually, in
order to search for
extremal points on three levels of DoG for each octave, there must be five
DoGs (right pyramid) per octave (to always have one above and one below)
and consequently, there must be six
filtered images (left pyramid) per octave. Note that the last three images of an
octave correspond to the first three of the upper octave (subsampled). DoG
images are equalised to improve
intelligibility. Images courtesy of R. Toldo

Where does the original image Ifit in the Gaussian


pyramid of SIFT? The first image at the bottom of the
pyramid (index - 1) is not I but
instead corresponds to filtering with a standard
deviation
Gaussian . According to the box at page 157, the image
at level - 1 is obtained from I after filtering with a
Gaussian with standard
deviation . With the values used by SIFT, this
makes
The DoG (like the LoG) also has opposite maxima and
minima near the edges (where it vanishes), but these are
of little use for
correspondences and should be eliminated. We therefore
want to select the extremal points of the DoG that exhibit
strong localisation, that is,
such that the two principal curvatures1 are both large. This
criterion can be expressed in terms of the eigenvalues of the
Hessian matrix of the
DoG image (denoted D):

( 11.43)

calculated at the location and scale of the feature point.


In particular, the two principal curvatures are the
eigenvalues of , which are therefore both required to be
large. As with the Harris-
Stephens operator, this criterion can be reformulated in
terms of the trace and determinant of the matrix to
make the calculation more efficient:

( 11.44)

where denotes the allowed upper bound on the ratio


of maximum to minimum eigenvalue (SIFT uses ).
Finally, in order to increase the localisation accuracy
in space and scale, a second-degree polynomial is fitted
to the values in the
neighbourhood of the extremal point and its maximum is
taken, which in general will correspond to a sub-pixel (and
sub-scale) position.

The relationship between Hessian matrix and structure


tensor. The Hessian matrix of a given point of image I is
wit being the Laplacian.
h
The Hessian matrix contains information on the curvature
of the
function. The two main curvatures (maximum and
minimum) are the
two eigenvalues of H, and since a feature point
correspond to peak of the function, both should be large.
There are several
indicators of this condition; one of these is the
Gaussian curvature . The product is
large when both factors are
large; therefore, we are interested in maxima of K. The
criterion is reminiscent of the Harris and Stephens
operator, which however
applies to the structure tensor S, which is not the same
as H. In fact, barring the convolution with the low-pass
kernel, the structure
tensor (or second-moment matrix) writes

The entries of S are product of first-order derivatives,


whereas the entries of H are second-order derivatives,
so they are different.

Nevertheless, a relationship exists, for one can regard


as an approximation of H in the same way as
in the derivation of the Gauss-Newton algorithm.

11.5.3 SIFT Descriptor


Each SIFT feature point is associated with a descriptor,
which essentially
reduces to a histogram of the gradient direction in a
neighbourhood. In principle, to achieve scale invariance,
the neighbourhood size must be appropriately normalised
by making it scale dependent. Equivalently, SIFT
considers a window of fixed size in the level of the
Gaussian pyramid where the point was detected.
To achieve invariance to rotation, a dominant direction of
the
gradients in the window is estimated. Specifically,
a histogram of the direction of the gradients discretised
into 3 6 bins is constructed where votes are weighted by
the magnitude of the gradient and by a
Gaussian centred on the window. The maximum of the
histogram
identifies the dominant direction. This value is refined by
parabolic
interpolation with the two neighbouring bins to improve the
10-degree
resolution. Multiple dominant directions are accepted within
80% of the maximum. In this case one descriptor is
produced for each direction.
To construct the descriptor (Fig. 11.14) the
window is
virtually rotated to align it with the dominant direction, by
adding a
constant angle to the gradients. In each sub-quadrant
of the first, an eight-bin histogram of the gradient
directions is computed, weighting the votes by the
magnitude of the gradient itself. The Gaussian-weighted
gradients used in the computation of the dominant direction
are
considered also in this step, to gradually reduce their
contribution as they move away from the centre of the
window.

Fig. 11.14 SIFT descriptor. In each quadrant of the window, a


histogram of the eight-bin gradient directions is computed

The resulting 128 non-negative values are collected into a


vector that is normalised (to unit length) to make the
descriptor invariant to
changes in contrast, while additive variations in grey levels
are already removed by the gradient. Finally, the values
below 0.2 are set to zero and the vector is normalised again
to obtain the SIFT descriptor.

Descriptors associated with feature points in two images


11.5.4 Matching

are matched using a simple closest neighbour strategy: for


each descriptor in one image, we search for the
descriptor in the other image that minimises the
Euclidean distance in the 128-dimensional space.
To suppress matching that can be considered
ambiguous, pairs for which the ratio of the distance of the
nearest to the second-nearest
descriptor is above a certain threshold (typically between
0.6 and 0.8) are rejected. Figure 11.15 shows an example
of SIFT point matching.
Fig. 11.15 Top left: test image with a subset of 50 SIFT points highlighted (the
radius of the
circle indicates scale while the segment represents the orientation). Top right:
some descriptors are represented as grids. Bottom: a subset of matches

The images in this chapter were created using the SIFT


implementation by A. Vedaldi (http://www.vlfeat.org). See
also
https://www.vlfeat.org/api/sift.html for detailed
documentation on the implementation.

References
Y. Bar-Shalom and T. E. Fortmann. Tracking and data Association. Academic Press,
Waltham, MA, 1988.
Lionel Gueguen and Martino Pesaresi. Multi scale Harris corner detector based
on differential
morphological decomposition. Pattern Recognition Letters, 32: 1714– 1719, 10 2011.
https://doi. org/10. 1016/j.patrec.2011.07.021.
C. Harris and M. Stephens. A combined corner and edge detector. Proceedings
of the 4thAlvey Vision Conference, pages 189– 192, August 1988.
T. Lindeberg. Scale invariant feature transform. Scholarpedia, 7 (5): 10491, 2012.
Tony Lindeberg. Feature detection with automatic scale selection.
International Journal of Computer Vision, 30: 79– 116, 1998.
[Crossref]
David G. Lowe. Distinctive image features from scale-invariant keypoints.
International Journal of Computer Vision, 60 (2): 91– 110, 2004.
[Crossref]

B. D. Lucas and T. Kanade. An iterative image registration technique with an


application to stereo vision. In Proceedings of the International Joint Conference on
Artificial Intelligence, pages 674–
679, 1981.
D. Marr and E. Hildreth. Theory of edge detection. Proceedings of the Royal
Society of London, Series B, 207: 187– 217, February 1980.
Krystian Mikolajczyk and Cordelia Schmid. A performance evaluation of local
descriptors. IEEE Transactions on Pattern Analysis & Machine Intelligence, 27 ( 10):
1615– 1630, 2005.
Krystian Mikolajczyk, Tinne Tuytelaars, Cordelia Schmid, Andrew Zisserman, J.
Matas, F.
Schaffalitzky, T. Kadir, and L. Van Gool. A comparison of affine region
detectors. International Journal of Computer Vision, 65 ( 1/2): 43– 72, 2005.
J.A. Noble. Finding corners. Image and Vision Computing, 6: 121– 128,
May 1988. [Crossref]
J. Shi and C. Tomasi. Good features to track. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 593–600, June 1994.
C. Tomasi and T. Kanade. Detection and tracking of point features. Technical
Report CMU-CS-91- 132, Carnegie Mellon University, Pittsburg, PA, April 1991.
M. Zuliani, C. Kenney, and B.S. Manjunath. A mathematical comparison of point
detectors. In 2004 Conference on Computer Vision and Pattern Recognition Workshop,
pages 172– 172, 2004. https:// doi.org/10. 1109/CVPR.2004.282.

Footnotes
1 Maximum and minimum curvature of a curve contained in the surface and
passing through the point.

OceanofPDF.com
© The Author(s), under exclusive license to Springer Nature
Switzerland AG 2024 A. Fusiello, Computer Vision: Three-dimensional

https://doi.org/10.1007/978-3-031-34507-4_12
Reconstruction Techniques

12. Stereopsis: Matching


Andrea Fusi ello1
( 1) University of Udine, Udine, Italy

Email: [email protected]
Andrea Fusiello

12. 1 Introduction
Having already discussed the geometric aspect of
stereopsis in Chap. 10, we will focus here on stereo matching,
an important technique used in computer vision for finding
correspondences between two images of a scene taken
from different perspectives. The goal is to establish a one-
to-one correspondence between pixels in the two images,
which can be customarily represented as a disparity map
(Fig. 12.1).

Fig. 12.1 Pair of stereo images (rectified) and ideal disparity map (images
taken from http:// vision.middlebury.edu/stereo/)

For the reader’ s convenience, we are providing here


some definitions given in the previous chapters. A homologous
pair consists of two points in two different images that are
projections of the same object point.
The binocular disparity is the (vector) difference between two
homologous points, assuming that the two images are
superimposed.

12.2 Constraints and Ambiguities


The matching of homologous points is made possible by the
assumption that, locally, the two images differ only slightly,
so that a detail of the
scene appears similar in the two images (Sect. 12.1).
However, based on this similarity constraint alone, many
false matching s are possible, thus necessitating additional
constraints to avoid them.
The most important of these is the epipolar constraint (Chap.
6),
which states that the homologous of a point in one image is
found on the epipolar line in the other image. Thanks to this,
the search for
correspondences becomes one-dimensional,
instead of two- dimensional.
We assume, without loss of generality, that the images
are rectified (Sect. 10.3), meaning that the epipolar lines
are parallel and horizontal in both images. This allows us
to search for homologous points along
horizontal scan lines of the same height. The disparity is
then reduced to a scalar value, which can be easily
represented in a disparity map (see
Fig. 12.1). In each pixel of the map, the horizontal
translation between the same point of the reference image
and its homologous point is
recorded.
Consider the example in Fig. 12.2: moving from image 1
(right
camera) to image 2 (left camera), the horizontal coordinates
of the
points in image 2 are greater than those of the
corresponding points in image 1. It follows that the
disparity, defined as , is positive in this case. When
computing disparity we will always assume that it is
positive, so we will have to identify images 1 and 2
so that this assumption is verified.
Fig. 12.2 Cameras in normal configuration and corresponding rectified
images
In addition to the similarity constraint and the epipolar
constraint, there are other constraints that can be
exploited in the calculation of correspondences:
smoothness: the scene is composed of regular surfaces, so
the
disparity is almost everywhere smooth (transitions
between different objects are excluded);
uniqueness: a point in one image can be matched with only
one point in the other image, and vice versa (fails with
transparent objects or
occlusions);
ordering: if the point in one image matches in the
other, the homologous of a point lying to the right
(left) of must lie to the right (left) of . It fails for
points that lie in the “forbidden zone” of a given point
(corresponding to the grey area in Fig. 12.3). Normally, for
an opaque, cohesive surface, points in the forbidden zone
are not
visible, so the constraint holds.
Fig. 12.3 Point B violates the ordering constraint with respect to C, while
point A fulfil it. The grey cone is the forbidden zone of C
The constraints just listed, however, are not sufficient to
make the
matching unambiguous. Even if they are fulfilled, a point in
one image
can be matched with many points in the other image: this is
the problem of false matches. In addition to this, there are
other problems that plague the calculation of
correspondences, due to the fact that the scene is
framed from two different points of view:

occlusions: due to discontinuities in the surfaces, there are


parts of the scene that appear in only one of the images,
that is, there are points in one image that do not have a
corresponding in the other image. Clearly no disparity can
be defined for such points;
due to surfaces violating the
Lambertian assumption (see Chap. 2), the intensity
non-Lambertian surfaces:

observed by the two cameras (the radiance) is different for


the same point in the scene;
perspective distortion: perspective projections of geometric
shapes are different in the two images.

All these problems are exacerbated the more the


cameras are apart. On the other hand, in order to have a
meaningful disparity, the cameras must be well separated
from each other.
All matching methods attempt to pair pixels in one image
with pixels in the other image by taking advantage of the
constraints listed above
and trying to work around the problems just mentioned. Local methods
while global methods impose constraints on the entire image.

12.3 Local Methods


The methods of this class, also called block matching or
correlation based , compute an aggregate matching cost on a
local support , called
window, and determinefor each pixel the disparity that yields
the lowest cost. This cost is a measure of difference or
dissimilarity between and the underlying image.
In more detail, a small area of one image is considered,
and the most similar area in the other image is identified by
minimising a matching cost that depends on the grey
levels, or a function of them (see Fig.
12.4). This process is repeated for each point,
resulting in a dense disparity map.
Fig. 12.4 Illustration of the block matching method. A rectangular window is
(ideally)
cropped in the reference image and slid over the other image along the scan
line until it finds the translation value d that yields the maximum similarity
between the window and the underlying image
In formulae, for each pixel in image , let us
consider a
window centred in of size . This is
compared with a window of the same size in moving
along the epipolar line
corresponding to ; since the images are rectified, we
consider the positions . Let
be the value of the resulting matching cost; the computed
disparity for is the
displacement that corresponds to the minimum cost c:
( 12. 1)

12.3. 1 Matching Cost


The matching costs we will analyse in this section
fall into three categories:
based on correlation (NCC, ZNCC);
based on intensity differences (SSD, SAD);
based on transformations of intensities (e.g. census
transform).
One of the most common, especially in photogrammetry,
is
Normalised Cross Correlation (NCC). It can be seen as the
scalar product of the two vectorised windows divided by the
product of their respective norms:
(
12.2)
where denotes the grey level of the pixel . Be
aware that NCC is not actually a cost, being instead a
similarity measure between 0
and 1. To convert it to a cost, just take 1-NCC. The MATLAB
implementation is given in Listing 12.1.
To obtain invariance to brightness changes between the
two images (of additive type), one can subtract from each
window its average,
obtaining the Zero-mean NCC or ZNCC.

Listing 12. 1 Stereo with NCC

Another cost, popular especially in computer vision, is


the Sum of Squared Difference (SSD):

(
12.3) It can be seen as the squared norm of the difference of
the vectorised
windows.
The smaller the value of (12.3), the more similar the
portions of the images considered. The MATLAB
implementation is given in Listing 12.2.
Listing 12.2 Stereo with SSD

Similar to SSD is Sum of Absolute Difference (SAD), where


the square is replaced by the absolute value. In this way
the cost is less sensitive to
impulsive noise: for example, two windows that are equal in
all but one pixel are more similar according to SAD than
according to SSD, since the square weighs much more the
differences than the absolute value:
(
12.4)
Following the same line of reasoning, one could replace the
absolute
value with a more robust penalty function, such as an M-
estimator (Sect. C. 1). This class includes, for example,
truncated cost functions proposed by (Scharstein and
Szeliski 2002).

We now look at a more sophisticated technique, in which first


12.3.2 Census Transform

a
transformation based on local sorting of grey levels is
applied to the images and then the similarity of the
windows on the transformed images is measured.
The census transform (Zabih and Woodfill 1994) is based on
the
comparison of intensities. Let and be the
intensity values of pixels p and , respectively.
If we denote the concatenation of bits by the symbol ,
the census transform for a pixel p in the image I is the bit
string:

( 12.5
)
where denotes a window centred in p of radius
and are the Iverson parenthesis.
The census transform summarises the local spatial
structure. In fact, it associates with a window a bit string
that encodes its intensity in
relation to the central pixel, as exemplified in Fig. 12.5.

Fig. 12.5 Example of census transform with

The matching takes place between windows of the


transformed
images, comparing strings of bits. By denoting with the
symbol the Hamming distance between two bit strings,
that is, the number of bits in which they differ, the SCH
(Sum of Census Hamming Distances)
matching cost is written:

(
12.6) Each term of the summation is the number of pixels
inside the
transformation window , whose relative order (i.e.
having
higher or lower intensity) with respect to the considered
pixel changes from to .
This method is invariant to any monotonic transformation
of
intensities, whether linear or not. In addition, this method
is tolerant to errors due to occlusions. In fact, the study of
Hirschmuller and
Scharstein (2007) identifies SCH as the best matching
cost for stereo matching.
The MATLAB implementation is reported in Listing 12.3.
Listing 12.3 Stereo with SCH

The census transform is also efficiently computable: the basic


operations are simple integer comparisons, thus avoiding
floating point operations or even integer multiplications.
The computation is purely local and the same function is
evaluated in every pixel, so it can be
applied in parallel.
In Fig. 12.6 we show, as an example, the disparity
maps obtained with SSD, NCC and SCH on the pair in
Fig. 12.1.

Fig. 12.6 Disparity maps produced by SSD, NCC and SCH (left to right) with
window

These block matching methods inevitably fail in the


presence of uniform areas or repetitive structures
because in both cases the
considered window does not have a unique match in the
other image. They also implicitly assume that all pixels
belonging to the window
have the same disparity (i.e. the same distance from the
focal plane).
This implies that the surfaces of the scene must be oriented
like the focal plane. Tilted surfaces and depth discontinuities
lead to mismatched
portions of the image being included in , and thus
the calculated disparity will be fatally wrong.
Robust cost functions can partially mitigate this problem
by treating samples of one of the two surface as outliers;
better still can do non-
parametric cost functions like SCH. However, the problem
can be
addressed more systematically by revising the cost
aggregation step: changing the shape, size and weighting
of the support .

12.4 Adaptive Support


We observed that if the window covers a region where the
depth
varies, the computed disparity will inevitably be affected by
error, since there is no one disparity that can be attributed
to all the support. Ideally, therefore, one would like to
include only points with the same
disparity, but this disparity is unknown when we set out to
compute it.
Reducing the window size helps mitigate the problem;
however, this choice makes the matching problem more
ambiguous, especially when dealing with uniform regions
and repetitive patterns.
To appreciate the phenomenon in a controlled
experiment, we
consider the random-dots stereogram of Fig. 12.7,
consisting of a
synthetic stereo pair obtained in the following way: a
background image is generated by assigning random
intensity values to the pixels. Using
the same method, we generate a square (or a region of any
shape). Then, we copy the square over the background in
two different positions that differ by a horizontal
translation. The pair thus constructed has a
horizontal disparity: observing one image with the right
eye and the other with the left eye, it is possible to
perceive the square as it was in the foreground.
Fig. 12.7 Random-dots stereogram. In the right image, a central square is
translated to the right by five pixels
Figure 12.8 shows the results of applying the block
algorithm with different window sizes to the random-dots
matching
stereogram. It can be seen that a small window yields
more accurate disparity on edges but with random
errors (less reliable), while large windows
remove random errors but introduce systematic errors at
disparity
discontinuities (less accurate). Thus, we are in the presence
of two
opposing demands for simultaneous reliability and
accuracy. An ideal cost aggregation strategy should have
a support that includes only
points at the same depth—which is not known—and
extends as far as possible to maximise the signal (intensity
variation) to noise ratio.

Fig. 12.8 Disparity maps obtained with SSD correlation on a random-dots


stereogram with Gaussian noise added. The grey level—
normalised—represents the disparity. The correlation window sizes are, from
left to right, , and

Several solutions have been proposed to adapt the shape


and size of the support to meet the two requirements.
12.4. 1 Multiresolution Stereo Matching
Multiresolution stereo matching address the problem of
window size
(but not shape). They are hierarchical methods that operate
at different resolutions. The idea is that at the coarse level
large windows provide an
inaccurate but reliable result. At the finer levels, smaller
windows and smaller search intervals improve accuracy.
We distinguish two
techniques (Fig. 12.9):

coarse-to-fine: the search interval is the same but operates


on images at gradually increasing resolutions. The
disparity obtained at one level is used as the center for the
interval at the higher level; the disparity
obtained at one level is used as the center for the
interval at the next level;
fine-to-fine: always operates on the same image but with
smaller
windows and intervals. As before, the disparity at one
level is used as the center for the interval at the next
level.

Fig. 12.9 Coarse-to-fine (left) and fine-to-fine (right) methods


12.4.2 Adaptive Windows
In this category we find methods that adapt window shape
and/or size based on the image content. The ancestor of
this class is the method
proposed by Kanade and Okutomi (1994), which implies a
window
whose size is locally selected based on the signal-to-noise
ratio and
disparity variation. The ideal window should include as
much intensity variation and as little disparity variation as
possible. Since the disparity is initially unknown, we start
with an estimate obtained with a
fixed window and iterate, approximating at each step the
optimal
window for each point, until convergence (if any) is
achieved. Building on this, a simpler algorithm based on
fixed size but eccentric windows was proposed by Fusi
ello et al. (1997). Nine windows are employed for each
point, each with the center at a different location (Fig.
12.10). The window among the nine with the smallest SSD
is the one most likely to cover the area with the least
variation in disparity.

Fig. 12.10 An eccentric window can cover a constant disparity zone even near
a disparity jump

It is possible to improve the efficiency of this scheme by


not
computing the matching cost nine times. Instead, the costs
for eccentric windows can be computed using the costs of
the neighbouring points calculated with the centred
support. By relaxing the disparity of the
point in its neighbourhood, the disparity with the lowest
matching cost is selected.
This method does not adjust the window size, which is
assumed to be set to a value that provide a sufficient
signal-to-noise ratio. Veksler (2003) adds a strategy to
adjust the window size, while Hirschmüller et
al. (2002) introduce a scheme that adapts the shape by
assembling the support from the aggregation of small
square windows with the
lowest matching score.
Other strategies include segmenting the image into
regions of similar intensity (assuming that the depth in these
regions is also nearly
constant) and intersecting the window with the region to
which the
pixel belongs, so as to obtain an irregularly shaped support
(Gerrits and Bekaert 2006). Similar to this strategy is the
one proposed by Kweon
(2005) that assigns a weight to the pixels of the support
based on a
segmentation criterion (as in the bilateral filtering
illustrated in Chap. 11).

12.5 Global Matching


We have observed how cost aggregation near depth
discontinuities is a source of error. An orthogonal approach
to window adaptation is to
forgo aggregating costs altogether. However, this loses
reliability, since the individual pixels do not contain
enough information for an
unambiguous match, and thus constraints must be added
to regularise the problem. This results in the global
optimisation of an objective
function that includes the matching cost and a penalty for
discontinuities.
These methods, generically called global, can be well
understood by referring to the so-called Disparity Space
Image (DSI). This is a three- dimensional image (a
volume) c, where is the value of the
matching cost between the pixel in the first image
and the pixel in the second image. In
principle, the cost is computed pixel- wise, meaning that
the support reduces to one pixel, although a small can
sometimes be used.
The disparity map we expect the algorithm to produce
can be seen as a surface inside the DSI, described by a
function (Fig. 12.11). In this formulation, we look
for the disparity map that minimises an objective
function :
( 12.7)
where denotes a neighbourhood of . The first term of
the function sums all pixel matching costs over the entire
image, while the second
term V adds a penalty for all pixels with neighbours that
have a different disparity; in the simplest case it could be
( 12.8)

Fig. 12.11 The disparity map as a surface in DSI


In this way, discontinuities are allowed if the pixel match
is stronger than the penalty , that is, if the signal
strongly indicates a
discontinuity.
Note that the second term links all pixels in the image,
making the
problem a global one. If this term were omitted, the sum of
the (positive) costs would be minimised by a local strategy—
known as Winner Takes All (WTA)—which consist in taking
the lowest cost for each pixel. This is exactly what is applied
in our MATLAB implementation of the local
methods reported in the previous section ( plus the
aggregation of the cost over ).
Among the best methods in this area are those based on
the
minimum graph cut (Roy and Cox 1998; Kolmogorov and
Zabih 2001). In a nutshell, they create a flow network,
where nodes correspond to
cells of the DSI and arcs connect adjacent cells, with an
associated
capacity that is a function of the costs of the incident cells.
The minimum cost cut represents the sought surface. The
disadvantage of these and
many other global methods is the high computational cost
in both time and memory consumption.
Although the problem, as we have formulated it, is
inherently two- dimensional (the solution is a surface), a
compromise to reduce the
computational cost is to decompose it into many
independent (simpler) one-dimensional problems.
Scanline Optimisation (SO) operates on individual
sections of the DSI (i.e. on scanlines ) and
optimises one scanline at a time independently of the
others. A disparity value d is assigned at each point u such
that the overall cost along the scanline is minimised with
respect to a cost function that incorporates the matching
cost and the
discontinuity penalty. Therefore, a SO algorithm optimises
the cost
function 12.7 with the only difference being that the
neighbourhood is one-dimensional (it only extends
horizontally) and thus vertical
discontinuities are not penalised. If the horizontal
discontinuities were not penalised as well, this would be
equivalent to forgo the
regularisation term and would result in a WTA, as we
have already observed.
In this class there are algorithms that operate in
sections of the DSI (Intille and Bobick 1994; Hirschmuller
2005) or in the so-called match space, the matrix
containing the costs of each pixel pair of two
corresponding scanlines (Cox et al. 1996; Ohta and
Kanade 1985), as illustrated in Fig. 12.12.

Fig. 12.12 Idealised cost matrices for the random point stereogram. Section
of the DSI (left) and match space. Intensities represent cost, with white
corresponding to the minimum cost. Choosing a point in the matrix is
equivalent to setting a match
In both cases, it is a matter of computing a minimum cost
path
through a cost matrix, and dynamic programming has been
shown to be particularly well suited for this task. However,
since the optimisation is performed independently along
the horizontal scanlines, horizontal
artefacts in the form of streakings are present. Several
authors add other penalties to the cost function for violating,
e.g. uniqueness or ordering constraints.
A compromise between global methods, which
undoubtedly produce better results (Fig. 12.13) but at high
computational cost, and those
operating on a single scanline is the Semi Global Matching
(SGM)) of
Hirschmuller (2005) which can be regarded as a variation of
SO that
considers many different scanlines instead of just one (Fig.
12.14). The method minimises a one-dimensional cost
function over n (typically
) sections of the DSI along the cardinal directions of
the image (the horizontal one corresponds to the
section of the SO methods
described above). This way of proceeding can be
seen as an approximation of the optimisation of
the global cost function.

Fig. 12.13 Disparity maps produced by a SO algorithm (left) and SGM


(right). Note the characteristic streaks in the left map
Fig. 12.14 To the left, section of the DSI along the plane; to the right,
the eight scan lines considered for one pixel
In addition, the semi-global method introduces a specific
cost
function to penalise differently small disparity
jumps, which are often part of sloped surfaces, and real
discontinuities of surfaces:
( 12.9)
SGM computes, for each direction , an aggregate
cost defined recursively as follows (starting
from the borders):
( 12. 10)

For each pixel and each disparity, the costs are summed
over the eight paths, resulting in an aggregate cost
volume:
( 12. 11)
in which per-pixel minima (as in WTA) are chosen as
the computed disparity.
Hirschmuller (2005) described SGM in conjunction to a
matching cost based on mutual information; however,
any cost can be plugged in. This algorithm achieves an
excellent trade-off between result quality
and execution time, as well as being suitable for parallel
implementations in hardware. Therefore, it has been widely
adopted in application contexts such as robotics and driver
assistance systems,
where there are real-time constraints and limited
computational capabilities.
When the two images of a stereo pair have different
lighting
conditions, it is necessary to normalise them by matching
their
respective histograms, that is, to transform one histogram
so that it is as similar as possible to the other.

Let and be cumulative histogram (counts the total


number of pixels in all of the bins up to the current bin)
of images X and Y ,
respectively. Let us assume that where g is an
unknown map that we would like to retrieve.

12.6 Post-Processing
Downstream of the calculation of the “raw” disparity,
there are several steps of optimisation and post-
processing that can be applied. First,
interpolating the cost function near the minimum (e.g. with
a parabola) can yield a sub-pixel resolution corresponding
to fractional values of
disparity. Further, various image processing techniques
such as median filtering, morphological operators and
bilateral filtering (Chap. 11) can be used to reduce
isolated pixels and make the map more regular,
particularly if it has been created using non-global methods.
In the
following paragraphs, we will explore two important post-
processing steps: the computation of reliability
indicators and occlusion detection.
12.6. 1 Reliability Indicators
The depth information provided by stereopsis is not
everywhere equally reliable. In particular, there is no
information for occlusion zones and
uniform intensity (or non-textured) areas. This incomplete
information can be integrated with information from other
sensors, but then it needs to be accompanied by a reliability
or confidence estimate, which plays a primary role in the
integration process.
We now briefly account for the most popular confidence
indicators; more details are found in (Hu and Mordohai
2012). In the following denotes the matching cost,
normalised in , related to the disparity d (in the case of
NCC we take 1-NCC). The minimum of the cost function is
denoted by and the corresponding disparity value by .
The second best cost is denoted by , while the best
second local minimum is
denoted by (Fig. 12.15). The confidence varies in
, where 0 means unreliable and 1 reliable.

Fig. 12.15 Matching cost profile. In the example . Courtesy ofF.


Malapelle

Matching cost

Curvature )

Peak Ratio )
Matching cost
Maximum )
Margin
Winner Margin
)

Occlusions generate points lacking homologous


12.6.2 Occlusion Detection
counterparts, and are related to depth discontinuities.
For example, in one image of a random- dots stereogram,
there are two depth discontinuities along a horizontal
scanline, located on the left and right edges of the square
(see Fig.
12.12).
By definition, occlusions can be identified when points
in one image fail to have corresponding homologous points
in the other image. The
matching procedure, however, is going to find a best match
for any point, even if the pairing is weak. To detect
occlusions one can analyse the
quality of the matches and remove any matches that are
likely to be
inaccurate, using, e.g. the metrics introduced in the
previous section. However, a more stringent detection
can be implemented by exploiting the uniqueness
constraint.
In the matching procedure, for each point in , the
corresponding point in is searched (see Fig. 12.16). If,
for example, a portion of the scene is visible in but not
in , a pixel , whose homologous is occluded, will be
paired with a certain pixel according to the
matching cost employed. If the true homologous of is
— and assuming that the matching operates
correctly—p is also matched to , violating the uniqueness
constraint. This can be easily detected by
checking the injectivity of the disparity map.
Fig. 12.16 Left-right consistency. The point being occluded does not
have a one-to-one correspondence with , while p does
The next problem is to determine which of the two
potential
homologous is the correct one, and this can be done by
selecting the lowest matching cost (or the most reliable
match, in general).
More effective1 (but also more expensive) is the left-right
consistency check (or bidirectional matching), which
prescribes that if p is coupled to by performing the search
from to , then must be coupled to p by performing
the search from to . Thus, in the
previous example, is paired with its true homologous p,
so the point can be recognised as occluded and left
without a correspondent.
Eventually, for points without a correspondent, one can
estimate a disparity by interpolation of neighbouring
values, or leave them as they
are.

Scharstein and Szeliski (2002) introduced a standard


protocol for
evaluating of stereo algorithms. The results can be found
on the web http://vision.middlebury.edu/stereo/.

References
I. J. Cox, S. Hingorani, B. M. Maggs, and S. B. Rao. A maximum likelihood
stereo algorithm. Computer Vision and Image Understanding, 63 (3): 542– 567,
May 1996.
[Crossref]

L. Di Stefano, M. Marchionni, S. Mattoccia, and G. Neri. Quantitative


evaluation of area-based stereo matching. In 7th International Conference on
Control, Automation, Robotics and Vision,
2002. ICARCV 2002., volume 2, pages 1110– 1114 vol.2, 2002.
A. Fusiello, V. Roberto, and E. Trucco. Efficient stereo with multiple windowing.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages
858– 863, Puerto Rico,
June 1997. IEEE Computer Society Press.

Mark Gerrits and Philippe Bekaert. Local stereo matching with segmentation-
based outlier
rejection. In Proceedings of the The 3rd Canadian Conference on Computer and Robot
Vision, CRV ’06, page 66, USA, 2006. IEEE Computer Society.
Heiko Hirschmüller, Peter R. Innocent, and Jonathan M. Garibaldi. Real-time
correlation-based stereo vision with reduced border errors. Int.J. Comput. Vis., 47
( 1-3): 229– 246, 2002.
[Crossref]

H. Hirschmuller and D. Scharstein. Evaluation of cost functions for stereo


matching. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–
8, 2007.
Heiko Hirschmuller. Accurate and efficient stereo processing by semi-
global matching and mutual information. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 807– 814, Washington, DC,
USA, 2005. IEEE Computer Society.
X. Hu and P. Mordohai. A quantitative evaluation of confidence measures for
stereo vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34 ( 11):
2121– 2133, Nov 2012.
[Crossref]

S. S. Intille and A. F. Bobick. Disparity-space images and large occlusion stereo.


In Jan-Olof
Eklundh, editor, European Conference on Computer Vision, pages 179– 186. Springer,
Berlin, May 1994.
T. Kanade and M. Okutomi. A stereo matching algorithm with an adaptive
window: Theory and experiments. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16 (9): 920–932, September 1994.
[Crossref]

Vladimir Kolmogorov and Ramin Zabih. Computing visual correspondence with


occlusions using graph cuts. Proceedings of the International Conference on Computer
Vision, 2: 508, 2001.
Kuk-Jin Yoon; In-So Kweon. Locally adaptive support-weight approach for visual
correspondence search. In Proceedings of the IEEE Conference on Computer Vision and

volume 2, pages 924–931, 2005.


Pattern Recognition,

Y. Ohta and T. Kanade. Stereo by intra- and inter-scanline search using


dynamic programming. IEEE Transactions on Pattern Analysis and Machine Intelligence,
7 (2): 139– 154, March 1985. [Crossref]
Sébastien Roy and Ingemar J. Cox. A maximum-flow formulation of the n-camera
stereo
correspondence problem. In Proceedings of the International Conference on Computer
Vision, page 492, Washington, DC, USA, 1998. IEEE Computer Society.
D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-frame
stereo
correspondence algorithms. International Journal of Computer Vision, 47 ( 1): 7–42,
May 2002. [Crossref]
Olga Veksler. Fast variable window for stereo correspondence using integral
images. In

Pattern Recognition, CVPR’03, page 556– 561, USA, 2003. IEEE Computer
Proceedings of the 2003 IEEE Computer Society Conference on Computer Vision and

Society.
R. Zabih and J. Woodfill. Non-parametric local transform for computing visual
correspondence. In Proceedings of the European Conference on Computer Vision, volume
2, pages 151– 158.
Springer, Berlin, 1994.

Footnotes
1 The two approaches are equivalent in the proposed example, but in more
complex cases they are not, as pointed out by Di Stefano et al. (2002).

OceanofPDF.com
© The Author(s), under exclusive license to Springer Nature
Switzerland AG 2024 A. Fusiello, Computer Vision: Three-dimensional

https://doi.org/10.1007/978-3-031-34507-4_13
Reconstruction Techniques

13. Range Sensors


Andrea Fusi ello1
( 1) University of Udine, Udine, Italy

Email: [email protected]
Andrea Fusiello

13. 1 Introduction
The recovery of a 3D model, or 3D shape acquisition, can be
achieved
through a variety of approaches that do not necessarily
involve cameras and images. These techniques include
contact (probes), destructive
(slicing), transmissive (tomography) and reflective non-
optical (SONAR, RADAR) methods. Reflective optical
techniques, which rely on back-
scattered visible (including near infrared)
electromagnetic radiation, have been the focus of our
discussion so far, as they offer several
advantages over other methods. These include not
requiring contact, speed and cost-effectiveness.
However, these methods also have some limitations, such
as only being able to acquire the visible portion of
surfaces and dependency on surface reflectance.
Within optical techniques, we distinguish between active
and passive ones. The attribute active refers to the fact that
they involve a control on the illumination of the scene, that
is, radiating it in a specific and
structured way (e.g. by projecting a pattern of light or a
laser beam), and exploit this knowledge in the 3D model
reconstruction. In contrast, the passive methods seen so far
rely only on analysing the images as they
are, without any assumptions about the illumination, except
that there is enough to let the camera see.
Active methods are reifie d as stand-alone sensors that
incorporate a light source and a light detector, which is not
necessarily a camera.
These are also called range sensors, for they return the range
of visible points in the scene, that is, their distance from
the sensor (Fig. 13.1).
This definition can be extended to include passive sensors
as well, such as stereo heads.

Fig. 13.1 Colour image and range image of the same subject, captured by a
Microsoft Kinect device. Courtesy of U. Castellani
Full-field sensors return a matrix of depths called a range
which is acquired in a single temporal instant, similar to
image
how a camera with a global shutter operates.
Scanning sensors (or scanners) sweep the scene with a
beam or a light plane, and capture a temporal sequence
of different depth
measurements. This sequence can then be used to generate
a range
image or a point cloud, depending on the desired output.
Similarly to rolling shutter cameras, artefacts will
appear if the sensor or the object
moves.

Global shutter captures an entire image simultaneously,


while rolling shutter captures the image line by line in a
temporal sequence. This can cause distortions in the
image, such as objects appearing skewed or slanted if
they are moving quickly.

13.2 Structured Lighting


The techniques of structured lighting have in common
that they
purposively project a light pattern characterised by an
informative
content on the scene. This “special” lighting has the
advantage of making
the result virtually independent of the texture of the
object surface (which is not true for passive methods,
such as stereopsis).
Among these techniques, the first two that we will
illustrate are
based on triangulation: in active stereo structured light is
simply used to better condition the stereo matching in a
binocular setup, while in active triangulation there is only one
camera and the triangulation takes place with the
projector. Finally, we will mention photometric stereo, based on
the acquisition of many images in which the viewpoint is
static but the lighting direction varies in a controlled way
(in this sense we consider it to be “structured”).
13.2. 1 Active Stereopsis
Active stereopsis works similarly to its passive
counterpart (Chap. 12), but uses structured light
projected onto the scene to improve
results in textureless areas. Structured light can take
several forms, as will be illustrated in the examples below:
1. A random dot pattern is projected onto the scene,
facilitating the matching, as in Fig. 13.2.

Fig. 13.2 Active stereo with random dots. Stereo pair with projected
“salt and pepper” artificial texture and resulting disparity map

2. A laser beam scans the scene, projecting a point onto the


surfaces
that is easily detected and matched in the two images. It
is necessary to take many images since a single pair of
images is used to calculate the depth of a single point.
3. A laser sheet is swept over the scene1 (Fig. 13.3). The
lines
determined by the light plane in the two images
intersected with their respective epipolar lines
provide the corresponding points. In
this way, for each pair of images, disparity is assigned
to the points in the lighted line. It is faster than the
previous solution, but still
many images are needed.

Fig. 13.3 Example of active stereo system with laser sheet. The top row
shows the two images acquired by the cameras, in which the line formed
by the laser sheet is visible. Below are shown, in overlay, the points
detected after a sweep

The first solution configures a full-field sensor, while the


other two
involve a scanning behaviour. Active stereo does not have
much practical impact, but it paves the way the next
method, which is active
triangulation.

13.2.2 Active Triangulation


Active triangulation, like stereopsis, is based on the
principle of
triangulation between two devices, with one device being
the light
projector and the other the camera. The projector is treated
as an
inverse camera, where the light rays exit from the Centre of
Projection
(COP) instead of entering, but the underlying geometry
remains the same. Thus, a calibrated camera-projector
system is geometrically equivalent to a calibrated
camera pair.

A projector of planes can be modelled, analogous to the


pinhole
camera, with a projection matrix mapping 3D
points in 2D lines:
( 13. 1)
The calibration and triangulation are entirely
analogous to the passive stereo case (Trobina
1995).

The camera shoots a scene in which a device projects a


pattern of
structured light, that is, that contains the information
necessary to
identify its elements. The range of points in the scene is
obtained by
intersecting the optical ray of an image point with the
corresponding
light ray or plane emitted by the projector. Calibration is
necessary to
orient the the light plane in object space. The matching
phase is avoided, as correspondences are obtained from the
information contained in the pattern itself. Some examples
follow:
1. The projector scans the scene with a beam (or sheet)
of laser light. The laser dot (or line) in the image is
uniquely identified.
2. Instead of one plane, one can project many planes
simultaneously using a projector of light bands. In this
case, the stripes must be encoded in some way to
distinguish them from each other in the image.
Compared to the previous solution, more points are
measured with a single image.

3. The projector illuminates the scene with a pattern of


random dots, so that the configuration of dots in a
neighbourhood of a point
identifies it uniquely.2

Because the camera and projector see the scene


from different positions, shadow areas are created
where the range cannot be
measured (Fig. 13.4), similarly to occlusions in stereo.

Fig. 13.4 Image of intensity and range image obtained from a commercial laser
active
triangulation system. The missing parts in the range image (in white) are due
to the different position of the laser source and the camera. Courtesy of S.
Fantoni

13.2.3 Ray-Plane Triangulation


By knowing the geometry of the system, the equation of
the optical ray from the camera and the equation of the
corresponding light plane, the 3 D coordinates of the
observed scene point can be calculated by
intersection.
Consider a point M with coordinates
in the camera reference frame. The direct i sometry (
that brings the camera reference frame onto the
projector reference frame is known from calibration;
therefore, the coordinates of the same point in the
projector reference frame are
( 13.2)
The projection of point M onto the camera reference
frame is (in normalised coordinates) ,
and is obtained from the projection equation:

( 13.3)

As for the projector, we model it as a camera, in which


the vertical
coordinate of the projected point is undefined (we assume
that the light sheets are vertical, in the interior reference
frame of the projector). Let be the coordinate of the
plane illuminating M, then, keeping with the
language of the camera, we will say that Mis projected onto
the point (in normalised coordinates) , in the
reference frame of the projector, where is only a
placeholder for and undefined value:

( 13.
4)
Using Eqs. (13.2), (13.3) and (13.4), we get:

which decomposes into a system of three scalar ( 13.


equations: 5)

( 13.
6)

of which the second cannot be used since is undefined.


We derive from the third equation and substitute it into
the first, obtaining, after a few steps, the result we sought,
namely, the depth of point M (in the camera reference
frame):

( 13.7)

13.2.4 Scanning Methods


In laser-based methods, the determination of the ray-plane
correspondence is straightforward due to the fact that only
one light
plane is projected. Consequently, the sensor measures the
3D position of
one line at a time. To obtain a range image, it is necessary to
mechanically move the sensor in a controlled manner
and with great precision. This results in a laser scanning
sensor, or scanner, in brief.
An alternative to the controlled scanning is to move the
sensor freely (e.g. by hands) and use a coupled coordinate-
measuring machine (CMM) that, through a magnetic,
optical or ultrasonic device, returns the
position and attitude of the sensor in a local reference
frame.

13.2.5 Coded-Light Methods


In the coded-light methods, the stripes or the points encode
the
information which serves to uniquely identify them.
This has the advantage that moving parts are
eliminated and the number of
measurable points in a frame increases. See Salvi et al.
(2010) for a review on this topic.
As for the stripes, which are one of the most widely
used methods, the simplest encoding is accomplished by
assigning a different colour (Boyer and Kak 1987) to each
stripe, so that they can be distinguished from each other
in the image.
The colour of a pixel is determined by the combination
of light and surface reflectance. Although we can control
the light, the surface
reflectance is usually unknown, making it difficult to
accurately predict the resulting pixel colour. To ensure
accurate identification, one can
associate the identity of the stripe not only with its colour
but also with that of neighbouring stripes, thus providing
more reliable results.

A de Bruijn sequence of order n on an alphabet A of


length k, denoted by , is a cyclic (start and end
coincide) sequence long in
which all possible strings of length n of A occur exactly
once as
substring s of . For example, on , we have
that
. The binary strings of length 3 occur
only once as substring s of 00010111 (cyclic).

For example, if the colours of the bands are arranged


according to a de Bruijn sequence, each one can be
uniquely identified by its colour
and that of the neighbouring bands. Another very robust
technique is
the so-called temporal coding, which is accomplished by
projecting, a
temporal sequence of n black-and-white stripe patterns,
appropriately generated by a computer-controlled
projector. Each projection direction is thus associated with
a code of n bits, where the i-th bit indicates
whether the corresponding stripe was in light or shadow
in the i-th pattern (see Fig. 13.5). This method allows
to distinguish different directions of projection.
Fig. 13.5 Coded light. Images of a subject with stripes of projected light. In the
lower right
image, the stripes are very thin and are not visible at reduced resolution.
Courtesy of A. Giachetti A camera acquires n grey-level images
of the object lighted with the
stripe pattern. These are then converted into binary form so
as to
separate the areas illuminated by the projector from the
dark areas, and for each pixel the n-bit string is stored,
which, by the above, uniquely
encodes the illumination direction of the thinnest stripe the
pixel
belongs to. To minimise the effect of errors Gray coding is
customarily used, so that adjacent stripes differ by one bit
only.

13.3 Time-of-Flight Sensors


A Time of Flight (TOF) laser sensor, commonly referred to
as LiDAR (an acronym for Light Detection And Ranging),
works on the same principle as RADAR (Radio Detection
And Ranging). The surface of an object
reflects laser light back to a receiver, which measures the
time elapsed between the transmission and reception of
the light— the time of flight. This time can be measured in
two ways: directly as or indirectly
after conversion to a phase delay through amplitude
modulation.
In the first case, also called pulsed wave (PW) LiDAR, the
laser
source emits a pulse, the receiver records the instant of
time when it returns after being reflected by the target
and compute the TOF .A
simple calculation allows to recover the distance:
where c indicates the speed of light. As the reader can
easily deduce, the depth resolution of the sensor is related
to its ability to measure extremely small timescales.
Light travels 1 mm in 3.3 ps, and it is difficult to
measure intervals smaller than 5–10 ps.
To obtain a more versatile sensor, a continuous wave
(CW) is emitted instead of a pulse; this is an amplitude-
modulated radiation with a sine wave, and the phase of
the reflected signal is measured. Let therefore

be the modulation frequency. The distance to


the target is related to the phase of the reflected signal
by
( 13.8)
Therefore, we derive:

( 13.9) The phase is measured by the cross-correlation of


the emitted and received signals. Note the periodicity (
): since the phase is
measured modulo , the distances also turn out to be

measured modulo .
The choice of modulation frequency depends on the type
of
application. A choice of = 10 MHz, which corresponds to
an
unambiguous range of 15 m, is typically suitable for indoor
robotic
vision. For close-range object modelling, we will choose a
lower . This ambiguity can be eliminated by scanning
with decreasing
wavelengths, or equivalently with a chirp type signal.
What we described so far is a laser range finder, a sensor
that emits a single beam and therefore obtains distance
measurements of only one
point in the scene at a time. A laser scanner, instead, uses a
beam of laser light to scan the scene in order to derive the
distance to all visible points.
In a typical terrestrial laser scanner, an actuator rotates a
mirror
along the horizontal axis (thus scanning in the vertical
direction), while the instrument body, rotating on the
vertical axis, scans horizontally.
Each position then corresponds to a 3D point
detected in polar coordinates (two angles and the
distance).
In an airborne laser scanner, on the other hand, the
instrument
performs an angular scan (or “swath”) along the
perpendicular to the flight direction, while the other scan
direction is given by the motion of the aircraft itself, which
must therefore be able to measure its
orientation with extreme precision, thanks to inertial
sensors
(accel erometer and gyroscope) and a global navigation
satellite system (GNSS), such as the GPS. The same
principle is also applied in the
terrestrial domain (mobile mapping) to different carriers
such as road vehicles, robots or humans.
Simultaneous Localization and Mapping (SLAM) is a
powerful tool for navigation in GNSS-denied
environments, such as indoors. It utilises 3D registration
techniques (see Chap. 15) to compute the incremental
rigid motion between consecutive scans. For further
information, see
(Thrun and Leonard 2008).
Recently, full-field time-of-flight sensors, also known as
Flash LiDAR or TOF cameras, have been developed. These
sensors allow for
simultaneous time-of-flight detection of a 2D matrix of
points through an array of electronic circuits on the chip.
While the lateral resolution of these sensors is low (e.g.
128x128), it is compensated for by their high acquisition
rate (e.g. 50 fps).

13.4 Photometric Stereo


Photometric stereo (Woodham 1980) is a technique for
deriving
information about the shape of an object from a series of
photographs taken by illuminating the scene from
different angles. Typically, the
input to photometric stereo consists of a series of images
captured by a fixed camera, where the illumination
changes are due to the
displacement of a single light source and the output are
the normal s of the 3D surface under observation.
The idea behind this approach is that the light radiation
(radiance) reflected from the object falling on the camera
depends mainly on the direction of the light illuminating
it and on the normal s of the surface: the more the
direction of the light coincides with the normal, the more
light will be reflected and therefore the higher the
intensity of the pixel in the corresponding image.
The brightness of point ) on the image is equal to the
radiance L of point of the scene projecting into
:
( 13. 10)
The radiance at point , in turn, depends on the
shape (the normal), the reflectance of the surface and
the light sources.
We assume that the surface is Lambertian, so we apply
(2.9), which we rewrite as
( 13. 11)
where is the effective albedo (incorporates the incident
radiance) and is the versor of the illumination direction
(points in the direction of
the light source), is the versor of the normal; all these
quantities
depend, in principle, on the point considered. If
we assume constant albedo and parallel illumination
(light source at infinite
distance), then only the normal depends on the point
, and we can therefore write:
( 13. 12)
where we have neglected the Iverson bracket, so that the
problem can be solved using simple linear algebra tools.
In practice, givenf greyscale images taken under
light
directions, we consider the matrix B of size obtained
by
juxtaposing by column the p pixels of each image. It is
important that the images are radiometrically calibrated,
that is, that the pixel intensities
correspond to the physical values of the scene radiance.
This can be ensured if one works directly with images in
linear RGB format.
Assuming the photographed object is Lambertian, the
reflectance equation (2.6) translates to matrix form as
( 13. 13)

is the matrix that collects for columns


the
normal s and is the matrix that has for
columns the directions of the light sources used during
the acquisition. If the light
directions S are known, we speak of a calibrated lights array. In
this
situation, we are led to solve a linear least-squares system of
the type:
( 13. 14)
which is determined if we have more than two images
available ( ). The rows of the matrix X that realise the
minimum, once normalised,
give us the normal versors of , while the albedo
corresponds to the norm of each row of X. The MATLAB
implementation is given in Listing 13.1, and an example
result is shown in Fig. 13.6.

Fig. 13.6 Top: 12 images of a figurine taken with different light source
positions. Bottom: the normals map obtained by photometric stereo (the hue
encodes the normal), on the right the reconstructed surface. Courtesy of L.
Magri
Listing 13. 1 Photometric stereo

The solution of the problem becomes more complicated if


the directions of the light sources are not known. In this
case, the problem is called
uncalibrated photometric stereo. A possible solution consists in
iteratively estimating X and the lighting directions S,
starting from the Singular Value Decomposition (SVD)
factorisation of the matrix B. It can also be shown that, if
the lighting directions are not known, the normal s field can
only be determined up to an invertible transformation called
bas-relief ambiguity. The name is due to the fact that this
ambiguity is exploited in the bas-relief technique to give
the feeling of seeing a
subject that is in low relief as if it were in the round.
Intuitively, the
matrix B can be factorised both as and as
where A is the transformation matrix representing the
ambiguity. More details can be found in (Belhumeur et al.
1999).
A further complication is due to the presence of
phenomena that
violate the Lambertian model, such as shadows and
specularity.
Specular reflection departs from the Lambertian model in
that light is not scattered uniformly in space, but is
reflected primarily along one
direction. Shadows, on the other hand, correspond to those
pixels for
which (self- shadows), or to points that are not
reached by the light source due to occlusions (cast
shadows). In some cases, specular and shadowed pixels
can be handled by treating them as outliers.
For example, after observing that the matrix B must
have at most rank 3 (because it is the product of two
matrices with at most rank 3), one can exploit matrix
factorisation techniques to clean up the data as explained
by Wu et al. (2010). The idea is to robustly determine the
linear subspace of dimension three that represents the
Lambertian measurements so as to discard
specularities as rogue measurements and treat shadows
as missing data.
If, on the other hand, non-diffusive phenomena are
dominant, a different function must be used to more
closely approximate the surface reflectance.
Once the normal s field is obtained, the surface of the
13.4. 1 From Normals to Coordinates
observed object can be reconstructed. An overview of
some of the more efficient
methods is given in (Agrawal et al. 2006). Here we present
a simple but not robust method that assumes orthographic
projection. If we denote by the depth of the surface
at pixel , in the orthographic projection case we
have that the vector

( 13.
15) approximates the tangent vector to the surface
in the
direction x and must therefore be orthogonal to the
normal, that is, (Fig. 13.7).

Fig. 13.7 Discrete approximation of normal. Courtesy of L. Magri

Developing the calculations we obtain a link between the


depths and the entries of the normal
( 13. 16)
A similar relationship can be derived for the vector:

(
13. 17) which approximates the tangent vector to the surface
in the y-direction, yielding
( 13. 18)
In this way, as the normal s vary, we can derive conditions
on the depths Z. It is then possible to determine Z by
solving the corresponding
overdetermined sparse system of dimension .

Among the structured light methods, we also mention


the method of the moiré fringes. The idea of this method is
to project a grid onto an object and take an image of it
through a second reference grid. This image interferes
with the reference grid and creates interference
figures known as moiré fringes, which appear as bands
of light and shadow. Analysis of these bands provides
information about the depth variation.

13.5 Practical Considerations


Figure 13.8 summarises the taxonomy of the active
methods discussed in this chapter. It is usually noted that
when discussing range sensors,
the most important distinction is between active
triangulation and time- of-flight, with the class of
“structured lighting” and “active triangulation” being
considered coincident. This implies neglecting active stereo
and
photometric stereo, which are not implemented in any
material range
sensor.
Fig. 13.8 Taxonomy of the active methods we covered in this chapter
Active triangulation and time-of-flight sensors have fairly
well
separated operating ranges. Triangulation sensors are the
most
accurate, reaching a maximum resolution of the order
oftens of
micrometres, but they work within a few metres of the
object. This is
because the baseline must be of the same order of
magnitude as the
sensor-object distance. Time-of-flight sensors reach much
greater
distances, with lower resolutions. In particular, PW class
sensors reach distances of several kilometers, with a
resolution of approximately 10
cm at 1 km. In any case the the resolution does not go
below 2 mm. CW sensors, on the other hand, reach only
hundreds of meters with a
resolution (approximately) of 1 mm at 300 m. The smallest
achievable resolution is of the order of hundredths of
millimetres. Figure 13.9
illustrates the different application areas. It should be noted
that the
boundaries of the boxes representing the different classes of
sensors are not sharp and that there are partial overlaps.
Fig. 13.9 Comparison of characteristics of active systems. Adapted from (Guidi
et al. 2009)
In summary, for small to medium measurement volumes
(up to
human size) and consequent working distances,
triangulation sensors are used, while for larger volumes
and distances (buildings,
geographical areas), time-of-flight sensors are used. The
former can be eventually coupled to CMMs while the latter
to GPS and inertial sensors.
minimum and maximum value of the
measurement.
operating range:

In the specific case of range sensors, a distinction is made


between
lateral and depth resolutions. The lateral resolution is
defined as the smallest detectable distance between two
adjacent points measured in a plane perpendicular to the
sensor axis. It therefore depends on
the distance at which the plane is placed. Often the size in
points (or pixels) of the image is given and the
resolution is applied by dividing the size of the area
imaged by the size of the image. Of course, sensors
based on different principles will have different
operating characteristics.

References
Amit Agrawal, Ramesh Raskar, and Rama Chellappa. What is the range of
surface reconstructions from a gradient field? In European Conference on Computer
Vision, pages 578– 591. Springer,
Berlin, 2006.
Peter N Belhumeur, David J Kriegman, and Alan L Yuille. The bas-relief
ambiguity. International Journal of Computer Vision, 35 ( 1): 33–44, 1999.
K. Boyer and A. Kak. Color-encoded structured light for rapid active ranging.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 9 ( 10): 14– 28, 1987.
[Crossref]

Gabriele Guidi, Michele Russo, and Jean-Angelo Beraldin. Acquisizione 3D e


modellazone poligonale. McGraw-Hill, Milano, 2009.
Joaquim Salvi, Sergio Fernandez, Tomislav Pribanic, and Xavier Llado. A
state of the art in structured light patterns for surface profilometry. Pattern
Recognition, 43 (8): 2666– 2680,
August 2010.
[Crossref]
Sebastian Thrun and John J. Leonard. Simultaneous Localization and Mapping, pages
871– 889.
Springer Berlin Heidelberg, Berlin, Heidelberg, 2008. ISBN 978-3-540-30301-5.
https://doi.org/ 10. 1007/978-3-540-30301-5_38.

Marjan Trobina. Error model of a coded-light range sensor. Technical Report


BIWI-TR- 164, ETH- Zentrum, 1995.
Robert J. Woodham. Photometric method for determining surface
orientation from multiple images. Optical Engineering, 19 ( 1): 191139, 1980.
Lun Wu, Arvind Ganesh, Boxin Shi, Yasuyuki Matsushita, Yongtian Wang,
andYiMa. Robust photometric stereo via low-rank matrix completion and
recovery. In Asian Conference on Computer Vision, pages 703– 717. Springer,
Berlin, 2010.

Footnotes
1 A plane or sheet can be obtained by passing a laser beam through a
cylindrical lens.

2 This is also the working principle of the first Microsoft Kinect sensor. The
pattern is generated by an infrared laser through appropriate diffraction grating
that creates a speckle pattern.

OceanofPDF.com
© The Author(s), under exclusive license to Springer Nature
Switzerland AG 2024 A. Fusiello, Computer Vision: Three-dimensional

https://doi.org/10.1007/978-3-031-34507-4_14
Reconstruction Techniques

14. Multi-View Euclidean


Reconstruction
Andrea Fusi ello1
( 1) University of Udine, Udine, Italy

Email: [email protected]
Andrea Fusiello

14. 1 Introduction
In this chapter we will deal with the problem of
reconstruction from many calibrated images, which is the
most relevant in practice and leads to a Euclidean
reconstruction.
Consider a set of 3D points, viewed by m cameras with
matrices
. Let be the (homogeneous) coordinates of the
projection of the j-th point into the i-th camera. The problem
of reconstruction can be posed as follows: given the set of
pixel coordinates , find the set of Perspective
Projection Matrix (PPM) and the 3D points
(called model in this context) such that:

( 14. 1)
As already observed in the case of two images, without
further
constraints one will obtain, in general, a reconstruction
defined up to an arbitrary projectivity, and for this reason it
is called projective
reconstruction. Indeed, if and are a
reconstruction, that is, they satisfy (14.1), then also
and satisfy (14.1) for every nonsingular
© The Author(s), under exclusive license to Springer Nature
Switzerland AG 2024 A. Fusiello, Computer Vision: Three-dimensional

https://doi.org/10.1007/978-3-031-34507-4_14
Reconstruction Techniques

matrix T that represents a projectivity. If the


intrinsic parameters are known, a Euclidean
reconstruction can be achieved, which differs from the
true one by a similitude.
In this chapter, we will focus on the problem of
reconstruction in the case of multiple calibrated images,
also known in the literature as
Structure from Motion (SFM) (Fig. 14.1). Here, when we
refer to
“structure”, we mean the 3D model and by “motion”, we
mean the
exterior orientation of a set of cameras. The uncalibrated
case will be dealt within Chap. 16.

Fig. 14.1 Visualisation of the cherub reconstruction, consisting of a 3D model


and the oriented images, shown as pyramids having the vertex in the Centre of
Projection (COP)

For the case of images with known interior


orientation, Chap. 8 discussed how to achieve a Euclidean
reconstruction through the
computation of the essential matrix, its factorisation and
triangulation. Now, the question is: how do we generalise
this process for ?

Given a set of unordered images, we assume that for


14. 1. 1 Epipolar Graph
a number of pairs of these images (ideally whenever
possible), a set of corresponding points is provided. Such
points are typically produced by a detector and are
matched automatically, so they are perturbed by small
errors and
contain a significant percentage of outliers.
This leads to the initial problem of establishing which of
them form epipolar pairs and to compute the essential or
fundamental matrix
(depending on whether the interior orientation is or is not
known) in order to geometrically validate the matches,
discarding the incorrect
ones.
The correspondences between pairs of images are then
propagated to the whole set, resulting in so-called tracks,
which span multiple
images (Fig. 14.2). The corresponding 3D points are
called tie points: they are the ones that will constitute the
model.

Fig. 14.2 Some tracks overlaid on the cherub images

It is useful in this context to represent the available


information by means of the epipolar graph ,
whose nodes V are the m
images and edges E correspond to the computed
fundamental or essential matrices (Fig. 14.3). The
adjacency matrix A of the graph
contains one in the entry if is available, zero
otherwise. We can
also associate a weight in with the edges proportional
to the
number of corresponding points between the two images. In
some cases it will be convenient to invert this value, turning
it into a cost.
Fig. 14.3 An epipolar graph. The nodes are the images and the edges
correspond to the epipolar pairs, that is, they are labelled with an essential
matrix
The edge is labelled with the essential matrix
, defined by . Observe that the
essential matrix between two PPMs ,
which in the previous chapters we simply called ,
with the notation introduced above corresponds to
being .
In the construction of the epipolar graph, we assumed
that for a
certain number of image pairs point correspondences are
provided. If
the number n of images is small, one can think of attempting
to match of all pairs of images, which has a quadratic
cost, but in general one should aim to find out in advance
a fixed number of images to attempt point matching with.
In some cases one has auxiliary information about the
camera trajectory (e.g. in aerial photogrammetry, or in
rotating table acquisitions), but when this information is
missing, one has to exploit
the images themselves.

A possible approach to the construction of the epipolar


graph. Extract salient points and descriptors (e.g. Scale
Invariant Feature Transform (SIFT)) in all images and
store them in a data structure that supports spatial
queries (e.g. k-d tree). For each descriptor determine its
first k neighbours and the images in which they occur. In
a two-dimensional histogram, increment the
accumulator whenever a
descriptor of image i has a descriptor of image j among its
first k
neighbours. In this way, the histogram approximates the
number of
features that the two images have in common, and provides
a first
approximation of the adjacency matrix of the epipolar
graph. We
must now decide for which edges of this graph to attempt
point
matching, and, to avoid quadratic growth, we must do
this for a linear number of edges. Brown and Lowe (2003)
choose k neighbours for
each node, obtaining kn candidate pairs. (Toldo et al. 2015)
instead
sequentially construct k spanning trees for the graph,
removing each time the edges that were considered in
the tree. The union of the
resulting k disjoint spanning trees, each containing
edges, gives candidate pairs. The first strategy
favours the creation of
loosely connected cliques, whereas overlapping
spanning trees guarantee a more strongly connected
graph.

14. 1.2 The Case of Three Images


It is instructive to start by considering the case of
images. We
assume that we know the essential matrices related to the
image pairs ( 1,2) and (2,3), as shown in Fig. 14.4. We
outline three ways of attacking the problem, which will be
expanded in the following sections

Fig. 14.4 Epipolar graph for the case of three images


Let us proceed in an incremental manner. We
apply the
Strategy 1

reconstruction process from two images (Listing 8.2) to the


pair (I 1, I2), obtaining a model. Assuming that some points
in the model are visible in I3 (the dots in Fig. 14.5), these
can be used to solve for the exterior
orientation of I3 by resection (Listing 5.2). The model is
then updated and augmented by the contribution of
I3allowing new points to be
triangulated (Listing 8.1).

Fig. 14.5 The points shared by I 1and I2 (stars and dots) are those that make up
the initial
model. Image I3 is added, thanks to the model points visible in I3 (dots); their
position will be refined by triangulation from three views. Other tie points
are added to the model: those visible from I 1and I3 (crosses) and those visible
from I2and I3 (rhombuses)

Let us independently apply the procedure for the


reconstruction from two images (Listing 8.2) to each of the
Strategy 2

two pairs (I 1, I2) and (I2, I3) obtaining two independent


models separated by a
similitude, and merge them into one by solving an absolute
orientation problem (Listing 5.1).
We aim to instantiate three mutually
consistent PPMs and proceed to the final triangulation.
Strategy 3
From the factorisation of the essential matrices and
( Listing 7.2), we obtain two rigid motions and
in which each translation is known up to the
modulus, and for this we consider the
versor, denoted by .
If the moduli of the translations were known, we could
concatenate the i sometries in the following way: 1
( 14.2)

( 14.3)
and then instantiate three mutually consistent PPMs,
and and
reconstruct the model by triangulation (Listing 8.1).
Unfortunately, however, the composition does not work
for versors. In other words, while the modulus of the only
translation involved was ignored in the two-view case
(yielding a scaled reconstruction), in the
three (or more)-view case, the moduli of each relative
translation cannot be fixed arbitrarily, since the relative
translations are not independent of each other.
Adding the edge 1– 3 to the graph solves this problem.
Let us rewrite (14.3) such as the ratios between the moduli
are explicited:
( 14.4)
where and From the ,
one matrix
and solves the above equation for the
unknowns and
as described in Proposition A. 18.
Note, however, that a global scaling factor remains
undetermined, and it is fixed arbitrarily through the norm
of , as in the case of two images.

This method is generalised by Zeller and Faugeras


(1996) to images provided that the essential
matrices 2- 1, i- 1 and i-2 are
available for . In essence, the epipolar graph
must have the particular structure illustrated in Fig.
14.6.
Fig. 14.6 Epipolar graph for the Zeller-Faugeras method: in the right-hand
arrangement, we see that it is made of circuits of length 3 sharing
edge 1– 2

14. 1.3 Taxonomy


Methods for reconstruction from many images can be
divided, at a first level, between those that are global and
those that are based on partial reconstructions (Fig. 14.7).
The former are those that simultaneously consider all
images and all tie points, such as projective factorisation
(Sect. 16.1.1) and bundle adjustment (Sect. 14.4).
Fig. 14.7 Taxonomy of reconstruction methods from many images
In the second class fall the methods illustrated in the
previous
examples. They are characterised by processes that employ
subsets of the points and/or cameras. The final result is
obtained by an alignment or fusion of partial results.
Among these we distinguish point-based
methods from reference frame-based methods. The former
(Sect. 14.2) employ the points to compute the alignment,
that is, they use them to
solve some orientation problems (relative, exterior,
absolute), while the latter (Sect. 14.3) align (synchronise) the
reference frames of the
cameras, without taking the points into account, and
compute the model only at the end.

Triples of images can be used instead of pairs in the partial


strategies.

14.2 Point-Based Approaches


This class includes methods that stem from the first
two strategies outlined in Sect. 14.1.2.
14.2. 1 Adjustment of Independent Models
The Adjustment of Independent Models (AIM) method,
well known in photogrammetry, is based on the
construction and subsequent
integration of many independent models built from
image pairs. In summary:
the images are grouped in pairs (with overlaps), and
many
independent models are computed by reconstruction
from two frames;
these models are related to each other by similitude
transformations; the 3D points they have in common are
used to bring them to a single reference frame.
In the first step, the image pairs correspond to a subset
of the edges of the epipolar graph. For example, one can
take the edges of a minimum spanning tree, where the cost
associated with an edge is inversely
proportional to the overlap.
With the tools we have seen so far, the last step can be
implemented with a series of cascading Orthogonal
Procrustes analysis (OPA);
however, this would be suboptimal. We will see in Sect.
15.1.1 how the alignment of two point clouds with OPA
can be generalised to the
simultaneous alignment of many point clouds, with
Generalised Procrustes analysis (GPA) (Crosilla and
Beinat 2002).
14.2.2 Incremental Reconstruction
Starting with an initial reconstruction from two images,
the
incremental approach grows it by iterating between the
following two steps:
orient one image with respect to the current model
via resection; update the model through triangulation
(or intersection).
This is why it is also called the resection-intersection method.
Initialisation. Two images are chosen, with a criterion that
must
balance a high number of homologous points with
sufficient separation of the two cameras (if too close, the
triangulation is poorly
conditioned). Reconstruction is then performed with these
two frames, as described in Sect. 8.4. After this
initialisation, the algorithm enters the main resection-
intersection loop, in which the next image is added
to the reconstruction:
Resection. The correspondences relative to the points
already
reconstructed are used to obtain 2D-3D matches. Based
on these, the exterior orientation of camera i with respect
to the current model is estimated;
Intersection. The current model is then updated in two ways:
the position of existing 3D points that were observed in
image i is re- computed, by adding one equation to the
triangulation;
new 3 D points are triangulated, thanks to the
correspondences that become available as a result of
adding image i.
The cycle ends when all images have been considered.
The sequential order of processing successive images can
be
obtained, for example, by a depth-first visit of the
minimum spanning tree of the epipolar graph (a.k.a.
preordering), starting with the second image of the initial
pair.

Some comments and clarifications on the incremental


method:

immunity to outliers is crucial: one must use


Random Sample Consensus (RANSAC) (or other
robust techniques) not only in calculating epipolar
geometry but also in resection;
error containment is also essential: in triangulation it is
advisable to keep an eye on some indicator of bad
conditioning (near-parallel rays) and possibly discard
the ill-conditioned 3D points. In
addition, Bundle Adjustment (BA) should be run
with some frequency during the iteration and not
only at the end.
A variant of the incremental reconstruction has been
14.2.3 Hierarchical Reconstruction

proposed by Gherardi et al. (2010). Instead of being


ordered sequentially, the images
are grouped hierarchically, resulting in a tree (or
dendrogram) in which the leaves are the images and the
nodes represent progressively larger cluster (Fig. 14.8).
The distances used for clustering are derived from the
weighted adjacency matrix of the epipolar graph, so the
images are grouped according to their overlap.

Fig. 14.8 Dendrogram relative to the cherub reconstruction. Pairs ( 10, 11),
(6,7) and (2,3)
initiate the reconstructions; other images are added by resection; in the root
node and its left child, two reconstructions are merged
The algorithm proceeds from the leaves to the root of the
tree:
1. in the nodes where two leaves are joined (image pairs),
we proceed with the reconstruction as described in
Sect. 8.4;
2. in the nodes where a leaf joins an internal node, the
reconstruction corresponding to the latter is increased
by adding a single image via resection and updating the
model via intersection ( as in the
previous sequential method);

3. in the nodes where two internal nodes join, the two


independent
reconstructions are merged by solving an absolute
orientation (with OPA).
Step 1 is identical to the initialisation of the sequential
method, but is performed on many pairs instead of just
one; step 2 coincides with the iteration of the sequential
method. Step 3, on the other hand, is
reminiscent of the AIM. In fact, if the tree is perfectly
balanced, step 2 is never executed and we obtain a
cascade of absolute orientations that align the
independent models built in step 1.
If, on the contrary, the tree is totally unbalanced,
step 3 is never performed and we get the sequential
method.
The approach is computationally less expensive and
exhibits better error containment than the sequential
approach, while also mitigating the dependence on the
initial pair.

14.3 Frame-Based Approach


The approach of this section is based on the network of
relative
transformations between reference frames, and computes
the model
only at the end. It follows the third strategy outlined in
Sect. 14.1.2, so it aims to instantiate the PPMs:
( 14.5)
given the (error-prone) estimates of a number of relative
orientations , which correspond to the edges of
the epipolar graph (Fig. 14.9).
Fig. 14.9 Epipolar graph for cherub images. The path touching the images in
numerical order forms a spanning tree. If the relative orientations were
composed using only these elements, precious redundancy would be
eliminated
If we fix the image I 1as a reference, the rotations
and
translations that are needed to instantiate the PPMs
are not
directly available, since, in general, image I 1 does not
overlap with all the other images. They must be
computed by composing the relative orientations (we
assume for now that the moduli of the translations are
known). An immediate solution is to compute a spanning
tree for the
epipolar graph with root in I 1and concatenate the relative
orientations along the tree. In this way, each is
uniquely determined (there is only one path between two
nodes in a tree), but not all (redundant) measures are
taken into account, resulting in the loss of the ability to
compensate for error. The synchronisation procedure, which
we will see in Sects. 14.3.1 and 14.3.2, instead implements
a global compensated
solution, which takes into account all the relative
orientations available as edges of the epipolar graph.
Before proceeding, we fix the notation and derive some
relations. We associate to each node i of the epipolar graph
the exterior orientation of the corresponding camera,
expressed as position and angular attitude
with respect to a world reference frame. This orientation can
be
assigned by a matrix representing a direct i sometry,
which is the inverse of the usual matrix found in the
definition of the PPM
, that is:
( 14.6)
Observe that the COP appears in .
The edge is labelled with the relative orientation:
( 14.7)
inferred from . The following compatibility relationship
exists:

( 14.8) which, considering rotation and translation


separately, becomes
( 14.9)
( 14. 10)
In the following paragraphs, we will focus first on
rotations and then on translations. As we observed in the
case of three images, the
composition of isometries only works if the translations are
known with their moduli, but this condition is false for the
relative orientations
computed from the essential matrices. We will deal with this
problem in
Sect. 14.3.3.
14.3. 1 Synchronisation of Rotations
Let us focus initially on recovering the angular attitude of
the
cameras: the goal is to compute the rotations
that satisfy the compatibility constraint (14.9), that is:
( 14. 11)
The problem is known in the literature as rotation averaging
(Hartley et al. 2013) or rotations synchronisation (Singer
2011). We immediately note that the solution is defined
up to an arbitrary rotation of all . To remain consistent
with the solution for two images, we fix it so that
.
In the presence of noise, we want to solve a minimum
problem
instead:

( 14.
12)

We will now devise a solution based on the


eigendecomposition of a matrix that solves a relaxation of
problem (14.12). For this purpose, we
introduce the following matrix that contains all
relative
rotations (we initially assume that all image pairs allow
the
computation of relative orientation):

( 14. 13)

Note that , so Z is symmetric. Let X be the


matrix constructed by stacking the unknown
rotation matrices:
( 14. 14)
From the compatibility constraint (14.9), it follows that Z can
be written
as
( 14. 15)
So and Z possesses three nonzero
eigenvalues. We multiply both members by X obtaining
( 14. 16)
since .
The relation (14.16) tells us that the columns of X are the
three
eigenvectors corresponding to the nonzero eigenvalues of
Z, which are equal to m.
In a practical scenario hardly all relative rotations are
available; in such a case, the blocks of Z corresponding to
the unknown rotations are set to zero
The i-th block of ZX writes:

( 14. 17)

where is the number of nonzero blocks in block-row i of


Z, which is equal to m if all the relative orientations are
available. Let us collect
these into a diagonal matrix , then
(14.16) becomes:
( 14. 18)
where the Kronecker product creates a diagonal matrix
where each occurs three consecutive times on the
diagonal, because each must multiply the matrix
.
In other words, the columns of X are the three
eigenvectors of
corresponding to eigenvalue 1, the other
eigenvalues being
zero.
In the presence of noise, we take the three dominant
eigenvectors of , and this is the solution to a
relaxation of the problem
(14.12), where the constraint is disregarded.
The blocks of X corresponding to are
thus projected onto only at the end by computing
the nearest rotation matrix
according to Proposition A. 14.
Considering the epipolar graph, it is easy to see that
is the degree of node i and D is the degree matrix of the
graph. In the complete graph, all nodes have degree m, if
we assume that each node has a loop, which in the Z
corresponds to the I on the diagonal.
The MATLAB implementation is given in Listing 14.1.
Listing 14. 1 Rotation synchronisation

We now deal with the recovery of the COPs under the


14.3.2 Synchronisation of Translations

assumption that the relative translations are known with


the moduli. The translation components of the PPM
will eventually be
derived from the relation , where is the
centre of the i- th camera in Cartesian coordinates.
We consider the compatibility equation for translations
(14.10):
( 14.

which we rewrite as
19)

( 14.20
)
where represents the relative translation expressed
in the world reference frame, which we fixed integral to
the first PPM.
We now consider the matrix X obtained from the
juxtaposition of the COPs: . Then (14.20)
is
written: where
the
indicator vector
( 14.21
)

( 14.22
)
selects columns i and j from X. We see that represents an
edge of the epipolar graph G, and in fact it is one of the
columns of the incidence matrix . Hence, the
equations that we can write, one for each
edge, are expressed in matrix form as

( 14.23) where the matrix U contains the as


columns. The matrix B is , but its rank is at most
, if the epipolar graph is connected, so the system is
underdetermined. However, the solution is clearly
defined up to a translation of all centres, which is why we
can arbitrarily fix (to remain consistent with the
solution for two images) and remove it from the unknowns,
thus leaving a matrix of full rank. Thanks to
the “vec-trick”, we can write:
( 14.24)
If one considers the errors that inevitably plagues the
measurements, then an approximate solution must be
sought. Russell et al. (2011) show that the least-squares
solution of (14.24) among all
those satisfying the compatibility constraints is the one
closest to the measurements, in Euclidean norm.
The MATLAB implementation is given in Listing 14.2.

Listing 14.2 Translation synchronisation


adjacent nodes. The name comes from clock
synchronisation, an
application in which pairs of devices measure their time
lags, and all must agree on a single “global” time that
accounts for the time lags.

14.3.3 Localisation from Bearings


In the context of SFM, the synchronisation of translations
does not solve
the problem of localizing the cameras, since the moduli of
the
translations are unknowns. In other words, instead of the
translation , we only can measure the unit vector:
( 14.25)
which is called bearing, 2 and represents the direction under
which the i- camera sees the j-camera, expressed in the
world reference frame, while the corresponds to the
same direction in the i-camera reference
frame .
Following Strategy 3 one should first compute the
unknown moduli of the translations and then run
translation synchronisation. This has been done by
Arrigoni et al. (2015), but here we will follow a simpler
approach (Brand et al. 2004) that directly computes the
positions of the cameras .
We start with the translation synchronisation equation
(14.24) and multiply both members by the block
diagonal matrix :

( 14.2
6)
obtaining

( 14.2
7)
The right-hand member cancels out as each
(column of U) is
multiplied on the left by the corresponding
, which is
the cross product of a vector by its own versor. The net
effect is to
replace U, which contains the unknown moduli, with ,
which depends only on the known bearings. In the noise-
free case, (14.27) has a unique
solution up to a scale if and only if possesses a
one-
dimensional null space.
While translation synchronisation has solution as soon as
the
epipolar graph is connected, the conditions on the graph
topology
under which (14.27) is solvable are more complex and call
into
question the notion of rigidity of the graph (Arrigoni and
Fusi ello
2019). Generally, in order to be rigid, the graph must have
more edges than those ensuring connectivity. For
example, in the three- view
case, edge 1– 3 was needed to solve the problem.

Since equation (14.27) is homogeneous, the solution is


defined up to a scale, which potentially includes a sign, and
this introduces a reflection of the solution (i.e. ifX is solution
so is ). Geometrically it means
that the solution of (14.27) considers the bearings collected
in U only as orientation and neglects the sense. 3 The latter
can be taken into account a posteriori by choosing between
X and the solution that agrees
with the direction of the bearings (Listing 14.3).
In the case of noise-affected bearings, will
in general have full rank and a solution is sought that
solves
( 14.28
)

The least right singular vector of the coefficients matrix is


the solution, as usual.
Listing 14.3 Localisation from bearings
The COPs, along with the rotations previously recovered,
allow us to instantiate the PPMs:
( 14.2
and then proceed to triangulate the points. 9)

14.4 Bundle Adjustment


Both point-based and frame-based methods are not
statistically
optimal, the former because they produce the final model
incrementally, the latter because they are global only in
compensating for relative
orientations but neglect points, which only played a role in
determining the relative orientations between pairs of
images. A genuinely global
method should optimi se a cost that simultaneously
includes all images and all tie points.
Optimal reconstruction from many views is naturally
formulated as the Bundle Adjustment (BA) problem: one
wants to minimise the overall reprojection error of the tie
points in the images in which they are
visible, with respect to the 3D coordinates of the tie
points and the orientation (exterior and interior) of the
images.
The name comes from a geometric interpretation of
the problem: each image with its optical rays
corresponding to the tie points can be seen as a bundle of
rays, and “adjustment” refers to the process of
orienting bundles in space until homologous optical rays
intersect. If control points are present, the rays should
also touch them.
Analytically, it is a matter of adjusting both the m
cameras and the n tie points so that the sum of squared
distances between the j-th tie point reprojected via the i-th
camera and the measured point is as small as
possible, in each image where the point appears. In
formulae
(the notation resumes that introduced in (9.8)):

( 14.30)

This is the reprojection error that we have already


employed in Chap. 9, and in fact BA is nothing more than a
simultaneous regression of the
PPMs (Sect. 9.3) and the 3D points (Sect. 9.5).
Equation (14.30) is the cost function of a non-linear least-
squares
problem, for which there are no closed-form solutions, but
only iterative
techniques (such as the Levenberg-Marquardt (LM)
algorithm) that
require a starting point close enough to the global
minimum. Thus, BA does not solve the SFM problem by
itself, but can be understood as a way to improve
suboptimal solutions, such as those obtained by the
methods described in the first part of the chapter.

Let us consider BA in the case of two images. We know


that five
points are sufficient in principle to determine the
essential matrix, which leads to a reconstruction up to
a similarity. Let us confirm this result by a simple
equations count. Five tie points in two images give 20
equations; the unknowns are the 3D coordinates of the
tie points, which makes 15, plus the 12 unknowns
corresponding to the
orientations of the 2 cameras, thus giving 27 unknowns.
The
remaining seven degrees of freedom corresponds to those
of a
similitude. In other words, out of 12 unknowns, only 5 can
be
determined, which represent the degrees of freedom of
the relative orientation up to scale.

For further discussion of BA, see Triggs et al. (2000),


Engel s et al. (2006), Konolige (2010).

14.4. 1 Jacobian of Bundle Adjustment


The BA residual is the reprojection error, and the
unknowns are the camera parameters (intri sic and
extrinsic) and the point coordinates . As such, the
derivatives with respect to each of these two sets of
unknowns have already been computed in Chap. 9. In
particular
expressions (9.12), (9.13), (9.16) and (9.23) are
summarised here for the readers’ convenience:

( 14.3
1) The Jacobian of the least-squares problem corresponding
to (14.30) is
therefore composed of blocks of this type. Let us define:
( 14.32)

the matrix of partial derivatives of the residual of pointj in


image i with respect to the parameters of camera k. And
similarly:

( 14.33)

are the partial derivatives of the residual of pointj in


image i with respect to the coordinates of point k. It is
easy to see that
and Thus, the Jacobian
matrix has a sparse block structure, called primary
structure and represented
graphically in the following diagram:

If we consider only one image and control points


instead of tie points, only the blocks to
remain that correspond to the
Jacobian of the residual of non-linear calibration (n is the
number of control points). If, on the other hand, cameras
are known, only the
blocks B remain, that correspond to the Jacobian of the
residual of the triangulation. This is consistent with the
observation that BA is
tantamount to a simultaneous regression of the PPMs
and of the tie points.
In the beginning of this chapter, we assumed that
intrinsic
parameters are known, whereas in this section we are
dealing with the general case in which both exterior and
interior orientations are
unknown. It is understood that the calibrated case is easily
dealt with by omitting the intrinsic parameters from the
unknowns.
Note that in practice the structure of the Jacobian matrix
(Fig. 14.10) also reflects the visibility of the points since
the two rows corresponding to are only present if the
point is visible in the camera . The
shape of the matrix derived from this observation is
called secondary structure.
Fig. 14.10 Example of primary (left) and secondary (right) structure for a
Jacobian matrix of a BA with 10 images and 15 tie points
Because one can scale, rotate and translate the entire
reconstruction without altering the reprojection error, the
Jacobian has a rank
deficiency corresponding to the degrees of freedom of a
similitude, which is seven. As a consequence, the
normal equation of the Gauss- Newton method is
underdetermined. There are several ways to solve this
problem:
remove the degrees of freedom by fixing the
coordinates of some points (which become control
points) and/or some COPs;
solve the normal equations with the pseudoinverse,
which in the underdetermined case returns the
minimum norm solution, thus implicitly imposing a
constraint on the solution, as in our Gauss- Newton
implementation (Listing C. 1).
add a diagonal damping term in the normal equation
which makes up for the rank drop, as the LM method
does (Sect. C.2.3).
With reference to the terminology introduced in Chap. 8,
the first
solution leads to an identical reconstruction, while the last
two lead to a Euclidean reconstruction.

In photogrammetry, an overabundant number of control


points are
used in the BA, which—in addition to fixing the degrees of
freedom— also allow to “bend” the model and
compensate for any systematic
errors.

The primary structure of the Jacobian matrix that we


14.4.2 Reduced System

observed in the previous section can be exploited to


reduce the computational cost of solving the normal
equations, through the formulation of a reduced
system of equations.
The normal equation that is solved for at each
Gauss-Newton step is

( 14.34) where is the vector of variables containing


parameters for the points and parameters for
the cameras ( if we
exclude the intrinsic parameters), and are the residuals
in (14.30). The residuals are 2mn if all points
are visible in all cameras, although typically the number is
much smaller. Matrix H has dimension : its
size is largely dominated by n which can be a few orders of
magnitude larger than m.
It is therefore a matter of solving a linear system that can
be very
large, but one can take advantage of the sparse structure of
the Jacobian matrix. In particular, we partition the Jacobian
into two parts, the one
related to the cameras and the one related to the points:
( 14.35)
Then, the matrix H (Fig. 14.11) and the normal equation
also turn out to be partitioned as
( 14.36)

Fig. 14.11 Structure of matrix H for the same example of Fig. 14.10. The fact
that the two NW
and SE blocks are block diagonals depends on the primary structure, while the
sparsity pattern
of the NE and SW blocks depends on the secondary structure,
that is, visibility To simplify the notation, we
rewrite it as

( 14.37)

Note that and are block diagonals. We


multiply both members to the left by
( 14.38)
thereby transforming the coefficient matrix into lower block
triangular:
( 14.3
9)
and therefore solve for the block of unknowns
(cf. Gaussian elimination):

( 14.4
0)

This system is smaller than the original, and the


inversion of is straightforward due to its diagonal
block structure, so we get a benefit from this rewriting.
The other block of unknowns is derived with
( 14.41)
The secondary structure of J is reflected in the reduced
system matrix: . If each image sees only
a fraction of the points, that matrix will be sparse; in
particular if images are ordered such that index differences
reflect the degree of overlap, the reduced system matrix has
a band structure, and this can be usefully exploited to
reduce the
computational cost.
The function bundleadj (listing omitted for space reasons)
implements BA with the LM method and the reduced
system. It is
possible to choose how many intrinsic parameters to
include among the unknowns, and whether they are all the
same or different for each
image. Also, one can designate some 3D points as control
points.

In our implementation of BA, what does that guarantee


that we do not run into some singularity in the Euler
representation of
rotations? Nothing, in fact. We would need to adopt an
incremental parameterisation of the exterior orientation
of each image, as in the parameterisation ofF.
References
F. Arrigoni and A. Fusiello. Bearing-based network localizability: A unifying view.
Transactions on Pattern Analysis and Machine Intelligence, 41 (9): 2049– 2069, 2019.
IEEE

ISSN 0162- 8828. https://doi.org/10.1109/TPAMI.2018.2848225.


Federica Arrigoni and Andrea Fusiello. Synchronization problems in computer
vision with
closed-form solutions. International Journal of Computer Vision, 128 ( 1): 26– 52, Jan
2020. ISSN 1573- 1405. https://doi.org/10.1007/s11263-019-01240-x.
[MathSciNet][Crossref]

Federica Arrigoni, Andrea Fusiello, and Beatrice Rossi. On computing the


translations norm in
the epipolar graph. In Proceedings of the International Conference on 3D Vision (3DV),
pages 300– 308. IEEE, 2015. https://doi.org/10.1109/3DV.2015.41.
M. Brand, M. Antone, and S. Teller. Spectral solution of large-scale extrinsic
camera calibration as a graph embedding problem. 2004.
M. Brown and D. Lowe. Recognising panoramas. In Proceedings of the 9th
International Conference on Computer Vision, volume 2, pages 1218– 1225,
Nice, October 2003.
F. Crosilla and A. Beinat. Use of generalised procrustes analysis for the
photogrammetric block adjustment by independent models. ISPRS Journal of
Photogrammetry & Remote Sensing, 56 (3): 195– 209, April 2002.
[Crossref]

C. Engels, H. Stewénius, and D. Nistér. Bundle adjustment rules. In


Photogrammetric Computer Vision (PCV). ISPRS, September 2006.
Riccardo Gherardi, Michela Farenzena, and Andrea Fusiello. Improving the
efficiency of
hierarchical structure-and-motion. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR 2010), 2010.
Richard I. Hartley, Jochen Trumpf, Yuchao Dai, and Hongdond Li.
Rotation averaging. International Journal of Computer Vision, 2013.
Kurt Konolige. Sparse sparse bundle adjustment. In British Machine Vision
Conference, Aberystwyth, Wales, 08/2010 2010.
W. J. Russell, D. J. Klein, and J. P. Hespanha. Optimal estimation on the graph
cycle space. IEEE Transactions on Signal Processing, 59 (6): 2834– 2846, June
2011.
[MathSciNet][Crossref]
A. Singer. Angular synchronization by eigenvectors and semidefinite
programming. Applied and Computational Harmonic Analysis, 30 ( 1): 20– 36, 2011.
[MathSciNet][Crossref]

Roberto Toldo, Riccardo Gherardi, Michela Farenzena, and Andrea


Fusiello. Hierarchical structure-and-motion recovery from uncalibrated
images. Computer Vision and Image Understanding, 2015.
Bill Triggs, Philip F. McLauchlan, Richard I. Hartley, and Andrew W. Fitzgibbon.
Bundle
adjustment—a modern synthesis. In Proceedings of the International Workshop
on Vision Algorithms, pages 298– 372. Springer, Berlin, 2000.
C. Zeller and O. Faugeras. Camera self-calibration from video sequences: the
Kruppa equations revisited. Research Report 2793, INRIA, February 1996.
Footnotes
1 Why the composition follows these formulae will be clear in light of the
notation we will introduce in Sect. 14.3.

2 In analogy to the bearing angle of a point, which in topography is defined as


the angle between the cardinal north direction and the observer-point direction.

3 Orientation and sense together determine the direction of a vector.

OceanofPDF.com
© The Author(s), under exclusive license to Springer Nature
Switzerland AG 2024 A. Fusiello, Computer Vision: Three-dimensional

https://doi.org/10.1007/978-3-031-34507-4_15
Reconstruction Techniques

15. 3D Registration
Andrea Fusi ello1
( 1) University of Udine, Udine, Italy

Email: [email protected]
Andrea Fusiello

15. 1 Introduction
The purpose of registration or alignment of partial 3 D models is
to bring
them all into the same reference frame by means of a
suitable
transformation. This problem has strong similarities
with image mosaicking, which we will discuss in
Chap. 18.
This chapter takes its starting point from the Adjustment
of
Independent Models (AIM) introduced in 14 and complete it
by
describing (in Sect. 15.1.1) how to extend the alignment
of two sets of 3D points to the simultaneous alignment of
many sets, with given
correspondences. Such sets of 3D points are also called
point clouds in this context.
The registration problem arises also when considering
model
acquisition by a range sensor. In fact, range sensors such
as the ones we studied Chap. 13 typically fail to capture the
shape of an object with a
single acquisition: many are needed, each spanning a part
of the object’ s surface, possibly with some overlap (Fig.
15.1). These partial models of the object are each
expressed in its own reference frame (related to the
position of the sensor).
Fig. 15.1 Registration of partial 3D models. On the right, eight range images of
an object, each in its own reference frame. On the left, all images superimposed
after registration. Courtesy ofF.
Arrigoni
While in the AIM the correspondences are given by
construction, the partial point clouds produced by range
sensors do not come with
correspondences. In Sect. 15.2.2 we will present an
algorithm called Iterative Closest Point (ICP) that aligns
two sets of points without
knowing the correspondences, as long as the initial
misalignment is moderate. We will finally (Sect. 15.2.3)
mention how to extend ICP for registration of many point
clouds by exploiting synchronisation (Sect. 14.3).
For a general framing of the problem, see also
Bernardini and Rushmeier (2002).
15. 1. 1 Generalised Procrustes Analysis
The registration of more than two point clouds in a single
reference
frame can be simply obtained by concatenating registrations
of
overlapping pairs. This procedure, however, does not lead to
the optimal solution. For example, if we have three
overlapping point clouds and we
align sets one and two and then sets one and three, we are
not
necessarily minimising the same cost function between
sets two and three. We therefore need adjustment
procedures that operate globally, that is, take into account
all sets simultaneously.
If correspondences between the different point clouds
are available, we can merge them in a single operation,
applying to each the similitude that brings it into a common
reference frame, thus generalising the
absolute orientation via Orthogonal Procrustes analysis
(OPA) that we
saw earlier (Sect. 5.2.1). This technique is in fact
referred to as Generalised Procrustes analysis
(GPA).
Given the matrices representing the m sets
of
points the alignment is achieved by
minimising the following cost function:

( 15. 1)

where the unknowns are the parameters of the similitude:


. At the core of this technique lies the following
observation by Commandeur (1991):

( 15.
2)
wher is the mean or centroid:
e

( 15.
3)

The minimisation of the residual rewritten as in (15.2)


turns out to be expressible as an iterative process, in
which one alternately compute
and align the sets on , each separately with
OPA, as seen in Listing 15.1. Convergence of the algorithm
is proved in (Commandeur 1991), although not necessarily
at the global minimum.
In the realistic case that not all correspondences are
available in all point clouds, one must multiply each
with a diagonal matrix that contains on the diagonal 1
where the corresponding column of is
specified and 0 otherwise. The formulae for the solution of
the absolute orientation are adapted straightforwardly, with
the only caveat that the vector must also be replaced with
and that the formula for the centroid becomes
(Crosilla & Beinat 2002):
( 15.4)
Listing 15. 1 Generalised Procrustean Analysis

15.2 Correspondence-Less Methods


When correspondences are not given, one can either
compute them, and fallback to OPA/GPA methods, or devise
different techniques that do not explicitely need them. As for
the first option, 3D salient points can be
extract and matched, along the lines of what is done in
Structure from Motion (SFM) (Chap. 14). See the review
of Tombari et al. (2013) in this regard. In this chapter we
will deal with the second option.
15.2. 1 Registration of Two Point Clouds
The OPA computes the alignment (or registration) of two sets
of
corresponding 3D points. There are practical cases,
however, where two point clouds must be registered for
which the correspondences are not known. In such a case,
an algorithm called ICP (Besl & McKay 1992;
Chen & Medioni 1992) can be used, which
simultaneously solves the correspondences problem and
estimates the direct i sometry.
Given two sets of points and , with , the ICP
15.2.2 Iterative Closest Point

algorithm computes for each point of the set , the


point closest to it. It then assumes that the closest
points are the true
correspondents, and computes the direct i sometry that best
aligns the corresponding points (with OPA, Sect. 5.2).
Apply the obtained isometry to and iterate the process
until convergence.
The idea is that, even if the correspondences assumed in
the
iterations are not the true ones, the resulting
transformation brings closer to . Figure 15.2 shows
an example of how the correspondences converge to the
correct ones. Besl and McKay (1992) prove that the
algorithm converges to a local minimum of the error:

( 15.5)

To ensure convergence to the global minimum, a good


initial estimate of the i sometry between the sets of points is
required (in practice we need the initial rotation to differ by
no more than from the correct one). The initial
alignment can be obtained:
by measuring the motion of the sensor;
by sampling the set of all possible transformations;
by considering the moments of and (mean and
principal axes).
Fig. 15.2 Evolution of correspondences assumed by ICP in subsequent
iterations. It can be seen that by starting with very wrong matches, the
algorithm quickly converges to correct ones,
represented in this example by the identity matrix
The algorithm just described assumes that ,
while in reality the intersection of the two sets is not
completely contained in either of them. In this case there
will be necessarily mismatches, since the
nearest-neighbour-based procedure always finds a match at
every point of , but not every point of actually has a
corresponding point.
These mismatches spoil the estimate obtained by OPA,
and prevent convergence: even if the two sets were
initially perfectly aligned, ICP could diverge. Mismatches
can be treated as outliers and the problem addressed
with the tools of robust statistics.
The code shown in Listing 15.2 applies a robust weight
function to underweight corresponding points according
to their distance (Fig.
15.3).
Fig. 15.3 Histogram of distances between closest points in two partially
overlapping but aligned sets. Matches with distances higher than the vertical
line are deemed as outliers
Listing 15.2 Iterative closest point

ICP can also be applied to surfaces, that is, points connected


in a
polygonal mesh. In the latter case, given a set of points
and a surface , for every point , there exists at
least one point on the surface that is closest to ,
and we take it as the corresponding
one.
In addition, it is possible to introduce an heuristic that
eliminates
matching s on the border of the mesh, based on the
observation that, in the case of partially overlapping
surfaces, it is likely that points
belonging to the unique part of one surface will be matched
with points belonging to the border of the other surface.
These correspondences are obviously wrong (since these
points should not be paired), and
therefore are eliminated.
In the case of surfaces, however, to make point matching
faster, one can also abandon the search for nearest points
by matching a point with the point of encountered
along the normal to the surface at
point (normal shooting). For further details we refer to
the reading of (Rusinkiewicz 2001).
The step that dominates the computational cost of ICP
is the search for matches. The naive method takes time
, where n is the
cardinality of . Using kD-trees it can be done in .
Other
speedup techniques include caching of the matches (Simon
et al. 1994) and computing the distance transform in the
space of the set , which remains stationary (Fitzgibbon
2003).
15.2.3 Registration of Many Point Clouds
In the case of simultaneous alignment of many point clouds
without
correspondences, one is faced with the same problem that
ICP solves for two sets, so it would come immediately to
think of generalising the
latter. In fact this is possible, for example, by extending the
cost function (15.5) to many sets and solving numerically
the resulting non-linear
least-squares problem, as in Fantoni et al. (2012) (Fig.
15.4). Or one can keep the main loop of ICP with nearest
point search and replace OPA
with GPA, as Toldo et al. (2010).

Fig. 15.4 From left to right: 27 point clouds in the initial references frames,
point cloud after alignment (points are coloured according to the source
image), surface reconstructed using the technique of Kazhdan et al. (2006).
Images taken from Fantoni et al. (2012)

The methods we have mentioned so far involve points;


however, if
we wanted to avoid this step and work in the space of
reference frames, we could exploit the technique of
synchronisation already introduced in Chap. 14. In this case
the measurements attached to the graph edges are rotations
and translations (direct i sometries) obtained, for example,
by
ICP applied to pairs of point clouds (Fig. 15.5). Arrigoni
et al. (2016) suggest to proceed as follows:
apply ICP to all pairs with sufficient overlap to
obtain the direct isometries that link the
pairs;
synchronise the resulting graph for rotations and
translations, as seen in Sect. 14.3 (here we do not have
the complication of unknown
moduli);
to each set apply the i sometry
resulting from synchronisation, to obtain the
alignment.

Fig. 15.5 Synchronisation of direct isometries applied to the alignment of 3D


point clouds. In the figure . Courtesy ofF. Arrigoni

References
Federica Arrigoni, Beatrice Rossi, and Andrea Fusiello. Global registration of
3D point sets via LRS decomposition. In Proceedings of the 14th European
Conference on Computer Vision, pages 489– 504, 2016.
Fausto Bernardini and Holly E. Rushmeier. The 3d model acquisition pipeline.
Computer Graphics Forum, 21 (2): 149– 172, 2002.
[Crossref]
P. Besl and N. McKay. A method for registration of 3-D shapes. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 14 (2): 239– 256,
February 1992.
[Crossref]
Y. Chen and G. Medioni. Object modeling by registration of multiple range
images. Image and Vision Computing, 10 (3): 145– 155, 1992.
[Crossref]
J. J. F. Commandeur. Matching configurations. DSWO Press, Leiden, 1991.

F. Crosilla and A. Beinat. Use of generalised procrustes analysis for the


photogrammetric block adjustment by independent models. ISPRS Journal of
Photogrammetry & Remote Sensing, 56 (3): 195– 209, April 2002.
[Crossref]

S. Fantoni, U. Castellani, and A. Fusiello. Accurate and automatic alignment of


range surfaces. In

and Transmission (3DIMPVT), 2012.


Second Joint 3DIM/3DPVT Conference: 3D Imaging, Modeling, Processing, Visualization

A. Fitzgibbon. Robust registration of 2D and 3D point sets. Image and Vision


Computing, 21 ( 13- 14): 1145 – 1153, 2003.
[Crossref]

Michael Kazhdan, Matthew Bolitho, and Hugues Hoppe. Poisson surface


reconstruction. In Proceedings of the fourth Eurographics symposium on Geometry
processing, pages 61– 70. Eurographics Association, 2006.
M. Levoy S. Rusinkiewicz. Efficient variants of the ICP algorithm. In IEEE
International Conference on 3-D Digital Imaging and Modeling, Quebec City (Canada),
2001.
D. A. Simon, M. Hebert, and T. Kanade. Real-time 3-d pose estimation using a
high-speed range sensor. In Proceedings of the 1994 IEEE International Conference on
Robotics and Automation, pages 2235– 2241 vol.3, May 1994.
R. Toldo, A. Beinat, and F. Crosilla. Global registration of multiple point clouds
embedding the generalized procrustes analysis into an ICP framework. In
International Symposium on 3D Data Processing, Visualization and Transmission, 2010.
Federico Tombari, Samuele Salti, and Luigi Di Stefano. Performance
evaluation of 3dkeypoint detectors. International Journal of Computer Vision, 102 (
1): 198– 220, Mar 2013.
[Crossref]

OceanofPDF.com
© The Author(s), under exclusive license to Springer Nature
Switzerland AG 2024 A. Fusiello, Computer Vision: Three-dimensional

https://doi.org/10.1007/978-3-031-34507-4_16
Reconstruction Techniques

16. Multi-view Projective


Reconstruction and
Autocalibration
Andrea Fusi ello1
( 1) University of Udine, Udine, Italy

Email: [email protected]
Andrea Fusiello

16. 1 Introduction
In this chapter we will study the problem of computing
a projective reconstruction from multiple uncalibrated
images. The point-based methods of Chap. 14 can be—
to some extent—adapted to the
uncalibrated case. For example, with the Adjustment of
Independent Models (AIM) one would obtain a set of
projective reconstructions
related to each other by an unknown projectivity (in other
words, each of these defines its own projective reference
frame). Less obvious is how to modify frame-based
techniques.
In the following we will describe a global method
proposed by Sturm and Triggs (1996) that is specific for
projective reconstruction and is
based on the factorisation method of Tomasi and Kanade
(1992). We will then address the problem of self-
calibration, that is, promoting the reconstruction from
projective to Euclidean, which is equivalent to
estimating the intrinsic parameters.

Consider m cameras framing n points in three-dimensional


16. 1. 1 Sturm-Triggs Factorisation Method
space,
. The usual perspective projection equation (with
the scaling factors made explicit) writes
( 16.
or, in matrix form: 1)

( 16.
2)
It tells us that W factorises into the product of
a matrix M, and because of this it matrix P and
has rank four.
a
Let us rewrite (16.2) in compact form as
( 16.
where 3)

( 16.
4)
and the symbol denotes the Hadamard product, that is,
the element- by-element product. Note that the matrix Z
has m rows, while Q has 3m since its elements are
blocks, and so in order to match the
dimensions, one must triplicate each row of Z via the
Kronecker product by the vector .
In this equation only Q is known and everything else is
unknown.
However, if we provisionally assume that Z is known, the
matrix W
becomes known, and we can compute its Singular Value
Decomposition (SVD):
( 16.5)
In the ideal case where the data is unaffected by error,
the rank of W is four and therefore
. Then, only the
first four columns of U and V contribute to the product
between
matrices. Let therefore and be the matrices
formed by the first four columns of U and V , respectively.
We can then write the
compact SVD of W as
( 16.6)
Comparing (16.6) with (16.2), we can identify:

( 16.7) thus obtaining the sought reconstruction. Note that


the choice to include in P is arbitrary. It
could have been included in M or split between the two.
This however is consistent with the fact that the
reconstruction is up to a projectivity, which therefore also
absorbs
.
In the real case where the data is affected by errors, the
rank of Wis not 4. By the Eckart-Young theorem, forcing
yields the solution that
minimises the
error:

( 16.
8)
We are now left with the problem of estimating the
unknowns
that can be calculated once P and M are known. In fact, for
image i (16.2) writes:

( 16.9
)

or:

( 16.
10)
which is analogous to (5.21) already solved in Sect. 5.3.2.
An iterative scheme that is often adopted in cases like
this is to
alternate the solution of two problems: in one step we estimate
given
P and M; in the next step, we estimate P and M given and
iterate.
To avoid convergence to the trivial solution , the
matrix Z must be normalised in some way at each iteration,
for example, by fixing the norm of its rows or columns, or
other subsets of its entries. Nasihatkon et al. (2015) show
that there are other trivial solutions besides
and that the choice of normalisation criterion is crucial to
exclude all or part of them.
In the implementation shown in Listing 16.1, we chose
to normalise the rows of Z, a simple and widely used
solution, but one that actually leaves open the possibility
of convergence to trivial solutions in which an entire
column vanishes (see Nasihatkon et al. 2015 for details).
Note also that preconditioning is applied to the
coordinates of the
input image points, as in the fundamental matrix
computation (Sect. 6. 4): experimentally it is observed
that this step is crucial to obtain a good result.
16.2 Autocalibration
Although some useful information can be derived from a
projective
reconstruction (Robert and Faugeras 1995), what we
would ultimately like to obtain is at least a Euclidean
reconstruction of the model, which
differs from the true model by a similitude. The latter
consists of a direct isometry (due to the arbitrary choice of
world reference frame) plus a
uniform change of scale (due to the well-known
depth-speed ambiguity).
Fortunately, if some reasonable conditions are met (e.g.
at least three cameras with constant intrinsic parameters), it
is possible to switch to the Euclidean reconstruction, that
is, the one we would get in the
calibrated case. This process is called autocalibration, and can
be
achieved through two different ways: we can start from
the projective reconstruction and then upgrade it to
Euclidean (Sect. 16.2.1), or
recover the intrinsic parameters from the fundamental
matrices and fall back to the calibrated case (Sect. 16.2.2).

In practice, however, the camera parameters are not


completely
unknown, since an approximation of them can be inferred
from EXIF (Exchangeable Image File Format) data of the
image. EXIF is a format for specifying metadata
implemented by the major image file formats (including
JPEG). Metadata includes static information such as the
model and manufacturer of the camera, and
information about the shot, such as dates and times,
geographical location (if available), aperture, speed,
focal length, white balance, etc.

16.2. 1 Absolute Quadric Constraint


In this section we will follow the first of the two paths
outlined above, namely, performing a projective
reconstruction followed by Euclidean upgrade.
The projective reconstruction differs from the Euclidean
reconstruction by an unknown projectivity T, which can be
interpreted as an appropriate change of projective basis.
The following passages replicate those of Sect. 8.6. From
the
projective reconstruction, we obtain a set of perspective
projection matrices . Without loss of generality,
we assume a particular expression for the first of
them, which arbitrarily fixes the projective basis:
( 16. 11)
We want to arrive at a Euclidean reconstruction, which,
without loss of generality, can be written as
( 16. 12)
via estimation of the non-singular matrix , which
transforms the projective reconstruction into the Euclidean
one:
( 16. 13)
As we have already observed in Sect. 8.6, the assumptions
about the
specific form of the first Perspective Projection Matrix
(PPM) imply that T should be written as:

( 16. 14)

where is an arbitrary vector of three elements and s is a


scalar. With this parameterisation Tis clearly non-singular,
and, being defined up to a scaling factor, it depends on
eight parameters (if we fix, e.g. ).
Proceeding as in Sect. 8.6, we obtain:

( 16. 15)

This is the basic equation, which links the unknowns ,


and to the available data and . Since
we can eliminate the latter, obtaining

( 16.
16) Note that (16.16) contains five equations, because the
matrices of
both members are symmetric, and the homogeneity reduces
the number of equations by one.
This is the basic equation for autocalibration, called
absolute quadric constraint (Triggs 1997), relating the unknowns
( ) and to the available data . The name
comes from its geometrical
interpretation, where is special conic called dual image of
absolute conic and is a quadric, the dual absolute quadric. The
the

interested reader can look up the details, e.g. on (Hartley


and Zisserman 2003).
16.2.1.1 Solution Strategies
We are thus required to solve (16.16), with respect to
and . The collineation T that upgrades cameras from
projective to Euclidean is obtained by decomposing
as
( 16. 17)
is a symmetric matrix that is also singular and
defined up to a scale factor, which results in 8 d.o.f., the
same of T.
Solving (16.16) entails getting rid of the unknown
scale factor contained in the symbol. Several
strategies have been proposed: Introduce the scale
factor explicitly as an additional unknown
(Heyden and Åström 1996):
( 16. 18)
This gives six equations but introduces one additional
unknown (the net sum is 5).
Eliminate it by using the same idea of the cross product
for 3 D vectors (Triggs 1997):

where is the column-wise vectorisation with the


upper portion excluded, as matrices in (16.16) are
symmetric.
This is tantamount to say that every minor of B
is zero. There are 15 different order-2 minors of a
matrix, but only 5 equations are independent.
Normalise both sides of the equation (Pollefeys et al.
1998):

( 16. 19)

In any case, a non-linear least-squares problem has to be


solved.
Available numerical techniques (based on the Gauss-Newton
method) are iterative, and require an estimate of the
solution to start. This can be obtained by doing an educated
guess about skew angle, principal point and aspect ratio,
and solving the linear problem that results.
Linear Solution
Linear equations on are generated from zero entries of
, because this eliminates the scale factor (Pollefeys et al.
2002):

where is the k-th row of .


If the principal point is known, and
this gives two linear constraints. If, in addition, there is no
skewing ( ), we have . Known aspect ratio
provides a further constraint:
.
Likewise, linear constraints on can be obtained from
the equality of elements in the upper (or lower) triangular
part of (because is symmetric).
In order to be able to solve linearly for , at least ten
linear
equations must be stacked up, to form a homogeneous linear
system,
which can be solved as usual (via SVD). Singularity of can
be enforced a posteriori by forcing the smallest singular
value to zero.

If all the cameras have the same intrinsic parameters, so


Constant Intrinsic Parameters

, then (16.16) becomes:

( 16.20)

We observe that (16.20) contains five equations, because the


matrices of both members are symmetric and the
arbitrary scaling factor
reduces the number of equations by one. Thus, each camera
matrix,
excluding the first one, gives five equations in the eight
unknowns given by the five intrinsic parameters plus the
three entries of . A single
solution is determined as soon as three cameras are
available. The
constraints expressed in (16.20) are also known as Kruppa
constraints. Although interesting from a theoretical point of
view, solving (16.20) it was found experimentally to be
numerically unstable, and so it is of little practical use.
Given a projective reconstruction:
, if the intrinsic parameters
of the first two cameras and were
known, it would be possible to compute the projective
transformation T that promotes this to a Euclidean
reconstruction, as seen in Sect. 8. 6. Gherardi and Fusi
ello (2010) observed that this T work for all
other cameras as well, so a pair ( ) determines a
solution, to which a cost can be associated, which
depends on the plausibility of ( ) where
is the intrinsic parameter matrix of the -th camera
after the Euclidean promotion determined by and
. We expect no skewing ( ), unit aspect ratio
( ) and principal point at the center of the image;
deviations from these assumptions are penalised. The
cost function is then optimised with respect to and
:

( 16.21
)

16.2.2 Mendonça-Cipolla Method


The second route to Euclidean reconstruction is through the
explicit
computation of the intrinsic parameters, which then leads
us back to the calibrated case.
Given the correspondences between homologous
points, without knowing anything about the camera, it is
possible to derive the
fundamental matrix .
We know that the fundamental matrix F has seven
degrees of
freedom, while the essential matrix E has five, since it
must satisfy two more constraints, which are those arising
from the equality of the two singular values (Theorem
7.1). Moreover, we know that E and F are
related via the intrinsic parameters K by (7.4). This means
that for every fundamental matrix , there exist two
intrinsic parameter matrices K and and a rigid motion
represented by andR such that
. This translates into two constraints
available for computing the intrinsic parameters for every
image pair.
With only two images available, it is not possible to derive
all the
intrinsic parameters, but only two of them: Hartley (1992),
for example, shows how to derive the focal lengths of the
two cameras. More than
two images are needed, to accumulate constraints. If the
intrinsic
parameters are constant (i.e. the same for all the cameras),
it is shown
that only three images are needed. In fact, there are five
unknowns, each pair of images gives two equations, and
there are three image pairs: 1–2, 1– 3 and 2– 3.
The way of writing the constraints may vary. Mendonça
and Cipolla (1999), who first proposed the method, directly
used Theorem 7.1. An alternative formulation, which
avoids SVD , is based on the following:
Proposition 16. 1 (Huang and Faugeras 1989) The
condition that the matrix E has a singular value of zero and two nonzero
singular values of equal value is equivalent to
( 16.22)

The second condition, in particular, is equivalent to the


equality of the two singular values.
The cost function takes the (unknown) intrinsic
parameters as
arguments, the fundamental matrices (computed from point
correspondences) as parameters, and returns a positive
value
proportional to the violation of the constraint (16.22). The
sought
intrinsic parameters are those that satisfy the constraint or
that
minimise the violation. In formulae, let be the
fundamental matrix relating images i and j, and let Kbe
the matrix of intrinsic parameters, which we assume for
simplicity to be constant, although the method has a
general formulation. The least-squares cost function is

( 16.23)

This is minimised using an iterative algorithm such as


Gauss-
Newton, for which it is necessary to calculate the
derivative. We rewrite the single term of the cost function
(we omit the indices for simplicity) as a composition of
functions:

( 16.2
4)
where
( 16.25)

( 16.26)

( 16.27)
and denotes the set of real matrices . Since we
are dealing with matrix functions, we employ the definition
of the Jacobian given by Magnus-Neudecker (Sect. B. 1),
which allows to apply the chain rule
(Theorem B. 1) to the compound function ( ),
obtaining

(
16.28) The Jacobian is immediately derived from
(B.26), while as for , using the chain rule and
formulae (B.27) and (B.4), we arrive at

( 16.2
9)
Finally, it is immediate to verify that

( 16.3
0)
Since K is parameterised through the intrinsic
parameters, it will be necessary to multiply the Jacobian
(16.28) to the right by the derivative of K with respect to its
parameters.
Listing 16.2 Mendonça-Cipolla self-calibration

The Mend onça- Cipolla autocalibration algorithm is


implemented in
Listing 16.2, where we assume that only three intrinsic
parameters are unknown ( ).
Gauss-Newton convergence is not guaranteed for any
initial value; however, in practice the algorithm often
converges to the correct
solution from a reasonable initial guess. An algorithm
that minimises (16.23) with global convergence has been
proposed by Fusi ello et al. (2004).

16.3 Autocalibration via


This calibration method, which we will denote by LVH, since
it was
invented independently by Luong and Viéville (1996) and
Hartley
(1997), is based on the knowledge of the homography of the
plane at
infinity, , so it is less general than the previous ones that
assumed no knowledge on the camera or the scene.
Consider two cameras with identical and unknown
intrinsic
parameters (or, equivalently, the same camera in motion)
and suppose that we can somehow arrive at the knowledge
of the homography
induced by the plane at infinity . Observe that does
not carry any scale ambiguity; in fact it must have unit
determinant being similar to a rotation matrix.
If , from we get and,
since
,we have:
( 16.31)
Elaborating with Kronecker’ s product, we arrive at

( 16.32) where and is the duplication matrix.


This is a
homogeneous linear system of nine equations in six
unknowns.
However, since (16.31) is an equality between symmetric
matrices, only six equations are unique, and of these Luong
and Viéville
(1996) showed that only four are independent. The
solution of the homogeneous system requires a matrix
of rank five, so at least three images are needed, with
the same intrinsic parameters.
The implementation is given in Listing 16.3.
Listing 16.3 LVH calibration

Regarding the estimation of , we briefly mention


three different approaches:
use three vanishing points plus the epipole (see
Problem 6.7) in the DLT;
use any plane sufficiently far from the camera (Viéville et
al. 1996);
rotate the camera around the Centre of Projection
(COP) (Hartley 1997).

16.4 Tomasi-Kanade’s Factorisation


We consider here a different camera model than the
perspective camera; in fact this method assumes that the a
fine camera approximation
applies, which realises a parallel projection instead of a
central
(perspective) one. In this case one can directly obtain the
reconstruction by the SVD of a matrix containing the
coordinates of the tie points in the different images. If the
intrinsic parameters of the camera are unknown,
the reconstruction is affine. If, on the other hand, the
intrinsic
parameters are known, as in the original work of Tomasi
and Kanade (1992), a metric reconstruction can be
obtained.
16.4. 1 Affine Camera
The most general affine camera is represented by a full rank

matrix that has the last row equal to . It is easy


to verity that such camera projects points at infinity to
infinity and points of the affine space to the affine plane,
hence the name.
Let us write , as we did for the PPM. In this
case is
clearly singular, so the COP cannot be computed with (3.29).
Instead, let
,than

( 16.33)

hence . The COP of an affine camera is at


infinity, and realises a parallel projection. As a matter
of fact, may be thought of as the parallel projection
version of the projective camera.
The most general affine camera matrix has 8 d.o.f.
and can be factored, similarly to the PPM, as
(
16.34)
The central matrix represents an orthographic projection
( instead of perspective), the rightmost matrix represents
the transformation from the word reference frame to the
camera reference frame (as in the PPM), and the leftmost
matrix is an affine transformation of the image
plane that plays the same role as Kin the PPM, with the
difference that the principal point is not defined. Indeed,
in a parallel projection, a
translation of the image and a translation of the scene
yield the same effect.
16.4.2 The Factorisation Method for Affine
Camera
Let us write the affine camera matrix as

( 16.35)
with and
In Cartesian coordinates, the affine projection is written: 1
( 16.36)
The point is the image of the origin of the world
reference frame, indeed .
Consider a set of n 3D points, viewed by m affine cameras
with
matrices . Let be the Cartesian coordinates of
the
projection of the j-th point in the i-th camera. The goal of the
reconstruction is to estimate the PPM of the cameras, that
is, , and the 3D points such that
( 16.37)
We consider the “centralised” coordinates obtained by
subtracting the centroid in each image:
( 16.38)
(this step implies that all n points are visible in all m images).
We choose the 3D reference frame in the centroid of
the points, so that . Taking this into account, we
substitute (16.37) into (16.38), thus obtaining to
eliminate the vector in (16.38), which
becomes

( 16.39) In this way we immediately get that . We


rewrite the
equations in matrix form:

( 16.40)

The matrix of W measurements represents the location of n


points
tracked along m images. It is a matrix that has
rank at most 3, being the product of a matrix
(motion) and a matrix
(structure). This observation is the key to Tomasi and
Kanade’s
factorisation method. In fact, to recover A and M from W,
we factorise the latter into the product of two matrices of
rank three using the SVD. Let
( 16.41)
the SVD of . In the absence of noise, only the first three
singular
values would be different from zero; in practice, however,
the rank of is only approximately three. We then consider
only the first three
singular values, getting
( 16.42)

where is the best (in Frobenius norm) rank three


approximation of , by the Eckart-Young theorem. The
factorisation sought is obtained by setting
( 16.43)
The factorisation is not unique: the matrix could be
arbitrarily
distributed between the two matrices; in fact one can insert
an arbitrary matrix of rank three into the factorisation
without altering the result:
( 16.44)
Thus, there is an affine ambiguity on the reconstruction.
If instead the intrinsic parameters are known, working in
normalised coordinates yields:

( 16.45)

So the centroid provides the translation component


(in X and Y ), and the rows of A are two orthogonal vectors,
a condition that can be exploited to fix the ambiguity. Note,
however, that the translation of each camera along its
optical axis Z cannot be derived.

16.5 Problems
16. 1 Adapt (i) AIM, (ii) incremental reconstruction
and (iii) bundle adjustment to the uncalibrated case.

References
A. Fusiello, A. Benedetti, M. Farenzena, and A. Busti. Globally convergent
autocalibration using interval analysis. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 26 ( 12): 1633– 1638, December 2004.
[Crossref]
Riccardo Gherardi and Andrea Fusiello. Practical autocalibration. In Proceedings
of the European Conference on Computer Vision, Lecture Notes in Computer Science,
pages 790– 801. Springer,
Berlin, 2010.

R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge


University Press, Cambridge, 2nd edition, 2003.
R. I. Hartley. Estimation of relative camera position for uncalibrated cameras.
In Proceedings of the European Conference on Computer Vision, pages 579– 587, Santa
Margherita L., 1992.
Richard Hartley. Self-calibration of stationary cameras. International Journal of
Computer Vision, 22 ( 1): 5– 24, February 1997.
[Crossref]

A. Heyden and K. Åström. Euclidean reconstruction from constant intrinsic


parameters. In
Proceedings of the International Conference on Pattern Recognition, pages 339– 343,
Vienna, 1996.
T.S. Huang and O.D. Faugeras. Some properties of the E matrix in two-view
motion estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11
( 12): 1310– 1312, December 1989.
[Crossref]

Q.-T. Luong and T. Viéville. Canonical representations for the geometries of


multiple projective views. Computer Vision and Image Understanding, 64 (2): 193–
229, 1996.
[Crossref]

P.R.S. Mendonça and R. Cipolla. A simple technique for self-calibration. In


Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 500–
505, 1999.
Behrooz Nasihatkon, Richard Hartley, and Jochen Trumpf. A generalized
projective
reconstruction theorem and depth constraints for projective factorization.
International Journal of Computer Vision, 115 (2): 87– 114, Nov 2015.
[MathSciNet][Crossref]

M. Pollefeys, R. Koch, and L. Van Gool. Self-calibration and metric


reconstruction in spite of varying and unknown internal camera
parameters. pages 90–95, Bombay, 1998.

M. Pollefeys, F. Verbiest, and L. Van Gool. Surviving dominant planes in


uncalibrated structure and motion recovery. pages 837– 851, 2002.
L. Robert and O. Faugeras. Relative 3-D positioning and 3-D convex hull
computation from a weakly calibrated stereo pair. Image and Vision Computing,
13 (3): 189– 197, 1995.
[Crossref]

P. Sturm and B. Triggs. A factorization based algorithm for multi-image


projective structure and motion. In Proceedings of the European Conference on
Computer Vision, pages 709– 720. Springer, Berlin, 1996.
C. Tomasi and T. Kanade. Shape and motion from image streams under
orthography—a
factorization method. International Journal of Computer Vision, 9 (2): 137– 154,
November 1992. [Crossref]
B. Triggs. Autocalibration and the absolute quadric. pages 609–614, Puerto Rico,
1997.

T. Viéville, C. Zeller, and L. Robert. Using collineations to compute motion and


structure in an uncalibrated image sequence. International Journal of Computer
Vision, 20 (3): 213– 242, 1996. [Crossref]

Footnotes
1 In this paragraph, in order not to clutter the notation, the Cartesian
coordinates of the points are denoted without the ˜.

OceanofPDF.com
© The Author(s), under exclusive license to Springer Nature
Switzerland AG 2024 A. Fusiello, Computer Vision: Three-dimensional

https://doi.org/10.1007/978-3-031-34507-4_17
Reconstruction Techniques

17. Multi-view Stereo


Reconstruction
Andrea Fusi ello1
( 1) University of Udine, Udine, Italy

Email: [email protected]
Andrea Fusiello

17. 1 Introduction
We treat in this chapter the problem of multi-view stereo
reconstruction,1 that is, recovering the surface of an object
from many ( ) images.
The key notion to understand multi-view stereo
reconstruction
algorithms is that of photo-consistency. A point in the scene
is defined as photo-consistent in a set of images if, for each
image in which it is visible, its radiance in the
direction of the pixel is equal to the intensity of the
corresponding pixel . The surface
reconstruction problem then reduces to that of
determining the photo- consistency of the points in space:
the photo-consistent ones belong to the surface, and the
others correspond to empty space.
If the surface is Lambertian, the radiance is the same in
all
directions, and therefore for a photo-consistent point all
values are equal . In fact the equality of the
colour (modulo radiometric and geometric perturbations) of
the projections of the point in the images in which it is
visible is commonly taken as a test of photo-consistency
(Fig. 17.1), usually including a small region around it (a.k.a.

Binocular stereopsis (Chap. 12) can be interpreted as the


window).

determination of a photo-consistent surface from two


© The Author(s), under exclusive license to Springer Nature
Switzerland AG 2024 A. Fusiello, Computer Vision: Three-dimensional

https://doi.org/10.1007/978-3-031-34507-4_17
Reconstruction Techniques

images.
Fig. 17.1 Illustration of the concept of photo-consistency. The patch A is
photo-consistent, and its two projections are similar, whereas patch B is not,
and in fact its two projections are very different
An aspect that in the binocular case can be ignored, as
it reduces to occlusions handling, but in multi-view stereo
becomes crucial is that of visibility. In fact, the problem of
determining photo-consistency is
closely related to that of visibility. In order to correctly
evaluate the
photo-consistency of a point, it is necessary to consider
only the images in which it is visible; however determining
in which images a point is
visible implies the knowledge of the shape of the surface,
as it is made clear in Fig. 17.2.
Fig. 17.2 The visibility problem. In order to recover the surface by photo-
consistency, the
visibility is needed. At the same time, to calculate the visibility the surface is
needed. Adapted from Hernández and Vogiatzis (2010)
Multi-view stereo algorithms can be grouped into four
classes.
Volumetric stereo approaches (on which we will focus in this
chapter) assume that there exists a known, finite volume
within which the objects of interest lie. This volume can
be viewed as a parallel epiped surrounding the scene and is
represented by a grid of “cubes” called
voxels, the 3D equivalent of pixels. The goal is to assign a
binary
occupancy label (or, possibly, a colour) to each element in
the working volume. The desired final representation is a
polygonal mesh
representing the surface of the object. To step from a
volumetric
representation to a polygonal mesh, marching cubes (Sect.
17.4) is the standard algorithm.
Within these approaches, we have object-space techniques
(see Dyer 2001 for an excellent review), which proceed by
scanning the working volume and assigning an occupancy
label to each voxel according to its photo-consistency
(Sect. 17.2), and image-space techniques, which start by
calculating correlation (or any other matching cost)
between
windows in the images (similarly to binocular stereo), and
only
eventually assign a photo-consistency value to the voxel
(Sect. 17.3).
The second class of techniques start directly from a
polygonal mesh that roughly represents the object and
evolve it in the direction of
optimi sing its photo-consistency (Faugeras and Keriven
1998; Delaunoy et al. 2008). This class can also be
considered as a post-processing for
other methods.
The third class consists of algorithms that compute a set
of depth
maps from binocular stereo and merge them into a 3D
model.2 These
methods immediately generalise binocular stereo and can
exploit fusion methods originally conceived for range images
(Goesele et al. 2006).
The last class starts from photo-consistent points that
definitely
belong to the surface and expand them by generating a
“cloud” of planar elements or patches oriented in space
(Furukawa and Ponce 2010).
The last three classes will not be covered in this chapter.
The reader is referred to Seitz et al. (2006) for a
comprehensive review

To obtain a number of images of an object in a controlled


way, one
can use a fixed calibrated camera and a turntable on
which the object is placed. For each rotation step of the
plate, one image is taken. The situation is equivalent to
a static object and a camera moving around, but the
rotation of the plate is easily measurable, while the
freehand motion of the camera is more complex to
gauge.

17.2 Volumetric Stereo in Object-


Space
Object-space methods scan the working volume voxel by
voxel and
assign its occupancy value. The implementation can start
from a
completely empty volume as in Seitz and Dyer (1999) and
mark as
opaque the voxels that pass the photo-consistency test or it
can start
from a full space (Kutulakos and Seitz 2000) and selectively
remove the voxels that do not pass the test. In the end a
colour is assigned to each opaque voxel taken from the
images on which it is projected (e.g., by
averaging).
Compared to image-space techniques, volumetric stereo
in object
space offers some advantages: it avoids the difficult
problem of finding correspondences (it also works for non-
textured surfaces), it allows
explicit treatment of occlusions, and it yields directly a
three-
dimensional model of the object by simultaneously
integrating all views, instead of aligning portions of the
model.
On the other hand, deciding the photo-consistency of a
voxelis a
non-trivial task, and to evaluate it correctly would require—
in principle — information about the geometry of the
objects, the reflectance
properties of the surfaces, and the direction of
illumination. In image- space, instead, one is confronted
with the simpler task of choosing the best photo-
consistency value in a given set.

A special case of volumetric stereo is the reconstruction


17.2. 1 Shape from Silhouette

from binary silhouettes, a.k.a. shapefrom silhouette (Martin


and Aggarwal 1983).
Intuitively, a silhouette is the outline of an object, including
its internal part. More precisely, we define silhouette as a
binary image whose value at some point indicates
whether or not the optical ray passing
through the pixel intersects the surface of an object in
the scene.
Each white point on the silhouette therefore identifies an
optical ray that intersects the object. The union of all white
optical rays defines a generalised cone that contains the
object. The intersection of the
generalised cones associated with all the frames
identifies a volume within which the object lies (Fig.
17.3).

Fig. 17.3 Illustration of reconstruction from silhouettes

Listing 17.1 shows a (not particularly efficient)


implementation of shape from silhouette. It starts with a
cubic lattice of points sampling
the working volume (corresponding to the voxel centres),
passed to the function via the variable M. Each point is
processed once, its projections
in the images are determined, and it is defined as
opaque if all projections fall within the silhouette,
transparent otherwise.
Listing 17. 1 Reconstruction from silhouettes

The resulting volume is taken as an approximation of the


object (Fig. 17.4). More precisely, it is an outer
approximation of its visual hull,
defined by Laurentini (1994) as the maximal volume that
returns the same silhouettes as the real object for all
views outside the convex hull of the object. In practice,
since only a finite number of silhouettes are available,
only an outer approximation of the visual hull is
achievable.
Fig. 17.4 The object and the approximation of its visual hull computed from
four frames
The shape from silhouettes is a proven technique that
produces
stable results, and often the resulting 3D model is a good
approximation of the real object and is an excellent
initialisation for other methods. An example result is shown
in Fig. 17.5.

Fig. 17.5 Volumetric reconstruction of the cherub obtained from 65


silhouettes, 6 of which are shown as examples. The image was obtained from
the volume after surface extraction with
marching cubes (Sect. 17.4)

17.2.2 Szeliski’s Algorithm


In order to save memory and computational time, Szeliski
(1993)
exploits an octonary tree (octree) data structure to store
voxel labels.
These are eight-way trees in which each node represents a
cubic region of space and the child nodes are the eight
equispaced subdivisions of
that region (octants). The octree is a more space-efficient
representation when the scene contains large empty areas.
The octree is constructed recursively by dividing each
cube into eight cubes, starting with the root node
representing the working volume.
Each node represents a cube, to which a label is associated:

White: leaf representing an occupied volume


Black: leaf representing an empty volume
Grey: internal node whose classification is still undecided

For each octant, the algorithm checks whether its


projection onto the i-th image is entirely contained in the
silhouette. If this is the case for all N silhouettes, the octant
is marked as “white” If, on the other hand, the octant
projection is entirely contained in the background, evenfor a
single camera, the octant is marked as “black” If either of
these cases
occurs, the octant becomes a leaf of the octree and is no
longer
processed. Otherwise, the octant is classified as “grey” and
is in turn
subdivided into eight children. To limit the size of the tree,
the “grey”
octants of minimum size are marked as “white” In the end
an octree that approximates the visual hull of the object is
obtained.
The shape from silhouette can also be implemented
efficiently on the GPU with a plane sweep method (Li et al.
2003).
out the parts of the clay block that were outside the
silhouette.
Repeating this procedure for all images, the result is the
visual hull of the subject in clay. The sculpture was then
finished by hand to round off the edges and add detail.

This note is taken from Sturm (2011), which contains


much more interesting information about the very
early versions of computer vision algorithms.

17.2.3 Voxel Colouring


When the input images are not simple binary silhouette
but are grey- level or colour images, additional
radiometric information can be
exploited to improve the reconstruction.
The order in which the voxels are visited is not irrelevant
here. In
fact, when testing the photo-consistency of a voxel, it is
necessary to
know on which images that voxel projects, otherwise it could
be
mistakenly considered not photo-consistent, as would
happen to voxel B in Fig. 17.6 if one ignored that it is
occluded by voxelA in the right
image. In this case, the photo-consistency of voxelA must
be evaluated before B. In general, the photo-consistency of
a voxel should only be
evaluated after all voxels that could affect its
visibility have been evaluated (i.e. have a final label).
Fig. 17.6 Photo-consistency and visibility. If we ignored that voxel B is not
visible in the right image, we could mistakenly consider it not photo-
consistent
To simplify the calculation of voxel visibility so that a
scene can be reconstructed in a single voxel scan, Seitz
and Dyer (1999) introduced the ordinal visibility constraint on
camera positions. It requires that
there exists an ordering of voxels such that the visibility
of a certain voxel can only be affected by the voxels
preceding it in the ordering.
Evidently, if the constraint is satisfied, it becomes
possible to
implement an algorithm that verifies photo-consistency
with a single pass. This happens, for example, when the
cameras are all on the same side of a plane. In such a case
the voxels can be visited by sliding that
plane front to back, that is, from the closest voxels to the
voxels farthest from the COP of the cameras.
The Voxel Colouring (VC) is a one-pass algorithm that,
starting from empty space, marks voxels that pass the
photo-consistency test as
opaque. The occlusions are stored in a simple binary map
for each of the input images. When a voxelis found to be
consistent (and therefore
opaque), the occlusion bits corresponding to the projections
of the voxel on the various images are set to 1. At any given
time, the visibility set of a voxelis given by the pixels onto
which the voxel projects whose
occlusion bits are 0.
More generally, the ordinal visibility constraint is
satisfied when no point in the scene falls inside the convex
hull formed by the COPs. This generalisation allows the
algorithm to be implemented by sliding not a simple plane,
but a front of surfaces at an increasing distance from the
convex hull.
VC is an elegant and efficient technique, but the
ordinal visibility constraint poses a significant
limitation: cameras cannot surround the
scene.

In the case of semitransparent objects, an opacity


value should be added to the voxel, besides to the
colour. There is an interesting analogy with
computer tomography, which has been described in
Gering and Wells (1999), Dachille et al. (2000).

17.2.4 Space Carving


The typical algorithm of this class starts with a volume
composed of
initially opaque voxels that enclose the scene to be
reconstructed. As the algorithm proceeds, the photo-
consistency of the opaque voxels is
evaluated, and those that are found to be inconsistent are
rendered
transparent, that is, carved. The algorithm stops when all
remaining
opaque voxels are photo-consistent. As already observed, for
generic
camera configurations there is no ordering of voxels that
guarantee that the visibility of a voxel does not change after
its photo-consistency has been evaluated. Therefore,
multiple passes are needed.
Listing 17.2 describes the general approach for Space
Carving (SC) algorithms. In the inner loop, the visibility
of voxels is determined, their photo-consistency is checked,
and they are finally carved if they are
found to be inconsistent. When a voxelis carved, the
visibility of other voxels potentially changes, invalidating
any consistency tests they may have passed. Thus, there is
an outer loop that repeats the consistency check until no
carving occurs in the inner loop. In this way, the final set
of opaque voxels is guaranteed to be photo-consistent.
Listing 17.2 General scheme (stub) of space carving

At first, the reconstruction bears little resemblance to the


real object;
however, the SC algorithm calculates the visibility for
voxels and sculpts those deemed inconsistent, based on this
reconstruction. It is
reasonable to ask, then, whether the algorithm might fail by
sculpting
voxels too early that should then have been found photo-
consistent
eventually. Seitz and Dyer (1999) have shown that, in fact,
this cannot
happen if the consistency measure is monotonic, that is, if a
certain set of pixels is inconsistent, then every superset of
those pixels will also be
inconsistent.
Since the algorithm only changes the assignment of voxels
from
opaque to transparent and never vice versa, the remaining
opaque
voxels can only become visible as the algorithm continues,
and the pixels that see a voxel at a certain instant of time
will necessarily be a subset of those that see the voxel at any
later time. Thus, if the monotonic photo- consistency
measure results in the inconsistency of a voxel, the voxel
will also be inconsistent in the final reconstruction.
Therefore, the
algorithm is conservative, that is, it never carves a voxel
that it should not (i.e. a truly opaque voxel) and thus
correctness is guaranteed.
Kutulakos and Seitz (2000) proposed a variant of SC
that evaluates voxels according to a sweeping scheme: at
each iteration the scene is
swept with a different plane (for simplicity, the three
planes parallel to the coordinate axes are chosen). The
visibility of a voxelis evaluated by considering only the
cameras that are behind the current plane. Thus, when a
voxelis evaluated, the transparency of other voxels that
might occlude it from the cameras currently in use is
already known.
Note that in the photo-consistency evaluation not all
cameras are
used but only those behind the current plane, even if the
voxelis visible from other cameras. This introduces an
approximation of visibility that manifests itself in a
tendency to retain more voxels than necessary. An
additional photo-consistency test is applied at the end of
each iteration with respect to all the cameras to remove
any excess voxels.

17.3 Volumetric Stereo in Image-


Space
We will now describe a multi-view (volumetric) stereo
method inspired by the work of Hernández and Vogiatzis
(2010).
We assume that we start with a set of oriented
images: typically, before this step, the PPM and
a sparse model have been
computed, as described in Chap. 14.
The goal is to obtain a three-dimensional photo-
consistency map
that indicates how photo-consistent a particular point
is in a given set of images in which it is visible and
then recover the surface of the object.
The approach of Hernández and Vogiatzis (2010) is to
compute, for each image, a depth map by aggregating
information from a number of neighbouring images
(generalising binocular stereo). Then the
individual depth maps are combined in object space to
obtain the photo- consistency map .
This algorithm does not explicitly treat visibility.
However, due to the locality of the images used, if the
correlation measure is robust, it is
possible to ignore the visibility problem by regarding
occlusions as disturbances.
Let us now see how a depth map is generated for each
image. Fixed a reference image , for each point of we
compute correlation
functions along the point’ s epipolar lines in neighbouring
images. More
precisely, let be the indices of the k images closest to
and march along the epipolar line of in each of them,
computing (for example) the normalised cross-correlation
(NCC) over a window of fixed size. We thus have k
correlation profiles3 , which must be
aggregated into one .
Note that in order to compare these functions with
each other, it is necessary to find a common parameter to
describe the epipolar lines, which, as we will see below, is
precisely the depth .
Regarding aggregation, the immediate (but not
necessarily the best) choice falls on the average:

( 17. 1)

The depth to be associated with the point in the


constructing map is the value for which reaches the
maximum.
The merging of the maps into 3D is done by a voting
mechanism in the discretised space, divided into voxels:
each point in each depth map votes for a specific voxel
of centre with its aggregate correlation value .
The procedure for calculating the photo-
consistency map is summarised in Algorithm 17.1.
Algorithm 17. 1 MULTI-VIEW STEREOPSIS

Consider a pair of images I and , and let be a point in the


Alignment of Epipolar Lines
first
image. Let us write the epipolar line of parameterised
with the depth of the point, (see Problem 6.1):

( 17.2) Using Proposition A. 18 to solve (17.2), we can derive


the of a generic point of the epipolar line of :

( 17.3
)

where ). This relation allows us to align (via


the depth ) the epipolar lines of in different images.

The choice of the mean as an aggregator proves to be not


Aggregation
very robust in dealing with occlusions, specular reflections,
and textureless areas.
Hernández and Vogiatzis (2010) observe that although the
global
maximum of the correlation curves (and their mean) may be
wrong, it is
seldom true that the individual correlation curves present a
local
maximum at the correct depth, if the point is visible in that
image (Fig. 17.7). They therefore propose to aggregate
the local maxima of the
individual correlation curves with a voting scheme. A
triangular or
Gaussian Parzen window, W, is employed to compute an
aggregate
correlation that accounts for local maxima with their value
and
reinforces the presence of nearby local maxima in different
correlation curves. In formulae, let be the local maxima
of above a certain threshold (e.g., 0.6 out of 1):

( 17.4)

Fig. 17.7 On the left are the k correlation curves and their mean (dashed),
in the centre are highlighted with a triangle the local maxima of each curve,
and on the right is the aggregate correlation curve with the Parzen window
and its maximum (triangle), which corresponds to the true depth, different from
the maximum of the mean. Images courtesy of G. Vogiatzis
The depth map calculation step can be further improved
(Hernández and Vogiatzis 2010) by relaxing the approach
that considers the
maximum of the aggregate correlation function with a
Markov random field in which the best local maxima of the
correlation are all candidates for the depth of a point.

At this point one has a photo-consistency map defined


Surface Recovery

on a
volume and wants to determine from it a surface that is
photo-
consistent and regular (Fig. 17.8). The problem can be seen
as a
segmentation of the volume into object-background, the
result of which is precisely the surface of the object. The
solution can be obtained by
applying the minimum-cut algorithm (dual of the maximum
flow) on a
suitably constructed graph, where the nodes are the voxel
centres and the edge connecting two voxels and has
a weight equal to
. The rational is that the cost of separating two
adjacent voxels (i.e., considering them object and
background) is proportional to the
photo-consistency of the two.

Fig. 17.8 Graphic rendering of the 3D model of the cherub. This is a


polygonal mesh with colours associated with the vertices. The detail on
the right shows the structure of the mesh
Visibility is implicitly taken into account by the photo-
consistency measure, which is robust to occlusions.
However, the accuracy of the reconstruction improves if
visibility is taken into account in the
formulation of the minimum cut problem.
Alternatively, one can instead obtain a photo-consistent
point cloud by thresholding on and fallback to the
problem of reconstructing a surface from scattered points,
which we will mention at the end of the
chapter.

17.4 Marching Cubes


The marching cubes algorithm, introduced by Lorensen
and Cline (1987), is a well-known technique for
extracting a surface from a volume. More precisely,
given a 3D scalar field sampled on a
cubic lattice, one wants to extract the isosurface
corresponding to a given value c.
The isosurface divides the space into two regions, the
one for which and the one for which
. We mark all points in the cubic lattice that
belong to the region . Assuming
that the underlying function is reasonably regular,
it takes the value c once on each side connecting a marked
vertex with an unmarked
one.
Given any cube in the lattice, it is then possible to
determine the
shape that the isosurface must take relative to that single
cube. There are 256 cases, but they are reduced to 15 by
symmetry and can be
tabulated (Fig. 17.9). It is seen that in this way the cubes
are separately processable, and the algorithm is
implemented in the function
isosurface of MATLAB.

Fig. 17.9 The 15 possible configurations of the vertices of a cube of the


lattice, modulo symmetries. From
https://commons.wikimedia.org/w/index.php?curid=1282165

As a first approximation, we decide to place the point


where
exactly in the middle of edge, as in Fig. 17.9.
However, it is possible to do better by determining the point
of intersection of the
isosurface with the edge of the cube by a linear
interpolation of the values of fat the endpoints of the
edge.
The marching cubes algorithm also finds use as the last
stage of
surface extraction from point clouds (Hoppe et al. 1992;
Kazhdan et al. 2006). Unlike the previous case, where the
lattice points can be seen as
sampling a 3D scalar field, the point cloud does not have
an associated scalar value and this must be constructed as
part of the procedure.
We then define the field function as the
distance from to the nearest point on the surface
of the object and
approximate it through the data points, which are on the
surface by
definition. In this way, the surface sought is the zero-level
set of the
scalar field: . To finally move from this implicit
representation of the surface to a triangular mesh, the
marching cubes algorithm is used.

References
Frank Dachille, Klaus Mueller, and Arie Kaufman. Volumetric backprojection.
In Proceedings of the 2000 IEEE symposium on Volume visualization, pages 109– 117.
ACM Press, 2000.
Amael Delaunoy, Emmanuel Prados, Pau Gargallo, Jean-Philippe Pons, and Peter
Sturm.
Minimizing the multi-view stereo reprojection error for triangular surface
meshes. British Machine Vision Conference, 2008.
C.R. Dyer. Volumetric scene reconstruction from multiple views. In L. S. Davis,
editor, Foundations of Image Understanding, chapter 16. Kluwer, Boston, 2001.
O. Faugeras and R. Keriven. Variational principles, surface evolution, PDEs, level
set methods, and the stereo problem. IEEE Transactions on Image Processing, 7 (3):
336– 344, March 1998.
[MathSciNet][Crossref]

Yasutaka Furukawa and Jean Ponce. Accurate, dense, and robust multiview
stereopsis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32 (8):
1362– 1376, August 2010. [Crossref]
D. T. Gering and W. M. Wells. Object modeling using tomography and
photography. In Proceedings of the IEEE Workshop on Multi-View Modeling and Analysis of
Visual Scenes, pages 11– 18, 1999.
Michael Goesele, Brian Curless, and Steven M. Seitz. Multi-view stereo
revisited. pages 2402– 2409, Washington, DC, USA, 2006. IEEE Computer
Society. ISBN 0-7695-2597-0. doi:
10. 1109/CVPR.2006. 199. URL http://dx.doi.org/10.1109/CVPR.2006.199.

Carlos Hernández and George Vogiatzis. Shape from photographs: A multi-


view stereo pipeline. In Roberto Cipolla, Sebastiano Battiato, and Giovanni
Maria Farinella, editors, Computer Vision: Detection, Recognition and Reconstruction,
volume 285 of Studies in Computational Intelligence, pages 281– 311. Springer,
Berlin, 2010.
Hugues Hoppe, Tony DeRose, Tom Duchamp, John McDonald, and Werner
Stuetzle. Surface reconstruction from unorganized points. Computer Graphics,
26 (2): 71– 78, July 1992.
[Crossref]
Michael Kazhdan, Matthew Bolitho, and Hugues Hoppe. Poisson surface
reconstruction. In Proceedings of the fourth Eurographics symposium on Geometry
processing, pages 61– 70. Eurographics Association, 2006.
K. N. Kutulakos and S. M. Seitz. A theory of shape by space carving.
International Journal of Computer Vision, 38 (3): 199– 218, 2000.
[Crossref]
A. Laurentini. The visual hull concept for silhouette-based image
understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence,
16 (2): 150– 162, 1994.
[Crossref]

Ming Li, Marcus Magnor, and Hans Peter Seidel. Hardware-accelerated visual
hull reconstruction and rendering. In In Graphics Interface 2003, pages 65– 71,
2003.
W.E. Lorensen and H.E. Cline. Marching cubes: a high resolution 3-D surface
construction
algorithm. In M.C. Stone, editor, SIGGRAPH: International Conference on Computer
Graphics and Interactive Techniques, pages 163– 170, Anaheim, CA, July 1987.
W. N. Martin and J. K. Aggarwal. Volumetric descriptions of objects from
multiple views. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5 (2):
150– 158, March 1983.
[Crossref]
S. M. Seitz and C. R. Dyer. Photorealistic scene reconstruction by voxel
coloring. International Journal of Computer Vision, 35 (2): 151– 173, 1999.
[Crossref]

Steven M. Seitz, Brian Curless, James Diebel, Daniel Scharstein, and Richard
Szeliski. A
comparison and evaluation of multi-view stereo reconstruction algorithms. In
IEEE Conference on Computer Vision and Pattern Recognition, pages 519– 528, 2006.
Peter Sturm. A historical survey of geometric computer vision. In Computer
Analysis of Images and Patterns, pages 1– 8. Springer, Berlin, 2011.
R. Szeliski. Rapid octree construction from image sequences. CVGIP: Image
Understanding, 58 ( 1): 23– 32, 1993.
[Crossref]

Footnotes
1 In this chapter the term reconstruction is used with a different meaning from
the one defined in Chap. 16.

2 If the fusion algorithm is volumetric, the boundary with image-space


volumetric stereo becomes blurred.
3 Note that correlation is a measure of photo-consistency, so it is the opposite
of the matching cost introduced in Chap. 12.

OceanofPDF.com
© The Author(s), under exclusive license to Springer Nature
Switzerland AG 2024 A. Fusiello, Computer Vision: Three-dimensional

https://doi.org/10.1007/978-3-031-34507-4_18
Reconstruction Techniques

18. Image-Based Rendering


Andrea Fusi ello1
( 1) University of Udine, Udine, Italy

Email: [email protected]
Andrea Fusiello

18. 1 Introduction
In this last chapter, techniques are presented that have in
common the fact to synthesise— or to render—an image
obtained from the geometric
transformation of one or more input images; we speak also of
image-based rendering. Synthetic images can be interpreted
as obtained from a virtual camera that differs from the real
one for some aspects such as a greater
field of view, or a different orientation.
Image synthesis from other images (image-based
rendering), as
opposed to image synthesis from three-dimensional models
(model-based rendering), originates from the idea that a
scene can be represented by a collection of its images.
Those that are missing can be synthesised from
existing ones. It is also referred to as image interpolation.

18.2 Parametric Transformations


There are several types of geometric image transformations:
affine,

18.2. 1 Mosaics
projective, piecewise affine and non-parametric
transformations, as in the case of a disparity or parallax field.
The simplest case of synthesis is when an image is
transformed through a homography. We will describe in this
section some applications based on projective
transformations.

18.2. 1 Mosaics
Image mosaicing is the alignment (or registration) of multiple
images
into larger aggregates through the application (in general) of
homographies. Mosaicing synthesises an image taken from a
virtual camera with a larger
field of view. Since there are two cases where images of a
scene are linked by homographies, there are correspondingly
two types of mosaics:
panoramic mosaics: images are taken from a camera rotating
on its COP, as in the case of Fig. 18.1;

Fig. 18.1 Panoramic mosaic from three images. Note the stretching both
horizontally and vertically as one moves away from the central reference image

planar mosaics: different images of a planar scene, or


approximately so, are aggregated, such as the terrain seen
from the zenith (Fig. 18.2).
Fig. 18.2 Planar mosaic of a terrain taken from a helicopter. Courtesy of Helica srl
The construction of a mosaic is normally accomplished by
aligning the input images with respect to a common
reference frame and then blending them into a single
composite image. Thus, it proceeds through two stages:
alignment: for each image calculate and apply the homography
that aligns it with the reference image;
blending: assign a colour to each pixel of the mosaic, based on
the colours of the various images that insist on it.

An excellent reference to delve into mosaicing is Szeliski


(2006).

When a panoramic mosaic is projected onto a plane, as one


moves away from the reference frame, the images are
“stretched” horizontally
(obvious) and vertically (less obvious), as seen in Fig. 18.1. To
cope with
large rotations and overcome the 180-degree limit, one
can switch to cylindrical or spherical coordinates.

Alignment of images can be approached by methods analogous


18.2.1.1 Alignment

to those
seen for 3D model registration in Chap. 15: what changes is
the nature of
the data, from 3D point clouds to 2D images, and the type of
transformation, homographies of in place of isometries (or
similitudes) of .
Therefore, in the wake of ICP, one can devise
correspondence-less
methods that, use all pixels to compute the best homography
that aligns the two images (such as Szeliski 1996). This
approach would become
impractical if extended to many images, where feature
extraction and
matching is to be preferred, as in Brown and Lowe (2007).
Another
approach would be to compute homographies between pairs of
images with any method and then synchronise them, as we did
for rotations in Chap. 14 and Sect. 15.2.3.

The synchronisation that we have seen at work in the case


Synchronisation of Homographies

of rotations (Sect. 14.3.1) can be employed in the same


way for homographies (Fig. 18.3). The homogeneous
scaling factor, however, must be eliminated in order for
the synchronisation to work: for this purpose we force the
homography matrices to have unit determinant, and exploit
the fact that they constitute a group, since the product of
two matrices with unit
determinant is also a matrix with unit determinant.
Fig. 18.3 Putative graph representing available homographies between a set of
eight images. The spanning tree (thick line) provides a unique path from each
node to the root (the reference image), along which to concatenate the
homographies. Synchronisation, on the other hand, uses all available edges to
compensate for error
The known homographies are packed into a matrix ; the
first three eigenvectors of are computed and
stacked into the matrix U. The 3 3 blocks of this matrix
correspond to the homographies to be
applied to each image to realise the alignment. See Listing
18.1.

Listing 18. 1 Homography synchronisation

The benefit of synchronisation can be appreciated in Fig.


18.4, where two solutions are compared in the case of a
sequence of homographies affected by small random errors;
the first is obtained from the concatenation of
consecutive transformations, creating a chain from the
first to the last image; the second also considers the
transformation that links the last image to the first, thus
creating a loop on which synchronisation is
performed. While in the first solution is clearly visible a drift
that leads the solution to diverge from the true, the
synchronisation exploits the presence of the loop to
compensate for the error (a.k.a. loop closing), producing a
solution much closer to the ground truth.
Fig. 18.4 Synchronisation of a single loop. The left figure shows the graph and its
spanning tree
(thick line). The two differ only in the presence of the edge that closes the loop.
On the right, the
transformation applied to a square is used to display the sequence of homographies
. The vertices marked with a circle (green) represent the outcome of
synchronisation, while the cross (red) represents simple concatenation. The ground
truth is drawn in blue

After alignment, on a single pixel in the mosaic generally


18.2.1.2 Blending

insist more than one image, which raises the question of how
to assign the colour to the
mosaic.
The most commonly used mixing operators are the average,
the median and the selection of a single image as the colour
source according to
appropriate criteria. The feathering operator performs a
weighted average of the colours, with the weight decreasing
according to the distance of the pixel from the image edge.
This remedies the vignetting effect which results in brighter
images in the centre than in the periphery. Brown and Lowe
(2007) show very good result with multi-band blending.
In the presence of misalignments due to residual alignment
errors,
parallax (deviations from flatness) or moving objects, the
techniques just
listed are nevertheless bound to produce visible artefacts.
Blending cannot remedy problems inherently present the
data; however, it can mask them as much as possible. This is
the approach that informs methods that attempt to compose
the mosaic with tiles each cut from a single image, so that the
transitions between them (or “seams”) are as little visible as
possible.
The first criterion for reducing the impact of seams is to
minimise their total length, so a tessellation of the plane with
minimal perimeter is sought. Consider the centroidal Voronoi
tessellation induced by the image centres projected on the
mosaic reference plane. Recall that, given a set of centres,
the Voronoi tessellation assigns to each centre a convex cell, to
which belong the points in the plane closer to that centre than
to any other. If the centers are themselves the centroids of
those regions such a tessellation is called a centroidalVoronoi
tessellation.
Two facts are relevant here: (i) as the number of
centroidalVoronoi cells in a bounded region increases, each
Voronoi cell approaches a hexagon (Du et al. 1999); (ii) the
“honeycomb conjecture” (Hales 2001) states that the
tessellation that divides the plane into regions of equal area
with the
smallest total perimeter is one with regular hexagons. For
these reasons,
the centroidalVoronoi tessellation of the image centres can be
considered a cost-effective approximation of the lowest-
perimeter tessellation.
Figure 18.5 shows the area covered by each image after its
transformation into the mosaic reference plane (left) and the
area assigned to that image after decomposition into Voronoi
cells (right).
Fig. 18.5 Representation of the frames of the images transferred into the mosaic
reference plane and Voronoi tessellation. Images taken from Santellani et al. (2018)
Boundaries between tiles are straight segments that do not
take into
account the content of the image; they should be refined by
evolving them into paths connecting two vertices through
pixels where colour differences between adjacent cells are
minimal, as suggested by Davis (1998). In this way, if the
original cell boundary crosses, for example, a moving object,
the refined path will avoid it, since there are likely to be large
differences in the affected pixels due to object-background
contrast.
Figure 18.6 shows an example of the seams before and after
optimisation. One can see how the cell boundary follows the
space between the trees and avoids crossing them, since the
canopies are more textured
than the ground and affected by parallax.

Fig. 18.6 Detail of the seams produced by Voronoi tessellation and optimised
seams. Images taken from Santellani et al. (2018)
Given a sequence of images, the purpose of stabilisation is
18.2.2 Image Stabilisation

to compensate for camera motion, that is, to geometrically


deform each image so that a
given plane of the scene appears motionless, despite the
camera being in motion. This can be easily accomplished by
transforming each image with respect to a reference one
(arbitrarily chosen) with the homography of the plane . This
is the same procedure used in mosaics, but without blending:
only the last transformed image is displayed, as in Fig. 18.7. If
the scene is planar or the motion is rotational (in this case
the plane is the one at
infinity), the sequence will be completely stabilised; otherwise,
the plane will appear motionless, and the rest of the scene
in motion.

Fig. 18.7 Top row: two images from an aerial sequence. Bottom row: the same
images stabilised
with respect to the ground plane. In white the frame of the reference image.
Images taken from Censi et al. (1999)
18.2.3 Perspective Rectification
There are cases where one would like to use an image as if
was a map, a blueprint or an elevation, that is, drawings
that has a constant scale of representation, thus allowing
to measure real distances on them.
Photographs result from a perspective projection, and the
scale is not constant across the image, unless in a special
case where framed object is planar and parallel to the focal
plane, as noted in Problem 3.3. In general, however, to have
a constant scale, a scaled orthographic projection is
required, which must be synthesised from the actual images
and the depth of the points.
Let us first concentrate on the special case of approximately
planar
objects such as portions of the territory without great
orographic
discontinuity or architectural items such as paving or facades.
Let the plane be our proxy of such object, then the image
can be transformed or
rectified so that the virtual focal plane is parallel to . The
transformation between and its perspective image is a
homography, which is completely defined by four points whose
coordinates in are known. Once such a
homography is determined, the image can be projected
backwards onto . This is equivalent to synthesising an
image taken by a virtual camera whose optical axis is
perpendicular to . Such an image is also called a photoplane.
An example is shown in Fig. 18.8. The photoplane can be
obtained from a
single photo or from a mosaic (in this case it is also called
a controlled mosaic).

Fig. 18.8 To the right a photograph of Porta Udine (Palmanova, IT). On the left the
photoplane of the facade. The four points used in the rectification are the vertices
of the yellow quadrilateral, which in the photoplane becomes a rectangle
The photoplane is correct only if applied to perfectly flat
objects or with such depth variations that generates negligible
errors at the scale of
representation chosen. When the object to be represented is
not exactly a plane, we can evaluate the extent of the error
that is made. With reference to Fig. 18.9, consider a generic
point M of depth z, which projects in onto the reference
plane , and let a be the elevation of M with respect to that
plane, or relief. The difference between the image of M,
which is x
away from the principal point, and that of represents the
error in the
image plane. From and , one
obtains:

( 18. 1)

where does not depend on the point and is equal to


the distance of the plane from the camera.

Fig. 18.9 The relief displacement. The actual object point is M, with relief
a. Ignoring a it is tantamount to confusing M with that lies on the
reference plane
In the photogrammetric literature, is called relief
displacement, since it manifests as a (radial) displacement of the
image point caused by the relief;
it is in fact a form of planar parallax.
We observe that is small when:
the height a is small compared to the distance from the plane

the distance x of the projected point from the principal point


d;

is small.
If the maximum parallax falls within a tolerance
threshold,1 then the photoplane can be considered an
orthographic projection. Since parallax increases with the
distance of the image point from the principal point,
using only the central part of the image in the
composition of a mosaic allows to keep parallax under
control.

18.3 Non-parametric Transformations


As long as the COP does not vary (or the scene is a plane), a
homography is sufficient to synthesise new images, as we have
seen (Fig. 10.4). If the
COP changes, however, the structure of the scene comes into
play in
establishing the (non-parametric) function that maps the given
images into the synthetic one.

Consider the following two PPM:


18.3. 1 Transfer with Depth

( 18.2)
Substituting these PPMs into the equation of the epipolar line
of with the explicit depths (6.30), we get:

( 18.3) Thus, if the depth of a point in the reference image


is known, its
position in the synthetic image is obtained by

( 18.4) where and specify the orientation of the virtual


camera with respect to the reference (real) camera. are the
intrinsic parameters of the virtual camera, and are those of
the reference camera (which is therefore
calibrated).
18.3.2 Transfer with Disparity
Ultimately, the depth is not really needed to transfer
points onto the synthetic image. Also disparity, being a
depth proxy, would do.
Assume that two calibrated reference images with a
dense map of correspondences are given. One can exploit
epipolar geometry to predict the position of a point in the
third (synthetic) image . The COP of the
virtual camera is constrained to lie on the baseline segment
between the COPs of the two reference cameras: hence, this
is an interpolation.
We assume that the cameras are in normal
configuration, that is, the images are rectified. In this case,
(18.3) specialises to:

( 18.5)

where is the first component of . The difference


is the
disparity. It is a vector, but since only the first component is
nonzero, we normally identify the disparity with it. The
transfer equation in the virtual image is
( 18.
It is easily verified that interpolating the disparity in 6)
this case is
equivalent to moving the camera to intermediate positions
along the
baseline, from 0 to . The algorithm was introduced by Chen
and Williams (1993) and extended to unrectified images by
Seitz and Dyer (1996) under the name of view morphing. The
latter simply rectifies the two images,
performs interpolation and eventually de-rectifies the resulting
image.

Transfer is possible even in the uncalibrated case, that is,


18.3.3 Epipolar Transfer

without knowing the intrinsic parameters. In fact all that is


needed are correspondences and epipolar geometry. Given
two reference images with a dense map of
correspondences, one can exploit epipolar geometry to predict
the position of a point in the third (synthetic) image
We assume that the fundamental matrices have
been
computed. Given two homologous points in the two reference
images, and , the position of their homologous point in
the third image is determined by the constraint of
simultaneously belonging to the epipolar line of and that
of , as illustrated in Fig. 18.10. In formula:
( 18.7)
Fig. 18.10 The point in the third image is determined by the intersection of the
epipolar lines of the other two
This method fails when the three optical rays are coplanar
(the two lines in the third image become coincident).
18.3.4 Transfer with Parallax
In the method of Sect. 18.3.2 disparity is used as a substitute
for unknown depth. The planar parallax can also serve the
same purpose. Recall that the relation between two images
via a reference plane is given by (6.29), which we rewrite here
as
( 18.8)
Given a number ( ) of homologous points, the
homography and the epipole can be easily computed
(see Ex. 18. 1). The parallax at each
point on the reference image is also calculated (Proposition A.
18) by solving with respect to in (18.8):
( 18.9)
Beware that and are only known up to a scaling
factor, so the modulus of the parallax also contains an
unknown scaling factor.
Since the parallax does not depend on the second image,
this can be replaced by a third one (Shashua and Navab
1996) and (18.8) can therefore be used for transferring points
into the synthetic image :
( 18. 10)
where the homography induced by between the
first and third images and the epipole in the third
image specify—indirectly—the
orientation of the virtual camera.
Note the similarity of this technique to transfer with depth.
The
equation is similar, but in one case we employ “calibrated”
quantities such as , R and , while in the other
“uncalibrated” quantities such as , H and , respectively. In
other words, in the former case, we work in a Euclidean
frame, while in the latter in a projective frame. The second
requires less
input information (no calibration is needed), but specifying
the orientation of the virtual frame is problematic (Fusi ello
2007), whereas in the
Euclidean case, it is immediate.
An example of synthesis with parallax is shown in Fig. 18.11.

Fig. 18.11 Some frames of a sequence (Palmanova, IT) synthesised using parallax.
The values and correspond to the reference images. For values between
0 and 1, there is interpolation; for values outside the range, there is extrapolation

An orthophoto is an orthographic projection of the object along


18.3.5 Ortho-Projection

a direction perpendicular to a reference plane, such as the


ground or a facade.
Contrary to the creation of the photoplane, where it is
sufficient to
identify the reference plane with at least four points, for
rendering the
orthophoto (a.k.a. ortho-projection), it is necessary to know the
structure of the scene in order to define a non-parametric
transformation of the original image. Hence, the orthophoto is
synthesised in a similar way to the transfer with depth (Sect.
18.3.1), with the difference that now the virtual camera is
affine. Consider then the following two projection matrices,
the first
perspective, the second affine:
( 18.
11) where and specify the orientation of the virtual
camera relative to
the reference (real) camera and contains the affine
intrinsic
parameters; it is a matrix that corresponds to the
principal submatrix of the usual intrinsic parameters matrix.
and are and
deprived of the third row, respectively.
To obtain the expression of the epipolar line, which is used
for the
transfer, we write the optical ray of a point in the first
camera and
project it (orthographically) into the second in , obtaining
after a few
steps:

( 18.
12)

In this case the structure of the scene is assigned through a


depth map ( ) referred to the original image, but usually it is
given equivalently
through a Digital Elevation Map (DEM) that for each point
of the object shows the distance from the reference plane.

In the generation of an orthophoto, you can use a DEM


relative to the
ground (Digital Terrain Model, DTM) or a DEM including
whatever is
above the ground (Digital Surface Model, DSM). In the first
case, the
orthophoto is correct only for the points that are on the
ground and has parallax errors for buildings and
vegetation. To distinguish between the two cases, when a
DSM is used, it is called a true orthophoto.
Often the photoplane is confused with the orthophoto, but
as we said, the photoplane is only correct for a particular
plane, so points outside the plane are affected by parallax,
while in the orthophoto the parallax is
removed. Figure 18.12 allows to appreciate the
differences between a photoplane (in a situation where it
is clearly not applicable) and an orthophoto.
Fig. 18.12 From left: image taken from above (nadiral), photoplane relative to the
ground, and
orthophoto. The (vertical) walls in the orthophoto are not visible, while in the
nadiral photo, they are visible, as well as in the photoplane, which does not change
the visibility of the object points. Images taken from the “Benchmark SIFT2017”,
processed by E. Maset with 3DF Zephyr

18.4 Geometric Image Transformation


Geometric image transformation is an operation that
alters the spatial arrangement of pixels. It is also referred
to in the literature as image warping. It shares
problems and solutions with texture mapping in
computer graphics.
To perform a geometric transformation of an image, we need
the
application that associates corresponding points between the
source image and the target image. If are the source
coordinates and are the target coordinates, one can be
given the direct transformation:
, or the inverse transformation: .
The forward mapping uses the direct transformation and
scans the
source image, according to the following scheme:
The geometric transformation of an image requires a
resampling step, that is, the conversion of a digital (2D)
signal from one sampling grid to another. If, by literally
following the previous pseudo-code, one simply
copies the pixels, the results will be unsatisfactory. In
particular:
if the transformation enlarges the image, the
operation results in oversampling. If not done well, it
results in the repetition of pixels;
if the transformation shrinks the image, the
operation results in undersampling. If it is not done
well, it results in aliasing.
In addition, for non-invertible transformations, there are
artefacts due to violation of injeictivity or surj ectivity,
namely (Fig. 18.13):
folding: occurs when two or pixels of the source image are
associated with the same pixel of the target image;
holes: occurs when points not visible in the source image are
visible in the target image;

Fig. 18.13 Artefacts in image synthesis. From left to right: folding, holes and
magnification

Folding can be avoided by following an appropriate pixel


evaluation
order (McMillan and Bishop 1995), which ensures that the
points closest to the camera are transformed last, so that they
eventually overwrite the
others. For example, if the synthetic camera is shifted to
the left of the reference camera, the pixels should be
processed from right to left.
Holes are difficult to fix because the information needed to
fill them is missing. Interpolation or inpainting techniques
can be used (Criminisi et al. 2004).
Magnification artefacts are resolved by splatting, that is,
drawing a footprint of the transferred point that are
larger than the pixel itself
(Westover 1991; Zwicker et al. 2001).
When the inverse of the transformation is available (as in
the case of homographies), one can apply the more
effective technique of backward mapping. It proceeds by
scanning the target image, according to the
following scheme:

With backward mapping the problem of


magnification/minification has an effective solution through
interpolation.

A good resampling technique that works for a scale factor of


Bilinear Interpolation

up to two can be achieved with bilinear interpolation. This is


a continuous function that is inexpensive to compute and
interpolates the data on a square grid.
Consider, for simplicity, the square of vertices
and
in the source image I (Fig. 18.14). The interpolated value
at the point
, included in the square, is calculated by
Fig. 18.14 Bilinear interpolation
The implementation of backward mapping, shown in Listing
18.2, is
based on the interp2 function of MATLAB and assumes a
monochrome image. When working with RGB images, the
same operation is performed on each of the three channels,
independently.
Listing 18.2 Image transform with backward mapping

In bilinear interpolation, to save on multiplications, one


can apply the following formulae that uses three
multiplications instead of eight:

( 18.
13)

Problems
18. 1 How to calculate the epipole given six homologous
points, four of which are coplanar?
18.2 In an autonomous driving application (e.g. of a car),
if one fixes the road plane as reference, all points for which
the parallax is significantly
different from zero are immediately detected as obstacles.
Expand this idea to build an obstacle avoidance algorithm.

18.3 In the game of soccer, the signed parallax of the ball


with respect to the plane formed by the posts and the
crossbar can detect the scoring.
Expand this idea to build an automatic score detection
algorithm.

References
M. Brown and D. G. Lowe. Automatic panoramic image stitching using
invariant features. International Journal of Computer Vision, 74 ( 1): 59– 73,
2007.
[Crossref]
A. Censi, A. Fusiello, and V. Roberto. Image stabilization by features tracking. In
10th International Conference on Image Analysis and Processing, pages 665–667, Venice,
Proceedings of the

Italy, 27–29 September 1999.


Shenchang Eric Chen and Lance Williams. View interpolation for image synthesis.
In James T. Kajiya, editor, SIGGRAPH: International Conference on Computer Graphics and
Interactive Techniques, volume 27, pages 279–288, August 1993.
A. Criminisi, P. Perez, and K. Toyama. Region filling and object removal by
exemplar-based image inpainting. Image Processing, IEEE Transactions on, 13 (9):
1200– 1212, 2004.
[Crossref]

James Davis. Mosaics of scenes with moving objects. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 354–360, 1998.
Qiang Du, Vance Faber, and Max Gunzburger. Centroidal voronoi tessellations:
Applications and
algorithms. SIAM Review, 41 (4): 637–676, 1999.
https://doi.org/10.1137/S0036144599352836. [MathSciNet][Crossref]
A. Fusiello. Specifying virtual cameras in uncalibrated view synthesis. IEEE
Transactions on Circuits and Systemsfor Video Technology, 17 (5): 604–611, May 2007.
[Crossref]

T. C. Hales. The honeycomb conjecture. Discrete Comput. Geom., 25 ( 1): 1–22, jan
2001. ISSN 0179- 5376. https://doi.org/10.1007/s004540010071.
D. Liebowitz and A. Zisserman. Metric rectification for perspective images of planes.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 482–
488, 1998.
L. McMillan and G. Bishop. Head-tracked stereo display using image warping. In
Stereoscopic Displays and Virtual Reality Systems II, number 2409 in SPIE Proceedings,
pages 21–30, San Jose, CA, 1995.
E. Santellani, E. Maset, and A. Fusiello. Seamless image mosaicking via
synchronization. ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information
Sciences, IV-2: 247–254, 2018.
Steven M. Seitz and Charles R. Dyer. View morphing: Synthesizing 3D
metamorphoses using image
transforms. In Holly Rushmeier, editor, SIGGRAPH: International Conference on Computer
Graphics and Interactive Techniques, pages 21–30, New Orleans, Louisiana, August 1996.
A. Shashua and N. Navab. Relative affine structure: Canonical model for 3D from
2D geometry and applications. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 18 (9): 873– 883,
September 1996.
[Crossref]
R. Szeliski. Video mosaics for virtual environments. IEEE Computer Graphics and
Applications, 16 (2): 22–30, March 1996.
[Crossref]
Richard Szeliski. Image alignment and stitching: a tutorial. Foundations and Trends
in Computer Graphics and Vision, 2 ( 1): 1– 104, 2006.
[MathSciNet][Crossref]

Lee Alan Westover. Splatting: A parallel, feed-forward volume rendering algorithm.


Technical Report TR91-029, University of North Carolina at Chapel Hill, USA,
1991.
Matthias Zwicker, Hanspeter Pfister, Jeroen van Baar, and Markus H. Gross.
Surface splatting. In SIGGRAPH ’01, 2001.

Footnotes
1 In photogrammetric survey this threshold is given by the graphical error, equal
to the thickness of the mark on the paper, conventionally fixed at 0.2 mm.

OceanofPDF.com
Appendix A

Notions of Linear Algebra


«If you were to be stranded on a deserted island, make sure to bring SVD
with you.» [Self-citation]

This appendix reports a set of linear algebra results useful for


A. 1 Introduction

understanding the methods and techniques outlined in


the previous chapters. For a systematic treatment, see
Strang (1988).

The scalar product of two


A.2 Scalar Product

vectors and of is defined as


Definition A. 1 (Scalar Product)

(A. 1)

The scalar product is commutative, that is, .


The scalar product induces the Euclidean norm:
(A.2)
Geometrically, the norm represents the length of the vector. If
is unitary, that is, , then represents the length
of the projection of along the direction of . The angle
between two vectors and is
defined as

(A.3)

Two vectors and


of are orthogonal if
Definition A.2 (Orthogonality)

(A.4)
Considering the vectors as matrices formed by a single
column, we can write the scalar product using the matrix
product and transposition
(superscript ):
(A.5)

A.3 Matrix Norm


The Euclidean vector norm naturally induces a Euclidean
matrix norm in the following way (applies in general to any
norm):

(A.6)

We will see later a property that allows to compute this


norm from the eigenvalues.
There is another way to generalise the Euclidean norm
to matrices, which starts from a scalar product on
matrices that generalises the one defined for vectors.
Define the following scalar product between two matrices
and
:

(A.7)

which reduces to the scalar product between vectors previously


defined in
the case where the two matrices consist of only one
column. The corresponding induced norm is called
the Frobenius norm:
Given a matrix A, its
Frobenius norm is defined as:
Definition A.3 (Frobenius Norm)

(A.8)

In the special case where the matrix is a vector, the two norms
coincide:
(A.9)
A.4 Inverse Matrix
Definition A.4 (Square Matrix) A matrix A is said to be
square if it has the same number of rows and columns.
A square matrix in which all elements below (or above) the
diagonal are null is said to be triangular.
A square matrix in which all elements other than the
diagonal are null is said to be diagonal:

(A. 10)

A particular square matrix is the identity: .


The identity matrix is the neutral element for the matrix
product:
Let A be a square matrix. If
there exists a matrix B such that , then B is called
Definition A.5 (Inverse Matrix)
the inverse of A and is
denoted by .

Properties:
1. The inverse, if it exists, is unique

2.
3. .

Given a linear system of n equations in n unknowns


, we can write , but recall that to solve a
linear system numerically, one should not compute the
inverse (Strang 1988).
We will see ahead how the notion of inverse generalises to
non-square matrices with the Moore-Penrose inverse.
A.5 Determinant
Definition A.6 (Determinant) Let A be a square matrix
. The determinant of A is defined (recursively) as
follows, for a fixed i:

(A. 11)

where is the submatrix of A obtained by suppressing row i


and column j.
This definition is also named Laplace expansion along row i. It
could be defined analogously along a column. The choice
of row (or column) is arbitrary.
For a matrix:

(A. 12)

Remark A.1 The determinant of a triangular matrix is the


product of the diagonal elements.
Properties:
1.

2. with

3.
4.

5. A is invertible

6. Swapping two rows (or columns) of A changes the sign of


.

Let A be a matrix
. We call the minor of order q extracted from A the
Definition A.7 (Minor of Order q)
determinant of a submatrix of A.
The adjoint matrix of a
square matrix A is defined as
Definition A.8 (Adjoint Matrix)

(A. 13)

where is the submatrix of A obtained by suppressing rowj


and column i. The entries of are also called cofactors of A
and are in fact minors of A, up to the sign.
The adjoint matrix is closely related to the inverse via
the following relationship:
(A. 14)

Definition A.9 (Singular Matrix) A matrix A such that


is said to be singular.
By the Property 5, singular is synonymous with non-invertible.
The determinant is related to the notion of linear
independence of a set
of vectors.
Definition A. 10 (Linear Independence) The vectors
of are said to be linearly independent if

(A. 15)

If the vectors are not linearly independent, then (at least)


one of them is a linear combination of the others.
Proposition A. 1 The vectors of are linearly
independent if and only if
(A. 16)

The vectors of are said to be mutually orthogonal if


A.6 Orthogonal Matrices

each is orthogonal to each other. If the vectors are all


unitary, then they are said to be orthonormal.
Proposition A.2 If the vectors are mutually orthogonal,
then
they are linearly independent.

A square real matrix


A is said to be orthogonal if
Definition A. 11 (Orthogonal Matrix)

(A. 17)

So for an orthogonal matrix, the transpose coincides with the


inverse. It also follows from the definition that ifA is
orthogonal, its columns (or rows) are mutually orthogonal
unit vectors (i.e. orthonormal).
Remark A.2 A rectangular matrix B can satisfy or

(but not both). It means that its columns or rows are


orthonormal vectors. It is called in such a case semi-orthogonal.
Properties: ifA is orthogonal, then:
1.
2. .

The second property means that transforming a vector with


an
orthogonal matrix does not alter its length. In fact, orthogonal
matrices are the all and only linear rigid transformations. In
particular, those with a
positive determinant are a pure rotation, while if the
determinant is negative, they contain a specular
reflection.
Definition A. 12 (Group) A group is an algebraic structure
formed by a
nonempty set endowed with an internal binary operation,
satisfying the
properties of associativity, existence of the neutral element
and existence of the inverse of each element. The operation is
internal if it always returns an element of the set itself.
Orthogonal matrices form a group. In particular, the group
of orthogonal matrices in is called the orthogonal group
and is denoted by .
The rotation matrices also form a group, called the special
orthogonal group and denoted (in ) by . Clearly,
.
A.7 Linear and Quadratic Forms
A square
Definition A. 13 (Symmetric and Antisymmetric Matrix)

matrix A is said to be symmetric if , that is, . It


is said to be antisymmetric (or skew-symmetrix) if .
Properties: let and A :
1. is a linearform in

2. is a squareform in

3. is a bilinearform in and .

Remark A.3 In a quadratic form, one can replace A with


(symmetric by construction) without changing the result, so it
can always be assumed that the matrix A in a quadratic form
is symmetric.
A symmetric
matrix A is said
Definition A. 14 (Positive Definite Matrix)
to be positive definite if . It is said to
be positive
semidefinite if

Properties: let A and B


1. and are positive semidefinite by construction

2.

3. is antisymmetric

4. .

Property 4 derives from the fact that the null matrix is


the only one simultaneously symmetric and antisymmetric.
Theorem A. 1 (Cholesky) A positive semidefinite matrix A can be
uniquely decomposed as
(A. 18)
where K is an upper triangular matrix with positive diagonal.

Definition A. 15 (Rank of a Matrix) The rank of a


A.8 Rank

matrix A is the maximum order of the nonzero minors


extracted from A, and is denoted by
.

Proposition A.3 Let A be a matrix. Its rank is equal to the


number of linearly independent columns or rows it contains.

If , it is said that
A has full rank.
Remark A.4

Properties:
1. A square has full rank

2.

3. The rank does not change after swapping two rows (or
columns)

4. .

5. If B has full rank equal to the number of rows:

6. If B has full rank equal to the number of columns:

Definition A. 16 (Null Space of a Matrix) Let A be a


matrix. Its null space (or kernel), denoted by ,
is the subspace of defined as follows:
(A. 19)

The null space is a subspace because:


it always contains the null vector:
it is closed by linear combination:
.
Definition A. 17 (Image of a Matrix) Let A be a
matrix. Its image, denoted by , is the subspace of
defined as follows:
(A.20)

Remark A.5 The image of A is the subspace generated by the


columns of A, that is, the vectors of are linear
combination of the columns of A. For this reason it is also
called column space of A.
This is easily seen, thanks to the block matrix product:

(A.21)

where by , we have denoted the columns of A.


Properties:
1. It follows from the previous observation
and from Prop. A.3

2. .

Theorem A.2 (Rank Nullity) Let A be a matrix , then

(A.22)

Corollary A. 1 A matrix A hasfull rank .

A.9 QR Decomposition
1
Theorem A.3 (QR Decomposition) Let A , then there exists an
orthogonal matrix Q and a upper triangular matrix R such
that

(A.23)

Note that, because Q is nonsingular, .


Remark A.6 If , the last rows of R are zero, and
so we can write the reduced QR decomposition:

(A.24)

where is upper triangular and is semi-


orthogonal, that is, .
Remark A.7 If and A has full rank, the QR decomposition
is unique as soon as R is required to have positive diagonal
elements.

Definition A. 18 (Eigenvalues) Let A be an matrix.


A. 10 Eigenvalues and Eigenvectors
The
eigenvalues of A are the roots of the polynomial in
(characteristic polynomial):
(A.25)

Let be an eigenvalue of A, then

or, equivalently
(A.26)
Such a vector is called eigenvector of A.
Let be the eigenvalues of the matrix A .
Then:

(A.27)

(A.28)

Remark A.8 A singular matrix possesses at least one null


eigenvalue. In fact with

Proposition A.4 A symmetric real matrix has real eigenvalues.


Proposition A.5 A symmetric matrix is defined to be positive
all eigenvalues are positive.
Proposition A.6 An antisymmetric real matrix of dimension
(even) has purely imaginary eigenvalues of the form
with . In the case of odd dimension, a single null
eigenvalue appears.

Proposition A.7 An orthogonal matrix has eigenvalues of unit


modulus.

Theorem A.4 (Schur’s Real Canonical Form) If A is real,


there exists an orthogonal real matrix V such that is upper
block triangular, with blocks of size 1 or 2 on the diagonal. The
eigenvalues of A coincide with the eigenvalues of the diagonal blocks,
where blocks of dimension 1
correspond to real eigenvalues and blocks of dimension 2 to conjugate pairs
of complex eigenvalues.
Observe that in general V do not contain the eigenvectors of A.
Two remarkable corollaries follow from this theorem, of
which the first is the spectral theorem for symmetric real
matrices.
Corollary A.2 (Schur’s Diagonalisation) Let A be a real
symmetric matrix with eigenvalues . Then, there exists a
real diagonal matrix and a real orthogonal
matrix V
whose columns are the eigenvectors of A, such that:
(A.29
)

The proof is simple: since A is symmetric, the eigenvalues are


all real, so Tis
upper triangular, but since it must also be symmetric, it
can only be diagonal.
For antisymmetric matrices, instead, the following applies:

Corollary A.3 Let A be a real antisymmetric matrix with


eigenvalues . Then there exists a block diagonal matrix
with and an orthogonal matrix V
, such that:
(A.30)
Proof: since A is antisymmetric, Tis upper block triangular, but
since it must also be antisymmetric it can only have the
structure of the statement.
Remark A.9 If the order of A is odd, then one eigenvalue of A
is null and consequently the last element of the diagonal of
S is zero.
Remark A.10 Antisymmetric matrices have even rank.

Proposition A.8 (Rayleigh Quotient) Let A be a real


symmetric matrix with eigenvalues ordered from largest to
smallest. Then:

(A.31)

This is easily proved by Schur’s diagonalisation. From this it


also follows that
(A.32)

Proposition A.9 (Similar Matrices) Let A be a matrix


and G be a nonsingular matrix, then has the same
eigenvalues as A. A and are said to be similar.

A. 11 Singular Value Decomposition


Theorem A.5 (Singular Values Decomposition) Let A
be a matrix. There exists a matrix D with positive
diagonal elements
and null elsewhere, an orthogonal matrix
U and an orthogonal matrix V such that
(A.33)
The diagonal elements of D are called singular values. The columns
of U are the left singular vectors ; the columns of V are the right
singular vectors.
As we will make clearer later, the Singular Value
Decomposition (SVD) is related to the spectral
decomposition, and singular values are related to
eigenvalues.
The following proposition justifies the importance of SVD.
Proposition A. 10 (Null Space and Image of the Matrix)
Let A be a
matrix and let be its singular value decomposition. The
columns of U corresponding to nonzero singular values are a basis for
,
while the columns of V corresponding to null singular values are a basis for
.

Compare this result with Theorem A.2. Figure A.1 gives


a graphical representation of the decomposition.

Fig. A.1 Graphical representation of the SVD, in which the partitions


corresponding to the null and non-null singular values are highlighted

The previous proposition provides, via SVD, an


algorithm for determining basis for and .

Remark A.11 The number of nonzero singular values is equal


to the rank of the matrix.
Remark A.12 If A is a square matrix, .

More specifically, the condition number indicates the extent


to which the matrix is close to singularity. If is close to
zero, then c is large and the matrix A is ill-conditioned, that is, it
is singular to practical effects.
The following proposition specifies the relationship with
eigenvalues.

Proposition A. 11 . Let A be a matrix and let be


its
nonzero singular values. Then are the nonzero eigenvalues of

.
and . The columns of U are the eigenvectors of ; the
columns of V are the eigenvectors of

This is the Schur


diagonalisation of (it is symmetric by construction, so it
Proof

satisfies the
assumptions of Theorem A.2.) So the diagonal matrix
contains the eigenvalues of , and the columns of V are its
eigenvectors.
Remark A.13
Proposition A. 12 (Compact SVD) Let A be a
matrix and let be its singular value decomposition. Let

(A.34)
be its nonzero singular values. Then:

where , is the matrix composed of


the first r columns of U and is the matrix composed of the first
r
columns of V .

The following theorem exploits SVD to determine the


lower-rank approximation of a matrix.

be its compact SVD. The matrix of rank


Theorem A.6 (Eckart-Young) Let A be a matrix of rank
r and let
closest to A in
Frobenius norm is the matrix defined as
(A.35)

where )
.

Proof (Sketch) We can rewrite:


where and are the columns of U and V , respectively.
Each of the terms of the summation is a matrix of rank one,
so we write A –which has rank r– as a weighted sum of r
matrices of rank one, with weights equal to the
corresponding singular values. If a lower-rank approximation
is sought, some terms must be cancelled, and the cost to
pay, in terms ofFrobenius’ norm (by remark A.13), is the
corresponding singular value. So the least- cost solution is
to delete the terms starting from the end:
.

A primary use for the SVD in this text is the solution of


homogeneous
systems of linear equations. Consider the system . Two
cases arise:
is full, so , that is, the system has only
the trivial solution
is not full, so , that is, the system has
non-trivial solutions. Note that if is a solution, so is
.
In practice the case often arises where A has full rank, so
strictly
speaking there are no meaningful solutions, but this is actually
due to
measurement errors or noise. In such cases we then look for
a non-trivial approximate solution, that is, we solve the least-
squares problem:

(A.36) The constraint is placed to avoid the solution and


the value 1 is
completely arbitrary.

Proposition A. 13 (Least-Squares Solution to a Linear


System of
Homogeneous Equations) Let A be a matrix. The
solution of the least-squares problem

(A.37)

is where is the last column of V in the singular


value decomposition of A:

.
After making the variable change (recall that V is
Proof

orthogonal), we
get:

(A.3
8)

The constraint prevents us from choosing , and since


the singular values in D are ordered, the least-cost solution
is , so
.
We will refer to as the least right singular vector of A.
Proposition A. 14 (Orthogonal Procrustian Problem)

(A.39)
Given two matrices A and B, the solution of the problem
is where is the SVD of : .

; therefore,
the problem becomes to maximise . Let
Proof

, then

(A.40) Since Z is orthogonal, we have and therefore the


maximum is
reached with , hence the thesis.

In the case where the constraint is added that Q is a


rotation, that is, , the solution is modified as
follows:
(A.41)
Indeed, . So, if
, we take as before, while if ,
we are forced to take a Z with determinant . Since the
singular values are ordered, the least-
cost choice is to take .
A special case of this problem occurs when . In that
case it is a
matter of finding the closest orthogonal matrix Q to the given
matrix A, that is:
(A.42)
The solution reduces to replacing the matrix D with the
identity in the SVD of A. Similarly we proceed in the case of a
rotation matrix.
On the Procrustean problem, see also Goodall (1991).

The pseudoinverse, also known as the Moore-Penrose inverse,


A. 12 Pseudoinverse

is a
generalisation of the inverse from square matrices to non-
square matrices. It is defined as a matrix that satisfies
certain properties resembling those of the inverse of a non-
square matrix. Let be a matrix. The Moore- Penrose
inverse of A, denoted by , satisfies the following properties:
It can can be shown that for each A, the pseudoinverse
exists and is unique.
Some properties derived from the defining equations:
for nonsingular A

if A has full column rank


if A has full row rank
Please note that the last two properties follow from the
first two. They imply that
if A has full column rank, then is a left inverse of A, that
is, ; if A has full row rank, then is a right
inverse of A, that is, .
Proposition A. 15 (Least-Squares Solution to a Linear
System of

of the system of equations (called normal equations ):


Equations) Let A be a matrix and . If is a solution

(A.43)

(A.44)
then also solves the least-squares problem

Normal equations always have solutions. If A has full


column rank, then is non-singular and therefore the
unique solution writes 2

(A.45)
Otherwise, there are many solutions, and one is arbitrarily
chosen as the representative.
The pseudoinverse can be used to find a least-squares
solution to a linear system of equations with
(A.46)
When , the unique solution is exactly (A.45);
otherwise, it is the one with minimum Euclidean norm.
In general, the pseudoinverse of A can be calculated using
the SVD as
(A.47)
where is the diagonal matrix with the reciprocals of
the nonzero singular values on the diagonal, and zeroes
elsewhere.
Ultimately, also the least-squares solution of
inhomogeneous systems reduces to the SVD.
A. 13 Cross Product
This is a product between two vectors that returns a vector,
and is defined only in .
Definition A. 19 (Cross Product) The cross product
of two vectors is defined as the vector:

(A.48)

The cross product is used (among other things) to check


whether two vectors differ only by a multiplicative
constant.

Given two vectors , we have:


.
Remark A.14

In fact, the hypothesis is equivalent to stating that all


determinants of order two extracted from are null, so it
has rank one, from which the thesis follows, and vice versa.
The cross product is naturally associated with an
antisymmetric matrix:

Remark A.15 Given a vector , the matrix

(A.49)
acts as the cross product for , that is, .

The matrix is antisymmetric, and , since


It is therefore singular, and in particular it
has rank two, since the rank must be even (Remark A.10).

Remark A.16 Given a vector

(A.5
0)

Proposition A. 16 (Triple Product) Given three vectors


, we have:
(A.5
This quantity also takes the name of the triple product of . 1)

Demonstration follows immediately using the Laplace


expansion and the definition of the cross product.
The sign of the triple product depends on the order of the
arguments, but is invariant for a cyclic permutation.
The absolute value of the triple product is equal to the
volume of the
parallel epiped constructed on the three vectors. Consequently,
assuming
that none of the three vectors is zero, the triple product is zero
if and only if the vectors are coplanar. In other words, since
the triple product is a
determinant, it can be used to check if three vectors are
linearly dependent.

Proposition A. 17 (Lagrange’s Formula) Givenfour ar


vectors e
,we have gi
ve
n
by
Proposition A. 18 Given three vectors such
the two scalars such that
(A.52)

(A.53)

(A.54)
and likewisefor .

Proof The hypothesis guarantees the


existence of the solution, since one of the vectors is linear
combination of the other two. Regarding the formula
(A.54), we have:

thesis.
It is shown (Kanatani 1993) that the formula (A.54) also
solves the least- squares problem:
(A.55
whe )
.
n

A. 14 Kronecker’s Product

Definition A.20 (Kronecker’s Product) Let A be a


matrix and B a matrix. Kronecker’s product3 of A and B is
the matrix
defined by

(A.56)

Note that Kronecker’s product is defined for any pair


of matrices. Properties:
1.

2. if the dimensions are


compatible

3.
4.

5. .
Proposition A. 19 The eigenvalues of are Kronecker’s
product of the eigenvalues of A by the eigenvalues of B.

Proof Let and be the Schur


decompositions of A and B, respectively, where L and M are
upper triangular matrices whose diagonal elements are the
eigenvalues of A and B, respectively:
(A.57)
Since and are similar, they have the same
eigenvalues, and since is upper triangular, the
eigenvalues of are the
diagonal elements of .

It follows immediately from the proposition that:

Corollary A.4 (Kronecker’s Product Rank)


(A.58)

The vectorisation of a matrix is a linear transformation that


converts the matrix to a vector (column).
Definition A.21 (Vectorisation) The vectorisation of a
matrix A, denoted by , is the vector that is
obtained by stacking the
columns of A one below the other.4
The connection between the Kronecker product and
vectorisation is given by the following relation, also known as
the vec-trick:
(A.59)
for matrices of compatible dimensions. This will be
very useful for extracting the unknown X from a matrix
equation.
When dealing with symmetric matrices, the vectorisation
must consider only the non-repeated elements. For this
purpose we define the following.
Definition A.22 (Half-Vectorisation) The half-
vectorisation of a symmetric matrix – denoted by
– is the vector
obtained by vectorising only the lower triangular
part of A.
To go from to , there exists an appropriate

matrix , called duplication matrix, such that


. In this way we can therefore write:
(A.60)
Similarly, we can treat the case where X is diagonal. Let us
define
as the operator that consumes a square matrix and
returns the vector containing the diagonal of its argument,
while the operator consumes a vector and constructs a
diagonal matrix.5 We can therefore write:
(A.61)
for an suitable matrix (different from the previous
one) such that
However, for computational efficiency reasons, it is
preferable to avoid forming the matrix which has
columns, when the unknowns are only n.
Proposition A.20 Let A be an matrix, B be an matrix
and X be an diagonal matrix. Then, the following relation holds:
(A.62)

wher denotes the Hadamard product (element by element).

Wi ,we obtain:
(A.63)
It is easy to see that is nothing else but
the diagonal block matrix which has the columns of A as
blocks. This construction is
implemented in the CV Toolkit by the MATLAB function diagc.

A. 15 Rotations
The rotations in are represented by orthogonal
matrices with positive determinant and form the special
orthogonal group denoted by .
According to a well-known Euler theorem, every rotation
in can be written as a rotation of some angle around an
axis passing through the origin, identified by a versor .
The corresponding rotation that performs a
matrix
rotation of can be
around the axis identified by the
obtained from and
versor using Rodriguez’s formula:

(A.64)

where is the antisymmetric matrix associated with .


Note that it only takes three numbers to encode a
rotation, since is unitary. In fact, we can use the vector:
(A.65)
Euler also proved that every rotation in around an axis
passing through the origin can be considered as a sequence
of three rotations around the Cartesian axes.
Let be the rotation angles along the Cartesian
axes x, y and z, respectively, identified by the three unit
vectors . Based on
Rodriguez’s formula, the three rotation matrices are
(A.66)
(A.67)
(A.68)
By convention we define the rotation matrix corresponding
to the triplet of angles as follows:
(A.69)
Beware that the order of composition of the matrices
matters (since
they do not commute). This representation of the rotation
matrices is called the Euler system.6 Like the representation
with axis and angle, it has three degrees of freedom.
Euler system has singularities that go by the name of gimbal
lock, at
which the degrees of freedom are reduced. In particular, if
, the roles of and become interchangeable:
(A.7
that is, there is only one degree of freedom. 0)
See the MATLAB functions of the CV Toolkit eul, ieul,
rod and irod.

A. 16 Matrices Associated with Graphs


Let be a simple undirected graph with n nodes
and m edges. The adjacency matrix of G is defined as the
matrix where

(A.71)

The incidence matrix of the simple directed with n


graph nodes and m edges is defined as

(A.72)

The rows of the incidence matrix correspond to the vertices


of G and its columns to the edges of G.
A spanning tree of a connected graph is a tree that contains all
the
vertices of the graph and only the edges necessary to connect
any pair of
vertices with one and only one path. Removing any edge from
the spanning tree disconnects it.
Let G be a directed and connected graph. A square
submatrix
Remark A.17

of is non-singular if and only if its


columns correspond to the set of edges of a spanning
tree of G.
It follows that G is connected if and only if .
The degree matrix of a graph G is defined as
(A.73)
or, equivalently: .
Regardless of the orientation assigned to the edges of G,
the following relation holds:
(A.74)

Colin Goodall. Procrustes Methods in the Statistical Analysis of


References

Shape.
Journal of the Royal Statistical Society. Series B (Methodological), 53
285– 339, 1991.
K. Kanatani. Geometric Computationfor Machine Vision.
Oxford University Press, Oxford, 1993.
Gilbert Strang. Linear Algebra and Its Applications. Harcourt
Brace & Co., San Diego, CA, 1988.

Appendix B

Matrix Differential Calculation

The derivatives of functions involving vectors and matrices


B. 1 Derivatives of Vector and Matrix Functions

ultimately lead back to the partial derivatives of the individual


components, and it is all
about how to arrange these partial derivatives. There are
several
conventions, but the diriment feature for our purposes is to
allow the
calculation of derivatives through formal manipulations of
matrix formulae.
In this book we will follow Magnus and Neudecker (1999)
notation, and in this chapter we give a brief summary of it.

Definition B. 1 Let a differentiable function. Its


derivative is defined as follows:

(B.
1)
or
(B.2)
By this coincides with the usual Jacobian matrix of
definition, .

Example B.1 The derivative of the vector function is


.

As special cases of the Definition B.1, we have:


1. , whose derivative is a column vector

2. , whose derivative is a row


vector,7 which coincides with the gradient off, also
denoted as .

Consider the scalar vector function


Its derivative is .
Example B.2

Definition B.1 is widely adopted, and it is a good definition,


since it leads, e.g. to the possibility of applying the chain
rule. However, when matrices
come into play, things get complicated, as it is not obvious
how to extend it to functions . In this case
there are many ways of
arranging the mnpd partial derivatives; the notation of Magnus
and
Neudecker (1999) has the advantage of generalising the
previous one if the
matrices become vectors and, more importantly, of allowing
the application of the chain rule.
Let Fbe a differentiable function
. The derivative ofF in X is the (Jacobian)
Definition B.2

matrix :

(B.3)

According to this notation, the derivatives of scalar


functions of
Warning

matrices and matrix functions of a scalar are vectors. Instead,


the
expressions and found elsewhere are matrices of the
same
dimension as and X, respectively, whose entries are the
same partial derivatives appearing and but
differently arranged:
1.
2.
.

Consider the matrix scalar function


It is immediately verified that
Example B.3

while its Jacobian matrix is, according to our definition:


(B.4)

Theorem B. 1 (Chain Rule) Let F be afunction


and G a function . If the compositefunction
is differentiable in , its derivative is
(B.5)
where .

The following theorem is very useful, as it transforms the


problem of finding the Jacobian of a matrix function, into
the problem of finding its differential, which is usually
easier.
Theorem B.2 (Identification) Let F be a differentiable
function and let be the differential. Then:
(B.6)
is equivalent to

(B.7)

Remark B.1 Given two differentiable matrix functions


and , let us compute the derivative of the
product
function:
(B.8)
(B.9)

therefore, by the identification theorem


(B.
10)

We now consider several cases of a matrix function and


compute its Jacobian by exploiting the identification
theorem.

Example B.4 Derivative of .

(B.
11)
Therefore, by the identification theorem

(B.
Example B.5 Derivative of where X is . 12)

(B.
13)
We introduce the commutation matrix K, a specific permutation
matrix that transforms into . In particular, let A
be an matrix:

So

and, by the identification theorem


(B. 14) (B. 15)

(B. 16)
Example B.6 Derivative of the inverse. First let us work out the
differential of the inverse:
(B. 17)
and therefore
(B. 18)
Then:

(B.
19) hence, by the identification theorem
(B.20)

Example B.7 Derivative of the pseudoinverse where X is


. The differential of is derived in Golub and
Pereyra (1973):
(B.2
1) From the differential the derivative of is easily
obtained:
(B.22
)
This is a general expression that makes no assumption on
the rank or the shape of X. If X has full column rank, the
term vanishes, whereas ifX has full row rank the
term vanishes, by the
properties of pseudoinverse.
Example B.8 Derivative of where X is

(B.2
3)
Thanks to the linearity of , we obtain:

So (B.2

and by exploiting (A.59) and Theorem B.2, we arrive at 4)

(B.2

5)

(B.2
6)

Example B.9 As a special case of the previous example with


, we get
,whose derivative is
(B.27)

B.2 Derivative of Rotations


B.2.1 Axis/Angle Representation
Let a versor representing the axis of rotation
and an angle. The corresponding rotation matrix is given
by Rodriguez’s formula:

F
The derivative is
or
example, for , we have: (B.2
8)

(B.2
9)

(B.3

0)

(B.3
1)
(B.32)

Consider an Euler representation of a rotation with angles


B.2.2 Euler Representation

defined
as follows:
(B.33)
where denotes the versor of the i-th axis (e.g. )
For (angle ) the derivative is

(B.34) and similarly one can proceed for and

. In the end:
(B.35)

G. H. Golub and V. Pereyra. The differentiation of pseudo-


References

inverses and
nonlinear least-squares problems whose variables separate.
SIAM Journal on Numerical Analysis, 10 413–432, 1973. doi:
https://doi.org/10. 1137/ 0710036.
J. R. Magnus and H. Neudecker. Matrix Differential Calculus with
Applications in Statistics and Econometrics. John Wiley & Sons,
New York, revised edition, 1999.

Appendix C

Regression
C. 1 Introduction
Fitting of a model to noisy data, or regression, is an important
statistical
tool, frequently employed in computer vision for a wide variety
of purposes.
The aim of regression analysis is to fit the parameters of a
function
(model) to the observations (data), which are obtained by
measuring a
dependent variable in correspondence to m data values of n
independent variables , with . The
model function has the form , where contains the p
parameters .
The adequacy of the model in fitting to the data is
measured through its residual, defined as the difference
between the measured value of the
dependent variable and the value predicted by the model
(at the same values of the independent variables):
(C. 1)
The goal is to compute the parameter estimates
that best explain the observations, in the
sense that they minimise the residuals defined above.

The Least-Squares (LS) method is the most common


C.2 Least-Squares

regression estimator. The regression traces back to the


solution of a system of m equations in p unknowns:
(C.2)
which can be rewritten in compact form as
(C.3)
The cost function to be minimised is
(C.4)
It is statistically convenient to minimise this weighted
cost function instead:8
(C.5)

where and is the variance of the


measure, which also coincides with the variance of the
corresponding residual.
Division by the standard deviation has the effect of
properly weighting the contribution of each equation:
residuals affected by less error weigh more than those with
greater error. Also, if the residuals have different
physical dimensions (and thus are inherently
incommensurable), division by makes them dimensionless.
It can be shown that the minimum of (C.5) is also the
maximum
likelihood estimator, under the assumption that the error on
has
Gaussian distribution with zero mean. In that case, the
least-squares estimator is also statistically optimal, in
the sense that it has minimum variance and its expected
value coincides with the true value
(unbiasedness).
When the Gaussian error assumption fails, the least-
squares estimator can be completely misleading, as we will
discuss later.

In linear regression we consider a linear function of the


C.2.1 Linear Least-Squares

parameters, that is:

(C.6) where . The entries of A are functions


of the independent variables: .
Weighting the equations with the variance, we get:
(C.7)

Following Proposition A.15, the least-squares regression


parameters can be obtained by solving the normal equation:
(C.8)

If the function is nonlinear in the parameters, we have


C.2.2 Non-linear Least-Squares

the generic relation:

(C.9)
from which

(C.
10)
In the following, to lighten the notation, we neglect the
weighting with .

We therefore consider the minimisation of (C.4). Starting from


C.2.2.1 Gauss-Newton Method

a suboptimal current solution , we want to determine the


increment that minimises the cost function. We
approximate with the Taylor series truncated to first order:
(C. 11)
Observe that since is constant, ; let us call
this matrix J.
We substitute in the formula of the cost function
, obtaining

(C. 12)

The increment that minimises the equation (C.12) is


obtained by setting to zero its derivative (with respect to ),
from which we get:
(C. 13)
This is the normal equation corresponding to the least-squares
solution of the equation:

(C. 14) Because of the first-order approximation in the (C.11),


does not minimise the cost function, but, under
appropriate conditions on , it is a
better approximation of the solution, so the iteration
produces a sequence of values converging
to the minimum of (C.4).
This iteration is known as the Gauss-Newton
method, and an implementation is shown in Listing
C.1.
Listing C. 1 Gauss-Newton method

In the case where the Jacobian matrix of does not have full
rank, one can
generalise the procedure by computing the
increment with the pseudoinverse:
(C. 15)
Finally, remember that we dropped the variance
weighting, so the normal equation (C.13) should write:
(C. 16)

Since is an approximation of the Hessian matrix, which in


C.2.3 The Levenberg-Marquardt Method

turn is a
quadratic approximation of the function at a point, the Gauss-
Newton step can be interpreted as the one leading on the
minimum of the parabol oid that locally approximates the cost
function.
This approximation is all the more verified the closer we are
to the
minimum. On the other hand, when the quadratic
approximation does not hold, Gauss-Newton does not reduce
the error quickly enough, or not at all. In these cases it would
be better to use the gradient descent technique,
which always guarantees (under mild conditions)
convergence to a stationary point, however slowly.
Consider again (C.4); if we wanted to find its minimum
with gradient descent, we would have to calculate its
gradient:

(C. 17) and iterate with

(C. 18) where is the length of the step. Gradient descent has
a somewhat
complementary behavior to Gauss-Newton; in fact it has a very
slow
convergence rate right near the minimum when contour
lines are very stretched in one direction, just where
Gauss-Newton works best.
Levenberg’s idea was to blend the two methods with a
strategy to choose which of the two to weigh more
depending on local conditions.
Combining (C.18) and (C.13) gives:
(C. 19)
For , one gets Gauss-Newton, while for one
obtains
which is a gradient descent. The strategy to
modulate the mixing factor is as follows: when the error
decreases, then Gauss-Newton is working and we make it the
prevailing strategy by decreasing (e.g.
divide by 10). When, on the contrary, the error is
increasing, we drift towards gradient descent by increasing
(e.g. multiply by 10).
Marquardt observed that even when the strategy inclines
towards
gradient descent, we can still exploit the local curvature
information by
scaling the components of the diagonal matrix to make longer
steps where the gradient is small. So the formula for
calculating the Levenberg-
Marquardt (LM) step becomes:
(C.20)
Let it be clear that is not actually computed by
inverting a matrix but by solving a linear system of
equations.
The LM method can be seen as an instance of the more
general class of damped Gauss-Newton methods, where a
diagonal matrix, called damping
term, is summed to . The damping term also solves the
problem of rank-deficient Jacobian matrices.
The lsq_nonlin function provided in the CV Toolkit, not
listed here for space reasons, is an instructional
implementation of the Levenberg-
Marquardt method

C.3 Robust Regression


In this section we will introduce some concepts concerning
robust statistics and briefly describe some robust estimators.
The interested reader might want to expand on these topics
in Meer et al. (1991) and Stewart (1999).

The most common regression technique is least-squares, which


C.3.1 Outliers and Robustness

is only
optimal in the case where the errors that affect the data are
Gaussian
distributed. In the case where the distribution is
contaminated by outliers, that is, observations that do not
follow the general trend of the data, these can corrupt the
least-squares estimate in an arbitrarily large way (see Fig.
C.1). The outliers arise in most cases from gross
measurement errors or impulsive noise that plagues the
data.

Fig. C.1 Example of corrupted data by an outlying sample: least-squares


regression line and robust regression line

In order to remedy these drawbacks, robust regression


techniques have been developed that provide reliable results
even in the presence of
contaminated data.
Three indicators are used to evaluate a regression
method: relative efficiency, breakpoint and computational
complexity.
The relative eficiency of a regression method is defined as
the ratio of the minimum achievable variance for the
estimated parameters to the
actual variance provided by the given method. The efficiency
also depends on the noise distribution.
The breakpoint of a regression method is the minimum
fraction of
outlier samples that can arbitrarily cause the estimator to
diverge from the true value. For example, least-squares has
breakpoint equal to , where m is the sample size, since it
only takes one outlier sample out of m to
arbitrarily corrupt the estimate. Asymptotically (for
) the breakpoint is 0.

Fig. C.2 Loss functions “Bisquare”,” Cauchy” and “Huber” (first line) and
related weight functions (second line). The square function is also drawn for
comparison

The M-estimators are one of the most widely used robust


C.3.2 M-Estimators
regression methods. The idea is to replace the squares
of the residuals in the
traditional least-squares method with a function of the
residuals:
(C.21)
where is the standard deviation of the residuals, and is
a symmetric, sub-quadratic function having a unique minimum
in zero, called loss
function. Examples of loss functions and their weights are
shown in Fig. C.2 under the names “Bisquare”, “Cauchy ” and
“Huber”
M-estimators have reduced sensitivity to peripheral
samples due to the sub-quadratic loss function. If the function
becomes flat after a certain
point, then the M-estimator can reach a breakpoint of 50%.
Otherwise, it
benefits from the reduced outliers influence, but the
asymptotic breakpoint remains zero.
The in (C.21) is the standard deviation of the Gaussian
noise that
plagues the measurements, without considering the outliers. If
it is not
specified a priori, it must be estimated from the data itself,
but in a robust way. One possibility is to use the Median
Absolute Deviation (MAD):
(C.22)
where are the residuals. Under the assumptions of normal
noise
distribution, a robust estimate of the standard deviation is
calculated as .
The minimisation of (C.21) is obtained by zeroing out the
derivative:
(C.23)
where is the derivative of . A widely used, though not the
only, way to proceed is to introduce the weight function w
such that . In
this way the previous equation becomes:

(C.2
4)
The solution of the latter equation is equivalent to the
solution of the weighted least-squares problem:

(C.2
5)
The weights, however, depend on the residuals, the residuals
depend on the estimate , and the latter depends on the
weights. Thus, an iterative
solution is needed, called Iteratively Reweighted Least-
Squares (IRLS) and implemented in Listing C.2. If is not
convex, convergence to local minima may occur. The initial
estimate can be obtained, for example, with ordinary least-
squares; however, Stewart (1999) notes that even in the
presence of
low percentages of outliers ( 15%), IRLS may not converge to
the correct
solution from this estimate. Thus, a robust initial estimate is
needed,
obtained for example, by one of the methods we will see
below. The same author also suggests freezing the
estimate after a number of iterations to allow IRLS to
converge better.
Listing C.2 IRLS

The Least Median of Squares (LMS) method, introduced by


C.3.3 Least Median of Squares

Rousseeuw and Leroy (1987), was the most widely used


robust estimation technique in
computer vision before the advent of RANSAC. In LMS, the
parameters are estimated by solving the non-linear
minimisation problem:
(C.26)

This problem does not have closed-form solutions, and


direct numerical minimisation of the objective function
immediately becomes impractical as the number of variables
increases. One possible approach is the Monte
Carlo method based on random sampling.
Given a model that requires a minimum number of p
observations to instantiate its independent variables, we
take p points at random and
compute a test estimate . We compute the residuals of the
test estimate and take the median of the squares. After
repeating these steps for a
number of times (ideally for all p-uples), we choose the test
estimate that realised the least median of the squared
residuals (Listing C.3).
Listing C.3 Least median of squares

How many samples of size p should be considered? If m is the


cardinality of
the set of all data points, there are different subsets of p
elements. If
only a number equal to k are considered, the question arises
of how large k must be for the procedure to compute the
correct estimate with good
probability. Since the minimum of (C.26) is realised by the
outlier-free samples, the capability to compute the
correct estimate is related to the presence of at least one
of these subsets among all p-tuples. Given a
percentage of outlier samples, the following formula
relates k to the probability of computing the correct
estimate:
(C.27)
from which, by fixing a value of close to 1, one can
determine the number k, given p and :

(C.28)

LMS has a breakpoint of , but its the relative efficiency


is poor in the presence of Gaussian noise. To overcome this
disadvantage, after
termination of LMS, a weighted least-squares problem is
solved:

(C.29)

where the robust estimate of the standard deviation can be


obtained with MAD or with
(C.30)

The Random Sample Consensus (RANSAC) algorithm by


C.3.4 RANSAC

Fischler and Bolles (1981) uses the same random sampling


scheme as LMS, but with a different objective function. In fact,
RANSAC seeks to maximise the “consensus” of the test
estimate, that is, the number of points whose residuals are
smaller
(in absolute value) than a given threshold T. This is equivalent
to
minimising the number of points with residuals greater than a
threshold T, so RANSAC can be seen as an M-estimator with a
step loss function
,which is zero for residuals contained in
and
one otherwise, as shown in Fig. C.3.
Fig. C.3 Loss function of RANSAC and MSAC viewed as M-estimators. In the
graphs
The algorithm operates under the implicit assumption that
there is a
high probability that the consensus for the test estimate
computed on a set of p samples free of outliers is higher than
the support obtained from the “wrong” test estimates, that
is, computed on a sample contaminated with outliers.
The theoretical breakpoint of RANSAC is 50%;9 however, in
practice, if the threshold Tis carefully set, RANSAC can
withstand even higher levels of contamination. The price to
pay for this higher robustness compared to LMS is the
presence of a parameter, the threshold T.
Listing C.4 MSAC

Torr and Zisserman (2000) propose to use instead the


following loss function:
(C.31)
and call MSAC (M-estimator SAmple Consensus) the resulting
variant of
RANSAC. They observe that the performance of MSAC is
always better than that of RANSAC, so there is no reason to
prefer the latter to MSAC. The
implementation of MSAC is shown in Listing C.4
More on RANSAC developments can be found in Raguram
et al. (2013); Baráth et al. (2020).

The errors that inevitably occur in the measurement


C.4 Propagation of Uncertainty

operations make the estimation of the input variables


uncertain, and this uncertainty necessarily propagates to the
output variables, which constitute the results of the
regression. The latter should therefore be accompanied by an
indication of the error, usually provided by the standard
deviation.
An uncertain variable is represented by a probability
distribution, that is, a probability density that assigns each
value of the variable a probability. Detailed knowledge of the
distribution is not necessary in most
applications, and moreover, it is often not accessible due to
the lack of such an accurate model of the measures. For these
reasons we choose to identify it by its first two moments –the
mean and the covariance– defined as,
respectively:
(C.32)
where represents the expected value (or expectation).
When two variables are related by a transformation, the
problem of
propagating uncertainty through the transformation arises.
Precisely if two variables and are related by a linear
relation and the
former is characterised by , one wants to obtain
mean and covariance of the latter.
Using the definitions and linearity of the expectation operator,
it is easily
verified that
(C.3

and also
3)

(C.3
4)
If the relation that links and is not linear, it is
necessary to approximate it by its Taylor series expansion
around the point obtaining
(C.3

where
5)

(C.3
6)
is the Jacobian matrix off evaluated in .
Truncating the expansion to the linear term and
computing the expectation yields the linear estimate of
the mean: (C.3

Similarly, the first-order estimate of the covariance is 7)

(C.3
8)

For example, with reference to the notation of Sect. C.2.1, let


C.4.1 Covariance Propagation in Least-Squares

us compute the covariance of the linear least-squares


regression coefficients , which are a linear function of the
measures:
(C.3
Then, applying (C.34), we obtain: 9)

(C.4
0)

In the non-linear case, we concentrate on the normal


equations of the Gauss-Newton method, and we get:

(C.41) assuming that the Jacobian matrix has full rank. If not,
the pesudoinverse
replaces the inverse (Hartley and Zisserman 2003), p. 144.

Dániel Baráth, Jana Noskova, Maksym Ivashechkin, and


References

JiriMatas.
Magsac++, a fast, reliable and accurate robust estimator. In

CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 1301– 1309.
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition,

Computer Vision Foundation / IEEE, 2020. doi:


https://doi.org/10.1109/CVPR42600.2020.00138.
M. A. Fischler and R. C. Bolles. Random Sample
Consensus: A paradigm model fitting with applications to
image analysis and automated
cartography. Communications of the ACM, 24 381– 395, June 1981.
R. Hartley and A. Zisserman. Multiple View Geometry in Computer
Vision. Cambridge University Press, Cambridge, 2nd edition,
2003.
P. Meer, D. Mintz, D. Y. Kim, and A. Rosenfeld. Robust
regression methods in computer vision: a review. International
Journal of Computer Vision, 6:
59– 70, 1991.
Rahul Raguram, Ondrej Chum, Marc Pollefeys, JiriMatas,
and Jan-
Michael Frahm. Usac: A universal framework for random
sample consensus. IEEE Trans. Pattern Anal. Mach. Intell., 35(8):
2022– 2038, 2013. URL http:// dblp.uni-
trier.de/db/journals/pami/pami35.html#RaguramCPMF13.
P. J. Rousseeuw andA. M. Leroy. Robust Regression & Outlier
Detection. John Wiley & Sons, New York, 1987.
C. V. Stewart. Robust parameter estimation in computer
vision. SIAM Review, 41 (3): 513– 537, 1999.
P. H. S. Torr and A. Zisserman. MLESAC: A new robust
estimator with application to estimating image geometry.
Computer Vision and Image Understanding, 78: 2000, 2000.

Appendix D

Notions of Projective Geometry


«A colonnade [ …] when its whole length is seenfrom the top end, little by
little it contracts to the pointed head of a narrow cone,joining roof with
floor, and all the right hand with the left, until it has brought all together
into the point of a cone that passes out of sight.» [Lucrezio]

In order to introduce the use of projective geometry and


D. 1 Introduction

therefore of
homogeneous coordinates, we follow Leon Battista Alberti
and study first the perspective projection of a plane. For a
formal treatment of analytic projective geometry, see, e.g.
Semple and Kneebone (1952).

Perspective projection is the type of projection that in the


D.2 Perspective Projection

vertebrate eye and in cameras governs image formation.


Fig. D.1 Perspective projection of a ground plane orthogonal to the image
plane . The lines h and fare special cases

Definition D. 1 (Perspective Projection) Perspective


projection carries points M of 3D space on the plane
(image plane) intersecting the line passing through M and
C (centre of projection) with .
The basic elements are the image plane and the centre of
projection C. The lines of space are transformed into lines in
.
If we now consider the projection of a plane
orthogonal to (the ground plane), we find that:
the linefdetermined by the intersection of the plane
with the plane parallel to and containing C does not
project onto ;
the line h intersection of with the plane parallel to
and passing through C is not the projection of any line on
the plane .
This situation is not satisfactory, as we would like to have a
bijection
between and . To remedy this we add to the usual
Euclidean plane (of which and are instances) an
“ideal” line, called the line at infinity.
We can then say thatf projects onto the line at infinity of
and that his the projection of the line at infinity of . We can
think of the line at infinity of a plane as the place where the
parallel lines of that plane
(metaphorically) meet. Indeed:

Remark D.1 With reference to Fig. D.1, the images of


parallel lines in intersect at a point h.
Let us consider (Fig. D.2), for example, the special case of
parallel lines
contained in and orthogonal to (called depth lines in the
Albertian construction). Their perspective image converges
to the principal point
, that is, the orthogonal projection of C onto . In
fact, the planes containing C and any depth line all contain
the half-line exiting C and
parallel to , called the principal ray, and thus intersect the
image in a line passing through the principal point O.

Fig. D.2 The depth line containing M projects onto a line containing P and
passing through the principal point O

Definition D.2 (Projective Plane) The projective plane


is defined as the union of the Euclidean plane with the line
at infinity :
(D. 1)

Similarly, one can define the projective space , where


the points at infinity constitute a hyperplane.
In the projective plane, two lines have a point of
intersection even when they are parallel, but the point is at
infinity. Every line, therefore, possesses a point at infinity that
is approached by travelling along the line in a certain
direction, and all lines parallel to that line meet it at the same
point at
infinity. For now, the term “point at infinity” is just a
metaphor, but we can give it a precise mathematical
meaning. One way to do this is to identify
points at infinity with the absolute direction of a line. Consider
the following three axioms of Euclidean geometry:
(i) there exists a single line through two distinct points;

(ii) there exists a single line passing through a point and


having a given direction;
(iii) two distinct lines have a single point in common or
have the same direction.
If we agree to replace “direction” with “point at infinity”
and then group all points at infinity into a line at infinity, we
get that the previous three
axioms are equivalent to:
(i) there exists a single line passing through two distinct

points;

(ii) two distinct lines have a single point in common.

These are the two axioms of incidence in the projective


plane.

Points at infinity are ideal (or improper) points that are added
D.3 Homogeneous Coordinates

to the usual Euclidean plane. Like complex points, which are


introduced by means of their algebraic representation, the
same can be done for points at infinity, using homogeneous

Fixed two reference axes, all points of admit a unique


coordinates.

representation as a pair of real numbers Two lines of


equations
and meet, if they are not parallel, at a point
of whose coordinates are (by Cramer’s rule):

(D.2)

When the lines are parallel, the denominator cancels out, so


the ratio is
meaningless (it becomes “infinite”). We note, however,
that of the three quantities that appear in (D.2), at least
one is nonzero (if the lines are distinct). The proposal is
then to hold these three quantities as
representations of the points of .
Essentially, instead of representing the points in the plane
by pairs of coordinates we agree to represent them by
triplets of homogeneous coordinates connected to the
Cartesian coordinates of by the relation and
.
We note that proportional triplets represent the same point
– this is why they are called homogeneous – and that the triplet
is excluded.
When the triplet represents a proper point (finite),
while every triplet for which represents a improper
point (at infinity).
Definition D.3 (Projective Plane) The projective plane
consists of the
triplets of real numbers modulo the following
equivalence relation: . The triplets
are the homogeneous coordinates of the point of .
Similarly, one can define the projective space .

In the familiar Euclidean plane with Carttesian coordinates,


D.4 Equation of the Line

each pair of numbers identifies a point, and, conversely,


each point is represented by only one pair. This is not true
for lines: th e quation of the line is
(with a and b not simultaneously null), so the
three
coefficients of the equation represent a line, but the same line
is
represented by all the linear equations with
. In
the projective plane, this asymmetry disappears, as points
and lines are dual elements, as can be inferred from the
axioms.
A line in is represented by a triplet in the equation
, and proportional triplets represent the same
line. This line coincides with the proper line if
. Otherwise, the equation represents
the line to infinity .
By identifying triplets with vectors of , the equation
of the line can be written
(D.3)

where and Note the symmetry


between the role of the vector representing the line and
the one representing the point.

.
Proposition D. 1 The line through two distinct points
and is represented by the triplet
Proof The line passes through both points, in fact:
and

As for the point determined by two lines, the dual


proposition applies, which can also be deduced directly
from (D.2).
Proposition D.2 The line l through two distinct points and

(D.4)
has parametric equation:

.
Proof

Letting and accepting the convention that


when , the line l has parametric equation:
(D.5)

D.5 Transformations
Definition D.4 (Projectivity) A projectivity or projective
transformation is a linear application in
homogeneous coordinates:
(D.6)
where His a non-singular matrix.

Projectivities also take the name of homographies or


collinearities (since they preserve the collinearity of the
points); they form a group.
The perspective projection of the plane defined before is
a
projectivity of (we introduced exactly to make this
application bijective).
Because of the homogeneous representation of points, the
two matrices H and with represent the same
projectivity.
Definition D.5 (Affinity) Projectivities that transform
the Euclidean space into itself and points at infinity into
points at infinity are called a finities.
One can alternatively say that an affinity preserves the points of
the
hyperplane at infinity. Affinities are all and only the
projectivities in which
the last row of the matrix is :

(D.7)

They constitute a subgroup of the projectivities.


In the hyperplane at infinity lies a special conic, the absolute
conic which has equation .
Definition D.6 (Similitude) Affinities that preserve the
absolute conic are called similitudes (or similarities).
Similitudes are all and only affinities in which the matrix has
this form:
(D.8)
where and s is a scalar. The similitudes form a
subgroup of the projectivities.
The action of the similitude in can be written as
where and are the cartesian
coordinates.
Definition D.7 (Isometry) Similitudes that preserve
distances between points are called isometries (or rigid
transform).
Isometries are all and only the similitudes in which .
They thus form a subgroup of the similitudes which is called
Euclidean group and is denoted by . In the case where
, the matrix represents a
rotation ( ) and the isometry is said to be
“direct” Directed isometries constitute the special Euclidean
group denoted by .

J. G. Semple and G. T. Kneebone. Algebraic Projective


References
Geometry. Oxford University Press, Oxford, 1952.

Appendix E

MATLAB Code
The MATLAB code listed in the text, along with other support
functions not listed, can be downloaded from
https://github.com/fusi ello/computer_ vision_Toolkit. Keep
in mind that this is code written with educational
intent and therefore:
whenever more efficient code could have been written at
the expense of readability, we have preferred the latter;
to avoid cluttering up the listings, there are no checks on
whether the parameters passed to the functions match
what is assumed within the functions themselves;
for the same reason, there is no documentation at the top
of the function and comments are used sparingly;
for reasons of compactness and simplicity, special cases and
pathological situations that might occur are not dealt with;
the input parameters to the functions are limited to the
indispensable arguments; other control parameters are
wired into the function itself;
the use of a variable number of input and output
parameters, and
assignments by default are used less than would be
expected in a well- built library.
Conventions related to the interface of functions:
the points are in Cartesian coordinates;
both 2D and 3D points are packed into a matrix by columns;
items (points, PPM, etc.) referred to multiple images are
represented with cell arrays indexed by the image;
in cases where the function calculates a transformation, it is
such that the second argument is transformed to the first.
The list of the main implemented functions follows
(auxiliary ones are omitted). All were tested with MATLAB
R2021a and OCTAVE 4.2. 1 under macOS 12.3. 1.
Index

Adjacency matrix 302


A

Adjoint matrix 80
Adjustment of independent models 207
Affinity 331
Algebraic distance 100
Alignment 225
Autocalibration 236, 242

Baseline 57, 126, 127


B

Bearing 216
Bilinear interpolation 280
Block matching 168, 173
Breakpoint 318, 319
Bundle adjustment 217– 224

Calibration 35–43
C

Camera
affine 244
pinhole 5, 10, 15– 28
telecentric 10
Camera matrix, see Perspective Projection Matrix (PPM) 334
Camera obscura 5, 21
Camera reference frame 16
Census transform 171
Centre of projection 15, 26
Chirality 78
Circle of confusion 10
Coded-light 191
Collinearity equations 28
Collineation 64
Commutation matrix 308
Compatible matrix 92
Conjugate points 9
Control points 35, 49, 50
Convolution 140
Corresponding points, see Homologous Points 334
Cross-correlation 140

Degree matrix 303


D

Depth of field 10
Depth-speed ambiguity 73, 74, 236
Diffuse scattering 13
Direct Linear Transform (DLT) 35, 37, 50, 64, 67, 103
Disparity 125, 127, 137, 165– 182, 274, 275
Disparity map 166

Effective pixel size 8, 19


E

Eight-point algorithm 61, 63


Epipolar constraint 165
Epipolar geometry 57– 64, 73
Epipolar graph 203
Epipolar line 57– 60, 70, 168, 258, 260
Epipolar plane 57
Epipolar rectification 126, 129– 136
Epipole 57, 59, 61, 68, 70, 71
Essential matrix 73– 80, 92, 202, 241
computation 76
factorisation 77
Euler 310
Extrinsic parameters 23, 49, 90, 240, 333

Features
F

detection 139– 152, 157– 160


matching 153, 161
tracking 153
Field of view 6, 21
Filtering 140
Focal length 6, 8, 10, 15, 20, 130
Focal plane 15
Foreshortening effect 6
Fundamental matrix 60– 64, 75, 202
computation 63
factorisation 92

Generalised Procrustean analysis (GPA) 208, 226


G

Geometric distance 100


Gimbal lock 302

Hierarchical reconstruction 209


H

Homogeneous coordinates 330


Homography 40, 64– 68, 96, 132, 266, 270, 272
calibrated 81
Homologous points 57, 74, 125, 333

Image plane 5, 6, 9, 15
I

Image warping 279


Incidence matrix 215, 302
Intersection, see Triangulation 334
Intrinsic parameters 20, 24, 45, 49, 73, 92, 130, 241245,
247, 274, 333
Isometry 22, 332
Iterative closest point (ICP) 228, 267
Iverson brackets 13

Jacobian matrix 30, 103, 108, 153, 154, 219– 222, 305–
J

309, 315318 Joint image space 110

Kernel 140
K

derivative 144
Gaussian 141
smoothing 141
Sobel 146
Kronecker product , 36, 61, 67, 215, 244299
Kruppa constraints 240

Lambert’s law 13
L

Light Detection And Ranging (LiDAR) 192


Linear regression 314
Longuet-Higgins equation 60, 74, 75
Loss function 319

Mapping
M

backward 280
forward 279
Marching cubes 250, 262
Match space 177
M-estimators 319
Model 201
Mosaic
panoramic 266
planar 266
Mosaicing 266– 270
Motion 246

Nodal points 9, 29
N

Non-linear
calibration 105
exterior orientation 106
fundamental matrix 115
homography 111
relative orientation 118
triangulation 107, 108
Normal configuration 59, 125, 126
Normal equation 296, 315
Normalised cross-correlation, NCC 169, 259
Normalised image coordinates 20, 50, 74– 76

Optical axis 8– 10, 29, 131, 247


O

Optical ray 27, 33


Orientation 333
absolute 45–48, 50, 210
exterior 23, 24, 45, 49– 55, 208
interior 24
relative 45, 73– 80, 212
Orthogonal Procrustes Analysis (OPA) 46, 51, 52, 295
Orthographic projection 7, 245
Orthophoto 277, 278
Ortho-projection 277
Orthoscopy 29
Outliers 121, 318– 324

Parallax 69– 70, 265, 270, 273, 276, 278


P

Parameters
extrinsic (see Extrinsic parameters) 334
intrinsic (see Intrinsic parameters) 334
Perspective pose 49
Perspective Projection Matrix (PPM) 18– 28, 333
photo-consistency 249, 251, 258, 259, 261
Photogrammetry 23, 28, 32, 49, 91, 207
Photographic scale 32
Photoplane 272
Pinhole camera 5
Pinhole camera, see camera
pinhole 334
Planar mosaics 266
Pose, see Orientation 334
Preconditioning 37, 63
Principal axis 15, 16
Principal point 15
Projection
orthographic 10
perspective 6, 17, 18, 26, 327
Projective bundle 15
Projectivity 64, 92, 233, 331

Radial distortion 29, 38


R

Radiance 11– 13, 194


radiance 249
Range sensors 185
Reconstruction 89, 201, 208
Euclidean 89, 90, 201, 202
identical 89
incremental 208
metric 90
projective 90, 92– 93, 201, 233– 235
from silhouettes 252
volumetric 249
Reflectance 11, 194, 252
Reflectance equation 12
Registration 225
Regression 313
Relief displacement 273
Reprojection error 102, 107
Resection 35, 39, 49, 87
Resection-intersection 208
Residual , 102, 103, 106, 107, 111, 115313
Robust regression 121, 318

Salient points 139


S

Sampson distance 100


Scale Invariant Feature Transform (SIFT) 155– 162
Scale-space 155– 156
Scanner 186, 193
Seven-point algorithm 62
shape from silhouette 252
Shape from X 3
Singular value decomposition (SVD) 36, 51, 62, 63, 76, 77, 81,
234241, 244,
247, 292– 295, 334
Singular values 292
Singular vectors 292
Spanning tree 303
Stereo 125, 168, 180, 187, 189, 333
active 187
correlation based 168
photometric 194– 197
Stereo matching 165
Stereopsis, see Stereo 334
Stratification 94
Structure 246
Structured lighting 186– 192
Sum of squares of differences (SSD) 149, 150, 169, 173, 175
Synchronisation 210, 231
of direct isometries 231
of homographies 267– 268
of rotations 212– 214
of translations 214– 215
Synthesis 274– 281

Template matching 141


T

Thin lens 8, 29
Tie points 73, 87, 203
Triangulation 75, 87– 89, 92, 125– 129, 205, 208, 333
active 186– 188
normal case 126
Triple product 75, 298

Vanishing point 33, 66, 244


V

visual hull 253


volumetric stereo 250
Voronoi tessellation 269
voxel 251– 258

Footnotes
1 in the sense that for .

2 Nobody would really invert , though. MATLAB, for example, uses QR


decomposition.

3 Implemented by the function kron in MATLAB.

4 In MATLAB we obtain with x(:).

5 In MATLAB both operations are performed by the overloaded function diag.


6 Other authors may follow different conventions. In fact, there are 12
different Euler angle conventions.

7 Only exception in this text to the convention that vectors are arranged in
columns.

8 This is also a norm, which is called “Mahalanobis” norm.

9 The theoretical breakpoint cannot exceed 50%, because it cannot be excluded


that a wrong test estimate will gather more consensus than the right ones, if the
outliers are arranged appropriately.

OceanofPDF.com

You might also like